Zheng-Xin Yong (Yong) @yong_zhengxin profile

Zheng-Xin Yong (Yong)

@yong_zhengxin

Followers

926

Following

1,462

Media

37

Statuses

298

PhD @BrownCSDept || RS Intern @ FAIR @AIatMeta 🤖 multilingual + responsible AI past: @CohereForAI @BigscienceW

https://t.co/2tqzcklcXc

Bay Area

Joined January 2018

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

#GranHermano • 471246 Tweets

FREENBECKY AT WORKPOINT • 217796 Tweets

Willie Mays • 169902 Tweets

Chino • 107232 Tweets

Juliana • 99934 Tweets

TSOU OST BY LINGLING • 98002 Tweets

Alperovich • 86736 Tweets

Game 6 • 80134 Tweets

SE HIZO JUSTICIA • 34980 Tweets

#HappyBirthdayRahulGandhi • 34283 Tweets

ドジャース • 33383 Tweets

Dario • 32604 Tweets

Panthers • 32313 Tweets

Telefe • 32045 Tweets

#GH2024 • 30547 Tweets

Edmonton • 27140 Tweets

SE FUE LA PELADA • 26348 Tweets

Alfa • 22668 Tweets

ホタルちゃん • 21604 Tweets

シグウィン • 19909 Tweets

#LetsGoOilers • 19224 Tweets

Bautista • 18655 Tweets

McDavid • 18406 Tweets

#TheAcolyte • 16744 Tweets

SE FESTEJA EN EL OBELISCO • 16246 Tweets

Guercio • 16053 Tweets

Say Hey Kid • 14900 Tweets

Tinelli • 13714 Tweets

帰還開拓者 • 11637 Tweets

WE LOVE YOU BEOMGYU • 11294 Tweets

リベステ • 10452 Tweets

ロッキーズ

マイキー

日本ハム移転後初決算

テオスカー

PlayStation®5

ホタル2凸

さくらガール

ちいかわくじ

श्री राहुल गांधी

ミニアクリルチャーム

Rockies

改正政治資金規正法

札幌ドーム赤字5億円超

Teoscar

パフォーマンスフォト

すり抜けなし

きたむー

Tkachuk

戸田さん

Last Seen Profiles

@fencing_stella

@EowynHD_pb

@phothikaly121

@jelliottsc

@CoCreateSoton

@vvn2li

@kennyboyfox92

@Sager_MSL

@abnerordonez4

@PaisanoArts

@Nergica

@yogamc__

@SmartGSer169

@Eddie09623388

@MoroeNomfundo

@TheTravelUpdate

@DanielCanaca1

@Polat14766066

@samanthajaneke3

@hafizadtghazali

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

A quick write-up about what I've learned so far in my PhD. Super grateful to work with @stevebach at @BrownCSDept , and look forward to the remaining two/three years of my education.

1

20

130

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

LLMs such as ChatGPT and BLOOMZ claim that they are multilingual, but does this mean they can generate code-mixed data? Follow this 🧵 to find out. (1/N) Paper:

4

31

124

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

(Repost for corrected Arxiv) 🧐What’s the best way to quickly adapt large multilingual language models to new languages? We present our new paper from @BigscienceW 🌸: BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting. 📜 [1/9]

2

27

70

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

Presenting LexC-Gen that generates data for extremely low-resource languages. 🤗 You only need 7B-LLMs and bilingual lexicons. 🔥 Our synthetic data are competitive with expert-translated data on sentiment and topic classification. Paper + Code: [1/n]

1

18

69

Zheng-Xin Yong (Yong)

@yong_zhengxin

9 months

🚨Cross-Lingual Vulnerabilities in GPT-4 Safeguards We find that translating English inputs into low-resource languages (LRL) increases the chance of bypassing GPT-4’s safety mechanisms from <1% to 79%. Preprint: See thread (1/n)

2

20

59

Zheng-Xin Yong (Yong)

@yong_zhengxin

6 months

I will be presenting ‘Low-Resource Languages Jailbreak GPT4’ at NeurIPS @solarneurips . ⭐️It is also selected as the best workshop paper! Let’s grab ☕️ if you wanna chat about AI safety! Happy to also talk about my involvement with Responsible Release of #Aya w/ @CohereForAI

Zheng-Xin Yong (Yong)

@yong_zhengxin

9 months

🚨Cross-Lingual Vulnerabilities in GPT-4 Safeguards We find that translating English inputs into low-resource languages (LRL) increases the chance of bypassing GPT-4’s safety mechanisms from <1% to 79%. Preprint: See thread (1/n)

2

20

59

0

10

57

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

Aya has launched!!! 🌍 I'm so excited thinking how Aya (open-source models & dataset with HUGE language coverage) will accelerate multilingual NLP. It's been so fulfilling to work with other amazing collaborators to get Malay represented and analyze the safety of Aya.

Cohere For AI

@CohereForAI

4 months

Today, we’re launching Aya, a new open-source, massively multilingual LLM & dataset to help support under-represented languages. Aya outperforms existing open-source models and covers 101 different languages – more than double covered by previous models.

77

381

1K

0

7

54

Zheng-Xin Yong (Yong)

@yong_zhengxin

2 years

As a Computer Science PhD student at Brown University, my 12-month stipend for this year is ~$41000, which covers medical insurance (United Healthcare) and dental insurance (Delta PPO Plus Premier). I have to pay ~$70 in fees per semester. #StipendTransparency

Gautam Goel

@gautamcgoel

2 years

To increase transparency around grad school stipends, retweet this tweet with your department, university, and annual stipend. I'll go first: I'm a PhD student in the Computing and Mathematical Sciences (CMS) department at Caltech, and I'm paid $36k/year. #StipendTransparency

81

191

1K

0

1

36

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

I'm feeling so happy about this recognition for the IndoNLP research group, especially since SEA languages have been very underrepresented in major NLP conferences. Please check out their paper:

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian...

Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available...

arxiv.org

Alham Fikri Aji

@AlhamFikri

1 year

🇮🇩NusaX is awarded with Outstanding Paper Award 🎉 Amazing work by all coauthors. More work to come from Indonesian NLP community, stay tuned.

11

31

242

0

32

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

Happy to share that BLOOM+1 has been accepted to #ACL2023 / #acl2023nlp (thanks to all @BigscienceW contributors)! If you want to know how to adapt BLOOM/BLOOMZ to an unseen language, come check out our work! TL;DR: smaller models benefit from continued pretraining, and larger

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

(Repost for corrected Arxiv) 🧐What’s the best way to quickly adapt large multilingual language models to new languages? We present our new paper from @BigscienceW 🌸: BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting. 📜 [1/9]

2

27

70

0

6

32

Zheng-Xin Yong (Yong)

@yong_zhengxin

7 months

@xwang_lk I was able to get it right on my first try. The reproducibility issue is a concern for these models.

2

1

31

Zheng-Xin Yong (Yong)

@yong_zhengxin

3 years

Accepted the offer a week ago, but I'm still really excited to share that I will be joining @BrownCSDept PhD program.

1

30

Zheng-Xin Yong (Yong)

@yong_zhengxin

6 months

Amazing panel talk on different dimensions of inclusivity (most important theme of AI right now imo). They touched on how to integrate non-Western values, non-academicians such as Redditors/hacker, and local native speakers in our AI field.

SoLaR @ NeurIPS2023

@solarneurips

6 months

Our panel discussion on **Responsible/Safe LLM research** with Roger Grosse, David Bau, Stella Biderman, Vinodkumar Prabhakaran ( @RogerGrosse , @davidbau , @BlancheMinerva ), moderated by Sara Hooker @sarahookr is starting! Come to Room R06 - R09 2nd Floor to join the discussion :)

1

26

0

2

29

Zheng-Xin Yong (Yong)

@yong_zhengxin

2 years

I had a blast at #acl2022nlp ! Here's my blog post sharing my first-time experience as a presenter, attendee, and volunteer at the NLP conference in person.

0

1

21

Zheng-Xin Yong (Yong)

@yong_zhengxin

2 years

Happy to have the opportunity to present PromptSource and ongoing work on unseen language adaptation at #acl2022 ! Made really good friends and had meaningful interactions in my first in-person conference/virtual poster presentation. Photos but capped at 4 (b/c of Twitter lol)

1

0

20

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

🔥 Our paper and research initiative on developing NLP for Southeast Asian languages got featured in WIRED

ChatGPT Is Cutting Non-English Languages Out of the AI Revolution

AI chatbots are less fluent in languages other than English, threatening to amplify existing bias in global commerce and innovation.

www.wired.com

1

6

19

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

Thanks AK for tweeting our work! Summary thread incoming 🧵

AK

@_akhaliq

4 months

LexC-Gen Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons Data scarcity in low-resource languages can be addressed with word-to-word translations from labeled task data in high-resource languages using bilingual lexicons.

2

9

49

1

18

Zheng-Xin Yong (Yong)

@yong_zhengxin

2 years

🎉 Excited to share that our multilingual work on predicting frame shifts is accepted to the @lrec2022 main conf! [1/n]

Frame Shift Prediction

Frame shift is a cross-linguistic phenomenon in translation which results in corresponding pairs of linguistic material evoking different frames. The ability to predict frame shifts enables...

arxiv.org

2

3

17

Zheng-Xin Yong (Yong)

@yong_zhengxin

6 months

Such a great talk about real-world harms by @rajiinio ! I learned so much about how LLMs affect diasporas users on Youtube – these real-world multilingual cases haven't been discussed widely enough in ML research literature.

SoLaR @ NeurIPS2023

@solarneurips

6 months

We now have Inioluwa Deborah Raji @rajiinio on Grounded Evaluations for Assessing Real-World Harms!

0

1

12

0

2

14

Zheng-Xin Yong (Yong)

@yong_zhengxin

3 years

Just passed my Capstone thesis defense @MinervaSchools under the advisership of @patrickdkwatson 🔥! Really appreciate all the support I have, especially as a first-gen college student studying abroad. Now I can't wait to graduate and start my research career.

6

0

13

Zheng-Xin Yong (Yong)

@yong_zhengxin

3 months

Curating data and benchmarks is difficult, let alone languages that are often left behind in AI development. Ambitious initiative currently led by @HolyLovenia to collect datasets for all SEA languages - if you speak any of the southeast Asian languages, come contribute!!

Holy Lovenia

@HolyLovenia

3 months

March marks the final month for accepting public datasheets in SEACrowd. Our focus will shift towards implementing dataloaders and submitting private datasets from April onwards. Let's make the most of this time by submitting more public SEA datasheets!

1

9

19

0

1

11

Zheng-Xin Yong (Yong)

@yong_zhengxin

10 months

Personally the embargo created a lot of stress and forced us release premature work. We were benchmarking LLMs on code-switching capability for EMNLP, and we were extremely worried that our work became irrelevant (model outdated or being updated) after the 5-month embargo.

Shaily

@shaily99

10 months

Re ACL anon debate: if the embargo hurts students most, students should be participating in the discussion (which I haven't seen much). Because sometimes indirect stakeholders (in this case profs) might think they know what the direct stakeholders think, but they don't.

2

12

78

1

0

11

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

🔥 Survey paper released!! After spending lots of time going through code-switching papers, I firmly believe that the biggest priority is to expand the data (more languages and evaluation tasks).

Genta Winata

@gentaiscool

1 year

🚨 New Paper The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges 📰 Paper: We conduct a comprehensive study on 400+ #codeswitching papers published at @aclanthology and ISCA from 1980s to present. #NLProc

4

11

56

1

0

10

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

Come join the project to boost the representations of your languages in multilingual LMs! Been working on the Malay language and there's so much work to be done. This initiative is a big step towards a truly multilingual world.

Cohere For AI

@CohereForAI

1 year

Introducing Aya: An Open Science Initiative to Accelerate Multilingual AI Progress 🧵 [1/11]

3

81

240

0

2

9

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

@Teknium1 Shameless self-plug on unique task-specific datasets 🙂 Synthetic task datasets that are competitive with expert-translated gold data for 17 low-resource languages. - sentiment analysis: -topic classification:

1

9

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

However, ChatGPT, like many LLMs, is sensitive to prompt wording. When we asked ChatGPT to imagine two bilingual speakers conversing, ChatGPT is prone to generating unnatural conversations where two interlocutors speak in different languages. (7/N)

1

0

8

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

Thanks Aran for sharing our work!!

Aran Komatsuzaki

@arankomatsuzaki

4 months

LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons LexC-Gen generated data is competitive with expert-translated gold data across 17 low-resource languages

0

15

68

0

8

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

I'm so happy to see more and more LLMs released for SEA languages! Really appreciate the work the authors have put into this!

Qian Liu 🔭

@sivil_taram

4 months

🚀Introducing Sailor: Open Language Models for South-East Asia🌏 From 🇮🇩Indonesian to 🇹🇭Thai, 🇻🇳Vietnamese to 🇲🇾Malay, Sailor are designed to understand and generate text across diverse linguistic landscapes of SEA region.🌊 Built from the awesome Qwen 1.5 models with careful

15

84

302

1

0

8

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

Appreciate all the feedbacks from @KreutzerJulia @gentaiscool @davlanade @ruochenz_ @AlhamFikri and Brown University Superlab. [11/n]

0

7

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

This is amazing work done in BigScience's multilingual-modeling group. @haileysch__ @Muennighoff @AlhamFikri @davlanade @eKhalid_ @sbmaruf @lintangsutawika @wittgen_ball @aobaruwa @gentaiscool @BlancheMinerva @dragomir_radev @VNikoulina [8/9]

1

7

Zheng-Xin Yong (Yong)

@yong_zhengxin

10 months

@Teknium1 It’s important to consider model size as well. shows that in tuning for a new language in a limited compute/data setup, smaller models are better with full finetuning whereas larger models are better with adapters.

BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting

The BLOOM model is a large publicly available multilingual language model, but its pretraining was limited to 46 languages. To extend the benefits of BLOOM to other languages without incurring...

arxiv.org

1

0

7

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

Language adaptation generally improves zero-shot prompting in new languages. We recommend adapter-based methods for adapting larger BLOOM models because they perform better than continued pretraining (previous figure) and are more compute-efficient. [2/9]

1

6

Zheng-Xin Yong (Yong)

@yong_zhengxin

3 years

Finished two days of virtual PhD student visit weekend at @BrownCSDept ! I am glad that I attended most of the events in person, even though I had multiple deadlines due this week and had to wake up at 6 am. Here are 3 things that stood out to me:

1

6

Zheng-Xin Yong (Yong)

@yong_zhengxin

2 years

I fiddled with #ChatGPT and I'm impressed by its multilinguality. For instance, it perfectly translates a trilingually code-mixed sentence and even correctly adds the context to when we'd say such a sentence.

2

0

6

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

Come check out our paper for other interesting findings and more details! (N/N) Paper: Data:

GitHub - Southeast-Asia-NLP/LLM-Code-Mixing: Can LLMs generate code-mixed sentences through...

Can LLMs generate code-mixed sentences through zero-shot prompting? - Southeast-Asia-NLP/LLM-Code-Mixing

github.com

1

6

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

Code can be found on [9/9]

GitHub - bigscience-workshop/multilingual-modeling: BLOOM+1: Adapting BLOOM model to support a new...

BLOOM+1: Adapting BLOOM model to support a new unseen language - bigscience-workshop/multilingual-modeling

github.com

0

1

5

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

While ChatGPT demonstrates high explanatory power, we find that its explanations on why outputs are code-mixed can be incorrect. Using the same example, the underlined word should be labeled Malay instead of Chinese. (10/N)

1

0

5

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

Despite ChatGPT’s relatively good performance, it occasionally suffers from fluency issues and, at times, even fails to follow prompts by mixing languages not specified in our instructions. (9/N)

1

0

5

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

We find that language adaptation is effective regardless of the language family, word order, and whether or not the script system is seen in pretraining. [3/9]

1

5

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

Out of the 5 models evaluated, we find that ChatGPT outperforms other models in its ability to generate text with a high level of code-mixing. (6/N)

1

0

5

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

The figure in our first tweet was adapted from by Renae Cheng. This amazing example showcases the complexity of Singlish, and we recommend reading it for a better understanding of Singlish.

0

5

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

Due to the lack of transparency of API-only models (such as ChatGPT), we are also limited in our ability to fully analyze and critique the code-mix capabilities of these systems. We are concerned about the broader effects these models will have on scientific investigation. (12/N)

1

5

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

Code-mixing, or code-switching, refers to the mixing of two or more languages. It’s very common for multilingual speakers to code-mix in colloquial dialogues. This makes collecting such data expensive and challenging. (2/N)

1

0

5

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

@BlackHC Yea - this is a good point. See my reply to @deliprao and I agree I might have passed the bar (I didn't know that before writing this post)

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

@deliprao Thanks for sharing this! Oof, so it is true that the hiring has also looked at shallow index such as # publications and H-index. That really sucks. One thing I think is also relevant is the "influencer" game nowadays in order to boost the metrics. 🥲

1

0

1

0

4

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

The most effective method to teach BLOOMZ a new language is to include a new language in the multitask fine-tuning mixture. [7/9]

1

4

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

While we are at prompting, we also study the effect of language adaptation on instruction tuning using BLOOMZ. We find that adaptation with monolingual free-form text (such as OSCAR) causes BLOOMZ to lose its prompting capability gained from multitask instruction tuning. [6/9]

1

4

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

We created 180 natural language LLM prompts, varying language pairs and topics, and annotated the level of code-mixing: loanwords, topic-related entities, or beyond entities such as clauses and verbs. (5/N)

1

0

4

Zheng-Xin Yong (Yong)

@yong_zhengxin

3 years

I have been mentored by @TorrentTiago for two years and worked with @FrameNetBrasil for @gsoc 2020. Highly recommend it 🔥! - Interesting research projects - Amazing learning experience (+ chance to publish) - Mentors look out for your best interest

Tiago Torrent

@TorrentTiago

3 years

Third time is the charm, they say! Check out our charming ideas for Google Summer of Code 2021 ( @gsoc ) and come work @FrameNetBrasil this summer! We do have the coolest mentors, you know!

1

2

6

0

1

4

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

We focus our analysis on six languages of Southeast Asia (Chinese, Malay, Indonesian,, Tagalog, Vietnamese, and Singlish), in which speakers from a variety of language families commonly code-mix. We then generated code-mixed texts using zero-shot prompting. (4/N)

1

0

4

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

@sivil_taram @bobz00528 Thanks for the mention @sivil_taram ! It seems like code-switching is a non-trivial task for multilingual LLMs. To add to the discussion, my colleague @ruochenz_ has a paper that shows similar findings across sentiment analysis, MT, summarization, and LID

Multilingual Large Language Models Are Not (Yet) Code-Switchers

Multilingual Large Language Models (LLMs) have recently shown great capabilities in a wide range of tasks, exhibiting state-of-the-art performance through zero-shot or few-shot prompting methods....

arxiv.org

1

0

3

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

@infernozzzz No need for 2 page if you cannot fill it up. I'd go for 1-page full resume.

1

0

4

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

Independent researchers have been helping companies expose safety vulnerabilities & design safer GenAI, but how are we being protected? Much needed call-to-action (led by @ShayneRedford ) for companies to provide legal and technical safe harbor for researchers!!

Shayne Longpre

@ShayneRedford

4 months

Independent AI research should be valued and protected. In an open letter signed by over a 100 researchers, journalists, and advocates, we explain how AI companies should support it going forward. 1/

7

78

230

0

1

4

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

We also urge future research on multilingual LLMs to evaluate on code-mixedness—a linguistic feature prevalent in many parts of the world. As LLMs become increasingly multimodal, capturing code-mixing could allow for more organic user interactions in future applications. (13/N)

1

0

4

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

@_akhaliq Hope you feel better soon!

0

3

Zheng-Xin Yong (Yong)

@yong_zhengxin

9 months

LLMs like GPT-4 are already powering real-world multilingual applications such as language preservation. Therefore, we urge AI safety research to go beyond high-resource languages. For LLMs to be truly safe, their safeguards need to apply to a wide range of languages. (5/n)

1

0

1

Zheng-Xin Yong (Yong)

@yong_zhengxin

6 months

@sarahookr Would love to collaborate on this direction! We found surprising scaling effects on adaptation in low-resource settings. (Spoiler: more trainable parameters isn't always good – there can be inverse scaling)

BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting

The BLOOM model is a large publicly available multilingual language model, but its pretraining was limited to 46 languages. To extend the benefits of BLOOM to other languages without incurring...

arxiv.org

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

6 months

@vinodkpg @solarneurips @RogerGrosse @davidbau @BlancheMinerva @sarahookr Really learned a lot from the panel and your answers on community building. Thank you!!

0

1

Zheng-Xin Yong (Yong)

@yong_zhengxin

9 months

Responsible disclosure: We’ve shared our findings with OpenAI before publicly releasing this work. Check out our work here: Work done with @CriMenghini and @stevebach on GPT-4-0613 (n/n)

Low-Resource Languages Jailbreak GPT-4

AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these...

arxiv.org

1

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

This is our first step towards a broader research goal: effectively use available linguistic materials, such as bilingual lexicons, to address the data scarcity of low-resource languages. [2/n]

1

0

3

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

We also discover that the prompting performance of the adapted BLOOM is primarily determined by the size of the language adaptation training data. We need at least 100K samples (~100M tokens) of the new language for effective adaptation. [5/9]

1

0

3

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

To sum, while our results may show that LLMs show promise, we recommend that researchers exercise caution when using LLMs to generate code-mixed data. (11/N)

1

0

3

Zheng-Xin Yong (Yong)

@yong_zhengxin

6 months

@solarneurips @usmananwar391 @LauraRuis @_achan96_ @yawen_duan @XinCynthiaChen @ethayarajh Thank you for this amazing workshop!

0

3

Zheng-Xin Yong (Yong)

@yong_zhengxin

9 months

Our work suggests that GPT-4’s RLHF safety training has limited cross-lingual generalization. We believe it creates safety risks that affect all LLMs users. (3/n)

1

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

@AlhamFikri @gentaiscool @sam_cahyawijaya @seb_ruder @eltimster @RicoSennrich @FajriKoto @rmahendrarm Huge congrats!!

0

3

Zheng-Xin Yong (Yong)

@yong_zhengxin

2 years

🥳This work was done during my internship at Google Summer of Code #GSoC 2020 under the guidance of @TorrentTiago , @CzuloOliver , @patrickdkwatson , and Collin F Baker! It’s been a long haul, and lots of future work are needed in this multilingual frame semantics space. [n/n]

0

1

3

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

@deliprao 😅 I didn't realize how insane this is until you point it out. I guessed that the PhD application process, which already expects students to have top-tier conference publications under their belt, already numbs me.

1

0

3

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

We use sentence retrieval accuracy to measure the quality of language-independent representation. One surprising result is that continued pretraining (green line) results in poorer language-independent representations as the model scales up. [4/9]

1

3

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

We also explore the ability of LLMs in generating Singlish, a creole language that borrows from multiple languages. ChatGPT and davinci-003 can generate Singlish with a 96% success rate, but BLOOMZ and Flan-T5-XXL fail miserably. (8/N)

1

0

3

Zheng-Xin Yong (Yong)

@yong_zhengxin

2 months

@sarahookr @cohere @CohereForAI Congratulations on your two-year anniversary! 🎉 It's incredible and inspiring to see all the progress that has come from @CohereForAI . Here's to more exploration and success in the years to come! 🚀

0

3

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

Furthermore, we found that simply scaling up synthetic task data is insufficient. You need lexicon-conditioning to close the performance gap with gold translations. [8/n]

1

0

1

Zheng-Xin Yong (Yong)

@yong_zhengxin

3 years

@dabelcs @wwdabney @aharutyu @Mark_Ho_ @mlittmancs Congrats David!!

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

Solution: We propose LexC-Gen (lexicon-conditioned generation) that prompts LLMs to use high-resource-language words from bilingual lexicons to generate task data. This lexicon-compatible synthetic data can now be better word-translated into low-resource languages. [5/n]

1

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

7 months

@AlhamFikri Interesting finds!! Safety guardrails are very fragile, and I can totally see how mixing languages, as a second-order language attack vector, can bypass whatever multilingual guardrails in-place.

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

5 months

@singhshiviii @AIatMeta Thank you Shivalika!

0

1

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

@davlanade @CIFAR_News @Mila_Quebec @mcgillu Congrats David!!

1

0

1

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 month

@ShayneRedford @MIT @RobertMahari @naana_om @wwbrannon @TobinSouth @katyilonka @alex_pentland @jad_kabbara Congrats Shayne!! Very timely work

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

5 months

@indianspeedster @AIatMeta Thanks Shekhar!

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

9 months

We use Google Translate to translate unsafe inputs into low-resource languages and successfully circumvent the safeguards. This cross-lingual vulnerability is concerning because our approach does not require any adversarial prompt to ‘fool’ the safeguards. (2/n)

1

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

3 months

@lintangsutawika @LTIatCMU @gneubig @BlancheMinerva @colinraffel @arankomatsuzaki @rmahendrarm @haileysch__ @AiEleuther Congrats!!

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

3 years

@SapienzaNLP @IJCAIconf @luigi_proc @edoardo_barba @FMartelli25 @RNavigli @ERC_Research @elexis_eu This is an amazing paper! Would love to use the code, but the GitHub code provided in the paper is not available. Is there any way to access the code?

1

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

Check out our paper for more details. We also open-source the code and all data we have generated. [10/n] Work done together with amazing co-authors @CriMenghini @stevebach 📃 Paper: 💻 Code: 📚 Data:

GitHub - BatsResearch/LexC-Gen-Data-Archive: Data Repository for LexC-Gen: Generating Data for...

Data Repository for LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons - BatsResearch/LexC-Gen-Data-Archive

github.com

1

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

@sarahookr This looks so stunning!

0

1

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

As an alternative to manual data collection, we explore the viability of using ChatGPT, InstructGPT (davinci-002 and -003), BLOOMZ, and FLAN-T5-XXL to generate code-mixed data. (3/N)

1

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

5 months

@eKhalid_ @AIatMeta Thank you Khalid!

0

Zheng-Xin Yong (Yong)

@yong_zhengxin

9 months

For instance, 1.2 billion low-resource language speakers can interact with GPT-4 with limited content moderation filters. Bad actors from high-resource language communities can also use publicly available translation tools to bypass the safety guardrails. (4/n)

1

0

1

Zheng-Xin Yong (Yong)

@yong_zhengxin

2 years

It's amazing to work with @stevebach (and to be under his supervision for my Ph.D.) 👏👏👏 He has a very high level of engagement and cares deeply about the progress of the ML community.

BigScience Research Workshop

@BigscienceW

2 years

🌸 Behind the scenes 👀 We’re delighted to introduce to you @stevebach ! Stephen is co-charing the Prompt Engineering Working Group, investigating the emergence of 0-shot prompting behaviors in very large language models and making 0-shot prompting abilities more robust. 👏

0

5

39

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

@AlhamFikri @mbzuai Congrats Aji! Can't wait to see your lab grows and the work coming out from it!

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

Why bilingual lexicons? Bilingual lexicons cover >5000 languages around the world. We can use it to translate existing labeled task data from high-resource languages to low-resource languages through word-to-word translation. [3/n]

1

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

4 months

Thank you @deliprao for sharing this data point. This contrast has made me realize that how terribly miscalibrated this job market is right now as supply (or people's interests) heavily outstrips demand. Finger-crossed for all others looking for internships and new grad jobs

Delip Rao e/σ

@deliprao

4 months

Contrast this with my Google research internship application process 16 years ago: * 1st year phd student * cold emailed someone who did not know me, but I knew their work * got a reply back and a same-day phone chat * one interview with another Google researcher to test how

2

1

21

1

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

5 months

@amilah_dul Thank you Amina!

0

2

Zheng-Xin Yong (Yong)

@yong_zhengxin

1 year

@srush_nlp yong_zhengxin

0

Zheng-Xin Yong (Yong)

@yong_zhengxin

2 years

@faisal_thisis It is great meeting you Fahim! I am sure we will cross path again in conferences

0

1

Zheng-Xin Yong (Yong)

@yong_zhengxin

5 months

@mrdanieldsouza @AIatMeta Thank you Daniel!

0

1

Zheng-Xin Yong (Yong)

@yong_zhengxin

3 years

@albertwebson It's amazing to work with you (iterations are a nightmare, but we got it done eventually)!

0

1

Zheng-Xin Yong (Yong)

@yong_zhengxin

3 years

@TaliaRinger Thanks for this thread! Any recommended way to improve creativity, or the ability to synthesize, in research work?

1

0

1

Zheng-Xin Yong (Yong)

@yong_zhengxin

2 years

@MayeeChen Likewise Mayee!!

0

1

Zheng-Xin Yong (Yong)

@yong_zhengxin

3 years

@TaliaRinger Thanks for the detailed response! From how you've described, it seems that it is a combination of observation and deep analysis. And certainly, your instinct of choosing what problems to engage with.

0

1

Zheng-Xin Yong (Yong)

@yong_zhengxin

2 years

🔍 We first attempt to characterize frame shifts. We find that frame shifts result from many causes such as the lack of exact translation of certain words (i.e., lexical divergence). We categorize these causes into translational divergences and construal differences. [3/n]

1

0

1