Tengyu Ma Profile
Tengyu Ma

@tengyuma

Followers
35K
Following
253
Media
79
Statuses
549

Assistant professor at Stanford; Co-founder of Voyage AI (https://t.co/wpIITHLgF0) ; Working on ML, DL, RL, LLMs, and their theory.

Joined June 2011
Don't wanna be here? Send us removal request.
@tengyuma
Tengyu Ma
2 years
Adam, a 9-yr old optimizer, is the go-to for training LLMs (eg, GPT-3, OPT, LLAMA). Introducing Sophia, a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M (if scaling laws hold). 🧵⬇️
Tweet media one
96
623
4K
@tengyuma
Tengyu Ma
2 years
📢 Introducing Voyage AI @Voyage_AI_!. Founded by a talented team of leading AI researchers and me 🚀🚀. We build state-of-the-art embedding models (e.g., better than OpenAI 😜). We also offer custom models that deliver 🎯+10-20% accuracy gain in your LLM products. 🧵
Tweet media one
38
94
761
@tengyuma
Tengyu Ma
3 years
Pretraining is ≈SoTA for domain adaptation: just do contrastive learning on *all* unlabeled data + finetune on source labels. Features are NOT domain-invariant, but disentangle class & domain info to enable transfer. Theory & exps:
Tweet media one
7
126
665
@tengyuma
Tengyu Ma
4 months
RL + CoT works great for DeepSeek-R1 & o1, but: . 1️⃣ Linear-in-log scaling in train & test-time compute.2️⃣ Likely bounded by difficulty of training problems. Meet STP—a self-play algorithm that conjectures & proves indefinitely, scaling better! 🧠⚡🧵🧵.
Tweet media one
17
107
556
@tengyuma
Tengyu Ma
4 years
Very honored to be named as a 2021 Sloan Fellow. Thanks to all my group members and collaborators for their wonderful works! Thanks for appreciating our works on ML. Check out them on my Twitter homepage or website! #SloanFellow.
19
7
533
@tengyuma
Tengyu Ma
2 years
Releasing the code of Sophia 😀, a new optimizer (⬇️). code:
@tengyuma
Tengyu Ma
2 years
Adam, a 9-yr old optimizer, is the go-to for training LLMs (eg, GPT-3, OPT, LLAMA). Introducing Sophia, a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M (if scaling laws hold). 🧵⬇️
Tweet media one
18
97
516
@tengyuma
Tengyu Ma
6 years
If you are interested in training deep models without batchnorm, or why batchnorm can help training, please check out our paper! Arxiv link . Thanks to @ajmooch for the tweet and re-implementation!.
@ajmooch
Andy Brock
6 years
Fixup (formerly ZeroInit) by H. Zhang, Y.N. Dauphin, and @tengyuma (ICLR2019): They manage to train deep nets (10k layers!) w/o BatchNorm, by careful init scaling & initializing the 2nd residual conv to 0. My @PyTorch impl. here:
Tweet media one
3
153
525
@tengyuma
Tengyu Ma
4 years
Why does contrastive learning magically produce linearly separable features? We leverage spectral graph theory to analyze it under realistic settings. (In contrast, many prior works require that positive pairs are independent conditioned on the label.)
Tweet media one
1
84
518
@tengyuma
Tengyu Ma
4 years
An introductory and short survey on nonconvex optimization for machine learning problems A chapter of Beyond the Worst-Case Analysis of Algorithms edited by @algo_class.
Tweet media one
3
85
491
@tengyuma
Tengyu Ma
6 years
A new paper on improving the generalization of deep models (w.r.t clean or robust accuracy) by theory-inspired explicit regularizers.
0
80
404
@tengyuma
Tengyu Ma
4 years
Thinking of applying self-supervised learning (SSL) on your uncurated, imbalanced datasets? Good news: we found SSL is more robust to long tails than supervised representations. We also present theoretical and empirical analyses and an improved algorithm.
Tweet media one
5
79
390
@tengyuma
Tengyu Ma
3 months
We joined @MongoDB! @VoyageAI’s best-in-class embedding models and rerankers will be part of MongoDB’s best-in-class database, powering mission-critical AI applications with high-quality semantic retrieval capability. A huge thank you to everyone with us on this journey, and to
Tweet media one
54
25
365
@tengyuma
Tengyu Ma
7 months
WSD learning rate is taking off—lower loss, no pre-set compute budget, & easier continual training. Yet, its loss curve is puzzling—high in stable phase but jumps in decay phase. Our paper explains it with a 'River Valley' structure of the loss! 🧵🧵
Tweet media one
11
62
329
@tengyuma
Tengyu Ma
7 years
Our recent paper that tries to explain why over-parameterized models can even help generalization: bigger models always have larger max-margins, and a weak regularizer + logistic loss can give the max-margin!.
0
71
280
@tengyuma
Tengyu Ma
5 years
DL models tend to struggle with heteroskedastic and imbalanced datasets, where long-tailed labels have varying levels of uncertainty, partly bc it's hard to distinguish mislabeled, ambiguous, and rare examples. We propose a new regularization technique:
Tweet media one
4
40
279
@tengyuma
Tengyu Ma
3 years
The double descent curves are intriguing; but do you also somewhat miss the classical U-curves (like me)? One possible option: change the x-axis of your visualization. Below, the x-axis is 2-norm, and the color and arrow indicate how # of params changes.
Tweet media one
7
40
273
@tengyuma
Tengyu Ma
5 years
We analyze self-training for domain adaptation, semi- and unsupervised learning, showing that pseudolabels are denoised through implicit propagation of correct labels via consistency regularization when data satisfy an expansion property. (More in Fig.)
Tweet media one
3
38
273
@tengyuma
Tengyu Ma
2 years
Updates on Sophia ✨. 1. It also works very well for 1.3B and 6.6B, with little tuning on top of tuned Adam! . 2. More tips on efficient hyperparam tuning . 3. More ablation showing all parts are necessary
Tweet media one
@tengyuma
Tengyu Ma
2 years
Adam, a 9-yr old optimizer, is the go-to for training LLMs (eg, GPT-3, OPT, LLAMA). Introducing Sophia, a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M (if scaling laws hold). 🧵⬇️
Tweet media one
4
39
266
@tengyuma
Tengyu Ma
4 years
RNNs and transformers with *infinite*-precision are universal approximators---but are they statistically learnable? We show that even finite-precision transformers can express any Turing machine, and they are learnable with polynomial samples.
Tweet media one
4
40
246
@tengyuma
Tengyu Ma
4 years
Pretrained language models are trained with losses quite different from downstream tasks; why can they help? Why does prompt tuning work so well? We analyze them under generative assumptions of the language (HMMs or more realistic variants)
Tweet media one
2
42
252
@tengyuma
Tengyu Ma
5 months
What issues can make OpenAI down for 9.5 hours? I am sooo curious. 🧵. Any hypotheses?
Tweet media one
42
6
232
@tengyuma
Tengyu Ma
5 months
Proud to share our best model yet, pushing boundaries again and outperforming all models on all domains (except voyage-code-3 on code). Our binary, 1024-dim embeddings are 5.53% better than OpenA, float, 3072 dim. If you spent $10k monthly on storage, now it’s $104 with us!.
@VoyageAI
Voyage AI by MongoDB
5 months
📢 Announcing the new SOTA voyage-3-large embedding model!. • 9.74% over OpenAI and +20.71% over Cohere.• flexible dim. (256-2048) and quantizations (float, int8, binary).• 8.56% over OpenAI with 1/24x storage cost.• 1.16% over OpenAI with 1/192x storage cost ($10K → $52)
Tweet media one
9
27
230
@tengyuma
Tengyu Ma
1 year
Really enjoyed the chat with @saranormous on RAG and embeddings, and all the past work with her on @Voyage_AI_ !.
@NoPriorsPod
No Priors
1 year
.@NoPriorsPod drop, all about embeddings models and retrieval systems with Stanford AI prof and @Voyage_AI_ founder @tengyuma . - Headroom in embeddings.- Hierarchical memory for LLMs.- RAG vs fine tuning vs agent chains.- Why we (still) need RAG.- Advice for building RAG systems
6
20
204
@tengyuma
Tengyu Ma
2 years
🆕📢 Checkout @Voyage_AI_ (during @OpenAI 's drama 👀). 1. Officially #1 embedding on @huggingface MTEB leaderboard, ahead of @OpenAI ada & @cohere new v3 😜. 2. Now supporting distinguishing docs & queries — different prompts added in the backend, gaining 1.2% on average.
Tweet media one
11
30
202
@tengyuma
Tengyu Ma
1 year
🆕📢 @Voyage_AI_'s new embedding models: voyage-2 & voyage-code-2!. 1. 🔼 +17% accuracy gain on 11 code datasets vs @OpenAI .2.🥇 # 1 on MTEB & diverse corpora.3.⚡ Production-ready latency.4.🛒 Available on AWS Marketplace, meeting compliances.5.📜 16K context length. #LLMs
Tweet media one
6
37
195
@tengyuma
Tengyu Ma
2 years
Personally, some cool things about this project 😊: (a) it demonstrates that we can still study LLM pre-training in academia where we have much less compute, and create new algorithms with impact.
3
11
186
@tengyuma
Tengyu Ma
1 year
Past works found empirically that one-layer transformers can learn in-context the gradient descent algorithm (when they are trained with in-context linear regression datasets). In our paper at #ICLR2024 , we give theoretical proof for why it’s GD (rather than other algorithms)🧵
Tweet media one
4
26
186
@tengyuma
Tengyu Ma
3 years
When and how can models extrapolate to unseen domains? . Previous understanding is mostly limited to linear models or well-covered new domains. We make some first baby steps beyond these by considering *structured* domain shift and nonlinear models. 1/n
Tweet media one
4
32
184
@tengyuma
Tengyu Ma
5 years
We theoretically prove that not only the scale of the noise in SGD has an implicit regularization, but its covariance also matters. For a simple model, SGD with label noise biases towards sparsity and leaves the NTK regime, whereas Gaussian noise does not.
4
21
172
@tengyuma
Tengyu Ma
7 years
Honored to receive the best paper award in COLT: "Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations" Congrats to Yuanzhi and Hongyang!.
6
19
168
@tengyuma
Tengyu Ma
4 years
Conventional wisdom: SGD prefers flat local minima. How *rigorously* can we characterize this? We prove that label-noise SGD converges to stationary points of training loss + a flatness regularizer, by coupling it with GD on the regularized loss. . 0/6
Tweet media one
2
25
171
@tengyuma
Tengyu Ma
1 year
A 100h/week job (e.g., CEO of a startup) apparently easily becomes a 24/7 job, where the job description of the remaining 68h is "forget about what happened in those 100h, just enjoy life, family, and sleep". And it's much much harder to succeed in those 68h than in those 100h.
7
10
166
@tengyuma
Tengyu Ma
8 months
Our journey started last year when we realized that embedding models were underloved and under-explored. Today, we have the best-in-class embeddings & rerankers, incredible partners such as @AnthropicAI , @harvey__ai, and several deployment options. Huge thanks to our.
@VoyageAI
Voyage AI by MongoDB
8 months
Thrilled to share that we've closed $28M in funding, led by @CRV, with continued support from @wing_vc and @saranormous. Also excited to onboard strategic partners @SnowflakeDB and @databricks!. Building the world’s best models for RAG and search 🧵🧵🧵:.
8
14
165
@tengyuma
Tengyu Ma
5 years
Some features can spuriously correlate with labels but do not cause them. Models relying on these features are biased and non-robust to varying correlations. We show self-training on an *unlabeled*, more diverse dataset can avoid using spurious features.
Tweet media one
1
28
145
@tengyuma
Tengyu Ma
3 years
Four out of my five NeuRIPS submissions had simultaneously a rating of 3 and a rating of 7. None of the five papers has a rating of 5. 😂 is perhaps the best emoji for this tweet---shall I joy or cry?.
11
1
146
@tengyuma
Tengyu Ma
6 years
Model-based planning vs policy optimization? Planning can be stronger when the Q-functions and policies are more complex than the dynamics. Such cases exist in theory and are practically relevant. This inspires a method that outperforms SOTA on humanoid!
1
23
137
@tengyuma
Tengyu Ma
8 months
It's pretty impressive that a 400M model can be better than OpenAI v3 large!.
@VoyageAI
Voyage AI by MongoDB
8 months
📢 Announcing a new generation of Voyage embedding models: voyage-3 and voyage-3-lite!. When compared with @OpenAI's v3 large:.voyage-3: + 7.5% accuracy, 2.2× cheaper, 3× smaller embeddings, 4× context.voyage-3-lite: + 3.8% accuracy, 6× cheaper, 6× smaller embeddings, 4× context
Tweet media one
1
13
136
@tengyuma
Tengyu Ma
2 years
Some interesting (surprising?) findings: . Larger models know more but are also more receptive to new information in the in-context exemplars. Instruct-tuning helps in-context learning of semantically-unrelated labels but hurts in-context learning of flipped labels.
@JerryWeiAI
Jerry Wei
2 years
New @GoogleAI paper: How do language models do in-context learning? Large language models (GPT-3.5, PaLM) can follow in-context exemplars, even if the labels are flipped or semantically unrelated. This ability wasn’t present in small language models. 1/
Tweet media one
2
22
125
@tengyuma
Tengyu Ma
1 year
OpenAI’s embedding v3 were out 🎉! Curious about its quality? We tested on 11 code retrieval datasets & 9 industry-domain datasets:. 1. @OpenAI v3 > ada-002 & cohere (except v3-small on code).2. voyage-code-2 is the best with + 14% margin on code & + 3% on industry docs 🚀
Tweet media one
@tengyuma
Tengyu Ma
1 year
🆕📢 @Voyage_AI_'s new embedding models: voyage-2 & voyage-code-2!. 1. 🔼 +17% accuracy gain on 11 code datasets vs @OpenAI .2.🥇 # 1 on MTEB & diverse corpora.3.⚡ Production-ready latency.4.🛒 Available on AWS Marketplace, meeting compliances.5.📜 16K context length. #LLMs
Tweet media one
6
22
126
@tengyuma
Tengyu Ma
4 years
Past theory for domain generalization often suggests the number of training domains has to scale linearly in dimension. Any hope to work with a more realistic # of domains? We give an algorithm that needs log-in-dim # of domains on a toy data model 1/6
Tweet media one
2
33
119
@tengyuma
Tengyu Ma
9 months
“The length of CoT can be super long”: if I still remember the paper correctly, the CoT steps needed is linear in the number of gates needed to solve the problem, or linear in the input length for a NC1-complete problem. I also vaguely remember that it takes exponential number.
@tydsh
Yuandong Tian
9 months
While CoT is super useful, I kindly disagree that blindly scaling it up is all we need. The paper proposes a universal approximation theorem by explicitly constructing Transformer weights to fit to the family of tasks. Although the depth can be constant, the length of CoT can.
3
12
117
@tengyuma
Tengyu Ma
1 year
New engineering interview at @Voyage_AI_ starting from Q3: . 1. pick one or more challenging questions that typically take 5 hours .2. ask the candidate to solve them in 1 hour.3. allow them to use GPT4. pros: testing the right skill.cons: need to recalibrate when GPT-5 is out.
15
10
109
@tengyuma
Tengyu Ma
2 years
Sophia applies a stronger penalization to updates in sharp dimensions (w/ large Hessian) than flat dimensions (w/ small Hessian), ensuring a uniform loss decrease across all parameter dimensions. Adam has a slower convergence in flat dimensions. Visualization on a toy case below.
Tweet media one
3
7
111
@tengyuma
Tengyu Ma
6 years
Very honored to receive it!.
@PrincetonCS
Princeton Computer Science
6 years
Recent Ph.D. grads Tengyu Ma (center left) and Ryan Beckett (center right) received Honorable Mentions for the @TheOfficialACM's Doctoral Dissertation Award for Outstanding Ph.D. Thesis at their gala event on Saturday. @tengyuma
Tweet media one
3
1
110
@tengyuma
Tengyu Ma
2 years
(b) We used a lot of *theoretical* thinking in the research process, besides recalling our optimization classes :). Joint work with an amazing team at Stanford: @HongLiu9903, @zhiyuanli_, @dlwh, @percyliang.
3
4
107
@tengyuma
Tengyu Ma
3 years
Does the validation pre-training loss in language modeling always correlate with downstream perf.? . We find it's not necessarily true even with the same architecture. This means that the implicit bias of pre-training algorithms is another key factor!.
Tweet media one
3
20
104
@tengyuma
Tengyu Ma
2 years
Sophia pre-conditions the gradient with a lightweight estimate of the diagonal Hessian, followed by an element-wise clipping (pseudo-code in first figure), and is easily implementable with the PyTorch code below.
Tweet media one
7
7
94
@tengyuma
Tengyu Ma
1 year
🆕📢 @Voyage_AI_'s new embedding model for legal and long-context retrieval and RAG: voyage-law-2!. 1.🥇 # 1 on MTEB legal retrieval benchmark with a large margin.2.📜 Best quality for long-context (16K) .3.✨ Improved quality across domains.4.🛒 On AWS Marketplace . #RAG #LLMs
Tweet media one
4
25
92
@tengyuma
Tengyu Ma
6 years
How to make the minority/rare classes generalize better? Enforcing class-dependent margins (following a theory-inspired formula) helps a lot! Please check out our recent paper if you are interested in learning imbalanced datasets @adnothing.
1
19
94
@tengyuma
Tengyu Ma
2 years
Our method decomposes the embedding for a word (e.g, "spring") into a sum of multiple sense/discourse embeddings (e.g., different meanings of "spring"). I still remember those fun times when I learned more meanings of polysemous English words 👇
Tweet media one
@prfsanjeevarora
Sanjeev Arora
2 years
With sparse coding again popular for interpretability in LLMs please look at older work! "Latent structure in word embeddings" Atoms of meaning" Decoding brain fMRI via sentence embeddings
2
11
91
@tengyuma
Tengyu Ma
3 years
In RL, minimax results bound the regret on the *worst-case* instance; can we customize the algo & bound to a particular instance? Somewhat surprisingly, we identify the optimal asymptotic regret for *each* instance and design an algo to achieve it.
Tweet media one
3
15
88
@tengyuma
Tengyu Ma
4 years
Optimism is the golden rule in bandit and RL, but deep RL practice doesn’t use it much. We prove that optimism with neural nets explores excessively. Instead, we design a new algo (ViOL) that explores via model-based curvature estimate with regret bounds.
1
9
86
@tengyuma
Tengyu Ma
10 months
Super excited to partner with @harvey__ai's team and @gabepereyra! . TLDR: our legal embedding model, voyage-law-2, already improves retrieval quality on general legal documents, but fine-tuning it on specific legal data helps a lot more.
@gabepereyra
Gabe Pereyra
10 months
@harvey__ai is excited to partner with @tengyuma and @Voyage_AI_ to build custom legal embeddings for our RAG and agent systems.
Tweet media one
2
15
79
@tengyuma
Tengyu Ma
2 years
On GPT-2 of sizes from 125M to 770M, Sophia achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock time (Fig. a&b, above). The scaling law (Fig. c, above) based on model size from 125M to 770M is in favor of Sophia over Adam.
1
2
73
@tengyuma
Tengyu Ma
1 year
A cross-encoder reranker can significantly improve your search / retrieval accuracy and is incredibly easy to use — it’s just an API call on top of any search methods. Very impressed by our team at @Voyage_AI_ for the amazing work on this SOTA reranker!.
@VoyageAI
Voyage AI by MongoDB
1 year
🆕📢 We are thrilled to launch rerank-1, our best general-purpose and multilingual reranker! It refines the ranking of your search results with cross-encoder transformers. It outperforms Cohere's english-v3 on English datasets and multilingual-v3 on multilingual datasets 🚀.
Tweet media one
3
9
73
@tengyuma
Tengyu Ma
2 years
Sophia also improves pre-training stability. It doesn’t need the re-parameterizing trick where the temperature of attention depends on the layer index. Gradient clipping is triggered much less frequently than Adam and Lion.
Tweet media one
1
2
68
@tengyuma
Tengyu Ma
2 years
It's great to see ML theory workshops again at NeurIPS.
@zhiyuanli_
Zhiyuan Li
2 years
🚨💡We are organizing a workshop on Mathematics of Modern Machine Learning (M3L) at #NeurIPS2023!. 🚀Join us if you are interested in exploring theories for understanding and advancing modern ML practice. Submission ddl: October 2, 2023.@M3LWorkshop
Tweet media one
0
4
67
@tengyuma
Tengyu Ma
7 months
With strong multimodal embeddings, we can search the PDFs, slides, tables, etc, by screenshots, without unstructured data ETL. The embedding models will do even more in the future!.
@VoyageAI
Voyage AI by MongoDB
7 months
📢 Announcing voyage-multimodal-3, our first multimodal embedding model!. It vectorizes interleaved text & images, capturing key visual features from screenshots of PDFs, slides, tables, figures, etc. 19.63% accuracy gain on 3 multimodal retrieval tasks (20 datasets)! 🧵🧵
Tweet media one
3
6
71
@tengyuma
Tengyu Ma
1 year
In increasing difficulty, . 1. train artificial neural nets .2. train one’s own biological neural net .3. train others’ neural nets . Level 1.5: train others’ neural nets when others are also willing to train their own— that’s why profs can mentor even though they may fail at 2.
1
2
68
@tengyuma
Tengyu Ma
1 year
Thanks for trying our optimizer! hope that Sophia can save some compute for FAIR and others :).
@ArmenAgha
Armen Aghajanyan
1 year
No replies here. Decided to try out on our own benchmarks, consisting of an auto-regressive, multi-modal pre-training at scale. Pretty complex setting. Yellow: Tuned (LR) AdamW.Purple: Tuned (LR) Sophia. Average Loss:
Tweet media one
1
3
65
@tengyuma
Tengyu Ma
5 years
Online learning is a classical approach to address the domain shifts in a data stream but requires iterative labeling. We propose to only query the label of uncertain data and give a regret guarantee that leverages the hidden domain structure.
1
3
65
@tengyuma
Tengyu Ma
2 years
A simple importance resampling can help select good training data for LMs: our filtered Pile dataset improves 2% GLUE acc! . Main idea: estimate importance weights in an n-gram feature space.
@sangmichaelxie
Sang Michael Xie
2 years
Data selection for LMs (GPT-3, PaLM) is done with heuristics that select data by training a classifier for high-quality text. Can we do better?. Turns out we can boost downstream GLUE acc by 2+% by adapting the classic importance resampling algorithm. 🧵
Tweet media one
2
7
63
@tengyuma
Tengyu Ma
7 months
Naive question: I'm still confused about why betting on election results is allowed. What if someone (1) bets on candidate A, (2) does something to increase A's perceived success rate (e.g., releases a surprising poll or some news), (3) and then bets on B? Guaranteed return?.
11
2
61
@tengyuma
Tengyu Ma
6 years
Is NTK (neural tangent kernel) the best way towards understanding deep learning? Our work suggests perhaps it is not, because a simple regularizer can provably provide better generalization:
1
12
60
@tengyuma
Tengyu Ma
4 months
and SoTA among whole-proof generation methods on miniF2F, ProofNet, and PutnamBench, and double the previous best results on LeanWorkBook. (reposting because it seems that this table has much more views 😝)
Tweet media one
@tengyuma
Tengyu Ma
4 months
RL + CoT works great for DeepSeek-R1 & o1, but: . 1️⃣ Linear-in-log scaling in train & test-time compute.2️⃣ Likely bounded by difficulty of training problems. Meet STP—a self-play algorithm that conjectures & proves indefinitely, scaling better! 🧠⚡🧵🧵.
Tweet media one
1
8
60
@tengyuma
Tengyu Ma
1 year
hallucination -> guessing properly with quantification of the confidence level 😊.
@neilbband
Neil Band
1 year
When LLMs are unsure, they either hallucinate or abstain. Ideally, they should clearly express truthful confidence levels. Our #ICML2024 work designs an alignment objective to achieve this notion of linguistic calibration in *long-form generations*. 🧵
Tweet media one
1
14
58
@tengyuma
Tengyu Ma
4 years
An exciting report led by @percyliang and @RishiBommasani! I was involved in the theory section (Sec 4.10) which presents an analysis framework used in many recent papers; please check it out! (Just in case you filter tweets by only arxiv link: :) ).
@StanfordHAI
Stanford HAI
4 years
NEW: This comprehensive report investigates foundation models (e.g. BERT, GPT-3), which are engendering a paradigm shift in AI. 100+ scholars across 10 departments at Stanford scrutinize their capabilities, applications, and societal consequences.
Tweet media one
1
17
56
@tengyuma
Tengyu Ma
3 months
It's tough when one brain has to handle two "PR"s 😇😇—public relations and pull requests. I felt that I am running MoE— every time I see PR, my visual cortex do a quick routing to the right part of brain.
2
1
52
@tengyuma
Tengyu Ma
5 months
Indulging in this @OpenAI's incident report, my real-world analogy is: . A CEO introduces a new extensive perf review system, so that their calendar gets blocked with perf reviews 24/7, so that the board can't even schedule a meeting to override the CEO to stop it. 🤷‍♂️🤷‍♀️🤣🤣.
4
5
53
@tengyuma
Tengyu Ma
5 years
The main principle is conservatism in the face of uncertainty --- we penalize the reward by the uncertainty of the learned dynamics. We also had a new version with more ablation studies! (same link as below).
@chelseabfinn
Chelsea Finn
5 years
Offline RL may make it possible to learn behavior from large, diverse datasets (like the rest of ML). We introduce:. MOPO: Model-based Offline Policy Optimization. w/ Tianhe Yu, Garrett Thomas, Lantao Yu @StefanoErmon @james_y_zou @svlevine @tengyuma.
0
9
50
@tengyuma
Tengyu Ma
1 year
With the right training process, a multilingual embedding model can also have great performance on English 😊.
@VoyageAI
Voyage AI by MongoDB
1 year
🌍📢 Launching our multilingual embeddings, voyage-multilingual-2!. 👑 Average 5.6% gain on evaluated languages, including French, German, Japanese, Spanish, and Korean.📚 32K context length.🛒 On AWS Marketplace. Check us out! 👉🏼 The first 50M tokens are on us. #RAG #LLM
Tweet media one
3
6
48
@tengyuma
Tengyu Ma
2 years
We think embeddings are an under-loved, but incredibly important, part of everyone’s retrieval stack. So we set out to build just better embeddings. Voyage embeddings are SOTA on MTEB and 9 held-out benchmarks covering industry domains. Learn more here:
Tweet media one
3
8
49
@tengyuma
Tengyu Ma
2 years
@bradneuberg @HongLiu9903 @zhiyuanli_ @dlwh @percyliang @StanfordAILab @stanfordnlp @StanfordCRFM @Stanford Unfortunately we've used all of our tiny compute resource on LLMs. We will release the code very soon and hopefully someone will try it out.
2
0
48
@tengyuma
Tengyu Ma
2 years
Navigating through diverse quality data from multiple sources to train your mega LLMs?. Mix them right with DoReMi 🎶:. Optimized data domain mixtures lead to a 2.6X speed-up! . @StanfordAILab @stanfordnlp @StanfordCRFM @GoogleAI.
@sangmichaelxie
Sang Michael Xie
2 years
Should LMs train on more books, news, or web data?. Introducing DoReMi🎶, which optimizes the data mixture with a small 280M model. Our data mixture makes 8B Pile models train 2.6x faster, get +6.5% few-shot acc, and get lower pplx on *all* domains!. 🧵⬇️
Tweet media one
4
7
48
@tengyuma
Tengyu Ma
2 years
Very cool work! . "Let's learn from these examples", "Let's think step by step", and "Let's recall what you knew". Prompting LLMs sounds increasingly similar to tutoring elementary or high school students 😀😀.
@michiyasunaga
Michi Yasunaga
2 years
Introducing Analogical Prompting, a new method to help LLMs solve reasoning problems. Idea: To solve a new problem, humans often draw from past experiences, recalling similar problems they have solved before. Can we prompt LLMs to mimic this?. [1/n]
Tweet media one
2
5
44
@tengyuma
Tengyu Ma
2 years
We partnered with @LangChainAI to help enhance their official chatbot, live at Both our base and fine-tuned embeddings improve both the retrieval and response quality, and the latter is deployed!. Check out for more details.
Tweet media one
5
8
48
@tengyuma
Tengyu Ma
6 years
We released the code of the model-based RL algorithms SLBO developed in the paper Please check it out!.
0
6
44
@tengyuma
Tengyu Ma
1 year
Very excited to announce @Voyage_AI_'s SOTA reranker!.
@VoyageAI
Voyage AI by MongoDB
1 year
Rerankers refine the retrieval in RAG. 🆕📢 Excited to announce our first reranker, rerank-lite-1: state-of-the-art in retrieval accuracy on 27 datasets across domains (law, finance, tech, long docs, etc.), enhancing various search methods, vector-based or lexical. 🧵
Tweet media one
2
7
46
@tengyuma
Tengyu Ma
5 years
Existing meta-RL methods tend to suffer from task distribution shifts. We mitigate this issue by proposing model-based adversarial meta-RL (AdMRL), which finds adversarial tasks (with a non-trivial task gradient formula) and trains on them. Arxiv:
Tweet media one
3
7
42
@tengyuma
Tengyu Ma
4 years
Surprised and puzzled by why in-context learning emerges? Please check out our theoretical work on it, led by @sangmichaelxie!.
@sangmichaelxie
Sang Michael Xie
4 years
Why can GPT3 magically learn tasks? It just reads a few examples, without any parameter updates or explicitly being trained to learn. We prove that this in-context learning can emerge from modeling long-range coherence in the pretraining data!. (1/n)
Tweet media one
1
3
43
@tengyuma
Tengyu Ma
7 years
Looking forward to UAI! Will give a tutorial on the theory of deep learning. (Thanks to @shakir_za for the organizing!).
@shakir_za
Shakir Mohamed
7 years
My work as tutorial chair for @uai2018 is over and we have a fantastic set of tutorials this year. Everyone going to Monterey, you are in for a treat, with @AnimaAnandkumar @zacharylipton @riedelcastro @tengyuma and others ✍🏾📚🏖
0
8
43
@tengyuma
Tengyu Ma
2 years
Hong (@HongLiu9903) will present Sophia at ES-FOMO workshop #ICML2023 (7/29, 1pm Ballroom A). Please join us and check out latest results of Sophia on 1.5B and 7B models!! (Also adding more results to the repo.) . code: paper:
@tengyuma
Tengyu Ma
2 years
Adam, a 9-yr old optimizer, is the go-to for training LLMs (eg, GPT-3, OPT, LLAMA). Introducing Sophia, a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M (if scaling laws hold). 🧵⬇️
Tweet media one
12
7
41
@tengyuma
Tengyu Ma
2 years
It's really enjoyable to collaborate with @LangChainAI team (@baga_tur, @j_schottenstein, @zebriez, and others) on integrating Voyage embeddings and improve the Chat Langchain !! 🚀🚀 . Try our embeddings and/or let us help to improve your RAG!.
@LangChainAI
LangChain
2 years
🚀@Voyage_AI_ + LangChain 🚀. A few weeks ago @tengyuma and the folks at Voyage AI came to us saying they had embedding models that would markedly improve retrieval for Chat LangChain. We agreed to give it a shot and . turns out they were right! Check out our latest blog to.
2
3
40
@tengyuma
Tengyu Ma
1 year
Typical feedback for grant proposals, blog posts, theses, etc.:. 2019: please polish the language.2024: please do more prompt engineering.
0
1
38
@tengyuma
Tengyu Ma
3 years
The more theoretical paper here ( will be in #NeurIPS2022. Please check out our poster at Hall J #920, Wed 11am-1pm!.
@tengyuma
Tengyu Ma
3 years
Pretraining is ≈SoTA for domain adaptation: just do contrastive learning on *all* unlabeled data + finetune on source labels. Features are NOT domain-invariant, but disentangle class & domain info to enable transfer. Theory & exps:
Tweet media one
1
6
38
@tengyuma
Tengyu Ma
1 year
@jiayq Your dad might be reading some AI-generated news article.
2
0
39
@tengyuma
Tengyu Ma
2 months
We scaled STP’s training compute by another 2x, and achieved new SoTA for whole-proof generation methods on miniF2F, ProofNet, and LeanWorkbook! . Checkout for our updated code, data, and model!
Tweet media one
@tengyuma
Tengyu Ma
4 months
RL + CoT works great for DeepSeek-R1 & o1, but: . 1️⃣ Linear-in-log scaling in train & test-time compute.2️⃣ Likely bounded by difficulty of training problems. Meet STP—a self-play algorithm that conjectures & proves indefinitely, scaling better! 🧠⚡🧵🧵.
Tweet media one
1
7
38
@tengyuma
Tengyu Ma
4 years
This appears in #ICLR2021. Please check out our paper, videos, poster, code, etc!.ICLR poster link: ArXiv: Github:
@tengyuma
Tengyu Ma
5 years
DL models tend to struggle with heteroskedastic and imbalanced datasets, where long-tailed labels have varying levels of uncertainty, partly bc it's hard to distinguish mislabeled, ambiguous, and rare examples. We propose a new regularization technique:
Tweet media one
3
4
38
@tengyuma
Tengyu Ma
4 years
#NeurIPS2020 How to accelerate distributed optimizers? We develop FedAc that accelerates the standard local SGD with provable improvement in convergence rate and communication efficiency. It turns out the vanilla acceleration is not the best! .Paper:
Tweet media one
1
4
36
@tengyuma
Tengyu Ma
1 year
@OpenAI’s embedding v3 were out 🎉! Curious about its quality? We tested on 11 code retrieval datasets & 9 industry-domain datasets:. 1. OpenAI v3 > ada-002 and cohere (except v3-small on code). 2. voyage-code-2 is the best with + 14% margin on code & + 3% on industry docs 🚀
Tweet media one
@tengyuma
Tengyu Ma
1 year
🆕📢 @Voyage_AI_'s new embedding models: voyage-2 & voyage-code-2!. 1. 🔼 +17% accuracy gain on 11 code datasets vs @OpenAI .2.🥇 # 1 on MTEB & diverse corpora.3.⚡ Production-ready latency.4.🛒 Available on AWS Marketplace, meeting compliances.5.📜 16K context length. #LLMs
Tweet media one
2
4
35
@tengyuma
Tengyu Ma
4 months
Please check out our paper for more experiments, examples of generated conjectures, and ablation studies. Joint work with @kefandong. Feedback and comments are more than welcome!.
1
2
36
@tengyuma
Tengyu Ma
4 months
Inspired by how mathematicians continue advancing the field, we train an LLM that conjectures and attempts proofs; then we iteratively reinforce/re-train it with correct, elegant, novel, and approachable generated conjectures and correctly generated proofs.
Tweet media one
1
1
34
@tengyuma
Tengyu Ma
2 years
Hong (@HongLiu9903) will give an oral presentation at #ICML2023 on this paper (Ballroom A, Jul 27, 16:04 HST). The poster presentation will be at at Poster Session 5 (Exhibit Hall 1),  Jul 27, 10:30 HST. Please check them out!.
@tengyuma
Tengyu Ma
3 years
Does the validation pre-training loss in language modeling always correlate with downstream perf.? . We find it's not necessarily true even with the same architecture. This means that the implicit bias of pre-training algorithms is another key factor!.
Tweet media one
0
4
31
@tengyuma
Tengyu Ma
3 years
For the four papers with both 3's and 7's, the total count of ratings is .# of 3's = 4.# of 4's = 2.# of 6's = 2.# of 7's = 8.Can I compete for the most controversial author? 😂.
1
0
31
@tengyuma
Tengyu Ma
4 years
Recent algorithms for offline RL perform well; do they still work in the multi-agent setting? Surprisingly, they may underperform due to challenges in multi-agent optimization. We propose OMAR that improves action opt. with the 0th-order approach: 1/n
Tweet media one
3
3
30
@tengyuma
Tengyu Ma
3 years
The results are on linear models. See Figure 8.12 of  page 115, for details. The other two relationships (error vs # params, norm vs # params) are visualized below. Experiments and figures are by Kefan Dong (@kefandong).
Tweet media one
3
3
31
@tengyuma
Tengyu Ma
1 year
It's sad to lose Luca. His papers on spectral graph theory were among the first few papers I read in graduate school and have been inspirational since then.
@boazbaraktcs
Boaz Barak
1 year
Tragic news. Luca Trevisan passed away today. The talk he prepared in his final weeks for the TCS4all workshop will be given virtually in his honor on Monday. I hope many of the TCS community can attend.
0
2
27
@tengyuma
Tengyu Ma
4 years
We need safe RL. Can we hope for *zero* training-time safety violations without prior knowledge of the dynamics (but with a trivial initial safe policy)? Many prior works need > 100 violations for even 2-4 dim. state space, and our algo. makes it zero!
Tweet media one
2
7
31
@tengyuma
Tengyu Ma
5 years
IMO, the (meta-)reviewer matching in ML conf. can be improved. I started to manually re-adjust the systems' assignments since last year and found the review quality in my stack is improved. The pain point is my ideal reviewers are already matched to other (suboptimal?) papers.
6
2
30
@tengyuma
Tengyu Ma
2 years
@capetorch @HongLiu9903 @zhiyuanli_ @dlwh @percyliang @StanfordAILab @stanfordnlp @StanfordCRFM @Stanford Do you mean learning rate schedules? We are using the standard cosine LR which has an initial warmup period, and decays 1/10 of the peak LR eventually. The LR schedule is also plotted in Figure 5a.
0
0
30