Adam Ibrahim Profile
Adam Ibrahim

@ai_phd

Followers
532
Following
160
Media
3
Statuses
74

Paris
Joined June 2019
Don't wanna be here? Send us removal request.
@ai_phd
Adam Ibrahim
1 year
Our tech report for Zamba-7B-v1 is out. We manage to come close to Llama 3 8B, Mistral 7B and others' level of performance, with only 1T tokens, with faster inference and less memory usage at a fixed context length. Read up to learn about our not-so-secret sauce!.
@QuentinAnthon15
Quentin Anthony
1 year
Zyphra is dropping the tech report for Zamba-7B, along with:.- Model weights (phase 1 and final annealed) at - Inference/generation code (both pure PyTorch and HuggingFace) at and Tech report:.
0
5
19
@ai_phd
Adam Ibrahim
3 days
RT @RylanSchaeffer: Another #ICML2025 paper!. Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?….
0
15
0
@ai_phd
Adam Ibrahim
1 year
RT @RylanSchaeffer: Excited to announce our paper ⬇️ was selected as an **Outstanding** paper at @TiFA_ICML2024 🔥🔥🔥. What did the paper sh….
0
7
0
@ai_phd
Adam Ibrahim
1 year
RT @RylanSchaeffer: ❤️‍🔥❤️‍🔥Excited to share our new paper ❤️‍🔥❤️‍🔥. **Why Has Predicting Downstream Capabilities of Frontier AI Models wit….
0
53
0
@ai_phd
Adam Ibrahim
1 year
Worth noting that we're working with @huggingface to release the model over the next week. Stay tuned !.
0
0
2
@ai_phd
Adam Ibrahim
1 year
something cool we did.
@BerenMillidge
Beren Millidge
1 year
Extremely excited to announce Zamba! A 7B SSM with a novel architecture competitive with Gemma-7B and Mistral-7B and significantly beating Llama2-7B trained on only 1T open training tokens.
1
2
18
@ai_phd
Adam Ibrahim
1 year
RT @TLesort: Look at our preprint on Continual Learning for increasing the scalability of LLMs pretraining. A great piece of work led by @….
0
1
0
@ai_phd
Adam Ibrahim
1 year
Here is the full paper of the continual pretraining project I have been working on last year. I encourage you to check it out if you pretrain LLMs (in particular, I recommend to start with takeaways in Section 2 and the Table of Contents at the start of the appendix).
@benjamintherien
Benjamin Thérien
1 year
Interested in seamlessly updating your #LLM on new datasets to avoid wasting previous efforts & compute, all while maintaining performance on past data? Excited to present Simple and Scalable Strategies to Continually Pre-train Large Language Models! 🧵1/N
Tweet media one
1
14
33
@ai_phd
Adam Ibrahim
1 year
RT @_akhaliq: Simple and Scalable Strategies to Continually Pre-train Large Language Models. Large language models (LLMs) are routinely pre….
0
49
0
@ai_phd
Adam Ibrahim
1 year
RT @arankomatsuzaki: Mila presents Simple and Scalable Strategies to Continually Pre-train Large Language Models. Shows efficient updates t….
0
77
0
@ai_phd
Adam Ibrahim
1 year
RT @QuentinAnthon15: State-space models (SSMs) like Mamba and mixture-of-experts (MoE) models like Mixtral both seek to reduce the computat….
0
60
0
@ai_phd
Adam Ibrahim
2 years
Looking forward to see you at the #NeurIPS2023 #NeurIPS23 ENLSP workshop (rooms 206-207), where we'll have a poster about this work at 16:15 !.
@ai_phd
Adam Ibrahim
2 years
1 Ever wondered how to keep pretraining your LLM as new datasets continue to become available, instead of pretraining from scratch every time, wasting prior effort and compute ? A thread 🧵.
0
2
6
@ai_phd
Adam Ibrahim
2 years
RT @irinarish: @PranshuRanjan1 @SarvamAI Hi-NOLIN Hindi model will be presented by our @NolanoOrg team (@imtejas13 @_AyushKaushal) and col….
0
4
0
@ai_phd
Adam Ibrahim
2 years
RT @ReyhaneAskari: (1/8) The great success of diffusion models such as Stable Diffusion, DALLE & Emu, have raised questions about the use o….
0
28
0
@ai_phd
Adam Ibrahim
2 years
RT @M_L_Richter: Rarely been so excited about a paper. Our model has a quality level higher than Stable Diffusion 2.1 at a fraction (less t….
0
3
0
@ai_phd
Adam Ibrahim
2 years
16 See the arXiv link for a bit more results, and stay tuned for the next steps, starring among others larger models, replay, comparisons with retraining on both datasets mixed together (for those with the budget), evals, and longer sequences of tasks !.
0
1
5
@ai_phd
Adam Ibrahim
2 years
15 This will allow you to leverage having already trained on many tokens, with our results suggesting that it will beat training a new model from scratch on the new dataset.
1
0
2
@ai_phd
Adam Ibrahim
2 years
14 So, what are the takeaways ? Given a budget of iterations to train on a new dataset, you might as well spend it on continuing to pretrain a preexisting model by rewarming the lr and decaying it again, to the same max value of the lr as the one used for the initial pretraining.
1
0
3
@ai_phd
Adam Ibrahim
2 years
13 Turns out the warm-up, crucial for pretraining, comes at a steep cost. Even without changing the dataset (i.e. using only Pile), rewarming+decaying the lr causes a performance hit from which the model struggles to recover. So we have to be careful/patient to benefit from it !
Tweet media one
1
0
3
@ai_phd
Adam Ibrahim
2 years
12 An important point is that rewarming+redecaying the lr requires training long enough to be beneficial, with some of the final trends being difficult to predict initially. But why do we get that initial increase in the loss when we resume training ? Distributional shift ?.
1
0
2