Adam Ibrahim @ai_phd X Profile

Adam Ibrahim

@ai_phd

Followers

532

Following

160

Media

3

Statuses

74

Paris

Joined June 2019

Don't wanna be here? Send us removal request.

Adam Ibrahim

@ai_phd

1 year

Our tech report for Zamba-7B-v1 is out. We manage to come close to Llama 3 8B, Mistral 7B and others' level of performance, with only 1T tokens, with faster inference and less memory usage at a fixed context length. Read up to learn about our not-so-secret sauce!.

Quentin Anthony

@QuentinAnthon15

1 year

Zyphra is dropping the tech report for Zamba-7B, along with:.- Model weights (phase 1 and final annealed) at - Inference/generation code (both pure PyTorch and HuggingFace) at and Tech report:.

0

5

19

Adam Ibrahim

@ai_phd

3 days

RT @RylanSchaeffer: Another #ICML2025 paper!. Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?….

0

15

0

Adam Ibrahim

@ai_phd

1 year

RT @RylanSchaeffer: Excited to announce our paper ⬇️ was selected as an **Outstanding** paper at @TiFA_ICML2024 🔥🔥🔥. What did the paper sh….

0

7

0

Adam Ibrahim

@ai_phd

1 year

RT @RylanSchaeffer: ❤️‍🔥❤️‍🔥Excited to share our new paper ❤️‍🔥❤️‍🔥. **Why Has Predicting Downstream Capabilities of Frontier AI Models wit….

0

53

0

Adam Ibrahim

@ai_phd

1 year

Worth noting that we're working with @huggingface to release the model over the next week. Stay tuned !.

0

2

Adam Ibrahim

@ai_phd

1 year

something cool we did.

Beren Millidge

@BerenMillidge

1 year

Extremely excited to announce Zamba! A 7B SSM with a novel architecture competitive with Gemma-7B and Mistral-7B and significantly beating Llama2-7B trained on only 1T open training tokens.

1

2

18

Adam Ibrahim

@ai_phd

1 year

RT @TLesort: Look at our preprint on Continual Learning for increasing the scalability of LLMs pretraining. A great piece of work led by @….

0

1

0

Adam Ibrahim

@ai_phd

1 year

Here is the full paper of the continual pretraining project I have been working on last year. I encourage you to check it out if you pretrain LLMs (in particular, I recommend to start with takeaways in Section 2 and the Table of Contents at the start of the appendix).

Benjamin Thérien

@benjamintherien

1 year

Interested in seamlessly updating your #LLM on new datasets to avoid wasting previous efforts & compute, all while maintaining performance on past data? Excited to present Simple and Scalable Strategies to Continually Pre-train Large Language Models! 🧵1/N

1

14

33

Adam Ibrahim

@ai_phd

1 year

RT @_akhaliq: Simple and Scalable Strategies to Continually Pre-train Large Language Models. Large language models (LLMs) are routinely pre….

0

49

0

Adam Ibrahim

@ai_phd

1 year

RT @arankomatsuzaki: Mila presents Simple and Scalable Strategies to Continually Pre-train Large Language Models. Shows efficient updates t….

0

77

0

Adam Ibrahim

@ai_phd

1 year

RT @QuentinAnthon15: State-space models (SSMs) like Mamba and mixture-of-experts (MoE) models like Mixtral both seek to reduce the computat….

0

60

0

Adam Ibrahim

@ai_phd

2 years

Looking forward to see you at the #NeurIPS2023 #NeurIPS23 ENLSP workshop (rooms 206-207), where we'll have a poster about this work at 16:15 !.

Adam Ibrahim

@ai_phd

2 years

1 Ever wondered how to keep pretraining your LLM as new datasets continue to become available, instead of pretraining from scratch every time, wasting prior effort and compute ? A thread 🧵.

0

2

6

Adam Ibrahim

@ai_phd

2 years

RT @irinarish: @PranshuRanjan1 @SarvamAI Hi-NOLIN Hindi model will be presented by our @NolanoOrg team (@imtejas13 @_AyushKaushal) and col….

0

4

0

Adam Ibrahim

@ai_phd

2 years

RT @ReyhaneAskari: (1/8) The great success of diffusion models such as Stable Diffusion, DALLE & Emu, have raised questions about the use o….

0

28

0

Adam Ibrahim

@ai_phd

2 years

RT @M_L_Richter: Rarely been so excited about a paper. Our model has a quality level higher than Stable Diffusion 2.1 at a fraction (less t….

0

3

0

Adam Ibrahim

@ai_phd

2 years

16 See the arXiv link for a bit more results, and stay tuned for the next steps, starring among others larger models, replay, comparisons with retraining on both datasets mixed together (for those with the budget), evals, and longer sequences of tasks !.

0

1

5

Adam Ibrahim

@ai_phd

2 years

15 This will allow you to leverage having already trained on many tokens, with our results suggesting that it will beat training a new model from scratch on the new dataset.

1

0

2

Adam Ibrahim

@ai_phd

2 years

14 So, what are the takeaways ? Given a budget of iterations to train on a new dataset, you might as well spend it on continuing to pretrain a preexisting model by rewarming the lr and decaying it again, to the same max value of the lr as the one used for the initial pretraining.

1

0

3

Adam Ibrahim

@ai_phd

2 years

13 Turns out the warm-up, crucial for pretraining, comes at a steep cost. Even without changing the dataset (i.e. using only Pile), rewarming+decaying the lr causes a performance hit from which the model struggles to recover. So we have to be careful/patient to benefit from it !

1

0

3

Adam Ibrahim

@ai_phd

2 years

12 An important point is that rewarming+redecaying the lr requires training long enough to be beneficial, with some of the final trends being difficult to predict initially. But why do we get that initial increase in the loss when we resume training ? Distributional shift ?.

1

0

2