It’s finally here 🎉🥳
In case you missed us, MosaicML/ Databricks is back at it, with a new best in class open weight LLM named DBRX. An MoE with 132B total parameters and 32B active 32k context length and trained for 12T tokens 🤯
The next wave of startups seems to be PhD Students dropping out to build MLOps companies because they got good at training models and that turned out to be more valuable than their actual research
It’s finally here 🎉🥳
In case you missed us, MosaicML/ Databricks is back at it, with a new best in class open weight LLM named DBRX. An MoE with 132B total parameters and 32B active 32k context length and trained for 12T tokens 🤯
@gdequeiroz
The best way I can articulate it is they care deeply about (or have worked hard at) proofs of things DL people just throw away. After several pages proving if you have an unbiased estimator of a parameter it's pretty annoying to see someone just doing a hyperparameter sweep.
On small overtrained models 💪
To reach the loss of a 67B model,
- A 33B model needs 2.3x compute 🚀
- A 13B model needs 25x compute 🤔
- A 7B model needs 7837x compute 🤡
- A 3B model can't match the 67B. Ever. 🪦
With love, from Chinchilla scaling laws.
I'm pretty sure the reason LLMs are not "funny" is that it specifically goes against their programming. Good jokes typically subvert our expectations. Which is the opposite of what autoregressively maximizing the highest likelihood next token is designed for.
I strongly believe that understanding how pruning/distillation works is the key to understanding how all neural networks work in general. I'm far less interested in "how many weights can we remove?" and more interested in "why heck can we remove them in the first place?!"
You asked for it and we listened! Today we are proud to announce the release of open-source MPT-30B. Same great architecture 1T tokens and now with 8k (and beyond) context! Try it now on our hugging face space.
SF is probably the only place on the planet you can be at a bar talking about tokenizers, and hear further down the bar someone else also talking about tokenizers.
Some people love this, some people loathe this.
That’s not entirely true. We released an open source 30B model, described in great detail the data used to train it, and the framework to train it.
Just add GPUs.
Of course if you pay us, we make dealing with the infra much easier 😉
I think people underestimate how hard it is to train a large model like GPT-3 and up.
Lots of challenges arise when reaching billions parameters, let alone 10B+ params (data management, training stability, parallelism...).
Only a few have succeeded so far and the recipe is not…
Thanks. This is such a great suggestion! In fact, the story DID read excerpts of the Declaration and then DID hear from “a diverse set” of Americans who relied on it through history. Too bad you didn’t listen! “Missed opportunity.” But it’s not too late:
@AlbalakAlon
Yes! We trained a *new* MPT7B. Exact same arch and code. We were able to hit the same quality with half the number of tokens / training. Its not quite 2x reduction in training (larger tokenizer), but pretty dang close. We evaluated it on our newest version of guantlet.
Not only is it’s a great general purpose LLM, beating LLama2 70B and Mixtral, but it’s an outstanding code model rivaling or beating the best open weight code models!
I still think the best use of chatGPT is just generating a template you can correct. Personally, editing requires a lot less mental strain than staring at a blank page.
@bartbing71
This isn’t the right take away but I hate the hassle the most when I catch cheating. Like … can you cheat better so I can enjoy my evening?
I'm absolutely floored by all the community-driven projects around MPT-7B 🤯. Are you using it for something? Tell us (
@MosaicML
), we would love to hear it!
I don’t have a SoundCloud but if you want to checkout the MLOps company I work for my boss (who hasn’t officially quit his PhD) would be very grateful
We are trying to change the math on efficient training. Want to train imagenet in 27 min? Find out how
@kairyssdal
how much do I need to donate to APM or Marketplace to have start the show off on a Wednesday saying "In Los Angeles, I am Kai Ryssdal it is Wednesday, my dudes!"
If it turns out Mistral’s new MoE is just 8 copies of its 7B trained “Branch, Train, Merge” style and compiled into an MoE. I suggest we call it “Mixture of Bastards” MoB.
Fun deep learning tip. Make your global batch size divisible by lots of numbers. 960 is way better then 1024. Then you can train on far more combinations of gpus if you want to soak up more capacity. 64, 80, 124, 120, 240, 480 so many options.
I have to thank my amazing team (the
@DbrxMosaicAI
Data team
@mansiege
@_BrettLarsen
@ZackAnkner
Sean Owen and Tessa Barton) for their outstanding work. We have try made a generational improvement in our data. Token for token our data is twice as good as MPT7B was.
People have been talking on twitter about how few people can train XXbillion param LLMs, but I wonder how many people know the dark arts of building great tokenizers.
Took a look at
@databricks
's new open source 132 billion model called DBRX!
1) Merged attention QKV clamped betw (-8, 8)
2) Not RMS Layernorm - now has mean removal unlike Llama
3) 4 active experts / 16. Mixtral 2/8 experts.
4)
@OpenAI
's TikToken tokenizer 100K. Llama splits…
BREAKING 🚨:
Nancy Pelosi just bought $5M of the AI company Databricks
Unfortunately, Databricks is a privately held company and not available to be bought by the public
Sorry people, you don’t have access to this one.
Today is the first day of my big boy job. I'm excited to finally be full-time at
@MosaicML
! 🥳 (now excuse me while I go flood our cluster with new experiments)
@johnwil80428495
@UniversityStar
Well alot of us love people that are old or have compromised immune systems. If we do the right things we can save lives.
*correction, not open weights. It’s a commercial friendly licensed model. You’ll have to forgive me I was up late 😅 feel free to download it and try it yourself.
🆕 Check out the recent update of 𝕎𝕚𝕝𝕕𝔹𝕖𝕟𝕔𝕙! We have included a few more models including DBRX-Instruct
@databricks
and StarlingLM-beta (7B)
@NexusflowX
which are both super powerful! DBRX-Instruct is indeed the best open LLM; Starling-LM 7B outperforms a lot of even…
@Tim_Dettmers
Truly the shame should go further up the author list.
That being said I think like 30-50% of deep learning papers of the last decade wouldn’t have been published if they had properly tuned baselines.
It’s coming back! The
@jefrankle
lost a bet with the unbelievably talented
@mansiege
and has been subjected to being rad. What an unfortunate turn of events.
I think people some people (not necessarily Jesse) misunderstood why there is a lack of transparency. Meta isn’t afraid of transparency, or giving up secret sauce. Big players will not disclose their data until case law over copyright/fair use is better defined. That doesn’t mean…
This follows the trend of large organizations releasing models and promoting their capabilities, while not providing the information necessary to understand their behavior: the training data.
To be clear, this is expected, but also highlights the need for more transparency.
Words cannot express how excited I am about this.
@lilac_ai
is *the* best user experience I have found for exploring, cleaning, and understanding data for LLMs. I can’t wait to work with them to build the future of data!
Incredibly excited to announce that
@lilac_ai
is joining
@databricks
!
With Lilac in Databricks, data curation for LLMs will be elevated to the next level: Enterprise AI 🚀🚀
A huge huge to everyone who’s supported us on this journey ❤️
I can't believe it's finally happening. Tomorrow I don my wizard robes and become a Dr. Blakeney (again ... I'm still trying to figure out how that works). I'm gonna try and jump in the river if it isn't flooding. If y'all don't hear from me ... check the news.
If you are hiring anything ML/NN related reach out to my boy. We were in the same PhD cohort. Half of my good ideas in my dissertation he helped me brainstorm. One of the best python programmers I know. Immigration laws in this country are bs and have him scrambling.
Well, bad news . I had to leave Tesla. I have a tight deadline of August 14th to get a new employer and save my immigration status😬. However, I refuse to let this setback define my journey. I am more determined than ever to continue my work in the world of
#AI
and
#DNN
!
It’s finally here 🎉🥳
In case you missed us, MosaicML/ Databricks is back at it, with a new best in class open weight LLM named DBRX. An MoE with 132B total parameters and 32B active 32k context length and trained for 12T tokens 🤯
I cannot say enough how much I ❤️ love ❤️ our model gauntlet. Both for the speed at which it evaluates on its many tasks, and the thoughtfulness that went into organizing the tasks. It’s been a god send for us for selecting pre-training data and making modeling decisions.
How can the ML community measure LLM quality in a holistic and standardized manner?
The Mosaic Model Gauntlet encompasses 34 benchmarks, organized into 6 broad categories of competency, evaluated with our blazingly fast open-source ICL eval harness.
🧵👇
It’s hard for me for read the statements by OpenAI as anything other than a cynical advertisement for how powerful their products are, and to scare people off from throwing their hat in the ring.
Kind of genius if this is what happened. Drop the big expensive model, let people analyze it and be amazed, then distill it to save costs.
*If* that is what occurred *and if* if has regressed this seems like a case where metrics didn’t capture the effects of compression.
Was GPT4 just lobotomized?
It responds to queries a lot faster but seems to perform a lot worse than just a few weeks ago (not following instructions properly, making very obvious coding mistakes etc)
Quite likely they replaced it with a distilled smaller model to save costs?