Jose Javier Gonzalez
@jjgort
Followers
384
Following
106
Media
5
Statuses
36
Research Scientist at MosaicAI DataBricks. Working on LLMs
San Francisco, CA
Joined January 2017
Our latest model DBRX is out today! Itโs a fast 132B MoE and currently the best open weights model. DBRX really shines at programming, beating all OSS models in the HumanEval programming benchmark, even code fine tuned models like CodeLLaMA-70B
Databricks DBRX model is AMAZING, generally great but CRUSHES code. 132B parameters, 12T token, 16 experts, 4 per forward, 36B active. ~2.6e24 HumanEval5, 0-Shot DBRX - 70.1% GPT-4 - 67% Gemini 1.5 Pro - 71.9% Mixtral - 54.8% Grok - 63.2% LLAMA 2 - 32.2% https://t.co/j0Y5xbH04W
0
4
20
Presenting MultiverSeg โ a scalable in-context system for interactively segmenting new datasets โ at #ICCV2025 today! ๐poster 110 (10:45 AMโ12:45 PM)
0
2
5
Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.
119
460
3K
RLVR and test-time compute are a powerful combo for enterprises, so much so that @databricks now leads overall BIRD single-model leaderboard. This isn't about BIRD, though. It's an example of what our customers are accomplishing in their domains with our RL recipe in Agent Bricks
RLVR isn't just for math and coding! At @databricks, it's impacting products and users across domains. One example: SQL Q&A. We hit the top of the BIRD single-model single-generation leaderboard with our standard TAO+RLVR recipe - the one rolling out in our Agent Bricks product.
1
6
35
This is the most rewarding thing I did during my PhD, looking forward to teaching an updated version!
Missing Semester has grown past 100K subscribers on YouTube. Appreciate all the engagement and support! We plan to teach another iteration of the course in January 2026, revising the curriculum and covering new topics like AI IDEs and vibe coding.
0
0
5
Deep learning training is a mathematical dumpster fire. But it turns out that if you *fix* the math, everything kinda just worksโฆfp8 training, hyperparameter transfer, training stability, and more. [1/n]
15
150
1K
*LoRA Learns Less and Forgets Less* is now out in its definitive edition in TMLR๐ Checkout the latest numbers fresh from the @DbrxMosaicAI oven ๐จโ๐ณ
New #FeaturedCertification: LoRA Learns Less and Forgets Less Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz et al. https://t.co/P7a2ktDjb8
#learns #regularization #rank
5
20
82
Pretraining data ablations are expensive: how can we measure data quality fast and cheap? If you're at ICML, come find out at the ES-FoMo poster session today in Lehar 2 at 1 pm:
Pretraining data experiments are expensive as measuring the impact of data on emergent tasks requires large FLOP scales. How do you determine what subsets of your data are important for the mixture of tasks you care about? We present Domain upsampling: a strategy to better
0
14
41
some notes from paper! - 405B trained on 15.6T tokens, 3.8e25 flops - use SFT, rejection sampling and DPO - annealing is used to judge quality of domain specific data (s/o dbrx paper)
405b is out! working on llama 3 has been a truly rewarding experience and i'm super grateful to all my teammates! i'm excited to see how the llama models will be used by the community! p.s. - we wrote a paper and not just a tech report ๐
2
6
40
Popular #LLM scaling laws only factor in training costs, and ignore the costs of deployment. In a paper presented at @icmlconf 2024, @databricks Mosaic AI researchers Nikhil Sardana, @JacobianNeuro, and @sashadoubov propose a modified scaling law that considers the cost of both
1
20
53
People think LoRA is a magic bullet for LLMs. Is it? Does it deliver the same quality as full finetuning but on consumer GPUs? Though LoRA has the advantage of a lower memory footprint, we find that it often substantially underperforms full finetuning. However, it forgets less
22
104
562
An interesting bit of nuance missing from throughput charts like this is that tokens != generated text. Because DBRX / LLama3 / GPT4s tokenizer has a larger vocabulary (100k+) they actually generate much faster then (20-30%) then tokens alone will measure compared to say Mixtral
Even if your focus is quality, inference speed & price matters when you can achieve the same quality across multiple requests Back of the envelope illustration: - GPT-4 Turbo is (arguably) the highest quality model available, however it is served at 18 tokens/s and costs ~$15/M
2
1
18
@deepwhitman @AIatMeta @lmsysorg no. people misunderstand chinchilla. chinchilla doesn't tell you the point of convergence. it tells you the point of compute optimality. if all you care about is perplexity, for every FLOPs compute budget, how big model on how many tokens should you train? for reasons not fully
23
50
579
DBRX is the best open model on AI2 WildBench! ๐
๐ Check out the recent update of ๐๐๐๐๐น๐๐๐๐! We have included a few more models including DBRX-Instruct @databricks and StarlingLM-beta (7B) @NexusflowX which are both super powerful! DBRX-Instruct is indeed the best open LLM; Starling-LM 7B outperforms a lot of even
3
5
40
๐งฑDBRX๐งฑ is so good that it forced 3-4 companies to release "competing" LLMs in the last two days (and we've barely heard about them). Some of my thoughts are summarized below... Prior research from Mosaic. DBRX is the next model in the series of Open LLMs released by Mosaic.
4
56
225
TFW the code eval metrics are so good your boss does his hair blue
1
6
67
Introducing DBRX: A New Standard for Open LLM ๐ https://t.co/0HpI6Sdv6J ๐ป DBRX is a 16x 12B MoE LLM trained on ๐ 12T tokens ๐ง DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks. Is this thread mostly written by DBRX? Yes! ๐งต
22
83
473
ML+art project with @dmshanmugam, @jjgort, and artist, @agnieszkakurant ! Our GAN-based approach generates signatures containing features learned from a collection of MIT and Cambridge residentsโ signatures. #creativeAI #MachineLearning @mit_caml
https://t.co/mZCzsXFGz6
1
6
18