jjgort Profile Banner
Jose Javier Gonzalez Profile
Jose Javier Gonzalez

@jjgort

Followers
384
Following
106
Media
5
Statuses
36

Research Scientist at MosaicAI DataBricks. Working on LLMs

San Francisco, CA
Joined January 2017
Don't wanna be here? Send us removal request.
@jjgort
Jose Javier Gonzalez
2 years
Our latest model DBRX is out today! Itโ€™s a fast 132B MoE and currently the best open weights model. DBRX really shines at programming, beating all OSS models in the HumanEval programming benchmark, even code fine tuned models like CodeLLaMA-70B
@dylan522p
Dylan Patel
2 years
Databricks DBRX model is AMAZING, generally great but CRUSHES code. 132B parameters, 12T token, 16 experts, 4 per forward, 36B active. ~2.6e24 HumanEval5, 0-Shot DBRX - 70.1% GPT-4 - 67% Gemini 1.5 Pro - 71.9% Mixtral - 54.8% Grok - 63.2% LLAMA 2 - 32.2% https://t.co/j0Y5xbH04W
0
4
20
@code_star
Cody Blakeney
11 days
@DimitrisPapail
Dimitris Papailiopoulos
11 days
@code_star CONVERT YOUR CODEBASES TO REAL NUMBERS, I REPEAT REAL NUMBERS
90
484
6K
@HalleeWong
Hallee Wong
19 days
Presenting MultiverSeg โ€” a scalable in-context system for interactively segmenting new datasets โ€” at #ICCV2025 today! ๐Ÿ“poster 110 (10:45 AMโ€“12:45 PM)
0
2
5
@thinkymachines
Thinking Machines
2 months
Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.
119
460
3K
@jefrankle
Jonathan Frankle
3 months
RLVR and test-time compute are a powerful combo for enterprises, so much so that @databricks now leads overall BIRD single-model leaderboard. This isn't about BIRD, though. It's an example of what our customers are accomplishing in their domains with our RL recipe in Agent Bricks
@jefrankle
Jonathan Frankle
3 months
RLVR isn't just for math and coding! At @databricks, it's impacting products and users across domains. One example: SQL Q&A. We hit the top of the BIRD single-model single-generation leaderboard with our standard TAO+RLVR recipe - the one rolling out in our Agent Bricks product.
1
6
35
@jjgort
Jose Javier Gonzalez
3 months
This is the most rewarding thing I did during my PhD, looking forward to teaching an updated version!
@anishathalye
Anish Athalye
3 months
Missing Semester has grown past 100K subscribers on YouTube. Appreciate all the engagement and support! We plan to teach another iteration of the course in January 2026, revising the curriculum and covering new topics like AI IDEs and vibe coding.
0
0
5
@davisblalock
Davis Blalock
4 months
Deep learning training is a mathematical dumpster fire. But it turns out that if you *fix* the math, everything kinda just worksโ€ฆfp8 training, hyperparameter transfer, training stability, and more. [1/n]
15
150
1K
@dan_biderman
Dan Biderman
1 year
*LoRA Learns Less and Forgets Less* is now out in its definitive edition in TMLR๐Ÿš€ Checkout the latest numbers fresh from the @DbrxMosaicAI oven ๐Ÿ‘จโ€๐Ÿณ
@TmlrCert
Certified papers at TMLR
1 year
New #FeaturedCertification: LoRA Learns Less and Forgets Less Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz et al. https://t.co/P7a2ktDjb8 #learns #regularization #rank
5
20
82
@mansiege
Mansheej Paul
1 year
Pretraining data ablations are expensive: how can we measure data quality fast and cheap? If you're at ICML, come find out at the ES-FoMo poster session today in Lehar 2 at 1 pm:
@code_star
Cody Blakeney
1 year
Pretraining data experiments are expensive as measuring the impact of data on emergent tasks requires large FLOP scales. How do you determine what subsets of your data are important for the mixture of tasks you care about? We present Domain upsampling: a strategy to better
0
14
41
@sashadoubov
Sasha Doubov
1 year
some notes from paper! - 405B trained on 15.6T tokens, 3.8e25 flops - use SFT, rejection sampling and DPO - annealing is used to judge quality of domain specific data (s/o dbrx paper)
@louvishh
lovish
1 year
405b is out! working on llama 3 has been a truly rewarding experience and i'm super grateful to all my teammates! i'm excited to see how the llama models will be used by the community! p.s. - we wrote a paper and not just a tech report ๐Ÿ˜›
2
6
40
@DbrxMosaicAI
Databricks Mosaic Research
1 year
Popular #LLM scaling laws only factor in training costs, and ignore the costs of deployment. In a paper presented at @icmlconf 2024, @databricks Mosaic AI researchers Nikhil Sardana, @JacobianNeuro, and @sashadoubov propose a modified scaling law that considers the cost of both
1
20
53
@dan_biderman
Dan Biderman
1 year
People think LoRA is a magic bullet for LLMs. Is it? Does it deliver the same quality as full finetuning but on consumer GPUs? Though LoRA has the advantage of a lower memory footprint, we find that it often substantially underperforms full finetuning. However, it forgets less
22
104
562
@code_star
Cody Blakeney
2 years
An interesting bit of nuance missing from throughput charts like this is that tokens != generated text. Because DBRX / LLama3 / GPT4s tokenizer has a larger vocabulary (100k+) they actually generate much faster then (20-30%) then tokens alone will measure compared to say Mixtral
@ArtificialAnlys
Artificial Analysis
2 years
Even if your focus is quality, inference speed & price matters when you can achieve the same quality across multiple requests Back of the envelope illustration: - GPT-4 Turbo is (arguably) the highest quality model available, however it is served at 18 tokens/s and costs ~$15/M
2
1
18
@karpathy
Andrej Karpathy
2 years
@deepwhitman @AIatMeta @lmsysorg no. people misunderstand chinchilla. chinchilla doesn't tell you the point of convergence. it tells you the point of compute optimality. if all you care about is perplexity, for every FLOPs compute budget, how big model on how many tokens should you train? for reasons not fully
23
50
579
@jjgort
Jose Javier Gonzalez
2 years
Nice! @KagiHQ now includes links to hackernews and reddit discussion threads directly in the search results. It's the small things
1
5
31
@code_star
Cody Blakeney
2 years
DBRX is the best open model on AI2 WildBench! ๐Ÿ˜€
@billyuchenlin
Bill Yuchen Lin
2 years
๐Ÿ†• Check out the recent update of ๐•Ž๐•š๐•๐••๐”น๐•–๐•Ÿ๐•”๐•™! We have included a few more models including DBRX-Instruct @databricks and StarlingLM-beta (7B) @NexusflowX which are both super powerful! DBRX-Instruct is indeed the best open LLM; Starling-LM 7B outperforms a lot of even
3
5
40
@cwolferesearch
Cameron R. Wolfe, Ph.D.
2 years
๐ŸงฑDBRX๐Ÿงฑ is so good that it forced 3-4 companies to release "competing" LLMs in the last two days (and we've barely heard about them). Some of my thoughts are summarized below... Prior research from Mosaic. DBRX is the next model in the series of Open LLMs released by Mosaic.
4
56
225
@tessybarton
Tessa Barton
2 years
TFW the code eval metrics are so good your boss does his hair blue
@jefrankle
Jonathan Frankle
2 years
1
6
67
@vitaliychiley
Vitaliy Chiley
2 years
Introducing DBRX: A New Standard for Open LLM ๐Ÿ”” https://t.co/0HpI6Sdv6J ๐Ÿ’ป DBRX is a 16x 12B MoE LLM trained on ๐Ÿ“œ 12T tokens ๐Ÿง DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks. Is this thread mostly written by DBRX? Yes! ๐Ÿงต
22
83
473
@KatieLewisMIT
Katie Lewis
5 years
ML+art project with @dmshanmugam, @jjgort, and artist, @agnieszkakurant ! Our GAN-based approach generates signatures containing features learned from a collection of MIT and Cambridge residentsโ€™ signatures. #creativeAI #MachineLearning @mit_caml https://t.co/mZCzsXFGz6
1
6
18