Jose Javier Gonzalez @jjgort X Profile

Jose Javier Gonzalez

@jjgort

Followers

384

Following

106

Media

5

Statuses

36

Research Scientist at MosaicAI DataBricks. Working on LLMs

https://t.co/8nSdD36bIQ

San Francisco, CA

Joined January 2017

Don't wanna be here? Send us removal request.

Jose Javier Gonzalez

@jjgort

2 years

Our latest model DBRX is out today! It’s a fast 132B MoE and currently the best open weights model. DBRX really shines at programming, beating all OSS models in the HumanEval programming benchmark, even code fine tuned models like CodeLLaMA-70B

Dylan Patel

@dylan522p

2 years

Databricks DBRX model is AMAZING, generally great but CRUSHES code. 132B parameters, 12T token, 16 experts, 4 per forward, 36B active. ~2.6e24 HumanEval5, 0-Shot DBRX - 70.1% GPT-4 - 67% Gemini 1.5 Pro - 71.9% Mixtral - 54.8% Grok - 63.2% LLAMA 2 - 32.2% https://t.co/j0Y5xbH04W

0

4

20

Cody Blakeney

@code_star

11 days

https://t.co/UGpxCbyE2Y

Dimitris Papailiopoulos

@DimitrisPapail

11 days

@code_star CONVERT YOUR CODEBASES TO REAL NUMBERS, I REPEAT REAL NUMBERS

90

484

6K

Hallee Wong

@HalleeWong

19 days

Presenting MultiverSeg — a scalable in-context system for interactively segmenting new datasets — at #ICCV2025 today! 📍poster 110 (10:45 AM–12:45 PM)

0

2

5

Thinking Machines

@thinkymachines

2 months

Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.

119

460

3K

Jonathan Frankle

@jefrankle

3 months

RLVR and test-time compute are a powerful combo for enterprises, so much so that @databricks now leads overall BIRD single-model leaderboard. This isn't about BIRD, though. It's an example of what our customers are accomplishing in their domains with our RL recipe in Agent Bricks

Jonathan Frankle

@jefrankle

3 months

RLVR isn't just for math and coding! At @databricks, it's impacting products and users across domains. One example: SQL Q&A. We hit the top of the BIRD single-model single-generation leaderboard with our standard TAO+RLVR recipe - the one rolling out in our Agent Bricks product.

1

6

35

Jose Javier Gonzalez

@jjgort

3 months

This is the most rewarding thing I did during my PhD, looking forward to teaching an updated version!

Anish Athalye

@anishathalye

3 months

Missing Semester has grown past 100K subscribers on YouTube. Appreciate all the engagement and support! We plan to teach another iteration of the course in January 2026, revising the curriculum and covering new topics like AI IDEs and vibe coding.

0

5

Davis Blalock

@davisblalock

4 months

Deep learning training is a mathematical dumpster fire. But it turns out that if you *fix* the math, everything kinda just works…fp8 training, hyperparameter transfer, training stability, and more. [1/n]

15

150

1K

Dan Biderman

@dan_biderman

1 year

*LoRA Learns Less and Forgets Less* is now out in its definitive edition in TMLR🚀 Checkout the latest numbers fresh from the @DbrxMosaicAI oven 👨‍🍳

Certified papers at TMLR

@TmlrCert

1 year

New #FeaturedCertification: LoRA Learns Less and Forgets Less Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz et al. https://t.co/P7a2ktDjb8 #learns #regularization #rank

5

20

82

Mansheej Paul

@mansiege

1 year

Pretraining data ablations are expensive: how can we measure data quality fast and cheap? If you're at ICML, come find out at the ES-FoMo poster session today in Lehar 2 at 1 pm:

Cody Blakeney

@code_star

1 year

Pretraining data experiments are expensive as measuring the impact of data on emergent tasks requires large FLOP scales. How do you determine what subsets of your data are important for the mixture of tasks you care about? We present Domain upsampling: a strategy to better

0

14

41

Sasha Doubov

@sashadoubov

1 year

some notes from paper! - 405B trained on 15.6T tokens, 3.8e25 flops - use SFT, rejection sampling and DPO - annealing is used to judge quality of domain specific data (s/o dbrx paper)

lovish

@louvishh

1 year

405b is out! working on llama 3 has been a truly rewarding experience and i'm super grateful to all my teammates! i'm excited to see how the llama models will be used by the community! p.s. - we wrote a paper and not just a tech report 😛

2

6

40

Databricks Mosaic Research

@DbrxMosaicAI

1 year

Popular #LLM scaling laws only factor in training costs, and ignore the costs of deployment. In a paper presented at @icmlconf 2024, @databricks Mosaic AI researchers Nikhil Sardana, @JacobianNeuro, and @sashadoubov propose a modified scaling law that considers the cost of both

1

20

53

Dan Biderman

@dan_biderman

1 year

People think LoRA is a magic bullet for LLMs. Is it? Does it deliver the same quality as full finetuning but on consumer GPUs? Though LoRA has the advantage of a lower memory footprint, we find that it often substantially underperforms full finetuning. However, it forgets less

22

104

562

Cody Blakeney

@code_star

2 years

An interesting bit of nuance missing from throughput charts like this is that tokens != generated text. Because DBRX / LLama3 / GPT4s tokenizer has a larger vocabulary (100k+) they actually generate much faster then (20-30%) then tokens alone will measure compared to say Mixtral

Artificial Analysis

@ArtificialAnlys

2 years

Even if your focus is quality, inference speed & price matters when you can achieve the same quality across multiple requests Back of the envelope illustration: - GPT-4 Turbo is (arguably) the highest quality model available, however it is served at 18 tokens/s and costs ~$15/M

2

1

18

Andrej Karpathy

@karpathy

2 years

@deepwhitman @AIatMeta @lmsysorg no. people misunderstand chinchilla. chinchilla doesn't tell you the point of convergence. it tells you the point of compute optimality. if all you care about is perplexity, for every FLOPs compute budget, how big model on how many tokens should you train? for reasons not fully

23

50

579

Jose Javier Gonzalez

@jjgort

2 years

Nice! @KagiHQ now includes links to hackernews and reddit discussion threads directly in the search results. It's the small things

1

5

31

Cody Blakeney

@code_star

2 years

DBRX is the best open model on AI2 WildBench! 😀

Bill Yuchen Lin

@billyuchenlin

2 years

🆕 Check out the recent update of 𝕎𝕚𝕝𝕕𝔹𝕖𝕟𝕔𝕙! We have included a few more models including DBRX-Instruct @databricks and StarlingLM-beta (7B) @NexusflowX which are both super powerful! DBRX-Instruct is indeed the best open LLM; Starling-LM 7B outperforms a lot of even

3

5

40

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

🧱DBRX🧱 is so good that it forced 3-4 companies to release "competing" LLMs in the last two days (and we've barely heard about them). Some of my thoughts are summarized below... Prior research from Mosaic. DBRX is the next model in the series of Open LLMs released by Mosaic.

4

56

225

Tessa Barton

@tessybarton

2 years

TFW the code eval metrics are so good your boss does his hair blue

Jonathan Frankle

@jefrankle

2 years

https://t.co/4VBfOyXsVV

1

6

67

Vitaliy Chiley

@vitaliychiley

2 years

Introducing DBRX: A New Standard for Open LLM 🔔 https://t.co/0HpI6Sdv6J 💻 DBRX is a 16x 12B MoE LLM trained on 📜 12T tokens 🧠DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks. Is this thread mostly written by DBRX? Yes! 🧵

22

83

473

Katie Lewis

@KatieLewisMIT

5 years

ML+art project with @dmshanmugam, @jjgort, and artist, @agnieszkakurant ! Our GAN-based approach generates signatures containing features learned from a collection of MIT and Cambridge residents’ signatures. #creativeAI #MachineLearning @mit_caml https://t.co/mZCzsXFGz6

1

6

18