
Matei Zaharia
@matei_zaharia
Followers
43K
Following
20K
Media
173
Statuses
3K
CTO at @Databricks and CS prof at @UCBerkeley. Working on data+AI, including @ApacheSpark, @DeltaLakeOSS, @MLflow, https://t.co/94gROE5Xa0. https://t.co/nmRYAKG0LZ
Joined October 2010
Lots of people are wondering whether #GPT4 and #ChatGPT's performance has been changing over time, so Lingjiao Chen, @james_y_zou and I measured it. We found big changes including some large decreases in some problem-solving tasks:
118
745
3K
Building a ChatGPT-like LLM might be easier than anyone thought. At @Databricks, we tuned a 2-year-old open source model to follow instructions in just 3 hours, and are open sourcing the code. We think this tech will quickly be democratized.
41
489
2K
Very excited to return to UC Berkeley as a professor starting this week. I’ll be collaborating with the Sky Lab, @UCBEPIC, @berkeley_ai and others!.
@Berkeley_EECS welcomes @matei_zaharia, who returns to Berkeley EECS as an Associate Professor. Matei’s research interests include computer systems and machine learning. He is also the co-founder and Chief Technologist of Databricks. Welcome back, Matei!.
49
43
845
Pretty sure I've seen people driving with only 19 neurons too!.
This autonomous car can drive itself using only 19 control neurons. Video: More: (work w/@ISTAustria @tuvienna). #SelfDrivingCars #Autonomy #ML #DL #MachineLearning
6
121
756
MLflow just added first-class support for LLMs, including integrations with @huggingface transformers/pipelines, @OpenAI and @LangChainAI! Open source #LLMOps is here.
4
142
595
Thrilled to receive this award; the credit is due to my students, my mentors, my collaborators in academia and open source, and my colleagues at Databricks for making all this work happen!.
The 2023 @ACMSIGOPS Mark Weiser was presented to @matei_zaharia for innovation and impact in large-scale data processing. The award was announced at @sospconf. From next year, awards will be announced annually as @sospconf is now an annual conference. See you in #austin in 2024.
52
38
452
One of my favorite announcements: English SDK for @ApacheSpark! No more need to remember weird syntax, just chain transformations in natural language with the familiar Spark API. So many fun examples.
12
71
407
Due to COVID19, we decided to make #SparkAISummit virtual and also *free* for anyone to attend this year! We still have the same great program with over 200 talks and keynotes from @NateSilver538, @jenniferchayes, @apaszke and more. Tune in for the largest data & AI summit ever.
We can’t wait to solve the world’s toughest problems — and it starts with #SparkAISummit, the world’s largest data and machine learning conference. As a global virtual event, we'll converge to shape the future of big data, analytics and AI. Join us:
10
156
357
Our new MOOC on #LLM Foundation Models from the Ground Up is now available! Join me, Chengyin Eng, @sjraymond, Joseph Bradley and @abhi_venigalla for a detailed look at how LLMs are built, how to improve them, and where the field is going.
4
83
351
Congrats to my student @codyaustun (with @pbailis) on defending his PhD today! Cody did amazing work improving the resource and data efficiency of deep learning, including widely used benchmarks (DAWNBench/MLPerf), perf analysis, and new 10-1000x faster algorithms (SVP & SEALS).
14
34
335
I'm super honored to have received a #PECASE award this year. Percy Liang from @StanfordNLP also got one, which is great news for Stanford CS. Congrats to everyone else who received one!
17
26
306
This thread highlights a point we've been seeing in for a while: you can't meaningfully talk about capabilities of a *language model*, you have to talk about capabilities of a *system*, including the inference algorithm. 32-CoT is not the same as 5-shot.
.@JeffDean why the need to do 32-CoT Gemini Ultra vs 5-shot GPT-4? Why not just report 5-shot vs 5-shot?.
6
40
309
For @VLDB2020, we wrote a paper on @DeltaLakeOSS, one of the most exciting new technologies from Databricks. By adding ACID transactions over cloud object stores, we can provide data-warehouse-like capabilities & performance on low-cost, HA cloud storage.
5
88
300
Excited to share our #Lakehouse technical paper published at #CIDR21. We describe a new class of data platforms that are (1) completely open, (2) efficiently support #MachineLearning, and (3) provide all traditional #DataWarehouse capabilities+performance.
6
87
286
#ApacheSpark 3.0 greatly simplifies writing Python user-defined functions through type hints, and makes it easier for your functions to process data efficiently in batches via Pandas and Apache Arrow. Check out how to use them:
2
75
262
This is a big release: we've spent the past 3 years working on LLM pipelines and retrieval-augmented apps in my group, and came up with this rich programming model based on our learnings. It not only defines but *automatically optimizes* pipelines for you to get great results.
🚨Announcing 𝗗𝗦𝗣𝘆, the framework for solving advanced tasks w/ LMs. Express *any* pipeline as clean, Pythonic control flow. Just ask DSPy to 𝗰𝗼𝗺𝗽𝗶𝗹𝗲 your modular code into auto-tuned chains of prompts or finetunes for GPT, Llama, and/or T5.🧵.
1
49
274
#ApacheSpark 2.4 is out today! This release has tons of new features including barrier execution mode for ML applications, higher-order functions in SQL, optional eager evaluation for previewing DataFrames in Jupyter, Scala 2.12 support and more.
0
123
254
As we worked with customers using LLMs, a common pattern we saw was that everyone wanted to add a layer in front of the LLM API to manage credentials, rate limits, etc, and to easily swap between models. We've built this the open source @MLflow AI Gateway:.
4
52
239
Cool to see this model from @MosaicML being trained on RedPajama and Dolly data. Fully open source AI is becoming a reality -- open source efficient training, curated web dataset, and instruction data. Still early and small model but it will get better.
1
44
241
Want to efficiently query a vector DB while filtering on structured attributes? My student Liana Patel, together with @petereliaskraft and @guestrin, modified HNSW to do this efficiently in ACORN, to appear at SIGMOD:
4
41
231
Databricks just published our #StateofDataAI report, with interesting trends at our enterprise customers: 1. Adoption of LLMs is booming, with use of SaaS LLM APIs exploding since #ChatGPT launched, but the largest use (and growth) still in custom LLMs.
2
55
222
Very cool to see Dolly-v2 hit #1 trending on HuggingFace Hub today. Stay tuned for a lot more LLM infra coming from Databricks soon. And register for our @Data_AI_Summit conference to hear the biggest things as they launch -- online attendance is free.
2
37
223
Large NLP models are expensive and opaque, but maybe it doesn't have to be that way. This exciting work with Omar Khattab and @ChrisGPotts uses retrieval to set SotA results in hard NLP tasks at low cost. Our Baleen paper will be a spotlight at NeurIPS.
3
43
226
Want to build your own chat AI from scratch? We're launching a Building LLMs course at @Data_AI_Summit to teach everyone how to build a Dolly clone: Tiny model, big attitude, for anyone. #DemocratizeAI
5
38
209
It's hard to believe that #ApacheSpark was first released as a research project 10 years ago! My @SparkAISummit keynote (live now) goes through the lessons in the past 10 years and what's new in #ApacheSpark 3.0.
6
39
206
Databricks is now available on @googlecloud! We've also built great integrations with BigQuery, Looker, GCS and Google AI services across the product.
Open #lakehouse platform meets open #cloud with unified data engineering, data science and analytics. Learn more about Databricks on @GoogleCloud:
7
41
209
We're thrilled to announce the keynote speakers for #MLSys2025: @AnimaAnandkumar, @soumithchintala, Ling Liu and @istoica05! Registration is open to attend the conference in Santa Clara.
2
29
210
Very excited that @ApacheSpark won the SIGMOD System Award this year. Congrats to the whole community behind the project!.
2022 ACM SIGMOD Awards. Edgar F. Codd Innovations Award goes to Dan Suciu. Contributions Award goes to Christian S. Jensen. Test-of-Time Award goes to “NoDB: Efficient Query Execution on Raw Data Files”. Systems Award goes to “Apache Spark”. Congrats!.
5
24
204
We updated the code for Dolly so it only trains in 30 minutes now. It’s nice to be able to experiment quickly with instruction tuning.
We’re actively updating the Dolly repo with model improvements! Make sure to pull the latest changes. At $30 / 30min per training run it’s dead simple to run multiple experiments. Also, 688 stars in 20 hours! Neat!.
2
43
198
I gave a keynote at @ACMSoCC about lessons from building a large-scale cloud service at @Databricks. Did you know that Databricks runs millions of VMs/day to process exabytes of data with <200 engineers? Slides here:
2
51
198
Congratulations and so well deserved, Omar! It's been fantastic working together.
I'm excited to share that I will be joining MIT EECS as an assistant professor in Fall 2025!. I'll be recruiting PhD students from the December 2024 application pool. Indicate interest if you'd like to work with me on NLP, IR, or ML Systems! Stay tuned for more about my new lab.
3
8
194
Congrats to the #ApacheSpark community on the 3.0 release! Over 440 developers contributed 3400 patches to this release, with big improvements in SQL performance, ANSI SQL support, Python usability and management features.
[ANNOUNCEMENT] Congrats to the Apache Spark community and all the contributors! The Apache Spark 3.0 is here. Try it out!
1
59
189
Exciting times at @Databricks. We're hiring in all departments, so take a look if you want to help shape the next generation of infrastructure for data and AI.
Databricks raises $1B at $28B valuation as it reaches $425M ARR by @alex and @ron_miller.
3
25
190
I'm co-organizing the inaugural research workshop on Compound AI Systems on June 13th: . Send in your work on designing & optimizing such systems!. Thrilled to have @RichardSocher, @MonicaSLam and @polynoamial as speakers, and host this at @Data_AI_Summit.
2
32
191
Meet #LakehouseIQ: a knowledge engine from your enterprise that understands your business & data to power AI apps. Every platform is adding an AI assistant, but in data, LLMs don't just work out of the box, because every org has its own jargon, data, etc.
11
88
182
Really cool to see OpenAI o1 launched today. It's another example of the trend towards compound AI systems, not models, getting the best AI results. I'm sure that future versions will not only scale inference, but also use tools (coding, search, etc) for better results.
Interesting trend in AI: the best results are increasingly obtained by compound systems, not monolithic models. AlphaCode, ChatGPT+, Gemini are examples. In this post, we discuss why this is and emerging research on designing & optimizing such systems.
4
24
186
We also have a big announcement for @MLflow today: it's joining the @linuxfoundation as a long-term vendor-neutral home to host the project! We've been blown away with how fast MLflow has grown and hope this leads to even more contributors.
1
78
181
We just posted the first release of open source Unity Catalog! It supports tables, unstructured data, and AI, and we have a great set of partners across data and AI integrating with it. Read more at
2
33
177
The great thing is that for customers wishing to build such models that natively understand their data, the cost could be even less. We have the checkpoints, data cleaning pipeline, instruction tuning pipeline, etc from DBRX — just apply these to your data.
Just $10M and two months to train from scratch a GPT3.5 - Llama2 level model. For context, it probably cost 10-20x more to OAI just a year ago!. The more we improve as a field thanks to open-source, the cheaper & more efficient it gets!. All companies should now train their own
1
20
157
Probably the thing I’m most excited about with DBRX, it’s super fast! Easily 150 tokens/s for quality comparable to much slower closed models.
6
29
171
Congrats to my student @deepakn94 for defending his PhD! Deepak worked on a ton of exciting systems and ML research, including Weld, DAWNBench/MLPerf, and most recently pipelining methods for efficient DNN training, including PipeDream-2BW (ICML'21) and Megatron's 1T param model.
5
7
164
Really proud of my student @sppalkia who passed his (online) PhD defense today! He's the first of my students to graduate, and he did awesome work accelerating data applications with Weld, Mozart and other systems. You can see his talk and slides here:
1
26
159
Apache Spark (and Databricks) are getting first-class support in @HuggingFace! You can now rapidly load data from these engines for HuggingFace training and inference, giving up to 40% speedups.
2
21
153
Everyone’s excited about vector DBs, but there’s a lot to do to get truly high quality retrieval systems! Check out this paper benchmarking quality, latency and cost.
#acl2023 findings paper for folks working on retrieval leaderboards- Read on:. ✅ We show multi-dimensional tradeoffs e.g. quality , latency & cost (instead of just F1).✅ Metrics that include concrete efforts e.g. DynaScore. -- Code in PrimeQA:
2
23
156
One of my favorite features in the upcoming #ApacheSpark 3.0 is Adaptive Query Execution (AQE), which tunes number of reduce tasks, join algorithms and skew joins automatically. Learn how it works and how it speeds up TPC-DS queries by up to 8x:
0
41
156
Super excited about our agreement to acquire @neondatabase, bringing state-of-the-art, serverless elastic Postgres to Databricks! Building end-to-end data and AI apps is about to get much easier.
I am super excited to announce that we have agreed to acquire Neon, a developer-centric serverless Postgres company. The Neon team engineered a new database architecture that offers speed, elastic scaling, and branching and forking. The capabilities that make Neon great for.
5
11
157
We’re hiring for the RAG / AG research team at Databricks. Come help make AI even better at incorporating real-time data and external tools.
“How’s your sabbatical?” Well…DBRX is GREAT at RAG!. If you’ve been using Mixtral/Llama2/GPT3.5, then try DBRX! The combination of RAG with its SoTA capabilities on knowledge/code/reasoning will unlock new CompoundAI opportunities.
2
20
151
So excited about this -- bringing amazing platforms for data and AI together. @NaveenGRao, @hanlintang and @jefrankle have built an amazing team that has steadily reduced the cost of AI training and released breakthroughs like the first open source LLMs with >64K context.
Today we’re announcing plans for @MosaicML to join forces with @databricks! We are excited at the possibilities for this deal including serving the growing number of enterprises interested in LLMs and diffusion models.
4
17
146
Congrats to the whole team at Databricks for the continued ultra-fast growth! We're hiring in all roles to continue simplifying how organizations work with data through technologies such as @DeltaLakeOSS, @MLflow, @ApacheSpark and more.
We're excited to announce that we've raised $400 million to continue our rapid global growth and engineering expansion, an investment that brings our valuation to $6.2 billion. Learn more:
1
23
142
My talk on #Lakehouse systems from #CIDR21 is now online, explaining this new trend in data management systems: You can also find our paper at
2
34
144
Welcome Omar, and really excited to keep working together on research along with the DSPy community.
Some personal news: I'm thrilled to have joined @Databricks @DbrxMosaicAI as a Research Scientist last month, before I start as MIT faculty in July 2025!. Expect increased investment into the open-source DSPy community, new research, & strong emphasis on production concerns 🧵.
4
6
142
Pretty accurate!.
Apache Spark - Query Execution Plan. #apachespark #sql #dataengineering #databricks #scala #python #azure #Hyderabad
5
19
135
#PySpark downloads are growing 3x year-on-year. As a result, the @ApacheSpark community is investing a lot in making its Python APIs easier as part of "Project Zen". Read about some of the work currently in progress, including type hints, viz and docs:
3
42
136
We're serious about an open, compatible foundation for all enterprise data. Very excited to work with the @tabulario team to make the open source data ecosystem even better.
Databricks to acquire @tabulario, a data platform from the original creators of Apache Iceberg. Together, we will bring format compatibility to the lakehouse for @DeltaLakeOSS and @ApacheIceberg.
4
22
125
Super excited about this work, and it's open source! One of the coolest open source frameworks from my research group. It lets developers use language-based models (including retrievers) in a composable way to build complex apps.
Introducing Demonstrate–Search–Predict (𝗗𝗦𝗣), a framework for composing search and LMs w/ up to 120% gains over GPT-3.5. No more prompt engineering.❌. Describe a high-level strategy as imperative code and let 𝗗𝗦𝗣 deal with prompts and queries.🧵.
2
17
132