
Stella Biderman
@BlancheMinerva
Followers
17K
Following
11K
Media
624
Statuses
13K
Open source LLMs and interpretability research at @BoozAllen and @AiEleuther. My employers disown my tweets. She/her
Joined May 2019
Two years in the making, we finally have 8 TB of openly licensed data with document-level metadata for authorship attribution, licensing details, links to original copies, and more. Hugely proud of the entire team.
Can you train a performant language models without using unlicensed text?. We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2
18
67
554
@AiEleuther But many orgs – @AiEleuther @llm360 DCLM @allen_ai @CMU_AI @StanfordHAI @mbzuai and more – kept proving that wrong so they need to keep raising the capital expenditure to count as "meaningful." And we'll keep meeting it.
0
0
49
The same thing happened before @AiEleuther started training models. People at many companies kept telling academics and non-profits "oh you'll never be able to train a model like GPT-3," "leave model training to companies, just study the behavior of the things we release.".
1
1
31
We haven't done the best job promoting it, but the @AiEleuther YouTube channel is a goldmine of AI content.
If you can't make it, no problem! All of our reading groups and speaker series upload to our YouTube. We have over 100 hours of content on topics from ML Scalability and Performance to Functional Analysis to podcasts and interviews featuring our team.
3
10
133
RT @AiEleuther: We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members….
0
21
0
Someone should probably critically analyze the abilities of models to do scientific work. .
#NLProc.AI Co-Scientists 🤖 can generate ideas, but can they spot mistakes? (not yet! 🚫). In my recent paper, we introduce SPOT, a dataset of STEM manuscripts (math, materials science, chemistry, physics, etc), annotated with real errors. SOTA models like o3, gemini-2.5-pro
1
0
41
Extremely exciting to see this finally come out. A game-changer for malware detection and analysis.
Lead by @rjjoyce8 , #EMBER24 has arrived @kdd_news #KDD25, the best, most open, and versatile malware detection benchmark ever! w/ @rjzak @mrphilroth @drhyrum & others, let's try to barely summarize all the new things you can do now! @BoozAllen @CrowdStrike @Cisco 🧵👇
0
1
10
This is incredibly good science. Read the entire thread, I beg you.
Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data
5
16
157
RT @lschmidt3: Very excited to finally release our paper for OpenThoughts!. After DataComp and DCLM, this is the third large open dataset m….
0
213
0
RT @storytracer: Common Pile v0.1 is only the beginning. At @AiEleuther we will publish open datasets on a regular basis from now on, using….
0
15
0
RT @AiEleuther: What do we mean by "openly licensed" data? Following the lead of orgs like @publicknowledge @Wikimedia @creativecommons we….
0
2
0