LLM360 @llm360 X Profile

LLM360

@llm360

Followers

2K

Following

260

Media

33

Statuses

123

LLM360 is an open research lab enabling community-owned AGI through open-source large model research and development.

https://t.co/da4vvRk2tJ

Joined November 2023

Don't wanna be here? Send us removal request.

LLM360

@llm360

1 year

📢📢 We are releasing TxT360: a globally deduplicated dataset for LLM pretraining 🌐 99 Common Crawls 📘 14 Curated Sources 👨‍🍳 recipe to easily adjust data weighting and train the most performant models Dataset: https://t.co/PFpnNAAiJJ Blog: https://t.co/BYNhEDQCEH

5

88

245

LLM360

@llm360

3 months

Check out the FastVideo series that allow you to generate a video in real time (5 second video in 1 second)! Try it out on your own hardware too. Kudos to the team for democratize these by providing efficient methods and the full recipe.

Hao AI Lab

@haoailab

3 months

(1/n) 🚀 With FastVideo, you can now generate a 5-second video in 5 seconds on a single H200 GPU! Introducing FastWan series, a family of fast video generation models trained via a new recipe we term as “sparse distillation”, to speed up video denoising time by 70X! 🖥️ Live

0

4

LLM360

@llm360

5 months

Our team is lucky to have "early access" of this work from the IFM talk given by @ssahoo_

Subham Sahoo

@ssahoo_

5 months

🚨 “The Diffusion Duality” is out! @ICML2025 ⚡️ Few-step generation in discrete diffusion language models by exploiting the underlying Gaussian diffusion. 🦾Beats AR on 3/7 zero-shot likelihood benchmarks. 📄 Paper: https://t.co/0RKsd8NJfB 💻 Code: https://t.co/oYE9hDYrGI 🧠

1

4

12

LLM360

@llm360

5 months

KV-caching is great, but will it work for Diffusion Language Models. @zhihanyang_ and team showed how to make it work with 65x speedup 🚀! Checkout the new preprint: https://t.co/JwjyLQf33r The LLM360 team is very interested to explore new architectures.

arxiv.org

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models...

Zhihan Yang

@zhihanyang_

5 months

📢Thrilled to share our new paper: Esoteric Language Models (Eso-LMs) > 🔀Fuses autoregressive (AR) and masked diffusion (MDM) paradigms > 🚀First to unlock KV caching for MDMs (65x speedup!) > 🥇Sets new SOTA on generation speed-vs-quality Pareto frontier How? Dive in👇

0

7

29

LLM360

@llm360

6 months

The LLM360 team continue to adhere to the value of open source, and we believe data is one of the most important ingredient. Please let our team know if you have any comments. We are eager to hear voice of the community!

0

1

LLM360

@llm360

6 months

🐍 Unlock the power of long-context data with our new Wikipedia Extended and Aligned Europral datasets in v1.1! We've created long context documents by extending Wikipedia articles with related abstracts, and align multiple Europarl languages into one training samples!

1

0

LLM360

@llm360

6 months

We release our initial successful recipe on synthetic data: generated QA pairs. The high quality TxT360 document includes multiple tailored Q&A pairs appended at the end (thanks to the permissive license of @MistralAI models!). #SyntheticData #QA

1

0

LLM360

@llm360

6 months

🌟 BestOfWeb, is a highly refined subset of the TxT360 CC dataset! 📊 It undergoes filtration using the ProX document filtering model, which use quality signals similar to the FineWeb-Edu classifier, and also adds additional format signals. #DataQuality #WebData #FineWeb

1

0

LLM360

@llm360

6 months

📢📢 TxT360 has been updated to v1.1: 🌟 BestofWeb: high-quality doc set from the web ❓ QA: Large Scale Synthetic Q&A dataset 📖 Wiki_extended: extended wiki articles via links 🌍 Europarl Aligned: reformatted long aligned corpus https://t.co/PFpnNAzKUb #AIResearch

huggingface.co

2

10

28

LLM360

@llm360

7 months

The MBZUAI IFM and the LLM360 team's first day at @iclr_conf, come to visit our new Institute of Foundation Models! Booth D04 in Hall 2! We’re looking forward to meeting researchers and engineers to introduce them to @mbzuai .

0

9

28

EleutherAI

@AiEleuther

7 months

Looking for EleutherAI at #ICLR2025? Come say hi at any of our five posters or the Open Science for Foundation Models workshop where @BlancheMinerva is giving the opening keynote. 🧵

2

3

25

Qian Liu

@sivil_taram

10 months

🎉 Announcing the first Open Science for Foundation Models (SCI-FM) Workshop at #ICLR2025! Join us in advancing transparency and reproducibility in AI through open foundation models. 🤝 Looking to contribute? Join our Program Committee: https://t.co/U9eIGY0Qai 🔍 Learn more at:

6

44

175

LLM360

@llm360

7 months

We would like to thank the open-source community — such as @huggingface, @DeepSeek, @Qwen, @AiEleuther and many more — for providing invaluable insights, model checkpoints, as well as data extraction, training, and evaluation frameworks that enabled us to continuously refine and

0

2

LLM360

@llm360

7 months

📂 All data are open, with a detailed technical report recording our development. 👉 Dataset: https://t.co/bIQSNtzCll 👉Github: https://t.co/2QfdiLEFwC 👉Paper: https://t.co/B6piYtB6Nx Follow @LLM360_org for updates. 🚀 #AI #LLM #OpenSource #Mathematics #Data4LLMs #Pretraining

arxiv.org

Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open,...

1

2

4

LLM360

@llm360

7 months

📈 Does it work? The good news is: it’s not just intuition — every key design decision in MegaMath has been validated through extensive ablation studies, including: ● Extraction pipelines ● Deduplication strategies ● Filtering thresholds ● Code ratios & SLM recall ●

1

0

2

LLM360

@llm360

7 months

🧪 How we did it — Cooking massive synthetic data: We synthesized Q&A-style examples, translated code into Python, and generated interleaved text & code blocks — all carefully verified for quality and executability. These formats consistently outperformed all existing synthetic

1

2

LLM360

@llm360

7 months

🔧 How we did it — Revisiting the web pipeline: Instead of relying on shallow web scraping, we reprocessed 99 Common Crawl dumps (2014–2024) with optimized HTML reformatting that preserves math symbols (e.g., LaTeX/KaTeX) and a two-phase extraction pipeline to ensure fidelity

1

0

2

LLM360

@llm360

7 months

💡 What’s in MegaMath? MegaMath is a comprehensive 371B-token collection delivering with top data quality. It is composed of: 📚279B tokens of math-rich web data 🧑‍💻 28B tokens of math-relevant code 🧠64B tokens of high-quality synthetic data (QA pairs, translated code,

1

2

LLM360

@llm360

7 months

🔍 Why is this important? Mathematical reasoning is a key feature of advanced LLMs. Training math-proficient models like O1 and DeepSeek-R1 requires large-scale, high-quality, diverse math data. Proprietary corpora, such as Qwen-2.5-Math (1T) and DeepSeekMath (120B), show strong

1

0

4

LLM360

@llm360

7 months

Proudly present MegaMath, the largest open-source math reasoning pretraining corpus—371B tokens of high-quality mathematical web, code, and synthetic data, designed to build the data foundation for next-generation math-proficient LLMs like o1 and R1. 🧵👇 #LLM #OpenSource

3

35

85