LLM360
@llm360
Followers
2K
Following
260
Media
33
Statuses
123
LLM360 is an open research lab enabling community-owned AGI through open-source large model research and development.
Joined November 2023
๐ข๐ข We are releasing TxT360: a globally deduplicated dataset for LLM pretraining ๐ 99 Common Crawls ๐ 14 Curated Sources ๐จโ๐ณ recipe to easily adjust data weighting and train the most performant models Dataset: https://t.co/PFpnNAAiJJ Blog: https://t.co/BYNhEDQCEH
5
88
245
Check out the FastVideo series that allow you to generate a video in real time (5 second video in 1 second)! Try it out on your own hardware too. Kudos to the team for democratize these by providing efficient methods and the full recipe.
(1/n) ๐ With FastVideo, you can now generate a 5-second video in 5 seconds on a single H200 GPU! Introducing FastWan series, a family of fast video generation models trained via a new recipe we term as โsparse distillationโ, to speed up video denoising time by 70X! ๐ฅ๏ธ Live
0
0
4
Our team is lucky to have "early access" of this work from the IFM talk given by @ssahoo_
๐จ โThe Diffusion Dualityโ is out! @ICML2025 โก๏ธ Few-step generation in discrete diffusion language models by exploiting the underlying Gaussian diffusion. ๐ฆพBeats AR on 3/7 zero-shot likelihood benchmarks. ๐ Paper: https://t.co/0RKsd8NJfB ๐ป Code: https://t.co/oYE9hDYrGI ๐ง
1
4
12
KV-caching is great, but will it work for Diffusion Language Models. @zhihanyang_ and team showed how to make it work with 65x speedup ๐! Checkout the new preprint: https://t.co/JwjyLQf33r The LLM360 team is very interested to explore new architectures.
arxiv.org
Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models...
๐ขThrilled to share our new paper: Esoteric Language Models (Eso-LMs) > ๐Fuses autoregressive (AR) and masked diffusion (MDM) paradigms > ๐First to unlock KV caching for MDMs (65x speedup!) > ๐ฅSets new SOTA on generation speed-vs-quality Pareto frontier How? Dive in๐
0
7
29
The LLM360 team continue to adhere to the value of open source, and we believe data is one of the most important ingredient. Please let our team know if you have any comments. We are eager to hear voice of the community!
0
0
1
๐ Unlock the power of long-context data with our new Wikipedia Extended and Aligned Europral datasets in v1.1! We've created long context documents by extending Wikipedia articles with related abstracts, and align multiple Europarl languages into one training samples!
1
0
0
We release our initial successful recipe on synthetic data: generated QA pairs. The high quality TxT360 document includes multiple tailored Q&A pairs appended at the end (thanks to the permissive license of @MistralAI models!). #SyntheticData #QA
1
0
0
๐ BestOfWeb, is a highly refined subset of the TxT360 CC dataset! ๐ It undergoes filtration using the ProX document filtering model, which use quality signals similar to the FineWeb-Edu classifier, and also adds additional format signals. #DataQuality #WebData #FineWeb
1
0
0
๐ข๐ข TxT360 has been updated to v1.1: ๐ BestofWeb: high-quality doc set from the web โ QA: Large Scale Synthetic Q&A dataset ๐ Wiki_extended: extended wiki articles via links ๐ Europarl Aligned: reformatted long aligned corpus https://t.co/PFpnNAzKUb
#AIResearch
huggingface.co
2
10
28
The MBZUAI IFM and the LLM360 team's first day at @iclr_conf, come to visit our new Institute of Foundation Models! Booth D04 in Hall 2! Weโre looking forward to meeting researchers and engineers to introduce them to @mbzuai .
0
9
28
Looking for EleutherAI at #ICLR2025? Come say hi at any of our five posters or the Open Science for Foundation Models workshop where @BlancheMinerva is giving the opening keynote. ๐งต
2
3
25
๐ Announcing the first Open Science for Foundation Models (SCI-FM) Workshop at #ICLR2025! Join us in advancing transparency and reproducibility in AI through open foundation models. ๐ค Looking to contribute? Join our Program Committee: https://t.co/U9eIGY0Qai ๐ Learn more at:
6
44
175
We would like to thank the open-source community โ such as @huggingface, @DeepSeek, @Qwen, @AiEleuther and many more โ for providing invaluable insights, model checkpoints, as well as data extraction, training, and evaluation frameworks that enabled us to continuously refine and
0
0
2
๐ All data are open, with a detailed technical report recording our development. ๐ Dataset: https://t.co/bIQSNtzCll ๐Github: https://t.co/2QfdiLEFwC ๐Paper: https://t.co/B6piYtB6Nx Follow @LLM360_org for updates. ๐ #AI #LLM #OpenSource #Mathematics #Data4LLMs #Pretraining
arxiv.org
Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open,...
1
2
4
๐ Does it work? The good news is: itโs not just intuition โ every key design decision in MegaMath has been validated through extensive ablation studies, including: โ Extraction pipelines โ Deduplication strategies โ Filtering thresholds โ Code ratios & SLM recall โ
1
0
2
๐งช How we did it โ Cooking massive synthetic data: We synthesized Q&A-style examples, translated code into Python, and generated interleaved text & code blocks โ all carefully verified for quality and executability. These formats consistently outperformed all existing synthetic
1
1
2
๐ง How we did it โ Revisiting the web pipeline: Instead of relying on shallow web scraping, we reprocessed 99 Common Crawl dumps (2014โ2024) with optimized HTML reformatting that preserves math symbols (e.g., LaTeX/KaTeX) and a two-phase extraction pipeline to ensure fidelity
1
0
2
๐ก Whatโs in MegaMath? MegaMath is a comprehensive 371B-token collection delivering with top data quality. It is composed of: ๐279B tokens of math-rich web data ๐งโ๐ป 28B tokens of math-relevant code ๐ง 64B tokens of high-quality synthetic data (QA pairs, translated code,
1
1
2
๐ Why is this important? Mathematical reasoning is a key feature of advanced LLMs. Training math-proficient models like O1 and DeepSeek-R1 requires large-scale, high-quality, diverse math data. Proprietary corpora, such as Qwen-2.5-Math (1T) and DeepSeekMath (120B), show strong
1
0
4
Proudly present MegaMath, the largest open-source math reasoning pretraining corpusโ371B tokens of high-quality mathematical web, code, and synthetic data, designed to build the data foundation for next-generation math-proficient LLMs like o1 and R1. ๐งต๐ #LLM #OpenSource
3
35
85