llm360 Profile Banner
LLM360 Profile
LLM360

@llm360

Followers
2K
Following
260
Media
33
Statuses
123

LLM360 is an open research lab enabling community-owned AGI through open-source large model research and development.

Joined November 2023
Don't wanna be here? Send us removal request.
@llm360
LLM360
1 year
๐Ÿ“ข๐Ÿ“ข We are releasing TxT360: a globally deduplicated dataset for LLM pretraining ๐ŸŒ 99 Common Crawls ๐Ÿ“˜ 14 Curated Sources ๐Ÿ‘จโ€๐Ÿณ recipe to easily adjust data weighting and train the most performant models Dataset: https://t.co/PFpnNAAiJJ Blog: https://t.co/BYNhEDQCEH
5
88
245
@llm360
LLM360
3 months
Check out the FastVideo series that allow you to generate a video in real time (5 second video in 1 second)! Try it out on your own hardware too. Kudos to the team for democratize these by providing efficient methods and the full recipe.
@haoailab
Hao AI Lab
3 months
(1/n) ๐Ÿš€ With FastVideo, you can now generate a 5-second video in 5 seconds on a single H200 GPU! Introducing FastWan series, a family of fast video generation models trained via a new recipe we term as โ€œsparse distillationโ€, to speed up video denoising time by 70X! ๐Ÿ–ฅ๏ธ Live
0
0
4
@llm360
LLM360
5 months
Our team is lucky to have "early access" of this work from the IFM talk given by @ssahoo_
@ssahoo_
Subham Sahoo
5 months
๐Ÿšจ โ€œThe Diffusion Dualityโ€ is out! @ICML2025 โšก๏ธ Few-step generation in discrete diffusion language models by exploiting the underlying Gaussian diffusion. ๐ŸฆพBeats AR on 3/7 zero-shot likelihood benchmarks. ๐Ÿ“„ Paper: https://t.co/0RKsd8NJfB ๐Ÿ’ป Code: https://t.co/oYE9hDYrGI ๐Ÿง 
1
4
12
@llm360
LLM360
5 months
KV-caching is great, but will it work for Diffusion Language Models. @zhihanyang_ and team showed how to make it work with 65x speedup ๐Ÿš€! Checkout the new preprint: https://t.co/JwjyLQf33r The LLM360 team is very interested to explore new architectures.
Tweet card summary image
arxiv.org
Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models...
@zhihanyang_
Zhihan Yang
5 months
๐Ÿ“ขThrilled to share our new paper: Esoteric Language Models (Eso-LMs) > ๐Ÿ”€Fuses autoregressive (AR) and masked diffusion (MDM) paradigms > ๐Ÿš€First to unlock KV caching for MDMs (65x speedup!) > ๐Ÿฅ‡Sets new SOTA on generation speed-vs-quality Pareto frontier How? Dive in๐Ÿ‘‡
0
7
29
@llm360
LLM360
6 months
The LLM360 team continue to adhere to the value of open source, and we believe data is one of the most important ingredient. Please let our team know if you have any comments. We are eager to hear voice of the community!
0
0
1
@llm360
LLM360
6 months
๐Ÿ Unlock the power of long-context data with our new Wikipedia Extended and Aligned Europral datasets in v1.1! We've created long context documents by extending Wikipedia articles with related abstracts, and align multiple Europarl languages into one training samples!
1
0
0
@llm360
LLM360
6 months
We release our initial successful recipe on synthetic data: generated QA pairs. The high quality TxT360 document includes multiple tailored Q&A pairs appended at the end (thanks to the permissive license of @MistralAI models!). #SyntheticData #QA
1
0
0
@llm360
LLM360
6 months
๐ŸŒŸ BestOfWeb, is a highly refined subset of the TxT360 CC dataset! ๐Ÿ“Š It undergoes filtration using the ProX document filtering model, which use quality signals similar to the FineWeb-Edu classifier, and also adds additional format signals. #DataQuality #WebData #FineWeb
1
0
0
@llm360
LLM360
6 months
๐Ÿ“ข๐Ÿ“ข TxT360 has been updated to v1.1: ๐ŸŒŸ BestofWeb: high-quality doc set from the web โ“ QA: Large Scale Synthetic Q&A dataset ๐Ÿ“– Wiki_extended: extended wiki articles via links ๐ŸŒ Europarl Aligned: reformatted long aligned corpus https://t.co/PFpnNAzKUb #AIResearch
Tweet card summary image
huggingface.co
2
10
28
@llm360
LLM360
7 months
The MBZUAI IFM and the LLM360 team's first day at @iclr_conf, come to visit our new Institute of Foundation Models! Booth D04 in Hall 2! Weโ€™re looking forward to meeting researchers and engineers to introduce them to @mbzuai .
0
9
28
@AiEleuther
EleutherAI
7 months
Looking for EleutherAI at #ICLR2025? Come say hi at any of our five posters or the Open Science for Foundation Models workshop where @BlancheMinerva is giving the opening keynote. ๐Ÿงต
2
3
25
@sivil_taram
Qian Liu
10 months
๐ŸŽ‰ Announcing the first Open Science for Foundation Models (SCI-FM) Workshop at #ICLR2025! Join us in advancing transparency and reproducibility in AI through open foundation models. ๐Ÿค Looking to contribute? Join our Program Committee: https://t.co/U9eIGY0Qai ๐Ÿ” Learn more at:
6
44
175
@llm360
LLM360
7 months
We would like to thank the open-source community โ€” such as @huggingface, @DeepSeek, @Qwen, @AiEleuther and many more โ€” for providing invaluable insights, model checkpoints, as well as data extraction, training, and evaluation frameworks that enabled us to continuously refine and
0
0
2
@llm360
LLM360
7 months
๐Ÿ“ˆ Does it work? The good news is: itโ€™s not just intuition โ€” every key design decision in MegaMath has been validated through extensive ablation studies, including: โ— Extraction pipelines โ— Deduplication strategies โ— Filtering thresholds โ— Code ratios & SLM recall โ—
1
0
2
@llm360
LLM360
7 months
๐Ÿงช How we did it โ€” Cooking massive synthetic data: We synthesized Q&A-style examples, translated code into Python, and generated interleaved text & code blocks โ€” all carefully verified for quality and executability. These formats consistently outperformed all existing synthetic
1
1
2
@llm360
LLM360
7 months
๐Ÿ”ง How we did it โ€” Revisiting the web pipeline: Instead of relying on shallow web scraping, we reprocessed 99 Common Crawl dumps (2014โ€“2024) with optimized HTML reformatting that preserves math symbols (e.g., LaTeX/KaTeX) and a two-phase extraction pipeline to ensure fidelity
1
0
2
@llm360
LLM360
7 months
๐Ÿ’ก Whatโ€™s in MegaMath? MegaMath is a comprehensive 371B-token collection delivering with top data quality. It is composed of: ๐Ÿ“š279B tokens of math-rich web data ๐Ÿง‘โ€๐Ÿ’ป 28B tokens of math-relevant code ๐Ÿง 64B tokens of high-quality synthetic data (QA pairs, translated code,
1
1
2
@llm360
LLM360
7 months
๐Ÿ” Why is this important? Mathematical reasoning is a key feature of advanced LLMs. Training math-proficient models like O1 and DeepSeek-R1 requires large-scale, high-quality, diverse math data. Proprietary corpora, such as Qwen-2.5-Math (1T) and DeepSeekMath (120B), show strong
1
0
4
@llm360
LLM360
7 months
Proudly present MegaMath, the largest open-source math reasoning pretraining corpusโ€”371B tokens of high-quality mathematical web, code, and synthetic data, designed to build the data foundation for next-generation math-proficient LLMs like o1 and R1. ๐Ÿงต๐Ÿ‘‡ #LLM #OpenSource
3
35
85