Hao Xu @EMNLP
@xuhaoxh
Followers
175
Following
12
Media
6
Statuses
16
Undergrad @uwcse @uwnlp | 🎻Musician
Joined August 2024
Wanna 🔎 inside Internet-scale LLM training data w/o spending 💰💰💰? Introducing infini-gram mini, an exact-match search engine with 14x less storage req than the OG infini-gram 😎 We make 45.6 TB of text searchable. Read on to find our Web Interface, API, and more. (1/n) ⬇️
6
32
115
Congratulations to the amazing Hao @xuhaoxh and @liujc1998 for winning the best paper award at EMNLP. Hao is applying for grad schools — highly recommend her.
5
6
111
🎉 Congratulations to all #EMNLP2025 award winners 🎉 Starting with the ✨Best Paper award ✨: "Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index" by Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi https://t.co/DKYhylaopF 1/n
2
32
220
Our infini-gram mini paper received the Best Paper Award at #EMNLP2025 !! Really proud 🥹
Wanna 🔎 inside Internet-scale LLM training data w/o spending 💰💰💰? Introducing infini-gram mini, an exact-match search engine with 14x less storage req than the OG infini-gram 😎 We make 45.6 TB of text searchable. Read on to find our Web Interface, API, and more. (1/n) ⬇️
20
19
348
We are presenting infini-gram mini at EMNLP this week! Check out how to search in Internet scale corpora on Friday, Nov. 7th, 2pm at Hall C3 (Poster Session 7).
Wanna 🔎 inside Internet-scale LLM training data w/o spending 💰💰💰? Introducing infini-gram mini, an exact-match search engine with 14x less storage req than the OG infini-gram 😎 We make 45.6 TB of text searchable. Read on to find our Web Interface, API, and more. (1/n) ⬇️
1
3
24
(8/n) Joint work with @liujc1998 and thanks to the amazing @nlpnoah @YejinChoinka @HannaHajishirzi for advising this project!! ❤️
0
0
4
(7/n) We envision infini-gram mini can be used in other applications, such as data curation, task-specific dataset construction, and many more awaits you to explore 🧭
1
0
6
(6/n) Infini-gram mini implements FM-index at massive scale. Compared to SOTA impl (sdsl), our indexing has 18x speedup and 3.2x lower RAM usage, and our on-disk search uses modicum RAM at query time.
2
0
3
(5/n) Behind infini-gram mini: We index text with “FM-index”, a compact data structure frequently used in bioinformatics. It stores a compact variation of a suffix array and the text corpus, greatly reducing storage overhead to 0.44x.
1
0
6
(4/n) We host a bulletin to track benchmark contamination as text corpora evolve, and provide regular updates. You can view results and add new benchmarks to be monitored for contamination. https://t.co/pTwX4OaQJe
1
0
5
(3/n) With infini-gram mini, we detect benchmark contamination at scale. We find several core LM benchmarks to be heavily contaminated in Internet crawls. This can lead to overestimating the capabilities of LMs if trained on such data.
1
2
9
(2/n) 🔎 Head to our web interface to search in the 45.6 TB text we indexed (including recent Common Crawls): https://t.co/oaFciWZQsN More resources: - API documentation: https://t.co/0Xdw2G7EpZ - Source code: https://t.co/1IfBT6tXKG - Paper: https://t.co/hJNkdwWV2f
1
0
5