Hao Xu @EMNLP Profile
Hao Xu @EMNLP

@xuhaoxh

Followers
175
Following
12
Media
6
Statuses
16

Undergrad @uwcse @uwnlp | 🎻Musician

Joined August 2024
Don't wanna be here? Send us removal request.
@xuhaoxh
Hao Xu @EMNLP
5 months
Wanna 🔎 inside Internet-scale LLM training data w/o spending 💰💰💰? Introducing infini-gram mini, an exact-match search engine with 14x less storage req than the OG infini-gram 😎 We make 45.6 TB of text searchable. Read on to find our Web Interface, API, and more. (1/n) ⬇️
6
32
115
@HannaHajishirzi
Hanna Hajishirzi
9 days
Congratulations to the amazing Hao @xuhaoxh and @liujc1998 for winning the best paper award at EMNLP. Hao is applying for grad schools — highly recommend her.
@liujc1998
Jiacheng Liu
10 days
Our infini-gram mini paper received the Best Paper Award at #EMNLP2025 !! Really proud 🥹
5
6
111
@emnlpmeeting
EMNLP 2025
9 days
🎉 Congratulations to all #EMNLP2025 award winners 🎉 Starting with the ✨Best Paper award ✨: "Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index" by Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi https://t.co/DKYhylaopF 1/n
2
32
220
@liujc1998
Jiacheng Liu
10 days
Our infini-gram mini paper received the Best Paper Award at #EMNLP2025 !! Really proud 🥹
@xuhaoxh
Hao Xu @EMNLP
5 months
Wanna 🔎 inside Internet-scale LLM training data w/o spending 💰💰💰? Introducing infini-gram mini, an exact-match search engine with 14x less storage req than the OG infini-gram 😎 We make 45.6 TB of text searchable. Read on to find our Web Interface, API, and more. (1/n) ⬇️
20
19
348
@xuhaoxh
Hao Xu @EMNLP
13 days
We are presenting infini-gram mini at EMNLP this week! Check out how to search in Internet scale corpora on Friday, Nov. 7th, 2pm at Hall C3 (Poster Session 7).
@xuhaoxh
Hao Xu @EMNLP
5 months
Wanna 🔎 inside Internet-scale LLM training data w/o spending 💰💰💰? Introducing infini-gram mini, an exact-match search engine with 14x less storage req than the OG infini-gram 😎 We make 45.6 TB of text searchable. Read on to find our Web Interface, API, and more. (1/n) ⬇️
1
3
24
@xuhaoxh
Hao Xu @EMNLP
5 months
(8/n) Joint work with @liujc1998 and thanks to the amazing @nlpnoah @YejinChoinka @HannaHajishirzi for advising this project!! ❤️
0
0
4
@xuhaoxh
Hao Xu @EMNLP
5 months
(7/n) We envision infini-gram mini can be used in other applications, such as data curation, task-specific dataset construction, and many more awaits you to explore 🧭
1
0
6
@xuhaoxh
Hao Xu @EMNLP
5 months
(6/n) Infini-gram mini implements FM-index at massive scale. Compared to SOTA impl (sdsl), our indexing has 18x speedup and 3.2x lower RAM usage, and our on-disk search uses modicum RAM at query time.
2
0
3
@xuhaoxh
Hao Xu @EMNLP
5 months
(5/n) Behind infini-gram mini: We index text with “FM-index”, a compact data structure frequently used in bioinformatics. It stores a compact variation of a suffix array and the text corpus, greatly reducing storage overhead to 0.44x.
1
0
6
@xuhaoxh
Hao Xu @EMNLP
5 months
(4/n) We host a bulletin to track benchmark contamination as text corpora evolve, and provide regular updates. You can view results and add new benchmarks to be monitored for contamination. https://t.co/pTwX4OaQJe
1
0
5
@xuhaoxh
Hao Xu @EMNLP
5 months
(3/n) With infini-gram mini, we detect benchmark contamination at scale. We find several core LM benchmarks to be heavily contaminated in Internet crawls. This can lead to overestimating the capabilities of LMs if trained on such data.
1
2
9
@xuhaoxh
Hao Xu @EMNLP
5 months
(2/n) 🔎 Head to our web interface to search in the 45.6 TB text we indexed (including recent Common Crawls): https://t.co/oaFciWZQsN More resources: - API documentation: https://t.co/0Xdw2G7EpZ - Source code: https://t.co/1IfBT6tXKG - Paper: https://t.co/hJNkdwWV2f
1
0
5