
Jimmy Lin
@lintool
Followers
14K
Following
0
Media
341
Statuses
4K
I profess CS-ly at @UWaterloo about NLP/IR/LLM-ish things. I science at @Primal. Previously, I monkeyed code for @Twitter and slides for @Cloudera.
Joined February 2010
"NLP makes IR interesting and IR makes NLP useful!" - slides from my #sigir2020 summer school talk at: Get your rotten tomatoes and eggs out!.
9
49
296
Happy to share an early draft of "Pretrained Transformers for Text Ranking: BERT and Beyond", our forthcoming book (tentatively, early 2021) by @lintool @rodrigfnogueira @andrewyates
3
59
265
Case in point: "Passage Re-ranking with BERT" by @rodrigfnogueira and @kchonyc was never accepted anywhere because of the "too simple, not novel" laziness. Yet that paper is LITERALLY cited in every single BERT-for-ranking paper ever published since.
1
8
170
Presenting. RankVicuna π¦ - the first fully open-source LLM capable of performing high-quality listwise reranking in a zero-shot setting! Brought to you by @rpradeep42 and @sahel_sharify
5
33
171
New entrants into the camelidae family π¦ for retrieval! @xueguang_ma presents RepLLaMA (a dense retrieval model) and RankLLaMA (a pointwise reranker) fine-tuned on (you guessed it!) LLaMA for multi-stage text retrieval:
5
18
140
Introducing RankZephyr π¨ - a fully open-source zero-shot listwise reranking LLM that achieves effectiveness parity with GPT-4! Brought to you by @rpradeep42 and @sahel_sharify!
4
25
132
"Pretrained Transformers for Text Ranking: BERT and Beyond" with @rodrigfnogueira and @andrewyates - it started on June 18, 2020 and culminates here with the official publication. Enjoy! institutional subscribers: retail orders:
1
28
129
Another addition to the "X is all you need" genre of papers: We took OpenAI embeddings of MS MARCO passages and stuffed them into Lucene - turns out you don't need fancy schmancy vector stores for dense retrieval! Lucene will do.
Dense retrieval without requiring a dedicated vector DB! Here's a guide on how you can take OpenAI ada2 embeddings (on MS MARCO passages) and perform retrieval directly using Lucene with our group's Anserini toolkit
9
15
99
With @rodrigfnogueira and @andrewyates we're happy to share the revised version of our book "Pretrained Transformers for Text Ranking: BERT and Beyond" - significant updates to transformer-based reranking models and dense retrieval techniques!.
4
26
97
Presenting AfriBERTa, a pretrained LM for 11 African languages by @Kelechukwu_ What's neat is that its pretraining corpus is 0.04% that of XLMR (<1GB), but AfriBERTa performs just as well on downstream tasks! To appear at MRL workshop at #EMNLP2021.
2
18
92
Yesterday @rodrigfnogueira @andrewyates and I wrapped up the final preproduction version of "Pretrained Transformers for Text Ranking: BERT and Beyond" - posted on arXiv as v3: now in the hands of @MorganClaypool and will be in print soon!.
1
16
82
#ACL2023NLP (or #ACL2023) is a great opportunity to plug what is shaping up to be perhaps the largest collection of core NLP faculty in Canada π¨π¦π at Waterloo @UWCheritonCS - joining @WenhuChen and me next year will be @fredahshi @hllo_wrld @yuntiandeng! Come find us to chat!.
1
12
84
Thanks to the tremendous effort of @edwinzhng @1729_gupta @kchonyc we're proud to present the Neural Covidex, our updated AI-powered search interface to @allen_ai 's COVID-19 corpus: Powered primarily by Lucene, T5, and BioBERT.
3
41
79
Introducing. SegaBERT! by @Richard_baihe et al. intuition is to introduce hierarchical position embeddings (paragraph, sentence, tokens) to better capture context during pretraining: simple idea, fairly large gains!.
1
20
78
Why I hate doing reimbursements: the default assumption by @UWaterloo is that you're a criminal trying to embezzle money from research accounts. Maryland was more sane. What's your experience been like at other places?.
16
0
70
If you're interested in dense retrieval, you'll want to check out this DPR replication effort led by @xueguang_ma tl;dr - BM25 is better than the original authors made it out to be, and free QA boost with better evidence fusion!.
5
6
76
@colinraffel You have to grow a grey beard first. And don't say "bag of words". Dress it up as "heuristically weighted sparse representations".
1
0
71
We've written up a description of the Neural Covidex and shared a few thoughts about our journey so far in a submission to the ACL 2020 COVID Workshop: Comments and feedback welcome! #NLProc #acl2020nlp.
1
14
68
New work by @ralph_tang @crystina_z @xueguang_ma adds yet another prompting technique to the mix: *permutation* self-consistency prompting to overcome positional bias in LLMs. Useful for listwise ranking. read all about it!
0
10
69
Our group has this server with 100TB disk. and it's always full. Why? These dense retrieval models take up so much $#!& space. But @xueguang_ma et al. came up with simple compression solutions, to appear in #emnlp2021
2
11
68
New work on using doc2query for summarization by @rodrigfnogueira et al. - works surprisingly well! Samples from CORD-19 corpus related to COVID-19 below.
0
15
64
Thanks to @edwinzhng and @1729_gupta our Anserini IR toolkit can now search @allen_ai 's COVID-19 corpus. - @kchonyc 's connected it up to SciBERT and bam, we have a two-stage neural ranking pipeline! Join and build on our work!.
3
22
62
Interested in large-scale machine learning (#hadoop and otherwise)? I recommend this tutorial at #KDD2011 :: http://t.co/NlEzGEe.
0
33
60
LLMs are missing a critical ingredient. and @Primal knows what it is! (Hint: knowledge graphs and neuro-symbolic approaches) Here's a writeup of the journey so far, featuring CEO @YvanCouture - oh btw, I'm the CTO.
2
5
60
In preprint of our #sigir2019 short we conducted a meta-analysis of 100+ papers reporting results on Robust04. tl;dr - weak baselines are still prevalent (both neural and non-neural models).
5
17
53
Want to replicate our (@rodrigfnogueira @victoryang118 @kchonyc) doc2query work (currently) sitting on top of the MS MARCO leaderboard? Have we got code for you! paper:
0
12
53
Tutorial slides to go with the book!
Slides of our WSDM 2021 tutorial "Pretrained Transformers for Text Ranking: BERT and Beyond" are available here: with @andrewyates and @lintool.
0
12
52
Apparently, I was recognized as an outstanding area chair at #emnlp2020 and didn't realize it until now. #humblebrag (do people even use this hashtag anymore?)
1
1
51
It's not every day you land a $1M (CAD) grant. announcing our Archives Unleashed 2 project led by @ianmilligan1.Looking forward to working with @ruebot @jefferson_bail @SamVFritz over the next few years!
2
1
48
Our latest study of BM25 variants, including Lucene's weird doc length encoding, with @kamphuis_c @arjenpdevries @srchvrs tl;dr - it's okay!
0
18
51
Sparse or dense representations for retrieval? Or hybrid? psssssh, says @jacklin_64 - neither! Densify sparse lexical reps and tack on dense semantic reps: best of both worlds and simplified infrastructure also (no need for HNSW or inverted indexes!)
1
7
47
We've connected Anserini to Solr to Blacklight to present a search frontend to @allen_ai 's COVID-19 corpus! Check out - awesome work by @edwinzhng and @1729_gupta.
4
18
47
It's hard to build usable software, but tweets like this make all the blood, sweat, and tears worthwhile. Credit goes to an awesome team!.
In the Information Retrieval course I let my students pick the IR toolkit of their choice among all the solutions we have available as research community. Clear front-runner by a mile was Pyserini: In big part thanks to its elaborate documentation!.
2
2
48
I'll be giving a talk on a Conceptual Framework for a Representational Approach to Information Retrieval on April 5, 4pm PT as a Pinterest Labs Tech Talk @PinterestEng. RSVP and learn more here!
0
5
49
Happy to share Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in 11 languages by @crystina_z @xueguang_ma @ShiPeng16 tl;dr - think of this as the open-retrieval condition of TyDi. Paper: Data:
2
12
47