
Anmol Mekala
@anmol_mekala
Followers
40
Following
141
Media
5
Statuses
34
AI eng @ Salient | LLM unlearning & benchmarking research | CS @umassamherst, @iitbombay | Formerly @Microsoft
San Francisco, CA
Joined July 2023
📢 New Paper 📢.Struggling to fit in very long contexts on your LLM? Considering 4-bit quantization to 2x your context window?. Prior work says 4-bit is “good enough,” but at long-context tasks it can drop 16%: with up to 59% drops on specific models❗❗.Details in 🧵👇
4
14
36
Accepted to EMNLP 2025!!.
📢 New Paper 📢.Struggling to fit in very long contexts on your LLM? Considering 4-bit quantization to 2x your context window?. Prior work says 4-bit is “good enough,” but at long-context tasks it can drop 16%: with up to 59% drops on specific models❗❗.Details in 🧵👇
0
0
1
RT @rishanthrajendh: Long-form factuality metrics like FactScore and VeriScore are accurate but slow (~100s/response): they split text into….
0
9
0
RT @selini0: We went from "RL without external rewards" to "RL with any rewards" in less than 6 hours hahaha. Interesting times https://t.c….
0
30
0
RT @corbtt: New paper! We used GRPO to train Qwen 2.5 on 32 randomly-generated Coq programs that don't compile, and it learned to prove the….
0
19
0
RT @MohitIyyer: 4bit quantization works fine with short contexts but can really hurt with longer ones! Check out our paper for more details….
0
9
0
📜 Does quantization affect models’ performance on long-context tasks? (.Work @UMass_NLP by @aatmakuru6 and myself, guided by @yixiao_song, @mar_kar_ & @MohitIyyer.
arxiv.org
Large language models (LLMs) now support context windows exceeding 128K tokens, but this comes with significant memory requirements and high inference latency. Quantization can mitigate these...
0
1
2
Takeaways:.✅ 8-bit quants are consistently robust. ⚠️ Be cautious with 4-bit: especially on long contexts 📚 and multilingual 🌍 tasks. 🚫 Testing a few tasks or models isn’t enough: performance on one doesn’t guarantee others.
1
1
2
Quantization effects vary dramatically across models: even similarly sized ones:.Qwen-2.5 72B remains robust under BNB-nf4 (+0.6%) on OneRuler, while Llama-3.1 70B sees a massive 59% drop 📉📉 on the same task!.Evaluating a single model family isn’t enough❗️❗️
1
0
2
Long-context tasks (up to 128K): Ruler, OneRuler & NoCha show quantization losses rise with longer contexts and in multilingual settings. Unlike long-context, long-form generation does not show large drops upon quantization.
1
0
2
🔹8-bit quantization (FP8, GPTQ-int8) maintains near-perfect accuracy (<0.9% avg drop). 🔸But 4-bit quantizations (AWQ-int4, GPTQ-int4, BNB-nf4) can degrade sharply on very long contexts📉
1
0
2
We benchmark five quantization methods (FP8, GPTQ-int8, AWQ-int4, GPTQ-int4, BNB-nf4) across five models (Llama-3.1/Qwen-2.5; 7–72B) on 10K examples from five long-context 📚🔍 (up to 128K tokens) and long-form generation tasks 📚✍️.
1
0
2
RT @kenziyuliu: An LLM generates an article verbatim—did it “train on” the article?. It’s complicated: under n-gram definitions of train-se….
0
96
0
RT @vaidehi_patil_: 🚨Exciting @icmlconf workshop alert 🚨. We’re thrilled to announce the #ICML2025 Workshop on Machine Unlearning for Gener….
0
19
0
RT @SketchesbyBoze: the use of chatbots to write essays is a five-alarm fire with the power to destroy education, but we can defeat it easi….
0
106
0
RT @ZhiyuanZeng_: Is a single accuracy number all we can get from model evals?🤔.🚨Does NOT tell where the model fails.🚨Does NOT tell how to….
0
92
0
RT @rohitgandikota: Why do distilled diffusion models generate similar-looking images? 🤔. Our Diffusion Target (DT) visualization reveals t….
0
74
0
RT @goyalsachin007: Realization (again) from research over the past 2 months: A solid open-source framework from “reliable” folks isn’t jus….
0
2
0
RT @WeijiaShi2: Another great work by @pratyushmaini. Excited to see our machine unlearning benchmark, MUSE (🔗, n….
arxiv.org
Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to...
0
6
0