Shlok Kumar
@sk2740
Followers
112
Following
1K
Media
12
Statuses
421
Developer | Turning code into solutions
India
Joined November 2017
6/6 Follow-up: Decide aggressive quant (INT4) via per-layer sensitivity test (loss impact). Quant FFNs harder than attention/embeddings. Use mixed precision if needed.
1
0
0
5/6 Quant hurts quality on sensitive layers/tiny models/no calib. Safe on >7B models post-calib (<1% drop). ALWAYS eval before/after on benchmarks (perplexity/acc).
1
0
0
4/6 Outlier-aware methods (AWQ, SmoothQuant, GPTQ): smooth/clip outliers in weights/acts. Makes INT4 safe with tiny loss on LLMs.
1
0
0
3/6 Per-tensor: 1 scale for whole tensor (simple but hurts acc). Per-channel: per-output-channel scales (better for matmuls, preserves quality). Prefer per-channel.
1
0
0
2/6 Calibration strategies: Use small dataset for scales (min-max, percentile clip, KL-div). Poor calib = big quality drop. Always required for INT8/INT4.
1
0
0
1/6 Q: "Explain the tradeoffs of weight quantization (e.g., INT8, INT4). When does quantization hurt quality and when is it safe?" Tradeoffs: 4x/8x smaller size, faster inference, lower power. Risk: accuracy loss if bad calib. Safe for large models; hurts small/poor-calib ones.
1
0
0
6/6 Follow-up: What changes with very long contexts? KV-cache explodes further → extreme memory usage + bandwidth bottleneck. Often need: quantization, eviction/paging, sliding windows, or switch to linear-attn / Mamba-style archs to escape quadratic memory scaling.
1
0
0
5/6 Follow-up: How does grouped-query attention (GQA) help? GQA shares fewer KV heads across many Q heads → KV-cache size drops (e.g. 8× smaller). Reduces memory footprint + bandwidth pressure → faster decode at long context.
1
0
0
4/6 Compute vs memory regimes: Short context → compute-bound (matmul flops heavy). Long context → memory-bandwidth bound (GPU stalls waiting on KV-cache reads; attn compute is cheap per token but data movement kills perf).
1
0
0
3/6 Main bottlenecks in autoregressive generation: KV-cache memory: O(layers × heads × S × d_head) → GBs at long context. Memory bandwidth: loading huge K/V tensors every step dominates GPU HBM traffic.
1
0
0
2/6 Forward pass per new token: QKV proj on current token only → load past K/V from KV-cache → causal attn (Q @ K^T → softmax → @ V) → MLP. KV-cache stores all previous K/V → grows O(S) per layer/head.
1
0
0
1/6 Q "Walk through the forward pass of a decoder-only transformer. Where are the main memory and compute bottlenecks during autoregressive generation?" A: Embed + pos enc → N causal self-attn layers (LN + MHA + LN + MLP) → final LN + LM head. Autoregressive decode
1
0
0
6/6 In production measure: A/B test real latency + tokens/sec with & without speculation on production traffic. Track effective acceptance rate + compute cost ratio. Only enable if net wall-clock time decreases and doesn't hurt quality.
1
0
0
5/6 How to choose draft model? Usually a 1–4B model quantized (Q4/Q5) that was either distilled from the target or fine-tuned on similar data. Goal: highest tokens-per-second while keeping acceptance rate ≥60–65%.
1
0
0
4/6 Main overhead: always running two models. Draft must be much faster (often 5–20×) than target to break even. Also memory cost of keeping both loaded. Worst case: small negative gain even at moderate acceptance.
1
0
0
3/6 It hurts when acceptance rate is low (<~40–50%). Most draft tokens get rejected → you pay draft model cost + full target cost almost every time → net slowdown + wasted compute.
1
0
0
2/6 It helps when acceptance rate is high (typically >60–70%). Good draft → many tokens accepted → effective batch size >1. Common speedups: 1.5–3× latency, up to 5×+ in ideal cases (e.g. code gen, repetitive text).
1
0
0