Shlok Kumar @sk2740 X Profile

Shlok Kumar

@sk2740

Followers

112

Following

1K

Media

12

Statuses

421

Developer | Turning code into solutions

https://t.co/UUaVwowXOi

India

Joined November 2017

Don't wanna be here? Send us removal request.

Shlok Kumar

@sk2740

3 days

#x #ai #interview

0

Shlok Kumar

@sk2740

3 days

6/6 Follow-up: Decide aggressive quant (INT4) via per-layer sensitivity test (loss impact). Quant FFNs harder than attention/embeddings. Use mixed precision if needed.

1

0

Shlok Kumar

@sk2740

3 days

5/6 Quant hurts quality on sensitive layers/tiny models/no calib. Safe on >7B models post-calib (<1% drop). ALWAYS eval before/after on benchmarks (perplexity/acc).

1

0

Shlok Kumar

@sk2740

3 days

4/6 Outlier-aware methods (AWQ, SmoothQuant, GPTQ): smooth/clip outliers in weights/acts. Makes INT4 safe with tiny loss on LLMs.

1

0

Shlok Kumar

@sk2740

3 days

3/6 Per-tensor: 1 scale for whole tensor (simple but hurts acc). Per-channel: per-output-channel scales (better for matmuls, preserves quality). Prefer per-channel.

1

0

Shlok Kumar

@sk2740

3 days

2/6 Calibration strategies: Use small dataset for scales (min-max, percentile clip, KL-div). Poor calib = big quality drop. Always required for INT8/INT4.

1

0

Shlok Kumar

@sk2740

3 days

1/6 Q: "Explain the tradeoffs of weight quantization (e.g., INT8, INT4). When does quantization hurt quality and when is it safe?" Tradeoffs: 4x/8x smaller size, faster inference, lower power. Risk: accuracy loss if bad calib. Safe for large models; hurts small/poor-calib ones.

1

0

Shlok Kumar

@sk2740

5 days

#x #ai #interview

0

Shlok Kumar

@sk2740

5 days

6/6 Follow-up: What changes with very long contexts? KV-cache explodes further → extreme memory usage + bandwidth bottleneck. Often need: quantization, eviction/paging, sliding windows, or switch to linear-attn / Mamba-style archs to escape quadratic memory scaling.

1

0

Shlok Kumar

@sk2740

5 days

5/6 Follow-up: How does grouped-query attention (GQA) help? GQA shares fewer KV heads across many Q heads → KV-cache size drops (e.g. 8× smaller). Reduces memory footprint + bandwidth pressure → faster decode at long context.

1

0

Shlok Kumar

@sk2740

5 days

4/6 Compute vs memory regimes: Short context → compute-bound (matmul flops heavy). Long context → memory-bandwidth bound (GPU stalls waiting on KV-cache reads; attn compute is cheap per token but data movement kills perf).

1

0

Shlok Kumar

@sk2740

5 days

3/6 Main bottlenecks in autoregressive generation: KV-cache memory: O(layers × heads × S × d_head) → GBs at long context. Memory bandwidth: loading huge K/V tensors every step dominates GPU HBM traffic.

1

0

Shlok Kumar

@sk2740

5 days

2/6 Forward pass per new token: QKV proj on current token only → load past K/V from KV-cache → causal attn (Q @ K^T → softmax → @ V) → MLP. KV-cache stores all previous K/V → grows O(S) per layer/head.

1

0

Shlok Kumar

@sk2740

5 days

1/6 Q "Walk through the forward pass of a decoder-only transformer. Where are the main memory and compute bottlenecks during autoregressive generation?" A: Embed + pos enc → N causal self-attn layers (LN + MHA + LN + MLP) → final LN + LM head. Autoregressive decode

1

0

Shlok Kumar

@sk2740

11 days

#x #ai #interview

0

Shlok Kumar

@sk2740

11 days

6/6 In production measure: A/B test real latency + tokens/sec with & without speculation on production traffic. Track effective acceptance rate + compute cost ratio. Only enable if net wall-clock time decreases and doesn't hurt quality.

1

0

Shlok Kumar

@sk2740

11 days

5/6 How to choose draft model? Usually a 1–4B model quantized (Q4/Q5) that was either distilled from the target or fine-tuned on similar data. Goal: highest tokens-per-second while keeping acceptance rate ≥60–65%.

1

0

Shlok Kumar

@sk2740

11 days

4/6 Main overhead: always running two models. Draft must be much faster (often 5–20×) than target to break even. Also memory cost of keeping both loaded. Worst case: small negative gain even at moderate acceptance.

1

0

Shlok Kumar

@sk2740

11 days

3/6 It hurts when acceptance rate is low (<~40–50%). Most draft tokens get rejected → you pay draft model cost + full target cost almost every time → net slowdown + wasted compute.

1

0

Shlok Kumar

@sk2740

11 days

2/6 It helps when acceptance rate is high (typically >60–70%). Good draft → many tokens accepted → effective batch size >1. Common speedups: 1.5–3× latency, up to 5×+ in ideal cases (e.g. code gen, repetitive text).

1

0