Rajeev Ranjan Pandey @rrpandey_in X Profile

Rajeev Ranjan Pandey

@rrpandey_in

Followers

87

Following

309

Media

47

Statuses

360

PhDing @IITBHU_Varanasi | #ReinforcementLearning | Sharing PhD journey, insights and paper summaries.

Varanasi, India

Joined November 2023

Don't wanna be here? Send us removal request.

Rajeev Ranjan Pandey

@rrpandey_in

3 days

If you want the full depth (and it’s worth it!), check out the lecture here:.

0

Rajeev Ranjan Pandey

@rrpandey_in

3 days

Joelle’s challenge to us:. * Bellman’s value function unified RL for 30 years. * Now we need new formalisms that capture richer values, unify the field again, and keep RL relevant for society.

1

0

Grok

@grok

3 days

Join millions who have switched to Grok.

168

322

2K

Rajeev Ranjan Pandey

@rrpandey_in

3 days

We need a taxonomy of harms:. Identify threats. Measure baseline risk. Measure marginal risk from new models. Evaluate mitigation effectiveness. Decide if residual risk is tolerable.

1

0

Rajeev Ranjan Pandey

@rrpandey_in

3 days

So what should alignment targets look like?. * Fine-grained (can guide behavior). * Generalizable. * Scalable (improves with more feedback). * Legitimate and auditable. New ideas: Pluralistic alignment.

1

0

Rajeev Ranjan Pandey

@rrpandey_in

3 days

Problems:. * RLHF advantage may be overstated (baseline issues). * Leaderboards can be gamed. * Opaque access and private testing. Joelle warns: "When we fool ourselves about model quality, we harm the field more than we help.".

1

0

Rajeev Ranjan Pandey

@rrpandey_in

3 days

Generative AI is already forcing this shift. LLM alignment pipeline today:.* Pretraining (learn representations). * Supervised fine-tuning (imitate human data). * RLHF (optimize a human preference model). * Evaluation = leaderboards + human judgment. But cracks are showing👇.

1

0

Rajeev Ranjan Pandey

@rrpandey_in

3 days

Beyond Bellman:. * Value isn't just “expected sum of rewards.”.* We need richer, pluralistic, dynamic value systems in AI.

1

0

Rajeev Ranjan Pandey

@rrpandey_in

3 days

Extending Bellman (RL++):. * Researchers added constraints: safety, fairness, privacy, interpretability. * Formalisms like Constrained MDPs helped. * Promising results in areas like treatment optimization. But simply piling on constraints doesn’t scale. Trade-offs get messy.

1

0

Rajeev Ranjan Pandey

@rrpandey_in

3 days

Bellman’s assumptions:. * Markov property (state captures all). * Stationarity (world doesn’t change). * Scalar rewards (everything = 1 number). * Additivity (values sum neatly). * Uniqueness (one true reward). These assumptions break in messy real-world domains (healthcare,.

1

0

Rajeev Ranjan Pandey

@rrpandey_in

3 days

Bellman’s gift to RL:. * The value function unified theory and practice. * Compute the value, get the optimal policy. * It became our shared language. * “How’s your value?” became our handshake. * Enabled benchmarks, modular progress, and decades of growth. But it carried hidden.

1

0

Rajeev Ranjan Pandey

@rrpandey_in

3 days

I just watched the keynote by @jpineau1 at #RLC2025 titled "Beyond Bellman’s Legacy".Joelle Pineau delivered a provocative talk:.👉 What does value really mean in Reinforcement Learning (RL)?.👉 Can we move beyond Bellman’s equation while still keeping RL coherent?.Here’s the.

1

0

Rajeev Ranjan Pandey

@rrpandey_in

10 days

So next time you’re torn between instant gratification vs. future payoff….That’s you running a little Value Iteration inside your brain. 😆.

0

Rajeev Ranjan Pandey

@rrpandey_in

10 days

Limitations of VI:.👉Real life is messy, and we don’t always know the exact rules or probabilities. 👉For huge problems (like planning your whole life), Value Iteration gets computationally heavy.

1

0

Rajeev Ranjan Pandey

@rrpandey_in

10 days

What makes Value Iteration powerful:.It always finds the best long-term strategy (if you know the environment). It balances short-term pleasure with long-term reward.

1

0

Rajeev Ranjan Pandey

@rrpandey_in

10 days

The math version (for the nerds 🤓):

1

0

Rajeev Ranjan Pandey

@rrpandey_in

10 days

Value Iteration says:.👉 Estimate how “good” each situation is.👉 Update those estimates by imagining the future rewards from your choices.👉 Repeat until you know which choices really pay off long-term.

1

0

Rajeev Ranjan Pandey

@rrpandey_in

10 days

Let's say you’re deciding whether to cook at home or order takeout. Cook means effort now, healthier later.Takeout means easy now, maybe regret later.Which should you pick? That’s exactly what Value Iteration helps answer.

1

0

Rajeev Ranjan Pandey

@rrpandey_in

10 days

Imagine life as a game:.Every situation is a state.Every choice you make is an action.Each choice changes your future is transition.You get rewards (or regrets) along the way i.e., reward.This is a Markov Decision Process (MDP).

1

0

Rajeev Ranjan Pandey

@rrpandey_in

10 days

Life is full of choices: short-term fun vs long-term payoff. Value Iteration is the algorithm that helps #RL agents figure it out. Here’s an explanation on how it works with some real world examples 👇.

1

0

Rajeev Ranjan Pandey

@rrpandey_in

11 days

Limitation 10: Poor Generalization.CEM learns narrow task-specific distributions. Fix: Domain randomization, meta-RL, skill-based latent policies.

1

0