rrpandey_in Profile Banner
Rajeev Ranjan Pandey Profile
Rajeev Ranjan Pandey

@rrpandey_in

Followers
87
Following
309
Media
47
Statuses
360

PhDing @IITBHU_Varanasi | #ReinforcementLearning | Sharing PhD journey, insights and paper summaries.

Varanasi, India
Joined November 2023
Don't wanna be here? Send us removal request.
@rrpandey_in
Rajeev Ranjan Pandey
3 days
If you want the full depth (and it’s worth it!), check out the lecture here:.
0
0
0
@rrpandey_in
Rajeev Ranjan Pandey
3 days
Joelle’s challenge to us:. * Bellman’s value function unified RL for 30 years. * Now we need new formalisms that capture richer values, unify the field again, and keep RL relevant for society.
1
0
0
@grok
Grok
3 days
Join millions who have switched to Grok.
168
322
2K
@rrpandey_in
Rajeev Ranjan Pandey
3 days
We need a taxonomy of harms:. Identify threats. Measure baseline risk. Measure marginal risk from new models. Evaluate mitigation effectiveness. Decide if residual risk is tolerable.
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
3 days
So what should alignment targets look like?. * Fine-grained (can guide behavior). * Generalizable. * Scalable (improves with more feedback). * Legitimate and auditable. New ideas: Pluralistic alignment.
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
3 days
Problems:. * RLHF advantage may be overstated (baseline issues). * Leaderboards can be gamed. * Opaque access and private testing. Joelle warns: "When we fool ourselves about model quality, we harm the field more than we help.".
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
3 days
Generative AI is already forcing this shift. LLM alignment pipeline today:.* Pretraining (learn representations). * Supervised fine-tuning (imitate human data). * RLHF (optimize a human preference model). * Evaluation = leaderboards + human judgment. But cracks are showing👇.
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
3 days
Beyond Bellman:. * Value isn't just “expected sum of rewards.”.* We need richer, pluralistic, dynamic value systems in AI.
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
3 days
Extending Bellman (RL++):. * Researchers added constraints: safety, fairness, privacy, interpretability. * Formalisms like Constrained MDPs helped. * Promising results in areas like treatment optimization. But simply piling on constraints doesn’t scale. Trade-offs get messy.
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
3 days
Bellman’s assumptions:. * Markov property (state captures all). * Stationarity (world doesn’t change). * Scalar rewards (everything = 1 number). * Additivity (values sum neatly). * Uniqueness (one true reward). These assumptions break in messy real-world domains (healthcare,.
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
3 days
Bellman’s gift to RL:. * The value function unified theory and practice. * Compute the value, get the optimal policy. * It became our shared language. * “How’s your value?” became our handshake. * Enabled benchmarks, modular progress, and decades of growth. But it carried hidden.
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
3 days
I just watched the keynote by @jpineau1 at #RLC2025 titled "Beyond Bellman’s Legacy".Joelle Pineau delivered a provocative talk:.👉 What does value really mean in Reinforcement Learning (RL)?.👉 Can we move beyond Bellman’s equation while still keeping RL coherent?.Here’s the.
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
10 days
So next time you’re torn between instant gratification vs. future payoff….That’s you running a little Value Iteration inside your brain. 😆.
0
0
0
@rrpandey_in
Rajeev Ranjan Pandey
10 days
Limitations of VI:.👉Real life is messy, and we don’t always know the exact rules or probabilities. 👉For huge problems (like planning your whole life), Value Iteration gets computationally heavy.
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
10 days
What makes Value Iteration powerful:.It always finds the best long-term strategy (if you know the environment). It balances short-term pleasure with long-term reward.
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
10 days
The math version (for the nerds 🤓):
Tweet media one
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
10 days
Value Iteration says:.👉 Estimate how “good” each situation is.👉 Update those estimates by imagining the future rewards from your choices.👉 Repeat until you know which choices really pay off long-term.
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
10 days
Let's say you’re deciding whether to cook at home or order takeout. Cook means effort now, healthier later.Takeout means easy now, maybe regret later.Which should you pick? That’s exactly what Value Iteration helps answer.
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
10 days
Imagine life as a game:.Every situation is a state.Every choice you make is an action.Each choice changes your future is transition.You get rewards (or regrets) along the way i.e., reward.This is a Markov Decision Process (MDP).
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
10 days
Life is full of choices: short-term fun vs long-term payoff. Value Iteration is the algorithm that helps #RL agents figure it out. Here’s an explanation on how it works with some real world examples 👇.
1
0
0
@rrpandey_in
Rajeev Ranjan Pandey
11 days
Limitation 10: Poor Generalization.CEM learns narrow task-specific distributions. Fix: Domain randomization, meta-RL, skill-based latent policies.
1
0
0