Ashwani Kumar Profile
Ashwani Kumar

@ash_at_tt

Followers
377
Following
4K
Media
19
Statuses
63

Deep Learning Engineer | Entrepreneur | PhD

London, United Kingdom
Joined September 2009
Don't wanna be here? Send us removal request.
@ash_at_tt
Ashwani Kumar
17 days
Let me know what you think
1
0
6
@ash_at_tt
Ashwani Kumar
17 days
Covers: • Noising for discrete tokens/characters • Step-by-step implementation of baby diffusion GPT • Training using: Score Entropy Objective • Annotated training and inference code in PyTorch • Inference using parallel denoising (no autoregressive bottleneck)
1
1
15
@ash_at_tt
Ashwani Kumar
17 days
I turned @karpathy's baby GPT into a character-level text diffusion model, using @aaron_lou et al.'s score entropy-based training objective.
17
53
977
@PrimaMente
Prima Mente
3 months
1/ Today we announce Pleiades, a series of epigenetic foundation models (90M→7B params) trained on 1.9T tokens of human methylation & genomic data. Pleiades accurately models epigenetics for genomic track prediction, generation & neurodegenerative disease detection from cfDNA,
10
42
145
@ash_at_tt
Ashwani Kumar
4 months
Video walkthrough:
0
0
1
@ash_at_tt
Ashwani Kumar
4 months
Defining a target for the Value head was also a bit confusing. It's simply: Values + Advantages where both Values and Advantages are from the old value head and old policy before the start of the mini-batch training.
1
0
0
@ash_at_tt
Ashwani Kumar
4 months
Advantages are whitened, where the advantages are first normalised via a Z-score normalisation but then shifted back to the mean
1
0
0
@ash_at_tt
Ashwani Kumar
4 months
The ratio of current and old (not SFT) policies in PPO's clip loss adds to the confusion. Clip loss is calculated during the mini-batch training step in PPO, where the old policy (π_θ_old) is the policy before we start our mini-batch training.
1
0
0
@ash_at_tt
Ashwani Kumar
4 months
The reward for PPO doesn't just come from the reward model (RM). It also includes a penalty term that penalizes the policy or model if it diverges too far from the SFT policy (π^SFT).
1
0
0
@ash_at_tt
Ashwani Kumar
4 months
I implemented Reinforcement Learning from Human Feedback (RLHF) from scratch in Python Notebooks and recorded the step-by-step process in a 3+ hour YouTube video. GitHub repo, surprising details I learned, and YouTube video: 👇
2
3
14
@ash_at_tt
Ashwani Kumar
4 months
The video is also available on YouTube
0
0
1
@ash_at_tt
Ashwani Kumar
4 months
The complete implementation in three Jupyter notebooks is available on GitHub:
1
0
0
@ash_at_tt
Ashwani Kumar
4 months
I recently implemented Reinforcement Learning from Human Feedback (RLHF) step-by-step, including Supervised Fine-Tuning (SFT), Reward Modeling, and Proximal Policy Optimization (PPO). 🧵
1
1
1
@rohanpaul_ai
Rohan Paul
2 years
BREAKING 🔥🤯 Google releases model with new Griffin architecture that outperforms transformers. Across multiple sizes, Griffin out performs the benchmark scores of transformers baseline in controlled tests in both the MMLU score across different parameter sizes as well as the
5
114
500
@ash_at_tt
Ashwani Kumar
2 years
Future work aims to extend this framework's capabilities, including building TIs for interaction with different resource types online and offline. Your feedback and contributions are welcome. The code repo is available at:
github.com
A GPT agent with a Text Interface tool. Contribute to ash80/backtracking_gpt development by creating an account on GitHub.
0
0
0
@ash_at_tt
Ashwani Kumar
2 years
The main limitations of this approach though are: it only works with GPT-4 and requires building text interfaces for interacting with different types of resources.
1
0
0
@ash_at_tt
Ashwani Kumar
2 years
In summary, key features of the framework are dynamic actions, the ability for backtracking, and a human-like information retrieval processes.
1
0
0
@ash_at_tt
Ashwani Kumar
2 years
The framework maintains a state, consisting of notes taken by the LLM agent and past actions that allows model to backtrack when it gets stuck on the current path.
1
0
0