Francesco Bertolotti
@f14bertolotti
Followers
1K
Following
341
Media
143
Statuses
258
Postdoctoral researcher at the university of Milan
Joined October 2021
Ever wondered why language model embeddings are organized semantically or why the weight tying technique is so effective? π€ Our new paper, spotlighted at ICML24, dives into these questions. Joint work with @w_cazzola. π Read more: https://t.co/OB7b4eg9fi π§΅π(1/7)
openreview.net
In this work, we analyze both theoretically and empirically the effect of tied input-output embeddingsβa popular technique that reduces the model size while often improving training. Interestingly,...
1
5
44
In this paper, the authors try GRPO with a Conciseness Reward Model to reduce answer length. The reward is scaled by difficulty annealing, and it is only applied when the outcome is already correct. It seems to work pretty well. π https://t.co/afJAznUGYZ
0
0
23
This is a new small LM (1.5B) that achieves impressing reasoning capabilities on AIME, LCB, and GPQA. The authors used model merging and an entropy maximizing variations of GRPO. Impressive work! π https://t.co/TVPMyR8j6j
1
15
82
These authors propose a super simple jailbreak attack, Ninja, that exploits long-context deficiencies of LLM to get an answer. The main idea is to fill the context with a bunch of benign context and then set the goal. Cool work! π https://t.co/z2AtgiwHbK
2
4
15
Other cool abstracts from today's arxiv: - Deep Research https://t.co/eLJBG4vAq7 - Neural Operator https://t.co/6GZR6ddB01 - LLM Technical Report
0
1
4
This author shows that grokking can arise purely from an initial and quick overfitting of data followed by a movement on the zero-loss region guided purely by weight decay. π https://t.co/hGF7vkMS8b
5
25
169
New paper from nvidia for autonomous vehicles. The authors bootstrap from a pre-trained world-model (cosmos) a vision-language-action model capable of driving. π https://t.co/xR9OrolGaz
0
1
5
From today's arxiv. The authors shows that the multi-language failures from reasoning models come from question understanding failures. A simple prompting technique is able to mitigate the issue. They also show failures are detectable from hidden states π https://t.co/Z64RYOIepd
0
3
19
Just skimmed this paper a little more in depth. This is a genuine breakthrough in the field. Huge congrats to the authors.
LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)
1
1
7
Attending ECAI25 in Bologna was a fantastic experience. The organizers truly excelled, even including a wonderful concert. Sincere appreciation to Orchestra @Senzaspine_ for providing such a beautiful and enjoyable evening.
0
1
0
link to the post: https://t.co/5XkhmmeaNc full credits of those who helped me review this are in the post
publish.obsidian.md
beyond.png Most attention variants have been designed to retain as much sample efficiency as possible, under the constraint of achieving subquadratic scaling with respect to sequence length. While tβ¦
0
12
89
Let me know if you find any issue or if you experiment with this!
0
0
1
Let me just add that all of this came with quite a cost: the training for this strategy came with 10x in training time. However, since these are just the first experiments, I believe that there is room for improvement. The code is open source π https://t.co/2yKN88gpIC
github.com
layer-wise K-level optimization . Contribute to f14-bertolotti/l2l development by creating an account on GitHub.
1
0
2
I am tracking only training and validation loss for now. As you can see, the improvement of this approach is quite substantial. Of course there are a lot of limitations in this evaluation: small context length, too few training steps, no ablations...
1
0
2
Reading the paper, I was wondering, what if we do the same thing for LLMs? There's a lot of strategies for doing this. But let's just start by treating each layer as an agent. To test this, I have trained from scratch Qwen 1.5B and 3B on the MiniPile dataset.
1
0
2
The k-level paper, https://t.co/nLLmmgpeSJ, proposes to update the policy of an agent by looking at the updated versions of all other agents. It is simply a matter of running the backpropagation when all other agents are one step ahead!
arxiv.org
Actor-critic algorithms for deep multi-agent reinforcement learning (MARL) typically employ a policy update that responds to the current strategies of other agents. While being straightforward,...
1
0
2
I had a kind of wild idea a few weeks back. Remember the K-level policy gradient paper for MARL? π€ Well, I thought, why do we not do something similar for layers? So I tried it, and it's not bad. π The details are in a new post https://t.co/5TSEuOxCXo π§΅Here's the gist.
2
4
27
In this work the authors theoretically compare LayerNormalization placements in transformers. The Peri-LN strategy, used by Gemma and OLMo, appears to be more well-behaved both during the forward and backward pass. π https://t.co/RxU7QQYWGE
4
13
66
In this paper the authors show that base models already have in them thinking capabilities, and they can elicit them using steering vectors obtained from their fine-tuned counterparts. π https://t.co/6Pi5DgNGo4
0
1
21
Interesting chunk-based approach to LLM-RL which goes like this: - get prompt - The model generates a chunk. - carryover of the chunk concatenated to the prompt. - Repeat generation This keeps the context small while allowing for RL. Cool work!! π https://t.co/jozWcDsrm2
1
1
12
Verifiable sparse attention approach for inference. The probability that vAttention strays more than Ο΅ from SDPA is less than Ξ΄. And you can control Ο΅ and Ξ΄ to customize the tradeoff between performance and accuracy. π https://t.co/uhf46BZpph
0
7
52