f14bertolotti Profile Banner
Francesco Bertolotti Profile
Francesco Bertolotti

@f14bertolotti

Followers
1K
Following
341
Media
143
Statuses
258

Postdoctoral researcher at the university of Milan

Joined October 2021
Don't wanna be here? Send us removal request.
@f14bertolotti
Francesco Bertolotti
1 year
Ever wondered why language model embeddings are organized semantically or why the weight tying technique is so effective? πŸ€” Our new paper, spotlighted at ICML24, dives into these questions. Joint work with @w_cazzola. πŸ”— Read more: https://t.co/OB7b4eg9fi πŸ§΅πŸ‘‡(1/7)
openreview.net
In this work, we analyze both theoretically and empirically the effect of tied input-output embeddingsβ€”a popular technique that reduces the model size while often improving training. Interestingly,...
1
5
44
@f14bertolotti
Francesco Bertolotti
6 hours
In this paper, the authors try GRPO with a Conciseness Reward Model to reduce answer length. The reward is scaled by difficulty annealing, and it is only applied when the outcome is already correct. It seems to work pretty well. πŸ”— https://t.co/afJAznUGYZ
0
0
23
@f14bertolotti
Francesco Bertolotti
3 days
This is a new small LM (1.5B) that achieves impressing reasoning capabilities on AIME, LCB, and GPQA. The authors used model merging and an entropy maximizing variations of GRPO. Impressive work! πŸ”— https://t.co/TVPMyR8j6j
1
15
82
@f14bertolotti
Francesco Bertolotti
3 days
These authors propose a super simple jailbreak attack, Ninja, that exploits long-context deficiencies of LLM to get an answer. The main idea is to fill the context with a bunch of benign context and then set the goal. Cool work! πŸ”— https://t.co/z2AtgiwHbK
2
4
15
@f14bertolotti
Francesco Bertolotti
9 days
Other cool abstracts from today's arxiv: - Deep Research https://t.co/eLJBG4vAq7 - Neural Operator https://t.co/6GZR6ddB01 - LLM Technical Report
0
1
4
@f14bertolotti
Francesco Bertolotti
9 days
This author shows that grokking can arise purely from an initial and quick overfitting of data followed by a movement on the zero-loss region guided purely by weight decay. πŸ”— https://t.co/hGF7vkMS8b
5
25
169
@f14bertolotti
Francesco Bertolotti
10 days
New paper from nvidia for autonomous vehicles. The authors bootstrap from a pre-trained world-model (cosmos) a vision-language-action model capable of driving. πŸ”— https://t.co/xR9OrolGaz
0
1
5
@f14bertolotti
Francesco Bertolotti
11 days
From today's arxiv. The authors shows that the multi-language failures from reasoning models come from question understanding failures. A simple prompting technique is able to mitigate the issue. They also show failures are detectable from hidden states πŸ”— https://t.co/Z64RYOIepd
0
3
19
@f14bertolotti
Francesco Bertolotti
15 days
Just skimmed this paper a little more in depth. This is a genuine breakthrough in the field. Huge congrats to the authors.
@GladiaLab
GLADIA Research Lab
17 days
LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)
1
1
7
@f14bertolotti
Francesco Bertolotti
16 days
Attending ECAI25 in Bologna was a fantastic experience. The organizers truly excelled, even including a wonderful concert. Sincere appreciation to Orchestra @Senzaspine_ for providing such a beautiful and enjoyable evening.
0
1
0
@f14bertolotti
Francesco Bertolotti
29 days
Let me know if you find any issue or if you experiment with this!
0
0
1
@f14bertolotti
Francesco Bertolotti
29 days
Let me just add that all of this came with quite a cost: the training for this strategy came with 10x in training time. However, since these are just the first experiments, I believe that there is room for improvement. The code is open source πŸ”— https://t.co/2yKN88gpIC
Tweet card summary image
github.com
layer-wise K-level optimization . Contribute to f14-bertolotti/l2l development by creating an account on GitHub.
1
0
2
@f14bertolotti
Francesco Bertolotti
29 days
I am tracking only training and validation loss for now. As you can see, the improvement of this approach is quite substantial. Of course there are a lot of limitations in this evaluation: small context length, too few training steps, no ablations...
1
0
2
@f14bertolotti
Francesco Bertolotti
29 days
Reading the paper, I was wondering, what if we do the same thing for LLMs? There's a lot of strategies for doing this. But let's just start by treating each layer as an agent. To test this, I have trained from scratch Qwen 1.5B and 3B on the MiniPile dataset.
1
0
2
@f14bertolotti
Francesco Bertolotti
29 days
The k-level paper, https://t.co/nLLmmgpeSJ, proposes to update the policy of an agent by looking at the updated versions of all other agents. It is simply a matter of running the backpropagation when all other agents are one step ahead!
Tweet card summary image
arxiv.org
Actor-critic algorithms for deep multi-agent reinforcement learning (MARL) typically employ a policy update that responds to the current strategies of other agents. While being straightforward,...
1
0
2
@f14bertolotti
Francesco Bertolotti
29 days
I had a kind of wild idea a few weeks back. Remember the K-level policy gradient paper for MARL? πŸ€” Well, I thought, why do we not do something similar for layers? So I tried it, and it's not bad. πŸ”— The details are in a new post https://t.co/5TSEuOxCXo 🧡Here's the gist.
2
4
27
@f14bertolotti
Francesco Bertolotti
1 month
In this work the authors theoretically compare LayerNormalization placements in transformers. The Peri-LN strategy, used by Gemma and OLMo, appears to be more well-behaved both during the forward and backward pass. πŸ”— https://t.co/RxU7QQYWGE
4
13
66
@f14bertolotti
Francesco Bertolotti
1 month
In this paper the authors show that base models already have in them thinking capabilities, and they can elicit them using steering vectors obtained from their fine-tuned counterparts. πŸ”— https://t.co/6Pi5DgNGo4
0
1
21
@f14bertolotti
Francesco Bertolotti
1 month
Interesting chunk-based approach to LLM-RL which goes like this: - get prompt - The model generates a chunk. - carryover of the chunk concatenated to the prompt. - Repeat generation This keeps the context small while allowing for RL. Cool work!! πŸ”— https://t.co/jozWcDsrm2
1
1
12
@f14bertolotti
Francesco Bertolotti
1 month
Verifiable sparse attention approach for inference. The probability that vAttention strays more than Ο΅ from SDPA is less than Ξ΄. And you can control Ο΅ and Ξ΄ to customize the tradeoff between performance and accuracy. πŸ”— https://t.co/uhf46BZpph
0
7
52