Francesco Bertolotti @f14bertolotti X Profile

Francesco Bertolotti

@f14bertolotti

Followers

1K

Following

341

Media

143

Statuses

258

Postdoctoral researcher at the university of Milan

https://t.co/A3Pcs47OJy

Joined October 2021

Don't wanna be here? Send us removal request.

Francesco Bertolotti

@f14bertolotti

1 year

Ever wondered why language model embeddings are organized semantically or why the weight tying technique is so effective? 🤔 Our new paper, spotlighted at ICML24, dives into these questions. Joint work with @w_cazzola. 🔗 Read more: https://t.co/OB7b4eg9fi 🧵👇(1/7)

openreview.net

In this work, we analyze both theoretically and empirically the effect of tied input-output embeddings—a popular technique that reduces the model size while often improving training. Interestingly,...

1

5

44

Francesco Bertolotti

@f14bertolotti

6 hours

In this paper, the authors try GRPO with a Conciseness Reward Model to reduce answer length. The reward is scaled by difficulty annealing, and it is only applied when the outcome is already correct. It seems to work pretty well. 🔗 https://t.co/afJAznUGYZ

0

23

Francesco Bertolotti

@f14bertolotti

3 days

This is a new small LM (1.5B) that achieves impressing reasoning capabilities on AIME, LCB, and GPQA. The authors used model merging and an entropy maximizing variations of GRPO. Impressive work! 🔗 https://t.co/TVPMyR8j6j

1

15

82

Francesco Bertolotti

@f14bertolotti

3 days

These authors propose a super simple jailbreak attack, Ninja, that exploits long-context deficiencies of LLM to get an answer. The main idea is to fill the context with a bunch of benign context and then set the goal. Cool work! 🔗 https://t.co/z2AtgiwHbK

2

4

15

Francesco Bertolotti

@f14bertolotti

9 days

Other cool abstracts from today's arxiv: - Deep Research https://t.co/eLJBG4vAq7 - Neural Operator https://t.co/6GZR6ddB01 - LLM Technical Report

0

1

4

Francesco Bertolotti

@f14bertolotti

9 days

This author shows that grokking can arise purely from an initial and quick overfitting of data followed by a movement on the zero-loss region guided purely by weight decay. 🔗 https://t.co/hGF7vkMS8b

5

25

169

Francesco Bertolotti

@f14bertolotti

10 days

New paper from nvidia for autonomous vehicles. The authors bootstrap from a pre-trained world-model (cosmos) a vision-language-action model capable of driving. 🔗 https://t.co/xR9OrolGaz

0

1

5

Francesco Bertolotti

@f14bertolotti

11 days

From today's arxiv. The authors shows that the multi-language failures from reasoning models come from question understanding failures. A simple prompting technique is able to mitigate the issue. They also show failures are detectable from hidden states 🔗 https://t.co/Z64RYOIepd

0

3

19

Francesco Bertolotti

@f14bertolotti

15 days

Just skimmed this paper a little more in depth. This is a genuine breakthrough in the field. Huge congrats to the authors.

GLADIA Research Lab

@GladiaLab

17 days

LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)

1

7

Francesco Bertolotti

@f14bertolotti

16 days

Attending ECAI25 in Bologna was a fantastic experience. The organizers truly excelled, even including a wonderful concert. Sincere appreciation to Orchestra @Senzaspine_ for providing such a beautiful and enjoyable evening.

0

1

0

tensorqt

@tensorqt

28 days

link to the post: https://t.co/5XkhmmeaNc full credits of those who helped me review this are in the post

publish.obsidian.md

beyond.png Most attention variants have been designed to retain as much sample efficiency as possible, under the constraint of achieving subquadratic scaling with respect to sequence length. While t…

0

12

89

Francesco Bertolotti

@f14bertolotti

29 days

Let me know if you find any issue or if you experiment with this!

0

1

Francesco Bertolotti

@f14bertolotti

29 days

Let me just add that all of this came with quite a cost: the training for this strategy came with 10x in training time. However, since these are just the first experiments, I believe that there is room for improvement. The code is open source 🔗 https://t.co/2yKN88gpIC

github.com

layer-wise K-level optimization . Contribute to f14-bertolotti/l2l development by creating an account on GitHub.

1

0

2

Francesco Bertolotti

@f14bertolotti

29 days

I am tracking only training and validation loss for now. As you can see, the improvement of this approach is quite substantial. Of course there are a lot of limitations in this evaluation: small context length, too few training steps, no ablations...

1

0

2

Francesco Bertolotti

@f14bertolotti

29 days

Reading the paper, I was wondering, what if we do the same thing for LLMs? There's a lot of strategies for doing this. But let's just start by treating each layer as an agent. To test this, I have trained from scratch Qwen 1.5B and 3B on the MiniPile dataset.

1

0

2

Francesco Bertolotti

@f14bertolotti

29 days

The k-level paper, https://t.co/nLLmmgpeSJ, proposes to update the policy of an agent by looking at the updated versions of all other agents. It is simply a matter of running the backpropagation when all other agents are one step ahead!

arxiv.org

Actor-critic algorithms for deep multi-agent reinforcement learning (MARL) typically employ a policy update that responds to the current strategies of other agents. While being straightforward,...

1

0

2

Francesco Bertolotti

@f14bertolotti

29 days

I had a kind of wild idea a few weeks back. Remember the K-level policy gradient paper for MARL? 🤔 Well, I thought, why do we not do something similar for layers? So I tried it, and it's not bad. 🔗 The details are in a new post https://t.co/5TSEuOxCXo 🧵Here's the gist.

2

4

27

Francesco Bertolotti

@f14bertolotti

1 month

In this work the authors theoretically compare LayerNormalization placements in transformers. The Peri-LN strategy, used by Gemma and OLMo, appears to be more well-behaved both during the forward and backward pass. 🔗 https://t.co/RxU7QQYWGE

4

13

66

Francesco Bertolotti

@f14bertolotti

1 month

In this paper the authors show that base models already have in them thinking capabilities, and they can elicit them using steering vectors obtained from their fine-tuned counterparts. 🔗 https://t.co/6Pi5DgNGo4

0

1

21

Francesco Bertolotti

@f14bertolotti

1 month

Interesting chunk-based approach to LLM-RL which goes like this: - get prompt - The model generates a chunk. - carryover of the chunk concatenated to the prompt. - Repeat generation This keeps the context small while allowing for RL. Cool work!! 🔗 https://t.co/jozWcDsrm2

1

12

Francesco Bertolotti

@f14bertolotti

1 month

Verifiable sparse attention approach for inference. The probability that vAttention strays more than ϵ from SDPA is less than δ. And you can control ϵ and δ to customize the tradeoff between performance and accuracy. 🔗 https://t.co/uhf46BZpph

0

7

52