Shangshang Wang @UpupWang X Profile

Shangshang Wang

@UpupWang

Followers

583

Following

41

Media

24

Statuses

55

Phd @CSatUSC | @ShanghaiTechUni | Post-train via RL & Pre-train for AI4Science.

https://t.co/dk2UOrMiRn

Los Angeles

Joined December 2024

Don't wanna be here? Send us removal request.

Shangshang Wang

@UpupWang

5 months

Sparse autoencoders (SAEs) can be used to elicit strong reasoning abilities with remarkable efficiency. Using only 1 hour of training at $2 cost without any reasoning traces, we find a way to train 1.5B models via SAEs to score 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23.

10

58

504

Shangshang Wang

@UpupWang

1 month

Our code is built on torchtune @PyTorch. We hope that our implementation can also contribute to their new repo for post-training! https://t.co/ooH4AJFKnn https://t.co/AsWWG1kmKt

github.com

Tora: Torchtune-LoRA for RL. Contribute to shangshang-wang/Tora development by creating an account on GitHub.

1

0

7

Shangshang Wang

@UpupWang

1 month

(Q)DoRA-with-Cache-based GRPO The standard DoRA layer recalculates the weight norm and magnitude scale on every forward pass. DoRA with Cache optimizes this by caching these expensive computations.

1

0

6

Shangshang Wang

@UpupWang

1 month

(Q)DoRA-based GRPO

1

0

4

Shangshang Wang

@UpupWang

1 month

(Q)LoRA-based GRPO

1

0

6

Shangshang Wang

@UpupWang

1 month

Full-parameter GRPO

1

0

6

Shangshang Wang

@UpupWang

1 month

We provide detailed benchmarking of Qwen2.5 models with various model sizes (1.5B-32B) to compare LoRA-based and full-parameter training on only 2x A40 GPUs. See below for: (Q)LoRA, (Q)DoRA, and (Q)Dora-with-Cache — where we cache expensive computations for more-efficient DoRA.

1

0

7

Shangshang Wang

@UpupWang

1 month

Check out our Tina project (efficient RL for reasoning with LoRA) here: https://t.co/qntPWxzDPJ

Shangshang Wang

@UpupWang

7 months

😋 Want strong LLM reasoning without breaking the bank? We explored just how cost-effectively RL can enhance reasoning using LoRA! [1/9] Introducing Tina: A family of tiny reasoning models with strong performance at low cost, providing an accessible testbed for RL reasoning. 🧵

1

0

7

Shangshang Wang

@UpupWang

1 month

We now know that LoRA can match full-parameter RL training (from https://t.co/pGxoMLFIGv and our Tina paper https://t.co/dkXdxV3eNj), but what about DoRA, QLoRA, and more? We are releasing a clean LoRA-for-RL repo to explore them all. https://t.co/AsWWG1kmKt

Thinking Machines

@thinkymachines

2 months

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.

13

71

567

Shangshang Wang

@UpupWang

2 months

LoRA is real for Reasoning. https://t.co/pGxoMLFIGv

Thinking Machines

@thinkymachines

2 months

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.

2

3

186

Shangshang Wang

@UpupWang

5 months

This is another amazing collaboration with Julian @julian_asilis , Omer @oemerakgull , Enes, Oliver @olliezliu and Deqing @DeqingFu in the course taught by Willie @willieneis (both the teacher and the advisor). Thanks everyone!

2

0

19

Shangshang Wang

@UpupWang

5 months

Curious about the details for these efficiency claims? We open-source everything for full reproducibility: Paper: https://t.co/dZ2VMLQWEd Blog: https://t.co/u2V8D0c3Y0 Code: https://t.co/1Kl5MRPwAB Model: https://t.co/GASQjSPJ0m Training Logs:

2

1

39

Shangshang Wang

@UpupWang

5 months

SAE-Tuning trains models that match RL-trained counterparts’ performance while reducing costs by >2000x and time by >450x. The trained model is transparent, revealing where reasoning abilities hide, also generalizable and modular, enabling transfer across datasets and models.

1

0

27

Shangshang Wang

@UpupWang

5 months

Such efficiency stems from our novel SAE-Tuning method, which expands the use of SAEs beyond test-time steering. In SAE-Tuning, the SAE first “extracts” latent reasoning features and then guides a standard supervised fine-tuning process to “elicit” reasoning abilities.

1

0

25

Shangshang Wang

@UpupWang

7 months

👧 Check out more about Tina down below! Paper: https://t.co/dkXdxV3eNj Notion Blog: https://t.co/vue286jaH0 Code: https://t.co/CcTLnx9VaH Model: https://t.co/TgeThEaSQL Training Logs: https://t.co/DWKXxXN4Zp Tina's avatar is generated by GPT-4o based on KYNE's girls.

0

19

Shangshang Wang

@UpupWang

7 months

We also want to express our gratitude to the broader open-source community. This research was made possible by leveraging numerous publicly available resources from DeepScaleR @Agentica_ , STILL, OpenThoughts @bespokelabsai , OpenR1 @huggingface , LIMR, and OpenRS projects.

1

0

9

Shangshang Wang

@UpupWang

7 months

This is an amazing collaboration with Julian @julian_asilis , Omer @oemerakgull , Enes, and Oliver @olliezliu in the course taught by Willie @willieneis (both the teacher and the advisor) Thanks everyone!

1

0

10

Shangshang Wang

@UpupWang

7 months

[9/9] 🚀 We thus hypothesize that LoRA’s effectiveness and efficiency stem from rapidly adapting the reasoning format under RL while preserving base model knowledge, a likely more compute-efficient process than the deep knowledge integration of full-parameter training.

4

0

13

Shangshang Wang

@UpupWang

7 months

[8/9] 🔬 Observation 2) We consistently observe a training phase transition in the format-related metrics (format reward, completion length) but NOT accuracy-related metrics across most Tina models. The best-performance checkpoint is always found around this transition point.

1

0

13

Shangshang Wang

@UpupWang

7 months

[7/9] 🩺 Observation 1) We observe that in Tina models, increased training compute inversely affects performance, in contrast to full-parameter models. This observation highlights a “less compute can yield more performance” phenomenon.

1

0

11