Jonathan Pilault @J_Pilault X Profile

Jonathan Pilault

@J_Pilault

Followers

327

Following

11

Media

4

Statuses

106

• ML Research Scientist at Silicon Valley startup @ZyphraAI • Former researcher @GoogleDeepMind @nvidia • PhD @Mila_Quebec

Montreal

Joined February 2010

Don't wanna be here? Send us removal request.

Jonathan Pilault

@J_Pilault

1 year

I am extremely proud of what the team at @ZyphraAI has achieved. Let's keep pushing the boundaries!

Quentin Anthony

@QuentinAnthon15

1 year

For a long time our training goals @ZyphraAI had been to just match dense transformers, but with faster inference and lower training cost. Today we also surpass them with Zamba2-7B.

0

6

Nick Alonso

@Nick__Alonso

1 year

1) RAG often struggles on complex multi-hop queries. In this blog, we @ZyphraAI discuss and build a graph-based RAG system which tops the leaderboard on a QA benchmark with multi-hop queries and outperforms frontier long-context models for 60x less cost. https://t.co/QDXUdiWzh5

1

4

12

Vasu Shyam

@vasud3vshyam

1 year

@ylecun Thanks for sharing! Another little trick that might amuse you is that we identified a function which upon minimization produces the forward pass of the attention block:

0

2

25

Jonathan Pilault

@J_Pilault

1 year

Thank you to my wonderful teammates @vasud3vshyam, @nshepperd1, @BerenMillidge, @QuentinAnthon15

0

8

Jonathan Pilault

@J_Pilault

1 year

By using the two-level interconnect topology on GPU clusters, Tree Attention allows for asymptotically faster decoding as we scale output sequence length and number of GPUs in a cluster and lower peak memory requirements:

1

0

7

Jonathan Pilault

@J_Pilault

1 year

Unlike Ring Attention's P2P communication that scales with sequence length, Tree Attention uses Allreduce that • do not scale communication volume with sequence length • reduce internode communication requirements • allow better overlap with single-device attention computation

1

0

8

Jonathan Pilault

@J_Pilault

1 year

Tree attention was derived from the scalar energy function interpretation of self-attention that reveals that a tree reduction can be performed across the sequence axis due to the associative properties of the logsumexp and max operations.

1

0

17

Jonathan Pilault

@J_Pilault

1 year

Zyphra is proud to release Tree Attention, a fast inference method for extremely large sequence lengths • 8x faster inference speed vs. Ring Attention • 2x less peak memory • low data communication volumes Paper: https://t.co/yf5VNRze6W Code: https://t.co/Th6Fg8eFEr A 🧵

1

31

152

Quentin Anthony

@QuentinAnthon15

1 year

Zyphra is ecstatic to release Zamba2-small: - 2.7B Mamba2/Attention hybrid - Pre-trained on 3T tokens + annealed on 100B high-quality tokens - Model released on HuggingFace and standalone PyTorch - SOTA evaluation performance and superior inference efficiency.

4

45

203

utku

@utkuevci

3 years

Hyped to share JaxPruner: a concise library for sparsity research. JaxPruner includes 10+ easy-to-modify baseline algorithms and provides integration with popular libraries like t5x, scenic, dopamine and fedjax. 1/7 Code: https://t.co/tPwCL03xnE Paper: https://t.co/eedLJj5EVW

1

31

148

Quentin Anthony

@QuentinAnthon15

2 years

Zyphra is pleased to announce Zamba-7B: - 7B Mamba/Attention hybrid - Competitive with Mistral-7B and Gemma-7B on only 1T fully open training tokens - Outperforms Llama-2 7B and OLMo-7B - All checkpoints across training to be released (Apache 2.0) - Achieved by 7 people, on 128

23

81

427

Ross Goroshin

@RGoroshin

2 years

Last week, I gave a talk at @Mila_Quebec. The talk should be of interest to anyone working on predictive models, particularly in latent space. In collab. with @MahanFathi @ClementGehring @J_Pilault @davidkanaa @pierrelux. See you at @iclr_conf in 🇦🇹! https://t.co/vFBtHDzNju

drive.google.com

0

5

18

Mahan Fathi

@MahanFathi

2 years

Course Correcting Koopman Representations Accepted at #ICLR2024! We identify problems with unrolling in imagination and propose an unconventional, simple, yet effective solution: periodically "𝒓𝒆𝒆𝒏𝒄𝒐𝒅𝒊𝒏𝒈" the latent. 📄 https://t.co/ULNzqAV3bB @GoogleDeepMind 1/🧵

4

19

93

David Krueger

@DavidSKrueger

2 years

My research group @kasl_ai is looking for interns! Applications are due in 2 weeks ***January 29***. The long-awaited form: https://t.co/hLOjuxSfnK Please share widely!!

6

74

276

Richard Socher

@RichardSocher

2 years

@notnotrishi I like the SSM/hyena/Block State Transformers https://t.co/HrQIWgtTIj https://t.co/mveReauq1S They remind me of Q-RNNs https://t.co/mwRsydj5dA and play around with different parallelization ideas. I don't think transformers are that special and there are many equivalent

1

3

26

Mahan Fathi

@MahanFathi

2 years

Why not get the best of both worlds by combining SSMs and Transformers? Excited to share our work at #NeurIPS2023: "Block-State Transformers." BST hits new highs in long-range language modeling and LRA tasks. paper: https://t.co/nHt6OGyez1 1/

8

65

378

Jonathan Pilault

@J_Pilault

13 years

Tips for non-technical entrepreneurs http://t.co/9lmKKTLx

0

2

Jonathan Pilault

@J_Pilault

13 years

Its 2012, Canadian eCommerce stuck in the 90's: http://t.co/OXpWKT6n

0

1

Jonathan Pilault

@J_Pilault

13 years

#Montreal should be the #innovation and #start-up gate keeper between Europe and the US. Wouldn't entering respective markets be easier?

0

Jonathan Pilault

@J_Pilault

13 years

33% of online shoppers in Canada leave their cart before check-out due to high shipping costs (emarketer). Free shipping is must in Canada

0

1