J_Pilault Profile Banner
Jonathan Pilault Profile
Jonathan Pilault

@J_Pilault

Followers
327
Following
11
Media
4
Statuses
106

• ML Research Scientist at Silicon Valley startup @ZyphraAI • Former researcher @GoogleDeepMind @nvidia • PhD @Mila_Quebec

Montreal
Joined February 2010
Don't wanna be here? Send us removal request.
@J_Pilault
Jonathan Pilault
1 year
I am extremely proud of what the team at @ZyphraAI has achieved. Let's keep pushing the boundaries!
@QuentinAnthon15
Quentin Anthony
1 year
For a long time our training goals @ZyphraAI had been to just match dense transformers, but with faster inference and lower training cost. Today we also surpass them with Zamba2-7B.
0
0
6
@Nick__Alonso
Nick Alonso
1 year
1) RAG often struggles on complex multi-hop queries. In this blog, we @ZyphraAI discuss and build a graph-based RAG system which tops the leaderboard on a QA benchmark with multi-hop queries and outperforms frontier long-context models for 60x less cost. https://t.co/QDXUdiWzh5
1
4
12
@vasud3vshyam
Vasu Shyam
1 year
@ylecun Thanks for sharing! Another little trick that might amuse you is that we identified a function which upon minimization produces the forward pass of the attention block:
0
2
25
@J_Pilault
Jonathan Pilault
1 year
Thank you to my wonderful teammates @vasud3vshyam, @nshepperd1, @BerenMillidge, @QuentinAnthon15
0
0
8
@J_Pilault
Jonathan Pilault
1 year
By using the two-level interconnect topology on GPU clusters, Tree Attention allows for asymptotically faster decoding as we scale output sequence length and number of GPUs in a cluster and lower peak memory requirements:
1
0
7
@J_Pilault
Jonathan Pilault
1 year
Unlike Ring Attention's P2P communication that scales with sequence length, Tree Attention uses Allreduce that • do not scale communication volume with sequence length • reduce internode communication requirements • allow better overlap with single-device attention computation
1
0
8
@J_Pilault
Jonathan Pilault
1 year
Tree attention was derived from the scalar energy function interpretation of self-attention that reveals that a tree reduction can be performed across the sequence axis due to the associative properties of the logsumexp and max operations.
1
0
17
@J_Pilault
Jonathan Pilault
1 year
Zyphra is proud to release Tree Attention, a fast inference method for extremely large sequence lengths • 8x faster inference speed vs. Ring Attention • 2x less peak memory • low data communication volumes Paper: https://t.co/yf5VNRze6W Code: https://t.co/Th6Fg8eFEr A 🧵
1
31
152
@QuentinAnthon15
Quentin Anthony
1 year
Zyphra is ecstatic to release Zamba2-small: - 2.7B Mamba2/Attention hybrid - Pre-trained on 3T tokens + annealed on 100B high-quality tokens - Model released on HuggingFace and standalone PyTorch - SOTA evaluation performance and superior inference efficiency.
4
45
203
@utkuevci
utku
3 years
Hyped to share JaxPruner: a concise library for sparsity research. JaxPruner includes 10+ easy-to-modify baseline algorithms and provides integration with popular libraries like t5x, scenic, dopamine and fedjax. 1/7 Code: https://t.co/tPwCL03xnE Paper: https://t.co/eedLJj5EVW
1
31
148
@QuentinAnthon15
Quentin Anthony
2 years
Zyphra is pleased to announce Zamba-7B: - 7B Mamba/Attention hybrid - Competitive with Mistral-7B and Gemma-7B on only 1T fully open training tokens - Outperforms Llama-2 7B and OLMo-7B - All checkpoints across training to be released (Apache 2.0) - Achieved by 7 people, on 128
23
81
427
@RGoroshin
Ross Goroshin
2 years
Last week, I gave a talk at @Mila_Quebec. The talk should be of interest to anyone working on predictive models, particularly in latent space. In collab. with @MahanFathi @ClementGehring @J_Pilault @davidkanaa @pierrelux. See you at @iclr_conf in 🇦🇹! https://t.co/vFBtHDzNju
drive.google.com
0
5
18
@MahanFathi
Mahan Fathi
2 years
Course Correcting Koopman Representations Accepted at #ICLR2024! We identify problems with unrolling in imagination and propose an unconventional, simple, yet effective solution: periodically "𝒓𝒆𝒆𝒏𝒄𝒐𝒅𝒊𝒏𝒈" the latent. 📄 https://t.co/ULNzqAV3bB @GoogleDeepMind 1/🧵
4
19
93
@DavidSKrueger
David Krueger
2 years
My research group @kasl_ai is looking for interns! Applications are due in 2 weeks ***January 29***. The long-awaited form: https://t.co/hLOjuxSfnK Please share widely!!
6
74
276
@RichardSocher
Richard Socher
2 years
@notnotrishi I like the SSM/hyena/Block State Transformers https://t.co/HrQIWgtTIj https://t.co/mveReauq1S They remind me of Q-RNNs https://t.co/mwRsydj5dA and play around with different parallelization ideas. I don't think transformers are that special and there are many equivalent
1
3
26
@MahanFathi
Mahan Fathi
2 years
Why not get the best of both worlds by combining SSMs and Transformers? Excited to share our work at #NeurIPS2023: "Block-State Transformers." BST hits new highs in long-range language modeling and LRA tasks. paper: https://t.co/nHt6OGyez1 1/
8
65
378
@J_Pilault
Jonathan Pilault
13 years
Tips for non-technical entrepreneurs http://t.co/9lmKKTLx
0
0
2
@J_Pilault
Jonathan Pilault
13 years
Its 2012, Canadian eCommerce stuck in the 90's: http://t.co/OXpWKT6n
0
0
1
@J_Pilault
Jonathan Pilault
13 years
#Montreal should be the #innovation and #start-up gate keeper between Europe and the US. Wouldn't entering respective markets be easier?
0
0
0
@J_Pilault
Jonathan Pilault
13 years
33% of online shoppers in Canada leave their cart before check-out due to high shipping costs (emarketer). Free shipping is must in Canada
0
0
1