Phillip Rust
@rust_phillip
Followers
393
Following
872
Media
7
Statuses
50
Research Scientist @AIatMeta (FAIR) • PhD @coastalcph
Paris, France
Joined July 2020
Happy to share our paper on language modelling with pixels has been accepted to ICLR‘23 (notable-top-5% / oral) 🎉. Big thanks and congrats to Team-PIXEL @jonasflotz @ebugliarello @esalesk @mdlhx @delliott and looking forward to presenting in Kigali! 🌍 #ICLR2023
Tired of tokenizers/subwords? Check out PIXEL, a new language model that processes written text as images📸 “Language Modelling with Pixels” 📄 https://t.co/pmp7Yvhx9W 🧑💻 https://t.co/RbMemZOpub 🤖 https://t.co/J80eju62eB by @rust_phillip @jonasflotz me @esalesk @mdlhx @delliott
9
34
231
Tough week! I also got impacted less than 3 months after joining. Ironically, I just landed some new RL infra features the day before. Life moves on. My past work spans RL, PEFT, Quantization, and Multimodal LLMs. If your team is working on these areas, I’d love to connect.
Meta has gone crazy on the squid game! Many new PhD NGs are deactivated today (I am also impacted🥲 happy to chat)
42
63
504
Humans see text — but LLMs don’t. I wrote a short blog post exploring how models can perceive text visually rather than tokenize it: 🔗 https://t.co/lk5Usj1WqT From PIXEL, CLIPPO, VisInContext, VIST to DeepSeek-OCR, this is a quick story of how vision-centric modeling is
csu-jpg.github.io
People read visually, not symbolically. Visual tokens and vision-centric MLLMs point to the next paradigm.
8
41
220
I will be presenting this work in-person at ACL🇹🇭 this week. Drop by if you'd like to chat! Oral: Today (Monday) 16:30 Poster: Tuesday (Tomorrow) 10:30 - 12:00
Introducing “Towards Privacy-Aware Sign Language Translation at Scale” We leverage self-supervised pretraining on anonymized videos, achieving SOTA ASL-to-English translation performance while mitigating risks arising from biometric data. 📄: https://t.co/hMY6eFo46D 🧵(1/9)
0
1
21
This project is a collaboration with my amazing peers and mentors during my internship @AIatMeta: Bowen Shi, @skylrwang, @ncihancamgoz @j_maillard. ⭐ 🧵(9/9)
0
0
5
For more experiments and all the details, check out our arXiv preprint linked above. We are working on releasing our code and data, so stay tuned! 👨💻 🧵(8/9)
1
0
2
We also highlight the importance of pretraining on longer video clips to learn long-range spatio-temporal dependencies 🎬➡️🧠. Even when controlling for the number of video tokens seen, we observe a large boost in performance by scaling from 16 to 128 frames 🚀. 🧵(7/9)
1
0
2
Face blurring incurs a loss of linguistic information in sign languages, leading to performance degradation. We show that such information, when lost during anonymized pretraining, can largely be recovered during finetuning. An effective privacy-performance trade-off ⚖️! 🧵(6/9)
1
0
2
Our best models outperform the prior SOTA for ASL-to-English translation performance on How2Sign by over 3 BLEU in both the finetuned and zero-shot settings 🥇. 🧵(5/9)
1
0
2
🌐 Optionally, an intermediate language-supervised pretraining (LSP) objective can help bridge the modality gap between sign language video inputs and text outputs. 🧵(4/9)
1
0
2
Our method, SSVP-SLT, consists of: 🎥 Self-supervised video pretraining (SSVP) on anonymized, unannotated videos to learn high-quality continuous sign language representations. 🎯 Supervised finetuning on a curated SLT dataset to learn translation-specific information. 🧵(3/9)
1
0
2
Training data scarcity and privacy risks are huge issues in sign language translation (SLT). Our approach is designed to be 🚀 scalable (by enabling training on unlabeled data) 🎭 privacy-aware (through anonymization) 🧵(2/9)
1
0
2
Introducing “Towards Privacy-Aware Sign Language Translation at Scale” We leverage self-supervised pretraining on anonymized videos, achieving SOTA ASL-to-English translation performance while mitigating risks arising from biometric data. 📄: https://t.co/hMY6eFo46D 🧵(1/9)
1
7
20
New preprint "Improving Language Understanding from Screenshots" w/ @zwcolin @AdithyaNLP @danqi_chen. We improve language understanding abilities of screenshot LMs, an emerging family of models that processes everything (including text) via visual inputs https://t.co/Qr9h8EHjUv
6
45
186
In PHD: Pixel-Based Language Modeling of Historical Documents with @NadavBorenstein @rust_phillip and @IAugenstein, we apply pixel language models to processing historical document and to more standard NLP classification tasks too. See it in Poster Session 6 on Sunday 10th.
1
5
21
In Text Rendering Strategies for Pixel Language Models with @jonasflotz @rust_phillip and @esalesk, we design new text renderers for visual language processing to improve performance or to squeeze the model down to just 22M parameters. See it in Poster Session 2 on Friday 8th.
1
4
15
Introducing SeamlessM4T, the first all-in-one, multilingual multimodal translation model. This single model can perform tasks across speech-to-text, speech-to-speech, text-to-text translation & speech recognition for up to 100 languages depending on the task. Details ⬇️
54
428
2K
📢 I am hiring a postdoc to join our project on pixel-based natural language processing. The position is based in Copenhagen 🇩🇰 for 18 months. Applications are due by March 29 https://t.co/ZvQtCoWXgH. Informal inquiries are welcome.
Thrilled to receive a grant from @VILLUMFONDEN to carry out blue-skies research on tokenization-free NLP https://t.co/yBRt2L3KgE I will hire Ph.Ds and Postdocs to build up the group so feel free to reach out. We're starting off with a paper at #ICLR2023
https://t.co/xwt7tpI2n6
0
20
32
Thrilled to receive a grant from @VILLUMFONDEN to carry out blue-skies research on tokenization-free NLP https://t.co/yBRt2L3KgE I will hire Ph.Ds and Postdocs to build up the group so feel free to reach out. We're starting off with a paper at #ICLR2023
https://t.co/xwt7tpI2n6
9
21
87