Phillip Rust @rust_phillip tweet - Introducing “Towards Privacy-Aware Sign Language Translation at Scale” We leverage self-supervised pretraining on anonymized videos, achieving SOTA ASL-to-English translation performance while mitigating risks arising from biometric data. 📄: https://t.co/hMY6eFo46D 🧵(1/9) https://t.co/nvQYGmX3rF

Phillip Rust

@rust_phillip

2 years

Introducing “Towards Privacy-Aware Sign Language Translation at Scale” We leverage self-supervised pretraining on anonymized videos, achieving SOTA ASL-to-English translation performance while mitigating risks arising from biometric data. 📄: https://t.co/hMY6eFo46D 🧵(1/9)

Replies

Phillip Rust

@rust_phillip

2 years

Training data scarcity and privacy risks are huge issues in sign language translation (SLT). Our approach is designed to be 🚀 scalable (by enabling training on unlabeled data) 🎭 privacy-aware (through anonymization) 🧵(2/9)

Phillip Rust

@rust_phillip

2 years

Our method, SSVP-SLT, consists of: 🎥 Self-supervised video pretraining (SSVP) on anonymized, unannotated videos to learn high-quality continuous sign language representations. 🎯 Supervised finetuning on a curated SLT dataset to learn translation-specific information. 🧵(3/9)

Phillip Rust

@rust_phillip

2 years

🌐 Optionally, an intermediate language-supervised pretraining (LSP) objective can help bridge the modality gap between sign language video inputs and text outputs. 🧵(4/9)

Phillip Rust

@rust_phillip

2 years

Our best models outperform the prior SOTA for ASL-to-English translation performance on How2Sign by over 3 BLEU in both the finetuned and zero-shot settings 🥇. 🧵(5/9)

Phillip Rust

@rust_phillip

2 years

Face blurring incurs a loss of linguistic information in sign languages, leading to performance degradation. We show that such information, when lost during anonymized pretraining, can largely be recovered during finetuning. An effective privacy-performance trade-off ⚖️! 🧵(6/9)

Phillip Rust

@rust_phillip

2 years

We also highlight the importance of pretraining on longer video clips to learn long-range spatio-temporal dependencies 🎬➡️🧠. Even when controlling for the number of video tokens seen, we observe a large boost in performance by scaling from 16 to 128 frames 🚀. 🧵(7/9)

Phillip Rust

@rust_phillip

2 years

For more experiments and all the details, check out our arXiv preprint linked above. We are working on releasing our code and data, so stay tuned! 👨‍💻 🧵(8/9)

Phillip Rust

@rust_phillip

2 years

This project is a collaboration with my amazing peers and mentors during my internship @AIatMeta: Bowen Shi, @skylrwang, @ncihancamgoz @j_maillard. ⭐ 🧵(9/9)