
Anuj Diwan
@anuj_diwan
Followers
781
Following
2K
Media
21
Statuses
242
PhD Student @UTCompSci. Prev. Student Researcher @GoogleDeepmind, FAIR (@metaai), @AdobeResearch. 2021 BTech CSE @iitbombay. Interests: NLP, ASR, ML. 🇮🇳🇺🇸
Austin + Mumbai
Joined May 2014
Introducing ParaSpeechCaps, our large-scale style captions dataset that enables rich, expressive control for text-to-speech models!.Beyond basic pitch or speed controls, our models can generate speech that sounds "guttural", "scared", "whispered" and more; 59 style tags in total.
3
17
75
ParaSpeechCaps has been accepted to the EMNLP 2025 Main Conference!.
Introducing ParaSpeechCaps, our large-scale style captions dataset that enables rich, expressive control for text-to-speech models!.Beyond basic pitch or speed controls, our models can generate speech that sounds "guttural", "scared", "whispered" and more; 59 style tags in total.
0
3
40
RT @ZEYULIU10: LLMs trained to memorize new facts can’t use those facts well.🤔. We apply a hypernetwork to ✏️edit✏️ the gradients for fact….
0
66
0
RT @ForbesIndia: A pioneer in machine learning, Sunita Sarawagi has transformed how computers process unstructured data through innovations….
0
5
0
RT @ForbesIndia: Preethi Jyothi is advancing speech and language technologies to make AI more inclusive for low-resource Indian languages.….
0
5
0
RT @EliasEskin: Extremely excited to announce that I will be joining @UTAustin @UTCompSci in August 2025 as an Assistant Professor! 🎉. I’m….
0
65
0
RT @ManyaWadhwa1: Evaluating language model responses on open-ended tasks is hard! 🤔. We introduce EvalAgent, a framework that identifies n….
0
39
0
RT @ramya_namuduri: Have that eerie feeling of déjà vu when reading model-generated text 👀, but can’t pinpoint the specific words or phrase….
0
18
0
RT @Jess_Riedel: Scott Aaronson announces he's building an Open-Phil backed AI alignment group at UT Austin. (🔗 below.). Prospective postd….
0
44
0
RT @PuyuanPeng: Announcing the new SotA voice-cloning TTS model: 𝗩𝗼𝗶𝗰𝗲𝗦𝘁𝗮𝗿 ⭐️. VoiceStar is . - autoregressive, . - voice-cloning, . - robu….
0
61
0
RT @mina1004h: Recent AI models can suggest endless video edits, offering many alternatives to video creators. But how can we easily sift t….
0
20
0
If you'd like an open-source text-to-speech model that follows your style instructions, consider using our ParaSpeechCaps-based model!.Model: Paper:
arxiv.org
We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal,...
Three new state-of-the-art audio models in the API:. 🗣️ Two speech-to-text models—outperforming Whisper.💬 A new TTS model—you can instruct it *how* to speak. 🤖 And the Agents SDK now supports audio, making it easy to build voice agents. Try TTS now at
1
5
42
RT @ai4bharat: 🚀 AI4Bharat: Advancing Indian Language AI - Open & Scalable! 🇮🇳✨. Over the past 4 years, we at AI4Bharat have been on a miss….
0
93
0
RT @berraksismann: Exciting News!😊INTERSPEECH 2028 will take place at the River Walk in San Antonio, Texas! ✨ I’m honored to serve as one o….
0
10
0
RT @ArxivSound: ``Scaling Rich Style-Prompted Text-to-Speech Datasets,'' Anuj Diwan, Zhisheng Zheng, David Harwath, Eunsol Choi, https://t.….
arxiv.org
We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal,...
0
3
0
Thanks to my amazing collaborators @zszheng147, @eunsolc and David Harwath!.Paper: Code: Dataset: Model: Demo: HF Space:
huggingface.co
0
0
7
We finetune Parler-TTS-Mini-v1 on ParaSpeechCaps and achieve significant improvements in both speech-style consistency and naturalness over our best performing baseline (that combines existing smaller-scale style datasets)!
1
0
4
ParaSpeechCaps contains 282 hrs of human-labelled data and 2427 hours of automatically-labelled data. Human evaluators rate our scaled data to be on par with human-labelled data! We carefully ablate our dataset design choices.
1
0
4
ParaSpeechCaps is the first large-scale dataset that supports both speaker-level intrinsic tags and utterance-level situational tags. Our key contribution is a novel pipeline for scalable, automatic style annotations over such a wide variety of rich styles for the first time.
1
0
3