NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot...
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g.,...