@niw Natural Speech 2も良いですよー。 VALLE-E-XよりConsistentでProduction向きかもです。 Tweet added by Katsuya @kn

Katsuya

10 months

@niw Natural Speech 2も良いですよー。 VALLE-E-XよりConsistentでProduction向きかもです。

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot...

Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g.,...

arxiv.org

1

0

5

Yoshimasa Niwa

@niw

10 months

VALL-E-X ちょっとヤバいなー。CPUで十分速いし、10秒くらいの音声サンプルからかなり本人っぽい音声が作れる。本家から実装やモデルが公開されていなかった理由わかる。

1

10

43

Yoshimasa Niwa

@niw

10 months

@kn まだあんまり調べきれてないんですが、声の調子とか早さとか固定してコントロールできる方法あるみたいですねー。いろいろ掘って行きたい。

1

0

Katsuya

@kn

10 months

@niw natural speech, tacotron, fastspeech, vitsなどのprior modelで先に音声のペースやピッチを予測するモデルだと簡単にコントロールにできますね。

0

1

Replies