Cosyvoice TTS for Unity


Cosyvoice TTS for Unity delivers unlimited, high quality text to speech in both Editor and Runtime, as well as unlimited voice cloning from 5 second sample clips.


by Tempstudio, LLC


Price History +

Cosyvoice TTS for Unity is an Unity port of Cosyvoice2.


The product includes 2 components:

  • Runtime TTS: powered by Unity AI Inference engine (formerly Sentis and Barracuda) and LLMUnity (llamacpp), the runtime TTS is high quality, fast and cross-platform. Runtime TTS generates audio clips from a simple text string and baked speaker data. The package comes with 16 different built-in voices (6 English, 6 Chinese, 2 Japanese and 2 Korean) ready to use. You can mix languages too. if you want an exotic feel with accents.
  • Editor TTS and voice cloning: the TTS also works in editor for asset creation. It can take many tries to get the best audio output, and Cosyvoice TTS for Unity offers a dedicated editor panel for this purpose. Additionally, the editor component also support voice cloning. You can create new speaker data directly from short sample clips. Due to various technical limitations, this uses a few external plugins such as onnxruntime. TTS itself does not require these plugins. The plugins are for voice cloning only and can be deleted if you do not need it.

Everything runs locally on your machine and synthesis is very fast on modern PC's. Cosyvoice2's voice cloning covers a wide range of emotions and styles of speech, the speech quality is excellent in Chinese, solid in English, but weaker in Japanese and Korean.


Disclaimers:

  • While TTS is functional on mobile, the performance is quite slow. and likely only usable on truly flagship devices (iPad air m1, but we don't have one to confirm). On lesser devices, generating 5 seconds of speech can take a minute or more. Combined with the relatively large model sizes (~2GB), this asset is likely unsuitable for most mobile use cases. However, it can still be useful, such as generating one-off personalized greetings for your players, due to the excellent voice cloning ability.
  • The model is alledgely rather poor in Japanese and Korean. Please do your research if you are buying this asset primarily for these languages.
  • No text normalization and sentence splitting. If you are generating large blocks of text, preprocessing would be necessary. This shouldn't be an issue: if you are generating from editor, you can do this yourself; if you are generating at runtime from LLM output, you can ask LLM to perform normalization.

Asset uses Cosyvoice2 under Apache License, and Onnxruntime under MIT License; see Third-Party Notices.txt file in package for details.