menu arrow_back Share chevron_right Programs chevron_right AI模型 chevron_right models chevron_right TTS chevron_right MegaTTS3
  • account_circle登录
  • brightness_4深色模式
  • home Home
  • code
    Github
  • info
    关于该主题
  • 登录
    lock
    faceREADME.md
    ---
    license: apache-2.0
    tags:
    - text-to-speech
    - tts
    - voice-cloning
    - speech-synthesis
    - pytorch
    - audio
    - chinese
    - english
    - zero-shot
    - diffusion
    library_name: transformers
    pipeline_tag: text-to-speech
    ---
    
    # MegaTTS3-WaveVAE: Complete Voice Cloning Model
    
    

    🚀 GitHub Repository

    GitHub Stars License Platform Language
    ## About This is a **complete MegaTTS3 model** with **WaveVAE support** for zero-shot voice cloning. Unlike the original ByteDance release, this includes the full WaveVAE encoder/decoder, enabling direct voice cloning from audio samples. **Key Features:** - 🎯 Zero-shot voice cloning from any 3-24 second audio sample - 🌍 Bilingual: Chinese, English, and code-switching - ⚡ Efficient: 0.45B parameter diffusion transformer - 🔧 Complete: Includes WaveVAE (missing from original) - 🎛️ Controllable: Adjustable voice similarity and clarity - 💻 Windows ready: One-click installer available ## Quick Start ### Installation **[📥 One-Click Windows Installer](https://github.com/Saganaki22/MegaTTS3-WaveVAE/releases/tag/Installer)** - Automated setup with GPU detection Or see [manual installation](https://github.com/Saganaki22/MegaTTS3-WaveVAE#installation) for advanced users. ### Usage Examples ```bash # Basic voice cloning python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output # Better quality settings python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output --p_w 2.0 --t_w 3.0 # Web interface (easiest) python tts/megatts3_gradio.py # Then open http://localhost:7929 ``` ## Model Components - **Diffusion Transformer**: 0.45B parameter TTS model - **WaveVAE**: High-quality audio encoder/decoder - **Aligner**: Speech-text alignment model - **G2P**: Grapheme-to-phoneme converter ## Parameters - `--p_w` (Intelligibility): 1.0-5.0, higher = clearer speech - `--t_w` (Similarity): 0.0-10.0, higher = more similar to reference - **Tip**: Set t_w 0-3 points higher than p_w ## Requirements - Windows 10/11 or Linux - Python 3.10 - 8GB+ RAM, NVIDIA GPU recommended - 5GB+ storage space ## Credits - **Original MegaTTS3**: [ByteDance Research](https://github.com/bytedance/MegaTTS3) - **WaveVAE Model**: [ACoderPassBy/MegaTTS-SFT](https://modelscope.cn/models/ACoderPassBy/MegaTTS-SFT) [Apache 2.0] - **Additional Components**: [mrfakename/MegaTTS3-VoiceCloning](https://huggingface.co/mrfakename/MegaTTS3-VoiceCloning) - **Windows Implementation & Complete Package**: [Saganaki22/MegaTTS3-WaveVAE](https://github.com/Saganaki22/MegaTTS3-WaveVAE) - **Special Thanks**: MysteryShack on Discord for model information ## Citation If you use this model, please cite the original research: ```bibtex @article{jiang2025sparse, title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis}, author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others}, journal={arXiv preprint arXiv:2502.18924}, year={2025} } @article{ji2024wavtokenizer, title={Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling}, author={Ji, Shengpeng and Jiang, Ziyue and Wang, Wen and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Cheng, Xize and Wang, Zehan and Li, Ruiqi and others}, journal={arXiv preprint arXiv:2408.16532}, year={2024} } ``` --- *High-quality voice cloning for research and creative applications. Please use responsibly.*