Programs/AI模型/models/TTS/MegaTTS3

---
license: apache-2.0
tags:
- text-to-speech
- tts
- voice-cloning
- speech-synthesis
- pytorch
- audio
- chinese
- english
- zero-shot
- diffusion
library_name: transformers
pipeline_tag: text-to-speech
---

# MegaTTS3-WaveVAE: Complete Voice Cloning Model


  🚀 GitHub Repository
  
  
  
  
  


## About

This is a **complete MegaTTS3 model** with **WaveVAE support** for zero-shot voice cloning. Unlike the original ByteDance release, this includes the full WaveVAE encoder/decoder, enabling direct voice cloning from audio samples.

**Key Features:**
- 🎯 Zero-shot voice cloning from any 3-24 second audio sample
- 🌍 Bilingual: Chinese, English, and code-switching
- ⚡ Efficient: 0.45B parameter diffusion transformer
- 🔧 Complete: Includes WaveVAE (missing from original)
- 🎛️ Controllable: Adjustable voice similarity and clarity
- 💻 Windows ready: One-click installer available

## Quick Start

### Installation
**[📥 One-Click Windows Installer](https://github.com/Saganaki22/MegaTTS3-WaveVAE/releases/tag/Installer)** - Automated setup with GPU detection

Or see [manual installation](https://github.com/Saganaki22/MegaTTS3-WaveVAE#installation) for advanced users.

### Usage Examples

```bash
# Basic voice cloning
python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output

# Better quality settings
python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output --p_w 2.0 --t_w 3.0

# Web interface (easiest)
python tts/megatts3_gradio.py
# Then open http://localhost:7929
```

## Model Components

- **Diffusion Transformer**: 0.45B parameter TTS model
- **WaveVAE**: High-quality audio encoder/decoder
- **Aligner**: Speech-text alignment model  
- **G2P**: Grapheme-to-phoneme converter

## Parameters
- `--p_w` (Intelligibility): 1.0-5.0, higher = clearer speech
- `--t_w` (Similarity): 0.0-10.0, higher = more similar to reference
- **Tip**: Set t_w 0-3 points higher than p_w

## Requirements
- Windows 10/11 or Linux
- Python 3.10
- 8GB+ RAM, NVIDIA GPU recommended
- 5GB+ storage space

## Credits

- **Original MegaTTS3**: [ByteDance Research](https://github.com/bytedance/MegaTTS3)
- **WaveVAE Model**: [ACoderPassBy/MegaTTS-SFT](https://modelscope.cn/models/ACoderPassBy/MegaTTS-SFT) [Apache 2.0]
- **Additional Components**: [mrfakename/MegaTTS3-VoiceCloning](https://huggingface.co/mrfakename/MegaTTS3-VoiceCloning)
- **Windows Implementation & Complete Package**: [Saganaki22/MegaTTS3-WaveVAE](https://github.com/Saganaki22/MegaTTS3-WaveVAE)
- **Special Thanks**: MysteryShack on Discord for model information

## Citation

If you use this model, please cite the original research:

```bibtex
@article{jiang2025sparse,
  title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
  author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
  journal={arXiv preprint arXiv:2502.18924},
  year={2025}
}

@article{ji2024wavtokenizer,
  title={Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling},
  author={Ji, Shengpeng and Jiang, Ziyue and Wang, Wen and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Cheng, Xize and Wang, Zehan and Li, Ruiqi and others},
  journal={arXiv preprint arXiv:2408.16532},
  year={2024}
}
```

---
*High-quality voice cloning for research and creative applications. Please use responsibly.*