Update: VoxCPM1.5 and fine-tuning supprt
This commit is contained in:
355
docs/finetune.md
Normal file
355
docs/finetune.md
Normal file
@@ -0,0 +1,355 @@
|
||||
# VoxCPM Fine-tuning Guide
|
||||
|
||||
This guide covers how to fine-tune VoxCPM models with two approaches: full fine-tuning and LoRA fine-tuning.
|
||||
|
||||
### 🎓 SFT (Supervised Fine-Tuning)
|
||||
|
||||
Full fine-tuning updates all model parameters. Suitable for:
|
||||
- 📊 Large, specialized datasets
|
||||
- 🔄 Cases where significant behavior changes are needed
|
||||
|
||||
### ⚡ LoRA Fine-tuning
|
||||
|
||||
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:
|
||||
- 🎯 Trains only a small number of additional parameters
|
||||
- 💾 Significantly reduces memory requirements and training time
|
||||
- 🔀 Supports multiple LoRA adapters with hot-swapping
|
||||
|
||||
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Data Preparation](#data-preparation)
|
||||
- [Full Fine-tuning](#full-fine-tuning)
|
||||
- [LoRA Fine-tuning](#lora-fine-tuning)
|
||||
- [Inference](#inference)
|
||||
- [LoRA Hot-swapping](#lora-hot-swapping)
|
||||
- [FAQ](#faq)
|
||||
|
||||
---
|
||||
|
||||
## Data Preparation
|
||||
|
||||
Training data should be prepared as a JSONL manifest file, with one sample per line:
|
||||
|
||||
```jsonl
|
||||
{"audio": "path/to/audio1.wav", "text": "Transcript of audio 1."}
|
||||
{"audio": "path/to/audio2.wav", "text": "Transcript of audio 2."}
|
||||
{"audio": "path/to/audio3.wav", "text": "Optional duration field.", "duration": 3.5}
|
||||
{"audio": "path/to/audio4.wav", "text": "Optional dataset_id for multi-dataset.", "dataset_id": 1}
|
||||
```
|
||||
|
||||
### Required Fields
|
||||
|
||||
| Field | Description |
|
||||
|-------|-------------|
|
||||
| `audio` | Path to audio file (absolute or relative) |
|
||||
| `text` | Corresponding transcript |
|
||||
|
||||
### Optional Fields
|
||||
|
||||
| Field | Description |
|
||||
|-------|-------------|
|
||||
| `duration` | Audio duration in seconds (speeds up sample filtering) |
|
||||
| `dataset_id` | Dataset ID for multi-dataset training (default: 0) |
|
||||
|
||||
### Requirements
|
||||
|
||||
- Audio format: WAV
|
||||
- Sample rate: 16kHz for VoxCPM-0.5B, 44.1kHz for VoxCPM1.5
|
||||
- Text: Transcript matching the audio content
|
||||
|
||||
See `examples/train_data_example.jsonl` for a complete example.
|
||||
|
||||
---
|
||||
|
||||
## Full Fine-tuning
|
||||
|
||||
Full fine-tuning updates all model parameters. Suitable for large datasets or when significant behavior changes are needed.
|
||||
|
||||
### Configuration
|
||||
|
||||
Create `conf/voxcpm_v1.5/voxcpm_finetune_all.yaml`:
|
||||
|
||||
```yaml
|
||||
pretrained_path: /path/to/VoxCPM1.5/
|
||||
train_manifest: /path/to/train.jsonl
|
||||
val_manifest: ""
|
||||
|
||||
sample_rate: 44100
|
||||
batch_size: 16
|
||||
grad_accum_steps: 1
|
||||
num_workers: 2
|
||||
num_iters: 2000
|
||||
log_interval: 10
|
||||
valid_interval: 1000
|
||||
save_interval: 1000
|
||||
|
||||
learning_rate: 0.00001 # Use smaller LR for full fine-tuning
|
||||
weight_decay: 0.01
|
||||
warmup_steps: 100
|
||||
max_steps: 2000
|
||||
max_batch_tokens: 8192
|
||||
|
||||
save_path: /path/to/checkpoints/finetune_all
|
||||
tensorboard: /path/to/logs/finetune_all
|
||||
|
||||
lambdas:
|
||||
loss/diff: 1.0
|
||||
loss/stop: 1.0
|
||||
```
|
||||
|
||||
### Training
|
||||
|
||||
```bash
|
||||
# Single GPU
|
||||
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
|
||||
|
||||
# Multi-GPU
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
|
||||
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
|
||||
```
|
||||
|
||||
### Checkpoint Structure
|
||||
|
||||
Full fine-tuning saves a complete model directory that can be loaded directly:
|
||||
|
||||
```
|
||||
checkpoints/finetune_all/
|
||||
└── step_0002000/
|
||||
├── model.safetensors # Model weights (excluding audio_vae)
|
||||
├── config.json # Model config
|
||||
├── audiovae.pth # Audio VAE weights
|
||||
├── tokenizer.json # Tokenizer
|
||||
├── tokenizer_config.json
|
||||
├── special_tokens_map.json
|
||||
├── optimizer.pth
|
||||
└── scheduler.pth
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## LoRA Fine-tuning
|
||||
|
||||
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains only a small number of additional parameters, significantly reducing memory requirements.
|
||||
|
||||
### Configuration
|
||||
|
||||
Create `conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml`:
|
||||
|
||||
```yaml
|
||||
pretrained_path: /path/to/VoxCPM1.5/
|
||||
train_manifest: /path/to/train.jsonl
|
||||
val_manifest: ""
|
||||
|
||||
sample_rate: 44100
|
||||
batch_size: 16
|
||||
grad_accum_steps: 1
|
||||
num_workers: 2
|
||||
num_iters: 2000
|
||||
log_interval: 10
|
||||
valid_interval: 1000
|
||||
save_interval: 1000
|
||||
|
||||
learning_rate: 0.0001 # LoRA can use larger LR
|
||||
weight_decay: 0.01
|
||||
warmup_steps: 100
|
||||
max_steps: 2000
|
||||
max_batch_tokens: 8192
|
||||
|
||||
save_path: /path/to/checkpoints/finetune_lora
|
||||
tensorboard: /path/to/logs/finetune_lora
|
||||
|
||||
lambdas:
|
||||
loss/diff: 1.0
|
||||
loss/stop: 1.0
|
||||
|
||||
# LoRA configuration
|
||||
lora:
|
||||
enable_lm: true # Apply LoRA to Language Model
|
||||
enable_dit: true # Apply LoRA to Diffusion Transformer
|
||||
enable_proj: false # Apply LoRA to projection layers (optional)
|
||||
|
||||
r: 32 # LoRA rank (higher = more capacity)
|
||||
alpha: 16 # LoRA alpha, scaling = alpha / r
|
||||
dropout: 0.0
|
||||
|
||||
# Target modules
|
||||
target_modules_lm: ["q_proj", "v_proj", "k_proj", "o_proj"]
|
||||
target_modules_dit: ["q_proj", "v_proj", "k_proj", "o_proj"]
|
||||
```
|
||||
|
||||
### LoRA Parameters
|
||||
|
||||
| Parameter | Description | Recommended |
|
||||
|-----------|-------------|-------------|
|
||||
| `enable_lm` | Apply LoRA to LM (language model) | `true` |
|
||||
| `enable_dit` | Apply LoRA to DiT (diffusion model) | `true` (required for voice cloning) |
|
||||
| `r` | LoRA rank (higher = more capacity) | 16-64 |
|
||||
| `alpha` | Scaling factor, `scaling = alpha / r` | Usually `r/2` or `r` |
|
||||
| `target_modules_*` | Layer names to add LoRA | attention layers |
|
||||
|
||||
### Training
|
||||
|
||||
```bash
|
||||
# Single GPU
|
||||
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
|
||||
|
||||
# Multi-GPU
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
|
||||
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
|
||||
```
|
||||
|
||||
### Checkpoint Structure
|
||||
|
||||
LoRA training saves only LoRA parameters:
|
||||
|
||||
```
|
||||
checkpoints/finetune_lora/
|
||||
└── step_0002000/
|
||||
├── lora_weights.safetensors # Only lora_A, lora_B parameters
|
||||
├── optimizer.pth
|
||||
└── scheduler.pth
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Inference
|
||||
|
||||
### Full Fine-tuning Inference
|
||||
|
||||
The checkpoint directory is a complete model, load it directly:
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_ft_infer.py \
|
||||
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
|
||||
--text "Hello, this is the fine-tuned model." \
|
||||
--output output.wav
|
||||
```
|
||||
|
||||
With voice cloning:
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_ft_infer.py \
|
||||
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
|
||||
--text "This is voice cloning result." \
|
||||
--prompt_audio /path/to/reference.wav \
|
||||
--prompt_text "Reference audio transcript" \
|
||||
--output cloned_output.wav
|
||||
```
|
||||
|
||||
### LoRA Inference
|
||||
|
||||
LoRA inference requires the training config (for LoRA structure) and LoRA checkpoint:
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_lora_infer.py \
|
||||
--config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml \
|
||||
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
|
||||
--text "Hello, this is LoRA fine-tuned result." \
|
||||
--output lora_output.wav
|
||||
```
|
||||
|
||||
With voice cloning:
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_lora_infer.py \
|
||||
--config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml \
|
||||
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
|
||||
--text "This is voice cloning with LoRA." \
|
||||
--prompt_audio /path/to/reference.wav \
|
||||
--prompt_text "Reference audio transcript" \
|
||||
--output cloned_output.wav
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## LoRA Hot-swapping
|
||||
|
||||
LoRA supports dynamic loading, unloading, and switching at inference time without reloading the entire model.
|
||||
|
||||
### API Reference
|
||||
|
||||
```python
|
||||
from voxcpm.model import VoxCPMModel
|
||||
from voxcpm.model.voxcpm import LoRAConfig
|
||||
|
||||
# 1. Load model with LoRA structure
|
||||
lora_cfg = LoRAConfig(
|
||||
enable_lm=True,
|
||||
enable_dit=True,
|
||||
r=32,
|
||||
alpha=16,
|
||||
target_modules_lm=["q_proj", "v_proj", "k_proj", "o_proj"],
|
||||
target_modules_dit=["q_proj", "v_proj", "k_proj", "o_proj"],
|
||||
)
|
||||
model = VoxCPMModel.from_local(
|
||||
pretrained_path,
|
||||
optimize=True, # Enable torch.compile acceleration
|
||||
lora_config=lora_cfg
|
||||
)
|
||||
|
||||
# 2. Load LoRA weights (works after torch.compile)
|
||||
loaded, skipped = model.load_lora_weights("/path/to/lora_checkpoint")
|
||||
print(f"Loaded {len(loaded)} params, skipped {len(skipped)}")
|
||||
|
||||
# 3. Disable LoRA (use base model only)
|
||||
model.set_lora_enabled(False)
|
||||
|
||||
# 4. Re-enable LoRA
|
||||
model.set_lora_enabled(True)
|
||||
|
||||
# 5. Unload LoRA (reset weights to zero)
|
||||
model.reset_lora_weights()
|
||||
|
||||
# 6. Hot-swap to another LoRA
|
||||
model.load_lora_weights("/path/to/another_lora_checkpoint")
|
||||
|
||||
# 7. Get current LoRA weights
|
||||
lora_state = model.get_lora_state_dict()
|
||||
```
|
||||
|
||||
### Method Reference
|
||||
|
||||
| Method | Description | torch.compile Compatible |
|
||||
|--------|-------------|--------------------------|
|
||||
| `load_lora_weights(path)` | Load LoRA weights from file | ✅ |
|
||||
| `set_lora_enabled(bool)` | Enable/disable LoRA | ✅ |
|
||||
| `reset_lora_weights()` | Reset LoRA weights to initial values | ✅ |
|
||||
| `get_lora_state_dict()` | Get current LoRA weights | ✅ |
|
||||
|
||||
---
|
||||
|
||||
## FAQ
|
||||
|
||||
### 1. Out of Memory (OOM)
|
||||
|
||||
- Increase `grad_accum_steps` (gradient accumulation)
|
||||
- Decrease `batch_size`
|
||||
- Use LoRA fine-tuning instead of full fine-tuning
|
||||
- Decrease `max_batch_tokens` to filter long samples
|
||||
|
||||
### 2. Poor LoRA Performance
|
||||
|
||||
- Increase `r` (LoRA rank)
|
||||
- Adjust `alpha` (try `alpha = r/2` or `alpha = r`)
|
||||
- Ensure `enable_dit: true` (required for voice cloning)
|
||||
- Increase training steps
|
||||
- Add more target modules
|
||||
|
||||
### 3. Training Not Converging
|
||||
|
||||
- Decrease `learning_rate`
|
||||
- Increase `warmup_steps`
|
||||
- Check data quality
|
||||
|
||||
### 4. LoRA Not Taking Effect at Inference
|
||||
|
||||
- Ensure inference config matches training config LoRA parameters
|
||||
- Check `load_lora_weights` return value - `skipped_keys` should be empty
|
||||
- Verify `set_lora_enabled(True)` is called
|
||||
|
||||
### 5. Checkpoint Loading Errors
|
||||
|
||||
- Full fine-tuning: checkpoint directory should contain `model.safetensors`(or `pytorch_model.bin`), `config.json`, `audiovae.pth`
|
||||
- LoRA: checkpoint directory should contain `lora_weights.safetensors` (or `lora_weights.ckpt`)
|
||||
46
docs/performance.md
Normal file
46
docs/performance.md
Normal file
@@ -0,0 +1,46 @@
|
||||
# 📊 Performance Highlights
|
||||
|
||||
VoxCPM achieves competitive results on public zero-shot TTS benchmarks.
|
||||
|
||||
## Seed-TTS-eval Benchmark
|
||||
|
||||
| Model | Parameters | Open-Source | test-EN | | test-ZH | | test-Hard | |
|
||||
|------|------|------|:------------:|:--:|:------------:|:--:|:-------------:|:--:|
|
||||
| | | | WER/%⬇ | SIM/%⬆| CER/%⬇| SIM/%⬆ | CER/%⬇ | SIM/%⬆ |
|
||||
| MegaTTS3 | 0.5B | ❌ | 2.79 | 77.1 | 1.52 | 79.0 | - | - |
|
||||
| DiTAR | 0.6B | ❌ | 1.69 | 73.5 | 1.02 | 75.3 | - | - |
|
||||
| CosyVoice3 | 0.5B | ❌ | 2.02 | 71.8 | 1.16 | 78.0 | 6.08 | 75.8 |
|
||||
| CosyVoice3 | 1.5B | ❌ | 2.22 | 72.0 | 1.12 | 78.1 | 5.83 | 75.8 |
|
||||
| Seed-TTS | - | ❌ | 2.25 | 76.2 | 1.12 | 79.6 | 7.59 | 77.6 |
|
||||
| MiniMax-Speech | - | ❌ | 1.65 | 69.2 | 0.83 | 78.3 | - | - |
|
||||
| F5-TTS | 0.3B | ✅ | 2.00 | 67.0 | 1.53 | 76.0 | 8.67 | 71.3 |
|
||||
| MaskGCT | 1B | ✅ | 2.62 | 71.7 | 2.27 | 77.4 | - | - |
|
||||
| CosyVoice | 0.3B | ✅ | 4.29 | 60.9 | 3.63 | 72.3 | 11.75 | 70.9 |
|
||||
| CosyVoice2 | 0.5B | ✅ | 3.09 | 65.9 | 1.38 | 75.7 | **6.83** | 72.4 |
|
||||
| SparkTTS | 0.5B | ✅ | 3.14 | 57.3 | 1.54 | 66.0 | - | - |
|
||||
| FireRedTTS | 0.5B | ✅ | 3.82 | 46.0 | 1.51 | 63.5 | 17.45 | 62.1 |
|
||||
| FireRedTTS-2 | 1.5B | ✅ | 1.95 | 66.5 | 1.14 | 73.6 | - | - |
|
||||
| Qwen2.5-Omni | 7B | ✅ | 2.72 | 63.2 | 1.70 | 75.2 | 7.97 | **74.7** |
|
||||
| OpenAudio-s1-mini | 0.5B | ✅ | 1.94 | 55.0 | 1.18 | 68.5 | 23.37 | 64.3 |
|
||||
| IndexTTS2 | 1.5B | ✅ | 2.23 | 70.6 | 1.03 | 76.5 | 7.12 | 75.5 |
|
||||
| VibeVoice | 1.5B | ✅ | 3.04 | 68.9 | 1.16 | 74.4 | - | - |
|
||||
| HiggsAudio-v2 | 3B | ✅ | 2.44 | 67.7 | 1.50 | 74.0 | 55.07 | 65.6 |
|
||||
| **VoxCPM** | 0.5B | ✅ | **1.85** | **72.9** | **0.93** | **77.2** | 8.87 | 73.0 |
|
||||
|
||||
|
||||
## CV3-eval Benchmark
|
||||
|
||||
| Model | zh | en | hard-zh | | | hard-en | | |
|
||||
|-------|:--:|:--:|:-------:|:--:|:--:|:-------:|:--:|:--:|
|
||||
| | CER/%⬇ | WER/%⬇ | CER/%⬇ | SIM/%⬆ | DNSMOS⬆ | WER/%⬇ | SIM/%⬆ | DNSMOS⬆ |
|
||||
| F5-TTS | 5.47 | 8.90 | - | - | - | - | - | - |
|
||||
| SparkTTS | 5.15 | 11.0 | - | - | - | - | - | - |
|
||||
| GPT-SoVits | 7.34 | 12.5 | - | - | - | - | - | - |
|
||||
| CosyVoice2 | 4.08 | 6.32 | 12.58 | 72.6 | 3.81 | 11.96 | 66.7 | 3.95 |
|
||||
| OpenAudio-s1-mini | 4.00 | 5.54 | 18.1 | 58.2 | 3.77 | 12.4 | 55.7 | 3.89 |
|
||||
| IndexTTS2 | 3.58 | 4.45 | 12.8 | 74.6 | 3.65 | 8.78 | 74.5 | 3.80 |
|
||||
| HiggsAudio-v2 | 9.54 | 7.89 | 41.0 | 60.2 | 3.39 | 10.3 | 61.8 | 3.68 |
|
||||
| CosyVoice3-0.5B | 3.89 | 5.24 | 14.15 | 78.6 | 3.75 | 9.04 | 75.9 | 3.92 |
|
||||
| CosyVoice3-1.5B | 3.91 | 4.99 | 9.77 | 78.5 | 3.79 | 10.55 | 76.1 | 3.95 |
|
||||
| **VoxCPM** | **3.40** | **4.04** | 12.9 | 66.1 | 3.59 | **7.89** | 64.3 | 3.74 |
|
||||
|
||||
109
docs/release_note.md
Normal file
109
docs/release_note.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# VoxCPM1.5 Release Notes
|
||||
|
||||
**Release Date:** December 5, 2025
|
||||
|
||||
## 🎉 Overview
|
||||
|
||||
|
||||
We’re thrilled to introduce a major upgrade that improves audio quality and efficiency of VoxCPM, while maintaining the core capabilities of context-aware speech generation and zero-shot voice cloning.
|
||||
|
||||
| Feature | VoxCPM | VoxCPM1.5 |
|
||||
|---------|------------|------------|
|
||||
| **Audio VAE Sampling Rate** | 16kHz | 44.1kHz |
|
||||
| **LM Token Rate** | 12.5Hz | 6.25Hz |
|
||||
| **Patch Size** | 2 | 4 |
|
||||
| **SFT Support** | ✅ | ✅ |
|
||||
| **LoRA Support** | ✅ | ✅ |
|
||||
|
||||
## 🎵 Model Updates
|
||||
|
||||
### 🔊 AudioVAE Sampling Rate: 16kHz → 44.1kHz
|
||||
|
||||
The AudioVAE now supports 44.1kHz sampling rate, which allows the model to:
|
||||
- 🎯 Clone better, preserving more high-frequency details and generate higher quality voice outputs
|
||||
|
||||
|
||||
*Note: This upgrade enables higher quality generation when using high-quality reference audio, but does not guarantee that all generated audio will be high-fidelity. The output quality depends on the **prompt speech** quality.*
|
||||
|
||||
### ⚡ Token Rate: 12.5Hz → 6.25Hz
|
||||
|
||||
We reduced the token rate in LM backbone from 12.5Hz to 6.25Hz (LocEnc&LocDiT patch size increased from 2 to 4) while maintaining similar performance on evaluation benchmarks. This change:
|
||||
- 💨 Reduces computational requirements for generating the same length of audio
|
||||
- 📈 Provides a foundation for longer audio generation
|
||||
- 🏗️ Paves the way for training larger models in the future
|
||||
|
||||
|
||||
## 🔧 Fine-tuning Support
|
||||
|
||||
We support full fine-tuning and LoRA fine-tuning now, please see the [Fine-tuning Guide](finetune.md) for detailed instructions.
|
||||
|
||||
|
||||
## 📚 Documentation
|
||||
|
||||
- Updated README with version comparison
|
||||
- Added comprehensive fine-tuning guide
|
||||
- Improved code comments and documentation
|
||||
|
||||
|
||||
## 🙏 Our Thanks to You
|
||||
This release wouldn’t be possible without the incredible feedback, testing, and contributions from our open-source community. Thank you for helping shape VoxCPM1.5!
|
||||
|
||||
|
||||
## 📞 Let's Build Together
|
||||
Questions, ideas, or want to contribute?
|
||||
|
||||
- 🐛 Report an issue: [GitHub Issues on OpenBMB/VoxCPM](https://github.com/OpenBMB/VoxCPM/issues)
|
||||
|
||||
- 📖 Dig into the docs: Check the [docs/](../docs/) folder for guides and API details
|
||||
|
||||
Enjoy the richer sound and powerful new features of VoxCPM1.5 🎉
|
||||
|
||||
We can't wait to hear what you create next! 🥂
|
||||
|
||||
## 🚀 What We're Working On
|
||||
|
||||
We're continuously improving VoxCPM and working on exciting new features:
|
||||
|
||||
- 🌍 **Multilingual TTS Support**: We are actively developing support for languages beyond Chinese and English.
|
||||
- 🎯 **Controllable Expressive Speech Generation**: We are researching controllable speech generation that allows fine-grained control over speech attributes (emotion, timbre, prosody, etc.) through natural language instructions.
|
||||
- 🎵 **Universal Audio Generation Foundation**: We also hope to explore VoxCPM as a unified audio generation foundation model capable of joint generation of speech, music, and sound effects. However, this is a longer-term vision.
|
||||
|
||||
**📅 Next Release**: We plan to release the next version in Q1 2026, which will include significant improvements and new features. Stay tuned for updates! We're committed to making VoxCPM even more powerful and versatile.
|
||||
|
||||
## ❓ Frequently Asked Questions (FAQ)
|
||||
|
||||
### Q: Does VoxCPM support fine-tuning for personalized voice customization?
|
||||
|
||||
**A:** Yes! VoxCPM now supports both full fine-tuning (SFT) and efficient LoRA fine-tuning. You can train personalized voice models on your own data. Please refer to the [Fine-tuning Guide](finetune.md) for detailed instructions and examples.
|
||||
|
||||
### Q: Is 16kHz audio quality sufficient for my use case?
|
||||
|
||||
**A:** We have upgraded the AudioVAE to support 44.1kHz sampling rate in VoxCPM1.5, which provides higher quality audio output with better preservation of high-frequency details. This upgrade enables better voice cloning quality and more natural speech synthesis when using high-quality reference audio.
|
||||
|
||||
### Q: Has the stability issue been resolved?
|
||||
|
||||
**A:** We have made stability optimizations in VoxCPM1.5, including improvements to the training data and model architecture. Based on community feedback, we collected some stability issues such as:
|
||||
- Increased noise and reverberation
|
||||
- Audio artifacts (e.g., howling/squealing)
|
||||
- Unstable speaking rate (speeding up)
|
||||
- Volume fluctuations (increases or decreases)
|
||||
- Noise artifacts at the beginning and end of audio
|
||||
- Synthesis issues with very short texts (e.g., "hello")
|
||||
|
||||
While we have made improvements to these issues, they have not been completely resolved and may still occasionally occur, especially with very long or highly expressive inputs. We continue to work on further stability improvements in future versions.
|
||||
|
||||
### Q: Does VoxCPM plan to support multilingual TTS?
|
||||
|
||||
**A:** Currently, VoxCPM is primarily trained on Chinese and English data. We are actively researching and developing multilingual TTS support for more languages beyond Chinese and English. Please let us know what languages you'd like to see supported!
|
||||
|
||||
### Q: Does VoxCPM plan to support controllable generation (emotion, style, fine-grained control)?
|
||||
|
||||
**A:** Currently, VoxCPM only supports zero-shot voice cloning and context-aware speech generation. Direct control over specific speech attributes (emotion, style, fine-grained prosody) is limited. However, we are actively researching instruction-controllable expressive speech generation with fine-grained control capabilities, working towards a human instruction-to-speech generation model!
|
||||
|
||||
### Q: Does VoxCPM support different hardware chips (e.g., Ascend 910B, XPU, NPU)?
|
||||
|
||||
**A:** Currently, we have not yet adapted VoxCPM for different hardware chips. Our main focus remains on developing new model capabilities and improving stability. We encourage you to check if community developers have done similar work, and we warmly welcome everyone to contribute and promote such adaptations together!
|
||||
|
||||
These features are under active development, and we look forward to sharing updates in future releases!
|
||||
|
||||
|
||||
53
docs/usage_guide.md
Normal file
53
docs/usage_guide.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# 👩🍳 A Voice Chef's Guide
|
||||
|
||||
Welcome to the VoxCPM kitchen! Follow this recipe to cook up perfect generated speech. Let's begin.
|
||||
|
||||
---
|
||||
|
||||
## 🥚 Step 1: Prepare Your Base Ingredients (Content)
|
||||
|
||||
First, choose how you'd like to input your text:
|
||||
|
||||
### 1. Regular Text (Classic Mode)
|
||||
- ✅ Keep "Text Normalization" ON. Type naturally (e.g., "Hello, world! 123"). The system will automatically process numbers, abbreviations, and punctuation using WeTextProcessing library.
|
||||
|
||||
### 2. Phoneme Input (Native Mode)
|
||||
- ❌ Turn "Text Normalization" OFF. Enter phoneme text like `{HH AH0 L OW1}` (EN) or `{ni3}{hao3}` (ZH) for precise pronunciation control. In this mode, VoxCPM also supports native understanding of other complex non-normalized text—try it out!
|
||||
- **Phoneme Conversion**: For Chinese, phonemes are converted using pinyin. For English, phonemes are converted using CMUDict. Please refer to the relevant documentation for more details.
|
||||
|
||||
---
|
||||
|
||||
## 🍳 Step 2: Choose Your Flavor Profile (Voice Style)
|
||||
|
||||
This is the secret sauce that gives your audio its unique sound.
|
||||
|
||||
### 1. Cooking with a Prompt Speech (Following a Famous Recipe)
|
||||
- A prompt speech provides the desired acoustic characteristics for VoxCPM. The speaker's timbre, speaking style, and even the background sounds and ambiance will be replicated.
|
||||
- **For a Clean, Studio-Quality Voice:**
|
||||
- ✅ Enable "Prompt Speech Enhancement". This acts like a noise filter, removing background hiss and rumble to give you a pure, clean voice clone.
|
||||
|
||||
### 2. Cooking au Naturel (Letting the Model Improvise)
|
||||
- If no reference is provided, VoxCPM becomes a creative chef! It will infer a fitting speaking style based on the text itself, thanks to the text-smartness of its foundation model, MiniCPM-4.
|
||||
- **Pro Tip**: Challenge VoxCPM with any text—poetry, song lyrics, dramatic monologues—it may deliver some interesting results!
|
||||
|
||||
---
|
||||
|
||||
## 🧂 Step 3: The Final Seasoning (Fine-Tuning Your Results)
|
||||
|
||||
You're ready to serve! But for master chefs who want to tweak the flavor, here are two key spices.
|
||||
|
||||
### CFG Value (How Closely to Follow the Recipe)
|
||||
- **Default**: A great starting point.
|
||||
- **Voice sounds strained or weird?** Lower this value. It tells the model to be more relaxed and improvisational, great for expressive prompts.
|
||||
- **Need maximum clarity and adherence to the text?** Raise it slightly to keep the model on a tighter leash.
|
||||
- **Short sentences?** Consider increasing the CFG value for better clarity and adherence.
|
||||
- **Long texts?** Consider lowering the CFG value to improve stability and naturalness over extended passages.
|
||||
|
||||
### Inference Timesteps (Simmering Time: Quality vs. Speed)
|
||||
- **Need a quick snack?** Use a lower number. Perfect for fast drafts and experiments.
|
||||
- **Cooking a gourmet meal?** Use a higher number. This lets the model "simmer" longer, refining the audio for superior detail and naturalness.
|
||||
|
||||
---
|
||||
|
||||
Happy creating! 🎉 Start with the default settings and tweak from there to suit your project. The kitchen is yours!
|
||||
|
||||
Reference in New Issue
Block a user