465 lines
13 KiB
Markdown
465 lines
13 KiB
Markdown
# VoxCPM Fine-tuning Guide
|
|
|
|
This guide covers how to fine-tune VoxCPM models with two approaches: full fine-tuning and LoRA fine-tuning.
|
|
|
|
### 🎓 SFT (Supervised Fine-Tuning)
|
|
|
|
Full fine-tuning updates all model parameters. Suitable for:
|
|
- 📊 Large, specialized datasets
|
|
- 🔄 Cases where significant behavior changes are needed
|
|
|
|
### ⚡ LoRA Fine-tuning
|
|
|
|
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:
|
|
- 🎯 Trains only a small number of additional parameters
|
|
- 💾 Significantly reduces memory requirements and training time
|
|
- 🔀 Supports multiple LoRA adapters with hot-swapping
|
|
|
|
|
|
|
|
## Table of Contents
|
|
|
|
- [Quick Start: WebUI](#quick-start-webui)
|
|
- [Data Preparation](#data-preparation)
|
|
- [Full Fine-tuning](#full-fine-tuning)
|
|
- [LoRA Fine-tuning](#lora-fine-tuning)
|
|
- [Inference](#inference)
|
|
- [LoRA Hot-swapping](#lora-hot-swapping)
|
|
- [FAQ](#faq)
|
|
|
|
---
|
|
|
|
## Quick Start: WebUI
|
|
|
|
For users who prefer a graphical interface, we provide `lora_ft_webui.py` - a comprehensive WebUI for training and inference:
|
|
|
|
### Launch WebUI
|
|
|
|
```bash
|
|
python lora_ft_webui.py
|
|
```
|
|
|
|
Then open `http://localhost:7860` in your browser.
|
|
|
|
### Features
|
|
|
|
- **🚀 Training Tab**: Configure and start LoRA training with an intuitive interface
|
|
- Set training parameters (learning rate, batch size, LoRA rank, etc.)
|
|
- Monitor training progress in real-time
|
|
- Resume training from existing checkpoints
|
|
|
|
- **🎵 Inference Tab**: Generate audio with trained models
|
|
- Automatic base model loading from LoRA checkpoint config
|
|
- Voice cloning with automatic ASR (reference text recognition)
|
|
- Hot-swap between multiple LoRA models
|
|
- Zero-shot TTS without reference audio
|
|
|
|
## Data Preparation
|
|
|
|
Training data should be prepared as a JSONL manifest file, with one sample per line:
|
|
|
|
```jsonl
|
|
{"audio": "path/to/audio1.wav", "text": "Transcript of audio 1."}
|
|
{"audio": "path/to/audio2.wav", "text": "Transcript of audio 2."}
|
|
{"audio": "path/to/audio3.wav", "text": "Optional duration field.", "duration": 3.5}
|
|
{"audio": "path/to/audio4.wav", "text": "Optional dataset_id for multi-dataset.", "dataset_id": 1}
|
|
```
|
|
|
|
### Required Fields
|
|
|
|
| Field | Description |
|
|
|-------|-------------|
|
|
| `audio` | Path to audio file (absolute or relative) |
|
|
| `text` | Corresponding transcript |
|
|
|
|
### Optional Fields
|
|
|
|
| Field | Description |
|
|
|-------|-------------|
|
|
| `duration` | Audio duration in seconds (speeds up sample filtering) |
|
|
| `dataset_id` | Dataset ID for multi-dataset training (default: 0) |
|
|
|
|
### Requirements
|
|
|
|
- Audio format: WAV
|
|
- Sample rate: 16kHz for VoxCPM-0.5B, 44.1kHz for VoxCPM1.5
|
|
- Text: Transcript matching the audio content
|
|
|
|
See `examples/train_data_example.jsonl` for a complete example.
|
|
|
|
---
|
|
|
|
## Full Fine-tuning
|
|
|
|
Full fine-tuning updates all model parameters. Suitable for large datasets or when significant behavior changes are needed.
|
|
|
|
### Configuration
|
|
|
|
Create `conf/voxcpm_v1.5/voxcpm_finetune_all.yaml`:
|
|
|
|
```yaml
|
|
pretrained_path: /path/to/VoxCPM1.5/
|
|
train_manifest: /path/to/train.jsonl
|
|
val_manifest: ""
|
|
|
|
sample_rate: 44100
|
|
batch_size: 16
|
|
grad_accum_steps: 1
|
|
num_workers: 2
|
|
num_iters: 2000
|
|
log_interval: 10
|
|
valid_interval: 1000
|
|
save_interval: 1000
|
|
|
|
learning_rate: 0.00001 # Use smaller LR for full fine-tuning
|
|
weight_decay: 0.01
|
|
warmup_steps: 100
|
|
max_steps: 2000
|
|
max_batch_tokens: 8192
|
|
|
|
save_path: /path/to/checkpoints/finetune_all
|
|
tensorboard: /path/to/logs/finetune_all
|
|
|
|
lambdas:
|
|
loss/diff: 1.0
|
|
loss/stop: 1.0
|
|
```
|
|
|
|
### Training
|
|
|
|
```bash
|
|
# Single GPU
|
|
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
|
|
|
|
# Multi-GPU
|
|
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
|
|
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
|
|
```
|
|
|
|
### Checkpoint Structure
|
|
|
|
Full fine-tuning saves a complete model directory that can be loaded directly:
|
|
|
|
```
|
|
checkpoints/finetune_all/
|
|
└── step_0002000/
|
|
├── model.safetensors # Model weights (excluding audio_vae)
|
|
├── config.json # Model config
|
|
├── audiovae.pth # Audio VAE weights
|
|
├── tokenizer.json # Tokenizer
|
|
├── tokenizer_config.json
|
|
├── special_tokens_map.json
|
|
├── optimizer.pth
|
|
└── scheduler.pth
|
|
```
|
|
|
|
---
|
|
|
|
## LoRA Fine-tuning
|
|
|
|
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains only a small number of additional parameters, significantly reducing memory requirements.
|
|
|
|
### Configuration
|
|
|
|
Create `conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml`:
|
|
|
|
```yaml
|
|
pretrained_path: /path/to/VoxCPM1.5/
|
|
train_manifest: /path/to/train.jsonl
|
|
val_manifest: ""
|
|
|
|
sample_rate: 44100
|
|
batch_size: 16
|
|
grad_accum_steps: 1
|
|
num_workers: 2
|
|
num_iters: 2000
|
|
log_interval: 10
|
|
valid_interval: 1000
|
|
save_interval: 1000
|
|
|
|
learning_rate: 0.0001 # LoRA can use larger LR
|
|
weight_decay: 0.01
|
|
warmup_steps: 100
|
|
max_steps: 2000
|
|
max_batch_tokens: 8192
|
|
|
|
save_path: /path/to/checkpoints/finetune_lora
|
|
tensorboard: /path/to/logs/finetune_lora
|
|
|
|
lambdas:
|
|
loss/diff: 1.0
|
|
loss/stop: 1.0
|
|
|
|
# LoRA configuration
|
|
lora:
|
|
enable_lm: true # Apply LoRA to Language Model
|
|
enable_dit: true # Apply LoRA to Diffusion Transformer
|
|
enable_proj: false # Apply LoRA to projection layers (optional)
|
|
|
|
r: 32 # LoRA rank (higher = more capacity)
|
|
alpha: 16 # LoRA alpha, scaling = alpha / r
|
|
dropout: 0.0
|
|
|
|
# Target modules
|
|
target_modules_lm: ["q_proj", "v_proj", "k_proj", "o_proj"]
|
|
target_modules_dit: ["q_proj", "v_proj", "k_proj", "o_proj"]
|
|
|
|
# Distribution options (optional)
|
|
# hf_model_id: "openbmb/VoxCPM1.5" # HuggingFace ID
|
|
# distribute: true # If true, save hf_model_id in lora_config.json
|
|
```
|
|
|
|
### LoRA Parameters
|
|
|
|
| Parameter | Description | Recommended |
|
|
|-----------|-------------|-------------|
|
|
| `enable_lm` | Apply LoRA to LM (language model) | `true` |
|
|
| `enable_dit` | Apply LoRA to DiT (diffusion model) | `true` (required for voice cloning) |
|
|
| `r` | LoRA rank (higher = more capacity) | 16-64 |
|
|
| `alpha` | Scaling factor, `scaling = alpha / r` | Usually `r/2` or `r` |
|
|
| `target_modules_*` | Layer names to add LoRA | attention layers |
|
|
|
|
### Distribution Options (Optional)
|
|
|
|
| Parameter | Description | Default |
|
|
|-----------|-------------|---------|
|
|
| `hf_model_id` | HuggingFace model ID (e.g., `openbmb/VoxCPM1.5`) | `""` |
|
|
| `distribute` | If `true`, save `hf_model_id` as `base_model` in checkpoint; otherwise save local `pretrained_path` | `false` |
|
|
|
|
> **Note**: If `distribute: true`, `hf_model_id` is required.
|
|
|
|
### Training
|
|
|
|
```bash
|
|
# Single GPU
|
|
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
|
|
|
|
# Multi-GPU
|
|
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
|
|
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
|
|
```
|
|
|
|
### Checkpoint Structure
|
|
|
|
LoRA training saves LoRA parameters and configuration:
|
|
|
|
```
|
|
checkpoints/finetune_lora/
|
|
└── step_0002000/
|
|
├── lora_weights.safetensors # Only lora_A, lora_B parameters
|
|
├── lora_config.json # LoRA config + base model path
|
|
├── optimizer.pth
|
|
└── scheduler.pth
|
|
```
|
|
|
|
The `lora_config.json` contains:
|
|
```json
|
|
{
|
|
"base_model": "/path/to/VoxCPM1.5/",
|
|
"lora_config": {
|
|
"enable_lm": true,
|
|
"enable_dit": true,
|
|
"r": 32,
|
|
"alpha": 16,
|
|
...
|
|
}
|
|
}
|
|
```
|
|
|
|
The `base_model` field contains:
|
|
- Local path (default): when `distribute: false` or not set
|
|
- HuggingFace ID: when `distribute: true` (e.g., `"openbmb/VoxCPM1.5"`)
|
|
|
|
This allows loading LoRA checkpoints without the original training config file.
|
|
|
|
---
|
|
|
|
## Inference
|
|
|
|
### Full Fine-tuning Inference
|
|
|
|
The checkpoint directory is a complete model, load it directly:
|
|
|
|
```bash
|
|
python scripts/test_voxcpm_ft_infer.py \
|
|
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
|
|
--text "Hello, this is the fine-tuned model." \
|
|
--output output.wav
|
|
```
|
|
|
|
With voice cloning:
|
|
|
|
```bash
|
|
python scripts/test_voxcpm_ft_infer.py \
|
|
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
|
|
--text "This is voice cloning result." \
|
|
--prompt_audio /path/to/reference.wav \
|
|
--prompt_text "Reference audio transcript" \
|
|
--output cloned_output.wav
|
|
```
|
|
|
|
### LoRA Inference
|
|
|
|
LoRA inference only requires the checkpoint directory (base model path and LoRA config are read from `lora_config.json`):
|
|
|
|
```bash
|
|
python scripts/test_voxcpm_lora_infer.py \
|
|
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
|
|
--text "Hello, this is LoRA fine-tuned result." \
|
|
--output lora_output.wav
|
|
```
|
|
|
|
With voice cloning:
|
|
|
|
```bash
|
|
python scripts/test_voxcpm_lora_infer.py \
|
|
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
|
|
--text "This is voice cloning with LoRA." \
|
|
--prompt_audio /path/to/reference.wav \
|
|
--prompt_text "Reference audio transcript" \
|
|
--output cloned_output.wav
|
|
```
|
|
|
|
Override base model path (optional):
|
|
|
|
```bash
|
|
python scripts/test_voxcpm_lora_infer.py \
|
|
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
|
|
--base_model /path/to/another/VoxCPM1.5 \
|
|
--text "Use different base model." \
|
|
--output output.wav
|
|
```
|
|
|
|
---
|
|
|
|
## LoRA Hot-swapping
|
|
|
|
LoRA supports dynamic loading, unloading, and switching at inference time without reloading the entire model.
|
|
|
|
### API Reference
|
|
|
|
```python
|
|
from voxcpm.core import VoxCPM
|
|
from voxcpm.model.voxcpm import LoRAConfig
|
|
|
|
# 1. Load model with LoRA structure and weights
|
|
lora_cfg = LoRAConfig(
|
|
enable_lm=True,
|
|
enable_dit=True,
|
|
r=32,
|
|
alpha=16,
|
|
target_modules_lm=["q_proj", "v_proj", "k_proj", "o_proj"],
|
|
target_modules_dit=["q_proj", "v_proj", "k_proj", "o_proj"],
|
|
)
|
|
model = VoxCPM.from_pretrained(
|
|
hf_model_id="openbmb/VoxCPM1.5", # or local path
|
|
load_denoiser=False, # Optional: disable denoiser for faster loading
|
|
optimize=True, # Enable torch.compile acceleration
|
|
lora_config=lora_cfg,
|
|
lora_weights_path="/path/to/lora_checkpoint",
|
|
)
|
|
|
|
# 2. Generate audio
|
|
audio = model.generate(
|
|
text="Hello, this is LoRA fine-tuned result.",
|
|
prompt_wav_path="/path/to/reference.wav", # Optional: for voice cloning
|
|
prompt_text="Reference audio transcript", # Optional: for voice cloning
|
|
)
|
|
|
|
# 3. Disable LoRA (use base model only)
|
|
model.set_lora_enabled(False)
|
|
|
|
# 4. Re-enable LoRA
|
|
model.set_lora_enabled(True)
|
|
|
|
# 5. Unload LoRA (reset weights to zero)
|
|
model.unload_lora()
|
|
|
|
# 6. Hot-swap to another LoRA
|
|
loaded, skipped = model.load_lora("/path/to/another_lora_checkpoint")
|
|
print(f"Loaded {len(loaded)} params, skipped {len(skipped)}")
|
|
|
|
# 7. Get current LoRA weights
|
|
lora_state = model.get_lora_state_dict()
|
|
```
|
|
|
|
### Simplified Usage (Load from lora_config.json)
|
|
|
|
If your checkpoint contains `lora_config.json` (saved by the training script), you can load everything automatically:
|
|
|
|
```python
|
|
import json
|
|
from voxcpm.core import VoxCPM
|
|
from voxcpm.model.voxcpm import LoRAConfig
|
|
|
|
# Load config from checkpoint
|
|
lora_ckpt_dir = "/path/to/checkpoints/finetune_lora/step_0002000"
|
|
with open(f"{lora_ckpt_dir}/lora_config.json") as f:
|
|
lora_info = json.load(f)
|
|
|
|
base_model = lora_info["base_model"]
|
|
lora_cfg = LoRAConfig(**lora_info["lora_config"])
|
|
|
|
# Load model with LoRA
|
|
model = VoxCPM.from_pretrained(
|
|
hf_model_id=base_model,
|
|
lora_config=lora_cfg,
|
|
lora_weights_path=lora_ckpt_dir,
|
|
)
|
|
```
|
|
|
|
Or use the test script directly:
|
|
|
|
```bash
|
|
python scripts/test_voxcpm_lora_infer.py \
|
|
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
|
|
--text "Hello world"
|
|
```
|
|
|
|
### Method Reference
|
|
|
|
| Method | Description | torch.compile Compatible |
|
|
|--------|-------------|--------------------------|
|
|
| `load_lora(path)` | Load LoRA weights from file | ✅ |
|
|
| `set_lora_enabled(bool)` | Enable/disable LoRA | ✅ |
|
|
| `unload_lora()` | Reset LoRA weights to initial values | ✅ |
|
|
| `get_lora_state_dict()` | Get current LoRA weights | ✅ |
|
|
| `lora_enabled` | Property: check if LoRA is configured | ✅ |
|
|
|
|
---
|
|
|
|
## FAQ
|
|
|
|
### 1. Out of Memory (OOM)
|
|
|
|
- Increase `grad_accum_steps` (gradient accumulation)
|
|
- Decrease `batch_size`
|
|
- Use LoRA fine-tuning instead of full fine-tuning
|
|
- Decrease `max_batch_tokens` to filter long samples
|
|
|
|
### 2. Poor LoRA Performance
|
|
|
|
- Increase `r` (LoRA rank)
|
|
- Adjust `alpha` (try `alpha = r/2` or `alpha = r`)
|
|
- Increase training steps
|
|
- Add more target modules
|
|
|
|
### 3. Training Not Converging
|
|
|
|
- Decrease `learning_rate`
|
|
- Increase `warmup_steps`
|
|
- Check data quality
|
|
|
|
### 4. LoRA Not Taking Effect at Inference
|
|
|
|
- Check that `lora_config.json` exists in the checkpoint directory
|
|
- Check `load_lora()` return value - `skipped_keys` should be empty
|
|
- Verify `set_lora_enabled(True)` is called
|
|
|
|
### 5. Checkpoint Loading Errors
|
|
|
|
- Full fine-tuning: checkpoint directory should contain `model.safetensors` (or `pytorch_model.bin`), `config.json`, `audiovae.pth`
|
|
- LoRA: checkpoint directory should contain:
|
|
- `lora_weights.safetensors` (or `lora_weights.ckpt`) - LoRA weights
|
|
- `lora_config.json` - LoRA config and base model path
|