13 KiB
VoxCPM Fine-tuning Guide
This guide covers how to fine-tune VoxCPM models with two approaches: full fine-tuning and LoRA fine-tuning.
🎓 SFT (Supervised Fine-Tuning)
Full fine-tuning updates all model parameters. Suitable for:
- 📊 Large, specialized datasets
- 🔄 Cases where significant behavior changes are needed
⚡ LoRA Fine-tuning
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:
- 🎯 Trains only a small number of additional parameters
- 💾 Significantly reduces memory requirements and training time
- 🔀 Supports multiple LoRA adapters with hot-swapping
Table of Contents
- Quick Start: WebUI
- Data Preparation
- Full Fine-tuning
- LoRA Fine-tuning
- Inference
- LoRA Hot-swapping
- FAQ
Quick Start: WebUI
For users who prefer a graphical interface, we provide lora_ft_webui.py - a comprehensive WebUI for training and inference:
Launch WebUI
python lora_ft_webui.py
Then open http://localhost:7860 in your browser.
Features
-
🚀 Training Tab: Configure and start LoRA training with an intuitive interface
- Set training parameters (learning rate, batch size, LoRA rank, etc.)
- Monitor training progress in real-time
- Resume training from existing checkpoints
-
🎵 Inference Tab: Generate audio with trained models
- Automatic base model loading from LoRA checkpoint config
- Voice cloning with automatic ASR (reference text recognition)
- Hot-swap between multiple LoRA models
- Zero-shot TTS without reference audio
Data Preparation
Training data should be prepared as a JSONL manifest file, with one sample per line:
{"audio": "path/to/audio1.wav", "text": "Transcript of audio 1."}
{"audio": "path/to/audio2.wav", "text": "Transcript of audio 2."}
{"audio": "path/to/audio3.wav", "text": "Optional duration field.", "duration": 3.5}
{"audio": "path/to/audio4.wav", "text": "Optional dataset_id for multi-dataset.", "dataset_id": 1}
Required Fields
| Field | Description |
|---|---|
audio |
Path to audio file (absolute or relative) |
text |
Corresponding transcript |
Optional Fields
| Field | Description |
|---|---|
duration |
Audio duration in seconds (speeds up sample filtering) |
dataset_id |
Dataset ID for multi-dataset training (default: 0) |
Requirements
- Audio format: WAV
- Sample rate: 16kHz for VoxCPM-0.5B, 44.1kHz for VoxCPM1.5
- Text: Transcript matching the audio content
See examples/train_data_example.jsonl for a complete example.
Full Fine-tuning
Full fine-tuning updates all model parameters. Suitable for large datasets or when significant behavior changes are needed.
Configuration
Create conf/voxcpm_v1.5/voxcpm_finetune_all.yaml:
pretrained_path: /path/to/VoxCPM1.5/
train_manifest: /path/to/train.jsonl
val_manifest: ""
sample_rate: 44100
batch_size: 16
grad_accum_steps: 1
num_workers: 2
num_iters: 2000
log_interval: 10
valid_interval: 1000
save_interval: 1000
learning_rate: 0.00001 # Use smaller LR for full fine-tuning
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
max_batch_tokens: 8192
save_path: /path/to/checkpoints/finetune_all
tensorboard: /path/to/logs/finetune_all
lambdas:
loss/diff: 1.0
loss/stop: 1.0
Training
# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
Checkpoint Structure
Full fine-tuning saves a complete model directory that can be loaded directly:
checkpoints/finetune_all/
└── step_0002000/
├── model.safetensors # Model weights (excluding audio_vae)
├── config.json # Model config
├── audiovae.pth # Audio VAE weights
├── tokenizer.json # Tokenizer
├── tokenizer_config.json
├── special_tokens_map.json
├── optimizer.pth
└── scheduler.pth
LoRA Fine-tuning
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains only a small number of additional parameters, significantly reducing memory requirements.
Configuration
Create conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml:
pretrained_path: /path/to/VoxCPM1.5/
train_manifest: /path/to/train.jsonl
val_manifest: ""
sample_rate: 44100
batch_size: 16
grad_accum_steps: 1
num_workers: 2
num_iters: 2000
log_interval: 10
valid_interval: 1000
save_interval: 1000
learning_rate: 0.0001 # LoRA can use larger LR
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
max_batch_tokens: 8192
save_path: /path/to/checkpoints/finetune_lora
tensorboard: /path/to/logs/finetune_lora
lambdas:
loss/diff: 1.0
loss/stop: 1.0
# LoRA configuration
lora:
enable_lm: true # Apply LoRA to Language Model
enable_dit: true # Apply LoRA to Diffusion Transformer
enable_proj: false # Apply LoRA to projection layers (optional)
r: 32 # LoRA rank (higher = more capacity)
alpha: 16 # LoRA alpha, scaling = alpha / r
dropout: 0.0
# Target modules
target_modules_lm: ["q_proj", "v_proj", "k_proj", "o_proj"]
target_modules_dit: ["q_proj", "v_proj", "k_proj", "o_proj"]
# Distribution options (optional)
# hf_model_id: "openbmb/VoxCPM1.5" # HuggingFace ID
# distribute: true # If true, save hf_model_id in lora_config.json
LoRA Parameters
| Parameter | Description | Recommended |
|---|---|---|
enable_lm |
Apply LoRA to LM (language model) | true |
enable_dit |
Apply LoRA to DiT (diffusion model) | true (required for voice cloning) |
r |
LoRA rank (higher = more capacity) | 16-64 |
alpha |
Scaling factor, scaling = alpha / r |
Usually r/2 or r |
target_modules_* |
Layer names to add LoRA | attention layers |
Distribution Options (Optional)
| Parameter | Description | Default |
|---|---|---|
hf_model_id |
HuggingFace model ID (e.g., openbmb/VoxCPM1.5) |
"" |
distribute |
If true, save hf_model_id as base_model in checkpoint; otherwise save local pretrained_path |
false |
Note
: If
distribute: true,hf_model_idis required.
Training
# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
Checkpoint Structure
LoRA training saves LoRA parameters and configuration:
checkpoints/finetune_lora/
└── step_0002000/
├── lora_weights.safetensors # Only lora_A, lora_B parameters
├── lora_config.json # LoRA config + base model path
├── optimizer.pth
└── scheduler.pth
The lora_config.json contains:
{
"base_model": "/path/to/VoxCPM1.5/",
"lora_config": {
"enable_lm": true,
"enable_dit": true,
"r": 32,
"alpha": 16,
...
}
}
The base_model field contains:
- Local path (default): when
distribute: falseor not set - HuggingFace ID: when
distribute: true(e.g.,"openbmb/VoxCPM1.5")
This allows loading LoRA checkpoints without the original training config file.
Inference
Full Fine-tuning Inference
The checkpoint directory is a complete model, load it directly:
python scripts/test_voxcpm_ft_infer.py \
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
--text "Hello, this is the fine-tuned model." \
--output output.wav
With voice cloning:
python scripts/test_voxcpm_ft_infer.py \
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
--text "This is voice cloning result." \
--prompt_audio /path/to/reference.wav \
--prompt_text "Reference audio transcript" \
--output cloned_output.wav
LoRA Inference
LoRA inference only requires the checkpoint directory (base model path and LoRA config are read from lora_config.json):
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--text "Hello, this is LoRA fine-tuned result." \
--output lora_output.wav
With voice cloning:
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--text "This is voice cloning with LoRA." \
--prompt_audio /path/to/reference.wav \
--prompt_text "Reference audio transcript" \
--output cloned_output.wav
Override base model path (optional):
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--base_model /path/to/another/VoxCPM1.5 \
--text "Use different base model." \
--output output.wav
LoRA Hot-swapping
LoRA supports dynamic loading, unloading, and switching at inference time without reloading the entire model.
API Reference
from voxcpm.core import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig
# 1. Load model with LoRA structure and weights
lora_cfg = LoRAConfig(
enable_lm=True,
enable_dit=True,
r=32,
alpha=16,
target_modules_lm=["q_proj", "v_proj", "k_proj", "o_proj"],
target_modules_dit=["q_proj", "v_proj", "k_proj", "o_proj"],
)
model = VoxCPM.from_pretrained(
hf_model_id="openbmb/VoxCPM1.5", # or local path
load_denoiser=False, # Optional: disable denoiser for faster loading
optimize=True, # Enable torch.compile acceleration
lora_config=lora_cfg,
lora_weights_path="/path/to/lora_checkpoint",
)
# 2. Generate audio
audio = model.generate(
text="Hello, this is LoRA fine-tuned result.",
prompt_wav_path="/path/to/reference.wav", # Optional: for voice cloning
prompt_text="Reference audio transcript", # Optional: for voice cloning
)
# 3. Disable LoRA (use base model only)
model.set_lora_enabled(False)
# 4. Re-enable LoRA
model.set_lora_enabled(True)
# 5. Unload LoRA (reset weights to zero)
model.unload_lora()
# 6. Hot-swap to another LoRA
loaded, skipped = model.load_lora("/path/to/another_lora_checkpoint")
print(f"Loaded {len(loaded)} params, skipped {len(skipped)}")
# 7. Get current LoRA weights
lora_state = model.get_lora_state_dict()
Simplified Usage (Load from lora_config.json)
If your checkpoint contains lora_config.json (saved by the training script), you can load everything automatically:
import json
from voxcpm.core import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig
# Load config from checkpoint
lora_ckpt_dir = "/path/to/checkpoints/finetune_lora/step_0002000"
with open(f"{lora_ckpt_dir}/lora_config.json") as f:
lora_info = json.load(f)
base_model = lora_info["base_model"]
lora_cfg = LoRAConfig(**lora_info["lora_config"])
# Load model with LoRA
model = VoxCPM.from_pretrained(
hf_model_id=base_model,
lora_config=lora_cfg,
lora_weights_path=lora_ckpt_dir,
)
Or use the test script directly:
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--text "Hello world"
Method Reference
| Method | Description | torch.compile Compatible |
|---|---|---|
load_lora(path) |
Load LoRA weights from file | ✅ |
set_lora_enabled(bool) |
Enable/disable LoRA | ✅ |
unload_lora() |
Reset LoRA weights to initial values | ✅ |
get_lora_state_dict() |
Get current LoRA weights | ✅ |
lora_enabled |
Property: check if LoRA is configured | ✅ |
FAQ
1. How Much Data is Needed for LoRA Fine-tuning to Converge to a Single Voice?
We have tested with 5 minutes and 10 minutes of data (all audio clips are 3-6s in length). In our experiments, both datasets converged to a single voice after 2000 training steps with default configurations. You can adjust the data amount and training configurations based on your available data and computational resources.
2. Out of Memory (OOM)
- Increase
grad_accum_steps(gradient accumulation) - Decrease
batch_size - Use LoRA fine-tuning instead of full fine-tuning
- Decrease
max_batch_tokensto filter long samples
3. Poor LoRA Performance
- Increase
r(LoRA rank) - Adjust
alpha(tryalpha = r/2oralpha = r) - Increase training steps
- Add more target modules
4. Training Not Converging
- Decrease
learning_rate - Increase
warmup_steps - Check data quality
5. LoRA Not Taking Effect at Inference
- Check that
lora_config.jsonexists in the checkpoint directory - Check
load_lora()return value -skipped_keysshould be empty - Verify
set_lora_enabled(True)is called
6. Checkpoint Loading Errors
- Full fine-tuning: checkpoint directory should contain
model.safetensors(orpytorch_model.bin),config.json,audiovae.pth - LoRA: checkpoint directory should contain:
lora_weights.safetensors(orlora_weights.ckpt) - LoRA weightslora_config.json- LoRA config and base model path