13 KiB
VoxCPM Fine-tuning Guide
This guide covers how to fine-tune VoxCPM models with two approaches: full fine-tuning and LoRA fine-tuning.
🎓 SFT (Supervised Fine-Tuning)
Full fine-tuning updates all model parameters. Suitable for:
- 📊 Large, specialized datasets
- 🔄 Cases where significant behavior changes are needed
⚡ LoRA Fine-tuning
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:
- 🎯 Trains only a small number of additional parameters
- 💾 Significantly reduces memory requirements and training time
- 🔀 Supports multiple LoRA adapters with hot-swapping
Table of Contents
- Quick Start: WebUI
- Data Preparation
- Full Fine-tuning
- LoRA Fine-tuning
- Inference
- LoRA Hot-swapping
- FAQ
Quick Start: WebUI
For users who prefer a graphical interface, we provide lora_ft_webui.py - a comprehensive WebUI for training and inference:
Launch WebUI
python lora_ft_webui.py
Then open http://localhost:7860 in your browser.
Features
-
🚀 Training Tab: Configure and start LoRA training with an intuitive interface
- Set training parameters (learning rate, batch size, LoRA rank, etc.)
- Monitor training progress in real-time
- Resume training from existing checkpoints
-
🎵 Inference Tab: Generate audio with trained models
- Automatic base model loading from LoRA checkpoint config
- Voice cloning with automatic ASR (reference text recognition)
- Hot-swap between multiple LoRA models
- Zero-shot TTS without reference audio
Data Preparation
Training data should be prepared as a JSONL manifest file, with one sample per line:
{"audio": "path/to/audio1.wav", "text": "Transcript of audio 1."}
{"audio": "path/to/audio2.wav", "text": "Transcript of audio 2."}
{"audio": "path/to/audio3.wav", "text": "Optional duration field.", "duration": 3.5}
{"audio": "path/to/audio4.wav", "text": "Optional dataset_id for multi-dataset.", "dataset_id": 1}
Required Fields
| Field | Description |
|---|---|
audio |
Path to audio file (absolute or relative) |
text |
Corresponding transcript |
Optional Fields
| Field | Description |
|---|---|
duration |
Audio duration in seconds (speeds up sample filtering) |
dataset_id |
Dataset ID for multi-dataset training (default: 0) |
Requirements
- Audio format: WAV
- Sample rate: 16kHz for VoxCPM-0.5B, 44.1kHz for VoxCPM1.5
- Text: Transcript matching the audio content
See examples/train_data_example.jsonl for a complete example.
Full Fine-tuning
Full fine-tuning updates all model parameters. Suitable for large datasets or when significant behavior changes are needed.
Configuration
Create conf/voxcpm_v1.5/voxcpm_finetune_all.yaml:
pretrained_path: /path/to/VoxCPM1.5/
train_manifest: /path/to/train.jsonl
val_manifest: ""
sample_rate: 44100
batch_size: 16
grad_accum_steps: 1
num_workers: 2
num_iters: 2000
log_interval: 10
valid_interval: 1000
save_interval: 1000
learning_rate: 0.00001 # Use smaller LR for full fine-tuning
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
max_batch_tokens: 8192
save_path: /path/to/checkpoints/finetune_all
tensorboard: /path/to/logs/finetune_all
lambdas:
loss/diff: 1.0
loss/stop: 1.0
Training
# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
Checkpoint Structure
Full fine-tuning saves a complete model directory that can be loaded directly:
checkpoints/finetune_all/
└── step_0002000/
├── model.safetensors # Model weights (excluding audio_vae)
├── config.json # Model config
├── audiovae.pth # Audio VAE weights
├── tokenizer.json # Tokenizer
├── tokenizer_config.json
├── special_tokens_map.json
├── optimizer.pth
└── scheduler.pth
LoRA Fine-tuning
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains only a small number of additional parameters, significantly reducing memory requirements.
Configuration
Create conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml:
pretrained_path: /path/to/VoxCPM1.5/
train_manifest: /path/to/train.jsonl
val_manifest: ""
sample_rate: 44100
batch_size: 16
grad_accum_steps: 1
num_workers: 2
num_iters: 2000
log_interval: 10
valid_interval: 1000
save_interval: 1000
learning_rate: 0.0001 # LoRA can use larger LR
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
max_batch_tokens: 8192
save_path: /path/to/checkpoints/finetune_lora
tensorboard: /path/to/logs/finetune_lora
lambdas:
loss/diff: 1.0
loss/stop: 1.0
# LoRA configuration
lora:
enable_lm: true # Apply LoRA to Language Model
enable_dit: true # Apply LoRA to Diffusion Transformer
enable_proj: false # Apply LoRA to projection layers (optional)
r: 32 # LoRA rank (higher = more capacity)
alpha: 16 # LoRA alpha, scaling = alpha / r
dropout: 0.0
# Target modules
target_modules_lm: ["q_proj", "v_proj", "k_proj", "o_proj"]
target_modules_dit: ["q_proj", "v_proj", "k_proj", "o_proj"]
# Distribution options (optional)
# hf_model_id: "openbmb/VoxCPM1.5" # HuggingFace ID
# distribute: true # If true, save hf_model_id in lora_config.json
LoRA Parameters
| Parameter | Description | Recommended |
|---|---|---|
enable_lm |
Apply LoRA to LM (language model) | true |
enable_dit |
Apply LoRA to DiT (diffusion model) | true (required for voice cloning) |
r |
LoRA rank (higher = more capacity) | 16-64 |
alpha |
Scaling factor, scaling = alpha / r |
Usually r/2 or r |
target_modules_* |
Layer names to add LoRA | attention layers |
Distribution Options (Optional)
| Parameter | Description | Default |
|---|---|---|
hf_model_id |
HuggingFace model ID (e.g., openbmb/VoxCPM1.5) |
"" |
distribute |
If true, save hf_model_id as base_model in checkpoint; otherwise save local pretrained_path |
false |
Note
: If
distribute: true,hf_model_idis required.
Training
# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
Checkpoint Structure
LoRA training saves LoRA parameters and configuration:
checkpoints/finetune_lora/
└── step_0002000/
├── lora_weights.safetensors # Only lora_A, lora_B parameters
├── lora_config.json # LoRA config + base model path
├── optimizer.pth
└── scheduler.pth
The lora_config.json contains:
{
"base_model": "/path/to/VoxCPM1.5/",
"lora_config": {
"enable_lm": true,
"enable_dit": true,
"r": 32,
"alpha": 16,
...
}
}
The base_model field contains:
- Local path (default): when
distribute: falseor not set - HuggingFace ID: when
distribute: true(e.g.,"openbmb/VoxCPM1.5")
This allows loading LoRA checkpoints without the original training config file.
Inference
Full Fine-tuning Inference
The checkpoint directory is a complete model, load it directly:
python scripts/test_voxcpm_ft_infer.py \
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
--text "Hello, this is the fine-tuned model." \
--output output.wav
With voice cloning:
python scripts/test_voxcpm_ft_infer.py \
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
--text "This is voice cloning result." \
--prompt_audio /path/to/reference.wav \
--prompt_text "Reference audio transcript" \
--output cloned_output.wav
LoRA Inference
LoRA inference only requires the checkpoint directory (base model path and LoRA config are read from lora_config.json):
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--text "Hello, this is LoRA fine-tuned result." \
--output lora_output.wav
With voice cloning:
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--text "This is voice cloning with LoRA." \
--prompt_audio /path/to/reference.wav \
--prompt_text "Reference audio transcript" \
--output cloned_output.wav
Override base model path (optional):
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--base_model /path/to/another/VoxCPM1.5 \
--text "Use different base model." \
--output output.wav
LoRA Hot-swapping
LoRA supports dynamic loading, unloading, and switching at inference time without reloading the entire model.
API Reference
from voxcpm.core import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig
# 1. Load model with LoRA structure and weights
lora_cfg = LoRAConfig(
enable_lm=True,
enable_dit=True,
r=32,
alpha=16,
target_modules_lm=["q_proj", "v_proj", "k_proj", "o_proj"],
target_modules_dit=["q_proj", "v_proj", "k_proj", "o_proj"],
)
model = VoxCPM.from_pretrained(
hf_model_id="openbmb/VoxCPM1.5", # or local path
load_denoiser=False, # Optional: disable denoiser for faster loading
optimize=True, # Enable torch.compile acceleration
lora_config=lora_cfg,
lora_weights_path="/path/to/lora_checkpoint",
)
# 2. Generate audio
audio = model.generate(
text="Hello, this is LoRA fine-tuned result.",
prompt_wav_path="/path/to/reference.wav", # Optional: for voice cloning
prompt_text="Reference audio transcript", # Optional: for voice cloning
)
# 3. Disable LoRA (use base model only)
model.set_lora_enabled(False)
# 4. Re-enable LoRA
model.set_lora_enabled(True)
# 5. Unload LoRA (reset weights to zero)
model.unload_lora()
# 6. Hot-swap to another LoRA
loaded, skipped = model.load_lora("/path/to/another_lora_checkpoint")
print(f"Loaded {len(loaded)} params, skipped {len(skipped)}")
# 7. Get current LoRA weights
lora_state = model.get_lora_state_dict()
Simplified Usage (Load from lora_config.json)
If your checkpoint contains lora_config.json (saved by the training script), you can load everything automatically:
import json
from voxcpm.core import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig
# Load config from checkpoint
lora_ckpt_dir = "/path/to/checkpoints/finetune_lora/step_0002000"
with open(f"{lora_ckpt_dir}/lora_config.json") as f:
lora_info = json.load(f)
base_model = lora_info["base_model"]
lora_cfg = LoRAConfig(**lora_info["lora_config"])
# Load model with LoRA
model = VoxCPM.from_pretrained(
hf_model_id=base_model,
lora_config=lora_cfg,
lora_weights_path=lora_ckpt_dir,
)
Or use the test script directly:
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--text "Hello world"
Method Reference
| Method | Description | torch.compile Compatible |
|---|---|---|
load_lora(path) |
Load LoRA weights from file | ✅ |
set_lora_enabled(bool) |
Enable/disable LoRA | ✅ |
unload_lora() |
Reset LoRA weights to initial values | ✅ |
get_lora_state_dict() |
Get current LoRA weights | ✅ |
lora_enabled |
Property: check if LoRA is configured | ✅ |
FAQ
1. Out of Memory (OOM)
- Increase
grad_accum_steps(gradient accumulation) - Decrease
batch_size - Use LoRA fine-tuning instead of full fine-tuning
- Decrease
max_batch_tokensto filter long samples
2. Poor LoRA Performance
- Increase
r(LoRA rank) - Adjust
alpha(tryalpha = r/2oralpha = r) - Increase training steps
- Add more target modules
3. Training Not Converging
- Decrease
learning_rate - Increase
warmup_steps - Check data quality
4. LoRA Not Taking Effect at Inference
- Check that
lora_config.jsonexists in the checkpoint directory - Check
load_lora()return value -skipped_keysshould be empty - Verify
set_lora_enabled(True)is called
5. Checkpoint Loading Errors
- Full fine-tuning: checkpoint directory should contain
model.safetensors(orpytorch_model.bin),config.json,audiovae.pth - LoRA: checkpoint directory should contain:
lora_weights.safetensors(orlora_weights.ckpt) - LoRA weightslora_config.json- LoRA config and base model path