Files
VoxCPM/docs/finetune.md

13 KiB

VoxCPM Fine-tuning Guide

This guide covers how to fine-tune VoxCPM models with two approaches: full fine-tuning and LoRA fine-tuning.

🎓 SFT (Supervised Fine-Tuning)

Full fine-tuning updates all model parameters. Suitable for:

  • 📊 Large, specialized datasets
  • 🔄 Cases where significant behavior changes are needed

LoRA Fine-tuning

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:

  • 🎯 Trains only a small number of additional parameters
  • 💾 Significantly reduces memory requirements and training time
  • 🔀 Supports multiple LoRA adapters with hot-swapping

Table of Contents


Quick Start: WebUI

For users who prefer a graphical interface, we provide lora_ft_webui.py - a comprehensive WebUI for training and inference:

Launch WebUI

python lora_ft_webui.py

Then open http://localhost:7860 in your browser.

Features

  • 🚀 Training Tab: Configure and start LoRA training with an intuitive interface

    • Set training parameters (learning rate, batch size, LoRA rank, etc.)
    • Monitor training progress in real-time
    • Resume training from existing checkpoints
  • 🎵 Inference Tab: Generate audio with trained models

    • Automatic base model loading from LoRA checkpoint config
    • Voice cloning with automatic ASR (reference text recognition)
    • Hot-swap between multiple LoRA models
    • Zero-shot TTS without reference audio

Data Preparation

Training data should be prepared as a JSONL manifest file, with one sample per line:

{"audio": "path/to/audio1.wav", "text": "Transcript of audio 1."}
{"audio": "path/to/audio2.wav", "text": "Transcript of audio 2."}
{"audio": "path/to/audio3.wav", "text": "Optional duration field.", "duration": 3.5}
{"audio": "path/to/audio4.wav", "text": "Optional dataset_id for multi-dataset.", "dataset_id": 1}

Required Fields

Field Description
audio Path to audio file (absolute or relative)
text Corresponding transcript

Optional Fields

Field Description
duration Audio duration in seconds (speeds up sample filtering)
dataset_id Dataset ID for multi-dataset training (default: 0)

Requirements

  • Audio format: WAV
  • Sample rate: 16kHz for VoxCPM-0.5B, 44.1kHz for VoxCPM1.5
  • Text: Transcript matching the audio content

See examples/train_data_example.jsonl for a complete example.


Full Fine-tuning

Full fine-tuning updates all model parameters. Suitable for large datasets or when significant behavior changes are needed.

Configuration

Create conf/voxcpm_v1.5/voxcpm_finetune_all.yaml:

pretrained_path: /path/to/VoxCPM1.5/
train_manifest: /path/to/train.jsonl
val_manifest: ""

sample_rate: 44100
batch_size: 16
grad_accum_steps: 1
num_workers: 2
num_iters: 2000
log_interval: 10
valid_interval: 1000
save_interval: 1000

learning_rate: 0.00001   # Use smaller LR for full fine-tuning
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
max_batch_tokens: 8192

save_path: /path/to/checkpoints/finetune_all
tensorboard: /path/to/logs/finetune_all

lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

Training

# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml

# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml

Checkpoint Structure

Full fine-tuning saves a complete model directory that can be loaded directly:

checkpoints/finetune_all/
└── step_0002000/
    ├── model.safetensors     # Model weights (excluding audio_vae)
    ├── config.json            # Model config
    ├── audiovae.pth           # Audio VAE weights
    ├── tokenizer.json         # Tokenizer
    ├── tokenizer_config.json
    ├── special_tokens_map.json
    ├── optimizer.pth
    └── scheduler.pth

LoRA Fine-tuning

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains only a small number of additional parameters, significantly reducing memory requirements.

Configuration

Create conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml:

pretrained_path: /path/to/VoxCPM1.5/
train_manifest: /path/to/train.jsonl
val_manifest: ""

sample_rate: 44100
batch_size: 16
grad_accum_steps: 1
num_workers: 2
num_iters: 2000
log_interval: 10
valid_interval: 1000
save_interval: 1000

learning_rate: 0.0001    # LoRA can use larger LR
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
max_batch_tokens: 8192

save_path: /path/to/checkpoints/finetune_lora
tensorboard: /path/to/logs/finetune_lora

lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

# LoRA configuration
lora:
  enable_lm: true        # Apply LoRA to Language Model
  enable_dit: true       # Apply LoRA to Diffusion Transformer
  enable_proj: false     # Apply LoRA to projection layers (optional)
  
  r: 32                  # LoRA rank (higher = more capacity)
  alpha: 16              # LoRA alpha, scaling = alpha / r
  dropout: 0.0
  
  # Target modules
  target_modules_lm: ["q_proj", "v_proj", "k_proj", "o_proj"]
  target_modules_dit: ["q_proj", "v_proj", "k_proj", "o_proj"]

# Distribution options (optional)
# hf_model_id: "openbmb/VoxCPM1.5"  # HuggingFace ID
# distribute: true                   # If true, save hf_model_id in lora_config.json

LoRA Parameters

Parameter Description Recommended
enable_lm Apply LoRA to LM (language model) true
enable_dit Apply LoRA to DiT (diffusion model) true (required for voice cloning)
r LoRA rank (higher = more capacity) 16-64
alpha Scaling factor, scaling = alpha / r Usually r/2 or r
target_modules_* Layer names to add LoRA attention layers

Distribution Options (Optional)

Parameter Description Default
hf_model_id HuggingFace model ID (e.g., openbmb/VoxCPM1.5) ""
distribute If true, save hf_model_id as base_model in checkpoint; otherwise save local pretrained_path false

Note

: If distribute: true, hf_model_id is required.

Training

# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml

# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml

Checkpoint Structure

LoRA training saves LoRA parameters and configuration:

checkpoints/finetune_lora/
└── step_0002000/
    ├── lora_weights.safetensors    # Only lora_A, lora_B parameters
    ├── lora_config.json            # LoRA config + base model path
    ├── optimizer.pth
    └── scheduler.pth

The lora_config.json contains:

{
  "base_model": "/path/to/VoxCPM1.5/",
  "lora_config": {
    "enable_lm": true,
    "enable_dit": true,
    "r": 32,
    "alpha": 16,
    ...
  }
}

The base_model field contains:

  • Local path (default): when distribute: false or not set
  • HuggingFace ID: when distribute: true (e.g., "openbmb/VoxCPM1.5")

This allows loading LoRA checkpoints without the original training config file.


Inference

Full Fine-tuning Inference

The checkpoint directory is a complete model, load it directly:

python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
    --text "Hello, this is the fine-tuned model." \
    --output output.wav

With voice cloning:

python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
    --text "This is voice cloning result." \
    --prompt_audio /path/to/reference.wav \
    --prompt_text "Reference audio transcript" \
    --output cloned_output.wav

LoRA Inference

LoRA inference only requires the checkpoint directory (base model path and LoRA config are read from lora_config.json):

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --text "Hello, this is LoRA fine-tuned result." \
    --output lora_output.wav

With voice cloning:

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --text "This is voice cloning with LoRA." \
    --prompt_audio /path/to/reference.wav \
    --prompt_text "Reference audio transcript" \
    --output cloned_output.wav

Override base model path (optional):

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --base_model /path/to/another/VoxCPM1.5 \
    --text "Use different base model." \
    --output output.wav

LoRA Hot-swapping

LoRA supports dynamic loading, unloading, and switching at inference time without reloading the entire model.

API Reference

from voxcpm.core import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig

# 1. Load model with LoRA structure and weights
lora_cfg = LoRAConfig(
    enable_lm=True, 
    enable_dit=True, 
    r=32, 
    alpha=16,
    target_modules_lm=["q_proj", "v_proj", "k_proj", "o_proj"],
    target_modules_dit=["q_proj", "v_proj", "k_proj", "o_proj"],
)
model = VoxCPM.from_pretrained(
    hf_model_id="openbmb/VoxCPM1.5",  # or local path
    load_denoiser=False,              # Optional: disable denoiser for faster loading
    optimize=True,                    # Enable torch.compile acceleration
    lora_config=lora_cfg,
    lora_weights_path="/path/to/lora_checkpoint",
)

# 2. Generate audio
audio = model.generate(
    text="Hello, this is LoRA fine-tuned result.",
    prompt_wav_path="/path/to/reference.wav",  # Optional: for voice cloning
    prompt_text="Reference audio transcript",   # Optional: for voice cloning
)

# 3. Disable LoRA (use base model only)
model.set_lora_enabled(False)

# 4. Re-enable LoRA
model.set_lora_enabled(True)

# 5. Unload LoRA (reset weights to zero)
model.unload_lora()

# 6. Hot-swap to another LoRA
loaded, skipped = model.load_lora("/path/to/another_lora_checkpoint")
print(f"Loaded {len(loaded)} params, skipped {len(skipped)}")

# 7. Get current LoRA weights
lora_state = model.get_lora_state_dict()

Simplified Usage (Load from lora_config.json)

If your checkpoint contains lora_config.json (saved by the training script), you can load everything automatically:

import json
from voxcpm.core import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig

# Load config from checkpoint
lora_ckpt_dir = "/path/to/checkpoints/finetune_lora/step_0002000"
with open(f"{lora_ckpt_dir}/lora_config.json") as f:
    lora_info = json.load(f)

base_model = lora_info["base_model"]
lora_cfg = LoRAConfig(**lora_info["lora_config"])

# Load model with LoRA
model = VoxCPM.from_pretrained(
    hf_model_id=base_model,
    lora_config=lora_cfg,
    lora_weights_path=lora_ckpt_dir,
)

Or use the test script directly:

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --text "Hello world"

Method Reference

Method Description torch.compile Compatible
load_lora(path) Load LoRA weights from file
set_lora_enabled(bool) Enable/disable LoRA
unload_lora() Reset LoRA weights to initial values
get_lora_state_dict() Get current LoRA weights
lora_enabled Property: check if LoRA is configured

FAQ

1. Out of Memory (OOM)

  • Increase grad_accum_steps (gradient accumulation)
  • Decrease batch_size
  • Use LoRA fine-tuning instead of full fine-tuning
  • Decrease max_batch_tokens to filter long samples

2. Poor LoRA Performance

  • Increase r (LoRA rank)
  • Adjust alpha (try alpha = r/2 or alpha = r)
  • Increase training steps
  • Add more target modules

3. Training Not Converging

  • Decrease learning_rate
  • Increase warmup_steps
  • Check data quality

4. LoRA Not Taking Effect at Inference

  • Check that lora_config.json exists in the checkpoint directory
  • Check load_lora() return value - skipped_keys should be empty
  • Verify set_lora_enabled(True) is called

5. Checkpoint Loading Errors

  • Full fine-tuning: checkpoint directory should contain model.safetensors (or pytorch_model.bin), config.json, audiovae.pth
  • LoRA: checkpoint directory should contain:
    • lora_weights.safetensors (or lora_weights.ckpt) - LoRA weights
    • lora_config.json - LoRA config and base model path