Files
VoxCPM/docs/finetune.md
2025-12-10 20:25:24 +08:00

13 KiB

VoxCPM Fine-tuning Guide

This guide covers how to fine-tune VoxCPM models with two approaches: full fine-tuning and LoRA fine-tuning.

🎓 SFT (Supervised Fine-Tuning)

Full fine-tuning updates all model parameters. Suitable for:

  • 📊 Large, specialized datasets
  • 🔄 Cases where significant behavior changes are needed

LoRA Fine-tuning

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:

  • 🎯 Trains only a small number of additional parameters
  • 💾 Significantly reduces memory requirements and training time
  • 🔀 Supports multiple LoRA adapters with hot-swapping

Table of Contents


Quick Start: WebUI

For users who prefer a graphical interface, we provide lora_ft_webui.py - a comprehensive WebUI for training and inference:

Launch WebUI

python lora_ft_webui.py

Then open http://localhost:7860 in your browser.

Features

  • 🚀 Training Tab: Configure and start LoRA training with an intuitive interface

    • Set training parameters (learning rate, batch size, LoRA rank, etc.)
    • Monitor training progress in real-time
    • Resume training from existing checkpoints
  • 🎵 Inference Tab: Generate audio with trained models

    • Automatic base model loading from LoRA checkpoint config
    • Voice cloning with automatic ASR (reference text recognition)
    • Hot-swap between multiple LoRA models
    • Zero-shot TTS without reference audio

Data Preparation

Training data should be prepared as a JSONL manifest file, with one sample per line:

{"audio": "path/to/audio1.wav", "text": "Transcript of audio 1."}
{"audio": "path/to/audio2.wav", "text": "Transcript of audio 2."}
{"audio": "path/to/audio3.wav", "text": "Optional duration field.", "duration": 3.5}
{"audio": "path/to/audio4.wav", "text": "Optional dataset_id for multi-dataset.", "dataset_id": 1}

Required Fields

Field Description
audio Path to audio file (absolute or relative)
text Corresponding transcript

Optional Fields

Field Description
duration Audio duration in seconds (speeds up sample filtering)
dataset_id Dataset ID for multi-dataset training (default: 0)

Requirements

  • Audio format: WAV
  • Sample rate: 16kHz for VoxCPM-0.5B, 44.1kHz for VoxCPM1.5
  • Text: Transcript matching the audio content

See examples/train_data_example.jsonl for a complete example.


Full Fine-tuning

Full fine-tuning updates all model parameters. Suitable for large datasets or when significant behavior changes are needed.

Configuration

Create conf/voxcpm_v1.5/voxcpm_finetune_all.yaml:

pretrained_path: /path/to/VoxCPM1.5/
train_manifest: /path/to/train.jsonl
val_manifest: ""

sample_rate: 44100
batch_size: 16
grad_accum_steps: 1
num_workers: 2
num_iters: 2000
log_interval: 10
valid_interval: 1000
save_interval: 1000

learning_rate: 0.00001   # Use smaller LR for full fine-tuning
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
max_batch_tokens: 8192

save_path: /path/to/checkpoints/finetune_all
tensorboard: /path/to/logs/finetune_all

lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

Training

# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml

# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml

Checkpoint Structure

Full fine-tuning saves a complete model directory that can be loaded directly:

checkpoints/finetune_all/
└── step_0002000/
    ├── model.safetensors     # Model weights (excluding audio_vae)
    ├── config.json            # Model config
    ├── audiovae.pth           # Audio VAE weights
    ├── tokenizer.json         # Tokenizer
    ├── tokenizer_config.json
    ├── special_tokens_map.json
    ├── optimizer.pth
    └── scheduler.pth

LoRA Fine-tuning

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains only a small number of additional parameters, significantly reducing memory requirements.

Configuration

Create conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml:

pretrained_path: /path/to/VoxCPM1.5/
train_manifest: /path/to/train.jsonl
val_manifest: ""

sample_rate: 44100
batch_size: 16
grad_accum_steps: 1
num_workers: 2
num_iters: 2000
log_interval: 10
valid_interval: 1000
save_interval: 1000

learning_rate: 0.0001    # LoRA can use larger LR
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
max_batch_tokens: 8192

save_path: /path/to/checkpoints/finetune_lora
tensorboard: /path/to/logs/finetune_lora

lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

# LoRA configuration
lora:
  enable_lm: true        # Apply LoRA to Language Model
  enable_dit: true       # Apply LoRA to Diffusion Transformer
  enable_proj: false     # Apply LoRA to projection layers (optional)
  
  r: 32                  # LoRA rank (higher = more capacity)
  alpha: 16              # LoRA alpha, scaling = alpha / r
  dropout: 0.0
  
  # Target modules
  target_modules_lm: ["q_proj", "v_proj", "k_proj", "o_proj"]
  target_modules_dit: ["q_proj", "v_proj", "k_proj", "o_proj"]

# Distribution options (optional)
# hf_model_id: "openbmb/VoxCPM1.5"  # HuggingFace ID
# distribute: true                   # If true, save hf_model_id in lora_config.json

LoRA Parameters

Parameter Description Recommended
enable_lm Apply LoRA to LM (language model) true
enable_dit Apply LoRA to DiT (diffusion model) true (required for voice cloning)
r LoRA rank (higher = more capacity) 16-64
alpha Scaling factor, scaling = alpha / r Usually r/2 or r
target_modules_* Layer names to add LoRA attention layers

Distribution Options (Optional)

Parameter Description Default
hf_model_id HuggingFace model ID (e.g., openbmb/VoxCPM1.5) ""
distribute If true, save hf_model_id as base_model in checkpoint; otherwise save local pretrained_path false

Note

: If distribute: true, hf_model_id is required.

Training

# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml

# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml

Checkpoint Structure

LoRA training saves LoRA parameters and configuration:

checkpoints/finetune_lora/
└── step_0002000/
    ├── lora_weights.safetensors    # Only lora_A, lora_B parameters
    ├── lora_config.json            # LoRA config + base model path
    ├── optimizer.pth
    └── scheduler.pth

The lora_config.json contains:

{
  "base_model": "/path/to/VoxCPM1.5/",
  "lora_config": {
    "enable_lm": true,
    "enable_dit": true,
    "r": 32,
    "alpha": 16,
    ...
  }
}

The base_model field contains:

  • Local path (default): when distribute: false or not set
  • HuggingFace ID: when distribute: true (e.g., "openbmb/VoxCPM1.5")

This allows loading LoRA checkpoints without the original training config file.


Inference

Full Fine-tuning Inference

The checkpoint directory is a complete model, load it directly:

python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
    --text "Hello, this is the fine-tuned model." \
    --output output.wav

With voice cloning:

python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
    --text "This is voice cloning result." \
    --prompt_audio /path/to/reference.wav \
    --prompt_text "Reference audio transcript" \
    --output cloned_output.wav

LoRA Inference

LoRA inference only requires the checkpoint directory (base model path and LoRA config are read from lora_config.json):

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --text "Hello, this is LoRA fine-tuned result." \
    --output lora_output.wav

With voice cloning:

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --text "This is voice cloning with LoRA." \
    --prompt_audio /path/to/reference.wav \
    --prompt_text "Reference audio transcript" \
    --output cloned_output.wav

Override base model path (optional):

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --base_model /path/to/another/VoxCPM1.5 \
    --text "Use different base model." \
    --output output.wav

LoRA Hot-swapping

LoRA supports dynamic loading, unloading, and switching at inference time without reloading the entire model.

API Reference

from voxcpm.core import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig

# 1. Load model with LoRA structure and weights
lora_cfg = LoRAConfig(
    enable_lm=True, 
    enable_dit=True, 
    r=32, 
    alpha=16,
    target_modules_lm=["q_proj", "v_proj", "k_proj", "o_proj"],
    target_modules_dit=["q_proj", "v_proj", "k_proj", "o_proj"],
)
model = VoxCPM.from_pretrained(
    hf_model_id="openbmb/VoxCPM1.5",  # or local path
    load_denoiser=False,              # Optional: disable denoiser for faster loading
    optimize=True,                    # Enable torch.compile acceleration
    lora_config=lora_cfg,
    lora_weights_path="/path/to/lora_checkpoint",
)

# 2. Generate audio
audio = model.generate(
    text="Hello, this is LoRA fine-tuned result.",
    prompt_wav_path="/path/to/reference.wav",  # Optional: for voice cloning
    prompt_text="Reference audio transcript",   # Optional: for voice cloning
)

# 3. Disable LoRA (use base model only)
model.set_lora_enabled(False)

# 4. Re-enable LoRA
model.set_lora_enabled(True)

# 5. Unload LoRA (reset weights to zero)
model.unload_lora()

# 6. Hot-swap to another LoRA
loaded, skipped = model.load_lora("/path/to/another_lora_checkpoint")
print(f"Loaded {len(loaded)} params, skipped {len(skipped)}")

# 7. Get current LoRA weights
lora_state = model.get_lora_state_dict()

Simplified Usage (Load from lora_config.json)

If your checkpoint contains lora_config.json (saved by the training script), you can load everything automatically:

import json
from voxcpm.core import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig

# Load config from checkpoint
lora_ckpt_dir = "/path/to/checkpoints/finetune_lora/step_0002000"
with open(f"{lora_ckpt_dir}/lora_config.json") as f:
    lora_info = json.load(f)

base_model = lora_info["base_model"]
lora_cfg = LoRAConfig(**lora_info["lora_config"])

# Load model with LoRA
model = VoxCPM.from_pretrained(
    hf_model_id=base_model,
    lora_config=lora_cfg,
    lora_weights_path=lora_ckpt_dir,
)

Or use the test script directly:

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --text "Hello world"

Method Reference

Method Description torch.compile Compatible
load_lora(path) Load LoRA weights from file
set_lora_enabled(bool) Enable/disable LoRA
unload_lora() Reset LoRA weights to initial values
get_lora_state_dict() Get current LoRA weights
lora_enabled Property: check if LoRA is configured

FAQ

1. How Much Data is Needed for LoRA Fine-tuning to Converge to a Single Voice?

We have tested with 5 minutes and 10 minutes of data (all audio clips are 3-6s in length). In our experiments, both datasets converged to a single voice after 2000 training steps with default configurations. You can adjust the data amount and training configurations based on your available data and computational resources.

2. Out of Memory (OOM)

  • Increase grad_accum_steps (gradient accumulation)
  • Decrease batch_size
  • Use LoRA fine-tuning instead of full fine-tuning
  • Decrease max_batch_tokens to filter long samples

3. Poor LoRA Performance

  • Increase r (LoRA rank)
  • Adjust alpha (try alpha = r/2 or alpha = r)
  • Increase training steps
  • Add more target modules

4. Training Not Converging

  • Decrease learning_rate
  • Increase warmup_steps
  • Check data quality

5. LoRA Not Taking Effect at Inference

  • Check that lora_config.json exists in the checkpoint directory
  • Check load_lora() return value - skipped_keys should be empty
  • Verify set_lora_enabled(True) is called

6. Checkpoint Loading Errors

  • Full fine-tuning: checkpoint directory should contain model.safetensors (or pytorch_model.bin), config.json, audiovae.pth
  • LoRA: checkpoint directory should contain:
    • lora_weights.safetensors (or lora_weights.ckpt) - LoRA weights
    • lora_config.json - LoRA config and base model path