admin/VoxCPM

Fork 0

mirror of https://github.com/OpenBMB/VoxCPM synced 2025-12-12 03:48:12 +00:00

Files

刘鑫 aabda60833 add lora finetune data setting QA

2025-12-10 20:25:24 +08:00

13 KiB

Raw Blame History

VoxCPM Fine-tuning Guide

This guide covers how to fine-tune VoxCPM models with two approaches: full fine-tuning and LoRA fine-tuning.

🎓 SFT (Supervised Fine-Tuning)

Full fine-tuning updates all model parameters. Suitable for:

📊 Large, specialized datasets
🔄 Cases where significant behavior changes are needed

⚡ LoRA Fine-tuning

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:

🎯 Trains only a small number of additional parameters
💾 Significantly reduces memory requirements and training time
🔀 Supports multiple LoRA adapters with hot-swapping

Quick Start: WebUI
Data Preparation
Full Fine-tuning
LoRA Fine-tuning
Inference
LoRA Hot-swapping
FAQ

Quick Start: WebUI

For users who prefer a graphical interface, we provide lora_ft_webui.py - a comprehensive WebUI for training and inference:

Launch WebUI

python lora_ft_webui.py

Then open http://localhost:7860 in your browser.

Features

🚀 Training Tab: Configure and start LoRA training with an intuitive interface
- Set training parameters (learning rate, batch size, LoRA rank, etc.)
- Monitor training progress in real-time
- Resume training from existing checkpoints
🎵 Inference Tab: Generate audio with trained models
- Automatic base model loading from LoRA checkpoint config
- Voice cloning with automatic ASR (reference text recognition)
- Hot-swap between multiple LoRA models
- Zero-shot TTS without reference audio

Data Preparation

Training data should be prepared as a JSONL manifest file, with one sample per line:

{"audio": "path/to/audio1.wav", "text": "Transcript of audio 1."}
{"audio": "path/to/audio2.wav", "text": "Transcript of audio 2."}
{"audio": "path/to/audio3.wav", "text": "Optional duration field.", "duration": 3.5}
{"audio": "path/to/audio4.wav", "text": "Optional dataset_id for multi-dataset.", "dataset_id": 1}

Required Fields

Field	Description
`audio`	Path to audio file (absolute or relative)
`text`	Corresponding transcript

Optional Fields

Field	Description
`duration`	Audio duration in seconds (speeds up sample filtering)
`dataset_id`	Dataset ID for multi-dataset training (default: 0)

Requirements

Audio format: WAV
Sample rate: 16kHz for VoxCPM-0.5B, 44.1kHz for VoxCPM1.5
Text: Transcript matching the audio content

See examples/train_data_example.jsonl for a complete example.

Full Fine-tuning

Full fine-tuning updates all model parameters. Suitable for large datasets or when significant behavior changes are needed.

Configuration

Create conf/voxcpm_v1.5/voxcpm_finetune_all.yaml:

pretrained_path: /path/to/VoxCPM1.5/
train_manifest: /path/to/train.jsonl
val_manifest: ""

sample_rate: 44100
batch_size: 16
grad_accum_steps: 1
num_workers: 2
num_iters: 2000
log_interval: 10
valid_interval: 1000
save_interval: 1000

learning_rate: 0.00001   # Use smaller LR for full fine-tuning
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
max_batch_tokens: 8192

save_path: /path/to/checkpoints/finetune_all
tensorboard: /path/to/logs/finetune_all

lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

Training

# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml

# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml

Checkpoint Structure

Full fine-tuning saves a complete model directory that can be loaded directly:

checkpoints/finetune_all/
└── step_0002000/
    ├── model.safetensors     # Model weights (excluding audio_vae)
    ├── config.json            # Model config
    ├── audiovae.pth           # Audio VAE weights
    ├── tokenizer.json         # Tokenizer
    ├── tokenizer_config.json
    ├── special_tokens_map.json
    ├── optimizer.pth
    └── scheduler.pth

LoRA Fine-tuning

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains only a small number of additional parameters, significantly reducing memory requirements.

Configuration

Create conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml:

pretrained_path: /path/to/VoxCPM1.5/
train_manifest: /path/to/train.jsonl
val_manifest: ""

sample_rate: 44100
batch_size: 16
grad_accum_steps: 1
num_workers: 2
num_iters: 2000
log_interval: 10
valid_interval: 1000
save_interval: 1000

learning_rate: 0.0001    # LoRA can use larger LR
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
max_batch_tokens: 8192

save_path: /path/to/checkpoints/finetune_lora
tensorboard: /path/to/logs/finetune_lora

lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

# LoRA configuration
lora:
  enable_lm: true        # Apply LoRA to Language Model
  enable_dit: true       # Apply LoRA to Diffusion Transformer
  enable_proj: false     # Apply LoRA to projection layers (optional)
  
  r: 32                  # LoRA rank (higher = more capacity)
  alpha: 16              # LoRA alpha, scaling = alpha / r
  dropout: 0.0
  
  # Target modules
  target_modules_lm: ["q_proj", "v_proj", "k_proj", "o_proj"]
  target_modules_dit: ["q_proj", "v_proj", "k_proj", "o_proj"]

# Distribution options (optional)
# hf_model_id: "openbmb/VoxCPM1.5"  # HuggingFace ID
# distribute: true                   # If true, save hf_model_id in lora_config.json

LoRA Parameters

Parameter	Description	Recommended
`enable_lm`	Apply LoRA to LM (language model)	`true`
`enable_dit`	Apply LoRA to DiT (diffusion model)	`true` (required for voice cloning)
`r`	LoRA rank (higher = more capacity)	16-64
`alpha`	Scaling factor, `scaling = alpha / r`	Usually `r/2` or `r`
`target_modules_*`	Layer names to add LoRA	attention layers

Distribution Options (Optional)

Parameter	Description	Default
`hf_model_id`	HuggingFace model ID (e.g., `openbmb/VoxCPM1.5`)	`""`
`distribute`	If `true`, save `hf_model_id` as `base_model` in checkpoint; otherwise save local `pretrained_path`	`false`

Note

: If distribute: true, hf_model_id is required.

Training

# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml

# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml

Checkpoint Structure

LoRA training saves LoRA parameters and configuration:

checkpoints/finetune_lora/
└── step_0002000/
    ├── lora_weights.safetensors    # Only lora_A, lora_B parameters
    ├── lora_config.json            # LoRA config + base model path
    ├── optimizer.pth
    └── scheduler.pth

The lora_config.json contains:

{
  "base_model": "/path/to/VoxCPM1.5/",
  "lora_config": {
    "enable_lm": true,
    "enable_dit": true,
    "r": 32,
    "alpha": 16,
    ...
  }
}

The base_model field contains:

Local path (default): when distribute: false or not set
HuggingFace ID: when distribute: true (e.g., "openbmb/VoxCPM1.5")

This allows loading LoRA checkpoints without the original training config file.

Inference

Full Fine-tuning Inference

The checkpoint directory is a complete model, load it directly:

python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
    --text "Hello, this is the fine-tuned model." \
    --output output.wav

With voice cloning:

python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
    --text "This is voice cloning result." \
    --prompt_audio /path/to/reference.wav \
    --prompt_text "Reference audio transcript" \
    --output cloned_output.wav

LoRA Inference

LoRA inference only requires the checkpoint directory (base model path and LoRA config are read from lora_config.json):

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --text "Hello, this is LoRA fine-tuned result." \
    --output lora_output.wav

With voice cloning:

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --text "This is voice cloning with LoRA." \
    --prompt_audio /path/to/reference.wav \
    --prompt_text "Reference audio transcript" \
    --output cloned_output.wav

Override base model path (optional):

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --base_model /path/to/another/VoxCPM1.5 \
    --text "Use different base model." \
    --output output.wav

LoRA Hot-swapping

LoRA supports dynamic loading, unloading, and switching at inference time without reloading the entire model.

API Reference

from voxcpm.core import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig

# 1. Load model with LoRA structure and weights
lora_cfg = LoRAConfig(
    enable_lm=True, 
    enable_dit=True, 
    r=32, 
    alpha=16,
    target_modules_lm=["q_proj", "v_proj", "k_proj", "o_proj"],
    target_modules_dit=["q_proj", "v_proj", "k_proj", "o_proj"],
)
model = VoxCPM.from_pretrained(
    hf_model_id="openbmb/VoxCPM1.5",  # or local path
    load_denoiser=False,              # Optional: disable denoiser for faster loading
    optimize=True,                    # Enable torch.compile acceleration
    lora_config=lora_cfg,
    lora_weights_path="/path/to/lora_checkpoint",
)

# 2. Generate audio
audio = model.generate(
    text="Hello, this is LoRA fine-tuned result.",
    prompt_wav_path="/path/to/reference.wav",  # Optional: for voice cloning
    prompt_text="Reference audio transcript",   # Optional: for voice cloning
)

# 3. Disable LoRA (use base model only)
model.set_lora_enabled(False)

# 4. Re-enable LoRA
model.set_lora_enabled(True)

# 5. Unload LoRA (reset weights to zero)
model.unload_lora()

# 6. Hot-swap to another LoRA
loaded, skipped = model.load_lora("/path/to/another_lora_checkpoint")
print(f"Loaded {len(loaded)} params, skipped {len(skipped)}")

# 7. Get current LoRA weights
lora_state = model.get_lora_state_dict()

Simplified Usage (Load from lora_config.json)

If your checkpoint contains lora_config.json (saved by the training script), you can load everything automatically:

import json
from voxcpm.core import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig

# Load config from checkpoint
lora_ckpt_dir = "/path/to/checkpoints/finetune_lora/step_0002000"
with open(f"{lora_ckpt_dir}/lora_config.json") as f:
    lora_info = json.load(f)

base_model = lora_info["base_model"]
lora_cfg = LoRAConfig(**lora_info["lora_config"])

# Load model with LoRA
model = VoxCPM.from_pretrained(
    hf_model_id=base_model,
    lora_config=lora_cfg,
    lora_weights_path=lora_ckpt_dir,
)

Or use the test script directly:

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --text "Hello world"

Method Reference

Method	Description	torch.compile Compatible
`load_lora(path)`	Load LoRA weights from file	✅
`set_lora_enabled(bool)`	Enable/disable LoRA	✅
`unload_lora()`	Reset LoRA weights to initial values	✅
`get_lora_state_dict()`	Get current LoRA weights	✅
`lora_enabled`	Property: check if LoRA is configured	✅

FAQ

1. How Much Data is Needed for LoRA Fine-tuning to Converge to a Single Voice?

We have tested with 5 minutes and 10 minutes of data (all audio clips are 3-6s in length). In our experiments, both datasets converged to a single voice after 2000 training steps with default configurations. You can adjust the data amount and training configurations based on your available data and computational resources.

2. Out of Memory (OOM)

Increase grad_accum_steps (gradient accumulation)
Decrease batch_size
Use LoRA fine-tuning instead of full fine-tuning
Decrease max_batch_tokens to filter long samples

3. Poor LoRA Performance

Increase r (LoRA rank)
Adjust alpha (try alpha = r/2 or alpha = r)
Increase training steps
Add more target modules

4. Training Not Converging

Decrease learning_rate
Increase warmup_steps
Check data quality

5. LoRA Not Taking Effect at Inference

Check that lora_config.json exists in the checkpoint directory
Check load_lora() return value - skipped_keys should be empty
Verify set_lora_enabled(True) is called

6. Checkpoint Loading Errors

Full fine-tuning: checkpoint directory should contain model.safetensors (or pytorch_model.bin), config.json, audiovae.pth
LoRA: checkpoint directory should contain:
- lora_weights.safetensors (or lora_weights.ckpt) - LoRA weights
- lora_config.json - LoRA config and base model path

13 KiB Raw Blame History

VoxCPM Fine-tuning Guide

🎓 SFT (Supervised Fine-Tuning)

⚡ LoRA Fine-tuning

Table of Contents

Quick Start: WebUI

Launch WebUI

Features

Data Preparation

Required Fields

Optional Fields

Requirements

Full Fine-tuning

Configuration

Training

Checkpoint Structure

LoRA Fine-tuning

Configuration

LoRA Parameters

Distribution Options (Optional)

Training

Checkpoint Structure

Inference

Full Fine-tuning Inference

LoRA Inference

LoRA Hot-swapping

API Reference

Simplified Usage (Load from lora_config.json)

Method Reference

FAQ

1. How Much Data is Needed for LoRA Fine-tuning to Converge to a Single Voice?

2. Out of Memory (OOM)

3. Poor LoRA Performance

4. Training Not Converging

5. LoRA Not Taking Effect at Inference

6. Checkpoint Loading Errors

13 KiB

Raw Blame History