admin/VoxCPM-use

Fork 0

Files

刘鑫 400f47a516 Modify lora inference api

2025-12-05 22:22:13 +08:00

10 KiB

Raw Blame History

VoxCPM Fine-tuning Guide

This guide covers how to fine-tune VoxCPM models with two approaches: full fine-tuning and LoRA fine-tuning.

🎓 SFT (Supervised Fine-Tuning)

Full fine-tuning updates all model parameters. Suitable for:

📊 Large, specialized datasets
🔄 Cases where significant behavior changes are needed

⚡ LoRA Fine-tuning

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:

🎯 Trains only a small number of additional parameters
💾 Significantly reduces memory requirements and training time
🔀 Supports multiple LoRA adapters with hot-swapping

Data Preparation
Full Fine-tuning
LoRA Fine-tuning
Inference
LoRA Hot-swapping
FAQ

Data Preparation

Training data should be prepared as a JSONL manifest file, with one sample per line:

{"audio": "path/to/audio1.wav", "text": "Transcript of audio 1."}
{"audio": "path/to/audio2.wav", "text": "Transcript of audio 2."}
{"audio": "path/to/audio3.wav", "text": "Optional duration field.", "duration": 3.5}
{"audio": "path/to/audio4.wav", "text": "Optional dataset_id for multi-dataset.", "dataset_id": 1}

Required Fields

Field	Description
`audio`	Path to audio file (absolute or relative)
`text`	Corresponding transcript

Optional Fields

Field	Description
`duration`	Audio duration in seconds (speeds up sample filtering)
`dataset_id`	Dataset ID for multi-dataset training (default: 0)

Requirements

Audio format: WAV
Sample rate: 16kHz for VoxCPM-0.5B, 44.1kHz for VoxCPM1.5
Text: Transcript matching the audio content

See examples/train_data_example.jsonl for a complete example.

Full Fine-tuning

Full fine-tuning updates all model parameters. Suitable for large datasets or when significant behavior changes are needed.

Configuration

Create conf/voxcpm_v1.5/voxcpm_finetune_all.yaml:

pretrained_path: /path/to/VoxCPM1.5/
train_manifest: /path/to/train.jsonl
val_manifest: ""

sample_rate: 44100
batch_size: 16
grad_accum_steps: 1
num_workers: 2
num_iters: 2000
log_interval: 10
valid_interval: 1000
save_interval: 1000

learning_rate: 0.00001   # Use smaller LR for full fine-tuning
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
max_batch_tokens: 8192

save_path: /path/to/checkpoints/finetune_all
tensorboard: /path/to/logs/finetune_all

lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

Training

# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml

# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml

Checkpoint Structure

Full fine-tuning saves a complete model directory that can be loaded directly:

checkpoints/finetune_all/
└── step_0002000/
    ├── model.safetensors     # Model weights (excluding audio_vae)
    ├── config.json            # Model config
    ├── audiovae.pth           # Audio VAE weights
    ├── tokenizer.json         # Tokenizer
    ├── tokenizer_config.json
    ├── special_tokens_map.json
    ├── optimizer.pth
    └── scheduler.pth

LoRA Fine-tuning

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains only a small number of additional parameters, significantly reducing memory requirements.

Configuration

Create conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml:

pretrained_path: /path/to/VoxCPM1.5/
train_manifest: /path/to/train.jsonl
val_manifest: ""

sample_rate: 44100
batch_size: 16
grad_accum_steps: 1
num_workers: 2
num_iters: 2000
log_interval: 10
valid_interval: 1000
save_interval: 1000

learning_rate: 0.0001    # LoRA can use larger LR
weight_decay: 0.01
warmup_steps: 100
max_steps: 2000
max_batch_tokens: 8192

save_path: /path/to/checkpoints/finetune_lora
tensorboard: /path/to/logs/finetune_lora

lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

# LoRA configuration
lora:
  enable_lm: true        # Apply LoRA to Language Model
  enable_dit: true       # Apply LoRA to Diffusion Transformer
  enable_proj: false     # Apply LoRA to projection layers (optional)
  
  r: 32                  # LoRA rank (higher = more capacity)
  alpha: 16              # LoRA alpha, scaling = alpha / r
  dropout: 0.0
  
  # Target modules
  target_modules_lm: ["q_proj", "v_proj", "k_proj", "o_proj"]
  target_modules_dit: ["q_proj", "v_proj", "k_proj", "o_proj"]

LoRA Parameters

Parameter	Description	Recommended
`enable_lm`	Apply LoRA to LM (language model)	`true`
`enable_dit`	Apply LoRA to DiT (diffusion model)	`true` (required for voice cloning)
`r`	LoRA rank (higher = more capacity)	16-64
`alpha`	Scaling factor, `scaling = alpha / r`	Usually `r/2` or `r`
`target_modules_*`	Layer names to add LoRA	attention layers

Training

# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml

# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml

Checkpoint Structure

LoRA training saves only LoRA parameters:

checkpoints/finetune_lora/
└── step_0002000/
    ├── lora_weights.safetensors    # Only lora_A, lora_B parameters
    ├── optimizer.pth
    └── scheduler.pth

Inference

Full Fine-tuning Inference

The checkpoint directory is a complete model, load it directly:

python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
    --text "Hello, this is the fine-tuned model." \
    --output output.wav

With voice cloning:

python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
    --text "This is voice cloning result." \
    --prompt_audio /path/to/reference.wav \
    --prompt_text "Reference audio transcript" \
    --output cloned_output.wav

LoRA Inference

LoRA inference requires the training config (for LoRA structure) and LoRA checkpoint:

python scripts/test_voxcpm_lora_infer.py \
    --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --text "Hello, this is LoRA fine-tuned result." \
    --output lora_output.wav

With voice cloning:

python scripts/test_voxcpm_lora_infer.py \
    --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --text "This is voice cloning with LoRA." \
    --prompt_audio /path/to/reference.wav \
    --prompt_text "Reference audio transcript" \
    --output cloned_output.wav

LoRA Hot-swapping

LoRA supports dynamic loading, unloading, and switching at inference time without reloading the entire model.

API Reference

from voxcpm.core import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig

# 1. Load model with LoRA structure and weights
lora_cfg = LoRAConfig(
    enable_lm=True, 
    enable_dit=True, 
    r=32, 
    alpha=16,
    target_modules_lm=["q_proj", "v_proj", "k_proj", "o_proj"],
    target_modules_dit=["q_proj", "v_proj", "k_proj", "o_proj"],
)
model = VoxCPM.from_pretrained(
    hf_model_id="openbmb/VoxCPM1.5",  # or local path
    load_denoiser=False,              # Optional: disable denoiser for faster loading
    optimize=True,                    # Enable torch.compile acceleration
    lora_config=lora_cfg,
    lora_weights_path="/path/to/lora_checkpoint",
)

# 2. Generate audio
audio = model.generate(
    text="Hello, this is LoRA fine-tuned result.",
    prompt_wav_path="/path/to/reference.wav",  # Optional: for voice cloning
    prompt_text="Reference audio transcript",   # Optional: for voice cloning
)

# 3. Disable LoRA (use base model only)
model.set_lora_enabled(False)

# 4. Re-enable LoRA
model.set_lora_enabled(True)

# 5. Unload LoRA (reset weights to zero)
model.unload_lora()

# 6. Hot-swap to another LoRA
loaded, skipped = model.load_lora("/path/to/another_lora_checkpoint")
print(f"Loaded {len(loaded)} params, skipped {len(skipped)}")

# 7. Get current LoRA weights
lora_state = model.get_lora_state_dict()

Simplified Usage (Auto LoRA Config)

If you only have LoRA weights and don't need custom config, just provide the path:

from voxcpm.core import VoxCPM

# Auto-create default LoRAConfig when only lora_weights_path is provided
model = VoxCPM.from_pretrained(
    hf_model_id="openbmb/VoxCPM1.5",
    lora_weights_path="/path/to/lora_checkpoint",  # Will auto-create LoRAConfig
)

Method Reference

Method	Description	torch.compile Compatible
`load_lora(path)`	Load LoRA weights from file	✅
`set_lora_enabled(bool)`	Enable/disable LoRA	✅
`unload_lora()`	Reset LoRA weights to initial values	✅
`get_lora_state_dict()`	Get current LoRA weights	✅
`lora_enabled`	Property: check if LoRA is configured	✅

FAQ

1. Out of Memory (OOM)

Increase grad_accum_steps (gradient accumulation)
Decrease batch_size
Use LoRA fine-tuning instead of full fine-tuning
Decrease max_batch_tokens to filter long samples

2. Poor LoRA Performance

Increase r (LoRA rank)
Adjust alpha (try alpha = r/2 or alpha = r)
Ensure enable_dit: true (required for voice cloning)
Increase training steps
Add more target modules

3. Training Not Converging

Decrease learning_rate
Increase warmup_steps
Check data quality

4. LoRA Not Taking Effect at Inference

Ensure inference config matches training config LoRA parameters
Check load_lora() return value - skipped_keys should be empty
Verify set_lora_enabled(True) is called

5. Checkpoint Loading Errors

Full fine-tuning: checkpoint directory should contain model.safetensors(or pytorch_model.bin), config.json, audiovae.pth
LoRA: checkpoint directory should contain lora_weights.safetensors (or lora_weights.ckpt)

10 KiB Raw Blame History

VoxCPM Fine-tuning Guide

🎓 SFT (Supervised Fine-Tuning)

⚡ LoRA Fine-tuning

Table of Contents

Data Preparation

Required Fields

Optional Fields

Requirements

Full Fine-tuning

Configuration

Training

Checkpoint Structure

LoRA Fine-tuning

Configuration

LoRA Parameters

Training

Checkpoint Structure

Inference

Full Fine-tuning Inference

LoRA Inference

LoRA Hot-swapping

API Reference

Simplified Usage (Auto LoRA Config)

Method Reference

FAQ

1. Out of Memory (OOM)

2. Poor LoRA Performance

3. Training Not Converging

4. LoRA Not Taking Effect at Inference

5. Checkpoint Loading Errors

10 KiB

Raw Blame History