Initial commit with large files ignored

2025-12-11 00:12:18 +08:00
parent a266c0a88d
commit 1e44eba871
17 changed files with 179286 additions and 329 deletions
--- a/docs/finetune.md
+++ b/docs/finetune.md
@@ -1,101 +1,99 @@
-# VoxCPM Fine-tuning Guide
+# VoxCPM 微调指南

-This guide covers how to fine-tune VoxCPM models with two approaches: full fine-tuning and LoRA fine-tuning.
+本指南介绍了如何使用全量微调（Full Fine-tuning）和 LoRA 微调两种方式对 VoxCPM 模型进行微调。

-### 🎓 SFT (Supervised Fine-Tuning)
+### 🎓 SFT (监督微调)

-Full fine-tuning updates all model parameters. Suitable for:
- 📊 Large, specialized datasets
- 🔄 Cases where significant behavior changes are needed
+全量微调会更新所有模型参数。适用于：
+- 📊 大型、专业的数据集
+- 🔄 需要显著改变模型行为的场景

-### ⚡ LoRA Fine-tuning
+### ⚡ LoRA 微调

-LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:
- 🎯 Trains only a small number of additional parameters
- 💾 Significantly reduces memory requirements and training time
- 🔀 Supports multiple LoRA adapters with hot-swapping
+LoRA (Low-Rank Adaptation) 是一种参数高效的微调方法，它：
+- 🎯 仅训练少量额外参数
+- 💾 显著降低显存需求和训练时间
+- 🔀 支持多个 LoRA 适配器热插拔

+## 目录

-
-## Table of Contents
-
- [Quick Start: WebUI](#quick-start-webui)
- [Data Preparation](#data-preparation)
- [Full Fine-tuning](#full-fine-tuning)
- [LoRA Fine-tuning](#lora-fine-tuning)
- [Inference](#inference)
- [LoRA Hot-swapping](#lora-hot-swapping)
- [FAQ](#faq)
+- [快速开始：WebUI](#快速开始webui)
+- [数据准备](#数据准备)
+- [全量微调](#全量微调)
+- [LoRA 微调](#lora-微调)
+- [推理](#推理)
+- [LoRA 热插拔](#lora-热插拔)
+- [常见问题](#常见问题)

 ---

-## Quick Start: WebUI
+## 快速开始：WebUI

-For users who prefer a graphical interface, we provide `lora_ft_webui.py` - a comprehensive WebUI for training and inference:
+对于喜欢图形界面的用户，我们提供了 `lora_ft_webui.py` —— 一个用于训练和推理的综合 WebUI：

-### Launch WebUI
+### 启动 WebUI

 ```bash
 python lora_ft_webui.py
 ```

-Then open `http://localhost:7860` in your browser.
+然后在浏览器中打开 `http://localhost:7860`。

-### Features
+### 功能特点

- **🚀 Training Tab**: Configure and start LoRA training with an intuitive interface
-  - Set training parameters (learning rate, batch size, LoRA rank, etc.)
-  - Monitor training progress in real-time
-  - Resume training from existing checkpoints
+- **🚀 训练标签页**：通过直观的界面配置并启动 LoRA 训练
+  - 设置训练参数（学习率、Batch Size、LoRA Rank 等）
+  - 实时监控训练进度
+  - 从现有断点恢复训练

- **🎵 Inference Tab**: Generate audio with trained models
-  - Automatic base model loading from LoRA checkpoint config
-  - Voice cloning with automatic ASR (reference text recognition)
-  - Hot-swap between multiple LoRA models
-  - Zero-shot TTS without reference audio
+- **🎵 推理标签页**：使用训练好的模型生成音频
+  - 从 LoRA 检查点配置自动加载基座模型
+  - 带自动 ASR（参考文本识别）的声音克隆
+  - 在多个 LoRA 模型间热切换
+  - 无参考音频的零样本 TTS

-## Data Preparation
+## 数据准备

-Training data should be prepared as a JSONL manifest file, with one sample per line:
+训练数据应准备为 JSONL 清单文件，每行一个样本：

 ```jsonl
-{"audio": "path/to/audio1.wav", "text": "Transcript of audio 1."}
-{"audio": "path/to/audio2.wav", "text": "Transcript of audio 2."}
-{"audio": "path/to/audio3.wav", "text": "Optional duration field.", "duration": 3.5}
-{"audio": "path/to/audio4.wav", "text": "Optional dataset_id for multi-dataset.", "dataset_id": 1}
+{"audio": "path/to/audio1.wav", "text": "音频1的文本内容。"}
+{"audio": "path/to/audio2.wav", "text": "音频2的文本内容。"}
+{"audio": "path/to/audio3.wav", "text": "可选的时长字段。", "duration": 3.5}
+{"audio": "path/to/audio4.wav", "text": "多数据集训练可选的 dataset_id。", "dataset_id": 1}
 ```

-### Required Fields
+### 必填字段

-| Field | Description |
+| 字段 | 描述 |
 |-------|-------------|
-| `audio` | Path to audio file (absolute or relative) |
-| `text` | Corresponding transcript |
+| `audio` | 音频文件路径（绝对或相对路径） |
+| `text` | 对应的文本内容 |

-### Optional Fields
+### 可选字段

-| Field | Description |
+| 字段 | 描述 |
 |-------|-------------|
-| `duration` | Audio duration in seconds (speeds up sample filtering) |
-| `dataset_id` | Dataset ID for multi-dataset training (default: 0) |
+| `duration` | 音频时长（秒），用于加速样本过滤 |
+| `dataset_id` | 多数据集训练的数据集 ID（默认：0） |

-### Requirements
+### 要求

- Audio format: WAV
- Sample rate: 16kHz for VoxCPM-0.5B, 44.1kHz for VoxCPM1.5
- Text: Transcript matching the audio content
+- 音频格式：WAV
+- 采样率：VoxCPM-0.5B 为 16kHz，VoxCPM1.5 为 44.1kHz
+- 文本：与音频内容匹配的文本

-See `examples/train_data_example.jsonl` for a complete example.
+查看 `examples/train_data_example.jsonl` 获取完整示例。

 ---

-## Full Fine-tuning
+## 全量微调

-Full fine-tuning updates all model parameters. Suitable for large datasets or when significant behavior changes are needed.
+全量微调更新所有模型参数。适用于大数据集或需要显著改变模型行为的情况。

-### Configuration
+### 配置

-Create `conf/voxcpm_v1.5/voxcpm_finetune_all.yaml`:
+创建 `conf/voxcpm_v1.5/voxcpm_finetune_all.yaml`：

 ```yaml
 pretrained_path: /path/to/VoxCPM1.5/
@@ -111,7 +109,7 @@ log_interval: 10
 valid_interval: 1000
 save_interval: 1000

-learning_rate: 0.00001   # Use smaller LR for full fine-tuning
+learning_rate: 0.00001   # 全量微调使用较小的学习率
 weight_decay: 0.01
 warmup_steps: 100
 max_steps: 2000
@@ -125,27 +123,27 @@ lambdas:
  loss/stop: 1.0
 ```

-### Training
+### 训练

 ```bash
-# Single GPU
+# 单 GPU
 python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml

-# Multi-GPU
+# 多 GPU
 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
 ```

-### Checkpoint Structure
+### 检查点结构

-Full fine-tuning saves a complete model directory that can be loaded directly:
+全量微调保存完整的模型目录，可以直接加载：

 ```
 checkpoints/finetune_all/
 └── step_0002000/
-    ├── model.safetensors     # Model weights (excluding audio_vae)
-    ├── config.json            # Model config
-    ├── audiovae.pth           # Audio VAE weights
+    ├── model.safetensors     # 模型权重 (不含 audio_vae)
+    ├── config.json            # 模型配置
+    ├── audiovae.pth           # Audio VAE 权重
    ├── tokenizer.json         # Tokenizer
    ├── tokenizer_config.json
    ├── special_tokens_map.json
@@ -155,13 +153,13 @@ checkpoints/finetune_all/

 ---

-## LoRA Fine-tuning
+## LoRA 微调

-LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains only a small number of additional parameters, significantly reducing memory requirements.
+LoRA (Low-Rank Adaptation) 是一种参数高效的微调方法，仅训练少量额外参数，显著降低显存需求。

-### Configuration
+### 配置

-Create `conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml`:
+创建 `conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml`：

 ```yaml
 pretrained_path: /path/to/VoxCPM1.5/
@@ -177,7 +175,7 @@ log_interval: 10
 valid_interval: 1000
 save_interval: 1000

-learning_rate: 0.0001    # LoRA can use larger LR
+learning_rate: 0.0001    # LoRA 可以使用较大的学习率
 weight_decay: 0.01
 warmup_steps: 100
 max_steps: 2000
@@ -190,69 +188,69 @@ lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

-# LoRA configuration
+# LoRA 配置
 lora:
-  enable_lm: true        # Apply LoRA to Language Model
-  enable_dit: true       # Apply LoRA to Diffusion Transformer
-  enable_proj: false     # Apply LoRA to projection layers (optional)
+  enable_lm: true        # 对语言模型应用 LoRA
+  enable_dit: true       # 对 Diffusion Transformer 应用 LoRA
+  enable_proj: false     # 对投影层应用 LoRA (可选)
  
-  r: 32                  # LoRA rank (higher = more capacity)
+  r: 32                  # LoRA rank (越高容量越大)
  alpha: 16              # LoRA alpha, scaling = alpha / r
  dropout: 0.0
  
-  # Target modules
+  # 目标模块
  target_modules_lm: ["q_proj", "v_proj", "k_proj", "o_proj"]
  target_modules_dit: ["q_proj", "v_proj", "k_proj", "o_proj"]

-# Distribution options (optional)
+# 分发选项 (可选)
 # hf_model_id: "openbmb/VoxCPM1.5"  # HuggingFace ID
-# distribute: true                   # If true, save hf_model_id in lora_config.json
+# distribute: true                   # 如果为 true，在 lora_config.json 中保存 hf_model_id
 ```

-### LoRA Parameters
+### LoRA 参数

-| Parameter | Description | Recommended |
+| 参数 | 描述 | 推荐值 |
 |-----------|-------------|-------------|
-| `enable_lm` | Apply LoRA to LM (language model) | `true` |
-| `enable_dit` | Apply LoRA to DiT (diffusion model) | `true` (required for voice cloning) |
-| `r` | LoRA rank (higher = more capacity) | 16-64 |
-| `alpha` | Scaling factor, `scaling = alpha / r` | Usually `r/2` or `r` |
-| `target_modules_*` | Layer names to add LoRA | attention layers |
+| `enable_lm` | 对 LM (语言模型) 应用 LoRA | `true` |
+| `enable_dit` | 对 DiT (扩散模型) 应用 LoRA | `true` (声音克隆必须) |
+| `r` | LoRA rank (越高容量越大) | 16-64 |
+| `alpha` | 缩放因子, `scaling = alpha / r` | 通常 `r/2` 或 `r` |
+| `target_modules_*` | 添加 LoRA 的层名称 | attention layers |

-### Distribution Options (Optional)
+### 分发选项 (可选)

-| Parameter | Description | Default |
+| 参数 | 描述 | 默认值 |
 |-----------|-------------|---------|
-| `hf_model_id` | HuggingFace model ID (e.g., `openbmb/VoxCPM1.5`) | `""` |
-| `distribute` | If `true`, save `hf_model_id` as `base_model` in checkpoint; otherwise save local `pretrained_path` | `false` |
+| `hf_model_id` | HuggingFace 模型 ID (例如 `openbmb/VoxCPM1.5`) | `""` |
+| `distribute` | 如果为 `true`，将 `hf_model_id` 作为 `base_model` 保存到检查点；否则保存本地 `pretrained_path` | `false` |

-> **Note**: If `distribute: true`, `hf_model_id` is required.
+> **注意**：如果 `distribute: true`，则必须提供 `hf_model_id`。

-### Training
+### 训练

 ```bash
-# Single GPU
+# 单 GPU
 python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml

-# Multi-GPU
+# 多 GPU
 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
 ```

-### Checkpoint Structure
+### 检查点结构

-LoRA training saves LoRA parameters and configuration:
+LoRA 训练保存 LoRA 参数和配置：

 ```
 checkpoints/finetune_lora/
 └── step_0002000/
-    ├── lora_weights.safetensors    # Only lora_A, lora_B parameters
-    ├── lora_config.json            # LoRA config + base model path
+    ├── lora_weights.safetensors    # 仅含 lora_A, lora_B 参数
+    ├── lora_config.json            # LoRA 配置 + 基座模型路径
    ├── optimizer.pth
    └── scheduler.pth
 ```

-The `lora_config.json` contains:
+`lora_config.json` 包含：
 ```json
 {
  "base_model": "/path/to/VoxCPM1.5/",
@@ -266,83 +264,83 @@ The `lora_config.json` contains:
 }
 ```

-The `base_model` field contains:
- Local path (default): when `distribute: false` or not set
- HuggingFace ID: when `distribute: true` (e.g., `"openbmb/VoxCPM1.5"`)
+`base_model` 字段包含：
+- 本地路径 (默认)：当 `distribute: false` 或未设置时
+- HuggingFace ID：当 `distribute: true` 时 (例如 `"openbmb/VoxCPM1.5"`)

-This allows loading LoRA checkpoints without the original training config file.
+这允许在没有原始训练配置文件的情况下加载 LoRA 检查点。

 ---

-## Inference
+## 推理

-### Full Fine-tuning Inference
+### 全量微调推理

-The checkpoint directory is a complete model, load it directly:
+检查点目录是一个完整的模型，直接加载：

 ```bash
 python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
-    --text "Hello, this is the fine-tuned model." \
+    --text "你好，这是微调后的模型。" \
    --output output.wav
 ```

-With voice cloning:
+带声音克隆：

 ```bash
 python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
-    --text "This is voice cloning result." \
+    --text "这是声音克隆的结果。" \
    --prompt_audio /path/to/reference.wav \
-    --prompt_text "Reference audio transcript" \
+    --prompt_text "参考音频的文本内容" \
    --output cloned_output.wav
 ```

-### LoRA Inference
+### LoRA 推理

-LoRA inference only requires the checkpoint directory (base model path and LoRA config are read from `lora_config.json`):
+LoRA 推理只需要检查点目录（基座模型路径和 LoRA 配置从 `lora_config.json` 读取）：

 ```bash
 python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
-    --text "Hello, this is LoRA fine-tuned result." \
+    --text "你好，这是 LoRA 微调的结果。" \
    --output lora_output.wav
 ```

-With voice cloning:
+带声音克隆：

 ```bash
 python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
-    --text "This is voice cloning with LoRA." \
+    --text "这是带 LoRA 的声音克隆。" \
    --prompt_audio /path/to/reference.wav \
-    --prompt_text "Reference audio transcript" \
+    --prompt_text "参考音频的文本内容" \
    --output cloned_output.wav
 ```

-Override base model path (optional):
+覆盖基座模型路径 (可选)：

 ```bash
 python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --base_model /path/to/another/VoxCPM1.5 \
-    --text "Use different base model." \
+    --text "使用不同的基座模型。" \
    --output output.wav
 ```

 ---

-## LoRA Hot-swapping
+## LoRA 热插拔

-LoRA supports dynamic loading, unloading, and switching at inference time without reloading the entire model.
+LoRA 支持在推理时动态加载、卸载和切换，无需重新加载整个模型。

-### API Reference
+### API 参考

 ```python
 from voxcpm.core import VoxCPM
 from voxcpm.model.voxcpm import LoRAConfig

-# 1. Load model with LoRA structure and weights
+# 1. 加载带 LoRA 结构和权重的模型
 lora_cfg = LoRAConfig(
    enable_lm=True, 
    enable_dit=True, 
@@ -352,47 +350,47 @@ lora_cfg = LoRAConfig(
    target_modules_dit=["q_proj", "v_proj", "k_proj", "o_proj"],
 )
 model = VoxCPM.from_pretrained(
-    hf_model_id="openbmb/VoxCPM1.5",  # or local path
-    load_denoiser=False,              # Optional: disable denoiser for faster loading
-    optimize=True,                    # Enable torch.compile acceleration
+    hf_model_id="openbmb/VoxCPM1.5",  # 或本地路径
+    load_denoiser=False,              # 可选：禁用降噪器以加快加载
+    optimize=True,                    # 启用 torch.compile 加速
    lora_config=lora_cfg,
    lora_weights_path="/path/to/lora_checkpoint",
 )

-# 2. Generate audio
+# 2. 生成音频
 audio = model.generate(
-    text="Hello, this is LoRA fine-tuned result.",
-    prompt_wav_path="/path/to/reference.wav",  # Optional: for voice cloning
-    prompt_text="Reference audio transcript",   # Optional: for voice cloning
+    text="你好，这是 LoRA 微调的结果。",
+    prompt_wav_path="/path/to/reference.wav",  # 可选：用于声音克隆
+    prompt_text="参考音频的文本内容",            # 可选：用于声音克隆
 )

-# 3. Disable LoRA (use base model only)
+# 3. 禁用 LoRA (仅使用基座模型)
 model.set_lora_enabled(False)

-# 4. Re-enable LoRA
+# 4. 重新启用 LoRA
 model.set_lora_enabled(True)

-# 5. Unload LoRA (reset weights to zero)
+# 5. 卸载 LoRA (重置权重为零)
 model.unload_lora()

-# 6. Hot-swap to another LoRA
+# 6. 热切换到另一个 LoRA
 loaded, skipped = model.load_lora("/path/to/another_lora_checkpoint")
 print(f"Loaded {len(loaded)} params, skipped {len(skipped)}")

-# 7. Get current LoRA weights
+# 7. 获取当前 LoRA 权重
 lora_state = model.get_lora_state_dict()
 ```

-### Simplified Usage (Load from lora_config.json)
+### 简化用法 (从 lora_config.json 加载)

-If your checkpoint contains `lora_config.json` (saved by the training script), you can load everything automatically:
+如果你的检查点包含 `lora_config.json`（由训练脚本保存），你可以自动加载所有内容：

 ```python
 import json
 from voxcpm.core import VoxCPM
 from voxcpm.model.voxcpm import LoRAConfig

-# Load config from checkpoint
+# 从检查点加载配置
 lora_ckpt_dir = "/path/to/checkpoints/finetune_lora/step_0002000"
 with open(f"{lora_ckpt_dir}/lora_config.json") as f:
    lora_info = json.load(f)
@@ -400,7 +398,7 @@ with open(f"{lora_ckpt_dir}/lora_config.json") as f:
 base_model = lora_info["base_model"]
 lora_cfg = LoRAConfig(**lora_info["lora_config"])

-# Load model with LoRA
+# 加载带 LoRA 的模型
 model = VoxCPM.from_pretrained(
    hf_model_id=base_model,
    lora_config=lora_cfg,
@@ -408,7 +406,7 @@ model = VoxCPM.from_pretrained(
 )
 ```

-Or use the test script directly:
+或者直接使用测试脚本：

 ```bash
 python scripts/test_voxcpm_lora_infer.py \
@@ -416,49 +414,49 @@ python scripts/test_voxcpm_lora_infer.py \
    --text "Hello world"
 ```

-### Method Reference
+### 方法参考

-| Method | Description | torch.compile Compatible |
+| 方法 | 描述 | torch.compile 兼容性 |
 |--------|-------------|--------------------------|
-| `load_lora(path)` | Load LoRA weights from file | ✅ |
-| `set_lora_enabled(bool)` | Enable/disable LoRA | ✅ |
-| `unload_lora()` | Reset LoRA weights to initial values | ✅ |
-| `get_lora_state_dict()` | Get current LoRA weights | ✅ |
-| `lora_enabled` | Property: check if LoRA is configured | ✅ |
+| `load_lora(path)` | 从文件加载 LoRA 权重 | ✅ |
+| `set_lora_enabled(bool)` | 启用/禁用 LoRA | ✅ |
+| `unload_lora()` | 将 LoRA 权重重置为初始值 | ✅ |
+| `get_lora_state_dict()` | 获取当前 LoRA 权重 | ✅ |
+| `lora_enabled` | 属性：检查是否配置了 LoRA | ✅ |

 ---

-## FAQ
+## 常见问题 (FAQ)

-### 1. Out of Memory (OOM)
+### 1. 显存溢出 (OOM)

- Increase `grad_accum_steps` (gradient accumulation)
- Decrease `batch_size`
- Use LoRA fine-tuning instead of full fine-tuning
- Decrease `max_batch_tokens` to filter long samples
+- 增加 `grad_accum_steps` (梯度累积步数)
+- 减小 `batch_size`
+- 使用 LoRA 微调代替全量微调
+- 减小 `max_batch_tokens` 以过滤长样本

-### 2. Poor LoRA Performance
+### 2. LoRA 效果不佳

- Increase `r` (LoRA rank)
- Adjust `alpha` (try `alpha = r/2` or `alpha = r`)
- Increase training steps
- Add more target modules
+- 增加 `r` (LoRA rank)
+- 调整 `alpha` (尝试 `alpha = r/2` 或 `alpha = r`)
+- 增加训练步数
+- 添加更多目标模块

-### 3. Training Not Converging
+### 3. 训练不收敛

- Decrease `learning_rate`
- Increase `warmup_steps`
- Check data quality
+- 减小 `learning_rate` (学习率)
+- 增加 `warmup_steps`
+- 检查数据质量

-### 4. LoRA Not Taking Effect at Inference
+### 4. LoRA 在推理时未生效

- Check that `lora_config.json` exists in the checkpoint directory
- Check `load_lora()` return value - `skipped_keys` should be empty
- Verify `set_lora_enabled(True)` is called
+- 检查检查点目录下是否存在 `lora_config.json`
+- 检查 `load_lora()` 返回值 - `skipped_keys` 应该为空
+- 确认调用了 `set_lora_enabled(True)`

-### 5. Checkpoint Loading Errors
+### 5. 检查点加载错误

- Full fine-tuning: checkpoint directory should contain `model.safetensors` (or `pytorch_model.bin`), `config.json`, `audiovae.pth`
- LoRA: checkpoint directory should contain:
-  - `lora_weights.safetensors` (or `lora_weights.ckpt`) - LoRA weights
-  - `lora_config.json` - LoRA config and base model path
+- 全量微调：检查点目录应包含 `model.safetensors` (或 `pytorch_model.bin`)、`config.json`、`audiovae.pth`
+- LoRA：检查点目录应包含：
+  - `lora_weights.safetensors` (或 `lora_weights.ckpt`) - LoRA 权重
+  - `lora_config.json` - LoRA 配置和基座模型路径