Initial commit with large files ignored

2025-12-11 00:12:18 +08:00
parent a266c0a88d
commit 1e44eba871
17 changed files with 179286 additions and 329 deletions
--- a/docs/finetune.md
+++ b/docs/finetune.md
@@ -1,101 +1,99 @@
-# VoxCPM Fine-tuning Guide
+# VoxCPM 微调指南

-This guide covers how to fine-tune VoxCPM models with two approaches: full fine-tuning and LoRA fine-tuning.
+本指南介绍了如何使用全量微调（Full Fine-tuning）和 LoRA 微调两种方式对 VoxCPM 模型进行微调。

-### 🎓 SFT (Supervised Fine-Tuning)
+### 🎓 SFT (监督微调)

-Full fine-tuning updates all model parameters. Suitable for:
- 📊 Large, specialized datasets
- 🔄 Cases where significant behavior changes are needed
+全量微调会更新所有模型参数。适用于：
+- 📊 大型、专业的数据集
+- 🔄 需要显著改变模型行为的场景

-### ⚡ LoRA Fine-tuning
+### ⚡ LoRA 微调

-LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:
- 🎯 Trains only a small number of additional parameters
- 💾 Significantly reduces memory requirements and training time
- 🔀 Supports multiple LoRA adapters with hot-swapping
+LoRA (Low-Rank Adaptation) 是一种参数高效的微调方法，它：
+- 🎯 仅训练少量额外参数
+- 💾 显著降低显存需求和训练时间
+- 🔀 支持多个 LoRA 适配器热插拔

+## 目录

-
-## Table of Contents
-
- [Quick Start: WebUI](#quick-start-webui)
- [Data Preparation](#data-preparation)
- [Full Fine-tuning](#full-fine-tuning)
- [LoRA Fine-tuning](#lora-fine-tuning)
- [Inference](#inference)
- [LoRA Hot-swapping](#lora-hot-swapping)
- [FAQ](#faq)
+- [快速开始：WebUI](#快速开始webui)
+- [数据准备](#数据准备)
+- [全量微调](#全量微调)
+- [LoRA 微调](#lora-微调)
+- [推理](#推理)
+- [LoRA 热插拔](#lora-热插拔)
+- [常见问题](#常见问题)

 ---

-## Quick Start: WebUI
+## 快速开始：WebUI

-For users who prefer a graphical interface, we provide `lora_ft_webui.py` - a comprehensive WebUI for training and inference:
+对于喜欢图形界面的用户，我们提供了 `lora_ft_webui.py` —— 一个用于训练和推理的综合 WebUI：

-### Launch WebUI
+### 启动 WebUI

 ```bash
 python lora_ft_webui.py
 ```

-Then open `http://localhost:7860` in your browser.
+然后在浏览器中打开 `http://localhost:7860`。

-### Features
+### 功能特点

- **🚀 Training Tab**: Configure and start LoRA training with an intuitive interface
-  - Set training parameters (learning rate, batch size, LoRA rank, etc.)
-  - Monitor training progress in real-time
-  - Resume training from existing checkpoints
+- **🚀 训练标签页**：通过直观的界面配置并启动 LoRA 训练
+  - 设置训练参数（学习率、Batch Size、LoRA Rank 等）
+  - 实时监控训练进度
+  - 从现有断点恢复训练

- **🎵 Inference Tab**: Generate audio with trained models
-  - Automatic base model loading from LoRA checkpoint config
-  - Voice cloning with automatic ASR (reference text recognition)
-  - Hot-swap between multiple LoRA models
-  - Zero-shot TTS without reference audio
+- **🎵 推理标签页**：使用训练好的模型生成音频
+  - 从 LoRA 检查点配置自动加载基座模型
+  - 带自动 ASR（参考文本识别）的声音克隆
+  - 在多个 LoRA 模型间热切换
+  - 无参考音频的零样本 TTS

-## Data Preparation
+## 数据准备

-Training data should be prepared as a JSONL manifest file, with one sample per line:
+训练数据应准备为 JSONL 清单文件，每行一个样本：

 ```jsonl
-{"audio": "path/to/audio1.wav", "text": "Transcript of audio 1."}
-{"audio": "path/to/audio2.wav", "text": "Transcript of audio 2."}
-{"audio": "path/to/audio3.wav", "text": "Optional duration field.", "duration": 3.5}
-{"audio": "path/to/audio4.wav", "text": "Optional dataset_id for multi-dataset.", "dataset_id": 1}
+{"audio": "path/to/audio1.wav", "text": "音频1的文本内容。"}
+{"audio": "path/to/audio2.wav", "text": "音频2的文本内容。"}
+{"audio": "path/to/audio3.wav", "text": "可选的时长字段。", "duration": 3.5}
+{"audio": "path/to/audio4.wav", "text": "多数据集训练可选的 dataset_id。", "dataset_id": 1}
 ```

-### Required Fields
+### 必填字段

-| Field | Description |
+| 字段 | 描述 |
 |-------|-------------|
-| `audio` | Path to audio file (absolute or relative) |
-| `text` | Corresponding transcript |
+| `audio` | 音频文件路径（绝对或相对路径） |
+| `text` | 对应的文本内容 |

-### Optional Fields
+### 可选字段

-| Field | Description |
+| 字段 | 描述 |
 |-------|-------------|
-| `duration` | Audio duration in seconds (speeds up sample filtering) |
-| `dataset_id` | Dataset ID for multi-dataset training (default: 0) |
+| `duration` | 音频时长（秒），用于加速样本过滤 |
+| `dataset_id` | 多数据集训练的数据集 ID（默认：0） |

-### Requirements
+### 要求

- Audio format: WAV
- Sample rate: 16kHz for VoxCPM-0.5B, 44.1kHz for VoxCPM1.5
- Text: Transcript matching the audio content
+- 音频格式：WAV
+- 采样率：VoxCPM-0.5B 为 16kHz，VoxCPM1.5 为 44.1kHz
+- 文本：与音频内容匹配的文本

-See `examples/train_data_example.jsonl` for a complete example.
+查看 `examples/train_data_example.jsonl` 获取完整示例。

 ---

-## Full Fine-tuning
+## 全量微调

-Full fine-tuning updates all model parameters. Suitable for large datasets or when significant behavior changes are needed.
+全量微调更新所有模型参数。适用于大数据集或需要显著改变模型行为的情况。

-### Configuration
+### 配置

-Create `conf/voxcpm_v1.5/voxcpm_finetune_all.yaml`:
+创建 `conf/voxcpm_v1.5/voxcpm_finetune_all.yaml`：

 ```yaml
 pretrained_path: /path/to/VoxCPM1.5/
@@ -111,7 +109,7 @@ log_interval: 10
 valid_interval: 1000
 save_interval: 1000

-learning_rate: 0.00001   # Use smaller LR for full fine-tuning
+learning_rate: 0.00001   # 全量微调使用较小的学习率
 weight_decay: 0.01
 warmup_steps: 100
 max_steps: 2000
@@ -125,27 +123,27 @@ lambdas:
  loss/stop: 1.0
 ```

-### Training
+### 训练

 ```bash
-# Single GPU
+# 单 GPU
 python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml

-# Multi-GPU
+# 多 GPU
 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
 ```

-### Checkpoint Structure
+### 检查点结构

-Full fine-tuning saves a complete model directory that can be loaded directly:
+全量微调保存完整的模型目录，可以直接加载：

 ```
 checkpoints/finetune_all/
 └── step_0002000/
-    ├── model.safetensors     # Model weights (excluding audio_vae)
-    ├── config.json            # Model config
-    ├── audiovae.pth           # Audio VAE weights
+    ├── model.safetensors     # 模型权重 (不含 audio_vae)
+    ├── config.json            # 模型配置
+    ├── audiovae.pth           # Audio VAE 权重
    ├── tokenizer.json         # Tokenizer
    ├── tokenizer_config.json
    ├── special_tokens_map.json
@@ -155,13 +153,13 @@ checkpoints/finetune_all/

 ---

-## LoRA Fine-tuning
+## LoRA 微调

-LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains only a small number of additional parameters, significantly reducing memory requirements.
+LoRA (Low-Rank Adaptation) 是一种参数高效的微调方法，仅训练少量额外参数，显著降低显存需求。

-### Configuration
+### 配置

-Create `conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml`:
+创建 `conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml`：

 ```yaml
 pretrained_path: /path/to/VoxCPM1.5/
@@ -177,7 +175,7 @@ log_interval: 10
 valid_interval: 1000
 save_interval: 1000

-learning_rate: 0.0001    # LoRA can use larger LR
+learning_rate: 0.0001    # LoRA 可以使用较大的学习率
 weight_decay: 0.01
 warmup_steps: 100
 max_steps: 2000
@@ -190,69 +188,69 @@ lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

-# LoRA configuration
+# LoRA 配置
 lora:
-  enable_lm: true        # Apply LoRA to Language Model
-  enable_dit: true       # Apply LoRA to Diffusion Transformer
-  enable_proj: false     # Apply LoRA to projection layers (optional)
+  enable_lm: true        # 对语言模型应用 LoRA
+  enable_dit: true       # 对 Diffusion Transformer 应用 LoRA
+  enable_proj: false     # 对投影层应用 LoRA (可选)
  
-  r: 32                  # LoRA rank (higher = more capacity)
+  r: 32                  # LoRA rank (越高容量越大)
  alpha: 16              # LoRA alpha, scaling = alpha / r
  dropout: 0.0
  
-  # Target modules
+  # 目标模块
  target_modules_lm: ["q_proj", "v_proj", "k_proj", "o_proj"]
  target_modules_dit: ["q_proj", "v_proj", "k_proj", "o_proj"]

-# Distribution options (optional)
+# 分发选项 (可选)
 # hf_model_id: "openbmb/VoxCPM1.5"  # HuggingFace ID
-# distribute: true                   # If true, save hf_model_id in lora_config.json
+# distribute: true                   # 如果为 true，在 lora_config.json 中保存 hf_model_id
 ```

-### LoRA Parameters
+### LoRA 参数

-| Parameter | Description | Recommended |
+| 参数 | 描述 | 推荐值 |
 |-----------|-------------|-------------|
-| `enable_lm` | Apply LoRA to LM (language model) | `true` |
-| `enable_dit` | Apply LoRA to DiT (diffusion model) | `true` (required for voice cloning) |
-| `r` | LoRA rank (higher = more capacity) | 16-64 |
-| `alpha` | Scaling factor, `scaling = alpha / r` | Usually `r/2` or `r` |
-| `target_modules_*` | Layer names to add LoRA | attention layers |
+| `enable_lm` | 对 LM (语言模型) 应用 LoRA | `true` |
+| `enable_dit` | 对 DiT (扩散模型) 应用 LoRA | `true` (声音克隆必须) |
+| `r` | LoRA rank (越高容量越大) | 16-64 |
+| `alpha` | 缩放因子, `scaling = alpha / r` | 通常 `r/2` 或 `r` |
+| `target_modules_*` | 添加 LoRA 的层名称 | attention layers |

-### Distribution Options (Optional)
+### 分发选项 (可选)

-| Parameter | Description | Default |
+| 参数 | 描述 | 默认值 |
 |-----------|-------------|---------|
-| `hf_model_id` | HuggingFace model ID (e.g., `openbmb/VoxCPM1.5`) | `""` |
-| `distribute` | If `true`, save `hf_model_id` as `base_model` in checkpoint; otherwise save local `pretrained_path` | `false` |
+| `hf_model_id` | HuggingFace 模型 ID (例如 `openbmb/VoxCPM1.5`) | `""` |
+| `distribute` | 如果为 `true`，将 `hf_model_id` 作为 `base_model` 保存到检查点；否则保存本地 `pretrained_path` | `false` |

-> **Note**: If `distribute: true`, `hf_model_id` is required.
+> **注意**：如果 `distribute: true`，则必须提供 `hf_model_id`。

-### Training
+### 训练

 ```bash
-# Single GPU
+# 单 GPU
 python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml

-# Multi-GPU
+# 多 GPU
 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
 ```

-### Checkpoint Structure
+### 检查点结构

-LoRA training saves LoRA parameters and configuration:
+LoRA 训练保存 LoRA 参数和配置：

 ```
 checkpoints/finetune_lora/
 └── step_0002000/
-    ├── lora_weights.safetensors    # Only lora_A, lora_B parameters
-    ├── lora_config.json            # LoRA config + base model path
+    ├── lora_weights.safetensors    # 仅含 lora_A, lora_B 参数
+    ├── lora_config.json            # LoRA 配置 + 基座模型路径
    ├── optimizer.pth
    └── scheduler.pth
 ```

-The `lora_config.json` contains:
+`lora_config.json` 包含：
 ```json
 {
  "base_model": "/path/to/VoxCPM1.5/",
@@ -266,83 +264,83 @@ The `lora_config.json` contains:
 }
 ```

-The `base_model` field contains:
- Local path (default): when `distribute: false` or not set
- HuggingFace ID: when `distribute: true` (e.g., `"openbmb/VoxCPM1.5"`)
+`base_model` 字段包含：
+- 本地路径 (默认)：当 `distribute: false` 或未设置时
+- HuggingFace ID：当 `distribute: true` 时 (例如 `"openbmb/VoxCPM1.5"`)

-This allows loading LoRA checkpoints without the original training config file.
+这允许在没有原始训练配置文件的情况下加载 LoRA 检查点。

 ---

-## Inference
+## 推理

-### Full Fine-tuning Inference
+### 全量微调推理

-The checkpoint directory is a complete model, load it directly:
+检查点目录是一个完整的模型，直接加载：

 ```bash
 python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
-    --text "Hello, this is the fine-tuned model." \
+    --text "你好，这是微调后的模型。" \
    --output output.wav
 ```

-With voice cloning:
+带声音克隆：

 ```bash
 python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
-    --text "This is voice cloning result." \
+    --text "这是声音克隆的结果。" \
    --prompt_audio /path/to/reference.wav \
-    --prompt_text "Reference audio transcript" \
+    --prompt_text "参考音频的文本内容" \
    --output cloned_output.wav
 ```

-### LoRA Inference
+### LoRA 推理

-LoRA inference only requires the checkpoint directory (base model path and LoRA config are read from `lora_config.json`):
+LoRA 推理只需要检查点目录（基座模型路径和 LoRA 配置从 `lora_config.json` 读取）：

 ```bash
 python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
-    --text "Hello, this is LoRA fine-tuned result." \
+    --text "你好，这是 LoRA 微调的结果。" \
    --output lora_output.wav
 ```

-With voice cloning:
+带声音克隆：

 ```bash
 python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
-    --text "This is voice cloning with LoRA." \
+    --text "这是带 LoRA 的声音克隆。" \
    --prompt_audio /path/to/reference.wav \
-    --prompt_text "Reference audio transcript" \
+    --prompt_text "参考音频的文本内容" \
    --output cloned_output.wav
 ```

-Override base model path (optional):
+覆盖基座模型路径 (可选)：

 ```bash
 python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
    --base_model /path/to/another/VoxCPM1.5 \
-    --text "Use different base model." \
+    --text "使用不同的基座模型。" \
    --output output.wav
 ```

 ---

-## LoRA Hot-swapping
+## LoRA 热插拔

-LoRA supports dynamic loading, unloading, and switching at inference time without reloading the entire model.
+LoRA 支持在推理时动态加载、卸载和切换，无需重新加载整个模型。

-### API Reference
+### API 参考

 ```python
 from voxcpm.core import VoxCPM
 from voxcpm.model.voxcpm import LoRAConfig

-# 1. Load model with LoRA structure and weights
+# 1. 加载带 LoRA 结构和权重的模型
 lora_cfg = LoRAConfig(
    enable_lm=True, 
    enable_dit=True, 
@@ -352,47 +350,47 @@ lora_cfg = LoRAConfig(
    target_modules_dit=["q_proj", "v_proj", "k_proj", "o_proj"],
 )
 model = VoxCPM.from_pretrained(
-    hf_model_id="openbmb/VoxCPM1.5",  # or local path
-    load_denoiser=False,              # Optional: disable denoiser for faster loading
-    optimize=True,                    # Enable torch.compile acceleration
+    hf_model_id="openbmb/VoxCPM1.5",  # 或本地路径
+    load_denoiser=False,              # 可选：禁用降噪器以加快加载
+    optimize=True,                    # 启用 torch.compile 加速
    lora_config=lora_cfg,
    lora_weights_path="/path/to/lora_checkpoint",
 )

-# 2. Generate audio
+# 2. 生成音频
 audio = model.generate(
-    text="Hello, this is LoRA fine-tuned result.",
-    prompt_wav_path="/path/to/reference.wav",  # Optional: for voice cloning
-    prompt_text="Reference audio transcript",   # Optional: for voice cloning
+    text="你好，这是 LoRA 微调的结果。",
+    prompt_wav_path="/path/to/reference.wav",  # 可选：用于声音克隆
+    prompt_text="参考音频的文本内容",            # 可选：用于声音克隆
 )

-# 3. Disable LoRA (use base model only)
+# 3. 禁用 LoRA (仅使用基座模型)
 model.set_lora_enabled(False)

-# 4. Re-enable LoRA
+# 4. 重新启用 LoRA
 model.set_lora_enabled(True)

-# 5. Unload LoRA (reset weights to zero)
+# 5. 卸载 LoRA (重置权重为零)
 model.unload_lora()

-# 6. Hot-swap to another LoRA
+# 6. 热切换到另一个 LoRA
 loaded, skipped = model.load_lora("/path/to/another_lora_checkpoint")
 print(f"Loaded {len(loaded)} params, skipped {len(skipped)}")

-# 7. Get current LoRA weights
+# 7. 获取当前 LoRA 权重
 lora_state = model.get_lora_state_dict()
 ```

-### Simplified Usage (Load from lora_config.json)
+### 简化用法 (从 lora_config.json 加载)

-If your checkpoint contains `lora_config.json` (saved by the training script), you can load everything automatically:
+如果你的检查点包含 `lora_config.json`（由训练脚本保存），你可以自动加载所有内容：

 ```python
 import json
 from voxcpm.core import VoxCPM
 from voxcpm.model.voxcpm import LoRAConfig

-# Load config from checkpoint
+# 从检查点加载配置
 lora_ckpt_dir = "/path/to/checkpoints/finetune_lora/step_0002000"
 with open(f"{lora_ckpt_dir}/lora_config.json") as f:
    lora_info = json.load(f)
@@ -400,7 +398,7 @@ with open(f"{lora_ckpt_dir}/lora_config.json") as f:
 base_model = lora_info["base_model"]
 lora_cfg = LoRAConfig(**lora_info["lora_config"])

-# Load model with LoRA
+# 加载带 LoRA 的模型
 model = VoxCPM.from_pretrained(
    hf_model_id=base_model,
    lora_config=lora_cfg,
@@ -408,7 +406,7 @@ model = VoxCPM.from_pretrained(
 )
 ```

-Or use the test script directly:
+或者直接使用测试脚本：

 ```bash
 python scripts/test_voxcpm_lora_infer.py \
@@ -416,49 +414,49 @@ python scripts/test_voxcpm_lora_infer.py \
    --text "Hello world"
 ```

-### Method Reference
+### 方法参考

-| Method | Description | torch.compile Compatible |
+| 方法 | 描述 | torch.compile 兼容性 |
 |--------|-------------|--------------------------|
-| `load_lora(path)` | Load LoRA weights from file | ✅ |
-| `set_lora_enabled(bool)` | Enable/disable LoRA | ✅ |
-| `unload_lora()` | Reset LoRA weights to initial values | ✅ |
-| `get_lora_state_dict()` | Get current LoRA weights | ✅ |
-| `lora_enabled` | Property: check if LoRA is configured | ✅ |
+| `load_lora(path)` | 从文件加载 LoRA 权重 | ✅ |
+| `set_lora_enabled(bool)` | 启用/禁用 LoRA | ✅ |
+| `unload_lora()` | 将 LoRA 权重重置为初始值 | ✅ |
+| `get_lora_state_dict()` | 获取当前 LoRA 权重 | ✅ |
+| `lora_enabled` | 属性：检查是否配置了 LoRA | ✅ |

 ---

-## FAQ
+## 常见问题 (FAQ)

-### 1. Out of Memory (OOM)
+### 1. 显存溢出 (OOM)

- Increase `grad_accum_steps` (gradient accumulation)
- Decrease `batch_size`
- Use LoRA fine-tuning instead of full fine-tuning
- Decrease `max_batch_tokens` to filter long samples
+- 增加 `grad_accum_steps` (梯度累积步数)
+- 减小 `batch_size`
+- 使用 LoRA 微调代替全量微调
+- 减小 `max_batch_tokens` 以过滤长样本

-### 2. Poor LoRA Performance
+### 2. LoRA 效果不佳

- Increase `r` (LoRA rank)
- Adjust `alpha` (try `alpha = r/2` or `alpha = r`)
- Increase training steps
- Add more target modules
+- 增加 `r` (LoRA rank)
+- 调整 `alpha` (尝试 `alpha = r/2` 或 `alpha = r`)
+- 增加训练步数
+- 添加更多目标模块

-### 3. Training Not Converging
+### 3. 训练不收敛

- Decrease `learning_rate`
- Increase `warmup_steps`
- Check data quality
+- 减小 `learning_rate` (学习率)
+- 增加 `warmup_steps`
+- 检查数据质量

-### 4. LoRA Not Taking Effect at Inference
+### 4. LoRA 在推理时未生效

- Check that `lora_config.json` exists in the checkpoint directory
- Check `load_lora()` return value - `skipped_keys` should be empty
- Verify `set_lora_enabled(True)` is called
+- 检查检查点目录下是否存在 `lora_config.json`
+- 检查 `load_lora()` 返回值 - `skipped_keys` 应该为空
+- 确认调用了 `set_lora_enabled(True)`

-### 5. Checkpoint Loading Errors
+### 5. 检查点加载错误

- Full fine-tuning: checkpoint directory should contain `model.safetensors` (or `pytorch_model.bin`), `config.json`, `audiovae.pth`
- LoRA: checkpoint directory should contain:
-  - `lora_weights.safetensors` (or `lora_weights.ckpt`) - LoRA weights
-  - `lora_config.json` - LoRA config and base model path
+- 全量微调：检查点目录应包含 `model.safetensors` (或 `pytorch_model.bin`)、`config.json`、`audiovae.pth`
+- LoRA：检查点目录应包含：
+  - `lora_weights.safetensors` (或 `lora_weights.ckpt`) - LoRA 权重
+  - `lora_config.json` - LoRA 配置和基座模型路径
--- a/docs/performance.md
+++ b/docs/performance.md
@@ -1,10 +1,10 @@
-# 📊 Performance Highlights
+# 📊 性能亮点

-VoxCPM achieves competitive results on public zero-shot TTS benchmarks.
+VoxCPM 在公开的零样本 TTS 基准测试中取得了具有竞争力的结果。

-## Seed-TTS-eval Benchmark
+## Seed-TTS-eval 基准测试

-| Model | Parameters | Open-Source | test-EN | | test-ZH | | test-Hard | |
+| 模型 | 参数量 | 开源 | test-EN | | test-ZH | | test-Hard | |
 |------|------|------|:------------:|:--:|:------------:|:--:|:-------------:|:--:|
 | | | | WER/%⬇ | SIM/%⬆| CER/%⬇| SIM/%⬆ | CER/%⬇ | SIM/%⬆ |
 | MegaTTS3 | 0.5B | ❌ | 2.79 | 77.1 | 1.52 | 79.0 | - | - |
@@ -28,9 +28,9 @@ VoxCPM achieves competitive results on public zero-shot TTS benchmarks.
 | **VoxCPM** | 0.5B | ✅ | **1.85** | **72.9** | **0.93** | **77.2** | 8.87 | 73.0 |


-## CV3-eval Benchmark
+## CV3-eval 基准测试

-| Model | zh | en | hard-zh | | | hard-en | | |
+| 模型 | zh | en | hard-zh | | | hard-en | | |
 |-------|:--:|:--:|:-------:|:--:|:--:|:-------:|:--:|:--:|
 | | CER/%⬇ | WER/%⬇ | CER/%⬇ | SIM/%⬆ | DNSMOS⬆ | WER/%⬇ | SIM/%⬆ | DNSMOS⬆ |
 | F5-TTS | 5.47 | 8.90 | - | - | - | - | - | - |
@@ -43,4 +43,3 @@ VoxCPM achieves competitive results on public zero-shot TTS benchmarks.
 | CosyVoice3-0.5B | 3.89 | 5.24 | 14.15 | 78.6 | 3.75 | 9.04 | 75.9 | 3.92 |
 | CosyVoice3-1.5B | 3.91 | 4.99 | 9.77 | 78.5 | 3.79 | 10.55 | 76.1 | 3.95 |
 | **VoxCPM** | **3.40** | **4.04** | 12.9 | 66.1 | 3.59 | **7.89** | 64.3 | 3.74 |
-
--- a/docs/release_note.md
+++ b/docs/release_note.md
@@ -1,116 +1,111 @@
-# VoxCPM1.5 Release Notes
+# VoxCPM1.5 发布说明

-**Release Date:** December 5, 2025
+**发布日期：** 2025年12月5日

-## 🎉 Overview
+## 🎉 概览

+我们非常激动地推出一次重大升级，在保持 VoxCPM 上下文感知语音生成和零样本声音克隆核心能力的同时，提升了音频质量和效率。

-We’re thrilled to introduce a major upgrade that improves audio quality and efficiency of VoxCPM, while maintaining the core capabilities of context-aware speech generation and zero-shot voice cloning.
-
-| Feature | VoxCPM | VoxCPM1.5 |
+| 特性 | VoxCPM | VoxCPM1.5 |
 |---------|------------|------------|
-| **Audio VAE Sampling Rate** | 16kHz | 44.1kHz |
-| **LM Token Rate** | 12.5Hz | 6.25Hz |
+| **Audio VAE 采样率** | 16kHz | 44.1kHz |
+| **LM Token 速率** | 12.5Hz | 6.25Hz |
 | **Patch Size** | 2 | 4 |
-| **SFT Support** | ✅ | ✅ |
-| **LoRA Support** | ✅ | ✅ |
+| **SFT 支持** | ✅ | ✅ |
+| **LoRA 支持** | ✅ | ✅ |

-## 🎵 Model Updates
+## 🎵 模型更新

-### 🔊 AudioVAE Sampling Rate: 16kHz → 44.1kHz
+### 🔊 AudioVAE 采样率：16kHz → 44.1kHz

-The AudioVAE now supports 44.1kHz sampling rate, which allows the model to:
- 🎯 Clone better, preserving more high-frequency details and generate higher quality voice outputs
+AudioVAE 现在支持 44.1kHz 采样率，这使得模型能够：
+- 🎯 更好地克隆声音，保留更多高频细节，生成更高质量的语音输出

+*注意：此升级在使用高质量参考音频时能生成更高质量的音频，但不能保证所有生成的音频都是高保真的。输出质量取决于**提示语音（prompt speech）**的质量。*

-*Note: This upgrade enables higher quality generation when using high-quality reference audio, but does not guarantee that all generated audio will be high-fidelity. The output quality depends on the **prompt speech** quality.*
+### ⚡ Token 速率：12.5Hz → 6.25Hz

-### ⚡ Token Rate: 12.5Hz → 6.25Hz
+我们将 LM 主干网络中的 token 速率从 12.5Hz 降低到了 6.25Hz（LocEnc&LocDiT patch size 从 2 增加到 4），同时在评估基准上保持了相似的性能。这一变化：
+- 💨 降低了生成相同长度音频的计算需求
+- 📈 为更长音频生成奠定了基础
+- 🏗️ 为未来训练更大的模型铺平了道路

-We reduced the token rate in LM backbone from 12.5Hz to 6.25Hz (LocEnc&LocDiT patch size increased from 2 to 4) while maintaining similar performance on evaluation benchmarks. This change:
- 💨 Reduces computational requirements for generating the same length of audio
- 📈 Provides a foundation for longer audio generation
- 🏗️ Paves the way for training larger models in the future
+**模型架构说明**：VoxCPM1.5 的核心架构与技术报告中保持一致。关键的修改是将局部模块（LocEnc & LocDiT）的 patch size 从 2 调整为 4，从而将 LM 处理速率从 12.5Hz 降低到 6.25Hz。由于局部模块现在需要处理更长的上下文，我们扩展了它们的网络深度，导致整体模型参数量略有增加。

-**Model Architecture Clarification**: The core architecture of VoxCPM1.5 remains unchanged from the technical report. The key modification is adjusting the patch size of the local modules (LocEnc & LocDiT) from 2 to 4, which reduces the LM processing rate from 12.5Hz to 6.25Hz. Since the local modules now need to handle longer contexts, we expanded their network depth, resulting in a slightly larger overall model parameter count.
+**生成速度说明**：虽然模型参数增加了，但 VoxCPM1.5 生成 1 秒音频仅需 6.25 个 token（相比之前的 12.5 个 token）。虽然显示的生成速度（xx it/s）可能看起来变慢了，但实际的实时率（RTF = 音频时长 / 处理时间）没有差异，甚至可能更快。

-**Generation Speed Clarification**: Although the model parameters have increased, VoxCPM1.5 only requires 6.25 tokens to generate 1 second of audio (compared to 12.5 tokens in the previous version). While the displayed generation speed (xx it/s) may appear slower, the actual Real-Time Factor (RTF = audio duration / processing time) shows no difference or may even be faster.
+## 🔧 微调支持

-## 🔧 Fine-tuning Support
+我们现在支持全量微调和 LoRA 微调，请参阅 [微调指南](finetune.md) 了解详细说明。

-We support full fine-tuning and LoRA fine-tuning now, please see the [Fine-tuning Guide](finetune.md) for detailed instructions.
- 
+## 📚 文档

-## 📚 Documentation
+- 更新了 README，增加了版本对比
+- 添加了全面的微调指南
+- 改进了代码注释和文档

- Updated README with version comparison
- Added comprehensive fine-tuning guide
- Improved code comments and documentation
+## 🙏 感谢大家

+没有开源社区的反馈、测试和贡献，这次发布是不可能的。感谢你们帮助塑造 VoxCPM1.5！

-## 🙏 Our Thanks to You
-This release wouldn’t be possible without the incredible feedback, testing, and contributions from our open-source community. Thank you for helping shape VoxCPM1.5!
+## 📞 让我们共同建设

+有问题、想法或想要贡献？

-## 📞 Let's Build Together
-Questions, ideas, or want to contribute?
+- 🐛 报告问题：[OpenBMB/VoxCPM GitHub Issues](https://github.com/OpenBMB/VoxCPM/issues)

- 🐛 Report an issue: [GitHub Issues on OpenBMB/VoxCPM](https://github.com/OpenBMB/VoxCPM/issues)
+- 📖 深入文档：查看 [docs/](../docs/) 文件夹获取指南和 API 详情

- 📖 Dig into the docs: Check the [docs/](../docs/) folder for guides and API details
+享受 VoxCPM1.5 更丰富的声音和强大的新功能吧 🎉

-Enjoy the richer sound and powerful new features of VoxCPM1.5 🎉
+我们迫不及待想听到你们接下来的创作！🥂

-We can't wait to hear what you create next! 🥂
+## 🚀 我们正在做的事情

-## 🚀 What We're Working On
+我们正在持续改进 VoxCPM 并致力于开发激动人心的新功能：

-We're continuously improving VoxCPM and working on exciting new features:
+- 🌍 **多语言 TTS 支持**：我们正在积极开发除中文和英文以外的语言支持。
+- 🎯 **可控表现力语音生成**：我们正在研究可控语音生成，允许通过自然语言指令对语音属性（情感、音色、韵律等）进行细粒度控制。
+- 🎵 **通用音频生成基础**：我们也希望探索 VoxCPM 作为统一的音频生成基础模型，能够联合生成语音、音乐和音效。不过，这是一个长期的愿景。

- 🌍 **Multilingual TTS Support**: We are actively developing support for languages beyond Chinese and English.
- 🎯 **Controllable Expressive Speech Generation**: We are researching controllable speech generation that allows fine-grained control over speech attributes (emotion, timbre, prosody, etc.) through natural language instructions.
- 🎵 **Universal Audio Generation Foundation**: We also hope to explore VoxCPM as a unified audio generation foundation model capable of joint generation of speech, music, and sound effects. However, this is a longer-term vision.
+**📅 下次发布**：我们计划在 2026 年第一季度发布下一个版本，其中将包含重大改进和新功能。敬请关注更新！我们致力于使 VoxCPM 更加强大和通用。

-**📅 Next Release**: We plan to release the next version in Q1 2026, which will include significant improvements and new features. Stay tuned for updates! We're committed to making VoxCPM even more powerful and versatile.
+## ❓ 常见问题 (FAQ)

-## ❓ Frequently Asked Questions (FAQ)
+### Q: VoxCPM 支持个性化声音定制的微调吗？

-### Q: Does VoxCPM support fine-tuning for personalized voice customization?
+**A:** 是的！VoxCPM 现在支持全量微调（SFT）和高效的 LoRA 微调。你可以使用自己的数据训练个性化声音模型。请参阅 [微调指南](finetune.md) 获取详细说明和示例。

-**A:** Yes! VoxCPM now supports both full fine-tuning (SFT) and efficient LoRA fine-tuning. You can train personalized voice models on your own data. Please refer to the [Fine-tuning Guide](finetune.md) for detailed instructions and examples.
+### Q: 16kHz 音频质量对我的用例足够吗？

-### Q: Is 16kHz audio quality sufficient for my use case?
+**A:** 我们在 VoxCPM1.5 中升级了 AudioVAE 以支持 44.1kHz 采样率，这提供了更高质量的音频输出，更好地保留了高频细节。当使用高质量参考音频时，此升级能实现更好的声音克隆质量和更自然的语音合成。

-**A:** We have upgraded the AudioVAE to support 44.1kHz sampling rate in VoxCPM1.5, which provides higher quality audio output with better preservation of high-frequency details. This upgrade enables better voice cloning quality and more natural speech synthesis when using high-quality reference audio.
+### Q: 稳定性问题解决了吗？

-### Q: Has the stability issue been resolved?
+**A:** 我们在 VoxCPM1.5 中进行了稳定性优化，包括对推理代码逻辑、训练数据和模型架构的改进。根据社区反馈，我们收集了一些稳定性问题，例如：
+- 噪声和混响增加
+- 音频伪影（如啸叫/尖叫）
+- 语速不稳定（加速）
+- 音量波动（忽大忽小）
+- 音频开头和结尾的噪声伪影
+- 极短文本（如“你好”）的合成问题

-**A:** We have made stability optimizations in VoxCPM1.5, including improvements to the inference code logic, training data, and model architecture. Based on community feedback, we collected some stability issues such as:
- Increased noise and reverberation
- Audio artifacts (e.g., howling/squealing)
- Unstable speaking rate (speeding up)
- Volume fluctuations (increases or decreases)
- Noise artifacts at the beginning and end of audio
- Synthesis issues with very short texts (e.g., "hello")
+**我们改进了什么：**
+- 通过调整推理代码逻辑和优化训练数据，我们很大程度上修复了开头/结尾的伪影。
+- 通过降低 LM 处理速率（12.5Hz → 6.25Hz），我们提高了长语音生成的稳定性。

-**What we've improved:**
- By adjusting inference code logic and optimizing training data, we have largely fixed the beginning/ending artifacts.
- By reducing the LM processing rate (12.5Hz → 6.25Hz), we have improved stability on longer speech generation cases.
+**还遗留什么：** 我们承认长语音稳定性问题尚未完全解决。特别是对于高表现力或复杂的参考语音，自回归生成过程中的误差累积仍可能发生。我们将继续在未来版本中分析和优化这一点。

-**What remains:** We acknowledge that long speech stability issues have not been completely resolved. Particularly for highly expressive or complex reference speech, error accumulation during autoregressive generation can still occur. We will continue to analyze and optimize this in future versions.
+### Q: VoxCPM 计划支持多语言 TTS 吗？

-### Q: Does VoxCPM plan to support multilingual TTS?
+**A:** 目前，VoxCPM 主要在中文和英文数据上进行训练。我们正在积极研究和开发除中英文以外更多语言的多语言 TTS 支持。请告诉我们你希望支持哪些语言！

-**A:** Currently, VoxCPM is primarily trained on Chinese and English data. We are actively researching and developing multilingual TTS support for more languages beyond Chinese and English. Please let us know what languages you'd like to see supported!
+### Q: VoxCPM 计划支持可控生成（情感、风格、细粒度控制）吗？

-### Q: Does VoxCPM plan to support controllable generation (emotion, style, fine-grained control)?
+**A:** 目前，VoxCPM 仅支持零样本声音克隆和上下文感知语音生成。对特定语音属性（情感、风格、细粒度韵律）的直接控制是有限的。然而，我们正在积极研究具有细粒度控制能力的指令可控表现力语音生成，致力于实现人类指令到语音的生成模型！

-**A:** Currently, VoxCPM only supports zero-shot voice cloning and context-aware speech generation. Direct control over specific speech attributes (emotion, style, fine-grained prosody) is limited. However, we are actively researching instruction-controllable expressive speech generation with fine-grained control capabilities, working towards a human instruction-to-speech generation model!
-
-### Q: Does VoxCPM support different hardware chips (e.g., Ascend 910B, XPU, NPU)?
-
-**A:** Currently, we have not yet adapted VoxCPM for different hardware chips. Our main focus remains on developing new model capabilities and improving stability. We encourage you to check if community developers have done similar work, and we warmly welcome everyone to contribute and promote such adaptations together!
-
-These features are under active development, and we look forward to sharing updates in future releases!
+### Q: VoxCPM 支持不同的硬件芯片（如 Ascend 910B, XPU, NPU）吗？

+**A:** 目前，我们尚未针对不同的硬件芯片适配 VoxCPM。我们的主要重点仍然是开发新的模型能力和提高稳定性。我们鼓励你查看社区开发者是否做了类似的工作，我们也热烈欢迎大家共同贡献和推动此类适配！

+这些功能正在积极开发中，我们期待在未来的版本中分享更新！
--- a/docs/usage_guide.md
+++ b/docs/usage_guide.md
@@ -1,55 +1,54 @@
-# 👩‍🍳 A Voice Chef's Guide
+# 👩‍🍳 声音大厨指南

-Welcome to the VoxCPM kitchen! Follow this recipe to cook up perfect generated speech. Let's begin.
+欢迎来到 VoxCPM 厨房！按照这份食谱，烹饪出完美的生成语音。让我们开始吧。

 ---

-## 🥚 Step 1: Prepare Your Base Ingredients (Content)
+## 🥚 第一步：准备基础食材（内容）

-First, choose how you'd like to input your text:
+首先，选择你输入文本的方式：

-### 1. Regular Text (Classic Mode)
- ✅ Keep "Text Normalization" ON. Type naturally (e.g., "Hello, world! 123"). The system will automatically process numbers, abbreviations, and punctuation using WeTextProcessing library.
+### 1. 普通文本（经典模式）
+- ✅ 保持“文本标准化 (Text Normalization)”开启。自然地输入文字（例如 "Hello, world! 123"）。系统将使用 WeTextProcessing 库自动处理数字、缩写和标点符号。

-### 2. Phoneme Input (Native Mode)
- ❌ Turn "Text Normalization" OFF. Enter phoneme text like `{HH AH0 L OW1}` (EN) or `{ni3}{hao3}` (ZH) for precise pronunciation control. In this mode, VoxCPM also supports native understanding of other complex non-normalized text—try it out!
- **Phoneme Conversion**: For Chinese, phonemes are converted using pinyin. For English, phonemes are converted using CMUDict. Please refer to the relevant documentation for more details.
+### 2. 音素输入（原生模式）
+- ❌ 关闭“文本标准化 (Text Normalization)”。输入音素文本，如 `{HH AH0 L OW1}` (英语) 或 `{ni3}{hao3}` (中文)，以进行精确的发音控制。在此模式下，VoxCPM 还支持对其他复杂的非标准化文本的原生理解——快来试试吧！
+- **音素转换**：对于中文，音素使用拼音转换。对于英语，音素使用 CMUDict 转换。更多详细信息请参考相关文档。

 ---

-## 🍳 Step 2: Choose Your Flavor Profile (Voice Style)
+## 🍳 第二步：选择风味（声音风格）

-This is the secret sauce that gives your audio its unique sound.
+这是让你的音频拥有独特声音的秘制酱料。

-### 1. Cooking with a Prompt Speech (Following a Famous Recipe)
- A prompt speech provides the desired acoustic characteristics for VoxCPM. The speaker's timbre, speaking style, and even the background sounds and ambiance will be replicated.
- **For a Clean, Denoising Voice:**
-  - ✅ Enable "Prompt Speech Enhancement". This acts like a noise filter, removing background hiss and rumble to give you a pure, clean voice clone. However, this will limit the audio sampling rate to 16kHz, restricting the cloning quality ceiling.
- **For High-Quality Audio Cloning (Up to 44.1kHz):**
-  - ❌ Disable "Prompt Speech Enhancement" to preserve all original audio information, including background atmosphere, and support audio cloning up to 44.1kHz sampling rate.
+### 1. 使用提示语音烹饪（跟随名家食谱）
+- 提示语音（Prompt Speech）为 VoxCPM 提供所需的声学特征。说话者的音色、说话风格，甚至背景声音和氛围都将被复制。
+- **为了获得干净、降噪的声音：**
+  - ✅ 启用“提示语音增强 (Prompt Speech Enhancement)”。这就像一个噪音过滤器，去除背景嘶嘶声和隆隆声，给你一个纯净、干净的声音克隆。但是，这将限制音频采样率为 16kHz，限制了克隆质量的上限。
+- **为了获得高质量音频克隆（最高 44.1kHz）：**
+  - ❌ 禁用“提示语音增强 (Prompt Speech Enhancement)”以保留所有原始音频信息，包括背景氛围，并支持高达 44.1kHz 采样率的音频克隆。

-### 2. Cooking au Naturel (Letting the Model Improvise)
- If no reference is provided, VoxCPM becomes a creative chef! It will infer a fitting speaking style based on the text itself, thanks to the text-smartness of its foundation model, MiniCPM-4.
- **Pro Tip**: Challenge VoxCPM with any text—poetry, song lyrics, dramatic monologues—it may deliver some interesting results!
+### 2. 自然烹饪（让模型即兴发挥）
+- 如果没有提供参考，VoxCPM 将成为一位创意大厨！通过其基础模型 MiniCPM-4 的文本智能，它会根据文本本身推断出合适的说话风格。
+- **专业提示**：用任何文本挑战 VoxCPM——诗歌、歌词、戏剧独白——它可能会带来一些有趣的结果！

 ---

-## 🧂 Step 3: The Final Seasoning (Fine-Tuning Your Results)
+## 🧂 第三步：最后的调味（微调结果）

-You're ready to serve! But for master chefs who want to tweak the flavor, here are two key spices.
+你已经准备好上菜了！但对于想要调整口味的大厨，这里有两个关键的香料。

-### CFG Value (How Closely to Follow the Recipe)
- **Default**: A great starting point.
- **Voice sounds strained or weird?** Lower this value. It tells the model to be more relaxed and improvisational, great for expressive prompts.
- **Need maximum clarity and adherence to the text?** Raise it slightly to keep the model on a tighter leash.
- **Short sentences?** Consider increasing the CFG value for better clarity and adherence.
- **Long texts?** Consider lowering the CFG value to improve stability and naturalness over extended passages.
+### CFG 值（多严格地遵循食谱）
+- **默认值**：一个很好的起点。
+- **声音听起来紧张或奇怪？** 降低此值。它告诉模型更加放松和即兴，非常适合富有表现力的提示。
+- **需要最大的清晰度和对文本的忠实度？** 稍微调高它，让模型保持更严格的控制。
+- **短句？** 考虑增加 CFG 值以获得更好的清晰度和忠实度。
+- **长文本？** 考虑降低 CFG 值以提高长段落的稳定性和自然度。

-### Inference Timesteps (Simmering Time: Quality vs. Speed)
- **Need a quick snack?** Use a lower number. Perfect for fast drafts and experiments.
- **Cooking a gourmet meal?** Use a higher number. This lets the model "simmer" longer, refining the audio for superior detail and naturalness.
+### 推理步数（炖煮时间：质量与速度）
+- **需要快餐？** 使用较低的数值。非常适合快速草稿和实验。
+- **烹饪大餐？** 使用较高的数值。这让模型“炖煮”得更久，提炼音频以获得卓越的细节和自然度。

 ---

-Happy creating! 🎉 Start with the default settings and tweak from there to suit your project. The kitchen is yours!
-
+祝创作愉快！🎉 从默认设置开始，根据你的项目进行调整。厨房是你的了！