Compare commits

...

10 Commits

Author SHA1 Message Date
admin
1e44eba871 Initial commit with large files ignored 2025-12-11 00:12:18 +08:00
刘鑫
a266c0a88d add lora funetine webUI; optimize lora save and load logic 2025-12-09 21:34:39 +08:00
Labmem-Zhouyx
0779a93697 Merge branch 'main' of https://github.com/OpenBMB/VoxCPM 2025-12-07 02:02:08 +08:00
Labmem-Zhouyx
a1f9d0c3b6 Update: release note 2025-12-07 01:59:53 +08:00
xliucs
aefba63f71 Merge pull request #98 from Ayin1412/main
修复lora/ft测试代码处传参错误的内容
2025-12-06 17:38:19 +08:00
Ayin1412
58717d7d82 修复lora/ft测试代码处传参错误的内容 2025-12-06 14:49:35 +08:00
Labmem-Zhouyx
1b0ff5693c Update: model parameters 2025-12-06 01:22:30 +08:00
Labmem-Zhouyx
762815a5b7 Update: user guides 2025-12-05 23:57:43 +08:00
Labmem-Zhouyx
5b13a35ea6 Update: gradio description 2025-12-05 23:47:35 +08:00
Labmem-Zhouyx
3ba727a615 Update: gradio description 2025-12-05 23:38:04 +08:00
25 changed files with 180828 additions and 335 deletions

10
.gitignore vendored
View File

@@ -2,3 +2,13 @@ launch.json
__pycache__ __pycache__
voxcpm.egg-info voxcpm.egg-info
.DS_Store .DS_Store
*.safetensors
*.pth
*.pt
*.ckpt
*.bin
*.pyc
.trae/
.vscode/
.idea/
*.log

View File

@@ -44,13 +44,13 @@ Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses
### 📦 Model Versions ### 📦 Model Versions
See [Release Notes](docs/release_note.md) for details See [Release Notes](docs/release_note.md) for details
- **VoxCPM1.5** (Latest): - **VoxCPM1.5** (Latest):
- Model Params: 750M - Model Params: 800M
- Sampling rate of AudioVAE: 44100 - Sampling rate of AudioVAE: 44100
- Token rate in LM Backbone: 6.25Hz (patch-size=4) - Token rate in LM Backbone: 6.25Hz (patch-size=4)
- RTF in a single NVIDIA-RTX 4090 GPU: ~0.15 - RTF in a single NVIDIA-RTX 4090 GPU: ~0.15
- **VoxCPM-0.5B** (Original): - **VoxCPM-0.5B** (Original):
- Model Params: 600M - Model Params: 640M
- Sampling rate of AudioVAE: 16000 - Sampling rate of AudioVAE: 16000
- Token rate in LM Backbone: 12.5Hz (patch-size=2) - Token rate in LM Backbone: 12.5Hz (patch-size=2)
- RTF in a single NVIDIA-RTX 4090 GPU: 0.17 - RTF in a single NVIDIA-RTX 4090 GPU: 0.17
@@ -210,6 +210,8 @@ We're excited to see the VoxCPM community growing! Here are some amazing project
- **[VoxCPM-NanoVLLM](https://github.com/a710128/nanovllm-voxcpm)** NanoVLLM integration for VoxCPM for faster, high-throughput inference on GPU. - **[VoxCPM-NanoVLLM](https://github.com/a710128/nanovllm-voxcpm)** NanoVLLM integration for VoxCPM for faster, high-throughput inference on GPU.
- **[VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)** ONNX export for VoxCPM supports faster CPU inference. - **[VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)** ONNX export for VoxCPM supports faster CPU inference.
- **[VoxCPMANE](https://github.com/0seba/VoxCPMANE)** VoxCPM TTS with Apple Neural Engine backend server. - **[VoxCPMANE](https://github.com/0seba/VoxCPMANE)** VoxCPM TTS with Apple Neural Engine backend server.
- **[PR: LoRA finetune web UI (by Ayin1412)](https://github.com/OpenBMB/VoxCPM/pull/100)**
- **[voxcpm_rs](https://github.com/madushan1000/voxcpm_rs)** A re-implementation of VoxCPM-0.5B in Rust.
*Note: The projects are not officially maintained by OpenBMB.* *Note: The projects are not officially maintained by OpenBMB.*

200
README_zh.md Normal file
View File

@@ -0,0 +1,200 @@
# 🎙️ VoxCPM: 基于上下文感知和真实声音克隆的无 Tokenizer 语音合成系统
[![Project Page](https://img.shields.io/badge/Project%20Page-GitHub-blue)](https://github.com/OpenBMB/VoxCPM/) [![Technical Report](https://img.shields.io/badge/Technical%20Report-Arxiv-red)](https://arxiv.org/abs/2509.24650)[![Live Playground](https://img.shields.io/badge/Live%20PlayGround-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) [![Samples](https://img.shields.io/badge/Audio%20Samples-Page-green)](https://openbmb.github.io/VoxCPM-demopage)
<div align="center">
<img src="assets/voxcpm_logo.png" alt="VoxCPM Logo" width="40%">
</div>
<div align="center">
👋 在 [微信](assets/wechat.png) 上联系我们
</div>
## 最新动态
* [2025.12.05] 🎉 🎉 🎉 开源 **VoxCPM1.5** [权重](https://huggingface.co/openbmb/VoxCPM1.5)!该模型现在支持全参数微调和高效的 LoRA 微调,使您能够创建自己的定制版本。详见 [发布说明](docs/release_note.md)。
* [2025.09.30] 🔥 🔥 🔥 发布 VoxCPM [技术报告](https://arxiv.org/abs/2509.24650)
* [2025.09.16] 🔥 🔥 🔥 开源 VoxCPM-0.5B [权重](https://huggingface.co/openbmb/VoxCPM-0.5B)
* [2025.09.16] 🎉 🎉 🎉 提供 VoxCPM-0.5B 的 [Gradio PlayGround](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo),立即试用!
## 项目概览
VoxCPM 是一种新颖的无 Tokenizer 文本转语音TTS系统重新定义了语音合成的真实感。通过在连续空间中对语音进行建模它克服了离散 Token 化带来的限制,并实现了两大核心能力:**上下文感知语音生成**和**逼真的零样本声音克隆**。
与将语音转换为离散 Token 的主流方法不同VoxCPM 采用了端到端的扩散自回归架构,直接从文本生成连续的语音表示。基于 [MiniCPM-4](https://huggingface.co/openbmb/MiniCPM4-0.5B) 骨干网络,通过分层语言建模和 FSQ 约束实现了隐式的语义-声学解耦,极大地增强了表现力和生成稳定性。
<div align="center">
<img src="assets/voxcpm_model.png" alt="VoxCPM Model Architecture" width="90%">
</div>
### 🚀 核心特性
- **上下文感知、富有表现力的语音生成** - VoxCPM 能够理解文本并推断出适当的韵律,生成富有表现力和自然流畅的语音。它会根据内容自发调整说话风格,基于 180 万小时的双语语料库训练,产生高度贴合的语音表达。
- **逼真的声音克隆** - 仅需一段简短的参考音频VoxCPM 就能进行准确的零样本声音克隆,不仅能捕捉说话者的音色,还能捕捉细微的特征,如口音、情感基调、节奏和语速,从而创建一个忠实且自然的复刻。
- **高效合成** - VoxCPM 支持流式合成,在消费级 NVIDIA RTX 4090 GPU 上实时因子RTF低至 0.17,使实时应用成为可能。
### 📦 模型版本
详见 [发布说明](docs/release_note.md)
- **VoxCPM1.5** (最新):
- 模型参数: 800M
- AudioVAE 采样率: 44100
- LM 骨干网络 Token 率: 6.25Hz (patch-size=4)
- 单张 NVIDIA-RTX 4090 GPU RTF: ~0.15
- **VoxCPM-0.5B** (原始):
- 模型参数: 640M
- AudioVAE 采样率: 16000
- LM 骨干网络 Token 率: 12.5Hz (patch-size=2)
- 单张 NVIDIA-RTX 4090 GPU RTF: 0.17
## 快速开始
### 🔧 通过 PyPI 安装
```bash
pip install voxcpm
```
### 1. 模型下载(可选)
默认情况下,首次运行脚本时会自动下载模型,但您也可以提前下载模型。
- 下载 VoxCPM1.5
```python
from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM1.5")
```
- 或下载 VoxCPM-0.5B
```python
from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM-0.5B")
```
- 下载 ZipEnhancer 和 SenseVoice-Small。我们使用 ZipEnhancer 来增强语音提示,并在 Web 演示中使用 SenseVoice-Small 进行语音提示的 ASR自动语音识别
```python
from modelscope import snapshot_download
snapshot_download('iic/speech_zipenhancer_ans_multiloss_16k_base')
snapshot_download('iic/SenseVoiceSmall')
```
### 2. 基本用法 (Python)
```python
import soundfile as sf
import numpy as np
from voxcpm import VoxCPM
model = VoxCPM.from_pretrained("openbmb/VoxCPM1.5")
# 非流式生成
wav = model.generate(
text="VoxCPM 是 ModelBest 推出的一款创新型端到端 TTS 模型,旨在生成极具表现力的语音。",
prompt_wav_path=None, # 可选: 用于声音克隆的提示语音路径
prompt_text=None, # 可选: 参考文本
cfg_value=2.0, # LocDiT 的 LM 引导值,越高越贴合提示,但可能影响自然度
inference_timesteps=10, # LocDiT 推理步数,越高效果越好,越低速度越快
normalize=False, # 启用外部文本标准化工具,但会禁用原生原始文本支持
denoise=False, # 启用外部降噪工具,可能会导致一些失真并将采样率限制在 16kHz
retry_badcase=True, # 启用针对某些坏例的重试模式(不可中断)
retry_badcase_max_times=3, # 最大重试次数
retry_badcase_ratio_threshold=6.0, # 坏例检测的最大长度限制(简单但有效),对于慢节奏语音可调整
)
sf.write("output.wav", wav, model.tts_model.sample_rate)
print("已保存: output.wav")
# 流式生成
chunks = []
for chunk in model.generate_streaming(
text = "使用 VoxCPM 进行流式文本转语音非常简单!",
# 支持与上述相同的参数
):
chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("output_streaming.wav", wav, model.tts_model.sample_rate)
print("已保存: output_streaming.wav")
```
### 3. 命令行 (CLI) 用法
安装后,入口点是 `voxcpm`(或使用 `python -m voxcpm.cli`)。
```bash
# 1) 直接合成 (单条文本)
voxcpm --text "VoxCPM 是一款创新的端到端 TTS 模型。" --output out.wav
# 2) 声音克隆 (参考音频 + 文本)
voxcpm --text "VoxCPM 是一款创新的端到端 TTS 模型。" \
--prompt-audio path/to/voice.wav \
--prompt-text "参考音频的文本内容" \
--output out.wav \
# --denoise
# (可选) 声音克隆 (参考音频 + 文本文件)
voxcpm --text "VoxCPM 是一款创新的端到端 TTS 模型。" \
--prompt-audio path/to/voice.wav \
--prompt-file "/path/to/text-file" \
--output out.wav \
# --denoise
# 3) 批量处理 (每行一条文本)
voxcpm --input examples/input.txt --output-dir outs
# (可选) 批量 + 克隆
voxcpm --input examples/input.txt --output-dir outs \
--prompt-audio path/to/voice.wav \
--prompt-text "参考音频的文本内容" \
# --denoise
# 4) 推理参数 (质量/速度)
voxcpm --text "..." --output out.wav \
--cfg-value 2.0 --inference-timesteps 10 --normalize
# 5) 模型加载
# 优先使用本地路径
voxcpm --text "..." --output out.wav --model-path /path/to/VoxCPM_model_dir
# 或者从 Hugging Face 加载 (自动下载/缓存)
voxcpm --text "..." --output out.wav \
--hf-model-id openbmb/VoxCPM1.5 --cache-dir ~/.cache/huggingface --local-files-only
# 6) 降噪器控制
voxcpm --text "..." --output out.wav \
--no-denoiser --zipenhancer-path iic/speech_zipenhancer_ans_multiloss_16k_base
# 7) 帮助
voxcpm --help
python -m voxcpm.cli --help
```
### 4. 启动 Web 演示
您可以运行 `python app.py` 启动 UI 界面,该界面允许您执行声音克隆和声音创作。
```bash
python app.py
```
### 5. 微调 (Fine-tuning)
VoxCPM1.5 支持全参数微调 (SFT) 和 LoRA 微调,允许您在自己的数据上训练个性化的语音模型。详细说明请参阅 [微调指南](docs/finetune.md)。
**快速开始:**
```bash
# 全参数微调
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
# LoRA 微调
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
```
## 📚 文档
- **[使用指南](docs/usage_guide.md)** - 关于如何有效使用 VoxCPM 的详细指南,包括文本输入模式、声音克隆技巧和参数调整。
- **[微调指南](docs/finetune.md)** - 使用 SFT 和 LoRA 微调 VoxCPM 模型的完整指南。
- **[发布说明](docs/release_note.md)** - 版本历史和更新。
- **[性能基准](docs/performance.md)** - 公共基准上的详细性能比较。
## ⚠️ 风险和限制
- **通用模型行为**: 虽然 VoxCPM 已在大规模数据集上进行了训练,但它仍可能产生意外的、有偏见的或包含伪影的输出。
- **声音克隆滥用的可能性**: VoxCPM 强大的零样本声音克隆能力可能会被滥用。
---

86
app.py
View File

@@ -45,7 +45,8 @@ class VoxCPMDemo:
repo_id = os.environ.get("HF_REPO_ID", "").strip() repo_id = os.environ.get("HF_REPO_ID", "").strip()
if len(repo_id) > 0: if len(repo_id) > 0:
target_dir = os.path.join("models", repo_id.replace("/", "__")) target_dir = os.path.join("models", repo_id.replace("/", "__"))
if not os.path.isdir(target_dir): # Check if directory exists AND contains config.json
if not os.path.isdir(target_dir) or not os.path.exists(os.path.join(target_dir, "config.json")):
try: try:
from huggingface_hub import snapshot_download # type: ignore from huggingface_hub import snapshot_download # type: ignore
os.makedirs(target_dir, exist_ok=True) os.makedirs(target_dir, exist_ok=True)
@@ -155,45 +156,33 @@ def create_demo_interface(demo: VoxCPMDemo):
gr.HTML('<div class="logo-container"><img src="/gradio_api/file=assets/voxcpm_logo.png" alt="VoxCPM Logo"></div>') gr.HTML('<div class="logo-container"><img src="/gradio_api/file=assets/voxcpm_logo.png" alt="VoxCPM Logo"></div>')
# Quick Start # Quick Start
with gr.Accordion("📋 Quick Start Guide 快速入门", open=False, elem_id="acc_quick"): with gr.Accordion("📋 快速入门", open=False, elem_id="acc_quick"):
gr.Markdown(""" gr.Markdown("""
### How to Use 使用说明 ### 使用说明
1. **(Optional) Provide a Voice Prompt** - Upload or record an audio clip to provide the desired voice characteristics for synthesis. 1. **(可选)提供参考声音** - 上传或录制一段音频,为声音合成提供音色、语调和情感等个性化特征。
**(可选)提供参考声音** - 上传或录制一段音频,为声音合成提供音色、语调和情感等个性化特征 2. **(可选)输入参考文本** - 如果提供了参考语音,请输入其对应的文本内容(支持自动识别)。
2. **(Optional) Enter prompt text** - If you provided a voice prompt, enter the corresponding transcript here (auto-recognition available). 3. **输入目标文本** - 输入您希望模型朗读的文字内容。
**(可选项)输入参考文本** - 如果提供了参考语音,请输入其对应的文本内容(支持自动识别) 4. **生成语音** - 点击"生成语音"按钮,即可为您创造出音频
3. **Enter target text** - Type the text you want the model to speak.
**输入目标文本** - 输入您希望模型朗读的文字内容。
4. **Generate Speech** - Click the "Generate" button to create your audio.
**生成语音** - 点击"生成"按钮,即可为您创造出音频。
""") """)
# Pro Tips # Pro Tips
with gr.Accordion("💡 Pro Tips 使用建议", open=False, elem_id="acc_tips"): with gr.Accordion("💡 使用建议", open=False, elem_id="acc_tips"):
gr.Markdown(""" gr.Markdown("""
### Prompt Speech Enhancement参考语音降噪 ### 参考语音降噪
- **Enable** to remove background noise for a clean, studio-like voice, with an external ZipEnhancer component. - **启用**:通过 ZipEnhancer 组件消除背景噪音但会将音频采样率限制在16kHz限制克隆上限。
**用**通过 ZipEnhancer 组件消除背景噪音,获得更好的音质 - **用**保留原始音频的全部信息包括背景环境声最高支持44.1kHz的音频复刻
- **Disable** to preserve the original audio's background atmosphere.
**禁用**:保留原始音频的背景环境声,如果想复刻相应声学环境。
### Text Normalization文本正则化 ### 文本正则化
- **Enable** to process general text with an external WeTextProcessing component. - **启用**:使用 WeTextProcessing 组件,可支持常见文本的正则化处理。
**用**:使用 WeTextProcessing 组件,可处理常见文本。 - **用**使用 VoxCPM 内置的文本理解能力。如,支持音素输入(如中文转拼音:{ni3}{hao3}英文转CMUDict{HH AH0 L OW1})和公式符号合成,尝试一下!
- **Disable** to use VoxCPM's native text understanding ability. For example, it supports phonemes input ({HH AH0 L OW1}), try it!
**禁用**:将使用 VoxCPM 内置的文本理解能力。如,支持音素输入(如 {da4}{jia1}好)和公式符号合成,尝试一下!
### CFG ValueCFG ### CFG 值
- **Lower CFG** if the voice prompt sounds strained or expressive. - **调低**:如果提示语音听起来不自然或过于夸张,或者长文本输入出现稳定性问题。
**调**如果提示语音听起来不自然或过于夸张 - **调**为更好地贴合提示音频的风格或输入文本, 或者极短文本输入出现稳定性问题
- **Higher CFG** for better adherence to the prompt speech style or input text.
**调高**:为更好地贴合提示音频的风格或输入文本。
### Inference Timesteps推理时间步 ### 推理时间步
- **Lower** for faster synthesis speed. - **调低**:合成速度更快。
**调**:合成速度更快 - **调**:合成质量更佳
- **Higher** for better synthesis quality.
**调高**:合成质量更佳。
""") """)
# Main controls # Main controls
@@ -202,22 +191,22 @@ def create_demo_interface(demo: VoxCPMDemo):
prompt_wav = gr.Audio( prompt_wav = gr.Audio(
sources=["upload", 'microphone'], sources=["upload", 'microphone'],
type="filepath", type="filepath",
label="Prompt Speech (Optional, or let VoxCPM improvise)", label="参考语音(可选,或让 VoxCPM 自由发挥)",
value="./examples/example.wav", value="./examples/example.wav",
) )
DoDenoisePromptAudio = gr.Checkbox( DoDenoisePromptAudio = gr.Checkbox(
value=False, value=False,
label="Prompt Speech Enhancement", label="参考语音增强",
elem_id="chk_denoise", elem_id="chk_denoise",
info="We use ZipEnhancer model to denoise the prompt audio." info="使用 ZipEnhancer 模型对参考音频进行降噪。"
) )
with gr.Row(): with gr.Row():
prompt_text = gr.Textbox( prompt_text = gr.Textbox(
value="Just by listening a few minutes a day, you'll be able to eliminate negative thoughts by conditioning your mind to be more positive.", value="Just by listening a few minutes a day, you'll be able to eliminate negative thoughts by conditioning your mind to be more positive.",
label="Prompt Text", label="参考文本",
placeholder="Please enter the prompt text. Automatic recognition is supported, and you can correct the results yourself..." placeholder="请输入参考文本。支持自动识别,您也可以自行修改结果..."
) )
run_btn = gr.Button("Generate Speech", variant="primary") run_btn = gr.Button("生成语音", variant="primary")
with gr.Column(): with gr.Column():
cfg_value = gr.Slider( cfg_value = gr.Slider(
@@ -225,30 +214,31 @@ def create_demo_interface(demo: VoxCPMDemo):
maximum=3.0, maximum=3.0,
value=2.0, value=2.0,
step=0.1, step=0.1,
label="CFG Value (Guidance Scale)", label="CFG 值 (引导比例)",
info="Higher values increase adherence to prompt, lower values allow more creativity" info="值越高越贴合提示,值越低允许更多的创造性"
) )
inference_timesteps = gr.Slider( inference_timesteps = gr.Slider(
minimum=4, minimum=4,
maximum=30, maximum=30,
value=10, value=10,
step=1, step=1,
label="Inference Timesteps", label="推理时间步",
info="Number of inference timesteps for generation (higher values may improve quality but slower)" info="生成的推理时间步数(值越高可能质量越好,但速度更慢)"
) )
with gr.Row(): with gr.Row():
text = gr.Textbox( text = gr.Textbox(
value="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly realistic speech.", value="VoxCPM 是 ModelBest 推出的一款创新型端到端 TTS 模型,旨在生成极具表现力的语音。",
label="Target Text", label="目标文本",
) )
with gr.Row(): with gr.Row():
DoNormalizeText = gr.Checkbox( DoNormalizeText = gr.Checkbox(
value=False, value=False,
label="Text Normalization", label="文本正则化",
elem_id="chk_normalize", elem_id="chk_normalize",
info="We use wetext library to normalize the input text." info="使用 wetext 库对输入文本进行标准化。"
) )
audio_output = gr.Audio(label="Output Audio") audio_output = gr.Audio(label="输出音频")
# Wiring # Wiring
run_btn.click( run_btn.click(
@@ -267,7 +257,7 @@ def run_demo(server_name: str = "localhost", server_port: int = 7860, show_error
demo = VoxCPMDemo() demo = VoxCPMDemo()
interface = create_demo_interface(demo) interface = create_demo_interface(demo)
# Recommended to enable queue on Spaces for better throughput # Recommended to enable queue on Spaces for better throughput
interface.queue(max_size=10).launch(server_name=server_name, server_port=server_port, show_error=show_error) interface.queue(max_size=10, default_concurrency_limit=1).launch(server_name=server_name, server_port=server_port, show_error=show_error)
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -19,6 +19,8 @@ tensorboard: /path/to/logs/finetune_lora
lambdas: lambdas:
loss/diff: 1.0 loss/diff: 1.0
loss/stop: 1.0 loss/stop: 1.0
# LoRA configuration
lora: lora:
enable_lm: true enable_lm: true
enable_dit: true enable_dit: true
@@ -26,3 +28,9 @@ lora:
r: 32 r: 32
alpha: 16 alpha: 16
dropout: 0.0 dropout: 0.0
# Distribution options (optional)
# - If distribute=false (default): save pretrained_path as base_model in lora_config.json
# - If distribute=true: save hf_model_id as base_model (hf_model_id is required)
# hf_model_id: "openbmb/VoxCPM1.5"
# distribute: true

View File

@@ -19,6 +19,8 @@ tensorboard: /path/to/logs/finetune_lora
lambdas: lambdas:
loss/diff: 1.0 loss/diff: 1.0
loss/stop: 1.0 loss/stop: 1.0
# LoRA configuration
lora: lora:
enable_lm: true enable_lm: true
enable_dit: true enable_dit: true
@@ -26,3 +28,9 @@ lora:
r: 32 r: 32
alpha: 16 alpha: 16
dropout: 0.0 dropout: 0.0
# Distribution options (optional)
# - If distribute=false (default): save pretrained_path as base_model in lora_config.json
# - If distribute=true: save hf_model_id as base_model (hf_model_id is required)
# hf_model_id: "openbmb/VoxCPM-0.5B"
# distribute: true

148
create_repo.py Normal file
View File

@@ -0,0 +1,148 @@
import requests
import json
import subprocess
import os
# Configuration
API_URL = "https://git.aitosuv.com/api/v1/user/repos"
AUTH = ('admin', 'lsy123123')
REPO_DATA = {
"name": "VoxCPM-use",
"description": "声音克隆",
"private": False,
"auto_init": False
}
def run_command(command):
"""Run a shell command and return the output."""
print(f"Running: {command}")
try:
result = subprocess.run(
command,
check=True,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True
)
if result.stdout:
print(result.stdout.strip())
return result.stdout.strip()
except subprocess.CalledProcessError as e:
print(f"Error running command: {command}")
print(e.stderr)
return None
def create_gitignore():
"""Create .gitignore if it doesn't exist."""
if not os.path.exists(".gitignore"):
content = """
.venv/
__pycache__/
*.pyc
.trae/
.vscode/
.idea/
*.log
*.safetensors
*.pth
*.pt
*.ckpt
*.bin
"""
with open(".gitignore", "w") as f:
f.write(content.strip())
print("Created .gitignore")
else:
print(".gitignore already exists")
def main():
clone_url = None
# Initialize session and disable proxy
session = requests.Session()
session.trust_env = False
# 1. Create Repository via API
try:
response = session.post(API_URL, auth=AUTH, json=REPO_DATA)
if response.status_code == 201:
print("Repository created successfully")
clone_url = response.json()['clone_url']
elif response.status_code == 422 or response.status_code == 409: # Already exists
print("Repository already exists")
# Fetch existing repo details
user = AUTH[0]
repo_name = REPO_DATA["name"]
get_url = f"https://git.aitosuv.com/api/v1/repos/{user}/{repo_name}"
resp_get = session.get(get_url, auth=AUTH)
if resp_get.status_code == 200:
clone_url = resp_get.json()['clone_url']
else:
print(f"Could not fetch existing repository details. Status: {resp_get.status_code}")
else:
print(f"Failed to create repository: {response.status_code}")
print(response.text)
return
except Exception as e:
print(f"Error: {e}")
return
if not clone_url:
print("Could not determine clone URL. Exiting.")
return
# Embed credentials into the URL for automatic authentication
# Assuming clone_url format: https://git.aitosuv.com/admin/geminiWX.git
# We want: https://admin:lsy123123@git.aitosuv.com/admin/geminiWX.git
if "://" in clone_url:
protocol, rest = clone_url.split("://", 1)
auth_url = f"{protocol}://{AUTH[0]}:{AUTH[1]}@{rest}"
else:
auth_url = clone_url # Fallback if format is unexpected
print(f"Target Remote URL: {clone_url}")
# 2. Local Git Operations
if not os.path.exists(".git"):
print("Initializing git repository...")
run_command("git init")
# Configure git user for this repository
print("Configuring git user...")
run_command(f'git config user.email "{AUTH[0]}@aitosuv.com"')
run_command(f'git config user.name "{AUTH[0]}"')
create_gitignore()
print("Adding files...")
run_command("git add .")
print("Committing changes...")
run_command('git commit -m "Initial commit"')
# Check and configure remote
remotes = run_command("git remote -v")
if remotes and "origin" in remotes:
print("Updating remote 'origin'...")
run_command(f"git remote set-url origin {auth_url}")
else:
print("Adding remote 'origin'...")
run_command(f"git remote add origin {auth_url}")
# Push to remote
print("Pushing to remote...")
# Try pushing to master first, then main if that fails (or vice versa depending on default branch)
# Usually 'master' is default for older git, 'main' for newer.
# We can try checking current branch name.
current_branch = run_command("git rev-parse --abbrev-ref HEAD")
if current_branch:
if run_command(f"git push -u origin {current_branch} -f") is None:
print("Push failed.")
else:
# Fallback if we couldn't get branch name
if run_command("git push -u origin master -f") is None:
run_command("git push -u origin main -f")
if __name__ == "__main__":
main()

View File

@@ -1,75 +1,99 @@
# VoxCPM Fine-tuning Guide # VoxCPM 微调指南
This guide covers how to fine-tune VoxCPM models with two approaches: full fine-tuning and LoRA fine-tuning. 本指南介绍了如何使用全量微调Full Fine-tuning)和 LoRA 微调两种方式对 VoxCPM 模型进行微调。
### 🎓 SFT (Supervised Fine-Tuning) ### 🎓 SFT (监督微调)
Full fine-tuning updates all model parameters. Suitable for: 全量微调会更新所有模型参数。适用于:
- 📊 Large, specialized datasets - 📊 大型、专业的数据集
- 🔄 Cases where significant behavior changes are needed - 🔄 需要显著改变模型行为的场景
### ⚡ LoRA Fine-tuning ### ⚡ LoRA 微调
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that: LoRA (Low-Rank Adaptation) 是一种参数高效的微调方法,它:
- 🎯 Trains only a small number of additional parameters - 🎯 仅训练少量额外参数
- 💾 Significantly reduces memory requirements and training time - 💾 显著降低显存需求和训练时间
- 🔀 Supports multiple LoRA adapters with hot-swapping - 🔀 支持多个 LoRA 适配器热插拔
## 目录
- [快速开始WebUI](#快速开始webui)
## Table of Contents - [数据准备](#数据准备)
- [全量微调](#全量微调)
- [Data Preparation](#data-preparation) - [LoRA 微调](#lora-微调)
- [Full Fine-tuning](#full-fine-tuning) - [推理](#推理)
- [LoRA Fine-tuning](#lora-fine-tuning) - [LoRA 热插拔](#lora-热插拔)
- [Inference](#inference) - [常见问题](#常见问题)
- [LoRA Hot-swapping](#lora-hot-swapping)
- [FAQ](#faq)
--- ---
## Data Preparation ## 快速开始WebUI
Training data should be prepared as a JSONL manifest file, with one sample per line: 对于喜欢图形界面的用户,我们提供了 `lora_ft_webui.py` —— 一个用于训练和推理的综合 WebUI
```jsonl ### 启动 WebUI
{"audio": "path/to/audio1.wav", "text": "Transcript of audio 1."}
{"audio": "path/to/audio2.wav", "text": "Transcript of audio 2."} ```bash
{"audio": "path/to/audio3.wav", "text": "Optional duration field.", "duration": 3.5} python lora_ft_webui.py
{"audio": "path/to/audio4.wav", "text": "Optional dataset_id for multi-dataset.", "dataset_id": 1}
``` ```
### Required Fields 然后在浏览器中打开 `http://localhost:7860`
| Field | Description | ### 功能特点
- **🚀 训练标签页**:通过直观的界面配置并启动 LoRA 训练
- 设置训练参数学习率、Batch Size、LoRA Rank 等)
- 实时监控训练进度
- 从现有断点恢复训练
- **🎵 推理标签页**:使用训练好的模型生成音频
- 从 LoRA 检查点配置自动加载基座模型
- 带自动 ASR参考文本识别的声音克隆
- 在多个 LoRA 模型间热切换
- 无参考音频的零样本 TTS
## 数据准备
训练数据应准备为 JSONL 清单文件,每行一个样本:
```jsonl
{"audio": "path/to/audio1.wav", "text": "音频1的文本内容。"}
{"audio": "path/to/audio2.wav", "text": "音频2的文本内容。"}
{"audio": "path/to/audio3.wav", "text": "可选的时长字段。", "duration": 3.5}
{"audio": "path/to/audio4.wav", "text": "多数据集训练可选的 dataset_id。", "dataset_id": 1}
```
### 必填字段
| 字段 | 描述 |
|-------|-------------| |-------|-------------|
| `audio` | Path to audio file (absolute or relative) | | `audio` | 音频文件路径(绝对或相对路径) |
| `text` | Corresponding transcript | | `text` | 对应的文本内容 |
### Optional Fields ### 可选字段
| Field | Description | | 字段 | 描述 |
|-------|-------------| |-------|-------------|
| `duration` | Audio duration in seconds (speeds up sample filtering) | | `duration` | 音频时长(秒),用于加速样本过滤 |
| `dataset_id` | Dataset ID for multi-dataset training (default: 0) | | `dataset_id` | 多数据集训练的数据集 ID默认0 |
### Requirements ### 要求
- Audio format: WAV - 音频格式:WAV
- Sample rate: 16kHz for VoxCPM-0.5B, 44.1kHz for VoxCPM1.5 - 采样率VoxCPM-0.5B 为 16kHzVoxCPM1.5 为 44.1kHz
- Text: Transcript matching the audio content - 文本:与音频内容匹配的文本
See `examples/train_data_example.jsonl` for a complete example. 查看 `examples/train_data_example.jsonl` 获取完整示例。
--- ---
## Full Fine-tuning ## 全量微调
Full fine-tuning updates all model parameters. Suitable for large datasets or when significant behavior changes are needed. 全量微调更新所有模型参数。适用于大数据集或需要显著改变模型行为的情况。
### Configuration ### 配置
Create `conf/voxcpm_v1.5/voxcpm_finetune_all.yaml`: 创建 `conf/voxcpm_v1.5/voxcpm_finetune_all.yaml`
```yaml ```yaml
pretrained_path: /path/to/VoxCPM1.5/ pretrained_path: /path/to/VoxCPM1.5/
@@ -85,7 +109,7 @@ log_interval: 10
valid_interval: 1000 valid_interval: 1000
save_interval: 1000 save_interval: 1000
learning_rate: 0.00001 # Use smaller LR for full fine-tuning learning_rate: 0.00001 # 全量微调使用较小的学习率
weight_decay: 0.01 weight_decay: 0.01
warmup_steps: 100 warmup_steps: 100
max_steps: 2000 max_steps: 2000
@@ -99,27 +123,27 @@ lambdas:
loss/stop: 1.0 loss/stop: 1.0
``` ```
### Training ### 训练
```bash ```bash
# Single GPU # GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
# Multi-GPU # GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \ CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
``` ```
### Checkpoint Structure ### 检查点结构
Full fine-tuning saves a complete model directory that can be loaded directly: 全量微调保存完整的模型目录,可以直接加载:
``` ```
checkpoints/finetune_all/ checkpoints/finetune_all/
└── step_0002000/ └── step_0002000/
├── model.safetensors # Model weights (excluding audio_vae) ├── model.safetensors # 模型权重 (不含 audio_vae)
├── config.json # Model config ├── config.json # 模型配置
├── audiovae.pth # Audio VAE weights ├── audiovae.pth # Audio VAE 权重
├── tokenizer.json # Tokenizer ├── tokenizer.json # Tokenizer
├── tokenizer_config.json ├── tokenizer_config.json
├── special_tokens_map.json ├── special_tokens_map.json
@@ -129,13 +153,13 @@ checkpoints/finetune_all/
--- ---
## LoRA Fine-tuning ## LoRA 微调
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains only a small number of additional parameters, significantly reducing memory requirements. LoRA (Low-Rank Adaptation) 是一种参数高效的微调方法,仅训练少量额外参数,显著降低显存需求。
### Configuration ### 配置
Create `conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml`: 创建 `conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml`
```yaml ```yaml
pretrained_path: /path/to/VoxCPM1.5/ pretrained_path: /path/to/VoxCPM1.5/
@@ -151,7 +175,7 @@ log_interval: 10
valid_interval: 1000 valid_interval: 1000
save_interval: 1000 save_interval: 1000
learning_rate: 0.0001 # LoRA can use larger LR learning_rate: 0.0001 # LoRA 可以使用较大的学习率
weight_decay: 0.01 weight_decay: 0.01
warmup_steps: 100 warmup_steps: 100
max_steps: 2000 max_steps: 2000
@@ -164,117 +188,159 @@ lambdas:
loss/diff: 1.0 loss/diff: 1.0
loss/stop: 1.0 loss/stop: 1.0
# LoRA configuration # LoRA 配置
lora: lora:
enable_lm: true # Apply LoRA to Language Model enable_lm: true # 对语言模型应用 LoRA
enable_dit: true # Apply LoRA to Diffusion Transformer enable_dit: true # Diffusion Transformer 应用 LoRA
enable_proj: false # Apply LoRA to projection layers (optional) enable_proj: false # 对投影层应用 LoRA (可选)
r: 32 # LoRA rank (higher = more capacity) r: 32 # LoRA rank (越高容量越大)
alpha: 16 # LoRA alpha, scaling = alpha / r alpha: 16 # LoRA alpha, scaling = alpha / r
dropout: 0.0 dropout: 0.0
# Target modules # 目标模块
target_modules_lm: ["q_proj", "v_proj", "k_proj", "o_proj"] target_modules_lm: ["q_proj", "v_proj", "k_proj", "o_proj"]
target_modules_dit: ["q_proj", "v_proj", "k_proj", "o_proj"] target_modules_dit: ["q_proj", "v_proj", "k_proj", "o_proj"]
# 分发选项 (可选)
# hf_model_id: "openbmb/VoxCPM1.5" # HuggingFace ID
# distribute: true # 如果为 true在 lora_config.json 中保存 hf_model_id
``` ```
### LoRA Parameters ### LoRA 参数
| Parameter | Description | Recommended | | 参数 | 描述 | 推荐值 |
|-----------|-------------|-------------| |-----------|-------------|-------------|
| `enable_lm` | Apply LoRA to LM (language model) | `true` | | `enable_lm` | 对 LM (语言模型) 应用 LoRA | `true` |
| `enable_dit` | Apply LoRA to DiT (diffusion model) | `true` (required for voice cloning) | | `enable_dit` | 对 DiT (扩散模型) 应用 LoRA | `true` (声音克隆必须) |
| `r` | LoRA rank (higher = more capacity) | 16-64 | | `r` | LoRA rank (越高容量越大) | 16-64 |
| `alpha` | Scaling factor, `scaling = alpha / r` | Usually `r/2` or `r` | | `alpha` | 缩放因子, `scaling = alpha / r` | 通常 `r/2` `r` |
| `target_modules_*` | Layer names to add LoRA | attention layers | | `target_modules_*` | 添加 LoRA 的层名称 | attention layers |
### Training ### 分发选项 (可选)
| 参数 | 描述 | 默认值 |
|-----------|-------------|---------|
| `hf_model_id` | HuggingFace 模型 ID (例如 `openbmb/VoxCPM1.5`) | `""` |
| `distribute` | 如果为 `true`,将 `hf_model_id` 作为 `base_model` 保存到检查点;否则保存本地 `pretrained_path` | `false` |
> **注意**:如果 `distribute: true`,则必须提供 `hf_model_id`。
### 训练
```bash ```bash
# Single GPU # GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
# Multi-GPU # GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \ CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
``` ```
### Checkpoint Structure ### 检查点结构
LoRA training saves only LoRA parameters: LoRA 训练保存 LoRA 参数和配置:
``` ```
checkpoints/finetune_lora/ checkpoints/finetune_lora/
└── step_0002000/ └── step_0002000/
├── lora_weights.safetensors # Only lora_A, lora_B parameters ├── lora_weights.safetensors # 仅含 lora_A, lora_B 参数
├── lora_config.json # LoRA 配置 + 基座模型路径
├── optimizer.pth ├── optimizer.pth
└── scheduler.pth └── scheduler.pth
``` ```
`lora_config.json` 包含:
```json
{
"base_model": "/path/to/VoxCPM1.5/",
"lora_config": {
"enable_lm": true,
"enable_dit": true,
"r": 32,
"alpha": 16,
...
}
}
```
`base_model` 字段包含:
- 本地路径 (默认):当 `distribute: false` 或未设置时
- HuggingFace ID`distribute: true` 时 (例如 `"openbmb/VoxCPM1.5"`)
这允许在没有原始训练配置文件的情况下加载 LoRA 检查点。
--- ---
## Inference ## 推理
### Full Fine-tuning Inference ### 全量微调推理
The checkpoint directory is a complete model, load it directly: 检查点目录是一个完整的模型,直接加载:
```bash ```bash
python scripts/test_voxcpm_ft_infer.py \ python scripts/test_voxcpm_ft_infer.py \
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \ --ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
--text "Hello, this is the fine-tuned model." \ --text "你好,这是微调后的模型。" \
--output output.wav --output output.wav
``` ```
With voice cloning: 带声音克隆:
```bash ```bash
python scripts/test_voxcpm_ft_infer.py \ python scripts/test_voxcpm_ft_infer.py \
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \ --ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
--text "This is voice cloning result." \ --text "这是声音克隆的结果。" \
--prompt_audio /path/to/reference.wav \ --prompt_audio /path/to/reference.wav \
--prompt_text "Reference audio transcript" \ --prompt_text "参考音频的文本内容" \
--output cloned_output.wav --output cloned_output.wav
``` ```
### LoRA Inference ### LoRA 推理
LoRA inference requires the training config (for LoRA structure) and LoRA checkpoint: LoRA 推理只需要检查点目录(基座模型路径和 LoRA 配置从 `lora_config.json` 读取):
```bash ```bash
python scripts/test_voxcpm_lora_infer.py \ python scripts/test_voxcpm_lora_infer.py \
--config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \ --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--text "Hello, this is LoRA fine-tuned result." \ --text "你好,这是 LoRA 微调的结果。" \
--output lora_output.wav --output lora_output.wav
``` ```
With voice cloning: 带声音克隆:
```bash ```bash
python scripts/test_voxcpm_lora_infer.py \ python scripts/test_voxcpm_lora_infer.py \
--config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \ --lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--text "This is voice cloning with LoRA." \ --text "这是带 LoRA 的声音克隆。" \
--prompt_audio /path/to/reference.wav \ --prompt_audio /path/to/reference.wav \
--prompt_text "Reference audio transcript" \ --prompt_text "参考音频的文本内容" \
--output cloned_output.wav --output cloned_output.wav
``` ```
覆盖基座模型路径 (可选)
```bash
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--base_model /path/to/another/VoxCPM1.5 \
--text "使用不同的基座模型。" \
--output output.wav
```
--- ---
## LoRA Hot-swapping ## LoRA 热插拔
LoRA supports dynamic loading, unloading, and switching at inference time without reloading the entire model. LoRA 支持在推理时动态加载、卸载和切换,无需重新加载整个模型。
### API Reference ### API 参考
```python ```python
from voxcpm.core import VoxCPM from voxcpm.core import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig from voxcpm.model.voxcpm import LoRAConfig
# 1. Load model with LoRA structure and weights # 1. 加载带 LoRA 结构和权重的模型
lora_cfg = LoRAConfig( lora_cfg = LoRAConfig(
enable_lm=True, enable_lm=True,
enable_dit=True, enable_dit=True,
@@ -284,93 +350,113 @@ lora_cfg = LoRAConfig(
target_modules_dit=["q_proj", "v_proj", "k_proj", "o_proj"], target_modules_dit=["q_proj", "v_proj", "k_proj", "o_proj"],
) )
model = VoxCPM.from_pretrained( model = VoxCPM.from_pretrained(
hf_model_id="openbmb/VoxCPM1.5", # or local path hf_model_id="openbmb/VoxCPM1.5", # 或本地路径
load_denoiser=False, # Optional: disable denoiser for faster loading load_denoiser=False, # 可选:禁用降噪器以加快加载
optimize=True, # Enable torch.compile acceleration optimize=True, # 启用 torch.compile 加速
lora_config=lora_cfg, lora_config=lora_cfg,
lora_weights_path="/path/to/lora_checkpoint", lora_weights_path="/path/to/lora_checkpoint",
) )
# 2. Generate audio # 2. 生成音频
audio = model.generate( audio = model.generate(
text="Hello, this is LoRA fine-tuned result.", text="你好,这是 LoRA 微调的结果。",
prompt_wav_path="/path/to/reference.wav", # Optional: for voice cloning prompt_wav_path="/path/to/reference.wav", # 可选:用于声音克隆
prompt_text="Reference audio transcript", # Optional: for voice cloning prompt_text="参考音频的文本内容", # 可选:用于声音克隆
) )
# 3. Disable LoRA (use base model only) # 3. 禁用 LoRA (仅使用基座模型)
model.set_lora_enabled(False) model.set_lora_enabled(False)
# 4. Re-enable LoRA # 4. 重新启用 LoRA
model.set_lora_enabled(True) model.set_lora_enabled(True)
# 5. Unload LoRA (reset weights to zero) # 5. 卸载 LoRA (重置权重为零)
model.unload_lora() model.unload_lora()
# 6. Hot-swap to another LoRA # 6. 热切换到另一个 LoRA
loaded, skipped = model.load_lora("/path/to/another_lora_checkpoint") loaded, skipped = model.load_lora("/path/to/another_lora_checkpoint")
print(f"Loaded {len(loaded)} params, skipped {len(skipped)}") print(f"Loaded {len(loaded)} params, skipped {len(skipped)}")
# 7. Get current LoRA weights # 7. 获取当前 LoRA 权重
lora_state = model.get_lora_state_dict() lora_state = model.get_lora_state_dict()
``` ```
### Simplified Usage (Auto LoRA Config) ### 简化用法 (从 lora_config.json 加载)
If you only have LoRA weights and don't need custom config, just provide the path: 如果你的检查点包含 `lora_config.json`(由训练脚本保存),你可以自动加载所有内容:
```python ```python
import json
from voxcpm.core import VoxCPM from voxcpm.core import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig
# Auto-create default LoRAConfig when only lora_weights_path is provided # 从检查点加载配置
lora_ckpt_dir = "/path/to/checkpoints/finetune_lora/step_0002000"
with open(f"{lora_ckpt_dir}/lora_config.json") as f:
lora_info = json.load(f)
base_model = lora_info["base_model"]
lora_cfg = LoRAConfig(**lora_info["lora_config"])
# 加载带 LoRA 的模型
model = VoxCPM.from_pretrained( model = VoxCPM.from_pretrained(
hf_model_id="openbmb/VoxCPM1.5", hf_model_id=base_model,
lora_weights_path="/path/to/lora_checkpoint", # Will auto-create LoRAConfig lora_config=lora_cfg,
lora_weights_path=lora_ckpt_dir,
) )
``` ```
### Method Reference 或者直接使用测试脚本:
| Method | Description | torch.compile Compatible | ```bash
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--text "Hello world"
```
### 方法参考
| 方法 | 描述 | torch.compile 兼容性 |
|--------|-------------|--------------------------| |--------|-------------|--------------------------|
| `load_lora(path)` | Load LoRA weights from file | ✅ | | `load_lora(path)` | 从文件加载 LoRA 权重 | ✅ |
| `set_lora_enabled(bool)` | Enable/disable LoRA | ✅ | | `set_lora_enabled(bool)` | 启用/禁用 LoRA | ✅ |
| `unload_lora()` | Reset LoRA weights to initial values | ✅ | | `unload_lora()` | LoRA 权重重置为初始值 | ✅ |
| `get_lora_state_dict()` | Get current LoRA weights | ✅ | | `get_lora_state_dict()` | 获取当前 LoRA 权重 | ✅ |
| `lora_enabled` | Property: check if LoRA is configured | ✅ | | `lora_enabled` | 属性:检查是否配置了 LoRA | ✅ |
--- ---
## FAQ ## 常见问题 (FAQ)
### 1. Out of Memory (OOM) ### 1. 显存溢出 (OOM)
- Increase `grad_accum_steps` (gradient accumulation) - 增加 `grad_accum_steps` (梯度累积步数)
- Decrease `batch_size` - 减小 `batch_size`
- Use LoRA fine-tuning instead of full fine-tuning - 使用 LoRA 微调代替全量微调
- Decrease `max_batch_tokens` to filter long samples - 减小 `max_batch_tokens` 以过滤长样本
### 2. Poor LoRA Performance ### 2. LoRA 效果不佳
- Increase `r` (LoRA rank) - 增加 `r` (LoRA rank)
- Adjust `alpha` (try `alpha = r/2` or `alpha = r`) - 调整 `alpha` (尝试 `alpha = r/2` `alpha = r`)
- Ensure `enable_dit: true` (required for voice cloning) - 增加训练步数
- Increase training steps - 添加更多目标模块
- Add more target modules
### 3. Training Not Converging ### 3. 训练不收敛
- Decrease `learning_rate` - 减小 `learning_rate` (学习率)
- Increase `warmup_steps` - 增加 `warmup_steps`
- Check data quality - 检查数据质量
### 4. LoRA Not Taking Effect at Inference ### 4. LoRA 在推理时未生效
- Ensure inference config matches training config LoRA parameters - 检查检查点目录下是否存在 `lora_config.json`
- Check `load_lora()` return value - `skipped_keys` should be empty - 检查 `load_lora()` 返回值 - `skipped_keys` 应该为空
- Verify `set_lora_enabled(True)` is called - 确认调用了 `set_lora_enabled(True)`
### 5. Checkpoint Loading Errors ### 5. 检查点加载错误
- Full fine-tuning: checkpoint directory should contain `model.safetensors`(or `pytorch_model.bin`), `config.json`, `audiovae.pth` - 全量微调:检查点目录应包含 `model.safetensors` (或 `pytorch_model.bin`)`config.json``audiovae.pth`
- LoRA: checkpoint directory should contain `lora_weights.safetensors` (or `lora_weights.ckpt`) - LoRA:检查点目录应包含:
- `lora_weights.safetensors` (或 `lora_weights.ckpt`) - LoRA 权重
- `lora_config.json` - LoRA 配置和基座模型路径

View File

@@ -1,10 +1,10 @@
# 📊 Performance Highlights # 📊 性能亮点
VoxCPM achieves competitive results on public zero-shot TTS benchmarks. VoxCPM 在公开的零样本 TTS 基准测试中取得了具有竞争力的结果。
## Seed-TTS-eval Benchmark ## Seed-TTS-eval 基准测试
| Model | Parameters | Open-Source | test-EN | | test-ZH | | test-Hard | | | 模型 | 参数量 | 开源 | test-EN | | test-ZH | | test-Hard | |
|------|------|------|:------------:|:--:|:------------:|:--:|:-------------:|:--:| |------|------|------|:------------:|:--:|:------------:|:--:|:-------------:|:--:|
| | | | WER/%⬇ | SIM/%⬆| CER/%⬇| SIM/%⬆ | CER/%⬇ | SIM/%⬆ | | | | | WER/%⬇ | SIM/%⬆| CER/%⬇| SIM/%⬆ | CER/%⬇ | SIM/%⬆ |
| MegaTTS3 | 0.5B | ❌ | 2.79 | 77.1 | 1.52 | 79.0 | - | - | | MegaTTS3 | 0.5B | ❌ | 2.79 | 77.1 | 1.52 | 79.0 | - | - |
@@ -28,9 +28,9 @@ VoxCPM achieves competitive results on public zero-shot TTS benchmarks.
| **VoxCPM** | 0.5B | ✅ | **1.85** | **72.9** | **0.93** | **77.2** | 8.87 | 73.0 | | **VoxCPM** | 0.5B | ✅ | **1.85** | **72.9** | **0.93** | **77.2** | 8.87 | 73.0 |
## CV3-eval Benchmark ## CV3-eval 基准测试
| Model | zh | en | hard-zh | | | hard-en | | | | 模型 | zh | en | hard-zh | | | hard-en | | |
|-------|:--:|:--:|:-------:|:--:|:--:|:-------:|:--:|:--:| |-------|:--:|:--:|:-------:|:--:|:--:|:-------:|:--:|:--:|
| | CER/%⬇ | WER/%⬇ | CER/%⬇ | SIM/%⬆ | DNSMOS⬆ | WER/%⬇ | SIM/%⬆ | DNSMOS⬆ | | | CER/%⬇ | WER/%⬇ | CER/%⬇ | SIM/%⬆ | DNSMOS⬆ | WER/%⬇ | SIM/%⬆ | DNSMOS⬆ |
| F5-TTS | 5.47 | 8.90 | - | - | - | - | - | - | | F5-TTS | 5.47 | 8.90 | - | - | - | - | - | - |
@@ -43,4 +43,3 @@ VoxCPM achieves competitive results on public zero-shot TTS benchmarks.
| CosyVoice3-0.5B | 3.89 | 5.24 | 14.15 | 78.6 | 3.75 | 9.04 | 75.9 | 3.92 | | CosyVoice3-0.5B | 3.89 | 5.24 | 14.15 | 78.6 | 3.75 | 9.04 | 75.9 | 3.92 |
| CosyVoice3-1.5B | 3.91 | 4.99 | 9.77 | 78.5 | 3.79 | 10.55 | 76.1 | 3.95 | | CosyVoice3-1.5B | 3.91 | 4.99 | 9.77 | 78.5 | 3.79 | 10.55 | 76.1 | 3.95 |
| **VoxCPM** | **3.40** | **4.04** | 12.9 | 66.1 | 3.59 | **7.89** | 64.3 | 3.74 | | **VoxCPM** | **3.40** | **4.04** | 12.9 | 66.1 | 3.59 | **7.89** | 64.3 | 3.74 |

View File

@@ -1,109 +1,111 @@
# VoxCPM1.5 Release Notes # VoxCPM1.5 发布说明
**Release Date:** December 5, 2025 **发布日期:** 2025年12月5日
## 🎉 Overview ## 🎉 概览
我们非常激动地推出一次重大升级,在保持 VoxCPM 上下文感知语音生成和零样本声音克隆核心能力的同时,提升了音频质量和效率。
Were thrilled to introduce a major upgrade that improves audio quality and efficiency of VoxCPM, while maintaining the core capabilities of context-aware speech generation and zero-shot voice cloning. | 特性 | VoxCPM | VoxCPM1.5 |
| Feature | VoxCPM | VoxCPM1.5 |
|---------|------------|------------| |---------|------------|------------|
| **Audio VAE Sampling Rate** | 16kHz | 44.1kHz | | **Audio VAE 采样率** | 16kHz | 44.1kHz |
| **LM Token Rate** | 12.5Hz | 6.25Hz | | **LM Token 速率** | 12.5Hz | 6.25Hz |
| **Patch Size** | 2 | 4 | | **Patch Size** | 2 | 4 |
| **SFT Support** | ✅ | ✅ | | **SFT 支持** | ✅ | ✅ |
| **LoRA Support** | ✅ | ✅ | | **LoRA 支持** | ✅ | ✅ |
## 🎵 Model Updates ## 🎵 模型更新
### 🔊 AudioVAE Sampling Rate: 16kHz → 44.1kHz ### 🔊 AudioVAE 采样率:16kHz → 44.1kHz
The AudioVAE now supports 44.1kHz sampling rate, which allows the model to: AudioVAE 现在支持 44.1kHz 采样率,这使得模型能够:
- 🎯 Clone better, preserving more high-frequency details and generate higher quality voice outputs - 🎯 更好地克隆声音,保留更多高频细节,生成更高质量的语音输出
*注意:此升级在使用高质量参考音频时能生成更高质量的音频,但不能保证所有生成的音频都是高保真的。输出质量取决于**提示语音prompt speech**的质量。*
*Note: This upgrade enables higher quality generation when using high-quality reference audio, but does not guarantee that all generated audio will be high-fidelity. The output quality depends on the **prompt speech** quality.* ### ⚡ Token 速率12.5Hz → 6.25Hz
### ⚡ Token Rate: 12.5Hz 6.25Hz 我们将 LM 主干网络中的 token 速率从 12.5Hz 降低到了 6.25HzLocEnc&LocDiT patch size 从 2 增加到 4同时在评估基准上保持了相似的性能。这一变化
- 💨 降低了生成相同长度音频的计算需求
- 📈 为更长音频生成奠定了基础
- 🏗️ 为未来训练更大的模型铺平了道路
We reduced the token rate in LM backbone from 12.5Hz to 6.25Hz (LocEnc&LocDiT patch size increased from 2 to 4) while maintaining similar performance on evaluation benchmarks. This change: **模型架构说明**VoxCPM1.5 的核心架构与技术报告中保持一致。关键的修改是将局部模块(LocEnc & LocDiT)的 patch size 从 2 调整为 4从而将 LM 处理速率从 12.5Hz 降低到 6.25Hz。由于局部模块现在需要处理更长的上下文,我们扩展了它们的网络深度,导致整体模型参数量略有增加。
- 💨 Reduces computational requirements for generating the same length of audio
- 📈 Provides a foundation for longer audio generation
- 🏗️ Paves the way for training larger models in the future
**生成速度说明**:虽然模型参数增加了,但 VoxCPM1.5 生成 1 秒音频仅需 6.25 个 token相比之前的 12.5 个 token。虽然显示的生成速度xx it/s可能看起来变慢了但实际的实时率RTF = 音频时长 / 处理时间)没有差异,甚至可能更快。
## 🔧 Fine-tuning Support ## 🔧 微调支持
We support full fine-tuning and LoRA fine-tuning now, please see the [Fine-tuning Guide](finetune.md) for detailed instructions. 我们现在支持全量微调和 LoRA 微调,请参阅 [微调指南](finetune.md) 了解详细说明。
## 📚 文档
## 📚 Documentation - 更新了 README增加了版本对比
- 添加了全面的微调指南
- 改进了代码注释和文档
- Updated README with version comparison ## 🙏 感谢大家
- Added comprehensive fine-tuning guide
- Improved code comments and documentation
没有开源社区的反馈、测试和贡献,这次发布是不可能的。感谢你们帮助塑造 VoxCPM1.5
## 🙏 Our Thanks to You ## 📞 让我们共同建设
This release wouldnt be possible without the incredible feedback, testing, and contributions from our open-source community. Thank you for helping shape VoxCPM1.5!
有问题、想法或想要贡献?
## 📞 Let's Build Together - 🐛 报告问题:[OpenBMB/VoxCPM GitHub Issues](https://github.com/OpenBMB/VoxCPM/issues)
Questions, ideas, or want to contribute?
- 🐛 Report an issue: [GitHub Issues on OpenBMB/VoxCPM](https://github.com/OpenBMB/VoxCPM/issues) - 📖 深入文档:查看 [docs/](../docs/) 文件夹获取指南和 API 详情
- 📖 Dig into the docs: Check the [docs/](../docs/) folder for guides and API details 享受 VoxCPM1.5 更丰富的声音和强大的新功能吧 🎉
Enjoy the richer sound and powerful new features of VoxCPM1.5 🎉 我们迫不及待想听到你们接下来的创作!🥂
We can't wait to hear what you create next! 🥂 ## 🚀 我们正在做的事情
## 🚀 What We're Working On 我们正在持续改进 VoxCPM 并致力于开发激动人心的新功能:
We're continuously improving VoxCPM and working on exciting new features: - 🌍 **多语言 TTS 支持**:我们正在积极开发除中文和英文以外的语言支持。
- 🎯 **可控表现力语音生成**:我们正在研究可控语音生成,允许通过自然语言指令对语音属性(情感、音色、韵律等)进行细粒度控制。
- 🎵 **通用音频生成基础**:我们也希望探索 VoxCPM 作为统一的音频生成基础模型,能够联合生成语音、音乐和音效。不过,这是一个长期的愿景。
- 🌍 **Multilingual TTS Support**: We are actively developing support for languages beyond Chinese and English. **📅 下次发布**:我们计划在 2026 年第一季度发布下一个版本,其中将包含重大改进和新功能。敬请关注更新!我们致力于使 VoxCPM 更加强大和通用。
- 🎯 **Controllable Expressive Speech Generation**: We are researching controllable speech generation that allows fine-grained control over speech attributes (emotion, timbre, prosody, etc.) through natural language instructions.
- 🎵 **Universal Audio Generation Foundation**: We also hope to explore VoxCPM as a unified audio generation foundation model capable of joint generation of speech, music, and sound effects. However, this is a longer-term vision.
**📅 Next Release**: We plan to release the next version in Q1 2026, which will include significant improvements and new features. Stay tuned for updates! We're committed to making VoxCPM even more powerful and versatile. ## ❓ 常见问题 (FAQ)
## ❓ Frequently Asked Questions (FAQ) ### Q: VoxCPM 支持个性化声音定制的微调吗?
### Q: Does VoxCPM support fine-tuning for personalized voice customization? **A:** 是的VoxCPM 现在支持全量微调SFT和高效的 LoRA 微调。你可以使用自己的数据训练个性化声音模型。请参阅 [微调指南](finetune.md) 获取详细说明和示例。
**A:** Yes! VoxCPM now supports both full fine-tuning (SFT) and efficient LoRA fine-tuning. You can train personalized voice models on your own data. Please refer to the [Fine-tuning Guide](finetune.md) for detailed instructions and examples. ### Q: 16kHz 音频质量对我的用例足够吗?
### Q: Is 16kHz audio quality sufficient for my use case? **A:** 我们在 VoxCPM1.5 中升级了 AudioVAE 以支持 44.1kHz 采样率,这提供了更高质量的音频输出,更好地保留了高频细节。当使用高质量参考音频时,此升级能实现更好的声音克隆质量和更自然的语音合成。
**A:** We have upgraded the AudioVAE to support 44.1kHz sampling rate in VoxCPM1.5, which provides higher quality audio output with better preservation of high-frequency details. This upgrade enables better voice cloning quality and more natural speech synthesis when using high-quality reference audio. ### Q: 稳定性问题解决了吗?
### Q: Has the stability issue been resolved? **A:** 我们在 VoxCPM1.5 中进行了稳定性优化,包括对推理代码逻辑、训练数据和模型架构的改进。根据社区反馈,我们收集了一些稳定性问题,例如:
- 噪声和混响增加
- 音频伪影(如啸叫/尖叫)
- 语速不稳定(加速)
- 音量波动(忽大忽小)
- 音频开头和结尾的噪声伪影
- 极短文本(如“你好”)的合成问题
**A:** We have made stability optimizations in VoxCPM1.5, including improvements to the training data and model architecture. Based on community feedback, we collected some stability issues such as: **我们改进了什么:**
- Increased noise and reverberation - 通过调整推理代码逻辑和优化训练数据,我们很大程度上修复了开头/结尾的伪影。
- Audio artifacts (e.g., howling/squealing) - 通过降低 LM 处理速率12.5Hz → 6.25Hz),我们提高了长语音生成的稳定性。
- Unstable speaking rate (speeding up)
- Volume fluctuations (increases or decreases)
- Noise artifacts at the beginning and end of audio
- Synthesis issues with very short texts (e.g., "hello")
While we have made improvements to these issues, they have not been completely resolved and may still occasionally occur, especially with very long or highly expressive inputs. We continue to work on further stability improvements in future versions. **还遗留什么:** 我们承认长语音稳定性问题尚未完全解决。特别是对于高表现力或复杂的参考语音,自回归生成过程中的误差累积仍可能发生。我们将继续在未来版本中分析和优化这一点。
### Q: Does VoxCPM plan to support multilingual TTS? ### Q: VoxCPM 计划支持多语言 TTS 吗?
**A:** Currently, VoxCPM is primarily trained on Chinese and English data. We are actively researching and developing multilingual TTS support for more languages beyond Chinese and English. Please let us know what languages you'd like to see supported! **A:** 目前VoxCPM 主要在中文和英文数据上进行训练。我们正在积极研究和开发除中英文以外更多语言的多语言 TTS 支持。请告诉我们你希望支持哪些语言!
### Q: Does VoxCPM plan to support controllable generation (emotion, style, fine-grained control)? ### Q: VoxCPM 计划支持可控生成(情感、风格、细粒度控制)吗?
**A:** Currently, VoxCPM only supports zero-shot voice cloning and context-aware speech generation. Direct control over specific speech attributes (emotion, style, fine-grained prosody) is limited. However, we are actively researching instruction-controllable expressive speech generation with fine-grained control capabilities, working towards a human instruction-to-speech generation model! **A:** 目前VoxCPM 仅支持零样本声音克隆和上下文感知语音生成。对特定语音属性(情感、风格、细粒度韵律)的直接控制是有限的。然而,我们正在积极研究具有细粒度控制能力的指令可控表现力语音生成,致力于实现人类指令到语音的生成模型!
### Q: Does VoxCPM support different hardware chips (e.g., Ascend 910B, XPU, NPU)? ### Q: VoxCPM 支持不同的硬件芯片(如 Ascend 910B, XPU, NPU)吗?
**A:** Currently, we have not yet adapted VoxCPM for different hardware chips. Our main focus remains on developing new model capabilities and improving stability. We encourage you to check if community developers have done similar work, and we warmly welcome everyone to contribute and promote such adaptations together!
These features are under active development, and we look forward to sharing updates in future releases!
**A:** 目前,我们尚未针对不同的硬件芯片适配 VoxCPM。我们的主要重点仍然是开发新的模型能力和提高稳定性。我们鼓励你查看社区开发者是否做了类似的工作我们也热烈欢迎大家共同贡献和推动此类适配
这些功能正在积极开发中,我们期待在未来的版本中分享更新!

View File

@@ -1,53 +1,54 @@
# 👩‍🍳 A Voice Chef's Guide # 👩‍🍳 声音大厨指南
Welcome to the VoxCPM kitchen! Follow this recipe to cook up perfect generated speech. Let's begin. 欢迎来到 VoxCPM 厨房!按照这份食谱,烹饪出完美的生成语音。让我们开始吧。
--- ---
## 🥚 Step 1: Prepare Your Base Ingredients (Content) ## 🥚 第一步:准备基础食材(内容)
First, choose how you'd like to input your text: 首先,选择你输入文本的方式:
### 1. Regular Text (Classic Mode) ### 1. 普通文本(经典模式)
-Keep "Text Normalization" ON. Type naturally (e.g., "Hello, world! 123"). The system will automatically process numbers, abbreviations, and punctuation using WeTextProcessing library. -保持“文本标准化 (Text Normalization)”开启。自然地输入文字(例如 "Hello, world! 123")。系统将使用 WeTextProcessing 库自动处理数字、缩写和标点符号。
### 2. Phoneme Input (Native Mode) ### 2. 音素输入(原生模式)
-Turn "Text Normalization" OFF. Enter phoneme text like `{HH AH0 L OW1}` (EN) or `{ni3}{hao3}` (ZH) for precise pronunciation control. In this mode, VoxCPM also supports native understanding of other complex non-normalized text—try it out! -关闭“文本标准化 (Text Normalization)”。输入音素文本,如 `{HH AH0 L OW1}` (英语) 或 `{ni3}{hao3}` (中文)以进行精确的发音控制。在此模式下VoxCPM 还支持对其他复杂的非标准化文本的原生理解——快来试试吧!
- **Phoneme Conversion**: For Chinese, phonemes are converted using pinyin. For English, phonemes are converted using CMUDict. Please refer to the relevant documentation for more details. - **音素转换**:对于中文,音素使用拼音转换。对于英语,音素使用 CMUDict 转换。更多详细信息请参考相关文档。
--- ---
## 🍳 Step 2: Choose Your Flavor Profile (Voice Style) ## 🍳 第二步:选择风味(声音风格)
This is the secret sauce that gives your audio its unique sound. 这是让你的音频拥有独特声音的秘制酱料。
### 1. Cooking with a Prompt Speech (Following a Famous Recipe) ### 1. 使用提示语音烹饪(跟随名家食谱)
- A prompt speech provides the desired acoustic characteristics for VoxCPM. The speaker's timbre, speaking style, and even the background sounds and ambiance will be replicated. - 提示语音Prompt Speech)为 VoxCPM 提供所需的声学特征。说话者的音色、说话风格,甚至背景声音和氛围都将被复制。
- **For a Clean, Studio-Quality Voice:** - **为了获得干净、降噪的声音:**
-Enable "Prompt Speech Enhancement". This acts like a noise filter, removing background hiss and rumble to give you a pure, clean voice clone. -启用“提示语音增强 (Prompt Speech Enhancement)”。这就像一个噪音过滤器,去除背景嘶嘶声和隆隆声,给你一个纯净、干净的声音克隆。但是,这将限制音频采样率为 16kHz限制了克隆质量的上限。
- **为了获得高质量音频克隆(最高 44.1kHz**
- ❌ 禁用“提示语音增强 (Prompt Speech Enhancement)”以保留所有原始音频信息,包括背景氛围,并支持高达 44.1kHz 采样率的音频克隆。
### 2. Cooking au Naturel (Letting the Model Improvise) ### 2. 自然烹饪(让模型即兴发挥)
- If no reference is provided, VoxCPM becomes a creative chef! It will infer a fitting speaking style based on the text itself, thanks to the text-smartness of its foundation model, MiniCPM-4. - 如果没有提供参考VoxCPM 将成为一位创意大厨!通过其基础模型 MiniCPM-4 的文本智能,它会根据文本本身推断出合适的说话风格。
- **Pro Tip**: Challenge VoxCPM with any text—poetry, song lyrics, dramatic monologues—it may deliver some interesting results! - **专业提示**:用任何文本挑战 VoxCPM——诗歌、歌词、戏剧独白——它可能会带来一些有趣的结果
--- ---
## 🧂 Step 3: The Final Seasoning (Fine-Tuning Your Results) ## 🧂 第三步:最后的调味(微调结果)
You're ready to serve! But for master chefs who want to tweak the flavor, here are two key spices. 你已经准备好上菜了!但对于想要调整口味的大厨,这里有两个关键的香料。
### CFG Value (How Closely to Follow the Recipe) ### CFG 值(多严格地遵循食谱)
- **Default**: A great starting point. - **默认值**:一个很好的起点。
- **Voice sounds strained or weird?** Lower this value. It tells the model to be more relaxed and improvisational, great for expressive prompts. - **声音听起来紧张或奇怪?** 降低此值。它告诉模型更加放松和即兴,非常适合富有表现力的提示。
- **Need maximum clarity and adherence to the text?** Raise it slightly to keep the model on a tighter leash. - **需要最大的清晰度和对文本的忠实度?** 稍微调高它,让模型保持更严格的控制。
- **Short sentences?** Consider increasing the CFG value for better clarity and adherence. - **短句?** 考虑增加 CFG 值以获得更好的清晰度和忠实度。
- **Long texts?** Consider lowering the CFG value to improve stability and naturalness over extended passages. - **长文本?** 考虑降低 CFG 值以提高长段落的稳定性和自然度。
### Inference Timesteps (Simmering Time: Quality vs. Speed) ### 推理步数(炖煮时间:质量与速度)
- **Need a quick snack?** Use a lower number. Perfect for fast drafts and experiments. - **需要快餐?** 使用较低的数值。非常适合快速草稿和实验。
- **Cooking a gourmet meal?** Use a higher number. This lets the model "simmer" longer, refining the audio for superior detail and naturalness. - **烹饪大餐?** 使用较高的数值。这让模型“炖煮”得更久,提炼音频以获得卓越的细节和自然度。
--- ---
Happy creating! 🎉 Start with the default settings and tweak from there to suit your project. The kitchen is yours! 祝创作愉快!🎉 从默认设置开始,根据你的项目进行调整。厨房是你的了!

1253
lora_ft_webui.py Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,36 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
assets/voxcpm_model.png filter=lfs diff=lfs merge=lfs -text

View File

@@ -0,0 +1,272 @@
---
license: apache-2.0
language:
- en
- zh
base_model:
- openbmb/MiniCPM4-0.5B
pipeline_tag: text-to-speech
library_name: voxcpm1.5
tags:
- text-to-speech
- speech
- speech generation
- voice cloning
---
## 🎙️ VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
[![Project Page](https://img.shields.io/badge/Project%20Page-GitHub-blue)](https://github.com/OpenBMB/VoxCPM/) [![Technical Report](https://img.shields.io/badge/Technical%20Report-Arxiv-red)](https://arxiv.org/abs/2509.24650)[![Live Playground](https://img.shields.io/badge/Live%20PlayGround-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) [![Samples](https://img.shields.io/badge/Audio%20Samples-Page-green)](https://openbmb.github.io/VoxCPM-demopage)
- VoxCPM1.5
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-OpenBMB-yellow)](https://huggingface.co/openbmb/VoxCPM1.5) [![ModelScope](https://img.shields.io/badge/ModelScope-OpenBMB-purple)](https://modelscope.cn/models/OpenBMB/VoxCPM1.5)
<div align="center">
<img src="assets/voxcpm_logo.png" alt="VoxCPM Logo" width="40%">
</div>
## 🎉 VoxCPM1.5 Updates
**Release Date:** December 5, 2025
VoxCPM1.5 brings improvements in audio quality and efficiency:
| Feature | VoxCPM | VoxCPM1.5 |
|---------|------------|------------|
| **Audio VAE Sampling Rate** | 16kHz | 44.1kHz |
| **LM Token Rate** | 12.5Hz | 6.25Hz |
| **Patch Size** | 2 | 4 |
| **SFT Support** | ✅ | ✅ |
| **LoRA Support** | ✅ | ✅ |
**Key Improvements:**
- 🔊 **Higher Quality**: 44.1kHz sampling rate preserves more high-frequency details for better voice cloning
-**More Efficient**: Reduced token rate (6.25Hz) lowers computational cost while maintaining performance
- 🎓 **Fine-tuning Support**: Train personalized voice models with SFT or LoRA
**Note**: Output quality depends on the prompt speech quality. VoxCPM-0.5B remains fully supported with backward compatibility.
## 📚 Model Overview
VoxCPM is a novel tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in a continuous space, it overcomes the limitations of discrete tokenization and enables two flagship capabilities: context-aware speech generation and true-to-life zero-shot voice cloning.
Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on [MiniCPM-4](https://huggingface.co/openbmb/MiniCPM4-0.5B) backbone, it achieves implicit semantic-acoustic decoupling through hierachical language modeling and FSQ constraints, greatly enhancing both expressiveness and generation stability.
<div align="center">
<img src="assets/voxcpm_model.png" alt="VoxCPM Model Architecture" width="90%">
</div>
### 🚀 Key Features
- **Context-Aware, Expressive Speech Generation** - VoxCPM comprehends text to infer and generate appropriate prosody, delivering speech with remarkable expressiveness and natural flow. It spontaneously adapts speaking style based on content, producing highly fitting vocal expression trained on a massive 1.8 million-hour bilingual corpus.
- **True-to-Life Voice Cloning** - With only a short reference audio clip, VoxCPM performs accurate zero-shot voice cloning, capturing not only the speakers timbre but also fine-grained characteristics such as accent, emotional tone, rhythm, and pacing to create a faithful and natural replica.
- **High-Efficiency Synthesis** - VoxCPM supports streaming synthesis with a Real-Time Factor (RTF) as low as 0.17 on a consumer-grade NVIDIA RTX 4090 GPU, making it possible for real-time applications.
## Quick Start
### 🔧 Install from PyPI
``` sh
pip install voxcpm
```
### 1. Model Download (Optional)
By default, when you first run the script, the model will be downloaded automatically, but you can also download the model in advance.
- Download VoxCPM1.5
```
from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM1.5")
```
- Or Download VoxCPM-0.5B
```
from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM-0.5B")
```
- Download ZipEnhancer and SenseVoice-Small. We use ZipEnhancer to enhance speech prompts and SenseVoice-Small for speech prompt ASR in the web demo.
```
from modelscope import snapshot_download
snapshot_download('iic/speech_zipenhancer_ans_multiloss_16k_base')
snapshot_download('iic/SenseVoiceSmall')
```
### 2. Basic Usage
```python
import soundfile as sf
import numpy as np
from voxcpm import VoxCPM
model = VoxCPM.from_pretrained("openbmb/VoxCPM1.5")
# Non-streaming
wav = model.generate(
text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.",
prompt_wav_path=None, # optional: path to a prompt speech for voice cloning
prompt_text=None, # optional: reference text
cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse
inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed
normalize=False, # enable external TN tool, but will disable native raw text support
denoise=False, # enable external Denoise tool, but it may cause some distortion and restrict the sampling rate to 16kHz
retry_badcase=True, # enable retrying mode for some bad cases (unstoppable)
retry_badcase_max_times=3, # maximum retrying times
retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech
)
sf.write("output.wav", wav, model.tts_model.sample_rate)
print("saved: output.wav")
# Streaming
chunks = []
for chunk in model.generate_streaming(
text = "Streaming text to speech is easy with VoxCPM!",
# supports same args as above
):
chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("output_streaming.wav", wav, model.tts_model.sample_rate)
print("saved: output_streaming.wav")
```
### 3. CLI Usage
After installation, the entry point is `voxcpm` (or use `python -m voxcpm.cli`).
```bash
# 1) Direct synthesis (single text)
voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." --output out.wav
# 2) Voice cloning (reference audio + transcript)
voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." \
--prompt-audio path/to/voice.wav \
--prompt-text "reference transcript" \
--output out.wav \
# --denoise
# (Optinal) Voice cloning (reference audio + transcript file)
voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." \
--prompt-audio path/to/voice.wav \
--prompt-file "/path/to/text-file" \
--output out.wav \
# --denoise
# 3) Batch processing (one text per line)
voxcpm --input examples/input.txt --output-dir outs
# (optional) Batch + cloning
voxcpm --input examples/input.txt --output-dir outs \
--prompt-audio path/to/voice.wav \
--prompt-text "reference transcript" \
# --denoise
# 4) Inference parameters (quality/speed)
voxcpm --text "..." --output out.wav \
--cfg-value 2.0 --inference-timesteps 10 --normalize
# 5) Model loading
# Prefer local path
voxcpm --text "..." --output out.wav --model-path /path/to/VoxCPM_model_dir
# Or from Hugging Face (auto download/cache)
voxcpm --text "..." --output out.wav \
--hf-model-id openbmb/VoxCPM1.5 --cache-dir ~/.cache/huggingface --local-files-only
# 6) Denoiser control
voxcpm --text "..." --output out.wav \
--no-denoiser --zipenhancer-path iic/speech_zipenhancer_ans_multiloss_16k_base
# 7) Help
voxcpm --help
python -m voxcpm.cli --help
```
### 4. Start web demo
You can start the UI interface by running `python app.py`, which allows you to perform Voice Cloning and Voice Creation.
### 5. Fine-tuning
VoxCPM1.5 supports both full fine-tuning (SFT) and LoRA fine-tuning, allowing you to train personalized voice models on your own data. See the [Fine-tuning Guide](docs/finetune.md) for detailed instructions.
**Quick Start:**
```bash
# Full fine-tuning
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
# LoRA fine-tuning
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
```
## 👩‍🍳 A Voice Chef's Guide
Welcome to the VoxCPM kitchen! Follow this recipe to cook up perfect generated speech. Lets begin.
---
### 🥚 Step 1: Prepare Your Base Ingredients (Content)
First, choose how youd like to input your text:.
1. Regular Text (Classic Mode)
- ✅ Keep "Text Normalization" ON. Type naturally (e.g., "Hello, world! 123"). The system will automatically process numbers, abbreviations, and punctuation using WeTextProcessing library.
2. Phoneme Input (Native Mode)
- ❌ Turn "Text Normalization" OFF. Enter phoneme text like {HH AH0 L OW1} (EN) or {ni3}{hao3} (ZH) for precise pronunciation control. In this mode, VoxCPM also supports native understanding of other complex non-normalized text—try it out!
- **Phoneme Conversion**: For Chinese, phonemes are converted using pinyin. For English, phonemes are converted using CMUDict. Please refer to the relevant documentation for more details.
---
### 🍳 Step 2: Choose Your Flavor Profile (Voice Style)
This is the secret sauce that gives your audio its unique sound.
#### 1. Cooking with a Prompt Speech (Following a Famous Recipe)
- A prompt speech provides the desired acoustic characteristics for VoxCPM. The speaker's timbre, speaking style, and even the background sounds and ambiance will be replicated.
- **For a Clean, Denoising Voice:**
- ✅ Enable "Prompt Speech Enhancement". This acts like a noise filter, removing background hiss and rumble to give you a pure, clean voice clone. However, this will limit the audio sampling rate to 16kHz, restricting the cloning quality ceiling.
- **For High-Quality Audio Cloning (Up to 44.1kHz):**
- ❌ Disable "Prompt Speech Enhancement" to preserve all original audio information, including background atmosphere, and support audio cloning up to 44.1kHz sampling rate.
#### 2. Cooking au Naturel (Letting the Model Improvise)
- If no reference is provided, VoxCPM becomes a creative chef! It will infer a fitting speaking style based on the text itself, thanks to the text-smartness of its foundation model, MiniCPM-4.
- **Pro Tip**: Challenge VoxCPM with any text—poetry, song lyrics, dramatic monologues—it may deliver some interesting results!
---
### 🧂 Step 3: The Final Seasoning (Fine-Tuning Your Results)
You're ready to serve! But for master chefs who want to tweak the flavor, here are two key spices.
#### CFG Value (How Closely to Follow the Recipe)
- **Default**: A great starting point.
- **Voice sounds strained or weird?** Lower this value. It tells the model to be more relaxed and improvisational, great for expressive prompts.
- **Need maximum clarity and adherence to the text?** Raise it slightly to keep the model on a tighter leash.
- **Short sentences?** Consider increasing the CFG value for better clarity and adherence.
- **Long texts?** Consider lowering the CFG value to improve stability and naturalness over extended passages.
#### Inference Timesteps (Simmering Time: Quality vs. Speed)
- **Need a quick snack?** Use a lower number. Perfect for fast drafts and experiments.
- **Cooking a gourmet meal?** Use a higher number. This lets the model "simmer" longer, refining the audio for superior detail and naturalness.
---
Happy creating! 🎉 Start with the default settings and tweak from there to suit your project. The kitchen is yours!
---
## ⚠️ Risks and limitations
- General Model Behavior: While VoxCPM has been trained on a large-scale dataset, it may still produce outputs that are unexpected, biased, or contain artifacts.
- Potential for Misuse of Voice Cloning: VoxCPM's powerful zero-shot voice cloning capability can generate highly realistic synthetic speech. This technology could be misused for creating convincing deepfakes for purposes of impersonation, fraud, or spreading disinformation. Users of this model must not use it to create content that infringes upon the rights of individuals. It is strictly forbidden to use VoxCPM for any illegal or unethical purposes. We strongly recommend that any publicly shared content generated with this model be clearly marked as AI-generated.
- Current Technical Limitations: Although generally stable, the model may occasionally exhibit instability, especially with very long or expressive inputs. Furthermore, the current version offers limited direct control over specific speech attributes like emotion or speaking style.
- Bilingual Model: VoxCPM is trained primarily on Chinese and English data. Performance on other languages is not guaranteed and may result in unpredictable or low-quality audio.
- This model is released for research and development purposes only. We do not recommend its use in production or commercial applications without rigorous testing and safety evaluations. Please use VoxCPM responsibly.
## 📄 License
The VoxCPM model weights and code are open-sourced under the Apache-2.0 license.

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

View File

@@ -0,0 +1,60 @@
{
"architecture": "voxcpm",
"lm_config": {
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_size": 1024,
"intermediate_size": 4096,
"max_position_embeddings": 32768,
"num_attention_heads": 16,
"num_hidden_layers": 24,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-05,
"rope_theta": 10000,
"rope_scaling": {
"type": "longrope",
"long_factor": [1.0004360675811768, 1.0668443441390991, 1.1631425619125366, 1.3025742769241333, 1.5040205717086792, 1.7941505908966064, 2.2101221084594727, 2.802666664123535, 3.6389970779418945, 4.804192543029785, 6.39855432510376, 8.527148246765137, 11.277542114257812, 14.684998512268066, 18.69317054748535, 23.13019371032715, 27.72362518310547, 32.1606559753418, 36.168827056884766, 39.57627868652344, 42.32667541503906, 44.45526885986328, 46.04962921142578, 47.21482849121094, 48.05115509033203, 48.64370346069336, 49.05967712402344, 49.34980392456055, 49.551246643066406, 49.69068145751953, 49.78697967529297, 49.85338592529297],
"short_factor": [1.0004360675811768, 1.0668443441390991, 1.1631425619125366, 1.3025742769241333, 1.5040205717086792, 1.7941505908966064, 2.2101221084594727, 2.802666664123535, 3.6389970779418945, 4.804192543029785, 6.39855432510376, 8.527148246765137, 11.277542114257812, 14.684998512268066, 18.69317054748535, 23.13019371032715, 27.72362518310547, 32.1606559753418, 36.168827056884766, 39.57627868652344, 42.32667541503906, 44.45526885986328, 46.04962921142578, 47.21482849121094, 48.05115509033203, 48.64370346069336, 49.05967712402344, 49.34980392456055, 49.551246643066406, 49.69068145751953, 49.78697967529297, 49.85338592529297],
"original_max_position_embeddings": 32768
},
"vocab_size": 73448,
"scale_emb": 12,
"dim_model_base": 256,
"scale_depth": 1.4,
"use_mup": false
},
"patch_size": 4,
"feat_dim": 64,
"scalar_quantization_latent_dim": 256,
"scalar_quantization_scale": 9,
"residual_lm_num_layers": 8,
"encoder_config": {
"hidden_dim": 1024,
"ffn_dim": 4096,
"num_heads": 16,
"num_layers": 8
},
"dit_config": {
"hidden_dim": 1024,
"ffn_dim": 4096,
"num_heads": 16,
"num_layers": 8,
"cfm_config": {
"sigma_min": 1e-06,
"solver": "euler",
"t_scheduler": "log-norm",
"inference_cfg_rate": 2.0
}
},
"audio_vae_config": {
"encoder_dim": 64,
"encoder_rates": [2, 3, 6, 7, 7],
"latent_dim": 64,
"decoder_dim": 2048,
"decoder_rates": [7, 7, 6, 3, 2],
"sample_rate": 44100
},
"max_length": 8192,
"device": "cuda",
"dtype": "bfloat16"
}

View File

@@ -0,0 +1,81 @@
{
"additional_special_tokens": [
{
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
{
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
{
"content": "<|tool_call|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
{
"content": "<|execute_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
{
"content": "<|execute_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
{
"content": "<|fim_prefix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
{
"content": "<|fim_middle|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
{
"content": "<|fim_suffix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
],
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,212 @@
{
"add_bos_token": true,
"add_eos_token": false,
"added_tokens_decoder": {
"0": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"101": {
"content": "<|audio_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"102": {
"content": "<|audio_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"103": {
"content": "<|audio_prompt_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"104": {
"content": "<|audio_prompt_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"105": {
"content": "<|background|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"106": {
"content": "<|/background|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"107": {
"content": "<|characters|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"108": {
"content": "<|/characters|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"109": {
"content": "<|speaker_id|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"110": {
"content": "<|/speaker_id|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"111": {
"content": "<|span|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"112": {
"content": "<|/span|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"73440": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"73441": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"73442": {
"content": "<|tool_call|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"73443": {
"content": "<|execute_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"73444": {
"content": "<|execute_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"73445": {
"content": "<|fim_prefix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"73446": {
"content": "<|fim_middle|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"73447": {
"content": "<|fim_suffix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"additional_special_tokens": [
"<|im_end|>",
"<|im_start|>",
"<|tool_call|>",
"<|execute_start|>",
"<|execute_end|>",
"<|fim_prefix|>",
"<|fim_middle|>",
"<|fim_suffix|>"
],
"bos_token": "<s>",
"clean_up_tokenization_spaces": false,
"eos_token": "<|im_end|>",
"legacy": true,
"model_max_length": 1000000000000000019884624838656,
"pad_token": null,
"sp_model_kwargs": {},
"spaces_between_special_tokens": false,
"tokenizer_class": "LlamaTokenizer",
"unk_token": "<unk>",
"use_default_system_prompt": false,
"chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
}

View File

@@ -114,7 +114,7 @@ def main():
prompt_text=prompt_text, prompt_text=prompt_text,
cfg_value=args.cfg_value, cfg_value=args.cfg_value,
inference_timesteps=args.inference_timesteps, inference_timesteps=args.inference_timesteps,
max_length=args.max_len, max_len=args.max_len,
normalize=args.normalize, normalize=args.normalize,
denoise=False, denoise=False,
) )

View File

@@ -5,7 +5,6 @@ LoRA inference test script.
Usage: Usage:
python scripts/test_voxcpm_lora_infer.py \ python scripts/test_voxcpm_lora_infer.py \
--config_path conf/voxcpm/voxcpm_finetune_test.yaml \
--lora_ckpt checkpoints/step_0002000 \ --lora_ckpt checkpoints/step_0002000 \
--text "Hello, this is LoRA finetuned result." \ --text "Hello, this is LoRA finetuned result." \
--output lora_test.wav --output lora_test.wav
@@ -13,37 +12,39 @@ Usage:
With voice cloning: With voice cloning:
python scripts/test_voxcpm_lora_infer.py \ python scripts/test_voxcpm_lora_infer.py \
--config_path conf/voxcpm/voxcpm_finetune_test.yaml \
--lora_ckpt checkpoints/step_0002000 \ --lora_ckpt checkpoints/step_0002000 \
--text "This is voice cloning result." \ --text "This is voice cloning result." \
--prompt_audio path/to/ref.wav \ --prompt_audio path/to/ref.wav \
--prompt_text "Reference audio transcript" \ --prompt_text "Reference audio transcript" \
--output lora_clone.wav --output lora_clone.wav
Note: The script reads base_model path and lora_config from lora_config.json
in the checkpoint directory (saved automatically during training).
""" """
import argparse import argparse
import json
from pathlib import Path from pathlib import Path
import soundfile as sf import soundfile as sf
from voxcpm.core import VoxCPM from voxcpm.core import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig from voxcpm.model.voxcpm import LoRAConfig
from voxcpm.training.config import load_yaml_config
def parse_args(): def parse_args():
parser = argparse.ArgumentParser("VoxCPM LoRA inference test") parser = argparse.ArgumentParser("VoxCPM LoRA inference test")
parser.add_argument(
"--config_path",
type=str,
required=True,
help="Training YAML config path (contains pretrained_path and lora config)",
)
parser.add_argument( parser.add_argument(
"--lora_ckpt", "--lora_ckpt",
type=str, type=str,
required=True, required=True,
help="LoRA checkpoint directory (contains lora_weights.ckpt with lora_A/lora_B only)", help="LoRA checkpoint directory (contains lora_weights.safetensors and lora_config.json)",
)
parser.add_argument(
"--base_model",
type=str,
default="",
help="Optional: override base model path (default: read from lora_config.json)",
) )
parser.add_argument( parser.add_argument(
"--text", "--text",
@@ -98,26 +99,44 @@ def parse_args():
def main(): def main():
args = parse_args() args = parse_args()
# 1. Load YAML config # 1. Check LoRA checkpoint directory
cfg = load_yaml_config(args.config_path) ckpt_dir = Path(args.lora_ckpt)
pretrained_path = cfg["pretrained_path"] if not ckpt_dir.exists():
lora_cfg_dict = cfg.get("lora", {}) or {}
lora_cfg = LoRAConfig(**lora_cfg_dict) if lora_cfg_dict else None
# 2. Check LoRA checkpoint
ckpt_dir = args.lora_ckpt
if not Path(ckpt_dir).exists():
raise FileNotFoundError(f"LoRA checkpoint not found: {ckpt_dir}") raise FileNotFoundError(f"LoRA checkpoint not found: {ckpt_dir}")
# 2. Load lora_config.json from checkpoint
lora_config_path = ckpt_dir / "lora_config.json"
if not lora_config_path.exists():
raise FileNotFoundError(
f"lora_config.json not found in {ckpt_dir}. "
"Make sure the checkpoint was saved with the updated training script."
)
with open(lora_config_path, "r", encoding="utf-8") as f:
lora_info = json.load(f)
# Get base model path (command line arg overrides config)
pretrained_path = args.base_model if args.base_model else lora_info.get("base_model")
if not pretrained_path:
raise ValueError("base_model not found in lora_config.json and --base_model not provided")
# Get LoRA config
lora_cfg_dict = lora_info.get("lora_config", {})
lora_cfg = LoRAConfig(**lora_cfg_dict) if lora_cfg_dict else None
print(f"Loaded config from: {lora_config_path}")
print(f" Base model: {pretrained_path}")
print(f" LoRA config: r={lora_cfg.r}, alpha={lora_cfg.alpha}" if lora_cfg else " LoRA config: None")
# 3. Load model with LoRA (no denoiser) # 3. Load model with LoRA (no denoiser)
print(f"[1/2] Loading model with LoRA: {pretrained_path}") print(f"\n[1/2] Loading model with LoRA: {pretrained_path}")
print(f" LoRA weights: {ckpt_dir}") print(f" LoRA weights: {ckpt_dir}")
model = VoxCPM.from_pretrained( model = VoxCPM.from_pretrained(
hf_model_id=pretrained_path, hf_model_id=pretrained_path,
load_denoiser=False, load_denoiser=False,
optimize=True, optimize=True,
lora_config=lora_cfg, lora_config=lora_cfg,
lora_weights_path=ckpt_dir, lora_weights_path=str(ckpt_dir),
) )
# 4. Synthesize audio # 4. Synthesize audio
@@ -136,7 +155,7 @@ def main():
prompt_text=prompt_text, prompt_text=prompt_text,
cfg_value=args.cfg_value, cfg_value=args.cfg_value,
inference_timesteps=args.inference_timesteps, inference_timesteps=args.inference_timesteps,
max_length=args.max_len, max_len=args.max_len,
normalize=args.normalize, normalize=args.normalize,
denoise=False, denoise=False,
) )
@@ -153,7 +172,7 @@ def main():
prompt_text=prompt_text, prompt_text=prompt_text,
cfg_value=args.cfg_value, cfg_value=args.cfg_value,
inference_timesteps=args.inference_timesteps, inference_timesteps=args.inference_timesteps,
max_length=args.max_len, max_len=args.max_len,
normalize=args.normalize, normalize=args.normalize,
denoise=False, denoise=False,
) )
@@ -170,7 +189,7 @@ def main():
prompt_text=prompt_text, prompt_text=prompt_text,
cfg_value=args.cfg_value, cfg_value=args.cfg_value,
inference_timesteps=args.inference_timesteps, inference_timesteps=args.inference_timesteps,
max_length=args.max_len, max_len=args.max_len,
normalize=args.normalize, normalize=args.normalize,
denoise=False, denoise=False,
) )
@@ -187,7 +206,7 @@ def main():
prompt_text=prompt_text, prompt_text=prompt_text,
cfg_value=args.cfg_value, cfg_value=args.cfg_value,
inference_timesteps=args.inference_timesteps, inference_timesteps=args.inference_timesteps,
max_length=args.max_len, max_len=args.max_len,
normalize=args.normalize, normalize=args.normalize,
denoise=False, denoise=False,
) )
@@ -197,7 +216,7 @@ def main():
# === Test 5: Hot-reload LoRA (load_lora) === # === Test 5: Hot-reload LoRA (load_lora) ===
print(f"\n [Test 5] Hot-reload LoRA (load_lora)...") print(f"\n [Test 5] Hot-reload LoRA (load_lora)...")
loaded, skipped = model.load_lora(str(ckpt_dir)) loaded, skipped = model.load_lora(ckpt_dir)
print(f" Reloaded {len(loaded)} parameters") print(f" Reloaded {len(loaded)} parameters")
audio_np = model.generate( audio_np = model.generate(
text=args.text, text=args.text,
@@ -205,7 +224,7 @@ def main():
prompt_text=prompt_text, prompt_text=prompt_text,
cfg_value=args.cfg_value, cfg_value=args.cfg_value,
inference_timesteps=args.inference_timesteps, inference_timesteps=args.inference_timesteps,
max_length=args.max_len, max_len=args.max_len,
normalize=args.normalize, normalize=args.normalize,
denoise=False, denoise=False,
) )

View File

@@ -14,6 +14,8 @@ import torch
from tensorboardX import SummaryWriter from tensorboardX import SummaryWriter
from torch.optim import AdamW from torch.optim import AdamW
from transformers import get_cosine_schedule_with_warmup from transformers import get_cosine_schedule_with_warmup
import signal
import os
try: try:
from safetensors.torch import save_file from safetensors.torch import save_file
@@ -56,8 +58,16 @@ def train(
lambdas: Dict[str, float] = {"loss/diff": 1.0, "loss/stop": 1.0}, lambdas: Dict[str, float] = {"loss/diff": 1.0, "loss/stop": 1.0},
lora: dict = None, lora: dict = None,
config_path: str = "", config_path: str = "",
# Distribution options (for LoRA checkpoints)
hf_model_id: str = "", # HuggingFace model ID (e.g., "openbmb/VoxCPM1.5")
distribute: bool = False, # If True, save hf_model_id as base_model; otherwise save pretrained_path
): ):
_ = config_path _ = config_path
# Validate distribution options
if lora is not None and distribute and not hf_model_id:
raise ValueError("hf_model_id is required when distribute=True")
accelerator = Accelerator(amp=True) accelerator = Accelerator(amp=True)
save_dir = Path(save_path) save_dir = Path(save_path)
@@ -171,6 +181,39 @@ def train(
num_training_steps=total_training_steps, num_training_steps=total_training_steps,
) )
# Try to load checkpoint and resume training
start_step = 0
if accelerator.rank == 0:
start_step = load_checkpoint(model, optimizer, scheduler, save_dir)
# Broadcast start_step to all processes
if hasattr(accelerator, 'all_reduce'):
start_step_tensor = torch.tensor(start_step, device=accelerator.device)
accelerator.all_reduce(start_step_tensor)
start_step = int(start_step_tensor.item())
if start_step > 0 and accelerator.rank == 0:
tracker.print(f"Resuming training from step {start_step}")
# Resume tracker for signal handler to read current step
resume = {"step": start_step}
# Register signal handler to save checkpoint on termination (SIGTERM/SIGINT)
def _signal_handler(signum, frame, _model=model, _optim=optimizer, _sched=scheduler, _save_dir=save_dir, _pretrained=pretrained_path, _hf_id=hf_model_id, _dist=distribute, _resume=resume):
try:
cur_step = int(_resume.get("step", start_step))
except Exception:
cur_step = start_step
print(f"Signal {signum} received. Saving checkpoint at step {cur_step} ...")
try:
save_checkpoint(_model, _optim, _sched, _save_dir, cur_step, _pretrained, _hf_id, _dist)
print("Checkpoint saved. Exiting.")
except Exception as e:
print(f"Error saving checkpoint on signal: {e}")
os._exit(0)
signal.signal(signal.SIGTERM, _signal_handler)
signal.signal(signal.SIGINT, _signal_handler)
# Manual epoch management instead of itertools.cycle to support DistributedSampler.set_epoch() # Manual epoch management instead of itertools.cycle to support DistributedSampler.set_epoch()
grad_accum_steps = max(int(grad_accum_steps), 1) grad_accum_steps = max(int(grad_accum_steps), 1)
data_epoch = 0 data_epoch = 0
@@ -191,7 +234,9 @@ def train(
return next(train_iter) return next(train_iter)
with tracker.live(): with tracker.live():
for step in range(num_iters): for step in range(start_step, num_iters):
# update resume step so signal handler can save current progress
resume["step"] = step
tracker.step = step tracker.step = step
optimizer.zero_grad(set_to_none=True) optimizer.zero_grad(set_to_none=True)
@@ -255,10 +300,10 @@ def train(
validate(model, val_loader, batch_processor, accelerator, tracker, lambdas) validate(model, val_loader, batch_processor, accelerator, tracker, lambdas)
if step % save_interval == 0 and accelerator.rank == 0: if step % save_interval == 0 and accelerator.rank == 0:
save_checkpoint(model, optimizer, scheduler, save_dir, step, pretrained_path) save_checkpoint(model, optimizer, scheduler, save_dir, step, pretrained_path, hf_model_id, distribute)
if accelerator.rank == 0: if accelerator.rank == 0:
save_checkpoint(model, optimizer, scheduler, save_dir, num_iters, pretrained_path) save_checkpoint(model, optimizer, scheduler, save_dir, num_iters, pretrained_path, hf_model_id, distribute)
if writer: if writer:
writer.close() writer.close()
@@ -301,7 +346,77 @@ def validate(model, val_loader, batch_processor, accelerator, tracker, lambdas):
model.train() model.train()
def save_checkpoint(model, optimizer, scheduler, save_dir: Path, step: int, pretrained_path: str = None): def load_checkpoint(model, optimizer, scheduler, save_dir: Path):
"""
Load the latest checkpoint if it exists.
Returns the step number to resume from, or 0 if no checkpoint found.
"""
latest_folder = save_dir / "latest"
if not latest_folder.exists():
return 0
unwrapped = model.module if hasattr(model, "module") else model
lora_cfg = unwrapped.lora_config
# Load model weights
if lora_cfg is not None:
# LoRA: load lora_weights
lora_weights_path = latest_folder / "lora_weights.safetensors"
if not lora_weights_path.exists():
lora_weights_path = latest_folder / "lora_weights.ckpt"
if lora_weights_path.exists():
if lora_weights_path.suffix == ".safetensors":
from safetensors.torch import load_file
state_dict = load_file(str(lora_weights_path))
else:
ckpt = torch.load(lora_weights_path, map_location="cpu")
state_dict = ckpt.get("state_dict", ckpt)
# Load only lora weights
unwrapped.load_state_dict(state_dict, strict=False)
print(f"Loaded LoRA weights from {lora_weights_path}")
else:
# Full finetune: load model.safetensors or pytorch_model.bin
model_path = latest_folder / "model.safetensors"
if not model_path.exists():
model_path = latest_folder / "pytorch_model.bin"
if model_path.exists():
if model_path.suffix == ".safetensors":
from safetensors.torch import load_file
state_dict = load_file(str(model_path))
else:
ckpt = torch.load(model_path, map_location="cpu")
state_dict = ckpt.get("state_dict", ckpt)
unwrapped.load_state_dict(state_dict, strict=False)
print(f"Loaded model weights from {model_path}")
# Load optimizer state
optimizer_path = latest_folder / "optimizer.pth"
if optimizer_path.exists():
optimizer.load_state_dict(torch.load(optimizer_path, map_location="cpu"))
print(f"Loaded optimizer state from {optimizer_path}")
# Load scheduler state
scheduler_path = latest_folder / "scheduler.pth"
if scheduler_path.exists():
scheduler.load_state_dict(torch.load(scheduler_path, map_location="cpu"))
print(f"Loaded scheduler state from {scheduler_path}")
# Try to infer step from checkpoint folders
step_folders = [d for d in save_dir.iterdir() if d.is_dir() and d.name.startswith("step_")]
if step_folders:
steps = [int(d.name.split("_")[1]) for d in step_folders]
resume_step = max(steps)
print(f"Resuming from step {resume_step}")
return resume_step
return 0
def save_checkpoint(model, optimizer, scheduler, save_dir: Path, step: int, pretrained_path: str = None, hf_model_id: str = "", distribute: bool = False):
""" """
Save checkpoint with different strategies for full finetune vs LoRA: Save checkpoint with different strategies for full finetune vs LoRA:
- Full finetune: save non-vae weights to model.safetensors (or pytorch_model.bin if safetensors unavailable) - Full finetune: save non-vae weights to model.safetensors (or pytorch_model.bin if safetensors unavailable)
@@ -325,6 +440,17 @@ def save_checkpoint(model, optimizer, scheduler, save_dir: Path, step: int, pret
save_file(state_dict, folder / "lora_weights.safetensors") save_file(state_dict, folder / "lora_weights.safetensors")
else: else:
torch.save({"state_dict": state_dict}, folder / "lora_weights.ckpt") torch.save({"state_dict": state_dict}, folder / "lora_weights.ckpt")
# Save LoRA config and base model path to a separate JSON file
# If distribute=True, save hf_model_id; otherwise save local pretrained_path
import json
base_model_to_save = hf_model_id if distribute else (str(pretrained_path) if pretrained_path else None)
lora_info = {
"base_model": base_model_to_save,
"lora_config": lora_cfg.model_dump() if hasattr(lora_cfg, "model_dump") else vars(lora_cfg),
}
with open(folder / "lora_config.json", "w", encoding="utf-8") as f:
json.dump(lora_info, f, indent=2, ensure_ascii=False)
else: else:
# Full finetune: save non-vae weights to model.safetensors # Full finetune: save non-vae weights to model.safetensors
state_dict = {k: v for k, v in full_state.items() if not k.startswith("audio_vae.")} state_dict = {k: v for k, v in full_state.items() if not k.startswith("audio_vae.")}
@@ -345,6 +471,29 @@ def save_checkpoint(model, optimizer, scheduler, save_dir: Path, step: int, pret
torch.save(optimizer.state_dict(), folder / "optimizer.pth") torch.save(optimizer.state_dict(), folder / "optimizer.pth")
torch.save(scheduler.state_dict(), folder / "scheduler.pth") torch.save(scheduler.state_dict(), folder / "scheduler.pth")
# Update (or create) a `latest` symlink pointing to the most recent checkpoint folder
latest_link = save_dir / "latest"
try:
if latest_link.exists() or latest_link.is_symlink():
# remove existing link or directory
if latest_link.is_dir() and not latest_link.is_symlink():
shutil.rmtree(latest_link)
else:
latest_link.unlink()
# Create a symlink pointing to the new folder
os.symlink(str(folder), str(latest_link))
except Exception:
# If symlink creation fails (e.g., on Windows or permission issues), fall back to copying
try:
if latest_link.exists():
if latest_link.is_dir():
shutil.rmtree(latest_link)
else:
latest_link.unlink()
shutil.copytree(folder, latest_link)
except Exception:
print(f"Warning: failed to update latest checkpoint link at {latest_link}")
if __name__ == "__main__": if __name__ == "__main__":
from voxcpm.training.config import load_yaml_config from voxcpm.training.config import load_yaml_config
@@ -359,4 +508,3 @@ if __name__ == "__main__":
# Otherwise use command line args (parsed by argbind) # Otherwise use command line args (parsed by argbind)
with argbind.scope(args): with argbind.scope(args):
train() train()

View File

@@ -55,11 +55,12 @@ class VoxCPM:
self.denoiser = ZipEnhancer(zipenhancer_model_path) self.denoiser = ZipEnhancer(zipenhancer_model_path)
else: else:
self.denoiser = None self.denoiser = None
print("Warm up VoxCPMModel...") if optimize:
self.tts_model.generate( print("Warm up VoxCPMModel...")
target_text="Hello, this is the first test sentence.", self.tts_model.generate(
max_len=10, target_text="Hello, this is the first test sentence.",
) max_len=10,
)
@classmethod @classmethod
def from_pretrained(cls, def from_pretrained(cls,

View File

@@ -159,6 +159,7 @@ class MiniCPMAttention(nn.Module):
query_states = query_states.contiguous() query_states = query_states.contiguous()
key_states = key_states.contiguous() key_states = key_states.contiguous()
value_states = value_states.contiguous() value_states = value_states.contiguous()
attn_output = torch.nn.functional.scaled_dot_product_attention( attn_output = torch.nn.functional.scaled_dot_product_attention(
query_states, query_states,
key_states, key_states,
@@ -208,6 +209,7 @@ class MiniCPMAttention(nn.Module):
query_states = query_states.contiguous() query_states = query_states.contiguous()
key_cache = key_cache.contiguous() key_cache = key_cache.contiguous()
value_cache = value_cache.contiguous() value_cache = value_cache.contiguous()
attn_output = torch.nn.functional.scaled_dot_product_attention( attn_output = torch.nn.functional.scaled_dot_product_attention(
query_states, query_states,
key_cache, key_cache,