Compare commits
10 Commits
a7a447b02a
...
1e44eba871
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
1e44eba871 | ||
|
|
a266c0a88d | ||
|
|
0779a93697 | ||
|
|
a1f9d0c3b6 | ||
|
|
aefba63f71 | ||
|
|
58717d7d82 | ||
|
|
1b0ff5693c | ||
|
|
762815a5b7 | ||
|
|
5b13a35ea6 | ||
|
|
3ba727a615 |
12
.gitignore
vendored
12
.gitignore
vendored
@@ -1,4 +1,14 @@
|
||||
launch.json
|
||||
__pycache__
|
||||
voxcpm.egg-info
|
||||
.DS_Store
|
||||
.DS_Store
|
||||
*.safetensors
|
||||
*.pth
|
||||
*.pt
|
||||
*.ckpt
|
||||
*.bin
|
||||
*.pyc
|
||||
.trae/
|
||||
.vscode/
|
||||
.idea/
|
||||
*.log
|
||||
|
||||
@@ -44,13 +44,13 @@ Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses
|
||||
### 📦 Model Versions
|
||||
See [Release Notes](docs/release_note.md) for details
|
||||
- **VoxCPM1.5** (Latest):
|
||||
- Model Params: 750M
|
||||
- Model Params: 800M
|
||||
- Sampling rate of AudioVAE: 44100
|
||||
- Token rate in LM Backbone: 6.25Hz (patch-size=4)
|
||||
- RTF in a single NVIDIA-RTX 4090 GPU: ~0.15
|
||||
|
||||
- **VoxCPM-0.5B** (Original):
|
||||
- Model Params: 600M
|
||||
- Model Params: 640M
|
||||
- Sampling rate of AudioVAE: 16000
|
||||
- Token rate in LM Backbone: 12.5Hz (patch-size=2)
|
||||
- RTF in a single NVIDIA-RTX 4090 GPU: 0.17
|
||||
@@ -210,6 +210,8 @@ We're excited to see the VoxCPM community growing! Here are some amazing project
|
||||
- **[VoxCPM-NanoVLLM](https://github.com/a710128/nanovllm-voxcpm)** NanoVLLM integration for VoxCPM for faster, high-throughput inference on GPU.
|
||||
- **[VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)** ONNX export for VoxCPM supports faster CPU inference.
|
||||
- **[VoxCPMANE](https://github.com/0seba/VoxCPMANE)** VoxCPM TTS with Apple Neural Engine backend server.
|
||||
- **[PR: LoRA finetune web UI (by Ayin1412)](https://github.com/OpenBMB/VoxCPM/pull/100)**
|
||||
- **[voxcpm_rs](https://github.com/madushan1000/voxcpm_rs)** A re-implementation of VoxCPM-0.5B in Rust.
|
||||
|
||||
*Note: The projects are not officially maintained by OpenBMB.*
|
||||
|
||||
|
||||
200
README_zh.md
Normal file
200
README_zh.md
Normal file
@@ -0,0 +1,200 @@
|
||||
# 🎙️ VoxCPM: 基于上下文感知和真实声音克隆的无 Tokenizer 语音合成系统
|
||||
|
||||
[](https://github.com/OpenBMB/VoxCPM/) [](https://arxiv.org/abs/2509.24650)[](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) [](https://openbmb.github.io/VoxCPM-demopage)
|
||||
|
||||
<div align="center">
|
||||
<img src="assets/voxcpm_logo.png" alt="VoxCPM Logo" width="40%">
|
||||
</div>
|
||||
|
||||
<div align="center">
|
||||
|
||||
👋 在 [微信](assets/wechat.png) 上联系我们
|
||||
|
||||
</div>
|
||||
|
||||
## 最新动态
|
||||
* [2025.12.05] 🎉 🎉 🎉 开源 **VoxCPM1.5** [权重](https://huggingface.co/openbmb/VoxCPM1.5)!该模型现在支持全参数微调和高效的 LoRA 微调,使您能够创建自己的定制版本。详见 [发布说明](docs/release_note.md)。
|
||||
* [2025.09.30] 🔥 🔥 🔥 发布 VoxCPM [技术报告](https://arxiv.org/abs/2509.24650)!
|
||||
* [2025.09.16] 🔥 🔥 🔥 开源 VoxCPM-0.5B [权重](https://huggingface.co/openbmb/VoxCPM-0.5B)!
|
||||
* [2025.09.16] 🎉 🎉 🎉 提供 VoxCPM-0.5B 的 [Gradio PlayGround](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo),立即试用!
|
||||
|
||||
## 项目概览
|
||||
|
||||
VoxCPM 是一种新颖的无 Tokenizer 文本转语音(TTS)系统,重新定义了语音合成的真实感。通过在连续空间中对语音进行建模,它克服了离散 Token 化带来的限制,并实现了两大核心能力:**上下文感知语音生成**和**逼真的零样本声音克隆**。
|
||||
|
||||
与将语音转换为离散 Token 的主流方法不同,VoxCPM 采用了端到端的扩散自回归架构,直接从文本生成连续的语音表示。基于 [MiniCPM-4](https://huggingface.co/openbmb/MiniCPM4-0.5B) 骨干网络,通过分层语言建模和 FSQ 约束实现了隐式的语义-声学解耦,极大地增强了表现力和生成稳定性。
|
||||
|
||||
<div align="center">
|
||||
<img src="assets/voxcpm_model.png" alt="VoxCPM Model Architecture" width="90%">
|
||||
</div>
|
||||
|
||||
### 🚀 核心特性
|
||||
- **上下文感知、富有表现力的语音生成** - VoxCPM 能够理解文本并推断出适当的韵律,生成富有表现力和自然流畅的语音。它会根据内容自发调整说话风格,基于 180 万小时的双语语料库训练,产生高度贴合的语音表达。
|
||||
- **逼真的声音克隆** - 仅需一段简短的参考音频,VoxCPM 就能进行准确的零样本声音克隆,不仅能捕捉说话者的音色,还能捕捉细微的特征,如口音、情感基调、节奏和语速,从而创建一个忠实且自然的复刻。
|
||||
- **高效合成** - VoxCPM 支持流式合成,在消费级 NVIDIA RTX 4090 GPU 上,实时因子(RTF)低至 0.17,使实时应用成为可能。
|
||||
|
||||
### 📦 模型版本
|
||||
详见 [发布说明](docs/release_note.md)
|
||||
- **VoxCPM1.5** (最新):
|
||||
- 模型参数: 800M
|
||||
- AudioVAE 采样率: 44100
|
||||
- LM 骨干网络 Token 率: 6.25Hz (patch-size=4)
|
||||
- 单张 NVIDIA-RTX 4090 GPU RTF: ~0.15
|
||||
|
||||
- **VoxCPM-0.5B** (原始):
|
||||
- 模型参数: 640M
|
||||
- AudioVAE 采样率: 16000
|
||||
- LM 骨干网络 Token 率: 12.5Hz (patch-size=2)
|
||||
- 单张 NVIDIA-RTX 4090 GPU RTF: 0.17
|
||||
|
||||
## 快速开始
|
||||
|
||||
### 🔧 通过 PyPI 安装
|
||||
```bash
|
||||
pip install voxcpm
|
||||
```
|
||||
|
||||
### 1. 模型下载(可选)
|
||||
默认情况下,首次运行脚本时会自动下载模型,但您也可以提前下载模型。
|
||||
- 下载 VoxCPM1.5
|
||||
```python
|
||||
from huggingface_hub import snapshot_download
|
||||
snapshot_download("openbmb/VoxCPM1.5")
|
||||
```
|
||||
|
||||
- 或下载 VoxCPM-0.5B
|
||||
```python
|
||||
from huggingface_hub import snapshot_download
|
||||
snapshot_download("openbmb/VoxCPM-0.5B")
|
||||
```
|
||||
- 下载 ZipEnhancer 和 SenseVoice-Small。我们使用 ZipEnhancer 来增强语音提示,并在 Web 演示中使用 SenseVoice-Small 进行语音提示的 ASR(自动语音识别)。
|
||||
```python
|
||||
from modelscope import snapshot_download
|
||||
snapshot_download('iic/speech_zipenhancer_ans_multiloss_16k_base')
|
||||
snapshot_download('iic/SenseVoiceSmall')
|
||||
```
|
||||
|
||||
### 2. 基本用法 (Python)
|
||||
```python
|
||||
import soundfile as sf
|
||||
import numpy as np
|
||||
from voxcpm import VoxCPM
|
||||
|
||||
model = VoxCPM.from_pretrained("openbmb/VoxCPM1.5")
|
||||
|
||||
# 非流式生成
|
||||
wav = model.generate(
|
||||
text="VoxCPM 是 ModelBest 推出的一款创新型端到端 TTS 模型,旨在生成极具表现力的语音。",
|
||||
prompt_wav_path=None, # 可选: 用于声音克隆的提示语音路径
|
||||
prompt_text=None, # 可选: 参考文本
|
||||
cfg_value=2.0, # LocDiT 的 LM 引导值,越高越贴合提示,但可能影响自然度
|
||||
inference_timesteps=10, # LocDiT 推理步数,越高效果越好,越低速度越快
|
||||
normalize=False, # 启用外部文本标准化工具,但会禁用原生原始文本支持
|
||||
denoise=False, # 启用外部降噪工具,可能会导致一些失真并将采样率限制在 16kHz
|
||||
retry_badcase=True, # 启用针对某些坏例的重试模式(不可中断)
|
||||
retry_badcase_max_times=3, # 最大重试次数
|
||||
retry_badcase_ratio_threshold=6.0, # 坏例检测的最大长度限制(简单但有效),对于慢节奏语音可调整
|
||||
)
|
||||
|
||||
sf.write("output.wav", wav, model.tts_model.sample_rate)
|
||||
print("已保存: output.wav")
|
||||
|
||||
# 流式生成
|
||||
chunks = []
|
||||
for chunk in model.generate_streaming(
|
||||
text = "使用 VoxCPM 进行流式文本转语音非常简单!",
|
||||
# 支持与上述相同的参数
|
||||
):
|
||||
chunks.append(chunk)
|
||||
wav = np.concatenate(chunks)
|
||||
|
||||
sf.write("output_streaming.wav", wav, model.tts_model.sample_rate)
|
||||
print("已保存: output_streaming.wav")
|
||||
```
|
||||
|
||||
### 3. 命令行 (CLI) 用法
|
||||
|
||||
安装后,入口点是 `voxcpm`(或使用 `python -m voxcpm.cli`)。
|
||||
|
||||
```bash
|
||||
# 1) 直接合成 (单条文本)
|
||||
voxcpm --text "VoxCPM 是一款创新的端到端 TTS 模型。" --output out.wav
|
||||
|
||||
# 2) 声音克隆 (参考音频 + 文本)
|
||||
voxcpm --text "VoxCPM 是一款创新的端到端 TTS 模型。" \
|
||||
--prompt-audio path/to/voice.wav \
|
||||
--prompt-text "参考音频的文本内容" \
|
||||
--output out.wav \
|
||||
# --denoise
|
||||
|
||||
# (可选) 声音克隆 (参考音频 + 文本文件)
|
||||
voxcpm --text "VoxCPM 是一款创新的端到端 TTS 模型。" \
|
||||
--prompt-audio path/to/voice.wav \
|
||||
--prompt-file "/path/to/text-file" \
|
||||
--output out.wav \
|
||||
# --denoise
|
||||
|
||||
# 3) 批量处理 (每行一条文本)
|
||||
voxcpm --input examples/input.txt --output-dir outs
|
||||
# (可选) 批量 + 克隆
|
||||
voxcpm --input examples/input.txt --output-dir outs \
|
||||
--prompt-audio path/to/voice.wav \
|
||||
--prompt-text "参考音频的文本内容" \
|
||||
# --denoise
|
||||
|
||||
# 4) 推理参数 (质量/速度)
|
||||
voxcpm --text "..." --output out.wav \
|
||||
--cfg-value 2.0 --inference-timesteps 10 --normalize
|
||||
|
||||
# 5) 模型加载
|
||||
# 优先使用本地路径
|
||||
voxcpm --text "..." --output out.wav --model-path /path/to/VoxCPM_model_dir
|
||||
# 或者从 Hugging Face 加载 (自动下载/缓存)
|
||||
voxcpm --text "..." --output out.wav \
|
||||
--hf-model-id openbmb/VoxCPM1.5 --cache-dir ~/.cache/huggingface --local-files-only
|
||||
|
||||
# 6) 降噪器控制
|
||||
voxcpm --text "..." --output out.wav \
|
||||
--no-denoiser --zipenhancer-path iic/speech_zipenhancer_ans_multiloss_16k_base
|
||||
|
||||
# 7) 帮助
|
||||
voxcpm --help
|
||||
python -m voxcpm.cli --help
|
||||
```
|
||||
|
||||
### 4. 启动 Web 演示
|
||||
|
||||
您可以运行 `python app.py` 启动 UI 界面,该界面允许您执行声音克隆和声音创作。
|
||||
|
||||
```bash
|
||||
python app.py
|
||||
```
|
||||
|
||||
### 5. 微调 (Fine-tuning)
|
||||
|
||||
VoxCPM1.5 支持全参数微调 (SFT) 和 LoRA 微调,允许您在自己的数据上训练个性化的语音模型。详细说明请参阅 [微调指南](docs/finetune.md)。
|
||||
|
||||
**快速开始:**
|
||||
```bash
|
||||
# 全参数微调
|
||||
python scripts/train_voxcpm_finetune.py \
|
||||
--config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
|
||||
|
||||
# LoRA 微调
|
||||
python scripts/train_voxcpm_finetune.py \
|
||||
--config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
|
||||
```
|
||||
|
||||
## 📚 文档
|
||||
|
||||
- **[使用指南](docs/usage_guide.md)** - 关于如何有效使用 VoxCPM 的详细指南,包括文本输入模式、声音克隆技巧和参数调整。
|
||||
- **[微调指南](docs/finetune.md)** - 使用 SFT 和 LoRA 微调 VoxCPM 模型的完整指南。
|
||||
- **[发布说明](docs/release_note.md)** - 版本历史和更新。
|
||||
- **[性能基准](docs/performance.md)** - 公共基准上的详细性能比较。
|
||||
|
||||
## ⚠️ 风险和限制
|
||||
|
||||
- **通用模型行为**: 虽然 VoxCPM 已在大规模数据集上进行了训练,但它仍可能产生意外的、有偏见的或包含伪影的输出。
|
||||
- **声音克隆滥用的可能性**: VoxCPM 强大的零样本声音克隆能力可能会被滥用。
|
||||
|
||||
---
|
||||
86
app.py
86
app.py
@@ -45,7 +45,8 @@ class VoxCPMDemo:
|
||||
repo_id = os.environ.get("HF_REPO_ID", "").strip()
|
||||
if len(repo_id) > 0:
|
||||
target_dir = os.path.join("models", repo_id.replace("/", "__"))
|
||||
if not os.path.isdir(target_dir):
|
||||
# Check if directory exists AND contains config.json
|
||||
if not os.path.isdir(target_dir) or not os.path.exists(os.path.join(target_dir, "config.json")):
|
||||
try:
|
||||
from huggingface_hub import snapshot_download # type: ignore
|
||||
os.makedirs(target_dir, exist_ok=True)
|
||||
@@ -155,45 +156,33 @@ def create_demo_interface(demo: VoxCPMDemo):
|
||||
gr.HTML('<div class="logo-container"><img src="/gradio_api/file=assets/voxcpm_logo.png" alt="VoxCPM Logo"></div>')
|
||||
|
||||
# Quick Start
|
||||
with gr.Accordion("📋 Quick Start Guide |快速入门", open=False, elem_id="acc_quick"):
|
||||
with gr.Accordion("📋 快速入门", open=False, elem_id="acc_quick"):
|
||||
gr.Markdown("""
|
||||
### How to Use |使用说明
|
||||
1. **(Optional) Provide a Voice Prompt** - Upload or record an audio clip to provide the desired voice characteristics for synthesis.
|
||||
**(可选)提供参考声音** - 上传或录制一段音频,为声音合成提供音色、语调和情感等个性化特征
|
||||
2. **(Optional) Enter prompt text** - If you provided a voice prompt, enter the corresponding transcript here (auto-recognition available).
|
||||
**(可选项)输入参考文本** - 如果提供了参考语音,请输入其对应的文本内容(支持自动识别)。
|
||||
3. **Enter target text** - Type the text you want the model to speak.
|
||||
**输入目标文本** - 输入您希望模型朗读的文字内容。
|
||||
4. **Generate Speech** - Click the "Generate" button to create your audio.
|
||||
**生成语音** - 点击"生成"按钮,即可为您创造出音频。
|
||||
### 使用说明
|
||||
1. **(可选)提供参考声音** - 上传或录制一段音频,为声音合成提供音色、语调和情感等个性化特征。
|
||||
2. **(可选)输入参考文本** - 如果提供了参考语音,请输入其对应的文本内容(支持自动识别)。
|
||||
3. **输入目标文本** - 输入您希望模型朗读的文字内容。
|
||||
4. **生成语音** - 点击"生成语音"按钮,即可为您创造出音频。
|
||||
""")
|
||||
|
||||
# Pro Tips
|
||||
with gr.Accordion("💡 Pro Tips |使用建议", open=False, elem_id="acc_tips"):
|
||||
with gr.Accordion("💡 使用建议", open=False, elem_id="acc_tips"):
|
||||
gr.Markdown("""
|
||||
### Prompt Speech Enhancement|参考语音降噪
|
||||
- **Enable** to remove background noise for a clean, studio-like voice, with an external ZipEnhancer component.
|
||||
**启用**:通过 ZipEnhancer 组件消除背景噪音,获得更好的音质。
|
||||
- **Disable** to preserve the original audio's background atmosphere.
|
||||
**禁用**:保留原始音频的背景环境声,如果想复刻相应声学环境。
|
||||
### 参考语音降噪
|
||||
- **启用**:通过 ZipEnhancer 组件消除背景噪音,但会将音频采样率限制在16kHz,限制克隆上限。
|
||||
- **禁用**:保留原始音频的全部信息,包括背景环境声,最高支持44.1kHz的音频复刻。
|
||||
|
||||
### Text Normalization|文本正则化
|
||||
- **Enable** to process general text with an external WeTextProcessing component.
|
||||
**启用**:使用 WeTextProcessing 组件,可处理常见文本。
|
||||
- **Disable** to use VoxCPM's native text understanding ability. For example, it supports phonemes input ({HH AH0 L OW1}), try it!
|
||||
**禁用**:将使用 VoxCPM 内置的文本理解能力。如,支持音素输入(如 {da4}{jia1}好)和公式符号合成,尝试一下!
|
||||
### 文本正则化
|
||||
- **启用**:使用 WeTextProcessing 组件,可支持常见文本的正则化处理。
|
||||
- **禁用**:将使用 VoxCPM 内置的文本理解能力。如,支持音素输入(如中文转拼音:{ni3}{hao3};英文转CMUDict:{HH AH0 L OW1})和公式符号合成,尝试一下!
|
||||
|
||||
### CFG Value|CFG 值
|
||||
- **Lower CFG** if the voice prompt sounds strained or expressive.
|
||||
**调低**:如果提示语音听起来不自然或过于夸张。
|
||||
- **Higher CFG** for better adherence to the prompt speech style or input text.
|
||||
**调高**:为更好地贴合提示音频的风格或输入文本。
|
||||
### CFG 值
|
||||
- **调低**:如果提示语音听起来不自然或过于夸张,或者长文本输入出现稳定性问题。
|
||||
- **调高**:为更好地贴合提示音频的风格或输入文本, 或者极短文本输入出现稳定性问题。
|
||||
|
||||
### Inference Timesteps|推理时间步
|
||||
- **Lower** for faster synthesis speed.
|
||||
**调低**:合成速度更快。
|
||||
- **Higher** for better synthesis quality.
|
||||
**调高**:合成质量更佳。
|
||||
### 推理时间步
|
||||
- **调低**:合成速度更快。
|
||||
- **调高**:合成质量更佳。
|
||||
""")
|
||||
|
||||
# Main controls
|
||||
@@ -202,22 +191,22 @@ def create_demo_interface(demo: VoxCPMDemo):
|
||||
prompt_wav = gr.Audio(
|
||||
sources=["upload", 'microphone'],
|
||||
type="filepath",
|
||||
label="Prompt Speech (Optional, or let VoxCPM improvise)",
|
||||
label="参考语音(可选,或让 VoxCPM 自由发挥)",
|
||||
value="./examples/example.wav",
|
||||
)
|
||||
DoDenoisePromptAudio = gr.Checkbox(
|
||||
value=False,
|
||||
label="Prompt Speech Enhancement",
|
||||
label="参考语音增强",
|
||||
elem_id="chk_denoise",
|
||||
info="We use ZipEnhancer model to denoise the prompt audio."
|
||||
info="使用 ZipEnhancer 模型对参考音频进行降噪。"
|
||||
)
|
||||
with gr.Row():
|
||||
prompt_text = gr.Textbox(
|
||||
value="Just by listening a few minutes a day, you'll be able to eliminate negative thoughts by conditioning your mind to be more positive.",
|
||||
label="Prompt Text",
|
||||
placeholder="Please enter the prompt text. Automatic recognition is supported, and you can correct the results yourself..."
|
||||
label="参考文本",
|
||||
placeholder="请输入参考文本。支持自动识别,您也可以自行修改结果..."
|
||||
)
|
||||
run_btn = gr.Button("Generate Speech", variant="primary")
|
||||
run_btn = gr.Button("生成语音", variant="primary")
|
||||
|
||||
with gr.Column():
|
||||
cfg_value = gr.Slider(
|
||||
@@ -225,30 +214,31 @@ def create_demo_interface(demo: VoxCPMDemo):
|
||||
maximum=3.0,
|
||||
value=2.0,
|
||||
step=0.1,
|
||||
label="CFG Value (Guidance Scale)",
|
||||
info="Higher values increase adherence to prompt, lower values allow more creativity"
|
||||
label="CFG 值 (引导比例)",
|
||||
info="值越高越贴合提示,值越低允许更多的创造性"
|
||||
)
|
||||
inference_timesteps = gr.Slider(
|
||||
minimum=4,
|
||||
maximum=30,
|
||||
value=10,
|
||||
step=1,
|
||||
label="Inference Timesteps",
|
||||
info="Number of inference timesteps for generation (higher values may improve quality but slower)"
|
||||
label="推理时间步",
|
||||
info="生成的推理时间步数(值越高可能质量越好,但速度更慢)"
|
||||
)
|
||||
with gr.Row():
|
||||
text = gr.Textbox(
|
||||
value="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly realistic speech.",
|
||||
label="Target Text",
|
||||
value="VoxCPM 是 ModelBest 推出的一款创新型端到端 TTS 模型,旨在生成极具表现力的语音。",
|
||||
label="目标文本",
|
||||
)
|
||||
with gr.Row():
|
||||
DoNormalizeText = gr.Checkbox(
|
||||
value=False,
|
||||
label="Text Normalization",
|
||||
label="文本正则化",
|
||||
elem_id="chk_normalize",
|
||||
info="We use wetext library to normalize the input text."
|
||||
info="使用 wetext 库对输入文本进行标准化。"
|
||||
)
|
||||
audio_output = gr.Audio(label="Output Audio")
|
||||
audio_output = gr.Audio(label="输出音频")
|
||||
|
||||
|
||||
# Wiring
|
||||
run_btn.click(
|
||||
@@ -267,7 +257,7 @@ def run_demo(server_name: str = "localhost", server_port: int = 7860, show_error
|
||||
demo = VoxCPMDemo()
|
||||
interface = create_demo_interface(demo)
|
||||
# Recommended to enable queue on Spaces for better throughput
|
||||
interface.queue(max_size=10).launch(server_name=server_name, server_port=server_port, show_error=show_error)
|
||||
interface.queue(max_size=10, default_concurrency_limit=1).launch(server_name=server_name, server_port=server_port, show_error=show_error)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -19,6 +19,8 @@ tensorboard: /path/to/logs/finetune_lora
|
||||
lambdas:
|
||||
loss/diff: 1.0
|
||||
loss/stop: 1.0
|
||||
|
||||
# LoRA configuration
|
||||
lora:
|
||||
enable_lm: true
|
||||
enable_dit: true
|
||||
@@ -26,3 +28,9 @@ lora:
|
||||
r: 32
|
||||
alpha: 16
|
||||
dropout: 0.0
|
||||
|
||||
# Distribution options (optional)
|
||||
# - If distribute=false (default): save pretrained_path as base_model in lora_config.json
|
||||
# - If distribute=true: save hf_model_id as base_model (hf_model_id is required)
|
||||
# hf_model_id: "openbmb/VoxCPM1.5"
|
||||
# distribute: true
|
||||
|
||||
@@ -19,10 +19,18 @@ tensorboard: /path/to/logs/finetune_lora
|
||||
lambdas:
|
||||
loss/diff: 1.0
|
||||
loss/stop: 1.0
|
||||
|
||||
# LoRA configuration
|
||||
lora:
|
||||
enable_lm: true
|
||||
enable_dit: true
|
||||
enable_proj: false
|
||||
r: 32
|
||||
alpha: 16
|
||||
dropout: 0.0
|
||||
dropout: 0.0
|
||||
|
||||
# Distribution options (optional)
|
||||
# - If distribute=false (default): save pretrained_path as base_model in lora_config.json
|
||||
# - If distribute=true: save hf_model_id as base_model (hf_model_id is required)
|
||||
# hf_model_id: "openbmb/VoxCPM-0.5B"
|
||||
# distribute: true
|
||||
148
create_repo.py
Normal file
148
create_repo.py
Normal file
@@ -0,0 +1,148 @@
|
||||
import requests
|
||||
import json
|
||||
import subprocess
|
||||
import os
|
||||
|
||||
# Configuration
|
||||
API_URL = "https://git.aitosuv.com/api/v1/user/repos"
|
||||
AUTH = ('admin', 'lsy123123')
|
||||
REPO_DATA = {
|
||||
"name": "VoxCPM-use",
|
||||
"description": "声音克隆",
|
||||
"private": False,
|
||||
"auto_init": False
|
||||
}
|
||||
|
||||
def run_command(command):
|
||||
"""Run a shell command and return the output."""
|
||||
print(f"Running: {command}")
|
||||
try:
|
||||
result = subprocess.run(
|
||||
command,
|
||||
check=True,
|
||||
shell=True,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
text=True
|
||||
)
|
||||
if result.stdout:
|
||||
print(result.stdout.strip())
|
||||
return result.stdout.strip()
|
||||
except subprocess.CalledProcessError as e:
|
||||
print(f"Error running command: {command}")
|
||||
print(e.stderr)
|
||||
return None
|
||||
|
||||
def create_gitignore():
|
||||
"""Create .gitignore if it doesn't exist."""
|
||||
if not os.path.exists(".gitignore"):
|
||||
content = """
|
||||
.venv/
|
||||
__pycache__/
|
||||
*.pyc
|
||||
.trae/
|
||||
.vscode/
|
||||
.idea/
|
||||
*.log
|
||||
*.safetensors
|
||||
*.pth
|
||||
*.pt
|
||||
*.ckpt
|
||||
*.bin
|
||||
"""
|
||||
with open(".gitignore", "w") as f:
|
||||
f.write(content.strip())
|
||||
print("Created .gitignore")
|
||||
else:
|
||||
print(".gitignore already exists")
|
||||
|
||||
def main():
|
||||
clone_url = None
|
||||
|
||||
# Initialize session and disable proxy
|
||||
session = requests.Session()
|
||||
session.trust_env = False
|
||||
|
||||
# 1. Create Repository via API
|
||||
try:
|
||||
response = session.post(API_URL, auth=AUTH, json=REPO_DATA)
|
||||
if response.status_code == 201:
|
||||
print("Repository created successfully")
|
||||
clone_url = response.json()['clone_url']
|
||||
elif response.status_code == 422 or response.status_code == 409: # Already exists
|
||||
print("Repository already exists")
|
||||
# Fetch existing repo details
|
||||
user = AUTH[0]
|
||||
repo_name = REPO_DATA["name"]
|
||||
get_url = f"https://git.aitosuv.com/api/v1/repos/{user}/{repo_name}"
|
||||
resp_get = session.get(get_url, auth=AUTH)
|
||||
if resp_get.status_code == 200:
|
||||
clone_url = resp_get.json()['clone_url']
|
||||
else:
|
||||
print(f"Could not fetch existing repository details. Status: {resp_get.status_code}")
|
||||
else:
|
||||
print(f"Failed to create repository: {response.status_code}")
|
||||
print(response.text)
|
||||
return
|
||||
except Exception as e:
|
||||
print(f"Error: {e}")
|
||||
return
|
||||
|
||||
if not clone_url:
|
||||
print("Could not determine clone URL. Exiting.")
|
||||
return
|
||||
|
||||
# Embed credentials into the URL for automatic authentication
|
||||
# Assuming clone_url format: https://git.aitosuv.com/admin/geminiWX.git
|
||||
# We want: https://admin:lsy123123@git.aitosuv.com/admin/geminiWX.git
|
||||
if "://" in clone_url:
|
||||
protocol, rest = clone_url.split("://", 1)
|
||||
auth_url = f"{protocol}://{AUTH[0]}:{AUTH[1]}@{rest}"
|
||||
else:
|
||||
auth_url = clone_url # Fallback if format is unexpected
|
||||
|
||||
print(f"Target Remote URL: {clone_url}")
|
||||
|
||||
# 2. Local Git Operations
|
||||
if not os.path.exists(".git"):
|
||||
print("Initializing git repository...")
|
||||
run_command("git init")
|
||||
|
||||
# Configure git user for this repository
|
||||
print("Configuring git user...")
|
||||
run_command(f'git config user.email "{AUTH[0]}@aitosuv.com"')
|
||||
run_command(f'git config user.name "{AUTH[0]}"')
|
||||
|
||||
create_gitignore()
|
||||
|
||||
print("Adding files...")
|
||||
run_command("git add .")
|
||||
|
||||
print("Committing changes...")
|
||||
run_command('git commit -m "Initial commit"')
|
||||
|
||||
# Check and configure remote
|
||||
remotes = run_command("git remote -v")
|
||||
if remotes and "origin" in remotes:
|
||||
print("Updating remote 'origin'...")
|
||||
run_command(f"git remote set-url origin {auth_url}")
|
||||
else:
|
||||
print("Adding remote 'origin'...")
|
||||
run_command(f"git remote add origin {auth_url}")
|
||||
|
||||
# Push to remote
|
||||
print("Pushing to remote...")
|
||||
# Try pushing to master first, then main if that fails (or vice versa depending on default branch)
|
||||
# Usually 'master' is default for older git, 'main' for newer.
|
||||
# We can try checking current branch name.
|
||||
current_branch = run_command("git rev-parse --abbrev-ref HEAD")
|
||||
if current_branch:
|
||||
if run_command(f"git push -u origin {current_branch} -f") is None:
|
||||
print("Push failed.")
|
||||
else:
|
||||
# Fallback if we couldn't get branch name
|
||||
if run_command("git push -u origin master -f") is None:
|
||||
run_command("git push -u origin main -f")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
372
docs/finetune.md
372
docs/finetune.md
@@ -1,75 +1,99 @@
|
||||
# VoxCPM Fine-tuning Guide
|
||||
# VoxCPM 微调指南
|
||||
|
||||
This guide covers how to fine-tune VoxCPM models with two approaches: full fine-tuning and LoRA fine-tuning.
|
||||
本指南介绍了如何使用全量微调(Full Fine-tuning)和 LoRA 微调两种方式对 VoxCPM 模型进行微调。
|
||||
|
||||
### 🎓 SFT (Supervised Fine-Tuning)
|
||||
### 🎓 SFT (监督微调)
|
||||
|
||||
Full fine-tuning updates all model parameters. Suitable for:
|
||||
- 📊 Large, specialized datasets
|
||||
- 🔄 Cases where significant behavior changes are needed
|
||||
全量微调会更新所有模型参数。适用于:
|
||||
- 📊 大型、专业的数据集
|
||||
- 🔄 需要显著改变模型行为的场景
|
||||
|
||||
### ⚡ LoRA Fine-tuning
|
||||
### ⚡ LoRA 微调
|
||||
|
||||
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:
|
||||
- 🎯 Trains only a small number of additional parameters
|
||||
- 💾 Significantly reduces memory requirements and training time
|
||||
- 🔀 Supports multiple LoRA adapters with hot-swapping
|
||||
LoRA (Low-Rank Adaptation) 是一种参数高效的微调方法,它:
|
||||
- 🎯 仅训练少量额外参数
|
||||
- 💾 显著降低显存需求和训练时间
|
||||
- 🔀 支持多个 LoRA 适配器热插拔
|
||||
|
||||
## 目录
|
||||
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Data Preparation](#data-preparation)
|
||||
- [Full Fine-tuning](#full-fine-tuning)
|
||||
- [LoRA Fine-tuning](#lora-fine-tuning)
|
||||
- [Inference](#inference)
|
||||
- [LoRA Hot-swapping](#lora-hot-swapping)
|
||||
- [FAQ](#faq)
|
||||
- [快速开始:WebUI](#快速开始webui)
|
||||
- [数据准备](#数据准备)
|
||||
- [全量微调](#全量微调)
|
||||
- [LoRA 微调](#lora-微调)
|
||||
- [推理](#推理)
|
||||
- [LoRA 热插拔](#lora-热插拔)
|
||||
- [常见问题](#常见问题)
|
||||
|
||||
---
|
||||
|
||||
## Data Preparation
|
||||
## 快速开始:WebUI
|
||||
|
||||
Training data should be prepared as a JSONL manifest file, with one sample per line:
|
||||
对于喜欢图形界面的用户,我们提供了 `lora_ft_webui.py` —— 一个用于训练和推理的综合 WebUI:
|
||||
|
||||
```jsonl
|
||||
{"audio": "path/to/audio1.wav", "text": "Transcript of audio 1."}
|
||||
{"audio": "path/to/audio2.wav", "text": "Transcript of audio 2."}
|
||||
{"audio": "path/to/audio3.wav", "text": "Optional duration field.", "duration": 3.5}
|
||||
{"audio": "path/to/audio4.wav", "text": "Optional dataset_id for multi-dataset.", "dataset_id": 1}
|
||||
### 启动 WebUI
|
||||
|
||||
```bash
|
||||
python lora_ft_webui.py
|
||||
```
|
||||
|
||||
### Required Fields
|
||||
然后在浏览器中打开 `http://localhost:7860`。
|
||||
|
||||
| Field | Description |
|
||||
### 功能特点
|
||||
|
||||
- **🚀 训练标签页**:通过直观的界面配置并启动 LoRA 训练
|
||||
- 设置训练参数(学习率、Batch Size、LoRA Rank 等)
|
||||
- 实时监控训练进度
|
||||
- 从现有断点恢复训练
|
||||
|
||||
- **🎵 推理标签页**:使用训练好的模型生成音频
|
||||
- 从 LoRA 检查点配置自动加载基座模型
|
||||
- 带自动 ASR(参考文本识别)的声音克隆
|
||||
- 在多个 LoRA 模型间热切换
|
||||
- 无参考音频的零样本 TTS
|
||||
|
||||
## 数据准备
|
||||
|
||||
训练数据应准备为 JSONL 清单文件,每行一个样本:
|
||||
|
||||
```jsonl
|
||||
{"audio": "path/to/audio1.wav", "text": "音频1的文本内容。"}
|
||||
{"audio": "path/to/audio2.wav", "text": "音频2的文本内容。"}
|
||||
{"audio": "path/to/audio3.wav", "text": "可选的时长字段。", "duration": 3.5}
|
||||
{"audio": "path/to/audio4.wav", "text": "多数据集训练可选的 dataset_id。", "dataset_id": 1}
|
||||
```
|
||||
|
||||
### 必填字段
|
||||
|
||||
| 字段 | 描述 |
|
||||
|-------|-------------|
|
||||
| `audio` | Path to audio file (absolute or relative) |
|
||||
| `text` | Corresponding transcript |
|
||||
| `audio` | 音频文件路径(绝对或相对路径) |
|
||||
| `text` | 对应的文本内容 |
|
||||
|
||||
### Optional Fields
|
||||
### 可选字段
|
||||
|
||||
| Field | Description |
|
||||
| 字段 | 描述 |
|
||||
|-------|-------------|
|
||||
| `duration` | Audio duration in seconds (speeds up sample filtering) |
|
||||
| `dataset_id` | Dataset ID for multi-dataset training (default: 0) |
|
||||
| `duration` | 音频时长(秒),用于加速样本过滤 |
|
||||
| `dataset_id` | 多数据集训练的数据集 ID(默认:0) |
|
||||
|
||||
### Requirements
|
||||
### 要求
|
||||
|
||||
- Audio format: WAV
|
||||
- Sample rate: 16kHz for VoxCPM-0.5B, 44.1kHz for VoxCPM1.5
|
||||
- Text: Transcript matching the audio content
|
||||
- 音频格式:WAV
|
||||
- 采样率:VoxCPM-0.5B 为 16kHz,VoxCPM1.5 为 44.1kHz
|
||||
- 文本:与音频内容匹配的文本
|
||||
|
||||
See `examples/train_data_example.jsonl` for a complete example.
|
||||
查看 `examples/train_data_example.jsonl` 获取完整示例。
|
||||
|
||||
---
|
||||
|
||||
## Full Fine-tuning
|
||||
## 全量微调
|
||||
|
||||
Full fine-tuning updates all model parameters. Suitable for large datasets or when significant behavior changes are needed.
|
||||
全量微调更新所有模型参数。适用于大数据集或需要显著改变模型行为的情况。
|
||||
|
||||
### Configuration
|
||||
### 配置
|
||||
|
||||
Create `conf/voxcpm_v1.5/voxcpm_finetune_all.yaml`:
|
||||
创建 `conf/voxcpm_v1.5/voxcpm_finetune_all.yaml`:
|
||||
|
||||
```yaml
|
||||
pretrained_path: /path/to/VoxCPM1.5/
|
||||
@@ -85,7 +109,7 @@ log_interval: 10
|
||||
valid_interval: 1000
|
||||
save_interval: 1000
|
||||
|
||||
learning_rate: 0.00001 # Use smaller LR for full fine-tuning
|
||||
learning_rate: 0.00001 # 全量微调使用较小的学习率
|
||||
weight_decay: 0.01
|
||||
warmup_steps: 100
|
||||
max_steps: 2000
|
||||
@@ -99,27 +123,27 @@ lambdas:
|
||||
loss/stop: 1.0
|
||||
```
|
||||
|
||||
### Training
|
||||
### 训练
|
||||
|
||||
```bash
|
||||
# Single GPU
|
||||
# 单 GPU
|
||||
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
|
||||
|
||||
# Multi-GPU
|
||||
# 多 GPU
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
|
||||
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
|
||||
```
|
||||
|
||||
### Checkpoint Structure
|
||||
### 检查点结构
|
||||
|
||||
Full fine-tuning saves a complete model directory that can be loaded directly:
|
||||
全量微调保存完整的模型目录,可以直接加载:
|
||||
|
||||
```
|
||||
checkpoints/finetune_all/
|
||||
└── step_0002000/
|
||||
├── model.safetensors # Model weights (excluding audio_vae)
|
||||
├── config.json # Model config
|
||||
├── audiovae.pth # Audio VAE weights
|
||||
├── model.safetensors # 模型权重 (不含 audio_vae)
|
||||
├── config.json # 模型配置
|
||||
├── audiovae.pth # Audio VAE 权重
|
||||
├── tokenizer.json # Tokenizer
|
||||
├── tokenizer_config.json
|
||||
├── special_tokens_map.json
|
||||
@@ -129,13 +153,13 @@ checkpoints/finetune_all/
|
||||
|
||||
---
|
||||
|
||||
## LoRA Fine-tuning
|
||||
## LoRA 微调
|
||||
|
||||
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains only a small number of additional parameters, significantly reducing memory requirements.
|
||||
LoRA (Low-Rank Adaptation) 是一种参数高效的微调方法,仅训练少量额外参数,显著降低显存需求。
|
||||
|
||||
### Configuration
|
||||
### 配置
|
||||
|
||||
Create `conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml`:
|
||||
创建 `conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml`:
|
||||
|
||||
```yaml
|
||||
pretrained_path: /path/to/VoxCPM1.5/
|
||||
@@ -151,7 +175,7 @@ log_interval: 10
|
||||
valid_interval: 1000
|
||||
save_interval: 1000
|
||||
|
||||
learning_rate: 0.0001 # LoRA can use larger LR
|
||||
learning_rate: 0.0001 # LoRA 可以使用较大的学习率
|
||||
weight_decay: 0.01
|
||||
warmup_steps: 100
|
||||
max_steps: 2000
|
||||
@@ -164,117 +188,159 @@ lambdas:
|
||||
loss/diff: 1.0
|
||||
loss/stop: 1.0
|
||||
|
||||
# LoRA configuration
|
||||
# LoRA 配置
|
||||
lora:
|
||||
enable_lm: true # Apply LoRA to Language Model
|
||||
enable_dit: true # Apply LoRA to Diffusion Transformer
|
||||
enable_proj: false # Apply LoRA to projection layers (optional)
|
||||
enable_lm: true # 对语言模型应用 LoRA
|
||||
enable_dit: true # 对 Diffusion Transformer 应用 LoRA
|
||||
enable_proj: false # 对投影层应用 LoRA (可选)
|
||||
|
||||
r: 32 # LoRA rank (higher = more capacity)
|
||||
r: 32 # LoRA rank (越高容量越大)
|
||||
alpha: 16 # LoRA alpha, scaling = alpha / r
|
||||
dropout: 0.0
|
||||
|
||||
# Target modules
|
||||
# 目标模块
|
||||
target_modules_lm: ["q_proj", "v_proj", "k_proj", "o_proj"]
|
||||
target_modules_dit: ["q_proj", "v_proj", "k_proj", "o_proj"]
|
||||
|
||||
# 分发选项 (可选)
|
||||
# hf_model_id: "openbmb/VoxCPM1.5" # HuggingFace ID
|
||||
# distribute: true # 如果为 true,在 lora_config.json 中保存 hf_model_id
|
||||
```
|
||||
|
||||
### LoRA Parameters
|
||||
### LoRA 参数
|
||||
|
||||
| Parameter | Description | Recommended |
|
||||
| 参数 | 描述 | 推荐值 |
|
||||
|-----------|-------------|-------------|
|
||||
| `enable_lm` | Apply LoRA to LM (language model) | `true` |
|
||||
| `enable_dit` | Apply LoRA to DiT (diffusion model) | `true` (required for voice cloning) |
|
||||
| `r` | LoRA rank (higher = more capacity) | 16-64 |
|
||||
| `alpha` | Scaling factor, `scaling = alpha / r` | Usually `r/2` or `r` |
|
||||
| `target_modules_*` | Layer names to add LoRA | attention layers |
|
||||
| `enable_lm` | 对 LM (语言模型) 应用 LoRA | `true` |
|
||||
| `enable_dit` | 对 DiT (扩散模型) 应用 LoRA | `true` (声音克隆必须) |
|
||||
| `r` | LoRA rank (越高容量越大) | 16-64 |
|
||||
| `alpha` | 缩放因子, `scaling = alpha / r` | 通常 `r/2` 或 `r` |
|
||||
| `target_modules_*` | 添加 LoRA 的层名称 | attention layers |
|
||||
|
||||
### Training
|
||||
### 分发选项 (可选)
|
||||
|
||||
| 参数 | 描述 | 默认值 |
|
||||
|-----------|-------------|---------|
|
||||
| `hf_model_id` | HuggingFace 模型 ID (例如 `openbmb/VoxCPM1.5`) | `""` |
|
||||
| `distribute` | 如果为 `true`,将 `hf_model_id` 作为 `base_model` 保存到检查点;否则保存本地 `pretrained_path` | `false` |
|
||||
|
||||
> **注意**:如果 `distribute: true`,则必须提供 `hf_model_id`。
|
||||
|
||||
### 训练
|
||||
|
||||
```bash
|
||||
# Single GPU
|
||||
# 单 GPU
|
||||
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
|
||||
|
||||
# Multi-GPU
|
||||
# 多 GPU
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
|
||||
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
|
||||
```
|
||||
|
||||
### Checkpoint Structure
|
||||
### 检查点结构
|
||||
|
||||
LoRA training saves only LoRA parameters:
|
||||
LoRA 训练保存 LoRA 参数和配置:
|
||||
|
||||
```
|
||||
checkpoints/finetune_lora/
|
||||
└── step_0002000/
|
||||
├── lora_weights.safetensors # Only lora_A, lora_B parameters
|
||||
├── lora_weights.safetensors # 仅含 lora_A, lora_B 参数
|
||||
├── lora_config.json # LoRA 配置 + 基座模型路径
|
||||
├── optimizer.pth
|
||||
└── scheduler.pth
|
||||
```
|
||||
|
||||
`lora_config.json` 包含:
|
||||
```json
|
||||
{
|
||||
"base_model": "/path/to/VoxCPM1.5/",
|
||||
"lora_config": {
|
||||
"enable_lm": true,
|
||||
"enable_dit": true,
|
||||
"r": 32,
|
||||
"alpha": 16,
|
||||
...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`base_model` 字段包含:
|
||||
- 本地路径 (默认):当 `distribute: false` 或未设置时
|
||||
- HuggingFace ID:当 `distribute: true` 时 (例如 `"openbmb/VoxCPM1.5"`)
|
||||
|
||||
这允许在没有原始训练配置文件的情况下加载 LoRA 检查点。
|
||||
|
||||
---
|
||||
|
||||
## Inference
|
||||
## 推理
|
||||
|
||||
### Full Fine-tuning Inference
|
||||
### 全量微调推理
|
||||
|
||||
The checkpoint directory is a complete model, load it directly:
|
||||
检查点目录是一个完整的模型,直接加载:
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_ft_infer.py \
|
||||
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
|
||||
--text "Hello, this is the fine-tuned model." \
|
||||
--text "你好,这是微调后的模型。" \
|
||||
--output output.wav
|
||||
```
|
||||
|
||||
With voice cloning:
|
||||
带声音克隆:
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_ft_infer.py \
|
||||
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
|
||||
--text "This is voice cloning result." \
|
||||
--text "这是声音克隆的结果。" \
|
||||
--prompt_audio /path/to/reference.wav \
|
||||
--prompt_text "Reference audio transcript" \
|
||||
--prompt_text "参考音频的文本内容" \
|
||||
--output cloned_output.wav
|
||||
```
|
||||
|
||||
### LoRA Inference
|
||||
### LoRA 推理
|
||||
|
||||
LoRA inference requires the training config (for LoRA structure) and LoRA checkpoint:
|
||||
LoRA 推理只需要检查点目录(基座模型路径和 LoRA 配置从 `lora_config.json` 读取):
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_lora_infer.py \
|
||||
--config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml \
|
||||
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
|
||||
--text "Hello, this is LoRA fine-tuned result." \
|
||||
--text "你好,这是 LoRA 微调的结果。" \
|
||||
--output lora_output.wav
|
||||
```
|
||||
|
||||
With voice cloning:
|
||||
带声音克隆:
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_lora_infer.py \
|
||||
--config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml \
|
||||
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
|
||||
--text "This is voice cloning with LoRA." \
|
||||
--text "这是带 LoRA 的声音克隆。" \
|
||||
--prompt_audio /path/to/reference.wav \
|
||||
--prompt_text "Reference audio transcript" \
|
||||
--prompt_text "参考音频的文本内容" \
|
||||
--output cloned_output.wav
|
||||
```
|
||||
|
||||
覆盖基座模型路径 (可选):
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_lora_infer.py \
|
||||
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
|
||||
--base_model /path/to/another/VoxCPM1.5 \
|
||||
--text "使用不同的基座模型。" \
|
||||
--output output.wav
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## LoRA Hot-swapping
|
||||
## LoRA 热插拔
|
||||
|
||||
LoRA supports dynamic loading, unloading, and switching at inference time without reloading the entire model.
|
||||
LoRA 支持在推理时动态加载、卸载和切换,无需重新加载整个模型。
|
||||
|
||||
### API Reference
|
||||
### API 参考
|
||||
|
||||
```python
|
||||
from voxcpm.core import VoxCPM
|
||||
from voxcpm.model.voxcpm import LoRAConfig
|
||||
|
||||
# 1. Load model with LoRA structure and weights
|
||||
# 1. 加载带 LoRA 结构和权重的模型
|
||||
lora_cfg = LoRAConfig(
|
||||
enable_lm=True,
|
||||
enable_dit=True,
|
||||
@@ -284,93 +350,113 @@ lora_cfg = LoRAConfig(
|
||||
target_modules_dit=["q_proj", "v_proj", "k_proj", "o_proj"],
|
||||
)
|
||||
model = VoxCPM.from_pretrained(
|
||||
hf_model_id="openbmb/VoxCPM1.5", # or local path
|
||||
load_denoiser=False, # Optional: disable denoiser for faster loading
|
||||
optimize=True, # Enable torch.compile acceleration
|
||||
hf_model_id="openbmb/VoxCPM1.5", # 或本地路径
|
||||
load_denoiser=False, # 可选:禁用降噪器以加快加载
|
||||
optimize=True, # 启用 torch.compile 加速
|
||||
lora_config=lora_cfg,
|
||||
lora_weights_path="/path/to/lora_checkpoint",
|
||||
)
|
||||
|
||||
# 2. Generate audio
|
||||
# 2. 生成音频
|
||||
audio = model.generate(
|
||||
text="Hello, this is LoRA fine-tuned result.",
|
||||
prompt_wav_path="/path/to/reference.wav", # Optional: for voice cloning
|
||||
prompt_text="Reference audio transcript", # Optional: for voice cloning
|
||||
text="你好,这是 LoRA 微调的结果。",
|
||||
prompt_wav_path="/path/to/reference.wav", # 可选:用于声音克隆
|
||||
prompt_text="参考音频的文本内容", # 可选:用于声音克隆
|
||||
)
|
||||
|
||||
# 3. Disable LoRA (use base model only)
|
||||
# 3. 禁用 LoRA (仅使用基座模型)
|
||||
model.set_lora_enabled(False)
|
||||
|
||||
# 4. Re-enable LoRA
|
||||
# 4. 重新启用 LoRA
|
||||
model.set_lora_enabled(True)
|
||||
|
||||
# 5. Unload LoRA (reset weights to zero)
|
||||
# 5. 卸载 LoRA (重置权重为零)
|
||||
model.unload_lora()
|
||||
|
||||
# 6. Hot-swap to another LoRA
|
||||
# 6. 热切换到另一个 LoRA
|
||||
loaded, skipped = model.load_lora("/path/to/another_lora_checkpoint")
|
||||
print(f"Loaded {len(loaded)} params, skipped {len(skipped)}")
|
||||
|
||||
# 7. Get current LoRA weights
|
||||
# 7. 获取当前 LoRA 权重
|
||||
lora_state = model.get_lora_state_dict()
|
||||
```
|
||||
|
||||
### Simplified Usage (Auto LoRA Config)
|
||||
### 简化用法 (从 lora_config.json 加载)
|
||||
|
||||
If you only have LoRA weights and don't need custom config, just provide the path:
|
||||
如果你的检查点包含 `lora_config.json`(由训练脚本保存),你可以自动加载所有内容:
|
||||
|
||||
```python
|
||||
import json
|
||||
from voxcpm.core import VoxCPM
|
||||
from voxcpm.model.voxcpm import LoRAConfig
|
||||
|
||||
# Auto-create default LoRAConfig when only lora_weights_path is provided
|
||||
# 从检查点加载配置
|
||||
lora_ckpt_dir = "/path/to/checkpoints/finetune_lora/step_0002000"
|
||||
with open(f"{lora_ckpt_dir}/lora_config.json") as f:
|
||||
lora_info = json.load(f)
|
||||
|
||||
base_model = lora_info["base_model"]
|
||||
lora_cfg = LoRAConfig(**lora_info["lora_config"])
|
||||
|
||||
# 加载带 LoRA 的模型
|
||||
model = VoxCPM.from_pretrained(
|
||||
hf_model_id="openbmb/VoxCPM1.5",
|
||||
lora_weights_path="/path/to/lora_checkpoint", # Will auto-create LoRAConfig
|
||||
hf_model_id=base_model,
|
||||
lora_config=lora_cfg,
|
||||
lora_weights_path=lora_ckpt_dir,
|
||||
)
|
||||
```
|
||||
|
||||
### Method Reference
|
||||
或者直接使用测试脚本:
|
||||
|
||||
| Method | Description | torch.compile Compatible |
|
||||
```bash
|
||||
python scripts/test_voxcpm_lora_infer.py \
|
||||
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
|
||||
--text "Hello world"
|
||||
```
|
||||
|
||||
### 方法参考
|
||||
|
||||
| 方法 | 描述 | torch.compile 兼容性 |
|
||||
|--------|-------------|--------------------------|
|
||||
| `load_lora(path)` | Load LoRA weights from file | ✅ |
|
||||
| `set_lora_enabled(bool)` | Enable/disable LoRA | ✅ |
|
||||
| `unload_lora()` | Reset LoRA weights to initial values | ✅ |
|
||||
| `get_lora_state_dict()` | Get current LoRA weights | ✅ |
|
||||
| `lora_enabled` | Property: check if LoRA is configured | ✅ |
|
||||
| `load_lora(path)` | 从文件加载 LoRA 权重 | ✅ |
|
||||
| `set_lora_enabled(bool)` | 启用/禁用 LoRA | ✅ |
|
||||
| `unload_lora()` | 将 LoRA 权重重置为初始值 | ✅ |
|
||||
| `get_lora_state_dict()` | 获取当前 LoRA 权重 | ✅ |
|
||||
| `lora_enabled` | 属性:检查是否配置了 LoRA | ✅ |
|
||||
|
||||
---
|
||||
|
||||
## FAQ
|
||||
## 常见问题 (FAQ)
|
||||
|
||||
### 1. Out of Memory (OOM)
|
||||
### 1. 显存溢出 (OOM)
|
||||
|
||||
- Increase `grad_accum_steps` (gradient accumulation)
|
||||
- Decrease `batch_size`
|
||||
- Use LoRA fine-tuning instead of full fine-tuning
|
||||
- Decrease `max_batch_tokens` to filter long samples
|
||||
- 增加 `grad_accum_steps` (梯度累积步数)
|
||||
- 减小 `batch_size`
|
||||
- 使用 LoRA 微调代替全量微调
|
||||
- 减小 `max_batch_tokens` 以过滤长样本
|
||||
|
||||
### 2. Poor LoRA Performance
|
||||
### 2. LoRA 效果不佳
|
||||
|
||||
- Increase `r` (LoRA rank)
|
||||
- Adjust `alpha` (try `alpha = r/2` or `alpha = r`)
|
||||
- Ensure `enable_dit: true` (required for voice cloning)
|
||||
- Increase training steps
|
||||
- Add more target modules
|
||||
- 增加 `r` (LoRA rank)
|
||||
- 调整 `alpha` (尝试 `alpha = r/2` 或 `alpha = r`)
|
||||
- 增加训练步数
|
||||
- 添加更多目标模块
|
||||
|
||||
### 3. Training Not Converging
|
||||
### 3. 训练不收敛
|
||||
|
||||
- Decrease `learning_rate`
|
||||
- Increase `warmup_steps`
|
||||
- Check data quality
|
||||
- 减小 `learning_rate` (学习率)
|
||||
- 增加 `warmup_steps`
|
||||
- 检查数据质量
|
||||
|
||||
### 4. LoRA Not Taking Effect at Inference
|
||||
### 4. LoRA 在推理时未生效
|
||||
|
||||
- Ensure inference config matches training config LoRA parameters
|
||||
- Check `load_lora()` return value - `skipped_keys` should be empty
|
||||
- Verify `set_lora_enabled(True)` is called
|
||||
- 检查检查点目录下是否存在 `lora_config.json`
|
||||
- 检查 `load_lora()` 返回值 - `skipped_keys` 应该为空
|
||||
- 确认调用了 `set_lora_enabled(True)`
|
||||
|
||||
### 5. Checkpoint Loading Errors
|
||||
### 5. 检查点加载错误
|
||||
|
||||
- Full fine-tuning: checkpoint directory should contain `model.safetensors`(or `pytorch_model.bin`), `config.json`, `audiovae.pth`
|
||||
- LoRA: checkpoint directory should contain `lora_weights.safetensors` (or `lora_weights.ckpt`)
|
||||
- 全量微调:检查点目录应包含 `model.safetensors` (或 `pytorch_model.bin`)、`config.json`、`audiovae.pth`
|
||||
- LoRA:检查点目录应包含:
|
||||
- `lora_weights.safetensors` (或 `lora_weights.ckpt`) - LoRA 权重
|
||||
- `lora_config.json` - LoRA 配置和基座模型路径
|
||||
|
||||
@@ -1,10 +1,10 @@
|
||||
# 📊 Performance Highlights
|
||||
# 📊 性能亮点
|
||||
|
||||
VoxCPM achieves competitive results on public zero-shot TTS benchmarks.
|
||||
VoxCPM 在公开的零样本 TTS 基准测试中取得了具有竞争力的结果。
|
||||
|
||||
## Seed-TTS-eval Benchmark
|
||||
## Seed-TTS-eval 基准测试
|
||||
|
||||
| Model | Parameters | Open-Source | test-EN | | test-ZH | | test-Hard | |
|
||||
| 模型 | 参数量 | 开源 | test-EN | | test-ZH | | test-Hard | |
|
||||
|------|------|------|:------------:|:--:|:------------:|:--:|:-------------:|:--:|
|
||||
| | | | WER/%⬇ | SIM/%⬆| CER/%⬇| SIM/%⬆ | CER/%⬇ | SIM/%⬆ |
|
||||
| MegaTTS3 | 0.5B | ❌ | 2.79 | 77.1 | 1.52 | 79.0 | - | - |
|
||||
@@ -28,9 +28,9 @@ VoxCPM achieves competitive results on public zero-shot TTS benchmarks.
|
||||
| **VoxCPM** | 0.5B | ✅ | **1.85** | **72.9** | **0.93** | **77.2** | 8.87 | 73.0 |
|
||||
|
||||
|
||||
## CV3-eval Benchmark
|
||||
## CV3-eval 基准测试
|
||||
|
||||
| Model | zh | en | hard-zh | | | hard-en | | |
|
||||
| 模型 | zh | en | hard-zh | | | hard-en | | |
|
||||
|-------|:--:|:--:|:-------:|:--:|:--:|:-------:|:--:|:--:|
|
||||
| | CER/%⬇ | WER/%⬇ | CER/%⬇ | SIM/%⬆ | DNSMOS⬆ | WER/%⬇ | SIM/%⬆ | DNSMOS⬆ |
|
||||
| F5-TTS | 5.47 | 8.90 | - | - | - | - | - | - |
|
||||
@@ -43,4 +43,3 @@ VoxCPM achieves competitive results on public zero-shot TTS benchmarks.
|
||||
| CosyVoice3-0.5B | 3.89 | 5.24 | 14.15 | 78.6 | 3.75 | 9.04 | 75.9 | 3.92 |
|
||||
| CosyVoice3-1.5B | 3.91 | 4.99 | 9.77 | 78.5 | 3.79 | 10.55 | 76.1 | 3.95 |
|
||||
| **VoxCPM** | **3.40** | **4.04** | 12.9 | 66.1 | 3.59 | **7.89** | 64.3 | 3.74 |
|
||||
|
||||
|
||||
@@ -1,109 +1,111 @@
|
||||
# VoxCPM1.5 Release Notes
|
||||
# VoxCPM1.5 发布说明
|
||||
|
||||
**Release Date:** December 5, 2025
|
||||
**发布日期:** 2025年12月5日
|
||||
|
||||
## 🎉 Overview
|
||||
## 🎉 概览
|
||||
|
||||
我们非常激动地推出一次重大升级,在保持 VoxCPM 上下文感知语音生成和零样本声音克隆核心能力的同时,提升了音频质量和效率。
|
||||
|
||||
We’re thrilled to introduce a major upgrade that improves audio quality and efficiency of VoxCPM, while maintaining the core capabilities of context-aware speech generation and zero-shot voice cloning.
|
||||
|
||||
| Feature | VoxCPM | VoxCPM1.5 |
|
||||
| 特性 | VoxCPM | VoxCPM1.5 |
|
||||
|---------|------------|------------|
|
||||
| **Audio VAE Sampling Rate** | 16kHz | 44.1kHz |
|
||||
| **LM Token Rate** | 12.5Hz | 6.25Hz |
|
||||
| **Audio VAE 采样率** | 16kHz | 44.1kHz |
|
||||
| **LM Token 速率** | 12.5Hz | 6.25Hz |
|
||||
| **Patch Size** | 2 | 4 |
|
||||
| **SFT Support** | ✅ | ✅ |
|
||||
| **LoRA Support** | ✅ | ✅ |
|
||||
| **SFT 支持** | ✅ | ✅ |
|
||||
| **LoRA 支持** | ✅ | ✅ |
|
||||
|
||||
## 🎵 Model Updates
|
||||
## 🎵 模型更新
|
||||
|
||||
### 🔊 AudioVAE Sampling Rate: 16kHz → 44.1kHz
|
||||
### 🔊 AudioVAE 采样率:16kHz → 44.1kHz
|
||||
|
||||
The AudioVAE now supports 44.1kHz sampling rate, which allows the model to:
|
||||
- 🎯 Clone better, preserving more high-frequency details and generate higher quality voice outputs
|
||||
AudioVAE 现在支持 44.1kHz 采样率,这使得模型能够:
|
||||
- 🎯 更好地克隆声音,保留更多高频细节,生成更高质量的语音输出
|
||||
|
||||
*注意:此升级在使用高质量参考音频时能生成更高质量的音频,但不能保证所有生成的音频都是高保真的。输出质量取决于**提示语音(prompt speech)**的质量。*
|
||||
|
||||
*Note: This upgrade enables higher quality generation when using high-quality reference audio, but does not guarantee that all generated audio will be high-fidelity. The output quality depends on the **prompt speech** quality.*
|
||||
### ⚡ Token 速率:12.5Hz → 6.25Hz
|
||||
|
||||
### ⚡ Token Rate: 12.5Hz → 6.25Hz
|
||||
我们将 LM 主干网络中的 token 速率从 12.5Hz 降低到了 6.25Hz(LocEnc&LocDiT patch size 从 2 增加到 4),同时在评估基准上保持了相似的性能。这一变化:
|
||||
- 💨 降低了生成相同长度音频的计算需求
|
||||
- 📈 为更长音频生成奠定了基础
|
||||
- 🏗️ 为未来训练更大的模型铺平了道路
|
||||
|
||||
We reduced the token rate in LM backbone from 12.5Hz to 6.25Hz (LocEnc&LocDiT patch size increased from 2 to 4) while maintaining similar performance on evaluation benchmarks. This change:
|
||||
- 💨 Reduces computational requirements for generating the same length of audio
|
||||
- 📈 Provides a foundation for longer audio generation
|
||||
- 🏗️ Paves the way for training larger models in the future
|
||||
**模型架构说明**:VoxCPM1.5 的核心架构与技术报告中保持一致。关键的修改是将局部模块(LocEnc & LocDiT)的 patch size 从 2 调整为 4,从而将 LM 处理速率从 12.5Hz 降低到 6.25Hz。由于局部模块现在需要处理更长的上下文,我们扩展了它们的网络深度,导致整体模型参数量略有增加。
|
||||
|
||||
**生成速度说明**:虽然模型参数增加了,但 VoxCPM1.5 生成 1 秒音频仅需 6.25 个 token(相比之前的 12.5 个 token)。虽然显示的生成速度(xx it/s)可能看起来变慢了,但实际的实时率(RTF = 音频时长 / 处理时间)没有差异,甚至可能更快。
|
||||
|
||||
## 🔧 Fine-tuning Support
|
||||
## 🔧 微调支持
|
||||
|
||||
We support full fine-tuning and LoRA fine-tuning now, please see the [Fine-tuning Guide](finetune.md) for detailed instructions.
|
||||
|
||||
我们现在支持全量微调和 LoRA 微调,请参阅 [微调指南](finetune.md) 了解详细说明。
|
||||
|
||||
## 📚 Documentation
|
||||
## 📚 文档
|
||||
|
||||
- Updated README with version comparison
|
||||
- Added comprehensive fine-tuning guide
|
||||
- Improved code comments and documentation
|
||||
- 更新了 README,增加了版本对比
|
||||
- 添加了全面的微调指南
|
||||
- 改进了代码注释和文档
|
||||
|
||||
## 🙏 感谢大家
|
||||
|
||||
## 🙏 Our Thanks to You
|
||||
This release wouldn’t be possible without the incredible feedback, testing, and contributions from our open-source community. Thank you for helping shape VoxCPM1.5!
|
||||
没有开源社区的反馈、测试和贡献,这次发布是不可能的。感谢你们帮助塑造 VoxCPM1.5!
|
||||
|
||||
## 📞 让我们共同建设
|
||||
|
||||
## 📞 Let's Build Together
|
||||
Questions, ideas, or want to contribute?
|
||||
有问题、想法或想要贡献?
|
||||
|
||||
- 🐛 Report an issue: [GitHub Issues on OpenBMB/VoxCPM](https://github.com/OpenBMB/VoxCPM/issues)
|
||||
- 🐛 报告问题:[OpenBMB/VoxCPM GitHub Issues](https://github.com/OpenBMB/VoxCPM/issues)
|
||||
|
||||
- 📖 Dig into the docs: Check the [docs/](../docs/) folder for guides and API details
|
||||
- 📖 深入文档:查看 [docs/](../docs/) 文件夹获取指南和 API 详情
|
||||
|
||||
Enjoy the richer sound and powerful new features of VoxCPM1.5 🎉
|
||||
享受 VoxCPM1.5 更丰富的声音和强大的新功能吧 🎉
|
||||
|
||||
We can't wait to hear what you create next! 🥂
|
||||
我们迫不及待想听到你们接下来的创作!🥂
|
||||
|
||||
## 🚀 What We're Working On
|
||||
## 🚀 我们正在做的事情
|
||||
|
||||
We're continuously improving VoxCPM and working on exciting new features:
|
||||
我们正在持续改进 VoxCPM 并致力于开发激动人心的新功能:
|
||||
|
||||
- 🌍 **Multilingual TTS Support**: We are actively developing support for languages beyond Chinese and English.
|
||||
- 🎯 **Controllable Expressive Speech Generation**: We are researching controllable speech generation that allows fine-grained control over speech attributes (emotion, timbre, prosody, etc.) through natural language instructions.
|
||||
- 🎵 **Universal Audio Generation Foundation**: We also hope to explore VoxCPM as a unified audio generation foundation model capable of joint generation of speech, music, and sound effects. However, this is a longer-term vision.
|
||||
- 🌍 **多语言 TTS 支持**:我们正在积极开发除中文和英文以外的语言支持。
|
||||
- 🎯 **可控表现力语音生成**:我们正在研究可控语音生成,允许通过自然语言指令对语音属性(情感、音色、韵律等)进行细粒度控制。
|
||||
- 🎵 **通用音频生成基础**:我们也希望探索 VoxCPM 作为统一的音频生成基础模型,能够联合生成语音、音乐和音效。不过,这是一个长期的愿景。
|
||||
|
||||
**📅 Next Release**: We plan to release the next version in Q1 2026, which will include significant improvements and new features. Stay tuned for updates! We're committed to making VoxCPM even more powerful and versatile.
|
||||
**📅 下次发布**:我们计划在 2026 年第一季度发布下一个版本,其中将包含重大改进和新功能。敬请关注更新!我们致力于使 VoxCPM 更加强大和通用。
|
||||
|
||||
## ❓ Frequently Asked Questions (FAQ)
|
||||
## ❓ 常见问题 (FAQ)
|
||||
|
||||
### Q: Does VoxCPM support fine-tuning for personalized voice customization?
|
||||
### Q: VoxCPM 支持个性化声音定制的微调吗?
|
||||
|
||||
**A:** Yes! VoxCPM now supports both full fine-tuning (SFT) and efficient LoRA fine-tuning. You can train personalized voice models on your own data. Please refer to the [Fine-tuning Guide](finetune.md) for detailed instructions and examples.
|
||||
**A:** 是的!VoxCPM 现在支持全量微调(SFT)和高效的 LoRA 微调。你可以使用自己的数据训练个性化声音模型。请参阅 [微调指南](finetune.md) 获取详细说明和示例。
|
||||
|
||||
### Q: Is 16kHz audio quality sufficient for my use case?
|
||||
### Q: 16kHz 音频质量对我的用例足够吗?
|
||||
|
||||
**A:** We have upgraded the AudioVAE to support 44.1kHz sampling rate in VoxCPM1.5, which provides higher quality audio output with better preservation of high-frequency details. This upgrade enables better voice cloning quality and more natural speech synthesis when using high-quality reference audio.
|
||||
**A:** 我们在 VoxCPM1.5 中升级了 AudioVAE 以支持 44.1kHz 采样率,这提供了更高质量的音频输出,更好地保留了高频细节。当使用高质量参考音频时,此升级能实现更好的声音克隆质量和更自然的语音合成。
|
||||
|
||||
### Q: Has the stability issue been resolved?
|
||||
### Q: 稳定性问题解决了吗?
|
||||
|
||||
**A:** We have made stability optimizations in VoxCPM1.5, including improvements to the training data and model architecture. Based on community feedback, we collected some stability issues such as:
|
||||
- Increased noise and reverberation
|
||||
- Audio artifacts (e.g., howling/squealing)
|
||||
- Unstable speaking rate (speeding up)
|
||||
- Volume fluctuations (increases or decreases)
|
||||
- Noise artifacts at the beginning and end of audio
|
||||
- Synthesis issues with very short texts (e.g., "hello")
|
||||
**A:** 我们在 VoxCPM1.5 中进行了稳定性优化,包括对推理代码逻辑、训练数据和模型架构的改进。根据社区反馈,我们收集了一些稳定性问题,例如:
|
||||
- 噪声和混响增加
|
||||
- 音频伪影(如啸叫/尖叫)
|
||||
- 语速不稳定(加速)
|
||||
- 音量波动(忽大忽小)
|
||||
- 音频开头和结尾的噪声伪影
|
||||
- 极短文本(如“你好”)的合成问题
|
||||
|
||||
While we have made improvements to these issues, they have not been completely resolved and may still occasionally occur, especially with very long or highly expressive inputs. We continue to work on further stability improvements in future versions.
|
||||
**我们改进了什么:**
|
||||
- 通过调整推理代码逻辑和优化训练数据,我们很大程度上修复了开头/结尾的伪影。
|
||||
- 通过降低 LM 处理速率(12.5Hz → 6.25Hz),我们提高了长语音生成的稳定性。
|
||||
|
||||
### Q: Does VoxCPM plan to support multilingual TTS?
|
||||
**还遗留什么:** 我们承认长语音稳定性问题尚未完全解决。特别是对于高表现力或复杂的参考语音,自回归生成过程中的误差累积仍可能发生。我们将继续在未来版本中分析和优化这一点。
|
||||
|
||||
**A:** Currently, VoxCPM is primarily trained on Chinese and English data. We are actively researching and developing multilingual TTS support for more languages beyond Chinese and English. Please let us know what languages you'd like to see supported!
|
||||
### Q: VoxCPM 计划支持多语言 TTS 吗?
|
||||
|
||||
### Q: Does VoxCPM plan to support controllable generation (emotion, style, fine-grained control)?
|
||||
**A:** 目前,VoxCPM 主要在中文和英文数据上进行训练。我们正在积极研究和开发除中英文以外更多语言的多语言 TTS 支持。请告诉我们你希望支持哪些语言!
|
||||
|
||||
**A:** Currently, VoxCPM only supports zero-shot voice cloning and context-aware speech generation. Direct control over specific speech attributes (emotion, style, fine-grained prosody) is limited. However, we are actively researching instruction-controllable expressive speech generation with fine-grained control capabilities, working towards a human instruction-to-speech generation model!
|
||||
### Q: VoxCPM 计划支持可控生成(情感、风格、细粒度控制)吗?
|
||||
|
||||
### Q: Does VoxCPM support different hardware chips (e.g., Ascend 910B, XPU, NPU)?
|
||||
**A:** 目前,VoxCPM 仅支持零样本声音克隆和上下文感知语音生成。对特定语音属性(情感、风格、细粒度韵律)的直接控制是有限的。然而,我们正在积极研究具有细粒度控制能力的指令可控表现力语音生成,致力于实现人类指令到语音的生成模型!
|
||||
|
||||
**A:** Currently, we have not yet adapted VoxCPM for different hardware chips. Our main focus remains on developing new model capabilities and improving stability. We encourage you to check if community developers have done similar work, and we warmly welcome everyone to contribute and promote such adaptations together!
|
||||
|
||||
These features are under active development, and we look forward to sharing updates in future releases!
|
||||
### Q: VoxCPM 支持不同的硬件芯片(如 Ascend 910B, XPU, NPU)吗?
|
||||
|
||||
**A:** 目前,我们尚未针对不同的硬件芯片适配 VoxCPM。我们的主要重点仍然是开发新的模型能力和提高稳定性。我们鼓励你查看社区开发者是否做了类似的工作,我们也热烈欢迎大家共同贡献和推动此类适配!
|
||||
|
||||
这些功能正在积极开发中,我们期待在未来的版本中分享更新!
|
||||
|
||||
@@ -1,53 +1,54 @@
|
||||
# 👩🍳 A Voice Chef's Guide
|
||||
# 👩🍳 声音大厨指南
|
||||
|
||||
Welcome to the VoxCPM kitchen! Follow this recipe to cook up perfect generated speech. Let's begin.
|
||||
欢迎来到 VoxCPM 厨房!按照这份食谱,烹饪出完美的生成语音。让我们开始吧。
|
||||
|
||||
---
|
||||
|
||||
## 🥚 Step 1: Prepare Your Base Ingredients (Content)
|
||||
## 🥚 第一步:准备基础食材(内容)
|
||||
|
||||
First, choose how you'd like to input your text:
|
||||
首先,选择你输入文本的方式:
|
||||
|
||||
### 1. Regular Text (Classic Mode)
|
||||
- ✅ Keep "Text Normalization" ON. Type naturally (e.g., "Hello, world! 123"). The system will automatically process numbers, abbreviations, and punctuation using WeTextProcessing library.
|
||||
### 1. 普通文本(经典模式)
|
||||
- ✅ 保持“文本标准化 (Text Normalization)”开启。自然地输入文字(例如 "Hello, world! 123")。系统将使用 WeTextProcessing 库自动处理数字、缩写和标点符号。
|
||||
|
||||
### 2. Phoneme Input (Native Mode)
|
||||
- ❌ Turn "Text Normalization" OFF. Enter phoneme text like `{HH AH0 L OW1}` (EN) or `{ni3}{hao3}` (ZH) for precise pronunciation control. In this mode, VoxCPM also supports native understanding of other complex non-normalized text—try it out!
|
||||
- **Phoneme Conversion**: For Chinese, phonemes are converted using pinyin. For English, phonemes are converted using CMUDict. Please refer to the relevant documentation for more details.
|
||||
### 2. 音素输入(原生模式)
|
||||
- ❌ 关闭“文本标准化 (Text Normalization)”。输入音素文本,如 `{HH AH0 L OW1}` (英语) 或 `{ni3}{hao3}` (中文),以进行精确的发音控制。在此模式下,VoxCPM 还支持对其他复杂的非标准化文本的原生理解——快来试试吧!
|
||||
- **音素转换**:对于中文,音素使用拼音转换。对于英语,音素使用 CMUDict 转换。更多详细信息请参考相关文档。
|
||||
|
||||
---
|
||||
|
||||
## 🍳 Step 2: Choose Your Flavor Profile (Voice Style)
|
||||
## 🍳 第二步:选择风味(声音风格)
|
||||
|
||||
This is the secret sauce that gives your audio its unique sound.
|
||||
这是让你的音频拥有独特声音的秘制酱料。
|
||||
|
||||
### 1. Cooking with a Prompt Speech (Following a Famous Recipe)
|
||||
- A prompt speech provides the desired acoustic characteristics for VoxCPM. The speaker's timbre, speaking style, and even the background sounds and ambiance will be replicated.
|
||||
- **For a Clean, Studio-Quality Voice:**
|
||||
- ✅ Enable "Prompt Speech Enhancement". This acts like a noise filter, removing background hiss and rumble to give you a pure, clean voice clone.
|
||||
### 1. 使用提示语音烹饪(跟随名家食谱)
|
||||
- 提示语音(Prompt Speech)为 VoxCPM 提供所需的声学特征。说话者的音色、说话风格,甚至背景声音和氛围都将被复制。
|
||||
- **为了获得干净、降噪的声音:**
|
||||
- ✅ 启用“提示语音增强 (Prompt Speech Enhancement)”。这就像一个噪音过滤器,去除背景嘶嘶声和隆隆声,给你一个纯净、干净的声音克隆。但是,这将限制音频采样率为 16kHz,限制了克隆质量的上限。
|
||||
- **为了获得高质量音频克隆(最高 44.1kHz):**
|
||||
- ❌ 禁用“提示语音增强 (Prompt Speech Enhancement)”以保留所有原始音频信息,包括背景氛围,并支持高达 44.1kHz 采样率的音频克隆。
|
||||
|
||||
### 2. Cooking au Naturel (Letting the Model Improvise)
|
||||
- If no reference is provided, VoxCPM becomes a creative chef! It will infer a fitting speaking style based on the text itself, thanks to the text-smartness of its foundation model, MiniCPM-4.
|
||||
- **Pro Tip**: Challenge VoxCPM with any text—poetry, song lyrics, dramatic monologues—it may deliver some interesting results!
|
||||
### 2. 自然烹饪(让模型即兴发挥)
|
||||
- 如果没有提供参考,VoxCPM 将成为一位创意大厨!通过其基础模型 MiniCPM-4 的文本智能,它会根据文本本身推断出合适的说话风格。
|
||||
- **专业提示**:用任何文本挑战 VoxCPM——诗歌、歌词、戏剧独白——它可能会带来一些有趣的结果!
|
||||
|
||||
---
|
||||
|
||||
## 🧂 Step 3: The Final Seasoning (Fine-Tuning Your Results)
|
||||
## 🧂 第三步:最后的调味(微调结果)
|
||||
|
||||
You're ready to serve! But for master chefs who want to tweak the flavor, here are two key spices.
|
||||
你已经准备好上菜了!但对于想要调整口味的大厨,这里有两个关键的香料。
|
||||
|
||||
### CFG Value (How Closely to Follow the Recipe)
|
||||
- **Default**: A great starting point.
|
||||
- **Voice sounds strained or weird?** Lower this value. It tells the model to be more relaxed and improvisational, great for expressive prompts.
|
||||
- **Need maximum clarity and adherence to the text?** Raise it slightly to keep the model on a tighter leash.
|
||||
- **Short sentences?** Consider increasing the CFG value for better clarity and adherence.
|
||||
- **Long texts?** Consider lowering the CFG value to improve stability and naturalness over extended passages.
|
||||
### CFG 值(多严格地遵循食谱)
|
||||
- **默认值**:一个很好的起点。
|
||||
- **声音听起来紧张或奇怪?** 降低此值。它告诉模型更加放松和即兴,非常适合富有表现力的提示。
|
||||
- **需要最大的清晰度和对文本的忠实度?** 稍微调高它,让模型保持更严格的控制。
|
||||
- **短句?** 考虑增加 CFG 值以获得更好的清晰度和忠实度。
|
||||
- **长文本?** 考虑降低 CFG 值以提高长段落的稳定性和自然度。
|
||||
|
||||
### Inference Timesteps (Simmering Time: Quality vs. Speed)
|
||||
- **Need a quick snack?** Use a lower number. Perfect for fast drafts and experiments.
|
||||
- **Cooking a gourmet meal?** Use a higher number. This lets the model "simmer" longer, refining the audio for superior detail and naturalness.
|
||||
### 推理步数(炖煮时间:质量与速度)
|
||||
- **需要快餐?** 使用较低的数值。非常适合快速草稿和实验。
|
||||
- **烹饪大餐?** 使用较高的数值。这让模型“炖煮”得更久,提炼音频以获得卓越的细节和自然度。
|
||||
|
||||
---
|
||||
|
||||
Happy creating! 🎉 Start with the default settings and tweak from there to suit your project. The kitchen is yours!
|
||||
|
||||
祝创作愉快!🎉 从默认设置开始,根据你的项目进行调整。厨房是你的了!
|
||||
|
||||
1253
lora_ft_webui.py
Normal file
1253
lora_ft_webui.py
Normal file
File diff suppressed because it is too large
Load Diff
36
models/openbmb__VoxCPM1.5/.gitattributes
vendored
Normal file
36
models/openbmb__VoxCPM1.5/.gitattributes
vendored
Normal file
@@ -0,0 +1,36 @@
|
||||
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||
*.model filter=lfs diff=lfs merge=lfs -text
|
||||
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||
assets/voxcpm_model.png filter=lfs diff=lfs merge=lfs -text
|
||||
272
models/openbmb__VoxCPM1.5/README.md
Normal file
272
models/openbmb__VoxCPM1.5/README.md
Normal file
@@ -0,0 +1,272 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
language:
|
||||
- en
|
||||
- zh
|
||||
base_model:
|
||||
- openbmb/MiniCPM4-0.5B
|
||||
pipeline_tag: text-to-speech
|
||||
library_name: voxcpm1.5
|
||||
tags:
|
||||
- text-to-speech
|
||||
- speech
|
||||
- speech generation
|
||||
- voice cloning
|
||||
---
|
||||
|
||||
## 🎙️ VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
|
||||
|
||||
|
||||
[](https://github.com/OpenBMB/VoxCPM/) [](https://arxiv.org/abs/2509.24650)[](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) [](https://openbmb.github.io/VoxCPM-demopage)
|
||||
|
||||
- VoxCPM1.5
|
||||
|
||||
[](https://huggingface.co/openbmb/VoxCPM1.5) [](https://modelscope.cn/models/OpenBMB/VoxCPM1.5)
|
||||
|
||||
|
||||
<div align="center">
|
||||
<img src="assets/voxcpm_logo.png" alt="VoxCPM Logo" width="40%">
|
||||
</div>
|
||||
|
||||
## 🎉 VoxCPM1.5 Updates
|
||||
|
||||
**Release Date:** December 5, 2025
|
||||
|
||||
VoxCPM1.5 brings improvements in audio quality and efficiency:
|
||||
|
||||
| Feature | VoxCPM | VoxCPM1.5 |
|
||||
|---------|------------|------------|
|
||||
| **Audio VAE Sampling Rate** | 16kHz | 44.1kHz |
|
||||
| **LM Token Rate** | 12.5Hz | 6.25Hz |
|
||||
| **Patch Size** | 2 | 4 |
|
||||
| **SFT Support** | ✅ | ✅ |
|
||||
| **LoRA Support** | ✅ | ✅ |
|
||||
|
||||
**Key Improvements:**
|
||||
- 🔊 **Higher Quality**: 44.1kHz sampling rate preserves more high-frequency details for better voice cloning
|
||||
- ⚡ **More Efficient**: Reduced token rate (6.25Hz) lowers computational cost while maintaining performance
|
||||
- 🎓 **Fine-tuning Support**: Train personalized voice models with SFT or LoRA
|
||||
|
||||
**Note**: Output quality depends on the prompt speech quality. VoxCPM-0.5B remains fully supported with backward compatibility.
|
||||
|
||||
|
||||
## 📚 Model Overview
|
||||
|
||||
|
||||
VoxCPM is a novel tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in a continuous space, it overcomes the limitations of discrete tokenization and enables two flagship capabilities: context-aware speech generation and true-to-life zero-shot voice cloning.
|
||||
|
||||
Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on [MiniCPM-4](https://huggingface.co/openbmb/MiniCPM4-0.5B) backbone, it achieves implicit semantic-acoustic decoupling through hierachical language modeling and FSQ constraints, greatly enhancing both expressiveness and generation stability.
|
||||
|
||||
|
||||
<div align="center">
|
||||
<img src="assets/voxcpm_model.png" alt="VoxCPM Model Architecture" width="90%">
|
||||
</div>
|
||||
|
||||
|
||||
### 🚀 Key Features
|
||||
- **Context-Aware, Expressive Speech Generation** - VoxCPM comprehends text to infer and generate appropriate prosody, delivering speech with remarkable expressiveness and natural flow. It spontaneously adapts speaking style based on content, producing highly fitting vocal expression trained on a massive 1.8 million-hour bilingual corpus.
|
||||
- **True-to-Life Voice Cloning** - With only a short reference audio clip, VoxCPM performs accurate zero-shot voice cloning, capturing not only the speaker’s timbre but also fine-grained characteristics such as accent, emotional tone, rhythm, and pacing to create a faithful and natural replica.
|
||||
- **High-Efficiency Synthesis** - VoxCPM supports streaming synthesis with a Real-Time Factor (RTF) as low as 0.17 on a consumer-grade NVIDIA RTX 4090 GPU, making it possible for real-time applications.
|
||||
|
||||
|
||||
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 🔧 Install from PyPI
|
||||
``` sh
|
||||
pip install voxcpm
|
||||
```
|
||||
### 1. Model Download (Optional)
|
||||
By default, when you first run the script, the model will be downloaded automatically, but you can also download the model in advance.
|
||||
- Download VoxCPM1.5
|
||||
```
|
||||
from huggingface_hub import snapshot_download
|
||||
snapshot_download("openbmb/VoxCPM1.5")
|
||||
```
|
||||
|
||||
- Or Download VoxCPM-0.5B
|
||||
```
|
||||
from huggingface_hub import snapshot_download
|
||||
snapshot_download("openbmb/VoxCPM-0.5B")
|
||||
```
|
||||
- Download ZipEnhancer and SenseVoice-Small. We use ZipEnhancer to enhance speech prompts and SenseVoice-Small for speech prompt ASR in the web demo.
|
||||
```
|
||||
from modelscope import snapshot_download
|
||||
snapshot_download('iic/speech_zipenhancer_ans_multiloss_16k_base')
|
||||
snapshot_download('iic/SenseVoiceSmall')
|
||||
```
|
||||
|
||||
### 2. Basic Usage
|
||||
```python
|
||||
import soundfile as sf
|
||||
import numpy as np
|
||||
from voxcpm import VoxCPM
|
||||
|
||||
model = VoxCPM.from_pretrained("openbmb/VoxCPM1.5")
|
||||
|
||||
# Non-streaming
|
||||
wav = model.generate(
|
||||
text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.",
|
||||
prompt_wav_path=None, # optional: path to a prompt speech for voice cloning
|
||||
prompt_text=None, # optional: reference text
|
||||
cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse
|
||||
inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed
|
||||
normalize=False, # enable external TN tool, but will disable native raw text support
|
||||
denoise=False, # enable external Denoise tool, but it may cause some distortion and restrict the sampling rate to 16kHz
|
||||
retry_badcase=True, # enable retrying mode for some bad cases (unstoppable)
|
||||
retry_badcase_max_times=3, # maximum retrying times
|
||||
retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech
|
||||
)
|
||||
|
||||
sf.write("output.wav", wav, model.tts_model.sample_rate)
|
||||
print("saved: output.wav")
|
||||
|
||||
# Streaming
|
||||
chunks = []
|
||||
for chunk in model.generate_streaming(
|
||||
text = "Streaming text to speech is easy with VoxCPM!",
|
||||
# supports same args as above
|
||||
):
|
||||
chunks.append(chunk)
|
||||
wav = np.concatenate(chunks)
|
||||
|
||||
sf.write("output_streaming.wav", wav, model.tts_model.sample_rate)
|
||||
print("saved: output_streaming.wav")
|
||||
```
|
||||
|
||||
### 3. CLI Usage
|
||||
|
||||
After installation, the entry point is `voxcpm` (or use `python -m voxcpm.cli`).
|
||||
|
||||
```bash
|
||||
# 1) Direct synthesis (single text)
|
||||
voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." --output out.wav
|
||||
|
||||
# 2) Voice cloning (reference audio + transcript)
|
||||
voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." \
|
||||
--prompt-audio path/to/voice.wav \
|
||||
--prompt-text "reference transcript" \
|
||||
--output out.wav \
|
||||
# --denoise
|
||||
|
||||
# (Optinal) Voice cloning (reference audio + transcript file)
|
||||
voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." \
|
||||
--prompt-audio path/to/voice.wav \
|
||||
--prompt-file "/path/to/text-file" \
|
||||
--output out.wav \
|
||||
# --denoise
|
||||
|
||||
# 3) Batch processing (one text per line)
|
||||
voxcpm --input examples/input.txt --output-dir outs
|
||||
# (optional) Batch + cloning
|
||||
voxcpm --input examples/input.txt --output-dir outs \
|
||||
--prompt-audio path/to/voice.wav \
|
||||
--prompt-text "reference transcript" \
|
||||
# --denoise
|
||||
|
||||
# 4) Inference parameters (quality/speed)
|
||||
voxcpm --text "..." --output out.wav \
|
||||
--cfg-value 2.0 --inference-timesteps 10 --normalize
|
||||
|
||||
# 5) Model loading
|
||||
# Prefer local path
|
||||
voxcpm --text "..." --output out.wav --model-path /path/to/VoxCPM_model_dir
|
||||
# Or from Hugging Face (auto download/cache)
|
||||
voxcpm --text "..." --output out.wav \
|
||||
--hf-model-id openbmb/VoxCPM1.5 --cache-dir ~/.cache/huggingface --local-files-only
|
||||
|
||||
# 6) Denoiser control
|
||||
voxcpm --text "..." --output out.wav \
|
||||
--no-denoiser --zipenhancer-path iic/speech_zipenhancer_ans_multiloss_16k_base
|
||||
|
||||
# 7) Help
|
||||
voxcpm --help
|
||||
python -m voxcpm.cli --help
|
||||
```
|
||||
|
||||
### 4. Start web demo
|
||||
|
||||
You can start the UI interface by running `python app.py`, which allows you to perform Voice Cloning and Voice Creation.
|
||||
|
||||
### 5. Fine-tuning
|
||||
|
||||
VoxCPM1.5 supports both full fine-tuning (SFT) and LoRA fine-tuning, allowing you to train personalized voice models on your own data. See the [Fine-tuning Guide](docs/finetune.md) for detailed instructions.
|
||||
|
||||
**Quick Start:**
|
||||
```bash
|
||||
# Full fine-tuning
|
||||
python scripts/train_voxcpm_finetune.py \
|
||||
--config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
|
||||
|
||||
# LoRA fine-tuning
|
||||
python scripts/train_voxcpm_finetune.py \
|
||||
--config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
|
||||
```
|
||||
|
||||
|
||||
## 👩🍳 A Voice Chef's Guide
|
||||
Welcome to the VoxCPM kitchen! Follow this recipe to cook up perfect generated speech. Let’s begin.
|
||||
|
||||
---
|
||||
### 🥚 Step 1: Prepare Your Base Ingredients (Content)
|
||||
|
||||
First, choose how you’d like to input your text:.
|
||||
1. Regular Text (Classic Mode)
|
||||
- ✅ Keep "Text Normalization" ON. Type naturally (e.g., "Hello, world! 123"). The system will automatically process numbers, abbreviations, and punctuation using WeTextProcessing library.
|
||||
2. Phoneme Input (Native Mode)
|
||||
- ❌ Turn "Text Normalization" OFF. Enter phoneme text like {HH AH0 L OW1} (EN) or {ni3}{hao3} (ZH) for precise pronunciation control. In this mode, VoxCPM also supports native understanding of other complex non-normalized text—try it out!
|
||||
- **Phoneme Conversion**: For Chinese, phonemes are converted using pinyin. For English, phonemes are converted using CMUDict. Please refer to the relevant documentation for more details.
|
||||
|
||||
|
||||
---
|
||||
### 🍳 Step 2: Choose Your Flavor Profile (Voice Style)
|
||||
|
||||
This is the secret sauce that gives your audio its unique sound.
|
||||
|
||||
#### 1. Cooking with a Prompt Speech (Following a Famous Recipe)
|
||||
- A prompt speech provides the desired acoustic characteristics for VoxCPM. The speaker's timbre, speaking style, and even the background sounds and ambiance will be replicated.
|
||||
- **For a Clean, Denoising Voice:**
|
||||
- ✅ Enable "Prompt Speech Enhancement". This acts like a noise filter, removing background hiss and rumble to give you a pure, clean voice clone. However, this will limit the audio sampling rate to 16kHz, restricting the cloning quality ceiling.
|
||||
- **For High-Quality Audio Cloning (Up to 44.1kHz):**
|
||||
- ❌ Disable "Prompt Speech Enhancement" to preserve all original audio information, including background atmosphere, and support audio cloning up to 44.1kHz sampling rate.
|
||||
|
||||
#### 2. Cooking au Naturel (Letting the Model Improvise)
|
||||
- If no reference is provided, VoxCPM becomes a creative chef! It will infer a fitting speaking style based on the text itself, thanks to the text-smartness of its foundation model, MiniCPM-4.
|
||||
- **Pro Tip**: Challenge VoxCPM with any text—poetry, song lyrics, dramatic monologues—it may deliver some interesting results!
|
||||
|
||||
---
|
||||
### 🧂 Step 3: The Final Seasoning (Fine-Tuning Your Results)
|
||||
|
||||
You're ready to serve! But for master chefs who want to tweak the flavor, here are two key spices.
|
||||
|
||||
#### CFG Value (How Closely to Follow the Recipe)
|
||||
- **Default**: A great starting point.
|
||||
- **Voice sounds strained or weird?** Lower this value. It tells the model to be more relaxed and improvisational, great for expressive prompts.
|
||||
- **Need maximum clarity and adherence to the text?** Raise it slightly to keep the model on a tighter leash.
|
||||
- **Short sentences?** Consider increasing the CFG value for better clarity and adherence.
|
||||
- **Long texts?** Consider lowering the CFG value to improve stability and naturalness over extended passages.
|
||||
|
||||
#### Inference Timesteps (Simmering Time: Quality vs. Speed)
|
||||
- **Need a quick snack?** Use a lower number. Perfect for fast drafts and experiments.
|
||||
- **Cooking a gourmet meal?** Use a higher number. This lets the model "simmer" longer, refining the audio for superior detail and naturalness.
|
||||
|
||||
---
|
||||
Happy creating! 🎉 Start with the default settings and tweak from there to suit your project. The kitchen is yours!
|
||||
|
||||
|
||||
---
|
||||
|
||||
|
||||
## ⚠️ Risks and limitations
|
||||
- General Model Behavior: While VoxCPM has been trained on a large-scale dataset, it may still produce outputs that are unexpected, biased, or contain artifacts.
|
||||
- Potential for Misuse of Voice Cloning: VoxCPM's powerful zero-shot voice cloning capability can generate highly realistic synthetic speech. This technology could be misused for creating convincing deepfakes for purposes of impersonation, fraud, or spreading disinformation. Users of this model must not use it to create content that infringes upon the rights of individuals. It is strictly forbidden to use VoxCPM for any illegal or unethical purposes. We strongly recommend that any publicly shared content generated with this model be clearly marked as AI-generated.
|
||||
- Current Technical Limitations: Although generally stable, the model may occasionally exhibit instability, especially with very long or expressive inputs. Furthermore, the current version offers limited direct control over specific speech attributes like emotion or speaking style.
|
||||
- Bilingual Model: VoxCPM is trained primarily on Chinese and English data. Performance on other languages is not guaranteed and may result in unpredictable or low-quality audio.
|
||||
- This model is released for research and development purposes only. We do not recommend its use in production or commercial applications without rigorous testing and safety evaluations. Please use VoxCPM responsibly.
|
||||
|
||||
|
||||
|
||||
## 📄 License
|
||||
The VoxCPM model weights and code are open-sourced under the Apache-2.0 license.
|
||||
|
||||
BIN
models/openbmb__VoxCPM1.5/assets/voxcpm_logo.png
Normal file
BIN
models/openbmb__VoxCPM1.5/assets/voxcpm_logo.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 24 KiB |
BIN
models/openbmb__VoxCPM1.5/assets/voxcpm_model.png
LFS
Normal file
BIN
models/openbmb__VoxCPM1.5/assets/voxcpm_model.png
LFS
Normal file
Binary file not shown.
60
models/openbmb__VoxCPM1.5/config.json
Normal file
60
models/openbmb__VoxCPM1.5/config.json
Normal file
@@ -0,0 +1,60 @@
|
||||
{
|
||||
"architecture": "voxcpm",
|
||||
"lm_config": {
|
||||
"bos_token_id": 1,
|
||||
"eos_token_id": 2,
|
||||
"hidden_size": 1024,
|
||||
"intermediate_size": 4096,
|
||||
"max_position_embeddings": 32768,
|
||||
"num_attention_heads": 16,
|
||||
"num_hidden_layers": 24,
|
||||
"num_key_value_heads": 2,
|
||||
"rms_norm_eps": 1e-05,
|
||||
"rope_theta": 10000,
|
||||
"rope_scaling": {
|
||||
"type": "longrope",
|
||||
"long_factor": [1.0004360675811768, 1.0668443441390991, 1.1631425619125366, 1.3025742769241333, 1.5040205717086792, 1.7941505908966064, 2.2101221084594727, 2.802666664123535, 3.6389970779418945, 4.804192543029785, 6.39855432510376, 8.527148246765137, 11.277542114257812, 14.684998512268066, 18.69317054748535, 23.13019371032715, 27.72362518310547, 32.1606559753418, 36.168827056884766, 39.57627868652344, 42.32667541503906, 44.45526885986328, 46.04962921142578, 47.21482849121094, 48.05115509033203, 48.64370346069336, 49.05967712402344, 49.34980392456055, 49.551246643066406, 49.69068145751953, 49.78697967529297, 49.85338592529297],
|
||||
"short_factor": [1.0004360675811768, 1.0668443441390991, 1.1631425619125366, 1.3025742769241333, 1.5040205717086792, 1.7941505908966064, 2.2101221084594727, 2.802666664123535, 3.6389970779418945, 4.804192543029785, 6.39855432510376, 8.527148246765137, 11.277542114257812, 14.684998512268066, 18.69317054748535, 23.13019371032715, 27.72362518310547, 32.1606559753418, 36.168827056884766, 39.57627868652344, 42.32667541503906, 44.45526885986328, 46.04962921142578, 47.21482849121094, 48.05115509033203, 48.64370346069336, 49.05967712402344, 49.34980392456055, 49.551246643066406, 49.69068145751953, 49.78697967529297, 49.85338592529297],
|
||||
"original_max_position_embeddings": 32768
|
||||
},
|
||||
"vocab_size": 73448,
|
||||
"scale_emb": 12,
|
||||
"dim_model_base": 256,
|
||||
"scale_depth": 1.4,
|
||||
"use_mup": false
|
||||
},
|
||||
"patch_size": 4,
|
||||
"feat_dim": 64,
|
||||
"scalar_quantization_latent_dim": 256,
|
||||
"scalar_quantization_scale": 9,
|
||||
"residual_lm_num_layers": 8,
|
||||
"encoder_config": {
|
||||
"hidden_dim": 1024,
|
||||
"ffn_dim": 4096,
|
||||
"num_heads": 16,
|
||||
"num_layers": 8
|
||||
},
|
||||
"dit_config": {
|
||||
"hidden_dim": 1024,
|
||||
"ffn_dim": 4096,
|
||||
"num_heads": 16,
|
||||
"num_layers": 8,
|
||||
"cfm_config": {
|
||||
"sigma_min": 1e-06,
|
||||
"solver": "euler",
|
||||
"t_scheduler": "log-norm",
|
||||
"inference_cfg_rate": 2.0
|
||||
}
|
||||
},
|
||||
"audio_vae_config": {
|
||||
"encoder_dim": 64,
|
||||
"encoder_rates": [2, 3, 6, 7, 7],
|
||||
"latent_dim": 64,
|
||||
"decoder_dim": 2048,
|
||||
"decoder_rates": [7, 7, 6, 3, 2],
|
||||
"sample_rate": 44100
|
||||
},
|
||||
"max_length": 8192,
|
||||
"device": "cuda",
|
||||
"dtype": "bfloat16"
|
||||
}
|
||||
81
models/openbmb__VoxCPM1.5/special_tokens_map.json
Normal file
81
models/openbmb__VoxCPM1.5/special_tokens_map.json
Normal file
@@ -0,0 +1,81 @@
|
||||
{
|
||||
"additional_special_tokens": [
|
||||
{
|
||||
"content": "<|im_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
{
|
||||
"content": "<|im_start|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
{
|
||||
"content": "<|tool_call|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
{
|
||||
"content": "<|execute_start|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
{
|
||||
"content": "<|execute_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
{
|
||||
"content": "<|fim_prefix|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
{
|
||||
"content": "<|fim_middle|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
{
|
||||
"content": "<|fim_suffix|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
}
|
||||
],
|
||||
"bos_token": {
|
||||
"content": "<s>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"eos_token": {
|
||||
"content": "</s>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"unk_token": {
|
||||
"content": "<unk>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
}
|
||||
}
|
||||
177952
models/openbmb__VoxCPM1.5/tokenizer.json
Normal file
177952
models/openbmb__VoxCPM1.5/tokenizer.json
Normal file
File diff suppressed because it is too large
Load Diff
212
models/openbmb__VoxCPM1.5/tokenizer_config.json
Normal file
212
models/openbmb__VoxCPM1.5/tokenizer_config.json
Normal file
@@ -0,0 +1,212 @@
|
||||
{
|
||||
"add_bos_token": true,
|
||||
"add_eos_token": false,
|
||||
"added_tokens_decoder": {
|
||||
"0": {
|
||||
"content": "<unk>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"1": {
|
||||
"content": "<s>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"2": {
|
||||
"content": "</s>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"101": {
|
||||
"content": "<|audio_start|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"102": {
|
||||
"content": "<|audio_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"103": {
|
||||
"content": "<|audio_prompt_start|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"104": {
|
||||
"content": "<|audio_prompt_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"105": {
|
||||
"content": "<|background|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"106": {
|
||||
"content": "<|/background|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"107": {
|
||||
"content": "<|characters|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"108": {
|
||||
"content": "<|/characters|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"109": {
|
||||
"content": "<|speaker_id|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"110": {
|
||||
"content": "<|/speaker_id|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"111": {
|
||||
"content": "<|span|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"112": {
|
||||
"content": "<|/span|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"73440": {
|
||||
"content": "<|im_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"73441": {
|
||||
"content": "<|im_start|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"73442": {
|
||||
"content": "<|tool_call|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"73443": {
|
||||
"content": "<|execute_start|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"73444": {
|
||||
"content": "<|execute_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"73445": {
|
||||
"content": "<|fim_prefix|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"73446": {
|
||||
"content": "<|fim_middle|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"73447": {
|
||||
"content": "<|fim_suffix|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
}
|
||||
},
|
||||
"additional_special_tokens": [
|
||||
"<|im_end|>",
|
||||
"<|im_start|>",
|
||||
"<|tool_call|>",
|
||||
"<|execute_start|>",
|
||||
"<|execute_end|>",
|
||||
"<|fim_prefix|>",
|
||||
"<|fim_middle|>",
|
||||
"<|fim_suffix|>"
|
||||
],
|
||||
"bos_token": "<s>",
|
||||
"clean_up_tokenization_spaces": false,
|
||||
"eos_token": "<|im_end|>",
|
||||
"legacy": true,
|
||||
"model_max_length": 1000000000000000019884624838656,
|
||||
"pad_token": null,
|
||||
"sp_model_kwargs": {},
|
||||
"spaces_between_special_tokens": false,
|
||||
"tokenizer_class": "LlamaTokenizer",
|
||||
"unk_token": "<unk>",
|
||||
"use_default_system_prompt": false,
|
||||
"chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
|
||||
}
|
||||
@@ -114,7 +114,7 @@ def main():
|
||||
prompt_text=prompt_text,
|
||||
cfg_value=args.cfg_value,
|
||||
inference_timesteps=args.inference_timesteps,
|
||||
max_length=args.max_len,
|
||||
max_len=args.max_len,
|
||||
normalize=args.normalize,
|
||||
denoise=False,
|
||||
)
|
||||
|
||||
@@ -5,7 +5,6 @@ LoRA inference test script.
|
||||
Usage:
|
||||
|
||||
python scripts/test_voxcpm_lora_infer.py \
|
||||
--config_path conf/voxcpm/voxcpm_finetune_test.yaml \
|
||||
--lora_ckpt checkpoints/step_0002000 \
|
||||
--text "Hello, this is LoRA finetuned result." \
|
||||
--output lora_test.wav
|
||||
@@ -13,37 +12,39 @@ Usage:
|
||||
With voice cloning:
|
||||
|
||||
python scripts/test_voxcpm_lora_infer.py \
|
||||
--config_path conf/voxcpm/voxcpm_finetune_test.yaml \
|
||||
--lora_ckpt checkpoints/step_0002000 \
|
||||
--text "This is voice cloning result." \
|
||||
--prompt_audio path/to/ref.wav \
|
||||
--prompt_text "Reference audio transcript" \
|
||||
--output lora_clone.wav
|
||||
|
||||
Note: The script reads base_model path and lora_config from lora_config.json
|
||||
in the checkpoint directory (saved automatically during training).
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
import soundfile as sf
|
||||
|
||||
from voxcpm.core import VoxCPM
|
||||
from voxcpm.model.voxcpm import LoRAConfig
|
||||
from voxcpm.training.config import load_yaml_config
|
||||
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser("VoxCPM LoRA inference test")
|
||||
parser.add_argument(
|
||||
"--config_path",
|
||||
type=str,
|
||||
required=True,
|
||||
help="Training YAML config path (contains pretrained_path and lora config)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--lora_ckpt",
|
||||
type=str,
|
||||
required=True,
|
||||
help="LoRA checkpoint directory (contains lora_weights.ckpt with lora_A/lora_B only)",
|
||||
help="LoRA checkpoint directory (contains lora_weights.safetensors and lora_config.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--base_model",
|
||||
type=str,
|
||||
default="",
|
||||
help="Optional: override base model path (default: read from lora_config.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--text",
|
||||
@@ -98,26 +99,44 @@ def parse_args():
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
# 1. Load YAML config
|
||||
cfg = load_yaml_config(args.config_path)
|
||||
pretrained_path = cfg["pretrained_path"]
|
||||
lora_cfg_dict = cfg.get("lora", {}) or {}
|
||||
lora_cfg = LoRAConfig(**lora_cfg_dict) if lora_cfg_dict else None
|
||||
|
||||
# 2. Check LoRA checkpoint
|
||||
ckpt_dir = args.lora_ckpt
|
||||
if not Path(ckpt_dir).exists():
|
||||
# 1. Check LoRA checkpoint directory
|
||||
ckpt_dir = Path(args.lora_ckpt)
|
||||
if not ckpt_dir.exists():
|
||||
raise FileNotFoundError(f"LoRA checkpoint not found: {ckpt_dir}")
|
||||
|
||||
# 2. Load lora_config.json from checkpoint
|
||||
lora_config_path = ckpt_dir / "lora_config.json"
|
||||
if not lora_config_path.exists():
|
||||
raise FileNotFoundError(
|
||||
f"lora_config.json not found in {ckpt_dir}. "
|
||||
"Make sure the checkpoint was saved with the updated training script."
|
||||
)
|
||||
|
||||
with open(lora_config_path, "r", encoding="utf-8") as f:
|
||||
lora_info = json.load(f)
|
||||
|
||||
# Get base model path (command line arg overrides config)
|
||||
pretrained_path = args.base_model if args.base_model else lora_info.get("base_model")
|
||||
if not pretrained_path:
|
||||
raise ValueError("base_model not found in lora_config.json and --base_model not provided")
|
||||
|
||||
# Get LoRA config
|
||||
lora_cfg_dict = lora_info.get("lora_config", {})
|
||||
lora_cfg = LoRAConfig(**lora_cfg_dict) if lora_cfg_dict else None
|
||||
|
||||
print(f"Loaded config from: {lora_config_path}")
|
||||
print(f" Base model: {pretrained_path}")
|
||||
print(f" LoRA config: r={lora_cfg.r}, alpha={lora_cfg.alpha}" if lora_cfg else " LoRA config: None")
|
||||
|
||||
# 3. Load model with LoRA (no denoiser)
|
||||
print(f"[1/2] Loading model with LoRA: {pretrained_path}")
|
||||
print(f"\n[1/2] Loading model with LoRA: {pretrained_path}")
|
||||
print(f" LoRA weights: {ckpt_dir}")
|
||||
model = VoxCPM.from_pretrained(
|
||||
hf_model_id=pretrained_path,
|
||||
load_denoiser=False,
|
||||
optimize=True,
|
||||
lora_config=lora_cfg,
|
||||
lora_weights_path=ckpt_dir,
|
||||
lora_weights_path=str(ckpt_dir),
|
||||
)
|
||||
|
||||
# 4. Synthesize audio
|
||||
@@ -136,7 +155,7 @@ def main():
|
||||
prompt_text=prompt_text,
|
||||
cfg_value=args.cfg_value,
|
||||
inference_timesteps=args.inference_timesteps,
|
||||
max_length=args.max_len,
|
||||
max_len=args.max_len,
|
||||
normalize=args.normalize,
|
||||
denoise=False,
|
||||
)
|
||||
@@ -153,7 +172,7 @@ def main():
|
||||
prompt_text=prompt_text,
|
||||
cfg_value=args.cfg_value,
|
||||
inference_timesteps=args.inference_timesteps,
|
||||
max_length=args.max_len,
|
||||
max_len=args.max_len,
|
||||
normalize=args.normalize,
|
||||
denoise=False,
|
||||
)
|
||||
@@ -170,7 +189,7 @@ def main():
|
||||
prompt_text=prompt_text,
|
||||
cfg_value=args.cfg_value,
|
||||
inference_timesteps=args.inference_timesteps,
|
||||
max_length=args.max_len,
|
||||
max_len=args.max_len,
|
||||
normalize=args.normalize,
|
||||
denoise=False,
|
||||
)
|
||||
@@ -187,7 +206,7 @@ def main():
|
||||
prompt_text=prompt_text,
|
||||
cfg_value=args.cfg_value,
|
||||
inference_timesteps=args.inference_timesteps,
|
||||
max_length=args.max_len,
|
||||
max_len=args.max_len,
|
||||
normalize=args.normalize,
|
||||
denoise=False,
|
||||
)
|
||||
@@ -197,7 +216,7 @@ def main():
|
||||
|
||||
# === Test 5: Hot-reload LoRA (load_lora) ===
|
||||
print(f"\n [Test 5] Hot-reload LoRA (load_lora)...")
|
||||
loaded, skipped = model.load_lora(str(ckpt_dir))
|
||||
loaded, skipped = model.load_lora(ckpt_dir)
|
||||
print(f" Reloaded {len(loaded)} parameters")
|
||||
audio_np = model.generate(
|
||||
text=args.text,
|
||||
@@ -205,7 +224,7 @@ def main():
|
||||
prompt_text=prompt_text,
|
||||
cfg_value=args.cfg_value,
|
||||
inference_timesteps=args.inference_timesteps,
|
||||
max_length=args.max_len,
|
||||
max_len=args.max_len,
|
||||
normalize=args.normalize,
|
||||
denoise=False,
|
||||
)
|
||||
|
||||
@@ -14,6 +14,8 @@ import torch
|
||||
from tensorboardX import SummaryWriter
|
||||
from torch.optim import AdamW
|
||||
from transformers import get_cosine_schedule_with_warmup
|
||||
import signal
|
||||
import os
|
||||
|
||||
try:
|
||||
from safetensors.torch import save_file
|
||||
@@ -56,8 +58,16 @@ def train(
|
||||
lambdas: Dict[str, float] = {"loss/diff": 1.0, "loss/stop": 1.0},
|
||||
lora: dict = None,
|
||||
config_path: str = "",
|
||||
# Distribution options (for LoRA checkpoints)
|
||||
hf_model_id: str = "", # HuggingFace model ID (e.g., "openbmb/VoxCPM1.5")
|
||||
distribute: bool = False, # If True, save hf_model_id as base_model; otherwise save pretrained_path
|
||||
):
|
||||
_ = config_path
|
||||
|
||||
# Validate distribution options
|
||||
if lora is not None and distribute and not hf_model_id:
|
||||
raise ValueError("hf_model_id is required when distribute=True")
|
||||
|
||||
accelerator = Accelerator(amp=True)
|
||||
|
||||
save_dir = Path(save_path)
|
||||
@@ -171,6 +181,39 @@ def train(
|
||||
num_training_steps=total_training_steps,
|
||||
)
|
||||
|
||||
# Try to load checkpoint and resume training
|
||||
start_step = 0
|
||||
if accelerator.rank == 0:
|
||||
start_step = load_checkpoint(model, optimizer, scheduler, save_dir)
|
||||
# Broadcast start_step to all processes
|
||||
if hasattr(accelerator, 'all_reduce'):
|
||||
start_step_tensor = torch.tensor(start_step, device=accelerator.device)
|
||||
accelerator.all_reduce(start_step_tensor)
|
||||
start_step = int(start_step_tensor.item())
|
||||
|
||||
if start_step > 0 and accelerator.rank == 0:
|
||||
tracker.print(f"Resuming training from step {start_step}")
|
||||
|
||||
# Resume tracker for signal handler to read current step
|
||||
resume = {"step": start_step}
|
||||
|
||||
# Register signal handler to save checkpoint on termination (SIGTERM/SIGINT)
|
||||
def _signal_handler(signum, frame, _model=model, _optim=optimizer, _sched=scheduler, _save_dir=save_dir, _pretrained=pretrained_path, _hf_id=hf_model_id, _dist=distribute, _resume=resume):
|
||||
try:
|
||||
cur_step = int(_resume.get("step", start_step))
|
||||
except Exception:
|
||||
cur_step = start_step
|
||||
print(f"Signal {signum} received. Saving checkpoint at step {cur_step} ...")
|
||||
try:
|
||||
save_checkpoint(_model, _optim, _sched, _save_dir, cur_step, _pretrained, _hf_id, _dist)
|
||||
print("Checkpoint saved. Exiting.")
|
||||
except Exception as e:
|
||||
print(f"Error saving checkpoint on signal: {e}")
|
||||
os._exit(0)
|
||||
|
||||
signal.signal(signal.SIGTERM, _signal_handler)
|
||||
signal.signal(signal.SIGINT, _signal_handler)
|
||||
|
||||
# Manual epoch management instead of itertools.cycle to support DistributedSampler.set_epoch()
|
||||
grad_accum_steps = max(int(grad_accum_steps), 1)
|
||||
data_epoch = 0
|
||||
@@ -191,7 +234,9 @@ def train(
|
||||
return next(train_iter)
|
||||
|
||||
with tracker.live():
|
||||
for step in range(num_iters):
|
||||
for step in range(start_step, num_iters):
|
||||
# update resume step so signal handler can save current progress
|
||||
resume["step"] = step
|
||||
tracker.step = step
|
||||
optimizer.zero_grad(set_to_none=True)
|
||||
|
||||
@@ -255,10 +300,10 @@ def train(
|
||||
validate(model, val_loader, batch_processor, accelerator, tracker, lambdas)
|
||||
|
||||
if step % save_interval == 0 and accelerator.rank == 0:
|
||||
save_checkpoint(model, optimizer, scheduler, save_dir, step, pretrained_path)
|
||||
save_checkpoint(model, optimizer, scheduler, save_dir, step, pretrained_path, hf_model_id, distribute)
|
||||
|
||||
if accelerator.rank == 0:
|
||||
save_checkpoint(model, optimizer, scheduler, save_dir, num_iters, pretrained_path)
|
||||
save_checkpoint(model, optimizer, scheduler, save_dir, num_iters, pretrained_path, hf_model_id, distribute)
|
||||
if writer:
|
||||
writer.close()
|
||||
|
||||
@@ -301,7 +346,77 @@ def validate(model, val_loader, batch_processor, accelerator, tracker, lambdas):
|
||||
model.train()
|
||||
|
||||
|
||||
def save_checkpoint(model, optimizer, scheduler, save_dir: Path, step: int, pretrained_path: str = None):
|
||||
def load_checkpoint(model, optimizer, scheduler, save_dir: Path):
|
||||
"""
|
||||
Load the latest checkpoint if it exists.
|
||||
Returns the step number to resume from, or 0 if no checkpoint found.
|
||||
"""
|
||||
latest_folder = save_dir / "latest"
|
||||
if not latest_folder.exists():
|
||||
return 0
|
||||
|
||||
unwrapped = model.module if hasattr(model, "module") else model
|
||||
lora_cfg = unwrapped.lora_config
|
||||
|
||||
# Load model weights
|
||||
if lora_cfg is not None:
|
||||
# LoRA: load lora_weights
|
||||
lora_weights_path = latest_folder / "lora_weights.safetensors"
|
||||
if not lora_weights_path.exists():
|
||||
lora_weights_path = latest_folder / "lora_weights.ckpt"
|
||||
|
||||
if lora_weights_path.exists():
|
||||
if lora_weights_path.suffix == ".safetensors":
|
||||
from safetensors.torch import load_file
|
||||
state_dict = load_file(str(lora_weights_path))
|
||||
else:
|
||||
ckpt = torch.load(lora_weights_path, map_location="cpu")
|
||||
state_dict = ckpt.get("state_dict", ckpt)
|
||||
|
||||
# Load only lora weights
|
||||
unwrapped.load_state_dict(state_dict, strict=False)
|
||||
print(f"Loaded LoRA weights from {lora_weights_path}")
|
||||
else:
|
||||
# Full finetune: load model.safetensors or pytorch_model.bin
|
||||
model_path = latest_folder / "model.safetensors"
|
||||
if not model_path.exists():
|
||||
model_path = latest_folder / "pytorch_model.bin"
|
||||
|
||||
if model_path.exists():
|
||||
if model_path.suffix == ".safetensors":
|
||||
from safetensors.torch import load_file
|
||||
state_dict = load_file(str(model_path))
|
||||
else:
|
||||
ckpt = torch.load(model_path, map_location="cpu")
|
||||
state_dict = ckpt.get("state_dict", ckpt)
|
||||
|
||||
unwrapped.load_state_dict(state_dict, strict=False)
|
||||
print(f"Loaded model weights from {model_path}")
|
||||
|
||||
# Load optimizer state
|
||||
optimizer_path = latest_folder / "optimizer.pth"
|
||||
if optimizer_path.exists():
|
||||
optimizer.load_state_dict(torch.load(optimizer_path, map_location="cpu"))
|
||||
print(f"Loaded optimizer state from {optimizer_path}")
|
||||
|
||||
# Load scheduler state
|
||||
scheduler_path = latest_folder / "scheduler.pth"
|
||||
if scheduler_path.exists():
|
||||
scheduler.load_state_dict(torch.load(scheduler_path, map_location="cpu"))
|
||||
print(f"Loaded scheduler state from {scheduler_path}")
|
||||
|
||||
# Try to infer step from checkpoint folders
|
||||
step_folders = [d for d in save_dir.iterdir() if d.is_dir() and d.name.startswith("step_")]
|
||||
if step_folders:
|
||||
steps = [int(d.name.split("_")[1]) for d in step_folders]
|
||||
resume_step = max(steps)
|
||||
print(f"Resuming from step {resume_step}")
|
||||
return resume_step
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
def save_checkpoint(model, optimizer, scheduler, save_dir: Path, step: int, pretrained_path: str = None, hf_model_id: str = "", distribute: bool = False):
|
||||
"""
|
||||
Save checkpoint with different strategies for full finetune vs LoRA:
|
||||
- Full finetune: save non-vae weights to model.safetensors (or pytorch_model.bin if safetensors unavailable)
|
||||
@@ -325,6 +440,17 @@ def save_checkpoint(model, optimizer, scheduler, save_dir: Path, step: int, pret
|
||||
save_file(state_dict, folder / "lora_weights.safetensors")
|
||||
else:
|
||||
torch.save({"state_dict": state_dict}, folder / "lora_weights.ckpt")
|
||||
|
||||
# Save LoRA config and base model path to a separate JSON file
|
||||
# If distribute=True, save hf_model_id; otherwise save local pretrained_path
|
||||
import json
|
||||
base_model_to_save = hf_model_id if distribute else (str(pretrained_path) if pretrained_path else None)
|
||||
lora_info = {
|
||||
"base_model": base_model_to_save,
|
||||
"lora_config": lora_cfg.model_dump() if hasattr(lora_cfg, "model_dump") else vars(lora_cfg),
|
||||
}
|
||||
with open(folder / "lora_config.json", "w", encoding="utf-8") as f:
|
||||
json.dump(lora_info, f, indent=2, ensure_ascii=False)
|
||||
else:
|
||||
# Full finetune: save non-vae weights to model.safetensors
|
||||
state_dict = {k: v for k, v in full_state.items() if not k.startswith("audio_vae.")}
|
||||
@@ -345,6 +471,29 @@ def save_checkpoint(model, optimizer, scheduler, save_dir: Path, step: int, pret
|
||||
torch.save(optimizer.state_dict(), folder / "optimizer.pth")
|
||||
torch.save(scheduler.state_dict(), folder / "scheduler.pth")
|
||||
|
||||
# Update (or create) a `latest` symlink pointing to the most recent checkpoint folder
|
||||
latest_link = save_dir / "latest"
|
||||
try:
|
||||
if latest_link.exists() or latest_link.is_symlink():
|
||||
# remove existing link or directory
|
||||
if latest_link.is_dir() and not latest_link.is_symlink():
|
||||
shutil.rmtree(latest_link)
|
||||
else:
|
||||
latest_link.unlink()
|
||||
# Create a symlink pointing to the new folder
|
||||
os.symlink(str(folder), str(latest_link))
|
||||
except Exception:
|
||||
# If symlink creation fails (e.g., on Windows or permission issues), fall back to copying
|
||||
try:
|
||||
if latest_link.exists():
|
||||
if latest_link.is_dir():
|
||||
shutil.rmtree(latest_link)
|
||||
else:
|
||||
latest_link.unlink()
|
||||
shutil.copytree(folder, latest_link)
|
||||
except Exception:
|
||||
print(f"Warning: failed to update latest checkpoint link at {latest_link}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from voxcpm.training.config import load_yaml_config
|
||||
@@ -358,5 +507,4 @@ if __name__ == "__main__":
|
||||
else:
|
||||
# Otherwise use command line args (parsed by argbind)
|
||||
with argbind.scope(args):
|
||||
train()
|
||||
|
||||
train()
|
||||
@@ -55,11 +55,12 @@ class VoxCPM:
|
||||
self.denoiser = ZipEnhancer(zipenhancer_model_path)
|
||||
else:
|
||||
self.denoiser = None
|
||||
print("Warm up VoxCPMModel...")
|
||||
self.tts_model.generate(
|
||||
target_text="Hello, this is the first test sentence.",
|
||||
max_len=10,
|
||||
)
|
||||
if optimize:
|
||||
print("Warm up VoxCPMModel...")
|
||||
self.tts_model.generate(
|
||||
target_text="Hello, this is the first test sentence.",
|
||||
max_len=10,
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls,
|
||||
|
||||
@@ -159,6 +159,7 @@ class MiniCPMAttention(nn.Module):
|
||||
query_states = query_states.contiguous()
|
||||
key_states = key_states.contiguous()
|
||||
value_states = value_states.contiguous()
|
||||
|
||||
attn_output = torch.nn.functional.scaled_dot_product_attention(
|
||||
query_states,
|
||||
key_states,
|
||||
@@ -208,6 +209,7 @@ class MiniCPMAttention(nn.Module):
|
||||
query_states = query_states.contiguous()
|
||||
key_cache = key_cache.contiguous()
|
||||
value_cache = value_cache.contiguous()
|
||||
|
||||
attn_output = torch.nn.functional.scaled_dot_product_attention(
|
||||
query_states,
|
||||
key_cache,
|
||||
|
||||
Reference in New Issue
Block a user