Initial commit with large files ignored
This commit is contained in:
@@ -1,116 +1,111 @@
|
||||
# VoxCPM1.5 Release Notes
|
||||
# VoxCPM1.5 发布说明
|
||||
|
||||
**Release Date:** December 5, 2025
|
||||
**发布日期:** 2025年12月5日
|
||||
|
||||
## 🎉 Overview
|
||||
## 🎉 概览
|
||||
|
||||
我们非常激动地推出一次重大升级,在保持 VoxCPM 上下文感知语音生成和零样本声音克隆核心能力的同时,提升了音频质量和效率。
|
||||
|
||||
We’re thrilled to introduce a major upgrade that improves audio quality and efficiency of VoxCPM, while maintaining the core capabilities of context-aware speech generation and zero-shot voice cloning.
|
||||
|
||||
| Feature | VoxCPM | VoxCPM1.5 |
|
||||
| 特性 | VoxCPM | VoxCPM1.5 |
|
||||
|---------|------------|------------|
|
||||
| **Audio VAE Sampling Rate** | 16kHz | 44.1kHz |
|
||||
| **LM Token Rate** | 12.5Hz | 6.25Hz |
|
||||
| **Audio VAE 采样率** | 16kHz | 44.1kHz |
|
||||
| **LM Token 速率** | 12.5Hz | 6.25Hz |
|
||||
| **Patch Size** | 2 | 4 |
|
||||
| **SFT Support** | ✅ | ✅ |
|
||||
| **LoRA Support** | ✅ | ✅ |
|
||||
| **SFT 支持** | ✅ | ✅ |
|
||||
| **LoRA 支持** | ✅ | ✅ |
|
||||
|
||||
## 🎵 Model Updates
|
||||
## 🎵 模型更新
|
||||
|
||||
### 🔊 AudioVAE Sampling Rate: 16kHz → 44.1kHz
|
||||
### 🔊 AudioVAE 采样率:16kHz → 44.1kHz
|
||||
|
||||
The AudioVAE now supports 44.1kHz sampling rate, which allows the model to:
|
||||
- 🎯 Clone better, preserving more high-frequency details and generate higher quality voice outputs
|
||||
AudioVAE 现在支持 44.1kHz 采样率,这使得模型能够:
|
||||
- 🎯 更好地克隆声音,保留更多高频细节,生成更高质量的语音输出
|
||||
|
||||
*注意:此升级在使用高质量参考音频时能生成更高质量的音频,但不能保证所有生成的音频都是高保真的。输出质量取决于**提示语音(prompt speech)**的质量。*
|
||||
|
||||
*Note: This upgrade enables higher quality generation when using high-quality reference audio, but does not guarantee that all generated audio will be high-fidelity. The output quality depends on the **prompt speech** quality.*
|
||||
### ⚡ Token 速率:12.5Hz → 6.25Hz
|
||||
|
||||
### ⚡ Token Rate: 12.5Hz → 6.25Hz
|
||||
我们将 LM 主干网络中的 token 速率从 12.5Hz 降低到了 6.25Hz(LocEnc&LocDiT patch size 从 2 增加到 4),同时在评估基准上保持了相似的性能。这一变化:
|
||||
- 💨 降低了生成相同长度音频的计算需求
|
||||
- 📈 为更长音频生成奠定了基础
|
||||
- 🏗️ 为未来训练更大的模型铺平了道路
|
||||
|
||||
We reduced the token rate in LM backbone from 12.5Hz to 6.25Hz (LocEnc&LocDiT patch size increased from 2 to 4) while maintaining similar performance on evaluation benchmarks. This change:
|
||||
- 💨 Reduces computational requirements for generating the same length of audio
|
||||
- 📈 Provides a foundation for longer audio generation
|
||||
- 🏗️ Paves the way for training larger models in the future
|
||||
**模型架构说明**:VoxCPM1.5 的核心架构与技术报告中保持一致。关键的修改是将局部模块(LocEnc & LocDiT)的 patch size 从 2 调整为 4,从而将 LM 处理速率从 12.5Hz 降低到 6.25Hz。由于局部模块现在需要处理更长的上下文,我们扩展了它们的网络深度,导致整体模型参数量略有增加。
|
||||
|
||||
**Model Architecture Clarification**: The core architecture of VoxCPM1.5 remains unchanged from the technical report. The key modification is adjusting the patch size of the local modules (LocEnc & LocDiT) from 2 to 4, which reduces the LM processing rate from 12.5Hz to 6.25Hz. Since the local modules now need to handle longer contexts, we expanded their network depth, resulting in a slightly larger overall model parameter count.
|
||||
**生成速度说明**:虽然模型参数增加了,但 VoxCPM1.5 生成 1 秒音频仅需 6.25 个 token(相比之前的 12.5 个 token)。虽然显示的生成速度(xx it/s)可能看起来变慢了,但实际的实时率(RTF = 音频时长 / 处理时间)没有差异,甚至可能更快。
|
||||
|
||||
**Generation Speed Clarification**: Although the model parameters have increased, VoxCPM1.5 only requires 6.25 tokens to generate 1 second of audio (compared to 12.5 tokens in the previous version). While the displayed generation speed (xx it/s) may appear slower, the actual Real-Time Factor (RTF = audio duration / processing time) shows no difference or may even be faster.
|
||||
## 🔧 微调支持
|
||||
|
||||
## 🔧 Fine-tuning Support
|
||||
我们现在支持全量微调和 LoRA 微调,请参阅 [微调指南](finetune.md) 了解详细说明。
|
||||
|
||||
We support full fine-tuning and LoRA fine-tuning now, please see the [Fine-tuning Guide](finetune.md) for detailed instructions.
|
||||
|
||||
## 📚 文档
|
||||
|
||||
## 📚 Documentation
|
||||
- 更新了 README,增加了版本对比
|
||||
- 添加了全面的微调指南
|
||||
- 改进了代码注释和文档
|
||||
|
||||
- Updated README with version comparison
|
||||
- Added comprehensive fine-tuning guide
|
||||
- Improved code comments and documentation
|
||||
## 🙏 感谢大家
|
||||
|
||||
没有开源社区的反馈、测试和贡献,这次发布是不可能的。感谢你们帮助塑造 VoxCPM1.5!
|
||||
|
||||
## 🙏 Our Thanks to You
|
||||
This release wouldn’t be possible without the incredible feedback, testing, and contributions from our open-source community. Thank you for helping shape VoxCPM1.5!
|
||||
## 📞 让我们共同建设
|
||||
|
||||
有问题、想法或想要贡献?
|
||||
|
||||
## 📞 Let's Build Together
|
||||
Questions, ideas, or want to contribute?
|
||||
- 🐛 报告问题:[OpenBMB/VoxCPM GitHub Issues](https://github.com/OpenBMB/VoxCPM/issues)
|
||||
|
||||
- 🐛 Report an issue: [GitHub Issues on OpenBMB/VoxCPM](https://github.com/OpenBMB/VoxCPM/issues)
|
||||
- 📖 深入文档:查看 [docs/](../docs/) 文件夹获取指南和 API 详情
|
||||
|
||||
- 📖 Dig into the docs: Check the [docs/](../docs/) folder for guides and API details
|
||||
享受 VoxCPM1.5 更丰富的声音和强大的新功能吧 🎉
|
||||
|
||||
Enjoy the richer sound and powerful new features of VoxCPM1.5 🎉
|
||||
我们迫不及待想听到你们接下来的创作!🥂
|
||||
|
||||
We can't wait to hear what you create next! 🥂
|
||||
## 🚀 我们正在做的事情
|
||||
|
||||
## 🚀 What We're Working On
|
||||
我们正在持续改进 VoxCPM 并致力于开发激动人心的新功能:
|
||||
|
||||
We're continuously improving VoxCPM and working on exciting new features:
|
||||
- 🌍 **多语言 TTS 支持**:我们正在积极开发除中文和英文以外的语言支持。
|
||||
- 🎯 **可控表现力语音生成**:我们正在研究可控语音生成,允许通过自然语言指令对语音属性(情感、音色、韵律等)进行细粒度控制。
|
||||
- 🎵 **通用音频生成基础**:我们也希望探索 VoxCPM 作为统一的音频生成基础模型,能够联合生成语音、音乐和音效。不过,这是一个长期的愿景。
|
||||
|
||||
- 🌍 **Multilingual TTS Support**: We are actively developing support for languages beyond Chinese and English.
|
||||
- 🎯 **Controllable Expressive Speech Generation**: We are researching controllable speech generation that allows fine-grained control over speech attributes (emotion, timbre, prosody, etc.) through natural language instructions.
|
||||
- 🎵 **Universal Audio Generation Foundation**: We also hope to explore VoxCPM as a unified audio generation foundation model capable of joint generation of speech, music, and sound effects. However, this is a longer-term vision.
|
||||
**📅 下次发布**:我们计划在 2026 年第一季度发布下一个版本,其中将包含重大改进和新功能。敬请关注更新!我们致力于使 VoxCPM 更加强大和通用。
|
||||
|
||||
**📅 Next Release**: We plan to release the next version in Q1 2026, which will include significant improvements and new features. Stay tuned for updates! We're committed to making VoxCPM even more powerful and versatile.
|
||||
## ❓ 常见问题 (FAQ)
|
||||
|
||||
## ❓ Frequently Asked Questions (FAQ)
|
||||
### Q: VoxCPM 支持个性化声音定制的微调吗?
|
||||
|
||||
### Q: Does VoxCPM support fine-tuning for personalized voice customization?
|
||||
**A:** 是的!VoxCPM 现在支持全量微调(SFT)和高效的 LoRA 微调。你可以使用自己的数据训练个性化声音模型。请参阅 [微调指南](finetune.md) 获取详细说明和示例。
|
||||
|
||||
**A:** Yes! VoxCPM now supports both full fine-tuning (SFT) and efficient LoRA fine-tuning. You can train personalized voice models on your own data. Please refer to the [Fine-tuning Guide](finetune.md) for detailed instructions and examples.
|
||||
### Q: 16kHz 音频质量对我的用例足够吗?
|
||||
|
||||
### Q: Is 16kHz audio quality sufficient for my use case?
|
||||
**A:** 我们在 VoxCPM1.5 中升级了 AudioVAE 以支持 44.1kHz 采样率,这提供了更高质量的音频输出,更好地保留了高频细节。当使用高质量参考音频时,此升级能实现更好的声音克隆质量和更自然的语音合成。
|
||||
|
||||
**A:** We have upgraded the AudioVAE to support 44.1kHz sampling rate in VoxCPM1.5, which provides higher quality audio output with better preservation of high-frequency details. This upgrade enables better voice cloning quality and more natural speech synthesis when using high-quality reference audio.
|
||||
### Q: 稳定性问题解决了吗?
|
||||
|
||||
### Q: Has the stability issue been resolved?
|
||||
**A:** 我们在 VoxCPM1.5 中进行了稳定性优化,包括对推理代码逻辑、训练数据和模型架构的改进。根据社区反馈,我们收集了一些稳定性问题,例如:
|
||||
- 噪声和混响增加
|
||||
- 音频伪影(如啸叫/尖叫)
|
||||
- 语速不稳定(加速)
|
||||
- 音量波动(忽大忽小)
|
||||
- 音频开头和结尾的噪声伪影
|
||||
- 极短文本(如“你好”)的合成问题
|
||||
|
||||
**A:** We have made stability optimizations in VoxCPM1.5, including improvements to the inference code logic, training data, and model architecture. Based on community feedback, we collected some stability issues such as:
|
||||
- Increased noise and reverberation
|
||||
- Audio artifacts (e.g., howling/squealing)
|
||||
- Unstable speaking rate (speeding up)
|
||||
- Volume fluctuations (increases or decreases)
|
||||
- Noise artifacts at the beginning and end of audio
|
||||
- Synthesis issues with very short texts (e.g., "hello")
|
||||
**我们改进了什么:**
|
||||
- 通过调整推理代码逻辑和优化训练数据,我们很大程度上修复了开头/结尾的伪影。
|
||||
- 通过降低 LM 处理速率(12.5Hz → 6.25Hz),我们提高了长语音生成的稳定性。
|
||||
|
||||
**What we've improved:**
|
||||
- By adjusting inference code logic and optimizing training data, we have largely fixed the beginning/ending artifacts.
|
||||
- By reducing the LM processing rate (12.5Hz → 6.25Hz), we have improved stability on longer speech generation cases.
|
||||
**还遗留什么:** 我们承认长语音稳定性问题尚未完全解决。特别是对于高表现力或复杂的参考语音,自回归生成过程中的误差累积仍可能发生。我们将继续在未来版本中分析和优化这一点。
|
||||
|
||||
**What remains:** We acknowledge that long speech stability issues have not been completely resolved. Particularly for highly expressive or complex reference speech, error accumulation during autoregressive generation can still occur. We will continue to analyze and optimize this in future versions.
|
||||
### Q: VoxCPM 计划支持多语言 TTS 吗?
|
||||
|
||||
### Q: Does VoxCPM plan to support multilingual TTS?
|
||||
**A:** 目前,VoxCPM 主要在中文和英文数据上进行训练。我们正在积极研究和开发除中英文以外更多语言的多语言 TTS 支持。请告诉我们你希望支持哪些语言!
|
||||
|
||||
**A:** Currently, VoxCPM is primarily trained on Chinese and English data. We are actively researching and developing multilingual TTS support for more languages beyond Chinese and English. Please let us know what languages you'd like to see supported!
|
||||
### Q: VoxCPM 计划支持可控生成(情感、风格、细粒度控制)吗?
|
||||
|
||||
### Q: Does VoxCPM plan to support controllable generation (emotion, style, fine-grained control)?
|
||||
**A:** 目前,VoxCPM 仅支持零样本声音克隆和上下文感知语音生成。对特定语音属性(情感、风格、细粒度韵律)的直接控制是有限的。然而,我们正在积极研究具有细粒度控制能力的指令可控表现力语音生成,致力于实现人类指令到语音的生成模型!
|
||||
|
||||
**A:** Currently, VoxCPM only supports zero-shot voice cloning and context-aware speech generation. Direct control over specific speech attributes (emotion, style, fine-grained prosody) is limited. However, we are actively researching instruction-controllable expressive speech generation with fine-grained control capabilities, working towards a human instruction-to-speech generation model!
|
||||
|
||||
### Q: Does VoxCPM support different hardware chips (e.g., Ascend 910B, XPU, NPU)?
|
||||
|
||||
**A:** Currently, we have not yet adapted VoxCPM for different hardware chips. Our main focus remains on developing new model capabilities and improving stability. We encourage you to check if community developers have done similar work, and we warmly welcome everyone to contribute and promote such adaptations together!
|
||||
|
||||
These features are under active development, and we look forward to sharing updates in future releases!
|
||||
### Q: VoxCPM 支持不同的硬件芯片(如 Ascend 910B, XPU, NPU)吗?
|
||||
|
||||
**A:** 目前,我们尚未针对不同的硬件芯片适配 VoxCPM。我们的主要重点仍然是开发新的模型能力和提高稳定性。我们鼓励你查看社区开发者是否做了类似的工作,我们也热烈欢迎大家共同贡献和推动此类适配!
|
||||
|
||||
这些功能正在积极开发中,我们期待在未来的版本中分享更新!
|
||||
|
||||
Reference in New Issue
Block a user