FX: Raising the Python version to avoid issues with Gradio failing to start.

Merge pull request #26 from AbrahamSanders/main
Add a streaming API for VoxCPM
2025-12-12 19:58:12 +00:00 · 2025-09-22 21:16:23 +08:00 · 2025-09-22 20:47:07 +08:00 · 2025-09-19 17:09:30 -04:00 · 2025-09-19 16:56:11 -04:00 · 2025-09-19 13:44:33 +08:00
9 changed files with 250 additions and 171 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,3 @@
 launch.json
 __pycache__
 voxcpm.egg-info
--- a/README.md
+++ b/README.md
@@ -1,13 +1,20 @@
 ## 🎙️ VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
-[![Project Page](https://img.shields.io/badge/Project%20Page-GitHub-blue)](https://github.com/OpenBMB/VoxCPM/) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-OpenBMB-yellow)](https://huggingface.co/openbmb/VoxCPM-0.5B) [![Live Playground](https://img.shields.io/badge/Live%20PlayGround-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) [![Samples](https://img.shields.io/badge/Page-Samples-red)](https://thuhcsi.github.io/VoxCPM/)
+[![Project Page](https://img.shields.io/badge/Project%20Page-GitHub-blue)](https://github.com/OpenBMB/VoxCPM/) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-OpenBMB-yellow)](https://huggingface.co/openbmb/VoxCPM-0.5B) [![ModelScope](https://img.shields.io/badge/ModelScope-OpenBMB-purple)](https://modelscope.cn/models/OpenBMB/VoxCPM-0.5B)  [![Live Playground](https://img.shields.io/badge/Live%20PlayGround-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) [![Samples](https://img.shields.io/badge/Page-Samples-red)](https://openbmb.github.io/VoxCPM-demopage)
 <div align="center">
  <img src="assets/voxcpm_logo.png" alt="VoxCPM Logo" width="40%">
 </div>
 <div align="center">
 👋 Contact us on [WeChat](assets/wechat.png)
 </div>
 ## News 
 * [2025.09.16] 🔥 🔥 🔥  We Open Source the VoxCPM-0.5B [weights](https://huggingface.co/openbmb/VoxCPM-0.5B)!
 * [2025.09.16] 🎉 🎉 🎉  We Provide the [Gradio PlayGround](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) for VoxCPM-0.5B, try it now! 
@@ -32,11 +39,6 @@ Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses
 ##  Quick Start
 ### 🔧 Install from PyPI
@@ -48,7 +50,7 @@ By default, when you first run the script, the model will be downloaded automati
 - Download VoxCPM-0.5B
    ```
    from huggingface_hub import snapshot_download
-    snapshot_download("openbmb/VoxCPM-0.5B",local_files_only=local_files_only)
+    snapshot_download("openbmb/VoxCPM-0.5B")
    ```
 - Download ZipEnhancer and SenseVoice-Small. We use ZipEnhancer to enhance speech prompts and SenseVoice-Small for speech prompt ASR in the web demo.
    ```
@@ -60,10 +62,12 @@ By default, when you first run the script, the model will be downloaded automati
 ### 2. Basic Usage
 ```python
 import soundfile as sf
 import numpy as np
 from voxcpm import VoxCPM
 model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")
 # Non-streaming
 wav = model.generate(
    text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.",
    prompt_wav_path=None,      # optional: path to a prompt speech for voice cloning
@@ -79,6 +83,18 @@ wav = model.generate(
 sf.write("output.wav", wav, 16000)
 print("saved: output.wav")
 # Streaming
 chunks = []
 for chunk in model.generate_streaming(
    text = "Streaming text to speech is easy with VoxCPM!",
    # supports same args as above
 ):
    chunks.append(chunk)
 wav = np.concatenate(chunks)
 sf.write("output_streaming.wav", wav, 16000)
 print("saved: output_streaming.wav")
 ```
 ### 3. CLI Usage
@@ -96,6 +112,13 @@ voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, desi
  --output out.wav \
  --denoise
 # (Optinal) Voice cloning (reference audio + transcript file)
 voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." \
  --prompt-audio path/to/voice.wav \
  --prompt-file "/path/to/text-file" \
  --output out.wav \
  --denoise
 # 3) Batch processing (one text per line)
 voxcpm --input examples/input.txt --output-dir outs
 # (optional) Batch + cloning
@@ -238,6 +261,13 @@ VoxCPM achieves competitive results on public zero-shot TTS benchmarks:
 ## 📝TO-DO List
 Please stay tuned for updates!
 - [ ] Release the VoxCPM technical report.
 - [ ] Support higher sampling rate (next version).
 ## 📄 License
 The VoxCPM model weights and code are open-sourced under the [Apache-2.0](LICENSE) license.
@@ -258,10 +288,14 @@ This project is developed by the following institutions:
 - <img src="assets/thuhcsi_logo.png" width="28px"> [THUHCSI](https://github.com/thuhcsi)
 ## ⭐ Star History
 [![Star History Chart](https://api.star-history.com/svg?repos=OpenBMB/VoxCPM&type=Date)](https://star-history.com/#OpenBMB/VoxCPM&Date)
 ## 📚 Citation
 The techical report is coming soon, please wait for the release 😊
 If you find our model helpful, please consider citing our projects 📝 and staring us ⭐️！
 ```bib
--- a/app.py
+++ b/app.py
@@ -170,7 +170,7 @@ def create_demo_interface(demo: VoxCPMDemo):
        # Pro Tips
        with gr.Accordion("💡 Pro Tips ｜使用建议", open=False, elem_id="acc_tips"):
-            gr.Markdown(f"""
+            gr.Markdown("""
            ### Prompt Speech Enhancement｜参考语音降噪
            - **Enable** to remove background noise for a clean, studio-like voice, with an external ZipEnhancer component.  
              **启用**：通过 ZipEnhancer 组件消除背景噪音，获得更好的音质。
@@ -194,10 +194,6 @@ def create_demo_interface(demo: VoxCPMDemo):
              **调低**：合成速度更快。
            - **Higher** for better synthesis quality.  
              **调高**：合成质量更佳。
            ### Long Text (e.g., >5 min speech)｜长文本 (如 >5分钟的合成语音)
            While VoxCPM can handle long texts directly, we recommend using empty lines to break very long content into paragraphs; the model will then synthesize each paragraph individually.  
            虽然 VoxCPM 支持直接生成长文本，但如果目标文本过长，我们建议使用换行符将内容分段；模型将对每个段落分别合成。
            """)
        # Main controls
@@ -206,7 +202,7 @@ def create_demo_interface(demo: VoxCPMDemo):
                prompt_wav = gr.Audio(
                    sources=["upload", 'microphone'],
                    type="filepath",
-                    label="Prompt Speech",
+                    label="Prompt Speech (Optional, or let VoxCPM improvise)",
                    value="./examples/example.wav",
                )
                DoDenoisePromptAudio = gr.Checkbox(
@@ -244,14 +240,13 @@ def create_demo_interface(demo: VoxCPMDemo):
                    text = gr.Textbox(
                        value="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly realistic speech.",
                        label="Target Text",
                        info="Default processing splits text on \\n into paragraphs; each is synthesized as a chunk and then concatenated into the final audio."
                    )
                with gr.Row():
                    DoNormalizeText = gr.Checkbox(
                        value=False,
                        label="Text Normalization",
                        elem_id="chk_normalize",
-                        info="We use WeTextPorcessing library to normalize the input text."
+                        info="We use wetext library to normalize the input text."
                    )
                audio_output = gr.Audio(label="Output Audio")
--- a/assets/wechat.png
+++ b/assets/wechat.png
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -20,12 +20,10 @@ classifiers = [
    "Intended Audience :: Developers",
    "Operating System :: OS Independent",
    "Programming Language :: Python :: 3",
    "Programming Language :: Python :: 3.8",
    "Programming Language :: Python :: 3.9",
    "Programming Language :: Python :: 3.10",
    "Programming Language :: Python :: 3.11",
 ]
-requires-python = ">=3.8"
+requires-python = ">=3.10"
 dependencies = [
    "torch>=2.5.0",
    "torchaudio>=2.5.0",
@@ -34,9 +32,9 @@ dependencies = [
    "gradio",
    "inflect",
    "addict",
-    "WeTextProcessing",
+    "wetext",
    "modelscope>=1.22.0",
-    "datasets>=2,<4",
+    "datasets>=3,<4",
    "huggingface-hub",
    "pydantic",
    "tqdm",
@@ -78,7 +76,7 @@ version_scheme = "post-release"
 [tool.black]
 line-length = 120
-target-version = ['py38']
+target-version = ['py310']
 include = '\.pyi?$'
 extend-exclude = '''
 /(
--- a/src/voxcpm/cli.py
+++ b/src/voxcpm/cli.py
@@ -240,6 +240,7 @@ Examples:
    # Prompt audio (for voice cloning)
    parser.add_argument("--prompt-audio", "-pa", help="Reference audio file path")
    parser.add_argument("--prompt-text", "-pt", help="Reference text corresponding to the audio")
    parser.add_argument("--prompt-file", "-pf", help="Reference text file corresponding to the audio")
    parser.add_argument("--denoise", action="store_true", help="Enable prompt speech enhancement (denoising)")
    # Generation parameters
@@ -279,6 +280,12 @@ def main():
    # If prompt audio+text provided → voice cloning
    if args.prompt_audio or args.prompt_text:
        if not args.prompt_text and args.prompt_file:
            assert os.path.isfile(args.prompt_file), "Prompt file does not exist or is not accessible."
            with open(args.prompt_file, 'r', encoding='utf-8') as f:
                args.prompt_text = f.read()
        if not args.prompt_audio or not args.prompt_text:
            print("Error: Voice cloning requires both --prompt-audio and --prompt-text")
            sys.exit(1)
--- a/src/voxcpm/core.py
+++ b/src/voxcpm/core.py
@@ -1,17 +1,17 @@
 import torch
 import torchaudio
 import os
 import re
 import tempfile
 import numpy as np
 from typing import Generator
 from huggingface_hub import snapshot_download
 from .model.voxcpm import VoxCPMModel
 from .utils.text_normalize import TextNormalizer
 class VoxCPM:
    def __init__(self,
            voxcpm_model_path : str,
            zipenhancer_model_path : str = "iic/speech_zipenhancer_ans_multiloss_16k_base",
            enable_denoiser : bool = True,
            optimize: bool = True,
        ):
        """Initialize VoxCPM TTS pipeline.
@@ -22,10 +22,11 @@ class VoxCPM:
            zipenhancer_model_path: ModelScope acoustic noise suppression model
                id or local path. If None, denoiser will not be initialized.
            enable_denoiser: Whether to initialize the denoiser pipeline.
            optimize: Whether to optimize the model with torch.compile. True by default, but can be disabled for debugging.
        """
        print(f"voxcpm_model_path: {voxcpm_model_path}, zipenhancer_model_path: {zipenhancer_model_path}, enable_denoiser: {enable_denoiser}")
-        self.tts_model = VoxCPMModel.from_local(voxcpm_model_path)
+        self.tts_model = VoxCPMModel.from_local(voxcpm_model_path, optimize=optimize)
-        self.text_normalizer = TextNormalizer()
+        self.text_normalizer = None
        if enable_denoiser and zipenhancer_model_path is not None:
            from .zipenhancer import ZipEnhancer
            self.denoiser = ZipEnhancer(zipenhancer_model_path)
@@ -33,7 +34,8 @@ class VoxCPM:
            self.denoiser = None
        print("Warm up VoxCPMModel...")
        self.tts_model.generate(
-            target_text="Hello, this is the first test sentence."
+            target_text="Hello, this is the first test sentence.",
            max_len=10,
        )
    @classmethod
@@ -43,6 +45,7 @@ class VoxCPM:
            zipenhancer_model_id: str = "iic/speech_zipenhancer_ans_multiloss_16k_base",
            cache_dir: str = None,
            local_files_only: bool = False,
            **kwargs,
        ):
        """Instantiate ``VoxCPM`` from a Hugging Face Hub snapshot.
@@ -54,6 +57,8 @@ class VoxCPM:
            cache_dir: Custom cache directory for the snapshot.
            local_files_only: If True, only use local files and do not attempt
                to download.
        Kwargs:
            Additional keyword arguments passed to the ``VoxCPM`` constructor.
        Returns:
            VoxCPM: Initialized instance whose ``voxcpm_model_path`` points to
@@ -82,9 +87,16 @@ class VoxCPM:
            voxcpm_model_path=local_path,
            zipenhancer_model_path=zipenhancer_model_id if load_denoiser else None,
            enable_denoiser=load_denoiser,
            **kwargs,
        )
-    def generate(self, 
+    def generate(self, *args, **kwargs) -> np.ndarray:
        return next(self._generate(*args, streaming=False, **kwargs))
    def generate_streaming(self, *args, **kwargs) -> Generator[np.ndarray, None, None]:
        return self._generate(*args, streaming=True, **kwargs)
    def _generate(self, 
            text : str,
            prompt_wav_path : str = None,
            prompt_text : str = None,
@@ -96,7 +108,8 @@ class VoxCPM:
            retry_badcase : bool = True,
            retry_badcase_max_times : int = 3,
            retry_badcase_ratio_threshold : float = 6.0,
-        ):
+            streaming: bool = False,
        ) -> Generator[np.ndarray, None, None]:
        """Synthesize speech for the given text and return a single waveform.
        This method optionally builds and reuses a prompt cache. If an external
@@ -118,12 +131,24 @@ class VoxCPM:
            retry_badcase: Whether to retry badcase.
            retry_badcase_max_times: Maximum number of times to retry badcase.
            retry_badcase_ratio_threshold: Threshold for audio-to-text ratio.
            streaming: Whether to return a generator of audio chunks.
        Returns:
-            numpy.ndarray: 1D waveform array (float32) on CPU.
+            Generator of numpy.ndarray: 1D waveform array (float32) on CPU. 
            Yields audio chunks for each generations step if ``streaming=True``,
            otherwise yields a single array containing the final audio.
        """
-        texts = text.split("\n")
+        if not text.strip() or not isinstance(text, str):
-        texts = [t.strip() for t in texts if t.strip()]
+            raise ValueError("target text must be a non-empty string")
-        final_wav = []
+        
        if prompt_wav_path is not None:
            if not os.path.exists(prompt_wav_path):
                raise FileNotFoundError(f"prompt_wav_path does not exist: {prompt_wav_path}")
        if (prompt_wav_path is None) != (prompt_text is None):
            raise ValueError("prompt_wav_path and prompt_text must both be provided or both be None")
        text = text.replace("\n", " ")
        text = re.sub(r'\s+', ' ', text)
        temp_prompt_wav_path = None
        try:
@@ -140,14 +165,14 @@ class VoxCPM:
            else:
                fixed_prompt_cache = None  # will be built from the first inference
            for sub_text in texts:
                if sub_text.strip() == "":
                    continue
                print("sub_text:", sub_text)
            if normalize:
-                    sub_text = self.text_normalizer.normalize(sub_text)
+                if self.text_normalizer is None:
-                wav, target_text_token, generated_audio_feat = self.tts_model.generate_with_prompt_cache(
+                    from .utils.text_normalize import TextNormalizer
-                                target_text=sub_text,
+                    self.text_normalizer = TextNormalizer()
                text = self.text_normalizer.normalize(text)
            generate_result = self.tts_model._generate_with_prompt_cache(
                            target_text=text,
                            prompt_cache=fixed_prompt_cache,
                            min_len=2,
                            max_len=max_length,
@@ -156,16 +181,11 @@ class VoxCPM:
                            retry_badcase=retry_badcase,
                            retry_badcase_max_times=retry_badcase_max_times,
                            retry_badcase_ratio_threshold=retry_badcase_ratio_threshold,
                            streaming=streaming,
                        )
                if fixed_prompt_cache is None:
                    fixed_prompt_cache = self.tts_model.merge_prompt_cache(
                        original_cache=None,
                        new_text_token=target_text_token,
                        new_audio_feat=generated_audio_feat
                    )
                final_wav.append(wav)
-            return torch.cat(final_wav, dim=1).squeeze(0).cpu().numpy()
+            for wav, _, _ in generate_result:
                yield wav.squeeze(0).cpu().numpy()
        finally:
            if temp_prompt_wav_path and os.path.exists(temp_prompt_wav_path):
--- a/src/voxcpm/model/voxcpm.py
+++ b/src/voxcpm/model/voxcpm.py
@@ -19,11 +19,12 @@ limitations under the License.
 """
 import os
-from typing import Dict, Optional, Tuple, Union
+from typing import Tuple, Union, Generator, List
 import torch
 import torch.nn as nn
 import torchaudio
 import warnings
 from einops import rearrange
 from pydantic import BaseModel
 from tqdm import tqdm
@@ -147,13 +148,23 @@ class VoxCPMModel(nn.Module):
        self.sample_rate = audio_vae.sample_rate
-    def optimize(self):
+    def optimize(self, disable: bool = False):
-        if self.device == "cuda":
+        try:
            if disable:
                raise ValueError("Optimization disabled by user")
            if self.device != "cuda":
                raise ValueError("VoxCPMModel can only be optimized on CUDA device")
            try:
                import triton
            except:
                raise ValueError("triton is not installed")
            self.base_lm.forward_step = torch.compile(self.base_lm.forward_step, mode="reduce-overhead", fullgraph=True)
            self.residual_lm.forward_step = torch.compile(self.residual_lm.forward_step, mode="reduce-overhead", fullgraph=True)
            self.feat_encoder_step = torch.compile(self.feat_encoder, mode="reduce-overhead", fullgraph=True)
            self.feat_decoder.estimator = torch.compile(self.feat_decoder.estimator, mode="reduce-overhead", fullgraph=True)
-        else:
+        except Exception as e:
            print(f"Error: {e}")
            print("Warning: VoxCPMModel can not be optimized by torch.compile, using original forward_step functions")
            self.base_lm.forward_step = self.base_lm.forward_step
            self.residual_lm.forward_step = self.residual_lm.forward_step
            self.feat_encoder_step = self.feat_encoder
@@ -161,8 +172,14 @@ class VoxCPMModel(nn.Module):
        return self
    def generate(self, *args, **kwargs) -> torch.Tensor:
        return next(self._generate(*args, streaming=False, **kwargs))
    def generate_streaming(self, *args, **kwargs) -> Generator[torch.Tensor, None, None]:
        return self._generate(*args, streaming=True, **kwargs)
    @torch.inference_mode()
-    def generate(
+    def _generate(
        self,
        target_text: str,
        prompt_text: str = "",
@@ -174,7 +191,11 @@ class VoxCPMModel(nn.Module):
        retry_badcase: bool = False,
        retry_badcase_max_times: int = 3,
        retry_badcase_ratio_threshold: float = 6.0, # setting acceptable ratio of audio length to text length (for badcase detection)
-    ):
+        streaming: bool = False,
    ) -> Generator[torch.Tensor, None, None]:
        if retry_badcase and streaming:
            warnings.warn("Retry on bad cases is not supported in streaming mode, setting retry_badcase=False.")
            retry_badcase = False
        if len(prompt_wav_path) == 0:
            text = target_text
            text_token = torch.LongTensor(self.text_tokenizer(text))
@@ -257,7 +278,7 @@ class VoxCPMModel(nn.Module):
        retry_badcase_times = 0
        while retry_badcase_times < retry_badcase_max_times:
-            latent_pred, pred_audio_feat = self.inference(
+            inference_result = self._inference(
                text_token,
                text_mask,
                audio_feat,
@@ -266,7 +287,17 @@ class VoxCPMModel(nn.Module):
                max_len=int(target_text_length * retry_badcase_ratio_threshold + 10) if retry_badcase else max_len,
                inference_timesteps=inference_timesteps,
                cfg_value=cfg_value,
                streaming=streaming,
            )
            if streaming:
                patch_len = self.patch_size * self.chunk_size
                for latent_pred, _ in inference_result:
                    decode_audio = self.audio_vae.decode(latent_pred.to(torch.float32))
                    decode_audio = decode_audio[..., -patch_len:].squeeze(1).cpu()
                    yield decode_audio
                break
            else:
                latent_pred, pred_audio_feat = next(inference_result)
                if retry_badcase:
                    if pred_audio_feat.shape[0] >= target_text_length * retry_badcase_ratio_threshold:
                        print(f"  Badcase detected, audio_text_ratio={pred_audio_feat.shape[0] / target_text_length}, retrying...")
@@ -276,7 +307,11 @@ class VoxCPMModel(nn.Module):
                        break
                else:
                    break   
-        return self.audio_vae.decode(latent_pred.to(torch.float32)).squeeze(1).cpu()
+                
        if not streaming:
            decode_audio = self.audio_vae.decode(latent_pred.to(torch.float32)).squeeze(1).cpu()  
            decode_audio = decode_audio[..., 640:-640] # trick: trim the start and end of the audio
            yield decode_audio        
    @torch.inference_mode()
    def build_prompt_cache(
@@ -314,7 +349,7 @@ class VoxCPMModel(nn.Module):
            audio = torch.nn.functional.pad(audio, (0, patch_len - audio.size(1) % patch_len))
        # extract audio features
-        audio_feat = self.audio_vae.encode(audio.cuda(), self.sample_rate).cpu()
+        audio_feat = self.audio_vae.encode(audio.to(self.device), self.sample_rate).cpu()
        audio_feat = audio_feat.view(
            self.audio_vae.latent_dim,
@@ -366,8 +401,16 @@ class VoxCPMModel(nn.Module):
        return merged_cache
    def generate_with_prompt_cache(self, *args, **kwargs) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        return next(self._generate_with_prompt_cache(*args, streaming=False, **kwargs))
    def generate_with_prompt_cache_streaming(
        self, *args, **kwargs
    ) -> Generator[Tuple[torch.Tensor, torch.Tensor, List[torch.Tensor]], None, None]:
        return self._generate_with_prompt_cache(*args, streaming=True, **kwargs)
    @torch.inference_mode()
-    def generate_with_prompt_cache(
+    def _generate_with_prompt_cache(
        self,
        target_text: str,
        prompt_cache: dict,
@@ -378,7 +421,8 @@ class VoxCPMModel(nn.Module):
        retry_badcase: bool = False,
        retry_badcase_max_times: int = 3,
        retry_badcase_ratio_threshold: float = 6.0,
-    ):
+        streaming: bool = False,
    ) -> Generator[Tuple[torch.Tensor, torch.Tensor, Union[torch.Tensor, List[torch.Tensor]]], None, None]:
        """
        Generate audio using pre-built prompt cache.
@@ -392,10 +436,17 @@ class VoxCPMModel(nn.Module):
            retry_badcase: Whether to retry on bad cases
            retry_badcase_max_times: Maximum retry attempts
            retry_badcase_ratio_threshold: Threshold for audio-to-text ratio
            streaming: Whether to return a generator of audio chunks
        Returns:
-            tuple: (decoded audio tensor, new text tokens, new audio features)
+            Generator of Tuple containing:
                - Decoded audio tensor for the current step if ``streaming=True``, else final decoded audio tensor
                - Tensor of new text tokens
                - New audio features up to the current step as a List if ``streaming=True``, else as a concatenated Tensor
        """
        if retry_badcase and streaming:
            warnings.warn("Retry on bad cases is not supported in streaming mode, setting retry_badcase=False.")
            retry_badcase = False
        # get prompt from cache
        if prompt_cache is None:
            prompt_text_token = torch.empty(0, dtype=torch.int32)
@@ -440,7 +491,7 @@ class VoxCPMModel(nn.Module):
        target_text_length = len(self.text_tokenizer(target_text))
        retry_badcase_times = 0
        while retry_badcase_times < retry_badcase_max_times:
-            latent_pred, pred_audio_feat = self.inference(
+            inference_result = self._inference(
                text_token,
                text_mask,
                audio_feat,
@@ -449,7 +500,21 @@ class VoxCPMModel(nn.Module):
                max_len=int(target_text_length * retry_badcase_ratio_threshold + 10) if retry_badcase else max_len,
                inference_timesteps=inference_timesteps,
                cfg_value=cfg_value,
                streaming=streaming,
            )
            if streaming:
                patch_len = self.patch_size * self.chunk_size
                for latent_pred, pred_audio_feat in inference_result:
                    decode_audio = self.audio_vae.decode(latent_pred.to(torch.float32))
                    decode_audio = decode_audio[..., -patch_len:].squeeze(1).cpu()
                    yield (
                        decode_audio,
                        target_text_token,
                        pred_audio_feat
                    )
                break
            else:
                latent_pred, pred_audio_feat = next(inference_result)
                if retry_badcase:
                    if pred_audio_feat.shape[0] >= target_text_length * retry_badcase_ratio_threshold:
                        print(f"  Badcase detected, audio_text_ratio={pred_audio_feat.shape[0] / target_text_length}, retrying...")
@@ -459,16 +524,24 @@ class VoxCPMModel(nn.Module):
                        break
                else:
                    break
        if not streaming:
            decode_audio = self.audio_vae.decode(latent_pred.to(torch.float32)).squeeze(1).cpu()
            decode_audio = decode_audio[..., 640:-640] # trick: trim the start and end of the audio
-        return (
+            yield (
                decode_audio,
                target_text_token,
                pred_audio_feat
            )
    def inference(self, *args, **kwargs) -> Tuple[torch.Tensor, torch.Tensor]:
        return next(self._inference(*args, streaming=False, **kwargs))
    def inference_streaming(self, *args, **kwargs) -> Generator[Tuple[torch.Tensor, List[torch.Tensor]], None, None]:
        return self._inference(*args, streaming=True, **kwargs)
    @torch.inference_mode()
-    def inference(
+    def _inference(
        self,
        text: torch.Tensor,
        text_mask: torch.Tensor,
@@ -478,7 +551,8 @@ class VoxCPMModel(nn.Module):
        max_len: int = 2000,
        inference_timesteps: int = 10,
        cfg_value: float = 2.0,
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        streaming: bool = False,
    ) -> Generator[Tuple[torch.Tensor, Union[torch.Tensor, List[torch.Tensor]]], None, None]:
        """Core inference method for audio generation.
        This is the main inference loop that generates audio features
@@ -493,11 +567,12 @@ class VoxCPMModel(nn.Module):
            max_len: Maximum generation length
            inference_timesteps: Number of diffusion steps
            cfg_value: Classifier-free guidance value
            streaming: Whether to yield each step latent feature or just the final result
        Returns:
-            Tuple containing:
+            Generator of Tuple containing:
-                - Predicted latent features
+                - Predicted latent feature at the current step if ``streaming=True``, else final latent features
-                - Predicted audio feature sequence
+                - Predicted audio feature sequence so far as a List if ``streaming=True``, else as a concatenated Tensor
        """
        B, T, P, D = feat.shape
@@ -555,6 +630,12 @@ class VoxCPMModel(nn.Module):
            pred_feat_seq.append(pred_feat.unsqueeze(1))  # b, 1, p, d
            prefix_feat_cond = pred_feat
            if streaming:
                # return the last three predicted latent features to provide enough context for smooth decoding
                pred_feat_chunk = torch.cat(pred_feat_seq[-3:], dim=1)
                feat_pred = rearrange(pred_feat_chunk, "b t p d -> b d (t p)", b=B, p=self.patch_size)
                yield feat_pred, pred_feat_seq
            stop_flag = self.stop_head(self.stop_actn(self.stop_proj(lm_hidden))).argmax(dim=-1)[0].cpu().item()
            if i > min_len and stop_flag == 1:
                break
@@ -569,14 +650,14 @@ class VoxCPMModel(nn.Module):
                lm_hidden + curr_embed[:, 0, :], torch.tensor([self.residual_lm.kv_cache.step()], device=curr_embed.device)
            ).clone()
        if not streaming:
            pred_feat_seq = torch.cat(pred_feat_seq, dim=1)  # b, t, p, d
            feat_pred = rearrange(pred_feat_seq, "b t p d -> b d (t p)", b=B, p=self.patch_size)
-        feat_pred = feat_pred[..., 1:-1] # trick: remove the first and last token
+            yield feat_pred, pred_feat_seq.squeeze(0).cpu()
        return feat_pred, pred_feat_seq.squeeze(0).cpu()
    @classmethod
-    def from_local(cls, path: str):
+    def from_local(cls, path: str, optimize: bool = True):
        config = VoxCPMConfig.model_validate_json(open(os.path.join(path, "config.json")).read())
        tokenizer = LlamaTokenizerFast.from_pretrained(path)
@@ -602,4 +683,4 @@ class VoxCPMModel(nn.Module):
        for kw, val in vae_state_dict.items():
            model_state_dict[f"audio_vae.{kw}"] = val
        model.load_state_dict(model_state_dict, strict=True)
-        return model.to(model.device).eval().optimize()
+        return model.to(model.device).eval().optimize(disable=not optimize)
--- a/src/voxcpm/utils/text_normalize.py
+++ b/src/voxcpm/utils/text_normalize.py
@@ -3,40 +3,7 @@ import re
 import regex
 import inflect
 from functools import partial
-from tn.chinese.normalizer import Normalizer as ZhNormalizer
+from wetext import Normalizer
 from tn.english.normalizer import Normalizer as EnNormalizer
 def normal_cut_sentence(text):
    # 先处理括号内的逗号，将其替换为特殊标记
    text = re.sub(r'([（(][^）)]*)([,，])([^）)]*[）)])', r'\1&&&\3', text)
    text = re.sub('([。！，？\?])([^’”])',r'\1\n\2',text)#普通断句符号且后面没有引号
    text = re.sub('(\.{6})([^’”])',r'\1\n\2',text)#英文省略号且后面没有引号
    text = re.sub('(\…{2})([^’”])',r'\1\n\2',text)#中文省略号且后面没有引号
    text = re.sub('([. ,。！；？\?\.{6}\…{2}][’”])([^’”])',r'\1\n\2',text)#断句号+引号且后面没有引号
    # 处理英文句子的分隔
    text = re.sub(r'([.,!?])([^’”\'"])', r'\1\n\2', text)  # 句号、感叹号、问号后面没有引号
    text = re.sub(r'([.!?][’”\'"])([^’”\'"])', r'\1\n\2', text)  # 句号、感叹号、问号加引号后面的部分
    text = re.sub(r'([（(][^）)]*)(&&&)([^）)]*[）)])', r'\1，\3', text)
    text = [t for t in text.split("\n") if t]
    return text
 def cut_sentence_with_fix_length(text : str, length : int):
    sentences = normal_cut_sentence(text)
    cur_length = 0
    res = ""
    for sentence in sentences:
        if not sentence:
            continue
        if cur_length > length or cur_length + len(sentence) > length:
            yield res
            res = ""
            cur_length = 0
        res += sentence
        cur_length += len(sentence)
    if res:
        yield res
 chinese_char_pattern = re.compile(r'[\u4e00-\u9fff]+')
@@ -195,8 +162,8 @@ def clean_text(text):
 class TextNormalizer:
    def __init__(self, tokenizer=None):
        self.tokenizer = tokenizer
-        self.zh_tn_model = ZhNormalizer(remove_erhua=False, full_to_half=False, remove_interjections=False, overwrite_cache=True)
+        self.zh_tn_model = Normalizer(lang="zh", operator="tn", remove_erhua=True)
-        self.en_tn_model = EnNormalizer()
+        self.en_tn_model = Normalizer(lang="en", operator="tn")
        self.inflect_parser = inflect.engine()
    def normalize(self, text, split=False):
@@ -208,37 +175,11 @@ class TextNormalizer:
            if re.search(r'([\d$%^*_+≥≤≠×÷?=])', text): # 避免 英文连字符被错误正则为减
                text = re.sub(r'(?<=[a-zA-Z0-9])-(?=\d)', ' - ', text) # 修复 x-2 被正则为 x负2
            text = self.zh_tn_model.normalize(text)
            text = re.sub(r'(?<=[a-zA-Z0-9])-(?=\d)', ' - ', text) # 修复 x-2 被正则为 x负2
            text = self.zh_tn_model.normalize(text)
            text = replace_blank(text)
            text = replace_corner_mark(text)
            text = remove_bracket(text)
            text = re.sub(r'[，,]+$', '。', text)
        else:
            text = self.en_tn_model.normalize(text)
            text = spell_out_number(text, self.inflect_parser)
        if split is False:
            return text
 if __name__ == "__main__":
    text_normalizer = TextNormalizer()
    text = r"""今天我们学习一元二次方程。一元二次方程的标准形式是：
 ax2+bx+c=0ax^2 + bx + c = 0ax2+bx+c=0 
 其中，aaa、bbb 和 ccc 是常数，xxx 是变量。这个方程的解可以通过求根公式来找到。
 一元二次方程的解法有几种：
  - 因式分解法：通过将方程因式分解来求解。我们首先尝试将方程表达成两个括号的形式，解决方程的解。比如，方程x2−5x+6=0x^2 - 5x + 6 = 0x2−5x+6=0可以因式分解为(x−2)(x−3)=0(x - 2)(x - 3) = 0(x−2)(x−3)=0，因此根为2和3。
  - 配方法：通过配方将方程转化为完全平方的形式，从而解出。我们通过加上或减去适当的常数来完成这一过程，使得方程可以直接写成一个完全平方的形式。
  - 求根公式：我们可以使用求根公式直接求出方程的解。这个公式适用于所有的一元二次方程，即使我们无法通过因式分解或配方法来解决时，也能使用该公式。
 公式：x=−b±b2−4ac2ax = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}x=2a−b±b2−4ac这个公式可以帮助我们求解任何一元二次方程的根。
 对于一元二次方程，我们需要了解判别式。判别式的作用是帮助我们判断方程的解的个数和性质。判别式 Δ\DeltaΔ 由下式给出：Δ=b2−4ac\Delta = b^2 - 4acΔ=b2−4ac 根据判别式的值，我们可以知道：
  - 如果 Δ>0\Delta > 0Δ>0，方程有两个不相等的实数解。这是因为判别式大于0时，根号内的值是正数，所以我们可以得到两个不同的解。
  - 如果 Δ=0\Delta = 0Δ=0，方程有一个实数解。这是因为根号内的值为零，导致两个解相等，也就是说方程有一个解。
  - 如果 Δ<0\Delta < 0Δ<0，方程没有实数解。这意味着根号内的值是负数，无法进行实数运算，因此方程没有实数解，可能有复数解。"""
    texts = ["这是一个公式 (a+b)³=a³+3a²b+3ab²+b³ S=(a×b)÷2", "这样的发展为AI仅仅作为“工具”这一观点提出了新的挑战，", "550 + 320 = 870千卡。", "解一元二次方程：3x^2+x-2=0", "你好啊"]
    texts = [text]
    for text in texts:
        text = text_normalizer.normalize(text)
        print(text)
        for t in cut_sentence_with_fix_length(text, 15):
            print(t)
Author	SHA1	Message	Date
刘鑫	41752dc0fa	FX: Raising the Python version to avoid issues with Gradio failing to start.	2025-09-22 21:16:23 +08:00
xliucs	b0714adcaa	Merge pull request #26 from AbrahamSanders/main Add a streaming API for VoxCPM	2025-09-22 20:47:07 +08:00
AbrahamSanders	89f4d917a0	Update readme with streaming example	2025-09-19 17:09:30 -04:00
AbrahamSanders	5c5da0dbe6	Add a streaming API for VoxCPM	2025-09-19 16:56:11 -04:00
刘鑫	5f56d5ff5d	FX: update README	2025-09-19 13:44:33 +08:00
xliucs	169c17ddfd	Merge pull request #17 from MayDomine/main add prompt-file option to set prompt text	2025-09-19 13:35:36 +08:00
MayDomine	996c69a1a8	add prompt-file option to set prompt text	2025-09-19 12:53:23 +08:00
刘鑫	dc6b6d1d1c	Fx: capture compile error on Windows	2025-09-18 19:23:13 +08:00
刘鑫	cef6aefb3d	remove \n from input text	2025-09-18 14:57:45 +08:00
周逸轩	1a46c5d1ad	update README	2025-09-18 14:53:37 +08:00
周逸轩	5257ec3dc5	FX: noise point	2025-09-18 14:50:01 +08:00
刘鑫	bdd516b579	remove target text anotation	2025-09-18 13:07:43 +08:00
刘鑫	11568f0776	remove target text anotation	2025-09-18 12:58:27 +08:00
刘鑫	e5bcb735f0	Remove segment text logic	2025-09-18 12:02:37 +08:00
周逸轩	1fa9e2ca02	update README	2025-09-18 01:21:45 +08:00
周逸轩	10f48ba330	update README	2025-09-17 19:36:32 +08:00
周逸轩	639b2272ab	update README	2025-09-17 19:34:08 +08:00
周逸轩	7e8f754ba1	update README	2025-09-17 19:33:37 +08:00
刘鑫	032c7fe403	capture torch compile error	2025-09-17 18:09:09 +08:00
刘鑫	5390a47862	Merge branch 'dev'; Replace the text normalization library	2025-09-16 22:17:30 +08:00
刘鑫	e7012f1a94	Replace the text normalization library	2025-09-16 22:17:14 +08:00
刘鑫	82332cfc99	Replace the text normalization library	2025-09-16 22:17:14 +08:00
刘鑫	605ac2d8e4	Replace the text normalization library	2025-09-16 22:16:40 +08:00
周逸轩	0fa8d894d1	update README	2025-09-16 21:33:57 +08:00
周逸轩	776c0d19fb	FX: typo	2025-09-16 19:40:27 +08:00
周逸轩	ed6e6b4dac	FX: typo	2025-09-16 19:37:55 +08:00
周逸轩	e3108d4a12	FX: typo	2025-09-16 19:36:17 +08:00
周逸轩	59fe3f30a1	update README	2025-09-16 19:05:00 +08:00
周逸轩	6f2fb45756	ModelScope	2025-09-16 17:12:52 +08:00
周逸轩	91128d823d	ModelScope	2025-09-16 17:12:52 +08:00