Posted on X by Lior Alexander You can now clone a human voice in real time without tokenization.
OpenBMB just open sourced VoxCPM weights with real time streaming and LoRA fine tuning.
It runs at ~0.15 real time factor on a single RTX 4090.
๐ง๐ต๐ถ๐ ๐ฟ๐ฒ๐บ๐ผ๐๐ฒ๐ ๐๐ผ๐ธ๐ฒ๐ป๐ถ๐๐ฎ๐๐ถ๐ผ๐ป ๐ณ๐ฟ๐ผ๐บ ๐ง๐ง๐ฆ
Research Notes: Real-Time Human Voice Cloning with VoxCPM
Overview
Real-time human voice cloning has been achieved without tokenization using the open-source VoxCPM model, marking a significant advancement in text-to-speech (TTS) technology. The model processes audio efficiently at approximately 0.15x real-time factor on an RTX 4090 GPU, eliminating the need for phoneme or codec bottlenecks [2]. This development allows voice cloning from just seconds of audio, making it accessible and efficient for various applications.
Technical Analysis
The VoxCPM model operates by directly converting input text into speech without tokenization, which traditionally posed a bottleneck. This approach leverages a novel architecture that processes audio in real-time, ensuring minimal latency and high-quality output [3]. The removal of phoneme or codec tokenization steps simplifies the cloning process, requiring only a short reference clip to capture accents, rhythm, tone, and timing accurately [2]. Running on a single RTX 4090 GPU at ~0.15x real-time factor highlights its efficiency, making it suitable for real-time applications without significant computational overhead.
Implementation Details
- VoxCPM: The primary framework used for voice cloning, known for its next-generation TTS capabilities [3].
- Pocket TTS: A 100M-parameter TTS model referenced in the search results, which supports efficient text-to-speech conversion and voice cloning [5].
- LoRA Fine-Tuning: Mentioned in the original post, allowing for efficient fine-tuning of models without retraining the entire system.
Related Technologies
- Text-to-Speech (TTS): VoxCPM is part of a broader trend in TTS systems aiming to improve efficiency and quality. Pocket TTS, with its 100M-parameter model, exemplifies this progression [5].
- Voice Cloning: This technology builds on advancements in speech synthesis, where models like those from 11Lab (mentioned in Result 4) are also making strides.
- Real-Time Processing: The ability to clone voices in real-time represents a leap forward in computational efficiency and practical application.
Key Takeaways
- Real-time voice cloning without tokenization is now feasible using VoxCPM, enhancing accessibility [2].
- The model's performance on RTX 4090 at ~0.15x real-time factor underscores its efficiency for real-world applications [3].
- Open-source availability of VoxCPM fosters collaboration and innovation in voice technologies [4].
Further Research
Here is the 'Further Reading' section based on the provided search results, formatted as markdown bullet points:
-
Real-time Human Voice Cloning with VoxCPM
LinkedIn -
No phoneme or codec token bottlenecks: It clones voices from seconds of audio
X -
VoxCPM - Next-Generation Text-to-Speech with Voice Cloning
Website -
Real-time voice cloning without tokenization and open-source discussion
Facebook -
Pocket TTS: High-performance TTS and Voice Cloning
Webkul Blog