Mike Gold

Real-Time Voice Cloning Without Tokenization

X Bookmarks
Ai

Posted on X by Lior Alexander You can now clone a human voice in real time without tokenization.

OpenBMB just open sourced VoxCPM weights with real time streaming and LoRA fine tuning.

It runs at ~0.15 real time factor on a single RTX 4090.

๐—ง๐—ต๐—ถ๐˜€ ๐—ฟ๐—ฒ๐—บ๐—ผ๐˜ƒ๐—ฒ๐˜€ ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ณ๐—ฟ๐—ผ๐—บ ๐—ง๐—ง๐—ฆ


Research Notes: Real-Time Human Voice Cloning with VoxCPM

Overview

Real-time human voice cloning has been achieved without tokenization using the open-source VoxCPM model, marking a significant advancement in text-to-speech (TTS) technology. The model processes audio efficiently at approximately 0.15x real-time factor on an RTX 4090 GPU, eliminating the need for phoneme or codec bottlenecks [2]. This development allows voice cloning from just seconds of audio, making it accessible and efficient for various applications.

Technical Analysis

The VoxCPM model operates by directly converting input text into speech without tokenization, which traditionally posed a bottleneck. This approach leverages a novel architecture that processes audio in real-time, ensuring minimal latency and high-quality output [3]. The removal of phoneme or codec tokenization steps simplifies the cloning process, requiring only a short reference clip to capture accents, rhythm, tone, and timing accurately [2]. Running on a single RTX 4090 GPU at ~0.15x real-time factor highlights its efficiency, making it suitable for real-time applications without significant computational overhead.

Implementation Details

  • VoxCPM: The primary framework used for voice cloning, known for its next-generation TTS capabilities [3].
  • Pocket TTS: A 100M-parameter TTS model referenced in the search results, which supports efficient text-to-speech conversion and voice cloning [5].
  • LoRA Fine-Tuning: Mentioned in the original post, allowing for efficient fine-tuning of models without retraining the entire system.
  • Text-to-Speech (TTS): VoxCPM is part of a broader trend in TTS systems aiming to improve efficiency and quality. Pocket TTS, with its 100M-parameter model, exemplifies this progression [5].
  • Voice Cloning: This technology builds on advancements in speech synthesis, where models like those from 11Lab (mentioned in Result 4) are also making strides.
  • Real-Time Processing: The ability to clone voices in real-time represents a leap forward in computational efficiency and practical application.

Key Takeaways

  • Real-time voice cloning without tokenization is now feasible using VoxCPM, enhancing accessibility [2].
  • The model's performance on RTX 4090 at ~0.15x real-time factor underscores its efficiency for real-world applications [3].
  • Open-source availability of VoxCPM fosters collaboration and innovation in voice technologies [4].

Further Research

Here is the 'Further Reading' section based on the provided search results, formatted as markdown bullet points:

  • Real-time Human Voice Cloning with VoxCPM
    LinkedIn

  • No phoneme or codec token bottlenecks: It clones voices from seconds of audio
    X

  • VoxCPM - Next-Generation Text-to-Speech with Voice Cloning
    Website

  • Real-time voice cloning without tokenization and open-source discussion
    Facebook

  • Pocket TTS: High-performance TTS and Voice Cloning
    Webkul Blog