Kyutai STT

Resumen Principal

Kyutai STT emerge como una solución de vanguardia en la transcripción de voz a texto, optimizada específicamente para usos en tiempo real e interactivos. Su arquitectura de modelo de transmisión proporciona una compensación inigualable entre latencia y precisión, posicionándolo como ideal para aplicaciones que demandan respuestas inmediatas. El sistema introduce dos modelos clave: kyutai/stt-1b-en_fr, un modelo bilingüe de baja latencia con un innovador detector de actividad de voz (VAD) semántico, y kyutai/stt-2.6b-en, una versión monolingüe en inglés de mayor tamaño optimizada para la máxima precisión. A diferencia de los modelos tradicionales que requieren el audio completo, Kyutai STT transcribe el audio a medida que lo recibe, manteniendo la precisión a la par de los modelos de última generación no-streaming. Además, su capacidad de procesamiento por lotes (batching) permite gestionar cientos de conversaciones concurrentes en una única GPU, destacando su idoneidad para entornos de producción de alto rendimiento.

Elementos Clave

Arquitectura de Streaming y Precisión: Kyutai STT opera como un modelo de transmisión que transcribe audio en tiempo real, lo que lo hace perfecto para aplicaciones como Unmute. A pesar de su naturaleza en tiempo real, logra una precisión comparable a la de los modelos de última generación no-streaming, los cuales tienen acceso al audio completo de antemano. Esto garantiza transcripciones bien formateadas con puntuación y marcas de tiempo a nivel de palabra.
Detector de Actividad de Voz Semántico (VAD): Una característica distintiva, especialmente útil para aplicaciones de chat de voz en cascada, es su VAD semántico. En lugar de depender de un tiempo de espera fijo después de que el usuario deja de hablar, Kyutai STT predice la probabilidad de que el usuario haya terminado de hablar basándose en el contenido y la entonación. Esto resuelve el problema de las pausas largas que confunden a los VAD tradicionales, adaptando dinámicamente el retraso de predicción de pausa.
Baja Latencia y el "Flush Trick": El modelo kyutai/stt-1b-en_fr presenta una latencia de 500ms, mientras que kyutai/stt-2.6b-en tiene 2.5 segundos. Para reducir aún más la latencia de respuesta en aplicaciones como Unmute, se emplea el "flush trick". Una vez que el VAD predice el fin del habla, el servidor de STT procesa el audio ya enviado a una velocidad de aproximadamente 4 veces el tiempo real. Esto reduce la espera adicional de 500ms a solo 125ms, "deformando el tiempo" para asegurar una transcripción completa con una demora mínima.
Alto Rendimiento (Throughput) y Modelado de Flujos Retrasados: Kyutai STT está diseñado para entornos de producción, capaz de transcribir 400 flujos de audio en tiempo real simultáneamente en una GPU H100. Esta capacidad se atribuye a su innovadora arquitectura de modelado de flujos retrasados, que permite ejecutar el modelo con un tamaño de lote (batch size) elevado sin necesidad de código adicional ("glue code") para el streaming. Esto contrasta con soluciones como Whisper-Streaming, que, aunque impresionantes, no soportan batching, resultando en un throughput significativamente menor.

Análisis e Implicaciones

La propuesta de Kyutai STT tiene implicaciones transformadoras para el desarrollo de aplicaciones interactivas de voz, como asistentes virtuales, contact centers o herramientas de colaboración en tiempo real. Su capacidad para ofrecer baja latencia y alta precisión simultáneamente, junto con un innovador VAD semántico y alto rendimiento, redefine las expectativas de las plataformas de comunicación impulsadas por IA. Esto permite interacciones de voz más fluidas y naturales, mejorando significativamente la experiencia del usuario y abriendo nuevas posibilidades en la automatización de procesos conversacionales.

Contexto Adicional

A speech-to-text optimized for real-time usage.

Kyutai STT is a streaming speech-to-text model architecture, providing an unmatched trade-off between latency and accuracy, perfect for interactive applications. Its support for batching allows for processing hundreds of concurrent conversations on a single GPU. We release two models:

kyutai/stt-1b-en_fr, a low-latency model that understands English and French, and has a built-in semantic voice activity detector.
kyutai/stt-2.6b-en, a larger English-only model optimized to be as accurate as possible.

Try out kyutai/stt-1b-en_fr here:

Streaming and accurate

WER chart

Word error rate, lower is better.

Kyutai STT is a streaming model, meaning it transcribes audio in real time as it receives it rather than assuming that the entire audio is known in advance. This makes it well-suited for real-time applications such as Unmute.

It outputs well-formatted transcripts with punctuation, and comes with word-level timestamps as well.

In terms of accuracy, it still performs on par with state-of-the-art non-streaming models that have access to the whole audio at once.

Semantic voice activity detector

For cascaded voice chat applications like Unmute, you need a way to determine that the user is done talking and we should generate a response.

The most common way of doing this is to have a separate voice activity detection model that determines whether the user is speaking or not, and wait a fixed amount of time after the user is done talking.

In practice, it's impossible to find a waiting time that would fit all cases. People often make long pauses during their sentences, which lead to false positives in the naive approach.

To solve this, Kyutai STT predicts not only the text but also the probability that the user is done talking. The delay for the pause prediction adapts based on the content and intonation of what the user is saying.

You can play around with this in the demo above. Look for "End of speech detected".

The semantic VAD is available in the Rust server but not yet in the other implementations.

Low latency

The delay of kyutai/stt-1b-en_fr is set to 500ms, meaning words will be transcribed 500ms after they are said. For kyutai/stt-2.6b-en, the delay is 2.5 seconds.

In Unmute, we use what we call the "flush trick" to reduce the response latency further. Once the voice activity detector predicts that the user is done talking, we have to wait an additional 500ms (the delay of the STT model) to ensure we don't cut off the end of the transcript.

To reduce this delay, we exploit the fact that the speech-to-text server is able to process audio faster than real-time. When we detect the end of speech, we ask the STT server to process the audio we've already sent as fast as it can. The server runs at around 4x real time, so it can process this audio in around 125ms = 500ms/4. This way, we "warp time" and we only have to wait these 125ms to be certain that we've transcribed everything.

High throughput

Kyutai TTS is well-suited to production settings: on an H100, it can transcribe 400 real-time audio streams simultaneously.

This is thanks to the delayed streams modeling architecture (see below), which allows us to run the model with a high batch size without needing any "glue code" on top of the model to allow for streaming.

Single-stream speech-to-text

For comparison, turning Whisper to a streaming model required a whole separate research project, Whisper-Streaming. The system repeatedly runs Whisper on the last few seconds of the audio and stitches the overlapping transcripts together.

Whisper-Streaming is technically impressive but does not support batching, leading to a much lower throughput. For a lower target delay, its throughput decreases further, because it needs to re-run Whisper more often.

Implementations

We provide different implementations of Kyutai STT, depending on your use case. Instructions for running all of these are available on GitHub.

PyTorch: for research and tinkering

If you want to call the model from Python for research or experimentation, use our PyTorch implementation.

Rust: for production

If you want to serve Kyutai STT in a production setting, use the Rust implementation. This is what we use in Unmute.

Our robust Rust server provides streaming access to the model over websockets. See the delayed-streams-modeling repo on how to run it. We use this server to run Unmute; on a L40S GPU, we can serve 64 simultaneous connections at a real-time factor of 3x.

MLX: for on-device inference on iPhone and Mac

MLX is Apple's ML framework that allows you to use hardware acceleration on Apple silicon.

If you want to run the model on a Mac or an iPhone, choose the MLX implementation.

Delayed streams modeling

The main innovation of Kyutai STT is a technique developed at Kyutai called delayed streams modeling, which we pioneered with Moshi.

The usual way of using a language model to do speech-to-text is to use a model that receives the whole audio at once as input and generates the text step-by-step (autoregressively). For instance, this is what Whisper does, using an encoder-decoder transformer.

Single-stream speech-to-text

In Kyutai STT, we instead represent the data as time-aligned streams of text and audio. Essentially, the audio and text are "next to" each other rather than after one another. The text stream is padded to ensure the timing of the text is aligned with the audio. We just delay the text stream by a few frames to allow the speech-to-text some lookahead.

Kyutai STT diagram

We train Kyutai STT on text-audio data represented this way, teaching it to model both the audio stream and the text stream. During inference, we keep the audio stream fixed, feed in the input audio in as we receive it, and use the model to predict the text stream.

Another neat property of this approach is its symmetry. We can get a text-to-speech by delaying the audio stream instead of the text stream, and then keeping the text fixed (teacher-forcing) and predicting the audio instead. A bit of trickery is required to allow the model to predict padding tokens to align the timing of the text stream with the audio stream. We'll provide more details once we open-source the text-to-speech model.

Kyutai TTS diagram

We are working on a paper that will explain both models in full detail.

Learn more about

Credits

Kyutai STT, Kyutai TTS, and Unmute was created by Alexandre Défossez, Edouard Grave, Eugene Kharitonov, Laurent Mazare, Gabriel de Marmiesse, Emmanuel Orsini, Patrick Perez, Václav Volhejn, and Neil Zeghidour, with support from the rest of the Kyutai team.

Absortio

Extracto

Resumen