StreamFlow: Streaming Audio Generation from Neural Codec Tokens via Streaming Flow Matching

1 KT Corp., Seoul, Korea
2 Ajou University, Suwon, Korea

Abstract

Diffusion models have demonstrated remarkable generative capabilities, and Conditional Flow Matching (CFM) has improved their inference efficiency by following optimal transport paths. However, CFM-based models still require multiple iterative sampling steps, which makes them unsuitable for real-time or streaming generation scenarios. In this paper, we introduce StreamFlow, a novel streaming generative model designed for real-time audio generation from discrete tokens. StreamFlow leverages a causal noising training framework along the time axis and predicts multi-time vector fields at once on each stream, enabling streaming inference with minimal latency. To further improve generalization, we propose Scale-DiT, a Diffusion Transformer architecture that enhances robustness by modeling, normalizing, and scaling feature differences prior to skip connections. This significantly improves the robustness and performance of DiT without increasing the parameter size. We validate the effectiveness of StreamFlow through audio reconstruction tasks using discrete tokens from EnCodec and Mimi, demonstrating both high-fidelity synthesis and streaming capability. Furthermore, we successfully incorporated our model into fully-duplex streaming speech language models of Moshi by replacing the Mimi decoder.

Streaming GIF
Streaming Flow Matching

Encodec Streaming Reconstruction

GT Encodec StreamFlow-Base StreamFlow-Small StreamFlow-Tiny

Encodec Parallel Reconstruction

GT vocos MBD
(10 steps)
RFWave
(CFG2)
StreamFlow
(4 steps)

Mimi Streaming Reconstruction

GT Mimi \( N_q \) = 4 StreamFlow \( N_q \) = 4 Mimi \( N_q \) = 6 StreamFlow \( N_q \) = 6 Mimi \( N_q \) = 8 StreamFlow \( N_q \) = 8

Replacing Mimi with StreamFlow in a Full-duplex Streaming Model

Streaming GIF
Inference pipeline comparing the Mimi decoder and the proposed StreamFlow, given identical Moshi output
Input Audio StreamFlow \( N_q \) = 8 Mimi \( N_q \) = 8