prototype · speech research · May 2026
Moshi turn-taking data pipeline
Supporting data pipeline for conversational audio: discovery, diarisation, transcription, Mimi encoding, and turn-yield labels.
My dissertation work is about turn-taking in dialogue systems, and this pipeline was a practical attempt to create the kind of data such systems need. Full-duplex speech models need more than transcripts: they need timing, speaker state, audio-codec tokens, and labels that say whether the system should hold or yield the floor.
The pipeline stages are deliberately explicit: discover suitable conversational audio, download it, diarise and transcribe it, encode the audio through Mimi, and generate training labels for future user state and turn yielding. It is less a polished product than a research data machine.
The local files date this work to May 2026, under moshi-data-pipeline/.