Full-Duplex-Bench: A Benchmark for Full-duplex Spoken Dialogue Models

Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities

Spoken dialogue modeling introduces unique challenges beyond text-based language modeling, demanding robust turn-taking, backchanneling, and real-time interaction. Although most Spoken Dialogue Models (SDMs) rely on half-duplex processing (handling speech one turn at a time), emerging full-duplex SDMs can listen and speak simultaneously, enabling more natural and engaging conversations. However, current evaluations of such models remain limited, often focusing on turn-based metrics or high-level corpus analyses (e.g., turn gaps, pauses). To address this gap, we present Full-Duplex-Bench, a new benchmark that systematically evaluates key conversational behaviors: pause handling, backchanneling, turn-taking, and interruption management. Our framework uses automatic metrics for consistent and reproducible assessments of SDMs' interactive performance. By offering an open and standardized evaluation benchmark, we aim to advance spoken dialogue modeling and encourage the development of more interactive and natural dialogue systems.

Framework Overview

Figure 1: The Full-Duplex-Bench evaluation framework architecture.

Figure 2: The four key dimensions evaluated in our benchmark.

Evaluation Dimensions

Full-Duplex-Bench evaluates spoken dialogue models across four key conversational dimensions:

1. Pause Handling

Evaluates if models can recognize when a speaker pauses but still holds the turn. The ideal model avoids taking over during natural pauses. Measured by Takeover Rate (TOR) - lower is better.

2. Backchanneling

Assesses if models provide appropriate listener responses (e.g., "uh-huh," "mm-hmm") without interrupting. Measured by Takeover Rate, Backchannel Frequency, and Jensen-Shannon Divergence (JSD) to compare model timing with human behavior.

3. Smooth Turn-Taking

Tests if models can detect turn boundaries and respond promptly. Measured by Response Latency - the time between the end of user speech and the start of model response. Lower latency indicates smoother turn-taking.

4. User Interruption

Examines how models handle and adapt to interruptions. Evaluated using Takeover Rate, GPT-4o Score (for response quality), and Latency After Interruption to measure response time following interruptions.

Benchmark Results

Performance comparison of dialogue models across our evaluation dimensions.

Dimension	Pause Handling		Backchannel			Smooth Turn Taking		User Interruption
Dimension	Synthetic TOR (↓)	Candor TOR (↓)	ICC TOR (↓)	ICC Freq (↑)	ICC JSD (↓)	Candor TOR (↑)	Candor Latency (↓)	Synthetic TOR (↑)	Synthetic GPT-4 (↑)	Synthetic Latency (↓)
dGSLM	0.949	0.953	0.782	0.013	0.950	0.989	0.572	0.895	0.201	3.972
Moshi	1.000	0.989	1.000	0.005	0.977	0.996	0.112	1.000	0.765	0.037
Freeze-Omni	0.672	0.287	0.782	0.002	0.984	0.369	1.168	0.795	3.371	1.200

Note: Bold numbers indicate best performance for each metric. Arrows indicate whether higher (↑) or lower (↓) values are better.

Audio Examples

Below are sample interactions demonstrating different model behaviors across our evaluation dimensions.

Pause Handling (Synthetic)

Sample ID	Model Responses
Sample ID	dGSLM	Moshi	Freeze-Omni
1
2
3
4
5

Pause Handling (Candor)

Sample ID	Model Responses
Sample ID	dGSLM	Moshi	Freeze-Omni
1
2
3
4
5

Backchanneling (ICC)

Sample ID	Model Responses
Sample ID	dGSLM	Moshi	Freeze-Omni
1
2
3
4
5

Turn-Taking (Candor)

Sample ID	Model Responses
Sample ID	dGSLM	Moshi	Freeze-Omni
1
2
3
4
5

User Interruption (Synthetic)

Sample ID	Model Responses
Sample ID	dGSLM	Moshi	Freeze-Omni
1
2
3
4
5