Target Speaker Extraction in a Binaural Setting using HRTFs

Paper: "Binaural Target Speaker Extraction using HRTFs and a Complex-Valued Neural Network"

Abstract

In this work, we aim to imitate the human ability to selectively attend to a single speaker, even in the presence of multiple simultaneous talkers. To achieve this, we propose a novel approach for Binaural Target Speaker Extraction (Bi-TSE) that leverages the listener’s Head-Related Transfer Function (HRTF) to isolate the desired speaker. Notably, our method does not rely on speaker embeddings, making it speaker-independent and enabling strong generalization across multiple speech datasets in different languages.

We employ a fully complex-valued neural network that operates directly on the complex-valued Short-Time Fourier Transform (STFT) of the mixed audio signals. This deviates from conventional approaches that use spectrograms or treat the real and imaginary components of the STFT as separate real-valued inputs.

We begin by evaluating the method in an anechoic, noise-free scenario, where it demonstrates excellent extraction performance while effectively preserving the binaural cues of the target signal. Next, we test a modified variant under mild reverberation conditions. This version remains robust in reverberant conditions, maintaining speech clarity, preserving source directionality, and simultaneously reducing reverberation.

Block diagram of the proposed method
Figure 1: A block diagram of the proposed method, where \( \boldsymbol{x}_b \) represents the mixed signal in the STFT domain, \( \boldsymbol{h}_{hrtf}(\theta_d,\phi_d,k) \) denotes denotes the (frequency-domain) HRTF of the desired speaker’s DOA, and \( \hat{\tilde{\boldsymbol{s}}}_d \) represents the STFT of estimated desierd signal.
Measurement system
Figure 2: A simulation setup example, illustrating two concurrent speakers: the desired speaker at \( \theta_d = 40^\circ \) (left) and the interferer at \( \theta_i = -30^\circ \) (right), both at elevation \( \phi \). Images sourced from: https://www.freepik.com

Audio Examples and spectrograms - Headphones recommended for optimal spatial audio experience

Example from the From the WSJ0-CSR1 dataset in an anechoic — speaker 1 at -35°, speaker 2 at -90°

Mixture spectrogram
Speaker 1 spectrogram
Speaker 2 spectrogram

Ground Truth Speaker 1:

Ground Truth Speaker 2:

Example from the Librispeech dataset in an anechoic setting; speaker 1 at 20° and speaker 2 at 80°

Mixture spectrogram
Speaker 1 spectrogram
Speaker 2 spectrogram

Ground Truth Speaker 1:

Ground Truth Speaker 2:

Example from the Librispeech MLS French dataset in an anechoic setting; speaker 1 at 80° and speaker 2 at 15°

Mixture spectrogram
Speaker 1 spectrogram
Speaker 2 spectrogram

Ground Truth Speaker 1:

Ground Truth Speaker 2:

Example from the WSJ0-CSR1 dataset in a reverberant setting; speaker 1 at -25° and speaker 2 at -75°;

Mixture spectrogram
Speaker 1 spectrogram
Speaker 2 spectrogram

Ground Truth Speaker 1:

Ground Truth Speaker 2: