Target Speaker Extraction in a Binaural Setting using HRTFs

Paper: "Binaural Target Speaker Extraction using HRTFs"

Authors: Yoav Ellinson and Sharon Gannot

Abstract

In this work, we aim to imitate the human ability to selectively attend to a single speaker, even in the presence of multiple simultaneous talkers. To achieve this, we propose a novel approach for binaural target speaker extraction that leverages the listener’s HRTF to isolate the desired speaker. Notably, our method does not rely on speaker embeddings, making it speaker-independent and enabling strong generalization across multiple speech datasets in different languages.

We employ a fully complex-valued neural network that operates directly on the complex-valued STFT of the mixed audio signals, and compare it to a RI-based neural network, demonstrating the advantages of the former.

We first evaluate the method in an anechoic, noise-free scenario, where it achieves excellent extraction performance while faithfully preserving the binaural cues of the target signal. We then extend the evaluation to reverberant conditions. Our method proves robust, maintaining speech clarity and source directionality while simultaneously reducing reverberation.

A comparative analysis with existing binaural TSE methods demonstrates that our approach attains performance on par with competing techniques in terms of noise reduction and perceptual quality, while offering a clear advantage in preserving binaural cues.

Block diagram of the proposed method
Figure 1: A block diagram of the proposed method, where \( \boldsymbol{x}_b \) represents the mixed binaural signal in the STFT domain, \( \boldsymbol{h}_{hrtf}(\theta_d,\phi_d,k) \) denotes denotes the (frequency-domain) HRTF of the desired speaker’s DOA, and \( \hat{\tilde{\boldsymbol{s}}}_d \) represents the STFT of estimated binaural desierd signal.
Measurement system
Figure 2: The simulation convention, illustrating an example with two concurrent speakers positioned at \( \theta_1 = 40^\circ \) (left) and \( \theta_2 = -30^\circ \) (right), with a fixed elevation \( \phi \). Images sourced from: https://www.freepik.com

Audio Examples and spectrograms

Example from the From the WSJ0-CSR1 dataset in an anechoic — speaker 1 at -35°, speaker 2 at -90°

Mixture spectrogram
Speaker 1 spectrogram
Speaker 2 spectrogram

Ground Truth Speaker 1:

Ground Truth Speaker 2:

Example from the Librispeech dataset in an anechoic setting; speaker 1 at 20° and speaker 2 at 80°

Mixture spectrogram
Speaker 1 spectrogram
Speaker 2 spectrogram

Ground Truth Speaker 1:

Ground Truth Speaker 2:

Example from the Librispeech MLS French dataset in an anechoic setting; speaker 1 at 80° and speaker 2 at 15°

Mixture spectrogram
Speaker 1 spectrogram
Speaker 2 spectrogram

Ground Truth Speaker 1:

Ground Truth Speaker 2:

Example from the WSJ0-CSR1 dataset in a reverberant setting; speaker 1 at -25° and speaker 2 at -75°;

Mixture spectrogram
Speaker 1 spectrogram
Speaker 2 spectrogram

Ground Truth Speaker 1:

Ground Truth Speaker 2: