BINAURAL TARGET SPEAKER EXTRACTION USING INDIVIDUALIZED HRTF

Paper: "BINAURAL TARGET SPEAKER EXTRACTION USING INDIVIDUALIZED HRTF"

Abstract

In this work, we address the problem of binaural target-speaker extraction in the presence of multiple simultaneous talkers. We propose a novel approach that leverages the individual listener’s Head-Related Transfer Function (HRTF)to isolate the target speaker. The proposed method is speaker-independent, as it does not rely on speaker embeddings. We employ a fully complex-valued neural network that operates directly on the complex-valued Short-Time Fourier transform (STFT) of the mixed audio signals, and compare it to a Real-Imaginary (RI)-based neural network, demonstrating the advantages of the former. We first evaluate the method in an anechoic, noise-free scenario, achieving excellent extraction performance while preserving the binaural cues of the target signal. We then extend the evaluation to reverberant conditions. Our method proves robust, maintaining speech clarity and source directionality while simultaneously reducing reverberation. A comparative analysis with existing binaural Target Speaker Extraction (TSE) methods shows that the proposed approach achieves performance comparable to state-of-the-art techniques in terms of noise reduction and perceptual quality, while providing a clear advantage in preserving binaural cues.

Block diagram of the proposed method
Figure 1: A block diagram of the proposed method, where \( \boldsymbol{x}_b \) represents the mixed binaural signal in the STFT domain, \( \boldsymbol{h}_{hrtf}(\theta_d,\phi_d,k) \) denotes denotes the (frequency-domain) HRTF of the desired speaker’s DOA, and \( \hat{\tilde{\boldsymbol{s}}}_d \) represents the STFT of estimated binaural desierd signal.
Measurement system
Figure 2: The simulation convention, illustrating an example with two concurrent speakers positioned at \( \theta_1 = 40^\circ \) (left) and \( \theta_2 = -30^\circ \) (right), with a fixed elevation \( \phi \). Images sourced from: https://www.freepik.com

Audio Examples and spectrograms

Example from the From the WSJ0-CSR1 dataset in an anechoic — speaker 1 at -35°, speaker 2 at -90°

Mixture spectrogram
Speaker 1 spectrogram
Speaker 2 spectrogram

Ground Truth Speaker 1:

Ground Truth Speaker 2:

Example from the Librispeech dataset in an anechoic setting; speaker 1 at 20° and speaker 2 at 80°

Mixture spectrogram
Speaker 1 spectrogram
Speaker 2 spectrogram

Ground Truth Speaker 1:

Ground Truth Speaker 2:

Example from the Librispeech MLS French dataset in an anechoic setting; speaker 1 at 80° and speaker 2 at 15°

Mixture spectrogram
Speaker 1 spectrogram
Speaker 2 spectrogram

Ground Truth Speaker 1:

Ground Truth Speaker 2:

Example from the WSJ0-CSR1 dataset in a reverberant setting; speaker 1 at -25° and speaker 2 at -75°;

Mixture spectrogram
Speaker 1 spectrogram
Speaker 2 spectrogram

Ground Truth Speaker 1:

Ground Truth Speaker 2: