Paper: "Binaural Target Speaker Extraction using HRTFs and a Complex-Valued Neural Network"
In this work, we aim to imitate the human ability to selectively attend to a single speaker, even in the presence of multiple simultaneous talkers. To achieve this, we propose a novel approach for Binaural Target Speaker Extraction (Bi-TSE) that leverages the listener’s Head-Related Transfer Function (HRTF) to isolate the desired speaker. Notably, our method does not rely on speaker embeddings, making it speaker-independent and enabling strong generalization across multiple speech datasets in different languages.
We employ a fully complex-valued neural network that operates directly on the complex-valued Short-Time Fourier Transform (STFT) of the mixed audio signals. This deviates from conventional approaches that use spectrograms or treat the real and imaginary components of the STFT as separate real-valued inputs.
We begin by evaluating the method in an anechoic, noise-free scenario, where it demonstrates excellent extraction performance while effectively preserving the binaural cues of the target signal. Next, we test a modified variant under mild reverberation conditions. This version remains robust in reverberant conditions, maintaining speech clarity, preserving source directionality, and simultaneously reducing reverberation.
Ground Truth Speaker 1:
Ground Truth Speaker 2:
Ground Truth Speaker 1:
Ground Truth Speaker 2:
Ground Truth Speaker 1:
Ground Truth Speaker 2:
Ground Truth Speaker 1:
Ground Truth Speaker 2: