Method

There are three methods that we implement which would complete our task. Including Filtering & Wavelet Transform, ICA, and MFCC. Filtering & Wavelet Transform and MFCC targets to cancel noise and extract the voice of some particular speaker, while ICA handles the situation that there are multiple speakers in the same audio.

Filtering & Wavelet Transform

Filtering

For audio containing only the sound of one person, which would be focusing on one particular frequency band, we could use filtering to eliminate noise. We have created a frequency filter that will filter out any noise that is not within the same frequency range compared to the voice.

The filter we chose to use is a Butterworth bandpass filter, which has the property of being easy to design and meet the specifications, easy to implement, and has a mostly linear phase, which would cause the audio to have a uniform outcome. We first select the frequency band that contains the most magnitude, set that frequency to the center of the passband, and design the filter accordingly.

Algorithm:

Frequency analysis of the original audio
Determine the frequency band with the largest magnitude
Choose a passband that is centered on the frequency band with the largest magnitude
Design a passband filter
Apply the filter to eliminate noise

Using this content-aware filtering, we can eliminate the noise that has a different frequency compared to the human voice and also leaves the human voice unaffected, regardless of the different frequencies of voice by different people, languages, etc. We tested the filter on the coffee shop recording.

Figure: Coffee Shop Recording before filtering, time-domain © Tianwei Liu

Figure: Coffee Shop Recording after filtering, time-domain © Tianwei Liu

Figure: Coffee Shop Recording before filtering, frequency-domain © Tianwei Liu

Figure: Coffee Shop Recording after filtering, frequency-domain © Tianwei Liu

As shown in the figure, when applied to the coffee shop recording, this filtering mechanism produced a moderate level of noise cancellation.

Wavelet Transform

However, there is still a problem with only frequency filters. Since the filter does not change what is inside the passband, it cannot eliminate the noise that has similar frequencies compared to the human voice.

To solve this problem, wavelet transform is being applied to our system. Wavelet transform is a change of basis using a wavelet function, an orthonormal basis. Wavelet transform can provide the frequency information of the signal as well as the time associated with those frequencies. The function of the wavelet transform used to analyze discrete signals is as follows:

Algorithm:

Normalize the audio signal
Obtain wavelet transform and get the 2-dimensional information
Set all components below a threshold to 0
Backward synthesis to audio signal

After the transformation, we have a 2D signal consisting of magnitudes of each frequency band at different time intervals, very much like a frequency spectrum. Using this information together with normalizing the signal, we can eliminate the signals that have frequency bands similar to that of the speech but have a much lower magnitude and still exist when the speaker is not speaking.

Figure: Wavelet Transform coefficients of the Coffee Shop Recording © Tianwei Liu

We implemented and tested the above algorithm on different speeches, and it gives us a better result compared to only using the frequency filter. The figure below shows the coffee shop recording after processing by both frequency filter and wavelet decomposition.

Figure: Coffee Shop Recording after processed by both methods, time-domain © Tianwei Liu

As shown in the figure, one can see that the noise, which is previously shown as small waves that has a lower magnitude but persist for a long time, is greatly reduced, while the human speech portion in the audio is mostly reserved.

Filteing & Wavelet TransfoTransform

ICA / Deep Learning Speech Diarization

When there are multiple speakers standing within a noisy background, It is a well-known cocktail party problem. It is critical to separate each speaker for later analysis. In this section, we will analyze two different methods for speaker diarization: classical Independent Component Analysis (ICA) and the state-of-the-art spectral clustering using d-vectors.

Independent Component Analysis (ICA)

Independent Component Analysis (ICA) is a classical algorithm for speech diarization. It assumes that the speeches are linearly mixed by the weighted matrix A. It requires an input of M*N matrix to successfully extract N features (speakers), where M is the number of samples collected. ICA is trying to find an inverse N*N matrix A-1 that converts the mixed signal into separated speeches.

Algorithm:

Load a mixed audio file, the dimension of the matrix should correspond to the number of speakers
Prewhiten the signal using prewhiten to remove white noise
Create an ICA model using rica
Transform the mixed data using the ICA model

Assumptions:

Each speakers are statistically independent of each other
The speeches are linearly mixed together
The mixed signal only contains additive Gaussian Noise

Note: in matlab, ICA is available as rica command.

Deep Learning Speech Diarization

The state-of-the-art deep learning algorithm uses neural networks based audio embedding, also known as d-vectors. The algorithm uses Deep Neural Network (DNN) along with the spectral clustering to perform diarization. This speech diarization algorithm outperformed previous algorithms using i-vector. For our project, we used their pretrained model and processed the diarization.

Algorithm:

Embed the audio signals to d-vectors
Create a spectral clustering model
Create label/segment pairs for different labels (speakers)
Based on the pairs, separate the original audio into two separate speeches

Acknowledgment: This algorithm follows the steps that Rahul Saxena posted on Medium.com.

Implementation & Analysis:

We have tested ICA on both synthetically mixed data and real recordings.

Synthetically mixed audio:

In this section, we use two audios, one is a professor speaking and another is a student speaking, to demonstrate the algorithm on synthetically mixed audio.

Audio: Speech of a Professor © https://www.kaggle.com/datasets/wiradkp/mini-speech-diarization

Audio: Speech of a Student © https://www.kaggle.com/datasets/wiradkp/mini-speech-diarization

First of all, we use a random weighted matrix and a random offset to mix these two audio files into a synthetically mixed audio.

Audio: Mixed student and professor speech © Qianxu Li

Then, we prewhitened the mixed audio and performed rica to obtain both the professor and the student.

Audio: Professor feature extracted from mixed speech © Qianxu Li

Audio: Student feature extracted from mixed speech © Qianxu Li

If you have heard the audio, you can clearly recognize the professor and student accordingly. It is indistinguishable if the original and the unmixed audio were not labeled for reference.

The plots below show the mixed signal, original signals, and the unmixed signals for comparison. The unmixed signals are very similar to the original signals except for the amplitude. It is because of the random offset that was introduced during the mixing stage. It will not affect the integrity of the audio. Other than that, rica has been great at separating two speakers.

Figure: Professor and Student demo © Qianxu Li

Real Audio Input:

The previous example has shown that ICA works well for synthetically mixed input. Now, we want to test its functionality when the input is a real audio that is not linearly mixed. To simplify the test, we assumed that there were only two speakers. In order to extract two features (two speakers) using ICA, it is necessary to have two mixed signals to produce the M*2 matrix where M is the number of samples. However, the recording is a mono sound signal in which there is only one column of data. To tackle this problem, we have introduced a random offset to the original audio and added it to the second column of the matrix. In this case, we have a matrix has a dimension of M*2, and we can perform ICA.

We used the speech recording by Qianxu and Tianwei that could be found in our data section.

Audio: Two Speaker Speech recording © Qianxu Li & Tianwei Liu

We performed similar procedures on this recording. We prewhitened the signal and then used rica and transform command to convert the signal into two speeches ideally.

Audio: Speaker 1 feature extracted from speech recording © Qianxu Li

Audio: Speaker 2 feature extracted from speech recording © Qianxu Li

The speakers are clearly not separated at all. One signal is basically the original signal, and the other is not audible.

The results are expected because the ICA has made an assumption that the speeches should be linearly mixed together. For our real life recording, the environment has introduced non-linear components into the audio file. From the plots above, it is noticeable that two unmixed recordings are the same as the original recording except for their amplitude. One has nearly zero amplitude. It makes sense that it can not be heard.

Figure: ICA result on the speech recording © Qianxu Li

Deep Learning Speech Diarization

Since classical ica cannot solve the cocktail party problem with real audio signals, we have looked into deep learning models in python. This method is learned from Wang, Quan, et al. "Speaker diarization with LSTM." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.

We still used the speech recording by Qianxu and Tianwei.

After running through our system, each speaker can be heard clearly with a small error.

The algorithms have successfully separated two speakers, Qianxu and Tianwei. However, there is a small error at the end of the recording. It is reasonable because the original paper has claimed that this algorithm achieved an error of 12%, relatively low compared to other methods.

These separated audios can then be used for noise cancellation and users could choose to have the speaker separated or reconstructed as a single audio file.

ICA

MFCC

Mel Frequency Cepstral Coefficients (MFCCs) is a leading approach used in automatic speech and speaker recognition systems. It is a method of audio enhancement where the DSP application is used to calculate coefficients that are quantized to a number of vectors using a linear scale called mel scale to replicate the function of human ear’s critical nonlinear frequency bandwidth.

Algorithm:

Matlab:

Load the raw audio file
Prepare and perform refinement of the raw and filtered audio signal
Plot the Spectrogram of each file for preliminary analysis

Acknowledgment: This algorithm follows the steps from paper titled “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences” written by Davis, S. and Mermelstien and another paper titled “Spoken Language Processing: A guide to theory, algorithm, and system development” written by Huang X. and Acero A.

Python:

Install the librosa library and packages in Colab notebook
Perform MFCC on the sample audio
Perform inv-MFCC on the sample audio

Acknowledgment: The library and packages utilized to perform mfcc and inv-mfcc was developed by McFee, Brian, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto and the related paper is titled “librosa: Audio and music signal analysis in python”

Implementation & Analysis:

In order to tackle the problem, first we implemented the inbuilt MFCC in mathworks library to produce the mel spectrogram using the mel coefficients.

The resulting spectrogram shows that there is a fading yellow gradient around 55 dB is what we perceive to be Raj's voice in the Coffee Shop Mel-Spectrogram figure because average human speech is between 50 to 65 dB. The distinguishing difference from the Coffee Shop Mel-Spectrogram to the FIltered Coffee Shop Mel-Spectrogram (due to high pass filter being applied) is that the loudness seems to be decreased removing all the high frequencies (yellow readings). Furthermore visual analysis of the two figures, the faint blue strip at 15k Hz filtered which we believed was pitch of the background noise that is similar to Raj's voice. In order to tackle the problem, we decided to reconstruct the MFCC into a form of audio to analyze what type of noise is being generated (either high or low) and if it can be eliminated by performing ICA.

In our attempts to reconstruct the MFCC to an audio representation. We originally tried to construct it in matlab but were unable to achieve a perfect audio conversion. In our first approach we tried to manually convert the frequencies by converting the mel coefficients and converting them into regular frequency basis but due to audio being in a logarithmic scale, there needs to be a define components for every point on the time point and that is not the case because some data eliminated when taking the Discrete Cosine Transform (DST). To overcome this difficulty, we utilized the MATLAB CENTRAL FILE EXCHANGE feature in Mathworks where Min Gang constructed the inverse mfcc method developed by Gang et al. Due to the computational and compiling error, we had to halt this approach.

Our last attempt consisted of using a python package for music and audio analysis called librosa developed by McFee, B at el. With the help of the mfcc library and its function, we were successfully able to plot the mel-spectrogram, along with a reconstructed audio sample of the raw speech (recording from the coffee shop). When comparing the python Coffee Shop Mel-Spectrograms to the one computed in Matlab, the graphs are identical, confirming that the MFCC computation was correct. The analysis of the reconstructed audio sample was very intriguing. The results were not what we expected, the audio seemed to be very discontinuous and loosy (alieny). The preliminary speculations are that, this is due to the some vital values being filtered out during the calculation of MFCC coefficients might be the reason why there is a loud discountious noise.

mcc