top of page
Method

There are three methods that we implement which would complete our task. Including Filtering & Wavelet Transform, ICA, and MFCC. Filtering & Wavelet Transform and MFCC targets to cancel noise and extract the voice of some particular speaker, while ICA handles the situation that there are multiple speakers in the same audio.

Filtering & Wavelet Transform

Filtering

For audio containing only the sound of one person, which would be focusing on one particular frequency band, we could use filtering to eliminate noise. We have created a frequency filter that will filter out any noise that is not within the same frequency range compared to the voice. 

 

The filter we chose to use is a Butterworth bandpass filter, which has the property of being easy to design and meet the specifications, easy to implement, and has a mostly linear phase, which would cause the audio to have a uniform outcome. We first select the frequency band that contains the most magnitude, set that frequency to the center of the passband, and design the filter accordingly.

 

Algorithm:

  1. Frequency analysis of the original audio

  2. Determine the frequency band with the largest magnitude

  3. Choose a passband that is centered on the frequency band with the largest magnitude

  4. Design a passband filter

  5. Apply the filter to eliminate noise

Using this content-aware filtering, we can eliminate the noise that has a different frequency compared to the human voice and also leaves the human voice unaffected, regardless of the different frequencies of voice by different people, languages, etc. We tested the filter on the coffee shop recording. 

 

 

 

Figure: Coffee Shop Recording before filtering, time-domain © Tianwei Liu

 

 

 

 

 

 

 

 

 

Figure: Coffee Shop Recording after filtering, time-domain © Tianwei Liu

 

Figure: Coffee Shop Recording before filtering, frequency-domain © Tianwei Liu

Figure: Coffee Shop Recording after filtering, frequency-domain © Tianwei Liu

As shown in the figure, when applied to the coffee shop recording, this filtering mechanism produced a moderate level of noise cancellation.

 

Wavelet Transform

However, there is still a problem with only frequency filters. Since the filter does not change what is inside the passband, it cannot eliminate the noise that has similar frequencies compared to the human voice. 

 

To solve this problem, wavelet transform is being applied to our system. Wavelet transform is a change of basis using a wavelet function, an orthonormal basis. Wavelet transform can provide the frequency information of the signal as well as the time associated with those frequencies. The function of the wavelet transform used to analyze discrete signals is as follows:

 

Algorithm:

  1. Normalize the audio signal

  2. Obtain wavelet transform and get the 2-dimensional information

  3. Set all components below a threshold to 0

  4. Backward synthesis to audio signal

After the transformation, we have a 2D signal consisting of magnitudes of each frequency band at different time intervals, very much like a frequency spectrum. Using this information together with normalizing the signal, we can eliminate the signals that have frequency bands similar to that of the speech but have a much lower magnitude and still exist when the speaker is not speaking. 

Figure: Wavelet Transform coefficients of the Coffee Shop Recording © Tianwei Liu

 

We implemented and tested the above algorithm on different speeches, and it gives us a better result compared to only using the frequency filter. The figure below shows the coffee shop recording after processing by both frequency filter and wavelet decomposition.

 

 

Figure: Coffee Shop Recording after processed by both methods, time-domain © Tianwei Liu

As shown in the figure, one can see that the noise, which is previously shown as small waves that has a lower magnitude but persist for a long time, is greatly reduced, while the human speech portion in the audio is mostly reserved. 

coffeeshoptime.png
csafterfilter.png
coffeeshopfrq.png
csafterfilterfrq.png
wavedecfunc.png
csafterprocessbyboth.png
wavedeccmp.png
Filteing & Wavelet TransfoTransform

 

ICA / Deep Learning Speech Diarization

When there are multiple speakers standing within a noisy background, It is a well-known cocktail party problem. It is critical to separate each speaker for later analysis. In this section, we will analyze two different methods for speaker diarization: classical Independent Component Analysis (ICA) and the state-of-the-art spectral clustering using d-vectors.

 

Independent Component Analysis (ICA)

Independent Component Analysis (ICA) is a classical algorithm for speech diarization. It assumes that the speeches are linearly mixed by the weighted matrix A. It requires an input of M*N matrix to successfully extract N features (speakers), where M is the number of samples collected. ICA is trying to find an inverse N*N matrix A-1 that converts the mixed signal into separated speeches.

 

Algorithm:

  1. Load a mixed audio file, the dimension of the matrix should correspond to the number of speakers

  2. Prewhiten the signal using prewhiten to remove white noise

  3. Create an ICA model using rica

  4. Transform the mixed data using the ICA model

 

Assumptions:

  • Each speakers are statistically independent of each other

  • The speeches are linearly mixed together

  • The mixed signal only contains additive Gaussian Noise

Note: in matlab, ICA is available as rica command.

 

Deep Learning Speech Diarization

The state-of-the-art deep learning algorithm uses neural networks based audio embedding, also known as d-vectors. The algorithm uses Deep Neural Network (DNN) along with the spectral clustering to perform diarization. This speech diarization algorithm outperformed previous algorithms using i-vector. For our project, we used their pretrained model and processed the diarization. 

 

Algorithm:

  1. Embed the audio signals to d-vectors

  2. Create a spectral clustering model

  3. Create label/segment pairs for different labels (speakers) 

  4. Based on the pairs, separate the original audio into two separate speeches

 

Acknowledgment: This algorithm follows the steps that Rahul Saxena posted on Medium.com.

 

Implementation & Analysis:

We have tested ICA on both synthetically mixed data and real recordings. 

 

Synthetically mixed audio:

In this section, we use two audios, one is a professor speaking and another is a student speaking, to demonstrate the algorithm on synthetically mixed audio. 

 

Audio: Speech of a Professor © https://www.kaggle.com/datasets/wiradkp/mini-speech-diarization

 

Audio: Speech of a Student © https://www.kaggle.com/datasets/wiradkp/mini-speech-diarization

First of all, we use a random weighted matrix and a random offset to mix these two audio files into a synthetically mixed audio.

 

Audio: Mixed student and professor speech © Qianxu Li

 

Then, we prewhitened the mixed audio and performed rica to obtain both the professor and the student.

 

Audio: Professor feature extracted from mixed speech © Qianxu Li

Audio: Student feature extracted from mixed speech © Qianxu Li

 

If you have heard the audio, you can clearly recognize the professor and student accordingly. It is indistinguishable if the original and the unmixed audio were not labeled for reference.

 

The plots below show the mixed signal, original signals, and the unmixed signals for comparison. The unmixed signals are very similar to the original signals except for the amplitude. It is because of the random offset that was introduced during the mixing stage. It will not affect the integrity of the audio. Other than that, rica has been great at separating two speakers.

 

 

 


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure: Professor and Student demo © Qianxu Li

Real Audio Input:

The previous example has shown that ICA works well for synthetically mixed input. Now, we want to test its functionality when the input is a real audio that is not linearly mixed. To simplify the test, we assumed that there were only two speakers. In order to extract two features (two speakers) using ICA, it is necessary to have two mixed signals to produce the M*2 matrix where M is the number of samples. However, the recording is a mono sound signal in which there is only one column of data. To tackle this problem, we have introduced a random offset to the original audio and added it to the second column of the matrix. In this case, we have a matrix has a dimension of M*2, and we can perform ICA.

 

We used the speech recording by Qianxu and Tianwei that could be found in our data section. 

 

Audio: Two Speaker Speech recording © Qianxu Li & Tianwei Liu

We performed similar procedures on this recording. We prewhitened the signal and then used rica and transform command to convert the signal into two speeches ideally.

 

Audio: Speaker 1 feature extracted from speech recording © Qianxu Li

Audio: Speaker 2 feature extracted from speech recording © Qianxu Li

 

The speakers are clearly not separated at all. One signal is basically the original signal, and the other is not audible.

 

The results are expected because the ICA has made an assumption that the speeches should be linearly mixed together. For our real life recording, the environment has introduced non-linear components into the audio file. From the plots above, it is noticeable that two unmixed recordings are the same as the original recording except for their amplitude. One has nearly zero amplitude. It makes sense that it can not be heard.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure: ICA result on the speech recording © Qianxu Li

Deep Learning Speech Diarization

Since classical ica cannot solve the cocktail party problem with real audio signals, we have looked into deep learning models in python. This method is learned from Wang, Quan, et al. "Speaker diarization with LSTM." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.

 

We still used the speech recording by Qianxu and Tianwei.

 

After running through our system, each speaker can be heard clearly with a small error.

 

 

Audio: Speaker 1 feature extracted from speech recording © Qianxu Li

Audio: Speaker 2 feature extracted from speech recording © Qianxu Li

 

The algorithms have successfully separated two speakers, Qianxu and Tianwei. However, there is a small error at the end of the recording. It is reasonable because the original paper has claimed that this algorithm achieved an error of 12%, relatively low compared to other methods.

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure: Speech recording after being processed by deep learning method © Qianxu Li

These separated audios can then be used for noise cancellation and users could choose to have the speaker separated or reconstructed as a single audio file.

ps2.png
ica2.png
ica1.png
ica3.png
ps1.png
ICA

 

MFCC

Mel Frequency Cepstral Coefficients (MFCCs) is a leading approach used in automatic speech and speaker recognition systems. It is a method of audio enhancement where the DSP application is used to calculate coefficients that are quantized to a number of vectors using a linear scale called mel scale to replicate the function of human ear’s critical nonlinear frequency bandwidth.

 

Algorithm:

Matlab:

  1. Load the raw audio file

  2. Prepare and perform refinement of the raw and filtered audio signal

  3. Plot the Spectrogram of each file for preliminary analysis

 

Acknowledgment: This algorithm follows the steps from paper titled “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences” written by Davis, S. and Mermelstien and another paper titled “Spoken Language Processing: A guide to theory, algorithm, and system development” written by Huang X. and Acero A.

Python:

  1. Install the librosa library and packages in Colab notebook

  2. Perform MFCC on the sample audio

  3. Perform inv-MFCC on the sample audio

 

Acknowledgment: The library and packages utilized to perform mfcc and inv-mfcc was developed by McFee, Brian, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto and the related paper is titled “librosa: Audio and music signal analysis in python”

 

Implementation & Analysis:

In order to tackle the problem, first we implemented the inbuilt MFCC in mathworks library to produce the mel spectrogram using the mel coefficients. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

​Figure: Melspectrogram of Coffee Shop Recording © Raj Patel

The resulting spectrogram shows that there is a fading yellow gradient around 55 dB is what we perceive to be Raj's voice in the Coffee Shop Mel-Spectrogram figure  because average human speech is between 50 to 65 dB. The distinguishing difference from the Coffee Shop Mel-Spectrogram to the FIltered Coffee Shop Mel-Spectrogram (due to high pass filter being applied) is that the loudness seems to be decreased removing all the high frequencies (yellow readings). Furthermore visual analysis of the two figures, the faint blue strip at 15k Hz filtered which we believed was pitch of the background noise that is similar to Raj's voice. In order to tackle the problem, we decided to reconstruct the MFCC into a form of audio to analyze what type of noise is being generated (either high or low) and if it can be eliminated by performing ICA. 

 

In our attempts to reconstruct the MFCC to an audio representation. We originally tried to construct it in matlab but were unable to achieve a perfect audio conversion. In our first approach we tried to manually convert the frequencies by converting the mel coefficients and converting them into regular frequency basis but due to audio being in a logarithmic scale, there needs to be a define components for every point on the time point and that is not the case because some data eliminated when taking the Discrete Cosine Transform (DST). To overcome this difficulty, we utilized the MATLAB CENTRAL FILE EXCHANGE feature in Mathworks where Min Gang constructed the inverse mfcc method developed by Gang et al. Due to the computational and compiling error, we had to halt this approach. 

 

Our last attempt consisted of using a python package for music and audio analysis called librosa developed by McFee, B at el. With the help of the mfcc library and its function, we were successfully able to plot the mel-spectrogram, along with a reconstructed audio sample of the raw speech (recording from the coffee shop). When comparing the python Coffee Shop Mel-Spectrograms to the one computed in Matlab, the graphs are identical, confirming that the MFCC computation was correct. The analysis of the reconstructed audio sample was very intriguing. The results were not what we expected, the audio seemed to be very discontinuous and loosy (alieny). The preliminary speculations are that, this is due to the some vital values being filtered out during the calculation of MFCC coefficients might be the reason why there is a loud discountious noise. 

 

 

 

 

 

 

 

 

 

 

 

 

Figure: Melspectrogram obtained from python © Raj Patel

 

 

 

Audio: Reconstructed Coffee Shop Recording processed by MFCC © Raj Patel

mcc
mel1.png
mel2.png
bottom of page