Idea Finalization Deliverable

This is a Preliminary Report. For most updated information, please go to the main website.

Collect a part of what would be your dataset. It could be a set of recordings with some specific common surrounding noises (phone ringing, another person talking in the distance, coffee machine noises, whatever you want to try eliminating).
Use some simple filtering to eliminate noise.

As advised we recorded a sample of audio and performed some basic high pass filtering to the audio; the audio sample is a speech given by President John F. Kennedy for the decision to go to the moon (NASA) recorded by Raj in an outdoor place with heavy background noise.

Original recording:

Basic Filtering:
Examining the raw speech, there is a lot of noise as seen in Fig 1. The basic high pass filtering eliminates out the noise as seen in Fig 2. In order to achieve this, we used the fir1 with highpass, filter and we also use the filter function which filters the input data using a rotational transfer function defined by the numerator and denominator coefficients.

Fig 1: Raw speech with noise Fig 2: filtered speech with a high pass filter

When you do, how clear is the human voice in the signal? What other features (other than frequency) distinguish the other sounds from the human voice? Describe how you might quantify those other features for an algorithm.

Beside Frequency:
Besides frequency filtering, we have used wave decomposition to eliminate the noise. In this approach, we separate the time-domain signal into different sinusoidal shaped fundamental waves, in which each wave represents one specific frequency level. Instead of filtering out specific frequencies, we have implemented a threshold on the amplitude of time-domain signal, in which eliminating the wave components that have an amplitude that is lower than the threshold. If some component of the audio signal is 1. Within some particular frequency range, and 2. Has a time-domain amplitude that is significantly lesser than the rest of the signal, this is a good indication that this wave segment is a noise rather than speech. We have implemented this method in matlab with the sample audio mentioned above, and with the use of a band-pass filter that is between 300Hz - 3400Hz, the effect of reducing noise is more significant compared to using only simple frequency filters. However, the human voice after implementing both methods is less clear compared to only using frequency filters. This may be caused by the fact that some wave components are within the range of speech frequency, and so that the speech quality is affected when trying to remove that noise. The result can be seen in ‘rawSpeech-both-method.wav’.
‘rawSpeech-both-method.wav’:

Speech processed only by frequency filter:

Fig 3: A comparison of the spectrum graph of the two signals, the vertical axis is the frequency and the horizontal axis is the time. After processing the signal using wave decomposition and frequency filter, the human speech frequency range (approximately 300Hz - 3400Hz) is preserved, and noise within that frequency range is also reduced by wave decomposition.

Beside those two methods, we have conducted research also on a paper named “Per-Channel Energy Normalization: Why and How” by Vincent Lostanlen et al. In this paper, the authors discussed a way not only to eliminate the noise but also extract sound of certain features using the PCEN method. The method contains the use of five constant parameters, which two of them depends on the original signal and three rest depends on what feature that the user would like to extract from the signal, for example is the sound source far away from the record position or near, is the sound intense or slow, is the sound loud or quiet. This method requires a 2-D convolution operation using a low-pass filter on the mel spectrogram of the sound signal, and we expect that this method will target at a higher accuracy and compatibility compared to time-frequency analysis (or maybe to use together with the time-frequency analysis, since both methods are LTI). We are looking forward to implementing some of the basic ideas of this method in our project as one of our plans.

Your task is related to something called speaker diarization. Read about this topic and discuss ideas you get in the process. You can find challenge datasets online at Kaggle or here: https://dihardchallenge.github.io/dihard3/.

Speech Diarization:

Speech Diarization is a method to separate each speaker from a mixed audio file. It normally uses machine learning to train the algorithms with the audio file of each speaker.

The following link is an example of speaker diarization we found in matlab.
https://www.mathworks.com/help/audio/ug/speaker-diarization-using-x-vectors.html

We expect that the speech diarization can separate the speaker from the background noise efficiently and we may choose the speaker audio file and perform further filtering to obtain a clear sound.

We have done a lot of research on this topic and it involves complicated machine learning algorithms. We could not perform a simple matlab example derived from these algorithms yet, but we will condense the example, and try to make it easier for us to compute with.

For our project, we think that speech diarization is similar to the wave decomposition which also separates the original sound into different sound waves. Instead of digging into the complicated machine learning algorithms, we could figure out a way to make an efficient wave decomposition algorithm. Along with the correct filter design, we could potentially get rid of the noise from different features.

Dataset:
The dataset from Kaggle can help us train the system
https://www.kaggle.com/datasets/wiradkp/mini-speech-diarization