I thought I’d write a separate post about the fundamentals behind sound digitization and frequency analysis before I do a tutorial about detecting notes played by a guitar inside Unity3D. I’m not going into the fine details in this post but merely just scratch the surface in order to  give you a basic introduction to the topic and to get you familiar with some of the terms.

Digitizing the sound

Sound, as we hear it in analog domain, is a continuous waveform. When we move to digital domain, like when we record a sound to a computer, we sample it with a certain frequency and the signal gets chopped up as we are now using a discrete signal processing system. This means that we lose some information that we had in the original, continuous waveform. This is where Nyquist sampling theorem comes to play. According to it, the signal must be sampled at twice as high frequency as the highest frequency contained in the signal. So, if sample rate is 44100Hz, which is pretty much the standard nowadays, the maximum frequency we can digitize will become 22050Hz. This frequency is also called the Nyquist frequency, which is the bandwidth of a sampled signal and is equal to half the sampling frequency of that signal. In theory, a Nyquist frequency that is larger than the signal bandwidth is enough to allow perfect signal reconstruction from the samples, but the process requires an ideal filter that passes some frequencies unchanged while suppressing all others completely. In practice this is not possible, since some amount of aliasing is unavoidable. Frequencies higher than the Nyquist frequency will “fold” about the Nyquist frequency back into lower frequencies. For example if the sample rate is 20kHz, the Nyquist frequency is 20000Hz and an 21000Hz signal will fold, or alias, to 19000Hz.

Enter the Fourier

I had been playing and composing music for years when I first found out about audio frequency spectrum. I had heard terms like spectrum and signal bandwidth previously since my father was a CB radio hobbyist back in the day. In the audio analysis software packages I saw, I noticed this thing called FFT. I had no idea what it was but soon figured out that it was a key to displaying the frequencies contained in a signal. Later I found out that it was called the Fast Fourier Transform.

Let’s dig a bit deeper to the foundations. During my engineering studies I had a course about signal theory, which was pretty much all mathematics and got introduced to different kind of transforms. One of them was Discrete Fourier Transform (DFT). DFT is a mathematical way to move a signal data from its original domain (often time) to frequency domain. The input samples are complex numbers and the output coefficients are complex as well. The frequencies of the output sinusoids are integer multiples of a fundamental frequency, whose corresponding period is the length of the sampling interval. The combination of sinusoids obtained through the DFT is therefore periodic with that same period. Fast Fourier Transform, the FFT, is an algorithm to calculate DFT and it’s inverse in a much faster way, but still producing the same results. Now, it’s up to you to find yourself a suitable implementation of FFT. There’s even one built into Unity3D engine.

Putting it to work

You managed to read this far? Good. You now know that we can get the frequency data out of the signal by using FFT. And you know that there are limitations about the frequencies we can have in our digitized signal. Now then, when we feed data to our FFT function, we need to decide how many bins we want to use for our analysis. The more bins we use, the longer it takes but the result is also more accurate. This means we have to find a suitable balance for our application. “But where do I start?” might some of you ask. First, you should use a sample count that is a power of two, like 512, 1024 and so on. Then we can apply some maths to determine what resolution we will get with that amount of bins.

Resolution = Samplerate / bins

For example, if our sample rate is the aforementioned Fs = 44100Hz and we use sample count N = 1024, we get 44100/1024 = 43.0664062 approximately. This means, that lowest frequency we can detect is around 44Hz and we get one frequency “bin” approximately between every 43.07Hz.  Now this might give you a good enough approximation for applications that don’t require very precise frequency data and you can make out if the audio signal coming in is strongest on the bass, middle or treble frequencies. But there’s also another thing to consider. For a real input signal, the second half of the FFT contains no useful additional information. This is due to the Nyquist frequency, so the bin at N/2 corresponds to frequency Fs/2, therefore the last useful bin for real world applications is N/2-1. Some FFT libraries do not even return the values of bins over N/2, so be sure to check your implementation about this little detail.

A few words about improving precision

Now as for precision, you have two choices. Lower the sample rate or increase the number of bins. Or do both. It really depends on the application and especially the frequency ranges where you need the most precision. But there’s a third way to improve the precision. Windowing. We can apply a window function to the FFT in order to reduce spectral leakage. If the waveform under analysis has two sinusoids of different frequencies, leakage can make them harder to identify spectrally. If one of the signals is stronger, it can obscure the other completely and if the frequencies are similar and sinusoids are of equal strength, leakage can render them unresolvable. One of the most used window functions is Blackman-Harris window, so that’s probably a good starting point. Go ahead, fire up Wikipedia and read more about the window functions. There’s a lot of math behind those, as well as with all the things discussed in this post.

FFT in Unity3D

Unity3D provides us with a GetSpectrumData –function in AudioSource and AudioListener classes. It has some limitations regarding the amount of FFT bins, but the function provides an easy way to gather the frequency data from your audio clips. There are several window functions available for it too. In the next post, I’m going to show you a simple use case for the GetSpectrumData –function and hopefully I’ll get my demonstration video about that ready soon too.