Spectrograms or: How I Learned to Stop Worrying and Love Audio Signal Processing for Machine Learning
Extracto
If you’ve ever decided to take on an A.I. sound project, you probably realized soon after it began that the persons of those who do said projects is divided into two camps. Those who are humbled. And…
Resumen
Resumen Principal
El artículo "Spectrograms or: How I Learned to Stop Worrying and Love Audio Signal Processing for Machine Learning" explora la división existente en la comunidad de profesionales que trabajan en proyectos de inteligencia artificial aplicada al sonido. Según el autor, estos profesionales se dividen en dos grupos claramente diferenciados: los que se sienten abrumados por la complejidad del procesamiento de señales de audio y los que logran dominar estas técnicas con cierta fluidez. Este enfoque narrativo introduce una reflexión sobre cómo la percepción inicial del procesamiento de audio puede influir en el éxito de los proyectos de machine learning. El texto sugiere que superar la ansiedad inicial hacia las matemáticas y la teoría detrás del procesamiento de señales es fundamental para desarrollar soluciones efectivas en este campo. La obra presenta una visión introspectiva sobre la curva de aprendizaje en audio signal processing, destacando que la aceptación y el dominio de estas herramientas técnicas pueden transformar una experiencia frustrante en una ventaja competitiva. El título hace referencia al clásico "How I Learned to Stop Worrying and Love the Bomb", sugiriendo un paralelismo entre la superación del miedo nuclear y la adopción confiada de tecnologías de procesamiento de audio.
Elementos Clave
-
División de la comunidad AI/audio: Existe una dicotomía clara entre profesionales que se sienten humbled (humildes/abrumados) y aquellos que manejan con destreza los proyectos de inteligencia artificial aplicada al sonido, revelando una brecha significativa en la experiencia y confianza técnica.
-
Procesamiento de señales de audio: El artículo se centra en el audio signal processing como componente esencial para el machine learning aplicado al sonido, destacando su importancia técnica en la transformación de datos auditivos en representaciones procesables por algoritmos.
-
Representaciones espectrales: Los spectrograms (espectrogramas) emergen como herramienta fundamental en el análisis de señales de audio, actuando como puente entre el dominio temporal y el dominio frecuencial para facilitar el entrenamiento de modelos de machine learning.
-
Transformación de perspectiva: El autor presenta una evolución personal desde la preocupación hacia la aceptación y el amor por el procesamiento de señales, sugiriendo que la superación de barreras psicológicas es tan importante como la adquisición de conocimientos técnicos.
Análisis e Implicaciones
Este enfoque revela cómo la percepción subjetiva de la complejidad técnica puede influir directamente en el éxito de proyectos de inteligencia artificial en el dominio del audio. La identificación de dos "campos" psicológicos entre los profesionales sugiere que el desarrollo de competencias en audio signal processing no solo requiere formación técnica, sino también una transformación en la mentalidad y confianza del practicante. La utilización de herramientas como los espectrogramas representa una democratización del acceso al análisis de señales complejas, permitiendo que profesionales sin formación avanzada en ingeniería de señales puedan participar efectivamente en proyectos de machine learning audio.
Contexto Adicional
La referencia al título cinematográfico "Dr. Strangelove" establece un marco cultural que conecta la ansiedad tecnológica con la eventual aceptación y dominio de herramientas complejas. Esta analogía refuerza la idea de que la superación de la incertidumbre inicial es un paso necesario en la adopción de tecnologías avanzadas de procesamiento de señales para aplicaciones de inteligencia artificial.
Contenido
The First Movement: Mechanical Waves, Time-Domain features, and how and why to extract them.
If you’ve ever decided to take on an A.I. sound project, you probably realized soon after it began that the persons of those who do said projects is divided into two camps. Those who are humbled. And those who are about to be. The reason you may have realized this is because in actuality you will take on two projects. The audio signal processing that is required to convert the original signal into spectrograms. (Spectrograms are images of time-frequency domain features that were extracted from wave signals) And once you have those, then you can move forward with a straight ahead image classification deep learning project using those spectrograms. Yes, you will never run a model on the sound itself but instead process the sound into images and build an unsupervised learning model to classify those images. But don’t you worry if this is your first project of this type. I’ve worried for the both of us and that is the purpose of this article, is to walk you through the process.
So What are Mechanical Waves?
Sound is produced by the vibration of objects and those vibrations cause molecules to oscillate and collide into each other and thereby changing the air pressure in the local vicinity which ultimately creates waves. A mechanical wave is a wave that oscillates and transfers the energy from one point to another point through the medium of air. The molecules that collide together across time have peaks of compression and valleys of rarefaction. This can be represented by way of sinusoidal waves. Not to oversimplify but sound is just energy that travels through space using air as a medium.
We can represent a complex waveform which plots the deviation from a zero level of air disturbance against time here below. This is what sound looks like.

It is sample of a cat meowing but music and other sounds will have the general format of this waveform of time on the x-axis and the deviation from center on the y-axis. This waveform has a ton of information in it including but not limited to frequency, intensity, timbre and duration of samples.
Frequency, Amplitude and Phase

The image above has two black highlights of peaks and dips. The distance between one peak to another or one dip to the other is called a period. Frequency is just the inverse of the period where F = 1 / T. Frequency is measure in hertz or Hz.

Amplitude is how high or low the disturbance of air goes when a sound is made. And phase is just the position of the waveform at time zero. Now that we have a basic understanding of these features we can think of the first waveform as a combination of many, many single sine wavs together to form a complex waveform.
If there are more periods in a given time than less in that same time that would result in higher frequency and vice-versa to produce lower frequency. The higher the frequency the higher the sound. Think of Mariah Carey’s ability to hit that high “ C ” as high frequency and Barry White’s bass notes as low frequency.
If the amplitude is high you will perceive that sound as loud. If it is low you will hear it as soft. Isn’t that fascinating?
But what if the sound is too soft or the frequency is too high?
If a tree falls in the middle of a forest and no one is there to hear it, does it still make a sound?
There was a sub-culture many moons ago that dubbed themselves as “ Head-Bangers ”. They listened to really, really loud music. To be frank, they listened to it a bit too loud for my ears to appreciate the music. But it never got out of range where I could not hear them jamming a block away. However, I witnessed a man use a dogwhistle that I could never hear but it appeared that surrounding dogs could. If they can hear sounds that humans cannot then there must be constraints on our ability to hear certain sounds. Perhaps due to frequency. Was the dogwhistle too high in frequency for me to hear? Yes. Humans can hear frequencies ranging from 20–20,000 Hz. Check out this link for more information.
This concept is really cool because how do we know that there is not an array of sounds surrounding us that are out of our range? Do they not exist merely because we cannot hear them? I have doubts about that.
How we perceive sound and music.
I first started playing guitar and piano seriously in the late 90’s and was fortunate to have as a teacher a pupil of Andres Segovia. He insisted that I tune the guitar with a tuning fork instead of a digital tuner. This is how I learned about A-440 Hz, because a tuning fork produces that frequency when you strike it against another object. He instructed me to go to the piano and find the same “note” on the piano. It was a bit of a trick question because several “notes” sound similar, just higher or lower than A-440. What he meant is to find where the pitch was located on the piano. Pitch is the idea we use for the perception of frequency. We do not hear pitch in a linear fashion but in a logarithmic one. To further support this I later discovered that the A above the 440 Hz was 880 Hz and the one above it was 1760 Hz. Do you see the patten? The one above or below the object A is just a factor of 2 away. This persuaded me that we do not perceive sound in a linear way. If we did, the difference between one octave and another would remain constant no matter how high or low the pitch. This concept will come in handy later.
Why instruments are distinguishable
Another way to tune a guitar is to strike the piano at A-440 Hz and tune your A string to the piano assuming that it is in tune. Both instruments share the same frequency and sound intensity and therefore can be aligned. But what is that certain thing that make them sound different? We can close our eyes and hear a pitch played on the piano and hear the exact same pitch on a sax yet make a distinction between the two. The reason for this is timbre. Timbre is the color of the tone or the quality of it. Even within the same instrument a musician is able to make the same pitch sound different albeit being the same pitch. Yeah…, but how is a computer going to be able to make the discernment you’re asking. Let’s go ahead and walk through some simple code in Python to see if we see a difference when we plot waveforms of two different instruments playing the same pitch for the same duration.
We are going to need to import the following libraries shown below.
import IPython.display as ipd
import os
import librosa
import librosa.display
import matplotlib.pyplot as plt
%matplotlib inlineI have wave files I created of a jazz organ and a Steinway Grand Piano. Both have the same duration, pitch and intensity. This is how you create a path in order to load your wav file.
path = os.getcwd()
file_name = '/Users/your_path/to_folder_where_your_wavefile_'
new_path = os.path.join(path, file_name)
os.chdir(new_path)If you want to play the particular wav file for a sanity check use the IPython.display as ipd.
ipd.Audio(A_440_Hz[0])Now that you know for certain you’re dealing with a signal we move onto Librosa.
steinway , sr = librosa.load(A_440_Hz[0])
jazz_organ , sr = librosa.load(A_440_Hz[1])To show evidence that both have the same duration. Here is your proof.
duration = sample_duration * len(jazz_organ)
print(f'duration of jazz organ signal is: {duration: 2f} seconds')output: duration of jazz organ signal is: 8.000000 secondsduration = sample_duration * len(steinway)
print(f’duration of steinway signal is: {duration: 2f} seconds’)output: duration of steinway signal is: 8.000000 seconds
Now we can plot these to illustrate the difference of the color of the tone.
plt.figure(figsize=(24,6))
librosa.display.waveplot(jazz_organ, alpha = .5)
plt.title('JAZZ ORGAN A-440-Hz')
plt.ylim(-1, 1)
plt.show()
I have separated the waveform above into parts of an attack, decay, sustain, and release and together they make up its Sound Envelope. Different sounds will render different envelopes as you can see the difference between these two. The attack on the jazz organ is not as dramatic as the attack on the steinway but they have similar decays and very distinctive sustains being that the jazz organ has a bit of a vibrato pattern.
plt.figure(figsize=(24,6))
librosa.display.waveplot(steinway, alpha = .5)
plt.title('Steinway A-440-Hz')
plt.ylim(-1, 1)
plt.show()
The steinways sustain does not have that much up and down action but has a very stable sustain followed by a smooth release.
Earlier, I said that both of these are the same pitch, intensity and frequency yet we do not see any evidence that these are the same pitch. Is that what accounts for the difference in the waveforms? They being different pitches? Not likely. A better explanation for the phenomenon is that these are Time-Domain features not Frequency-Domain features which we will go over in the next article but first let’s go over a few more time-domain features besides the sound envelope.
Analog Signal to Digital Signal

An analog signal needs to be converted to a digital signal because the values are continuous and we need to process them into discrete values. The continuous values could have infinite numbers after the decimal which would mean that we need infinite memory to process them. To do this analog to digital conversion we need to employ sampling and quantization.
A sample is a value or a set of values at particular point in time. Basically, we decide on a period and at each period we take a sample. The sample rate is how many samples we have per second and is the inverse of the period. Sr = 1 / T. If you are interested in more about sampling here is a link.
If you look at the image above from the link you will see the image of the sample dots and the area under the curve highlighted in yellow. If the sampling rate is low it will capture less of the area under the curve and if the sampling rate is high it will capture more. This is our effective cost function for the conversion because we are trying to minimize errors, but should we indiscriminately choose the highest sample rate possible or is there an ideal sampling rate that we should choose?
The Nyquist frequency is the upper bound frequency that we can have in a digital signal without reproducing an artifact and its formula is Fn = Sr / 2. The sampling rate for most modern music players is 44100 and if we divide by 2 it will be 22050 Hz which happens to be the upper bound frequency that humans can hear. Remember earlier.
Quantization is the second step the the analog to digital conversion. There is a similar process but on the y-axis or on the amplitude of a time-domain graph. The idea is to minimize errors just like sampling and after quantization the result will be that are smaller set of discrete values than the input values. Check out the link below for more details.
Time-Domain Features
Audio features are descriptors of sound. When we are dealing with waveforms, we are dealing with time, and events happen across time. We can take a look at all the events across time by extracting features along the x-axis of a waveform. Amplitude envelope, Root-mean square energy and Zero crossing rate are some of the features we can extract.
In order to extract those features we need to take our converted digital signal and apply framing to the signal. Framing is grouping so many samples from the signal into a particular frame. We divide all the samples into frames and then to ensure that the frame’s time length is long enough so that humans can perceive the signal. Anything less than 10 ms is below human ear’s temporal resolution. To do that the duration of frame =1 / sample rate * k(frame size) and assign the sample rate to 44100 Hz and the frame size to at least 500.
The Amplitude envelope is the max amplitude value for all the samples in a given frame. Let’s jump into to some code and I’ll show you how can you extract it from scratch.
We are going to use Librosa again but this time it will be cats as our samples.
cat_20, sr = librosa.load(cats[20])
len(cats), srOutput: (26, 22050)cat_99 , sr = librosa.load(cats[99])
len(cat_99)Output: (246960)
If we take a look at length of the array of both cats you will notice that cat_20 is smaller. Now let’s take a look at what the first 25 values are in these arrays.
cat_20[:25]Output: array([-0.02869544, -0.03226709, -0.02647833, -0.02724418, -0.02664405,
-0.02373683, -0.02506031, -0.02607695, -0.02617143, -0.02862327,
-0.03005172, -0.03034695, -0.03152664, -0.03144585, -0.03056201,
-0.03005065, -0.02873663, -0.02774573, -0.02756811, -0.02704341,
-0.0272577 , -0.02817268, -0.02817227, -0.0277576 , -0.02757546],
dtype=float32)cat_99[:25]Output: array([-0.02869544, -0.03226709, -0.02647833, -0.02724418, -0.02664405,
-0.02373683, -0.02506031, -0.02607695, -0.02617143, -0.02862327,
-0.03005172, -0.03034695, -0.03152664, -0.03144585, -0.03056201,
-0.03005065, -0.02873663, -0.02774573, -0.02756811, -0.02704341,
-0.0272577 , -0.02817268, -0.02817227, -0.0277576 , -0.02757546],
dtype=float32)
As you can see these are just numerical values. We are going to iterate through the them frame by frame and step by a shorter interval to produce overlap. Down below is a function that will do this for us.
FRAME_SIZE = 1024
STEP_LENGTH = 512def amplitude_envelope(signal, frame_size, step_length):
return np.array([max(signal[sample: sample + frame_size]) for sample in range(0, signal.size, step_length)])
Now that we have the function we just feed the function our input parameters.
ae_cat_99 = amplitude_envelope(cat_99, FRAME_SIZE, STEP_LENGTH)
len(ae_cat_99)Output: 517ae_cat_20 = amplitude_envelope(cat_20, FRAME_SIZE, STEP_LENGTH)
len(ae_cat_20)Output: 55
As you may notice the lengths are much smaller now that we took only the max values for each frame.
Here is what both signals look like before we put them in this function.




The top image does not wrap around the values as tight as the bottom does but if you notice the top duration is quite short.
Amplitude envelope’s can be used for genre classification because it gives an estimation of loudness. Funk music will have very high amplitudes in a repetitive fashion while classical music will have gradual crescendos and decrescendos of those amplitudes. Let’s move on to the next feature we can extract.
Root mean square energy is another time-domain feature we can extract out of a signal and lucky for us the Librosa package has a method to help us out for that. Take a look at the fourth line of code.
cat_99, _ = librosa.load(cats[99])
FRAME_LENGTH = 1024
STEP_LENGTH = 512
rms_cat_99 = librosa.feature.rms(cat_99, frame_length = FRAME_SIZE, step_length = STEP_LENGTH)[0]We will add some code to plot and see what this looks like with the amplitude envelope.
frames = range(0, rms_cat_99.size)
t = librosa.frames_to_time(frames, step_length = STEP_LENGTH)plt.figure(figsize=(24,6))
librosa.display.waveplot(cat_99, alpha = .5)
plt.plot(t, ae_cat_99, color = 'r' )
plt.plot(t, rms_cat_99, color = 'black' )
plt.title('cat # 99')
plt.ylim(-1, 1)
plt.show()

Root mean square energy can be used for genre classification too but it is very useful for audio segmentation. One example is when a piece of classical music transitions to another movement.
Conclusion
We have covered quite a bit of concepts and ideas including some code. But don’t overburden yourself by all the different parts, but please do take home the big idea that audio signal processing is necessary to convert signals into spectrograms. A spectrogram is a visual way of representing the signal strength, or “loudness”, of a signal over time at various frequencies present in a particular waveform. Not only can one see whether there is more or less energy at, for example, 440Hz vs 880Hz, but one can also see how energy levels vary over time. With that information we can apply machine learning algorithms to those representations to make predictions whether it belongs to a certain class or not.
We did not cover any frequency-domain features because we did not apply a “very special” transformation to our signal and that left us with just time-domain features to analyze. These features are important descriptors of sound and are vital to AI projects like audio classification in the same way that square footage would be an important attribute to predicting house prices.
In the next article we will cover the “very special” transformation called a fast Fourier transform. Fourier analysis converts a signal from its original domain to a representation in the frequency domain. Then we will be able to produce our spectrograms to use as input images for the image classification part mentioned in the introduction.
Thank you for reading!
I learned about audio signal processing during my capstone project at Metis where I trained a neural network to classify if a certain audio clip came from a dog or a cat. To view some of my projects check out my Github page. And please reach out to me on Linkedin if you have any questions or would like to connect.
Fuente: SelectFrom