A spectrogram is a visual representation of the frequency content of a signal. A spetrogram shows how the quantity of energy in different frequency regions varies as a function of time. On a spectrogram, the signal is divided into many small time sections and each section is analysed in terms of what frequency components are present in the section. This analysis is called spectral analysis because the spectrum of each section is calculated and the quantity of each frequency component (that is each sinusoid) is measured from the spectrum. The quantity of each component is then converted to a grey level in which (normally) low energy components are converted to a white colour, while high energy components are converted to a black colour. These colours are then plotted on a vertical strip corresponding to the time at which the original signal segment occurred. The height of the coloured element on this vertical strip represents the frequency of the component.
Thus a spectrogram is a 3-dimensional analysis of a signal, the horizontal dimension is time, the vertical dimension is frequency, and the grey-scale shows the amount of energy occurring in the signal at each time and frequency.
From the description of a spectrogram above, you will see that one part of the analysis process involves dividing the signal up into sections so that each section can be spectrally analysed. What we didn't say is how big these sections should be. Clearly if we chose very large sections, say of half a second, we wouldn't be able to see much of what is going on in a speech signal - each half second chunk of the picture would be a static set of colours. We know that speech signals change rapidly and we want to choose relatively small sections so that we can see the spectral content of the speech changing from moment to moment.
There is a problem however, as we make the sections smaller and smaller, it becomes more and more difficult to determine precisely which frequency components are present in the signal. You can see this in a qualitative way: if you only look at a small fraction of a single cycle of a sinusoid, it is very difficult to estimate the frequency: you can guess the duration of a whole cycle, but you might be considerably in error. In the same way, the smaller the section of signal we put into spectral analysis, the poorer the estimates of the frequency components it contains.
So we need to come to some kind of compromise: we want sections short enough to see the temporal detail in the speech signal, but long enough to see the frequency detail too. It turns out that there is not single best compromise value for speech signals. If we choose sections long enough that we can see the individual harmonics of larynx vibration, then these sections will be too long to see the time response of the vocal tract resonances (formants) as they respond to an excitation pulse. However, if you need to see source harmonics, then you need sections of this length. This is called a "narrow-band" spectrogram, and the sections are about 20ms long. On the other hand, if you want to see the detailed formant vibrations that occur within the larynx cycle, you need to use sections smaller than a singhle glottal cycle (which is about 5ms for women), so sections of about 3ms are commonly used. These short sections give us bettern temporal detail but poorer frequency detail. This is why the analysis is called a "wide-band" spectrogram. Wide-band analysis is most useful for finding formant frequencies.
If you study a wide-band spectrogram of say a couple of words of speech you should be able to see some of the following events:
| www.speechandhearing.net | © 2000 Mark Huckvale University College London |