Is the same video compression applied to audio ?
Definitely no. The eye and the ear, even if they are only a few centimeters apart, works very differently. The ear has a much higher dynamic range and resolution. It can pick out more details but it is slower than the eye.
The MPEG committee chose to recommend 3 compression methods and named them Audio Level I, II and III. Level I is the simplest, a sub-band coder with a psycoacustic model. Layer II adds more advanced bit allocation techniques and greater accuracy. Layer III adds a hybrid filterbank and non- uniform quantization plus advanced features like Huffman coding, 18 times higher frequency resolution and bit reservoir technique. Layer I, II and III gives increasing quality/compression ratios with increasing complexity and demands on processing power.
The reason for recommending 3 methods where partly that the testers felt that none of the coders was 100% transparent to all material and partly that the best coder (Layer III) was so computing heavy that it would seriously impact the acceptance of the standard.
The specs say that a valid Layer III decoder shall be able to decode any Layer I, II or III MPEG Audio stream. A Layer II decoder shall be able to decode Layer I and Layer II streams.
Layer-3 is the only audio coding scheme that has been recommended by the ITU-R for the use at 60 kbps channels. Layer-3 is widely in use in ISDN networks and other telecommunication applications, and various consumer applications (e.g. digital satellite broadcasting via Eutelsat or WorldSpace) have decided to use it. A Layer-3 decoder chip (that also decodes Layer-2) is available from ITT Germany.
Well, first you need to know how sound is stored in a computer. Sound is pressure differences in air. When picked up by a microphone and fed through an amplifier this becomes voltage levels. The voltage is sampled by the computer a number of times per second. For CD-audio quality you need to sample 44100 times per second and each sample has a resolution of 16 bits. In stereo this gives you 1,4Mbit per second and you can probably see the need for compression.
To compress audio MPEG tries to remove the irrelevant parts of the signal and the redundant parts of the signal. Parts of the sound that we do not hear can be thrown away. To do this MPEG Audio uses psyco- acustic principles.
How good is MPEG-1 AUDIO compression ?
MPEG can compress to a bitstream of 32kbit/s to 384kbit/s (Layer II). A raw PCM audio bitstream is about 705kbit/s so this gives a max. compression ratio of about 22. Normal compression ratio is more like 1:6 or 1:7. If you think that this is not much please remember that unlike video we are talking about no perceivable quality loss here. 96kbit/s is considered transparent for most practical purposes. This means that you will not notice any difference between the original and the compressed signal for rock'n roll or popular music. For more demanding stuff like piano concerts and such you will need to go up to 128kbit/s.
How does MPEG-1 AUDIO achieve this compression ratio ?
Well, with audio you basically have two alternatives. Either you sample less often or you sample with less resolution (less than 16 bit per sample). If you want quality you can't do much with the sample frequency. Humans can hear sounds with frequencies from about 20Hz to 20kHz. According to the Nyquist theorem you must sample at least two times the highest frequency you want to reproduce. Allowing for imperfect filters, a 44,1kHz sampling rate is a fair minimum. So you either set out to prove the Nyquist theorem is wrong or go to work on reducing the resolution. The MPEG committee chose the latter.
Now, the real reason for using 16 bits is to get a good signal-to- noise (s/n) ratio. The noise we're talking about here is quantization noise from the digitizing process. For each bit you add, you get 6dB better s/n. (To the ear, 6dBu corresponds to a doubling of the sound level.) CD-audio achieves about 90dB s/n. This matches the dynamic range of the ear fairly well. That is, you will not hear any noise coming from the system itself (well, there is still some people arguing about that, but lets not worry about them for the moment). So what happens when you sample to 8 bit resolution ? You get a very noticeable noise floor in your recording. You can easily hear this in silent moments in the music or between words or sentences if your recording is a human voice. Waitaminnit. You don't notice any noise in loud passages, right? This is the masking effect and is the key to MPEG Audio coding. Stuff like the masking effect belongs to a science called psyco- acoustics that deals with the way the human brain perceives sound. And MPEG uses psycoacustic principles when it does its thing.
Say you have a strong tone with a frequency of 1000Hz. You also have a tone nearby of say 1100Hz. This second tone is 18 dB lower. You are not going to hear this second tone. It is completely masked by the first 1000Hz tone. As a matter of fact, any relatively weak sounds near a strong sound is masked. If you introduce another tone at 2000Hz also 18 dB below the first 1000Hz tone, you will hear this. You will have to turn down the 2000Hz tone to something like 45 dB below the 1000Hz tone before it will be masked by the first tone. So the further you get from a sound the less masking effect it has. The masking effect means that you can raise the noise floor around a strong sound because the noise will be masked anyway. And raising the noise floor is the same as using less bits and using less bits is the same as compression.
Let's now try to explain how the MPEG Audio coder goes about its thing. It divides the frequency spectrum (20Hz to 20kHz) into 32 sub-bands. Each sub-band holds a little slice of the audio spectrum. Say, in the upper region of sub-band 8, a 1000Hz tone with a level of 60dB is present. OK, the coder calculates the masking effect of this sound and finds that there is a masking threshold for the entire 8th sub-band (all sounds w. a frequency...) 35dB below this tone. The acceptable s/n ratio is thus 60 - 35 = 25 dB. The equals 4 bit resolution. In addition there are masking effects on band 9-13 and on band 5-7, the effect decreasing with the distance from band 8. I a real-life situation you have sounds in most bands and the masking effects are additive. In addition the coder considers the sensitivity of the ear for various frequencies. The ear is a lot less sensitive in the high and low frequencies. Peak sensitivity is around 2-4kHz, the same region that the human voice occupies.
The sub-bands should match the ear, that is each sub-band should consist of frequencies that have the same psycoacustic properties. In MPEG layer II, each subband is 625Hz wide. It would been better if the sub-bands where narrower in the low frequency range and wider in the high frequency range. To do this you need complex filters. To keep the filters simple they chose to add FFT in parallel with the filtering and use the spectral components from the FFT as additional information to the coder. This way you get higher resolution in the low frequencies where the ear is more sensitive.
But there is more to it. We have explained concurrent masking, but the masking effect also occurs before and after a strong sound (pre- and postmasking)
If there is a significant (30 - 40dB ) shift in level. The reason is believed to be that the brain needs some processing time. Premasking is only about 2 to 5 ms. The postmasking can be up till 100ms. Other bit-reduction techniques involve considering tonal and non- tonal components of the sound. For a stereo signal you have a lot of redundancy between channels. The last step before formatting a Layer-3 stream is Huffman coding.
The coder calculates masking effects by an iterative process until it runs out of time. It is up to the implementor to spend bits in the least obtrusive fashion. For layer II the coder works on 23 ms of sound (1152 samples) at a time. For some material the 23 ms time-window can be a problem. This is normally in a situation with transients where there are large differences in sound level over the 23 ms. The masking is calculated on the strongest sound and the weak parts will drown in quantization noise. This is perceived as a noise-echo by the ear. Layer III addresses this problem specifically.
Layer III needs about 20MIPS per channel for real-time coding. This means a real fast DSP. A full Layer-3 decoder is available from ITT (MAS 3503 C). The encoder is really more complex, if you need reference quality, but this means that you need e.g. 2 DSPs instead of one for Layer-2.
Layer II on the other hand needs only a simple DSP like for example the AD2015 that can be had for a few dollars. The process is asymmetrical, much more processing is needed on the coding side. A decoder could be made to work without hardware assistance on a decent computer.
HP already does full
frame rate
1.15 Mbps MPEG-1 decoding, 352x240x30 fps or 352x288x25 fps on its 712/80
workstation.
That's currently video only; playing the audio causes
B-frames
to be skipped as necessary, but you still get better than 25 fps.
The quality is still far superior than Indeo et al even with
the
frame
skipping. If you reduce the MPEG bit rate down so
that the quality is equivalent to say Indeo you can easily do
30 fps with sound, the cost of decoding MPEG depends on
the bit rate.
Even though the player is software only, it relies on
special halfword instructions in the newer PA-RISC PCXL
chips, and on special frame buffer support for YUV data,
and Color Recovery which yields near 24 bit quality on
an 8 bit frame buffer.
Philips uses MPEG for their new digital video CD's. They say they will start shipping movies and music videos on CD's for their CD-I player by the end of this year. MPEG is accepted by Eureka-147. That means that when digital radio broadcasts starts in Europe a couple of years from now, you will receive MPEG coded audio.
The IUMA
(Internet Underground Music Archive) holds many audio clips
in MPEG compressed format, but you might need to
configure
your WWW browser.
IUMA, has been founded to provide a worldwide audience to
otherwise obscure and unavailable bands and artists.
Which sampling frequencies are used ?
You can have 48kHz, (used in professional sound equipment), 44,1kHz (used in consumer equipment like CD-audio) or 32kHz (used in some communications equipment).
MPEG I allows for two audio channels. These can be either single (mono) dual (two mono channels), stereo or joint stereo (intensity stereo or m/s-stereo). In normal (l/r) stereo one channel carries the left audio signal and one channel carries the right audio signal. In m/s stereo one channel carries the sum signal (l+r) and the other the difference (l-r) signal. In intensity stereo the high frequency part of the signal (above 2kHz) is combined. The stereo image is preserved but only the temporal envelope is transmitted. In addition MPEG allows for pre-emphasis, copyright marks and original/copy marks. MPEG II allows for several channels in the same stream.
Where can I get more details about MPEG audio ?
There is no description of the coder in the specs. The specs describes in great detail the bitstream and suggests psycoacustic models.
A good summary of MPEG-1 audio is :
ISO-MPEG-1 Audio: A generic standard for coding of high-quality digital audio J. Audio Eng. Soc. 42(10):780-792, October 1994.