El Codec CELP

CELP Codec

Nota: Se incluye el original inglés de un artículo sobre este "Codec" que considero muy interesante. Lamentablemente, cuando recogí esta información (cut&paste) para mi "Biblioteca" particular, no podía figurarme que algún día me gustaría publicarla. No conservo ningún rastro de sus orígenes y tampoco he conseguido encontrarla en Google (quizás no esté ya en la Red). Así pues, desde aquí reconozco la paternidad a su desconocido autor (si lee estas líneas y lo desea, estoy dispuesto a citarlo explícitamente atribuyéndole la justa titularidad del copyright).

§1 Sinopsis

Speech coding systems function to provide codeword representations of speech signals for communication over a channel or network to one or more system receivers. Each system receiver reconstructs speech signals from received codewords. The amount of codeword information communicated by a system in a given time period defines the system bandwidth and affects the quality of the speech received by system receivers.

The objective for speech coding systems is to provide the best trade-off between speech quality and bandwidth, given conditions such as the input signal quality, channel quality, bandwidth limitations, and cost. To reduce speech coding system bandwidth, redundancy is removed from the speech signal prior to transmission. Among the redundancies that can be exploited is the periodic nature of voiced speech. In many speech coders, this long-term redundancy is removed with a pitch or long-term predictor. At the system receiver a second long-term predictor is used to regenerate the periodicity in the reconstructed speech signal. Note that the term long-term predictor often refers to related but different structures in the system receiver and the system transmitter.

Long-term predictors are commonly applied to a class of coders called analysis-by-synthesis coders. A well-known representative of this class is code-excited linear prediction (CELP). In analysis-by-synthesis coders, speech signals are coded using a waveform-matching procedure. The speech is divided into segments which are called subframes. For each subframe, a candidate reconstructed speech signal is constructed for each of a large set of parameter configurations. Each of the parameter configurations is fully defined by a number of indices. Each candidate is compared to the original speech signal to determine which candidate most closely matches the original speech. The matching procedure is tailored to the properties of the human auditory system through the use of perceptual weighting. The indices corresponding to the best matching candidate reconstructed speech signal are transmitted over the channel. From the indices, the system receiver determines the correct parameter configuration and creates the reconstructed speech signal.

In analysis-by-synthesis coders, the long-term predictor generally is an integral part of the waveform matching process. In a common configuration, the long-term predictor uses a segment of the past reconstructed signal to match an original signal in the present subframe. Past reconstructed speech is related in time to original (present) speech by an interval known as delay. Such reconstructed speech may be scaled by a gain. Both the gain and the delay of the past segment are adjusted to provide the best match to the original speech signal.

The long-term predictor greatly enhances the coding efficiency of analysis-by-synthesis coders. This is confirmed by objective measurements, which show significant improvements in the signal-to-noise ratio of the reconstructed speech signal. However, the human auditory system is very sensitive to distortions in the speech signal which are related to the periodicity. For example, speech coders are often perceived to be noisy or buzzy -- both distortions which are related to the level of periodicity of the reconstructed speech. These distortions generally become stronger when coding bit rate is decreased.

The degree of periodicity in a natural speech signal generally decreases with increasing frequency. In a conventional long-term predictor, periodicity is controlled by only one parameter, the long-term predictor gain. Despite the fact that this parameter does not vary with frequency, the periodicity of the reconstructed signal is not constant as a function of frequency. This is because the periodicity is dependent upon nonstationarity of the long-term predictor, as well as other factors. However, this frequency dependence cannot be adjusted separately for different frequencies. This shortcoming may lead to perceptible noise and/or buzziness in the reconstructed speech, especially at low bit rates and in the lower frequency regions, where the human auditory system has a high frequency resolution capability.

Note: CELP uses a codebook of 256 speech patterns. Only searching part of the entries reduces the CPU requirements while degrading the speech quality. Reducing the codebook search down to about 170 entries does not noticeably degrade the speech, but decreasing it down to as low as 32 is still very intelligible. The CELP encoding performance listed shows figures from a codebook search of 32 up to the full 256 entries.

Inicio