Speech coding techniques, AMR, AMR-NB, AMR-WB, EVS summary

I’ve recently become a bit interested in real-time speech coding technology, so I learned a bit about it.
At first I heard about AMR-NB narrowband coding, and searched only to find more coding techniques, which are summarized here for future viewing.

I. What is AMR, AMR-WB
Full name Adaptive Multi-Rate and Adaptive Multi-Rate Wideband, mainly used for mobile devices audio, compression ratio is relatively large, but the quality is relatively poor compared to other compression formats, due to more used for vocals, calls, the effect is still very good.

AMR: Also known as AMR-NB, as opposed to the following WBs
Voice bandwidth range: 300-3400Hz
8KHz sampling rate

AMR-WB:AMR WideBand，
Voice bandwidth range: 50-7000Hz
16KHz sampling rate
“AMR-WB is known as “Adaptive Multi-rate – Wideband”, i.e. “Adaptive Multi-rate Wideband Coding”, with a sampling frequency of 16kHz. With a sampling frequency of 16kHz, AMR-WB is a broadband voice coding standard adopted by both the International Organization for Standardization (ITU-T) and the 3GPP, and is also known as the G722.2 standard. AMR-WB provides a voice bandwidth range of 50-7000Hz, which allows the user to subjectively perceive that the voice is more natural, more comfortable, and easier to distinguish than before.

Reference source:

1.Comparison of AMR-NB, AMR-WB, EVS speech coding
https://www.txrjy.com/thread-1030405-1-1.html

2.Audio AMR-NB, AMR-WB
https://blog.csdn.net/weixin_45312249/article/details/120508280

3. Mobile communications state-of-the-art audio codecs EVS and use good to do the work
https://www.cnblogs.com/talkaudiodev/p/9074554.html
(This one is the most detailed.) The full text is below:

Voice communications went from being wired only to later competing between wired and wireless (mobile) communications, and when the price of mobile voice communications came down wired voice communications were clearly in a headwind. Today’s mobile voice communication competitors are OTT (On The Top) voice, OTT voice is a service provided by Internet vendors, generally free of charge, such as WeChat voice. At present, voice communication technology is divided into two camps: the traditional communications camp and the Internet camp, competing with each other to promote the development of voice communication technology. Specifically on codecs, the Internet camp has proposed the audio codec OPUS (OPUS is jointly led by the non-profit Xiph.org Foundation, Skype and Mozilla, etc.), which is a full-band (8kHZ to 48kHZ) codec that supports both voice and music (SILK for voice, CELT for music), and it has been accepted by the IETF as the standard for voice codecs on the Internet. (RFC6716)), the vast majority of OTT voice APP support, there is a trend of unification of the Internet camp. Mobile communications standards organization 3GPP in order to cope with the competition of the Internet camp, also proposed to cover voice and music audio codec EVS (Enhanced Voice Service). I have successfully added EVS to my cell phone platform, and it passed China Mobile’s real network environment test. The following is about this codec and the work to be done to use it.

The EVS codec was standardized by 3GPP in September 2014 and is defined by 3GPP Release R12, primarily for VoLTE, but also for VoWiFi and VoIP for fixed VoIP.The EVS codec has been developed by a consortium of operators, terminal equipment, infrastructure and silicon providers, as well as experts in voice and audio coding, including Ericsson, Fraunhofer Institute for Integrated Circuits, Huawei Technologies, Nokia Corporation, Nippon Telegraph and Telephone Corporation (NTT), NTT DOCOMO Japan, France Telecom (ORANGE), Panasonic Corporation of Japan, Qualcomm Incorporated, Samsung Electronics Company, VoiceAge Corporation and ZTE Corporation. It is the best performance and quality voice frequency encoder of 3GPP so far, it is full band (8kHZ to 48kHZ), can work in the code rate range of 5.9kbps to 128kbps, not only for voice and music signals are able to provide a very high audio quality, but also has a strong anti-frame-loss and anti-delay jitter ability, which can bring a new experience for users.

The following figure shows the 3GPP EVS related SPECs from TS26.441 to TS26.451.

I have marked the key ones with red boxes, among which TS26.441 is the general overview, TS26.442 is the fixed-point implementation (reference code) written in C, which is also the top priority in the work of using EVS later on. TS26.444 is the test sequence, the optimization of the reference code process almost every day to save an optimized version, every day to run the optimized version with the test sequence, if found to be different, it means that there are problems with the optimization, to back to the previous version and check out which step of the optimization is wrong. TS26.444 is the test sequence, optimization reference code process almost every day to save an optimized version, every day to use the test sequence to run a run optimized version, such as found different, that is, the optimization of the problem, to return to the previous version, and check out which optimization step out of the problem.TS26.445 is the EVS algorithm of the specific description of the nearly 700 pages, to be honest, see the headache, if you are not doing algorithms, algorithms, algorithms part of the general look at the can be, but the description of the characteristics of the related to the must look carefully.

EVS uses different encoders for speech signals and music signals. The speech encoder is an Improved Contrastive Excitation Linear Prediction (ACELP), which also employs linear prediction modes suitable for different speech categories. For music signal coding, frequency domain (MDCT) coding is used, and special attention is paid to the frequency domain coding efficiency in low latency/low bit rate situations, so as to achieve seamless and reliable switching between the speech processor and the audio processor. The following figure shows the block diagram of the EVS codec:

The encoding is done by preprocessing the input PCM signal and also determining whether it is a voice signal or an audio signal. If it is a voice signal it is encoded with a voice encoder to get a bit stream, if it is an audio signal it is encoded with a sense encoder to get a bit stream. Decoding is done according to the information in the bit stream to determine whether it is a voice signal or an audio signal, if it is a voice signal it is decoded with a voice decoder to get the PCM data and then voice bandwidth expansion is done. If it is an audio signal, use the sense decoder to decode the PCM data and then do the frequency bandwidth expansion. Finally, do the post-processing as the output of EVS decoder.

Below is a description of each of the key features of EVS.

1, EVS supports full bandwidth (8kHZ-48kHZ), the bit rate range is 5.9kbps to 128kbps. each frame is 20Ms long. The following figure shows the distribution of audio bandwidth:

Narrow Band (NB) range is 300HZ-3400HZ, the corresponding sampling rate is 8kHZ, AMR-NB uses this sampling rate. Wide Band (WB) range is 50HZ-7000HZ, the corresponding sampling rate is 16kHZ, AMR-WB uses this sampling rate. Super Wide Band (SWB) range is 20HZ-14000HZ, the corresponding sample rate is 32kHZ. Full Band (FB) range is 20HZ-2000HZ, the corresponding sample rate is 48kHZ. EVS supports full band, so it supports four sample rates: 8kHZ, 16kHZ, 32kHZ and 48kHZ. and 48kHZ.

The following chart shows the supported bit rates at various sample rates:

As you can see from the graph above only full bit rates are supported at WB, only partial bit rates are supported at other sample rates. Note that the EVS is forward compatible with AMR-WB, so it also supports all AMR-WB bitrates.

2, EVS supports DTX/VAD/CNG/SID, which is the same as AMR-WB. During a call, you usually talk about half the time and listen the rest of the time. There is no need to send voice packets to the other party while listening, hence DTX (discontinuous transmission). The VAD (Voice Attack Detection) algorithm is used to determine whether it is voice or silence, a voice packet is sent when it is a voice packet, and a silence packet (SID packet) is sent when it is silence. When the other party receives the SID packet, the CNG (comfort noise generation) algorithm is used to generate the comfort noise.There are two types of CNG algorithms in EVS: linear prediction-domain based CNG and frequency-domain based CNG. The sending mechanism of SID packets in EVS is different from that of AMR-WB, in which the VAD sends a SID packet when it detects silence, then sends the second SID packet after 40Ms, and then sends a SID packet every 160Ms, but the VAD sends a voice packet as soon as it detects voice.The mechanism of SID packet sending in EVS can be configured to send the SID packets in a fixed interval (a few frames), and the SID packets are sent to the VAD in the frequency domain. The SID packet sending mechanism in EVS can be configured to send a SID packet at fixed intervals (a few frames, ranging from 3-100), or it can send SID packets according to the SNR adaptively, with the sending period ranging from 8-50 frames.The size of the SID packet payload in EVS is also different from that of AMR-WB, which is 40 bytes (50 fps), and it is not the same as that of AMR-WB.40 = 2000 bps), EVS is 48 bytes (50(48=2400bps). From the above you can see that DTX has two benefits, one is that it saves bandwidth and increases capacity, and the other is because not coding and decoding reduces the amount of computation, which in turn reduces power consumption and increases endurance.

3, EVS also supports PLC (Packet Loss Compensation), which is the same as AMR-WB. However, EVS includes Jitter Buffer Module (JBM), which has never been included in previous codecs. I didn’t use JBM in my usage, and I didn’t have time to study it due to time constraints. Later, if you have time, you must study it, JB is one of the difficulties of voice communication and also one of the bottlenecks of voice quality.

The algorithmic delay of EVS varies according to the sampling rate. When the sampling rate is WB/SWB/FB, the total delay is 32ms, including a 20ms delay for one frame, 0.9375ms delay for input resampling and 8.75ms forward delay on the encoding side, and 2.3125ms delay for time domain bandwidth expansion on the decoding side. When the sampling rate is NB, the total delay is reduced to 30.9375ms, which is 1.0625ms relative to WB/SWB/FB, and this 1.0625ms is mainly reduced on the decoding side.

The voice quality (MOS value) of EVS is significantly improved compared to AMR-NB/AMR-WB. The following chart compares the MOS values of these codecs:

From the above figure, it can be seen that the MOS of EVS-NB is significantly higher than that of AMR-NB at all code rates when the sample rate is NB; the MOS of EVS-WB is also significantly higher than that of AMR-WB at all code rates when the sample rate is WB; and the MOS of EVS-SWB is close to the MOS of PCM without MOS of EVS-SWB is close to the MOS of non-coded PCM. It can be seen that the voice quality of EVS is quite good.

The work you have to do to use EVS well will be different on different platforms, I am using it on the cell phone platform audio DSP for voice communication. Here is what I did to support EVS for cell phones.

1, learn EVS related SPEC. to the previous I listed SPEC are read once, because not to do algorithms, algorithms related to the coarse, but the characteristics of the description of the relevant must be to look at the fine, which is related to the use of the back.

2, Generate encoder/decoder application on PC. I did it on Ubuntu, take the PCM file as the input to the encoder, generate the corresponding code stream file according to different configurations, then take the code stream file as the input to the decoder, decode and reduce it to PCM file. If the decoded PCM file sounds no different from the original PCM file, it means that the algorithm implementation is credible (the algorithm implementation of the authoritative organization is credible, if there is a difference, it means that the application is not done). Doing the application is for the later optimization, but also easy to understand the peripheral implementation, such as how to turn the encoded value into a stream. The encoded values are placed in indices (up to 1953 indices), each indices has two member variables, one is nb_bits, which indicates how many bits are in the indices, and the other is value, which indicates the value of the indices. indices are stored in two ways: G192 (ITU-T) and G.192 (ITU-T). G.192) and MIME (Multipurpose Internet Mail Extensions). Let’s look at G192 first, the storage format of G192 for each frame is shown in the following figure:

The first Word is the synchronization value, divided into good frame (value 0x6B21) and bad frame (value 0x6B20) two kinds, the second Word is the length, followed by each value (1 with 0x0081, 0 with 0x007F). value in the Indices in the binary representation of the value, the value of the bit on the value of 1 is stored as 0x0081, for 0 is stored as 0x007F. The following figure is an example, the sampling rate is 16000HZ, the code rate is 8000bps, so there are 160 bits in a frame (160 = 8000/50), saved in G192 format is 160 Word. the header in the following figure is 0x6B21, means good frame, length is 0x00A0, the latter 160 Word is the content. In the figure below, the header is 0x6B21, which means good frame, and the length is 0x00A0 , and the last 160 Word is the content.

Let’s look at the MIME format again. To pack the value of indices into serial values, specifically how to pack, see pack_bit () function. MIME format is the first Word is the header (the lower 4 bits indicate the bit rate index, in the use of WB_IO when the 5th and 6th bits should be set to 1, EVS does not need), followed by the bit stream. Or the above sample rate of 16000HZ code rate of 8000bps example, but stored in MIME format, a frame of 160 bits, need 20byte (20 = 160/8), the following chart:

The first 16 bytes in the above figure are the reference code, the 17th byte is the header, 0x2 means encoded in EVS, the code rate is 8kbps (8kbps index is 2), the next 20 bytes means the payload after packing.

In voice communication, the value of indices is packaged into a serial value and then sent to the other party as a payload. The other party receives it and unpacks and decodes it to get the PCM value.

3, the original reference code usually can not be used directly, need to be optimized. As for how to optimize, please see an article I wrote earlier (audio codec and its optimization methods and experience), the article is a more general method. I now want to use in the DSP, DSP frequency is low, only more than three hundred MHZ, do not use assembly optimization can not handle. I have not used the DSP assembly, to optimize in a short period of time is very difficult. The boss weighed and decided to use the optimized library provided by the DSP IP vendors, who are more professional in assembly.

4, the reference code of the application program transformation, easy to use as a tool for debugging and verification later. The original reference code is saved as a file in units of bytes, while the DSP is in units of Word (two bytes), so the pack/unpack function in the reference code should be modified to adapt to the DSP.

5, in order to call the codec is EVS, Audio DSP and CP have to add the corresponding code, first write their own code self-tuning, and then co-tuning. I self-tuning with AMR-WB shell (because EVS and AMR-WB a frame are 20ms long), that is, the process of using AMR-WB, but the codec from AMR-WB to EVS. mainly to verify that the encoder, pack, unpack, decoder is ok, where encoder and pack is upstream, unpack and decoder is upstream. encoder and pack are upstream, and unpack and decoder are downstream. Their sequential relationship is as follows:

First adjust upstream, save the encode into G192 format, use decoder tool to decode it into PCM data and listen to it with CoolEdit, it’s the same as what you said, which means encoder is OK. Then adjust pack, save the code stream after pack into MIME format, also use decoder tool to decode it into PCM data and listen to it with CoolEdit, it’s the same as what you said, that means pack is OK. Tune the downlink again. As the CP has not been correctly EVS stream sent to the Audio DSP, the way to debug with loopback, specifically the code stream after the pack as the input to the unpack, after the unpack is saved as G192 format, with the decoder tool to decode the data into PCM data with CoolEdit to listen to the same words with their own words, that is, the unapck is OK. Finally, we can adjust the decoder, and listen to the PCM data after decoder with CoolEdit, it is the same as what we said, which means the decoder is OK. This is the end of self-tuning.

6, and CP tuning. Because of the previous self-tuning when the key modules are tuned up, coupled with relatively smooth, within a few days to tune up. This way, you can enjoy the high sound quality brought by EVS when making phone calls.

Speech coding techniques, AMR, AMR-NB, AMR-WB, EVS summary

Recommended Today

[Parameter Adjustment Magic] PID Parameter Adjustment Using VOFA+Upper Unit (with lower unit code)