Speech Quality Assessment of VoIP: G.711 VS G.722 Based on Interview Tests with Thai Users

Автор: Therdpong Daengsi, Chai Wutiwiwatchai, Apiruck Preechayasomboon, Saowanit Sukparungsee

Журнал: International Journal of Information Technology and Computer Science(IJITCS) @ijitcs

Статья в выпуске: 2 Vol. 4, 2012 года.

Бесплатный доступ

This paper presents the comparison between two codecs, G.711 and G.722 at 64 kbps, referring to speech quality perception using a subjective method called interview tests. These subjective tests have been conducted with 201 subjects, who are Thai native speakers that use Thai which is a tonal language, for accuracy and reliability of results. The results from testing with both codecs are almost the same; the scores are 4.17 for G.722 and 4.14 for G.711. After analyzing the results, it has been confirmed that G.722 does not provide better speech quality than G.711 to the Thai subjects significantly, which is consistent with previous information. However, these results could be used as the benchmark of G.711 and G.722 for speech quality assessment within Thai environments.

Еще

VoIP, speech quality assessment/evaluation/measurement, subjective methods/tests, G.711, G.722, Thai

Короткий адрес: https://sciup.org/15011658

IDR: 15011658

Текст научной статьи Speech Quality Assessment of VoIP: G.711 VS G.722 Based on Interview Tests with Thai Users

Published Online March 2012 in MECS

At present, Voice over IP (VoIP) is one of the most popular services for people around the world because the cost is cheaper than traditional telephone services, particularly international calls. However, Internet protocol was originally designed for data communication, not for speech or voice communication that requires realtime support.

VoIP, a modern telecommunication technology, is similar to traditional telecommunication technology that mainly provides voice services via narrow band, which supports frequency ranges of about 300-3,400 Hz [1]. Therefore, to compensate the speech quality of VoIP which is influenced by the limitations of carrying voice packets, wideband codecs that support frequency ranges up to 7,000 Hz [1] are applied. To prove this issue within a Thai environment, G.722 (a wideband codec at 64 kbps) has been selected for this study and then compared to G.711 (a narrow band codec) which uses the same bit rate generally used in LAN [2-3].

The remainder of this paper is organized as follows: Section 2 gives background information. Section 3 describes the methodology. Section 4 presents the results before presenting section 5 as analysis and discussion. Finally, the conclusion is in the last section, section 6.

2. Background 2.1 Why Thai Users?

Thai people use the Thai language which is a tonal language, similar to Chinese. Every tonal language has a special characteristic, called a tonal feature. This feature influences and changes the meanings of words, for example, the Thai words “Jг ( хППП ) means “to throw”,

“i! 1” H' ) means “a forest”, and “ili” H' ) means “an aunt”. Whereas, changing tone in a non-tonal language, such as English, does not change the meaning of a word. For Thai, there are five tones, consisting of middle, low, falling, high and rising [4-5], as shown in Fig 1.

2.2 Speech Quality in Telecommunication Networks
2.3 The metric of Speech Quality
2.4 VoIP Overview

The term ‘quality’ is quite subjective and ambiguous. However, in telecommunication networks, speech or voice quality can be described as the result of the judgment from subjective assessment by users who perceived the speech that has been provided over the telecommunication networks [6]. However, not only telecommunication network conditions affect speech quality but also other factors, such as expectation, naturalness, speech characteristics and conversational effort as presented in Fig 2 [7].

Mean Opinion Score (MOS), the bench mark for speech quality, is the official scale of speech quality that was issued by ITU-T [8-9]. It has been mentioned that it is the most reliable metric of service quality at the end point, or end user’QoE of VoIP [10]. Normally, MOS is the average value from a predefined scale, called the opinion score, as in Table I. Subjects are asked for their opinion referring to the performance of the telephone system and/or telephone network [9]. However, MOS is presented in the MOS-LQ and MOS-CQ detailing Listening Quality and Communication Quality respectively [11].

VoIP is a kind of modern telecommunication system that emerged after the development of the Internet. Of course, it uses Internet protocol to carry voice packets. Its architecture is presented in Fig 3 [12]. It can be seen that the main parts of VoIP system consists of IP signaling protocol, QoS mechanisms, header compression and codecs. More information, can be found in the paper “VoIP: A comprehensive survey on a promising technology” [12].

In Thailand, the main organization that promotes VoIP technology is the National Broadcasting and Telecommunication Commission (NBTC) or the former National Telecommunications Commission (NTC) [13]. NTC has collaborated with King Mongkut’s University of Technology Thonburi to develop tools for VoIP system development, called AsteriskNow for Thailand (ANT) and the new Asterisk Appliance [14], based on Asterisk an open source software that is widely used in several related areas. Particularly, it has been used by students, academicians and researchers for development and research [15-21].

Fig 1: An example of fundamental frequency contours of five Thai tones [5].

Fig 2: Influencing factors of speech quality.

Fig 3: VoIP architecture

Table 1 ： Opinion Scores and meaning

Opinion Score	Meaning
5	Excellent
4	Good
3	Fair
2	Poor
1	bad

Fig 4: Overview of subjective methods for speech quality assessment

2.5 Subjective Speech Quality Assessment Methods
2.6 Speech Codecs

It has been mentioned that subjective speech quality assessment results are highly accurate and reliable [22-23]. Those methods using subjective methods are mainly divided into four, as shown in Fig 4. Conversational opinion tests require very good control conditions with appropriate manner of test situation, particularly two separated soundproof rooms. Whereas, listening tests that the Absolute Category Rating (ACR) recommends [9], do not cover the same realism such as a long delay situation [24-25]. Also, similar to the conversational opinion tests, listening tests require very good control conditions, including a laboratory and good speech materials. In the case of limitations when proving two soundproof rooms for conversational opinion tests, and good quality speech materials like listening tests, the interview and survey tests are recommended [9]. However, it must be compensated by testing with at least 100 subjects per condition [9].

For VoIP, codec selection is very important, because different codecs, mainly based on different bit rates and speech coding algorithms (including quantization), provide different levels of speech quality [12, 26]. Codecs have been classified into narrow band codecs, broadband or wideband codecs and multi-mode codecs. However, in this study, focus was on the perception of speech quality with G.711 and G.722. Therefore, only these two codecs have been described as follows [2-3, 9, 12]:

1) G.711: is the original codec that was widely used in ISDNs, in the prosperous digital era. It uses Pulse Code Modulation (PCM), requiring the bit rate of 64 kbps. There are two subtypes of G.711, G.711 µ -law and G.711A-law. The µ -law is mainly used in North America and Japan, whereas the A-law is widely used in the rest of the world, including Thailand and Asia and its MOS is about 4.1.
2) G.722: is an audio codec that can be used for a variety of higher quality speech. It uses Sub-Band Adaptive Differential Pulse Code Modulation (SB-ADPCM), basically requiring the bit rate of 64 kbps, like G.711. This was the first wideband codec that was issued by ITU-T and it can support a bandwidth up to 7000 Hz. There are also its folks such as G.722.1 and G.722.2 but those support different bit rates, not 64 kbps. However, it seems that the speech quality provided by G.722 is not
2.7 A survey on the Previous Works

better than G.711 significantly. This could be due to its MOS equaling ~4.1.

Here are some previous works similar to this research which can be shortly presented as follows:

1) J.-H. Chen and J. Thyssen compared several codecs with two codecs, BV16 and BV32 codec, including comparison of BV16 with G.711u-law and other codecs, and comparison of BV32 with G.722 at several bit rates [27]. However, the results from the two listening tests did not show that G.722 at 64 kbps is better than G.711µ-law significantly.
2) Z. Cai, N. Kitawaki and T. Yamada, and S. Makino presented a comparison of MOS evaluation characteristics for Chinese, Japanese and English with G.722 family but this paper excluded G.711 and Thai [28].
3) ITU-T issued the test plan for G.722 and G.711.1 but both are wideband [29].
4) A. Takahashi, A. Kurashima, and H. Aoki described a method to estimate the subjective quality of wideband codec and also presented the relationship between estimated and subjective MOS values of codecs including G.722 and G.711 but it did not focus on the MOS comparison [30].
5) M. Graubner, P.S. Mogre, R. Steinmetz, and T. Lorenzen presented a new QoE model and proposed an objective QoE metric for assessing listening quality. This paper includes a figure that implies that G.722 is better than G.711 significantly. However, it did not explain in detail [31].
6) M. N. Ismail showed the result from the study about codec selection for wireless networks, including G.711 and G.722 but it did not suggest that G.722 is better than G.711 significantly in terms of MOS. Also, the results were from only 5 users [32].
7) L. Miao et.al focused beyond superwideband codecs, excluding G.711 and G.722. Moreover it did not cover MOS [33].
8) Psytechnics compared VoIP client performance with several codecs, e.g. G.722 at 56 kbps and G.711µ-law but this paper excluded G.722 at 64 kbps and G.711A-law [34].
3.1 Purpose
3.2 Test Facilities

Thus, it can be summarized in this part that there is no work that has focused on the comparison between speech quality perception provided by G.711A-law and G.722 using a subjective method, particularly with Thai native speakers.

This section describes about the methodology to conduct the interview test with Thai subjects, as follows:

This interview test is to assess speech quality perception of Thai subjects to G.722 at 64 kbps and G.711A-law; due to requiring the same bandwidth of 64 kbps for its payload. Then compare the results to discover whether G.722 at 64 kbps can provide better speech quality than G.711A-law. This interview test was used instead of the ACR-test, which cannot be conducted using the available VoIP system as it has a limitation of playing speech materials of wideband audio files, and conversational opinion test that requires two soundproof rooms.

The set of test facilities that were used consisted of:

1) 1 laboratory, with good acoustic properties of a soundproof room (e.g. room noise < 35 dBA and

reverberation time 200-300 ms). Therefore the studio room at the Central Library, King Mongkut’s University of Technology (KMUTNB) was selected.

2) 1 IP network, including a switch.
3) 1 VoIP system, implemented by using Asterisk open-source software, version 1.6.2.
4) 2 IP phones, supporting SIP.
3.3 Test condition and Experimental Design
3.4 Subjects
3.5 Tasks and Data Gathering

The condition variable for this test is codec. Only the G.711A-law is used in Thailand and G.722 at 64 kbps. For other conditions, they were provided as best as possible, by the ‘real’ VoIP system. This was designed to interview at least 100 subjects per codec. Each interviewee was interviewed for about 3-4 minutes.

At least 200 subjects are required to represent a group of Thai native listeners. These were intended to be students from KMUTNB. However, the general Thai public who were interested in the research were also welcome. Therefore, they had few aspects of homogeneity about the range of age, the background of education in science and technology and the nationality.

Each interviewee was invited to sit in the room one-by-one, then an interviewer who was outside (could be a male or a female interviewer), made a call and started the interview, as in Fig 5, taking 3-4 minutes. Before finishing the interview, he or she would be asked to vote the speech quality that has been provide using G.711 or G.722, using the scale, as in Table I. The data from all subjects were recorded and gathered using a paper-based form by the interviewer.

G.711 was assessed by 60 male and 40 female subjects with the average age of 20.92 years of age and the standard deviation of 3.36 years, whereas G.722 was assessed by 49 male and 52 female subjects with the average age of 21.17 years old and the standard deviation (StDev) of 1.94 years. All 201 subjects were students from KMUTNB, except one person who was a member of the public. The average age of all participants was 21.04 years with the standard deviation of 2.73 years. The results are presented as MOS-CQS because the interview test is equivalent to the conversational opinion test, presented in Table II and Fig 6. However, the comparison of results from two IP phones had a few issues about influence of gender which has been presented as well in Fig 7-10.

Fig 5 : Overview of the test facilities

Table 2 ： Results

Codec	No. of Subjects	MOS-CQS	StDev
G.711	100	4.14	0.60
G.722 at 64 kbps	101	4.17	0.62

Fig 6: Comparison of percent of the votes between G.711 and G. 722 at 64 kbps by all participants.

Fig 7: Comparison of MOS between G.711 and G.722, referring to two IP phones, where, N = 47, 53, 53 and 48 and StDev = 0.54, 0.66, 0.65 and 0.57 for G.711 w/ Phone1, G.711 w/ Phone2, G.722 w/ Phone1 and G.722 w/ Phone2 respectively.

Fig 8: Comparison of MOS between G.711 and G.722, referring to gender of interviewees, where, N = 40, 60, 52 and 49, and StDev = 0.62, 0.60, 0.62 and 0.61 for G.711 w/ female interviewees, G.711 w/ Male interviewees, G.722 w/ female interviewees, and G.722 w/ Male interviewees respectively.

Fig 9: Comparison of MOS between G.711 and G.722, referring to gender of interviewers, where, N = 35, 65, 35 and 66, and StDev = 0.52, 0.63, 0.58 and 0.63 for G.711 w/ female interviewers, G.711 w/ Male interviewers, G.722 w/ female interviewers, and G.722 w/ Male interviewers respectively.

□ G.711-Female Interviewee/Female Interviewer
□ G.711-Male IntervieweeZFemale Interviewer
□ G.722-Female Interviewee/Female Interviewer
□ G.722-Male Interviewee/Female Interviewer
□ G.711-Female Interviewee/Male Interviewer
□ G.711-Male IntervieweeZMale Interviewer
□ G.722-Female Interviewee/Male Interviewer
□ G.722-Male Interviewee/Male Interviewer

Fig 10: Comparison of MOS between G.711 and G.722, referring to both different and same gender interviewees and interviewers, where, N = 18, 17, 22, 13, 22, 43, 30, and 36, and StDev = 0.57, 0.47, 0.49, 0.72, 0.61, 0.64, 0.65, and 0.61 for G.711 w/ female interviewee/female interviewer, G.711 w/ male interviewee/female interviewer, G.722 w/ female interviewee/female interviewer, G.722 w/ male interviewee/female interviewer, G.711 w/ female interviewee/male interviewer, G.711 w/ male interviewee/male interviewer, G.722 w/ female interviewee/male interviewer, G.722 w/ male interviewee/male interviewer respectively.

Form the results in Table 2 and Fig 6, it can be seen that the speech quality perception scores, called MOS-CQS, of G.722 is slightly higher that G.711 as the supposition, although MOS-CQS of 4.17 with G.722 and MOS-CQS of 4.14 with G.711 are almost the same. Whereas, Fig 7 shows that MOS-CQS between G.711 and G.722 referring to two IP phones are almost the same as well, although the result from G.722 with phone2 is the highest at 4.27. From Fig 8, it can be seen that the result of the female interviewee is slightly lower than the male interviewee with G.711, this contradicts the result of G.722 that the result of female interviewee should be higher than the result of the male interviewee. For Fig 9, the results from G.711 and G.722 are consistent, results from the female interviewer is slightly higher than the result from the male interviewer. For Fig 10, it is the extension of Fig 8-9, it shows that the result from G.722-female interviewee/female interviewer obtained the highest score of 4.36, whereas, the results from G.711-female interviewee/male interviewer obtained the highest score of 3.91, which is the only one that is lower than 4.

The overall result of each figure is almost the same. Therefore, the Student’s t–test and ANOVA with 95% confidence interval were used for analysis with the hypothesis as follows.

H1: The speech quality perception of Thai subjects/interviewees to G.711 and G.722 is the same or different

H2: The speech quality perception of Thai subjects/interviewees to different IP phone (under test) referring to G.711 and G.722 is the same or different

H3: The perception of different gender of Thai subjects/interviewees to G.711 and G.722 is the same or different

H4: The perception of Thai subjects/interviewees to different gender of interviewer referring to G.711 and G.722 is the same or different

H5: The perception of the same/opposite gender of subjects/interviewees and interviewers referring to G.711 and G.722 is the same or different.

The output from the Student’s t-test and ANOVA are shown in Table 3. It can be seen that the p-value of

H1 is 0.743, which is higher than 0.05 significantly. Therefore, it is proven that there is no difference between the speech quality perception scores or QoE value to G.711 and G.729. On the other hand, speech quality perception of a group of Thai subjects to G.711 and G.722 is the same. For H2, the verification of variation of two IP phones resulted in a p-value of 0.437. This means there is no significant difference. For H3, H4 and H5, the verification about the issues of gender of interviewee and interviewer resulted in a p-value of 0.427, 0.099 and 0.212 respectively. This also means there is no significant difference between them.

Although human ears can hear a wide range of frequency (20-20,000 Hz), it has been known for a long time that most speech frequency ranges, issued from human mouths, is not over 4,000 Hz which is in the range of narrow band. It might be this fundamental reason that the subjects did not perceive the difference of speech quality provided by G.711 and G.722.

Table 3 ： Hypothesis Analysis Result

Hypotheses	p-value
H1: MOS-CQS of G.711 VS G.722	0.743
H2: IP phone1 w/ G.711 VS IP phone1 w/ G.722 VS IP phone2 w/ G.711 VS IP phone2 w/ G.722	0.437
H3: Gender of interviewee effects to MOS-CQS of G.711 VS G.722	0.427
H4: Gender of interviewer effects to MOS-CQS of G.711 VS G.722	0.099
H5: Same/opposite gender of interviewee and interviewer effects to MOS-CQS of G.711 VS G.722	0.212

Remark: Significant at p-value < 0.05

6. Conclusion

After conducting the interview tests with a group of Thai native speakers, consisting of 109 male and 92 female subjects, and result analysis, it has been found that the overall - speech quality perception of Thai subjects to G.711 narrow band codec and G.722 wideband codec is not significantly different. However, it is inconsistent with the supposition that expects to obtain better perceptual speech quality of G.722 than G.711. Nevertheless, G.722 might be better than G.711 significantly for other applications, such as carrying music, which could be investigated in future work.

Acknowledgment

Thank you to Mr. Wiwat Suwanuntawong, the staff of the Central Library, KMUTNB. Special thanks to the lecturers who asked their students to participate in the test and thanks to the students who participated. Also thank you to Mr. Gary Sherriff for editing. Lastly, this paper is dedicated to Dr. Gareth Clayton, the advisor of the first author, who passed away sadly.

Список литературы Speech Quality Assessment of VoIP: G.711 VS G.722 Based on Interview Tests with Thai Users

ITU-T Recommendation P.830. Subjective performance assessment of telephone-band and wideband digital codecs. 1996.
ITU-T Recommendation G.722. 7 kHz audio – coding within 64 kbps. 1988.
ITU-T Recommendation G.711, “Pulse code modulation (PCM) of voice frequencies. 1988.
C. Wutiwiwatchai and S. Furui. Thai speech processing technology: A review. Speech communication, 2007, 49 : 8-27.
N. Thubthong. A study of various linguistic effects on tone recognition in Thai continuous speech. PhD. Dissertation, 2001, Chulalongkorn University, Thailand.
A. E. Mahdi and D. Picovici. Advances in voice quality measurement in modern telecommunications. Digital Signal Processing, 2009, 19 : 79-103.
H.W. Gierlich and F. Kettler. Advanced speech quality testing of modern telecommunication equipment: An overview. Signal Processing, 2006, 86 : 1327-1340.
T. Uhl. Quality of Service in VoIP Communication. AEU-Int. J. Electron. C., 2004, 58 : 178-182.
ITU-T Recommendation P.800. Methods for subjective determination of transmission quality. 1996.
S. Uemura, N. Fukumoto, H. Yamada, and H. Nakamura,"QoS/QoE measurement system implemented on cellular phone for NGN," In: Proceedings of IEEE Consumer Communications and Networking Conference, 2008, Las Vegas, NV, USA : 117-121.
ITU-T Recommendation P.800.1. Mean Opinion Score (MOS) terminology. 1996.
S. Karapantazis and F.-N. Pavlidou. Voip: A comprehensive survey on a promising technology. Computer Networks, 2009, 53(12) : 2050-2090.
T. Jaruvitayakovit. VoIP Status in Thailand. In: Proceedings of the 1st AUN/Seed-Net Electrical and Electronics Engineering Regional Conference, International Symposium on Multimedia and Communication Technology, 2009, Bangkok, Thailand : 128-130.
V. Vanijja. VoIP Software Using Open Source. http://www.tridi.nbtc.go.th/library/upload/c6.pdf, 2009, (accessed: December 25, 2010)
P. Chichareon, S. Kamolphiwong, S. Saewong and T. Kamolphiwong. Web based Configuration Manager for Asterisk Trunking System. In: Proceedings of the 8th PSU Engineering Conference, 2010, Songkla, Thailand : 198-203.
P. Ratsamimonthon, S. Saewong, T. Angchuan,C. Jantaraprim and S. Kamolphiwong. Service Integration of Voice Communication and Web based Conference. In: Proceedings of the 8th PSU Engineering Conference, 2010, Songkla, Thailand : 204-208.
P. Casaby and S. Puangpronpitag. Problem Evaluation of Security Issues in IP telephony Open Source software. In: Proceedings of the National Conference on Computer Information Technologies, 2010, Chanthaburi, Thailand : 33-38.
S. Toomwan. A Study of Voice over IP. MS. Thesis, 2010, Mahanakorn University of Technology, Thailand.
S. Jaksopha. VoIP Development for World Study Center Co.,Ltd.. MS. Thesis, 2010, Mahanakorn University of Technology, Thailand.
A. J. Johansen. Improvement of SPIT prevention technique based on Turing test. MS. Thesis, 2010, Mahanakorn University of Technology, Thailand.
S. Thewaphon. A Study of Voice over IP for Department of Agricultural Extension. MS. Thesis, 2010, Mahanakorn University of Technology, Thailand.
M. Goudarzi. Evaluation of Voice Quality in 3G Mobile Networks. MS. Thesis, 2008, University of Plymouth, United Kingdom.
T. A. Hall. Objective speech quality measures for Internet telephony. In: Proceedings of SPIE in Voice over IP VoIP Technology, 2001, Denver, CO, USA : 128-136.
Telchemy. Voice Quality Measurement. http://www.telchemy.com/appnotes/TelchemyVoiceQualityMeasurement.pdf, 2008 (accessed:
December 10, 2011).
Tektronix. VoIP Service Quality Measurements. http://www.tektronixcommunications.com/sites/tektronixcommunications.com/files/assets/ documents/2007.12.27.08.07.37_12403_EN.pdf, 2007, (accessed: December 10, 2011).
Q. Xiao, L Chen and Y. Wang. An Efficient Dimension Reduction Quantization Scheme for Speech Vocal Parameters. International Journal of Information Technology and Computer Science, 2011, 1 : 18-25.
J.-H. Chen and J. Thyssen, "The BroadVoice Speech Coding Algorithm," In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2007, Honolulu, HI, USA : IV-537 - IV-540.
Z. Cai, N. Kitawaki and T. Yamada, and S. Makino. Comparison of MOS Evaluation Characteristics for Chinese, Japanese, and English in IP Telephony. In: Proceedings of the 4th International Universal Communication Symposium, 2010, Beijing, China : 112-115.
ITU-T. Qualification Quality Assessment Test Plan for the joint superwideband extension of G.722 and G.711.1. http://www.ietf.org/mail-archive/web/codec/current/pdfwM2asIvaFq.pdf,2008, (accessed: December 25, 2011).
A. Takahashi, A. Kurashima, and H. Aoki. Quality Assessment of Wideband Speech Communication Services. NTT Technical Review, 2006, 4(4) : 47-51.
M. Graubner, P.S. Mogre, R. Steinmetz, and T. Lorenzen. A New QoE Model and Evaluation Method for Broadcast Audio Contribution over IP. In: Proceedings of the 20th International Workshop on Network and Operating Systems Support for Digital Audio and Video, 2010, Amsterdam, Netherlands : 57-62.
M. N. Ismail. Best VoIP Codecs Selection for VoIP Conversation over Wireless Carriers Network. Annals. Computer Science Series, 2011, 9 : 57-66.
L. Miao et.al.. G.711.1 Annex D and G.722 Annex B. – New ITU-T Superwideband Codecs. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2011, Prague, Czech : 5232-5235.
Psytechnics. VoIP client benchmarking report. http://www.ucstrategies.com/uploadedFiles/UC_Information/White_Papers/Microsoft/VoIP_benchmarking_report.pdf, 2007 (accessed: December 10, 2011).

Еще

Статья научная