5th etsi speech Quality Test Event Anonymous Test Report


Test Signals, Analyses and Test Conditions



Download 0.63 Mb.
Page2/13
Date06.08.2017
Size0.63 Mb.
#27239
1   2   3   4   5   6   7   8   9   ...   13

3Test Signals, Analyses and Test Conditions

3.1Test Setup


The speech quality tests during the 5th ETSI SQTE are carried out for IP gateways and IP phones. This test program represents HEAD acoustics implementation of the current version of TS 101 329-5 [1]. The tests are implemented in the test system ACQUA. The tests could be carried out in two basic configurations:

  • electrical - electrical connection (IP gateway to IP gateway, see figure 3.1)

  • acoustical - electrical connection (IP terminal to IP gateway, see figure 3.2)

The connection between the gateways and the test system is established through the ISDN simulators (Aethra D2000PRO, E1/ISDN DSS1 connection, A-law compression). Gateways providing POTS interfaces (analog 2-wire) are interconnected to the test system via appropriate POTS interfaces in the Aethra simulators.

In order to reproduce realistic conditions for acoustical quality measurements dummy heads (Head and Torso Simulators, HATS according to ITU-T P.58 [2]) are used to interconnect the IP phones. The HATS are equipped with an artificial mouth and artificial ears (type 3.4 according to ITU-T P.57 [3]). Handsets are positioned according to ITU-T P.64 [4], hands-free devices according to ITU-T P.340 [5].








Fig. 3.1: Gateway to gateway setup (here: 4-wire access)

Fig. 3.2: IP phone to gateway setup

For each configuration the measurements are subdivided in single talk tests determining listening speech quality scores and conversational tests designed to measure implementation parameters in the speech processing algorithms. Besides the listening situation (single talk in sending and receiving direction) these tests especially focus on echo performance, echo cancellation and echo suppression, double talk performance and background noise transmission. The tests make use of real speech, e.g. to calculate listening speech quality scores, and speech like test signals according to ITU-T Recommendation P.501 [6] and analysis methods as given in ITU-T Recommendation P.502 [7].

Comprehensive analyses on the one hand and the analytical tests of the implemented signal processing on the other hand are a powerful combination for potential quality improvement. Moreover the tests have been carried out under different IP impairments introduced by NISTnet.


Description of Test Signals and Analyses


Speech samples were transmitted and analyzed using the Telecommunications Objective Speech Quality Assessment method TOSQA2001 [8], [9] and PESQ according to ITU-T P.862 [10] and P.862.1 [11]. Both analysis methods lead to one dimensional test results with a high correlation to auditory perceived speech sound quality for one-way transmission. These methods have been validated for VoIP transmission scenarios and have been used successfully for quality assessment of recordings carried out at electrical interfaces during the 1st, 2nd, 3rd and 4th ETSI SQTE [12], [13], [14], [15], [16].

Note that PESQ can only be applied for recordings carried out at electrical interfaces and not at acoustical interfaces, because it has not been validated for these cases [10]. For recordings at the acoustical interface only TOSQA2001 is used because of these limitations.

German speech samples are used for the analyses. Four concatenated speech files (32 seconds each) are transmitted over the connection under test.



Each of these files contains four different sentence pairs (8 seconds long each) uttered by different male and female speakers. These sentence pairs fulfill the requirements of ITU-T Recommendation P.800 [17] and provide a 50% speech activity. Figure 3.3 shows the time signal and the typical structure of one of these 32 s speech files.

The speech samples used as electrical input signals are pre-filtered with a modified IRS (send) filter [18], the active speech level [19] was adjusted to -16 dBm0. For all acoustical input signals, unfiltered speech material without any band limitation is used (ASL 89 dB(SPL) at the MRP of the HATS).



Fig. 3.3: Speech sample (example 1) for MOS testing

For each condition (e.g. network impairment), all four of these 32 s files are transmitted. The transmission is repeated until the defined packet loss is monitored. For the calculation of TMOS and MOS-LQO values the 32 s files are divided in the original 8 s sentence pairs. In order to guarantee comparable analysis conditions as during the 1st, 2nd, 3rd and 4th SQTE each of the resulting 16 speech samples is assessed separately by TOSQA2001 and PESQ. The final result for each test condition is achieved by averaging the 16 individual quality scores. In addition to these analyses the bandwidth was monitored on the IP side. This is especially interesting for VAD tests. These bandwidth scores are analyzed together with the TMOS respectively MOS-LQO results.

The result is influenced by parameters like the speech coder, an implemented AGC, VAD and silence suppression at the sending side, comfort noise generation at the receiving side or the quality of implemented PLC – packet loss concealment - and jitter buffer. In case of terminals the results are further influenced by frequency responses, distortions and other acoustical parameters.

The resulting TMOS or MOS-LQO scores provide a useful, comprehensive quality measure for one-way speech transmission but provide only few information about the parameter “being responsible” for the current quality. This is one motivation to further evaluate the implemented signal processing using sophisticated ITU-T P.501 test signals and analysis methods.

The composite source signal (CS signal, CSS) especially suited to measure the behavior of echo cancellers and non-linear processors, switching characteristics and one-way delay is shown in figure 3.4. The signal is periodically repeated in order to provide the necessary signal length. The duration of one burst amounts to 350 ms including the pause. The power density spectrum (see figure 3.5) decreases towards higher frequencies. Similar to real speech the excitation energy is higher in the low frequency range. The composite source signal consists of voiced parts, noisy parts and the pause in order to reproduce a temporal modulation.









Fig. 3.4: Test signal

Fig. 3.5: Test signal

Fig. 3.6: Test signal

Artificial voice as described in ITU-T Recommendation P.50 and P.501 is more suited to determine long term parameters. The necessary characteristics of artificial voices are reproduced if the analysis is averaged over the complete duration of approximately 10 s. The time signal is shown in figure 3.6. Figure 3.7 compares the power density spectra for artificial voice (blue curve) and the composite source signal (red).



Fig. 3.7: Test signal

The voiced sound of the composite source signal provides deterministic characteristics. The periodical repetition of this voiced sound with decreasing and increasing test signal level is shown in figure 3.8. The enlarged sequence in figure 3.9 shows the periodicity of this voiced sound. The pitch frequency is approximately 330 Hz corresponding to a signal duration of approximately 3 ms. This test signal is applied in receiving direction of IP phones respectively IP gateways (test signal shown in figure 3.10) in order to evaluate packet loss concealment algorithms, the phase accuracy and audible disturbances.







Fig. 3.8: Test signal

Fig. 3.9: Test signal (enlarged)

Fig. 3.10: Test signal

A powerful analysis to evaluate and optimize PLC and jitter buffer implementations is the combination of a cross correlation analysis between the transmitted signal and the original test signal and the Relative Approach [20], [21]. Both methods are applied here on the transmitted voiced sound as test signal. Two examples are shown in figure 3.11 and 3.12.

The Relative Approach, a hearing model based analysis compares a forward estimation based on the signal history to the new measured signal value. The deviation in time and frequency is displayed as an "estimation-error". Thus unexpected artifacts in the time domain (x-axis) and in the spectral domain (y-axis) are found based on the human ear sensitivity on these parameters. The estimation error is color coded, the warmth of the color correlates to the estimation error. The left hand example (figure 3.11) indicates low disturbances in the Relative Approach analysis (upper window). Packet loss is concealed, the phase is properly interpolated as shown in the lower cross correlation analysis. Vice versa figure 3.12 represents an implementation with a low listening speech quality under the influence of packet loss. Significant disturbances, dominant in the lower frequency range occur in the Relative Approach analysis (see red arrows).



The cross correlation analysis provides further information:

The disturbances are related to phase shifts introduced by the concealment of lost packets (see black arrow in the analysis).












Fig. 3.11: PLC Analysis, example I

Fig. 3.12: PLC Analysis, example II

The VAD performance is measured applying, processing and analyzing speech signals as well as realistic background noises. In order to gather comparison data, these background noise tests are measured in two steps: During a first measurement all relevant signal processing -like VAD- is disabled. The second recording is then carried out with the relevant signal processing enabled. Both sequences are then analyzed as level vs. time and on a hearing model basis using a differential representation of the Relative Approach.

An example for the VAD tests using the realistic pub background noise scenario is shown in figure 3.13.



The grey curve represents the transmitted background noise signal if VAD and other signal processing components are disabled, the green curve is analyzed as level vs. time if the relevant signal processing components are enabled. The differences between both curves (grey and green) give a first indication of level adjustments and level mismatch between the original background noise signal and the processed background noise signal (see red arrow).

The next question that comes into mint intuitively is, how disturbing these modulations really are. Does the VAD processed signal attract more attention than the original noise, transmitted and recorded without VAD? How true is it to the original? A differential analysis, the Δ Relative Approach between the transmitted signal and the unprocessed signal as reference, provides this analysis capability. Figure 3.14 shows the Δ Relative Approach between the two signals from figure 3.13 The analysis detects audible and unexpected features in the transmitted signal if VAD is enabled (see red arrow). These components are disturbing for the user. A homogeneous blue color indicates a high similarity between the original signal and the processed signal via VAD or other components.





Fig. 3.13: Transmission of pub noise



Fig. 3.14: Δ Relative Approach analysis

Another background noise signal consists of a recorded organ music and a synthetic part. This sequence provides a quickly changing spectral characteristic. It consists of a 5 s extract of the real organ music and in the second part of a 10 s synthetically reproduced organ signal. The spectrographical representation of this signal is shown in figure 3.15.

The frequency content of the original organ signal spreads over the whole transmission range (first part in figure 3.15) whereas the tone sequence varies the spectral content vs. time.



Figure 3.16 shows an analysis result recorded after the transmission over two gateways with enabled VAD. The voice activity detection sends silence packets and updates the spectral coefficients. This can easily be detected in the spectrogram in figure 3.16.





Fig. 3.15: Organ signal

Fig. 3.16: Processed signal

In general the tests in the presence of background noise can be subdivided in two groups. The part described above determines general performance parameters applying background noise as test signal. Other tests focus on the transmission quality applying real speech together with simultaneous background noise. These analyses are based on the new objective model as described in ETSI EG 202 396 3 [23].

The principal of the ETSI EG 202 396 3 analysis method for terminals (like IP phones) is shown in figure 3.17. Different background noise scenarios are played back via a 4-loudspeaker arrangement plus subwoofer in a test room [22]. Speech sequences (typically eight sentences, two male and two female speakers) are fed via the artificial mouth. In order to partly consider the Lombard effect [24] the speech level is increased by 3 dB compared to the nominal playback level of  4.7 dBPa. The resulting  1.7 dBPa active speech level at the MRP represents a reasonable level in these scenarios.







Fig. 3.17: Speech sample (example 1) for MOS testing

Fig. 3.18: Test sequence used for gateway tests (IRS modified filtered)

During the development of EG 202 396-3 transmitted signals via different terminals and different terminal simulations (“processed” signal in figure 3.17) were assessed in subjective listening tests. The clean speech (“clean”), the processed (“processed”) and the unprocessed signal (“unprocessed”, recorded with a measurement microphone close to the terminals microphone) are used for the objective model. The results of the subjective tests are used to map objective scores. The model provides a high correlation to the results of a subjective test for the three parameters indicating the speech quality (S-MOS), the quality of transmitted background noise (N-MOS) and the overall impression (G-MOS, general MOS). This analysis method has been used for IP terminal testing during the 5th ETSI SQTE using the test setup as described in figure 3.17. Different background noise scenarios like the café noise, a stationary car noise and the pub noise were used for testing.

In order to adapt the testing methodology for IP gateways the terminal from figure 3.17 is substituted by an IRS modified send filter. The processed signals (speech and noise, see figure 3.18) are then fed as input signal for IP gateways. The signals are processed through VAD, AGC or other signal processing components in the gateways. The transmitted signal is then recorded on the IP gateway on the other side of the connection and analyzed using EG 202 396-3. Again the three noise scenarios (café noise, car noise, pub noise) are used.

The evaluation of double talk performance - both subscribers talk simultaneously – also requires a second test signal to be applied simultaneously at the opposite transmission path. The two signals that simulate double talk need to be uncorrelated. Figure 3.19 and 3.20 show a combination of two uncorrelated CS signals. If both signals are applied si­mul­ta­ne­ous­ly – this simulates a double talk period -, specific parameters determining transmission quality under double talk conditions can be analyzed, e.g. audible level variations.

During terminal tests the green colored signal is played back via the artificial mouth of the HATS, picked up by the microphone and should be transmitted in sending direction. For gateway tests this signal is applied at the near end. The red colored CS signal bursts are applied in receiving direction of the device under test. These components may lead to echo and –consequently- should be cancelled and attenuated. Vice versa the green signal should be completely transmitted.









Fig. 3.19: Double talk test signal I

Fig. 3.20: Enlarged sequence

Fig. 3.21: Level variation

Figure 3.21 analyses that the CS signals on both channels are periodically repeated with a level variation of 20 dB in each transmission direction. Note that the entire signal sequence has a duration of 32 s. The critical situation for echo cancellers double talk performance is in the middle of this signal: The receive signal level is high (red) whereas the double talk signal level is low (green).

Another combination of these two test signals is shown in figure 3.22. The critical double talk sequence is now at the beginning of the sequence, where the near end (double talk) signal level is low but the receive level is high. Figure 3.23 shows an example of near end signal transmission. The level of transmitted signal is calculated and referred to the original near end test signal level. Gaps introduced by NLP can easily be detected in this example (see red arrows), the near end signal is not completely transmitted.



This signal can also be used to calculate the echo attenuation during the complete double talk sequence. The echo attenuation is calculated by level subtraction between the echo signal and the original test signal level. The lower excitation energy in the middle of the test signal leads to a lower resolution. Consequently an echo attenuation which is limited by the idle noise in sending direction can only be calculated with a lower resolution. An example is shown in figure 3.24.







Fig. 3.22: Double talk test signal II

Fig. 3.23: Transmission of near end signal

Fig. 3.24: Echo attenuation

An additional test signal to determine echo during double talk is shown in figure 3.25. The sequence represents a single talk situation in receiving direction for about 2 s (red), then the double talk sequence (green respectively black color) is applied for again 2 s and the sequence ends with another short single talk period. The power density spectrum of the double talk sequence calculated by Fourier transform is given in figure 3.26.

The two signals show “comb-filter” spectra, which is necessary to distinguish between the double talk signal (coming from the near end) and the echo signal (coming from the echo path as a reaction on the receive signal). For each frequency range the echo attenuation can be calculated.





Fig. 3.25: AM/FM modulated test signal

Fig. 3.26: Power density spectrum of AM/FM modulated test signal

Another double talk sequence used for testing is based on real speech instead of speech-like test signals. The structure is shown in figure 3.27. The red signal in receiving direction ensures that the echo canceller is fully converged. Different double talk sequences are then applied at the near end (green). An enlarged part is shown in figure 3.28, the level distribution is analyzed in figure 3.29. The near end signal is active and interrupted by the far end speech. Different analyses can be applied on these signals, e.g. determining level fluctuations or gaps inserted by NLP.







Fig. 3.27: Speech double talk signal

Fig. 3.28: Enlarged part

Fig. 3.29: Level distribution

Another group of test signals apply background noise at the near end coincident to a second test signal in receiving direction. This typically activates echo cancellers and NLP and may therefore also impair the transmission of the near end background noise. An example is shown in figure 3.30. The red colored signal (CS signal) is fed in receiving direction, the green signal represents a near end pub noise. Level modulation (attenuation, gaps, …) are analyzed from a level vs. time representation as shown in figure 3.31 for one example: The black curve represents the background noise signal level if the noise is transmitted without applying a receive signal. The green curve analyzes the transmitted background noise level coincident to the application of the receive signal. For better orientation, the original receive signal is given in grey. Level differences can be calculated from the black and green analysis curves.

Again, these level analyses can be combined with hearing adequate analyses. How disturbing are these modulations? How true is the modulated signal compared to the original? The Δ Relative Approach as shown in figure 3.32 detects the audible and unexpected modulations. Missing features in the signal (caused by attenuation, gaps,…) are represented by the black color, light color represents sudden unexpected components, e.g. caused by sudden appearance of signal components. These modulations are disturbing for the user.









Fig. 3.30: Test signal for noise transmission (CSS)

Fig. 3.31: Level modulation

Fig. 3.32: Δ Relative Approach analysis







Fig. 3.33: Test signal for noise transm. (speech)

Fig. 3.34: Recorded time signals

Fig. 3.35: 2D Relative Approach

A similar test is carried out using real speech instead of the composite source signal in receiving direction (see figure 3.33). An example of a transmitted signal is shown in figure 3.34. The grey signal is the original speech, the green signal is transmitted in sending direction. The NLP of the echo canceller introduces attenuation in the transmitted background noise (see red arrow) in this example. The missing features are detected in 2D Relative Approach analysis vs. time in figure 3.35.

The setup for the gateway EC tests is shown in figure 3.36. In order to evaluate the transmission performance of the implemented algorithms in the echo cancellers, one EC is disabled during the tests. This is necessary in order to avoid undesired effects under double talk conditions. In principle the same setting is used for an IP phone under test which is connected to a gateway on the other side of the connection. The IP phone under test substitutes the “EC under Test” on the left hand side in figure 3.36. The ISDN simulator is replaced by the artificial head measurement system in this case.





Fig. 3.36: Test setup for gateway echo canceller testing. One EC is disabled (right hand side in this figure) not to introduce additional clipping during double talk evaluations

Various echo tests on the IP gateways are carried out using echo path simulations of cordless telephones. These echoes were recorded using two real DECT phones (Digital Enhanced Cordless Telephones [25]). The two are designated as “DECT phone no.1” and “DECT phone no.2” in the following. The echo includes the DECT delay of approximately 30 ms in the echo path plus potential non linearity’s as introduced by these phones. The echoes simulate different scenarios:



  • DECT phone no.1 lying on a hard surface, transducers down, echo loss of 17.8 dB according to ITU-T G.122 [26]

  • DECT phone no.1 mounted to a HATS, application force 8 N between the DECT phone and the artificial ear, echo loss according to ITU-T G.122 31.0 dB

  • DECT phone no.2 mounted to a HATS, application force 8 N between DECT phone and artificial ear, echo loss according to ITU-T G.122 36.3 dB

These echo simulations are used during the tests in combination with a 26 dB hybrid ERL simulating a typical hybrid (pure attenuation) when connecting DECT phones.

The corresponding test setup for IP phones is shown in figure 3.37.





Fig. 3.37: Test setup for IP phone testing. The EC in the interface gateway is disabled (right hand side in this figure) not to introduce additional clipping during double talk evaluations

Download 0.63 Mb.

Share with your friends:
1   2   3   4   5   6   7   8   9   ...   13




The database is protected by copyright ©ininet.org 2024
send message

    Main page