Panel Discussion (Chair): “Neural Audio Coding Techniques and Their Evaluation” (Audio Engineering Society Artificial Intelligence and Machine Learning for Audio Conference 2025 Sept 8-10, London, UK)

Together with Jan Skoglund (Google LLC), Jürgen Herre (Fraunhofer IIS/Audiolabs Erlangen), Julian Parker (Stability AI) and Stéphane Ragot (Orange Labs)

Neural audio coding is a cutting-edge approach to audio compression, yet its technology development is in its infancy. This panel will present an overview of various neural audio coding techniques in development and highlight dataset robustness issues. The discussion will emphasize the need for robust subjective and objective evaluation methods, given that neural codecs may generate outputs differing substantially from the original signal. Consequently, it will explore which established assessment principles remain relevant and how insights from generative models can inform new evaluation strategies.

Towards Improved Objective Perceptual Audio Quality Assessment - Part 1: A Novel Data-Driven Cognitive Model (IEEE/ACM Transactions on Audio, Speech, and Language Processing )

Efficient audioquality assessment is vital for streamlining audio codec development. Objective assessment tools have been developed over time to algorithmically predict quality ratings from subjective assessments, the gold standard for quality judgment. Many of these tools use perceptual auditory models to extract audio features that are mapped to a basic audio quality score prediction using machine learning algorithms and subjective scores as training data. However, existing tools struggle with generalization in quality prediction, especially when faced with unknown signal and distortion types. This is particularly evident in the presence of signals coded using non-waveform-preserving parametric techniques. Addressing these challenges, this two-part work proposes extensions to the Perceptual Evaluation of Audio Quality (PEAQ - ITU-R BS.1387-1) recommendation. Part 1 focuses on increasing generalization, while Part 2 targets accurate spatial audio quality measurement in audio coding. To enhance prediction generalization, this paper (Part 1) introduces a novel machine learning approach that uses subjective data to model cognitive aspects of audio quality perception. The proposed method models the perceived severity of audible distortions by adaptively weighting different distortion metrics. The weights are determined using an interaction cost function that captures relationships between distortion salience and cognitive effects. Compared to other machine learning methods and established tools, the proposed architecture achieves higher prediction accuracy on large databases of previously unseen subjective quality scores. The perceptually-motivated model offers a more manageable alternative to general-purpose machine learning algorithms, allowing potential extensions and improvements to multi-dimensional quality measurement without complete retraining.

Expanding and Analyzing ODAQ - The Open Dataset of Audio Quality (157th Audio Engineering Society (AES) Convention, NYC, 2024)

The Open Dataset of Audio Quality (ODAQ) was recently introduced to address the scarcity of openly available audio datasets with corresponding subjective quality scores. The dataset, released under permissive licenses, comprises audio material processed using six different signal processing methods operating at five quality levels, along with corresponding subjective test results. To expand the dataset, we provided listener training to university students to conduct further subjective tests and obtained results consistent with previous expert listeners. We also showed how different training approaches affect the use of absolute scales and anchors. The expanded dataset now comprises results from three international laboratories providing a total of 42 listeners and 10080 subjective scores. This paper provides the details of the expansion and an in-depth analysis. As part of this analysis, we initiate the use of ODAQ as a benchmark to evaluate objective audio quality metrics in their ability to predict subjective scores.

Workshop: “Applications of Artificial Intelligence and Machine Learning in Audio Quality Models” Parts I and II, (on 156th and 157th Audio Engineering Society (AES) Conventions, Madrid and NYC, 2024)

Together with Jan Skoglund (Google LLC), Phill Williams (Netflix), Arijit Biswas (Dolby Germany GmbH), Hannes Gamper (Microsoft Research) and Sascha Dick (Fraunhofer IIS).

This workshop, sponsored by the AES Technical Committee on ML/AI, provided hands-on experience in using machine learning for audio quality modeling. Participants learned how machine learning helps understand audio quality perception and improves device and algorithm development by revealing hidden relationships in subject response data. Accurate quality models were shown to predict audio quality, crucial for customer experience, especially when subjective testing is expensive or impractical. The aim of this workshop was to disseminate domain-specific skills necessary using ML/DL algorithms in audio quality assessment, including experimental design, data collection, data augmentation, filtering, model design, and cross-validation. The workshop covered historical and modern techniques in merging machine learning with auditory perception, giving participants the tools to evaluate the suitability of existing ML-based quality models for their specific use cases.

Design Choices in a Binaural Perceptual Model for Improved Objective Spatial Audio Quality Assessment (AES 2023)

Spatial audio quality assessment is crucial for attaining immersive user experiences, but subjective evaluations are time-consuming and costly. Thus, automated algorithms have been developed for objective quality assessment. This study focuses on the development of an improved binaural perceptual model for spatial audio quality measurement by choosing the best-performing set of design parameters among previously proposed methods. Existing binaural models, particularly extensions of the Perceptual Evaluation of Audio Quality (PEAQ) method, are investigated to enhance spatial audio quality metrics.

An Improved Metric of Informational Masking for Perceptual Audio Quality Measurement (WASPAA 2023)

Video/Slides

Perceptual audio quality measurement systems algorithmically analyze the output of audio processing systems to estimate possible perceived quality degradation using perceptual models of human audition. In this manner, they save the time and resources associated with the design and execution of listening tests (LTs). Models of disturbance audibility predicting peripheral auditory masking in quality measurement systems have considerably increased subjective quality prediction performance of signals processed by perceptual audio codecs. Additionally, cognitive effects have also been known to regulate perceived distortion severity by influencing their salience. However, the performance gains due to cognitive effect models in quality measurement systems were inconsistent so far, particularly for music signals. Firstly, this paper presents an improved model of informational masking (IM) – an important cognitive effect in quality perception – that considers disturbance information complexity around the masking threshold. Secondly, we incorporate the proposed IM metric into a quality measurement systems using a novel interaction analysis procedure between cognitive effects and distortion metrics. The procedure establishes interactions between cognitive effects and distortion metrics using LT data. The proposed IM metric is shown to outperform previously proposed IM metrics in a validation task against subjective quality scores from large and diverse LT databases. Particularly, the proposed system showed an increased quality prediction of music signals coded with bandwidth extension techniques, where other models frequently fail.

Objective Quality Assessment of Perceptually Coded Audio Signals

Presentation

The main goal of audio signal coding and processing is to achieve the best possible sound quality within specific parameters. Quality is closely tied to the listener’s perception, and a reliable quality assessment method is crucial for market acceptance of a digital audio codec. Perceptual audio coding algorithms eliminate redundant and irrelevant information, but can also produce artifacts that degrade the sound quality. Quality assessment is complex due to factors such as different perceptions of low- and high-quality signals and disturbances in spatial image. Listening tests are considered the gold standard for subjective quality assessment, but computer-based objective quality assessment methods have been developed to reduce costs and time. These systems use perceptual auditory models to predict sound quality, but their performance is limited due to perceptual model limitations. The thesis proposes contributions to improve timbre and spatial aspects of objective quality assessment methods by extending their perceptual models.

A Data-Driven Cognitive Salience Model for Objective Audio Quality Assessment (ICASSP 2022)

Video/Slides

Objective audio quality measurement systems often use perceptual models to predict the subjective quality scores of processed signals, as reported in listening tests. Most systems map different metrics of perceived degradation into a single quality score predicting subjective quality. This requires a quality mapping stage that is informed by real listening test data using statistical learning (i.e., a data-driven approach) with distortion metrics as input features. However, the amount of reliable training data is limited in practice, and usually not sufficient for a comprehensive training of large learning models. Models of cognitive effects in objective systems can, however, improve the learning model. Specifically, considering the salience of certain distortion types, they provide additional features to the mapping stage that improve the learning process, especially for limited amounts of training data. We propose a novel data-driven salience model that informs the quality mapping stage by explicitly estimating the cognitive/degradation metric interactions using a salience measure. Systems incorporating the novel salience model are shown to outperform equivalent systems that only use statistical learning to combine cognitive and degradation metrics, as well as other well-known measurement systems, for a representative validation dataset.

Can we still use PEAQ? A Performance Analysis of the ITU Standard for the Objective Assessment of Perceived Audio Quality

Video (arXiv)

Abstract—The Perceptual Evaluation of Audio Quality (PEAQ) method as described in the International Telecommunication Union (ITU) recommendation ITU-R BS.1387 has been widely used for computationally estimating the quality of perceptually coded audio signals without the need for extensive subjective listening tests. However, many reports have highlighted clear limitations of the scheme after the end of its standardization, particularly involving signals coded with newer technologies such as bandwidth extension or parametric multi-channel coding. Until now, no other method for measuring the quality of both speech and audio signals has been standardized by the ITU. Therefore, a further investigation of the causes for these limitations would be beneficial to a possible update of said scheme. Our experimental results indicate that the performance of PEAQ’s model of disturbance loudness is still as good as (and sometimes superior to) other state-of-the-art objective measures, albeit with varying performance depending on the type of degraded signal content (i.e. speech or music). This finding evidences the need for an improved cognitive model. In addition, results indicate that an updated mapping of Model Output Values (MOVs) to PEAQ’s Distortion Index (DI) based on newer training data can greatly improve performance. Finally, some suggestions for the improvement of PEAQ are provided based on the reported results and comparison to other systems. Index Terms—PEAQ, ViSQOL, PEMO-Q, objective quality assessment, audio quality, speech quality, auditory model, audio coding.

Invited Workshop: To PEAQ or Not to PEAQ? - BS.1387 Revisited (Audio Engineering Society 147th Convention)

Speakers: Pablo Delgado and Thomas Sporer. Carefully conducted listening tests are time consuming and expensive. Computerized, objective measurement schemes for the assessment of perceived audio quality seem to be an adequate replacement. The Recommendation ITU-R BS.1387 (PEAQ) is a standardized method to assess bit-reduced audio signals However in the last 20 years both audio coding and listening methods have evolved. In addition many authors use PEAQ for assessment of audio processing scheme not known and validated in 1998. This tutorial will consist of the following parts: • explain what PEAQ is, how it was designed and validated; • show some examples where PEAQ fails to predict perceived quality; • summarizes the work since the standardization concerning newer audio coding tools, spatial audio and listening procedures; • gives an outlook of further developments; • give advice under which circumstance the current version of PEAQ should be used. This session was presented in association with the AES Technical Committee on Perception and Subjective Evaluation of Audio Signals

Objective Measurement of Stereophonic Audio Quality in the Directional Loudness Domain

Automated audio quality prediction is still considered a challenge for stereo or multichannel signals carrying spatial information.A system that accurately and reliably predicts quality scores obtained by time-consuming listening tests can be of great advantage in saving resources, for instance, in the evaluation of parametric spatial audio codecs. Most of the solutions so far work with individual comparisons of distortions of inter-channel cues across time and frequency, known to correlate to distortions in the evoked spatial image of the subject listener. We propose a scene analysis method that considers signal loudness distributed across estimations of perceived source directions on the horizontal plane. The calculation of distortion features in the directional loudness domain (as opposed to the time-frequency domain) seems to provide equal or better correlation with subjectively perceived quality degradation than previous methods, as confirmed by experiments with an extensive database of parametric audio codec listening tests. We investigate the effect of a number of design alternatives (based on psychoacoustic principles) on the overall prediction performance of the associated quality measurement system.

Influence of Binaural Processing on Objective Perceptual Quality Assessment

Objective spatial audio quality measurement systems attempt to predict the perceived quality degradation reported in subjective tests by comparing an original reference and a processed version of the same signal. In this context, binaural processing can be used in conjunction with a perceptual model for improved prediction performance. This paper investigates how variations in the binaural processor—namely head rotations, added reverberation, and simulated room properties—impact the prediction performance of a standardized objective audio (Perceptual Evaluation of Audio Quality–PEAQ) quality measurement scheme that has been extended to include spatial aspects.

Objective Assessment of Spatial Audio Quality Using Directional Loudness Maps

(arXiv) This work introduces a feature extracted from stereophonic/binaural audio signals aiming to represent a measure of perceived quality degradation in processed spatial auditory scenes. The feature extraction technique is based on a simplified stereo signal model considering auditory events positioned towards a given direction in the stereo field using amplitude panning (AP) techniques. We decompose the stereo signal into a set of directional signals for given AP values in the Short-Time Fourier Transform domain and calculate their overall loudness to form a directional loudness representation or maps. Then, we compare directional loudness maps of a reference signal and a deteriorated version to derive a distortion measure aiming to describe the associated perceived degradation scores reported in listening tests. The measure is then tested on an extensive listening test database with stereo signals processed by state-of-the-art perceptual audio codecs using non waveform-preserving techniques such as bandwidth extension and joint stereo coding, known for presenting a challenge to existing quality predictors. Results suggest that the derived distortion measure can be incorporated as an extension to existing automated perceptual quality assessment algorithms for improving prediction on spatially coded audio signals.

Investigations on the Influence of Combined Inter-Aural Cue Distortions in Overall Audio Quality

(arXiv) There is a considerable interest in developing algorithms that can predict audio quality of perceptually coded signals to avoid the cost of extensive listening tests during development time. While many established algorithms for predicting the perceived quality of signals with monaural (timbral) distortions are available (PEAQ, POLQA), predicting the quality degradation of stereo and multi-channel spatial signals is still considered a challenge. Audio quality degradation arising from spatial distortions is usually measured in terms of well known inter-aural cue distortion measures such as Inter-aural Level Difference Distortions (ILDD), Inter-aural Time Difference Distortions (ITDD) and Inter-aural Cross Correlation Distortions (IACCD). However, the extent to which their interaction influences the overall audio quality degradation in complex signals as expressed - for example - in a multiple stimuli test is not yet thoroughly studied. We propose a systematic approach that introduces controlled combinations of spatial distortions on a representative set of signals and evaluates their influence on overall perceived quality degradation by analyzing listening test scores over said signals. From this study we derive guidelines for designing meaningful distortion measures that consider inter-aural cue distortion interactions.

Energy Aware Modeling of Inter-channel Level Difference Distortion Impact on Spatial Audio Perception

In spatial audio processing, Inter-aural Level Difference Distortions (ILDD) between reference and coded signals play an important role in the perception of quality degradation. In order to reduce costs, there are efforts to develop algorithms that automatically predict the perceptual quality of multichannel/spatial audio processing operations relative to the unimpaired original without requiring extensive listening tests. Correct modelling of perceived ILDD has a great influence in the prediction performance of automated measurements. We propose an energy aware model of ILDD perception that contemplates a dependency of energy content in different spectral regions of the involved signal. Model parameters are fitted to subjective results obtained from listening test data over a synthetically generated audio database with arbitrarily induced ILDD at different intensities, frequency regions and energy levels. Finally, we compare the performance of our proposed model over two extensive databases of real coded signals along with two state-of-the-art ILDD models.

An Expressive Multidimensional Physical Modelling Percussion Instrument

This paper describes the design, implementation and evaluation of a digital percussion instrument with multidimensional polyphonic control of a real-time physical modelling system. The system utilizes modular parametric control of different physical models, excitations and couplings along-side continuous morphing and unique interaction capabilities to explore and enhance expressivity and gestural interaction for a percussion instrument. Details of the instrument and audio engine are provided together with an experiment that tested real-time capabilities of the system, and expressive qualities of the instrument. Testing showed that advances in sensor technology have the potential to enhance creativity in percussive instruments and extend gestural manipulation, but will require well designed and inherently complex mapping scheme.
I provided the ground concept and code. The project was developed along the Multisensory Experience Laboratory of the University of Aalborg. Presented in the Proceedings of the 15th Sound and Music Computing Conference 2018 (download link) Videos; (demo1) (demo2)

On the Effect of Inter-Channel Level Difference Distortions on the Perceived Subjective Quality of Stereo Signals

Perceptual audio coding at low bitrates and stereo enhancement algorithms can affect perceived quality of stereo audio signals. Besides changes in timbre, also the spatial sound image can be altered, resulting in quality degradations compared to an original reference. While effects of timbre degradation on quality are well-understood, effects of spatial distortions are not sufficiently known. This paper presents a study designed to quantify the effect of Inter-Channel Level Difference (ICLD) errors on perceived audio quality. Results show systematic effects of ICLD errors on quality: bigger ICLD errors led to greater quality degradation. Spectral portions containing relatively higher energy were affected more strongly.

Complexity scaling of audio algorithms: parametrizing the MPEG Advanced Audio Coding rate-distortion loop (DAFx-2016)

Implementations of audio algorithms on embedded devices are required to consume minimal memory and processing power. Such applications can usually tolerate numerical imprecisions (distortion) as long as the resulting perceived quality is not degraded. By taking advantage of this error-tolerant nature the algorithmic complexity can be reduced greatly. In the context of real-time audio coding, these algorithms can benefit from parametrization to adapt rate-distortion-complexity (R-D-C) trade-offs. We propose a modification to the rate-distortion loop in the quantization and coding stage of a fixed-point implementation of the Advanced Audio Coding (AAC) encoder to include complexity scaling. This parametrization could allow the control of algorithmic complexity through instantaneous workload measurements using the target processor’s task scheduler to better assign processing resources. Results show that this framework can be tuned to reduce up to 80% of the additional workload caused by the rate-distortion loop while remaining perceptually equivalent to the full-complexity version. Additionally, the modification allows a graceful degradation when transparency cannot be met due to limited computational capabilities.

Acoustic source localization using wireless sensor networks.

I developed an adaptive version of an algorithm suitable for collaborative, distributed acoustic localization of a source based on energy readings over resource constrained Wireless Sensor Networks. This method, under certain circumstances, outperforms the commonly used ones for a given set of design constraints.