Quantifying perceptual errors in virtual soundscapes

APR 16, 2021

Machine learning can measure meaningful differences in a key sound-localization parameter.

DOI: 10.1063/PT.6.1.20210416a

4810/f1-4.jpg — Photo courtesy of Facebook

Virtual and augmented reality systems face a daunting acoustic challenge: using headphones to render a perceptually accurate soundscape. Generating appropriate timing and amplitude differences between the two ears is only part of the process. Real-world sounds scatter off our outer ears, head, and upper body in frequency-, direction-, and distance-dependent ways. (See the article by Bill Hartmann, Physics Today, November 1999, page 24 .) And since everyone’s anatomy is different, those complex dependencies, characterized mathematically for each ear by so-called head-related transfer functions (HRTFs), vary substantially from person to person.

HRTFs are essential signal-processing inputs for generating headphone sounds that can be accurately placed within a virtual space. Using ones that aren’t closely matched to an individual listener’s HRTFs can distort the perceived sound-source direction, size, and coloration and even place a sound inside the listener’s head. But what is close enough—how can one quantify HRTF differences in an acoustically meaningful way?

One common measure, the spectral difference error (SDE), is simple to calculate—it’s the logarithmic difference of two HRTFs—but it doesn’t always correlate well with perceived errors. Ishwarya Ananthabhotla of the MIT Media Lab and Vamsi Krishna Ithapu and W. Owen Brimijoin of Facebook Reality Labs have now shown how a two-stage machine-learning model can generate metrics that are better aligned with perception.

To illustrate the approach, the researchers focus on sound localization. They start with a large database of 123 individuals’ measured HRTF frequency spectra for each of 612 directions on a spherical grid. (The photo shows one participant being aligned to the room’s coordinate system prior to the HRTF measurements.) With those data, they trained their model to recognize the relationship between the HRTFs and the corresponding source direction. They then fine-tuned the model through listening tests that evaluate perceived location. For that second stage, the researchers took 30 participants from the original group, presented each with a series of virtual sound sources rendered using headphones and their measured HRTFs, and had the participants indicate the perceived source locations.

Once the model is trained, from a given pair of left and right HRTFs it will predict the angular coordinates. So for two HRTF pairs, the subtended angle between the predictions is a ready, meaningful metric of the HRTF differences. Unlike SDEs, the new metric is monotonic and linear, and it suffers less interperson variability. Moreover, the model’s predictions before and after the fine-tuning step are informative: The perceived-localization data are sparse and noisy, and source locations for which the tuning isn’t statistically significant indicate where collecting additional perceptual data may be useful. (I. Ananthabhotla, V. K. Ithapu, W. O. Brimijoin, JASA Express Lett. 1, 044401, 2021 .)

More about the authors

Richard J. Fitzgerald, rfitzger@aip.org