Creating authentic sounds is becoming easier and easier, thanks to technological developments such as monophonic sound reproduction, stereo, surround sound, and 3D audio. Of these technologies, 3D audio stands out because of its remarkable ability to process 3D audio waves that mimic real-life sounds, for a more immersive user experience.

Specifically speaking, 3D audio (which is also known as spatial sound) delivers an audio experience that improves the sense of immersion, by simulating a surround-sound setup. Sometimes it may in fact be confused with surround sound, but actually 3D audio performs much better as it makes sounds not only move around in all directions of a plane (which surround sound can also do), but also makes sounds come at the listener from other directions in a 3D space. 3D audio processing frequently places sound sources in different positions in virtual 3D space, to produce audio that sounds natural.

3D audio is usually implemented using raw audio tracks (like the voice track and piano sound track), a digital audio workstation (DAW), and a 3D reverb plugin — manually. This process is slow and costly, and has a high threshold. This method can also be daunting for mobile app developers as accessing raw audio tracks is a challenge.

Fortunately, Audio Editor Kit from HMS Core can resolve all of these issues through two capabilities that facilitate 3D audio generation: audio source separation for obtaining raw audio tracks and spatial audio for converting 2D audio to 3D audio.

(Audio source separation and spatial audio)

Next, I would like to show you the basics I have learned about these two capabilities and how they can help with creating 3D audio.

Introduction to Audio Source Separation

Most audio that we are exposed to is stereophonic. Stereo audio mixes all audio objects (including voice, piano, and guitar sounds) into two channels, making it difficult to separate the sounds, let alone reshuffle the objects into different positions in a 3D space. This means that mixed audio objects must be separated before performing 2D-to-3D audio conversion.

The audio source separation capability is a valuable tool here as it adopts a colossal amount of music data for deep learning modeling and classic signal processing methods. This capability uses the short-time Fourier transform (STFT) to convert 1D audio signals into a 2D spectrogram. Then, it inputs both the 1D audio signals and 2D spectrogram as two separate streams. The audio source separation capability relies on multi-layer residual coding and training of a large amount of data to obtain the expression in the latent space for a specified audio object. Finally, the capability uses a set of transformation matrices to restore the expression in the latent space to the stereo sound signals of the object.

The matrices and network structure in the mentioned process are uniquely developed for the audio source separation capability, which are designed according to the features of different audio sources. In this way, the capability can ensure that each of the sounds it supports can be separated wholly and distinctly, to provide high-quality raw audio tracks for creating 3D audio.

Audio source separation utilizes a set of advanced technologies. Here I will name a few:

Audio feature extraction. This technology involves direct extraction from the time domain signals by using an encoder and extraction of spectrogram features from the time domain signals by using the STFT.

Deep learning modeling. It introduces the residual module and attention, to enhance harmonic modeling performance and time sequence correlation for different audio sources.

Multistage Wiener filter (MWF). This technique is combined with the functionality of traditional signal processing and utilizes deep learning modeling to predict the power spectrum relationship between the audio object and non-objects. MWF builds and processes the filter coefficient.

(How audio source separation works)

To pave the way for 3D audio creation, audio source separation now supports 12 sound types, which are: voice, accompaniment, drum sound, violin sound, bass sound, piano sound, acoustic guitar sound, electric guitar sound, lead vocalist, accompaniment with the backing vocal voice, stringed instrument sound, and brass stringed instrument sound.

Introduction to Spatial Audio

It's incredible that our ears are able to tell the source of a sound just by hearing it. This is possible because sound travels in different speeds and directions to our ears, so we are able to perceive the direction it came from pretty quickly.

In the digital world, however, the differences between sounds are represented by a series of transformation functions, namely, head-related transfer functions (HRTFs). By applying the HRTFs on the point audio source, we can simulate the direct sound. This is because the HRTFs recognize body differences in, for example, the head shape and shoulder width.

To achieve a high level of audio immersion and ensure that 3D audio can be enjoyed by as many users as possible, the spatial audio capability is loaded with a set of relatively universal HRTFs. The capability also implements the reverb effect (the echo that appears after a sound is produced). It constructs authentic space by using room impulse responses (RIRs) to simulate acoustic phenomena in the physical world, such as reflection, dispersion, and interference. By using the HRTFs and RIRs for audio wave filtering, the spatial audio capability can convert a sound (such as one that is obtained by using the audio source separation capability) to 3D audio.

How spatial audio works.png (How spatial audio works)

HUAWEI Music uses these two capabilities so that users can enjoy 3D audio simply by opening the app and tapping Sci-Fi Audio or Focus on the Sound effects > Featured screen.

Sci-Fi Audio and Focus.jpg (Sci-Fi Audio and Focus)

Audio source separation and spatial audio help streamline 3D audio creation. Next time when you want to generate 3D audio effects for your app, try out audio source separation to get the raw audio tracks, and then import them to spatial audio and let it do the rest of the work for you. 3D audio is ideal for games and entertainment apps, but I'm curious to know what other fields you think it can be used for. Let me know in the comments section below.

A Review of the 3D Audio Creation Solution

Table of contents

Introduction to Audio Source Separation

Introduction to Spatial Audio