By Charles Pao, senior marketing specialist, CEVA
Charles Pao graduated from Johns Hopkins University with a master’s degree in electrical engineering. After that, he began to work in CEVA Hillcrest laboratory. He worked in software development and developed a black box system to evaluate motion characteristics. Charles loves the media and communication field very much and began to produce demonstrations and product videos for Hillcrest labs. Out of love, he officially transferred to the marketing department. Currently, he is the first contact of Hillcrest information and support department and is responsible for managing the marketing work. He also has a variety of responsibilities and project management roles. Charles also received a Bachelor of Science Degree in electrical engineering and computer engineering from Johns Hopkins University.
Immersive 3d/ space audio, combined with xr/360 video, brings you an audio-visual experience like being in a dense forest – falling twigs creak at your feet, a deer runs to the East, and when your eyes chase a red crested bird away, you can hear its wings flapping.
Accurate head tracking helps to provide a realistic user experience (UX). Understanding the key factors of evaluating solutions can help you find a direction in the growing industry.
Key factors of head tracking
In order to facilitate understanding, this paper summarizes the key factors in head tracking.
Delay: it refers to the time difference between the audio-visual signal sent from the audio-visual source and perceived by the user. According to the purpose of this article, we divide it into two parts Audio input delay: it refers to the time difference between the audio signal sent from the audio source and perceived by the user Head tracking delay: it refers to the time difference of 3D audio processing to adapt to the new head direction when your head moves.
Head tracking accuracy: in this paper, we discuss 3-DOF head tracking that only tracks the direction, rather than 6-DOF head tracking that tracks the position and direction. Accuracy refers to the measurement difference between the actual motion and its corresponding position in the extended reality (XR) environment. If the sensor (and its algorithm) is not accurate, you may be able to track the head motion in real time, but the motion in the virtual environment will be different from that in the real world.
Head tracking smoothness: it refers to the clarity and perceptibility of 3D audio conversion when the head changes direction. You want to create an XR experience that is not affected by jumping. Suddenly changing the output will destroy the immersive experience, and even cause crash during the game.
Head tracking delay
In the absence of appropriate measuring equipment, it is not easy to test the delay, but it can be tested in a subjective way. A study by the audio communication team of Tu Berlin shows that the average detection level of human subjects is 108 milliseconds, and the absolute detection threshold of single sound source is 52 to 73 milliseconds. What needs to be clarified here is that the team studies “total system delay”, which refers to the time difference between the speaker’s audio output and the device output. The study concluded that it took an average of 108 milliseconds for humans to notice changes in motion. The sound is more pronounced when playing from a single source.
This delay has no effect when listening to pre recorded music or other restricted audio content. However, for recorded video, if the monitor does not delay the image to solve the problem of audio input delay, mouth synchronization may occur. For video games, you don’t want picture delay, because picture delay will affect the game performance of players. Therefore, bass delay is very important to keep the sound synchronized with the game picture. The delay will always exist to some extent, but the key is to minimize the delay so that users will not be aware of the impact of the delay.
In spatial audio systems, head correlation transform function (HRTF), reverberation or other indoor simulation techniques are usually used to map head tracking data through spatially processed spatial audio input. After this processing, there are several common methods to implement the spatial audio system.
If you run the spatial processing algorithm on the audio device, due to the influence of wireless communication technology, it will only increase the input delay of audio. Since there is no wireless link in the head tracking path, the delay of head tracking is still very low. This is a key advantage of performing both spatial processing and head tracking on the same device.
Another method is to perform spatial audio processing on mobile devices such as mobile phones. The header tracking information is sent from the audible device to the mobile device, which processes it and then pushes it back to the user. Due to the existence of additional communication links, this method will increase the delay of head tracking compared with the previous method. Audio can be transferred from the phone to the headset via Bluetooth technology, and the Bluetooth delay depends on the audio codec used. The delay of the faster codec can be as low as 50-80 milliseconds, but the delay of the more common codec can be as high as 170-270 milliseconds. Head tracking data typically adds a delay of 50-100 milliseconds.
Through the understanding of space audio system and the research of human delay detection, we can roughly understand the advantages and disadvantages of space audio system delay. Try using a higher frequency sound to test the delay. The directivity of low-frequency noise is not significant (which is why stereo systems usually have only one subwoofer).
The high-quality sound source used to test the delay is a continuous sound, which can be well located. Ideally, this sound source needs to mix sounds of multiple frequencies. However, for the convenience of test description, please consider using high-frequency audio that plays continuously to test the delay. Higher frequencies are easier to recognize, while constant tones allow you to notice different changes in the audio image.
Suppose your headset has a head tracking delay of 200 milliseconds. For good audio rendering, we want the audio image to move no more than 5 degrees. This means that the user needs to always move at less than 25 degrees per second. To help you visualize better, this means rotating your head 90 degrees in 3.6 seconds. This movement speed is quite slow, and you can move much faster under normal circumstances.
In the test, if you rotate your head 90 degrees in about 1/4 second, you will move at a speed of 360 degrees / second. A delay of 200 milliseconds means that the sound source will move 72 degrees, but it will only be in the wrong position for 200 milliseconds. In the test, the delay can be easily identified by using the continuous sound as a reference.
Accuracy, accuracy and smoothness
Accuracy is related to the difference between the movement and the real world / real answer. Accuracy is related to the consistency with which you get the same answer. True accuracy can only be measured using a complete 9-axis solution with a magnetometer. However, due to the use of magnetic drives in audio technology and the changing user environment, it is not practical to use a complete 9-axis head tracking solution. This is why most space audio hardware uses only accelerometers and gyroscopes.
Testing accuracy and smoothness is a bit tricky, but with your space audio software, you should be able to test their performance. Clear voice audio (such as podcasts) may be the best tool for testing these standards. In a podcast, the speaker is in a fixed position, so no matter which direction you turn your head, the speaker’s voice should come from the same position. When you move your head, the 3D audio should change from one position to another, and there should be no significant drop or change in volume or sound quality.
The gyroscope sensor in the 3d/ spatial audio headset is prone to offset, which will reduce the overall accuracy of the headset. The software will provide you with multiple options: manual reset, slow stabilization or fast stabilization.
If you do not adjust the offset, you will find that people move slowly in the room over time. Maybe they started right in front of you, but now they are on the left of the center. This effect is not ideal. You can manually reset the device by clicking the specified button (on the device or software), saying “I look straight ahead again”, and reset the offset. However, as time goes on, the offset will still increase gradually. The slow reset method takes advantage of the fact that your head is facing the line of sight object. By making this assumption, it can reset the gyro offset in a few minutes. The fast reset method uses the same idea, but by contrast, it can move immediately in a few seconds.
You need to select the ideal automatic reset method according to the specific use case. If you look at the same direction of the screen, quick reset is an ideal choice, because occasionally looking at a position other than the screen will not affect the reset, and keep your eye point in the center. At the beginning of the activity, resetting the “front” direction can guide the reset, so that you do not have to spend a few minutes waiting for the algorithm to adjust. However, if you play games on multiple screens at home, play action games on your mobile phone, or take a walk in the park, your direction will change frequently. Fast reset can better keep up with the direction changes of the above scenes.
When you turn your head to listen to the podcast, try to pay attention to the tracking effect of the sound in space and the smoothness of the sound position change when the sound moves (or whether you notice the movement). The smoothness of spatial audio is mainly reflected in the clarity of sound in the process of position conversion. Whether you rotate the head slowly or quickly, the clear change of audio position you can detect is the sign of smoothing algorithm. If you notice audio skipping or obvious quantization when your head is moving, it may be a sign of jump correction, or the sensor / system cannot smoothly convert motion.
With large technology companies creating various 3d/ space audio integrated products, 3d/ space audio is becoming the mainstream of the world. The more products you have, the more you need to know how to choose the best. Although the above assessment to a large extent represents my subjective views, I hope to provide some guidance for you in the 3d/ spatial audio world by explaining the ideas and logic behind the assessment and testing. If you need to visually understand the importance of head tracking latency, or for more information about HRTF, check out the webinar video. If you are interested in the content of this article or webinar, please send us a message to find out which CEVA products can best support your project.