Immersive Audio & IVAS: How Spatial Hearing AI Enables True Telepresence
By Dani Cherkassky, Ph.D., CEO, Kardome Technology — March 2025
- Loading table of contents...
Immersive Voice and Audio Service (IVAS), introduced in 3GPP Release 18, brings immersive audio delivery to modern communication. But immersive telepresence isn’t only about how sound is rendered; , it also depends on how sound is captured and analyzed in real environments. Spatial Hearing AI addresses this gap by decoding complex soundscapes in enclosed spaces, (like cars, offices, and living rooms,) using each speaker’s unique acoustic signature, shaped by location and reflections, so spatial audio scenes can feel more natural and easier to follow.
Introduction
Virtual collaboration keeps growing, and expectations for “being there” keep rising. Yet traditional telecommunication, even high-quality voice and video, often misses the subtle dynamics of face-to-face interaction. That gap contributes to “Zoom fatigue,” where users work harder to follow conversations due to limited auditory context and outdated audio assumptions.
Immersive Voice and Audio Service is a meaningful step forward: it combines advanced audio processing with spatial awareness to enable more lifelike communication experiences. Still, the IVAS codec mainly enables transmission of spatial audio information. To deliver truly immersive experiences, systems also need strong spatial audio capture and analysis—the ability to interpret a real environment and produce the right spatial parameters for rendering.
Rendering has advanced rapidly, but spatial audio capture remains comparatively underdeveloped. Spatial Hearing AI is designed to help close that gap by using environmental cues to interpret and extract audio surroundings, closer to how humans navigate complex sound environments.
IVAS Codec Standard Issued by 3GPP Release 18
What is the IVAS Codec?
The Immersive Voice and Audio Service (IVAS) is a 3GPP standard that upgrades mobile calls from monophonic audio to immersive, 3D sound. It is the successor to the EVS mono codec.
The telephone connected people across distances, but even modern voice and video calls can still feel “flat” compared to being in the same room. Better audio quality alone doesn’t fully reproduce the spatial and social cues that help humans track who is speaking and from where.
3GPP Release 18 significantly evolves the system by defining the IVAS codec standard as an extension beyond traditional monophonic voice codecs. IVAS enables transmission of immersive, three-dimensional sound over mobile networks—moving communication beyond standard mono voice calls.
Supported audio formats
IVAS uses advanced audio compression and spatial processing techniques and supports three primary formats:
- Stereo — basic spatial audio with two channels.
- Multi-channel — a more immersive experience with multiple audio channels.
- MASA (Metadata-Assisted Spatial Audio) — a format designed for limited form factors (e.g., smartphones). It uses metadata to describe spatial characteristics of the audio scene, enabling efficient immersive audio delivery even with constrained processing power.
IVAS vs. traditional audio codecs
Immersive Communication Architecture
3GPP’s IVAS architecture (TS 26.250) describes an end-to-end approach aimed at rebuilding the original sound environment so the listener feels genuinely present.
Core components of an IVAS immersive module
IVAS communication can be viewed as three building blocks:
- Audio transducers: microphones and loudspeakers that capture and play back sound.
- Spatial modules: a spatial analyzer that interprets the 3D acoustic scene and extracts MASA (Metadata-Assisted Spatial Audio) parameters
- Communication modules: the IVAS encoder/decoder plus the network layer that transports the audio scene efficiently.
On the receiving side, a spatial synthesizer renders an immersive 3D experience over speakers or headphones using advanced positioning algorithms. This often includes HRTFs (head-related transfer functions) to personalize the scene (including head rotation), making playback feel more natural and realistic.
The Spatial Analyzer: the critical module for immersion
The spatial analyzer is essential for a truly immersive audio experience. It uses microphone arrays to capture and process sound—from compact near-field setups (two or three mics embedded in PCs or smartphones) to professional far-field arrays (12+ microphones) designed for conferencing scenarios.
Beyond recording audio, the analyzer extracts acoustic properties of the environment and converts them into MASA parameters. Those parameters are sent over the network to the listener, enabling reconstruction of the spatial soundscape on the receiving side.
While spatial audio synthesizers have received significant attention from technology providers and research institutions, spatial audio analyzers have lagged behind, even though they are responsible for capturing the “essence” of an audio scene at the virtual listener’s location.
What the Spatial Analyzer must extract
Traditionally, the analyzer’s output parameters fall into two groups:
1) Sound source positioning (where each source is)
Defined relative to the virtual listener:
- Azimuth: horizontal angle
- Elevation: vertical angle
- Distance: perceived distance from listener
2) Environment model (what the room “does” to sound)
Key elements include:
- Geometry: room shape and dimensions
- Reflections: how sound bounces and creates echoes
- Reverberation: persistence of sound over time
- Absorption: how materials absorb energy (affects decay)
- Noise: background noise profile
In theory, if these parameters are accurately extracted at the capture location, they can be efficiently transmitted and used by the synthesizer to recreate the experience of being in the same room as the speaker.
The IVAS standard defines how to transmit spatial audio, but the question remains: how do we accurately capture it? In real-world environments, traditional analysis methods fall short.
Why DOA Alone Isn’t Enough in Enclosed Environments
Much speech research has focused on direction of arrival (DOA) as the primary cue for spatial scene representation. But DOA often overlooks other crucial spatial parameters—especially in enclosed spaces where sound propagation is far more complex.
In real rooms (car, office, living room), sound rarely travels only directly to the listener. It interacts with many surfaces, creating reflections that contribute significantly to the overall soundscape, so a single source can be perceived as arriving from multiple directions (like being in a hall of mirrors).
As a result, relying only on DOA can fail to capture the physical soundscape: it can “describe” only one direction out of many paths the sound arrives from, missing the full environment-driven structure of the scene.
Tired of audio that falls flat in complex rooms? Learn how Kardome Spatial Hearing AI decodes the soundscape.
Spatial Hearing AI Technology: Decoding Soundscapes
Spatial Hearing AI is inspired by the human auditory system, which uses environmental cues to analyze and interpret complex surroundings. By processing multi-dimensional spatial properties, humans can navigate and understand real sound environments.
Spatial Hearing AI is Kardome’s proprietary soundscape analysis method. It decodes spatial cues by inferring each sound source’s unique reflection pattern (room impulse response), which reflects the relative geometry between the source, the listening device, and the environment. This happens transparently, without requiring active input from the sound sources.
By capturing this interplay, Spatial Hearing AIovercomes key limitations of DOA-only methods and enables more accurate decoding of multi-dimensional soundscapes in enclosed environments.
Unique Acoustic Signature (UAS): the core idea
Spatial Hearing AI processes captured audio in overlapping segments (“frames”). Within each frame, it groups elements that share the same Unique Acoustic Signature (UAS). You can think of a UAS as a fingerprint that helps identify each speaker in a scene based on their location and reflection pattern.
When a speaker starts talking, the system may initially rely on simplified spatial cues, like DOA, while it learns the environment and builds the speaker’s UAS. Over time, it analyzes the environment and adds features, capturing distinct environmental qualities, so it can characterize sound sources accurately even in challenging acoustics.
Example: Separating Three Speakers in a Room
If three speakers are present, Spatial Hearing AI can identify three distinct acoustic signatures, one per speaker, plus a fourth group for elements that don’t match any UAS (ambient noise).
This signature-based grouping enables accurate soundscape analysis in complex environments with multiple sources and reflections.
Use Cases
IVAS-based immersive communication can make audio experiences more engaging and realistic across scenarios such as:
1) Collaboration and conferencing
- Remote “telepresence” meetings: clearer sense of who is speaking and where, even across continents.
- Collaborative workspaces: improved spatial awareness for teams working remotely.
2) Media and entertainment
- Gaming: more realistic sound placement (e.g., footsteps behind you, overhead motion cues).
- VR/AR: stronger presence and immersion through realistic spatial cues.
Acoustic Signature
A listener experiences sound through a combination of direct sound waves and reflected waves, which are shaped by the speaker's location and the room's physical acoustics..
Acoustic Signature Classification
Every sound source possesses a distinct acoustic signature, characterized by its unique spectral and temporal properties, enabling the discrimination of individual speech sources.
Signature-Based Signal Grouping
Signals are classified based on their acoustic signature, not direction, making the algorithm microphone-independent.
Conclusion
Spatial Hearing AI addresses the challenge of complex sound environments by enabling precise soundscape analysis and spatially accurate audio experiences. Whether in virtual meetings, gaming, or VR/AR, it can create more natural immersive interactions by isolating voices and positioning sounds accurately.
This approach provides a foundation for advancing how we communicate and interact with technology, opening new possibilities for engineers and developers working on the next generation of audio solutions.