index.bs

<pre class='metadata'>
Group: AOM
Status: WGD
Title: Immersive Audio Model and Formats
Editor: SungHee Hwang, Samsung, hshee@samsung.com
Editor: Felicia Lim, Google, flim@google.com
Repository: AOMediaCodec/iamf
Shortname: iamf
URL: https://aomediacodec.github.io/iamf/
Date: 2023-01-09
Abstract: This document specifies an immersive audio (IA) architecture and model, a standalone IA sequence format and an [[!ISOBMFF]]-based IA container format.
</pre>

<pre class="anchors">
url: https://www.iso.org/standard/68960.html#; spec: ISOBMFF; type: dfn;
	text: AudioSampleEntry
	text: boxtype
	text: grouping_type
	text: SampleGroupDescriptionEntry
	text: channelcount
	text: samplerate
	text: AudioPreRollEntry

url: https://www.iso.org/standard/68960.html#; spec: ISOBMFF; type: property;
	text: iso6
	text: sgpd
	text: stsd
	text: sbgp
	text: edts
	text: stts
	text: prol

url: https://aomedia.org/av1/specification/conventions/; spec: AV1-Convention; type: dfn;
	text: leb128()
	text: Clip3

url: https://www.iso.org/standard/43345.html#; spec: AAC; type: dfn;
	text: raw_data_block()
	text: ADTS
	text: Low Complexity Profile

url: https://opus-codec.org/docs/opus_in_isobmff.html#; spec: OPUS-IN-ISOBMFF; type: dfn;
	text: OpusSpecificBox
	text: OutputChannelCount
	text: OutputGain
	text: ChannelMappingFamily
	text: PreSkip
	text: InputSampleRate


url: https://opus-codec.org/docs/opus_in_isobmff.html#; spec: OPUS-IN-ISOBMFF; type: property;
	text: opus
	text: dOps

url: https://www.iso.org/standard/55688.html#; spec: MP4-Systems; type: dfn;
	text: objectTypeIndication
	text: streamType
	text: upstream
	text: decSpecificInfo()
	text: DecoderConfigDescriptor()
	text: Syntatic Description Language

url: https://www.iso.org/standard/76383.html#; spec: MP4-Audio; type: dfn;
	text: AudioSpecificConfig()
	text: audioObjectType
	text: channelConfiguration
	text: GASpecificConfig()
	text: frameLengthFlag
	text: dependsOnCoreCoder
	text: extensionFlag

url: https://www.iso.org/standard/79110.html#; spec: MP4; type: dfn;
	text: ESDBox

url: https://www.iso.org/standard/79110.html#; spec: MP4; type: property;
	text: mp4a
	text: esds

url: https://tools.ietf.org/html/rfc6381#; spec: RFC6381; type: property;
	text: codecs

url: https://tools.ietf.org/html/rfc8486#; spec: RFC8486; type: dfn;
	text: channel count

url: https://tools.ietf.org/html/rfc7845#; spec: RFC7845; type: dfn;
	text: ID Header
	text: Output Gain

url: https://tools.ietf.org/html/rfc6716#; spec: RFC6716; type: dfn;
	text: opus packet

url: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1770-4-201510-I!!PDF-E.pdf#; spec: ITU1770-4; type: dfn;
	text: LKFS

url: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2051-3-202205-I!!PDF-E.pdf#; spec: ITU2051-3; type: dfn;
	text: Loudspeaker configuration for Sound System A (0+2+0)
	text: Loudspeaker configuration for Sound System B (0+5+0)
	text: Loudspeaker configuration for Sound System C (2+5+0)
	text: Loudspeaker configuration for Sound System D (4+5+0)
	text: Loudspeaker configuration for Sound System E (4+5+1)
	text: Loudspeaker configuration for Sound System F (3+7+0)
	text: Loudspeaker configuration for Sound System G (4+9+0)
	text: Loudspeaker configuration for Sound System H (9+10+3)
	text: Loudspeaker configuration for Sound System I (0+7+0)
	text: Loudspeaker configuration for Sound System J (4+7+0)
	text: SP Label

url: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2127-0-201906-I!!PDF-E.pdf#; spec: ITU2127-0; type: dfn;
	text:

url: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2076-2-201910-I!!PDF-E.pdf#; spec: ITU2076-2; type: dfn;
	text:

url: https://en.wikipedia.org/wiki/Q_(number_format); spec: Q-Format; type: dfn;
	text:

url: https://xiph.org/flac/format.html; spec: FLAC; type: dfn;
	text: METADATA_BLOCK
	text: FRAME
	text: FRAME_HEADER
	text: SUBFRAME
	text: FRAME_FOOTER

url: https://xiph.org/flac/format.html; spec: FLAC; type: property;
	text: fLaC


</pre>

<pre class='biblio'>
{
	"AI-CAD-Mixing": {
		"title": "AI 3D immersive audio codec based on content-adaptive dynamic down-mixing and up-mixing framework",
		"status": "Paper",
		"publisher": "AES",
		"href": "https://www.aes.org/e-lib/browse.cfm?elib=21489"
	},
	"AAC": {
		"title": "Information technology — Generic coding of moving pictures and associated audio information — Part 7: Advanced Audio Coding (AAC)",
		"status": "Standard",
		"publisher": "ISO/IEC",
		"href": "https://www.iso.org/standard/43345.html"
	},
	"MP4-Audio": {
		"title": "Information technology — Coding of audio-visual objects — Part 3: Audio",
		"status": "Standard",
		"publisher": "ISO/IEC",
		"href": "https://www.iso.org/standard/76383.html"
	},
	"MP4-Systems": {
		"title": "Information technology — Coding of audio-visual objects — Part 1: Systems",
		"status": "Standard",
		"publisher": "ISO/IEC",
		"href": "https://www.iso.org/standard/55688.html"
	},
	"OPUS-IN-ISOBMFF": {
		"title": "Encapsulation of Opus in ISO Base Media File Format",
		"status": "Best Practice",
		"publisher": "IETF",
		"href": "https://opus-codec.org/docs/opus_in_isobmff.html"
	},
	"ITU1770-4": {
		"title": "Algorithms to measure audio programme loudness and true-peak audio level",
		"status": "Standard",
		"publisher": "ITU",
		"href": "https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1770-4-201510-I!!PDF-E.pdf"
	},
	"ITU2051-3": {
		"title": "Advance sound system for programme production",
		"status": "Standard",
		"publisher": "ITU",
		"href": "https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2051-3-202205-I!!PDF-E.pdf"
	},
	"Q-Format": {
		"title": "Q (number format)",
		"status": "Best Practice",
		"publisher": "Wikepedia",
		"href": "https://en.wikipedia.org/wiki/Q_(number_format)"
	},
	"BCP47": {
		"title": "BCP 47",
		"status": "Best Practice",
		"publisher": "IETF",
		"href": "https://www.rfc-editor.org/info/bcp47"
	},
	"FLAC": {
		"title": "Free Lossless Audio Codec",
		"status": "Best Practice",
		"publisher": "xiph.org",
		"href": "https://xiph.org/flac/format.html"
	},
	"AV1-Convention": {
		"title": "Conventions",
		"status": "Spec",
		"publisher": "aomedia.org",
		"href": "https://aomedia.org/av1/specification/conventions/"
	},
	"ITU2076-2": {
		"title": "Audio Definition Model",
		"status": "Standard",
		"publisher": "ITU",
		"href": "https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2076-2-201910-I!!PDF-E.pdf"
	},
	"ITU2127-0": {
		"title": "Audio Definition Model renderer for advanced sound systems",
		"status": "Standard",
		"publisher": "ITU",
		"href": "https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2127-0-201906-I!!PDF-E.pdf"
	}
}
</pre>

# Convention # {#convention}

## Syntax Description ## {#convention-syntaxstructure}

All of syntax elements shall conform to [=Syntatic Description Language=] specified in [[!MP4-Systems]] unless it is explicitly described in the specification.

### Data types ### {#convention-data-types}

 <b>leb128()</b> <b>syntaxName</b>
 
 <b>leb128()</b> indicates the type of an unsigned integer. It indicates the following unsigned integer <b>syntaxName</b> shall be encoded by [=leb128()=] specified in [[!AV1-Convention]].
 
 <b>syntaxName</b> is an unsigned integer which is encoded by [=leb128()=] specified in [[!AV1-Convention]].
 
 <b>sleb128()</b> <b>syntaxName</b>

 <b>sleb128()</b> indicates the type of an signed integer. It indicates the following signed integer <b>syntaxName</b> shall be encoded by [=leb128()=] specified in [[!AV1-Convention]].
 
 <b>syntaxName</b> is an signed integer which is encoded by [=leb128()=] specified in [[!AV1-Convention]].
 
 <b>string</b> <b>syntaxName</b>

 <b>string</b> indicates the type of a string with ring which is terminated by null of one byte (i.e. 0x00).
 
 <b>syntaxName</b> is a human readable label whose byte representation shall consists of <b>two-letter primary language subtags</b> and <b>two-letter region subtags</b> which are connected by hyphen("-"), and followed by bytes representation of [=UTF-8_Enc(label)=].
 
 Where, <b>two-letter primary language subtags</b> and <b>two-letter region subtags</b> shall conform to [[!BCP47]].

## Arithmetic Operators ## {#convention-arithmetic-operators}

<table class="def">
<tr>
  <td>+</td><td>Addition.</td>
</tr>
<tr>
  <td>-</td><td>Subtraction.</td>
</tr>
<tr>
  <td>*</td><td>Multiplication.</td>
</tr>
<tr>
  <td>floor(x)</td><td>The largest integer that is smaller than or equal to x.</td>
</tr>
<tr>
  <td>sqrt(x)</td><td>The square root of x.</td>
</tr>
</table>

## Function ## {#convention-function}

### Function templates ### {#convention-function-templates}

When the <b>template</b> keyword is used to decorate the <b>class</b> declaration, it indicates that the code is a template with a placeholder type that can be reused by other classes. Only classes that use the template shall be present in the bitstream; the template itself shall not be present in the bitstream. Classes that use a function template shall pass a data type that is specified in either [[!MP4-Systems]] or [[#convention-data-types]].

<b>Example</b>

```
template <class T>
class Foo {
  T t;
}

class Bar {
  Foo<int> f;
}
```

### Mathematical functions ### {#convention-function-mathematical}
 
 <b>Clip3(x, y, z)</b>
 
 It shall conform to [=Clip3=] specified in [[!AV1-Convention]].
 
### Function UTF-8 Encoding ### {#convention-function-utf8}

 <b>UTF-8_Enc(label)</b>
 
 <dfn values noexport>UTF-8_Enc(label)</dfn> is byted represenation of the encoded <b>label</b>, which is UTF-8 string as defined in [[!RFC3629]], null terminated.

# Introduction # {#introduction}

The <dfn noexport>IA sequence</dfn> is a bitstream to represent immersive audio for presentation on a wide range of devices in both dynamic streaming and offline applications. These applications include internet audio streaming, multicasting/broadcasting services, file download, gaming, communication, virtual and augmented reality, and others. In these applications, audio may be played back on a wide range of devices, e.g. headsets, mobile phones, tablets, TVs, sound bars, home theater systems and big screen.

The bitstream comprises a number of coded audio substreams and the metadata that describes how to decode, render and mix the substreams to generate an audio signal for playback. The bitstream format itself is codec-agnostic; any supported audio codec may be used to code the audio substreams.

The immersive audio container (<dfn noexport>IAC</dfn>) is the storage format for immersive audio (IA) sequence in one single [[!ISOBMFF]] track.

The figure below shows the conceptual IAC architecture.

<center><img src="images/Conceptual IAC Architecture.png"></center>
<center><figcaption>Conceptual IAC Architecture</figcaption></center>

For a given input 3D audio,
- Pre-Processor generates Pre-Processed Audio and Codec Agnostic Metadata for immersive audio (IA).
- Audio Codec Enc generates Codec-Dependent Bitstream, which consists of the coded streams, coded from Pre-Processed Audio.
- File Packager generates IAC File by encapsulating IA sequence, which consists of Codec-Dependent Bitstream and Codec Agnostic Metadata, into [[!ISOBMFF]] tracks.
- File Parser reconstructs IA sequence by decapsulating IAC File.
- Audio Codec Dec outputs a decoded Pre-Processed Audio after decoding of Codec-Dependent Bitstream.
- Post-Processor outputs Immersive 3D Audio by using the decoded Pre-Processed Audio and Codec Agnostic Metadata.


The rest of this specification is formulated as follows:
- [[#overview]] describes the high level IA sequence architecture and introduces its components.
- [[#obu-syntax]] specifies the syntax and semantics of the top level IA components and detailed IA components.
- [[#profiles]] specifies the profiles for IA sequences and IA decoders.
- [[#standalone]] specifies the representation of a standalone IA sequence.
- [[#isobmff]] specifies the encapsulation of an IA sequence into [[!ISOBMFF]] tracks.
- [[#processing]] specifies how the IA sequence should be decoded to generatethe output immersive 3D audio.
- [[#iacgeneration]] provides a guideline for generating the IA sequence.
- [[#iacconsumption]] provides a guideline for consuming the IA sequence, for different use-cases.


# Overview # {#overview}

## IA sequence Components ## {#iab-components}

The IA sequence includes one or more audio elements, each of which consists of one or more audio substreams. The IA sequence further include mix presentations and parameters.

- <dfn noexport>Audio substream</dfn> is the actual audio signal, which may be encoded with any compatible audio codec.
- <dfn noexport>Audio element</dfn> is the 3D representation of the audio signals, and are constructed from one or more audio substreams and the metadata describing them. The audio substreams associated with one audio element use the same audio codec.
- <dfn noexport>Mix presentations</dfn> contain metadata that describe how the audio elements are rendered and mixed together for playback through physical loudspeakers or headphones. At any given time, only one mix presentation is used for playback. However, multiple mix presentations can be defined as alternatives to each other within the same IA sequence. Furthermore, the choice of which mix presentation to use at playback is left to the user. For example, multi-language support is implemented by defining different mix presentations, where the first mix describes the use of the audio element with English dialogue, and the second mix describes the use of the audio element with French dialogue.
- <dfn noexport>Parameters</dfn> are the values that are associated with the algorithms used for decoding, reconstructing, rendering and mixing. Parameters may change their values over time and may further be animated; for example, any changes in values may be smoothed over some time interval. Their rate of change is specific to its respective algorithm, and is independent of other algorithms and the frame rates associated with the audio substreams. As such, they may be viewed as a 1D signal that have different metadata specified for different time intervals.


The figure below shows the relationship between the audio substreams, audio elements and mix presentations and the processing flow to obtain the immersive audio playback.

<center><img src="images/decoding_flow_cropped.png" style="width:100%; height:auto;"></center>
<center><figcaption>Processing flow to decode, reconstruct, render and mix the audio signals for immersive audio playback.</figcaption></center>

## Use of OBU Syntax ## {#use-of-obu}

### Descriptors ### {#descriptors}

The descriptor OBUS contains all the information that is required to setup and configure the decoders, reconstruction algorithms, renderers and mixers.

- <dfn noexport>Magic Code OBU</dfn> indicates the start of a full IA sequence description, version and profile version.
- <dfn noexport>Codec Config OBU</dfn> describes information to set up a decoder for an audio substream.
- <dfn noexport>Audio Element OBU</dfn> describes information to combine one or more audio substreams to reconstruct an audio element.
- <dfn noexport>Mix Presentation OBU</dfn> describes information to render and mix one or more audio elements to generate the final audio output.

### Data ### {#data}

The data OBUs contain the actual time-varying data that is required in the generation of the final audio output.

The IA sequence supports the description of multiple audio substreams and algorithms, which may have different metadata update rates to each other. The update rate for the audio substreams and audio elements is governed by the frame rates of the audio codec used. Since a single bitstream may support multiple codecs, this may lead to multiple different frame rates. The algorithms for rendering and mixing may have parameters that update at different rates to each other and to the audio frame rates. Therefore, the IA sequence contains information to facilitate the synchronization of the different audio frames and parameters.

- <dfn noexport>Audio Frame OBU</dfn> provides the raw coded audio frame for an audio substream.
- <dfn noexport>Parameter Block OBU</dfn> provides the time-varying parameter values for an algorithm used in any of the decoding, reconstruction, rendering or mixing steps.
- <dfn noexport>Sync OBU</dfn> provides relative timestamp offsets to synchronize audio frames and parameter blocks.
- <dfn noexport>Temporal Delimiter OBU</dfn> identifies the temporal units.


The below figure shows the linking scheme among [=obu_id=]s in obu_header and ids in obu payload.

<center><img src="images/ID Linking Example.png" style="width:100%; height:auto;"></center>
<center><figcaption>ID Linking Scheme</figcaption></center>

In the above figure, 
- codec config obu is saying that there are two audio elements (audio_element_id = 11 and 12) which are coded by using the codec_config() in the obu.
	- The audio element having audio_element_id = 11 is linked to the audio element obu having audio_element_id = 11.
		- The audio element obu is saying that there are two substreams (substream_id = 31 and 32) which composing of this audio element.
			- The audio substream having substream_id = 31 is linked to the audio frame obus having id = 31.
			- The audio substream having substream_id = 32 is linked to the audio frame obus having id = 32.
		- The audio element obu is saying that there are one parameter block (parameter_id = 71) for demixing_info_parameter_data() which is applied to the audio element.
			- The parameter block having parameter_id = 71 is linked to the parameter block obu having parameter_id = 71.
		- IAC decoders applies the parameter block to the audio substreams after decoding by substream decoders.
	- The audio element having audio_element_id = 12 is linked to the audio element obu having obu_id = 12.
		- The audio element obu is saying that there are one substream (substream_id = 33) which composing of this audio element. 
			- The audio substream having substream_id = 33 is linked to the audio frame obus having id = 33.
		- Substream decoder do decoding the substream.
- mix presentation obu is saying that there are two audio elements (audio_element_id = 11 and 12) which need to be mixed.
	- The audio element having audio_element_id = 11 and the audio element having audio_element_id = 12 are mixed after decoding each of them.
	- Then IAC decoders may do process loudness and drc controls by using mix_loudness_info() and drc_config().		


# Open Bitstream Unit (OBU) Syntax and Semantics # {#obu-syntax}

## Top Level OBU Syntax and Semantics ## {#top-level-syntax}

The IA sequence uses the OBU syntax.

This section specifies the top-level OBU syntax elements and their semantics.

### Audio OBU Syntax and Semantics ### {#audio-obu}

<b>Syntax</b>

```
class audio_open_bitstream_unit() {
  obu_header();

  if (obu_type == OBU_IA_Magic_Code)
    magic_code_obu();
  else if (obu_type == OBU_IA_Codec_Config)
    codec_config_obu();
  else if (obu_type == OBU_IA_Audio_Element)
    audio_element_obu();
  else if (obu_type == OBU_IA_Mix_Presentation)
    mix_presentation_obu();
  else if (obu_type == OBU_IA_Parameter_Block)
    parameter_block_obu();
  else if (obu_type == OBU_IA_Temporal_Delimiter)
    temporal_delimiter_obu();
  else if (obu_type == OBU_IA_Sync)
    sync_obu();
  else if (obu_type == OBU_IA_Audio_Frame)
    audio_frame_obu_with_no_id();
  else if (obu_type >= 9 and <= 30)
    audio_frame_obu(obu_type - 9);
  else if (obu_type == 6 or 7)
    reserved_obu();

  byte_alignment():
}
```

<b>Semantics</b>

If the syntax element obu_type is equal to OBU_IA_Magic_Code, an ordered series of OBUs is presented to the decoding process as a string of bytes.

OBU data shall start on the first (most significant) bit and shall end on the last bit of the given bytes. The payload of an OBU shall lie between the first bit of the given bytes and the last bit before the first zero bit of the byte_alignment().


### OBU Header Syntax and Semantics ### {#obu-header}

<b>Syntax</b>

```
class obu_header() {
  unsigned int (5) obu_type;
  unsigned int (1) obu_redundant_copy;
  unsigned int (1) obu_trimming_status_flag;
  unsigned int (1) obu_extension_flag;
  leb128() obu_size;

  if (obu_trimming_status_flag) {
    leb128() num_samples_to_trim_at_end;
    leb128() num_samples_to_trim_at_start;
  }
  if (obu_extension_flag == 1)
    leb128() extension_header_size;
}
```

<b>Semantics</b>

OBUs are structured with a header and a payload.

<dfn noexport>obu_type</dfn> specifies the type of data structure contained in the OBU payload.

<pre class = "def">
obu_type: Name of obu_type
   0    : OBU_IA_Codec_Config
   1    : OBU_IA_Audio_Element
   2    : OBU_IA_Mix_Presentation
   3    : OBU_IA_Parameter_Block
   4    : OBU_IA_Temporal_Delimiter
   5    : OBU_IA_Sync
  6~7   : Reserved
   8    : OBU_IA_Audio_Frame
  9~30  : OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21
   31   : OBU_IA_Magic_Code
</pre>

<dfn noexport>obu_redundant_copy</dfn> indicates whether this OBU is a redundant copy of the previous OBU in the IA sequence with the same obu_type. A value of 1 shall indicate that it is a redundant copy, while a value of 0 shall indicate that it is not.

It shall always be set to 0 for the following obu_type values:

- OBU_IA_Temporal_Delimiter
- OBU_IA_Sync
- OBU_IA_Audio_Frame
- OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21

<dfn noexport>obu_trimming_status_flag</dfn> indicates whether this OBU has audio samples to be trimmed or not. If it is set to 1, the [=num_samples_to_trim_at_start=] and [=num_samples_to_trim_at_end=] fields shall be present.

<dfn noexport>obu_extension_flag</dfn> indicates whether the [=extension_header_size=] field shall be present. If it set to 0, the [=extension_header_size=] field shall not be present. Otherwise, the [=extension_header_size=] field shall be present.

This flag shall be set to 0 for the current version of the specification (i.e. [=version=] = 0). An IAC-OBU parser which is conformant with the current version of the specification shall be able to parse this flag and [=extension_header_size=].

NOTE: A future version of specification may use this flag to specify an extension header field by setting [=obu_extension_flag=] = 1 and setting the size of extended header to [=extension_header_size=].

<dfn noexport>obu_size</dfn> shall indicate the size in bytes of the OBU not including the bytes within the obu_header of the preceding fields, i.e. obu_type, obu_redundant_copy, obu_trimming_status_flag and obu_extension_flag.

<dfn noexport>num_samples_to_trim_at_start</dfn> shall indicate the number of samples that needs to be trimmed from the start of the samples in this Audio Frame OBU. 

<dfn noexport>num_samples_to_trim_at_end</dfn> shall indicate the number of samples that needs to be trimmed from the end of the samples in this Audio Frame OBU.

<dfn noexport>extension_header_size</dfn> shall indicate the size in bytes of the extension header including this field.


### Byte Alignment Syntax and Semantics ### {#obu-bytealignment}

<b>Syntax</b>

```
class byte_alignment() {
  while (get_position() & 7)
    unsigned int (1) zero_bit;
}
```

<b>Semantics</b>

<dfn noexport>zero_bit</dfn> shall be equal to 0 and shall be inserted into the bitstream to align the bit position to a multiple of 8 bits.


### Reserved OBU Syntax and Semantics ### {#obu-reserved}

The reserved OBU allows the extension of this specification with additional OBU types in a way that allows IAC-OBU parsers compliant to this version of specification to ignore them.


### Magic Code OBU Syntax and Semantics ### {#obu-magiccode}

This section specifies obu payload of OBU_IA_Magic_Code.

For this obu, the obu header (2 bytes) shall be set to 0xF006.

<b>Syntax</b>

```
class magic_code_obu() {
  unsigned int (32) ia_code;
  unsigned int (8) version;
  unsigned int (8) profile_version
}
```

<b>Semantics</b>

<dfn noexport>ia_code</dfn> shall be a ‘four-character code’ (4CC) to identify the start of the IA sequence. It shall be 'iamf'.

<dfn noexport>version</dfn> shall indicate the version of an IA sequence. It shall be set to 0 for this version of the specification. Implementations should treat IA sequences where the MSB four bits of the version number match that of a recognized specification as backwards compatible with that specification. That is, the version number can be split into "major" and "minor" version sub-fields, with changes to the minor sub-field (in the LSB four bits) signaling compatible changes. For example, an implementation of this specification should accept any stream with a version number of ’15’ or less, and should assume any stream with a version number ’16’ or
greater is incompatible.

<dfn noexport>profile_version</dfn> shall indicate the profile of an IA sequence. The MSB four bits shall indicate the profile of an IA sequence. Implementations should treat IA sequences where the MSB four bits of the version number match that of a recognized profile as backwards compatible with that specification. That is, the version number can be split into "profile major" and "profile minor" version sub-fields, with changes to the minor sub-field (in the LSB four bits) signaling compatible changes with the profile major version. The semantic of this field shall be only valid when the MSB four bits of [=version=] = 0.

### Codec Config OBU Syntax and Semantics ### {#obu-codecconfig}

This section specifies the OBU payload of OBU_IA_Codec_Config.

<b>Syntax</b>

```
class codec_config_obu() {
  leb128() codec_config_id;
  leb128() num_audio_elements;
  for (i = 0; i < num_audio_elements; i++) {
    leb128() audio_element_id;
  }
  codec_config();

}

class codec_config() {
  unsigned int (32) codec_id;
  leb128() num_samples_per_frame;
  signed int (16) roll_distance;
  decoder_config(codec_id);

}
```

<b>Semantics</b>

<dfn noexport>codec_config_id</dfn> shall indicate a unique ID in an IA sequence for a given codec config.

<dfn value noexport for="codec_config_obu()">num_audio_elements</dfn> shall specify the number of audio elements that refer to this codec config.

<dfn value noexport for="codec_config_obu()">audio_element_id</dfn> shall specify the unique ID associated with the specific audio element that refers to this codec config.

<dfn noexport>codec_id</dfn> shall be a ‘four-character code’ (4CC) to identify the codec used to generate the audio substreams. It shall be 'opus' for IAC-OPUS, 'mp4a' for IAC-AAC-LC, 'fLaC' for IAC-FLAC and 'lpcm' for IAC-LPCM.

For ISOBMFF encapsulation, it shall be the same as the [=boxtype=] of its AudioSampleEntry if exist. 

<dfn noexport>num_samples_per_frame</dfn> shall indicate the frame length, in samples, of the raw coded audio provided in by audio_frame_obu().

<dfn noexport>roll_distance</dfn> is a signed integer that gives the number of frames that need to be decoded in order for a frame to be decoded correctly. A negative value indicates the number of frames before the frame to be decoded corrently.
- It shall be set to -1 for IAC-AAC-LC and -R (R = 4 when the frame size = 960) for IAC-OPUS. IAC-FLAC may ignore this field. Where, R is the smallest integer greater than or equal to 3840 divided by the frame size. 

<dfn noexport>decoder_config()</dfn> specifies the set of codec parameters required to decode an audio substream for the given codec_id. It shall be byte aligned.
- The codec_id and decoder_config() for IAC-OPUS shall conform to [=Codec_Specific_Info=] of [[#iac-opus-specific]]
- The codec_id and decoder_config() for IAC-AAC-LC shall conform to [=Codec_Specific_Info=] of [[#iac-aac-lc-specific]].
- The codec_id and decoder_config() for IAC-FLAC shall conform to [=Codec_Specific_Info=] of [[#iac-flac-specific]]
- The codec_id and decoder_config() for IAC-LPCM shall conform to [=Codec_Specific_Info=] of [[#iac-lpcm-specific]].

### Audio Element OBU Syntax and Semantics ### {#obu-audioelement}

This section specifies the OBU payload of OBU_IA_Audio_Element.

<b>Syntax</b>

```
class audio_element_obu() {
  leb128() audio_element_id;
  unsigned int (3) audio_element_type;
  unsigned int (5) reserved;

  leb128() num_substreams;
  for (i = 0; i < num_substreams; i++) {
    leb128() audio_substream_id;
  }
  
  leb128() num_parameters;
  for (i = 0; i < num_parameters; i++) {
    leb128() param_definition_type;
    if (param_definition_type == PARAMETER_DEFINITION_DEMIXING) {
        DemixingParamDefinition demixing_info;
    }
    if (param_definition_type == PARAMETER_DEFINITION_RECON_GAIN) {
        ReconGainParamDefinition recon_gain_info;
    }
  }

  if (audio_element_type == CHANNEL_BASED) {
    scalable_channel_layout_config();
  } else if (audio_element_type == SCENE_BASED) {
    ambisonics_config();
  }  
}
```

```
class DemixingParamDefinition() extends ParamDefinition() {
}
```

```
class ReconGainParamDefinition() extends ParamDefinition() {
}

```

<b>Semantics</b>

<dfn value noexport for="audio_element_obu()">audio_element_id</dfn> shall indicate a unique ID in an IA sequence for a given audio element. A Codec Config OBU that refers to that audio element shall use the same value for its [=audio_element_id=] field.

<dfn noexport>audio_element_type</dfn> shall specify the audio representation of this audio element which is constructed from one or more audio substreams.

<pre class = "def">
audio_element_type: The type of audio representation.
   0    : CHANNEL_BASED
   1    : SCENE_BASED
  2~7   : Reserved
</pre>

<dfn noexport>num_substreams</dfn> shall specify the number of audio substreams that are used to reconstruct this audio element.

<dfn noexport>audio_substream_id</dfn> shall specify the unique ID associated with the audio substream that is used to reconstruct this audio element.

Let a particular ChannelGroup's substream be indexed as [<dfn noexport>c</dfn>, <dfn noexport>n_c</dfn>], where
- [=c=] = [1, ..., C] is the ChannelGroup index and C is the number of ChannelGroups.
- [=n_c=] = [1, ..., N_c] is the substream index in the c-th ChannelGroup and N_c is the number of substreams in the c-th ChannelGroup.
- The i-th audio_substream_id maps to a ChannelGroup's substream as follows, where i is the index of the array:

```
[[1, 1], [1, 2], ..., [1, N_1], [2, 1], [2, 2], ..., [2, N_2], ..., [C, 1], [C, 2], ..., [C, N_c]]
```

A ChannelGroup is defined in [[#iacgeneration]]. The order of the substreams in each ChannelGroup., i.e. the semantics of n_c, is specified in [[#syntax-scalable-channel-layout-config]].


<dfn noexport>num_parameters</dfn> shall specify the number of parameters that are used by the algorithms specified in this audio element.

<dfn noexport>param_definition_type</dfn> specifies the type of the parameter definition. All parameter definition types described in this version of the specification are listed in the table below, along with their associated parameter definitions.

<table class = "def">
<tr>
  <th>param_definition_type</th><th>Parameter definition type</th><th>Parameter definition</th>
</tr>
<tr>
  <td>0</td><td>PARAMETER_DEFINITION_MIX_GAIN</td><td>MixGainParamDefinition</td>
</tr>
<tr>
  <td>1</td><td>PARAMETER_DEFINITION_DEMIXING</td><td>DemixingParamDefinition</td>
</tr>
<tr>
  <td>2</td><td>PARAMETER_DEFINITION_RECON_GAIN</td><td>ReconGainParamDefinition</td>
</tr>
</table>

<dfn noexport>demixing_info</dfn> provides the parameter definition for the demixing information to reconstruct channel audios according to [=loudspeaker_layout=] from scalable channel audio. The parameter definition is provided by DemixingParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in demixing_info_parameter_data().

<dfn noexport>recon_gain_info</dfn> provides the parameter definition for the gain value to reconstruct channel audios according to [=loudspeaker_layout=] from scalable channel audio. The parameter definition is provided by ReconGainParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in recon_gain_info_parameter_data().

<dfn noexport>scalable_channel_layout_config()</dfn> is a class that provides the metadata required for combining the substreams identified here in order to reconstruct a scalable channel layout.

<dfn noexport>ambisonics_config()</dfn> is a class that provides the metadata required for combining the substreams identified here in order to reconstruct an Ambisonics layout.

### Mix Presentation OBU Syntax and Semantics ### {#obu-mixpresentation}

This section specifies the OBU payload of OBU_IA_Mix_Presentation.

The metadata in mix_presentation() specifies how to render, process and mix one or more audio elements, with details provided in [[#processing-mixpresentation]].

An IA sequence may have one or more mix presentations specified. The IA parser shall select the appropriate mix presentation to process according to the rules specified in [[#processing-mixpresentation-selection]].

A mix presentation may contain one or more sub-mixes. Common use-cases may specify only one sub-mix, which includes all rendered and processed audio elements used in the mix presentation. The use-case for specifying more than one sub-mix arises if an IA multiplexer is merging two or more IA sequences. In this case, it may choose to capture the loudness information from the original IA sequences in multiple sub-mixes, instead of recomputing the loudness information for the final mix.

<b>Syntax</b>
```
class mix_presentation_obu() {
  leb128() mix_presentation_id;
  mix_presentation_annotations();

  leb128() num_sub_mixes;
  for (i = 0; i < num_sub_mixes; i++) {	  
    leb128() num_audio_elements;
    for (j = 0; j < num_audio_elements; j++) {
      leb128() audio_element_id;
      mix_presentation_element_annotations();
      rendering_config();
      element_mix_config();
    }
    output_mix_config();
    
    leb128() num_layouts;
    for (j = 0; j < num_layouts; j++) {
      layout loudness_layout;
      loudness_info loudness; 
    }
  }
}  
```

<b>Semantics</b>

<dfn noexport>mix_presentation_id</dfn> shall indicate a unique ID in an IA sequence for a given mix presentation.

<dfn noexport>mix_presentation_annotations()</dfn> is a class that provides informational metadata that an IA parser should refer to when selecting the mix presentation to use. The metadata may also be used by the playback system to display information to the user, but is not used in the rendering or mixing process to generate the final output audio signal.


<dfn noexport>num_sub_mixes</dfn> specifies the number of sub-mixes.

<dfn value noexport for ="mix_presentation_obu()">num_audio_elements</dfn> shall specify the number of audio elements that are used in this mix presentation to generate the final output audio signal for playback.

<dfn noexport>audio_element_id</dfn> shall indicate the unique ID associated with a specific audio element that is used in this mix presentation.

<dfn noexport>rendering_config()</dfn> is a class that provides the metadata required for rendering the referenced audio element. 

<dfn noexport>element_mix_config()</dfn> is a class that provides the metadata required for applying any processing to the referenced and rendered audio element before being summed with other processed audio elements.

<dfn noexport>output_mix_config()</dfn> is a class that provides the metadata required for post-processing the mixed audio signal to generate the audio signal for playback.

<dfn noexport>num_layouts</dfn> specifies the number of layouts for this sub-mix which the loudness informations were measured on.

<dfn noexport>loudness_layout</dfn> identifies the layout that was used to measure the loudness information provided in this sub-mix.

<dfn noexport>loudness</dfn> provides the loudness information which was measured on [=loudness_layout=] for the mixed audio elements by this sub-mix.

The layout specified in [=loudness_layout=] should not be higher than the highest layout among layouts provided by the audio elements. In other words, rendering from an audio element with the highest layout to the [=loudness_layout=] should not require an upmix.

If one sub-mix of Mix Presentation OBU includes only one single scalable channel audio, then it shall compy with as follows:
- [=num_layouts=] shall be greater than or equal to [=num_layers=] specified in [=scalable_channel_layout_config()=] of Audio Element OBU for the [=audio_element_id=].
- The set of [=loudness_layout=]s shall include all of [=loudspeaker_layout=]s specified in the [=channel_audio_layer_config()=]s of Audio Element OBU for the [=audio_element_id=]. 

The highest [=loudness_layout=] specified in one sub-mix is the layout which was used for authoring the sub-mix.

ISSUE: Loudness_info in scalable_channel_audio_layer is removed instead.

#### Mix Presentation Annotations Syntax and Semantics #### {#obu-mixpresentation-annotation}

<b>Syntax</b>
```
class mix_presentation_annotations() {
  string mix_presentation_friendly_label;
}
```

<b>Semantics</b>

<dfn noexport>mix_presentation_friendly_label</dfn> shall specify a human-friendly label to describe this mix presentation.


#### Mix Presentation Element Annotations Syntax and Semantics #### {#obu-mixpresentation-elementannotation}

<b>Syntax</b>
```
class mix_presentation_element_annotations() {
  string audio_element_friendly_label;
}
```

<b>Semantics</b>

<dfn noexport>audio_element_friendly_label</dfn> shall specify a human-friendly label to describe the referenced audio element.

#### Output Mix Config Syntax and Semantics #### {#obu-mixpresentation-outputmix}

output_mix_config() provides a gain value to be applied to the mixed audio signal.

<b>Syntax</b>

```
class output_mix_config() {
  MixGainParamDefinition output_mix_gain;
}
```

<b>Semantics</b>

<dfn noexport>output_mix_gain</dfn> provides the parameter definition for the gain value that is applied to all channels of the mixed audio signal. The parameter definition is provided by MixGainParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in mix_gain_parameter_data().

#### Loudness Info Syntax and Semantics #### {#obu-mixpresentation-loudness}

loudness_info() provides loudness information for a given audio signal.

All signed values are stored as signed Q7.8 fixed-point values (in [[!Q-Format]]).

<b>Syntax</b>

```
class loudness_info() {
  unsigned int (8) info_type;
  signed int (16) integrated_loudness;
  signed int (16) digital_peak;

  if (info_type & 1) {
    signed int (16) true_peak;
  }
}
```

<b>Semantics</b>

<dfn noexport>info_type</dfn> is a bitmask that specifies the type of optional loudness information provided. The bits are set as follows, where the first bit is the LSB:

<pre class = "def">
 Bit : Type of information provided
  0  : True peak
 1~7 : Reserved
</pre>

<dfn noexport>integrated_loudness</dfn> provides the integrated loudness information, specified in [=LKFS=] as defined in [[!ITU1770-4]], and measured according to [[!ITU1770-4]].

<dfn noexport>digital_peak</dfn> specifies the digital (sampled) peak value of the audio signal, specified in dBFS.

<dfn noexport>true_peak</dfn> specifies the true peak of the audio signal, specified in dBFS and measured according to [[!ITU1770-4]].

NOTE: [[!ITU1770-4]] adopts the convention of using the dBov unit for dBFS, where the RMS value of a full-scale square wave is 0 dBov. The same convention is adopted here.

### Parameter Block OBU Syntax and Semantics ### {#obu-parameterblock}

This section specifies the OBU payload of OBU_IA_Parameter_Block.

The metadata specified in this OBU defines the parameter values for an algorithm for an indicated duration, including any animation of the parameter values over this duration. The metadata shall be used in conjunction with a corresponding parameter definition and parameter data specification. The parameter definition shall be specified based on [=ParamDefinition()=]. The parameter data shall provide the values to apply in each parameter block. These shall be specified using the [=AnimatedParameterData()=] function template if parameter animation is supported.

<b>Syntax</b>

```
class parameter_block_obu() {
  leb128() parameter_id;
  leb128() duration;
  leb128() num_segments;
  leb128() constant_segment_interval;

  param_definition_type = get_param_definition_type(parameter_id);

  for (i = 0; i < num_segments; i++) {
    if (constant_segment_interval == 0) {
      leb128() segment_interval;
    }

    if (param_definition_type == PARAMETER_DEFINITION_MIX_GAIN) {
      leb128() animation_type;
      mix_gain_parameter_data(animation_type);
    }
    if (param_definition_type == PARAMETER_DEFINITION_DEMIXING) {
      demixing_info_parameter_data();
    }
    if (param_definition_type == PARAMETER_DEFINITION_RECON_GAIN) {
      recon_gain_info_parameter_data();
    }
  }
}
```

<b>Semantics</b>

<dfn value noexport for="parameter_block_obu()">parameter_id</dfn> shall indicate the unique ID that is associated with a specific parameter definition. All parameter blocks that provide data for that parameter definition shall have the same parameter_id.

<dfn noexport>duration</dfn> shall specify the duration for which this parameter block is valid and applicable. 

<dfn noexport>num_segments</dfn> shall specify the number of different sets of parameter values specified in this parameter block, where each set describes a different segment of the timeline, contiguously.

<dfn noexport>constant_segment_interval</dfn> shall specify the interval of each segment, in the case where all segments except the last segment have equal intervals. If all segments except the last segment do not have equal intervals, the value of constant_segment_interval shall be set to 0. 

<dfn noexport>get_param_definition_type()</dfn> is a run-time function to get the parameter definition type mapped to the parameter_id. 

Audio Element OBU and/or Mix Presentation OBU is mapping a parameter_id to the parameter definition type. So, IA decoders can know the definition type mapped to the parameter_id.

<dfn noexport>segment_interval</dfn> shall specify the interval for the given segment.

Each value of [=duration=], [=constant_segment_interval=] and [=segment_interval=] shall be expressed as the number of ticks at the rate indicated by the time base specified in the corresponding parameter definition.
- When it defines <dfn noexport>D</dfn> = the value of [=duration=], <dfn noexport>NS</dfn> = the value of [=num_segments=], <dfn noexport>CSI</dfn> = the value of [=constant_segment_interval=] and <dfn noexport>SI</dfn> = the value of [=segment_interval=].
	- When [=CSI=] != 0, [=NS=] x [=CSI=] shall be equal to or greater than [=D=].
		- If [=NS=] x [=CSI=] > [=D=], the actual interval of the last segment shall be [=D=] - ([=NS=] - 1) x [=CSI=].
	- When [=CSI=] = 0, the summation of all [=SI=]s in this parameter block shall be equal to [=D=].

<dfn noexport>animation_type</dfn> specifies the type of animation applied to the parameter values in this parameter block.

<pre class = "def">
animation_type : Animation Type
       0       : STEP
       1       : LINEAR
       2       : BEZIER
</pre>

Classes that take [=animation_type=] as an input argument must use the <dfn noexport>AnimatedParameterData()</dfn> function template. The method of applying the animation is described in [[#processing-animated-params]].

```
template <class T>
class AnimatedParameterData(animation_type) {
  if (animation_type == STEP) {
    T start_point_value;
  }
  if (animation_type == LINEAR) {
    T start_point_value;
    T end_point_value;
  }
  if (animation_type == BEZIER) {
    T start_point_value;
    T end_point_value;
    T control_point_value;
    unsigned int (8) control_point_relative_time;
  }
}
```

<dfn noexport>start_point_value</dfn> shall specify the parameter value that is applied at the start of the segment.

<dfn noexport>end_point_value</dfn> shall specify the parameter value that is applied at the end of the segment.

<dfn noexport>control_point_value</dfn> shall specify the parameter value of the middle control point of a quadratic Bezier curve, i.e. its y-axis value.

<dfn noexport>control_point_relative_time</dfn> shall specify the time of the middle control point of a quadratic Bezier curve, i.e. its x-axis value. This value is expressed as a fraction of the parameter segment interval with valid values in the range of 0 and 1, inclusively. A value equal to 0 or 1 shall indicate that this animation implements a linear Bezier curve, in which case control_point_value shall be ignored by the IA parser. It is stored as an 8-bit, unsigned, fixed-point value with 8 fractional bits (i.e. Q0.8 in [[!Q-Format]]).

#### Parameter Definition Syntax and Semantics #### {#parameter-definition}

Parameter definition classes shall inherit from the abstract <dfn noexport>ParamDefinition()</dfn> class. They may optionally further provide default parameter values, which are applied when there are no parameter blocks available.

<b>Syntax</b>

```
abstract class ParamDefinition() {
  leb128() parameter_id;
  leb128() time_base;
}
```

<b>Semantics</b>

<dfn value noexport for="ParamDefinition()">parameter_id</dfn> shall indicate the unique ID in an IA sequence for a given parameter.

<dfn value noexport for="ParamDefinition()">time_base</dfn> shall specify the time base used by this parameter, expressed as seconds per tick. Time-related fields associated with this parameter, such as durations and intervals, shall be expressed in the number of ticks.

### Audio Frame OBU Syntax and Semantics ### {#obu-audioframe}

This section specifies OBU payloads of OBU_IA_Audio_Frame and OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21. 

The first 22 audio substreams in an IA sequence may use the OBU types OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21, which have predefined audio substream IDs associated with them. This avoids the need to manually specify an audio_substream_id.

<b>Syntax</b>

```
class audio_frame_obu_with_no_id() {
  leb128() audio_substream_id;
  audio_frame_obu(audio_substream_id);
}
```

```
class audio_frame_obu(audio_substream_id) {
  unsigned int (8*coded_frame_size) audio_frame();
}
```

<b>Semantics</b>

<dfn value noexport for="audio_frame_obu_with_no_id()">audio_substream_id</dfn> shall indicate a unique ID in an IA sequence for a given substream. All Audio Frame OBUs of the same substream shall have the same audio_substream_id.

This value must be greater or equal to 22, in order to avoid collision with the reserved IDs for the OBU types OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21.

<dfn noexport>coded_frame_size</dfn> is the size of [=audio_frame()=] in bytes.

<dfn noexport>audio_frame()</dfn> is the raw coded audio data for the frame. It shall be [=opus packet=] of [[!RFC6716]] for IAC-OPUS, [=raw_data_block()=] of [[!AAC]] for IAC-AAC-LC and [=FRAME=] of [[!FLAC]] for IAC-FLAC.

For IAC-LPCM, [=audio_frame()=] shall be LPCM samples. When more than one byte is used to represent a LPCM sample, the byte order shall be in little endian. 

For this version of the specification, all audio frames for a given substream must be gapless.

### Temporal Delimiter OBU Syntax and Semantics ### {#obu-temporaldelimiter}

This section specifies the OBU payload of OBU_IA_Temporal_Delimiter.

<b>Syntax</b>

```
class temporal_delimiter_obu() {
}
```

NOTE: The Temporal Delimiter OBU has an empty payload.

### Sync OBU Syntax and Semantics ### {#obu-sync}

This section specifies the OBU payload of OBU_IA_Sync.

<b>Syntax</b>

```
class sync_obu() {
  leb128() global_offset;
  leb128() num_obu_ids;
  for (i = 0; i < num_obu_ids; i++) {
    leb128() obu_id;
    unsigned int (1) obu_data_type;
    unsigned int (1) reinitialize_decoder;
    unsigned int (6) reserved;
    sleb128() relative_offset;
  }
}
```

<b>Semantics</b>

<dfn noexport>global_offset</dfn> shall specify the offset that is applied to all substreams and parameters specified in this Sync OBU, in addition to their individual relative offsets.

For this version of the specification, the value of global_offset shall be set to 0.

<dfn noexport>num_obu_ids</dfn> shall specify the number of substream and parameter IDs that this Sync OBU specifies the offset for.

<dfn noexport>obu_id</dfn> shall specify the unique ID associated with the substream or parameter that is being referred to.

<dfn noexport>obu_data_type</dfn> shall specify the type of data that is being referred to.

<pre class = "def">
obu_data_type : Type of OBU data
      0       : SUBSTREAM
      1       : PARAMETER
</pre>

<dfn noexport>reinitialize_decoder</dfn> shall be used to specify the behaviour of a decoder when encountering gaps in the audio substream, where the gap shall be identified as described in [[#standalone-synchronizing-data-obus]]. If obu_data_type does not equal SUBSTREAM, an IAC-OBU parser shall ignore this field.

If reinitialize_decoder = 0, the decoder shall not be reinitialized before decoding the audio frames after the gap. This may be used in the case where it is preferable for the decoder to fill the gap with silence instead.

If reinitialize_decoder = 1, the decoder shall be reinitialized before decoding the audio frames after the gap. If a pre-skip is specified in the relevant Codec Config OBU, it is applicable after reinitializing the decoder.

For this version of the specification, the value of reinitialize_decoder shall be set to 0. If a value of 1 is seen, the IA sequence shall be rejected as invalid.

<dfn value noexport for="sync_obu()">reserved</dfn> shall be set to 0. Reserved units are for future use and shall be ignored by an IAC-OBU parser.

<dfn noexport>relative_offset</dfn> is the offset to position the first audio frame (before trimming) or parameter block with the referenced obu_id that comes after this Sync OBU with respect to the timeline generated before this Sync OBU. If this Sync OBU is the first one, it is the offset from 0. Otherwise, it is the offset from the end of the timeline of Substreams generated from the previous Sync OBU.

The offset shall be indicated in the number of ticks at the time_base specified in the corresponding substream or parameter definition. 

IA encoder and decoder operations related to this field are specified in [[#standalone-synchronizing-data-obus]].

## Detailed OBU Syntax and Semantics ## {#syntax-detailed}

### Scalable Channel Layout Config Syntax and Semantics ### {#syntax-scalable-channel-layout-config}

[=scalable_channel_layout_config()=] contains information regarding the configuration of scalable channel audio.

<b>Syntax</b>

```
class scalable_channel_layout_config() {
  unsigned int (3) num_layers;
  unsigned int (5) reserved;
  for (i = 1; i <= num_layers; i++) {
    channel_audio_layer_config(i);
  }
}

class channel_audio_layer_config(i) {
  unsigned int (4) loudspeaker_layout(i);
  unsigned int (1) output_gain_is_present_flag(i);
  unsigned int (1) recon_gain_is_present_flag(i);
  unsigned int (2) reserved;
  unsigned int (8) substream_count(i);
  unsigned int (8) coupled_substream_count(i);
  if (output_gain_is_present_flag(i) == 1) {
    unsigned int (6) output_gain_flag(i);
    unsigned int (2) reserved;
    signed int (16) output_gain(i);
  }
}
```

When an audio element is composed of G(r) number of substreams, scalable channel audio for the audio element shall be layered into [=num_layers=] = r number of ChannelGroups.
- The order of ChannelGroups in each temporal unit shall be same as the order of channel_audio_layer_config()s in scalable_channel_layout_config().
- <dfn noexport>ChannelGroup</dfn> is a set of substreams which is able to provide a spatial resolution of audio contents by itself or which is able to provide an enhanced spatial resolution of audio contents by combining with the preceding ChannelGroups within the audio frames.
- ChannelGroup #q consists of G(q)-G(q-1) number of substreams. Where, q = 1, 2, ..., r and G(0) = 0.
- IA frame shall be a set of audio_frame_obus with the same sync offsets of the single audio element for scalable channel audio. Each of them shall come from each substream.
- Every IA frame shall have the same number of audio_frame_obus.
- When r > 1, parameter_block_obu may present with IA frame. 

<center><img src="images/Immersive Audio Bitstream with scalable channel audio (before OBU packing).png" style="width:100%; height:auto;"></center>
<center><figcaption>Immersive Audio Sequence with scalable channel audio (before OBU packing)</figcaption></center>

The IA decoder shall select one of one or more channel audios provided by scalable channel audio. The IA decoder should select the appropriate channel audio according to the following rules, in order:
- The IA decoder should first attempt to select the channel audio whose loudspeaker layout matches the physical playback layout.
- If there is no match, the IA decoder should select the channel audio with the closest specified loudspeaker layout to the physical layout and then apply up or down-mixing appropriately, after decoding and reconstruction of the channel audio. [[#iacgeneration-scalablechannelaudio-downmixmechanism]] and [[#processing-downmixmatrix]] provide examples of dynamic and static down-mixing matrices for some common layouts that may be used.

<b>Semantics</b>

<dfn noexport>num_layers</dfn> shall indicate the number of ChannelGroups for scalable channel audio. It shall not be set to zero and its maximum number shall be limited to 6.
- For Binaural, this field shall be set to 1.

<dfn noexport>channel_audio_layer_config()</dfn> is a class that provides the information regarding the configuration of ChannelGroup for scalable channel audio. channel_audio_layer_config(i) shall provide information regarding the configuaration of ChannelGroup #i.

<dfn noexport>loudspeaker_layout</dfn> shall indicate the channel layout for the channels to be reconstructed from the precedent ChannelGroups and the current ChannelGroup among ChannelGroups for scalable channel audio.

In the current version of the specification, [=loudspeaker_layout=] shall indicate one of 10 channel layouts including Mono, Stereo, 5.1ch, 5.1.2ch, 5.1.4ch, 7.1ch, 7.1.2ch, 7.1.4ch, 3.1.2ch and Binaural. Where,
- <dfn noexport>Stereo</dfn> is the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System A (0+2+0)=] of [[!ITU2051-3]].
- <dfn noexport>5.1ch</dfn> is the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System B (0+5+0)=] of [[!ITU2051-3]].
- <dfn noexport>5.1.2ch</dfn> is the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System C (2+5+0)=] of [[!ITU2051-3]].
- <dfn noexport>5.1.4ch</dfn> is the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System D (4+5+0)=] of [[!ITU2051-3]].
- <dfn noexport>7.1ch</dfn> is the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System I (0+7+0)=] of [[!ITU2051-3]].
- <dfn noexport>7.1.2ch</dfn> is the combination of the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System I (0+7+0)=] of [[!ITU2051-3]] and the left and right top front pair of the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System J (4+7+0)=] of [[!ITU2051-3]].
- <dfn noexport>7.1.4ch</dfn> is the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System J (4+7+0)=] of [[!ITU2051-3]].
- <dfn noexport>3.1.2ch</dfn> is the front subset (L/C/R/Ltf/Rtf/LFE) of [=7.1.4ch=].

<pre class = "def">
Loudspeaker Layout (4 bits) :  Channel Layout  : Loudspeaker Location Ordering
             0000           :       Mono       : C
             0001           :      Stereo      : L/R
             0010           :      5.1ch       : L/C/R/Ls/Rs/LFE
             0011           :     5.1.2ch      : L/C/R/Ls/Rs/Ltf/Rtf/LFE
             0100           :     5.1.4ch      : L/C/R/Ls/Rs/Ltf/Rtf/Ltr/Rtr/LFE
             0101           :      7.1ch       : L/C/R/Lss/Rss/Lrs/Rrs/LFE
             0110           :     7.1.2ch      : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/LFE
             0111           :     7.1.4ch      : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/Ltb/Rtb/LFE
             1000           :     3.1.2ch      : L/C/R//Ltf/Rtf/LFE
             1001           :     Binaural     : L/R
            others          :     reserved     :
</pre>

```
Where, C: Center, L: Left, R: Right, Ls: Left Surround, Lss: Left Side Surround, 
Rs: Right Surround, Rss: Right Side Surround, 
Ltf: Left Top Front, Rtf: Right Top Front, Ltr: Left Top Rear, Rtr: Right Top Rear, 
Ltb: Left Top Back, Rtb: Right Top Back, LFE: Low-Frequency Effects
```

<dfn noexport>output_gain_is_present_flag</dfn> shall indicate if output_gain information fields for the ChannelGroup presents .
- 0: No output_gain information fields for the ChannelGroup present.
- 1: output_gain information fields for the ChannelGroup present. In this case, output_gain_flags and output_gain fields present.

<dfn noexport>recon_gain_is_present_flag</dfn> shall indicate if recon_gain information fields for the ChannelGroup presents in recon_gain_info_parameter_data().
- 0: No recon_gain information fields for the ChannelGroup present in recon_gain_info_parameter_data().
- 1: recon_gain information fields for the ChannelGroup present in recon_gain_info_parameter_data(). In this case, recon_gain_flags and recon_gain fields present.

<dfn noexport>substream_count</dfn> shall specify the number of audio substreams. It must be the same as [=num_substreams=] in its corresponding audio_element().

<dfn noexport>coupled_substream_count</dfn> shall specify the number of referenced substreams that are coded as coupled stereo channels.

<dfn noexport>output_gain_flags</dfn> shall indicate the channels which output_gian is applied to. If a bit set to 1, output_gain shall be applied to the channel. Otherwise, output_gain shall not be applied to the channel.


<pre class = "def">
Bit position : Channel Name
    b5(MSB)  : Left channel (L1, L2, L3)
      b4     : Right channel (R2, R3)
      b3     : Left Surround channel (Ls5)
      b2     : Right Surround channel (Rs5)
      b1     : Left Top Front channel (Ltf)
      b0     : Rigth Top Front channel (Rtf)

</pre>

<dfn noexport>output_gain</dfn> shall indicate the gain value to be applied to the mixed channels which are indicated by output_gain_flags. It is 20*log10 of the factor by which to scale the mixed channels. It is stored in a 16-bit, signed, two’s complement fixed-point value with 8 fractional bits (i.e. Q7.8 in [[!Q-Format]]). Where, each mixed channel is generated by downmixing two or more input channels.


### Ambisonics Config Syntax and Semantics ### {#syntax-ambisonics-config}

[=ambisonics_config()=] contains information regarding the configuration of Ambisonics.

<b>Syntax</b>

```
class ambisonics_config() {
  leb128() ambisonics_mode;
  if (ambisonics_mode == MONO) {
    ambisonics_mono_config();
  } else if (ambisonics_mode == PROJECTION) {
    ambisonics_projection_config();
  }
}

class ambisonics_mono_config() {
  unsigned int (8) output_channel_count (C);
  unsigned int (8) substream_count (N);
  unsigned int (8 * C) channel_mapping;
}

class ambisonics_projection_config() {
  unsigned int (8) output_channel_count (C);
  unsigned int (8) substream_count (N);
  unsigned int (8) coupled_substream_count (M);
  unsigned int (16 * (N + M) * C) demixing_matrix;
}
```

<b>Semantics</b>

<dfn noexport>ambisonics_mode</dfn> shall specify the method of coding Ambisonics.

<pre class = "def">
ambiosnics_mode: Method of coding Ambisonics.
   0    : MONO
   1    : PROJECTION
</pre>

If ambisonics_mode is equal to MONO, this shall indicate that the Ambisonics channels are coded as individual mono substreams. For IAMF-LPCM, [=ambisonics_mode=] shall be equal to MONO. 

If ambisonics_mode is equal to PROJECTION, this shall indicate that the Ambisonics channels are first linearly projected onto another subspace before coding as a mix of coupled stereo and mono substreams.

<dfn noexport>output_channel_count</dfn> shall be the same as [=channel count=] in [[!RFC8486] with following restrictions:
- The allowed numbers of [=output_channel_count=] are (1+n)^2, where n = 0, 1, 2, ..., 14. 
- In other words, the scene-based audio element shall not include non-diegetic channels.

[=substream_count=] shall specify the number of audio substreams. It must be the same as [=num_substreams=] in its corresponding audio_element().

<dfn noexport>channel_mapping</dfn> shall be the same as the one for [=ChannelMappingFamily=] = 2 in [[!RFC8486]].

[=coupled_substream_count=] shall specify the number of referenced substreams that are coded as coupled stereo channels, where M <= N.

<dfn noexport>demixing_matrix</dfn> shall be the same as the one for [=ChannelMappingFamily=] = 3 in [[!RFC8486]] except the byte order of each of matrix coefficients shall be converted to big endian.


### Demixing Info Parameter Data Syntax and Semantics ### {#syntax-demixing-info}

<dfn noexport>demixing_info_parameter_data()</dfn> specifies demixing parameter mode to be used to reconstruct output channel audio according to its [=loudspeaker_layout=].

<b>Syntax</b>

```
class demixing_info_parameter_data() {
  unsigned int (3) dmixp_mode;
  unsigned int (5) reserved;
}
```

<b>Semantics</b>

<dfn noexport>dmixp_mode</dfn> shall indicate a mode of pre-defined combinations of five demix parameters.
- 0: mode1, (alpha, beta, gamma, delta, w_idx_offset) = (1, 1, 0.707, 0.707, -1)
- 1: mode2, (alpha, beta, gamma, delta, w_idx_offset) = (0.707, 0.707, 0.707, 0.707, -1)
- 2: mode3, (alpha, beta, gamma, delta, w_idx_offset) = (1, 0.866, 0.866, 0.866, -1)
- 3: reserved
- 4: mode1, (alpha, beta, gamma, delta, w_idx_offset) = (1, 1, 0.707, 0.707, 1)
- 5: mode2, (alpha, beta, gamma, delta, w_idx_offset) = (0.707, 0.707, 0.707, 0.707, 1)
- 6: mode3, (alpha, beta, gamma, delta, w_idx_offset) = (1, 0.866, 0.866, 0.866, 1)
- 7: reserved

<dfn noexport>alpha</dfn> and <dfn noexport>beta</dfn> shall be gain values used for S7to5 down-mixer, <dfn noexport>gamma</dfn> for T4to2 down-mixer, <dfn noexport>delta</dfn> for S5to3 down-mixer and <dfn noexport>w_idx_offset</dfn> shall be the offset to generate a gain value <dfn noexport>w</dfn> used for T2toTF2 down-mixer.

<center><img src="images/Down-mix Mechanism.png" style="width:100%; height:auto;"></center>
<center><figcaption></b>IA Down-mix Mechanism</figcaption></center>

### Recon Gain Info Parameter Data Syntax and Semantics ### {#syntax-recon-gain-info}

<dfn noexport>recon_gain_info_parameter_data()</dfn> contains recon gain values for demixed channels.

<b>Syntax</b>

```
class recon_gain_info_parameter_data() {
  for (i=0; i< num_layers; i++) {
    if (recon_gain_is_present_flag(i) == 1) {
      leb128() recon_gain_flags(i);
      for (j=0; j< n(i); j++) {
        if (recon_gain_flag(i)(j) == 1)
          unsigned int (8) recon_gain;
      }
    }
  }
}
```

<b>Semantics</b>

<dfn noexport>recon_gain_flags</dfn> shall indicate the channels which recon_gain is applied to.

<left><img src="images/Recon_Gain_Flags.png" style="width:100%; height:auto;"></left>

The each bit of [=recon_gain_flags=] indicates the presence of [=recon_gain=] applied to the channel as depicted in the above figure.
 - 0: It shall indicate that no [=recon_gain=] presents for the channel.
 - 1: It shall indicate that [=recon_gain=] presents for the channel.

<dfn noexport>n(i)</dfn> shall indicate the number of bits for recon_gain_flag(i). It shall be 7 or 12 as depicted in the above figure. Where, i = 0, 1, ..., [=num_layers=] - 1.

<dfn noexport>recon_gain</dfn> shall indicate the gain value to be applied to the channel, which is indicated by [=recon_gain_flags=], after decoding of the associated frames and demixing operation. Where, the channel is indicated by recon_gain_flags. Detailed operation by using this value is specified in [[#processing-scalablechannelaudio-recongain]].

### Layout Syntax and Semantics ### {#syntax-layout}

The layout class specifies either a binaural system or the list of physical loudspeaker positions according to [[!ITU2051-3]].

<b>Syntax</b>

```
class layout() {
  unsigned int (2) layout_type;
  
  if (layout_type == LOUDSPEAKERS_SP_LABEL) {
    unsigned int (6) num_loudspeakers;
    for (i = 0; i < num_loudspeakers; i++) {
      unsigned int (8) sp_label;
    }
  } 
  else if (layout_type == LOUDSPEAKERS_SS_CONVENTION) {
    unsigned int (4) sound_system;
    unsigned int (2) reserved;
  }
  else if (layout_type == BINAURAL or NOT_DEFINED) {
    unsigned int (6) reserved;
  }
}
```

<b>Semantics</b>

<dfn noexport>layout_type</dfn> specifies the layout type. 

<pre class = "def">
layout_type : Layout type
     0      : NOT_DEFINED
     1      : LOUDSPEAKERS_SP_LABEL
     2      : LOUDSPEAKERS_SS_CONVENTION
     3      : BINAURAL
</pre>

- A value of 0 shall indicate no specific layout.
- A value of 1 shall indicate that the layout is defined using the [=SP Label=] of [[!ITU2051-3]].
- A value of 2 shall indicate that the layout is defined using the sound system convention of [[!ITU2051-3]].
- A value of 3 shall indicate that the layout is binaural.

<dfn noexport>num_loudspeakers</dfn> shall specify the number of loudspeakers.

<dfn noexport>sp_label</dfn> shall define the [=SP Label=] as specified in [[!ITU2051-3]].


<style>
.col_border {
  border-left: 1px solid var(--def-border);
}
</style>

<table class="def">
<tr>
  <th>sp_label</th><th>SP label</th><th class="col_border">sp_label</th><<th>SP label</th><th class="col_border">sp_label</th><th>SP label</th>
</tr>
<tr>
  <td>0</td><td>M+000</td><td class="col_border">18</td><td>U+000</td><td class="col_border">36</td><td>B+000</td>
</tr>
<tr>
  <td>1</td><td>M+022</td><td class="col_border">19</td><td>U+022</td><td class="col_border">37</td><td>B+022</td>
</tr>
<tr>
  <td>2</td><td>M-022</td><td class="col_border">20</td><td>U-022</td><td class="col_border">38</td><td>B-022</td>
</tr>
<tr>
  <td>3</td><td>M+SC</td><td class="col_border">21</td><td>U+030</td><td class="col_border">39</td><td>B+030</td>
</tr>
<tr>
  <td>4</td><td>M-SC</td><td class="col_border">22</td><td>U-030</td><td class="col_border">40</td><td>B-030</td>
</tr>
<tr>
  <td>5</td><td>M+030</td><td class="col_border">23</td><td>U+045</td><td class="col_border">41</td><td>B+045</td>
</tr>
<tr>
  <td>6</td><td>M-030</td><td class="col_border">24</td><td>U-045</td><td class="col_border">42</td><td>B-045</td>
</tr>
<tr>
  <td>7</td><td>M+045</td><td class="col_border">25</td><td>U+060</td><td class="col_border">43</td><td>B+060</td>
</tr>
<tr>
  <td>8</td><td>M-045</td><td class="col_border">26</td><td>U-060</td><td class="col_border">44</td><td>B-060</td>
</tr>
<tr>
  <td>9</td><td>M+060</td><td class="col_border">27</td><td>U+090</td><td class="col_border">45</td><td>B+090</td>
</tr>
<tr>
  <td>10</td><td>M-060</td><td class="col_border">28</td><td>U-090</td><td class="col_border">46</td><td>B-090</td>
</tr>
<tr>
  <td>11</td><td>M+090</td><td class="col_border">29</td><td>U+110</td><td class="col_border">47</td><td>B+110</td>
</tr>
<tr>
  <td>12</td><td>M-090</td><td class="col_border">30</td><td>U-110</td><td class="col_border">48</td><td>B-110</td>
</tr>
<tr>
  <td>13</td><td>M+110</td><td class="col_border">31</td><td>U+135</td><td class="col_border">49</td><td>B+135</td>
</tr>
<tr>
  <td>14</td><td>M-110</td><td class="col_border">32</td><td>U-135</td><td class="col_border">50</td><td>B-135</td>
</tr>
<tr>
  <td>15</td><td>M+135</td><td class="col_border">33</td><td>U+180</td><td class="col_border">51</td><td>B+180</td>
</tr>
<tr>
  <td>16</td><td>M-135</td><td class="col_border">34</td><td>UH+180</td><td class="col_border">52</td><td>LFE1</td>
</tr>
<tr>
  <td>17</td><td>M+180</td><td class="col_border">35</td><td>T+000</td><td class="col_border">53</td><td>LFE2</td>
</tr>
<tr>
  <td></td><td></td><td class="col_border"></td><td></td><td class="col_border">54 ~ 256</td><td>Reserved</td>
</tr>
</table>


<dfn noexport>sound_system</dfn> shall specify the sound system A to J as specified in [[!ITU2051-3]], 7.1.2ch and 3.1.2ch of [=loudspeaker_layout=] as follows:
 - 0: It shall indicate [=Loudspeaker configuration for Sound System A (0+2+0)=]
 - 1: It shall indicate [=Loudspeaker configuration for Sound System B (0+5+0)=]
 - 2: It shall indicate [=Loudspeaker configuration for Sound System C (2+5+0)=]
 - 3: It shall indicate [=Loudspeaker configuration for Sound System D (4+5+0)=]
 - 4: It shall indicate [=Loudspeaker configuration for Sound System E (4+5+1)=]
 - 5: It shall indicate [=Loudspeaker configuration for Sound System F (3+7+0)=]
 - 6: It shall indicate [=Loudspeaker configuration for Sound System G (4+9+0)=]
 - 7: It shall indicate [=Loudspeaker configuration for Sound System H (9+10+3)=]
 - 8: It shall indicate [=Loudspeaker configuration for Sound System I (0+7+0)=]
 - 9: It shall indicate [=Loudspeaker configuration for Sound System J (4+7+0)=]
 - 10: It shall indicate the same loudspeaker configuration as [=loudspeaker_layout=] = 0110 (i.e. 7.1.2ch)
 - 11: It shall indicate the same loudspeaker configuration as [=loudspeaker_layout=] = 1000 (i.e. 3.1.2ch)
 - 12 ~ 15: Reserved

### Rendering Config Syntax and Semantics ### {#syntax-rendering-config}

<b>Syntax</b>

```
class rendering_config() {
  if (audio_element_type == CHANNEL_BASED) {
    itur_bs2127_direct_speakers_config();
  }
  else if (audio_element_type == SCENE_BASED) {
    itur_bs2127_hoa_config();
  }
}
```

<b>Semantics</b>

<dfn value noexport for="rendering_config()">audio_element_type</dfn> is the [=audio_element_type=] value in Audio Element OBU.

<dfn noexport>itur_bs2127_direct_speakers_config()</dfn> is a class that provides the metadata required for rendering a multichannel audio element to a loudspeaker layout, as specified in [[!ITU2127-0]].

<dfn noexport>itur_bs2127_hoa_config()</dfn> is a class that provides the metadata required for rendering an ambisonics audio element to a loudspeaker layout, as specified in [[!ITU2127-0]].

#### ITU-R BS.2127 Direct Speakers Config Syntax and Semantics #### {#syntax-rendering-direct-speakers-config}

The metadata specified in itur_bs2127_direct_speakers_config(), based on [[!ITU2076-2]], provides information about the loudspeaker that is intended to be used for playing back each of the input audio channels. An IA renderer must use the DirectSpeakers rendering method of EAR, specified in [[!ITU2127-0]], in combination with these metadata to render to the output loudspeakers.

The position information is specified in polar coordinates following the convention of [[!ITU2076-2]]. Specifically,

- The origin is in the centre.
- The azimuth angles are expressed in degrees, with 0 degrees as straight ahead, and positive values rotating to the left (anti-clockwise) when viewed from above.
- The elevation angles are expressed in degrees, with 0 degrees horizontally ahead, and positive values going up.
- The distance is a normalized distance, where 1.0 is assumed to be the default radius of the sphere.


<b>Syntax</b>

```
class itur_bs2127_direct_speakers_config() {
  unsigned int (1) distance_flag;
  unsigned int (1) position_bounds_flag;
  unsigned int (1) screen_edge_lock_azimuth_flag;
  unsigned int (1) screen_edge_lock_elevation_flag;
  unsigned int (4) reserved;

  for (int i = 0; i < num_channels; i++) {
    if (distance_flag) {
      signed int (8) distance;
    }

    if (position_bounds_flag) {
      signed int (16) azimuth_max;
      signed int (16) azimuth_min;
      signed int (16) elevation_max;
      signed int (16) elevation_min;
      signed int (8) distance_max;
      signed int (8) distance_min;
    }

    if (screen_edge_lock_azimuth_flag) {
      unsigned int (2) screen_edge_lock_azimuth;
    }
    if (screen_edge_lock_elevation_flag) {
      unsigned int (2) screen_edge_lock_elevation;
    }
    byte_alignment();
  }
}
```

<b>Semantics</b>

<dfn noexport>distance_flag</dfn> indicates if a distance other than the default value of 1.0 is provided.

<dfn noexport>position_bounds_flag</dfn> indicates if the position bounds for the azimuth, elevation and distance are provided.

<dfn noexport>screen_edge_lock_azimuth_flag</dfn> indicates if the screen edge lock value to be used in combination with the azimuth position is provided.

<dfn noexport>screen_edge_lock_elevation_flag</dfn> indicates if the screen edge lock value to be used in combination with the elevation position is provided.

<dfn noexport>num_channels</dfn> indicates the number of audio channels within the audio element. The order of the channels shall be the same as the order in which they are specified in [=loudspeaker_layout=].

<dfn noexport>distance</dfn> specifies the normalized distance from the origin as a signed Q1.6 fixed-point value (in [[!Q-Format]]). It is the same as the following attribute in [[!ITU2076-2]]:

- audioChannelFormat.typeDefinition == "DirectSpeakers"
- sub-element: "position"
- attribute: coordinate="distance"

<dfn noexport>azimuth_max</dfn> specifies the maximum bound of the azimuth as a signed Q8.7 fixed-point value (in [[!Q-Format]]). It is the same as the following attribute in [[!ITU2076-2]]:

- audioChannelFormat.typeDefinition == "DirectSpeakers"
- sub-element: "position"
- attribute: coordinate="azimuth", bound="max"

<dfn noexport>azimuth_min</dfn> specifies the minimum bound of the azimuth as a signed Q8.7 fixed-point value (in [[!Q-Format]]). It is the same as the following attribute in [[!ITU2076-2]]:

- audioChannelFormat.typeDefinition == "DirectSpeakers"
- sub-element: "position"
- attribute: coordinate="azimuth", bound="min"

<dfn noexport>elevation_max</dfn> specifies the maximum bound of the elevation as a signed Q8.7 fixed-point value (in [[!Q-Format]]). It is the same as the following attribute in [[!ITU2076-2]]:

- audioChannelFormat.typeDefinition == "DirectSpeakers"
- sub-element: "position"
- attribute: coordinate="elevation", bound="max"

<dfn noexport>elevation_min</dfn> specifies the minimum bound of the elevation as a signed Q8.7 fixed-point value (in [[!Q-Format]]). It is the same as the following attribute in [[!ITU2076-2]]:

- audioChannelFormat.typeDefinition == "DirectSpeakers"
- sub-element: "position"
- attribute: coordinate="elevation", bound="min"

<dfn noexport>distance_max</dfn> specifies the maximum bound of the distance as a signed Q1.6 fixed-point value (in [[!Q-Format]]). It is the same as the following attribute in [[!ITU2076-2]]:

- audioChannelFormat.typeDefinition == "DirectSpeakers"
- sub-element: "position"
- attribute: coordinate="distance", bound="max"

<dfn noexport>distance_min</dfn> specifies the minimum bound of the distance as a signed Q1.6 fixed-point value (in [[!Q-Format]]). It is the same as the following attribute in [[!ITU2076-2]]:

- audioChannelFormat.typeDefinition == "DirectSpeakers"
- sub-element: "position"
- attribute: coordinate="distance", bound="min"

<dfn noexport>screen_edge_lock_azimuth</dfn> indicates the edge that the loudspeaker is locked to, to be used in combination with the azimuth position. It is the same as the following attribute in [[!ITU2076-2]]:

- audioChannelFormat.typeDefinition == "DirectSpeakers"
- sub-element: "position"
- attribute: screenEdgeLock

<pre class = "def">
screen_edge_lock_azimuth : Screen edge to lock to
            0            : LEFT
            1            : RIGHT
            2            : TOP
            3            : BOTTOM
</pre>

<dfn noexport>screen_edge_lock_elevation</dfn> indicates the edge that the loudspeaker is locked to, to be used in combination with the elevation position. It is the same as the following attribute in [[!ITU2076-2]]:

- audioChannelFormat.typeDefinition == "DirectSpeakers"
- sub-element: "position"
- attribute: screenEdgeLock

<pre class = "def">
screen_edge_lock_elevation : Screen edge to lock to
            0              : LEFT
            1              : RIGHT
            2              : TOP
            3              : BOTTOM
</pre>


#### ITU-R BS.2127 HOA Config Syntax and Semantics #### {#syntax-rendering-hoa-config}

<b>Syntax</b>

```
class itur_bs2127_hoa_config() {
}
 ```
 
<b>Semantics</b>
 
NOTE: itur_bs2127_hoa_config() has an empty payload.


### Element Mix Config Syntax and Semantics ### {#syntax-element-mix-config}

[=element_mix_config()=] provides a gain value to be applied to the rendered audio element signal.

<b>Syntax</b>

```
class element_mix_config() {
  MixGainParamDefinition mix_gain;
}
```

<b>Semantics</b>

<dfn noexport>mix_gain</dfn> provides the parameter definition for the gain value that is applied to all channels of the rendered audio element signal. The parameter definition is provided by MixGainParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in mix_gain_parameter_data().

#### Mix Gain Parameter Definition and Data Syntax and Semantics #### {#syntax-mix-gain-param}

<b>Syntax</b>

```
class MixGainParamDefinition extends ParamDefinition() {
  signed int (16) default_mix_gain;
}

class mix_gain_parameter_data(animation_type) {
  AnimatedParameterData<signed int (16)> param_data;
}
```

<b>Semantics</b>

<dfn noexport>default_mix_gain</dfn> shall specify the default mix gain value to apply when there are no mix gain parameter blocks provided. This value is expressed in dB and shall be applied to all channels in the rendered audio element. It is stored as a 16-bit, signed, two's complement fixed-point value with 8 fractional bits (i.e. Q7.8 in [[!Q-Format]]).

<dfn noexport>param_data</dfn> shall use the AnimatedParameterData function template. Each of the values defined within this instance (start_point_value, end_point_value and control_point_value) shall be expressed in dB and shall be applied to all channels in the rendered audio element. They are stored as 16-bit, signed, two's complement fixed-point values with 8 fractional bits (i.e. Q7.8 in [[!Q-Format]]).


## Codec Specific ## {#codec-specific}

This section defines codec specific information for Codec_Specific_Info and Substream.

- <dfn noexport>Codec_Specific_Info</dfn> shall be composed of [=Codec_ID=] and [=Decoder_Config()=]. Codec_ID shall indicate the codec which has been used to generate a given substream within IA sequence and Decder_Config() shall indicate the decoding parameters which are applied to the substream within IA sequence.

For legacy codecs, Decoder_Config() shall be exactly the same information as the conventional file parser feeds to the codec decoders for decoding of the substream. For future codecs, Decoder_Config() shall include all of decoding parameters which are required to decode Substreams.

- Substream shall be a raw coded stream for one or more channels. Substream format shall be exactly the same as the sample format (before packing OBU and except parameter blocks) for the audio file which consists of only one single coded stream by the Codec_ID.


### IAC-OPUS Specific ### {#iac-opus-specific}

Codec_Specific_Info for IAC-OPUS shall conform to [=ID Header=] with [=ChannelMappingFamily=] = 0 of [[!RFC7845]] with following constraints:
- [=Channel Count=] should be set to 2.
- [=Output Gain=] shall not be used. In other words, it shall be set to 0dB.
- The byte order of each field in [=ID Header=] shall be converted to big endian.

Substream format shall be [=opus packet=] of [[!RFC6716]] which contains only one single frame of mono or stereo channels and which has non-delimiting frame structure.


### IAC-AAC-LC Specific ### {#iac-aac-lc-specific}

[=Codec_ID=] shall be 'mp4a'.

[=Decoder_Config()=] for IAC-AAC-LC shall be [=DecoderConfigDescriptor()=] of [[!MP4-Systems]], which is a subset of [=ESDBox=] for [[!MP4-Audio]], with following constraints:
- [=objectTypeIndication=] = 0x40
- [=streamType=] = 0x05 (Audio Stream)
- [=upstream=] = 0
- [=decSpecificInfo()=]: The syntax and values shall conform to [=AudioSpecificConfig()=] of [[!MP4-Audio]] with following constraints:
	- [=audioObjectType=] = 2
	- [=channelConfiguration=] should be set to 2.
	- [=GASpecificConfig()=]: The syntax and values shall conform to [=GASpecificConfig()=] of [[!MP4-Audio]] with following constraints:
		- [=frameLengthFlag=] = 0 (1024 lines IMDCT)
		- [=dependsOnCoreCoder=] = 0
		- [=extensionFlag=] = 0

Substream format shall be one single [=raw_data_block()=] of [[!AAC]] which contains only one single frame of mono or stereo channels.

### IAC-FLAC Specific ### {#iac-flac-specific}

[=Codec_ID=] shall be 'fLaC', the FLAC stream marker in ASCII, meaning byte 0 of the stream is 0x66, followed by 0x4C 0x61 0x43.

[=Decoder_Config()=] for IAC-FLAC shall be [=METADATA_BLOCK=] of [[!FLAC]].

Substream format shall be [=FRAME=] of [[!FLAC]], which is composed of [=FRAME_HEADER=], followd by [=SUBFRAME=](s) (one [=SUBFRAME=] per channel) and followed by [=FRAME_FOOTER=].

### IAC-LPCM Specific ### {#iac-lpcm-specific}

[=Codec_ID=] shall be 'lpcm'.

[=Decoder_Config()=] for IAC-LPCM shall be as follows:

```
class decoder_config(lpcm) {
  unsigned int (32) sample_rate;
  unsigned int (8) sample_size;
}
```

<dfn noexport>sample_rate</dfn> shall indicate the sample rate of the input audio in Hz.

<dfn noexport>sample_size</dfn> shall indicate the size of a PCM sample in bit units. The value shall be less than or equal to 24.

Substream format shall be the LPCM audio samples for the frame size. 

For IAMF-LPCM, one Substream is an individual channel of input audio. For example, one Ambisonics channel of Ambisonics is one Substream and one speaker channel of channel audio is one Substream.


# Profiles # {#profiles}

The IA Profiles define a set of capabilities that are required to parse, decode and process the corresponding IA sequence.


## IA Simple Profile ## {#profiles-simple}

This section specifies the conformance points of the simple profile.

Restrictions on the IA sequence:

- There shall be only one unique Codec Config OBU.
- There shall be only one unique Audio Element OBU.
- There shall be only one unique set of Descriptor OBUs.
- There shall not be any Temporal Delimiter OBUs present.
- [=version=] shall be set to 0 for this version of specification.
- [=profile_version=] shall be set to 0 for this version of specification.
	- [=num_sub_mixes=] shall be set to 1 for this profile.
	- All flags of itur_bs2127_direct_speakers_config() shall be set to 0 for this profile.
- [=num_layers=] shall be set to 1 or up to 6 for Channel-based audio element (i.e. scalable channel audio).
    - In this case, [=demixing_info_parameter_data()=] and [=recon_gain_info_parameter_data()=] may be present in the IA sequence.
    - In case of simple scalable channel audio (e.g. mono for layer 1 & stereo for layer 2), demixing_info() and recon_gain_info() shall not be present in the bitstream.
    - When num_layers = 1, OBU_IA_Parameter_Block including demixing_info() may be present in the IA sequence and IA decoders may use the demixing_info() for dynamic down-mixing.
- All audio frames shall have aligned frame boundaries.

Capabilities of the IA parser, decoder and processor:
- They shall be able to parse an IA sequence with the MSB four bits of [=profile_version=] = 0 and the MSB four bits of [=version=] = 0 (i.e., profile_version = 0 to 15 and version = 0 to 15).
- They shall be able to decode and process up to 16 channels.
- They shall be able to reconstruct one audio element.
- They may use demixing_info_parameter_data() to do down-mixing.

## IA Base Profile ## {#profiles-base}

This section specifies the conformance points of the base profile.

Restrictions on IA sequence:
- There shall be only one unique Codec Config OBU.
- There shall be at most two unique Audio Element OBUs at any one time.
- There may be more than one unique set of Descriptor OBUs.
- There may be Temporal Delimiter OBUs present.
- [=version=] shall be set to 0 for this version of specification.
- [=profile_version=] shall be set to 16 for this version of specification.
	- [=num_sub_mixes=] shall be set to 1 for this profile.
	- All flags of itur_bs2127_direct_speakers_config() shall be set to 0 for this profile.
	- For a given Mix Presentation OBU, it shall include at most one scene-based audio element and it shall include at most one channel-based audio element with [=num_layers=] > 1 for this profile.
- [=num_layers=] shall be set to 1 or up to 6 for Channel-based audio element (i.e. scalable channel audio)
    - In this case, [=demixing_info_parameter_data()=] and [=recon_gain_info_parameter_data()=] may be present in the IA sequence.
    - In case of simple scalable channel audio (e.g. mono for layer 1 & stereo for layer 2), [=demixing_info_parameter_data()=] and [=recon_gain_info_parameter_data()=] shall not be present in the bitstream.
    - When num_layers = 1, OBU_IA_Parameter_Block including [=demixing_info_parameter_data()=] may be present in the IA sequence and IA decoders may use the [=demixing_info_parameter_data()=] for dynamic down-mixing.
- All audio frames shall have aligned frame boundaries.

Capabilities of the IA parser, decoder and processor:
- They shall be able to parse an IA sequence with the MSB four bits of [=profile_version=] = 0 or 1 and the MSB four bits of [=version=] = 0 (i.e., profile_version = 0 to 31 and version = 0 to 15).
- They shall be able to support the capabilities of the Simple Profile.
- They shall be able to decode and process up to 16 channels.
- They shall be able to reconstruct two audio elements.
- They shall be able to mix two audio elements.
- They shall be able to process short-lived audio elements.

## IA Enhanced Profile ## {#profiles-enhanced}

This section specifies the conformance points of the enhanced profile.

Restrictions on IA sequence:
- There may be more than one unique Codec Config OBUs.
- There may be more than one unique Audio Element OBUs.
- There may be more than one unique Mix Presentation OBUs.
- There shall not be Temporal Delimiter OBUs present.
- [=version=] shall be set to 0 for this version of specification.
- [=profile_version=] shall be set to 32 for this version of specification.
- The different Codec Config OBUs may have different [=codec_id=]s specified with the following constraints:
    - The combination of [=codec_id=] = 'fLaC' for one substream and [=codec_id=] = 'opus' for another substream shall not be allowed.
    - The combination of [=codec_id=] = 'fLaC' for one substream and [=codec_id=] = 'mp4a' for another substream shall not be allowed.
- [=num_layers=] shall be set to 1 or up to 6 for Channel-based audio element (i.e. scalable channel audio)
    - In this case, [=demixing_info_parameter_data()=] and [=recon_gain_info_parameter_data()=] may be present.
    - In case of simple scalable channel audio (e.g. mono for layer 1 & stereo for layer 2), [=demixing_info_parameter_data()=] and [=recon_gain_info_parameter_data()=] shall not be present.

Capabilities of the IA parser, decoder and processor:
- They shall be able to parse an IA sequence with the MSB four bits of [=profile_version=] = 0, 1 or 2 and the MSB four bits of [=version=] = 0 (i.e., profile_version = 0 to 47 and version = 0 to 15).
- They shall be able to support the capabilities of the base profile.
- They shall be able to decode and process up to 36 channels.
- They shall be able to decode one or more different audio codecs in the same sequence, with the exception of the following combinations:
    - IAC-FLAC and IAC-OPUS
    - IAC-FLAC and IAC-AAC-LC
- IA decoder which is conformant to this profile shall be able to synchronize two or more audio elements with different frame sizes.

# Standalone IAC Representation # {#standalone}

This section details the order in which the OBUs shall be sequenced in a standalone IAC representation. It further specifies how the Data OBUs shall be synchronized, with the aid of the Sync OBUs.

## OBU Sequence Order ## {#standalone-obu-sequence-order}

An IA sequence is composed of a series of OBUs in the sequence of a set of descriptor OBUs followed by their associated data OBUs, and where this pattern is repeated as many times as needed.

### Descriptor OBUs ### {#standalone-descriptor-obus}
A set of Descriptor OBUs shall be placed at the beginning of the bitstream in the following order:

1. One Magic Code OBU
2. All Codec Config OBUs
3. All Mix Presentation OBUs
4. All Audio Element OBUs


### Data OBUs ### {#standalone-data-obus}

One Sync OBU shall be placed immediately after the Descriptor OBUs. This shall be followed by a sequence of Audio Frame OBUs, Parameter Block OBUs, one or more additional Sync OBUs and one or more Temporal Delimiter OBUs, according to the rules below:

- Audio Frame OBUs and Parameter Block OBUs must be ordered by their implied timestamp in the timeline, and may be interleaved.
- If there are multiple Audio Frame OBUs that have the same implied start timestamp, they must be grouped by audio elements.
- A new Sync OBU may be inserted anywhere in the sequence of data OBUs, as frequently as needed.
- Between two Sync OBUs, a sequence of audio frames or parameter blocks must be gapless.
- If an Audio Frame OBU or Parameter Block OBU has a substream or parameter ID that is not defined in the most recent Sync OBU, it must not appear in the bitstream, until a new Sync OBU is provided that specifies them.
- A Temporal Delimiter OBU may be inserted at the beginning of a temporal unit, defined as a set of all audio frames with the same start timestamp and the same duration from all substreams and all non-redundant parameter blocks with the start timestamp within the duration. A temporal unit may include redundant parameter blocks.
- If Temporal Delimiter OBUs are present, they must be inserted at the beginning of every temporal unit.

Additionally, the following constraints apply to the Audio Frame and Parameter Block OBUs:

- Audio Frame OBUs must be provided non-redundantly, such that for each substream, there shall not be two audio frames that are overlapping in time.
- Parameter Block OBUs may be provided redundantly, such that they contain the same data as a previously provided Parameter Block OBU for the same time region. In this case, the "obu_redundant_copy" field in the OBU header shall be set to 1.
- Redundant Parameter Block OBUs do not need to be ordered by their implied timestamp in the timeline. The implied timestamp should be inferred from the initial non-redundant version.
- Non-redundant Parameter Block OBUs must not provide data for overlapping time regions.

### Refreshing Descriptor OBUs ### {#standalone-obu-sequence-refreshes}

The above describes the full sequence of OBUs for a given set of descriptor OBUs and their associated data OBUs. If the IAC configuration changes, a new set of descriptor OBUs is required. In that case, a new sequence of the complete set of descriptor OBUs, a Sync OBU and their corresponding data OBUs shall follow, in the same order as described above.

NOTE: In case of that IA sequence contains two audio elements which one of them is a short-lived contents, the number of audio elements is changed (i.e. one to two or two to one) depending on whether the short-lived contents is present or not. So, a new set of descriptor OBUs presents to indicate the changes. In this case, the new set of descriptor OBUs includes a new Mix Presentation OBU to provide the proper mixing according to the audio element(s) to be mixed.

The descriptor OBUs may additionally be repeated redundantly and as frequently as necessary. In this case, the "obu_redundant_copy" field in the OBU header of each of the descriptor OBUs shall be set to 1.

If there is set of descriptor OBUs placed mid-stream, there may be parameter blocks that came before them which are still valid and applicable for the duration after the descriptor OBUs. In this case, these parameter blocks must be redundantly copied and placed after the first Sync OBU that follows the descriptor OBUs. This ensures that any receiver joining mid-stream and encountering a set of descriptor OBU is guaranteed to be able to receive the complete set of metadata that is applicable to all audio frames that come after.

## Synchronizing Data OBUs ## {#standalone-synchronizing-data-obus}

The audio frames and parameter data provided in the Data OBUs may be asynchronous; different audio substreams may have different audio frame sizes, parameter blocks may have different durations from the audio frames, or there may be gaps in a parameter's timeline. This section details how these Data OBUs may be synchronized, based on their duration and the information provided in the Sync OBUs.

The Sync OBU contains two pieces of information that apply to all substream and parameters that follow it: 

1) a relative offset for each of the substreams and parameters, and

2) a global offset.

The relative offsets describe how the substreams and parameters are positioned with respect to the timeline of Substreams generated from the previous Sync OBU. For example, from the previous Sync OBU, Substream 1 has a end timestamp 960 units, Substream 2 has a end timestamp 1024 units, Parameter 1 has a end timestamp 900 units and Parameter 2 has a end timestamp 1100 units. The relative offsets are calculated with respect to the maximum timeline of Substreams (i.e. 1024 units). So, Subsream 1 has a relative offset that is 64 units before Substream 2, Substream 2 has  a relative offset that is 0 unit, Parameter 1 has a relative offset that is 124 units before Substream 2 and Parameter 2 has a relative offset that is 76 after Substream 2.

<table class="def">
<tr>
  <th>ID (name)</th><th>end timestamp</th><th>Relative offset</th>
</tr>
<tr>
  <td>N/A (Global offset)</td><td>0</td><td>0</td>
</td>
</tr>
  <td>1 (Substream 1)</td><td>960</td><td>-64</td>
</td>
</tr>
  <td>2 (Substream 2)</td><td>1024</td><td>+0</td>
</td>
</tr>
  <td>3 (Parameter 1)</td><td>900</td><td>-124</td>
</td>
</tr>
  <td>4 (Parameter 2)</td><td>1100</td><td>+76</td>
</tr>
</table>

The global offset defines an additional offset that is applied to all substreams and parameters, and can be used to express intentional gaps between the local frames associated with two Sync OBUs.

The local frame of reference can be positioned in a global frame of reference by using the concatenation rule provided below. This rule specify how two timelines associated with different Sync OBUs shall be aligned.

<dfn noexport>Concatenation Rule</dfn>

Ignoring the global offset, the new timeline after a Sync OBU is extended based on the timeline generated from the previous Sync OBU and relative offsets in the current Sync OBU. Then, the global offset is applied to additionally shift the new timeline.


The algorithm below may be used to implement the concatenation rule. But, the result shall comply with this.

```
For a given ID, end_timestamp\[ID]\[0] = 0 (i.e. initial value = 0)

Encoder operation for the Nth Sync OBU 
// Encoders know the position of an OBU on global timeline. 
// So, they know start_timestamp\[ID]\[N] and global_offset\[N].

For each ID in the Nth Sync OBU.

relative_offset\[ID]\[N] = start_timestamp\[ID]\[N] - global_offset\[N]
                           - max(end_timestamp\[ID]\[N-1] for each audio frame ID);
// i.e. relative_offset\[ID]\[1] = start_timestamp\[ID]\[1] 
//                                 - global_offset\[1] for the first Sync OBU.

// end_timestamp\[ID]\[N] for each ID is calculated as follows:

end_timestamp\[ID]\[N] = start_timestamp\[ID]\[N];

for (i = 0: i < M(N); i++) {
  end_timestamp\[ID]\[N] += frame size of ith audio frame
                            (or duration of ith parameter block) having the ID;
}
// M(N) is the number of audio frame OBUs(or parameter block OBUs) having the given ID
// between the Nth Sync OBU and the (N+1)th Sync OBU.

Decoder operation for the Nth Sync OBU 
// Decoders need to extend the timeline generated from the previous Sync OBU.
// For the first Sync OBU which decoders get after join mid-stream, N is set to 1.

For each ID in the Nth Sync OBU.

start_timestamp\[ID]\[N] = max(end_timestamp\[ID]\[N-1] for each audio frame ID)
                           + relative_offset\[ID]\[N] + globl_offset\[N];

end_timestamp\[ID]\[N] = start_timestamp\[ID]\[N];

for (i = 0: i < M(N); i++) {
  end_timestamp\[ID]\[N] += frame size of ith audio frame
                            (or duration of ith parameter block) having the ID;
}
// M(N) is the number of audio frame OBUs
// (or parameter block OBUs) having the given ID
// between the Nth Sync OBU and the (N+1)th Sync OBU.
```


# ISOBMFF IAC Encapsulation # {#isobmff}

## General Requirements & Brands ## {#brands}

A file conformant to this specification satisfies the following:
- It shall conform to the normative requirements of [[!ISOBMFF]]
- It shall have the <dfn value export for="ISOBMFF Brand">iamf</dfn> brand among the compatible brands array of the FileTypeBox
- It shall contain at least one track using an [=IASampleEntry=]
- It SHOULD indicate a structural ISOBMFF brand among the compatible brands array of the FileTypeBox, such as 'iso6'
- It MAY indicate other brands not specified in this specification provided that the associated requirements do not conflict with those given in this specification

Parsers shall support the structures required by the <code>'iso6'</code> brand and MAY support structures required by further ISOBMFF structural brands.


## ISOBMFF IAC Encapsulation with single track ## {#isobmff-singletrack}

This section describes the basic data structures used to signal encapsulation of IA sequence in [[!ISOBMFF]] containers.

### Requirement of IA sequence ### {#isobmff-singletrack-iasequence}

IA sequence shall comply with the bitstream which is specified in [[#profiles-simple]] or [[#profiles-base]] for encapsulation of ISOBMFF with single track.


### Encapsulation Scheme ### {#isobmff-singletrack-basicencapsulationscheme}

During encapsulation process, OBUs of IA sequence are encapsulated into [[!ISOBMFF]] as follows:
- Magic Code OBU: version and profile version fields shall be moved to IASampleEntry.
- Codec Config OBU: 
	- codec_id and decoder_config() shall move to IASampleEntry.
	- num_samples_per_frame shall move to 'stts'.
	- roll_distance shall be stored as [=AudioPreRollEntry=] having [=grouping_type=], 'prol'.
- Mix Presentation OBUs and Audio Element OBUs (with OBU syntax) shall be stored as a new sample group having [=grouping_type=], [=iagd=].
- Sync OBU: parse the input timeline using the Sync OBU information and construct a PTRO box that describes the relative offsets for each parameter block.
- Each temporal unit:
	- Temporal Delimiter OBU: shall be discarded if present.
	- Parameter Block OBU for demixing_info_parameter_data() (with OBU syntax) shall be stored as a new sample group having [=grouping_type=], [=demi=].
	- Remained OBUs of each temporal unit shall be stored as one sample data without gap among OBUs.
- Audio Frame OBUs:
	- Select one substream. 
	- If [=obu_trimming_status_flag=] of the first Audio Frame OBU of the substream is set to 1, keep parsing following Audio Frame OBUs of the substream until meets the Audio Frame OBU having [=obu_trimming_status_flag=] =  0 and sum [=num_samples_to_trim_at_start=]. Then reflect the result of the summation to 'edts'.
	- If [=obu_trimming_status_flag=] of the last Audio Frame OBU of the substream is set to 1, then reflect num_samples_to_trim_at_end to 'stts.
	

<center><img src="images/IAC Encapsulation Guideline.png" style="width:100%; height:auto;"></center>
<center><figcaption>IAC Encapsulation Scheme</figcaption></center>


### IA Sample Entry ### {#iasampleentry-section}

<pre class="def">
	Sample Entry Type: <dfn value export for="IASampleEntry">iamf</dfn>
	Container:         Sample Description Box ('stsd')
	Mandatory:         Yes
	Quantity:          One or more.
</pre>


The <dfn noexport>IASampleEntry</dfn> identifies that the track contains [=IA Samples=], and uses one single [=codec specific box=].

<b>Syntax</b>

```
class IASampleEntry extends AudioSampleEntry('iamf') {
  unsigned int (8) version;
  unsigned int (8) profile_version;
  CodecSpecificBox config;
}
```

No optional boxes of AudioSampleEntry shall present.

<b>Sematics</b>

Both [=channelcount=] and [=samplerate=] fields of AudioSampleEntry shall be ignored.

version and profile_version shall be the same as [=version=] and [=profile_version=] in magic_code_obu, respectively.


### Codec Specific Box ### {#codecspecificbox-section}

This section describes a <dfn noexport>codec specific box</dfn> for the decoding parameters, which is defined by codec_id of audio_substream_config(), to decode one single substream of IA sequence. <code>iamf</code> shall contain only one single codec specific box regardless of the number of substreams in IA sequence. So, the codec specific box is applied to all of substreams in sample data.

#### OPUS Specific Box #### {#codecspecificbox-opus}

This shal be [=OpusSpecificBox=] ('dOps') for 'opus' audiosampleentry which is specified in [[!OPUS-IN-ISOBMFF]].

<pre class="def">
	Box Type:  <dfn export>dOps</dfn>
	Container: IA Sample Entry ('iamf')
	Mandatory: Yes
	Quantity:  One
</pre>


This box shall be for one single substream.

<b>Syntax</b>

It shall be the same as 'dOps' box for 'opus' with that [=ChannelMappingFamily=] shall be set to 0.

<b>Sematics</b>

It shall be the same as the semantics in [[!OPUS-IN-ISOBMFF]] except followings:
- [=OutputChannelCount=] should be set to 2. [=OutputChannelCount=] can be ignored because the real value can be determined from the Audio Element OBU and from the [=opus packet=] header.
- In case of [=num_layers=] > 1, [=OutputGain=] shall be set to 0.
- [=ChannelMappingFamily=] shall be set to 0.

#### MP4A Specific Box #### {#codecspecificbox-mp4a}

This shall be [=ESDBox=] ('esds') for 'mp4a' which is specified in [[!MP4]].


<pre class="def">
	Box Type:  <dfn export>esds</dfn>
	Container: IA Sample Entry ('iamf')
	Mandatory: Yes
	Quantity:  One of more
</pre>


This box shall be for one single Substream.

<b>Syntax</b>

It shall be the same as 'esds' box for [=Low Complexity Profile=] of [[!AAC]] (AAC-LC).

<b>Semantics</b>

It shall be the same as the semantics in 'esds' except followings:
- [=channelConfiguration=] field should be set to 2. The real value can be implied from the Audio Element OBU.

ISSUE: We need to add specific boxes for FLAC and LPCM.

### IA Sample Format ### {#iasampleformat}

For tracks using the [=IASampleEntry=], an <dfn noexport>IA Sample</dfn> has the following constraints:
- The one sample data shall be the remained OBUs of each temporal unit after processing of [[#isobmff-singletrack-basicencapsulationscheme]].

### IA Sample Group ### {#iasamplegroup}

#### Global Descriptor Sample Group #### {#iasamplegroup-globaldescriptor}

During encapsulation process, global descriptors shall be discarded from IA sequence. A new sample group for global descriptors shall be defined by using 'sgpd' and 'sbgp' boxes with following requirements:
- [=grouping_type=] shall be set to <dfn noexport>iagd</dfn>.
- [=SampleGroupDescriptionEntry=] shall be Mix Presentation OBUs and followed by Audio Element OBUs with OBU syntax.

#### Demixing Info Sample Group #### {#iasamplegroup-demixing}

During encapsulation process, Parameter Block OBU for demixing_info_parameter_data shall be discarded from IA sequence. A new sample group for demixing_info_parameter_data() shall be defined by using 'sgpd' and 'sbgp' boxes with following requirements:
- [=grouping_type=] shall be set to <dfn noexport>demi</dfn>.
- Each [=SampleGroupDescriptionEntry=] shall be Parameter Block OBU for demixing_info_parameter_data with OBU syntax.


## Common Encryption ## {#CommonEncryption}
TBA

## Codecs Parameter String ## {#codecsparameter}
DASH and other applications require defined values for the 'Codecs' parameter specified in [[!RFC6381]] for ISO Media tracks. The codecs parameter string for the AOM IA codec shall be:
- For IAC-OPUS

```
	iamf.IAC-specific-needs.Opus
```

- For IAC-AAC-LC

```
	iamf.IAC-specific-needs.mp4a.40.2
```

- For IAC-FLAC

```
	iamf.IAC-specific-needs.fLaC
```

- For IAC-LPCM

```
	iamf.IAC-specific-needs.lpcm
```

<b>IAC-specific-needs</b> shall be <b>V.PV</b> as follows:
- <dfn noexport>V</dfn> shall be four digits and shall represent the version of IA sequence.
	- The first two digits shall represent the major version within the range 0 to 15.
	- The second two digits shall represent the minor version within the range 0 to 15.
- <dfn noexport>PV</dfn> shall be four digits and shall represent the profile version of IA sequence.
	- The first <b>P</b> shall be two digits and shall represent the profile major version within the range 0 to 15.
	- The second <b>V</b> shall be two digits and shall represent the profile minor version within the range 0 to 15.

For example, for this version of the specification
- The codecs parameter string of IAC-OPUS for the simple profile:

```
	iamf.0000.0000.Opus
```

- The codecs parameter string of IAC-AAC-LC for the base profile:

```
	iamf.0000.0100.mp4a.40.2
```

# ISOBMFF IAC Decapsulation # {#isobmff-decapsulation}

## ISOBMFF IAC Decapsulation with single track ## {#isobmff-decapsulation-singletrack}

This section provides a guideline for IAC parser to reconstruct IA sequences from IAC file.

When IAC parser feeds the reconstructed IA sequences to IAC-OBU parser, descriptor OBUs shall be placed at the first and followed by Temoral Units.

Below figure shows the mirroring process of the encapsulation scheme of IA sequence specified in [[#isobmff]].

<center><img src="images/IAC Decapsulation Guideline.png" style="width:100%; height:auto;"></center>
<center><figcaption>IAC Decapsulation Guideline</figcaption></center>


During decapsulation process, IAC file is decapsulated into IA sequences which conform to [[#obu-syntax]] as follows:
- Step1: Reconstruction of descriptor OBUs (one Magic Code OBU, one Codec Config OBU, one or more Mix Presentation OBUs and one or more Audio Element OBUs) for the ith IA sequence.
	- [Step1-1] Magic Code OBU: take version and profile_version fields from <code>iamf</code> sample entry and packetize it with [=ia_code=] and the pre-fixed header value (i.e. 0xF006) by OBU.
	- [Step1-2] Codec Config OBU: generate [=codec_id=] and [=decoder_config()=] from CodecSpecificBox of <code>iamf</code> sample entry, num_samples_per_frame from 'stts' box and take roll_distance from [=AudioPreRollEntry=], and packetize it by OBU with obu_type = OBU_IA_Codec_Config.
	- [Step1-3] Mix Presentation OBUs and Audio Element OBUs: take the ith SampleGroupDescriptionEntry as it is in SampleGroup with grouping_type, [=iagd=].
	- [Step1-4] Figure out the offset (i1) and number (im) of Samples, which the ith SampleGroupDescriptionEntry is applied to, from the SampleGroup.
- Step2: Prepare Sync_OBU with obu_type = OBU_IA_Sync.
- Step3: Reconstructing of the jth Temporal Unit of the ith IA sequence (j = i1, i2, …, im)
	- [Step3-1] If there is the SampleGroup with grouping_type = [=demi=], then take the parameter block OBU for the demixing_info and jth sample. Otherwise, take jth sample as it is.
		- Parameter block OBU for demixing_info: take the SampleGroupDescriptionEntry as it is, from SampleGrouop with grouping_type = [=demi=], mapped to jth Sample.
	- [Step3-2] Place Sync_OBU at the front of the result of Step2-2 without gap to reconstruct the jth Temporal Unit.
- Step4: Place descriptor OBUs, followed by Sync OBU, and followed by Temporal Units in order (j = i1, i2, …, im) without gap, to reconstruct the ith IA sequence.

[=codec_id=] and [=decoder_config()=] for IAC-OPUS is generated as follows:
- The syntax and values shall conform to [=ID Header=] of [[!RFC7845]] with following constraints.
	- [=OutputChannelCount=], [=PreSkip=], [=InputSampleRate=], [=OutputGain=] and [=ChannelMappingFamily=] are copied from [=dOps=] box.
	

[=codec_id=] and [=decoder_config()=] for IAC-AAC-LC is generated as follows:
- [=codec_id=]: 'mp4a'
- [=decoder_config()=] is generated from [=DecoderConfigDescriptor()=] of [=esds=] box.


# IAC processing # {#processing}

This section provides a guideline for IA decoding for a given [=IA sequence=].


IA decoding can be done by using the combination of following decoding processing.
- Decoding of a scene-based audio element (Ambisonics decoding)
- Decoding of a channel-based audio element (Scalable Channel Audio decoding)
- Rendering and mixing of each audio element before mixing of multiple audio elements.
	- It may include re-sampling of each audio element.
- Mixing of multiple audio elements with synchronization
- Post processing such as Loudness, DRC and Limiter.

<b>Abmisonics decoding</b>, it shall conform to [[!RFC8486]] except codec specific processing and shall output Ambisonics channels in ACN (Ambisonics Channel Number) order.

<b>Scalable Channel Audio decoding</b>, it shall output the channel audio (e.g. 3.1.2ch or 7.1.4ch) for the target channel layout.

IA decoder is composed of OBU parser, Codec decoder, Audio Element Renderer and Post-processor as depicted in below figure.
- OBU parser shall depacketize IA sequence to output one or more substreams with one or more Decoder_Config() but one decoder_config() per audio element, descriptors and parameters.
- Codec decoder for each substream shall ouptput decoded channels.
- Audio Element Renderer reconstructs audio channels from decoded channels of Codec decoders according to the type of audio element which is specified audio element OBU.
	- For scene-based audio element, it shall output ambisonics channels.
	- For channel-based audio element, it shall output audio channels for the given loudspeaker layout.
- Post-processor outputs audio channels according to the target loudspeaker layout after processing optional rendering, mixing and post processing such as DRC, Loudness and Limiter.
	- For a given scene-based audio element, one of mix presentations shall be used to render the given scene-based audio element.
	- To mix a given multiple audio elements, one of mix presentations shall be used to render each of the given multiple audio elements.
	
<center><img src="images/IA Decoder Configuration.png" style="width:100%; height:auto;"></center>
<center><figcaption>IA Decoder Configuration</figcaption></center>

## Ambisonics decoding ## {#processing-ambisonics}

This section describes the decoding of Ambisonics.

Below figure shows the decoding flowchart of Ambisonics decoding.
- OBU parser shall output the substreams for the scene-based audio element in IA sequence.
	- OBU parser shall output [=channel_mapping=] or [=demixing_matrix=] according to [=ambisonics_mode=] to Channel_Mapping/Demixing_Matrix module
- Codec decoder shall output decoded channels (PCM) in the transmission order as many as[=output_channel_count=] after decoding of each Substream.
- Channel_Mapping/Demixing_Matrix module shall apply channel_mapping or demixing_matrix according to Ambisonics_Mode to the channels (PCM) and outputs channels as many as [=output_channel_count=] in ACN order.
- Ambisonics to Channel Format module may convert the output channels to channel audio according to the target loudspeaker layout.

<center><img src="images/Ambisonics Decoding Flowchart.png" style="width:80%; height:auto;"></center>
<center><figcaption>Ambisonics Decoding Flowchart</figcaption></center>

## Scalable Channel Audio decoding ## {#processing-scalablechannelaudio}

This section describes the decoding of Scalable Channe Audio.

Below figure shows the decoding flowchart of the decoding for Scalable Channel Audio.

<center><img src="images/Channel Audio Decoding Flowchart.png" style="width:80%; height:auto;"></center>
<center><figcaption>Scalable Channel Audio Decoding Flowchart</figcaption></center>

For a given loudspeaker layout (i.e. CL #i) among the list of [=loudspeaker_layout=] in scalable channel layout config,
- OBU Parser shall get substreams for ChannelGroup #1 ~ ChannelGroup #i and pass them to Codec decoder with [=Decoder_Config()=].
- Codec decoder shall output decoded channels (PCM) in the transmission order.
	- For non-scalable audio (i.e i = 1), its order shall be converted to the loudspeaker location order for CL #1.
- Following are further processed for scalable audio (i.e. i > 1)
	- When Output_Gain_Is_Present_Flag(j) for ChanneGroup #j (j = 1, 2, …, i-1) is on, Gain module shall apply Output_Gain(j) to all audio samples of the mixed channels in the ChannelGroup #j indicated by Output_Gain_Flag(j).
	- De-Mixer shall output de-mixed channels (PCM) for CL #i generated through de-mixing of the mixed channels from Gain module by using non-mixed channels and demixing parameters for each frame.
	- Recon_Gain module shall output smoothed channels (PCM) by appling Recon_Gain to each frame of the de-mixed channels.
	- The order for Non-mixed cahnnels and Smoothed channels shall be converted to the loudspeaker location order for CL #i after going through necessary modules such as Gain, De-Mixer, Recon_Gain etc..
- Following may be further processed
	- Loudness normalization module may output loudness normalized channels at -24 LKFS from non-mixed channels and smoothed channels (if present) by using loudness value for CL #i.
	- DRC control module may apply the pre-defined DRC compression to the loudness normalized channels, after that it outputs loudness normalized channels at -16 LKFS.
	- Limiter module may limit the true peak of input channels at -1dB.

Following sections, [[#processing-scalablechannelaudio-gain]], [[#processing-scalablechannelaudio-demixer]] and [[#processing-scalablechannelaudio-recongain]] are only needed for decoding of scalable audio with [=num_layers=] > 1.

### Gain ### {#processing-scalablechannelaudio-gain}

Gain module is the mirror process of Attenuation module. It recovers the reduced sample values using Output_Gain when its flag for ChannelGroup #j is on. When its flag is off, then this module shall be bypassed for ChannelGroup #j. Output_Gain(j) for ChannelGroup #j shall be applied to all samples of the mixed channels in the ChannelGroup #j. Where, mixed channels means the mixed channels from an input channel audio (i.e. a channel audio for CL #n).

To apply the gain, an implementation MUST use the following:

```
	Sample *= pow(10, Output_Gain(j) / (20.0*256))
```

Where, Output_Gain(j) is the raw 16-bit value for jth layer which is specified in [=channel_audio_layer_config()=].

### De-mixer ### {#processing-scalablechannelaudio-demixer}

For scalable channel audio with [=num_layers=] > 1, some channels of [=down-mixed audio=] for CL #i are delivered as is but the rest are mixed with other channels for CL #i-1.

De-mixer module reconstructs the rest of the down-mixed audio for CL #i from the mixed channels, which is passed by Gain module, and its relevant non-mixed channels using its relevant demixing parameters.

De-mixing for down-mixed audio for CL #i shall comply with the result by the combination of following surround and top de-mixers:
- Surround de-mixers
	- <dfn noexport>S1to2 de-mixer</dfn>: R2 = 2 x Mono – L2
	- <dfn noexport>S2to3 de-mixer</dfn>: L3 = L2 – 0.707 x C and R3 = R2 – 0.707 x C
	- <dfn noexport>S3to5 de-mixer</dfn>: Ls = 1/δ(k) x (L3 – L5) and Rs = 1/δ(k) x (R3 – R5)
	- <dfn noexport>S5to7 de-mixer</dfn>: Lrs = 1/β(k) x (Ls – α(k) x Lss) and Rrs = 1/β(k) x (Rs – α(k) x Rss)
- Top de-mixers
	- <dfn noexport>TF2toT2 de-mixer</dfn>: Ltf2 = Ltf3 – w(k) x (L3 – L5) and Rtf2 = Rtf3 – w(k) x (R3 – R5)
	- <dfn noexport>T2to4 de-mixer</dfn>: Ltb = 1/γ(k) x (Ltf2 – Ltf4) and Rtb = 1/γ(k) x (Rtf2 – Rtf4)
- Where, Ltf2 / Rtf2 is top channel of x.1.2ch, Ltf3 / Rtf3 is top channel of 3.1.2ch, and Ltf4 / Rtf4 is to channel of x.1.4ch (x = 5 or 7) and w(k) is determined from the value of wIdx(k).

Initially, wIdx(0) = 0 and the value of wIdx(k) shall be derived as follows:
- <dfn noexport>wIdx(k)</dfn> = Clip3(0, 10, wIdx(k-1) + w_idx_offset(k))

Mapping of wIdx(k) to w(k) should be as follows:
<pre class = "def">
 wIdx(k) :   w(k)
    0    :    0
    1    :  0.0179
    2    :  0.0391
    3    :  0.0658
    4    :  0.1038
    5    :  0.25
    6    :  0.3962
    7    :  0.4342
    8    :  0.4609
    9    :  0.4821
    10    : 0.5
</pre>

When D_set = { x | S1 < x ≤ Si and x is an integer},
- If 2 is an element of D_set, the combination shall include [=S1to2 de-mixer=].
- If 3 is an element of D_set, the combination shall include [=S2to3 de-mixer=].
- If 5 is an element of D_set, the combination shall include [=S3to5 de-mixer=].
- If 7 is an element of D_set, the combination shall include [=S5to7 de-mixer=].

When Ti = 2,
- If Sj = 3 (j=1,2,…, i-1), the combination shall include [=TF2toT2 de-mixer=].

When Ti = 4,
- If Sj = 3 (j=1,2,…, i-1), the combination shall include [=TF2toT2 de-mixer=] and [=T2to4 de-mixer=].
- Elseif Tj = 2 (j=1,2,…, i-1), the combination shall include [=T2to4 de-mixer=].

For example, when CL #1 = 2ch, CL #2 = 3.1.2ch, CL #3 = 5.1.2ch and CL #4 = 7.1.4ch. To reconstruct the rest (i.e. Ls5/Rs5/Ltf/Rtf) of th down-mixed 5.1.2ch,
- The combination includes [=S2to3 de-mixer=], [=S3to5 de-mixer=] and [=TF2toF2 de-mixer].
- Ls5 and Rs5 are recovered by S2to3 de-mixer and S3to5 de-mixer.
- Ltf and Rtf are recovered by S2to3 de-mixer and TF2toT2 de-mixer.

```
	Ls5 = 1/δ(k) × (L2 - 0.707 × C - L5) and Rs5 = 1/δ(k) × (R2 - 0.707 × C - R5).
	Ltf = Ltf3 - w(k) x (L2 - 0.707 x C - L5) and Rtf = Rtf3 - w(k) x (R2 - 0.707 x C - R5).
```

### Recon Gain ### {#processing-scalablechannelaudio-recongain}

[=recon_gain=] shall be only applied to all of audio samples of the de-mixed channels from De-mixer module.
- [=recon_gain_info_parameter_data()=] indicates each channel of CL #i which Recon_Gain needs to be applied to and provides Recon_Gain value for each frame of the channel.
	- Sample (k,i) *= Smoothed_Recon_Gain (k,i), where k is the frame index and i is the sample index of the frame.
	- Smoothed_Recon_Gain (k) = MA_gain (k-1) x e_window + MA_gain (k) x s_window
	- MA_gain (k) = 2 / (N+1) x Recon_Gain (k) / 255 + (1 – 2/(N+1)) x MA_gain (k-1), where MA_gain (0) = 1.
	- e_window[:ps – olen] = 1, e_window[ps – olen: ps] = hanning[olen:], e_window[ps:flen] = 0.
	- s_window[:ps – olen] = 0, s_window[ps – olen: ps] = hanning[:olen], s_window[ps:flen] = 1.
	- Where, hanning = np.hanning (2*olen), ps is the pre-skip value, flen is the frame size and olen is the overlab size.
	- Recommend values: N = 7

Below figure shows the smoothing scheme of [=recon_gain=].

<center><img src="images/Smoothing Scheme of Recon Gain.png" style="width:100%; height:auto;"></center>
<center><figcaption>Smoothing Scheme of Recon Gain</figcaption></center>

Recommend values for specific codecs are as follows
- IAC-OPUS: olen = 60, the pre-skip (ps) value is indicated in Codec_Specific_Info for IAC-OPUS.
- IAC-AAC-LC: olen = 64, ps = 720.

## Mix Presentation ## {#processing-mixpresentation}

An IA sequence may contain more than one mix presentation. [[#processing-mixpresentation-selection]] details how a mix presentation should be selected from multiple of them.

A mix presentation specifies how to render, process and mix one or more audio elements. Each audio element should first be individually renderered and processed before mixing. Then, any additional processing specified by [=output_mix_config()=] should be applied to the mixed audio signal in order to generate the final output audio for playback. [[#processing-mixpresentation-rendering]] details how each audio element should be renderered, while [[#processing-mixpresentation-mixing]] details how the audio elements should be processed and mixed.

### Selecting a Mix Presentation ### {#processing-mixpresentation-selection}

An IA sequence may contain more than one mix presentations. The IA parser should select the appropriate mix presentation in the following order.

1. If there are any user-selectable mixes, the IA parser should select the mix, or mixes, that match the user's preferences. An example might be a mix with a specific language. Mix presentations may use [=mix_presentation_friendly_label=] to describe such mixes.
2. If there are more than one valid mixes remaining, the IA parser should select an appropriate mix for rendering, in the following order.
	1. If the playback layout is binaural, i.e. headphones:
		1. Select the mix with [=audio_element_id=] whose [=loudspeaker_layout=] is BINAURAL.
		2. If there is no such mix, select the mix with the highest available [=loudness_layout=].
	2. If the playback layout is loudspeakers:
		1. If there is a mix with an [=loudness_layout=] that matches the playback loudspeaker layout, it should be selected. If there are more than one matching mixes, the first one should be selected.
		2. If there is no such mix, select the mix presentation with the highest available [=loudness_layout=].

### Rendering an Audio Element ### {#processing-mixpresentation-rendering}

After selecting a Mix Presentation, an audio element should be rendered as follows:
- If the selected Mix Presentation OBU includes only one single channel-based audio element, do M2M-Rendering (M2M-Rendering: Multichannel to Multichannel Rendering).
- If the selected Mix Presentation OBU includes only one single scene-based audio element, do A2M-Rendering (A2M-Rendering: Ambisonics to Multichannel Rendering).
- If the selected Mix Presentation OBU includes multiple audio elements.
	- If the audio element is channel-based, then it should follow M2M-Rendering.
	- If the audio elemetn is scene-based, then it should follow A2M-Rendering.
	
For M2M-Rendering,
- The input layout of the IA renderer
	- If num_layer = 1, use the [=loudspeaker_layout=] of the audio element.
	- Else, use the layout that matches the playback layout, or is the next highest available layout.
- The IA render
	- If the playback layout matches a [=loudspeaker_layout=] which can be generated from the highest loudspeaker layout of the audio element according to [[#iacgeneration-scalablechannelaudio-channellayoutgenerationrule]], use demixing_info_parameter_data().
		- If demixing_info_parameter_data() is not delivered, use EAR Direct Speakers renderer ([[!ITU2127-0]]).
	- Else if the playback layout complies with loudspeaker layouts supported by [[!ITU2051-3]], use EAR Direct Speakers renderer ([[!ITU2127-0]]).
	- Else, use implementation-specific renderer.
- The output of the IA render: the playback layout
 
For A2M-Rendering,
- The input layout of the IA renderer: Ambisonics
- The IA render
	- If the playback layout complies with loudspeaker layouts supported by [[!ITU2051-3]], use EAR HOA renderer ([[!ITU2127-0]]).
	- Else, use implementation-specific renderer.
		- If there is no implementation-specific Ambisonics renderer, use libear to render to the next highest BS2051 layout compared to the playback layout, and then downmix using implementation-specific renderer.
- The output of the IA render: the playback layout

This specification supports the rendering of either a multichannel or ambisonics audio element to either a target loudspeaker layout or a binaural output. The choice of the renderer to use when processing a mix presentation depends on the input audio element and the playback layout, as defined in the table below.

<table class="def">
<tr>
  <th>audio_element_type</th><th>Playback layout</th><th>Renderer to use</th>
</tr>
<tr>
  <td>CHANNEL_BASED</td><td>Loudspeaker layouts supported by [[!ITU2051-3]]</td><td>EAR Direct Speakers renderer ([[!ITU2127-0]]).</td>
</tr>
<tr>
  <td>CHANNEL_BASED</td><td>Loudspeaker layouts not supported by [[!ITU2051-3]]</td><td>Implementation-specific loudspeaker renderer.</td>
</tr>
<tr>
  <td>SCENE_BASED</td><td>Loudspeaker layouts supported by [[!ITU2051-3]]</td><td>EAR HOA renderer ([[!ITU2127-0]]).</td>
</tr>
<tr>
  <td>SCENE_BASED</td><td>Loudspeaker layouts not supported by [[!ITU2051-3]]</td><td>Implementation-specific Ambisonics renderer.<br><br>If an implementation-specific Ambisonics renderer is not available, the EAR HOA renderer may be used to render the Ambisonics audio element to the an [[!ITU2051-3]] layout than the output loudspeaker layout, and then downmixed using the implementation-specific loudspeaker renderer.</td>
</tr>
<tr>
  <td>CHANNEL_BASED</td><td>Binaural</td><td>// TODO</td>
</tr>
<tr>
  <td>SCENE_BASED</td><td>Binaural</td><td>// TODO</td>
</tr>
</table>


#### EAR Direct Speakers renderer #### {#processing-mixpresentation-rendering-ear-directspeakers}

In addition to the metadata provided in itur_bs2127_direct_speakers_config(), the IA renderer should provide the following to the EAR Direct Speakers renderer for each audio channel of the audio element:

- speaker label: the label of the speaker position, using the same convention as "SP Label" in [[!ITU2051-3]]. This is defined for each audio channel of the audio element based on the information from [=loudspeaker_layouts=].
- azimuth: specifies the azimuth location of the sound. This is mapped from the speaker label as defined in [[!ITU2051-3]].
- elevation: specifies the elevation location of the sound. This is mapped from the speaker label as defined in [[!ITU2051-3]].

In [[!ITU2051-3]], an LFE audio channel may be identified either by an explicit label or its frequency content. In this specification, the LFE channel is identified based on the explicit label only, given by [=loudspeaker_layout=].

#### EAR HOA renderer #### {#processing-mixpresentation-rendering-ear-hoa}

The IA renderer should provide the following metadata to the EAR HOA renderer for each audio channel:

1. Ambisonics order
2. Ambisonics degree
3. Ambisonics normalization method

In this specification, the AmbiX format is adopted, which uses SN3D normalization and ACN channel ordering. Accordingly, the Ambisonics order and degree can be computed from the channel index k as follows:

```
order   n = floor(sqrt(k)),
degree  m = k - n * (n + 1).
```

### Mixing Audio Elements ### {#processing-mixpresentation-mixing}

Each audio element is processed individually before mixing as follows:
1. Render to the playback layout.
2. If all audio elements do not have a common sample rate, re-sample to 48 kHz.
3. If all audio elements do not have a common bit-depth, convert to a common bit-depth. This specification recommends using 16 bits.
4. If [=loudness_layout=] matches with the playback layout, apply any per-element processing according to [=element_mix_config()=]. Otherwise, apply any per-element processing according to implementation-specific element_mix.

The rendered and processed audio elements are then summed, and then apply [=output_mix_config()=] to generate one sub-mixed audio signal. If there are more than one sub-mixes, the output of each sub-mix is further summed to generate the final mixed audio signal.


## Animated Parameters ## {#processing-animated-params}

This section describes how a set of parameters is animated over a segment in a parameter block, using the information provided in [=AnimatedParameterData()=].

Let P0, P1 and P2 be 2D coordinates defined as

```
P0 = (t_start, start_point_value),
P1 = (t_control, control_point_value),
P2 = (t_end, end_point_value),
```

where t_start is the segment start time, t_end is the segment end time and t_control is the control point time given by

```
t_control = t_start + (t_end - t_start) * control_point_relative_time.
```

If [=animation_type=] is equal to STEP, the parameter value provided by [=start_point_value=] should be applied immediately to all samples of the segment.

If [=animation_type=] is equal to LINEAR, the parameter value is linearly interpolated between [=start_point_value=] and [=end_point_value=] as follows:

```
B_linear(a) = (1 - a) * P0 + a * P2,
0 <= a <= 1.
```

If [=animation_type=] is equal to BEZIER, the parameter value is interpolated following a quadratic Bezier curve between [=start_point_value=] and [=end_point_value=] as follows:

```
B_quad(a) = (1 - a)^2 * P0 + 2 * (1 - a) * a * P1 + a^2 * P2,
0 <= a <= 1.
```

## Post Processing ## {#processing-post}

### Loudness Normalization ### {#processing-post-loudness}

Loudness normalization should be done by adjusting the loudness level to a target value, using the integrated loudness and true peak information provided in loudness_info(). If the true peak information is not available, the digital peak information should be used.

The rendered layouts that was used to measure the loudness information of a sub-mix are provided by [=loudness_layout=]s. 

If one of them matches the playback layout, the loudness information should be used directly for normalization. If there is a mismatch between [=loudness_layout=] and the playback layout, the implementation may choose to use the provided loudness information of the highest [=loudness_layout=] as-is. 

If there is more than one selected loudness_info() specified in the mix presentation (i.e. in case of multiple sub-mixes), the implementation should normalize the loudness of each sub-mix independently before summing them.

### Limiter ### {#processing-post-limiter}

The limiter should limit the true peak of audio signal at -1 dBTP, where true peak is defined in [[!ITU1770-4]]. The limiter should apply to multichannel signals in a linked manner and further support auto-release.


## Down-mix Matrix ## {#processing-downmixmatrix}


### Dynamic Down-mix Matrix {#processing-downmixmatrix-dynamic}

This section recommends dynamic down-mixing matrics.

The dynamix down-mixing matrics shall comply with the down-mixing mechanisam which is specified in [[#iacgeneration-scalablechannelaudio-downmixmechanism]].


# IAC Generation Process # {#iacgeneration}

This section provides a guideline for IA encoding for a given input audio format.

Recommended input audio format for IA encoding is as follows:
- Ambiosnics format: It shall conform to [=ChannelMappingFamily=] = 2 or 3 of [[RFC8486]].
- Channel Audio format: It shall conform to [=loudspeaker_layout=] specified in channel_audio_layer_config().
- Input Smapling Rate: 48000hz
- Bitdepth: 16 bits or 24 bits
	- 16 bits are recommended for IAC-OPUS.
- Input file format: .wav file (Linear PCM, simply called as PCM)

For a given input audio and user inputs, IA encoder shall output [=IA sequence=] which conforms to [[#obu-syntax]].

Input audio shall be one of followings:
- Ambisonics format
- Channel Audio format

User inputs are:
- Ambisonics mode to indicate if [=ChannelMappingFamily=] = 2 or 3 of [[RFC8486]].
- List of channel layouts to be supported for scalable channel audio: it shall conform to [=loudspeaker_layout=].

IA encoding can be done by using the combination of following generation processing.
- Encoding of an audio element (Ambisonics encoding or Scalable Channel Audio encoding)
- Encoding of mix presentation

The below figure shows IA encoder configuration for one single audio element.

The IA encoder is composed of Pre-processor, Codec encoder and OBU packetizer.
- Pre-processor outputs one or more ChannelGroups, descriptors and optional parameter blocks based on the input audios and user inputs.
	- It outputs one single ChannelGroup for scene-based audio element.
	- It outputs one or more ChannelGroups for channel-based audio element.
	- It outputs descriptors which are composed of one Magic Code, one Codec Config, one Audio Element config, one or more Mix Presentation config. 
	- It may output paramete blocks
		- For channel-based audio element with [=num_layers=] = 1, it may output parameter blocks for demixing info.
		- For channel-based audio element with [=num_layers=] > 1, it outputs parameter blocks for demixing_info_parameter_data and recon_gain_info_parameter_data.
		- It may further output parameter blocks for post processing such as Loudness and DRC control.
- Codec encoder generates one or more substreams from each ChannelGroup based on Codec Config.
	- Mono or stereo coding shall be only allowed.
		- Channel Audio format: each pair of coupled channels in the same ChannelGroup shall be coded as stereo mode to generate one single substream and each of non-coupled channels in the same ChannelGroup shall be coded as mono mode to generate one single substream.
			- <dfn noexport>Coupled channels</dfn>: L/R, Ls/Rs, Lss/Rss, Lrs/Rrs, Ltf/Rtf, Ltb/Rtb
			- <dnf noexport>Non-coupled channels</dfn>: C, LFE, L
- OBU packetizer packetize descriptors, parameter blocks and audio frames by OBU, and outputs IA sequence.
	- Temporal unit generator generates temporal unit for each frame from audio frame OBUs and parameter block OBUs (if present).

<center><img src="images/IA Encoder Configuration.png" style="width:100%; height:auto;"></center>
<center><figcaption>IA Encoder Configuration</figcaption></center>

The order of substreams in each ChannelGroup shall be as follows:
- In ChannelGroup for Ambisonics: The order shall conform to [[RFC8486]].
- In ChannelGroup for Scalable Channel Audio: The order shall conform to following rules:
	- For IAMF-OPUS, IAMF-AAC-LC and IAMF-FLAC,
		- Coupled Substreams comes first and followed by non-coupled Substreams.
		- Coupled Substreams for surround channels comes first and followed by one(s) for top channels.
		- Coupled Substreams for front channels comes first and followed by one(s) for side, rear and back channels.
		- Coupled Substreams for side channels comes first and followed by one(s) for rear channels.
		- Center channel comes first and followed by LFE and followed by the other one.
		- Where, <dfn noexport>non-coupled substream</dfn> is a coded substream from one of non-coupled channels.
	- For IAMF-LPCM,
		- The order of substreams in ChannelGroup complies with "Loudspeaker Location Ordering" specified in [=loudspeaker_layout=].

## Ambisonics Encoding ## {#iacgeneration-ambisonics}

For Ambisonics encoding:
- Pre-processor outputs one ChannelGroup and descriptors and it is only composed of Meta Generator.
	- Meta generator generates descriptors based on Ambisonics mode and the number of channels for Ambisonics.
		- [=ambisonics_mode=] shall be set to 0 for [=ChannelMappingFamily=] = 2 of [[RFC8486]] or 1 for [=ChannelMappingFamily=] = 3 of [[RFC8486]].
		- ambisonics_config is set to as follows:
			- [=output_channel_count=], [=substream_count=] and [=coupled_substream_count=] shall be set to the number of channels for Ambisonics.
			- [=channel_mapping=] for [=ambisonics_mode=] = 0 is assigned to according to the order of substreams in ChannelGroup.
			- [=demixing_matrix=] for [=ambisonics_mode=] = 1 is assigned to according to the order of substreams in ChannelGroup.
- Codec Enc. outputs substreams as many as the number of channels which is indicated in [=substream_count=].
- Temporal unit shall be composed of audio frame OBUs for substreams.
	- It may have the immediately preceding temporal delimiter OBU.
	- The order of substreams in ChanngelGroup shall be aligned with [=channel_mapping=] for Ambisonics_Mode = 0 or [=demixing_matrix=] for Ambisonics_Mode = 1.

## Scalable Channel Audio Encoding ## {#iacgeneration-scalablechannelaudio}

For Scalable Channel Audio encoding:
- Pre-processor outputs one or more ChannelGroups, desriptors and parameter blocks. It is composed of Down-mix parameter generator, Down-mixer, Loudness, ChannelGroup generator, Attenation and Meta generator.
	- For non-scalable channel audio (i.e. [=num_layers=] = 1):
		- Parameter blocks for recon_gain_info_parameter_data is not be generated. 
		- Parameter blocks for demixing_info_parameter_data may be generated by implementers who assume it to be recommended for dynamic downmixing in a decoder side.
		- Down-mixer, ChannelGroup generator and Attenuation modules do not needed.
	- Down-mix parameter generator shall generate 5 down-mix parameters (α(k), β(k), γ(k), δ(k) and w(k)) by analyzing input channel audio.
	- Down-mixer shall generate down-mixed audios according to the list of channel layouts and the down-mix parameters.
	- Loudness module should output the loudness level ([=LKFS=]) of each down-mixed audio based on [[ITU1770-4]].
	- ChannelGroup generator shall transform the input channel audio to ChannelGroups for scalable channel audio with [=num_layers=] > 1 scalability by using the down-mix parameters and the list of channel layouts.
	- Attenuation module shall apply a gain to the transformed ChannelGroups to prevent clipping.
	- Meta generator generates descriptors, and parameter blocks for each frame.
		- descriptors shall be set to as follows:
			- [=num_layers=] shall be set to the number of channel layouts.
			- [=channel_audio_layer_config()=] shall be set to as follows:
				- [=loudspeaker_layout=] shall be set to the ith list of channel layouts for the ith ChannelGroup.
				- [=output_gain_is_present_flag=] shall set to 1 for the ith ChannelGroup if attenuation is applied to the mixed channels of the ith ChannelGroup. Otherwise it shall be set to 0 for the ith ChannelGroup.
				- [=recon_gain_is_present_flag=] shall be set to 1 for the ith ChannelGroup if the preceding ChannelGroups has one or more mixed channels from the down-mixed audio for the ith channel layout. Otherwise, it shall be set to 0 for the ith ChannelGroup. Especially, when [=num_layers=] = 1, this flag shall be set to 0.
				- [=substream_count=] shall be set to the nubmer of substreams composing of the ith ChannelGroup.
				- [=coupled_substream_count=] shall be set to the nubmer of coupled substreams among the substreams composing of the ith ChannelGroup.
				- [=loudness=] shall be set to the loudness ([=LKFS=]) of the down-mixed audio for the ith channel layout for the ith ChannelGroup.
				- Each bit of [=output_gain_flags=] shall be set to 1 for the ith ChannelGroup if attenuation is applied to the relevant channel of the ith ChannelGroup. Otherwies it shall be set to 0 for the ith ChannelGroup.
				- [=output_gain=] shall be set to the inverse number of the gain which is applied to the channels which are indicated by output_gain_flags.
		- Parameter blocks can be composed of [=demixing_info_parameter_data()=] and [=recon_gain_info_parameter_data()=]. When [=recon_gain_is_present_flag=] = 0 for all ChannelGroup, recon_gain_info shall not present in IA sequence.
			- [=dmixp_mode=] of demixing_info_parameter_data for the kth frame shall be set to indicate (α(k), β(k), γ(k), δ(k)) and w_idx_offset(k). Where w_idx_offset(k) = 1 or -1.
			- [=recon_gain_flags=] of recon_gain_info_parameter_data shall be set to indicate the de-mixed channels, which need to apply [=recon_gain=] among the output channels after demixing for ith channel layout.
			- [=recon_gain=] shall be set to the gain value to be applied to the channel which is indicated by recon_gain_flags for the ith ChannelGroup.
- Temporal unit for kth frame shall be composed of audio frame OBUs for the kth frames of the substreams and followed by OBUs for zero or more prameter block OBUs.
	- It may have the immediately preceding temporal delimiter OBU,
	- ChannelGroups in temporal unit shall be placed in order. In other words, ChannelGroup for the first channel layout shall come first, followed by ChannelGroup for the second channel layout, followed by ChannelGroup for the third channel layout and so on.

Below figure shows IA encoding flowchart for Scalable Channel Audio.
- For a given Channel Audio and a given list of channel layouts for scalability, PCMs for Channel Audio are passed to CG Generation moddule.
- CG Generation module generates the transformed audio according to CG generation rule based on the list of CLs and the down-mix parameters.
	- The transformed audio is structured as ChannelGroups.
- Non-mixed channels of the transformed audio (i.e., the original channels of the input channel audio) are directly input to Codec encoder, but the mixed channels may be input first to Attenuation module and then to Codec encoder.
- The Attenuation module reduces all sample values of the mixed channels in the same CG at a uniform rate (Output_Gain).
	- A range of 0dB to -6dB is recommended for the attenuation. (i.e. a range of 0dB to 6dB for Output_Gain)
- Codec Enc. generates the coded substreams from PCMs and passes substreams and one single decoder_config to OBU Packetizer.
- OBU packetizer generates descriptor OBUs which consists of one Magic Code OBU, one Codec Config OBU, one Audio Element OBU and zero or more Mix Presentation OBU.
		- Codec Config OBU is generated based on [=decoder_config()=].
- OBU packetizer generates zero or more parameter block OBUs for each frame which contains demixing_info_parameter_data and recon_gain_info_parameter_data.
- OBU packetizer generates audio frame OBUs for each frame of the substreams.
- OBU packetizer generates temporal unit for each frame.
	- Temporal unit consists of audio frame OBUs and followed by zero or more parameter block OBUs and audio frame OBUs.
		- It may have the immediately preceding temporal delimiter OBU, 
- OBU Packetizer outputs IA sequence which is composed of OBUs for descriptor OBUs and followed by OBUs for temporal units.

<center><img src="images/IA Encoding Flowchart for Channel Audio Format.png" style="width:80%; height:auto;"></center>
<center><figcaption>IA Encoding Flowchart for Scalable Channel Audio</figcaption></center>

Following sections, [[#iacgeneration-scalablechannelaudio-downmixparameter]], [[#iacgeneration-scalablechannelaudio-downmixmechanism]], [[#iacgeneration-scalablechannelaudio-channellayoutgenerationrule]], [[#iacgeneration-scalablechannelaudio-recongaingeneration]] and [[#iacgeneration-scalablechannelaudio-channelgroupgenerationrule]] do not needed for non-scalable channel audio (i.e., when [=num_layers=] specified in [=scalable_channel_layout_config()=] is set to 1).

### Down-mix parameter and Loudness ### {#iacgeneration-scalablechannelaudio-downmixparameter}

This section describes how to generate down-mix parameters and loudness level for a given channel audio and a given list of channel layouts for scalability.

Below figure shows a block diagram for down-mix parameter and loudness module including down-mixer.

<center><img src="images/Down-mix Parameter and Loudness.png" style="width:100%; height:auto;"></center>
<center><figcaption>IA Down-mix Parameter and Loudness</figcaption></center>

For a given Channel Audio (e.g. 7.1.4ch) and a given list of channel layouts based on the Channel Audio,
- Down-mix parameter generator shall generate 5 down-mix parameters (α(k), β(k), γ(k), δ(k) and w(k)) by analyzing input Channel Audio, by refering [[AI-CAD-Mixing]]. Where, k is a frame index.
	- It is composed of Audio Scene Classification module and Height Energy Quantification module as depicted in Figure 11-2.
	- Audio Scene Classification module generates 4 parameters (α(k), β(k), γ(k), δ(k)) by classifying audio scenes of input channel audio in three modes.
		- Default scene: Neither Dialog nor Effect
		- Dialog scene: Center-channel oriented and clear dialog/voice sounds
		- Effect scene: Directional and spatially moving sounds.
	- Height Energy Quantification module generates a surround to height mixing parameter (w(k)) which is decided according to the relative energy difference between the top and surround channels of input channel audio.
		- If the energy of top channels is bigger than that of surround ones, then w_idx_offset(k) is set to 1. Otherwise, it is set to -1. And, w(k) is calculated based on w_idx_offset(k) and conforms to [[#processing-scalablechannelaudio]].
- Down-mixer generates down-mixed audios from input Channel Audio according to the list of channel layouts and the down-mix parameters, and outputs down-mixed audio for each channel layout to Loudness module.
	- It is not depicted in the figure but Down-mixer further generates [=Dmixp_Mode=] and [=Recon_Gains=] for each frame to be passed to OBU packetizer.
- Loudness module measures the loudness level ([=LKFS=]) of each down-mixed audio based on [[ITU1770-4]], and passes them to OBU packetizer.

### Down-mix Mechanism ### {#iacgeneration-scalablechannelaudio-downmixmechanism}

This section specifies the down-mixing mechanism to generate <dfn noexport>down-mixed audio</dfn> for scalable channel audio.

For a given Channel Audio which conforms to [[=loudspeaker_layout]], the surround and top channels (if any) are separately down-mixed and especially step by step until to get a target channels.

Implementors may use another method to get the down-mixed audio from the given channel audio, but the down-mixed audio shall comply with that by this section.

Therefore, a down-mixer based on the down-mix mechanisam is a combination of following surround down-mixer(s) and top down-mixer(s) as depicted in below figure.
- Surround Down-mixers: S7to5 enc., S5to3 enc., S3to2 enc., S2to1 enc.

```
	S7to5 enc.: Ls5 = α(k) x Lss7 + β(k) x Lrs7 and Rs5 = α(k) x Rss7 + β(k) x Rrs7.
	S5to3 enc.: L3 = L5 + δ(k) x Ls5 and R3 = R5 + δ(k) x Rs5
	S3to2 enc.: L2 = L3 + 0.707 x C and R2 = R3 + 0.707 x C
	S2to1 enc.: Mono = 0.5 x (L2 + R2)
```

- Top Down-mixers: T4to2 enc., T2toTF2 enc.

```
	T4to2 enc.: Ltf2 = Ltf4 + γ(k) x Ltb4  and Rtf2 = Rtf4 + γ(k) x Rtb4.
	T2toTF2 enc.: Ltf3 = Ltf2 + w(k) x δ(k) x Ls5 and Rtf3 = Rtf2 + w(k) x δ(k) x Rs5.
```

<center><img src="images/Down-mix Mechanism.png" style="width:100%; height:auto;"></center>
<center><figcaption>IA Down-mix Mechanism</figcaption></center>

```
For example, to get down-mixed 3.1.2ch from 7.1.4ch:
- S3 of 3.1.2ch is generated by using S7to5 and S5to3 encs.
- TF2 of 3.1.2ch is generated by using T4to2 and T2toTF2 encs.
```

### Channel Layout Generation Rule ### {#iacgeneration-scalablechannelaudio-channellayoutgenerationrule}

This section describes the generation rule for channel layouts for scalable channel audio.

For a given channel layout (CL #n) of input Channel Audio, any list of CLs ({CL #i: i = 1, 2, ..., n}) for a scalable channel audio shall comform with following rules:
- Si ≤ Si+1 and Wi ≤ Wi+1 and Ti ≤ Ti+1 except Si = Si+1 and Wi = Wi+1 and Ti = Ti+1 for i = n-1, n-2, …, 1. Where ith Channel Layout CL #i = Si.Wi.Ti.
- CL #i is one of [=loudspeaker_layouts=] supported in this specification.

Down-mix paths, which conform to the above rule, shall be only allowed for scalable channel audio with [=num_layers=] > 1 as depicted in below figure.

<center><img src="images/Down-mix Path.png" style="width:90%; height:auto;"></center>
<center><figcaption>IA Down-mix Path</figcaption></center>

### Recon Gain Generation ### {#iacgeneration-scalablechannelaudio-recongaingeneration}

This section describes how to generate [=recon_gain=].

Recon_Gain needs to be applied to de-mixed channels. For this, IA encoder needs to deliver it to IA decoders.

Let's define followings:
- Level Ok is the signal power for the frame #k of a channel of the down-mixed audio for CL #i.
- Level Mk is the signal power for the frame #k of the relevant mixed channel of the down-mixed audio for CL #i-1.
- Level Dk is the signal power for the frame #k of the de-mixed channel for CL #i (after demixing).

If 10*log10(level Ok / maxL^2) is less than the first threshold value (e.g. -80dB), Recon_Gain (k, i)  = 0. Where, maxL = 32767 for 16bits.

If 10*log10(level Ok / level Mk ) is less than the second threshold value (e.g. -6dB), Recon_Gain (k, i) is set to the value which makes level Ok = Recon_Gain (k, i)^2 x level Dk. Otherwise, Recon_Gain (k, i) = 1. Actual value to be delivered is floor(255*Recon_Gain).

```
For example, if we assume CL #i = 7.1.4ch and CL #i-1 = 5.1.2ch, then de-mixed channels are D_Lrs7, D_Rrs7, D_Ltb4 and D_Rtb4.
- D_Lrs7 and D_Rrs7 are de-mixed from Ls5 and Rs5 in the (i-1)th ChanngelGroup by using Lss7 and Rss7 in the ith ChannelGroup and its relevant demixing parameters (i.e., α(k) and β(k)) , respectively.
- D_Ltb4 and D_Rtb4 are de-mixed from Ltf2 and Rtf2 in the (i-1)th ChanngelGroup by using Ltf4 and Rtf4 in the ith ChannelGroup and its relevant demixing parameter (i.e., γ(k)), respectively.

Recon_Gain for D_Lrs7:
- Level Ok is the signal power for the frame #k of Lrs7 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Ls5 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Lrs7.
Recon_Gain for D_Rrs7:
- Level Ok is the signal power for the frame #k of Rrs7 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Rs5 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Rrs7.
Recon_Gain for D_Ltb4:
- Level Ok is the signal power for the frame #k of Ltf4 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Ltf2 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Ltb4.
Recon_Gain for D_Rtb4:
- Level Ok is the signal power for the frame #k of Rtf4 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Rtf2 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Rtb4.
```

### ChannelGroup Generation Rule ### {#iacgeneration-scalablechannelaudio-channelgroupgenerationrule}

This section describes the generation rule for ChannelGroup.

For a given Channel Audio and the list of CLs ({CL #i: i = 1, 2, ..., n}), CG Generation module outputs the transformed audio (i.e. ChannelGroups) which shall conform to following rules:
- It consists of C number of channels and is structured to n number of CGs, where C is the number of channels for the Channel Audio.
- CG #1 (as called BCG): This CG is the down-mixed audio itself for CL #1 generated from the Channel Audio. It contains C1 number of channels.
- CG #i (as called DCG, i = 2, 3, …, n): This CG contains (Ci – Ci-1) number of channels. (Ci – Ci-1) channel(s) consists of as follows:
	- (Si – Si-1) surround channel(s) if Si > Si-1 . When S_set = { x | Si-1 < x ≤ Si and x is an integer},
		- If 2 is an element of S_set, the L2 channel is contained in this CG #i.
		- If 3 is an element of S_set, the Center channel is contained in this CG #i.
		- If 5 is an element of S_set, the L5 and R5 channels are contained in this CG #i.
		- If 7 is an element of S_set, the Lss7 and Rss7 channels are contained in this CG #i.
	- The LFE channel if Wi > Wi-1 .
	- (Ti – Ti-1) top channels if Ti > Ti-1 .
		- If Ti-1 = 0, the top channels of the down-mixed audio for CL #i are contained in this CG #i.
		- If Ti-1 = 2, the Ltf and Rtf channels of the down-mixed audio for CL #i are contained in this CG #i.

Below figure shows one example of transformation matrix with 4 CGs (2ch/3.1.2ch/5.1.2ch/7.1.4ch).

<center><img src="images/Example of Transformation Matrix with 4 CGs.png" style="width:100%; height:auto;"></center>
<center><figcaption>Example of Transformation Matrix with 4 CGs</figcaption></center>

### Mix Presentation Encoding ### {#iacgeneration-mixpresentation}

For Mix Presentation OBU for one single channel-based audio element, Mix Presentation OBU shall follow following restrictions:
- [=num_sub_mixes=]: set to 1
- [=num_audio_elements=]: set to 1
- [=rendering_config()=]: set to itur_bs2127_direct_speakers_config() with no metadata (i.e all flags off)
- [=element_mix_config()=]: No parameter block for element_mix and default_mix_gain = 0dB
- [=output_mix_config()=]: No parameter block for output_mix and default_mix_gain = 0dB
- [=num_layouts=]: set to [=num_layers=]
- [=loudness_layout=]: set to L(1), L(2), ..., L([=num_layers=]).
- loudness_info() on L(1), loudness_info on L(2), ..., loudness_info on L([=num_layers=]): loudness information of the rendered audio to the measured layout L(i).
- Where L(i) is the measured layout for the ith layer and i = 1, 2, ..., [=num_layers=]

For Mix Presentation for one single scene-based audio element, Mix Presentation OBU shall follow following restrictions:
- [=num_sub_mixes=]: set to 1
- [=num_audio_elements=]: set to 1
- [=rendering_config()=]: set to itur_bs2127_hos_config()
- [=element_mix_config()=]: set to [=mix_gain=]
- [=output_mix_config()=]: set to [=output_mix_gain=]
- [=num_layouts=]: set to M1, the number of loudness informations which are provided.
- [=loudness_layout=]: set to L(1), L(2), ..., L(M1).
- loudness_info() on L(1), loudness_info on L(2), ..., loudness_info on L(M1): loudness information of the rendered audio to the measured layout L(i).
- Where L(i) is the measured layout for the ith loudness information and i = 1, 2, ..., M1.
- This Mix Presentation is authored by using the highest [=loudness_layout=].
 
For Mix Presenation for N (>1) audio elements (when num_sub-mixes = 1), Mix Presentation OBU shall follow following restrictions:
- [=num_sub_mixes=]: set to 1
- [=num_audio_elements=]: set to N
- [=rendering_config()=] for each audio element: set to itur_bs2127_direct_speakers_config() with no metadata (i.e all flags off) if channel-based or itur_bs2127_hoa_config() if scene-based 
- [=element_mix_config()=] for each audio element: set to [=mix_gain=]
- [=output_mix_config()=]: set to [=output_mix_gain=]
- [=num_layouts=]: set to M2, the number of loudness informations which are provided.
- [=loudness_layout=]: set to L(1), L(2), ..., L(M2).
- loudness_info() on L(1), loudness_info on L(2), ..., loudness_info on L(M2): loudness information of the rendered audio to the measured layout L(i).
- Where L(i) is the measured layout for the ith loudness information and i = 1, 2, ..., M2.
- This Mix Presentation is authored by using the highest [=loudness_layout=].

#### Element Mix Config ####  {#iacgeneration-mixpresentation-mix}

This section provide a guideline to generate element_mix_config().

An IA multiplexer may merge two or more IA sequences. In this case, it should adjust the gain values for [=element_mix_config()=]s as necessary to describe the desired relative gains between the IA sequences when they are summed to generate the final mix. It should also ensure that the gains selected do not result in clipping when the final mix is generated.

### Multiple Audio Elements Encoding ### {#iacgeneration-multipleaudioelements}

This section provide a guideline to generate IA sequence having multiple audio elements from two IA simple or base profiles.

#### Multiple Audio Elements with One Codec Config #### {#iacgeneration-multipleaudioelements-onecodec}

This section provides a way how to generate IA sequence having multiple audio elements from two IA simple profiles with the same codec config OBU. However, the result shall comply with the base profile of IA sequence.

Step1: Descriptor OBUs are generated as follows:
- Magic Code OBU: get the larger version field and the larger profile version field, respectively.
- Codec Config OBU
	- take just one codec_id and codec_config()
	- update num_audio_elements and audio_element_id
- Mix Presentation OBUs: just take all of them and generate new ones which are used for mixing of multiple audio elements if needed except following:
	- [=audio_element_id=]s are updated to be aligned according to Codec Config OBU.
- Audio Element OBUs: just take all of them except followings:
	- [=audio_element_id=]s are updated to be aligned according to Mix Presentation OBUs.
	- [=audio_substream_id=]s are updated to be unique in all of Audio Element OBUs. 
	- {{"audio_element_obu()"/parameter_id]s are updated to be unique in all of Audio Element OBUs.

Step2: ith temporal unit is generated as follows:
- Just take all of temporal units for ith frames from each audio element and keep the order of temporal units as the order of audio element OBUs in descriptor OBUs except following:
	- [=obu_type=]s are updated to be aligned according to [=audio_element_id=]s specified in Audio Element OBUs.
	- {{"parameter_block_obu()"/parameter_id]s in Parameter Block OBUs are updated to be aligned according to {{"audio_element_obu()"/parameter_id]s in Audio Element OBUs.
- Add Sync OBU in front of ith temporal unit if needed.
	- New Sync OBU is generated based on Sync OBUs of each IA sequence and updated [=audio_substream_id=]s and {{"parameter_block_obu()"/parameter_id]s.
- It may have the immediately preceding temporal delimiter OBU for each temporal unit.

Step3: Generate IA sequence which starts descriptor OBUs and followed by temporal units in order.

#### Multiple Audio Elements with Multiple Codec Config #### {#iacgeneration-multipleaudioelements-multiplecodec}

This section provides a way how to generate IA sequence having multiple audio elements from two IA simple or base profiles with the different codec config OBUs. However, the result shall comply with the enhanced profile of IA sequence.

Step1: Descriptor OBUs are generated as follows:
- Magic Code OBU: get the larger version field and the larger profile version field, respectively.
- Codec Config OBU: if some of multiple Codec Config OBUs are same, then merge multiple Codec Config OBUs into one Codec Config OBU, and take each of the others.
	- Update [=audio_element_id=]s to be unique in all of Codec Config OBUs.
- Mix Presentation OBUs: just take all of them and generate ones which are used after mixing of multiple audio elements if needed except following:
	- [=audio_element_id=]s are updated to be aligned according to Codec Config OBU.
- Audio Element OBUs: just take all of them except followings:
	- [=audio_element_id=]s are updated to be aligned according to Mix Presentation OBUs.
	- [=audio_substream_id=]s are updated to be unique in all of Audio Element OBUs. 
	- {{"audio_element_obu()"/parameter_id]s are updated to be unique in all of Audio Element OBUs.

Step2: Data OBUs are generated as follows:
- Place Temporal Units from multiple audio elements in timing order.
- Add Sync OBU in front of Temporal Unit, frequently.
	- New Sync OBU is generated based on Sync OBUs from each IA sequence and updated [=audio_substream_id=]s and {{"parameter_block_obu()"/parameter_id]s.
- It may have the immediately preceding temporal delimiter OBU for each audio element,

Step3: Generate IA sequence which starts descriptor OBUs and followed by Temporal Units in order.

### Post Processing ### {#iacgeneration-postprocessing}

This section provides a guideline to generate algorithms for post processing.

#### Loudness Config ####  {#iacgeneration-postprocessing-drc}

This section provide a guideline to generate loudness_config().

//To Do: Fill in how to generate loudness_config()

#### DRC Config ####  {#iacgeneration-postprocessing-drc}

This section provide a guideline to generate drc_config().

//To Do: Fill in how to generate drc_config()


# Consumption of IAC bitstream # {#iacconsumption}

ISSUE: TODO. Fill in example workflows.

# Annex A: Audio Substream Gaps

This annex describes a number of scenarios where a gap may exist in the audio signals, where a gap is defined as no audio frames for some period of time.

## A gap within an audio substream

A gap within an audio substream may be expressed via the Sync OBU offsets. A decoder encountering such a gap may either:

1. insert silent audio frames in the gap without reinitializing, or
2. reinitialize before decoding the audio frames after the gap.

The appropriate behaviour in this case is signalled via the [=reinitialize_decoder=] field in the Sync OBU.

In this version of the specification, gaps within an audio substream are not supported.

## A gap between two audio substreams

A gap may occur if there is a period of time between the end of one substream and the start of another. Such a gap may be expressed via the Sync OBU offsets. Similar to the case of a gap within an audio substream, the behaviour of the decoder is determined by the [=reinitialize_decoder=] field in the Sync OBU.

A gap may further occur if there is a period of time between the end of all substreams and the start of another. This case may be expressed by setting a non-zero value for the [=global_offset=] field in the Sync OBU.

## A gap due to packet loss

In the case where a gap is not signalled by the Sync OBUs, any unexpected absence of audio frames shall be interpreted as packet loss. The IAC parser is unable to guarantee the correctness of following OBUs received until the next set of Descriptor OBUs.

In this version of the specification, gaps in the audio substreams are not supported so if a gap is encountered, it can always be interpreted as packet loss.