Encapsulation of FLAC in ISO Base Media File Format Version 0.0.4 (draft) Table of Contents 1 Scope 2 Supporting Normative References 3 Design Rules of Encapsulation 3.1 File Type Identification 3.2 Overview of Track Structure 3.3 Definition of FLAC sample 3.3.1 Sample entry format 3.3.2 FLAC Specific Box 3.3.3 Sample format 3.3.4 Duration of FLAC sample 3.3.5 Sub-sample 3.3.6 Random Access 3.3.6.1 Random Access Point 3.4 Basic Structure (informative) 3.4.1 Initial Movie 3.5 Example of Encapsulation (informative) 4 Acknowledgements 5 Author's Address 1 Scope This document specifies the normative mapping for encapsulation of FLAC coded audio bitstreams in ISO Base Media file format and its derivatives. The encapsulation of FLAC coded bitstreams in QuickTime file format is outside the scope of this specification. 2 Supporting Normative References [1] ISO/IEC 14496-12:2012 Corrected version Information technology — Coding of audio-visual objects — Part 12: ISO base media file format [2] ISO/IEC 14496-12:2012/Amd.1:2013 Information technology — Coding of audio-visual objects — Part 12: ISO base media file format AMENDMENT 1: Various enhancements including support for large metadata [3] FLAC format specification https://xiph.org/flac/format.html Definition of the FLAC Audio Codec stream format [4] FLAC-in-Ogg mapping specification https://xiph.org/flac/ogg_mapping.html Ogg Encapsulation for the FLAC Audio Codec [5] Matroska specification 3 Design Rules of Encapsulation 3.1 File Type Identification This specification does not define any brand to declare files which conform to this specification. Files which conform to this specification shall contain at least one brand which supports the requirements and the requirements described in this clause without contradiction in the compatible brands list of the File Type Box. The minimal support of the encapsulation of FLAC bitstreams in ISO Base Media file format requires the 'isom' brand. 3.2 Overview of Track Structure FLAC coded audio shall be encapsulated into the ISO Base Media File Format as media data within an audio track. + The handler_type field in the Handler Reference Box shall be set to 'soun'. + The Media Information Box shall contain the Sound Media Header Box. + The codingname of the sample entry is 'fLaC'. This specification does not define any encapsulation using MP4AudioSampleEntry with objectTypeIndication specified by the MPEG-4 Registration Authority (http://www.mp4ra.org/). See section 'Sample entry format' for the definition of the sample entry. + The 'dfLa' box is added to the sample entry to convey initializing information for the decoder. See section 'FLAC Specific Box' for the definition of the box contents. + A FLAC sample is exactly one FLAC frame as described in the format specification[3]. See section 'Sample format' for details of the frame contents. + Every FLAC sample is a sync sample. No pre-roll or lapping is required. See section 'Random Access' for further details. 3.3 Definition of a FLAC sample 3.3.1 Sample entry format For any track containing one or more FLAC bitstreams, a sample entry describing the corresponding FLAC bitstream shall be present inside the Sample Table Box. This version of the specification defines only one sample entry format named FLACSampleEntry whose codingname is 'fLaC'. This sample entry includes exactly one FLAC Specific Box defined in section 'FLAC specific box' as a mandatory box and indicates that FLAC samples described by this sample entry are stored by the sample format described in section 'Sample format'. The syntax and semantics of the FLACSampleEntry is shown as follows. The data fields of this box and native FLAC[3] structures encoded within FLAC blocks are both stored in big-endian format, though for purposes of the ISO BMFF container, FLAC native metadata and data blocks are treated as unstructured octet streams. class FLACSampleEntry() extends AudioSampleEntry ('fLaC'){ FLACSpecificBox(); } The fields of the AudioSampleEntry portion shall be set as follows: + channelcount: The channelcount field shall be set equal to the channel count specified by the FLAC bitstream's native METADATA_BLOCK_STREAMINFO header as described in [3]. Note that the FLAC FRAME_HEADER structure that begins each FLAC sample redundantly encodes channel number; the number of channels declared in each FRAME_HEADER MUST match the number of channels declared here and in the METADATA_BLOCK_STREAMINFO header. + samplesize: The samplesize field shall be set equal to the bits per sample specified by the FLAC bitstream's native METADATA_BLOCK_STREAMINFO header as described in [3]. Note that the FLAC FRAME_HEADER structure that begins each FLAC sample redundantly encodes the number of bits per sample; the bits per sample declared in each FRAME_HEADER MUST match the samplesize declared here and the bits per sample field declared in the METADATA_BLOCK_STREAMINFO header. + samplerate: When possible, the samplerate field shall be set equal to the sample rate specified by the FLAC bitstream's native METADATA_BLOCK_STREAMINFO header as described in [3], left-shifted by 16 bits to create the appropriate 16.16 fixed-point representation. When the bitstream's native sample rate is greater than the maximum expressible value of 65535 Hz, the samplerate field shall hold the greatest expressible regular division of that rate. I.e. the samplerate field shall hold 48000.0 for native sample rates of 96 and 192 kHz. In the case of unusual sample rates which do not have an expressible regular division, the maximum value of 65535.0 Hz should be used. High-rate FLAC bitstreams are common, and the native value from the METADATA_BLOCK_STREAMINFO header in the FLACSpecificBox MUST be read to determine the correct sample rate of the bitstream. Note that the FLAC FRAME_HEADER structure that begins each FLAC sample redundantly encodes the sample rate; the sample rate declared in each FRAME_HEADER MUST match the sample rate declared in the METADATA_BLOCK_STREAMINFO header, and here in the AudioSampleEntry portion of the FLACSampleEntry as much as is allowed by the encoding restrictions described above. Finally, the FLACSpecificBox carries codec headers: + FLACSpecificBox This box contains initializing information for the decoder as defined in section 'FLAC specific box'. 3.3.2 FLAC Specific Box Exactly one FLAC Specific Box shall be present in each FLACSampleEntry. This specification defines version 0 of this box. If incompatible changes occur in future versions of this specification, another version number will be defined. The data fields of this box and native FLAC[3] structures encoded within FLAC blocks are both stored in big-endian format, though for purposes of the ISO BMFF container, FLAC native metadata and data blocks are treated as unstructured octet streams. The syntax and semantics of the FLAC Specific Box is shown as follows. class FLACMetadataBlock { unsigned int(1) LastMetadataBlockFlag; unsigned int(7) BlockType; unsigned int(24) Length; unsigned int(8) BlockData[Length]; } aligned(8) class FLACSpecificBox extends FullBox('dfLa', version=0, 0){ for (i=0; ; i++) { // to end of box FLACMetadataBlock(); } } + Version: The Version field shall be set to 0. In the future versions of this specification, this field may be set to other values. And without support of those values, the reader shall not read the fields after this within the FLACSpecificBox. + Flags: The Flags field shall be set to 0. After the FullBox header, the box contains a sequence of FLAC[3] native-metadata block structures that fill the remainder of the box. Each FLACMetadataBlock structure consists of three fields filling a total of four bytes that form a FLAC[3] native METADATA_BLOCK_HEADER, followed by raw octet bytes that comprise the FLAC[3] native METADATA_BLOCK_DATA. + LastMetadataBlockFlag: The LastMetadataBlockFlag field maps semantically to the FLAC[3] native METADATA_BLOCK_HEADER Last-metadata-block flag as defined in the FLAC[3] file specification. The LastMetadataBlockFlag is set to 1 if this MetadataBlock is the last metadata block in the FLACSpecificBox. It is set to 0 otherwise. + BlockType: The BlockType field maps semantically to the FLAC[3] native METADATA_BLOCK_HEADER BLOCK_TYPE field as defined in the FLAC[3] file specification. The BlockType is set to a valid FLAC[3] BLOCK_TYPE value that identifies the type of this native metadata block. The BlockType of the first FLACMetadataBlock must be set to 0, signifying this is a FLAC[3] native METADATA_BLOCK_STREAMINFO block. + Length: The Length field maps semantically to the FLAC[3] native METADATA_BLOCK_HEADER Length field as defined in the FLAC[3] file specification. The length field specifies the number of bytes of MetadataBlockData to follow. + BlockData The BlockData field maps semantically to the FLAC[3] native METADATA_BLOCK_HEADER METADATA_BLOCK_DATA as defined in the FLAC[3] file specification. Taken together, the bytes of the FLACMetadataBlock form a complete FLAC[3] native METADATA_BLOCK structure. Note that a minimum of a single FLACMetadataBlock, consisting of a FLAC[3] native METADATA_BLOCK_STREAMINFO structure, is required. Should the FLACSpecificBox contain more than a single FLACMetadataBlock structure, the FLACMetadataBlock containing the FLAC[3] native METADATA_BLOCK_STREAMINFO must occur first in the list. Other containers that package FLAC audio streams, such as Ogg[4] and Matroska[5], wrap FLAC[3] native metadata without modification similar to this specification. When repackaging or remuxing FLAC[3] streams from another format that contains FLAC[3] native metadata into an ISO BMFF file, the complete FLAC[3] native metadata should be preserved in the ISO BMFF stream as described above. It is also allowed to parse this native metadata and include contextually redundant ISO BMFF-native repackagings and/or reparsings of FLAC[3] native metadata, so long as the native metadata is also preserved. 3.3.3 Sample format A FLAC sample is exactly one FLAC audio FRAME (as defined in the FLAC[3] file specification) belonging to a FLAC bitstreams. The FLAC sample data begins with a complete FLAC FRAME_HEADER, followed by one FLAC SUBFRAME per channel, any necessary bit padding, and ends with the usual FLAC FRAME_FOOTER. Note that the FLAC native FRAME_HEADER structure that begins each FLAC sample redundantly encodes channel count, sample rate, and sample size. The values of these fields must agree both with the values declared in the FLAC METADATA_BLOCK_STREAMINFO structure as well as the FLACSampleEntry box. 3.3.4 Duration of a FLAC sample The duration of any given FLAC sample is determined by dividing the decoded block size of a FLAC frame, as encoded in the FLAC FRAME's FRAME_HEADER structure, by the value of the timescale field in the Media Header Box. FLAC samples are permitted to have variable durations within a given audio stream. FLAC does not use padding values. 3.3.5 Sub-sample Sub-samples are not defined for FLAC samples in this specification. 3.3.6 Random Access This subclause describes the nature of the random access of FLAC sample. 3.3.6.1 Random Access Point All FLAC samples can be independently decoded i.e. every FLAC sample is a sync sample. The Sync Sample Box shall not be present as long as there are no samples other than FLAC samples in the same track. The sample_is_non_sync_sample field for FLAC samples shall be set to 0. 3.4 Basic Structure (informative) 3.4.1 Initial Movie This subclause shows a basic structure of the Movie Box as follows: +----+----+----+----+----+----+----+----+------------------------------+ |moov| | | | | | | | Movie Box | +----+----+----+----+----+----+----+----+------------------------------+ | |mvhd| | | | | | | Movie Header Box | +----+----+----+----+----+----+----+----+------------------------------+ | |trak| | | | | | | Track Box | +----+----+----+----+----+----+----+----+------------------------------+ | | |tkhd| | | | | | Track Header Box | +----+----+----+----+----+----+----+----+------------------------------+ | | |edts|* | | | | | Edit Box | +----+----+----+----+----+----+----+----+------------------------------+ | | | |elst|* | | | | Edit List Box | +----+----+----+----+----+----+----+----+------------------------------+ | | |mdia| | | | | | Media Box | +----+----+----+----+----+----+----+----+------------------------------+ | | | |mdhd| | | | | Media Header Box | +----+----+----+----+----+----+----+----+------------------------------+ | | | |hdlr| | | | | Handler Reference Box | +----+----+----+----+----+----+----+----+------------------------------+ | | | |minf| | | | | Media Information Box | +----+----+----+----+----+----+----+----+------------------------------+ | | | | |smhd| | | | Sound Media Header Box | +----+----+----+----+----+----+----+----+------------------------------+ | | | | |dinf| | | | Data Information Box | +----+----+----+----+----+----+----+----+------------------------------+ | | | | | |dref| | | Data Reference Box | +----+----+----+----+----+----+----+----+------------------------------+ | | | | | | |url | | DataEntryUrlBox | +----+----+----+----+----+----+ or +----+------------------------------+ | | | | | | |urn | | DataEntryUrnBox | +----+----+----+----+----+----+----+----+------------------------------+ | | | | |stbl| | | | Sample Table | +----+----+----+----+----+----+----+----+------------------------------+ | | | | | |stsd| | | Sample Description Box | +----+----+----+----+----+----+----+----+------------------------------+ | | | | | | |fLaC| | FLACSampleEntry | +----+----+----+----+----+----+----+----+------------------------------+ | | | | | | | |dfLa| FLAC Specific Box | +----+----+----+----+----+----+----+----+------------------------------+ | | | | | |stts| | | Decoding Time to Sample Box | +----+----+----+----+----+----+----+----+------------------------------+ | | | | | |stsc| | | Sample To Chunk Box | +----+----+----+----+----+----+----+----+------------------------------+ | | | | | |stsz| | | Sample Size Box | +----+----+----+----+----+ or +----+----+------------------------------+ | | | | | |stz2| | | Compact Sample Size Box | +----+----+----+----+----+----+----+----+------------------------------+ | | | | | |stco| | | Chunk Offset Box | +----+----+----+----+----+ or +----+----+------------------------------+ | | | | | |co64| | | Chunk Large Offset Box | +----+----+----+----+----+----+----+----+------------------------------+ | |mvex|* | | | | | | Movie Extends Box | +----+----+----+----+----+----+----+----+------------------------------+ | | |trex|* | | | | | Track Extends Box | +----+----+----+----+----+----+----+----+------------------------------+ Figure 1 - Basic structure of Movie Box It is strongly recommended that the order of boxes should follow the above structure. Boxes marked with an asterisk (*) may or may not be present depending on context. For most boxes listed above, the definition is as is defined in ISO/IEC 14496-12 [1]. The additional boxes and the additional requirements, restrictions and recommendations to the other boxes are described in this specification. 3.5 Example of Encapsulation (informative) [File] size = 17790 [ftyp: File Type Box] position = 0 size = 24 major_brand = mp42 : MP4 version 2 minor_version = 0 compatible_brands brand[0] = mp42 : MP4 version 2 brand[1] = isom : ISO Base Media file format [moov: Movie Box] position = 24 size = 757 [mvhd: Movie Header Box] position = 32 size = 108 version = 0 flags = 0x000000 creation_time = UTC 2014/12/12, 18:41:19 modification_time = UTC 2014/12/12, 18:41:19 timescale = 48000 duration = 33600 (00:00:00.700) rate = 1.000000 volume = 1.000000 reserved = 0x0000 reserved = 0x00000000 reserved = 0x00000000 transformation matrix | a, b, u | | 1.000000, 0.000000, 0.000000 | | c, d, v | = | 0.000000, 1.000000, 0.000000 | | x, y, w | | 0.000000, 0.000000, 1.000000 | pre_defined = 0x00000000 pre_defined = 0x00000000 pre_defined = 0x00000000 pre_defined = 0x00000000 pre_defined = 0x00000000 pre_defined = 0x00000000 next_track_ID = 2 [iods: Object Descriptor Box] position = 140 size = 33 version = 0 flags = 0x000000 [tag = 0x10: MP4_IOD] expandableClassSize = 16 ObjectDescriptorID = 1 URL_Flag = 0 includeInlineProfileLevelFlag = 0 reserved = 0xf ODProfileLevelIndication = 0xff sceneProfileLevelIndication = 0xff audioProfileLevelIndication = 0xfe visualProfileLevelIndication = 0xff graphicsProfileLevelIndication = 0xff [tag = 0x0e: ES_ID_Inc] expandableClassSize = 4 Track_ID = 1 [trak: Track Box] position = 173 size = 608 [tkhd: Track Header Box] position = 181 size = 92 version = 0 flags = 0x000007 Track enabled Track in movie Track in preview creation_time = UTC 2014/12/12, 18:41:19 modification_time = UTC 2014/12/12, 18:41:19 track_ID = 1 reserved = 0x00000000 duration = 33600 (00:00:00.700) reserved = 0x00000000 reserved = 0x00000000 layer = 0 alternate_group = 0 volume = 1.000000 reserved = 0x0000 transformation matrix | a, b, u | | 1.000000, 0.000000, 0.000000 | | c, d, v | = | 0.000000, 1.000000, 0.000000 | | x, y, w | | 0.000000, 0.000000, 1.000000 | width = 0.000000 height = 0.000000 [mdia: Media Box] position = 273 size = 472 [mdhd: Media Header Box] position = 281 size = 32 version = 0 flags = 0x000000 creation_time = UTC 2014/12/12, 18:41:19 modification_time = UTC 2014/12/12, 18:41:19 timescale = 48000 duration = 34560 (00:00:00.720) language = und pre_defined = 0x0000 [hdlr: Handler Reference Box] position = 313 size = 51 version = 0 flags = 0x000000 pre_defined = 0x00000000 handler_type = soun reserved = 0x00000000 reserved = 0x00000000 reserved = 0x00000000 name = Xiph Audio Handler [minf: Media Information Box] position = 364 size = 381 [smhd: Sound Media Header Box] position = 372 size = 16 version = 0 flags = 0x000000 balance = 0.000000 reserved = 0x0000 [dinf: Data Information Box] position = 388 size = 36 [dref: Data Reference Box] position = 396 size = 28 version = 0 flags = 0x000000 entry_count = 1 [url : Data Entry Url Box] position = 412 size = 12 version = 0 flags = 0x000001 location = in the same file [stbl: Sample Table Box] position = 424 size = 321 [stsd: Sample Description Box] position = 432 size = 79 version = 0 flags = 0x000000 entry_count = 1 [fLaC: Audio Description] position = 448 size = 63 reserved = 0x000000000000 data_reference_index = 1 reserved = 0x0000 reserved = 0x0000 reserved = 0x00000000 channelcount = 2 samplesize = 16 pre_defined = 0 reserved = 0 samplerate = 48000.000000 [dfLa: FLAC Specific Box] position = 484 size = 50 version = 0 flags = 0x000000 [FLACMetadataBlock] LastMetadataBlockFlag = 1 BlockType = 0 Length = 34 BlockData[34]; [stts: Decoding Time to Sample Box] position = 492 size = 24 version = 0 flags = 0x000000 entry_count = 1 entry[0] sample_count = 18 sample_delta = 1920 [stsc: Sample To Chunk Box] position = 516 size = 40 version = 0 flags = 0x000000 entry_count = 2 entry[0] first_chunk = 1 samples_per_chunk = 13 sample_description_index = 1 entry[1] first_chunk = 2 samples_per_chunk = 5 sample_description_index = 1 [stsz: Sample Size Box] position = 556 size = 92 version = 0 flags = 0x000000 sample_size = 0 (variable) sample_count = 18 entry_size[0] = 977 entry_size[1] = 938 entry_size[2] = 939 entry_size[3] = 938 entry_size[4] = 934 entry_size[5] = 945 entry_size[6] = 948 entry_size[7] = 956 entry_size[8] = 955 entry_size[9] = 930 entry_size[10] = 933 entry_size[11] = 934 entry_size[12] = 972 entry_size[13] = 977 entry_size[14] = 958 entry_size[15] = 949 entry_size[16] = 962 entry_size[17] = 848 [stco: Chunk Offset Box] position = 648 size = 24 version = 0 flags = 0x000000 entry_count = 2 chunk_offset[0] = 686 chunk_offset[1] = 12985 [free: Free Space Box] position = 672 size = 8 [mdat: Media Data Box] position = 680 size = 17001 4 Acknowledgements This spec draws heavily from the Opus-in-ISOBMFF specification work done by Yusuke Nakamura Thank you to Tim Terriberry, David Evans, and Yusuke Nakamura for valuable feedback. Thank you to Ralph Giles for editorial help. 5 Author Address Monty Montgomery