USB Device Class Definition for Audio Data Formats
Release 1.0 March 18, 1998 11
| Offset |
Field |
Size |
Value |
Description |
| 8 |
tLowerSamFreq |
3 |
Number |
Lower bound in Hz of the sampling frequency range for this isochronous data endpoint. |
| 11 |
tUpperSamFreq |
3 |
Number |
Upper bound in Hz of the sampling frequency range for this isochronous data endpoint. |
Table 2-3: Discrete Number of Sampling Frequencies
| Offset |
Field |
Size |
Value |
Description |
| 8 |
tSamFreq [1] |
3 |
Number |
Sampling frequency 1 in Hz for this isochronous data endpoint. |
| … |
… |
… |
… |
… |
| 8+(ns-1)*3 |
tSamFreq [ns] |
3 |
Number |
Sampling frequency ns in Hz for this isochronous data endpoint. |
Note: In the case of adaptive isochronous data endpoints that support only a discrete number of sampling frequencies, the endpoint must at least tolerate ±1000 PPM inaccuracy on the reported sampling frequencies.
2.2.6 Supported Formats
The following paragraphs list all currently supported Type I Audio Data Formats.
2.2.6.1 PCM Format
The PCM (Pulse Coded Modulation) format is the most commonly used audio format to represent audio data streams. The audio data is not compressed and uses a signed two’s-complement fixed point format. It is left-justified (the sign bit is the Msb) and data is padded with trailing zeros to fill the remaining unused bits of the subframe. The binary point is located to the right of the sign bit so that all values lie within the range [-1,+1).
2.2.6.2 PCM8 Format
The PCM8 format is introduced to be compatible with the legacy 8-bit wave format. Audio data is uncompressed and uses 8 bits per sample (bBitResolution = 8). In this case, data is unsigned fixed-point, left-justified in the audio subframe, Msb first. The range is [0,255].
2.2.6.3 IEEE_FLOAT Format
The IEEE_FLOAT format is based on the ANSI/IEEE-754 floating-point standard. Audio data is represented using the basic single-precision format. The basic single-precision number is 32 bits wide and has an 8-bit exponent and a 24-bit mantissa. Both mantissa and exponent are signed numbers, but neither is represented in two's-complement format. The mantissa is stored in sign magnitude format and the exponent in biased form (also called excess-n form). In biased form, there is a positive integer (called the bias) which is subtracted from the stored number to get the actual number. For example, in an eight-bit exponent, the bias is 127. To represent 0, the number 127 is stored. To represent -100, 27 is stored. An
USB Device Class Definition for Audio Data Formats
Release 1.0 March 18, 1998 12
exponent of all zeroes and an exponent of all ones are both reserved for special cases, so in an eight-bit field, exponents of -126 to +127 are possible. In the basic floating-point format, the mantissa is assumed to be normalized so that the most significant bit is always one, and therefore is not stored. Only the fractional part is stored.
The 32-bit IEEE-754 floating-point word is broken into three fields. The most significant bit stores the sign of the mantissa, the next group of 8 bits stores the exponent in biased form, and the remaining 23 bits store the magnitude of the fractional portion of the mantissa. For further information, refer to the ANSI/IEEE-754 standard.
The data is conveyed over USB using 32 bits per sample (bBitResolution = 32; bSubframeSize = 4).
2.2.6.4 ALaw Format and mLaw Format
Starting from 12- or 16-bits linear PCM samples, simple compression down to 8-bits per sample (one byte per sample) can be achieved by using logarithmic companding. The compressed audio data uses 8 bits per sample (bBitsPerSample = 8). Data is signed fixed point, left-justified in the subframe, Msb first. The compressed range is [-128,128]. The difference between Alaw and mLaw compression lies in the formulae used to achieve the compression. Refer to the ITU G.711 standard for further details.
2.3 Type II Formats
Type II formats are used to transmit non-PCM encoded audio data into bitstreams that consist of a sequence of encoded audio frames.
2.3.1 Encoded Audio Frames
An encoded audio frame is a sequence of bits that contains an encoded representation of one or more physical audio channels. The encoding takes place over a fixed number of audio samples. Each encoded audio frame contains enough information to entirely reconstruct the audio samples (albeit not lossless), encoded in the encoded audio frame. No information from adjacent encoded audio frames is needed during decoding. The number of samples used to construct one encoded audio frame depends on the encoding scheme. (For MPEG, the number of samples per encoded audio frame (nf) is 384 for Layer I or 1152 for Layer II. For AC-3, the number of samples is 1536.)
In most cases, the encoded audio frame represents multiple physical audio channels. The number of bits per encoded audio frame may be variable. The content of the encoded audio frame is defined according to the implemented encoding scheme. Where applicable, the bit ordering shall be MSB first, relative to existing standards of serial transmission or storage of that encoding scheme. An encoded audio frame represents an interval longer than the USB frame time of 1 ms. This is typical of audio compression algorithms that use psycho-acoustic or vocal tract parametric models.
Note: It is important to make a clear distinction between an audio frame (see Section 2.2.3, “Audio Frame”) and an encoded audio frame. The overloaded use of the term audio frame could cause confusion. Therefore, this specification will always use the qualifier ‘encoded’ to refer to MPEG or AC-3 encoded audio frames.
2.3.2 Audio Bitstreams
An encoded audio bitstream is a concatenation of a potentially very large number of encoded audio frames, ordered according to ascending time. Subsequent encoded audio frames are independent and can be decoded separately.
USB Device Class Definition for Audio Data Formats
Release 1.0 March 18, 1998 13
2.3.3 USB Packets
Encoded audio bitstreams are packetized when transported over an isochronous pipe. Each USB packet contains only part of a single encoded audio frame. Packet sizes are determined according to the shortpacket protocol. The encoded audio frame is broken down into a number of packets, each containing wMaxPacketSize bytes except for the last packet, which may be smaller and contains the remainder of the encoded audio frame. If the MaxPacketsOnly bit D7 in the bmAttributes field of the class-specific endpoint descriptor is set, the last (short) packet must be padded with zero bytes to wMaxPacketSize length. No USB packet may contain bits belonging to different encoded audio frames. If the encoded audio frame length is not a multiple of 8 bits, the last byte in the last packet is padded with zero bits. The decoder must ignore all padded extra bits and bytes. Consecutive encoded audio frames are separated by at least one Transfer Delimiter. A Transfer Delimiter must be sent in all consecutive USB frames until the next encoded audio frame is due. The above rules guarantee that a new encoded audio frame always starts on a USB packet boundary.
2.3.4 Bandwidth Allocation
The encoded audio frame time tf equals the number of audio samples per encoded audio frame nf divided by the sampling rate fs of the original audio samples.
The allocated bandwidth for the pipe must accommodate for the largest possible encoded audio frame to be transmitted within an encoded audio frame time. This should take into account the Transfer Delimiter requirement and any differences between the time base of the stream and the USB frame timer. The device may choose to consume more bandwidth than necessary (by increasing the reported wMaxPacketSize) to minimize the time needed to transmit an entire encoded audio frame. This can be used to enable early decoding and therefore minimize system latency.
2.3.5 Timing
The timing reference point is the beginning of an encoded audio frame. Therefore, the USB packet that contains the first bits (usually the encoded audio frame sync word) of the encoded audio frame is used as a timing reference in USB space. This USB packet is called the reference packet. The transmission of the reference packet of an encoded audio frame should begin at the target playback time of that frame (minus the endpoint’s reported delay) rounded to the nearest USB frame time. This guarantees that, at the receiving end, the arrival of subsequent reference packets matches the encoded audio frame time tf as closely as possible.
2.3.6 Type II Format Type Descriptor
The Type II Format Type descriptor starts with the usual three fields bLength, bDescriptorType and bDescriptorSubtype.
The bFormatType field indicates this is a Type II descriptor. The wMaxBitRate field contains the maximum number of bits per second this interface can handle. It is a measure for the buffer size available in the interface. The wSamplesPerFrame field contains the number of non-PCM encoded audio samples contained within a single encoded audio frame
The sampling frequency capabilities of the endpoint are reported using the bSamFreqType field andfollowing fields.
Table 2-4: Type II Format Type Descriptor
| Offset |
Field |
Size |
Value |
Description |
USB Device Class Definition for Audio Data Formats
Release 1.0 March 18, 1998 14
| Offset |
Field |
Size |
Value |
Description |
| 0 |
bLength |
1 |
Number |
Size of this descriptor, in bytes: 9+(ns*3) |
| 1 |
bDescriptorType |
1 |
Constant |
CS_INTERFACE descriptor type. |
| 2 |
bDescriptorSubtype |
1 |
Constant |
FORMAT_TYPE descriptor subtype. |
| 3 |
bFormatType |
1 |
Constant |
FORMAT_TYPE_II. Constant identifying the Format Type the AudioStreaming interface is using. |
| 4 |
wMaxBitRate |
2 |
Number |
Indicates the maximum number of bits per second this interface can handle. Expressed in kbits/s. |
| 6 |
wSamplesPerFrame |
2 |
Number |
Indicates the number of PCM audio samples contained in one encoded audio frame. |
| 8 |
bSamFreqType |
1 |
Number |
Indicates how the sampling frequency can be programmed: 0: Continuous sampling frequency 1..255: The number of discrete sampling frequencies supported by the isochronous data endpoint of the AudioStreaming interface (ns) |
| 9... |
|
|
|
See sampling frequency tables, below. |
Depending on the value in the bSamFreqType field, the layout of the next part of the descriptor is as shown in the following tables.
Table 2-5: Continuous Sampling Frequency
| Offset |
Field |
Size |
Value |
Description |
| 9 |
tLowerSamFreq |
3 |
Number |
Lower bound in Hz of the sampling frequency range for this isochronous data endpoint. |
| 12 |
tUpperSamFreq |
3 |
Number |
Upper bound in Hz of the sampling frequency range for this isochronous data endpoint. |
Table 2-6: Discrete Number of Sampling Frequencies
| Offset |
Field |
Size |
Value |
Description |
| 9 |
tSamFreq [1] |
3 |
Number |
Sampling frequency 1 in Hz for this isochronous data endpoint. |
| … |
… |
… |
… |
… |
USB Device Class Definition for Audio Data Formats
Release 1.0 March 18, 1998 15
| Offset |
Field |
Size |
Value |
Description |
| 9+(ns-1)*3 |
tSamFreq [ns] |
3 |
Number |
Sampling frequency ns in Hz for this isochronous data endpoint. |
Note: In the case of adaptive isochronous data endpoints that support only a discrete number of sampling frequencies, the endpoint must at least tolerate ±1000 PPM inaccuracy on the reported sampling frequencies.
2.3.7 Rate feedback
If the isochronous data endpoint needs explicit rate feedback (adaptive source, asynchronous sink), the feedback pipe shall report the number of equivalent PCM audio samples. The host will accumulate this data and start transmission of an encoded audio frame whenever the current number of samples exceeds the number of samples per encoded audio frame. The remainder is kept in the accumulator.
2.3.8 Supported Formats
The following sections list all currently supported Type II Audio Data Formats. Format-specific descriptors and format-specific requests are explained in more detail.
2.3.8.1 MPEG Format
In the current specification, only MPEG decoding aspects are considered. Real-time MPEG encoding peripherals are not (yet) available and consequently are not covered by this specification.
2.3.8.1.1 MPEG Format-Specific Descriptor
The wFormatTag field is a duplicate of the wFormatTag field in the class-specific AudioStreaming interface descriptor. The same field is used here to identify the format-specific descriptor.
The bmMPEGCapabilities bitmap field describes the capabilities of the MPEG decoder built into the AudioStreaming interface.
Some general information must be retrieved from the Format Type-specific descriptor. For instance, the sampling frequencies supported by the decoder are reported through the Format Type-specific descriptor. This includes the ability of the decoder to handle low sampling frequencies (16 kHz, 22.05 kHz and 24 kHz) besides the standard 32 kHz, 44.1 kHz and 48 kHz sampling frequencies.
Bits D2..0 of the bmMPEGCapabilities field are used to indicate which layers this decoder is capable of processing. The different layers relate to the different algorithms that are used during encoding and decoding.
Bit D3 indicates that the decoder can only process the MPEG-1 base stream. Therefore, only Left and Right channels will be output.
Bit D4 indicates that the decoder can handle MPEG-2 streams that contain two independent stereo pairs instead of the normal 3/2 encoding scheme. This bit is only applicable for MPEG-2 decoders.
Bit D5 indicates that the decoder supports the MPEG dual channel mode. In this case, the MPEG-1 base stream does not contain Left and Right channels of a stereo pair but instead contains two independent mono channels. One of these channels can be selected through the proper request (Dual Channel Control) and reproduced over the Left and Right output channels simultaneously.
Bit D6 indicates that the decoder supports the DVD MPEG-2 augmentation to 7.1 channels instead of the standard 5.1 channels.
最終更新:2011年05月22日 23:24