Universal Serial Bus Device Class Definition for Audio Data Formats
Release 2.0 May 31, 2006 11
2 Audio Data Formats
Audio Data formats can be divided in two main groups:
• Simple Audio Data Formats
• Extended Audio Data Formats
Simple Audio Data Formats can then be subdivided into four groups according to type.
The first group, Type I, deals with audio data streams that are transmitted over USB and are constructed on a sample-by-sample basis. Each audio sample is represented by a single independent symbol, contained in an audio subslot. Different compression schemes may be used to transform the audio samples into symbols.
Note: This is different from encoding. Compression is considered to take place on a per-audio-sample base. Each audio sample generates one symbol (e.g. A-law compression where a 16-bit audio sample is compressed into an 8 bit symbol).
If multiple physical audio channels are formatted into a single audio channel cluster, then samples at time x of subsequent channels are first contained into audio subslots. These audio subslots are then interleaved, according to the cluster channel ordering as described in the main USB Audio Specification, and then grouped into an audio slot. The audio samples, taken at time x+1, are interleaved in the same fashion to generate the next audio slot and so on. The notion of physical channels is explicitly preserved during transmission. A typical example of Type I formats is the standard PCM audio data. The following figure illustrates the concept.
Figure 2-1: Type I Audio Stream
The second group, Type II, deals with those formats that do not preserve the notion of physical channels during the transmission over USB. Typically, all non-PCM encoded audio data streams belong to this group. A number of audio samples, often originating from multiple physical channels and taken over a certain period of time, are encoded into a number of bits in such a way that, after transmission, the original audio samples can be reconstructed to a certain degree of accuracy. The number of bits used for transmission is typically one or more orders of magnitude smaller than the number of bits needed to represent the original PCM audio samples, effectively realizing a considerable bandwidth reduction during transmission.
Universal Serial Bus Device Class Definition for Audio Data Formats
Release 2.0 May 31, 2006 12
Figure 2-2: Type II Audio Stream
The third group, Type III, contains special formats that do not fit in both previous groups. In fact, they mix characteristics of Type I and Type II groups to transmit audio data streams over USB. One or more non-PCM encoded audio data streams are packed into “pseudo-stereo samples” and transmitted as if they were real stereo PCM audio samples. The sampling frequency of these pseudo samples matches the sampling frequency of the original PCM audio data streams. Therefore, clock recovery at the receiving end is easier than it is in the case of Type II formats. The drawback is that unless multiple non-PCM encoded streams are packed into one pseudo stereo stream, more bandwidth than necessary is consumed.
The fourth group, Type IV, deals with audio streams that are not transmitted over USB. Instead, they interface with the audio function through an AudioStreaming interface that does not contain a USB isochronous IN or OUT endpoint. These streams typically connect via a digital interface like S/PDIF (or some other means of connectivity) but require interaction from the Host before they enter or leave the audio function. A typical example is an external S/PDIF connector that can accept an AC-3 encoded audio stream. This stream is first processed by an AC-3 decoder before the (decoded) logical audio channels enter the audio function through the Input Terminal that represents this S/PDIF connection. The capabilities of the AC-3 decoder are advertised by means of the AC-3 Decoder descriptor and the decoder Controls can be programmed through the AudioStreaming interface.
In addition to the Simple Audio Data Formats described above, Extended Audio Data Formats are defined. These are based on the Simple Audio Data Formats Type I, II, and III definitions but they provide an optional packet header and for the Extended Audio Data Format Type I, an optional synchronous (i.e. sample accurate) control channel. Type IV Audio Data Formats do not have an Extended Audio Data Format definition.
Section A.1, “Format Type Codes” summarizes the Audio Data Formats that are currently supported by the Audio Device Class. The following sections explain those formats in more detail.
2.1 Transfer Delimiter
Isochronous data streams are continuous in nature, although the actual number of bytes sent per packet may vary throughout the lifetime of the stream (for rate adaptation purposes for instance). To indicate a temporary stop in the isochronous data stream without closing the pipe (and thus relinquishing the USB
Universal Serial Bus Device Class Definition for Audio Data Formats
Release 2.0 May 31, 2006 13
bandwidth), an in-band Transfer Delimiter needs to be defined. This specification considers two situations to be a Transfer Delimiter. The first is a zero-length data packet and the second is the absence of an isochronous transfer in a USB (micro)frame that would normally have an isochronous transfer. Both situations are considered equivalent and the audio function is expected to behave the same. However, the second type consumes less isochronous USB bandwidth (i.e. zero bandwidth). In both cases, this specification considers a Transfer Delimiter to be an entity that can be sent over the USB.
2.2 Virtual Frame and Virtual Frame Packet Definitions
To better describe packetization for audio the concept of a “virtual frame” (VF) is introduced. A virtual frame is defined as:
VF = (micro)frame * 2(bInterval-1)
In addition, a “virtual frame packet” (VFP) is introduced. A virtual frame packet is defined as a packet that contains all the samples that are transferred over the bus during a virtual frame. For full-/high-speed endpoints, the virtual frame packets are exactly the same as the physical packets that are transferred over USB. However, for high-speed high-bandwidth endpoints, the virtual frame packet is the concatenation of the two or three physical packets that are transferred over the bus in a microframe.
Note: The USB Specification already considers the 2 or 3 transactions of a high-speed high-bandwidth transfer to be part of a single packet. See Section 5.12.3, “Clock Synchronization”
The above definitions provide a model of ‘one (virtual frame) packet per (virtual) frame’, irrespective of the actual transactions on the USB.
2.3 Simple Audio Data Formats
2.3.1 Type I Formats
The following sections describe the Audio Data Formats that belong to Type I. A number of terms and their definition are presented.
2.3.1.1 USB Packets
Audio data streams that are inherently continuous must be packetized when sent over the USB. The quality of the packetizing algorithm directly influences the amount of effort needed to reconstruct a reliable sample clock at the receiving side.
The goal must be to keep the instantaneous number of audio slots per virtual frame, ni as close as possible to the average number of audio slots per virtual frame, nav. The average nav must be calculated as follows:
where TVF is the duration of a virtual frame and Δt is the sample time (1/FS). In most cases nav will be a number with a fractional part.
If the sampling rate is a constant, the allowable variation on ni is limited to one audio slot, that is, Δni = 1. This implies that all virtual frame packets must either contain INT(nav ) audio slots (small VFP) or INT(nav) + 1 (large VFP) audio slots. For all i:
ni = INT(nav) | INT(nav) + 1
Note: In the case where nav = INT(nav), ni may vary between INT(nav) - 1 (small VFP), INT(nav) (medium VFP) and INT(nav) + 1 (large VFP).
Furthermore, a large VFP must be generated as soon as it becomes available. Typically, a source will generate a number of small VFPs as long as the accumulated fractional part of nav remains < 1. Once the
Universal Serial Bus Device Class Definition for Audio Data Formats
Release 2.0 May 31, 2006 14
accumulated fractional part of nav becomes ≥ 1, the source must send a large VFP and decrement the accumulator by 1.
Due to possible different notions of time in the source and the sink (they might each have their own independent sampling clock), the (small VFP)/(large VFP) pattern generated by the source may be different from what the sink expects. Therefore, the sink must be capable to accept a large VFP at all times.
Example:
Assume FS = 44,100 Hz and TVF = 1ms. Then nav = 44.1 audio slots. Since the source can only send an integer number of audio slots per VF, it will send small VFPs of 44 audio slots. Each VF, it therefore sends ‘0.1 slot’ too few and it will accumulate this fractional part in an accumulator. After having sent 9 small VFPs of 44 audio slots, at the tenth VF it will have exactly one audio slot in excess and therefore can send a large VFP containing 45 audio slots. Decrementing the accumulator by 1 brings it back to 0 and the process can start all over again. The source will thus produce a repetitive pattern of 9 small VFPs of 44 audio slots followed by 1 large VFP of 45 audio slots. The following table illustrates the process:
| #VF |
nav |
ni |
Fraction |
Accumulator |
| n |
44.1 |
44 |
0.1 |
0.1 |
| n+1 |
44.1 |
44 |
0.1 |
0.2 |
| n+2 |
44.1 |
44 |
0.1 |
0.3 |
| n+3 |
44.1 |
44 |
0.1 |
0.4 |
| n+4 |
44.1 |
44 |
0.1 |
0.5 |
| n+5 |
44.1 |
44 |
0.1 |
0.6 |
| n+6 |
44.1 |
44 |
0.1 |
0.7 |
| n+7 |
44.1 |
44 |
0.1 |
0.8 |
| n+8 |
44.1 |
44 |
0.1 |
0.9 |
| n+9 |
44.1 |
45 |
0.1 |
1.0 -> 0 |
| n+10 |
44.1 |
44 |
0.1 |
0.1 |
| n+11 |
44.1 |
44 |
0.1 |
0.2 |
| … |
… |
… |
… |
… |
2.3.1.2 Pitch Control
If the sampling rate can be varied (to implement pitch control), the allowable variation on ni is limited to one audio slot per virtual frame. For all i:
Pitch control is restricted to adaptive endpoints only. AudioStreaming interfaces that support pitch control on their isochronous endpoint are required to report this in the class-specific endpoint descriptor. In addition, a Set/Get Pitch Control request is required to enable or disable the pitch control functionality.
Universal Serial Bus Device Class Definition for Audio Data Formats
Release 2.0 May 31, 2006 15
2.3.1.3 Audio Subslot
The basic structure used to represent audio data is the audio subslot. An audio subslot holds a single audio sample. An audio subslot always contains an integer number of bytes.
This specification limits the possible audio subslot sizes (bSubslotSize) to 1, 2, 3 or 4 bytes per audio subslot. An audio sample is represented using a number of bits (bBitResolution) less than or equal to the total number of bits available in the audio subslot, i.e. bBitResolution ≤ bSubslotSize*8.
AudioStreaming endpoints must be constructed in such a way that a valid transfer can take place as long as the reported audio subslot size (bSubslotSize) is respected during transmission. If the reported bits per sample (bBitResolution) do not correspond with the number of significant bits actually used during transfer, the device will either discard trailing significant bits ([actual_bits_per_sample] > bBitResolution) or interpret trailing zeros as significant bits ([actual_bits_per_sample] < bBitResolution).
2.3.1.4 Audio Slot
An audio slot consists of a collection of audio subslots, each containing an audio sample of a different physical audio channel, taken at the same moment in time. The number of audio subslots in an audio slot equals the number of logical audio channels in the audio channel cluster. The ordering of the audio subslots in the audio slot obeys the rules set forth in the USB Audio Specification. All audio subslots must have the same audio subslot size.
2.3.1.5 Audio Streams
An audio stream is a concatenation of a potentially very large number of audio slots, ordered according to ascending time. Streams are packetized when transported over USB whereby virtual frame packets can only contain an integer number of audio slots. Each packet always starts with the same channel, and the channel order is respected throughout the entire transmission. If, for any reason, there are no audio slots available to construct a VFP, a Transfer Delimiter must be sent instead.
2.3.1.6 Type I Format Type Descriptor
The Type I format type descriptor starts with the usual three fields: bLength, bDescriptorType, and bDescriptorSubtype.
The bFormatType field indicates this is a Type I descriptor. The bSubslotSize field indicates how many bytes are used to transport an audio subslot. The bBitResolution field indicates how many bits of the total number of available bits in the audio subslot are truly used by the audio function to convey audio information.
Table 2-2: Type I Format Type Descriptor
| Offset |
Field |
Size |
Value |
Description |
| 0 |
bLength |
1 |
Number |
Size of this descriptor, in bytes: 6 |
| 1 |
bDescriptorType |
1 |
Constant |
CS_INTERFACE descriptor type. |
| 2 |
bDescriptorSubtype |
1 |
Constant |
FORMAT_TYPE descriptor subtype. |
| 3 |
bFormatType |
1 |
Constant |
FORMAT_TYPE_I. Constant identifying the Format Type the AudioStreaming interface is using. |
| 4 |
bSubslotSize |
1 |
Number |
The number of bytes occupied by one audio subslot. Can be 1, 2, 3 or 4. |
最終更新:2011年06月04日 18:24