USB Audio Analysis: In-Depth Insights into Core Technology and Diverse Applications¶

The Foundation of USB: The Cornerstone of Connection and Transmission¶

In the vast river of technological advancement, USB, the Universal Serial Bus, shines like a brilliant star, having hung in the expansive sky of personal computers for over a decade. Its extensive connectivity resembles an invisible web, effortlessly linking various peripheral devices such as microphones, speakers, external drives, and webcams.

This article will focus on USB audio—this digital audio standard that shines brightly in the realms of personal computers, smartphones, and tablets. It acts as a bridge, cleverly connecting audio peripherals like speakers, microphones, and mixers, opening a new chapter in audio transmission.

Let’s first explore the fundamentals of USB. USB operates under specific protocols, akin to a precisely orchestrated symphony, where the host computer plays the role of the conductor, initiating transmission commands to devices such as USB speakers. Each transmission is precisely directed at a specific device and its designated endpoint, much like an arrow hitting its target. There are four types of transmission: bulk transfer, isochronous transfer, interrupt transfer, and control transfer.

Bulk transfer acts like a steady and reliable messenger, focused on securely transmitting data between the host and the device. All USB transmissions are accompanied by a CRC (cyclic redundancy check), serving as a loyal guardian, constantly checking for errors. In bulk transfer, the data receiver must verify the CRC; if it is correct, the transmission is confirmed, and the data is considered error-free. Conversely, if the CRC is incorrect, the transmission is not confirmed and will be retried. If the device is not ready to accept data, it can send a negative acknowledgment (NAK), prompting the host to attempt the transmission again. Bulk transfer is not a time-sensitive method; it gracefully navigates around more stringent timing requirements of other transmission types.

Isochronous transfer, on the other hand, is like a time-conscious vanguard, used for real-time data transmission between the host and the device. When the host sets up an isochronous endpoint, it allocates specific bandwidth for that endpoint and periodically performs input or output transmissions at that endpoint. For instance, the host may output 1K bytes of data to the device every 125 milliseconds. However, due to fixed and limited bandwidth, if a transmission issue arises, there is no time for retransmission. Although the data has a normal CRC, if the receiver detects an error, there is no mechanism for resending.

Interrupt transfer, despite its name being potentially misleading, functions like a diligent patrol officer, used by the host to periodically poll devices to check for significant events. For example, the host may poll an audio device to confirm whether the MUTE button has been pressed. Although it contains the word "interrupt," its function is akin to regular polling rather than an actual interrupt in the host.

Control transfer, similar to bulk transfer, can be acknowledged or rejected and is delivered in a non-real-time manner, acting as a behind-the-scenes coordinator for operations outside the normal data stream, such as querying device capabilities or endpoint statuses. Regarding how to describe device capabilities, this is beyond the scope of this article, but it is worth noting that there are predefined classes such as "USB Audio Class" or "USB Mass Storage Class," which are effective aids for achieving cross-platform interoperability. All transmissions occur in USB frames as the smallest unit, with high-speed USB frame intervals of 125μs (full-speed USB is 1ms), marked by the host sending a start of frame (SOF) message. Isochronous and interrupt transfers can transmit once per frame at most.

USB Audio: The Artful Arrangement of Data Transmission¶

Now, let’s examine the intricacies of USB audio. USB audio utilizes isochronous transfer, interrupt transfer, and control transfer to convey audio data. Isochronous transfer serves as a high-speed channel for transmitting audio data; interrupt transfer acts as the guardian of the audio clock, monitoring its availability; and control transfer functions as the master of audio settings, adjusting parameters such as volume and sample rate.

The data requirements vary based on the number of channels, sample bit depth, and sample rate. Common channel counts include 2 (stereo), 6 (5.1), or even more for studio and DJ applications. In terms of sample size, typical values are 24 bits, while 16 bits are commonly used for traditional audio, and 32 bits serve high-quality audio. Common sample rates include 44.1, 48, 96, and 192 kHz, with the latter often used for high-quality audio production.

Suppose we design a stereo speaker with a sample rate of 96 kHz and a sample size of 24 bits. To simplify data transmission between the host and the device, the 24 bits are often padded with zero bytes, resulting in a total data throughput of 96,000 x 2 channels x 4 bytes = 768,000 bytes/second. The isochronous endpoint operates at a speed of transmitting once every 125μs—equating to 8,000 transmissions per second. Dividing the required byte rate per second by the number of transmissions per second gives us the byte count per isochronous transfer: 768,000/8,000 = 96 bytes per transfer.

Clock Synchronization: The Key to Temporal Coordination¶

USB clock synchronization is also a crucial element. In the realm of digital audio, it is akin to navigating a maze of time, requiring a shared understanding of a common time concept. As mentioned earlier, USB frames are transmitted 8,000 times per second, while the speaker plays 96,000 samples per second. Only when the speaker and the host reach a consensus on the length of one second can they operate harmoniously.

USB audio provides three modes to ensure that the host and speaker progress together in time. In synchronous mode, the length of one second is defined by the host device, which sends data at a specific rate that the device must match. Asynchronous mode operates in the opposite manner, where the device defines the length of one second, and the host adjusts accordingly. In adaptive mode, the data flow dictates the clock direction.

However, neither adaptive nor synchronous modes are flawless, as personal computers often struggle to maintain stable clocks and may have other audio sources intervening, such as external digital tape decks. Asynchronous mode utilizes an external clock source or a low-jitter clock within the device as the master clock, typically relying on a crystal-based PLL. Therefore, there are at least two independent clocks in the system: one drives the USB transmission frequency of 8,000 times per second, and the other drives the sample rate (e.g., 96,000 Hz).

These clock frequencies may differ slightly and change subtly over time. For example, in the case of a 96,000 Hz sample rate, the average sample count may be 12.001. To ensure that the host sends the correct amount of data, it requests the current sample rate through the interrupt endpoint. Every few milliseconds, the average sample rate from the previous stage is reported in a 16.16 fixed-point format. If the final period's average is 12.001 frames, the reported value would be 0x000C0041 (65536*12.001).

With this average rate, the host can calculate when to send additional samples during transmission; in this case, every 8 transmissions per second will carry one extra sample. Additionally, the host can use this value to synchronize with audio devices, ensuring that host applications like DVD players keep video and audio in sync. Otherwise, the audio will gradually get ahead of the video, resulting in a one-second audio lead over the video after two hours.

To maintain a short feedback loop, it is crucial to avoid unnecessarily buffering audio data packets and feedback packets. Any additional buffering will introduce delays in the reports, making it more challenging to maintain smooth traffic. This means that the underlying USB protocol stack and the USB audio protocol stack should be tightly integrated without buffering in between. While implementing this on application processors is quite challenging, it is straightforward on embedded processors with predictable execution times.

In summary, maintaining a consistent time concept is vital in the digital audio world. The three modes provided by USB audio—synchronous, asynchronous, and adaptive—each play their part in ensuring synchronization between the host and peripheral devices, with asynchronous mode being more reliable due to the external clock source.

Complex Devices: Precise Choices for Multiple Clocks¶

In complex devices like mixers, multiple devices may provide sample rates through different interfaces. USB audio allows designers to deploy clock selectors to choose the input clock source from multiple inputs (such as the input clock from S/PDIF connections, local oscillators, and input clocks from ADAT connections). Users can select the input clock source, such as the one from S/PDIF connections, through control transfer.

Compliance Support: A Smooth Bridge for System Integration¶

In terms of compliance and native support, adhering to the USB Audio Class 2.0 standard ensures seamless integration between devices and operating systems, allowing easy control of parameters such as volume and sample rate through standard operating system dialog boxes.

The Outstanding Contributions of USB Audio¶

In conclusion, leveraging the robust advantages of high-speed USB 2.0, USB Audio Class 2.0 successfully establishes a low-latency transmission bridge between PCs and audio devices, ensuring high throughput and excellent sound quality. Its applicability spans a wide range, serving various devices from complex mixers to surround sound systems, PC speakers, and microphones, injecting continuous vitality and potential into the development of the audio field.