Life of a UART-over-USB Packet (RX)
Scenario: STM32 sends data over UART to a PC through an FTDI USB-Serial bridge. Goal: trace one byte from the STM32 TX pin to a user-process
read()return — every hop.
Background
UART is the workhorse of MCU ↔ PC communication in our projects, yet latency and throughput have consistently been worse than expected. After too many rounds of intuition-based tuning, we decided to map out every hop end-to-end rather than guess. The STM32 side is straightforward; the complexity lives in the Linux/x86 software stack on the PC — many layers, each with hidden latency. This post walks through the full path from hardware signal to userspace read() return.
On Linux/ARM hosts (Raspberry Pi, Jetson) the hardware path differs slightly (no PCIe/xHCI, different USB OTG controller), but the OS software path is essentially the same.
Part I: Hardware Path — STM32 to PC Memory
Hardware Architecture
Three physical domains:
- •STM32: serializes bytes and clocks them out bit-by-bit at the configured baud rate
- •FTDI chip: UART → USB protocol bridge; manages buffering and packet timing
- •PC: USB host actively polls for data; path is xHCI → DMA → kernel driver stack → user process
1.1 UART Signal (STM32 → FTDI)
The STM32 UART peripheral serializes each byte into an NRZ waveform, toggling the TX pin at the baud rate. Every 8N1 frame:
idle(H) | start(L) | D0 D1 D2 D3 D4 D5 D6 D7 | stop(H) | idle
The FTDI RXD pin samples this continuously; its internal UART engine recovers the byte and writes it into the on-chip device-side RX FIFO.
1.2 Inside FTDI: Latency Timer and USB Packetization
FTDI cannot push data spontaneously — USB is Host-driven, so the device can only respond to polls. FTDI has two flush triggers:
| Trigger | Description |
|---|---|
| RX FIFO reaches one USB payload's worth (FS: 62 B, HS: 510 B) | Mark endpoint ready; send on next IN poll |
| Latency Timer expires (default 1 ms, tunable) | Flush whatever is in the FIFO to avoid stalling small packets forever |
On data loss: neither trigger is an overflow point. Data loss happens when the entire FTDI RX FIFO overflows — i.e., the MCU transmit rate persistently exceeds the USB drain rate, causing a UART overrun error.
Latency timer and throughput:
The host polling frequency (FS: every 1 ms, HS: every 125 µs) is independent of latency_timer. The timer only controls when FTDI marks data as ready:
- •Continuous high-rate stream (UART rate > 62 B/ms): the 62 B threshold fires frequently;
latency_timerbarely matters - •Protocol-style communication (request-response, or bursts with a short tail): bytes at the end of each burst stall in the FIFO waiting for the timer, directly increasing round-trip latency. The larger the timer value, the lower the effective throughput
# Read / set latency timer (ms; default 16, recommended 1)
cat /sys/bus/usb-serial/devices/ttyUSB0/latency_timer
echo 1 > /sys/bus/usb-serial/devices/ttyUSB0/latency_timer1.3 USB 2.0: Host Polling → DATA Packet
USB is Host-driven — FTDI cannot push data; it waits for xHCI to ask:
USB FS peak effective throughput is ~1.2 MB/s; USB HS is ~48 MB/s. In virtually all UART use cases USB is not the bottleneck — see Appendix II.
1.4 xHCI: Receive USB Packet → DMA → Interrupt
When xHCI receives a USB DATA packet, three things happen in sequence:
① USB PHY → xHCI on-chip packet buffer: the USB physical layer receives the signal; xHCI's MAC assembles and CRC-verifies the complete USB packet in internal SRAM. This buffer is opaque to software — purely a hardware staging area.
② DMA write to DRAM: once verified, xHCI's transfer engine DMAs the payload directly into the host DRAM URB (USB Request Block) buffer, whose physical address was pre-allocated by the driver with dma_alloc_coherent().
③ MSI triggers CPU interrupt: xHCI issues a PCIe Memory Write TLP to 0xFEE000xx. The Root Complex recognizes the APIC address range and routes it to the target CPU's Local APIC, firing a hard IRQ.
APIC addressing: x86 has two interrupt controllers — Local APIC (one per core, inside the CPU) and I/O APIC (one on the motherboard, routes traditional PCI/ISA interrupt lines). PCIe MSI bypasses the I/O APIC entirely, writing directly to the target CPU's Local APIC MMIO address. Which CPU receives the interrupt is determined by kernel IRQ affinity (default CPU 0; irqbalance auto-balances; /proc/irq/<N>/smp_affinity for manual override).
Hardware-side data locations summary:
| Data | Location |
|---|---|
| Raw UART bytes | FTDI on-chip RX FIFO (device side) |
| USB payload | DRAM URB buffer (host side, driver-allocated) |
| Transfer completion notification | DRAM Event Ring TRB |
| Interrupt signal | MSI → Local APIC IRR |
At this point the byte is in host DRAM. Hardware is done; the OS takes over.
Part II: OS Path — Hard IRQ to User Process
2.1 Hard IRQ: xhci_irq
The CPU acknowledges the APIC interrupt, consults the IDT, and jumps into the xHCI driver's ISR:
xhci_irq() # drivers/usb/host/xhci-ring.c
└─ handle_event()
└─ handle_tx_event()
└─ usb_hcd_giveback_urb() # hand the filled URB back up the stack
Hard IRQ context: no sleeping, no mutex, must return quickly. usb_hcd_giveback_urb() transitions into softirq context.
2.2 Softirq: FTDI Driver Processes the URB
usb_serial_generic_read_bulk_callback()
└─ ftdi_process_read_urb() # drivers/usb/serial/ftdi_sio.c
├─ strip the 2-byte FTDI modem-status header at the front of each packet
├─ tty_insert_flip_string() # copy payload into the tty flip buffer
└─ tty_flip_buffer_push()
└─ queue_work(system_unbound_wq, &buf->work)
│
[softirq returns quickly; remaining work queued for kworker]
Softirq cannot sleep, cannot take a mutex, and must not run long, so the rest is handed off to a workqueue (a kworker thread running in process context, where sleeping and locking are allowed).
2.3 Workqueue / kworker: Line Discipline
flush_to_ldisc() # drivers/tty/tty_buffer.c
└─ n_tty_receive_buf_common() # drivers/tty/n_tty.c
├─ handle canonical/raw mode, echo, etc.
└─ wake_up_interruptible_poll(&tty->read_wait, EPOLLIN | EPOLLRDNORM)
wake_up_interruptible_poll() walks the tty->read_wait wait queue and marks every sleeping process as TASK_RUNNABLE, placing them on the scheduler run queue.
2.4 User Process: select() Wakes Up, read() Returns
The user process sleeps inside select([fd], ...). n_tty_poll() registers it on tty->read_wait:
static __poll_t n_tty_poll(struct tty_struct *tty, struct file *file,
poll_table *wait)
{
poll_wait(file, &tty->read_wait, wait); // register on the wait queue
if (input_available_p(tty, 1))
mask |= EPOLLIN | EPOLLRDNORM;
return mask;
}Once kworker calls wake_up_interruptible_poll(), the process is runnable. When the scheduler picks it, select() returns, and the subsequent os.read(fd, N) copies the bytes from the tty buffer into userspace. Journey complete.
2.5 OS Execution Timeline
Key takeaways:
- •The interrupt source is the xHCI PCIe controller — FTDI never directly signals the CPU
- •Softirq must not block; slow work goes to kworker (process context, can sleep)
- •
latency_timeraffects burst-boundary latency and therefore effective throughput in protocol-based communication - •For continuous high-rate streams, the bottleneck is UART baud rate, not USB capacity
Part III: User-Library Latency Pitfall — pyserial as an Example
The hardware chain and kernel stack work fine; yet latency is still hundreds of milliseconds. The culprit is sometimes the API semantics of the user library. pyserial is a canonical example — any library built around "read exactly N bytes" can cause the same problem.
3.1 pyserial's read-N-bytes Semantics
pyserial's read(size) implementation (serialposix.py):
def read(self, size=1):
read = bytearray()
timeout = Timeout(self._timeout) # e.g. 0.5 s
while len(read) < size: # ← loop until size bytes accumulated
ready, _, _ = select.select([self.fd, ...], [], [], timeout.time_left())
if not ready:
break # timeout exit
buf = os.read(self.fd, size - len(read))
read.extend(buf)
if timeout.expired():
break
return bytes(read)Semantics: "accumulate size bytes, or wait until timeout" — not "return as soon as data arrives."
3.2 The Blocking Scenario at Stream End
Reading a variable-length sentinel-terminated stream with pyserial:
while True:
chunk = uart.read(4096) # won't return until 4096 bytes or timeout
buf.extend(chunk)
if sentinel in buf:
breakWhen the last packet (containing the sentinel) is only a few dozen bytes:
pyserial has no bug — it was designed for "read a fixed-size block," not "read until sentinel." The mismatch between the two semantics causes a full timeout-worth of tail latency on every stream.
3.3 Fix: select + os.read + Sentinel Detection
while time.monotonic() < deadline:
remaining = deadline - time.monotonic()
if not select.select([fd], [], [], remaining)[0]:
break
chunk = os.read(fd, 65536) # drain up to 64 KB at once
buf.extend(chunk)
if sentinel in buf:
buf = buf[: buf.index(sentinel)]
break # ← exit immediately on sentinel, 0 ms waitpyserial read(4096) | raw select + os.read | |
|---|---|---|
| Exit condition | Accumulate 4096 bytes or timeout | Sentinel found → exit immediately |
| Tail latency | ~0.5 s | 0 ms |
| Single read size | size - len(read) bytes | 65536 bytes — maximally drain the buffer |
Bottom line: kernel-stack latency is in the microsecond range; it's the user-library API semantics that introduce hundreds of milliseconds at stream end. For variable-length sentinel streams, select + os.read is the right approach.
Appendix I: DPDK Userspace Polling
The normal interrupt-driven path has latency in the tens-of-microseconds range, which is more than adequate for virtually all UART applications. For extreme latency requirements (HFT, telecom fronthaul) the kernel can be bypassed entirely:
Dedicated-core busy polling: DPDK's PMD (Poll Mode Driver) pins one or more CPU cores in a while(1) loop — not waiting for IRQs, but actively polling the DMA ring buffer. The core never sleeps, pushing latency down to a few hundred nanoseconds. The cost: 100% core utilization (vs. an idle core in C6/C7 at 0.3–1 W; a busy-polled core runs 10–30 W — a 10–30× power difference).
x86 offers middle-ground approaches (adaptive interrupt coalescing / NAPI, Intel DDIO, SmartNIC/DPU offload), but zero power and zero latency are physically contradictory — DPDK simply trades power for latency.
For UART-over-USB, DPDK offers almost no practical benefit. FTDI's latency_timer and the USB polling period are fixed hardware delays that userspace polling cannot eliminate. The real latency wins come from fixing the user-library semantics (see Part III), not from deploying DPDK.
Appendix II: Throughput Calculations
Actual link throughput is determined by the minimum of the two segment byte rates: the UART segment (STM32 → FTDI) and the USB segment (FTDI → PC). Both have protocol overhead, so byte rate — not wire rate — is the right unit for comparison:
- •UART 8N1: 1 start + 8 data + 1 stop = 10 bits/byte → byte rate = baud rate ÷ 10
- •USB FS: theoretical ceiling ~1.2 MB/s (19 bulk IN transactions × 64 B payload × 1000 frames/s), but FTDI in practice achieves 300–800 KB/s (system-dependent) — after each URB completes the driver must resubmit a new one (software round-trip), and at ~19 interrupts/ms this overhead is significant
- •USB HS: theoretical ~48 MB/s (480 Mbps × ~80% efficiency ÷ 8); UART is always the bottleneck
Bottleneck Analysis
Effective throughput = min(UART byte rate, USB byte rate)
| Typical combination | UART byte rate | USB practical ceiling | Bottleneck |
|---|---|---|---|
| 115200 bps + USB FS | 11.3 KB/s | ~800 KB/s | UART |
| 3 Mbps + USB FS (FT232R) | 293 KB/s | ~800 KB/s | UART |
| 12 Mbps + USB FS | 1140 KB/s | ~800 KB/s | USB FS |
| 12 Mbps + USB HS (FT232H) | 1140 KB/s | ~48 MB/s | UART |
Conclusion: in most UART scenarios the baud rate is the limit; pairing a 12 Mbps UART with USB FS makes USB the constraint.
Overrun: Conditions and Location
Overrun always occurs at the FTDI on-chip RX FIFO — the junction between the UART input and USB output. When the UART byte rate persistently exceeds the USB drain rate, the FIFO accumulates until it overflows and FTDI signals a UART overrun error; the data is gone.
The PC side (DRAM) never overflows: the kernel never refuses an incoming USB packet (URB buffers are large enough), so back-pressure ultimately manifests as FTDI FIFO overflow, not PC-side overflow.