May 31, 2026·Quanta Control

Life of a UART-over-USB Packet (RX)

UARTUSBLinuxembeddedFTDIxHCIkernel

Scenario: STM32 sends data over UART to a PC through an FTDI USB-Serial bridge. Goal: trace one byte from the STM32 TX pin to a user-process read() return — every hop.

Background

UART is the workhorse of MCU ↔ PC communication in our projects, yet latency and throughput have consistently been worse than expected. After too many rounds of intuition-based tuning, we decided to map out every hop end-to-end rather than guess. The STM32 side is straightforward; the complexity lives in the Linux/x86 software stack on the PC — many layers, each with hidden latency. This post walks through the full path from hardware signal to userspace read() return.

On Linux/ARM hosts (Raspberry Pi, Jetson) the hardware path differs slightly (no PCIe/xHCI, different USB OTG controller), but the OS software path is essentially the same.

Part I: Hardware Path — STM32 to PC Memory

Hardware Architecture

Loading diagram…

Three physical domains:

•STM32: serializes bytes and clocks them out bit-by-bit at the configured baud rate
•FTDI chip: UART → USB protocol bridge; manages buffering and packet timing
•PC: USB host actively polls for data; path is xHCI → DMA → kernel driver stack → user process

1.1 UART Signal (STM32 → FTDI)

The STM32 UART peripheral serializes each byte into an NRZ waveform, toggling the TX pin at the baud rate. Every 8N1 frame:

 idle(H) | start(L) | D0 D1 D2 D3 D4 D5 D6 D7 | stop(H) | idle

The FTDI RXD pin samples this continuously; its internal UART engine recovers the byte and writes it into the on-chip device-side RX FIFO.

1.2 Inside FTDI: Latency Timer and USB Packetization

FTDI cannot push data spontaneously — USB is Host-driven, so the device can only respond to polls. FTDI has two flush triggers:

Trigger	Description
RX FIFO reaches one USB payload's worth (FS: 62 B, HS: 510 B)	Mark endpoint ready; send on next IN poll
Latency Timer expires (default 1 ms, tunable)	Flush whatever is in the FIFO to avoid stalling small packets forever

On data loss: neither trigger is an overflow point. Data loss happens when the entire FTDI RX FIFO overflows — i.e., the MCU transmit rate persistently exceeds the USB drain rate, causing a UART overrun error.

Latency timer and throughput:

The host polling frequency (FS: every 1 ms, HS: every 125 µs) is independent of latency_timer. The timer only controls when FTDI marks data as ready:

•Continuous high-rate stream (UART rate > 62 B/ms): the 62 B threshold fires frequently; latency_timer barely matters
•Protocol-style communication (request-response, or bursts with a short tail): bytes at the end of each burst stall in the FIFO waiting for the timer, directly increasing round-trip latency. The larger the timer value, the lower the effective throughput

# Read / set latency timer (ms; default 16, recommended 1)
cat /sys/bus/usb-serial/devices/ttyUSB0/latency_timer
echo 1 > /sys/bus/usb-serial/devices/ttyUSB0/latency_timer

1.3 USB 2.0: Host Polling → DATA Packet

USB is Host-driven — FTDI cannot push data; it waits for xHCI to ask:

Loading diagram…

USB FS peak effective throughput is ~1.2 MB/s; USB HS is ~48 MB/s. In virtually all UART use cases USB is not the bottleneck — see Appendix II.

1.4 xHCI: Receive USB Packet → DMA → Interrupt

When xHCI receives a USB DATA packet, three things happen in sequence:

① USB PHY → xHCI on-chip packet buffer: the USB physical layer receives the signal; xHCI's MAC assembles and CRC-verifies the complete USB packet in internal SRAM. This buffer is opaque to software — purely a hardware staging area.

② DMA write to DRAM: once verified, xHCI's transfer engine DMAs the payload directly into the host DRAM URB (USB Request Block) buffer, whose physical address was pre-allocated by the driver with dma_alloc_coherent().

③ MSI triggers CPU interrupt: xHCI issues a PCIe Memory Write TLP to 0xFEE000xx. The Root Complex recognizes the APIC address range and routes it to the target CPU's Local APIC, firing a hard IRQ.

Loading diagram…

APIC addressing: x86 has two interrupt controllers — Local APIC (one per core, inside the CPU) and I/O APIC (one on the motherboard, routes traditional PCI/ISA interrupt lines). PCIe MSI bypasses the I/O APIC entirely, writing directly to the target CPU's Local APIC MMIO address. Which CPU receives the interrupt is determined by kernel IRQ affinity (default CPU 0; irqbalance auto-balances; /proc/irq/<N>/smp_affinity for manual override).

Hardware-side data locations summary:

Data	Location
Raw UART bytes	FTDI on-chip RX FIFO (device side)
USB payload	DRAM URB buffer (host side, driver-allocated)
Transfer completion notification	DRAM Event Ring TRB
Interrupt signal	MSI → Local APIC IRR

At this point the byte is in host DRAM. Hardware is done; the OS takes over.

Part II: OS Path — Hard IRQ to User Process

2.1 Hard IRQ: xhci_irq

The CPU acknowledges the APIC interrupt, consults the IDT, and jumps into the xHCI driver's ISR:

xhci_irq()                    # drivers/usb/host/xhci-ring.c
  └─ handle_event()
       └─ handle_tx_event()
            └─ usb_hcd_giveback_urb()   # hand the filled URB back up the stack

Hard IRQ context: no sleeping, no mutex, must return quickly. usb_hcd_giveback_urb() transitions into softirq context.

2.2 Softirq: FTDI Driver Processes the URB

usb_serial_generic_read_bulk_callback()
  └─ ftdi_process_read_urb()            # drivers/usb/serial/ftdi_sio.c
       ├─ strip the 2-byte FTDI modem-status header at the front of each packet
       ├─ tty_insert_flip_string()      # copy payload into the tty flip buffer
       └─ tty_flip_buffer_push()
            └─ queue_work(system_unbound_wq, &buf->work)
                 │
            [softirq returns quickly; remaining work queued for kworker]

Softirq cannot sleep, cannot take a mutex, and must not run long, so the rest is handed off to a workqueue (a kworker thread running in process context, where sleeping and locking are allowed).

2.3 Workqueue / kworker: Line Discipline

flush_to_ldisc()                        # drivers/tty/tty_buffer.c
  └─ n_tty_receive_buf_common()         # drivers/tty/n_tty.c
       ├─ handle canonical/raw mode, echo, etc.
       └─ wake_up_interruptible_poll(&tty->read_wait, EPOLLIN | EPOLLRDNORM)

wake_up_interruptible_poll() walks the tty->read_wait wait queue and marks every sleeping process as TASK_RUNNABLE, placing them on the scheduler run queue.

2.4 User Process: select() Wakes Up, read() Returns

The user process sleeps inside select([fd], ...). n_tty_poll() registers it on tty->read_wait:

static __poll_t n_tty_poll(struct tty_struct *tty, struct file *file,
                            poll_table *wait)
{
    poll_wait(file, &tty->read_wait, wait);   // register on the wait queue
    if (input_available_p(tty, 1))
        mask |= EPOLLIN | EPOLLRDNORM;
    return mask;
}

Once kworker calls wake_up_interruptible_poll(), the process is runnable. When the scheduler picks it, select() returns, and the subsequent os.read(fd, N) copies the bytes from the tty buffer into userspace. Journey complete.

2.5 OS Execution Timeline

Loading diagram…

Key takeaways:

•The interrupt source is the xHCI PCIe controller — FTDI never directly signals the CPU
•Softirq must not block; slow work goes to kworker (process context, can sleep)
•latency_timer affects burst-boundary latency and therefore effective throughput in protocol-based communication
•For continuous high-rate streams, the bottleneck is UART baud rate, not USB capacity

Part III: User-Library Latency Pitfall — pyserial as an Example

The hardware chain and kernel stack work fine; yet latency is still hundreds of milliseconds. The culprit is sometimes the API semantics of the user library. pyserial is a canonical example — any library built around "read exactly N bytes" can cause the same problem.

3.1 pyserial's read-N-bytes Semantics

pyserial's read(size) implementation (serialposix.py):

def read(self, size=1):
    read = bytearray()
    timeout = Timeout(self._timeout)          # e.g. 0.5 s
    while len(read) < size:                   # ← loop until size bytes accumulated
        ready, _, _ = select.select([self.fd, ...], [], [], timeout.time_left())
        if not ready:
            break                             # timeout exit
        buf = os.read(self.fd, size - len(read))
        read.extend(buf)
        if timeout.expired():
            break
    return bytes(read)

Semantics: "accumulate size bytes, or wait until timeout" — not "return as soon as data arrives."

3.2 The Blocking Scenario at Stream End

Reading a variable-length sentinel-terminated stream with pyserial:

while True:
    chunk = uart.read(4096)    # won't return until 4096 bytes or timeout
    buf.extend(chunk)
    if sentinel in buf:
        break

When the last packet (containing the sentinel) is only a few dozen bytes:

Loading diagram…

pyserial has no bug — it was designed for "read a fixed-size block," not "read until sentinel." The mismatch between the two semantics causes a full timeout-worth of tail latency on every stream.

3.3 Fix: select + os.read + Sentinel Detection

while time.monotonic() < deadline:
    remaining = deadline - time.monotonic()
    if not select.select([fd], [], [], remaining)[0]:
        break
    chunk = os.read(fd, 65536)     # drain up to 64 KB at once
    buf.extend(chunk)
    if sentinel in buf:
        buf = buf[: buf.index(sentinel)]
        break                      # ← exit immediately on sentinel, 0 ms wait

	pyserial `read(4096)`	raw `select + os.read`
Exit condition	Accumulate 4096 bytes or timeout	Sentinel found → exit immediately
Tail latency	~0.5 s	0 ms
Single read size	`size - len(read)` bytes	65536 bytes — maximally drain the buffer

Bottom line: kernel-stack latency is in the microsecond range; it's the user-library API semantics that introduce hundreds of milliseconds at stream end. For variable-length sentinel streams, select + os.read is the right approach.

Appendix I: DPDK Userspace Polling

The normal interrupt-driven path has latency in the tens-of-microseconds range, which is more than adequate for virtually all UART applications. For extreme latency requirements (HFT, telecom fronthaul) the kernel can be bypassed entirely:

Loading diagram…

Dedicated-core busy polling: DPDK's PMD (Poll Mode Driver) pins one or more CPU cores in a while(1) loop — not waiting for IRQs, but actively polling the DMA ring buffer. The core never sleeps, pushing latency down to a few hundred nanoseconds. The cost: 100% core utilization (vs. an idle core in C6/C7 at 0.3–1 W; a busy-polled core runs 10–30 W — a 10–30× power difference).

x86 offers middle-ground approaches (adaptive interrupt coalescing / NAPI, Intel DDIO, SmartNIC/DPU offload), but zero power and zero latency are physically contradictory — DPDK simply trades power for latency.

For UART-over-USB, DPDK offers almost no practical benefit. FTDI's latency_timer and the USB polling period are fixed hardware delays that userspace polling cannot eliminate. The real latency wins come from fixing the user-library semantics (see Part III), not from deploying DPDK.

Appendix II: Throughput Calculations

Actual link throughput is determined by the minimum of the two segment byte rates: the UART segment (STM32 → FTDI) and the USB segment (FTDI → PC). Both have protocol overhead, so byte rate — not wire rate — is the right unit for comparison:

•UART 8N1: 1 start + 8 data + 1 stop = 10 bits/byte → byte rate = baud rate ÷ 10
•USB FS: theoretical ceiling ~1.2 MB/s (19 bulk IN transactions × 64 B payload × 1000 frames/s), but FTDI in practice achieves 300–800 KB/s (system-dependent) — after each URB completes the driver must resubmit a new one (software round-trip), and at ~19 interrupts/ms this overhead is significant
•USB HS: theoretical ~48 MB/s (480 Mbps × ~80% efficiency ÷ 8); UART is always the bottleneck

Bottleneck Analysis

Effective throughput = min(UART byte rate, USB byte rate)

Typical combination	UART byte rate	USB practical ceiling	Bottleneck
115200 bps + USB FS	11.3 KB/s	~800 KB/s	UART
3 Mbps + USB FS (FT232R)	293 KB/s	~800 KB/s	UART
12 Mbps + USB FS	1140 KB/s	~800 KB/s	USB FS
12 Mbps + USB HS (FT232H)	1140 KB/s	~48 MB/s	UART

Conclusion: in most UART scenarios the baud rate is the limit; pairing a 12 Mbps UART with USB FS makes USB the constraint.

Overrun: Conditions and Location

Loading diagram…

Overrun always occurs at the FTDI on-chip RX FIFO — the junction between the UART input and USB output. When the UART byte rate persistently exceeds the USB drain rate, the FIFO accumulates until it overflows and FTDI signals a UART overrun error; the data is gone.

The PC side (DRAM) never overflows: the kernel never refuses an incoming USB packet (URB buffers are large enough), so back-pressure ultimately manifests as FTDI FIFO overflow, not PC-side overflow.