May 11, 2026·Quanta Control

STM32 + Embassy + OctoSPI: A Rust Async Driver

STM32EmbassyRustOctoSPIembedded

The STM32U595 includes an OCTOSPIM (OctoSPI I/O Manager) peripheral that can drive an 8-bit single-transfer-rate interface at up to 125 MHz. Paired with Embassy — the async embedded Rust framework — this gives us a clean, zero-copy abstraction over a fast FPGA communication link.

This post covers the Embassy peripheral setup, the driver layer we wrote for the four BRAM access patterns, and the workflow for using Vivado ILA to catch timing issues on the FPGA side.

The full source is at github.com/quantacontrol/octospi.

Why Embassy?

Embassy brings async/await to bare-metal Rust. For an FPGA communication driver, this matters because:

•DMA-backed transfers — the OCTOSPIM peripheral can DMA directly from/to a buffer; Embassy's OctospiWord trait abstracts over transfer widths without runtime overhead.
•Structured concurrency — multiple tasks can share the FPGA link without a mutex in the hot path.
•No RTOS overhead — Embassy's executor is a cooperative scheduler with zero dynamic allocation.

OctoSPI Peripheral Configuration

The STM32U595 OCTOSPI peripheral needs to match the FPGA's frame format exactly:

// stm32/src/lib.rs (simplified)
use embassy_stm32::octospi::{Config, OctospiWidth, DummyCycles};
 
pub fn octospi_config() -> Config {
    let mut cfg = Config::default();
    cfg.fifo_threshold = 4;
    cfg.memory_type = MemoryType::Macronix;   // frame-based, not memory-mapped
    cfg.device_size = 24;                      // address bits
    cfg.chip_select_high_time = 1;
    cfg.free_running_clock = false;
    cfg.clock_mode = false;                    // CPOL=0, CPHA=0
    cfg.wrap_size = WrapSize::None;
    cfg.clock_prescaler = 1;                   // 160 MHz AHB / 2 = 80 MHz SCLK
    cfg.sample_shifting = false;
    cfg.delay_hold_quarter_cycle = false;
    cfg
}

All data transfers use OctospiWidth::OCTO (8-bit), instruction and address phases use the same 8-bit width to match the FPGA's IO[7:0] input path.

The Driver Abstraction

Rather than calling Embassy's raw OCTOSPI HAL directly, we wrap it in a FpgaLink struct that exposes the four opcodes as typed async methods:

// stm32/src/lib.rs
pub struct FpgaLink<'d, T: OctospiInstance> {
    ospi: Octospi<'d, T, Async>,
}
 
impl<'d, T: OctospiInstance> FpgaLink<'d, T> {
    /// Burst write, auto-incrementing address.
    pub async fn write_incr(&mut self, addr: u32, data: &[u32]) {
        self.ospi.write_extended(
            OctospiWidth::OCTO, 0xCA_u8,   // CMD_WRITE_INCR
            OctospiWidth::OCTO, addr,
            DummyCycles::_8,
            bytemuck::cast_slice(data),
        ).await.unwrap();
    }
 
    /// Burst read, auto-incrementing address.
    pub async fn read_incr(&mut self, addr: u32, buf: &mut [u32]) {
        self.ospi.read_extended(
            OctospiWidth::OCTO, 0xBA_u8,   // CMD_READ_INCR
            OctospiWidth::OCTO, addr,
            DummyCycles::_8,
            bytemuck::cast_slice_mut(buf),
        ).await.unwrap();
    }
 
    /// Write to fixed address (FIFO push).
    pub async fn write_fixed(&mut self, addr: u32, data: &[u32]) {
        self.ospi.write_extended(
            OctospiWidth::OCTO, 0xFE_u8,   // CMD_WRITE_FIXED
            OctospiWidth::OCTO, addr,
            DummyCycles::_8,
            bytemuck::cast_slice(data),
        ).await.unwrap();
    }
 
    /// Read from fixed address (FIFO pop).
    pub async fn read_fixed(&mut self, addr: u32, buf: &mut [u32]) {
        self.ospi.read_extended(
            OctospiWidth::OCTO, 0xBE_u8,   // CMD_READ_FIXED
            OctospiWidth::OCTO, addr,
            DummyCycles::_8,
            bytemuck::cast_slice_mut(buf),
        ).await.unwrap();
    }
}

Protocol Frame

Every transaction uses the same five-field frame:

CMD (1 byte) → AUX (2 bytes) → ADDR (4 bytes) → DUMMY (8 cycles) → DATA (N bytes)

All fields are clocked on IO[7:0] in STR mode at 100 MHz aclk.

CMD selects one of four opcodes:

Opcode	Hex	Semantics
`CMD_WRITE_INCR`	`0xCA`	Burst write, auto-increment address each word
`CMD_WRITE_FIXED`	`0xFE`	Burst write to the same address (FIFO push)
`CMD_READ_INCR`	`0xBA`	Burst read, auto-increment address
`CMD_READ_FIXED`	`0xBE`	Burst read from fixed address (FIFO pop)

AUX carries the burst length (number of 32-bit words). ADDR is the 32-bit AXI base address. DUMMY gives the FPGA 8 clock cycles to issue an AXI prefetch read before the data phase starts.

The 7-State FSM

octo_spi_slave.sv implements this as a Mealy FSM with states:

ST_IDLE → ST_CMD → ST_AUX → ST_ADDR → ST_DUMMY → ST_DATA → ST_DONE

Each state watches byte_ready (strobed by the synchronizer when a full byte has been shifted in from IO[7:0]) to advance.

// Simplified state transitions
always_ff @(posedge aclk or negedge aresetn) begin
  if (!aresetn) state <= ST_IDLE;
  else case (state)
    ST_IDLE:  if (!csn_sync)               state <= ST_CMD;
    ST_CMD:   if (byte_ready)              state <= ST_AUX;
    ST_AUX:   if (byte_ready && aux_done)  state <= ST_ADDR;
    ST_ADDR:  if (byte_ready && addr_done) state <= ST_DUMMY;
    ST_DUMMY: if (dummy_done)              state <= ST_DATA;
    ST_DATA:  if (byte_ready && data_done) state <= ST_DONE;
    ST_DONE:  if (csn_sync)               state <= ST_IDLE;
  endcase
end

The ST_DUMMY state is the critical design choice: 8 clock cycles during which the FPGA issues its first AXI read request and receives the response into initial_buf — before a single data byte needs to be driven onto IO[7:0].

Dual-Buffer Prefetch

Read latency hiding works with two buffers:

•initial_buf — loaded during DUMMY cycles from the first AXI read
•prefetch_buf — loaded during the current data word transfer from the next address

When ST_DATA begins, initial_buf is immediately available for the first word. As that word is being shifted out, the FSM fires the next AXI read and loads prefetch_buf. The pattern continues word-by-word:

Cycle:  [DUMMY]   [DATA word 0]    [DATA word 1]    [DATA word 2]
AXI:    addr[0]→  addr[1]→         addr[2]→         addr[3]→
Buffer: initial_buf ready   prefetch_buf ready   prefetch_buf ready

This keeps the AXI bus busy every cycle and avoids bubbles in the data stream, even though AXI read latency is several cycles.

The Four BRAM Test Patterns

Each example demonstrates a different access semantic:

bram_a — READ_INCR with auto-fill

BRAM A is backed by bram_incr_fill_master.sv: every time the FPGA sees a read request, it fills the BRAM with an auto-incrementing counter starting at the requested address. Reading N words always returns [addr, addr+1, ..., addr+N-1].

// stm32/examples/bram_a.rs
let mut buf = [0u32; 8];
link.read_incr(0x0000_0000, &mut buf).await;
for (i, &word) in buf.iter().enumerate() {
    assert_eq!(word, i as u32, "BRAM A counter mismatch at index {i}");
}
info!("[RESULT] bram_a PASS");

bram_c — WRITE_INCR + READ_INCR roundtrip

BRAM C is a plain synchronous BRAM. Write a pattern, read it back:

// stm32/examples/bram_c.rs
let write_data: [u32; 4] = [0xDEAD_BEEF, 0xCAFE_BABE, 0x1234_5678, 0xABCD_EF01];
link.write_incr(0x0002_0000, &write_data).await;
 
let mut read_buf = [0u32; 4];
link.read_incr(0x0002_0000, &mut read_buf).await;
assert_eq!(write_data, read_buf);
info!("[RESULT] bram_c PASS");

bram_b — READ_FIXED (hardware FIFO pop)

BRAM B is fed by bram_fifo_master.sv, which enqueues words autonomously. Each READ_FIXED pops one word; READ_INCR would read the same FIFO head repeatedly.

bram_d — WRITE_FIXED + READ_FIXED (software FIFO)

BRAM D implements a circular buffer managed entirely in software. Push with write_fixed, pop with read_fixed, tracking head/tail pointers in firmware.

Run All Tests Automatically

cd fpga
make end2end PROBE_RS_PROBE=<your-serial>

scripts/run_tests.py programs the FPGA and then launches each example, checking for [RESULT] bram_X PASS in the defmt output.