STM32 + Embassy + OctoSPI: A Rust Async Driver
The STM32U595 includes an OCTOSPIM (OctoSPI I/O Manager) peripheral that can drive an 8-bit single-transfer-rate interface at up to 125 MHz. Paired with Embassy — the async embedded Rust framework — this gives us a clean, zero-copy abstraction over a fast FPGA communication link.
This post covers the Embassy peripheral setup, the driver layer we wrote for the four BRAM access patterns, and the workflow for using Vivado ILA to catch timing issues on the FPGA side.
The full source is at github.com/quantacontrol/octospi.
Why Embassy?
Embassy brings async/await to bare-metal Rust. For an FPGA communication driver, this matters because:
- •DMA-backed transfers — the OCTOSPIM peripheral can DMA directly from/to a buffer; Embassy's
OctospiWordtrait abstracts over transfer widths without runtime overhead. - •Structured concurrency — multiple tasks can share the FPGA link without a mutex in the hot path.
- •No RTOS overhead — Embassy's executor is a cooperative scheduler with zero dynamic allocation.
OctoSPI Peripheral Configuration
The STM32U595 OCTOSPI peripheral needs to match the FPGA's frame format exactly:
// stm32/src/lib.rs (simplified)
use embassy_stm32::octospi::{Config, OctospiWidth, DummyCycles};
pub fn octospi_config() -> Config {
let mut cfg = Config::default();
cfg.fifo_threshold = 4;
cfg.memory_type = MemoryType::Macronix; // frame-based, not memory-mapped
cfg.device_size = 24; // address bits
cfg.chip_select_high_time = 1;
cfg.free_running_clock = false;
cfg.clock_mode = false; // CPOL=0, CPHA=0
cfg.wrap_size = WrapSize::None;
cfg.clock_prescaler = 1; // 160 MHz AHB / 2 = 80 MHz SCLK
cfg.sample_shifting = false;
cfg.delay_hold_quarter_cycle = false;
cfg
}All data transfers use OctospiWidth::OCTO (8-bit), instruction and address phases use the same 8-bit width to match the FPGA's IO[7:0] input path.
The Driver Abstraction
Rather than calling Embassy's raw OCTOSPI HAL directly, we wrap it in a FpgaLink struct that exposes the four opcodes as typed async methods:
// stm32/src/lib.rs
pub struct FpgaLink<'d, T: OctospiInstance> {
ospi: Octospi<'d, T, Async>,
}
impl<'d, T: OctospiInstance> FpgaLink<'d, T> {
/// Burst write, auto-incrementing address.
pub async fn write_incr(&mut self, addr: u32, data: &[u32]) {
self.ospi.write_extended(
OctospiWidth::OCTO, 0xCA_u8, // CMD_WRITE_INCR
OctospiWidth::OCTO, addr,
DummyCycles::_8,
bytemuck::cast_slice(data),
).await.unwrap();
}
/// Burst read, auto-incrementing address.
pub async fn read_incr(&mut self, addr: u32, buf: &mut [u32]) {
self.ospi.read_extended(
OctospiWidth::OCTO, 0xBA_u8, // CMD_READ_INCR
OctospiWidth::OCTO, addr,
DummyCycles::_8,
bytemuck::cast_slice_mut(buf),
).await.unwrap();
}
/// Write to fixed address (FIFO push).
pub async fn write_fixed(&mut self, addr: u32, data: &[u32]) {
self.ospi.write_extended(
OctospiWidth::OCTO, 0xFE_u8, // CMD_WRITE_FIXED
OctospiWidth::OCTO, addr,
DummyCycles::_8,
bytemuck::cast_slice(data),
).await.unwrap();
}
/// Read from fixed address (FIFO pop).
pub async fn read_fixed(&mut self, addr: u32, buf: &mut [u32]) {
self.ospi.read_extended(
OctospiWidth::OCTO, 0xBE_u8, // CMD_READ_FIXED
OctospiWidth::OCTO, addr,
DummyCycles::_8,
bytemuck::cast_slice_mut(buf),
).await.unwrap();
}
}Protocol Frame
Every transaction uses the same five-field frame:
CMD (1 byte) → AUX (2 bytes) → ADDR (4 bytes) → DUMMY (8 cycles) → DATA (N bytes)
All fields are clocked on IO[7:0] in STR mode at 100 MHz aclk.
CMD selects one of four opcodes:
| Opcode | Hex | Semantics |
|---|---|---|
CMD_WRITE_INCR | 0xCA | Burst write, auto-increment address each word |
CMD_WRITE_FIXED | 0xFE | Burst write to the same address (FIFO push) |
CMD_READ_INCR | 0xBA | Burst read, auto-increment address |
CMD_READ_FIXED | 0xBE | Burst read from fixed address (FIFO pop) |
AUX carries the burst length (number of 32-bit words). ADDR is the 32-bit AXI base address. DUMMY gives the FPGA 8 clock cycles to issue an AXI prefetch read before the data phase starts.
The 7-State FSM
octo_spi_slave.sv implements this as a Mealy FSM with states:
ST_IDLE → ST_CMD → ST_AUX → ST_ADDR → ST_DUMMY → ST_DATA → ST_DONE
Each state watches byte_ready (strobed by the synchronizer when a full byte has been shifted in from IO[7:0]) to advance.
// Simplified state transitions
always_ff @(posedge aclk or negedge aresetn) begin
if (!aresetn) state <= ST_IDLE;
else case (state)
ST_IDLE: if (!csn_sync) state <= ST_CMD;
ST_CMD: if (byte_ready) state <= ST_AUX;
ST_AUX: if (byte_ready && aux_done) state <= ST_ADDR;
ST_ADDR: if (byte_ready && addr_done) state <= ST_DUMMY;
ST_DUMMY: if (dummy_done) state <= ST_DATA;
ST_DATA: if (byte_ready && data_done) state <= ST_DONE;
ST_DONE: if (csn_sync) state <= ST_IDLE;
endcase
endThe ST_DUMMY state is the critical design choice: 8 clock cycles during which the FPGA issues its first AXI read request and receives the response into initial_buf — before a single data byte needs to be driven onto IO[7:0].
Dual-Buffer Prefetch
Read latency hiding works with two buffers:
- •
initial_buf— loaded during DUMMY cycles from the first AXI read - •
prefetch_buf— loaded during the current data word transfer from the next address
When ST_DATA begins, initial_buf is immediately available for the first word. As that word is being shifted out, the FSM fires the next AXI read and loads prefetch_buf. The pattern continues word-by-word:
Cycle: [DUMMY] [DATA word 0] [DATA word 1] [DATA word 2]
AXI: addr[0]→ addr[1]→ addr[2]→ addr[3]→
Buffer: initial_buf ready prefetch_buf ready prefetch_buf ready
This keeps the AXI bus busy every cycle and avoids bubbles in the data stream, even though AXI read latency is several cycles.
The Four BRAM Test Patterns
Each example demonstrates a different access semantic:
bram_a — READ_INCR with auto-fill
BRAM A is backed by bram_incr_fill_master.sv: every time the FPGA sees a read request, it fills the BRAM with an auto-incrementing counter starting at the requested address. Reading N words always returns [addr, addr+1, ..., addr+N-1].
// stm32/examples/bram_a.rs
let mut buf = [0u32; 8];
link.read_incr(0x0000_0000, &mut buf).await;
for (i, &word) in buf.iter().enumerate() {
assert_eq!(word, i as u32, "BRAM A counter mismatch at index {i}");
}
info!("[RESULT] bram_a PASS");bram_c — WRITE_INCR + READ_INCR roundtrip
BRAM C is a plain synchronous BRAM. Write a pattern, read it back:
// stm32/examples/bram_c.rs
let write_data: [u32; 4] = [0xDEAD_BEEF, 0xCAFE_BABE, 0x1234_5678, 0xABCD_EF01];
link.write_incr(0x0002_0000, &write_data).await;
let mut read_buf = [0u32; 4];
link.read_incr(0x0002_0000, &mut read_buf).await;
assert_eq!(write_data, read_buf);
info!("[RESULT] bram_c PASS");bram_b — READ_FIXED (hardware FIFO pop)
BRAM B is fed by bram_fifo_master.sv, which enqueues words autonomously. Each READ_FIXED pops one word; READ_INCR would read the same FIFO head repeatedly.
bram_d — WRITE_FIXED + READ_FIXED (software FIFO)
BRAM D implements a circular buffer managed entirely in software. Push with write_fixed, pop with read_fixed, tracking head/tail pointers in firmware.
Run All Tests Automatically
cd fpga
make end2end PROBE_RS_PROBE=<your-serial>scripts/run_tests.py programs the FPGA and then launches each example, checking for [RESULT] bram_X PASS in the defmt output.