Surface diving into Tenstorrent's Wormhole architecture

AI Accelerator Blackhole Wormhole Tenstorrent

0: Overview and stats

This blog will be focusing mainly on it’s microarchitecture, along with their NoC. I will touch their peripherals and compiler a lil as well, but won’t be digging too much into their ISA, because honestly, it’s a documentation hell in their tt-metal and ISA documentation repo (not complaining tho, thankyou TT for making it open, very cool).

Before understanding their microarchitecture, it’s good to know the performance it delivers (LOL jk, I just put it there to sound like a nerd).
Below is a table comparing their various accelerator cards with their performance and hardware specifications. We’ll discuss everything soon dw.

Architectural SpecificationGrayskull (e150)Wormhole (n300s - Dual ASIC)Blackhole (p150a - Single ASIC)
Process Node12nm12nm6nm
Usable Tensix Cores120128 (64 per ASIC)120
Application CPU CoresNone (Host Dependent)None (Host Dependent)16x SiFive Intelligence X280
On-Die SRAM Capacity120 MB192 MB180 MB
DRAM Capacity8 GB LPDDR424 GB GDDR632 GB GDDR6
DRAM Bandwidth118 GB/s576 GB/s512 GB/s
Ethernet InterconnectN/A (PCIe 4.0 x16 only)4x 100 GbE (Warp 100 + QSFP-DD)4x 800 GbE (QSFP-DD)
Compute Throughput (FP8)332 TeraFLOPS466 TeraFLOPS745 TeraFLOPS
Total Board Power (TBP)200W300W300W

1: What’s a Tensix Tile/Core?

Tenstorrent’s architecture is a spacially distributed grid of compute cores connected by an NoC. They’ve named them Tensix cores. Fundamentally, the architecture isn’t SIMD/SIMT or dataflow, it’s basically a grid of Tensix cores and the compiler schedules compute load onto each core.
Wormhole has around 64 Tensix cores, and blackhole has 128 as specified in the table, arranged in 2D torus topology connected by a quasi full duplex NoC (It’s a fancy term I’ll explain later).

1.1: Components within a Tensix Core:

Components in Tensix Cores:

  1. 1.5 MB SRAM (L1)
  2. 5x Baby RISCV (RV32IM)
  3. 1x Tensix CoProcessor
  4. 2x NoC connections + 1 NoC Overlay

Baby RISCV Cores

  • They are small 32-bit in-order single-issue cores, that execute RV32IM ISA + custom tensix extension.
  • The 5x RISCV cores are used as controllers, & are expected to execute one RV32IM instruction per cycle, with 1GHz clk.
    The fundamental life cycle of an AI compute kernel, i.e, Drata Ingress -> Unpacking -> Compute -> Packing -> Data egress, are driven by the 5 RISCV cores.
  • These are responsible as control circuits alone and do not perform any compute operations. They are optimized for area and power.
  • The 5 cores are given 2 main designated tasks:
    1. NCRISC and BRISC Handle NoC communications.
      • These issue asynchronous NoC read-write commands.
    2. TRISC0, TRISC1 and TRISC2 manage Trnsix execution pipeline.
      • TRISC0 is the unpacker, instructing the DMA engine to load data from SRAM into 4KiB srcA and srcB registers.
      • TRISC1 is the math dispatcher that issues instructions to ALUs.
      • TRISC2 functions as the packer, instructing the hew to format accumulator results and push into SRAM.
Core DesignationPrimary Hardware FunctionPipeline StageLocal Instruction MemoryLocal Data MemoryCoprocessor Privileges
RISC-V NC (NCRISC)NoC 0 Control, DRAM IngressData Movement½ KiB Cache + 16 KiB RAM4 KiB RAMDebug Bus Only
RISC-V T0 (TRISC0)Unpacker Hardware DispatchData Unpack2 KiB Cache2 KiB RAMHigh / Full Access
RISC-V T1 (TRISC1)Matrix (FPU) & Vector (SFPU) DispatchCompute½ KiB Cache2 KiB RAMHigh / Full Access
RISC-V T2 (TRISC2)Packer Hardware DispatchData Pack2 KiB Cache2 KiB RAMHigh / Full Access
RISC-V B (BRISC)NoC 1 Control, DRAM Egress, Tile OrchestrationData Movement2 KiB Cache4 KiB RAMModerate

Pipeline

SRAM L1/Scratchpad

TT calls this as L1 but it’s just a Scratchpad, as mentioned in their tt-metal docs.
The 1.5 MB SRAM is available per Tensix Core that holds transient data. Hundreads of tensix cores in the chip offer enough memory capacity. The architecture avoids complex cache hierarchies, and instead uses a flat software managed SRAM pool.
The architecture has hybrid approach where data can be stored in both SRAM and DRAM, to avoid storing excessive data on-chip as well as avoiding multiple DRAM accesses by storing enough intermediate data in SRAM.

  • It’s organised as 16 banks (91.5 KiB each), each capable of 128 bit read or write per cycle.
  • Any port can access any bank, but one at a time. If there’s a conflict, remaining ports are forced to wait.
  • Some ports (6 ports) have multiple clients attached to them via mux, only one client can access it at a time and remaining will be forced to wait.

L1

Functionalities supported by the ports:
  • 128 bit read-write.
  • Narrow reads, implemented by simply reading 128 bits and discarding unwamted bits, hence uses same BW as normal read.
  • Narrow writes, implemented by read-modify-write operation, blocking port as well as the bank for 5 cycles, hence uses 5 times of the write bandwidth.
  • Atomics (operations executed as one uninterrupted task) are implemented as read-modify-write opetations, and block the port and the corresponding bank for 5 cycles. This function is exposed to NoC and ThCon.
  • Near memory accumulate (4x FP32 or INT32, sign-magniture with saturation, or 8x FP16, BF16) is implemented as atomic read-modify-write operation, or non-atomic operation. If non-atomic, then it blocks the prot and bank for only 2 cycles, but users have to ensure overlapping operations are not targetting overlapping addresses in the banks. This function is exposed to packers.

The physical placement of DRAM controllers are exploited by default, using “interleaved” mode where data is distributed across all available controllers to balance performance. For specific operations, the data can be stored in “sharded” mode where tensors are distributed across multiple Tensix cores. We’ll see them in detail.

[To be continued because lazy]