## Microthreading and $\mu$ TC for Massive On-Chip Concurrency



A Microgrid consists of **configured rings** of Microthreaded cores (see 3a) and uses a diffuse **COMA** memory system (allowing flexible partitioning of on-chip memory). An SEP core allocates rings of cores to threads which can then delegate work.

Without Microthreading<br/>(1 thread per core)With Microthreading<br/>(256 threads per core) $\Rightarrow$ Rendering 15,000 points on 32 cores, shown after 35,000 cycles. $\Rightarrow$ Microthreaded version is nearly  $4 \times$  faster, due to latency handling<br/>Livermore Kernels $\underline{Livermore Kernels}$ NAS Integer Sort<br/>Sorting 1 Million Numbers





 $\Rightarrow \mathsf{Even}$  for large numbers of cores, near theoretical-maximum-performance is achieved!

## References:

DIVIA

>

K. Bousias, L. Guang, C.R. Jesshope, M. Lankamp (2008) Implementation and Evaluation of a Microthreaded Architecture, Journal of Systems Architecture Dependency rings are circuit switched, Delegation grid is packet switched A create instruction distributes a family of threads to a ring of processors.

## 4) The Microthreaded C ( $\mu \overline{TC}$ ) System Level Language

- Captures ISA Extensions of  $\mu$ T Architecture in C language
- Intended as a compiler target language for C, C++ etc.
  - /\* C Sum of Squares \*/ float a[100], b[100], sum=0;

```
for(int i=0; i<100;i++) {
```

```
sum+=a[i]*a[i] + b[i]*b[i];
```

- thread, family, place (resources) are captured as new C types
- **shared** type modifier shares a variable between threads, as an i-structure

```
/* µTC Sum of Squares */
float a[100], b[100], sum=0;
thread void sqsum(shared float s) {
    index i;
        s+= a[i]*a[i] + b[i]*b[i];
}
family fid;
```

place pid; shared float sum; /\* will contain result \*/ create (fid, pid, 0, 99, sqsum, sum); sync(fid);

Example of a loop in standard C transformed into  $\mu$  TC, with dependency captured