

#### Sunway TaihuLight: Designing and Tuning Scientific Applications at the Scale of 10-Million Cores

Haohuan Fu

National Supercomputing Center in Wuxi Department of Earth System Science, Tsinghua University

March 14th 2017 @ SCF

# Outline





# Sunway TaihuLight









# Sunway TaihuLight



| City      | Rank in Top100 |  |  |  |
|-----------|----------------|--|--|--|
| Shanghai  | 1              |  |  |  |
| Suzhou    | 7              |  |  |  |
| Wuxi      | 14             |  |  |  |
| Nantong   | 24             |  |  |  |
| Changzhou | 34             |  |  |  |
| Jiaxing   | 50             |  |  |  |



# **The Sunway Machine Family**





- Sunway BlueLight:
- NSCC-Jinan, 2011
- 16-core processor
- 1 Pflops
- $14^{th}$  of TOP500



- Sunway TaihuLight: - NSCC-Wuxi, 2016
- 260-core processor
- 125 Pflops
- 1<sup>st</sup> of TOP500

# Sunway TaihuLight: Overview

| Entire System          |              |
|------------------------|--------------|
| Peak Performance       | 125 PFlops   |
| Linpack Performance    | 93 PFlops    |
| Total Memory           | 1310.72 TB   |
| Total Memory Bandwidth | 5591.45 TB/s |
| # nodes                | 40,960       |
| # cores                | 10,649,600   |



# **Sunway TaihuLight: Overview**

| Each Node               |             |
|-------------------------|-------------|
| Peak Performance        | 3.06 TFlops |
| Memory                  | 32 GB       |
| Memory Bandwidth        | 136.5 GB/s  |
| # CPU                   | 1           |
| # cores                 | 260         |
| 260 cores per processor |             |

- 0000300
- 4 Core Groups (CGs), each of which has:
  - 1 Management Processing Element (MPE)
  - 64 (8x8) Computing Processing Elements (CPEs)



# SW26010: Sunway 260-Core Processor





- computing node
- computing board
- super node
- cabinet
- entire computing system



- computing node
- computing board
- super node
- cabinet
- entirecomputingsystem





- computing node
- computing board
- super node
- cabinet
- entirecomputingsystem





- computing node
- computing board
- super node
- cabinet
- entirecomputingsystem





### A Five-Level Integration Hierarchy

- computing node
- computing board
- super node

cabinet

entirecomputingsystem



40×4×256×4×(1+8×8) = 10,649,600





40×4×256×4×(1+8×8) = 10,649,600

2D core array with row and column buses



40×4×256×4×(1+8×8) = 10,649,600

2D core array with row and column buses

Network on Chip









## Sunway TaihuLight V.S. Other Systems

| System                       | TaihuLight | Tianhe-2  | Titan     | Sequoia    | Cori      |
|------------------------------|------------|-----------|-----------|------------|-----------|
| Peak Performance (PFlops)    | 125.4      | 54.9      | 27.1      | 20.1       | 27.9      |
| Total Memory (TB)            | 1310       | 1024      | 710       | 1572       | 879       |
| Linpack Performance (PFlops) | 93.0(74%)  | 33.9(62%) | 17.6(65%) | 17.2(85.3) | 14.0(50%) |
| Rank of Top500               | 1          | 2         | 3         | 4          | 5         |
| Performance/Power (Mflops/W) | 6051.3     | 1901.5    | 2142.8    | 2176.6     | 3266.8    |
| Rank of Green500             | 4          | 135       | 100       | 90         | 26        |
| GTEPS                        | 23755.7    | 2061.48   | ###       | 23751      | ###       |
| Rank of Graph500             | 2          | 8         | ###       | 3          | ###       |
| HPCG (Pflops)                | 0.3712     | 0.5801    | 0.3223    | 0.3304     | 0.3554    |
| Rank of HPCG                 | 4          | 2         | 7         | 6          | 5         |





Satoshi Matsuoka @ProfMatsuoka



I was quite impressed with the engineering quality of TaihuLight, different from previous Chinese machines; now truly rivals US, Japan in SC twitter.com/profmatsuoka/s...

下午4:40 - 2016年11月3日 发自 **東京 目黒区** 







```
Satoshi Matsuoka
@ProfMatsuoka
```









Japan i

下午4:40

TaihuLight physical design is excellent with low num. of chips, dual-sided surface mounting of all components for dense cold plate cooling .

```
下午5:57 - 2016年11月3日
```







@ProfMatsuoka



Finally their design was cost&utility conscious. No expensive parts, quacky architecture, etc. Sunway apparently plans to sell the machine.

下午6:08 - 2016年11月3日



# Outline





# **Programming Model on TaihuLight**

#### MPI + X X: (Sunway OpenACC / Athread)

#### MPI

One MPI process runs on one management core (MPE)

#### Sunway OpenACC

Sunway OpenACC conducts data transfer between main memory and local data memory (LDM), and distributes the kernel workload to the computing cores (CPEs)

#### Athread

 Athread is the threading library to manage threads on computing core (CPE), which is used in the Sunway OpenACC implementation



# **Brief Overview of Sunway OpenACC**

- Sunway OpenACC is a directive-based programming tool for SW26010
  - OpenACC2.0 based
  - Extensions for the architecture of SW26010
  - Supported by SWACC/SWAFORT compiler
  - Source-to-Source compiler
  - **Based on ROSE compiler infrastructure (0.9.6a)** 
    - An open Source compiler infrastructure to build source-to-source program transformation and analysis
    - Developed by LLNL



# **Brief View of Sunway OpenACC compiler**



Compute pattern: data into SPM -> calculation -> data out to Main memory

Workload distribution and the size for data transfer are automatically determined by compiler

国家超级计算无键 National Supercomputing Cen

#### The difference between the Memory Models





# The Principal Extension of Sunway OpenACC

#### Extend the usage of data environment directives

- □ Use *data copy* inside the accelerator parallel region
- **c** *copy on parallel* **perform data moving and distributing between LDMs**
- Add new directives/clauses
  - □ *local* clause, to allocate space on **LDM** of CPE thread.
  - Data transform support to speedup data transfer
    - pack/packin/packout clause
    - *swap/swapin/swapout* clause
  - annotate clauses to better controls of data movement and execution from compiler
    - *tilemask, entire, co\_compute*





#### **Directly data transfer between MEM and LDM**

Use *data copy* to handle data moving between Mem and LDM

!\$acc data copyin(A) copyout(B)
!\$acc parallel loop
do i=1,128
 m = func(i)
 do j=1,128
 B(j, i) = A(j, m)
 enddo
enddo
!\$acc end parallel loop
!\$acc end data

#### OpenACC2.0

- Moving A, B between host memory and device memory (e.g. global memory on GPU)
- Executed by host thread

#### Sunway OpenACC

- Moving A(\*, m)、 B(\*, i) between host memory and LDM in each i-loop
- Executed by each CPE thread



#### **Distributed moving data between MEM and LDMs**

 Use *copy clause of parallel* to move and distribute data between Mem and LDMs\_\_\_\_\_\_

```
!$acc parallel loop copyout(C) copyin(A, B)
do i = 1, 256
do j = 1, 512
C(j, i) = A(j, i) + B(j, i)
end do
end do
!$acc end parallel loop
```

#### OpenACC2.0

- Moving A、B between host memory and device memory
- Executed by host thread

#### Sunway OpenACC

- Moving A(\*, i)、 B(\*, i)、 C(\*, i) between host memory and LDM in each i-loop
- Executed by each slave thread on CPE
- Data distribution controlled by compiler
- For readonly arrays with small size , can use *copyin(arr) annotate(entire(arr))* to specify that the *arr* will be put totally into LDM of Each CPE



#### **Control the size of data moved to LDM**

Use *tile clause* to control the granularity of data moved to LDM



#### tile:

- allocate buffer(256, 2, 1) in each LDM for A, B, C.
- same size of data being moved to each LDM each round.
- two loops on j is assigned to each CPE thread.

- Add *tilemask* for better data transfer
  - *Tilemask(var-list)* means the *tile* will not affect the variables in var-list.
  - Move more data in one transfer.
  - Buffer\_C(1,1) will be allocated in LDM without *tilemask*.
  - Buffer\_C(128, 1) will be allocated with *tilemask*.





## Local and private data management

- Add *local clause* to allocate LDM space for private data
  - Usage: *local(var-list)*
  - Variables in var-list are private for CPE thread, and will be placed in LDM.
  - Used for private variables with small size.
- Use *private+copy* to manage private data with large size
  - Array with large size can not be put into LDM.
  - Step 1: private(var-list), private vars will be allocated in private space of each CPE in Main Memory.
  - Step 2: copy(the-same-var-list), the private data will be copied into LDM, piece-by-piece.
  - Buffer\_tmp(256, 2) will be allocated and maintained in LDM.

```
!$acc parallel loop copyin(A, B) copyout(C)
!$acc& private(tmp) copy(tmp)
do i = 1.64
 !$acc loop tile(2)
 do j = 1, 128
   do k = 1, 256
      tmp(k, j) = A(k, j, i) + B(k, j, i)
    end do
    \dots ! some compute on tmp(*,*)
    do k = 1, 256
       C(k, j, i) = tmp(k, j)
    end do
  end do end do ! end of j-loop
  !$acc end loop
end do ! end of i-loop
!$acc end parallel loop
```



# Packing data to improve transfer efficiency

- pack clause
  - pack/packin/packout
  - $\square pack = dataPack + copy$

pack multiple variables into a new variable by MPE, and copy data between MEM and SPM with the new one by CPE.

Most useful for multiple scalars.





# Transposing array to improve transfer efficiency

- swap clause
  - swap/swapin/swapout
  - swap = ArrayTranspose + copy
- Improve the space-locality and the data transmission efficiency to avoid repeated stride copy
- Use CPE threads to perform array transpose for better bandwidth utilization.
- Efficient transpose algorithm , can support 6-dim array.



## **Other Extensions**

#### Cooperative computing

- Treat the host thread the same as device thread
- N device threads, 1 host thread
- N+1 threads execute the parallel loop
- Add *co\_compute* clause
  - Used on *loop* directive
  - annotate(co\_compute)

```
#pragma acc parallel copyout(A)
{
    #pragma acc loop annotate(co_compute)
    for(i = 0; i < 130; i++)
    {
        A[i] = i;
    }
}</pre>
```



## Athread

Threading library to manage threads on CPEs

Similar to posix Pthreads

| Routine                                                             | Functionality                                                                                                         |
|---------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| int athread_init()                                                  | Initialize the athread library                                                                                        |
| <pre>int athread_create(int id, start_routine fpc, void *arg)</pre> | Start a routine with id executing the function pointed by fpc, the parameters for the function fpc are pointed by arg |
| int athread_wait(int id)                                            | Wait for the completion of the thread ID                                                                              |
| int athread_end(int id)                                             | Terminate the thread by specifying thread ID                                                                          |
| <pre>int athread_spawn(start_routine fpc, void *arg)</pre>          | Spawn a set of threads making use of all CPEs                                                                         |
| int<br>athread_get_max_threads()                                    | Get the maximum number of active threads                                                                              |
| <pre>int athread_get_id()</pre>                                     | Get the ID of current threads                                                                                         |



....

## Athread example

#### MPE code

```
\mathbb{R} < > c a.c > No Selection
```

```
#include <athread.h>
extern SLAVE_FUN(fun_sw)();
int main() {
    int a[64] = \{0\};
    int i;
    for(i = 0; i < 64; i ++)
        a[i] = i;
    athread_init();
    athread_spawn(fun_sw, a);
    athread_join();
    for(i = 0; i < 64; i ++)
        printf("a[%d] = %d\n", i, a[i]);
    return 0;
```



}

## Outline





atmosphere model

















# Increase in Spatial and Temporal Resolution to be Cloud-Resolving and Eddy-Resolving





#### Simulation of more and more detailed physics processes





#### 10240 parallel earths



Online Ensembles 10240 NICAM Samples on K computer

Courtesy of Takemasa Miyoshi's talk at BDEC 2017, Wuxi.



#### **Balancing Science Goals with Computing Power**





## The Gap between Software and Hardware

- millions lines of legacy code
- poor scalability
- written for multi-core, rather than many-core

#### 100T



#### China's models

- pure CPU code
- scaling to hundreds or thousands of cores

#### China's supercomputers

heterogeneous systems with many-core chips

100P

• millions of cores



## **Our Research Goals**





## **Our Research Goals**





## **Highly-Scalable Atmospheric Simulation Framework**



The "Best" Computational Solution



### 2012: 2D SWE Solver on Tianhe-1A

### Starting from shallow wave equation

- cubed-sphere mesh grid
- adjustable partition between CPU and GPU
- scale to 40,000 CPU cores and 3750 GPUs with a sustainable performance of 800 TFlops





"A Peta-Scalable CPU-GPU Algorithm for Global Atmospheric Simulations", in *Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)*, pp. 1-12, Shenzhen, 2013.

## 2013: 3D Euler Equation Solver on Tianhe-2



### A Sustained Performance of 1.7 Pflops



"Ultra-scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2", IEEE Transaction on Computers.

### 2013: 2D SWE Solver on FPGA





#### 100x speedup over 6-core CPU, 5x speedup over GPU

| Mesh size: $1024 \times 1024 \times 6$ |                                |         |                 |                                      |                     |
|----------------------------------------|--------------------------------|---------|-----------------|--------------------------------------|---------------------|
| platform                               | performance<br>(points/second) | speedup | power<br>(Watt) | efficiency<br>(points/(second·Watt)) | power<br>efficiency |
| 6-core CPU                             | 4.66K                          | 1       | 225             | 20.71                                | 1                   |
| Tianhe-1A node                         | 110.38K                        | 23x     | 360             | 306.6                                | 14.8x               |
| MaxWorkstation                         | 468.11K                        | 100x    | 186             | 2.52K                                | 121.6x              |
| MaxNode                                | 1.54M                          | 330x    | 514             | 3K                                   | 144.9x              |



"Accelerating Solvers for Global Atmospheric Equations Through Mixed-Precision Data Flow Engine", in Proceedings of the 23rd International Conference on Field Programmable Logic and Applications, 2013.

#### 2013: 2D SWE Solver on FPGA



-pipelined hardware design

### Selected as one of the 27 Significant Papers of FPL in 25 Years (27 out of 1765)

| FPGA | float(8,32)<br>float(11,53) |
|------|-----------------------------|
|      | output                      |

input

**100x** speedup over 6-core CPU, 5x speedup over GPU

| Mesh size: $1024 \times 1024 \times 6$ |                                |         |                 |                                      |                     |
|----------------------------------------|--------------------------------|---------|-----------------|--------------------------------------|---------------------|
| platform                               | performance<br>(points/second) | speedup | power<br>(Watt) | efficiency<br>(points/(second·Watt)) | power<br>efficiency |
| 6-core CPU                             | 4.66K                          | 1       | 225             | 20.71                                | 1                   |
| Tianhe-1A node                         | 110.38K                        | 23x     | 360             | 306.6                                | 14.8x               |
| MaxWorkstation                         | 468.11K                        | 100x    | 186             | 2.52K                                | 121.6x              |
| MaxNode                                | 1.54M                          | 330x    | 514             | 3K                                   | 144.9x              |



"Accelerating Solvers for Global Atmospheric Equations Through Mixed-Precision Data Flow Engine", in Proceedings of the 23rd International Conference on Field Programmable Logic and Applications, 2013.















## **Strong-scaling results**



The 3-km res run: 1.01 SYPD with 10.6M cores, dt=240s, I/O penalty <5%



## Weak-scaling results



The 488-m res run: 0.07 SYPD, 10.6M cores, dt=240s, 89.5X speedup over explicit

## **Our Research Goals**





#### The CESM Project on Sunway TaihuLight



Tsinghua + BNU 30+ Professors and Students



Four component models, millions lines of codeLarge-scale run on Sunway TaihuLight

- 24,000 MPI processes
- Over one million cores
- 10-20x speedup for kernels
- 2-3x speedup for the entire model

"Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer", in Proceedings of SC 2016.

## **Major Challenges**

a high complexity in application, and a heavy legacy in the code base (millions lines of code)

an extremely complicated MPMD program with no hotspots (or hundreds of hotspots)

misfit between the in-place design philosophy and the new architecture

lack of people with interdisciplinary knowledge and experience



## Workflow of CAM

Pass tracers (u, v) to dynamics



After initialization, the physics and the dynamics are executed in turn during each simulation time-step.



## **Porting of CAM: General Idea**

- Entire code base: 530, 000 lines of code
- Components with regular code patterns
  - e.g. the CAM-SE dynamic core
  - manual OpenACC parallelization and optimization on code and data structures
- Components with irregular and complex code patterns
  - e.g. the CAM physics schemes
  - Ioop transformation tool to expose the right level of parallelism and code size
  - memory footprint analysis and reduction tool



## **Refactoring the Euler Step**





## **Refactoring the Euler Step**

|                                         | · · ·                        | 1 |
|-----------------------------------------|------------------------------|---|
| do ie = nets, nete                      | do ie = nets, nete           |   |
| do k = 1, nlev                          | do q = 1, qsize              |   |
| dp(k) = func_1()                        | do k = 1, nlev               |   |
| do q = 1, qsize                         |                              |   |
|                                         | end do                       |   |
| Qtens(k,q,ie) = func_2(dp(k))<br>end do | end do                       |   |
|                                         | end do                       |   |
| end do                                  |                              |   |
| end do                                  | do ie = nets, nete           |   |
|                                         | do k = 1, nlev               |   |
| do ie = nets, nete                      | dp0 = func_3()               |   |
| do k = 1, nlev                          | dpdiss = func_4()            |   |
| do q = 1, qsize                         | do q = 1, qsize              |   |
| qmin(k,q,ie) =                          | Qtens(k,q,ie) =              | 1 |
| qmax(k,q,ie) =                          | func_5(dp0,                  |   |
| end do                                  | dpdiss)                      |   |
| end do                                  | end do                       |   |
| end do                                  | end do                       |   |
|                                         | do k = 1, nlev               | 1 |
|                                         | dp_star(k) = func_8(dp(k))   |   |
| do ie = nets, nete                      | end do                       |   |
| do k = 1, nlev                          |                              |   |
| dp(k) = func_5()                        | do k = 1, nlev               |   |
| Vstar(k) = func_6()                     | Qtens(k,q,ie) =              |   |
| end do                                  |                              |   |
|                                         | func_9(dp_star(k))<br>end do |   |
| do q = 1, qsize                         |                              |   |
| do k = 1, nlev                          | end do                       |   |
| Qtens(k,q,ie) =                         | Data packing                 |   |
| func_7(dp(k), Vstar(k))                 | end do                       |   |
| end do                                  | (2)                          |   |
|                                         |                              |   |



## **Refactoring the Euler Step**





## **Refactoring of the Physics Schemes**





## Loop Transformation for Phys\_run1



### Variable Storage Space Analysis and Reduction Tool



#### **Basic functions**

- Estimate the storage requirements of the variable and arrays
- Identify the lifespan of the variables and arrays
- Determine whether the variables and arrays of each CPE thread can fit into the 64KB SPM.

#### **Example Explanation**

• The original Fortran function accesses 7 intermediate arrays (A to G) during the computation process. By analyzing the lifespan of these 7 arrays, which are annotated by the lines above these arrays, we can determine that 4 arrays would provide sufficient space to store these 7 arrays in different stages of the execution process.

## **Speedup of Major Kernels in CAM-SE**





## **Speedup of Major Kernels in CAM-PHY**





## CAM model: scalability and speedup



MPE only

MPE+CPE for dynamic core
MPE+CPE for both dynamic core and physics schemes

## Library for Deep Learning (swDNN)

#### swDNN: Provide interface for optimized basic operators

- □ Fully-connected layer (BLAS); Pooling layer
- Activation function; Batch Normalization
- Convolutional Layer(90% time for CNN)

#### **Related Works on other architectures**

| Work                | Platform | Method        |
|---------------------|----------|---------------|
| cuDNN(2014)         | GPU      | GEMM          |
| fbtfft(2014)        | GPU      | FFT           |
| Andrew Lavin (2015) | GPU      | Winograd      |
| Chen Zhang (2015)   | FPGA     | Direct Conv   |
| swDNN               | SW26010  | Blocking GEMM |



### Library for Deep Learning (swDNN)

#### Performance

- **Convolutional performance above 1.6 Tflops with double-precision**
- □ Speedup ranging from **1.91x to 9.75x** compared with cudnnv5.1.



swdnn cudnnv5



swdnn cudnnv5



### Framework for Deep Learning (under development)

### Distributed framework

- □ Customized from *Caffe* with less dependencies
- Two-level Parameter Server Based-on MPI





### swDNN Supported Project: Sunway-Lingo collaborated with Prof. Zhiqing Liu, BUPT



- Original go board to be processed
- Converted to a 48-channel image fed to deep CNN with essential go features such as liberties
- Order of probabilities of plausible moves as outputted by policy network



## Long Term Plan

- Traditional HPC Applications (Science -> Service)
  - weather / climate service
  - seismic data processing service
  - **CFD** simulation framework for Advanced Manufacturing
- Deep Learning Related Applications
  - the swDNN framework
  - collaborating with face++ for face recognition applications
  - collaborating with Sogou for voice recognition and translation
  - customized DNN Sunway chip?
- Big Data Center
  - National Health and Medical Big Data Center at Nanjing



# THANK YOU