# Xbyak\_aarch64; Just-In-Time Assembler for ARMv8-A and Scalable Vector Extension



Kentaro Kawakami <kawakami.k@fujitsu.com> Fujitsu Laboratories Ltd., Kawasaki, Japan

### About Me

- Kentaro Kawakami <kawakami.k@fujitsu.com>
  - GitHub account: kawakami-k
  - Senior Researcher at Platform Innovation project, Fujitsu Laboratories Ltd., Japan
  - Engaged in R&D of AI software for Arm high-performance computing
  - Developing the deep learning software stack for Fugaku, the world first Arm ISA-based supercomputer, for the last two years





## Table of Contents

- What is Xbyak\_aarch64?
  - Sample programs
- Proven working configuration
- Usage of Xbyak\_aarch64
- How to debug programs implemented with Xbyak\_aarch64
- Summary



## What is Xbyak\_aarch64?



## What is Xbyak\_aarch64?

- A JIT assembler for AArch64
- Enables to assemble AArch64 mnemonic at runtime
- We can make generators that produce instruction sequence
- Just-In-Time (JIT) functions are generated by the generators at runtime
- Based on Xbyak that is for x64 CPUs by S. Mitsunari (Cybozu Labs. Inc.)

mov(reg0, 0); ldr(reg1, x[0]); add(reg0, reg0, reg1); ldr(reg1, x[1]); add(reg0, reg0, reg1);

Gen\_func(x, 2);

Various JIT functions can be generated by input parameters

mov(reg0, 0); ldr(reg1, x[0]); add(reg0, reg0, reg1);

Gen\_func(x, 1);



### Minimal Sample Code

/\* Write source code in C++11 or later \*/ #include <xbyak\_aarch64/xbyak\_aarch64.h> using namespace Xbyak aarch64; class Generator : public CodeGenerator { public: Generator() { add(w0, w1, w0); ret(); **}**; int main() { Generator gen; gen.ready(); auto f = gen.getCode<int (\*)(int, int)>(); int a = 3, b = 4;printf("%d + %d = %d¥n", a, b, f(a, b))return 0;

Red bold texts are the functions, classes and instances provided by Xbyak\_aarch64.

Include "Xbyak\_aarch64.h".

Define your class inheriting "CodeGenerator".

Implement instruction sequence you want to do.

Machine code sequence, "add" and "ret" is generated.

Define a function pointer for the machine code sequence.

The machine code sequence can be called as a C++ function.

make & execute

> ./a.out > 3 + 4 = 7 >



### Machine Code Sequence Generation



The machine code sequence can be called as a function.



Supported Instructions

- ARMv8-A, ARMv8.1, ARMv8.2, ARMv8.3 instructions
- Scalable Vector Extension (SVE) instructions



~1K mnemonics

### Advantage of Xbyak\_aarch64 compared to Existing Assembler

- Easier to write assembly code
  - Simple assembly description in C++ syntax
  - Loop unrolling is easy to describe

Dynamical unrolling

. . . .

for (int j = 0; j < 15; ++j) fmla(ZRegS(j), PReg(0), ZRegS(j + 15), ZRegS(31));

Implementation becomes simpler. (The above code generates 15 "fmla" instructions.)

> fmla z0.s, p0.s, z15.s, z31.s fmla z1.s, p0.s, z16.s, z31.s

– x18

fmla z14.s, p0.s, z29.s, z31.s

- Optimization using runtime parameters
  - It is possible to change the instructions using parameters

| Dynamical instruction selection                        |
|--------------------------------------------------------|
| if ( isPowOfTwo(param) ) {                             |
| int bitPos = 0;                                        |
| do { bitPos +=1;                                       |
| } while (param = (param >> 1));                        |
| <pre>Isr(r0, r1, bitPos); // Logical shift right</pre> |
| } else {                                               |
| udiv(r0, r1, param); // unsigned division              |
| }                                                      |

Considering execution latency, "Isl" is more preferable than "udiv"



### Performance Comparison

- Environment: FX700 / GCC 8.3.1
  - CPU: A64FX (ARMv8-A + 512-bit SVE)
  - Compile options: -march=armv8.2-a+sve -fopenmp -O3
- Measurement conditions
  - Reduction operation for N-size array (N = 512)

| o iterated 10 millio             | on times                                   | }                                                                       |
|----------------------------------|--------------------------------------------|-------------------------------------------------------------------------|
| Reference code                   | Reference code + pragma                    | for(int i = numZregs; i > 0; i=i>>1){<br>for(int j =0; j < (i/2); j++){ |
| float reduction(float* A){       | float reduction(float* A){                 | if((j+(i/2)) < numZregs)                                                |
| s = 0;<br>for(i = 0; i < N; i++) | s = 0;<br>#pragma omp simd reduction (+:s) | <pre>fadd(ZRegS(j), ZRegS(j), ZRegS(j+(i/2))); }</pre>                  |
| s += A[i];<br>return s;          | for(i = 0; i < N; i++)<br>s += A[i];       | faddv(SReg(0), PReg(0), ZRegS(0));                                      |
| }                                | return s;                                  | ret();<br>}                                                             |
|                                  | ]}                                         |                                                                         |
| Execution time [sec]             | 6.30 x14 times faste                       | er 0.45 Linaro                                                          |
| 23.24 <u>x51</u>                 | .6 times faster                            |                                                                         |

JIT implementation with Xbyak\_aarch64

size\_t offset = sizeof(float) \* 16;

for(int i = 0; i < numZregs; i++){

void generate(int N) {

ptrue(PRegS(0));

int numZregs = N/16;

add (x1, x0, i \* offset); Idr(ZReg(i), ptr(x1));

# Proven working configuration



### Main Development Target

(Since Xbyak\_aarch64 is an OSS, it is basically provided as is.)

- H/W: Fugaku, Fujitsu PRIMEHPC FX1000/FX700
  - CPU: Fujitsu A64FX, designed for high-performance computing and complies with the ARMv8-A architecture profile and the Scalable Vector Extension (SVE)
- Compiler: FCC(Fujitsu C/C++ compiler)/GCC/LLVM
- Language: C++11 or later
- OS: Linux (RedHat Enterprise Linux 8.x)



## Proven working configurations

(Since Xbyak\_aarch64 is an OSS, it is basically provided as is.)

| H/W                              | CPU                             | OS (64-bit)                                          | Compiler          | Note                                        |
|----------------------------------|---------------------------------|------------------------------------------------------|-------------------|---------------------------------------------|
| FX1000*1                         | Fujitsu A64FX                   | RedHat Enterprise Linux 8.x                          | FCC <sup>*2</sup> | Well-tested                                 |
| FX700*1                          | Fujitsu A64FX                   | CentOS 8                                             | FCC/GCC/LLVM      | oneDNN * 3 works                            |
| IA server                        | QEMU 5.0.0<br>(Linux user mode) | (Host OS running on IA server)<br>Ubuntu 16.04.6 LTS | GCC               | oneDNN works                                |
| MAC mini, 2020                   | Apple M1                        | macOS Big Sur                                        | Apple clang       | oneDNN works                                |
| Raspberry Pi3<br>Model B Rev 1.2 | Broadcom<br>BCM2835             | Ubuntu 18.04.4 LTS                                   | GCC               | Some samples of<br>Xbyak_aarch64<br>works*4 |

\*1 <u>https://www.fujitsu.com/jp/products/computing/servers/supercomputer/</u> (in Japanese)

- \*2 C/C++ compiler of FUJITSU Software Compiler Package
- \*3 oneDNN is one of the applications that uses up all the functionality of Xbyak/Xbyak\_aarch64.
- \*4 A limited number of sample programs has been tested.



## Usage of Xbyak\_aarch64



### **Usage of Mnemonic Functions**

- Prototype declaration of the mnemonic functions
  - <u>https://github.com/fujitsu/xbyak\_aarch64/blob/main/xbyak\_aarch64/xbyak\_aarch64</u>
     <u>mnemonic\_def.h</u>

void add(const WReg &rd, const WReg &rn, const uint32\_t imm, const uint32\_t sh = 0); void add(const WReg &rd, const WReg &rn, const WReg &rm, const ShMod shmod = NONE, const uint32\_t sh = 0); void sub(const XReg &rd, const XReg &rn, const uint32\_t imm, const uint32\_t sh = 0); void sub(const XReg &rd, const XReg &rn, const XReg &rm, const ShMod shmod = NONE, const uint32\_t sh = 0); void ldr(const WReg &rt, const AdrReg &adr); void ldr(const XReg &rt, const AdrReg &adr); void ldr(const ZRegH &zdn, const \_PReg &pg, const ZRegH &zm);

- Usage samples
  - <u>https://github.com/fujitsu/xbyak\_aarch64/tree/main/sample/mnemonic\_syntax/nm.m</u> ake\*.cpp

add(w0, w0, 0x2aa); dump(); add(w0, w0, w2); dump(); sub(x0, x0, 0x2aa); dump(); sub(x0, x0, x2); dump(); ldr(w0, ptr( x3 ) ); dump(); ldr(x0, ptr( x3 ) ); dump(); fmul(z0.h, p7/T\_m, z7.h); dump();



## **General Purpose Register Class**

| Class name defined in Xbyak_aarch64 | Pre-instantiated variable | Remarks                             |
|-------------------------------------|---------------------------|-------------------------------------|
| W/Bog                               | w0, w1,, w30              | 32-bit general purpose registers    |
| WReg                                | wsp, wzr                  | 32-bit stack pointer, zero register |
| VDer                                | x0, x1,, x30              | 64-bit general purpose registers    |
| XReg                                | sp, xzr                   | 64-bit stack pointer, zero register |

WReg dstReg(0); WReg srcReg0(1); WReg srcReg1(2); add(dstReg, srcReg0, srcReg1); add(w0, w1, w2); for(size\_t i=0; i<16; i++) add(WReg(i), WReg(i), WReg(i+1));

- (A) Register instances can be freely defined.
- (B) These two line generate the same machine code of "add w0, w1, w2".
- (C) Register can be instantiated on the fly.

Xbyak\_aarch64 also defines the classes and has the pre-instantiated variables for V (128-bit SIMD), Z (SVE), P (scalable predicate) registers. Please refer <u>README.md</u> of Xbyak\_aarch64.



### Passing parameters to JIT-ed code/

### Receiving return value from JIT-ed code

• As JIT-ed code complies the procedure call standard of AArch64, JIT-ed code can freely exchange parameters with the code generated by compiler.

| Table 2, Gene | eral purpose regi | sters and AAPCS64 usage                                                                                                                            |   | From "Procedure Call Standard                                    |
|---------------|-------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|---|------------------------------------------------------------------|
| Register      | Special           | Role in the procedure call standard                                                                                                                | - | for the Arm 64-bit Architecture                                  |
| SP            |                   | The Stack Pointer.                                                                                                                                 |   | (AArch64)"                                                       |
| r30           | LR                | The Link Register.                                                                                                                                 |   |                                                                  |
| r29           | FP                | The Frame Pointer                                                                                                                                  |   | Generator() {                                                    |
| r19r28        |                   | Callee-saved registers                                                                                                                             |   | add(w0, w1, w0);                                                 |
| r18           |                   | The Platform Register, if needed; otherwise a temporary register. See notes.                                                                       |   | ret();                                                           |
| r17           | IP1               | The second intra-procedure-call temporary register (can be used by call veneers and PLT code); at other times may be used as a temporary register. |   | }                                                                |
| r16           | IP0               | The first intra-procedure-call scratch register (can be used by call veneers and PLT code); at other times may be used as a temporary register.    |   | The first and second parameters are<br>bassed by r0(w0), r1(w1). |
| r9r15         |                   | Temporary registers                                                                                                                                |   | The return value is passed by r0(w0).                            |
| r8            |                   | Indirect result location register                                                                                                                  |   |                                                                  |
| r0r7          |                   | Parameter/result registers                                                                                                                         |   | Connect                                                          |
|               |                   |                                                                                                                                                    |   | VIRTUAL • SPRING 2021                                            |

### Register usage in JIT-ed code

#### Table 2, General purpose registers and AAPCS64 usage

| Register | Special | Role in the procedure call standard                                                                                                                |   |
|----------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------|---|
| SP       |         | The Stack Pointer.                                                                                                                                 |   |
| r30      | LR      | The Link Register.                                                                                                                                 |   |
| r29      | FP      | The Frame Pointer                                                                                                                                  |   |
| r19r28   |         | Callee-saved registers                                                                                                                             | ┝ |
| r18      |         | The Platform Register, if needed; otherwise a temporary register. See notes.                                                                       |   |
| r17      | IP1     | The second intra-procedure-call temporary register (can be used by call veneers and PLT code); at other times may be used as a temporary register. |   |
| r16      | IP0     | The first intra-procedure-call scratch register (can be used by call veneers and PLT code); at other times may be used as a temporary register.    |   |
| r9r15    |         | Temporary registers                                                                                                                                | ┝ |
| r8       |         | Indirect result location register                                                                                                                  | ] |
| r0r7     |         | Parameter/result registers                                                                                                                         | ] |

#### JIT-ed code

- must save the values on these registers to the stack before use them,
- must restore them before "ret" instruction.

JIT-ed code can be freely use these registers.

"Procedure Call Standard" also defines the usage for V (128-bit SIMD) registers, Z (SVE) resisters and P (scalable predicate) registers of SVE, please refer them.



### Register usage in JIT-ed code

```
Generator() {
   /* stp: store register pair
    pre_ptr: Pre-index addressing
    sp: stack pointer register */
   for(size_t i=19; i<=28; i+=2)
      stp(XReg(i), XReg(i+1), pre_ptr(sp, -16));</pre>
```

```
/* Implement what you want to do
with x0 - x7, x9 - x15, x19 - x28. */
```

```
/* Idp: load register pair
    post_ptr: Post-index addressing */
for(size_t i=28; i>=19; i-=2)
    Idp(XReg(i-1), XReg(i), post_ptr(sp, 16));
```

#### **ret**();

Red bold texts are the functions, classes and instances provided by Xbyak\_aarch64.

Save the registers before use them

Restore the registers after use them



### Label and Branch Instructions

Red bold texts are the functions, classes and instances provided by Xbyak\_aarch64.

| Generator() {                  |                                     |                         |            |                         |                |
|--------------------------------|-------------------------------------|-------------------------|------------|-------------------------|----------------|
| Label L1, L2;                  | // Instancing Label cla             | ass of Xbyak_aarch64.   |            |                         |                |
| <mark>L (</mark> L1 <b>)</b> ; | // L function of Xbyak              | _aarch64 registers JIT- | -ed code a | ddress of this position | n to Label L1. |
| add (w0, w1, w0)               | ;                                   |                         |            |                         |                |
| cmp(w0, 13);                   | // Compare the register             | r wO value to the immed | liate valu | ie 13.                  |                |
| <b>b (EQ</b> , L2) ;           | // Branch to L2, if the             | e register w0 value ==  | 13.        |                         |                |
| sub(w1, w1, 1);                | // Decrement loop coun <sup>.</sup> | ter value.              |            |                         |                |
| <mark>b (</mark> L1) ;         | // Unconditional brancl             | h.                      |            |                         |                |
| <mark>L (</mark> L2) ;         |                                     | r                       |            |                         |                |
| <b>ret</b> ();                 | ∧ B+>                               | 0xffffbe7a0000          | add        | w0, w1, w0              |                |
| }                              | JIT-ed code                         | 0xffffbe7a0004          | cmp        | w0, #0xd                |                |
|                                | -oue                                | 0xffffbe7a0008          | b.eq       | 0xffffbe7a0014          | // b.none      |
|                                |                                     | 0xffffbe7a000c          | sub        | w1, w1, #0x1            | ,,             |
|                                |                                     |                         | b          | 0xffffbe7a0000          |                |
|                                |                                     | 0xffffbe7a0014          | ret        | UXIIII Del do 000       |                |
|                                |                                     | 0,11110010014           | Tet        |                         |                |



### **Referencing Static Table**



## Generating and Referencing Table



### Precautions1

- Xbyak\_aarch64 can output instructions that cannot be executed on the CPU running Xbyak\_aarch64.
  - Your CPU may not have support for cryptographic, atomic, SVE instructions etc., but Xbyak\_aarch64 running on your CPU can output machine code of these instructions, which raises the illegal instruction exception.

-> Please check your CPU capability and chose the mnemonic functions.

• In an extreme case, if Xbyak\_aarch64 is run on an x64 machine, any ARMv8-A machine code generated by Xbyak\_aarch64 causes the exception.



#### Table C2-2 Floating-point constant values

### Precautions2

- Xbyak\_aarch64 does not validate every argument passed to the mnemonic functions.
  - Example: immediate value of FMOV FMOV copies an immediate floatingpoint constant into every element of SIMD&FP register FMOV <Vd>.<T>, #<imm>

# Only these constant values are allowed for #<imm> of "FMOV"

From "Arm Architecture Reference Manual Armv8, for Armv8-A architecture profile"

| . f h | bcd   |      |      |      |           |          |         |        |
|-------|-------|------|------|------|-----------|----------|---------|--------|
| efgh  | 000   | 001  | 010  | 011  | 100       | 101      | 110     | 111    |
| 0000  | 2.0   | 4.0  | 8.0  | 16.0 | 0.125     | 0.25     | 0.5     | 1.0    |
| 0001  | 2.125 | 4.25 | 8.5  | 17.0 | 0.1328125 | 0.265625 | 0.53125 | 1.0625 |
| 0010  | 2.25  | 4.5  | 9.0  | 18.0 | 0.140625  | 0.28125  | 0.5625  | 1.125  |
| 0011  | 2.375 | 4.75 | 9.5  | 19.0 | 0.1484375 | 0.296875 | 0.59375 | 1.1875 |
| 0100  | 2.5   | 5.0  | 10.0 | 20.0 | 0.15625   | 0.3125   | 0.625   | 1.25   |
| 0101  | 2.625 | 5.25 | 10.5 | 21.0 | 0.1640625 | 0.328125 | 0.65625 | 1.3125 |
| 0110  | 2.75  | 5.5  | 11.0 | 22.0 | 0.171875  | 0.34375  | 0.6875  | 1.375  |
| 0111  | 2.875 | 5.75 | 11.5 | 23.0 | 0.1796875 | 0.359375 | 0.71875 | 1.4375 |
| 1000  | 3.0   | 6.0  | 12.0 | 24.0 | 0.1875    | 0.375    | 0.75    | 1.5    |
| 1001  | 3.125 | 6.25 | 12.5 | 25.0 | 0.1953125 | 0.390625 | 0.78125 | 1.5625 |
| 1010  | 3.25  | 6.5  | 13.0 | 26.0 | 0.203125  | 0.40625  | 0.8125  | 1.625  |
| 1011  | 3.375 | 6.75 | 13.5 | 27.0 | 0.2109375 | 0.421875 | 0.84375 | 1.6875 |
| 1100  | 3.5   | 7.0  | 14.0 | 28.0 | 0.21875   | 0.4375   | 0.875   | 1.75   |
| 1101  | 3.625 | 7.25 | 14.5 | 29.0 | 0.2265625 | 0.453125 | 0.90625 | 1.8125 |
| 1110  | 3.75  | 7.5  | 15.0 |      | 0.234375  | 0.46875  | 0.9375  | 1.875  |
| 1111  | 3.875 |      |      |      |           |          |         |        |
|       |       |      |      |      |           |          |         |        |

#### Table C2-2 Floating-point constant values

110

0.5

0.53125

0.5625

0.59375

111

1.0

1.0625

1.125

1.1875

101

0.25

0.265625

0.28125

0.296875

### Precautions2

- Xbyak aarch64 does not validate every argument passed to the mnemonic functions.
  - Example: immediate value of FMOV 0



bcd

000

2.0

2.125

2.25

2.375

001

4.0

4.25

4.5

4.75

010

8.0

8.5

9.0

9.5

011

16.0

17.0

18.0

19.0

100

0.125

0.1328125

0.140625

0.1484375

efah

0000

0001

0010

0011

If you use values that are not listed in the table, the operation of the mnemonic function is undefined. The operand validation is the future work.

|      |       |      |      |      | 0.21075   | 0.4575   | 0.075   | 1.75   |
|------|-------|------|------|------|-----------|----------|---------|--------|
| 1101 | 3.625 | 7.25 | 14.5 | 29.0 | 0.2265625 | 0.453125 | 0.90625 | 1.8125 |
| 1110 | 3.75  | 7.5  | 15.0 |      | 0.234375  | 0.46875  | 0.9375  | 1.875  |
| 1111 | 3.875 |      |      |      |           |          |         |        |

# How to debug programs implemented with Xbyak\_aarch64



### Debug JIT-ed Code

- So far, there is no efficient way to debug JIT-ed code 😣
- Basically, it's the same as debugging assembler.
  - I often use GDB with "asm" layout.
- JIT-ed code can be dump as a file and disassembled by "objdump".





Breakpoint 4 at 0xffffbe7a0000

1) Set a break point to the address of the function pointer *f*, before it is called.



| x0<br>x3                                                  | roup: general<br>0x3<br>0x100                                                            | 3<br>256                                                                             | ×1<br>×4                               | 0×4<br>0×100                                                                | 4<br>256                                                                  | x2<br>x5                               | 0xffffbe7a0000<br>0x6                                                                 |
|-----------------------------------------------------------|------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|----------------------------------------|-----------------------------------------------------------------------------|---------------------------------------------------------------------------|----------------------------------------|---------------------------------------------------------------------------------------|
| ×6<br>×9<br>×12<br>×15                                    | bug by Gl                                                                                | OB                                                                                   |                                        |                                                                             |                                                                           |                                        | 4e06a8                                                                                |
| x18<br>x21<br>x24<br>x27<br>x30<br>cpsr<br>vg             | 0x200000005<br>0x445948<br>0x1400000008<br>0x1e00000020<br>0x4045d4<br>0x80000000<br>0x8 | 137438953477<br>4479304<br>85899345928<br>128849018912<br>4212180<br>[ EL=0 N ]<br>8 | x19<br>x22<br>x25<br>x28<br>sp<br>fpsr | 0×fffffffb680<br>0×8<br>0×1200000080<br>0×120000020<br>0×fffffffb610<br>0×0 | 281474976691840<br>8<br>77309411456<br>77309411360<br>0×ffffffffb610<br>0 | x20<br>x23<br>x26<br>x29<br>pc<br>fpcr | 0xa00000080<br>0x1c00000008<br>0x1e00000080<br>0xfffffffb610<br>0xffffbe7a0000<br>0x0 |
| B+> 0xffffbe7<br>0xffffbe7<br>0xffffbe7                   | a0004 ret<br>a0008 .inst 0x00000                                                         | w1                                                                                   |                                        |                                                                             |                                                                           |                                        |                                                                                       |
| 0xffffbe7<br>native proces<br>Starting proc               | 38684 4) The p                                                                           | 0                                                                                    |                                        | e start of JIT-e<br>h the instructi                                         |                                                                           | GDB cor                                | mmand "si".                                                                           |
| Breakpoint 3,<br>(gdb) n<br>(gdb) p/x f<br>\$1 = 0xffffbe | main () at add.cpp:/                                                                     | 26                                                                                   |                                        |                                                                             |                                                                           |                                        |                                                                                       |
| (gdb) b *f<br>Breakpoint 4<br>(gdb) layout<br>(gdb) c     | at 0xffffbe7a0000<br>asm                                                                 | <b>}</b>                                                                             | $\overline{}$                          | 2) Set layou                                                                | it to "asm" or                                                            | "regs"                                 | _                                                                                     |
| Continuing.                                               | 0x0000ffffbe7a0000                                                                       | in ??                                                                                |                                        | 3) Continue                                                                 | execution                                                                 |                                        |                                                                                       |
| (gdb) layout                                              |                                                                                          |                                                                                      |                                        |                                                                             |                                                                           |                                        |                                                                                       |

## Dumping JIT-ed Code

```
#include <xbyak_aarch64/xbyak_aarch64.h>
using namespace Xbyak_aarch64;
class Generator : public CodeGenerator {
public:
    Generator() {
        add(w0, w0, w1);
        ret();
    }
};
int main() {
    Generator gen;
    gen.ready();
    auto f = gen.getCode<int (*)(int, int)>();
    int a = 3;
    int b = 4;
```

```
FILE *fp = fopen("dump.bin", "wb+");
fwrite(gen.getCode(), gen.getSize(), 1, fp);
fclose(fp);
```

```
printf("%d + %d = %d\n", a, b, f(a, b));
```

|   | [kawakami@fx700-01-05 sample]\$ objdump -D -b binary -m AArch64 dump.bin |
|---|--------------------------------------------------------------------------|
|   | dump.bin: file format binary                                             |
|   | Disassembly of section .data:                                            |
|   | $0_{0000000000000} < data>$                                              |
|   | 0: 0b010000 add w0,w0,w1<br>4: d65f03c0 ret                              |
| , | [kawakami@tx700-01-05 sample]\$                                          |
|   |                                                                          |



# Summary



## Summary

- Xbyak\_aarch64; just-in-time assembler for ARMv8-A + SVE, is introduced,
  - which can dynamically generate optimized code considering runtime parameters and make it easier than the existing assembler to implement optimized code at the instruction level.
- Xbyak\_aarch64 is mainly developed to implement the deep learning processing software on the supercomputer Fugaku, but it can be expected to work with a variety of software for ARMv8-a architecture systems.
- Xbyak\_aarch64 is being developed as an OSS. I hope that many people will use Xbyak\_aarch64 on various platforms and participate in its development.
  - Questions, bug reports, pull requests, etc. on Github are welcome. <u>https://github.com/fujitsu/xbyak\_aarch64</u>



### Acknowledgment

• The authors thank S. Mitsunari (Cybozu Labs, Inc.), the developer of the original Xbyak. He contributed helpful advice to Xbyak\_aarch64 and brushed up the source code.



## References

- Xbyak\_aarch64: Just-In-Time assembler for ARMv8-A + SVE
  - O <u>https://github.com/fujitsu/xbyak\_aarch64</u>
- Xbyak: Just-In-Time assembler for x86\_64
  - O <u>https://github.com/herumi/xbyak</u>
- oneDNN: Deep Learning Processing Library
  - O <u>https://github.com/oneapi-src/oneDNN</u>
- oneDNN for A64FX: Deep Learning Processing Library for A64FX
  - O <u>https://github.com/fujitsu/oneDNN</u>
- A64FX: CPU designed for high-performance computing and complies with the ARMv8-A architecture profile and the Scalable Vector Extension (SVE)
  - O Toshio Yoshida, "Fujitsu High Performance CPU for the Post-K Computer," in Proc. Hot Chips 30, Aug. 2018.
- "Arm Architecture Reference Manual Armv8, for Armv8-A architecture profile"
- "Procedure Call Standard for the Arm 64-bit Architecture (AArch64)"
- TechBlog
  - o <u>https://blog.fltech.dev/entry/2020/11/19/fugaku-onednn-deep-dive-en</u>
  - <u>https://blog.fltech.dev/entry/2020/11/18/fugaku-onednn-deep-dive-ja</u> (in Japanese)



# Thank you

Accelerating deployment in the Arm Ecosystem



### What is Xbyak\_aarch64?

- Xbyak\_aarch64 (<u>https://github.com/fujitsu/xbyak\_aarch64</u>) is the Just-In-Time (JIT) assembler for ARMv8-A + Scalable Vector Extension (SVE), inheriting the concept of Xbyak, written in C++11.
- Xbyak (<u>https://github.com/herumi/xbyak</u>) is the Just-In-Time assembler for x86\_64 instruction set architecture (ISA),
  - developed by S. Mitsunari (Cybozu Labs, Inc.),
  - pronounced "kəi-bja-k" (I'm not sure the correct spelling by IPA), <u>https://translate.google.com/?hl=ja&sl=ja&tl=en&text=kaibyaku</u>
  - o The word "Xbyak" is derived from Japanese word "開闢".
    - Its meanings is "the beginning of the world", "exploring the unexplored", etc.
- The main purpose of developing Xbyak\_aarch64 is to port oneDNN, a deep learning processing library for x86\_64, to A64FX (ARMv8-A + SVE).

