# Early results using Fortran's do concurrent standard parallelism on Intel GPUs with the ifx compiler

Ronald M. Caplan, Miko M. Stulajter, Jon A. Linker, and Cooper Downs Predictive Science Inc. caplanr@predsci.com

### Predictive Science Inc.

Supported by NSF and NASA

# Accelerated computing

# Directives and Fortran standard parallelism

# Previous implementation results on NVIDIA GPUs with nvfortran

# Preliminary implementation results on Intel GPUs with ifx

# Call to action and future outlook



# **Accelerated Computing**

- W Overall performance • FLOP/s
  - Memory Bandwidth
  - Specialized hardware
    - (e.g. ML/DL tensor cores)
- - In-house workstations
  - Reduce HPC real estate
- Efficient performance
   Efficient performance
   Efficient
   Efficient
  - Lower energy use
  - Save money





# Directives

- Comments that the compiler can use to generate code that the base language does not support (e.g. parallelism, GPU-offload, data movement, etc.)
- W Can produce single source code for multiple targets (GPU, CPU, FPGA, etc.)
- Use Low-risk can ignore directives and compile as before
- Vendor-independent (subject to implementation)
- Great for rapid development and accelerating (Ψ) legacy codes
- W Two major directive APIs for accelerated computing: **OpenACC** and **OpenMP**®



do i=1,nenddo

!\$omp target enter data map(to:x) map(alloc:y) !\$omp target teams distribute parallel do do i=1,ny(i) = a\*x(i) + benddo !\$omp end target teams distribute parallel do !\$omp target exit data map(delete:x) map(from:y)



# **OpenACC**

!\$acc enter data copyin(x) create(y)

y(i) = a \* x(i) + b

### !\$acc exit data delete(x) copyout(y)



# Fortran Standard Parallelism: Do Concurrent (DC)

## **ISO Fortran 2008**

- Indicates loop can be run out-of-order
- Can hint to compiler that loop may be parallelizable
- W No support for atomics, device selection, async, conditionals, etc.
- Fortran 2023 has added reductions



| Compiler  | Version          | DO CONCURRENT parallelization support                                     |
|-----------|------------------|---------------------------------------------------------------------------|
| nvfortran | ≥ 20.11          | CPU with -stdpar=cpu<br>GPU with -stdpar=gpu                              |
| ifx       | ≥ 19.1<br>≥ 23.0 | <b>CPU with</b> -fopenmp<br><b>GPU with</b> -fopenmp-target-do-concurrent |
| gfortran  | ≥9               | <b>CPU with</b> -ftree-parallelize-loops=<#Threads>                       |



# Portability

# github.com/AndiH/gpu-lang-compat





bind your way through it or directly link the libraries

## 12: Standard language parallelism of Fortran, mainly do concurrent, is supported on NVIDIA GPUs

26: Currently, no (known) way to launch Standard-based parallel algorithms on AMD GPUs

38: With Intel oneAPI 2022.3, Intel supports DO CONCURRENT with GPU offloading

### Vendor support, but not (yet) entirely comprehensive

Limited, probably indirect support - but at least some No direct support available, but of course one could ISO-C-

# **Directives vs. Standard Parallelism**

# Why use DC instead of directives?

- Use Longevity (ISO)
- Use Lower code footprint
- Use Units Less Unit scientists
- For accelerated computing, directives (e.g. OpenMP) are currently more portable

These also apply to codes that already use directives

# Original Non-Parallelized Code

do k=1, npdo j=1, ntdo i=1,nrm1 br(i,j,k) = (phi(i+1,j,k)-phi(i,j,k))\*dr i(i)enddo enddo enddo

### **OpenACC** Parallelized Code !\$acc enter data copyin(phi,dr i) !\$acc enter data create(br) !\$acc parallel loop default(present) collapse(3) async(1)

```
do k=1, np
do j=1, nt
  do i=1,nrm1
    br(i,j,k) = (phi(i+1,j,k)-phi(i,j,k))*dr i(i)
  enddo
```

enddo

enddo

!\$acc wait

!\$acc exit data delete(phi,dr i,br)

### Fortran's DO CONCURRENT

do concurrent (k=1:np,j=1:nt,i=1:nrm1) br(i,j,k) = (phi(i+1,j,k)-phi(i,j,k))\*dr i(i)enddo



# Previous Implementations Results on NVIDIA GPUs

# W History of our GPU implementations

2012-3: Wanted to use GPUs, but not with CUDA due to needing code rewrites, and multiple code bases in multiple languages (Fortran & C) 2014-5:NVIDIA's OpenACC implementation mature enough to start using it for small tools (DIFFUSE) 2016-7: Implemented OpenACC into a larger code that uses MPI for running on multiple GPUs (POT3D) 2018-9: Implemented OpenACC into our production-level MHD code (MAS) Optimized OpenACC implementations, started using in production runs 2020: 2021: Implemented Fortran standard parallelism with 'do concurrent' (DC) into DIFFUSE **Decided to pursue standard** Implemented DC into POT3D, but retained 2022: Fortran (stdpar) due to NVIDIA's small amount of OpenACC for performance support and Intel's announcement of support. 2023: Implemented DC into MAS, but retained **Stdpar allows for cleaner code** small amount of OpenACC for performance and more portability than

OpenACC

C

ortran

**OpenMP** Target features and support starts to be competitive to OpenACC, but we already had large amounts of OpenACC implemented, and felt **OpenACC** was easier/cleaner to code

**OpenACC** 

Previous Implementations Results on MODA GPUs: DIFFUSE

- Small solar surface magnetic field smoothing tool
- Integrates 2D spherical surface Laplacian operator with finite differenceing and super time stepping
- Parallelized for CPUs with OpenMP and for GPUs with OpenACC
- We replaced all directives with DC, and the code retained its performance on multicore CPUs

**Replace Directives for Accelerated Computing?" Lecture Notes in Computer** Science, 13194, 3-21. Springer, Cham. (2021)



9

# Stulajter, et. al. "Can Fortran's `do concurrent'

|      |                        |       |                       | INTEL ifor $\sigma$ (over 10 |                           |
|------|------------------------|-------|-----------------------|------------------------------|---------------------------|
|      |                        |       |                       |                              |                           |
|      |                        |       |                       |                              |                           |
|      |                        |       |                       |                              |                           |
|      |                        |       |                       |                              |                           |
|      |                        |       |                       |                              |                           |
| 19   | 4.9                    | 17    | 8.3                   |                              |                           |
|      |                        |       |                       |                              |                           |
| (Ope | jinal<br>nMP)<br>cores | (DC+O | ew<br>penMP)<br>cores | ) (D                         | mental<br>C)<br>ct Result |

Previous Implementations Results on MIDIA GPUs: DIFFUSE We saw similar performance on NVIDIA GPUs using only DC This used the default setting of the NVIDIA compiler when using DC, which activates Unified Managed Memory (UMM) Alternatively, we can turn off UMM and manually manage

memory with unstructured data movement directives





Previous Implementations Results on MIDIA GPUs: POT3D

- POT3D computes approximations of the magnetic field of the Sun's lower atmosphere
- It is parallelized for CPUs with MPI and multiple GPUs with MPI+OpenACC
- Part of the SPEChpc(TM) 2021 benchmark suite
- We converted all "do" loops into DC, and the CPU performance did not change. DC also added the ability to run in hybrid MPI+multicore mode (not tested yet)

# github.com/predsci/POT3D

- "Variations in Finite Difference Potential Fields" Caplan, et. al., Ap.J. 915,1 (2021) 44
- https://developer.nvidia.com/blog/using-fortranstandard-parallel-programming-for-gpu-acceleration







6717.4 6767.9

# Previous Implementations Results on MIDIA GPUs: POT3D

We replaced all directives with DC, letting **UMM** handle data management

We saw a ~10% slowdown due to issues with UMM+MPI (not present on Grace-Hopper!)

Original performance regained by adding back OpenACC data directives

Hybrid DC+OpenACC still advantageous due to large reduction in number of directives and lines of code, making the code more domain-scientist friendly



### POT3D: Explicit, Managed, and Unified Memories



12

Explicit Memory Pure Fortran

### GH200

# Previous Implementations Results on MIDIA GPUs: MAS

- Large (~70,000 lines) in-production code for generalpurpose simulations of the Sun's atmosphere used in solar physics and space weather research
- Solves spherical 3D thermodynamic MHD equations using implicit & explicit time-stepping with finite-differences and sparse matrix preconditioned iterative solvers
- Parallelized for multiple CPUs with MPI and multiple GPUs with MPI+OpenACC
- We converted "do" loops into DC and the CPU performance did not change. DC again added the ability to run in hybrid MPI+multicore mode (not tested yet)

"GPU Acceleration of an Established Solar MHD Code using OpenACC". Caplan et. al. J. of Phys.: Conf. Series. ASTRONUM 2018. 1225,1 (2019) 012012

"Acceleration of a production Solar MHD code with Fortran standard parallelism: From OpenACC to `do concurrent'" Caplan et. al. IEEE IPDPSW Proceedings., (2023) 582-590.





13

 $= -\nabla \cdot \left(T\,\mathbf{v}\right) - \left(\gamma - 2\right)\left(T\,\nabla \cdot \mathbf{v}\right) + \frac{\left(\gamma - 1\right)}{2\,k}\frac{m_p}{\rho}\left[-\nabla \cdot \left(\widetilde{\mathbf{q_1}} + \mathbf{q_2}\right) - \frac{\rho^2}{m_p^2}\,\widetilde{\mathbf{Q}} + \mathbf{H}\right]$  $= -\rho \mathbf{v} \cdot \nabla \mathbf{v} + \frac{1}{c} \mathbf{J} \times \mathbf{B} - \nabla (p + p_w) + \rho \, \mathbf{g} + \mathbf{F_c} + \nabla \cdot (\nu \, \rho \, \nabla \mathbf{v}) + \nabla \cdot \left( \mathbf{S} \, \rho \, \nabla \frac{\vec{\partial} \mathbf{v}}{\partial t} \right)$ 

# predsci.com/mas

# Previous Implementations Results on MIDIA GPUs: MAS

- We were able to run with pure Fortran using **UMM**, however the issues with **UMM**+MPI severely limited scaling across GPUs
- W Adding back OpenACC data directives restored original scaling





- To test GPU-acceleration on Intel GPUs with DC, we go back to using the DIFFUSE tool
- We start with the pure Fortran version (zero directives)
- Test run is the same as in [Stulaiter, et. al. (2021)]
- Using the Intel Developer Cloud, we use **ifx** compiler v2023.2 on an Intel MAX 1100 Data Center GPU
- DIFFUSE is highly memory-bandwidth bound so performance on the MAX 1100 (1,229 GB/s) expected to be between an NVIDIA V100 (900 Gb/s) and A100 (1,555 GB/s), where the test takes ~35 seconds on an A100
- We first ran the test on a dual-socket Xeon Platinum 8480+ CPU (614 GB/s) and it took a reasonable 95 seconds



# intel



### **Developer** Cloud

15

# console.cloud.intel.com

- For testing on the MAX GPU, we use the tips shown here: www.intel.com/content/www/us/en/developer/videos/offloadfortran-workloads-new-data-center-gpu-max.html
- Compiler flags:
  - -fiopenmp
  - -fopenmp-target-do-concurrent
  - -fopenmp-targets=spir64
  - -Xopenmp-target-backend "-device pvc"
- We set the following environment variable to profile the results (adding negligible extra time to the run)
  - export LIBOMPTARGET PLUGIN PROFILE=T

### 

• The profile of the run shows that the slow performance is due to excesive amounts of CPU-GPU data transfers

| LIBOMPTARGET_PLUGIN_PROFIL | E (L) | EVELO) for           | OMP DEVICE       | (0) Intel(1 | R) Data C | enter GPU 1         | Max 1100,         | Threa |
|----------------------------|-------|----------------------|------------------|-------------|-----------|---------------------|-------------------|-------|
| Kernel O                   |       | _omp_offlo           | ading_10301      | _bd28a_ax_  | _11425    |                     |                   |       |
| Kernel 1                   | •     | _omp_offlo           | ading_10301      | _bd28a_ax_  | _11438    |                     |                   |       |
| Kernel 2                   |       | _omp_offlo           | ading_10301      | bd28a_ax_   | 11443     |                     |                   |       |
| Kernel 3                   |       | _omp_offlo           | ading_10301      | bd28a_ax_   | 11454     |                     |                   |       |
| Kernel 4                   | •     | omp_offlo            | ading_10301      | bd28a_dif:  | fuse_step | _sts1552            |                   |       |
| Kernel 5                   | •     | _omp_offlo           | ading_10301      | _bd28a_dif: | fuse_step | _sts_1565           |                   |       |
| Name                       | : Ho  | ost Time (1<br>Total | msec)<br>Average | Min         | D<br>Max  | evice Time<br>Total | (msec)<br>Average |       |
| Compiling                  | :     | 329.88               | 329.88           | 329.88      | 329.88    | 0.00                | 0.00              |       |
| DataAlloc                  | :     | 63045.73             | 0.03             | 0.00        | 3.16      | 0.00                | 0.00              |       |
| DataRead (Device to Host)  | :     | 2.09e+06             | 2.16             | 0.01        | 15.08     | 1.44e+06            | 1.49              |       |
| DataWrite (Host to Device) | :     | 3.01e+06             | 1.33             | 0.00        | 18.31     | 1.49e+06            | 0.66              |       |
| Kernel 0                   | •     | 45135.81             | 1.12             | 0.83        | 2.29      | 31140.47            | 0.77              |       |
| Kernel 1                   | •     | 1787.53              | 0.04             | 0.04        | 0.90      | 1261.04             | 0.03              |       |
| Kernel 2                   | •     | 14269.57             | 0.35             | 0.06        | 2.20      | 318.11              | 0.01              |       |
| Kernel 3                   | •     | 1470.01              | 0.04             | 0.02        | 0.41      | 377.62              | 0.01              |       |
| Kernel 4                   | •     | 36.64                | 0.61             | 0.60        | 0.65      | 32.84               | 0.55              |       |
| Kernel 5                   | :     | 30664.22             | 0.76             | 0.74        | 0.95      | 28215.62            | 0.70              |       |

Total wall clock time: 5331.8 seconds



17

ad 0

Max Min Count 0.00 0.00 1.00 0.00 1.81e+06 0.00 0.00 7.40 966000.00 0.00 7.63 2.25e+06 0.75 0.83 40260.00 0.03 0.06 40260.00 0.01 0.04 40260.00 0.01 0.04 40260.00 0.54 0.56 60.00 0.68 0.75 40200.00

- **ifx** does not currently have an equivalent to UMM for DC, so to try to reduce data transfers, we add unstructured data region directives
- OpenACC is (unfortunately) not supported by ifx, so we use the **OpenMP** Target equivalents:

### github.com/intel/intel-application-migration-tool-for-openacc-to-openmp

!\$acc enter data copyin(a) !\$acc enter data create(b)

COMPUTE

!\$acc exit data copyout(a) !\$acc exit data delete(b)



!\$omp target enter data map(to:a) !\$omp target enter data map(alloc:b)

- COMPUTE

!\$omp target exit data map(from:a) !\$omp target exit data map(release:b)

# **OpenMP** Target

- which is still far too slow due to data transfers
- It turns out that **ifx** currently translates **DC** into OpenMP Target as:

do concurrent (i=1:10)

 $\mathbf{x} =$ 

enddo



- This mapping always performs CPU-GPU transfers on every loop, even if the data is already on the GPU (through OpenMP data regions)
- According to the specification, the behavior of an OpenMP Target loop with no mapping clause is "copy or present" (copy data if needed, but not if the data is already on the GPU)
- Changing (or adding user options) how ifx translates DC should be straight forward

To see what performance we can expect with updated mapping, we converted all DC loops into do loops with unmapped OpenMP target directives, keeping the unstructured data regions

### 

| LIBOMPTARGET_PLUGIN_PROFILE (1 | LEVELO) for  | OMP DEVICE ( | 0) Intel( | R) Data C      | enter GPU N | fax 1100, | Threa    |
|--------------------------------|--------------|--------------|-----------|----------------|-------------|-----------|----------|
| Kernel 0 :                     | omp_offloa   | ding_10301_1 | bd3bf_ax_ | _11454         |             |           |          |
| Kernel 1 :                     |              | ding_10301_1 | bd3bf_ax_ | 11470          |             |           |          |
| Kernel 2 :                     |              | ding_10301_1 | bd3bf_ax_ | 11476          |             |           |          |
| Kernel 3 :                     | omp offloa   | ding 10301   | bd3bf ax  | 11488          |             |           |          |
| Kernel 4 :                     | omp offloa   | ding 10301   | bd3bf dif | -<br>fuse step | sts 1565    |           |          |
| Kernel 5 :                     | omp_offloa   | ding_10301_1 | bd3bf_dif | fuse_step      |             |           |          |
| : 1                            | Host Time (n | <br>Isec)    |           | D              | evice Time  | (msec)    | <u> </u> |
| Name :                         | Total        | Average      | Min       | Max            | Total       | Average   |          |
| Compiling :                    | 372.01       | 372.01       | 372.01    | 372.01         | 0.00        | 0.00      |          |
| DataAlloc :                    | 30.75        | 0.00         | 0.00      | 3.19           | 0.00        | 0.00      |          |
| DataRead (Device to Host) :    | 548.55       | 0.01         | 0.01      | 2.90           | 144.88      | 0.00      |          |
| DataWrite (Host to Device):    | 513.99       | 0.01         | 0.00      | 14.16          | 16.82       | 0.00      |          |
| Kernel 0 :                     | 31166.05     | 0.77         | 0.74      | 8.06           | 30979.28    | 0.77      |          |
| Kernel 1 :                     | 1389.45      | 0.03         | 0.03      | 0.88           | 1197.80     | 0.03      |          |
| Kernel 2 :                     | 275.31       | 0.01         | 0.01      | 0.04           | 107.58      | 0.00      |          |
| Kernel 3 :                     | 278.59       | 0.01         | 0.01      | 0.04           | 116.55      | 0.00      |          |
| Kernel 4 :                     | 31.62        | 0.53         | 0.52      | 0.55           | 31.28       | 0.52      |          |
| Kernel 5 :                     | 26509.14     | 0.66         | 0.65      | 1.02           | 26345.37    | 0.66      |          |

Total wall clock time: 62.1 seconds

ad 0

Max Min Count 0.00 0.00 1.00 0.00 0.00 281854.00 0.00 80521.00 1.43 0.00 7.36 80554.00 0.73 0.84 40260.00 0.03 0.08 40260.00 0.00 0.03 40260.00 0.00 0.03 40260.00 0.51 0.54 60.00 0.64 0.71 40200.00

# Performance Summary for NVIDIA and INTEL GPUs



• The mapping issue in **if** is expected to be fixed soon, which should yield efficient results for Intel GPUs DC with OpenMP for data management To use DC only, a system similar to NVIDIA's UMM would need to be implemented/activated in ifx for DC

# 55.2 seconds 54.2 seconds 54.6 seconds

# seconds 4480.0 seconds 62.1 seconds

# Call to Action and Future Outlook

# Try Fortran's do concurrent (DC) to run your legacy (and new) codes on GPUs!

- On the NVIDIA platform, DC codes can be as fast as directive-based codes, with the best performance obtained by adding some Open(ACC|MP) data directives
- W The Intel platform's support for DC is just beginning, and can yield decent performance on compute kernels
- With some manual OpenMP data directives and further compiler updates, DC codes are expected to run well on Intel GPUs in the near future
- With the portability of DC GPU-acceleration on NVIDIA and Intel, there is an incentive for an AMD implementation to be developed

EXTRA: Preliminary Results on Consumer INTEL GPUs We also tested the run on an Intel Arc A750 Limited Edition GPU (512 GB/s Memory Bandwidth) We use the OpenMP Target Loops & Data version of the code The Arc GPUs do not have hardware for double precision FLOPs, but can compute them using emulation: export IGC EnableDPEmulation=1 export SYCL DEVICE WHITE LIST=""

export OverrideDefaultFP64Settings=1

Compiler flags:

-fiopenmp -fopenmp-targets=spir64

-Xopenmp-target-backend "-device arc"

# Total wall clock time: 161.2 seconds

Rocky Linux 9.2, Kernel 6.3.4-1, ifx 2024.0 Resizable bar enabled



