ECMWF Newsletter #175

ECMWF collaborates with Swiss partners on GPU porting of FVM dynamical core

Christian Kühnlein (ECMWF)
Till Ehrengruber (CSCS)
Stefano Ubbiali
Nicolai Krieger
Lukas Papritz
Alexandru Calotoiu
Heini Wernli (all ETH Zurich)

 

Numerical weather prediction models face an increasingly diverse landscape of supercomputing architectures. The ongoing, overarching trend is the shift from purely CPU‑based platforms to heterogeneous systems with GPU accelerators. The resulting boost in computing cores and memory bandwidth offers significant potential for attaining higher numerical resolution and energy efficiency. However, challenges arise because model programming and implementation depends on the specific hardware, and efficient execution requires targeted optimisation. Serving various hardware inevitably involves more complex code that needs to be organised to maintain productivity.

To achieve this for ECMWF’s non-hydrostatic FVM dynamical core, which was previously implemented in Fortran, we are rewriting and further developing FVM with the domain-specific library GT4Py of the Swiss National Supercomputing Centre (CSCS) and ETH Zurich. Both CSCS and ETH Zurich are also working closely with MeteoSwiss in a related project that is porting the ICON model to GPUs using GT4Py. ECMWF benefits greatly from the leading software development efforts by these partners and the PASC-(Platform for Advanced Scientific Computing)-funded KILOS project at ETH Zurich. At ECMWF, the GT4Py efforts occur in addition to the ongoing GPU adaptation of the operational Integrated Forecasting System (IFS) model.

GT4Py systematically separates the user code from hardware architecture dependent optimisations, instead of handling all in one source code, which is common in traditional designs. The user code contains the definitions of the computational stencils (e.g. discrete operators of the spatial discretisation, advection schemes, time integration) that are expressed using the embedded GT4Py domain-specific language (DSL). An optimising toolchain in GT4Py then transforms this high-level representation into a finely tuned implementation for the target hardware architecture (see the presentation of results in the first figure).

Selected performance measures.
Selected performance measures. The performance measures were obtained on the hybrid partition (CPU: Intel Xeon E5-2690 v3; GPU: NVIDIA Tesla P100) of the Piz Daint supercomputer at CSCS. In the top two plots, the runtime per time step for various GT4Py backends using 64‑bit and 32‑bit floating point precision is shown for CPUs (gt:cpu_kfirst and gt:cpu_ifirst backends) and GPUs (cuda and dace:gpu backends). The bottom plot shows weak scaling, with the model executing at various model resolutions on either the CPUs or the GPUs, respectively, using the gt:cpu_kfirst or the dace:gpu backends.

Productivity can be maintained with the GT4Py approach because the common high-level ‘driver’ interface is completely agnostic to target hardware, thus allowing support for optimisations and new architectures without changing the application. Furthermore, Python represents a popular and advanced programming language with concise syntax, comprehensive documentation, an extensive set of libraries (e.g. for unit testing, data analysis, machine learning, visualisation), a low barrier of entry for domain scientists and academia, and relatively straightforward prototyping. Python can also integrate seamlessly with lower-level languages such as C++ and Fortran, making it a good fit for DSLs.

Preliminary results

We present results from an intermediate stage of the FVM porting effort, using the first version of GT4Py restricted to structured grids. We consider the FVM dynamical core coupled to the IFS prognostic cloud microphysics scheme, implemented entirely in Python using GT4Py. For the performance measures shown in the first figure, we performed short runs of the moist baroclinic wave benchmark exemplified in the second figure.

Moist baroclinic wave.
Moist baroclinic wave. Snapshot of a moist baroclinic wave at day ten using the FVM nearly global configuration (a regular grid defined in the latitude range +/–80°). Depicted are cloud fraction at about 2 km above the surface (shading) and surface pressure (contour lines with interval of 10 hPa).

Once ported to GT4Py, FVM can employ various user-selectable backends for CPU and GPU hardware, using either 64‑bit or 32‑bit floating point precision. Two CPU C++ backends specialised for different array layouts were tested. The two GPU backends tested can execute with either native CUDA C++ or by leveraging the data-centric (DaCe) framework of ETH’s Scalable Parallel Computing Laboratory. The first figure’s bottom panel shows parallel scaling of the distributed model using the GHEX library (funded by the Partnership for Advanced Computing in Europe, PRACE) across multiple CPU or GPU nodes. Weak scaling is considered from about 14 km down to 1.7 km grid spacing and corresponding increases in compute nodes from 64 to 2,048, respectively (grid spacings are given for the equatorial region, finer spacings apply away from the equator due to the regular longitude-latitude grid that is employed temporarily). Near-optimal scaling (i.e. the same runtime) is achieved in the given range, somewhat better with GPUs. Development and study are continued on other platforms with different hardware.

Outlook

The preliminary results presented indicate the potential of the GT4Py approach to achieve performance portability and productivity in the context of future numerical weather prediction models. This is also supported by the latest results with the ICON model (a first GT4Py-enabled dynamical core on GPUs already shows equal performance compared to the Fortran & OpenACC version) and an earlier GT4Py porting project for the US FV3 model at the Allen Institute for AI.

The porting and development of FVM is ongoing with the next version of GT4Py, which is equipped with a new declarative interface and supports unstructured meshes required for the global configuration with the quasi-uniform IFS octahedral grid. The declarative GT4Py introduces important new features for more modular and concise programming in the user code and enables GT4Py to optimise more freely and aggressively.

Updates with FVM based on the new declarative GT4Py are expected in 2023 and 2024. From mid‑2023, ECMWF will host a computer scientist position for GT4Py, which is funded by the EU ESiWACE3 project.