>[!info] Learning goals
> - Being able to estimate the cost in computer hours of an experiment
> - Understanding the concept of _strong scaling_
> - Understanding the concept of _weak scaling_
> - Learning how to navigate the balance between costs and waiting time
# Introduction
In this tutorial, we will go through the steps to set up a large-eddy simulation on a supercomputer. We will use the Snellius machine as an example, but the methods apply to any supercomputer.
The example case that we will analyze is a cumulus-topped boundary layer inspired on [Stevens (2007)](https://doi.org/10.1175/JAS3983.1). This case requires three velocity components, liquid-water potential temperature, and specific humidity as prognostic variables, and one more if a TKE-based LES subgrid scheme is desired.
As a starting point, we will aim for a grid size of `2048 x 2048 x 256` grid points and an isotropic grid spacing of `12.5` m leading to a domain size of `25.6 x 25.6 x 3.2` km$^3$. The simulation is run with a `dt` of `1.0` s, and we aim for a simulation of 1 hour.
Once we have established the costs of such a run, we will evaluate the cost of expanding the horizontal domain size, while using the same grid spacing, as this generally leads to better statistics.
>[!info] Questions to ask yourself while setting up simulations
>1. Will I use CPU or GPU?
>2. How much will this run cost?
>3. How long will I need to wait for this to complete?
>4. Can I afford a larger domain size?
# Finding out the cost of a run
The time the supercomputer takes to complete a run is spent in three activities: input/output (IO), calculation, and communication. Generally the latter two take up the majority of the time, and we will focus on this to keep things simple.
>[!tip] Rule of thumb: the cheapest possible run is not the run you want to do.
> The cheapest possible run in computer hour use will (almost) always be the run that uses as few cores or GPUs as possible, because communication time is minimized. This run, however, might take too long to complete, so you have to find a balance.
# Strong scaling: increasing computer power for a given problem size
First, we explore how much it would cost to increase the computer power (in terms of cores or GPUs), while keeping the number of grid points the same. The result of this experiment is what we call _strong scaling_.
## Strong scaling on the CPU (`2048 x 2048 x 256`)
The Snellius system has different partitions to which your compute job will be submitted. For CPU runs, we will make use of the `rome` partition, where 128 cores per node are available, with 1.75 GB memory per core.
As a first test, we will test whether the run fits on one node. This means that we split the domain in pencils of `128 x 64 x 256` grid points for the first test and increase the number of cores used from there.
```
128 cores ( 1 nodes): 2.935 s (113% wrt 2 nodes)
256 cores ( 2 nodes): 1.660 s
512 cores ( 4 nodes): 0.836 s ( 99% wrt 2 nodes)
1,024 cores ( 8 nodes): 0.444 s ( 93% wrt 2 nodes)
2,048 cores (16 nodes): 0.238 s ( 87% wrt 2 nodes)
4,096 cores (32 nodes): 0.114 s ( 91% wrt 2 nodes)
```
## Strong scaling on the GPU (`2048 x 2048 x 256`)
For this tests we use the NVIDIA H100 GPUs that can be accessed via the `gpu_h100` partition
```
1 GPUs ( 0.25 nodes): 0.4072 s
2 GPUs ( 0.5 nodes): 0.3025 s
4 GPUs ( 1 nodes): 0.1321 s
8 GPUs ( 2 nodes): 0.0799 s
16 GPUs ( 4 nodes): 0.0448 s
```

*Strong scaling on the Snellius supercomputer a) on the Rome CPU nodes (128 cores per node) and b) the H100 GPU nodes (4 GPUs per node) for a 2,048 x 2,048 x 256 problem size.*
## Comparing cost and speed

*Figure shows a) simulated hours per hour wall clock time and b) cost per simulated hour in terms of SBU (system billing unit, also often called "core hour").*
# Weak scaling: extending the problem size
Now, we explore how much it would cost to increase the domain size, while keeping the amount of grid points per core or per GPU the same. The result of this experiment is what we call _weak scaling_.
## Weak scaling CPU (building blocks of `2048 x 2048 x 256`)
We scale up `2048 x 2048 x 256` per node
```
1 nodes: 2,048 x 2,048 x 256: 2.935 s (145.8% wrt 2 nodes)
2 nodes: 4,096 x 2,048 x 256: 4.280 s
4 nodes: 4,096 x 4,096 x 256: 4.076 s (105.0% wrt 2 nodes)
8 nodes: 8,192 x 4,096 x 256: 4.184 s (102.2% wrt 2 nodes)
16 nodes: 8,192 x 8,192 x 256: 4.289 s ( 99.8% wrt 2 nodes)
32 nodes: 16,384 x 8,192 x 256: 4.775 s ( 89.6% wrt 2 nodes)
64 nodes: 16,384 x 16,384 x 256: 5.053 s ( 84.7% wrt 2 nodes)
```
## Weak scaling GPU (building blocks of `2048 x 2048 x 256`)
We scale up `2048 x 2048 x 256` per GPU
```
1 GPU: 2,048 x 2,048 x 256: 0.4071 s
2 GPU: 4,096 x 2,048 x 256: 0.6124 s
4 GPU: 4,096 x 4,096 x 256: 0.5570 s
8 GPU: 8,192 x 4,096 x 256: 0.6586 s
16 GPU: 8,192 x 8,192 x 256: 0.7150 s
32 GPU: 16,384 x 8,192 x 256: 0.7780 s
64 GPU: 16,384 x 16,384 x 256: 1.345 s ( 1 x 64)
64 GPU: 16,384 x 16,384 x 256: 1.298 s ( 8 x 8)
```

*Weak scaling on the Snellius supercomputer a) on the Rome CPU nodes (128 cores per node) and b) the H100 GPU nodes (4 GPUs per node) for a 2,048 x 2,048 x 256 grid points per GPU / node problem.*