>[!info] Learning goals
> - Being able to estimate the cost in computer hours of an experiment
> - Understanding the concept of _strong scaling_
> - Understanding the concept of _weak scaling_
> - Learning how to navigate the balance between costs and waiting time
# Introduction
In this tutorial, we will go through the steps to set up a large simulation on a supercomputer. We will use the Snellius machine as an example, but the methods apply to any supercomputer.
The example case that we will analyze is a cumulus-topped boundary layer inspired on Stevens (200x) studying ballistic growth. This case requires three velocity components, liquid-water potential temperature, and specific humidity as prognostic variables, and one more if a TKE-based LES subgrid scheme is desired.
As a starting point, we will aim for a grid size of `2048 x 2048 x 256` grid points and an isotropic grid spacing of `12.5` m leading to a domain size of `51.2 x 51.2 x 6.4` km$^3$.
>[!info] Questions to ask yourself
>1. Will I use CPU or GPU?
>2. How much will this run cost?
>3. How long do I need to wait for this run to complete?
# Finding out the cost of a run
The time the supercomputer takes to complete a run is spent in three activities: input/output (IO), calculation, and communication. Generally the latter two take up the majority of the time. However, for very large runs things can get more complex, and we will look into that later in the tutorial.
>[!tip] Rule of thumb: the cheapest possible run
> The cheapest possible run in computer hour use will be run that uses as few cores or GPUs as possible.
## Running on a CPU
### Strong scaling (`2048 x 2048 x 256`)
The Snellius system has different partitions to which your compute job will be submitted. For CPU runs, we will make use of the `rome` partition, where 128 cores per node are available, with 1.75 GB memory per core.
As a first test, we will test whether the run fits on one node. This means that we split the domain in pencils of `128 x 64 x 256` grid points for the first test and refine from there.
```
128 cores ( 1 nodes): 2.935 s / rkstep
256 cores ( 2 nodes): 1.660 s / rkstep (0.89)
512 cores ( 4 nodes): 0.836 s / rkstep (0.99)
1024 cores ( 8 nodes): 0.444 s / rkstep (0.94)
2048 cores (16 nodes): 0.238 s / rkstep
4096 cores (32 nodes): 0.114 s / rkstep (0.8045, 0.91 wrt 2)
8192 cores (64 nodes): unclear (large variation)
```
### Weak scaling
We scale up `2048 x 2048 x 256` per node
```
1 nodes: 2048 x 2048 x 256: 2.935 s
2 nodes: 4096 x 2048 x 256: 4.280 s
4 nodes: 4096 x 4096 x 256: 4.076 s
8 nodes: 8192 x 4096 x 256: 4.184 s
16 nodes: 8192 x 8192 x 256: 4.289 s
32 nodes: 16384 x 8192 x 256: 4.775 s
64 nodes: 16384 x 16384 x 256: 5.053 s
```
## Running on a GPU
### Strong scaling (`2048 x 2048 x 256`)
This is done on GPU H100 queue
```
1 GPUs ( 0.25 nodes): 0.4072 s / rkstep
2 GPUs ( 0.5 nodes): 0.3025 s / rkstep
4 GPUs ( 1 nodes): 0.1321 s / rkstep
8 GPUs ( 2 nodes): 0.0799 s / rkstep
16 GPUs ( 4 nodes): 0.0448 s / rkstep
```
### Weak scaling
We scale up `2048 x 2048 x 256` per GPU
```
1 GPU: 2048 x 2048 x 256: 0.4071 s
2 GPU: 4096 x 2048 x 256: 0.6124 s
4 GPU: 4096 x 4096 x 256: 0.5570 s
8 GPU: 8192 x 4096 x 256: 0.6586 s
16 GPU: 8192 x 8192 x 256: 0.7150 s
32 GPU: 16384 x 8192 x 256: 0.7780 s
64 GPU: 16384 x 16384 x 256: 1.345 s ( 1 x 64)
64 GPU: 16384 x 16384 x 256: 1.298 s ( 8 x 8)
```