Seismic HPC RTM Sizing Worksheet

Section 1 — Model Parameters

Fill these in.

Parameter	Symbol	Example
Model length X	\(L_x\)	40000 m
Model length Y	\(L_y\)	40000 m
Model depth	\(L_z\)	12000 m
Minimum velocity	\(v_{\min}\)	1500 m/s
Maximum velocity	\(v_{\max}\)	4500 m/s
Maximum frequency	\(f_{\max}\)	15 Hz
Grid points per wavelength	\(n_\lambda\)	10
CFL constant	\(C_{\mathrm{CFL}}\)	0.45
Propagation time	\(T_{\mathrm{prop}}\)	8 s
Number of shots	\(N_{\mathrm{shot}}\)	20000

Section 2 — Hardware Parameters

Parameter	Symbol	Example
RTM Multiplier	\(\alpha_{RTM}\)	2.4
GPU effective throughput (cell updates/s)	\(P_{\mathrm{gpu}}\)	\(6\times10^9\)
Strong-scaling efficiency	\(\eta_{\mathrm{scale}}\)	0.75
GPUs per shot domain	\(G_{\mathrm{shot}}\)	64
Cluster utilization	\(U\)	0.85
Target campaign time	\(H_{\mathrm{wall}}\)	120 hours

Section 3 — Spatial Grid

Equation:

\[ \Delta x \approx \frac{v_{\min}}{n_\lambda f_{\max}} \]

Where \(n_\lambda\) is the number of grid points per wavelength and \(v_{\min}\) is the minimum velocity in the model.

This ensures that the shortest wavelength is resolved and determines grid spacing.

Section 4 — Number of Cells

The number of grid cells is determined by the physical model dimensions and the spatial grid spacing.

\[ N_{\mathrm{cell}}=\frac{L_xL_yL_z}{\Delta x^3} \]

where \(L_x\), \(L_y\), and \(L_z\) are the physical model dimensions and \(\Delta x\) is the spatial grid spacing.

Frequency scaling

Grid spacing is chosen to resolve the shortest wavelength in the model:

\[ \Delta x \approx \frac{v_{\min}}{n_\lambda f_{\max}} \]

Substituting this into the cell-count expression gives

\[ N_{\mathrm{cell}} = K_x f_{\max}^3 \qquad\text{where}\qquad K_x=\frac{L_xL_yL_zn_\lambda^3}{v_{\min}^3} \]

So \(N_{\mathrm{cell}}\propto f_{\max}^3\) for a fixed physical model size.

Padded computational domain

In practice, the solver operates on a larger computational grid than the physical model. Padding is required for several reasons:

PML absorbing boundary layers
finite-difference stencil halo regions
domain decomposition alignment or rounding

Let the padded computational dimensions be

\[ L_x^{\mathrm{comp}} = L_x + 2L_{\mathrm{PML},x},\qquad L_y^{\mathrm{comp}} = L_y + 2L_{\mathrm{PML},y},\qquad L_z^{\mathrm{comp}} = L_z + L_{\mathrm{PML,top}} + L_{\mathrm{PML,bot}} \]

The true number of grid cells used by the simulation is then

\[ N_{\mathrm{cell,true}}= \frac{L_x^{\mathrm{comp}}L_y^{\mathrm{comp}}L_z^{\mathrm{comp}}}{\Delta x^3} \]

Determining absorber thickness

Absorbing boundary layers are usually specified in grid points rather than physical distance.

\[ L_{\mathrm{PML}} = N_{\mathrm{PML}}\Delta x \]

Typical values in production wave-equation solvers are on the order of 20–40 grid cells per absorbing boundary, depending on absorber formulation and reflection tolerance requirements.

The \(f_{\max}^3\) scaling law describes the interior model scaling. Boundary padding slightly increases the total cell count beyond the interior \(f_{\max}^3\) scaling, so actual simulations should compute the cell count using the padded dimensions above rather than the asymptotic scaling rule alone.

One-line practical formula

For a uniform grid, if you know the model volume, minimum velocity, target frequency, and points per wavelength, collapse the spatial constant and work in km\(^3\):

\[ N_{\mathrm{cell}} \approx \left(\frac{n_\lambda}{v_{\min}}\right)^3 \left(10^9\right) V_{\mathrm{km}^3} f_{\max}^3 \]

\(10^9\) comes from \(1~\mathrm{km}^3 = 10^9~\mathrm{m}^3\), where \(V_{\mathrm{km}^3}\) is model volume in km\(^3\), \(v_{\min}\) is in m/s, and \(f_{\max}\) is in Hz.

Back-of-the-napkin size and padding correction shortcuts, assuming \(n_\lambda\approx 10\) and \(v_{\min}\approx 1500~\mathrm{m/s}\):

\[ N_{\mathrm{cell,true}}\approx (1.15\text{ to }1.30)N_{\mathrm{cell}} \qquad\text{where}\qquad N_{\mathrm{cell}}\approx 300\,V_{\mathrm{km}^3}f_{\max}^3 \]

\[ \text{so from } W_{\mathrm{prop}}\propto f_{\max}^4 \;\Rightarrow\; W_{\mathrm{prop}}\propto N_{\mathrm{cell}}f_{\max} \]

Section 5 — Time Step

CFL constant:

\[ C_{\mathrm{CFL}}=\frac{v_{\max}\Delta t}{\Delta x} \]

where \(v_{\max}\) is maximum propagation velocity, \(\Delta t\) is the time step, and \(\Delta x\) is spatial grid spacing.

Intuitively, it controls how small the time step must be relative to the grid spacing and maximum wave speed so that information does not travel farther than one grid cell per time step.
The numerical domain of dependence must contain the physical domain of dependence, and the stable CFL limit depends on dimension, stencil, and discretization.
If CFL forces a smaller \(\Delta t\), timesteps, FLOPs, and runtime all increase.

\[ \Delta t \le \frac{C_{\mathrm{CFL}}\Delta x}{v_{\max}} \]

Production codes usually choose close to the limit for efficiency:

\[ \Delta t \approx \frac{C_{\mathrm{CFL}}\Delta x}{v_{\max}} \]

Because the time step scales with grid spacing, \(\Delta t\propto \Delta x\). Since grid spacing decreases as maximum frequency increases, \(\Delta x\propto 1/f_{\max}\), so the number of time steps required for a fixed propagation time scales as \(N_t\propto f_{\max}\).

Section 6 — Number of Time Steps

\[ N_t=\frac{T_{\mathrm{prop}}}{\Delta t} \]

Scaling:

\[ N_t\propto f_{\max} \]

This is correct only because \(\Delta x\propto 1/f_{\max}\) and \(\Delta t\propto \Delta x\) from the CFL condition above.

To be specific, since the linear slope is non-trivial:

\[ N_t = \frac{T_{\mathrm{prop}}n_\lambda v_{\max}}{C_{\mathrm{CFL}}v_{\min}}\,f_{\max} \;\Rightarrow\; N_t = K_t f_{\max} \]

\[ K_t = \frac{T_{\mathrm{prop}}n_\lambda v_{\max}}{C_{\mathrm{CFL}}v_{\min}} \]

So higher frequency yields more time steps linearly, but the slope depends on model physics and numerical methods.

Section 7 — Wavefield Work per Propagation

\[ W_{\mathrm{prop}} = N_{\mathrm{cell}}N_t \]

This counts cell updates, not FLOPs.

Spatial discretization: \(N_{\mathrm{cell}}\propto f_{\max}^3\)

Temporal discretization: \(N_t\propto f_{\max}\)

Scaling:

\[ W_{\mathrm{prop}}\propto f_{\max}^4 \]

This is the core seismic HPC scaling law for uniform 3D discretization with fixed physical model size.

\[ W_{\mathrm{prop}} = K_xK_tf_{\max}^4 \]

\[ K_x=\frac{L_xL_yL_zn_\lambda^3}{v_{\min}^3}, \qquad K_t=\frac{T_{\mathrm{prop}}n_\lambda v_{\max}}{C_{\mathrm{CFL}}v_{\min}} \]

\[ \Rightarrow W_{\mathrm{prop}}= \frac{L_xL_yL_zT_{\mathrm{prop}}n_\lambda^4v_{\max}}{C_{\mathrm{CFL}}v_{\min}^4} f_{\max}^4 \]

\(W_{\mathrm{prop}}\) counts total cell updates and is therefore dimensionless; the apparent time units in the constants cancel when combined with \(f_{\max}^4\).

These constants assume explicit time stepping and uniform grid spacing determined by minimum velocity and a fixed points-per-wavelength rule. Adaptive and anisotropic grids invalidate this simple constant-coefficient scaling model.

Section 8 — RTM Work per Shot

Define an RTM multiplier as an estimate:

\[ \alpha_{\mathrm{RTM}} \approx 2.4 \]

Derived from:

\[ \alpha_{\mathrm{fwd}}\approx 1.0,\qquad \alpha_{\mathrm{bwd}}\approx 1.0,\qquad \alpha_{\mathrm{img}}\approx 0.1\text{–}0.2,\qquad \alpha_{\mathrm{overhead}}\approx 0.1\text{–}0.3 \]

Or better yet:

\[ \alpha_{\mathrm{RTM}}= \frac{\text{measured end-to-end RTM work per shot}}{N_{\mathrm{cell}}N_t} \]

Estimation methodology:

\[ \alpha_{\mathrm{RTM}}= \alpha_{\mathrm{fwd}}+\alpha_{\mathrm{bwd}}+\alpha_{\mathrm{img}}+\alpha_{\mathrm{ovh}} \]

This multiplier depends on stencil order, physics, memory strategy, checkpointing method, imaging condition complexity, communication overhead, GPU occupancy, memory bandwidth behavior, and how your code defines “cell updates/s”.

Important: If \(P_{\mathrm{gpu}}\) already comes from an end-to-end RTM benchmark, do not apply a separate \(\alpha_{\mathrm{RTM}}\) multiplier or the cost will be double-counted.

Then:

\[ W_{\mathrm{rtm,shot}}=\alpha_{\mathrm{RTM}}N_{\mathrm{cell}}N_t \qquad\text{if } P_{\mathrm{gpu}} \text{ is cell updates/s for one stencil propagation} \]

\[ W_{\mathrm{rtm,shot}}=N_{\mathrm{cell}}N_t \qquad\text{if } P_{\mathrm{gpu}} \text{ already means effective RTM cell updates/s} \]

Section 9 — RTM Runtime per Shot

\[ t_{\mathrm{shot}}= \frac{W_{\mathrm{rtm,shot}}}{G_{\mathrm{shot}}P_{\mathrm{gpu,eff}}} \]

seconds per shot.

Notes on \(P_{\mathrm{gpu}}\)

\(P_{\mathrm{gpu}}\) is effective cell updates per second per GPU and will be bound by one of three things:

\[ P_{\mathrm{gpu}} \approx \min(P_{\mathrm{mem}},P_{\mathrm{comp}},P_{\mathrm{comm}}) \]

\[ \begin{aligned} P_{\mathrm{mem}} &= \text{memory-bandwidth-limited throughput} \\ P_{\mathrm{comp}} &= \text{FLOP-limited throughput} \\ P_{\mathrm{comm}} &= \text{communication-limited throughput} \end{aligned} \]

The core method is:

Estimate bytes per cell update.
Estimate FLOPs per cell update.
Compare those with GPU hardware limits.
Apply efficiency factors for real execution losses.

\[ P_{\mathrm{gpu}} \approx \min\left( \frac{B_{\mathrm{gpu}}\eta_{\mathrm{mem}}}{b_{\mathrm{cell}}}, \frac{F_{\mathrm{gpu}}\eta_{\mathrm{comp}}}{f_{\mathrm{cell}}}, P_{\mathrm{comm}} \right) \]

Memory-bandwidth estimate

\[ P_{\mathrm{mem}}=\frac{B_{\mathrm{gpu}}\eta_{\mathrm{gpu}}}{b_{\mathrm{cell}}} \]

You need to estimate \(b_{\mathrm{cell}}\), the effective bytes transferred per cell update.

For a simple acoustic finite-difference time step, a naive estimate might be:

\[ b_{\mathrm{cell,naive}}\approx 4\times 4=16~\text{bytes/cell} \]

For planning, a more realistic acoustic estimate is often:

\[ b_{\mathrm{cell}}\approx 24\text{ to }80~\text{bytes/cell update} \]

Compute-limited estimate

\[ P_{\mathrm{comp}}=\frac{F_{\mathrm{gpu}}\eta_{\mathrm{comp}}}{f_{\mathrm{cell}}} \]

For an acoustic stencil, \(f_{\mathrm{cell}}\) might be on the order of

\[ 50\text{ to }300~\text{FLOPs/update} \]

Arithmetic intensity shortcut

\[ I=\frac{f_{\mathrm{cell}}}{b_{\mathrm{cell}}} \qquad\text{and}\qquad I_{\mathrm{balance}}=\frac{F_{\mathrm{gpu}}}{B_{\mathrm{gpu}}} \]

\[ \text{If }I < I_{\mathrm{balance}},\text{ then you are memory-bound; otherwise you are compute-bound.} \]

Communication-limited estimate

When one shot domain is split across multiple GPUs, halo exchange can become the real limit.

\[ V_{\mathrm{halo}}\approx 2h(n_yn_z+n_xn_z+n_xn_y)n_{\mathrm{fields}}b \]

\[ t_{\mathrm{comm}}\approx \frac{V_{\mathrm{halo}}}{B_{\mathrm{net,eff}}} \qquad t_{\mathrm{comp}}\approx \frac{n_xn_yn_z}{P_{\mathrm{gpu,single}}} \]

\[ P_{\mathrm{comm}}\approx \frac{N_{\mathrm{local\ cells}}}{t_{\mathrm{comp}}+t_{\mathrm{comm}}-t_{\mathrm{overlap}}} \]

The practical analytic workflow

Estimate \(P_{\mathrm{gpu}}\) without measurements.

Step A: define the kernel precisely.

Step B: estimate per-cell work \(f_{\mathrm{cell}}\) and \(b_{\mathrm{cell}}\).

Step C: take hardware limits.

Step D: apply realistic efficiencies.

Step E: degrade for strong scaling:

\[ P_{\mathrm{gpu,eff}}=P_{\mathrm{gpu,single}}\eta_{\mathrm{scale}} \]

\[ \eta_{\mathrm{scale}}=0.5\text{–}0.9 \]

Summary formula:

\[ P_{\mathrm{gpu,eff}} \approx \min\left( \frac{B_{\mathrm{gpu}}\eta_{\mathrm{mem}}}{b_{\mathrm{cell}}}, \frac{F_{\mathrm{gpu}}\eta_{\mathrm{comp}}}{f_{\mathrm{cell}}} \right)\eta_{\mathrm{scale}} \]

Section 10 — GPU-hours per Shot

\[ \mathrm{GPUh}_{\mathrm{shot}}= \frac{G_{\mathrm{shot}}t_{\mathrm{shot}}}{3600} \]

The shot is processed using \(G_{\mathrm{shot}}\) GPUs simultaneously, and \(t_{\mathrm{shot}}\) is the wall-clock runtime of that shot domain.

Section 11 — Total RTM GPU-hours

\[ \mathrm{GPUh}_{\mathrm{total}}= N_{\mathrm{shot}}\mathrm{GPUh}_{\mathrm{shot}} \]

\[ G_{\mathrm{cluster}}= \frac{\mathrm{GPUh}_{\mathrm{total}}}{U\,H_{\mathrm{wall}}} \]

where \(U\) is cluster utilization and \(H_{\mathrm{wall}}\) is campaign time target.

Because RTM shots are independent, total campaign compute is the sum of GPU-hours across all shots.

\(U\) is not a physics constant. It represents effective cluster utilization accounting for queueing delays, scheduling gaps, node failures, and workflow overhead. Typical planning values are 0.7–0.9.

\[ G_{\mathrm{cluster}}\approx \left\lceil \frac{G_{\mathrm{shot}}N_{\mathrm{shot}}}{\text{parallel shots}} \right\rceil \]

Utilization is intended to absorb this scheduling granularity effect.