Seismic HPC RTM Sizing Worksheet

Section 1 — Model Parameters

Fill these in.

ParameterSymbolExample
Model length X\(L_x\)40000 m
Model length Y\(L_y\)40000 m
Model depth\(L_z\)12000 m
Minimum velocity\(v_{\min}\)1500 m/s
Maximum velocity\(v_{\max}\)4500 m/s
Maximum frequency\(f_{\max}\)15 Hz
Grid points per wavelength\(n_\lambda\)10
CFL constant\(C_{\mathrm{CFL}}\)0.45
Propagation time\(T_{\mathrm{prop}}\)8 s
Number of shots\(N_{\mathrm{shot}}\)20000

Section 2 — Hardware Parameters

ParameterSymbolExample
RTM Multiplier\(\alpha_{RTM}\)2.4
GPU effective throughput (cell updates/s)\(P_{\mathrm{gpu}}\)\(6\times10^9\)
Strong-scaling efficiency\(\eta_{\mathrm{scale}}\)0.75
GPUs per shot domain\(G_{\mathrm{shot}}\)64
Cluster utilization\(U\)0.85
Target campaign time\(H_{\mathrm{wall}}\)120 hours

Section 3 — Spatial Grid

Equation:

\[ \Delta x \approx \frac{v_{\min}}{n_\lambda f_{\max}} \]

Where \(n_\lambda\) is the number of grid points per wavelength and \(v_{\min}\) is the minimum velocity in the model.

This ensures that the shortest wavelength is resolved and determines grid spacing.

Section 4 — Number of Cells

The number of grid cells is determined by the physical model dimensions and the spatial grid spacing.

\[ N_{\mathrm{cell}}=\frac{L_xL_yL_z}{\Delta x^3} \]

where \(L_x\), \(L_y\), and \(L_z\) are the physical model dimensions and \(\Delta x\) is the spatial grid spacing.

Frequency scaling

Grid spacing is chosen to resolve the shortest wavelength in the model:

\[ \Delta x \approx \frac{v_{\min}}{n_\lambda f_{\max}} \]

Substituting this into the cell-count expression gives

\[ N_{\mathrm{cell}} = K_x f_{\max}^3 \qquad\text{where}\qquad K_x=\frac{L_xL_yL_zn_\lambda^3}{v_{\min}^3} \]

So \(N_{\mathrm{cell}}\propto f_{\max}^3\) for a fixed physical model size.

Padded computational domain

In practice, the solver operates on a larger computational grid than the physical model. Padding is required for several reasons:

Let the padded computational dimensions be

\[ L_x^{\mathrm{comp}} = L_x + 2L_{\mathrm{PML},x},\qquad L_y^{\mathrm{comp}} = L_y + 2L_{\mathrm{PML},y},\qquad L_z^{\mathrm{comp}} = L_z + L_{\mathrm{PML,top}} + L_{\mathrm{PML,bot}} \]

The true number of grid cells used by the simulation is then

\[ N_{\mathrm{cell,true}}= \frac{L_x^{\mathrm{comp}}L_y^{\mathrm{comp}}L_z^{\mathrm{comp}}}{\Delta x^3} \]

Determining absorber thickness

Absorbing boundary layers are usually specified in grid points rather than physical distance.

\[ L_{\mathrm{PML}} = N_{\mathrm{PML}}\Delta x \]

Typical values in production wave-equation solvers are on the order of 20–40 grid cells per absorbing boundary, depending on absorber formulation and reflection tolerance requirements.

The \(f_{\max}^3\) scaling law describes the interior model scaling. Boundary padding slightly increases the total cell count beyond the interior \(f_{\max}^3\) scaling, so actual simulations should compute the cell count using the padded dimensions above rather than the asymptotic scaling rule alone.

One-line practical formula

For a uniform grid, if you know the model volume, minimum velocity, target frequency, and points per wavelength, collapse the spatial constant and work in km\(^3\):

\[ N_{\mathrm{cell}} \approx \left(\frac{n_\lambda}{v_{\min}}\right)^3 \left(10^9\right) V_{\mathrm{km}^3} f_{\max}^3 \]

\(10^9\) comes from \(1~\mathrm{km}^3 = 10^9~\mathrm{m}^3\), where \(V_{\mathrm{km}^3}\) is model volume in km\(^3\), \(v_{\min}\) is in m/s, and \(f_{\max}\) is in Hz.

Back-of-the-napkin size and padding correction shortcuts, assuming \(n_\lambda\approx 10\) and \(v_{\min}\approx 1500~\mathrm{m/s}\):

\[ N_{\mathrm{cell,true}}\approx (1.15\text{ to }1.30)N_{\mathrm{cell}} \qquad\text{where}\qquad N_{\mathrm{cell}}\approx 300\,V_{\mathrm{km}^3}f_{\max}^3 \]
\[ \text{so from } W_{\mathrm{prop}}\propto f_{\max}^4 \;\Rightarrow\; W_{\mathrm{prop}}\propto N_{\mathrm{cell}}f_{\max} \]

Section 5 — Time Step

CFL constant:

\[ C_{\mathrm{CFL}}=\frac{v_{\max}\Delta t}{\Delta x} \]

where \(v_{\max}\) is maximum propagation velocity, \(\Delta t\) is the time step, and \(\Delta x\) is spatial grid spacing.

\[ \Delta t \le \frac{C_{\mathrm{CFL}}\Delta x}{v_{\max}} \]

Production codes usually choose close to the limit for efficiency:

\[ \Delta t \approx \frac{C_{\mathrm{CFL}}\Delta x}{v_{\max}} \]

Because the time step scales with grid spacing, \(\Delta t\propto \Delta x\). Since grid spacing decreases as maximum frequency increases, \(\Delta x\propto 1/f_{\max}\), so the number of time steps required for a fixed propagation time scales as \(N_t\propto f_{\max}\).

Section 6 — Number of Time Steps

\[ N_t=\frac{T_{\mathrm{prop}}}{\Delta t} \]

Scaling:

\[ N_t\propto f_{\max} \]

This is correct only because \(\Delta x\propto 1/f_{\max}\) and \(\Delta t\propto \Delta x\) from the CFL condition above.

To be specific, since the linear slope is non-trivial:

\[ N_t = \frac{T_{\mathrm{prop}}n_\lambda v_{\max}}{C_{\mathrm{CFL}}v_{\min}}\,f_{\max} \;\Rightarrow\; N_t = K_t f_{\max} \]
\[ K_t = \frac{T_{\mathrm{prop}}n_\lambda v_{\max}}{C_{\mathrm{CFL}}v_{\min}} \]

So higher frequency yields more time steps linearly, but the slope depends on model physics and numerical methods.

Section 7 — Wavefield Work per Propagation

\[ W_{\mathrm{prop}} = N_{\mathrm{cell}}N_t \]

This counts cell updates, not FLOPs.

Spatial discretization: \(N_{\mathrm{cell}}\propto f_{\max}^3\)

Temporal discretization: \(N_t\propto f_{\max}\)

Scaling:

\[ W_{\mathrm{prop}}\propto f_{\max}^4 \]

This is the core seismic HPC scaling law for uniform 3D discretization with fixed physical model size.

\[ W_{\mathrm{prop}} = K_xK_tf_{\max}^4 \]
\[ K_x=\frac{L_xL_yL_zn_\lambda^3}{v_{\min}^3}, \qquad K_t=\frac{T_{\mathrm{prop}}n_\lambda v_{\max}}{C_{\mathrm{CFL}}v_{\min}} \]
\[ \Rightarrow W_{\mathrm{prop}}= \frac{L_xL_yL_zT_{\mathrm{prop}}n_\lambda^4v_{\max}}{C_{\mathrm{CFL}}v_{\min}^4} f_{\max}^4 \]

\(W_{\mathrm{prop}}\) counts total cell updates and is therefore dimensionless; the apparent time units in the constants cancel when combined with \(f_{\max}^4\).

These constants assume explicit time stepping and uniform grid spacing determined by minimum velocity and a fixed points-per-wavelength rule. Adaptive and anisotropic grids invalidate this simple constant-coefficient scaling model.

Section 8 — RTM Work per Shot

Define an RTM multiplier as an estimate:

\[ \alpha_{\mathrm{RTM}} \approx 2.4 \]

Derived from:

\[ \alpha_{\mathrm{fwd}}\approx 1.0,\qquad \alpha_{\mathrm{bwd}}\approx 1.0,\qquad \alpha_{\mathrm{img}}\approx 0.1\text{–}0.2,\qquad \alpha_{\mathrm{overhead}}\approx 0.1\text{–}0.3 \]

Or better yet:

\[ \alpha_{\mathrm{RTM}}= \frac{\text{measured end-to-end RTM work per shot}}{N_{\mathrm{cell}}N_t} \]

Estimation methodology:

\[ \alpha_{\mathrm{RTM}}= \alpha_{\mathrm{fwd}}+\alpha_{\mathrm{bwd}}+\alpha_{\mathrm{img}}+\alpha_{\mathrm{ovh}} \]

This multiplier depends on stencil order, physics, memory strategy, checkpointing method, imaging condition complexity, communication overhead, GPU occupancy, memory bandwidth behavior, and how your code defines “cell updates/s”.

Important: If \(P_{\mathrm{gpu}}\) already comes from an end-to-end RTM benchmark, do not apply a separate \(\alpha_{\mathrm{RTM}}\) multiplier or the cost will be double-counted.

Then:

\[ W_{\mathrm{rtm,shot}}=\alpha_{\mathrm{RTM}}N_{\mathrm{cell}}N_t \qquad\text{if } P_{\mathrm{gpu}} \text{ is cell updates/s for one stencil propagation} \]
\[ W_{\mathrm{rtm,shot}}=N_{\mathrm{cell}}N_t \qquad\text{if } P_{\mathrm{gpu}} \text{ already means effective RTM cell updates/s} \]

Section 9 — RTM Runtime per Shot

\[ t_{\mathrm{shot}}= \frac{W_{\mathrm{rtm,shot}}}{G_{\mathrm{shot}}P_{\mathrm{gpu,eff}}} \]

seconds per shot.

Notes on \(P_{\mathrm{gpu}}\)

\(P_{\mathrm{gpu}}\) is effective cell updates per second per GPU and will be bound by one of three things:

\[ P_{\mathrm{gpu}} \approx \min(P_{\mathrm{mem}},P_{\mathrm{comp}},P_{\mathrm{comm}}) \]
\[ \begin{aligned} P_{\mathrm{mem}} &= \text{memory-bandwidth-limited throughput} \\ P_{\mathrm{comp}} &= \text{FLOP-limited throughput} \\ P_{\mathrm{comm}} &= \text{communication-limited throughput} \end{aligned} \]

The core method is:

  1. Estimate bytes per cell update.
  2. Estimate FLOPs per cell update.
  3. Compare those with GPU hardware limits.
  4. Apply efficiency factors for real execution losses.
\[ P_{\mathrm{gpu}} \approx \min\left( \frac{B_{\mathrm{gpu}}\eta_{\mathrm{mem}}}{b_{\mathrm{cell}}}, \frac{F_{\mathrm{gpu}}\eta_{\mathrm{comp}}}{f_{\mathrm{cell}}}, P_{\mathrm{comm}} \right) \]

Memory-bandwidth estimate

\[ P_{\mathrm{mem}}=\frac{B_{\mathrm{gpu}}\eta_{\mathrm{gpu}}}{b_{\mathrm{cell}}} \]

You need to estimate \(b_{\mathrm{cell}}\), the effective bytes transferred per cell update.

For a simple acoustic finite-difference time step, a naive estimate might be:

\[ b_{\mathrm{cell,naive}}\approx 4\times 4=16~\text{bytes/cell} \]

For planning, a more realistic acoustic estimate is often:

\[ b_{\mathrm{cell}}\approx 24\text{ to }80~\text{bytes/cell update} \]

Compute-limited estimate

\[ P_{\mathrm{comp}}=\frac{F_{\mathrm{gpu}}\eta_{\mathrm{comp}}}{f_{\mathrm{cell}}} \]

For an acoustic stencil, \(f_{\mathrm{cell}}\) might be on the order of

\[ 50\text{ to }300~\text{FLOPs/update} \]

Arithmetic intensity shortcut

\[ I=\frac{f_{\mathrm{cell}}}{b_{\mathrm{cell}}} \qquad\text{and}\qquad I_{\mathrm{balance}}=\frac{F_{\mathrm{gpu}}}{B_{\mathrm{gpu}}} \]
\[ \text{If }I < I_{\mathrm{balance}},\text{ then you are memory-bound; otherwise you are compute-bound.} \]

Communication-limited estimate

When one shot domain is split across multiple GPUs, halo exchange can become the real limit.

\[ V_{\mathrm{halo}}\approx 2h(n_yn_z+n_xn_z+n_xn_y)n_{\mathrm{fields}}b \]
\[ t_{\mathrm{comm}}\approx \frac{V_{\mathrm{halo}}}{B_{\mathrm{net,eff}}} \qquad t_{\mathrm{comp}}\approx \frac{n_xn_yn_z}{P_{\mathrm{gpu,single}}} \]
\[ P_{\mathrm{comm}}\approx \frac{N_{\mathrm{local\ cells}}}{t_{\mathrm{comp}}+t_{\mathrm{comm}}-t_{\mathrm{overlap}}} \]

The practical analytic workflow

Estimate \(P_{\mathrm{gpu}}\) without measurements.

Step A: define the kernel precisely.

Step B: estimate per-cell work \(f_{\mathrm{cell}}\) and \(b_{\mathrm{cell}}\).

Step C: take hardware limits.

Step D: apply realistic efficiencies.

Step E: degrade for strong scaling:

\[ P_{\mathrm{gpu,eff}}=P_{\mathrm{gpu,single}}\eta_{\mathrm{scale}} \]
\[ \eta_{\mathrm{scale}}=0.5\text{–}0.9 \]

Summary formula:

\[ P_{\mathrm{gpu,eff}} \approx \min\left( \frac{B_{\mathrm{gpu}}\eta_{\mathrm{mem}}}{b_{\mathrm{cell}}}, \frac{F_{\mathrm{gpu}}\eta_{\mathrm{comp}}}{f_{\mathrm{cell}}} \right)\eta_{\mathrm{scale}} \]

Section 10 — GPU-hours per Shot

\[ \mathrm{GPUh}_{\mathrm{shot}}= \frac{G_{\mathrm{shot}}t_{\mathrm{shot}}}{3600} \]

The shot is processed using \(G_{\mathrm{shot}}\) GPUs simultaneously, and \(t_{\mathrm{shot}}\) is the wall-clock runtime of that shot domain.

Section 11 — Total RTM GPU-hours

\[ \mathrm{GPUh}_{\mathrm{total}}= N_{\mathrm{shot}}\mathrm{GPUh}_{\mathrm{shot}} \]
\[ G_{\mathrm{cluster}}= \frac{\mathrm{GPUh}_{\mathrm{total}}}{U\,H_{\mathrm{wall}}} \]

where \(U\) is cluster utilization and \(H_{\mathrm{wall}}\) is campaign time target.

Because RTM shots are independent, total campaign compute is the sum of GPU-hours across all shots.

\(U\) is not a physics constant. It represents effective cluster utilization accounting for queueing delays, scheduling gaps, node failures, and workflow overhead. Typical planning values are 0.7–0.9.

\[ G_{\mathrm{cluster}}\approx \left\lceil \frac{G_{\mathrm{shot}}N_{\mathrm{shot}}}{\text{parallel shots}} \right\rceil \]

Utilization is intended to absorb this scheduling granularity effect.