GPU development of OPM flow has reached a significant milestone where the GPU can now outperform the CPU in certain situations, despite the GPU numerics being at an earlier stage of its development than its CPU counterpart. This demonstrates the usefulness of GPU solver implementations for large-scale CO2 storage simulations.
Recent work in the ACROSS project has resulted in the first measured speedup of OPM Flow when run on GPU compared to multicore CPU runs with state-of-the-art CPR solvers. The speedup came from using a BiCGSTAB linear solver developed with the CuISTL framework that was made during ACROSS, together with a new Diagonal ILU (DILU) preconditioner. The improvements are measured on large-scale CO2 storage simulations.
Before ACROSS the GPU version of the linear solver in OPM Flow struggled to outperform a single core on the CPU. Now a consumer grade GPU such as the Nvidia RTX 4070TI using DILU outperforms the top model AMD Ryzen 7950x CPU running a much stronger preconditioner on all its 16 cores. This is a big milestone for the further parallelization and acceleration of the OPM Flow simulator. The speedup is attributed to the CuISTL that allows for using the Dune Numerics Library implementation of the BiCGSTAB method, and the DILU preconditioner written in CUDA. The DILU preconditioner uses graph coloring and row-reordering of the matrix to extract parallelism and access memory efficiently.
The benchmark case for performance results are SPE11C-derived CO2 storage cases. The grids are Cartesian with the same number of cells in each spatial direction. The scaling experiments showed that for the mentioned consumer grade hardware the GPU becomes faster than the CPU when the simulation contains 2 million cells or more. The largest simulation run had 3 million cells, where the GPU had a 1.17 speedup over the CPU per linear solve.
This speedup should not only be emphasized because we now have a GPU linear solver that is faster than its counterpart on the CPU, but also because it uses much simpler numerics, and therefore still has potential for further improvement. The DILU preconditioner is a generic preconditioner that uses less memory than ILU0, with the same degree of parallelism. The CPU code we compare against uses the much more sophisticated two-stage constraint pressure residual (CPR) preconditioner, that employs both algebraic multigrid and ILU0. The CPR preconditioner was developed and tailored specifically for reservoir simulation. Despite not yet having implemented equally sophisticated code for the GPU we are already at a point where the GPU is competitive with the CPU on large grids.