rocALUTION Documentation¶
rocALUTION is a sparse linear algebra library with focus on exploring finegrained parallelism on top of AMD’s Radeon Open Compute ROCm runtime and toolchains, targeting modern CPU and GPU platforms. Based on C++ and HIP, it provides a portable, generic and flexible design that allows seamless integration with other scientific software packages.
In the following, three separate chapters are available:
User Manual: This is the manual of rocALUTION. It can be seen as a starting guide for new users but also a reference book for more experienced users.
Design Documentation: The Design Document is targeted to advanced users / developers that want to understand, modify or extend the functionality of the rocALUTION library. To embed rocALUTION into your project, it is not required to read the Design Document.
API: This is a list of API functions provided by rocALUTION.
User Manual¶
Introduction¶
Overview¶
rocALUTION is a sparse linear algebra library with focus on exploring finegrained parallelism, targeting modern processors and accelerators including multi/manycore CPU and GPU platforms. The main goal of this package is to provide a portable library for iterative sparse methods on state of the art hardware. rocALUTION can be seen as middleware between different parallel backends and application specific packages.
The major features and characteristics of the library are
 Various backends
Host  fallback backend, designed for CPUs
GPU/HIP  accelerator backend, designed for HIP capable AMD GPUs
OpenMP  designed for multicore CPUs
MPI  designed for multinode and multiGPU configurations
 Easy to use
The syntax and structure of the library provide easy learning curves. With the help of the examples, anyone can try out the library  no knowledge in HIP, OpenMP or MPI programming required.
 No special hardware requirements
There are no hardware requirements to install and run rocALUTION. If a GPU device and HIP is available, the library will use them.
 Variety of iterative solvers
FixedPoint iteration  Jacobi, GaussSeidel, SymmetricGauss Seidel, SOR and SSOR
Krylov subspace methods  CR, CG, BiCGStab, BiCGStab(l), GMRES, IDR, QMRCGSTAB, Flexible CG/GMRES
Mixedprecision defectcorrection scheme
Chebyshev iteration
Multiple MultiGrid schemes, geometric and algebraic
 Various preconditioners
Matrix splitting  Jacobi, (Multicolored) GaussSeidel, Symmetric GaussSeidel, SOR, SSOR
Factorization  ILU(0), ILU(p) (based on levels), ILU(p,q) (power(q)pattern method), MultiElimination ILU (nested/recursive), ILUT (based on threshold) and IC(0)
Approximate Inverse  Chebyshev matrixvalued polynomial, SPAI, FSAI and TNS
Diagonalbased preconditioner for Saddlepoint problems
Blocktype of subpreconditioners/solvers
Additive Schwarz and Restricted Additive Schwarz
Variable type preconditioners
 Generic and robust design
rocALUTION is based on a generic and robust design allowing expansion in the direction of new solvers and preconditioners and support for various hardware types. Furthermore, the design of the library allows the use of all solvers as preconditioners in other solvers. For example you can easily define a CG solver with a MultiElimination preconditioner, where the lastblock is preconditioned with another Chebyshev iteration method which is preconditioned with a multicolored Symmetric GaussSeidel scheme.
 Portable code and results
All code based on rocALUTION is portable and independent of HIP or OpenMP. The code will compile and run everywhere. All solvers and preconditioners are based on a single source code, which delivers portable results across all supported backends (variations are possible due to different rounding modes on the hardware). The only difference which you can see for a hardware change is the performance variation.
 Support for several sparse matrix formats
Compressed Sparse Row (CSR), Modified Compressed Sparse Row (MCSR), Dense (DENSE), Coordinate (COO), ELL, Diagonal (DIA), Hybrid format of ELL and COO (HYB).
The code is opensource under MIT license, see License and hosted on the GitHub rocALUTION page.
License¶
rocALUTION is distributed as opensource under the following license:
MIT License
Copyright (c) 2018 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Building and Installing¶
Installing from AMD ROCm repository¶
rocALUTION can be installed from AMD ROCm repository. The repository hosts the singlenode, accelerator enabled version of the library. If a different setup is required, e.g. multinode support, rocALUTION need to be built from source, see Building from GitHub repository.
For detailed instructions on how to set up ROCm on different platforms, see the AMD ROCm Platform Installation Guide for Linux.
rocALUTION has the following runtime dependencies
Building from GitHub repository¶
To build rocALUTION from source, the following compiletime and runtime dependencies must be met
CMake 3.5 or later
AMD ROCm 2.9 or later (optional, for HIP support)
rocSPARSE (optional, for HIP support)
rocBLAS (optional, for HIP support)
rocPRIM (optional, for HIP support)
OpenMP (optional, for OpenMP support)
MPI (optional, for multinode / multiGPU support)
googletest (optional, for clients)
Download rocALUTION¶
The rocALUTION source code is available at the rocALUTION GitHub page. Download the master branch using:
$ git clone b master https://github.com/ROCmSoftwarePlatform/rocALUTION.git
$ cd rocALUTION
Below are steps to build different packages of the library, including dependencies and clients. It is recommended to install rocALUTION using the install.sh script.
Using install.sh script to build rocALUTION with dependencies¶
The following table lists common uses of install.sh to build dependencies + library. Accelerator support via HIP and OpenMP will be enabled by default, whereas MPI is disabled.
Command 
Description 

./install.sh h 
Print help information. 
./install.sh d 
Build dependencies and library in your local directory. The d flag only needs to be used once. For subsequent invocations of install.sh it is not necessary to rebuild the dependencies. 
./install.sh 
Build library in your local directory. It is assumed dependencies are available. 
./install.sh i 
Build library, then build and install rocALUTION package in /opt/rocm/rocalution. You will be prompted for sudo access. This will install for all users. 
./install.sh –host 
Build library in your local directory without HIP support. It is assumed dependencies are available. 
./install.sh –mpi=<dir> 
Build library in your local directory with HIP and MPI support. It is assumed dependencies are available. 
Using install.sh script to build rocALUTION with dependencies and clients¶
The client contains example code, unit tests and benchmarks. Common uses of install.sh to build them are listed in the table below.
Command 
Description 

./install.sh h 
Print help information. 
./install.sh dc 
Build dependencies, library and client in your local directory. The d flag only needs to be used once. For subsequent invocations of install.sh it is not necessary to rebuild the dependencies. 
./install.sh c 
Build library and client in your local directory. It is assumed dependencies are available. 
./install.sh idc 
Build library, dependencies and client, then build and install rocALUTION package in /opt/rocm/rocalution. You will be prompted for sudo access. This will install for all users. 
./install.sh ic 
Build library and client, then build and install rocALUTION package in opt/rocm/rocalution. You will be prompted for sudo access. This will install for all users. 
Using individual commands to build rocALUTION¶
CMake 3.5 or later is required in order to build rocALUTION without the use of install.sh.
rocALUTION can be built with cmake using the following commands:
# Create and change to build directory
mkdir p build/release ; cd build/release
# Default install path is /opt/rocm, use DCMAKE_INSTALL_PREFIX=<path>
# to adjust it. In this case, rocALUTION is built with HIP and
# OpenMP support.
# MPI support is disabled.
cmake ../.. DSUPPORT_HIP=ON \
DSUPPORT_MPI=OFF \
DSUPPORT_OMP=ON
# Compile rocALUTION library
make j$(nproc)
# Install rocALUTION to /opt/rocm
sudo make install
GoogleTest is required in order to build all rocALUTION clients.
rocALUTION with dependencies and clients can be built using the following commands:
# Install googletest
mkdir p build/release/deps ; cd build/release/deps
cmake ../../../deps
sudo make j$(nproc) install
# Change to build directory
cd ..
# Default install path is /opt/rocm, use DCMAKE_INSTALL_PREFIX=<path>
# to adjust it. By default, HIP and OpenMP support are enabled,
# MPI support is disabled.
cmake ../.. DBUILD_CLIENTS_TESTS=ON \
DBUILD_CLIENTS_SAMPLES=ON
# Compile rocALUTION library
make j$(nproc)
# Install rocALUTION to /opt/rocm
sudo make install
The compilation process produces a shared library file librocalution.so and librocalution_hip.so if HIP support is enabled. Ensure that the library objects can be found in your library path. If you do not copy the library to a specific location you can add the path under Linux in the LD_LIBRARY_PATH variable.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_rocalution>
Common build problems¶
 Issue: Could not find a package file provided by “ROCM” with any of the following names:
ROCMConfig.cmake rocmconfig.cmake
Solution: Install ROCm cmake modules either from source or from AMD ROCm repository.
 Issue: Could not find a package file provided by “ROCSPARSE” with any of the following names:
ROCSPARSE.cmake rocsparseconfig.cmake
Solution: Install rocSPARSE either from source or from AMD ROCm repository.
 Issue: Could not find a package file provided by “ROCBLAS” with any of the following names:
ROCBLAS.cmake rocblasconfig.cmake
Solution: Install rocBLAS either from source or from AMD ROCm repository.
Simple Test¶
You can test the installation by running a CG solver on a sparse matrix. After successfully compiling the library, the CG solver example can be executed.
cd rocALUTION/build/release/clients/staging
wget ftp://math.nist.gov/pub/MatrixMarket2/HarwellBoeing/laplace/gr_30_30.mtx.gz
gzip d gr_30_30.mtx.gz
./cg gr_30_30.mtx
Basics¶
Operators and Vectors¶
The main objects in rocALUTION are linear operators and vectors. All objects can be moved to an accelerator at run time. The linear operators are defined as local or global matrices (i.e. on a single node or distributed/multinode) and local stencils (i.e. matrixfree linear operations). The only template parameter of the operators and vectors is the data type (ValueType). The operator data type could be float, double, complex float or complex double, while the vector data type can be int, float, double, complex float or complex double (int is used mainly for the permutation vectors). In the current version, cross ValueType object operations are not supported. Fig. 1 gives an overview of supported operators and vectors. Further details are also given in the Design Documentation.
Each of the objects contain a local copy of the hardware descriptor created by the rocalution::init_rocalution()
function. This allows the user to modify it according to his needs and to obtain two or more objects with different hardware specifications (e.g. different amount of OpenMP threads, HIP block sizes, etc.).
Local Operators and Vectors¶
By Local Operators and Vectors we refer to Local Matrices and Stencils and to Local Vectors. By Local we mean the fact that they stay on a single system. The system can contain several CPUs via UMA or NUMA memory system, it can also contain an accelerator.

template<typename ValueType>
class LocalMatrix : public rocalution::Operator<ValueType> LocalMatrix class.
A LocalMatrix is called local, because it will always stay on a single system. The system can contain several CPUs via UMA or NUMA memory system or it can contain an accelerator.
A number of matrix formats are supported. These are CSR, BCSR, MCSR, COO, DIA, ELL, HYB, and DENSE.
Note
For CSR type matrices, the column indices must be sorted in increasing order. For COO matrices, the row indices must be sorted in increasing order. The function
Check
can be used to check whether a matrix contains valid data. For CSR and COO matrices, the functionSort
can be used to sort the row or column indices respectively. tparam ValueType
 can be int, float, double, std::complex<float> and std::complex<double>

template<typename ValueType>
class LocalStencil : public rocalution::Operator<ValueType> LocalStencil class.
A LocalStencil is called local, because it will always stay on a single system. The system can contain several CPUs via UMA or NUMA memory system or it can contain an accelerator.
 tparam ValueType
 can be int, float, double, std::complex<float> and std::complex<double>

template<typename ValueType>
class LocalVector : public rocalution::Vector<ValueType> LocalVector class.
A LocalVector is called local, because it will always stay on a single system. The system can contain several CPUs via UMA or NUMA memory system or it can contain an accelerator.
 tparam ValueType
 can be int, float, double, std::complex<float> and std::complex<double>
Global Operators and Vectors¶
By Global Operators and Vectors we refer to Global Matrix and to Global Vectors. By Global we mean the fact they can stay on a single or multiple nodes in a network. For this type of computation, the communication is based on MPI.

template<typename ValueType>
class GlobalMatrix : public rocalution::Operator<ValueType> GlobalMatrix class.
A GlobalMatrix is called global, because it can stay on a single or on multiple nodes in a network. For this type of communication, MPI is used.
A number of matrix formats are supported. These are CSR, BCSR, MCSR, COO, DIA, ELL, HYB, and DENSE.
Note
For CSR type matrices, the column indices must be sorted in increasing order. For COO matrices, the row indices must be sorted in increasing order. The function
Check
can be used to check whether a matrix contains valid data. For CSR and COO matrices, the functionSort
can be used to sort the row or column indices respectively. tparam ValueType
 can be int, float, double, std::complex<float> and std::complex<double>

template<typename ValueType>
class GlobalVector : public rocalution::Vector<ValueType> GlobalVector class.
A GlobalVector is called global, because it can stay on a single or on multiple nodes in a network. For this type of communication, MPI is used.
 tparam ValueType
 can be int, float, double, std::complex<float> and std::complex<double>
Backend Descriptor and User Control¶
Naturally, not all routines and algorithms can be performed efficiently on manycore systems (i.e. on accelerators). To provide full functionality, the library has internal mechanisms to check if a particular routine is implemented on the accelerator. If not, the object is moved to the host and the routine is computed there. This guarantees that your code will run (maybe not in the most efficient way) with any accelerator regardless of the available functionality for it.
Initialization of rocALUTION¶
The body of a rocALUTION code is very simple, it should contain the header file and the namespace of the library.
The program must contain an initialization call to init_rocalution
which will check and allocate the hardware and a finalizing call to stop_rocalution
which will release the allocated hardware.

int rocalution::init_rocalution(int rank = 1, int dev_per_node = 1)
Initialize rocALUTION platform.
init_rocalution
defines a backend descriptor with information about the hardware and its specifications. All objects created after that contain a copy of this descriptor. If the specifications of the global descriptor are changed (e.g. set different number of threads) and new objects are created, only the new objects will use the new configurations.For control, the library provides the following functions
set_device_rocalution() is a unified function to select a specific device. If you have compiled the library with a backend and for this backend there are several available devices, you can use this function to select a particular one. This function has to be called before init_rocalution().
set_omp_threads_rocalution() sets the number of OpenMP threads. This function has to be called after init_rocalution().
 Example
#include <rocalution.hpp> using namespace rocalution; int main(int argc, char* argv[]) { init_rocalution(); // ... stop_rocalution(); return 0; }
 Parameters
rank – [in] specifies MPI rank when multinode environment
dev_per_node – [in] number of accelerator devices per node, when in multiGPU environment

int rocalution::stop_rocalution(void)
Shutdown rocALUTION platform.
stop_rocalution
shuts down the rocALUTION platform.
Threadcore Mapping¶
The number of threads which rocALUTION will use can be modified by the function set_omp_threads_rocalution
or by the global OpenMP environment variable (for Unixlike OS this is OMP_NUM_THREADS).
During the initialization phase, the library provides affinity threadcore mapping:
If the number of cores (including SMT cores) is greater or equal than two times the number of threads, then all the threads can occupy every second core ID (e.g. 0,2,4,…). This is to avoid having two threads working on the same physical core, when SMT is enabled.
If the number of threads is less or equal to the number of cores (including SMT), and the previous clause is false, then the threads can occupy every core ID (e.g. 0,1,2,3,…).
If non of the above criteria is matched, then the default threadcore mapping is used (typically set by the operating system).
Note
The threadcore mapping is available for Unixlike operating systems only.
Note
The user can disable the thread affinity by set_omp_affinity_rocalution
, before initializing the library.
OpenMP Threshold Size¶
Whenever working on a small problem, OpenMP host backend might be slightly slower than using no OpenMP.
This is mainly attributed to the small amount of work, which every thread should perform and the large overhead of forking/joining threads.
This can be avoid by the OpenMP threshold size parameter in rocALUTION.
The default threshold is set to 10.000, which means that all matrices under (and equal to) this size will use only one thread (disregarding the number of OpenMP threads set in the system).
The threshold can be modified with set_omp_threshold_rocalution
.
Accelerator Selection¶
The accelerator device id that is supposed to be used for the computation can be selected by the user by set_device_rocalution
.
Disable the Accelerator¶
Furthermore, the accelerator can be disabled without having to recompile the library by calling disable_accelerator_rocalution
.
Backend Information¶
Detailed information about the current backend / accelerator in use as well as the available accelerators can be printed by info_rocalution
.
MPI and MultiAccelerators¶
When initializing the library with MPI, the user need to pass the rank of the MPI process as well as the number of accelerators available on each node. Basically, this way the user can specify the mapping of MPI process and accelerators  the allocated accelerator will be rank % num_dev_per_node. Thus, the user can run two MPI processes on systems with two accelerators by specifying the number of devices to 2, as illustrated in the example code below.
#include <rocalution.hpp>
#include <mpi.h>
using namespace rocalution;
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
MPI_Comm comm = MPI_COMM_WORLD;
int num_processes;
int rank;
MPI_Comm_size(comm, &num_processes);
MPI_Comm_rank(comm, &rank);
int nacc_per_node = 2;
init_rocalution(rank, nacc_per_node);
// ... do some work
stop_rocalution();
return 0;
}
Automatic Object Tracking¶
rocALUTION supports automatic object tracking.
After the initialization of the library, all objects created by the user application can be tracked.
Once stop_rocalution
is called, all memory from tracked objects gets deallocated.
This will avoid memory leaks when the objects are allocated but not freed.
The user can enable or disable the tracking by editing src/utils/def.hpp.
By default, automatic object tracking is disabled.
Verbose Output¶
rocALUTION provides different levels of output messages. The VERBOSE_LEVEL can be modified in src/utils/def.hpp before the compilation of the library. By setting a higher level, the user will obtain more detailed information about the internal calls and data transfers to and from the accelerators. By default, VERBOSE_LEVEL is set to 2.
Verbose Output and MPI¶
To prevent all MPI processes from printing information to stdout, the default configuration is that only RANK 0 outputs information. The user can change the RANK or allow all processes to print setting LOG_MPI_RANK to 1 in src/utils/def.hpp. If file logging is enabled, all ranks write into the corresponding log files.
Debug Output¶
Debug output will print almost every detail in the program, including object constructor / destructor, address of the object, memory allocation, data transfers, all function calls for matrices, vectors, solvers and preconditioners. The flag DEBUG_MODE can be set in src/utils/def.hpp. When enabled, additional assert()s are being checked during the computation. This might decrease performance of some operations significantly.
File Logging¶
rocALUTION trace file logging can be enabled by setting the environment variable ROCALUTION_LAYER to 1. rocALUTION will then log each rocALUTION function call including object constructor / destructor, address of the object, memory allocation, data transfers, all function calls for matrices, vectors, solvers and preconditioners. The log file will be placed in the working directory. The log file naming convention is rocalutionrank<rank><time_since_epoch_in_msec>.log. By default, the environment variable ROCALUTION_LAYER is unset, and logging is disabled.
Note
Performance might degrade when logging is enabled.
Versions¶
For checking the rocALUTION version in an application, predefined macros can be used:
#define __ROCALUTION_VER_MAJOR // version major
#define __ROCALUTION_VER_MINOR // version minor
#define __ROCALUTION_VER_PATCH // version patch
#define __ROCALUTION_VER_TWEAK // commit id (sha1)
#define __ROCALUTION_VER_PRE // version prerelease (alpha or beta)
#define __ROCALUTION_VER // version
The final __ROCALUTION_VER holds the version number as 10000 * major + 100 * minor + patch, as defined in src/base/version.hpp.in.
Singlenode Computation¶
Introduction¶
In this chapter, all base objects (matrices, vectors and stencils) for computation on a singlenode (sharedmemory) system are described. A typical configuration is illustrated in Fig. 2.
The compute node contains none, one or more accelerators. The compute node could be any kind of sharedmemory (single, dual, quad CPU) system.
Note
The host and accelerator memory can be physically different.
ValueType¶
The value (data) type of the vectors and the matrices is defined as a template. The matrix can be of type float (32bit), double (64bit) and complex (64/128bit). The vector can be float (32bit), double (64bit), complex (64/128bit) and int (32/64bit). The information about the precision of the data type is shown in the rocalution::BaseRocalution::Info()
function.
Complex Support¶
Currently, rocALUTION does not support complex computation.
Allocation and Free¶

void rocalution::LocalVector::Allocate(std::string name, IndexType2 size)¶
Allocate a local vector with name and size.
The local vector allocation function requires a name of the object (this is only for information purposes) and corresponding size description for vector objects.
 Example
LocalVector<ValueType> vec; vec.Allocate("my vector", 100); vec.Clear();
 Parameters
name – [in] object name
size – [in] number of elements in the vector

virtual void rocalution::LocalVector::Clear()¶
Clear (free all data) the object.

void rocalution::LocalMatrix::AllocateCOO(const std::string &name, int nnz, int nrow, int ncol)¶

void rocalution::LocalMatrix::AllocateCSR(const std::string &name, int nnz, int nrow, int ncol)¶

void rocalution::LocalMatrix::AllocateBCSR(const std::string &name, int nnzb, int nrowb, int ncolb, int blockdim)¶

void rocalution::LocalMatrix::AllocateMCSR(const std::string &name, int nnz, int nrow, int ncol)¶

void rocalution::LocalMatrix::AllocateELL(const std::string &name, int nnz, int nrow, int ncol, int max_row)¶

void rocalution::LocalMatrix::AllocateDIA(const std::string &name, int nnz, int nrow, int ncol, int ndiag)¶

void rocalution::LocalMatrix::AllocateHYB(const std::string &name, int ell_nnz, int coo_nnz, int ell_max_row, int nrow, int ncol)¶

void rocalution::LocalMatrix::AllocateDENSE(const std::string &name, int nrow, int ncol)¶
Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of nonzero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
 Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();
Note
More detailed information on the additional parameters required for matrix allocation is given in Matrix Formats.

virtual void rocalution::LocalMatrix::Clear(void)¶
Clear (free all data) the object.
Matrix Formats¶
Matrices, where most of the elements are equal to zero, are called sparse. In most practical applications, the number of nonzero entries is proportional to the size of the matrix (e.g. typically, if the matrix \(A \in \mathbb{R}^{N \times N}\), then the number of elements are of order \(O(N)\)). To save memory, storing zero entries can be avoided by introducing a structure corresponding to the nonzero elements of the matrix. rocALUTION supports sparse CSR, MCSR, COO, ELL, DIA, HYB and dense matrices (DENSE).
Note
The functionality of every matrix object is different and depends on the matrix format. The CSR format provides the highest support for various functions. For a few operations, an internal conversion is performed, however, for many routines an error message is printed and the program is terminated.
Note
In the current version, some of the conversions are performed on the host (disregarding the actual object allocation  host or accelerator).
// Convert mat to CSR storage format
mat.ConvertToCSR();
// Perform a matrixvector multiplication y = mat * x in CSR format
mat.Apply(x, &y);
// Convert mat to ELL storage format
mat.ConvertToELL();
// Perform a matrixvector multiplication y = mat * x in ELL format
mat.Apply(x, &y);
// Convert mat to CSR storage format
mat.ConvertTo(CSR);
// Perform a matrixvector multiplication y = mat * x in CSR format
mat.Apply(x, &y);
// Convert mat to ELL storage format
mat.ConvertTo(ELL);
// Perform a matrixvector multiplication y = mat * x in ELL format
mat.Apply(x, &y);
COO storage format¶
The most intuitive sparse format is the coordinate format (COO). It represents the nonzero elements of the matrix by their coordinates and requires two index arrays (one for row and one for column indexing) and the values array. A \(m \times n\) matrix is represented by
m 
number of rows (integer). 
n 
number of columns (integer). 
nnz 
number of nonzero elements (integer). 
coo_val 
array of 
coo_row_ind 
array of 
coo_col_ind 
array of 
Note
The COO matrix is expected to be sorted by row indices and column indices per row. Furthermore, each pair of indices should appear only once.
Consider the following \(3 \times 5\) matrix and the corresponding COO structures, with \(m = 3, n = 5\) and \(\text{nnz} = 8\):
where
CSR storage format¶
One of the most popular formats in many scientific codes is the compressed sparse row (CSR) format. In this format, instead of row indices, the row offsets to the beginning of each row are stored. Thus, each row elements can be accessed sequentially. However, this format does not allow sequential accessing of the column entries. The CSR storage format represents a \(m \times n\) matrix by
m 
number of rows (integer). 
n 
number of columns (integer). 
nnz 
number of nonzero elements (integer). 
csr_val 
array of 
csr_row_ptr 
array of 
csr_col_ind 
array of 
Note
The CSR matrix is expected to be sorted by column indices within each row. Furthermore, each pair of indices should appear only once.
Consider the following \(3 \times 5\) matrix and the corresponding CSR structures, with \(m = 3, n = 5\) and \(\text{nnz} = 8\):
where
ELL storage format¶
The EllpackItpack (ELL) storage format can be seen as a modification of the CSR format without row offset pointers. Instead, a fixed number of elements per row is stored. It represents a \(m \times n\) matrix by
m 
number of rows (integer). 
n 
number of columns (integer). 
ell_width 
maximum number of nonzero elements per row (integer) 
ell_val 
array of 
ell_col_ind 
array of 
Note
The ELL matrix is assumed to be stored in columnmajor format. Rows with less than ell_width
nonzero elements are padded with zeros (ell_val
) and \(1\) (ell_col_ind
).
Consider the following \(3 \times 5\) matrix and the corresponding ELL structures, with \(m = 3, n = 5\) and \(\text{ell_width} = 3\):
where
DIA storage format¶
If all (or most) of the nonzero entries belong to a few diagonals of the matrix, they can be stored with the corresponding offsets. The values in DIA format are stored as array with size \(D \times N_D\), where \(D\) is the number of diagonals in the matrix and \(N_D\) is the number of elements in the main diagonal. Since not all values in this array are occupied, the not accessible entries are denoted with \(\ast\). They correspond to the offsets in the diagonal array (negative values represent offsets from the beginning of the array). The DIA storage format represents a \(m \times n\) matrix by
m 
number of rows (integer) 
n 
number of columns (integer) 
ndiag 
number of occupied diagonals (integer) 
dia_offset 
array of 
dia_val 
array of 
Consider the following \(5 \times 5\) matrix and the corresponding DIA structures, with \(m = 5, n = 5\) and \(\text{ndiag} = 4\):
where
HYB storage format¶
The DIA and ELL formats cannot represent efficiently completely unstructured sparse matrices. To keep the memory footprint low, DIA requires the elements to belong to a few diagonals and ELL needs a fixed number of elements per row. For many applications this is a too strong restriction. A solution to this issue is to represent the more regular part of the matrix in such a format and the remaining part in COO format. The HYB format is a mixture between ELL and COO, where the maximum elements per row for the ELL part is computed by nnz/m. It represents a \(m \times n\) matrix by
m 
number of rows (integer). 
n 
number of columns (integer). 
nnz 
number of nonzero elements of the COO part (integer) 
ell_width 
maximum number of nonzero elements per row of the ELL part (integer) 
ell_val 
array of 
ell_col_ind 
array of 
coo_val 
array of 
coo_row_ind 
array of 
coo_col_ind 
array of 
Memory Usage¶
The memory footprint of the different matrix formats is presented in the following table, considering a \(N \times N\) matrix, where the number of nonzero entries is denoted with nnz.
Format 
Structure 
Values 

DENSE 
\(N \times N\) 

COO 
\(2 \times \text{nnz}\) 
\(\text{nnz}\) 
CSR 
\(N + 1 + \text{nnz}\) 
\(\text{nnz}\) 
ELL 
\(M \times N\) 
\(M \times N\) 
DIA 
\(D\) 
\(D \times N_D\) 
For the ELL matrix \(M\) characterizes the maximal number of nonzero elements per row and for the DIA matrix, \(D\) defines the number of diagonals and \(N_D\) defines the size of the main diagonal.
File I/O¶

virtual void rocalution::LocalVector::ReadFileASCII(const std::string &filename)¶
Read vector from ASCII file.
Read a vector from ASCII file.
 Example
LocalVector<ValueType> vec; vec.ReadFileASCII("my_vector.dat");
 Parameters
filename – [in] name of the file containing the ASCII data.

virtual void rocalution::LocalVector::WriteFileASCII(const std::string &filename) const¶
Write vector to ASCII file.
Write a vector to ASCII file.
 Example
LocalVector<ValueType> vec; // Allocate and fill vec // ... vec.WriteFileASCII("my_vector.dat");
 Parameters
filename – [in] name of the file to write the ASCII data to.

virtual void rocalution::LocalVector::ReadFileBinary(const std::string &filename)¶
Read vector from binary file.
Read a vector from binary file. For details on the format, see WriteFileBinary().
 Example
LocalVector<ValueType> vec; vec.ReadFileBinary("my_vector.bin");
 Parameters
filename – [in] name of the file containing the data.

virtual void rocalution::LocalVector::WriteFileBinary(const std::string &filename) const¶
Write vector to binary file.
Write a vector to binary file.
The binary format contains a header, the rocALUTION version and the vector data as follows
// Header out << "#rocALUTION binary vector file" << std::endl; // rocALUTION version out.write((char*)&version, sizeof(int)); // Vector data out.write((char*)&size, sizeof(int)); out.write((char*)vec_val, size * sizeof(double));
 Example
LocalVector<ValueType> vec; // Allocate and fill vec // ... vec.WriteFileBinary("my_vector.bin");
Note
Vector values array is always stored in double precision (e.g. double or std::complex<double>).
 Parameters
filename – [in] name of the file to write the data to.

void rocalution::LocalMatrix::ReadFileMTX(const std::string &filename)¶
Read matrix from MTX (Matrix Market Format) file.
Read a matrix from Matrix Market Format file.
 Example
LocalMatrix<ValueType> mat; mat.ReadFileMTX("my_matrix.mtx");
 Parameters
filename – [in] name of the file containing the MTX data.

void rocalution::LocalMatrix::WriteFileMTX(const std::string &filename) const¶
Write matrix to MTX (Matrix Market Format) file.
Write a matrix to Matrix Market Format file.
 Example
LocalMatrix<ValueType> mat; // Allocate and fill mat // ... mat.WriteFileMTX("my_matrix.mtx");
 Parameters
filename – [in] name of the file to write the MTX data to.

void rocalution::LocalMatrix::ReadFileCSR(const std::string &filename)¶
Read matrix from CSR (rocALUTION binary format) file.
Read a CSR matrix from binary file. For details on the format, see WriteFileCSR().
 Example
LocalMatrix<ValueType> mat; mat.ReadFileCSR("my_matrix.csr");
 Parameters
filename – [in] name of the file containing the data.

void rocalution::LocalMatrix::WriteFileCSR(const std::string &filename) const¶
Write CSR matrix to binary file.
Write a CSR matrix to binary file.
The binary format contains a header, the rocALUTION version and the matrix data as follows
// Header out << "#rocALUTION binary csr file" << std::endl; // rocALUTION version out.write((char*)&version, sizeof(int)); // CSR matrix data out.write((char*)&m, sizeof(int)); out.write((char*)&n, sizeof(int)); out.write((char*)&nnz, sizeof(int)); out.write((char*)csr_row_ptr, (m + 1) * sizeof(int)); out.write((char*)csr_col_ind, nnz * sizeof(int)); out.write((char*)csr_val, nnz * sizeof(double));
 Example
LocalMatrix<ValueType> mat; // Allocate and fill mat // ... mat.WriteFileCSR("my_matrix.csr");
Note
Vector values array is always stored in double precision (e.g. double or std::complex<double>).
 Parameters
filename – [in] name of the file to write the data to.
Access¶

ValueType &rocalution::LocalVector::operator[](int i)¶

const ValueType &rocalution::LocalVector::operator[](int i) const¶
Access operator (only for host data)
The elements in the vector can be accessed via [] operators, when the vector is allocated on the host.
 Example
// rocALUTION local vector object LocalVector<ValueType> vec; // Allocate vector vec.Allocate("my_vector", 100); // Initialize vector with 1 vec.Ones(); // Set even elements to 1 for(int i = 0; i < vec.GetSize(); i += 2) { vec[i] = 1; }
 Parameters
i – [in] access data at index
i
 Returns
value at index
i
Note
Accessing elements via the [] operators is slow. Use this for debugging purposes only. There is no direct access to the elements of matrices due to the sparsity structure. Matrices can be imported by a copy function. For CSR matrices, this is rocalution::LocalMatrix::CopyFromCSR()
and rocalution::LocalMatrix::CopyToCSR()
.
// Allocate the CSR matrix
int* csr_row_ptr = new int[100 + 1];
int* csr_col_ind = new int[345];
ValueType* csr_val = new ValueType[345];
// Fill the CSR matrix
// ...
// rocALUTION local matrix object
LocalMatrix<ValueType> mat;
// Import CSR matrix to rocALUTION
mat.AllocateCSR("my_matrix", 345, 100, 100);
mat.CopyFromCSR(csr_row_ptr, csr_col, csr_val);
Raw Access to the Data¶
SetDataPtr¶
For vector and matrix objects, direct access to the raw data can be obtained via pointers. Already allocated data can be set with SetDataPtr. Setting data pointers will leave the original pointers empty.

void rocalution::LocalVector::SetDataPtr(ValueType **ptr, std::string name, int size)¶
Initialize a LocalVector on the host with externally allocated data.
SetDataPtr
has direct access to the raw data via pointers. Already allocated data can be set by passing the pointer. Example
// Allocate vector ValueType* ptr_vec = new ValueType[200]; // Fill vector // ... // rocALUTION local vector object LocalVector<ValueType> vec; // Set the vector data, ptr_vec will become invalid vec.SetDataPtr(&ptr_vec, "my_vector", 200);
Note
Setting data pointer will leave the original pointer empty (set to
NULL
).

void rocalution::LocalMatrix::SetDataPtrCOO(int **row, int **col, ValueType **val, std::string name, int nnz, int nrow, int ncol)¶

void rocalution::LocalMatrix::SetDataPtrCSR(int **row_offset, int **col, ValueType **val, std::string name, int nnz, int nrow, int ncol)¶

void rocalution::LocalMatrix::SetDataPtrMCSR(int **row_offset, int **col, ValueType **val, std::string name, int nnz, int nrow, int ncol)¶

void rocalution::LocalMatrix::SetDataPtrELL(int **col, ValueType **val, std::string name, int nnz, int nrow, int ncol, int max_row)¶

void rocalution::LocalMatrix::SetDataPtrDIA(int **offset, ValueType **val, std::string name, int nnz, int nrow, int ncol, int num_diag)¶

void rocalution::LocalMatrix::SetDataPtrDENSE(ValueType **val, std::string name, int nrow, int ncol)¶
Initialize a LocalMatrix on the host with externally allocated data.
SetDataPtr
functions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers. Example
// Allocate a CSR matrix int* csr_row_ptr = new int[100 + 1]; int* csr_col_ind = new int[345]; ValueType* csr_val = new ValueType[345]; // Fill the CSR matrix // ... // rocALUTION local matrix object LocalMatrix<ValueType> mat; // Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become // invalid mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);
Note
Setting data pointers will leave the original pointers empty (set to
NULL
).
LeaveDataPtr¶
With LeaveDataPtr, the raw data from the object can be obtained. This will leave the object empty.

void rocalution::LocalVector::LeaveDataPtr(ValueType **ptr)¶
Leave a LocalVector to host pointers.
LeaveDataPtr
has direct access to the raw data via pointers. A LocalVector object can leave its raw data to a host pointer. This will leave the LocalVector empty. Example
// rocALUTION local vector object LocalVector<ValueType> vec; // Allocate the vector vec.Allocate("my_vector", 100); // Fill vector // ... ValueType* ptr_vec = NULL; // Get (steal) the data from the vector, this will leave the local vector object empty vec.LeaveDataPtr(&ptr_vec);

void rocalution::LocalMatrix::LeaveDataPtrCOO(int **row, int **col, ValueType **val)¶

void rocalution::LocalMatrix::LeaveDataPtrCSR(int **row_offset, int **col, ValueType **val)¶

void rocalution::LocalMatrix::LeaveDataPtrMCSR(int **row_offset, int **col, ValueType **val)¶

void rocalution::LocalMatrix::LeaveDataPtrELL(int **col, ValueType **val, int &max_row)¶

void rocalution::LocalMatrix::LeaveDataPtrDIA(int **offset, ValueType **val, int &num_diag)¶

void rocalution::LocalMatrix::LeaveDataPtrDENSE(ValueType **val)¶
Leave a LocalMatrix to host pointers.
LeaveDataPtr
functions have direct access to the raw data via pointers. A LocalMatrix object can leave its raw data to host pointers. This will leave the LocalMatrix empty. Example
// rocALUTION CSR matrix object LocalMatrix<ValueType> mat; // Allocate the CSR matrix mat.AllocateCSR("my_matrix", 345, 100, 100); // Fill CSR matrix // ... int* csr_row_ptr = NULL; int* csr_col_ind = NULL; ValueType* csr_val = NULL; // Get (steal) the data from the matrix, this will leave the local matrix // object empty mat.LeaveDataPtrCSR(&csr_row_ptr, &csr_col_ind, &csr_val);
Note
If the object is allocated on the host, then the pointers obtained from SetDataPtr and LeaveDataPtr will be on the host. If the vector object is on the accelerator, then the data pointers will be on the accelerator.
Note
If the object is moved to and from the accelerator, then the original pointer will be invalid.
Note
Never rely on old pointers, hidden object movement to and from the accelerator will make them invalid.
Note
Whenever you pass or obtain pointers to/from a rocALUTION object, you need to use the same memory allocation/free functions. Please check the source code for that (for host src/utils/allocate_free.cpp and for HIP src/base/hip/hip_allocate_free.cpp)
Copy CSR Matrix Host Data¶

void rocalution::LocalMatrix::CopyFromHostCSR(const int *row_offset, const int *col, const ValueType *val, const std::string &name, int nnz, int nrow, int ncol)¶
Allocates and copies (imports) a host CSR matrix.
If the CSR matrix data pointers are only accessible as constant, the user can create a LocalMatrix object and pass const CSR host pointers. The LocalMatrix will then be allocated and the data will be copied to the corresponding backend, where the original object was located at.
 Parameters
row_offset – [in] CSR matrix row offset pointers.
col – [in] CSR matrix column indices.
val – [in] CSR matrix values array.
name – [in] Matrix object name.
nnz – [in] Number of nonzero elements.
nrow – [in] Number of rows.
ncol – [in] Number of columns.
Copy Data¶
The user can copy data to and from a local vector by using CopyFromData() CopyToData().

void rocalution::LocalVector::CopyFromData(const ValueType *data)¶
Copy (import) vector.
Copy (import) vector data that is described in one array (values). The object data has to be allocated with Allocate(), using the corresponding size of the data, first.
 Parameters
data – [in] data to be imported.

void rocalution::LocalVector::CopyToData(ValueType *data) const¶
Copy (export) vector.
Copy (export) vector data that is described in one array (values). The output array has to be allocated, using the corresponding size of the data, first. Size can be obtain by GetSize().
 Parameters
data – [out] exported data.
Object Info¶

virtual void rocalution::BaseRocalution::Info(void) const = 0¶
Print object information.
Info
can print object information about any rocALUTION object. This information consists of object properties and backend data. Example
mat.Info(); vec.Info();
Copy¶
All matrix and vector objects provide a CopyFrom() function. The destination object should have the same size or be empty. In the latter case, the object is allocated at the source platform.

virtual void rocalution::LocalVector::CopyFrom(const LocalVector<ValueType> &src)¶
Copy vector from another vector.
CopyFrom
copies values from another vector. Example
LocalVector<ValueType> vec1, vec2; // Allocate and initialize vec1 and vec2 // ... // Move vec1 to accelerator // vec1.MoveToAccelerator(); // Now, vec1 is on the accelerator (if available) // and vec2 is on the host // Copy vec1 to vec2 (or vice versa) will move data between host and // accelerator backend vec1.CopyFrom(vec2);
Note
This function allows cross platform copying. One of the objects could be allocated on the accelerator backend.
 Parameters
src – [in] Vector, where values should be copied from.

void rocalution::LocalMatrix::CopyFrom(const LocalMatrix<ValueType> &src)¶
Copy matrix from another LocalMatrix.
CopyFrom
copies values and structure from another local matrix. Source and destination matrix should be in the same format. Example
LocalMatrix<ValueType> mat1, mat2; // Allocate and initialize mat1 and mat2 // ... // Move mat1 to accelerator // mat1.MoveToAccelerator(); // Now, mat1 is on the accelerator (if available) // and mat2 is on the host // Copy mat1 to mat2 (or vice versa) will move data between host and // accelerator backend mat1.CopyFrom(mat2);
Note
This function allows cross platform copying. One of the objects could be allocated on the accelerator backend.
 Parameters
src – [in] Local matrix where values and structure should be copied from.
Note
For vectors, the user can specify source and destination offsets and thus copy only a part of the whole vector into another vector.

virtual void rocalution::LocalVector::CopyFrom(const LocalVector<ValueType> &src, int src_offset, int dst_offset, int size)¶
Copy vector from another vector with offsets and size.
CopyFrom
copies values with specific source and destination offsets and sizes from another vector.Note
This function allows cross platform copying. One of the objects could be allocated on the accelerator backend.
 Parameters
src – [in] Vector, where values should be copied from.
src_offset – [in] source offset.
dst_offset – [in] destination offset.
size – [in] number of entries to be copied.
Clone¶
The copy operators allow you to copy the values of the object to another object, without changing the backend specification of the object. In many algorithms, you might need auxiliary vectors or matrices. These objects can be cloned with CloneFrom().
CloneFrom¶

virtual void rocalution::LocalVector::CloneFrom(const LocalVector<ValueType> &src)¶
Clone the vector.
CloneFrom
clones the entire vector, with data and backend descriptor from another Vector. Example
LocalVector<ValueType> vec; // Allocate and initialize vec (host or accelerator) // ... LocalVector<ValueType> tmp; // By cloning vec, tmp will have identical values and will be on the same // backend as vec tmp.CloneFrom(vec);
 Parameters
src – [in] Vector to clone from.

void rocalution::LocalMatrix::CloneFrom(const LocalMatrix<ValueType> &src)¶
Clone the matrix.
CloneFrom
clones the entire matrix, including values, structure and backend descriptor from another LocalMatrix. Example
LocalMatrix<ValueType> mat; // Allocate and initialize mat (host or accelerator) // ... LocalMatrix<ValueType> tmp; // By cloning mat, tmp will have identical values and structure and will be on // the same backend as mat tmp.CloneFrom(mat);
 Parameters
src – [in] LocalMatrix to clone from.
CloneBackend¶

virtual void rocalution::BaseRocalution::CloneBackend(const BaseRocalution<ValueType> &src)¶
Clone the Backend descriptor from another object.
With
CloneBackend
, the backend can be cloned without copying any data. This is especially useful, if several objects should reside on the same backend, but keep their original data. Example
LocalVector<ValueType> vec; LocalMatrix<ValueType> mat; // Allocate and initialize vec and mat // ... LocalVector<ValueType> tmp; // By cloning backend, tmp and vec will have the same backend as mat tmp.CloneBackend(mat); vec.CloneBackend(mat); // The following matrix vector multiplication will be performed on the backend // selected in mat mat.Apply(vec, &tmp);
 Parameters
src – [in] Object, where the backend should be cloned from.
Check¶

virtual bool rocalution::LocalVector::Check(void) const¶
Perform a sanity check of the vector.
Checks, if the vector contains valid data, i.e. if the values are not infinity and not NaN (not a number).
 Returns
true – if the vector is ok (empty vector is also ok).
false – if there is something wrong with the values.

bool rocalution::LocalMatrix::Check(void) const¶
Perform a sanity check of the matrix.
Checks, if the matrix contains valid data, i.e. if the values are not infinity and not NaN (not a number) and if the structure of the matrix is correct (e.g. indices cannot be negative, CSR and COO matrices have to be sorted, etc.).
 Returns
true – if the matrix is ok (empty matrix is also ok).
false – if there is something wrong with the structure or values.
Checks, if the object contains valid data. For vectors, the function checks if the values are not infinity and not NaN (not a number). For matrices, this function checks the values and if the structure of the matrix is correct (e.g. indices cannot be negative, CSR and COO matrices have to be sorted, etc.).
Sort¶

void rocalution::LocalMatrix::Sort(void)¶
Sort the matrix indices.
Sorts the matrix by indices.
For CSR matrices, column values are sorted.
For COO matrices, row indices are sorted.
Keying¶

void rocalution::LocalMatrix::Key(long int &row_key, long int &col_key, long int &val_key) const¶
Compute a unique hash key for the matrix arrays.
Typically, it is hard to compare if two matrices have the same structure (and values). To do so, rocALUTION provides a keying function, that generates three keys, for the row index, column index and values array.
 Parameters
row_key – [out] row index array key
col_key – [out] column index array key
val_key – [out] values array key
Graph Analyzers¶
The following functions are available for analyzing the connectivity in graph of the underlying sparse matrix.
(R)CMK Ordering
Maximal Independent Set
MultiColoring
Zero Block Permutation
Connectivity Ordering
All graph analyzing functions return a permutation vector (integer type), which is supposed to be used with the rocalution::LocalMatrix::Permute()
and rocalution::LocalMatrix::PermuteBackward()
functions in the matrix and vector classes.
CuthillMcKee Ordering¶

void rocalution::LocalMatrix::CMK(LocalVector<int> *permutation) const¶
Create permutation vector for CMK reordering of the matrix.
The CuthillMcKee ordering minimize the bandwidth of a given sparse matrix.
 Example
LocalVector<int> cmk; mat.CMK(&cmk); mat.Permute(cmk);
 Parameters
permutation – [out] permutation vector for CMK reordering

void rocalution::LocalMatrix::RCMK(LocalVector<int> *permutation) const¶
Create permutation vector for reverse CMK reordering of the matrix.
The Reverse CuthillMcKee ordering minimize the bandwidth of a given sparse matrix.
 Example
LocalVector<int> rcmk; mat.RCMK(&rcmk); mat.Permute(rcmk);
 Parameters
permutation – [out] permutation vector for reverse CMK reordering
Maximal Independent Set¶

void rocalution::LocalMatrix::MaximalIndependentSet(int &size, LocalVector<int> *permutation) const¶
Perform maximal independent set decomposition of the matrix.
The Maximal Independent Set algorithm finds a set with maximal size, that contains elements that do not depend on other elements in this set.
 Example
LocalVector<int> mis; int size; mat.MaximalIndependentSet(size, &mis); mat.Permute(mis);
 Parameters
size – [out] number of independent sets
permutation – [out] permutation vector for maximal independent set reordering
MultiColoring¶

void rocalution::LocalMatrix::MultiColoring(int &num_colors, int **size_colors, LocalVector<int> *permutation) const¶
Perform multicoloring decomposition of the matrix.
The MultiColoring algorithm builds a permutation (coloring of the matrix) in a way such that no two adjacent nodes in the sparse matrix have the same color.
 Example
LocalVector<int> mc; int num_colors; int* block_colors = NULL; mat.MultiColoring(num_colors, &block_colors, &mc); mat.Permute(mc);
 Parameters
num_colors – [out] number of colors
size_colors – [out] pointer to array that holds the number of nodes for each color
permutation – [out] permutation vector for multicoloring reordering
Zero Block Permutation¶

void rocalution::LocalMatrix::ZeroBlockPermutation(int &size, LocalVector<int> *permutation) const¶
Return a permutation for saddlepoint problems (zero diagonal entries)
For SaddlePoint problems, (i.e. matrices with zero diagonal entries), the Zero Block Permutation maps all zerodiagonal elements to the last block of the matrix.
 Example
LocalVector<int> zbp; int size; mat.ZeroBlockPermutation(size, &zbp); mat.Permute(zbp);
 Parameters
size – [out]
permutation – [out] permutation vector for zero block permutation
Connectivity Ordering¶

void rocalution::LocalMatrix::ConnectivityOrder(LocalVector<int> *permutation) const¶
Create permutation vector for connectivity reordering of the matrix.
Connectivity ordering returns a permutation, that sorts the matrix by nonzero entries per row.
 Example
LocalVector<int> conn; mat.ConnectivityOrder(&conn); mat.Permute(conn);
 Parameters
permutation – [out] permutation vector for connectivity reordering
Basic Linear Algebra Operations¶
For a full list of functions and routines involving operators and vectors, see the API specifications.
Multinode Computation¶
Introduction¶
This chapter describes all base objects (matrices and vectors) for computation on multinode (distributed memory) systems.
To each compute node, one or more accelerators can be attached. The compute node could be any kind of sharedmemory (single, dual, quad CPU) system, details on a singlenode can be found in Fig. 2.
Note
The memory of accelerator and host are physically different. All nodes can communicate with each other via network.
For the communication channel between different nodes (and between the accelerators on single or multiple nodes) the MPI library is used.
rocALUTION supports nonoverlapping type of distribution, where the computational domain is split into several subdomain with the corresponding information about the boundary and ghost layers. An example is shown in Fig. 5. The square box domain is distributed into four subdomains. Each subdomain belongs to a process P0, P1, P2 and P3.
To perform a sparse matrixvector multiplication (SpMV), each process need to multiply its own portion of the domain and update the corresponding ghost elements. For P0, this multiplication reads
where \(I\) stands for interior and \(G\) stands for ghost. \(x_G\) is a vector with three sections, coming from P1, P2 and P3. The whole ghost part of the global vector is used mainly for the SpMV product. It does not play any role in the computation of vectorvector operations.
Code Structure¶
Each object contains two local subobjects. The global matrix stores interior and ghost matrix by local objects. Similarily, the global vector stores its data by two local objects. In addition to the local data, the global objects have information about the global communication through the parallel manager.
Parallel Manager¶

class ParallelManager : public rocalution::RocalutionObj
Parallel Manager class.
The parallel manager class handles the communication and the mapping of the global operators. Each global operator and vector need to be initialized with a valid parallel manager in order to perform any operation. For many distributed simulations, the underlying operator is already distributed. This information need to be passed to the parallel manager.
The parallel manager class hosts the following functions:

void rocalution::ParallelManager::SetMPICommunicator(const void *comm)
Set the MPI communicator.

void rocalution::ParallelManager::Clear(void)
Clear all allocated resources.

IndexType2 rocalution::ParallelManager::GetGlobalSize(void) const
Return the global size.

int rocalution::ParallelManager::GetLocalSize(void) const
Return the local size.

int rocalution::ParallelManager::GetNumReceivers(void) const
Return the number of receivers.

int rocalution::ParallelManager::GetNumSenders(void) const
Return the number of senders.

int rocalution::ParallelManager::GetNumProcs(void) const
Return the number of involved processes.

void rocalution::ParallelManager::SetGlobalSize(IndexType2 size)
Initialize the global size.

void rocalution::ParallelManager::SetLocalSize(int size)
Initialize the local size.

void rocalution::ParallelManager::SetBoundaryIndex(int size, const int *index)
Set all boundary indices of this ranks process.

void rocalution::ParallelManager::SetReceivers(int nrecv, const int *recvs, const int *recv_offset)
Number of processes, the current process is receiving data from, array of the processes, the current process is receiving data from and offsets, where the boundary for process ‘receiver’ starts.

void rocalution::ParallelManager::SetSenders(int nsend, const int *sends, const int *send_offset)
Number of processes, the current process is sending data to, array of the processes, the current process is sending data to and offsets where the ghost part for process ‘sender’ starts.

void rocalution::ParallelManager::ReadFileASCII(const std::string &filename)
Read file that contains all relevant parallel manager data.

void rocalution::ParallelManager::WriteFileASCII(const std::string &filename) const
Write file that contains all relevant parallel manager data.
To setup a parallel manager, the required information is:
Global size
Local size of the interior/ghost for each process
Communication pattern (what information need to be sent to whom)
Global Matrices and Vectors¶

const LocalMatrix<ValueType> &rocalution::GlobalMatrix::GetInterior() const¶

const LocalMatrix<ValueType> &rocalution::GlobalMatrix::GetGhost() const¶
Warning
doxygenfunction: Unable to resolve function “rocalution::GlobalVector::GetInterior” with arguments None in doxygen xml output for project “rocALUTION” from directory: ../docBin/xml. Potential matches:
 LocalVector<ValueType> &GetInterior()
 const LocalVector<ValueType> &GetInterior() const
The global matrices and vectors store their data via two local objects. For the global matrix, the interior can be access via the rocalution::GlobalMatrix::GetInterior()
and rocalution::GlobalMatrix::GetGhost()
functions, which point to two valid local matrices. Similarily, the global vector can be accessed by rocalution::GlobalVector::GetInterior()
.
Asynchronous SpMV¶
To minimize latency and to increase scalability, rocALUTION supports asynchronous sparse matrixvector multiplication. The implementation of the SpMV starts with asynchronous transfer of the required ghost buffers, while at the same time it computes the interior matrixvector product. When the computation of the interior SpMV is done, the ghost transfer is synchronized and the ghost SpMV is performed. To minimize the PCIE bus, the HIP implementation provides a special packaging technique for transferring all ghost data into a contiguous memory buffer.
File I/O¶
The user can store and load all global structures from and to files. For a solver, the necessary data would be
the parallel manager
the sparse matrix
and the vector
Reading/writing from/to files can be done fully in parallel without any communication. Fig. 7 visualizes data of a \(4 \times 4\) grid example which is distributed among 4 MPI processes (organized in \(2 \times 2\)). Each local matrix stores the local unknowns (with local indexing). Fig. 8 furthermore illustrates the data associated with RANK0.
File Organization¶
When the parallel manager, global matrix or global vector are writing to a file, the main file (passed as a file name to this function) will contain information for all files on all ranks.
parallelmanager.dat.rank.0
parallelmanager.dat.rank.1
parallelmanager.dat.rank.2
parallelmanager.dat.rank.3
matrix.mtx.interior.rank.0
matrix.mtx.ghost.rank.0
matrix.mtx.interior.rank.1
matrix.mtx.ghost.rank.1
matrix.mtx.interior.rank.2
matrix.mtx.ghost.rank.2
matrix.mtx.interior.rank.3
matrix.mtx.ghost.rank.3
rhs.dat.rank.0
rhs.dat.rank.1
rhs.dat.rank.2
rhs.dat.rank.3
Parallel Manager¶
The data for each rank can be split into receiving and sending information. For receiving data from neighboring processes, see Fig. 9, RANK0 need to know what type of data will be received and from whom. For sending data to neighboring processes, see Fig. 10, RANK0 need to know where and what to send.
To receive data, RANK0 requires:
Number of MPI ranks, which will send data to RANK0 (NUMBER_OF_RECEIVERS  integer value).
Which are the MPI ranks, sending the data (RECEIVERS_RANK  integer array).
How will the received data (from each rank) be stored in the ghost vector (RECEIVERS_INDEX_OFFSET  integer array). In this example, the first 30 elements will be received from P1 \([0, 2)\) and the second 30 from P2 \([2, 4)\).
To send data, RANK0 requires:
Total size of the sending information (BOUNDARY_SIZE  integer value).
Number of MPI ranks, which will receive data from RANK0 (NUMBER_OF_SENDERS  integer value).
Which are the MPI ranks, receiving the data (SENDERS_RANK  integer array).
How will the sending data (from each rank) be stored in the sending buffer (SENDERS_INDEX_OFFSET  integer array). In this example, the first 30 elements will be sent to P1 \([0, 2)\) and the second 30 to P2 \([2, 4)\).
The elements, which need to be send (BOUNDARY_INDEX  integer array). In this example, the data which need to be send to P1 and P2 is the ghost layer, marked as ghost P0. The vertical stripe need to be send to P1 and the horizontal stripe to P2. The numbering of local unknowns (in local indexing) for P1 (the vertical stripes) are 1, 2 (size of 2) and stored in the BOUNDARY_INDEX. After 2 elements, the elements for P2 are stored, they are 2, 3 (2 elements).
Matrices¶
Each rank hosts two local matrices, interior and ghost matrix. They can be stored in separate files, one for each matrix. The file format could be Matrix Market (MTX) or binary.
Vectors¶
Each rank holds the local interior vector only. It is stored in a single file. The file could be ASCII or binary.
Solvers¶
Code Structure¶

template<class OperatorType, class VectorType, typename ValueType>
class Solver : public rocalution::RocalutionObj Base class for all solvers and preconditioners.
Most of the solvers can be performed on linear operators LocalMatrix, LocalStencil and GlobalMatrix  i.e. the solvers can be performed locally (on a shared memory system) or in a distributed manner (on a cluster) via MPI. The only exception is the AMG (Algebraic Multigrid) solver which has two versions (one for LocalMatrix and one for GlobalMatrix class). The only pure local solvers (which do not support global/MPI operations) are the mixedprecision defectcorrection solver and all direct solvers.
All solvers need three template parameters  Operators, Vectors and Scalar type.
The Solver class is purely virtual and provides an interface for
SetOperator() to set the operator \(A\), i.e. the user can pass the matrix here.
Build() to build the solver (including preconditioners, subsolvers, etc.). The user need to specify the operator first before calling Build().
Solve() to solve the system \(Ax = b\). The user need to pass a righthandside \(b\) and a vector \(x\), where the solution will be obtained.
Print() to show solver information.
ReBuildNumeric() to only rebuild the solver numerically (if possible).
MoveToHost() and MoveToAccelerator() to offload the solver (including preconditioners and subsolvers) to the host/accelerator.
 tparam OperatorType
 can be LocalMatrix, GlobalMatrix or LocalStencil
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::DirectLinearSolver< OperatorType, VectorType, ValueType >, rocalution::IterativeLinearSolver< OperatorType, VectorType, ValueType >, rocalution::Preconditioner< OperatorType, VectorType, ValueType >
It provides an interface for

virtual void rocalution::Solver::Build(void)¶
Build the solver (data allocation, structure and numerical computation)

virtual void rocalution::Solver::Solve(const VectorType &rhs, VectorType *x) = 0¶
Solve Operator x = rhs.

virtual void rocalution::Solver::ReBuildNumeric(void)¶
Rebuild the solver only with numerical computation (no allocation or data structure computation)
Iterative Linear Solvers¶

template<class OperatorType, class VectorType, typename ValueType>
class IterativeLinearSolver : public rocalution::Solver<OperatorType, VectorType, ValueType> Base class for all linear iterative solvers.
The iterative solvers are controlled by an iteration control object, which monitors the convergence properties of the solver, i.e. maximum number of iteration, relative tolerance, absolute tolerance and divergence tolerance. The iteration control can also record the residual history and store it in an ASCII file.
Init(), InitMinIter(), InitMaxIter() and InitTol() initialize the solver and set the stopping criteria.
RecordResidualHistory() and RecordHistory() start the recording of the residual and write it into a file.
Verbose() sets the level of verbose output of the solver (0  no output, 2  detailed output, including residual and iteration information).
SetPreconditioner() sets the preconditioning.
All iterative solvers are controlled based on
Absolute stopping criteria, when \(r_{k}_{L_{p}} < \epsilon_{abs}\)
Relative stopping criteria, when \(r_{k}_{L_{p}} / r_{1}_{L_{p}} \leq \epsilon_{rel}\)
Divergence stopping criteria, when \(r_{k}_{L_{p}} / r_{1}_{L_{p}} \geq \epsilon_{div}\)
Maximum number of iteration \(N\), when \(k = N\)
where \(k\) is the current iteration, \(r_{k}\) the residual for the current iteration \(k\) (i.e. \(r_{k} = b  Ax_{k}\)) and \(r_{1}\) the starting residual (i.e. \(r_{1} = b  Ax_{init}\)). In addition, the minimum number of iterations \(M\) can be specified. In this case, the solver will not stop to iterate, before \(k \geq M\).
The \(L_{p}\) norm is used for the computation, where \(p\) could be 1, 2 and \(\infty\). The norm computation can be set with SetResidualNorm() with 1 for \(L_{1}\), 2 for \(L_{2}\) and 3 for \(L_{\infty}\). For the computation with \(L_{\infty}\), the index of the maximum value can be obtained with GetAmaxResidualIndex(). If this function is called and \(L_{\infty}\) was not selected, this function will return 1.
The reached criteria can be obtained with GetSolverStatus(), returning
0, if no criteria has been reached yet
1, if absolute tolerance has been reached
2, if relative tolerance has been reached
3, if divergence tolerance has been reached
4, if maximum number of iteration has been reached
 tparam OperatorType
 can be LocalMatrix, GlobalMatrix or LocalStencil
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::BaseMultiGrid< OperatorType, VectorType, ValueType >, rocalution::BiCGStab< OperatorType, VectorType, ValueType >, rocalution::BiCGStabl< OperatorType, VectorType, ValueType >, rocalution::CG< OperatorType, VectorType, ValueType >, rocalution::Chebyshev< OperatorType, VectorType, ValueType >, rocalution::CR< OperatorType, VectorType, ValueType >, rocalution::FCG< OperatorType, VectorType, ValueType >, rocalution::FGMRES< OperatorType, VectorType, ValueType >, rocalution::FixedPoint< OperatorType, VectorType, ValueType >, rocalution::GMRES< OperatorType, VectorType, ValueType >, rocalution::IDR< OperatorType, VectorType, ValueType >, rocalution::QMRCGStab< OperatorType, VectorType, ValueType >
It provides an interface for

void rocalution::IterativeLinearSolver::Init(double abs_tol, double rel_tol, double div_tol, int max_iter)¶
Initialize the solver with absolute/relative/divergence tolerance and maximum number of iterations.

void rocalution::IterativeLinearSolver::Init(double abs_tol, double rel_tol, double div_tol, int min_iter, int max_iter)¶
Initialize the solver with absolute/relative/divergence tolerance and minimum/maximum number of iterations.

void rocalution::IterativeLinearSolver::InitMinIter(int min_iter)¶
Set the minimum number of iterations.

void rocalution::IterativeLinearSolver::InitMaxIter(int max_iter)¶
Set the maximum number of iterations.

void rocalution::IterativeLinearSolver::InitTol(double abs, double rel, double div)¶
Set the absolute/relative/divergence tolerance.

void rocalution::IterativeLinearSolver::RecordResidualHistory(void)¶
Record the residual history.

void rocalution::IterativeLinearSolver::RecordHistory(const std::string &filename) const¶
Write the history to file.

virtual void rocalution::IterativeLinearSolver::Verbose(int verb = 1)¶
Set the solver verbosity output.

virtual void rocalution::IterativeLinearSolver::SetPreconditioner(Solver<OperatorType, VectorType, ValueType> &precond)¶
Set a preconditioner of the linear solver.

void rocalution::IterativeLinearSolver::SetResidualNorm(int resnorm)¶
Set the residual norm to \(L_1\), \(L_2\) or \(L_\infty\) norm.
resnorm = 1 > \(L_1\) norm
resnorm = 2 > \(L_2\) norm
resnorm = 3 > \(L_\infty\) norm

virtual int rocalution::IterativeLinearSolver::GetAmaxResidualIndex(void)¶
Return absolute maximum index of residual vector when using \(L_\infty\) norm.

virtual int rocalution::IterativeLinearSolver::GetSolverStatus(void)¶
Return the current status.
Building and Solving Phase¶
Each iterative solver consists of a building step and a solving step. During the building step all necessary auxiliary data is allocated and the preconditioner is constructed. After that, the user can call the solving procedure, the solving step can be called several times.
When the initial matrix associated with the solver is on the accelerator, the solver will try to build everything on the accelerator. However, some preconditioners and solvers (such as FSAI and AMG) need to be constructed on the host before they can be transferred to the accelerator. If the initial matrix is on the host and we want to run the solver on the accelerator then we need to move the solver to the accelerator as well as the matrix, the righthandside and the solution vector.
Note
If you have a preconditioner associate with the solver, it will be moved automatically to the accelerator when you move the solver.
// CG solver
CG<LocalMatrix<ValueType>, LocalVector<ValueType>, ValueType> ls;
// MultiColored ILU preconditioner
MultiColoredILU<LocalMatrix<ValueType>, LocalVector<ValueType>, ValueType> p;
// Move matrix and vectors to the accelerator
mat.MoveToAccelerator();
rhs.MoveToAccelerator();
x.MoveToAccelerator();
// Set mat to be the operator
ls.SetOperator(mat);
// Set p as the preconditioner of ls
ls.SetPreconditioner(p);
// Build the solver and preconditioner on the accelerator
ls.Build();
// Compute the solution on the accelerator
ls.Solve(rhs, &x);
// CG solver
CG<LocalMatrix<ValueType>, LocalVector<ValueType>, ValueType> ls;
// MultiColored ILU preconditioner
MultiColoredILU<LocalMatrix<ValueType>, LocalVector<ValueType>, ValueType> p;
// Set mat to be the operator
ls.SetOperator(mat);
// Set p as the preconditioner of ls
ls.SetPreconditioner(p);
// Build the solver and preconditioner on the host
ls.Build();
// Move matrix and vectors to the accelerator
mat.MoveToAccelerator();
rhs.MoveToAccelerator();
x.MoveToAccelerator();
// Move linear solver to the accelerator
ls.MoveToAccelerator();
// Compute the solution on the accelerator
ls.Solve(rhs, &x);
Clear Function and Destructor¶
The rocalution::Solver::Clear()
function clears all the data which is in the solver, including the associated preconditioner. Thus, the solver is not anymore associated with this preconditioner.
Note
The preconditioner is not deleted (via destructor), only a rocalution::Preconditioner::Clear()
is called.
Note
When the destructor of the solver class is called, it automatically calls the Clear() function. Be careful, when declaring your solver and preconditioner in different places  we highly recommend to manually call the Clear() function of the solver and not to rely on the destructor of the solver.
Numerical Update¶
Some preconditioners require two phases in the their construction: an algebraic (e.g. compute a pattern or structure) and a numerical (compute the actual values) phase. In cases, where the structure of the input matrix is a constant (e.g. Newtonlike methods) it is not necessary to fully reconstruct the preconditioner. In this case, the user can apply a numerical update to the current preconditioner and pass the new operator with rocalution::Solver::ReBuildNumeric()
. If the preconditioner/solver does not support the numerical update, then a full rocalution::Solver::Clear()
and rocalution::Solver::Build()
will be performed.
FixedPoint Iteration¶

template<class OperatorType, class VectorType, typename ValueType>
class FixedPoint : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType> FixedPoint Iteration Scheme.
The FixedPoint iteration scheme is based on additive splitting of the matrix \(A = M + N\). The scheme reads
\[ x_{k+1} = M^{1} (b  N x_{k}). \]It can also be reformulated as a weighted defect correction scheme\[ x_{k+1} = x_{k}  \omega M^{1} (Ax_{k}  b). \]The inversion of \(M\) can be performed by preconditioners (Jacobi, GaussSeidel, ILU, etc.) or by any type of solvers. tparam OperatorType
 can be LocalMatrix, GlobalMatrix or LocalStencil
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>

void rocalution::FixedPoint::SetRelaxation(ValueType omega)¶
Set relaxation parameter \(\omega\).
Krylov Subspace Solvers¶
CG¶

template<class OperatorType, class VectorType, typename ValueType>
class CG : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType> Conjugate Gradient Method.
The Conjugate Gradient method is the best known iterative method for solving sparse symmetric positive definite (SPD) linear systems \(Ax=b\). It is based on orthogonal projection onto the Krylov subspace \(\mathcal{K}_{m}(r_{0}, A)\), where \(r_{0}\) is the initial residual. The method can be preconditioned, where the approximation should also be SPD. SAAD
 tparam OperatorType
 can be LocalMatrix, GlobalMatrix or LocalStencil
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
CR¶

template<class OperatorType, class VectorType, typename ValueType>
class CR : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType> Conjugate Residual Method.
The Conjugate Residual method is an iterative method for solving sparse symmetric semipositive definite linear systems \(Ax=b\). It is a Krylov subspace method and differs from the much more popular Conjugate Gradient method that the system matrix is not required to be positive definite. The method can be preconditioned where the approximation should also be SPD or semipositive definite. SAAD
 tparam OperatorType
 can be LocalMatrix, GlobalMatrix or LocalStencil
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
GMRES¶

template<class OperatorType, class VectorType, typename ValueType>
class GMRES : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType> Generalized Minimum Residual Method.
The Generalized Minimum Residual method (GMRES) is a projection method for solving sparse (non) symmetric linear systems \(Ax=b\), based on restarting technique. The solution is approximated in a Krylov subspace \(\mathcal{K}=\mathcal{K}_{m}\) and \(\mathcal{L}=A\mathcal{K}_{m}\) with minimal residual, where \(\mathcal{K}_{m}\) is the \(m\)th Krylov subspace with \(v_{1} = r_{0}/r_{0}_{2}\). SAAD
The Krylov subspace basis size can be set using SetBasisSize(). The default size is 30.
 tparam OperatorType
 can be LocalMatrix, GlobalMatrix or LocalStencil
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
FGMRES¶

template<class OperatorType, class VectorType, typename ValueType>
class FGMRES : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType> Flexible Generalized Minimum Residual Method.
The Flexible Generalized Minimum Residual method (FGMRES) is a projection method for solving sparse (non) symmetric linear systems \(Ax=b\). It is similar to the GMRES method with the only difference, the FGMRES is based on a window shifting of the Krylov subspace and thus allows the preconditioner \(M^{1}\) to be not a constant operator. This can be especially helpful if the operation \(M^{1}x\) is the result of another iterative process and not a constant operator. SAAD
The Krylov subspace basis size can be set using SetBasisSize(). The default size is 30.
 tparam OperatorType
 can be LocalMatrix, GlobalMatrix or LocalStencil
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
BiCGStab¶

template<class OperatorType, class VectorType, typename ValueType>
class BiCGStab : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType> BiConjugate Gradient Stabilized Method.
The BiConjugate Gradient Stabilized method is a variation of CGS and solves sparse (non) symmetric linear systems \(Ax=b\). SAAD
 tparam OperatorType
 can be LocalMatrix, GlobalMatrix or LocalStencil
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
IDR¶

template<class OperatorType, class VectorType, typename ValueType>
class IDR : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType> Induced Dimension Reduction Method.
The Induced Dimension Reduction method is a Krylov subspace method for solving sparse (non) symmetric linear systems \(Ax=b\). IDR(s) generates residuals in a sequence of nested subspaces. IDR1 IDR2
The dimension of the shadow space can be set by SetShadowSpace(). The default size of the shadow space is 4.
 tparam OperatorType
 can be LocalMatrix, GlobalMatrix or LocalStencil
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
FCG¶

template<class OperatorType, class VectorType, typename ValueType>
class FCG : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType> Flexible Conjugate Gradient Method.
The Flexible Conjugate Gradient method is an iterative method for solving sparse symmetric positive definite linear systems \(Ax=b\). It is similar to the Conjugate Gradient method with the only difference, that it allows the preconditioner \(M^{1}\) to be not a constant operator. This can be especially helpful if the operation \(M^{1}x\) is the result of another iterative process and not a constant operator. fcg
 tparam OperatorType
 can be LocalMatrix or GlobalMatrix
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
QMRCGStab¶

template<class OperatorType, class VectorType, typename ValueType>
class QMRCGStab : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType> QuasiMinimal Residual Conjugate Gradient Stabilized Method.
The QuasiMinimal Residual Conjugate Gradient Stabilized method is a variant of the Krylov subspace BiCGStab method for solving sparse (non) symmetric linear systems \(Ax=b\). qmrcgstab
 tparam OperatorType
 can be LocalMatrix or GlobalMatrix
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
BiCGStab(l)¶

template<class OperatorType, class VectorType, typename ValueType>
class BiCGStabl : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType> BiConjugate Gradient Stabilized (l) Method.
The BiConjugate Gradient Stabilized (l) method is a generalization of BiCGStab for solving sparse (non) symmetric linear systems \(Ax=b\). It minimizes residuals over \(l\)dimensional Krylov subspaces. The degree \(l\) can be set with SetOrder(). bicgstabl
 tparam OperatorType
 can be LocalMatrix or GlobalMatrix
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Chebyshev Iteration Scheme¶

template<class OperatorType, class VectorType, typename ValueType>
class Chebyshev : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType> Chebyshev Iteration Scheme.
The Chebyshev Iteration scheme (also known as acceleration scheme) is similar to the CG method but requires minimum and maximum eigenvalues of the operator. templates
 tparam OperatorType
 can be LocalMatrix, GlobalMatrix or LocalStencil
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
MixedPrecision Defect Correction Scheme¶

template<class OperatorTypeH, class VectorTypeH, typename ValueTypeH, class OperatorTypeL, class VectorTypeL, typename ValueTypeL>
class MixedPrecisionDC : public rocalution::IterativeLinearSolver<OperatorTypeH, VectorTypeH, ValueTypeH> MixedPrecision Defect Correction Scheme.
The MixedPrecision solver is based on a defectcorrection scheme. The current implementation of the library is using host based correction in double precision and accelerator computation in single precision. The solver is implemeting the scheme
\[ x_{k+1} = x_{k} + A^{1} r_{k}, \]where the computation of the residual \(r_{k} = b  Ax_{k}\) and the update \(x_{k+1} = x_{k} + d_{k}\) are performed on the host in double precision. The computation of the residual system \(Ad_{k} = r_{k}\) is performed on the accelerator in single precision. In addition to the setup functions of the iterative solver, the user need to specify the inner ( \(Ad_{k} = r_{k}\)) solver. tparam OperatorTypeH
 can be LocalMatrix
 tparam VectorTypeH
 can be LocalVector
 tparam ValueTypeH
 can be double
 tparam OperatorTypeL
 can be LocalMatrix
 tparam VectorTypeL
 can be LocalVector
 tparam ValueTypeL
 can be float
MultiGrid Solvers¶
The library provides algebraic multigrid as well as a skeleton for geometric multigrid methods. The BaseMultigrid class itself is not constructing the data for the method. It contains the solution procedure for V, W and Kcycles. The AMG has two different versions for Local (nonMPI) and for Global (MPI) type of computations.

template<class OperatorType, class VectorType, typename ValueType>
class BaseMultiGrid : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType> Base class for all multigrid solvers Trottenberg2003.
 tparam OperatorType
 can be LocalMatrix or GlobalMatrix
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::BaseAMG< OperatorType, VectorType, ValueType >, rocalution::MultiGrid< OperatorType, VectorType, ValueType >
Geometric MultiGrid¶

template<class OperatorType, class VectorType, typename ValueType>
class MultiGrid : public rocalution::BaseMultiGrid<OperatorType, VectorType, ValueType> MultiGrid Method.
The MultiGrid method can be used with external data, such as externally computed restriction, prolongation and operator hierarchy. The user need to pass all this information for each level and for its construction. This includes smoothing step, prolongation/restriction, grid traversing and coarse grid solver. This data need to be passed to the solver. Trottenberg2003
Restriction and prolongation operations can be performed in two ways, based on Restriction() and Prolongation() of the LocalVector class, or by matrixvector multiplication. This is configured by a set function.
Smoothers can be of any iterative linear solver. Valid options are Jacobi, GaussSeidel, ILU, etc. using a FixedPoint iteration scheme with predefined number of iterations. The smoothers could also be a solver such as CG, BiCGStab, etc.
Coarse grid solver could be of any iterative linear solver type. The class also provides mechanisms to specify, where the coarse grid solver has to be performed, on the host or on the accelerator. The coarse grid solver can be preconditioned.
Grid scaling based on a \(L_2\) norm ratio.
Operator matrices need to be passed on each grid level.
 tparam OperatorType
 can be LocalMatrix or GlobalMatrix
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Algebraic MultiGrid¶

template<class OperatorType, class VectorType, typename ValueType>
class BaseAMG : public rocalution::BaseMultiGrid<OperatorType, VectorType, ValueType> Base class for all algebraic multigrid solvers.
The Algebraic MultiGrid solver is based on the BaseMultiGrid class. The coarsening is obtained by different aggregation techniques. The smoothers can be constructed inside or outside of the class.
All parameters in the Algebraic MultiGrid class can be set externally, including smoothers and coarse grid solver.
 tparam OperatorType
 can be LocalMatrix or GlobalMatrix
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::PairwiseAMG< OperatorType, VectorType, ValueType >, rocalution::RugeStuebenAMG< OperatorType, VectorType, ValueType >, rocalution::SAAMG< OperatorType, VectorType, ValueType >, rocalution::UAAMG< OperatorType, VectorType, ValueType >

void rocalution::BaseAMG::SetCoarsestLevel(int coarse_size)¶
Set coarsest level for hierarchy creation.

void rocalution::BaseAMG::SetManualSmoothers(bool sm_manual)¶
Set flag to pass smoothers manually for each level.

void rocalution::BaseAMG::SetManualSolver(bool s_manual)¶
Set flag to pass coarse grid solver manually.
Unsmoothed Aggregation AMG¶

template<class OperatorType, class VectorType, typename ValueType>
class UAAMG : public rocalution::BaseAMG<OperatorType, VectorType, ValueType> Unsmoothed Aggregation Algebraic MultiGrid Method.
The Unsmoothed Aggregation Algebraic MultiGrid method is based on unsmoothed aggregation based interpolation scheme. stueben
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Smoothed Aggregation AMG¶

template<class OperatorType, class VectorType, typename ValueType>
class SAAMG : public rocalution::BaseAMG<OperatorType, VectorType, ValueType> Smoothed Aggregation Algebraic MultiGrid Method.
The Smoothed Aggregation Algebraic MultiGrid method is based on smoothed aggregation based interpolation scheme. vanek
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
RugeStueben AMG¶

template<class OperatorType, class VectorType, typename ValueType>
class RugeStuebenAMG : public rocalution::BaseAMG<OperatorType, VectorType, ValueType> RugeStueben Algebraic MultiGrid Method.
The RugeStueben Algebraic MultiGrid method is based on the classic RugeStueben coarsening with direct interpolation. The solver provides highefficiency in terms of complexity of the solver (i.e. number of iterations). However, most of the time it has a higher building step and requires higher memory usage. stueben
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>

void rocalution::RugeStuebenAMG::SetCouplingStrength(ValueType eps)¶
Set coupling strength.
Pairwise AMG¶

template<class OperatorType, class VectorType, typename ValueType>
class PairwiseAMG : public rocalution::BaseAMG<OperatorType, VectorType, ValueType> Pairwise Aggregation Algebraic MultiGrid Method.
The Pairwise Aggregation Algebraic MultiGrid method is based on a pairwise aggregation matching scheme. It delivers very efficient building phase which is suitable for Poissonlike equation. Most of the time it requires Kcycle for the solving phase to provide low number of iterations. This version has multinode support. pairwiseamg
 tparam OperatorType
 can be LocalMatrix or GlobalMatrix
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>

void rocalution::PairwiseAMG::SetBeta(ValueType beta)¶
Set beta for pairwise aggregation.

void rocalution::PairwiseAMG::SetOrdering(unsigned int ordering)¶
Set reordering for aggregation.

void rocalution::PairwiseAMG::SetCoarseningFactor(double factor)¶
Set target coarsening factor.
Direct Linear Solvers¶

template<class OperatorType, class VectorType, typename ValueType>
class DirectLinearSolver : public rocalution::Solver<OperatorType, VectorType, ValueType> Base class for all direct linear solvers.
The library provides three direct methods  LU, QR and Inversion (based on QR decomposition). The user can pass a sparse matrix, internally it will be converted to dense and then the selected method will be applied. These methods are not very optimal and due to the fact that the matrix is converted to a dense format, these methods should be used only for very small matrices.
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::Inversion< OperatorType, VectorType, ValueType >, rocalution::LU< OperatorType, VectorType, ValueType >, rocalution::QR< OperatorType, VectorType, ValueType >

template<class OperatorType, class VectorType, typename ValueType>
class LU : public rocalution::DirectLinearSolver<OperatorType, VectorType, ValueType> LU Decomposition.
LowerUpper Decomposition factors a given square matrix into lower and upper triangular matrix, such that \(A = LU\).
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType>
class QR : public rocalution::DirectLinearSolver<OperatorType, VectorType, ValueType> QR Decomposition.
The QR Decomposition decomposes a given matrix into \(A = QR\), such that \(Q\) is an orthogonal matrix and \(R\) an upper triangular matrix.
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType>
class Inversion : public rocalution::DirectLinearSolver<OperatorType, VectorType, ValueType> Matrix Inversion.
Full matrix inversion based on QR decomposition.
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Note
These methods can only be used with localtype problems.
Preconditioners¶
In this chapter, all preconditioners are presented. All preconditioners support local operators. They can be used as a global preconditioner via blockjacobi scheme which works locally on each interior matrix. To provide fast application, all preconditioners require extra memory to keep the approximated operator.

template<class OperatorType, class VectorType, typename ValueType>
class Preconditioner : public rocalution::Solver<OperatorType, VectorType, ValueType> Base class for all preconditioners.
 tparam OperatorType
 can be LocalMatrix or GlobalMatrix
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::AIChebyshev< OperatorType, VectorType, ValueType >, rocalution::AS< OperatorType, VectorType, ValueType >, rocalution::BlockJacobi< OperatorType, VectorType, ValueType >, rocalution::BlockPreconditioner< OperatorType, VectorType, ValueType >, rocalution::DiagJacobiSaddlePointPrecond< OperatorType, VectorType, ValueType >, rocalution::FSAI< OperatorType, VectorType, ValueType >, rocalution::GS< OperatorType, VectorType, ValueType >, rocalution::IC< OperatorType, VectorType, ValueType >, rocalution::ILU< OperatorType, VectorType, ValueType >, rocalution::ILUT< OperatorType, VectorType, ValueType >, rocalution::Jacobi< OperatorType, VectorType, ValueType >, rocalution::MultiColored< OperatorType, VectorType, ValueType >, rocalution::MultiElimination< OperatorType, VectorType, ValueType >, rocalution::SGS< OperatorType, VectorType, ValueType >, rocalution::SPAI< OperatorType, VectorType, ValueType >, rocalution::TNS< OperatorType, VectorType, ValueType >, rocalution::VariablePreconditioner< OperatorType, VectorType, ValueType >
Code Structure¶
The preconditioners provide a solution to the system \(Mz = r\), where either the solution \(z\) is directly computed by the approximation scheme or it is iteratively obtained with \(z = 0\) initial guess.
Jacobi Method¶

template<class OperatorType, class VectorType, typename ValueType>
class Jacobi : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> Jacobi Method.
The Jacobi method is for solving a diagonally dominant system of linear equations \(Ax=b\). It solves for each diagonal element iteratively until convergence, such that
\[ x_{i}^{(k+1)} = (1  \omega)x_{i}^{(k)} + \frac{\omega}{a_{ii}} \left( b_{i}  \sum\limits_{j=1}^{i1}{a_{ij}x_{j}^{(k)}}  \sum\limits_{j=i}^{n}{a_{ij}x_{j}^{(k)}} \right) \] tparam OperatorType
 can be LocalMatrix or GlobalMatrix
 tparam VectorType
 can be LocalVector or GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Note
Damping parameter \(\omega\) can be adjusted by rocalution::FixedPoint::SetRelaxation()
.
(Symmetric) GaussSeidel / (S)SOR Method¶

template<class OperatorType, class VectorType, typename ValueType>
class GS : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> GaussSeidel / Successive OverRelaxation Method.
The GaussSeidel / SOR method is for solving system of linear equations \(Ax=b\). It approximates the solution iteratively with
\[ x_{i}^{(k+1)} = (1  \omega) x_{i}^{(k)} + \frac{\omega}{a_{ii}} \left( b_{i}  \sum\limits_{j=1}^{i1}{a_{ij}x_{j}^{(k+1)}}  \sum\limits_{j=i}^{n}{a_{ij}x_{j}^{(k)}} \right), \]with \(\omega \in (0,2)\). tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType>
class SGS : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> Symmetric GaussSeidel / Symmetric Successive OverRelaxation Method.
The Symmetric GaussSeidel / SSOR method is for solving system of linear equations \(Ax=b\). It approximates the solution iteratively.
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Note
Relaxation parameter \(\omega\) can be adjusted by rocalution::FixedPoint::SetRelaxation()
.
Incomplete Factorizations¶
ILU¶

template<class OperatorType, class VectorType, typename ValueType>
class ILU : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> Incomplete LU Factorization based on levels.
The Incomplete LU Factorization based on levels computes a sparse lower and sparse upper triangular matrix such that \(A = LU  R\).
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
ILUT¶

template<class OperatorType, class VectorType, typename ValueType>
class ILUT : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> Incomplete LU Factorization based on threshold.
The Incomplete LU Factorization based on threshold computes a sparse lower and sparse upper triangular matrix such that \(A = LU  R\). Fillin values are dropped depending on a threshold and number of maximal fillins per row. SAAD
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
IC¶

template<class OperatorType, class VectorType, typename ValueType>
class IC : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> Incomplete Cholesky Factorization without fillins.
The Incomplete Cholesky Factorization computes a sparse lower triangular matrix such that \(A=LL^{T}  R\). Additional fillins are dropped and the sparsity pattern of the original matrix is preserved.
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
AI Chebyshev¶

template<class OperatorType, class VectorType, typename ValueType>
class AIChebyshev : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> Approximate Inverse  Chebyshev Preconditioner.
The Approximate Inverse  Chebyshev Preconditioner is an inverse matrix preconditioner with values from a linear combination of matrixvalued Chebyshev polynomials. chebpoly
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>

void rocalution::AIChebyshev::Set(int p, ValueType lambda_min, ValueType lambda_max)¶
Set order, min and max eigenvalues.
FSAI¶

template<class OperatorType, class VectorType, typename ValueType>
class FSAI : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> Factorized Approximate Inverse Preconditioner.
The Factorized Sparse Approximate Inverse preconditioner computes a direct approximation of \(M^{1}\) by minimizing the Frobenius norm \(I  GL_{F}\), where \(L\) denotes the exact lower triangular part of \(A\) and \(G:=M^{1}\). The FSAI preconditioner is initialized by \(q\), based on the sparsity pattern of \(A^{q}\). However, it is also possible to supply external sparsity patterns in form of the LocalMatrix class. kolotilina
Note
The FSAI preconditioner is only suited for symmetric positive definite matrices.
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
SPAI¶

template<class OperatorType, class VectorType, typename ValueType>
class SPAI : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> SParse Approximate Inverse Preconditioner.
The SParse Approximate Inverse algorithm is an explicitly computed preconditioner for general sparse linear systems. In its current implementation, only the sparsity pattern of the system matrix is supported. The SPAI computation is based on the minimization of the Frobenius norm \(AM  I_{F}\). grote
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
TNS¶

template<class OperatorType, class VectorType, typename ValueType>
class TNS : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> Truncated Neumann Series Preconditioner.
The Truncated Neumann Series (TNS) preconditioner is based on \(M^{1} = K^{T} D^{1} K\), where \(K=(ILD^{1}+(LD^{1})^{2})\), with the diagonal \(D\) of \(A\) and the strictly lower triangular part \(L\) of \(A\). The preconditioner can be computed in two forms  explicitly and implicitly. In the implicit form, the full construction of \(M\) is performed via matrixmatrix operations, whereas in the explicit from, the application of the preconditioner is based on matrixvector operations only. The matrix format for the stored matrices can be specified.
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
MultiColored Preconditioners¶

template<class OperatorType, class VectorType, typename ValueType>
class MultiColored : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> Base class for all multicolored preconditioners.
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::MultiColoredILU< OperatorType, VectorType, ValueType >, rocalution::MultiColoredSGS< OperatorType, VectorType, ValueType >

void rocalution::MultiColored::SetPrecondMatrixFormat(unsigned int mat_format, int blockdim = 1)¶
Set a specific matrix type of the decomposed block matrices.

void rocalution::MultiColored::SetDecomposition(bool decomp)¶
Set if the preconditioner should be decomposed or not.
MultiColored (Symmetric) GaussSeidel / (S)SOR¶

template<class OperatorType, class VectorType, typename ValueType>
class MultiColoredGS : public rocalution::MultiColoredSGS<OperatorType, VectorType, ValueType> MultiColored GaussSeidel / SOR Preconditioner.
The MultiColored Symmetric GaussSeidel / SOR preconditioner is based on the splitting of the original matrix. Higher parallelism in solving the forward substitution is obtained by performing a multicolored decomposition. Details on the GaussSeidel / SOR algorithm can be found in the GS preconditioner.
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType>
class MultiColoredSGS : public rocalution::MultiColored<OperatorType, VectorType, ValueType> MultiColored Symmetric GaussSeidel / SSOR Preconditioner.
The MultiColored Symmetric GaussSeidel / SSOR preconditioner is based on the splitting of the original matrix. Higher parallelism in solving the forward and backward substitution is obtained by performing a multicolored decomposition. Details on the Symmetric GaussSeidel / SSOR algorithm can be found in the SGS preconditioner.
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::MultiColoredGS< OperatorType, VectorType, ValueType >

void rocalution::MultiColoredSGS::SetRelaxation(ValueType omega)¶
Set the relaxation parameter for the SOR/SSOR scheme.
Note
The preconditioner matrix format can be changed using rocalution::MultiColored::SetPrecondMatrixFormat()
.
MultiColored Power(q)pattern method ILU(p,q)¶

template<class OperatorType, class VectorType, typename ValueType>
class MultiColoredILU : public rocalution::MultiColored<OperatorType, VectorType, ValueType> MultiColored Incomplete LU Factorization Preconditioner.
MultiColored Incomplete LU Factorization based on the ILU(p) factorization with a power(q)pattern method. This method provides a higher degree of parallelism of forward and backward substitution compared to the standard ILU(p) preconditioner. Lukarski2012
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>

void rocalution::MultiColoredILU::Set(int p)¶
Initialize a multicolored ILU(p, p+1) preconditioner.

void rocalution::MultiColoredILU::Set(int p, int q, bool level = true)¶
Initialize a multicolored ILU(p, q) preconditioner.
level = true will perform the factorization with levels
level = false will perform the factorization only on the power(q)pattern
Note
The preconditioner matrix format can be changed using rocalution::MultiColored::SetPrecondMatrixFormat()
.
MultiElimination Incomplete LU¶

template<class OperatorType, class VectorType, typename ValueType>
class MultiElimination : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> MultiElimination Incomplete LU Factorization Preconditioner.
The MultiElimination Incomplete LU preconditioner is based on the following decomposition
\[\begin{split} A = \begin{pmatrix} D & F \\ E & C \end{pmatrix} = \begin{pmatrix} I & 0 \\ ED^{1} & I \end{pmatrix} \times \begin{pmatrix} D & F \\ 0 & \hat{A} \end{pmatrix}, \end{split}\]where \(\hat{A} = C  ED^{1} F\). To make the inversion of \(D\) easier, we permute the preconditioning before the factorization with a permutation \(P\) to obtain only diagonal elements in \(D\). The permutation here is based on a maximal independent set. This procedure can be applied to the block matrix \(\hat{A}\), in this way we can perform the factorization recursively. In the last level of the recursion, we need to provide a solution procedure. By the design of the library, this can be any kind of solver. SAAD tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>

inline int rocalution::MultiElimination::GetSizeDiagBlock(void) const¶
Returns the size of the first (diagonal) block of the preconditioner.

inline int rocalution::MultiElimination::GetLevel(void) const¶
Return the depth of the current level.

void rocalution::MultiElimination::Set(Solver<OperatorType, VectorType, ValueType> &AA_Solver, int level, double drop_off = 0.0)¶
Initialize (recursively) MEILU with level (depth of recursion)
AA_Solvers  defines the lastblock solver
drop_off  defines dropoff tolerance

void rocalution::MultiElimination::SetPrecondMatrixFormat(unsigned int mat_format, int blockdim = 1)¶
Set a specific matrix type of the decomposed block matrices.
Diagonal Preconditioner for SaddlePoint Problems¶

template<class OperatorType, class VectorType, typename ValueType>
class DiagJacobiSaddlePointPrecond : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> Diagonal Preconditioner for SaddlePoint Problems.
Consider the following saddlepoint problem
\[\begin{split} A = \begin{pmatrix} K & F \\ E & 0 \end{pmatrix}. \end{split}\]For such problems we can construct a diagonal Jacobitype preconditioner of type\[\begin{split} P = \begin{pmatrix} K & 0 \\ 0 & S \end{pmatrix}, \end{split}\]with \(S=ED^{1}F\), where \(D\) are the diagonal elements of \(K\). The matrix \(S\) is fully constructed (via sparse matrixmatrix multiplication). The preconditioner needs to be initialized with two external solvers/preconditioners  one for the matrix \(K\) and one for the matrix \(S\). tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>

void rocalution::DiagJacobiSaddlePointPrecond::Set(Solver<OperatorType, VectorType, ValueType> &K_Solver, Solver<OperatorType, VectorType, ValueType> &S_Solver)¶
Initialize solver for \(K\) and \(S\).
(Restricted) Additive Schwarz Preconditioner¶

template<class OperatorType, class VectorType, typename ValueType>
class AS : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> Additive Schwarz Preconditioner.
The Additive Schwarz preconditioner relies on a preconditioning technique, where the linear system \(Ax=b\) can be decomposed into small subproblems based on \(A_{i} = R_{i}^{T}AR_{i}\), where \(R_{i}\) are restriction operators. Those restriction operators produce submatrices wich overlap. This leads to contributions from two preconditioners on the overlapped area which are scaled by \(1/2\). RAS
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::RAS< OperatorType, VectorType, ValueType >

void rocalution::AS::Set(int nb, int overlap, Solver<OperatorType, VectorType, ValueType> **preconds)¶
Set number of blocks, overlap and array of preconditioners.

template<class OperatorType, class VectorType, typename ValueType>
class RAS : public rocalution::AS<OperatorType, VectorType, ValueType> Restricted Additive Schwarz Preconditioner.
The Restricted Additive Schwarz preconditioner relies on a preconditioning technique, where the linear system \(Ax=b\) can be decomposed into small subproblems based on \(A_{i} = R_{i}^{T}AR_{i}\), where \(R_{i}\) are restriction operators. The RAS method is a mixture of block Jacobi and the AS scheme. In this case, the submatrices contain overlapped areas from other blocks, too. RAS
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>
The overlapped area is shown in Fig. 11.
BlockJacobi (MPI) Preconditioner¶

template<class OperatorType, class VectorType, typename ValueType>
class BlockJacobi : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> BlockJacobi Preconditioner.
The BlockJacobi preconditioner is designed to wrap any local preconditioner and apply it in a global block fashion locally on each interior matrix.
 tparam OperatorType
 can be GlobalMatrix
 tparam VectorType
 can be GlobalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>

void rocalution::BlockJacobi::Set(Solver<LocalMatrix<ValueType>, LocalVector<ValueType>, ValueType> &precond)¶
Set local preconditioner.
The BlockJacobi (MPI) preconditioner is shown in Fig. 12.
Block Preconditioner¶

template<class OperatorType, class VectorType, typename ValueType>
class BlockPreconditioner : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> BlockPreconditioner.
When handling vector fields, typically one can try to use different preconditioners and/or solvers for the different blocks. For such problems, the library provides a blocktype preconditioner. This preconditioner builds the following blocktype matrix
\[\begin{split} P = \begin{pmatrix} A_{d} & 0 & . & 0 \\ B_{1} & B_{d} & . & 0 \\ . & . & . & . \\ Z_{1} & Z_{2} & . & Z_{d} \end{pmatrix} \end{split}\]The solution of \(P\) can be performed in two ways. It can be solved by blocklowertriangular sweeps with inversion of the blocks \(A_{d} \ldots Z_{d}\) and with a multiplication of the corresponding blocks. This is set by SetLSolver() (which is the default solution scheme). Alternatively, it can be used only with an inverse of the diagonal \(A_{d} \ldots Z_{d}\) (BlockJacobi type) by using SetDiagonalSolver(). tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>

void rocalution::BlockPreconditioner::Set(int n, const int *size, Solver<OperatorType, VectorType, ValueType> **D_solver)¶
Set number, size and diagonal solver.

void rocalution::BlockPreconditioner::SetDiagonalSolver(void)¶
Set diagonal solver mode.

void rocalution::BlockPreconditioner::SetLSolver(void)¶
Set lower triangular sweep mode.

void rocalution::BlockPreconditioner::SetExternalLastMatrix(const OperatorType &mat)¶
Set external last block matrix.

virtual void rocalution::BlockPreconditioner::SetPermutation(const LocalVector<int> &perm)¶
Set permutation vector.
Variable Preconditioner¶

template<class OperatorType, class VectorType, typename ValueType>
class VariablePreconditioner : public rocalution::Preconditioner<OperatorType, VectorType, ValueType> Variable Preconditioner.
The Variable Preconditioner can hold a selection of preconditioners. Thus, any type of preconditioners can be combined. As example, the variable preconditioner can combine Jacobi, GS and ILU  then, the first iteration of the iterative solver will apply Jacobi, the second iteration will apply GS and the third iteration will apply ILU. After that, the solver will start again with Jacobi, GS, ILU.
 tparam OperatorType
 can be LocalMatrix
 tparam VectorType
 can be LocalVector
 tparam ValueType
 can be float, double, std::complex<float> or std::complex<double>

virtual void rocalution::VariablePreconditioner::SetPreconditioner(int n, Solver<OperatorType, VectorType, ValueType> **precond)¶
Set the preconditioner sequence.
Backends¶
The support of accelerator devices is embedded in the structure of rocALUTION. The primary goal is to use this technology whenever possible to decrease the computational time. .. note:: Not all functions are ported and present on the accelerator backend. This limited functionality is natural, since not all operations can be performed efficiently on the accelerators (e.g. sequential algorithms, I/O from the file system, etc.).
Currently, rocALUTION supports HIP capable GPUs starting with ROCm 1.9. Due to its design, the library can be easily extended to support future accelerator technologies. Such an extension of the library will not reflect the algorithms which are based on it.
If a particular function is not implemented for the used accelerator, the library will move the object to the host and compute the routine there. In this case a warning message of level 2 will be printed. For example, if the user wants to perform an ILUT factorization on the HIP backend which is currently not available, the library will move the object to the host, perform the routine there and print the following warning message
*** warning: LocalMatrix::ILUTFactorize() is performed on the host
Moving Objects To and From the Accelerator¶
All objects in rocALUTION can be moved to the accelerator and to the host.

virtual void rocalution::BaseRocalution::MoveToAccelerator(void) = 0¶
Move the object to the accelerator backend.

virtual void rocalution::BaseRocalution::MoveToHost(void) = 0¶
Move the object to the host backend.
LocalMatrix<ValueType> mat;
LocalVector<ValueType> vec1, vec2;
// Perform matrix vector multiplication on the host
mat.Apply(vec1, &vec2);
// Move data to the accelerator
mat.MoveToAccelerator();
vec1.MoveToAccelerator();
vec2.MoveToAccelerator();
// Perform matrix vector multiplication on the accelerator
mat.Apply(vec1, &vec2);
// Move data to the host
mat.MoveToHost();
vec1.MoveToHost();
vec2.MoveToHost();
Asynchronous Transfers¶
The rocALUTION library also provides asynchronous transfers of data between host and HIP backend.

virtual void rocalution::BaseRocalution::MoveToAcceleratorAsync(void)¶
Move the object to the accelerator backend with async move.

virtual void rocalution::BaseRocalution::MoveToHostAsync(void)¶
Move the object to the host backend with async move.

virtual void rocalution::BaseRocalution::Sync(void)¶
Sync (the async move)
This can be done with rocalution::LocalVector::CopyFromAsync()
and rocalution::LocalMatrix::CopyFromAsync()
or with MoveToAcceleratorAsync() and MoveToHostAsync(). These functions return immediately and perform the asynchronous transfer in background mode. The synchronization is done with Sync().
When using the MoveToAcceleratorAsync() and MoveToHostAsync() functions, the object will still point to its original location (i.e. host for calling MoveToAcceleratorAsync() and accelerator for MoveToHostAsync()). The object will switch to the new location after the Sync() function is called.
Note
The objects should not be modified during an active asynchronous transfer. However, if this happens, the values after the synchronization might be wrong.
Note
To use the asynchronous transfers, you need to enable the pinned memory allocation. Uncomment #define ROCALUTION_HIP_PINNED_MEMORY in src/utils/allocate_free.hpp.
Systems without Accelerators¶
rocALUTION provides full code compatibility on systems without accelerators, the user can take the code from the GPU system, recompile the same code on a machine without a GPU and it will provide the same results. Any calls to rocalution::BaseRocalution::MoveToAccelerator()
and rocalution::BaseRocalution::MoveToHost()
will be ignored.
Memory Allocations¶
All data which is passed to and from rocALUTION is using the memory handling functions described in the code. By default, the library uses standard C++ new and delete functions for the host data. This can be changed by modifying src/utils/allocate_free.cpp.
Allocation Problems¶
If the allocation fails, the library will report an error and exits. If the user requires a special treatment, it has to be placed in src/utils/allocate_free.cpp.
Memory Alignment¶
The library can also handle special memory alignment functions. This feature need to be uncommented before the compilation process in src/utils/allocate_free.cpp.
Pinned Memory Allocation (HIP)¶
By default, the standard host memory allocation is realized by C++ new and delete. For faster PCIExpress transfers on HIP backend, the user can also use pinned host memory. This can be activated by uncommenting the corresponding macro in src/utils/allocate_free.hpp.
Remarks¶
Performance¶
Solvers can be built on the accelerator. In many cases, this is faster compared to building on the host.
Smallsized problems tend to perform better on the host (CPU), due to the good caching system, while largesized problems typically perform better on the accelerator devices.
Avoid accessing vectors using [] operators. Use techniques based on
rocalution::LocalVector::SetDataPtr()
androcalution::LocalVector::LeaveDataPtr()
instead.By default, the OpenMP backend picks the maximum number of threads available. However, if your CPU supports SMT, it will allow to run two times more threads than number of cores. This, in many cases, leads to lower performance. You may observe a performance increase by setting the number of threads (see
rocalution::set_omp_threads_rocalution()
) equal to the number of physical cores.If you need to solve a system with multiple righthandsides, avoid constructing the solver/preconditioner every time.
If you are solving similar linear systems, you might want to consider to use the same preconditioner to avoid long building phases.
In most of the cases, the classical CSR matrix format performs very similar to all other formats on the CPU. On accelerators with manycores (such as GPUs), formats such as DIA and ELL typically perform better. However, for general sparse matrices one could use HYB format to avoid large memory overhead. The multicolored preconditioners can be performed in ELL for most of the matrices.
Not all matrix conversions are performed on the device, the platform will give you a warning if the object need to be moved.
If you are deploying the rocALUTION library into another software framework try to design your integration functions to avoid
rocalution::init_rocalution()
androcalution::stop_rocalution()
every time you call a solver in the library.Be sure to compile the library with the correct optimization level (O3).
Check, if your solver is really performed on the accelerator by printing the matrix information (
rocalution::BaseRocalution::Info()
) just before calling therocalution::Solver::Solve()
function.Check the configuration of the library for your hardware with
rocalution::info_rocalution()
.MixedPrecision defect correction technique is recommended for accelerators (e.g. GPUs) with partial or no double precision support. The stopping criteria for the inner solver has to be tuned well for good performance.
Accelerators¶
Avoid PCIExpress communication whenever possible (such as copying data from/to the accelerator). Also check the internal structure of the functions.
Pinned memory allocation (pagelocked) can be used for all host memory allocations when using the HIP backend. This provides faster transfers over the PCIExpress and allows asynchronous data movement. By default, this option is disabled. To enable the pinned memory allocation uncomment #define ROCALUTION_HIP_PINNED_MEMORY in file src/utils/allocate_free.hpp.
Asynchronous transfers are available for the HIP backend.
Correctness¶
If you are assembling or modifying your matrix, you can check it in octave/MATLAB by just writing it into a matrixmarket file and read it via mmread() function. You can also input a MATLAB/octave matrix in such a way.
Be sure, to set the correct relative and absolute tolerance values for your problem.
Check the computation of the relative stopping criteria, if it is based on \(bAx^k_2/bAx^0_2\) or \(bAx^k_2/b_2\).
Solving very illconditioned problems by iterative methods without a proper preconditioning technique might produce wrong results. The solver could stop by showing a low relative tolerance based on the residual but this might be wrong.
Building the Krylov subspace for many illconditioned problems could be a tricky task. To ensure orthogonality in the subspace you might want to perform double orthogonalization (i.e. reorthogonalization) to avoid rounding errors.
If you read/write matrices/vectors from files, check the ASCII format of the values (e.g. 34.3434 or 3.43434E + 01).
Supported Targets¶
Currently, rocALUTION is supported under the following operating systems
Ubuntu 16.04, Ubuntu 18.04
CentOS 7
SLES 15
To compile and run rocALUTION with HIP support, AMD ROCm Platform 2.9 or newer is required.
The following HIP devices are currently supported
gfx803 (e.g. Fiji)
gfx900 (e.g. Vega10, MI25)
gfx906 (e.g. Vega20, MI50, MI60)
gfx908
Design Documentation¶
Design and Philosophy¶
rocALUTION is written in C++ and HIP.
The main idea of the rocALUTION objects is that they are separated from the actual hardware specification. Once you declare a matrix, a vector or a solver they are initially allocated on the host (CPU). Then, every object can be moved to a selected accelerator by a simple function call. The whole execution mechanism is based on runtime type information (RTTI), which allows you to select where and how you want to perform the operations at run time. This is in contrast to the templatebased libraries, which need this information at compile time.
The philosophy of the library is to abstract the hardwarespecific functions and routines from the actual program, that describes the algorithm. It is hard and almost impossible for most of the large simulation software based on sparse computation, to adapt and port their implementation in order to use every new technology. On the other hand, the new high performance accelerators and devices have the capability to decrease the computational time significantly in many critical parts.
This abstraction layer of the hardware specific routines is the core of the rocALUTION design. It is built to explore finegrained level of parallelism suited for multi/manycore devices. This is in contrast to most of the parallel sparse libraries available which are mainly based on domain decomposition techniques. Thus, the design of the iterative solvers the preconditioners is very different. Another cornerstone of rocALUTION is the native support of accelerators  the memory allocation, transfers and specific hardware functions are handled internally in the library.
rocALUTION helps you to use accelerator technologies but does not force you to use them. Even if you offload your algorithms and solvers to the accelerator device, the same source code can be compiled and executed in a system without any accelerator.
Naturally, not all routines and algorithms can be performed efficiently on manycore systems (i.e. on accelerators). To provide full functionality, the library has internal mechanisms to check if a particular routine is implemented on the accelerator. If not, the object is moved to the host and the routine is computed there. This guarantees that your code will run with any accelerator, regardless of the available functionality for it.
Library Source Code Organization¶
Library Source Code Organization¶
The rocALUTION library is split into three major parts:
The src/base/ directory contains all source code that is built on top of the
BaseRocalution
object as well as the backend structure.src/solvers/ contains all solvers, preconditioners and its control classes.
In src/utils/ memory (de)allocation, logging, communication, timing and math helper functions are placed.
The src/base/ directory¶
Backend Manager¶
The support of accelerator devices is embedded in the structure of rocALUTION. The primary goal is to use this technology whenever possible to decrease the computational time.
Each technology has its own backend implementation, dealing with platform specific initialization, synchronization, reservation, etc. functionality. Currently available backends are for CPU (naive, OpenMP, MPI) and GPU (HIP).
Note
Not all functions are ported and present on the accelerator backend. This limited functionality is natural, since not all operations can be performed efficiently on the accelerators (e.g. sequential algorithms, I/O from the file system, etc.).
The Operator and Vector classes¶
The Operator
and Vector
classes and its derived local and global classes, are the classes available by the rocALUTION API.
While granting the user access to all relevant functionality, all hardware relevant implementation details are hidden.
Those linear operators and vectors are the main objects in rocALUTION.
They can be moved to an accelerator at run time.
The linear operators are defined as local or global matrices (i.e. on a single node or distributed/multinode) and local stencils (i.e. matrixfree linear operations). The only template parameter of the operators and vectors is the data type (ValueType). Fig. 13 gives an overview of supported operators and vectors.
Each of the objects contain a local copy of the hardware descriptor created by the init_rocalution
function.
Additionally, each local object that is derived from an operator or vector, contains a pointer to a Baseclass, a Hostclass and an Acceleratorclass of same type (e.g. a LocalMatrix
contains pointers to a BaseMatrix
, HostMatrix
and AcceleratorMatrix
).
The Baseclass pointer will always point towards either the Hostclass or the Acceleratorclass pointer, dependend on the runtime decision of the local object.
Baseclasses and their derivatives are further explained in The BaseMatrix and BaseVector classes.
Furthermore, each global object, derived from an operator or vector, embeds two Localclasses of same type to store the interior and ghost part of the global object (e.g. a GlobalVector
contains two LocalVector
).
For more details on distributed data structures, see the user manual.
The BaseMatrix and BaseVector classes¶
The data is an object, pointing to the BaseMatrix class. The pointing is coming from either a HostMatrix or an AcceleratorMatrix. The AcceleratorMatrix is created by an object with an implementation in the backend and a matrix format. Switching between host and accelerator matrices is performed in the LocalMatrix class. The LocalVector is organized in the same way.
Each matrix format has its own class for the host and for the accelerator backend. All matrix classes are derived from the BaseMatrix, which provides the base interface for computation as well as for data accessing.
Each local object contains a pointer to a Baseclass object.
While the Baseclass is mainly pure virtual, their derivatives implement all platform specific functionality.
Each of them is coupled to a rocALUTION backend descriptor.
While the HostMatrix
, HostStencil
and HostVector
classes implements all host functionality, AcceleratorMatrix
, AcceleratorStencil
and AcceleratorVector
contain accelerator related device code.
Each of the backend specializations are located in a different directory, e.g. src/base/host for host related classes and src/base/hip for accelerator / HIP related classes.
ParallelManager¶
The parallel manager class handles the communication and the mapping of the global operators. Each global operator and vector need to be initialized with a valid parallel manager in order to perform any operation. For many distributed simulations, the underlying operator is already distributed. This information need to be passed to the parallel manager. All communication functionality for the implementation of global algorithms is available in the rocALUTION communicator in src/utils/communicator.hpp. For more details on distributed data structures, see the user manual.
The src/solvers/ directory¶
The Solver
and its derived classes can be found in src/solvers.
The directory structure is further split into the subclasses DirectLinearSolver
in src/solvers/direct, IterativeLinearSolver
in src/solvers/krylov, BaseMultiGrid
in src/solvers/multigrid and Preconditioner
in src/solvers/preconditioners.
Each of the solver is using an Operator
, Vector
and data type as template parameters to solve a linear system of equations.
The actual solver algorithm is implemented by the Operator
and Vector
functionality.
Most of the solvers can be performed on linear operators, e.g. LocalMatrix
, LocalStencil
and GlobalMatrix
 i.e. the solvers can be performed locally (on a shared memory system) or in a distributed manner (on a cluster) via MPI.
All solvers and preconditioners need three template parameters  Operators, Vectors and Scalar type.
The Solver class is purely virtual and provides an interface for
SetOperator
to set the operator, i.e. the user can pass the matrix here.Build
to build the solver (including preconditioners, subsolvers, etc.). The user need to specify the operator first before building the solver.Solve
to solve the sparse linear system. The user need to pass a righthand side and a solution / initial guess vector.Print
to show solver information.ReBuildNumeric
to only rebuild the solver numerically (if possible).MoveToHost
andMoveToAccelerator
to offload the solver (including preconditioners and subsolvers) to the host / accelerator.
The src/utils/ directory¶
In the src/utils directory, all commonly used host (de)allocation, timing, math, communication and logging functionality is gathered.
Furthermore, the rocALUTION IndexType2, which is the indexing type for global, distributed structures, can be adjusted in src/utils/types.hpp. By default, rocALUTION uses 64bit wide global indexing.
Note
It is not recommended to switch to 32bit global indexing.
In src/utils/def.hpp
verbosity level VERBOSE_LEVEL can be adjusted, see Verbose Output,
debug mode DEBUG_MODE can be enabled, see Debug Output,
MPI logging LOG_MPI_RANK can be modified, see Verbose Output and MPI,
and object tracking OBJ_TRACKING_OFF can be enabled, see Automatic Object Tracking.
Functionality Extension Guidelines¶
The main purpose of this chapter is to give an overview of different ways to implement userspecific routines, solvers or preconditioners to the rocALUTION library package. Additional features can be added in multiple ways. Additional solver and preconditioner functionality that uses already implemented backend functionality will perform well on accelerator devices without the need for expert GPU programming knowledge. Also, users that are not interested in using accelerators will not be confronted with HIP and GPU related programming tasks to add additional functionality.
In the following sections, different levels of functionality enhancements are illustrated. These examples can be used as guidelines to extend rocALUTION step by step with your own routines. Please note, that user added routines can also be added to the main GitHub repository using pull requests.
LocalMatrix Functionlity Extension¶
In this example, the LocalMatrix
class is extended by an additional routine.
The routine shall support both, Host and Accelerator backend.
Furthermore, the routine requires the matrix to be in CSR format.
API Enhancement¶
To make the new routine available by the API, we first need to modify the LocalMatrix
class.
The corresponding header file local_matrix.hpp is located in src/base/.
The new routines can be added as public member function, e.g.
...
void ConvertTo(unsigned int matrix_format);
void MyNewFunctionality(void);
virtual void Apply(const LocalVector<ValueType>& in, LocalVector<ValueType>* out) const;
virtual void ApplyAdd(const LocalVector<ValueType>& in,
...
For the implementation of the new API function, it is important to know where this functionality will be available. To add support for any backend and matrix format, format conversions are required, if MyNewFunctionality() is only supported for CSR matrices. This will be subject to the API function implementation:
template <typename ValueType>
void LocalMatrix<ValueType>::MyNewFunctionality(void)
{
// Debug logging
log_debug(this, "LocalMatrix::MyNewFunctionality()");
#ifdef DEBUG_MODE
// If we are in debug mode, perform an additional matrix sanity check
this>Check();
#endif
// If no nonzero entries, do nothing
if(this>GetNnz() > 0)
{
// As we want to implement our function only for CSR, we first need to convert
// the matrix to CSR format
unsigned int format = this>GetFormat();
this>ConvertToCSR();
// Call the corresponding base matrix implementation
bool err = this>matrix_>MyNewFunctionality();
// Check its return type
if((err == false) && (this>is_host_() == true))
{
// If our matrix is on the host, the function call failed.
LOG_INFO("Computation of LocalMatrix::MyNewFunctionality() failed");
this>Info();
FATAL_ERROR(__FILE__, __LINE__);
}
// Run backup algorithm on host, in case the accelerator version failed
if(err == false)
{
// Move matrix to host
bool is_accel = this>is_accel_();
this>MoveToHost();
// Try again
if(this>matrix_>MyNewFunctionality() == false)
{
LOG_INFO("Computation of LocalMatrix::MyNewFunctionality() failed");
this>Info();
FATAL_ERROR(__FILE__, __LINE__);
}
// On a successful host call, move the data back to the accelerator
// if initial data was on the accelerator
if(is_accel == true)
{
// Print a warning, that the algorithm was performed on the host
// even though the initial data was on the device
LOG_VERBOSE_INFO(2, "*** warning: LocalMatrix::MyNewFunctionality() was performed on the host");
this>MoveToAccelerator();
}
}
// Convert the matrix back to CSR format
if(format != CSR)
{
// Print a warning, that the algorithm was performed in CSR format
// even though the initial matrix format was different
LOG_VERBOSE_INFO(2, "*** warning: LocalMatrix::MyNewFunctionality() was performed in CSR format");
this>ConvertTo(format);
}
}
#ifdef DEBUG_MODE
// Perform additional sanity check in debug mode, because this is a nonconst function
this>Check();
#endif
}
Similarly, hostonly functions can be implemented. In this case, initial data explicitly need to be moved to the host backend by the API implementation.
The next step is the implementation of the actual functionality in the BaseMatrix
class.
Enhancement of the BaseMatrix class¶
To make the new routine available in the base class, we first need to modify the BaseMatrix
class.
The corresponding header file base_matrix.hpp is located in src/base/.
The new routines can be added as public member function, e.g.
...
virtual bool ILU0Factorize(void);
/// Perform MyNewFunctionality algorithm
virtual bool MyNewFunctionality(void);
/// Perform LU factorization
...
We do not implement MyNewFunctionality() purely virtual, as we do not supply an implementation for all base classes.
We decided to implement it only for CSR format, and thus need to return an error flag, such that the LocalMatrix
class is aware of the failure and can convert it to CSR.
template <typename ValueType>
bool MyNewFunctionality(void)
{
return false;
}
Platformspecific Host Implementation¶
So far, our new function will always fail, as there is no backend implementation available yet. To satisfy the rocALUTION host backup philosophy, we need to make sure that there is always a host implementation available. This host implementation need to be placed in src/base/host/host_matrix_csr.cpp as we decided to make it available for CSR format.
...
virtual bool ILUTFactorize(double t, int maxrow);
virtual bool MyNewFunctionality(void);
virtual void LUAnalyse(void);
...
template <typename ValueType>
bool HostMatrixCSR<ValueType>::MyNewFunctionality(void)
{
// Place some asserts to verify sanity of input data
// Our algorithm works only for squared matrices
assert(this>nrow_ == this>ncol_);
assert(this>nnz_ > 0);
// place the actual host based algorithm here:
// for illustration, we scale the matrix by its inverse diagonal
for(int i = 0; i < this>nrow_; ++i)
{
int row_begin = this>mat_.row_offset[i];
int row_end = this>mat_.row_offset[i + 1];
bool diag_found = false;
ValueType inv_diag;
// Find the diagonal entry
for(int j = row_begin; j < row_end; ++j)
{
if(this>mat_.col[j] == i)
{
diag_found = true;
inv_diag = static_cast<ValueType>(1) / this>mat_.val[j];
}
}
// Our algorithm works only with full rank
assert(diag_found == true);
// Scale the row
for(int j = row_begin; j < row_end; ++j)
{
this>mat_.val[j] *= inv_diag;
}
}
return true;
}
Platformspecific HIP Implementation¶
We can now add an additional implementation for the HIP backend, using HIP programming framework. This will make our algorithm available on accelerators and rocALUTION will not switch to the host backend on function calls anymore. The HIP implementation needs to be added to src/base/hip/hip_matrix_csr.cpp in this case.
...
virtual bool ILU0Factorize(void);
virtual bool MyNewFunctionality(void);
virtual bool ICFactorize(BaseVector<ValueType>* inv_diag = NULL);
...
template <typename ValueType>
bool HIPAcceleratorMatrixCSR<ValueType>::MyNewFunctionality(void)
{
// Place some asserts to verify sanity of input data
// Our algorithm works only for squared matrices
assert(this>nrow_ == this>ncol_);
assert(this>nnz_ > 0);
// Enqueue the HIP kernel
hipLaunchKernelGGL((kernel_csr_mynewfunctionality),
dim3((this>nrow_  1) / this>local_backend_.HIP_block_size + 1),
dim3(this>local_backend_.HIP_block_size),
0,
0,
this>mat_.row_offset,
this>mat_.col,
this>mat_.val);
// Check for HIP execution error before successfully returning
CHECK_HIP_ERROR(__FILE__, __LINE__);
return true;
}
The corresponding HIP kernel should be placed in src/base/hip/hip_kernels_csr.hpp.
Adding a Solver¶
In this example, a new solver shall be added to rocALUTION.
API Enhancement¶
First, the API for the new solver must be defined.
In this example, a new IterativeLinearSolver
is added.
To achieve this, the CG
is a good template.
Thus, we first copy src/solvers/krylov/cg.hpp to src/solvers/krylov/mysolver.hpp and src/solvers/krylov.cg.cpp to src/solvers/krylov/mysolver.cpp (assuming we add a krylov subspace solvers).
Next, modify the cg.hpp and cg.cpp to your needs (e.g. change the solver name from CG to MySolver). Each of the virtual functions in the class need an implementation.
MySolver(): The constructor of the new solver class.
~MySolver(): The destructor of the new solver class. It should call the Clear() function.
void Print(void) const: This function should print some informations about the solver.
void Build(void): This function creates all required structures of the solver, e.g. allocates memory and sets the backend of temporary objects.
void BuildMoveToAcceleratorAsync(void): This function should moves all solver related objects asynchronously to the accelerator device.
void Sync(void): This function should synchronize all solver related objects.
void ReBuildNumeric(void): This function should rebuild the solver only numerically.
void Clear(void): This function should clean up all solver relevant structures that have been created using Build().
void SolveNonPrecond_(const VectorType& rhs, VectorType* x): This function should perform the solving phase Ax=y without the use of a preconditioner.
void SolvePrecond_(const VectorType& rhs, VectorType* x): This function should perform the solving phase Ax=y with the use of a preconditioner.
void PrintStart_(void) const: This protected function is called upton solver start.
void PrintEnd_(void) const: This protected function is called when the solver ends.
void MoveToHostLocalData_(void): This protected function should move all local solver objects to the host.
void MoveToAcceleratorLocalData_(void): This protected function should move all local solver objects to the accelerator.
Of course, additional member functions that are solver specific, can be introduced.
Then, to make the new solver visible, we have to add it to the src/rocalution.hpp header:
...
#include "solvers/krylov/cg.hpp"
#include "solvers/krylov/mysolver.hpp"
#include "solvers/krylov/cr.hpp"
...
Finally, the new solver must be added to the CMake compilation list, found in src/solvers/CMakeLists.txt:
...
set(SOLVERS_SOURCES
solvers/krylov/cg.cpp
solvers/krylov/mysolver.cpp
solvers/krylov/fcg.cpp
...
Functionality Table¶
The following tables give an overview whether a rocALUTION routine is implemented on host backend, accelerator backend, or both.
LocalMatrix and LocalVector classes¶
All matrix operations (except SpMV) require a CSR matrix.
Note
If the input matrix is not a CSR matrix, an internal conversion will be performed to CSR format, followed by a back conversion to the previous format after the operation. In this case, a warning message on verbosity level 2 will be printed.
LocalMatrix function 
Comment 
Host 
HIP 

Obtain the matrix format 
Yes 
Yes 

Check the matrix for structure and value validity 
Yes 
No 

Allocate CSR matrix 
Yes 
Yes 

Allocate BCSR matrix 
Yes 
Yes 

Allocate MCSR matrix 
Yes 
Yes 

Allocate COO matrix 
Yes 
Yes 

Allocate DIA matrix 
Yes 
Yes 

Allocate ELL matrix 
Yes 
Yes 

Allocate HYB matrix 
Yes 
Yes 

Allocate DENSE matrix 
Yes 
Yes 

Initialize matrix with externally allocated CSR data 
Yes 
Yes 

Initialize matrix with externally allocated MCSR data 
Yes 
Yes 

Initialize matrix with externally allocated COO data 
Yes 
Yes 

Initialize matrix with externally allocated DIA data 
Yes 
Yes 

Initialize matrix with externally allocated ELL data 
Yes 
Yes 

Initialize matrix with externally allocated DENSE data 
Yes 
Yes 

Direct Memory access 
Yes 
Yes 

Direct Memory access 
Yes 
Yes 

Direct Memory access 
Yes 
Yes 

Direct Memory access 
Yes 
Yes 

Direct Memory access 
Yes 
Yes 

Direct Memory access 
Yes 
Yes 

Set all matrix entries to zero 
Yes 
Yes 

Scale all matrix nonzeros 
Yes 
Yes 

Scale matrix diagonal 
Yes 
Yes 

Scale matrix offdiagonal entries 
Yes 
Yes 

Add scalar to all matrix nonzeros 
Yes 
Yes 

Add scalar to matrix diagonal 
Yes 
Yes 

Add scalar to matrix offdiagonal entries 
Yes 
Yes 

Extract submatrix 
Yes 
Yes 

Extract array of nonoverlapping submatrices 
Yes 
Yes 

Extract matrix diagonal 
Yes 
Yes 

Extract inverse matrix diagonal 
Yes 
Yes 

Extract lower triangular matrix 
Yes 
Yes 

Extract upper triangular matrix 
Yes 
Yes 

(Forward) permute the matrix 
Yes 
Yes 

(Backward) permute the matrix 
Yes 
Yes 

Create CMK permutation vector 
Yes 
No 

Create reverse CMK permutation vector 
Yes 
No 

Create connectivity (increasing nnz per row) permutation vector 
Yes 
No 

Create multicoloring decomposition of the matrix 
Yes 
No 

Create maximal independent set decomposition of the matrix 
Yes 
No 

Create permutation where zero diagonal entries are mapped to the last block 
Yes 
No 

Create ILU(0) factorization 
Yes 
No 

Create LU factorization 
Yes 
No 

Create ILU(t,m) factorization 
Yes 
No 

Create ILU(p) factorization 
Yes 
No 

Create IC factorization 
Yes 
No 

Create QR decomposition 
Yes 
No 

Read matrix from matrix market file 
Yes 
No 

Write matrix to matrix market file 
Yes 
No 

Read matrix from binary file 
Yes 
No 

Write matrix to binary file 
Yes 
No 

Copy matrix (values and structure) from another LocalMatrix 
Yes 
Yes 

Copy matrix asynchronously 
Yes 
Yes 

Clone an entire matrix (values, structure and backend) from another LocalMatrix 
Yes 
Yes 

Update CSR matrix values (structure remains identical) 
Yes 
Yes 

Copy (import) CSR matrix 
Yes 
Yes 

Copy (export) CSR matrix 
Yes 
Yes 

Copy (import) COO matrix 
Yes 
Yes 

Copy (export) COO matrix 
Yes 
Yes 

Allocate and copy (import) a CSR matrix from host 
Yes 
No 

Convert a matrix to CSR format 
Yes 
No 

Convert a matrix to MCSR format 
Yes 
No 

Convert a matrix to BCSR format 
Yes 
No 

Convert a matrix to COO format 
Yes 
Yes 

Convert a matrix to ELL format 
Yes 
Yes 

Convert a matrix to DIA format 
Yes 
Yes 

Convert a matrix to HYB format 
Yes 
Yes 

Convert a matrix to DENSE format 
Yes 
No 

Convert a matrix 
Yes 

Perform symbolic power computation (structure only) 
Yes 
No 

Matrix addition 
Yes 
No 

Multiply two matrices 
Yes 
No 

Multiply matrix with diagonal matrix (stored in LocalVector) 
Yes 
Yes 

Multiply matrix with diagonal matrix (stored in LocalVector) from left 
Yes 
Yes 

Multiply matrix with diagonal matrix (stored in LocalVector) from right 
Yes 
Yes 

Compute the spectrum approximation with Gershgorin circles theorem 
Yes 
No 

Delete all entries where abs(a_ij) <= drop_off 
Yes 
Yes 

Transpose the matrix 
Yes 
No 

Sort the matrix indices 
Yes 
No 

Compute a unique matrix key 
Yes 
No 

Replace a column vector of a matrix 
Yes 
No 

Replace a row vector of a matrix 
Yes 
No 

Extract a column vector of a matrix 
Yes 
No 

Extract a row vector of a matrix 
Yes 
No 
LocalVector function 
Comment 
Host 
HIP 

Obtain vector size 
Yes 
Yes 

Check vector for valid entries 
Yes 
No 

Allocate vector 
Yes 
Yes 

Synchronize 
Yes 
Yes 

Initialize vector with external data 
Yes 
Yes 

Direct Memory Access 
Yes 
Yes 

Set vector entries to zero 
Yes 
Yes 

Set vector entries to one 
Yes 
Yes 

Set vector entries to scalar 
Yes 
Yes 

Initialize vector with uniformly distributed random numbers 
Yes 
No 


Initialize vector with normally distributed random numbers 
Yes 
No 
Read vector for ASCII file 
Yes 
No 

Write vector to ASCII file 
Yes 
No 

Read vector from binary file 
Yes 
No 

Write vector to binary file 
Yes 
No 

Copy vector (values) from another LocalVector 
Yes 
Yes 

Copy vector asynchronously 
Yes 
Yes 

Copy vector from another LocalVector<float> 
Yes 
Yes 

Copy vector from another LocalVector<double> 
Yes 
Yes 

Copy vector under specified (forward) permutation 
Yes 
Yes 

Copy vector under specified (backward) permutation 
Yes 
Yes 

Clone vector (values and backend descriptor) from another LocalVector 
Yes 
Yes 

Copy (import) vector from array 
Yes 
Yes 

Copy (export) vector to array 
Yes 
Yes 

(Foward) permute vector inplace 
Yes 
Yes 

(Backward) permute vector inplace 
Yes 
Yes 

y = a * x + y 
Yes 
Yes 

y = x + a * y 
Yes 
Yes 

y = b * x + a * y 
Yes 
Yes 

z = a * x + b * y + c * z 
Yes 
Yes 

x = a * x 
Yes 
Yes 


Compute exclusive sum 
Yes 
No 
Compute dot product 
Yes 
Yes 

Compute nonconjugated dot product 
Yes 
Yes 

Compute L2 norm 
Yes 
Yes 

Obtain the sum of all vector entries 
Yes 
Yes 

Obtain the absolute sum of all vector entries 
Yes 
Yes 

Obtain the absolute maximum entry of the vector 
Yes 
Yes 

Perform point wise multiplication of two vectors 
Yes 
Yes 

Compute vector power 
Yes 
Yes 
Solver and Preconditioner classes¶
Note
The building phase of the iterative solver also depends on the selected preconditioner.
Solver 
Functionality 
Host 
HIP 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
No 

Solving 
Yes 
Yes 

Building 
Yes 
No 

Solving 
Yes 
Yes 

Building 
Yes 
No 

Solving 
Yes 
Yes 

Building 
Yes 
No 

Solving 
Yes 
Yes 

Building 
Yes 
No 

Solving 
Yes 
No 

Building 
Yes 
No 

Solving 
Yes 
No 

Building 
Yes 
No 

Solving 
Yes 
Yes 
Preconditioner 
Functionality 
Host 
HIP 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
No 

Solving 
Yes 
Yes 

Building 
Yes 
No 

Solving 
Yes 
Yes 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
No 

Solving 
Yes 
No 

Building 
Yes 
No 

Solving 
Yes 
No 

Building 
Yes 
No 

Solving 
Yes 
No 

Building 
Yes 
No 

Solving 
Yes 
Yes 

Building 
Yes 
No 

Solving 
Yes 
Yes 

Building 
Yes 
No 

Solving 
Yes 
Yes 

Building 
Yes 
No 

Solving 
Yes 
Yes 

Building 
Yes 
No 

Solving 
Yes 
No 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
Yes 

Solving 
Yes 
Yes 

Building 
Yes 
No 

Solving 
Yes 
Yes 
Clients¶
rocALUTION clients host a variety of different examples as well as a unit test package. For detailed instructions on how to build rocALUTION with clients, see Building from GitHub repository.
Examples¶
The examples collection offers different possible setups of solvers and preconditioners. The following tables gives a short overview on the different examples:
Example 
Description 

amg 
Algebraic Multigrid solver (smoothed aggregation scheme, GS smoothing) 
asprecond 
GMRES solver with Additive Schwarz preconditioning 
async 
Asynchronous rocALUTION object transfer 
benchmark 
Benchmarking important sparse functions 
bicgstab 
BiCGStab solver with multicolored GaussSeidel preconditioning 
blockprecond 
GMRES solver with blockwise multicolored ILU preconditioning 
cgamg 
CG solver with Algebraic Multigrid (smoothed aggregation scheme) preconditioning 
cg 
CG solver with Jacobi preconditioning 
cmk 
CG solver with ILU preconditioning using Cuthill McKee ordering 
direct 
Matrix inversion 
fgmres 
Flexible GMRES solver with multicolored GaussSeidel preconditioning 
fixedpoint 
FixedPoint iteration scheme using Jacobi relaxation 
gmres 
GMRES solver with multicolored GaussSeidel preconditioning 
idr 
Induced Dimension Reduction solver with Jacobi preconditioning 
key 
Sparse matrix unique key computation 
mepreconditioner 
CG solver with multielimination preconditioning 
mixedprecision 
Mixedprecision CG solver with multicolored ILU preconditioning 
powermethod 
CG solver using Chebyshev preconditioning and power method for eigenvalue approximation 
simplespmv 
Sparse Matrix Vector multiplication 
spprecond 
BiCGStab solver with multicolored ILU preconditioning for saddle point problems 
stencil 
CG solver using stencil as operator 
tns 
CG solver with Truncated Neumann Series preconditioning 
varprecond 
FGMRES solver with variable preconditioning 
Example (MPI) 
Description 

benchmark_mpi 
Benchmarking important sparse functions 
bicgstab_mpi 
BiCGStab solver with multicolored GaussSeidel preconditioning 
cgamg_mpi 
CG solver with Algebraic Multigrid (pairwise aggregation scheme) preconditioning 
cg_mpi 
CG solver with Jacobi preconditioning 
fcg_mpi 
Flexible CG solver with ILU preconditioning 
fgmres_mpi 
Flexible GMRES solver with SParse Approximate Inverse preconditioning 
globalio_mpi 
File I/O with CG solver and Factorized Sparse Approximate Inverse preconditioning 
idr_mpi 
IDR solver with Factorized Sparse Approximate Inverse preconditioning 
qmrcgstab_mpi 
QMRCGStab solver with ILUT preconditioning 
Unit Tests¶
Multiple unit tests are available to test for bad arguments, invalid parameters and solver and preconditioner functionality. The unit tests are based on google test. The tests cover a variety of different solver, preconditioning and matrix format combinations and can be performed on all available backends.
API¶
This section provides a detailed list of the library API
Host Utility Functions¶

template<typename DataType>
void rocalution::allocate_host(int size, DataType **ptr)¶ Allocate buffer on the host.
allocate_host
allocates a buffer on the host. Parameters
size – [in] number of elements the buffer need to be allocated for
ptr – [out] pointer to the position in memory where the buffer should be allocated, it is expected that
*ptr
==NULL
 Template Parameters
DataType – can be char, int, unsigned int, float, double, std::complex<float> or std::complex<double>.

template<typename DataType>
void rocalution::free_host(DataType **ptr)¶ Free buffer on the host.
free_host
deallocates a buffer on the host.*ptr
will be set to NULL after successful deallocation. Parameters
ptr – [inout] pointer to the position in memory where the buffer should be deallocated, it is expected that
*ptr
!=NULL
 Template Parameters
DataType – can be char, int, unsigned int, float, double, std::complex<float> or std::complex<double>.

template<typename DataType>
void rocalution::set_to_zero_host(int size, DataType *ptr)¶ Set a host buffer to zero.
set_to_zero_host
sets a host buffer to zero. Parameters
size – [in] number of elements
ptr – [inout] pointer to the host buffer
 Template Parameters
DataType – can be char, int, unsigned int, float, double, std::complex<float> or std::complex<double>.

double rocalution::rocalution_time(void)¶
Return current time in microseconds.
Backend Manager¶

int rocalution::init_rocalution(int rank = 1, int dev_per_node = 1)¶
Initialize rocALUTION platform.
init_rocalution
defines a backend descriptor with information about the hardware and its specifications. All objects created after that contain a copy of this descriptor. If the specifications of the global descriptor are changed (e.g. set different number of threads) and new objects are created, only the new objects will use the new configurations.For control, the library provides the following functions
set_device_rocalution() is a unified function to select a specific device. If you have compiled the library with a backend and for this backend there are several available devices, you can use this function to select a particular one. This function has to be called before init_rocalution().
set_omp_threads_rocalution() sets the number of OpenMP threads. This function has to be called after init_rocalution().
 Example
#include <rocalution.hpp> using namespace rocalution; int main(int argc, char* argv[]) { init_rocalution(); // ... stop_rocalution(); return 0; }
 Parameters
rank – [in] specifies MPI rank when multinode environment
dev_per_node – [in] number of accelerator devices per node, when in multiGPU environment

int rocalution::stop_rocalution(void)¶
Shutdown rocALUTION platform.
stop_rocalution
shuts down the rocALUTION platform.

void rocalution::set_device_rocalution(int dev)¶
Set the accelerator device.
set_device_rocalution
lets the user select the accelerator device that is supposed to be used for the computation. Parameters
dev – [in] accelerator device ID for computation

void rocalution::set_omp_threads_rocalution(int nthreads)¶
Set number of OpenMP threads.
The number of threads which rocALUTION will use can be set with
set_omp_threads_rocalution
or by the global OpenMP environment variable (for Unixlike OS this isOMP_NUM_THREADS
). During the initialization phase, the library provides affinity threadcore mapping:If the number of cores (including SMT cores) is greater or equal than two times the number of threads, then all the threads can occupy every second core ID (e.g. 0, 2, 4, \(\ldots\)). This is to avoid having two threads working on the same physical core, when SMT is enabled.
If the number of threads is less or equal to the number of cores (including SMT), and the previous clause is false, then the threads can occupy every core ID (e.g. 0, 1, 2, 3, \(\ldots\)).
If non of the above criteria is matched, then the default threadcore mapping is used (typically set by the OS).
Note
The threadcore mapping is available only for Unixlike OS.
Note
The user can disable the thread affinity by calling set_omp_affinity_rocalution(), before initializing the library (i.e. before init_rocalution()).
 Parameters
nthreads – [in] number of OpenMP threads

void rocalution::set_omp_affinity_rocalution(bool affinity)¶
Enable/disable OpenMP host affinity.
set_omp_affinity_rocalution
enables / disables OpenMP host affinity. Parameters
affinity – [in] boolean to turn on/off OpenMP host affinity

void rocalution::set_omp_threshold_rocalution(int threshold)¶
Set OpenMP threshold size.
Whenever you want to work on a small problem, you might observe that the OpenMP host backend is (slightly) slower than using no OpenMP. This is mainly attributed to the small amount of work, which every thread should perform and the large overhead of forking/joining threads. This can be avoid by the OpenMP threshold size parameter in rocALUTION. The default threshold is set to 10000, which means that all matrices under (and equal) this size will use only one thread (disregarding the number of OpenMP threads set in the system). The threshold can be modified with
set_omp_threshold_rocalution
. Parameters
threshold – [in] OpenMP threshold size

void rocalution::info_rocalution(void)¶
Print info about rocALUTION.
info_rocalution
prints information about the rocALUTION platform

void rocalution::info_rocalution(const struct Rocalution_Backend_Descriptor &backend_descriptor)¶
Print info about specific rocALUTION backend descriptor.
info_rocalution
prints information about the rocALUTION platform of the specific backend descriptor. Parameters
backend_descriptor – [in] rocALUTION backend descriptor

void rocalution::disable_accelerator_rocalution(bool onoff = true)¶
Disable/Enable the accelerator.
If you want to disable the accelerator (without recompiling the code), you need to call
disable_accelerator_rocalution
before init_rocalution(). Parameters
onoff – [in] boolean to turn on/off the accelerator

void rocalution::_rocalution_sync(void)¶
Sync rocALUTION.
_rocalution_sync
blocks the host until all active asynchronous transfers are completed.
Base Rocalution¶

template<typename ValueType>
class rocalution::BaseRocalution : public rocalution::RocalutionObj¶ Base class for all operators and vectors.
 tparam ValueType
 can be int, float, double, std::complex<float> and std::complex<double>
Subclassed by rocalution::Operator< ValueType >, rocalution::Vector< ValueType >
Public Functions

virtual void MoveToAccelerator(void) = 0¶
Move the object to the accelerator backend.

virtual void MoveToHost(void) = 0¶
Move the object to the host backend.

virtual void MoveToAcceleratorAsync(void)¶
Move the object to the accelerator backend with async move.

virtual void MoveToHostAsync(void)¶
Move the object to the host backend with async move.

virtual void Sync(void)¶
Sync (the async move)

virtual void CloneBackend(const BaseRocalution<ValueType> &src)¶
Clone the Backend descriptor from another object.
With
CloneBackend
, the backend can be cloned without copying any data. This is especially useful, if several objects should reside on the same backend, but keep their original data. Example
LocalVector<ValueType> vec; LocalMatrix<ValueType> mat; // Allocate and initialize vec and mat // ... LocalVector<ValueType> tmp; // By cloning backend, tmp and vec will have the same backend as mat tmp.CloneBackend(mat); vec.CloneBackend(mat); // The following matrix vector multiplication will be performed on the backend // selected in mat mat.Apply(vec, &tmp);
 Parameters
src – [in] Object, where the backend should be cloned from.

virtual void Info(void) const = 0¶
Print object information.
Info
can print object information about any rocALUTION object. This information consists of object properties and backend data. Example
mat.Info(); vec.Info();

virtual void Clear(void) = 0¶
Clear (free all data) the object.
Operator¶

template<typename ValueType>
class rocalution::Operator : public rocalution::BaseRocalution<ValueType>¶ Operator class.
The Operator class defines the generic interface for applying an operator (e.g. matrix or stencil) from/to global and local vectors.
 tparam ValueType
 can be int, float, double, std::complex<float> and std::complex<double>
Subclassed by rocalution::GlobalMatrix< ValueType >, rocalution::LocalMatrix< ValueType >, rocalution::LocalStencil< ValueType >
Public Functions

virtual IndexType2 GetM(void) const = 0¶
Return the number of rows in the matrix/stencil.

virtual IndexType2 GetN(void) const = 0¶
Return the number of columns in the matrix/stencil.

virtual IndexType2 GetNnz(void) const = 0¶
Return the number of nonzeros in the matrix/stencil.

virtual int GetLocalM(void) const¶
Return the number of rows in the local matrix/stencil.

virtual int GetLocalN(void) const¶
Return the number of columns in the local matrix/stencil.

virtual int GetLocalNnz(void) const¶
Return the number of nonzeros in the local matrix/stencil.

virtual int GetGhostM(void) const¶
Return the number of rows in the ghost matrix/stencil.

virtual int GetGhostN(void) const¶
Return the number of columns in the ghost matrix/stencil.

virtual int GetGhostNnz(void) const¶
Return the number of nonzeros in the ghost matrix/stencil.

virtual void Apply(const LocalVector<ValueType> &in, LocalVector<ValueType> *out) const¶
Apply the operator, out = Operator(in), where in and out are local vectors.

virtual void ApplyAdd(const LocalVector<ValueType> &in, ValueType scalar, LocalVector<ValueType> *out) const¶
Apply and add the operator, out += scalar * Operator(in), where in and out are local vectors.

virtual void Apply(const GlobalVector<ValueType> &in, GlobalVector<ValueType> *out) const¶
Apply the operator, out = Operator(in), where in and out are global vectors.

virtual void ApplyAdd(const GlobalVector<ValueType> &in, ValueType scalar, GlobalVector<ValueType> *out) const¶
Apply and add the operator, out += scalar * Operator(in), where in and out are global vectors.
Vector¶

template<typename ValueType>
class rocalution::Vector : public rocalution::BaseRocalution<ValueType>¶ Vector class.
The Vector class defines the generic interface for local and global vectors.
 tparam ValueType
 can be int, float, double, std::complex<float> and std::complex<double>
Subclassed by rocalution::LocalVector< int >, rocalution::GlobalVector< ValueType >, rocalution::LocalVector< ValueType >
Unnamed Group

virtual void CopyFrom(const LocalVector<ValueType> &src)¶
Copy vector from another vector.
CopyFrom
copies values from another vector. Example
LocalVector<ValueType> vec1, vec2; // Allocate and initialize vec1 and vec2 // ... // Move vec1 to accelerator // vec1.MoveToAccelerator(); // Now, vec1 is on the accelerator (if available) // and vec2 is on the host // Copy vec1 to vec2 (or vice versa) will move data between host and // accelerator backend vec1.CopyFrom(vec2);
Note
This function allows cross platform copying. One of the objects could be allocated on the accelerator backend.
 Parameters
src – [in] Vector, where values should be copied from.

virtual void CopyFrom(const GlobalVector<ValueType> &src)¶
Copy vector from another vector.
CopyFrom
copies values from another vector. Example
LocalVector<ValueType> vec1, vec2; // Allocate and initialize vec1 and vec2 // ... // Move vec1 to accelerator // vec1.MoveToAccelerator(); // Now, vec1 is on the accelerator (if available) // and vec2 is on the host // Copy vec1 to vec2 (or vice versa) will move data between host and // accelerator backend vec1.CopyFrom(vec2);
Note
This function allows cross platform copying. One of the objects could be allocated on the accelerator backend.
 Parameters
src – [in] Vector, where values should be copied from.
Unnamed Group

virtual void CloneFrom(const LocalVector<ValueType> &src)¶
Clone the vector.
CloneFrom
clones the entire vector, with data and backend descriptor from another Vector. Example
LocalVector<ValueType> vec; // Allocate and initialize vec (host or accelerator) // ... LocalVector<ValueType> tmp; // By cloning vec, tmp will have identical values and will be on the same // backend as vec tmp.CloneFrom(vec);
 Parameters
src – [in] Vector to clone from.

virtual void CloneFrom(const GlobalVector<ValueType> &src)¶
Clone the vector.
CloneFrom
clones the entire vector, with data and backend descriptor from another Vector. Example
LocalVector<ValueType> vec; // Allocate and initialize vec (host or accelerator) // ... LocalVector<ValueType> tmp; // By cloning vec, tmp will have identical values and will be on the same // backend as vec tmp.CloneFrom(vec);
 Parameters
src – [in] Vector to clone from.
Public Functions

virtual IndexType2 GetSize(void) const = 0¶
Return the size of the vector.

virtual int GetLocalSize(void) const¶
Return the size of the local vector.

virtual int GetGhostSize(void) const¶
Return the size of the ghost vector.

virtual bool Check(void) const = 0¶
Perform a sanity check of the vector.
Checks, if the vector contains valid data, i.e. if the values are not infinity and not NaN (not a number).
 Returns
true – if the vector is ok (empty vector is also ok).
false – if there is something wrong with the values.

virtual void Clear(void) = 0¶
Clear (free all data) the object.

virtual void Zeros(void) = 0¶
Set all values of the vector to 0.

virtual void Ones(void) = 0¶
Set all values of the vector to 1.

virtual void SetRandomUniform(unsigned long long seed, ValueType a = static_cast<ValueType>(1), ValueType b = static_cast<ValueType>(1)) = 0¶
Fill the vector with random values from interval [a,b].

virtual void SetRandomNormal(unsigned long long seed, ValueType mean = static_cast<ValueType>(0), ValueType var = static_cast<ValueType>(1)) = 0¶
Fill the vector with random values from normal distribution.

virtual void ReadFileASCII(const std::string &filename) = 0¶
Read vector from ASCII file.
Read a vector from ASCII file.
 Example
LocalVector<ValueType> vec; vec.ReadFileASCII("my_vector.dat");
 Parameters
filename – [in] name of the file containing the ASCII data.

virtual void WriteFileASCII(const std::string &filename) const = 0¶
Write vector to ASCII file.
Write a vector to ASCII file.
 Example
LocalVector<ValueType> vec; // Allocate and fill vec // ... vec.WriteFileASCII("my_vector.dat");
 Parameters
filename – [in] name of the file to write the ASCII data to.

virtual void ReadFileBinary(const std::string &filename) = 0¶
Read vector from binary file.
Read a vector from binary file. For details on the format, see WriteFileBinary().
 Example
LocalVector<ValueType> vec; vec.ReadFileBinary("my_vector.bin");
 Parameters
filename – [in] name of the file containing the data.

virtual void WriteFileBinary(const std::string &filename) const = 0¶
Write vector to binary file.
Write a vector to binary file.
The binary format contains a header, the rocALUTION version and the vector data as follows
// Header out << "#rocALUTION binary vector file" << std::endl; // rocALUTION version out.write((char*)&version, sizeof(int)); // Vector data out.write((char*)&size, sizeof(int)); out.write((char*)vec_val, size * sizeof(double));
 Example
LocalVector<ValueType> vec; // Allocate and fill vec // ... vec.WriteFileBinary("my_vector.bin");
Note
Vector values array is always stored in double precision (e.g. double or std::complex<double>).
 Parameters
filename – [in] name of the file to write the data to.

virtual void CopyFromAsync(const LocalVector<ValueType> &src)¶
Async copy from another local vector.

virtual void CopyFromFloat(const LocalVector<float> &src)¶
Copy values from another local float vector.

virtual void CopyFromDouble(const LocalVector<double> &src)¶
Copy values from another local double vector.

virtual void CopyFrom(const LocalVector<ValueType> &src, int src_offset, int dst_offset, int size)¶
Copy vector from another vector with offsets and size.
CopyFrom
copies values with specific source and destination offsets and sizes from another vector.Note
This function allows cross platform copying. One of the objects could be allocated on the accelerator backend.
 Parameters
src – [in] Vector, where values should be copied from.
src_offset – [in] source offset.
dst_offset – [in] destination offset.
size – [in] number of entries to be copied.

virtual void AddScale(const LocalVector<ValueType> &x, ValueType alpha)¶
Perform vector update of type this = this + alpha * x.

virtual void AddScale(const GlobalVector<ValueType> &x, ValueType alpha)¶
Perform vector update of type this = this + alpha * x.

virtual void ScaleAdd(ValueType alpha, const LocalVector<ValueType> &x)¶
Perform vector update of type this = alpha * this + x.

virtual void ScaleAdd(ValueType alpha, const GlobalVector<ValueType> &x)¶
Perform vector update of type this = alpha * this + x.

virtual void ScaleAddScale(ValueType alpha, const LocalVector<ValueType> &x, ValueType beta)¶
Perform vector update of type this = alpha * this + x * beta.

virtual void ScaleAddScale(ValueType alpha, const GlobalVector<ValueType> &x, ValueType beta)¶
Perform vector update of type this = alpha * this + x * beta.

virtual void ScaleAddScale(ValueType alpha, const LocalVector<ValueType> &x, ValueType beta, int src_offset, int dst_offset, int size)¶
Perform vector update of type this = alpha * this + x * beta with offsets.

virtual void ScaleAddScale(ValueType alpha, const GlobalVector<ValueType> &x, ValueType beta, int src_offset, int dst_offset, int size)¶
Perform vector update of type this = alpha * this + x * beta with offsets.

virtual void ScaleAdd2(ValueType alpha, const LocalVector<ValueType> &x, ValueType beta, const LocalVector<ValueType> &y, ValueType gamma)¶
Perform vector update of type this = alpha * this + x * beta + y * gamma.

virtual void ScaleAdd2(ValueType alpha, const GlobalVector<ValueType> &x, ValueType beta, const GlobalVector<ValueType> &y, ValueType gamma)¶
Perform vector update of type this = alpha * this + x * beta + y * gamma.

virtual ValueType Dot(const LocalVector<ValueType> &x) const¶
Compute dot (scalar) product, return this^T y.

virtual ValueType Dot(const GlobalVector<ValueType> &x) const¶
Compute dot (scalar) product, return this^T y.

virtual ValueType DotNonConj(const LocalVector<ValueType> &x) const¶
Compute nonconjugate dot (scalar) product, return this^T y.

virtual ValueType DotNonConj(const GlobalVector<ValueType> &x) const¶
Compute nonconjugate dot (scalar) product, return this^T y.

virtual ValueType Norm(void) const = 0¶
Compute \(L_2\) norm of the vector, return = srqt(this^T this)

virtual ValueType Asum(void) const = 0¶
Compute the sum of absolute values of the vector, return = sum(this)

virtual int Amax(ValueType &value) const = 0¶
Compute the absolute max of the vector, return = index(max(this))

virtual void PointWiseMult(const LocalVector<ValueType> &x)¶
Perform pointwise multiplication (elementwise) of this = this * x.

virtual void PointWiseMult(const GlobalVector<ValueType> &x)¶
Perform pointwise multiplication (elementwise) of this = this * x.

virtual void PointWiseMult(const LocalVector<ValueType> &x, const LocalVector<ValueType> &y)¶
Perform pointwise multiplication (elementwise) of this = x * y.

virtual void PointWiseMult(const GlobalVector<ValueType> &x, const GlobalVector<ValueType> &y)¶
Perform pointwise multiplication (elementwise) of this = x * y.

virtual void Power(double power) = 0¶
Perform power operation to a vector.
Local Matrix¶

template<typename ValueType>
class rocalution::LocalMatrix : public rocalution::Operator<ValueType>¶ LocalMatrix class.
A LocalMatrix is called local, because it will always stay on a single system. The system can contain several CPUs via UMA or NUMA memory system or it can contain an accelerator.
A number of matrix formats are supported. These are CSR, BCSR, MCSR, COO, DIA, ELL, HYB, and DENSE.
Note
For CSR type matrices, the column indices must be sorted in increasing order. For COO matrices, the row indices must be sorted in increasing order. The function
Check
can be used to check whether a matrix contains valid data. For CSR and COO matrices, the functionSort
can be used to sort the row or column indices respectively. tparam ValueType
 can be int, float, double, std::complex<float> and std::complex<double>
Unnamed Group

void AllocateCSR(const std::string &name, int nnz, int nrow, int ncol)¶
Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of nonzero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
 Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();

void AllocateBCSR(const std::string &name, int nnzb, int nrowb, int ncolb, int blockdim)¶
Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of nonzero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
 Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();

void AllocateMCSR(const std::string &name, int nnz, int nrow, int ncol)¶
Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of nonzero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
 Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();

void AllocateCOO(const std::string &name, int nnz, int nrow, int ncol)¶
Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of nonzero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
 Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();

void AllocateDIA(const std::string &name, int nnz, int nrow, int ncol, int ndiag)¶
Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of nonzero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
 Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();

void AllocateELL(const std::string &name, int nnz, int nrow, int ncol, int max_row)¶
Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of nonzero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
 Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();

void AllocateHYB(const std::string &name, int ell_nnz, int coo_nnz, int ell_max_row, int nrow, int ncol)¶
Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of nonzero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
 Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();

void AllocateDENSE(const std::string &name, int nrow, int ncol)¶
Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of nonzero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
 Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();
Unnamed Group

void SetDataPtrCOO(int **row, int **col, ValueType **val, std::string name, int nnz, int nrow, int ncol)¶
Initialize a LocalMatrix on the host with externally allocated data.
SetDataPtr
functions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers. Example
// Allocate a CSR matrix int* csr_row_ptr = new int[100 + 1]; int* csr_col_ind = new int[345]; ValueType* csr_val = new ValueType[345]; // Fill the CSR matrix // ... // rocALUTION local matrix object LocalMatrix<ValueType> mat; // Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become // invalid mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);
Note
Setting data pointers will leave the original pointers empty (set to
NULL
).

void SetDataPtrCSR(int **row_offset, int **col, ValueType **val, std::string name, int nnz, int nrow, int ncol)¶
Initialize a LocalMatrix on the host with externally allocated data.
SetDataPtr
functions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers. Example
// Allocate a CSR matrix int* csr_row_ptr = new int[100 + 1]; int* csr_col_ind = new int[345]; ValueType* csr_val = new ValueType[345]; // Fill the CSR matrix // ... // rocALUTION local matrix object LocalMatrix<ValueType> mat; // Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become // invalid mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);
Note
Setting data pointers will leave the original pointers empty (set to
NULL
).

void SetDataPtrBCSR(int **row_offset, int **col, ValueType **val, std::string name, int nnzb, int nrowb, int ncolb, int blockdim)¶
Initialize a LocalMatrix on the host with externally allocated data.
SetDataPtr
functions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers. Example
// Allocate a CSR matrix int* csr_row_ptr = new int[100 + 1]; int* csr_col_ind = new int[345]; ValueType* csr_val = new ValueType[345]; // Fill the CSR matrix // ... // rocALUTION local matrix object LocalMatrix<ValueType> mat; // Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become // invalid