Note that what I wrote above was specifically about dense matrices. Sparse matrices are much more interesting.

There are three major approaches I can recommend.

First, using chains (singly-linked lists):

Code:

struct matrix_element {
double value;
uint32_t row;
uint32_t col;
uint32_t nextrow; /* Element index to continue on the same column */
uint32_t nextcol; /* Element index to continue on the same row */
};
struct matrix {
uint32_t rows;
uint32_t cols;
uint32_t *col; /* Indices to first element on each column */
uint32_t *row; /* Indices to first element on each row */
size_t size; /* Number of data elements allocated */
size_t used; /* Number of data elements used */
struct matrix_element *data;
};

Here, the data is basically an array of elements. The first element (offset zero) is best used as a sentinel, to signify end of each chain. The maximum size of the matrix this structure can support is 4294967295×4294967295 with up to 4294967295 nonzero elements within the matrix. Indices are used instead of pointers so that you can reallocate the data array whenever necessary. Keeping the data in a linear array caches better than using individually allocated elements. Populating the data structure initially is surprisingly easy: for example, you can just skip the chaining at first, just adding the nonzero elements to the array, and finally construct the chain links using two temporary arrays in a single linear pass.

Second, storing each row or column as a sparse vector.

Code:

struct matrix_rowmajor { /* Row vectors */
size_t rows;
size_t cols;
double *scale; /* Common multiplier for each row */
size_t *size; /* Number of elements on each row */
size_t **col; /* Array for each row: column numbers */
double **value; /* Array for each row: values */
};
struct matrix_colmajor { /* Column vectors */
size_t rows;
size_t cols;
double *scale; /* Common multiplier for each column */
size_t *size; /* Number of elements on each column */
size_t **row; /* Array for each column: column numbers */
double **value; /* Array for each column: values */
};

This structure is most useful with

Gaussian elimination (and similar algorithms), and when multiplying a row major matrix with a column major matrix. It does not vectorize, but even a trivial multiplication implementation should produce very efficient code. The downside is the relatively expensive transformation between the two matrix forms. (Given sufficiently large matrices, the transformation does speed up e.g. matrix multiplication.)

The reason why the column/row number and value are in separate arrays is that usually a 32-bit column number is sufficient (4294967295 columns!) -- you might wish to use uint32_t instead of size_t there. If you combine the column/row number and value in a structure, you lose the 64-bit alignment on the double values, which will hurt performance on most architectures.

If you want vectorization, instead of single doubles, store v2df or v4df vectors (aligned twin or quad doubles), with row/column number divided by two or four. It tends to lead much more complex code, but if the nonzero elements in the sparse matrix are clustered, it may well be worth the added code complexity. (I think I could provide some example code using explicit GCC SSE2/SSE3 built-in vector operations for say matrix multiplication, if someone is sufficiently interested.)

Third, block matrices. There are three subtypes for this case: fixed size blocks, common size blocks, and variable size blocks. Fixed size blocks:

Code:

#define ROWS 4
#define COLS 4
struct matrix_block {
double value[ROWS][COLS];
size_t references; /* Optional, for memory management */
};
struct matrix {
size_t rows;
size_t cols;
size_t stride; /* = (cols + COLS - 1) / COLS */
struct matrix_block *block;
};
static double none = 0.0;
static inline double *element(struct matrix *const m, const size_t row, const size_t col)
{
if (m && row < m->rows && col < m->cols) {
const size_t block = (size_t)(col / COLS)
+ m->stride * (size_t)(row / ROWS);
if (m->block[block])
return (double *)&(m->block[block].value[(col % COLS) + COLS * (row % ROWS)]);
}
return &none;
}

Note that the casts to (size_t) above have a very specific role in C99: they tell the compiler that the result of the division must be treated as a size_t (integer), and thus enforce that the division really is integer division regardless of optimization. Without the cases, the compiler might combine the division with a multiplication, b0rking the block index calculation. (Some programmers use cargo cult programming for this, believing that a temporary variable is required to be sure.)

Since the block size is fixed, you can trivially calculate the block index (by dividing by the block size) and index within the block (by modulus block size). Using power of two sizes (4, 8, 16, 32) means the division and modulus are just bit shifts and masks (GCC will take care of the details, no need to write them explicitly as such), making random accesses much faster.

The major benefit of this method is that since most matrix operations can be efficiently described as operations on block matrices, and since each block is of fixed size and consecutive in memory, the compiler can optimize (especially vectorize) operations on entire blocks fully. For example, AMD CPUs can do two or four operations (multiplications, divisions, additions, substractions, square roots) in parallel via SSE2/SSE3. You do need to redefine the matrix block and random access accessor functions to get GCC to vectorize it; I omitted that because you really need to hardcode COLS to a power of two then. I haven't used block matrices, so I don't know how fast an implementation you can do in practice, but I suspect it depends heavily on the block size.

Obviously, each block can be reused more than once. The

references field is a bit problematic, because it screws the block alignment -- it would be best to have each block aligned on a cacheline boundary for optimal cache behaviour -- but it makes memory management a lot easier. (If your matrices reuse blocks often, you might benefit from creating matrix pair lists (using e.g. an 8-bit hash of the two pointers), so that you do the operation between two blocks only once, then copy it to all targets. I have not tried this, so I am not sure if it is beneficial in real life.)

Common size block is similar, except that COLS and ROWS are structure parameters instead of fixed constants. Variable block size is basically just a linked list of submatrices offset in the actual matrix.

Because you may operate on matrices with different block sizes, your element access will be rather complicated. It is probably best to just remember the last block, and see if the next access is still within the same block; if not, then do a full search for the correct block. If you want to vectorize the per-block operations, you will have to do a generic function, and special case functions (for when the blocks are of same size, square, and/or power of two in size).

Depending on the operations you do on the matrix blocks, you may further benefit from extending the block structure with a type/flags field, and optionally cached properties. Here is an example for variable-sized blocks:

Code:

/* Examples of matrix type flags, not thought out! */
#define MTYPE_IDENTITY (1<<0)
#define MTYPE_DIAGONAL (1<<1)
#define MTYPE_SYMMETRIC (1<<2)
#define MTYPE_POSITIVE_DEFINITE (1<<3)
#define MTYPE_TRIDIAGONAL (1<<4)
#define MTYPE_UPPER_TRIANGULAR (1<<5)
#define MTYPE_LOWER_TRIANGULAR (1<<6)
struct block {
size_t rows;
size_t cols;
size_t rowstep;
size_t colstep;
double *data;
void *storage; /* For data memory management */
size_t eigenvalues;
double *eigenvalue;
int type;
};
struct matrix {
size_t rows;
size_t cols;
size_t blocks;
size_t *block_row; /* Row offsets */
size_t *block_col; /* Column offsets */
struct block *block; /* Array of block definitions */
};

For example, if the matrix is known to be identity, diagonal, tridiagonal, upper triangular, lower triangular, symmetric, positive definite, or so on, mark it such. You could also cache the eigenvalues. Something like this is very likely to help you avoid doing unnecessary block operations altogether, in which case using common or variable sized blocks may well lead to more efficient code -- assuming your algorithm makes maximum use of those features.

The storage void pointer in the block refers to the entity owning the data. You don't need it, but if you want easy memory management, that entity should have a reference count. (When it drops to zero, that data can be released. Since the data points to within that entity, not at the start of the entity, you do need a pointer to the actual entity.)

When accessing the elements of such a matrix in sequential fashion, remember the last accessed block and its row and column offset. That way you only need to check whether the next access is still within the block, and it usually will be. (If not, you do a full search in the actual matrix data structure.)

Hope you find this useful,