LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-10-2012, 10:24 PM   #1
ejspeiro
Member
 
Registered: Feb 2011
Distribution: Ubuntu 14.04 LTS (Trusty Tahr)
Posts: 203

Rep: Reputation: 26
Question Block Matrices in C


Hi guys!

I was looking in Google for some clever implementations of block-defined matrices in C.

Particularly, how to define a matrix M, given its structures in previously defined blocks, i.e.:

Code:
    |B11 ... B1n|
M = |    ...    | 
    |Bm1 ... Bmn|
Any ideas or links will be of tremendous help!

Thanks!
 
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 02-11-2012, 03:37 AM   #2
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 852

Rep: Reputation: 389Reputation: 389Reputation: 389Reputation: 389
The most straightforward solution would be to merge the blocks into a single matrix just by allocating a large enough array and copying everything to it. But the copying would probably take some time. Another approach would be defining it as an array of pointers to blocks.
But everything depends on your situation and requirements: How do the blocks look like? Do they have any special properties(diagonal, tridiagonal, antisymmetric,...)? Are they the same size? What operations are you going to perform on the matrices most often and what are your optimization priorities(memory usage, element access time, efficiency of a certain operation (multiplication, determinant, inverse))? How large will the blocks and matrices be? Every information you know about the matrices/blocks can help you find a more suitable solution.
 
1 members found this post helpful.
Old 02-11-2012, 08:29 AM   #3
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by ejspeiro View Post
how to define a matrix M, given its structures in previously defined blocks
In general, the added indirection and its effects will kill your performance, unless there are specific added structural requirements you can apply to avoid the indirection. For example, if the matrices are very large, and all the blocks are the same size and type.

In all the cases I've encountered thus far it is much more efficient to create a copy. If you use a library like GNU Scientific Library or ACML (for optimized BLAS and LAPACK libraries), you should use the matrix structures the library provides.

If you are building your own library for dense matrices, then there are two C99 structures I can recommend:
Code:
typedef struct matrix_view  matrix_view_t;
typedef struct matrix       matrix_t;

struct matrix {
    size_t         cols;
    size_t         rows;
    size_t         colstep;
    size_t         rowstep;
    size_t         references;
    double         data[];
};

struct matrix_view {
    int            cols;
    int            rows;
    size_t         colstep;
    size_t         rowstep;
    struct matrix *parent;
    double        *data;
};
The difference between these two is related to memory management and avoiding unnecessary copying. In the simplest cases, the references and the parent fields are unnecessary. (When used, references field will be initialized to one.)

Each matrix_t will contain its data. It is a C99 structure, of unspecific size, and must be dynamically allocated. Freeing the structure itself will also free the data.

A view contains no data, it only refers to data from another matrix (the parent). The references field in the parent matrix is used to keep count of the views, so it will not be released before the last view is also discarded. (Thus the parent matrix can be free()d when either it itself is discarded and the reference count drops to zero, or when discarding a view decreases the reference count of the parent matrix to zero.)

Because access to any row (0 <= row < rows) and column (0 <= col < cols) is via ptr->data[col*ptr->colstep+row*ptr->rowstep] the matrix data can be either row-major or column-major. It also means that a view can be any rectangular subset of a matrix, optionally transposed. To transpose, you just swap the rowstep and colstep wrt. the parent.

It is also possible to mirror the data in the view by negating the rowstep or colstep, but if you do that, you should use ssize_t type for those fields since size_t is unsigned. I've never needed data mirroring myself, though. (Obviously, you can also use every second/third/Nth column and/or row, not just consecutive columns or rows, as long as you do not exceed the parent matrix dimensions.)

When using these structures, usually matrix multiplication and other complex operations will produce a new matrix. I've used inline wrappers to handle the four possible cases where the two sources are either matrices or views, simply by creating temporary views to any real matrices, and having the actual function only work on views. This seems to work well, and is quite possible to e.g. vectorize (for SSE3 and higher). Most of the benefits do not come from having fastest possible basic operations, but from the way you can optimize larger algorithms, to use less memory and fewer operations through the use of views.

Operations to small matrices usually suffer from cache effects. If the matrix is small enough to fit in one or just a few cache lines, it is usually faster to copy the source matrices -- for example, direct copy of the left-side matrix and trasposed copy of the right side matrix -- to maximize cache locality during the operation. Operations on larger matrices might benefit from blocking into cache-local parts, but I haven't done any research on that.

What I have learned, though, is that optimizing individual operations is usually not really worthwhile. The structures should be designed so that they help you, the programmer, to implement the relevant algorithms efficiently. Let the compiler handle the low-level vectorization, loop optimization, and so on. In many cases good structures lets you avoid a lot of low-level operations altogether -- and not doing them at all is always faster than even a superoptimized implementation. You can always go back later on, and even hand-code the basic operations in assembly (for vectorization, for example), if you need to shave a few more cycles out.

Hope you find this useful,
 
2 members found this post helpful.
Old 02-12-2012, 03:00 AM   #4
ejspeiro
Member
 
Registered: Feb 2011
Distribution: Ubuntu 14.04 LTS (Trusty Tahr)
Posts: 203

Original Poster
Rep: Reputation: 26
Talking

millgates: Thank you very much for your reply. It was helpful to me in the following ways:

First, you mentioned that one of the possible methods for the implementation would be...

Quote:
Another approach would be defining it as an array of pointers to blocks.
and indeed this got me thinking that the blocks are nothing but defined subsets of the parent matrix they define. Therefore, I thought on simply defining a 1D array and simply keep this collection of "inner limits". By this approach, I would do:

PHP Code:
typedef struct {

  
int row_begin;
  
int col_begin;
  
int row_end;
  
int col_end;
  
int hor_extent;
  
int ver_extent;
Block;

typedef struct {
  
double *values;
  
Block *blocks;
  
int rows;
  
int cols;
  
int nvals;
  
int nnz;
  
int nblocks;
  
int block_index;
BlockMatrixDouble
What I like about this approach is that (1) By writing adequate routines for the addition of values and blocks into the parent matrix, the ordering of the blocks wrt the parent matrix doesn't matter, provided that you are capable to efficiently save the instances of the Block defined type. And (2), that -after I coded this and came back- it resembles Nominal Animal's proposal, which is aimed towards considering an efficient use of the memory, by means of avoiding the copies.

Nominal Animal: Thank you very much for your reply. It can be seen that you put some effort in conveying your ideas and I really appreciate that! You actually gave me an idea, which in my head is well related to millgates's suggestion:

Quote:
But everything depends on your situation and requirements: How do the blocks look like? Do they have any special properties(diagonal, tridiagonal, antisymmetric,...)? Are they the same size?
Yes, millgates, this would be another way, your post helped me!

I noticed that, in the case of a stencil matrix, for example, I could even avoid the extra blocks, inside the parent matrix which are nothing but copies of a few blocks, given the symmetry and diagonal-domination of these type of matrices! I could merge my approach, with yours so that by using views of the "base blocks" I could avoid copying the same blocks all across the parent matrix -btw, thanks for the terminology-.

Thank you very much guys!
 
Old 02-12-2012, 03:04 AM   #5
ejspeiro
Member
 
Registered: Feb 2011
Distribution: Ubuntu 14.04 LTS (Trusty Tahr)
Posts: 203

Original Poster
Rep: Reputation: 26
Btw, I will label the thread as solved, however feel free to pitch in any suggestions since I will consider them very seriously!

Btw, Nominal Animal, man you will have a profound thank you note on my dissertation!

Thank you guys!
 
Old 02-12-2012, 06:12 AM   #6
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Note that what I wrote above was specifically about dense matrices. Sparse matrices are much more interesting.

There are three major approaches I can recommend.

First, using chains (singly-linked lists):
Code:
struct matrix_element {
    double      value;
    uint32_t    row;
    uint32_t    col;
    uint32_t    nextrow; /* Element index to continue on the same column */
    uint32_t    nextcol; /* Element index to continue on the same row */
};

struct matrix {
    uint32_t               rows;
    uint32_t               cols;
    uint32_t              *col;     /* Indices to first element on each column */
    uint32_t              *row;     /* Indices to first element on each row */
    size_t                 size;    /* Number of data elements allocated */
    size_t                 used;    /* Number of data elements used */
    struct matrix_element *data;
};
Here, the data is basically an array of elements. The first element (offset zero) is best used as a sentinel, to signify end of each chain. The maximum size of the matrix this structure can support is 4294967295×4294967295 with up to 4294967295 nonzero elements within the matrix. Indices are used instead of pointers so that you can reallocate the data array whenever necessary. Keeping the data in a linear array caches better than using individually allocated elements. Populating the data structure initially is surprisingly easy: for example, you can just skip the chaining at first, just adding the nonzero elements to the array, and finally construct the chain links using two temporary arrays in a single linear pass.


Second, storing each row or column as a sparse vector.
Code:
struct matrix_rowmajor { /* Row vectors */
    size_t    rows;
    size_t    cols;
    double   *scale;     /* Common multiplier for each row */
    size_t   *size;      /* Number of elements on each row */
    size_t  **col;       /* Array for each row: column numbers */
    double  **value;     /* Array for each row: values */
};

struct matrix_colmajor { /* Column vectors */
    size_t    rows;
    size_t    cols;
    double   *scale;     /* Common multiplier for each column */
    size_t   *size;      /* Number of elements on each column */
    size_t  **row;       /* Array for each column: column numbers */
    double  **value;     /* Array for each column: values */
};
This structure is most useful with Gaussian elimination (and similar algorithms), and when multiplying a row major matrix with a column major matrix. It does not vectorize, but even a trivial multiplication implementation should produce very efficient code. The downside is the relatively expensive transformation between the two matrix forms. (Given sufficiently large matrices, the transformation does speed up e.g. matrix multiplication.)

The reason why the column/row number and value are in separate arrays is that usually a 32-bit column number is sufficient (4294967295 columns!) -- you might wish to use uint32_t instead of size_t there. If you combine the column/row number and value in a structure, you lose the 64-bit alignment on the double values, which will hurt performance on most architectures.

If you want vectorization, instead of single doubles, store v2df or v4df vectors (aligned twin or quad doubles), with row/column number divided by two or four. It tends to lead much more complex code, but if the nonzero elements in the sparse matrix are clustered, it may well be worth the added code complexity. (I think I could provide some example code using explicit GCC SSE2/SSE3 built-in vector operations for say matrix multiplication, if someone is sufficiently interested.)


Third, block matrices. There are three subtypes for this case: fixed size blocks, common size blocks, and variable size blocks. Fixed size blocks:
Code:
#define  ROWS 4
#define  COLS 4

struct matrix_block {
    double  value[ROWS][COLS];
    size_t  references;          /* Optional, for memory management */
};

struct matrix {
    size_t               rows;
    size_t               cols;
    size_t               stride; /* = (cols + COLS - 1) / COLS */
    struct matrix_block *block;
};

static double none = 0.0;

static inline double *element(struct matrix *const m, const size_t row, const size_t col)
{
    if (m && row < m->rows && col < m->cols) {
        const size_t block = (size_t)(col / COLS)
                           + m->stride * (size_t)(row / ROWS);
        if (m->block[block])
            return (double *)&(m->block[block].value[(col % COLS) + COLS * (row % ROWS)]);
    }
    return &none;
}
Note that the casts to (size_t) above have a very specific role in C99: they tell the compiler that the result of the division must be treated as a size_t (integer), and thus enforce that the division really is integer division regardless of optimization. Without the cases, the compiler might combine the division with a multiplication, b0rking the block index calculation. (Some programmers use cargo cult programming for this, believing that a temporary variable is required to be sure.)

Since the block size is fixed, you can trivially calculate the block index (by dividing by the block size) and index within the block (by modulus block size). Using power of two sizes (4, 8, 16, 32) means the division and modulus are just bit shifts and masks (GCC will take care of the details, no need to write them explicitly as such), making random accesses much faster.

The major benefit of this method is that since most matrix operations can be efficiently described as operations on block matrices, and since each block is of fixed size and consecutive in memory, the compiler can optimize (especially vectorize) operations on entire blocks fully. For example, AMD CPUs can do two or four operations (multiplications, divisions, additions, substractions, square roots) in parallel via SSE2/SSE3. You do need to redefine the matrix block and random access accessor functions to get GCC to vectorize it; I omitted that because you really need to hardcode COLS to a power of two then. I haven't used block matrices, so I don't know how fast an implementation you can do in practice, but I suspect it depends heavily on the block size.

Obviously, each block can be reused more than once. The references field is a bit problematic, because it screws the block alignment -- it would be best to have each block aligned on a cacheline boundary for optimal cache behaviour -- but it makes memory management a lot easier. (If your matrices reuse blocks often, you might benefit from creating matrix pair lists (using e.g. an 8-bit hash of the two pointers), so that you do the operation between two blocks only once, then copy it to all targets. I have not tried this, so I am not sure if it is beneficial in real life.)

Common size block is similar, except that COLS and ROWS are structure parameters instead of fixed constants. Variable block size is basically just a linked list of submatrices offset in the actual matrix.

Because you may operate on matrices with different block sizes, your element access will be rather complicated. It is probably best to just remember the last block, and see if the next access is still within the same block; if not, then do a full search for the correct block. If you want to vectorize the per-block operations, you will have to do a generic function, and special case functions (for when the blocks are of same size, square, and/or power of two in size).

Depending on the operations you do on the matrix blocks, you may further benefit from extending the block structure with a type/flags field, and optionally cached properties. Here is an example for variable-sized blocks:
Code:
/* Examples of matrix type flags, not thought out! */
#define MTYPE_IDENTITY          (1<<0)
#define MTYPE_DIAGONAL          (1<<1)
#define MTYPE_SYMMETRIC         (1<<2)
#define MTYPE_POSITIVE_DEFINITE (1<<3)
#define MTYPE_TRIDIAGONAL       (1<<4)
#define MTYPE_UPPER_TRIANGULAR  (1<<5)
#define MTYPE_LOWER_TRIANGULAR  (1<<6)

struct block {
    size_t         rows;
    size_t         cols;
    size_t         rowstep;
    size_t         colstep;
    double        *data;
    void          *storage;     /* For data memory management */
    size_t         eigenvalues;
    double        *eigenvalue;
    int            type;
};

struct matrix {
    size_t        rows;
    size_t        cols;
    size_t        blocks;
    size_t       *block_row; /* Row offsets */
    size_t       *block_col; /* Column offsets */
    struct block *block;     /* Array of block definitions */
};
For example, if the matrix is known to be identity, diagonal, tridiagonal, upper triangular, lower triangular, symmetric, positive definite, or so on, mark it such. You could also cache the eigenvalues. Something like this is very likely to help you avoid doing unnecessary block operations altogether, in which case using common or variable sized blocks may well lead to more efficient code -- assuming your algorithm makes maximum use of those features.

The storage void pointer in the block refers to the entity owning the data. You don't need it, but if you want easy memory management, that entity should have a reference count. (When it drops to zero, that data can be released. Since the data points to within that entity, not at the start of the entity, you do need a pointer to the actual entity.)

When accessing the elements of such a matrix in sequential fashion, remember the last accessed block and its row and column offset. That way you only need to check whether the next access is still within the block, and it usually will be. (If not, you do a full search in the actual matrix data structure.)

Hope you find this useful,
 
1 members found this post helpful.
Old 02-13-2012, 04:34 PM   #7
mabouali
LQ Newbie
 
Registered: Feb 2012
Posts: 3

Rep: Reputation: Disabled
Hi,

Your main question is already answered pretty well. Just make sure that you check that the blocks are not overlapping. It might happen that the user enters wrong number and then you have overlapping blocks, then nothing will work as you expect, such as matrix multiplications etc.

Or you can leave that up to the person who is going to use your code. But a good program check its inputs.

cheers,
 
1 members found this post helpful.
Old 02-13-2012, 05:37 PM   #8
ejspeiro
Member
 
Registered: Feb 2011
Distribution: Ubuntu 14.04 LTS (Trusty Tahr)
Posts: 203

Original Poster
Rep: Reputation: 26
mabouali! Good to see you around!

As I told you, thanks for that amazing suggestion!

Welcome to the site!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Block bitmap for group 416 not in group (block 0) + group descriptors corrupted? quanta Linux - Server 1 12-08-2010 10:40 AM
fsck.ext3 keeps fails with "Error reading block" short read at same block jpletka Linux - Server 2 06-10-2010 02:46 AM
Error reading block "x" (Attempt to read block from....... pvandyk2005 Slackware 6 07-06-2008 05:25 AM
how to block phani.85 Linux - Enterprise 2 12-28-2007 07:44 PM
IPTables and PPTPD :S (to block or not to block) thewonka Linux - Networking 0 03-24-2005 06:58 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:21 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration