ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Welcome to LinuxQuestions.org, a friendly and active Linux Community.

You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!

Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.

If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.

Having a problem logging in? Please visit this page to clear all LQ-related cookies.

Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.

Exclusive for LQ members, get up to 45% off per month. Click here for more info.

Yes, I knew it was not there when I posted the reference, however, the reference stands. It is just a matter of taking a look at the actual book's glossary. Maybe in a Google Book's preview? Or at Amazon?

What kind of a-hole logic is that?

Why not just post the definition to the forum?

Click here to see the post LQ members have rated as the most helpful post in this thread.

I'm going to have to agree with sundialsvcs. This is a beautiful thread. (In fact, it is the first time I have subscribed to a thread in over one year.) Please let's not destroy it with nit picking; and that is not directed at any one poster.

OP's question has been answered - and then some - in a most satisfactory fashion.

So, "implementation of integer math for all numeric functions eliminates rounding errors associated with floating point" is a baseless statement - with long enough mantissa floating point numbers are no worse and may be even better than integers.

On 32 bit machine integers are typically 32 bits while doubles have 53 bits long mantissa (IIRC), so doubles are already better out of the box. And now 'gcc' supports already 'long double': http://en.wikipedia.org/wiki/Long_double .

A typical Cobol shop would much more frequently issue reports that involved adding, subtracting and multiplying numeric fields that involved dollars and cents, that is, dollar amounts which are stored as integers with the two decimals included in the integer. Your example would be much less common, but could still be handled by standard Cobol.

Could you do the same in C or C++? Sure, but you would either have to code your own library functions to automatically adjust for the implied decimals, or implement some other non-standard library extensions. My original point was that such common business usages are already implemented in standard Cobol resulting in much more simple and intuitive coding.

A typical Cobol shop would much more frequently issue reports that involved adding, subtracting and multiplying numeric fields that involved dollars and cents, that is, dollar amounts which are stored as integers with the two decimals included in the integer. Your example would be much less common, but could still be handled by standard Cobol.

Could you do the same in C or C++? Sure, but you would either have to code your own library functions to automatically adjust for the implied decimals, or implement some other non-standard library extensions. My original point was that such common business usages are already implemented in standard Cobol resulting in much more simple and intuitive coding.

In C++ it's just a matter of appropriate classes and operator overloading. ...

GMP function categories
There are several categories of functions in GMP:
High-level signed integer arithmetic functions (mpz). There are about 140 arithmetic and logic functions in this category.
High-level rational arithmetic functions (mpq). This category consists of about 35 functions, but all signed integer arithmetic functions can be used too, by applying them to the numerator and denominator separately.
High-level floating-point arithmetic functions (mpf). This is the GMP function category to use if the C type `double' doesn't give enough precision for an application. There are about 65 functions in this category.
C++ class based interface to all of the above. (The C functions and types can of course be used directly from C++ too.)

.

So, I don't have to code anything - the library already exists. Such libraries exist for various languages, for example, for Perl:

Sergei Steshenko: My original post to this thread was in response to the (off-topic) question, why is Cobol still out there? You have not refuted my stated response to that question. You have merely asserted that you could, as a C++ guru, do the same things that a rookie (or dinosaur) Cobol programmer could do with Cobol, without having to attain your level of expertise nor acquaintance with modern C++ library functions. So, do your C++ coding, and acknowledge that there may still be a place for Cobol as well.

Sergei Steshenko: My original post to this thread was in response to the (off-topic) question, why is Cobol still out there? You have not refuted my stated response to that question. You have merely asserted that you could, as a C++ guru, do the same things that a rookie (or dinosaur) Cobol programmer could do with Cobol, without having to attain your level of expertise nor acquaintance with modern C++ library functions. So, do your C++ coding, and acknowledge that there may still be a place for Cobol as well.

My point is that stating that implementing math using integers improves accuracy is nonsense - regardless of language. Accuracy is lost either due to the fact we have insufficient number of bits for mantissa or due the fact that mathematically mantissa should use inifinite number of bits (periodic fraction), but in reality number of available bits is finite.

My point is that stating that implementing math using integers improves accuracy is nonsense - regardless of language.

That is correct.

The reason integers -- more correctly, fixed-point arithmetic -- is used in some financial software, is that at the designed precision, over all possible values barring overflows and underflows, all transactions (signed sums) are exact. Typical cases are two or four decimal digits; i.e. 100 or 10000 correspond to 1.0, depending on what the appropriate laws/statutes/customs dictate, and whether the application calculates daily interests or only over longer periods, for example monthly.

With floating-point values, only a very limited range provides exact transactions; basically all exponent bits are wasted. Also, IEEE-754 was officially adopted in 1985, and before that, the rules wrt. rounding and LSB error when using floating-point values depended on the hardware. The key point to note is that the range where all operations are exact when using floating-point values is only a very small fraction of the range of possible floating-point values, and the rules wrt. rounding and precision have only been standardized for about 25 years. (Obviously, the range of possible floating-point values is astronomical considering financial needs, but that is beside the point.)

The myth of integers being more precise than floating-point values is directly related. It is not about precision, really, but about having proof that all possible transactions (signed sums) are exact. Overflow is obviously possible with fixed-point arithmetic, but all CPU archtectures have overflow/carry flags, and it is trivial to check even without overflow flag support and to reject such transactions. With floating-point values, it is a lot more difficult to check if the transaction is accounted for exactly, or if rounding occurred while summing. If all discrepancies are flagged, then the rounding would be a complete disaster.

Last edited by Nominal Animal; 05-28-2012 at 01:07 PM.

AFAIR COBOL had a "Rounded" clause that rounded according to the format (pre-defined as a string in COBOL). Whenever accuracy was needed, we used ROUNDED with the COMPUTE statement.

The ROUNDED function didn't need to have a precision because the PIC clause defined it.

My point is that stating that implementing math using integers improves accuracy is nonsense - regardless of language. Accuracy is lost either due to the fact we have insufficient number of bits for mantissa or due the fact that mathematically mantissa should use inifinite number of bits (periodic fraction), but in reality number of available bits is finite.

You are missing the detail that "fixed point" math typically scales by a fixed power of TEN, while floating point scales by a variable power of TWO.

In some fields, the answer you get when rounding only after scaling by a fixed power of ten is correct by definition, while a different rounding is incorrect.

When there is some other inherent advantage of floating point, it is obviously possible to pre scale the floating point values by a fixed power of ten, just as you pre scale integer values by a fixed power of ten in traditional fixed point operations. Then the comparison (made earlier in this thread) between the integer and the mantissa of the floating point, would be fair.

Traditionally, floating point is more expensive than integer. So if you have to pre scale either to get "correct" results, then you might as well pre scale integer. But if you have other reasons to choose floating point, it is important to be aware you can pre scale that as well. For example, if you need 48 bits of fixed point precision in an architecture with 48 bit floating point mantissas and 32 bit integers, going to double wide integers might be more trouble than going to floating point.

about multiplying: http://stackoverflow.com/questions/3...-ia32-assembly
so if you want to implement a specific function and you need to be really fast you need to use assembler, c is not good enough. But in most cases c can be really powerful and efficient.

You are missing the detail that "fixed point" math typically scales by a fixed power of TEN, while floating point scales by a variable power of TWO.

In some fields, the answer you get when rounding only after scaling by a fixed power of ten is correct by definition, while a different rounding is incorrect.

When there is some other inherent advantage of floating point, it is obviously possible to pre scale the floating point values by a fixed power of ten, just as you pre scale integer values by a fixed power of ten in traditional fixed point operations. Then the comparison (made earlier in this thread) between the integer and the mantissa of the floating point, would be fair.

Traditionally, floating point is more expensive than integer. So if you have to pre scale either to get "correct" results, then you might as well pre scale integer. But if you have other reasons to choose floating point, it is important to be aware you can pre scale that as well. For example, if you need 48 bits of fixed point precision in an architecture with 48 bit floating point mantissas and 32 bit integers, going to double wide integers might be more trouble than going to floating point.

I am not missing the detail, in fact, I was talking about BCDs (Binary Coded Decimals).

And I've been running a lot of numeric FP code - some of it requires prescaling.

about multiplying: http://stackoverflow.com/questions/3...-ia32-assembly
so if you want to implement a specific function and you need to be really fast you need to use assembler, c is not good enough. But in most cases c can be really powerful and efficient.

This is true if you know more than the optimizer (hint: on a large project, this becomes more difficult). For trivial software, the compiler writer has probably done the right thing (doubtless, you can find any number of ways of completing a task that you might deem 'more optimal,' but without actually benchmarking it's all programming by voodoo).

------------

I don't know how to put in a horizontal rule - and I've never really been a fan of double posting, so I'm going to add something here on the C vs. C++ vs. asm vs. whatever.

I tend to approach things very methodically. Any language comparison must first start with the language guarantees and anti-guarantees. It is important to understand that any intrinsics, optimizations, pragma flags, and other compiler or platform specific switches are NOT guaranteed by the language. That means that if you are trying to compare languages care must be taken to choose a correct context.

As an example, I'm going to take the following:

Quote:

Originally Posted by Nominal Animal

I am asking for a concrete case where C++ yields faster execution than what I can write in C. You answered with std::valarray, which I consider to be a rather tricky answer, close to trolling. Yes, currently the atomic primitives related to vector operations have not been standardized in C yet, but they are provided by all C compilers I use.

Here, you have admitted in the bolded part that std::valarray DOES provide something not provided by the C standard, which means that any fair comparison between C and C++ on vectorized maths must now be relegated to the following:

Code:

1. Code *someone* writes as a perfect language implementation
2. A library (whether compiled or not) provided by *someone*
3. Non-standard extensions

You can see that 3 should not be used as either point or counter-point. Anyone quoting non-standard extensions as the BASIS for a solution misses the fact that NSE goes both ways. IE: any example you can give in language X using NSE, I can give a counter example in language Y using NSE.

2 MAY or MAY NOT be a valid argument. Any library code which relies on NSE becomes part of 3. In this context, I choose reliance to mean it is inseparable from the NSE. As an example of what I mean, if there were a library called "SomeBigVectorizedMaths.h", but it REQUIRED that you use the GNU C compiler because it used GNU C extensions, I would say that fell under 3. However, if it could be used with ANY C compiler, and merely took advantage of NSE when they were present, I would say that does NOT fall under 3 - it is merely knowledgeable programming. It does not preclude me from using the Tiny C compiler, Microsoft's C compiler, or *insert name* C compiler.

1 is similar to 2. The idea behind 1 is the 'mythical' (I use that to mean we don't need actual working coded stuff, we accept that if relevant portions of the standards are quoted, the example could exist) code which solves the problem using ONLY GUARANTEED LANGUAGE PRIMITIVES. IE: this is the case where the code uses NO NSE ever.

When I gave the example of std::valarray, I was not trolling (and am offended that you would imply such, given my s:n ratio). The question posed was (paraphrased):

Quote:

Originally Posted by Nominal Animal

I am asking for a concrete case where C++ yields faster execution than what I can write in C.
... std::valarray (orgcandman) is irrelevant to this discussion: sure, it is useful, but I can provide the same functionality in C using my own code.

With this understanding and the framework I outlined above, I believe that std::valarray provides something that C lacks. I'm not going to try and say that in all cases C++ is superior to C. However, your question asks this, and I believe std::valarray represents this case (and is therefore why I picked it - making it quite relevant).

I can rely on std::valarray with ANY C++ compiler. It will always exist, and will always provide vectorized maths. So lets analyze it using either part 1 OR part 2 from above.

With part 1, I am not sure of any standardized C code which allows one to specify vectorized maths. If there is, I would be very interested. As it stands, the only way I know of to implement anything like a vectorized math is to build structs and loops. Note, I'm not talking about any NSE here. If there IS a way to do this in standard C, then they are equivalent.

Lets look at part 2 then, since this is potentially more interesting. We both agree that there exist for most C compilers, a suite of NSE intrinsics that will provide some basic vectorized math operations. This means that anyone who requires vectorized maths DOES have access, as a good bet. In the C++ case, this does NOT require studying the intrinsics - the compiler knows exactly how to use the NSE under the hood to generate the correct code. With C, it REQUIRES that someone take a lot of extra time crafting a solution to a problem that has been solved. This alone makes C++ far more efficient, programmer time wise. Code generation (and therefore execution time), they look the same (at least, under GNU C, the optimizer throws away the same code it will in C). The std::valarray has only a few interfaces which _could_ throw (and all are related to memory), and if we accept NSE, we have a good bet that we can remove most, if not all, of the performance bottlenecks traditionally cited against C++. So again, we have gained programmer time from C++.

The other advantages we gain from C++ are the type-safety, and code reuse. That is, if I want to reuse my vectorizing code for, say, int and float datatypes the compiler will autogenerate the code, rather than my cut/paste. Additionally, the correct call will always be made, meaning that I can confidently turn on -Wall -Werror (something that traditional C-only projects are scared to do when combined with optimizers that perform code path analysis) from the beginning. This is not specific to std::valarray, but again the example was chosen for a purpose. The drawback from this approach is the code-space bloat associated.

In all of this, I STILL believe that a traditional C language has a valuable purpose, especially when it comes to constrained memory systems. The fact that C doesn't specify certain things should be seen as a BOON to the language, NOT a drawback. I believe that there are plenty of valid places to use either. As I've always said, specific cases are important as there exists no general case software.

Anyway, this is an attempt to elucidate my position. As for your comment here:

Quote:

Originally Posted by Nominal Animal

Does it yield the optimum results? I do not think so.
...
As far as I know, apply() is limited to (one slice of) a single array, it cannot do multiple arrays in tandem. In C, I have no problem with this.

Care to post some code? I think you'll be quite surprised by the latest version of the G++ optimizer. When given complete information it can vectorize most things impressively optimally. Heck, it even correctly optimized a coworker's naive usage of the class. Additionally, multiple arrays can be implemented with a single array. Heck, you can even think of a matrix as a fiction on top of a single-dimensioned array.

Quote:

Originally Posted by Nominal Animal

the latter to avoid any security implications by the implicit work done by the language or standard library.

In fact, I would argue that C++ by virtue of its type-safety, implicit code-reuse, and pre-build collections schemes far exceeds C in terms of security benefits. In C, it is possible to exploit a mis-typed *printf() routine. In C++, one is encouraged, quite audibly, to avoid the *printf suite. That's just an example, mind you. As someone well versed in security, this is one place where I cannot even fathom C offering more than C++.

Here, you have admitted in the bolded part that std::valarray DOES provide something not provided by the C standard, which means that any fair comparison between C and C++ on vectorized maths must now be relegated to the following:

You skipped the case I'm interested in: the features provided by current C (or C++) compilers, "de facto standard features", if you will.

Personally, I'm mostly interested in GCC, Intel Compiler Collection, and Pathscale C compilers, since these are the ones I've worked with in HPC environments. I can install Open64 and probably have access to Portland Group compiler, too.

Quote:

Originally Posted by orgcandman

When I gave the example of std::valarray, I was not trolling (and am offended that you would imply such, given my s:n ratio). The question posed was (paraphrased):

With this understanding and the framework I outlined above, I believe that std::valarray provides something that C lacks. I'm not going to try and say that in all cases C++ is superior to C. However, your question asks this, and I believe std::valarray represents this case (and is therefore why I picked it - making it quite relevant).

I think we found the disconnect. I quite agree with your characterization above, but I was not arguing about that. Let me try explaining my concern or point in a different way.

Johnsfine stated that In the hands of an expert, C++ is equal or occasionally faster execution speed than C. I have not yet seen the faster execution speed than C, when the C code uses the features I consider "de facto standard". (Well, I don't actually consider it that, but they are available in all the compilers I use. If you look at all C compilers, especially MS one, you'd probably have to limit yourself to C89, which I would find basically hopeless to work in.)

I thought you were trolling, because you only described a feature, not an application of that feature; and I thought I specifically asked for cases where C++ yields faster execution speed.

This is important to me, because it impacts on how I should direct my development efforts in the future. Just because I prefer C now, I can change my mind if new evidence appears. If I find out that moving to C++ (in my case, most likely a tight subset, or perhaps even freestanding C++) yields clearly faster execution speed for my use cases, I'll happily do that. I just have not seen that case yet.

Quote:

Originally Posted by orgcandman

I am not sure of any standardized C code which allows one to specify vectorized maths. If there is, I would be very interested.

Most compilers support these vector extensions. On x86 and x86-64, I heavily rely on these. I believe they actually originated in the Intel Compiler Collection. I am not sure of how complete the AVX support is, because I don't have access to hardware supporting AVX yet.

For example,

Code:

typedef float v4sf __attribute__((vector_size (16)));
v4sf multiply_add(const v4sf a, const v4sf b, const v4sf c)
{
return a * b + c;
}

compiled in e.g. GCC with SSE2 available, produces perfectly vectorized code (mulps %xmm1, %xmm0; addps %xmm2, %xmm0) for the function.

This is not standardized by anybody yet, as far as I know, so you could say it is a nonstandard extension and therefore invalid wrt. this discussion. I could claim it is a de facto standard, because most C compilers -- certainly the ones used for HPC -- do compile that C code just fine.

The important point to me is that the above code compiles to vectorized code (although I prefer to use the explicit built-ins instead), using the most popular C compilers used in HPC. This, and the built-in functions, are also a very stable feature; this support has existed as-is since MMX times.

Quote:

Originally Posted by orgcandman

With C, it REQUIRES that someone take a lot of extra time crafting a solution to a problem that has been solved. This alone makes C++ far more efficient, programmer time wise.

No, and yes. The reason why I use these intrinsics is that the extra control yields better results. It is very taxing design work to implement a potential model in a way that vectorizes efficiently, but the end result is easily worth it. I'm talking about 25% to 50% increase in computation speed. In my world, the existing solutions, including std::valarray, are poor.

Quote:

Originally Posted by orgcandman

The other advantages we gain from C++ are the type-safety

Yes, yes; I've agreed on those features already. It is not like I'm recommending new programmers to ignore C++ and concentrate on C.

Quote:

Originally Posted by orgcandman

Care to post some code?

I'd rather not, but I guess I have to. My multiband EAM code is too complex for this, but I think Lennard-Jones should be simple enough. I'll include the mathematical stuff and explanations, so others can implement the same algorithm, instead of trying to translate C to some other language. But I warn you, I do have to use the vector built-in functions; it's pretty nasty to read. I'll try to add enough comments.

Actually I found a very useful real world case, which is even simpler: Calculating distances and unit vectors between one atom and all its neighbors.

The difficulty is that atom coordinates are not consecutive in memory. Instead, you have an array of atom identifiers (indexes); the coordinates must be referred to indirectly.

Neighbor data (atom index, distance, unit vector) is consecutive in memory, however. Atom index, distance, and unit vector also have room for one extra entry, and are padded to a multiple of four entries.

There is an additional twist: the distances, unit vectors, and atom numbers must be arranged so that all neighbors closer than a specified limit are listed first. Basically, the same thing as in quicksort: the distance limit is the pivot. The pivot itself belongs to the latter set. This is an important algorithmic optimization relied on elsewhere, so that we do not need to calculate known zeros.

With this code, a simulation relying on atom pair interactions (Morse, Lennard-Jones, EAM, dipole interactions between charged particles) does not need to refer to the scattered atom coordinates, but has both the distance and pair unit vectors readily available. My tests indicate this yields faster execution than the traditional model, where interatom distances and vectors are calculated from coordinates on each run. In particular, the ability to skip the atoms that are known to not interact (albeit being still neighbors), is important. So, this is not an abstract benchmark, but real-world useful code.

Here is the slow, plain, reference C code. This uses separate arrays for each coordinate, but you can use 3- or 4-component point structures or objects just as well.

Code:

int distances(const unsigned int neighbors, /* Number of neighbors */
unsigned int *const neighbor, /* Neighbor atom indexes */
double *const distance, /* Consecutive output array: distance */
double *const deltax, /* Consecutive output array: unit vector x components */
double *const deltay, /* Consecutive output array: unit vector y components */
double *const deltaz, /* Consecutive output array: unit vector z components */
const double *const x, /* Atoms' x coordinates */
const double *const y, /* Atoms' y coordinates */
const double *const z, /* Atoms' z coordinates */
const unsigned int atom, /* Index for the one atom */
const double limit) /* Distance limit, "pivot" */
{
{
const double x0 = x[atom], /* This atom coordinates; */
y0 = y[atom], /* subtract from the other */
z0 = z[atom]; /* atom to get delta vector. */
int i;
for (i = 0; i < neighbors; i++) {
const unsigned int n = neighbor[i];
const double dx = x[n] - x0, /* Delta vector components */
dy = y[n] - y0,
dz = z[n] - z0;
const double dd = dx*dx + dy*dy + dz*dz; /* Distance squared */
const double d = sqrt(dd); /* Distance */
/* Save unit vector, */
deltax[i] = dx / d;
deltay[i] = dy / d;
deltaz[i] = dz / d;
/* and distance. */
distance[i] = d;
}
}
/* The distance[], deltax[], deltay[], deltaz[] are guaranteed to
* have at least one extra element, and padded to a multiple of 4 elements.
* Use a sentinel to avoid ihead < itail checks.
*/
distance[neighbors] = limit;
{
unsigned int ihead = 0;
unsigned int itail = neighbors - 1;
while (1) {
/* Keep close neighbors at start of list */
while (distance[ihead] < limit)
ihead++;
/* All under limit? */
if (ihead >= itail)
return ihead;
/* Keep far neighbors at end of list */
while (distance[itail] >= limit)
itail--;
/* Done? */
if (ihead >= itail)
return ihead;
/* Swap. */
{ double dx = deltax[ihead],
dy = deltay[ihead],
dz = deltaz[ihead],
dd = distance[ihead];
unsigned int nn = neighbor[ihead];
deltax[ihead] = deltax[itail]; deltax[itail] = dx;
deltay[ihead] = deltay[itail]; deltay[itail] = dy;
deltaz[ihead] = deltaz[itail]; deltaz[itail] = dz;
distance[ihead] = distance[itail]; distance[itail] = dd;
neighbor[ihead] = neighbor[itail]; neighbor[itail] = nn;
}
/* This pair is now in correct order. */
ihead++;
itail--;
}
}
}

On my machine (Athlon II X4 640), using gcc-4.6.1 and -O3 -fomit-frame-pointer -msse3 , the above runs in 80 clocks per neighbor, including all overhead. My test case is a unit cubic lattice, neighbors included to up to +-3 (7-1 = 342 neighbors total), with pivot at distance 4.5. That should be roughly typical; about 10% of neighbour list members do not interact with the atom at any given point in time.

My reference implementation using SSE3 runs in 47 clocks per neighbor, including all overhead; about a 41% speedup to plain C:

Code:

typedef double v2df __attribute__((vector_size (16)));
/* TODO: See if using doubles would actually be faster.
*/
static inline void swap_doubles(void *const ptr1, void *const ptr2)
{
const unsigned long t = *(unsigned long *)ptr1;
*(unsigned long *)ptr1 = *(unsigned long *)ptr2;
*(unsigned long *)ptr2 = t;
}
static inline void swap_uints(void *const ptr1, void *const ptr2)
{
const unsigned int t = *(unsigned int *)ptr1;
*(unsigned int *)ptr1 = *(unsigned int *)ptr2;
*(unsigned int *)ptr2 = t;
}
/* Update the neighbor list for one atom.
* Return the number of interacting neighbors.
* The neighbor and distance lists are padded to a multiple of 4,
* with at least one unused entry at end.
*/
int distances(const unsigned int neighbors,
unsigned int *const neighbor,
double *const distance,
double *const deltax,
double *const deltay,
double *const deltaz,
const double *const x,
const double *const y,
const double *const z,
const unsigned int atom,
const double limit)
{
/* Note: neighbor[] array is padded with a safe index
* to a multiple of four entries. */
/* Distance calculation part. */
{
const v2df x0 = { x[atom], x[atom] },
y0 = { y[atom], y[atom] },
z0 = { z[atom], z[atom] };
unsigned int i = 0;
while (i < neighbors) {
const unsigned int i1 = neighbor[i++];
const unsigned int i2 = neighbor[i++];
const unsigned int i3 = neighbor[i++];
const unsigned int i4 = neighbor[i++];
v2df x1 = { x[i1], x[i2] },
y1 = { y[i1], y[i2] },
z1 = { z[i1], z[i2] },
x2 = { x[i3], x[i4] },
y2 = { y[i3], y[i4] },
z2 = { z[i3], z[i4] },
d1, d2, t1, t2;
/* Substract current atom coordinates. */
x1 = __builtin_ia32_subpd(x1, x0);
y1 = __builtin_ia32_subpd(y1, y0);
z1 = __builtin_ia32_subpd(z1, z0);
x2 = __builtin_ia32_subpd(x2, x0);
y2 = __builtin_ia32_subpd(y2, y0);
z2 = __builtin_ia32_subpd(z2, z0);
/* Calculate squared distances. */
d1 = x1;
d2 = x2;
d1 = __builtin_ia32_mulpd(d1, d1);
d2 = __builtin_ia32_mulpd(d2, d2);
t1 = y1;
t2 = y2;
t1 = __builtin_ia32_mulpd(t1, t1);
t2 = __builtin_ia32_mulpd(t2, t2);
d1 = __builtin_ia32_addpd(d1, t1);
d2 = __builtin_ia32_addpd(d2, t2);
t1 = z1;
t2 = z2;
t1 = __builtin_ia32_mulpd(t1, t1);
t2 = __builtin_ia32_mulpd(t2, t2);
d1 = __builtin_ia32_addpd(d1, t1);
d2 = __builtin_ia32_addpd(d2, t2);
/* Take square root to get distances. */
d1 = __builtin_ia32_sqrtpd(d1);
d2 = __builtin_ia32_sqrtpd(d2);
/* Divide deltas by distances to get unit vectors. */
x1 = __builtin_ia32_divpd(x1, d1);
y1 = __builtin_ia32_divpd(y1, d1);
z1 = __builtin_ia32_divpd(z1, d1);
x2 = __builtin_ia32_divpd(x2, d2);
y2 = __builtin_ia32_divpd(y2, d2);
z2 = __builtin_ia32_divpd(z2, d2);
/* Save coordinate deltas. */
__builtin_ia32_storeupd(&deltax[i-4], x1);
__builtin_ia32_storeupd(&deltay[i-4], y1);
__builtin_ia32_storeupd(&deltaz[i-4], z1);
__builtin_ia32_storeupd(&deltax[i-2], x2);
__builtin_ia32_storeupd(&deltay[i-2], y2);
__builtin_ia32_storeupd(&deltaz[i-2], z2);
/* Save distances. */
__builtin_ia32_storeupd(&distance[i-4], d1);
__builtin_ia32_storeupd(&distance[i-2], d2);
}
}
/* Set the distance sentinel. */
distance[neighbors] = limit;
/* Neighbor sorting part.
* Move atoms closer than limit to start of list,
* and atoms at limit or further to end of list.
* Other than that, the order is not important.
*/
{
unsigned int ihead = 0;
unsigned int itail = neighbors - 1;
while (1) {
/* Keep close neighbors at start of list */
while (distance[ihead] < limit)
ihead++;
/* All close? */
if (ihead >= itail)
return ihead;
/* Keep far neighbors at end of list */
while (distance[itail] >= limit)
itail--;
/* Done? */
if (ihead >= itail)
return ihead;
/* Swap. */
swap_doubles(&deltax[ihead], &deltax[itail]);
swap_doubles(&deltay[ihead], &deltay[itail]);
swap_doubles(&deltaz[ihead], &deltaz[itail]);
swap_doubles(&distance[ihead], &distance[itail]);
swap_uints(&neighbor[ihead], &neighbor[itail]);
/* This pair is now in correct order. */
ihead++;
itail--;
}
}
}

Let me be the first to say that my SSE3 code is butt-ugly. It belongs in a simulation library if anywhere -- which is exactly where I'll end up putting it. It is already complex enough to escape my assembly optimization skills: I can easily write it in assembly, but GCC seems to produce at least as fast code. I'd need an IDE with the ability to estimate clocks used, pipelines used, stalls, and so on; it is just too slow to find a good balance otherwise.

My point for writing it is that I still believe this code yields the fastest execution time. I really would not mind being proven wrong.

Note that I just wrote this from scratch, it is not something I've yet tried to optimize any further. Other than working in groups of four indirect atoms, it is a very straightforward implementation, I think.

I'd be very interested to see an efficient C++ implementation we could compare to my SSE3 version above. If you need to tweak the problem to solve it better, please do; just remember that neighbor atom coordinates cannot be consecutively in memory, they have to be indirectly referenced, we need both distances and unit vectors, and the neighbors closer than the specified limit should be listed first in the result.

LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.