Assembly program runs 81 times *slower* with 13 fewer instructions.
I have a simple math program that solves a special type of matrix. It works its way down the matrix, then back up. It finishes where it started. It runs on a Pentium 4 on Ubuntu.
Code:
.section .data I am measuring how long it takes to run, so I have a loop that makes it run 1,000,000 times. The program runs in 0.31 seconds: Code:
$ gcc test.s -pg Code:
$ gcc test.s -pg This is an O(n) algorithm, and "downloop" and "backloop" work independently from one another. All they do is operate on numbers, leaving the structure unchanged. Also, I am using the identity matrix for my test matrix, which remains unchanged by these operations. Why does it run so slowly? |
Without seeing any specific code, one can only guess that EDI is being changed in the 'backloop' loop, and affecting the number of overall loops. Removing the effect changes it accordingly.
--- rod. |
Maybe I should post the whole thing. I didn't want to bother people with the details. Assembly language programs are hard to read and require a lot of explanation. Here's the whole program - including odd comments which probably will make no sense to anyone.
In any event, downloop doesn't change %edi (as far as I can tell). I tried using %esi instead, but that doesn't help. This was the first assembly language program I wrote. At the time I wrote this, I didn't know I could use fld 4(%eax) -- instead of -- add $4, %eax fld (%eax) Code:
.section .data Thanks. |
Hi, Travis86.
I don't know anything about the math behind what you are doing, but I do know that many library routines use techniques that generally increase speed. Do you have any timing comparisons between your routines and library routines -- perhaps those of blas or lapack? ... cheers, makyo |
Eventually I'd like to benchmark my program against someone else's, but I'm going to wait until I get it running as fast as possible.
|
It's interesting the instruction in backloop that seems to make the difference is:
fst (%eax) # store d1 You should, in gdb, dump the matrix a few times and see the difference when backloop is in and when it isn't. I'm guessing that store is putting back values into the matrix that the fpu can optimize on (like 0 or 1) and when you don't have it, the fpu has to do more work. |
Did you at least compare the performance of your assembly implementation to an implementation in a high-level language, such as C? If you know how to write the algorithm in assembly, surely you could write it in C as well...
Today's compilers know how to generate pretty efficient code. Another point is that your application may benefit from using SSE instructions (which your P4 machine must support). You could tell the compiler to try to use SSE instruction wherever possible (don't remember which flags are for gcc). |
If you can enable floating point exceptions that would be good; it may be possible that the FPU is generating a large number of interrupts and wasting processor time with the exception handling.
|
I think pinniped is on the right track, but I still haven't figured it out.
I thought when I commented out backloop, things were being left on the floating-point stack, and the processor didn't like the stack overflowing. But even when backloop isn't commented out, things are left on the stack. After downloop, there is one number on the floating-point stack, and after backloop, there is one number on the floating-point stack. None the less, I noticed three interesting things that would seem to indicate that the problem has to do with leaving things on the floating-point stack: 1. If you replace backloop with "fstp %st(0)" (which pops the number floating-point stack), the program runs fine. 2. When you only comment out two particular lines in backloop, it runs slowly: Code:
backloop: 3. When I modified version 2 so that it didn't leave anything on the floating-point stack, it stopped running 115x slower and started running a few percent faster. I'll keep working on it. --------- Estabroo, what do you mean by "makes the difference"? If I comment out that line, it works the same. Plus, I'm working with the identity matrix, and all of the values remains unchanged. All entries are either 0 or 1 all of the time. --------- fuzzyBuzz, SSE instructions perform the same operation to several numbers at once. For mathematical reasons, one row can't be modified before the previous row has been modified. So, you can't work on the rows in parallel. --------- I have done some speed comparisons. I made the best C++ algorithm I could and compiled it with the "-O3" flag. When solving an 8-variable matrix, the assembly version runs about 50% faster (i.e., the assembly version does it in 62.35% of the C++ time). When solving an 1600-variable matrix, the assembly version runs about 8 times faster (the assembly version does it in 12.14% of the C++ time). |
Well, congratulations on getting somewhere with the code. Do be careful when leaving data on the stack; the FPU doesn't have much storage space to start with so things can get really horrible. What does the CPU's manual say about pushing too many variables onto the FPU stack?
|
When you push too many values onto the stack, it raises an exception and starts overwriting the bottom of the stack (section 8.1.2 of the Intel manual).
I do not think that the stack overflowing was/is the problem. Whether backloop was commented out or not, values were being left on the stack. I don't know what the matter is, and I can't be sure I've really fixed what's going wrong. I've decided until I get some inspiration, I'll just fiddle with my program until it works. The fast/slow problem has cropped up again with version 2. With some input matrices it would run slow; with others it would run fast. I made some trivial change, and it ran fast for all matrices. Success ... sort of. |
One suggestion. The alignment of your data and code with respect to word/doubleword/other boundaries may impact how fast the CPU can access such data. I believe this is one of the ways that an optimizing compiler works its magic. See if you can affect performance by inserting dummy data &/or code to force some optimal alignment, or see if there are assembler directives to cause the assembler to apply such alignment optimization.
--- rod. |
All times are GMT -5. The time now is 08:04 AM. |