Assembly program runs 81 times *slower* with 13 fewer instructions.
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Assembly program runs 81 times *slower* with 13 fewer instructions.
I have a simple math program that solves a special type of matrix. It works its way down the matrix, then back up. It finishes where it started. It runs on a Pentium 4 on Ubuntu.
Code:
.section .data
length: .int 8
matrix:
[a matrix]
.section .text
.globl main
main:
mov $1000000, %edi
testloop:
mov $matrix, %eax # array pointer
[5 load/store and math instructions]
mov length, %ecx
downloop:
[21 load/store and math instructions]
loop downloop
[2 load/store and math instructions]
mov length, %ecx
backloop:
[10 load/store and math instructions]
loop backloop
dec %edi
jnz testloop
I am measuring how long it takes to run, so I have a loop that makes it run 1,000,000 times. The program runs in 0.31 seconds:
Code:
$ gcc test.s -pg
$ ./a.out
$ gprof -b ./a.out
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
75.81 0.23 0.23 downloop
16.13 0.28 0.05 backloop
8.06 0.31 0.03 testloop
As you can see from gprof, backloop is smaller and faster than downloop. It's 13 instructions, or about 25% of the program. However, when I comment out the "backloop" section, the program runs 81 times slower!
Code:
$ gcc test.s -pg
$ ./a.out
$ gprof -b ./a.out
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
99.74 25.16 25.16 downloop
0.26 25.23 0.07 testloop
What would cause this? It no longer stops where it started, but I hardly think this would cause an 81x speed penalty.
This is an O(n) algorithm, and "downloop" and "backloop" work independently from one another. All they do is operate on numbers, leaving the structure unchanged. Also, I am using the identity matrix for my test matrix, which remains unchanged by these operations.
Without seeing any specific code, one can only guess that EDI is being changed in the 'backloop' loop, and affecting the number of overall loops. Removing the effect changes it accordingly.
--- rod.
Maybe I should post the whole thing. I didn't want to bother people with the details. Assembly language programs are hard to read and require a lot of explanation. Here's the whole program - including odd comments which probably will make no sense to anyone.
In any event, downloop doesn't change %edi (as far as I can tell). I tried using %esi instead, but that doesn't help.
This was the first assembly language program I wrote. At the time I wrote this, I didn't know I could use
fld 4(%eax)
-- instead of --
add $4, %eax
fld (%eax)
Code:
.section .data
length: .int 7 # actually length minus 1
matrix: #a b c d
.float 0, 1, 0, 1 # temporary 0 for gdb
.float 0, 1, 0, 1
.float 0, 1, 0, 1
.float 0, 1, 0, 1
.float 0, 1, 0, 1
.float 0, 1, 0, 1
.float 0, 1, 0, 1
.float 0, 1, 0, 1
.section .text
.globl main
main:
mov $1000000, %edi
testloop:
mov $matrix, %eax # array pointer
add $4, %eax # ignore that temporary 0
fld (%eax) # load b
fld1
fstp (%eax) # optional store b=1
add $8, %eax # move up 2 numbers
fld (%eax) # load d
mov length, %ecx
downloop:
fdiv %st(1), %st(0)
fst (%eax) # store d
fxch
sub $4, %eax
fld (%eax) # load c
fdivp %st(0), %st(1)
fst (%eax) # store c
# this is where a new row starts
add $8, %eax
fld (%eax) # load a
fldz
fstp (%eax) # optional store a=0
fmul %st(0), %st(2)
fmulp
add $4, %eax
fsubr (%eax) # load b
fld1
fstp (%eax) # optional store b=1
fxch
add $8, %eax
fsubr (%eax) # load d
loop downloop
fdivp # d/b
fst (%eax) # store d
/*
mov length, %ecx
backloop:
sub $20, %eax # go back 5 spaces (5*4 bytes)
fld (%eax) # load c1
fldz
fstp (%eax) # optional store c=0
add $4, %eax
fld (%eax) # load d1
fstp %st(3)
fmulp
fsubrp
fst (%eax) # store d1
loop backloop
*/
dec %edi
jnz testloop
The program solves tri-diagonal matrices. This is version 1; version 2 has 14 fewer instructions, but it runs 115 times slower. I didn't post version 2, because it's more complicated and corrects for certain math pitfalls. After I solve this problem I'll work on that one.
It's interesting the instruction in backloop that seems to make the difference is:
fst (%eax) # store d1
You should, in gdb, dump the matrix a few times and see the difference when backloop is in and when it isn't. I'm guessing that store is putting back values into the matrix that the fpu can optimize on (like 0 or 1) and when you don't have it, the fpu has to do more work.
Did you at least compare the performance of your assembly implementation to an implementation in a high-level language, such as C? If you know how to write the algorithm in assembly, surely you could write it in C as well...
Today's compilers know how to generate pretty efficient code.
Another point is that your application may benefit from using SSE instructions (which your P4 machine must support). You could tell the compiler to try to use SSE instruction wherever possible (don't remember which flags are for gcc).
If you can enable floating point exceptions that would be good; it may be possible that the FPU is generating a large number of interrupts and wasting processor time with the exception handling.
I think pinniped is on the right track, but I still haven't figured it out.
I thought when I commented out backloop, things were being left on the floating-point stack, and the processor didn't like the stack overflowing. But even when backloop isn't commented out, things are left on the stack. After downloop, there is one number on the floating-point stack, and after backloop, there is one number on the floating-point stack.
None the less, I noticed three interesting things that would seem to indicate that the problem has to do with leaving things on the floating-point stack:
1. If you replace backloop with "fstp %st(0)" (which pops the number floating-point stack), the program runs fine.
2. When you only comment out two particular lines in backloop, it runs slowly:
Code:
backloop:
sub $20, %eax # go back 5 spaces (5*4 bytes)
fld (%eax) # load c1
fldz
fstp (%eax) # optional store c=0
add $4, %eax
fld (%eax) # load d1
fstp %st(3)
# fmulp
# fsubrp
fst (%eax) # store d1
loop backloop
When you replace them with two "fstp %st(0)" instructions, the program runs fine (albeit mathematically incorrect). fmulp and fsubrp both pop the stack after they're done.
3. When I modified version 2 so that it didn't leave anything on the floating-point stack, it stopped running 115x slower and started running a few percent faster.
I'll keep working on it.
---------
Estabroo, what do you mean by "makes the difference"? If I comment out that line, it works the same. Plus, I'm working with the identity matrix, and all of the values remains unchanged. All entries are either 0 or 1 all of the time.
---------
fuzzyBuzz, SSE instructions perform the same operation to several numbers at once. For mathematical reasons, one row can't be modified before the previous row has been modified. So, you can't work on the rows in parallel.
---------
I have done some speed comparisons. I made the best C++ algorithm I could and compiled it with the "-O3" flag. When solving an 8-variable matrix, the assembly version runs about 50% faster (i.e., the assembly version does it in 62.35% of the C++ time). When solving an 1600-variable matrix, the assembly version runs about 8 times faster (the assembly version does it in 12.14% of the C++ time).
Well, congratulations on getting somewhere with the code. Do be careful when leaving data on the stack; the FPU doesn't have much storage space to start with so things can get really horrible. What does the CPU's manual say about pushing too many variables onto the FPU stack?
When you push too many values onto the stack, it raises an exception and starts overwriting the bottom of the stack (section 8.1.2 of the Intel manual).
I do not think that the stack overflowing was/is the problem. Whether backloop was commented out or not, values were being left on the stack.
I don't know what the matter is, and I can't be sure I've really fixed what's going wrong. I've decided until I get some inspiration, I'll just fiddle with my program until it works.
The fast/slow problem has cropped up again with version 2. With some input matrices it would run slow; with others it would run fast. I made some trivial change, and it ran fast for all matrices.
One suggestion. The alignment of your data and code with respect to word/doubleword/other boundaries may impact how fast the CPU can access such data. I believe this is one of the ways that an optimizing compiler works its magic. See if you can affect performance by inserting dummy data &/or code to force some optimal alignment, or see if there are assembler directives to cause the assembler to apply such alignment optimization.
--- rod.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.