LinuxQuestions.org - Assembly program runs 81 times *slower* with 13 fewer instructions.

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Assembly program runs 81 times *slower* with 13 fewer instructions. (https://www.linuxquestions.org/questions/programming-9/assembly-program-runs-81-times-%2Aslower%2A-with-13-fewer-instructions-681810/)

Assembly program runs 81 times *slower* with 13 fewer instructions.

I have a simple math program that solves a special type of matrix. It works its way down the matrix, then back up. It finishes where it started. It runs on a Pentium 4 on Ubuntu.

Code:

.section .data



length: .int 8

matrix:

  [a matrix]



.section .text

.globl main

main:

  mov $1000000, %edi

  testloop:



  mov $matrix, %eax # array pointer



  [5 load/store and math instructions]

  

  mov length, %ecx

  downloop:

    [21 load/store and math instructions]

    loop downloop



  [2 load/store and math instructions]



  mov length, %ecx

  backloop:

    [10 load/store and math instructions]

    loop backloop



  dec %edi

  jnz testloop

I am measuring how long it takes to run, so I have a loop that makes it run 1,000,000 times. The program runs in 0.31 seconds:

Code:

$ gcc test.s -pg

$ ./a.out 

$ gprof -b ./a.out 

Flat profile:



Each sample counts as 0.01 seconds.

  %  cumulative  self              self    total          

 time  seconds  seconds    calls  Ts/call  Ts/call  name    

 75.81      0.23    0.23                            downloop

 16.13      0.28    0.05                            backloop

  8.06      0.31    0.03                            testloop

As you can see from gprof, backloop is smaller and faster than downloop. It's 13 instructions, or about 25% of the program. However, when I comment out the "backloop" section, the program runs 81 times slower!

Code:

$ gcc test.s -pg

$ ./a.out 

$ gprof -b ./a.out 

Flat profile:



Each sample counts as 0.01 seconds.

  %  cumulative  self              self    total          

 time  seconds  seconds    calls  Ts/call  Ts/call  name    

 99.74    25.16    25.16                            downloop

  0.26    25.23    0.07                            testloop

What would cause this? It no longer stops where it started, but I hardly think this would cause an 81x speed penalty.

This is an O(n) algorithm, and "downloop" and "backloop" work independently from one another. All they do is operate on numbers, leaving the structure unchanged. Also, I am using the identity matrix for my test matrix, which remains unchanged by these operations.

Why does it run so slowly?

Without seeing any specific code, one can only guess that EDI is being changed in the 'backloop' loop, and affecting the number of overall loops. Removing the effect changes it accordingly.
--- rod.

Maybe I should post the whole thing. I didn't want to bother people with the details. Assembly language programs are hard to read and require a lot of explanation. Here's the whole program - including odd comments which probably will make no sense to anyone.

In any event, downloop doesn't change %edi (as far as I can tell). I tried using %esi instead, but that doesn't help.

This was the first assembly language program I wrote. At the time I wrote this, I didn't know I could use
fld 4(%eax)
-- instead of --
add $4, %eax
fld (%eax)

Code:

.section .data



length: .int 7 # actually length minus 1

matrix: #a  b  c    d

 .float  0,  1,  0,  1 # temporary 0 for gdb

 .float  0,  1,  0,  1

 .float  0,  1,  0,  1

 .float  0,  1,  0,  1

 .float  0,  1,  0,  1

 .float  0,  1,  0,  1

 .float  0,  1,  0,  1

 .float  0,  1,  0,  1



.section .text

.globl main

main:

  mov $1000000, %edi

  testloop:



  mov $matrix, %eax # array pointer

  add $4, %eax # ignore that temporary 0

  

  fld (%eax)  # load b

  fld1

  fstp (%eax)  # optional store b=1

  add $8, %eax # move up 2 numbers

  fld (%eax)  # load d



  mov length, %ecx

  downloop:

    fdiv %st(1), %st(0)

    fst (%eax)  # store d

    fxch

    sub $4, %eax

    fld (%eax)  # load c

    fdivp %st(0), %st(1)

    fst (%eax)  # store c

    

    # this is where a new row starts

    

    add $8, %eax

    fld (%eax)    # load a

    fldz

    fstp (%eax)    # optional store a=0

    fmul %st(0), %st(2)

    fmulp

    add $4, %eax

    fsubr (%eax)  # load b 

    fld1

    fstp (%eax)    # optional store b=1

    fxch

    add $8, %eax

    fsubr (%eax)  # load d

    loop downloop



  fdivp          # d/b

  fst (%eax)      # store d



/*

  mov length, %ecx

  backloop:

    sub $20, %eax # go back 5 spaces (5*4 bytes)

    fld (%eax)    # load c1

    fldz          

    fstp (%eax)  # optional store c=0

    add $4, %eax

    fld (%eax)    # load d1

    fstp %st(3)

    fmulp

    fsubrp

    fst (%eax)    # store d1

    loop backloop

*/



  dec %edi

  jnz testloop

The program solves tri-diagonal matrices. This is version 1; version 2 has 14 fewer instructions, but it runs 115 times slower. I didn't post version 2, because it's more complicated and corrects for certain math pitfalls. After I solve this problem I'll work on that one.

Thanks.

Hi, Travis86.

I don't know anything about the math behind what you are doing, but I do know that many library routines use techniques that generally increase speed.

Do you have any timing comparisons between your routines and library routines -- perhaps those of blas or lapack? ... cheers, makyo

Eventually I'd like to benchmark my program against someone else's, but I'm going to wait until I get it running as fast as possible.

It's interesting the instruction in backloop that seems to make the difference is:

fst (%eax) # store d1

You should, in gdb, dump the matrix a few times and see the difference when backloop is in and when it isn't. I'm guessing that store is putting back values into the matrix that the fpu can optimize on (like 0 or 1) and when you don't have it, the fpu has to do more work.

Did you at least compare the performance of your assembly implementation to an implementation in a high-level language, such as C? If you know how to write the algorithm in assembly, surely you could write it in C as well...
Today's compilers know how to generate pretty efficient code.
Another point is that your application may benefit from using SSE instructions (which your P4 machine must support). You could tell the compiler to try to use SSE instruction wherever possible (don't remember which flags are for gcc).

If you can enable floating point exceptions that would be good; it may be possible that the FPU is generating a large number of interrupts and wasting processor time with the exception handling.

I think pinniped is on the right track, but I still haven't figured it out.

I thought when I commented out backloop, things were being left on the floating-point stack, and the processor didn't like the stack overflowing. But even when backloop isn't commented out, things are left on the stack. After downloop, there is one number on the floating-point stack, and after backloop, there is one number on the floating-point stack.

None the less, I noticed three interesting things that would seem to indicate that the problem has to do with leaving things on the floating-point stack:

1. If you replace backloop with "fstp %st(0)" (which pops the number floating-point stack), the program runs fine.

2. When you only comment out two particular lines in backloop, it runs slowly:

Code:

backloop:

    sub $20, %eax # go back 5 spaces (5*4 bytes)

    fld (%eax)    # load c1

    fldz          

    fstp (%eax)  # optional store c=0

    add $4, %eax

    fld (%eax)    # load d1

    fstp %st(3)

#    fmulp

#    fsubrp

    fst (%eax)    # store d1

    loop backloop

When you replace them with two "fstp %st(0)" instructions, the program runs fine (albeit mathematically incorrect). fmulp and fsubrp both pop the stack after they're done.

3. When I modified version 2 so that it didn't leave anything on the floating-point stack, it stopped running 115x slower and started running a few percent faster.

I'll keep working on it.

---------

Estabroo, what do you mean by "makes the difference"? If I comment out that line, it works the same. Plus, I'm working with the identity matrix, and all of the values remains unchanged. All entries are either 0 or 1 all of the time.

---------

fuzzyBuzz, SSE instructions perform the same operation to several numbers at once. For mathematical reasons, one row can't be modified before the previous row has been modified. So, you can't work on the rows in parallel.

---------

I have done some speed comparisons. I made the best C++ algorithm I could and compiled it with the "-O3" flag. When solving an 8-variable matrix, the assembly version runs about 50% faster (i.e., the assembly version does it in 62.35% of the C++ time). When solving an 1600-variable matrix, the assembly version runs about 8 times faster (the assembly version does it in 12.14% of the C++ time).

Well, congratulations on getting somewhere with the code. Do be careful when leaving data on the stack; the FPU doesn't have much storage space to start with so things can get really horrible. What does the CPU's manual say about pushing too many variables onto the FPU stack?

When you push too many values onto the stack, it raises an exception and starts overwriting the bottom of the stack (section 8.1.2 of the Intel manual).

I do not think that the stack overflowing was/is the problem. Whether backloop was commented out or not, values were being left on the stack.

I don't know what the matter is, and I can't be sure I've really fixed what's going wrong. I've decided until I get some inspiration, I'll just fiddle with my program until it works.

The fast/slow problem has cropped up again with version 2. With some input matrices it would run slow; with others it would run fast. I made some trivial change, and it ran fast for all matrices.

Success ... sort of.

One suggestion. The alignment of your data and code with respect to word/doubleword/other boundaries may impact how fast the CPU can access such data. I believe this is one of the ways that an optimizing compiler works its magic. See if you can affect performance by inserting dummy data &/or code to force some optimal alignment, or see if there are assembler directives to cause the assembler to apply such alignment optimization.
--- rod.