LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-07-2008, 12:06 PM   #1
Travis86
Member
 
Registered: Dec 2002
Location: The land of GMT -6
Distribution: OS X, PS2 Linux, Ubuntu, IRIX 6.5
Posts: 399

Rep: Reputation: 31
Assembly program runs 81 times *slower* with 13 fewer instructions.


I have a simple math program that solves a special type of matrix. It works its way down the matrix, then back up. It finishes where it started. It runs on a Pentium 4 on Ubuntu.

Code:
.section .data

length: .int 8
matrix:
  [a matrix]

.section .text
.globl main
main:
  mov $1000000, %edi
  testloop:

  mov $matrix, %eax # array pointer

  [5 load/store and math instructions]
  
  mov length, %ecx
  downloop:
    [21 load/store and math instructions]
    loop downloop

  [2 load/store and math instructions]

  mov length, %ecx
  backloop:
    [10 load/store and math instructions]
    loop backloop

  dec %edi
  jnz testloop

I am measuring how long it takes to run, so I have a loop that makes it run 1,000,000 times. The program runs in 0.31 seconds:

Code:
$ gcc test.s -pg
$ ./a.out 
$ gprof -b ./a.out 
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 75.81      0.23     0.23                             downloop
 16.13      0.28     0.05                             backloop
  8.06      0.31     0.03                             testloop
As you can see from gprof, backloop is smaller and faster than downloop. It's 13 instructions, or about 25% of the program. However, when I comment out the "backloop" section, the program runs 81 times slower!

Code:
$ gcc test.s -pg
$ ./a.out 
$ gprof -b ./a.out 
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 99.74     25.16    25.16                             downloop
  0.26     25.23     0.07                             testloop
What would cause this? It no longer stops where it started, but I hardly think this would cause an 81x speed penalty.

This is an O(n) algorithm, and "downloop" and "backloop" work independently from one another. All they do is operate on numbers, leaving the structure unchanged. Also, I am using the identity matrix for my test matrix, which remains unchanged by these operations.

Why does it run so slowly?

Last edited by Travis86; 11-08-2008 at 06:48 PM.
 
Old 11-07-2008, 07:34 PM   #2
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Without seeing any specific code, one can only guess that EDI is being changed in the 'backloop' loop, and affecting the number of overall loops. Removing the effect changes it accordingly.
--- rod.
 
Old 11-07-2008, 09:55 PM   #3
Travis86
Member
 
Registered: Dec 2002
Location: The land of GMT -6
Distribution: OS X, PS2 Linux, Ubuntu, IRIX 6.5
Posts: 399

Original Poster
Rep: Reputation: 31
Maybe I should post the whole thing. I didn't want to bother people with the details. Assembly language programs are hard to read and require a lot of explanation. Here's the whole program - including odd comments which probably will make no sense to anyone.

In any event, downloop doesn't change %edi (as far as I can tell). I tried using %esi instead, but that doesn't help.

This was the first assembly language program I wrote. At the time I wrote this, I didn't know I could use
fld 4(%eax)
-- instead of --
add $4, %eax
fld (%eax)

Code:
.section .data

length: .int 7 # actually length minus 1
matrix: #a   b   c    d
 .float  0,  1,  0,   1 # temporary 0 for gdb
 .float  0,  1,  0,   1
 .float  0,  1,  0,   1
 .float  0,  1,  0,   1
 .float  0,  1,  0,   1
 .float  0,  1,  0,   1
 .float  0,  1,  0,   1
 .float  0,  1,  0,   1

.section .text
.globl main
main:
  mov $1000000, %edi
  testloop:

  mov $matrix, %eax # array pointer
  add $4, %eax # ignore that temporary 0
  
  fld (%eax)   # load b
  fld1
  fstp (%eax)  # optional store b=1
  add $8, %eax # move up 2 numbers
  fld (%eax)   # load d

  mov length, %ecx
  downloop:
    fdiv %st(1), %st(0)
    fst (%eax)  # store d
    fxch
    sub $4, %eax
    fld (%eax)  # load c
    fdivp %st(0), %st(1)
    fst (%eax)  # store c
    
    # this is where a new row starts
    
    add $8, %eax
    fld (%eax)     # load a
    fldz
    fstp (%eax)    # optional store a=0
    fmul %st(0), %st(2)
    fmulp
    add $4, %eax
    fsubr (%eax)   # load b 
    fld1
    fstp (%eax)     # optional store b=1
    fxch
    add $8, %eax
    fsubr (%eax)   # load d
    loop downloop

  fdivp           # d/b
  fst (%eax)      # store d

/*
  mov length, %ecx
  backloop:
    sub $20, %eax # go back 5 spaces (5*4 bytes)
    fld (%eax)    # load c1
    fldz          
    fstp (%eax)   # optional store c=0
    add $4, %eax
    fld (%eax)    # load d1
    fstp %st(3)
    fmulp
    fsubrp
    fst (%eax)    # store d1
    loop backloop
*/

  dec %edi
  jnz testloop
The program solves tri-diagonal matrices. This is version 1; version 2 has 14 fewer instructions, but it runs 115 times slower. I didn't post version 2, because it's more complicated and corrects for certain math pitfalls. After I solve this problem I'll work on that one.

Thanks.
 
Old 11-08-2008, 10:35 AM   #4
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi, Travis86.

I don't know anything about the math behind what you are doing, but I do know that many library routines use techniques that generally increase speed.

Do you have any timing comparisons between your routines and library routines -- perhaps those of blas or lapack? ... cheers, makyo
 
Old 11-08-2008, 06:52 PM   #5
Travis86
Member
 
Registered: Dec 2002
Location: The land of GMT -6
Distribution: OS X, PS2 Linux, Ubuntu, IRIX 6.5
Posts: 399

Original Poster
Rep: Reputation: 31
Eventually I'd like to benchmark my program against someone else's, but I'm going to wait until I get it running as fast as possible.
 
Old 11-08-2008, 10:12 PM   #6
estabroo
Senior Member
 
Registered: Jun 2008
Distribution: debian, ubuntu, sidux
Posts: 1,126
Blog Entries: 2

Rep: Reputation: 124Reputation: 124
It's interesting the instruction in backloop that seems to make the difference is:

fst (%eax) # store d1


You should, in gdb, dump the matrix a few times and see the difference when backloop is in and when it isn't. I'm guessing that store is putting back values into the matrix that the fpu can optimize on (like 0 or 1) and when you don't have it, the fpu has to do more work.
 
Old 11-10-2008, 03:52 AM   #7
fuzzyBuzz
Member
 
Registered: Sep 2005
Posts: 34

Rep: Reputation: 15
Did you at least compare the performance of your assembly implementation to an implementation in a high-level language, such as C? If you know how to write the algorithm in assembly, surely you could write it in C as well...
Today's compilers know how to generate pretty efficient code.
Another point is that your application may benefit from using SSE instructions (which your P4 machine must support). You could tell the compiler to try to use SSE instruction wherever possible (don't remember which flags are for gcc).
 
Old 11-10-2008, 04:36 AM   #8
pinniped
Senior Member
 
Registered: May 2008
Location: planet earth
Distribution: Debian
Posts: 1,732

Rep: Reputation: 50
If you can enable floating point exceptions that would be good; it may be possible that the FPU is generating a large number of interrupts and wasting processor time with the exception handling.
 
Old 11-10-2008, 12:43 PM   #9
Travis86
Member
 
Registered: Dec 2002
Location: The land of GMT -6
Distribution: OS X, PS2 Linux, Ubuntu, IRIX 6.5
Posts: 399

Original Poster
Rep: Reputation: 31
I think pinniped is on the right track, but I still haven't figured it out.

I thought when I commented out backloop, things were being left on the floating-point stack, and the processor didn't like the stack overflowing. But even when backloop isn't commented out, things are left on the stack. After downloop, there is one number on the floating-point stack, and after backloop, there is one number on the floating-point stack.

None the less, I noticed three interesting things that would seem to indicate that the problem has to do with leaving things on the floating-point stack:

1. If you replace backloop with "fstp %st(0)" (which pops the number floating-point stack), the program runs fine.

2. When you only comment out two particular lines in backloop, it runs slowly:

Code:
backloop:
    sub $20, %eax # go back 5 spaces (5*4 bytes)
    fld (%eax)    # load c1
    fldz          
    fstp (%eax)   # optional store c=0
    add $4, %eax
    fld (%eax)    # load d1
    fstp %st(3)
#    fmulp
#    fsubrp
    fst (%eax)    # store d1
    loop backloop
When you replace them with two "fstp %st(0)" instructions, the program runs fine (albeit mathematically incorrect). fmulp and fsubrp both pop the stack after they're done.

3. When I modified version 2 so that it didn't leave anything on the floating-point stack, it stopped running 115x slower and started running a few percent faster.

I'll keep working on it.

---------

Estabroo, what do you mean by "makes the difference"? If I comment out that line, it works the same. Plus, I'm working with the identity matrix, and all of the values remains unchanged. All entries are either 0 or 1 all of the time.

---------

fuzzyBuzz, SSE instructions perform the same operation to several numbers at once. For mathematical reasons, one row can't be modified before the previous row has been modified. So, you can't work on the rows in parallel.

---------

I have done some speed comparisons. I made the best C++ algorithm I could and compiled it with the "-O3" flag. When solving an 8-variable matrix, the assembly version runs about 50% faster (i.e., the assembly version does it in 62.35% of the C++ time). When solving an 1600-variable matrix, the assembly version runs about 8 times faster (the assembly version does it in 12.14% of the C++ time).
 
Old 11-10-2008, 05:14 PM   #10
pinniped
Senior Member
 
Registered: May 2008
Location: planet earth
Distribution: Debian
Posts: 1,732

Rep: Reputation: 50
Well, congratulations on getting somewhere with the code. Do be careful when leaving data on the stack; the FPU doesn't have much storage space to start with so things can get really horrible. What does the CPU's manual say about pushing too many variables onto the FPU stack?
 
Old 11-14-2008, 07:35 PM   #11
Travis86
Member
 
Registered: Dec 2002
Location: The land of GMT -6
Distribution: OS X, PS2 Linux, Ubuntu, IRIX 6.5
Posts: 399

Original Poster
Rep: Reputation: 31
When you push too many values onto the stack, it raises an exception and starts overwriting the bottom of the stack (section 8.1.2 of the Intel manual).

I do not think that the stack overflowing was/is the problem. Whether backloop was commented out or not, values were being left on the stack.

I don't know what the matter is, and I can't be sure I've really fixed what's going wrong. I've decided until I get some inspiration, I'll just fiddle with my program until it works.

The fast/slow problem has cropped up again with version 2. With some input matrices it would run slow; with others it would run fast. I made some trivial change, and it ran fast for all matrices.

Success ... sort of.
 
Old 11-15-2008, 11:30 AM   #12
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
One suggestion. The alignment of your data and code with respect to word/doubleword/other boundaries may impact how fast the CPU can access such data. I believe this is one of the ways that an optimizing compiler works its magic. See if you can affect performance by inserting dummy data &/or code to force some optimal alignment, or see if there are assembler directives to cause the assembler to apply such alignment optimization.
--- rod.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
8800 GTX runs very slow, slower than an 8800 GTS 1veedo General 6 02-27-2008 03:57 PM
Why Firefox runs much slower in Linux than in Windows ? electrogap Linux - Software 9 01-16-2007 12:05 PM
Internet speed is 50 times slower on Linux than on Windows callum85 Linux - Networking 2 10-05-2006 02:02 PM
GeForce FX5200 20 times slower that GeForce4 Ti4200 crack Linux - Hardware 7 12-28-2004 05:47 PM
Internet 10 times slower in linux than in windows samx123 Linux - Wireless Networking 5 03-08-2004 04:08 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:27 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration