LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-29-2007, 08:54 AM   #1
dogpatch
Member
 
Registered: Nov 2005
Location: Central America
Distribution: Mepis, Android
Posts: 490
Blog Entries: 4

Rep: Reputation: 238Reputation: 238Reputation: 238
Optimizing assembler code


An enormously repetitive routine calls for simple assembler coding to better control optimization and minimize run time. Given a parent C program with a defined array1:
Code:
char array1[81];
Please look at these subroutine assembler code snippets to address, say, the element at offset 24 of the array. Which code will execute faster (fewer clock ticks),
This:
Code:
	movl	$array1, %edi	// Outside loop, so disregard overhead
	.
	.
some_loop:
	movb	24(%edi),%al	// These instructions repeat within heavy loop
	.
	.
	cmpb	%al,24(%edi)
	.
	.
	movb	%al,24(%edi)
or this:
Code:
	movb	(array1+24),%al
	.
	.
	cmpb	%al,(array1+24)
	.
	.
	movb	%al,(array1+24)
Also, do i gain speed performance by defining my array with long rather than char elements, thus achieving better alignment efficiency?
Code:
long array1[81];
My assembler code then becomes:
Code:
	cmpl	%eax,24(%edi)
or
Code:
	cmpl	%eax,(array1+24)
The target processor, if that's germaine, is an older AMD-K6 500 mHz.
 
Old 01-29-2007, 09:43 AM   #2
macemoneta
Senior Member
 
Registered: Jan 2005
Location: Manalapan, NJ
Distribution: Fedora x86 and x86_64, Debian PPC and ARM, Android
Posts: 4,593
Blog Entries: 2

Rep: Reputation: 344Reputation: 344Reputation: 344Reputation: 344
The target processor is germane - the timing will change across processor models. In addition, the timing on a given processor will change based on the state of the cache and the alignment of the assembly (not assembler) instructions and data.
 
Old 01-29-2007, 09:54 AM   #3
dogpatch
Member
 
Registered: Nov 2005
Location: Central America
Distribution: Mepis, Android
Posts: 490

Original Poster
Blog Entries: 4

Rep: Reputation: 238Reputation: 238Reputation: 238
OK, so the alignment of the data is important. Is align 4 sufficient, or is align 16 better?

I don't anticipate any issues with the cache, as the assembly (not assembler) code will run pretty much exclusively for long stretches.

As to processor specifics, can you tell me where i might find out which code may be better for the AMD chip?

Last edited by dogpatch; 01-29-2007 at 09:59 AM.
 
Old 01-29-2007, 10:13 AM   #4
macemoneta
Senior Member
 
Registered: Jan 2005
Location: Manalapan, NJ
Distribution: Fedora x86 and x86_64, Debian PPC and ARM, Android
Posts: 4,593
Blog Entries: 2

Rep: Reputation: 344Reputation: 344Reputation: 344Reputation: 344
The higher the alignment value, the more universal the benefit, and the greater the wasted space. Cache lines vary between 32 and 128 bytes on most processors.

Unless you are running your code in a single process OS and/or on a dedicated CPU, your CPU cache will be invalidated and flushed quite a bit. Optimizing your locality of reference will minimize that.

None of the performance concerns require the use of assembly code; C compilers are quite capable of performing the necessary optimizations, and allow for increased portability.

Intel has a considerable amount of information online on the subject. You can start here. This is another good article.

Last edited by macemoneta; 01-29-2007 at 10:16 AM.
 
Old 02-05-2007, 06:33 PM   #5
dogpatch
Member
 
Registered: Nov 2005
Location: Central America
Distribution: Mepis, Android
Posts: 490

Original Poster
Blog Entries: 4

Rep: Reputation: 238Reputation: 238Reputation: 238
Thanks for the info, and the links. I ended up doing my own benchmark tests, since i thought my endeavor may be a bit out of the ordinary, and here's what i found:

Due to several factors - the efficient use of registers and boolean logic, and creative loop controls and stack use, the difference between my assembly code and the best i could get from the C compiler was not just percentage points better, but several times faster.

As i suspected, the cache wasn't an issue. At least, running at runlevel 1 was no faster than at runlevel 2 or higher, providing i wasn't actually running something else simultaneously. I think this was due to the small size of my data and code. Align 4 was adequate. In fact, i experienced a significant speed penalty for going to align 32 or higher, probably because then my small array became big enough to require multiple cache pages.

The answer to my original question (if anyone is interested) is that the use of the data registers edi and esi to access data was faster than direct addressing by about 10%

Last edited by dogpatch; 02-05-2007 at 06:42 PM.
 
Old 02-05-2007, 09:39 PM   #6
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,610
Blog Entries: 4

Rep: Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905
As one might expect it would be...

A direct-address instruction must be fetched and decoded, then loaded into an internal micro-register to be used in the fetch. But if the address is already in one of the main segment/offset registers, it's already prepared.

There is always a tradeoff of "speed vs. space." So yes, you might well find that it will save time to store each byte of the value in a integer or long-integer. Takes eight times the storage but who cares.
 
Old 02-06-2007, 08:33 PM   #7
dogpatch
Member
 
Registered: Nov 2005
Location: Central America
Distribution: Mepis, Android
Posts: 490

Original Poster
Blog Entries: 4

Rep: Reputation: 238Reputation: 238Reputation: 238
Actually, i was arguing that, in this case, i got both speed and size benefits by using align 4 rather than align 32 or 64 or 128. I'm <em>guessing</em> that's because the larger data array required multiple cache pages, thus slowing things down as well as taking more space in memory. Does this run totally against the grain of processor design principles? (Remember that i'm doing this on an older chip - a 500 mHz AMD.)
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
unhandled page fault from assembler code lordofring Programming 2 09-14-2005 02:29 PM
Assembler samjkd General 7 03-14-2005 01:52 PM
assembler usr Programming 2 11-15-2003 05:15 PM
optimizing source code compilations... uriahk Linux - Software 1 08-22-2002 09:07 AM
assembler tda Programming 4 08-21-2002 02:54 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:48 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration