[SOLVED] Inline assembly

vmelkon · 03-13-2024, 12:18 PM

Hello all,

I know that there are many choices in the world of programming ...
In my case, I have some code that I had written in Intel style for Microsoft VC++ 6. It uses 32 bit addresses. It uses some ordinary x86 instructions and also MMX at some places and SSE at some places.

On Linux, I use Qt Creator as my IDE. I think that underneath it, it uses the g++ compiler.

Step 1: I updated the code to use 64 bit addresses.
Step 2: There are more registers. So I did minor changes to use more registers.
Step 3: Compile the Intel code under Linux? Some people mention using a compiler flag for gcc. I did not do this.
Step 4: So, I learned AT&T style. I learned asm extended assembly for gcc and went ahead and converted from Intel Style to AT&T style. (Maybe I’m nuts?)
Step 5: That %0, %1, %2 stuff. Oh boy! Too late did I learn that it is possible to use labels.

MY MAIN QUESTION:
In my Intel style code, I have

Code:

addps xmm0, xmmword ptr[Global_NfloatArray]

where Global_NfloatArray is some float array.

In AT&T style

Code:

addps %0, %%xmm0;

where %0 represents Global_NfloatArray.
It compiles but doesn’t work.
I think it is because it copies the address to xmm0 instead of referencing the RAM pointed to by pointer Global_NfloatArray

So, I guess I need to write

Code:

addps (%0), %%xmm0;

but that doesn’t compile.

Another case:
In my Intel style code, I have

Code:

fld dword ptr[t1]
fld dword ptr[t1+4]

In AT&T style

Code:

fld %1; ?????????
fld %1+4;

where %1 is float t1[100]

I tried to learn by example but can’t find what I am looking for.
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

vmelkon · 03-13-2024, 07:56 PM

After a week of searching, I found the solution.
Apparently, if you have an array that is global, there is something special you need to do.
You need to write (*Global_t1)

For example:

Code:

sint64 Test4000()
{
	sint64 returnVal;
	Global_t1[0]=57.0;
	Global_t1[1]=11.0;

	//This is 64 bit code for Linux
	asm volatile
	(
		"movaps		%0, %%xmm0;"
		"mulps		%%xmm0, %%xmm0;"
		"movaps		%%xmm0, %0;"
		:
		: "m" (*Global_t1)
		:
	);

	return returnVal;
}

ntubski · 03-13-2024, 09:58 PM

Quote:

Originally Posted by vmelkon

Apparently, if you have an array that is global, there is something special you need to do.
You need to write (*Global_t1)

Hmm, I get the same code output with (Global_t1) as with (*Global_t1). I would suggest using register constraints and then letting the compiler figure out how to move the data instead of hard-coding movaps though:

Code:

    asm ("mulps %0, %0\n"
      : "+v" (Global_t1) /* + means read/write, v means "Any EVEX encodable SSE
                            register (%xmm0-%xmm31)." */);

This is generating the following for me at -O1:

Code:

        movq    xmm0, QWORD PTR Global_t1[rip]
        mulps   xmm0, xmm0

        movq    QWORD PTR Global_t1[rip], xmm0

References:
https://gcc.gnu.org/onlinedocs/gcc/Modifiers.html
https://gcc.gnu.org/onlinedocs/gcc/M...nstraints.html (search for x86 family)

vmelkon · 03-14-2024, 03:54 PM

movq seems to copy 64 bit but I want 128 bit, since it would be for copying 4 floats to a xmm0. (According to https://www.felixcloutier.com/x86/movq)

I find that variable constraint thing confusing.There is "r" and "+r" and "g" and various codes.
Some codes seem to be suggestions for loading into a register like "a" is suppose to mean rax.

ntubski · 03-14-2024, 06:09 PM

Quote:

Originally Posted by vmelkon

movq seems to copy 64 bit but I want 128 bit, since it would be for copying 4 floats to a xmm0. (According to https://www.felixcloutier.com/x86/movq)

Oh, I saw 2 float in your example and assumed that was the whole array. Actually, now I'm seeing my suggestion only works with an array up to size 4 (and the size has to be visible at compile time). So it's probably not what you want anyway.

Your code in #2 looks good except that the constraints you put don't tell the compiler that you are using xmm0, and reading from Global_t1. I would update it like this:

Code:

    asm (
         "movaps %0, %%xmm0;\n"
         "mulps %%xmm0, %%xmm0;\n"
         "movaps %%xmm0, %0"
         : "+m" (*Global_t1) /* out (+ means also read as input) */
         :                   /* in */
         : "xmm0"            /* clobber (tell compiler we are overwriting xmm0) */
         );

vmelkon · 03-15-2024, 01:47 PM

For the clobber list, it doesn't seem to recognize r8, r9 and all the way to r15.
This is weird since I used r8 and r9 in my assembly code. (I did not run the code yet).

Also, I forgot to give the other example:
If you want to operate from RAM directly
Intel style code is something like

Code:

float t1[100];      //Globally declared variable
fld dword ptr[t1]
fld dword ptr[t1+4]
fadd dword ptr[t1+8], st(0)

AT&T style would be

Code:

fld %0;
fld 4%0;        //Notice the 4. This adds 4 bytes to the address of t1
fadd 8%0, %%st(0);      //Notice the 8. This adds 4 bytes to the address of t1

ntubski · 03-15-2024, 10:57 PM

Quote:

Originally Posted by vmelkon

For the clobber list, it doesn't seem to recognize r8, r9 and all the way to r15.
This is weird since I used r8 and r9 in my assembly code. (I did not run the code yet).

Hmm, the following compiles for me:

Code:

  /* Add the first 3 elements of t1, and put the result in x */
  float x;
  asm (
       "flds %1;\n"
       "flds %2;\n"
       "faddp;\n"
       "fadds %3;\n"
       : "=t" (x)     /* out. t is "Top of 80387 floating-point stack (%st(0))." */
       : "m" (t1), "m"(t1[1]), "m"(t1[2]) /* in */
       : "st(1)"                         /* clobber */
       );

Quote:

Also, I forgot to give the other example:
If you want to operate from RAM directly
Intel style code is something like

Code:

float t1[100];      //Globally declared variable
fld dword ptr[t1]
fld dword ptr[t1+4]
fadd dword ptr[t1+8], st(0)

I'm not sure exactly how to get this offset addressing thing working, the compiler seems to prefer %rip relative instead. For example:

Code:

  /* Add the first 3 elements of t1, and put the result in x */
  float x;
  asm (
       "flds %1;\n"
       "flds %2;\n"
       "faddp;\n"
       "fadds %3;\n"
    :  "=t" (x)                      /* out */
    :  "m" (t1), "m"(t1[1]), "m"(t1[2]) /* in */
    :  "st(1)"                         /* clobber */
       );

produces this disassembly:

Code:

   0x00000001400014e5 <+4>:     flds   0x6b15(%rip)        # 0x140008000 <t1>
   0x00000001400014eb <+10>:    flds   0x6b13(%rip)        # 0x140008004 <t1+4>
   0x00000001400014f1 <+16>:    faddp  %st,%st(1)
   0x00000001400014f3 <+18>:    fadds  0x6b0f(%rip)        # 0x140008008 <t1+8>
   0x00000001400014f9 <+24>:    fstps  0xc(%rsp)

(hopefully I got the suffixes right; it seems to give the right output, but I was basically just guessing until it stopped throwing warnings at me)

vmelkon · 03-16-2024, 04:39 PM

I didn't even know that you could write "=t" (x).

I think you are having trouble because you wrote
: "m" (t1), "m"(t1[1]), "m"(t1[2]) /* in */
If your array size is huge, that is too much work.
I would do

Code:

float Global_t1[100];
void function()
{
float x;
  asm volatile (
       "flds %0;\n"
       "flds 4%0;\n" /////Address of Global_t1 + 4 bytes
       "fadds 8%0, st(0);\n"  /////Address of Global_t1 + 8 bytes
    :  "=t" (x)                      /* out */
    :  "m" (*Global_t1) /* in */
    :  "st(1)"                         /* clobber */
       );
}

and maybe st(0) needs to be in the clobber as well.

I nicer solution is to use labels to avoid the %0, %1, %2 stuff.

Code:

float Global_t1[100];
void function()
{
float x;
  asm volatile (
       "flds %[Global_t1];\n"
       "flds 4%[Global_t1];\n" /////Address of Global_t1 + 4 bytes
       "fadds 8%[Global_t1], st(0);\n"  /////Address of Global_t1 + 8 bytes
    :  "=t" (x)                      /* out */
    :  [Global_t1] "m" (*Global_t1) /* in */
    :  "st(1)"                         /* clobber */
       );
}

ntubski · 03-16-2024, 07:45 PM

Quote:

Originally Posted by vmelkon

Code:

float x;
  asm volatile (
       "flds %0;\n"
       "flds 4%0;\n" /////Address of Global_t1 + 4 bytes
       "fadds 8%0, st(0);\n"  /////Address of Global_t1 + 8 bytes

Hmm, it compiles on godbolt.org, but on Mingw I get "Error: junk `((%rcx))' after expression" and on my Debian box I get "Error: invalid instruction suffix for `fld'".

Quote:

and maybe st(0) needs to be in the clobber as well.

The "t" is st(0), and it's already listed as an output, so it doesn't need to be in clobber.

vmelkon · 03-17-2024, 12:47 AM

This one works.
x receives 70.55
if I put : [x] "=t" (x)
x gets a NAN

Code:

sint64 Test4000()
{
	sint64 returnVal;
	Global_t1[0]=57.0;
	Global_t1[1]=11.0;
	Global_t1[2]=2.55;
	float x;

	//This is 64 bit code for Linux
	asm volatile
	(
		"fld		%[Global_t1];"			//Load to register st(0)
		"fadd		4%[Global_t1];"			//Add value to what is already in st(0)
		"fadd		8%[Global_t1];"			//Add value to what is already in st(0)
		"fstp		%[x];"			//Store and pop FPU stack. Write to x
		//"fld		4%0;"
		//"faddp		%%st(1), %%st(0);"
		//"fstp		4%0;"
		//"movaps		%0, %%xmm0;"
		//"mulps		%%xmm0, %%xmm0;"
		//"movaps		%%xmm0, %0;"
		: [x] "=m" (x)
		: [Global_t1] "m" (*Global_t1)
		: "st(1)"
	);

	return returnVal;
}

I don't know what exactly is flds. It seems to be a gcc invention. There are others like fldt (https://docs.oracle.com/cd/E19455-01...0ah/index.html)
I think those aren't real x86 FPU instructions, so I replaced with fld.

vmelkon · 03-17-2024, 09:36 PM

OK, if you use
if I put : [x] "=t" (x)

then
"fstp %[x];" //Store and pop FPU stack. Write to x
needs to be commented out and x receives 70.55.

ntubski · 03-17-2024, 11:11 PM

Quote:

Originally Posted by vmelkon

OK, if you use
if I put : [x] "=t" (x)

then
"fstp %[x];" //Store and pop FPU stack. Write to x
needs to be commented out and x receives 70.55.

Yeah, because otherwise it will try to pop st(0) into st(0) which makes no sense.