Printing Armstrong numbers from 1 to 500. (C Programming)

Laserbeak · 06-14-2017, 11:37 PM

Quote:

Originally Posted by pan64

Yes, it looks like. I have no idea why. I have AMD FX-8320, it supports SSE4.1.
I have just tried to debug this and it looks like it used the real SSE, there was no emulation. But again, probably I missed something.

I was just thinking, I wonder if there's an alignment problem. I think Intel got rid of alignment requirements but AMD may not have.

Laserbeak · 06-15-2017, 02:12 AM

I was thinking you can actually use 256 bit vectors in these Intel processors:

Quote:

Intel® Advanced Vector Extensions Types

The Intel® AVX intrinsic functions use three new C data types as operands, representing the new registers used as operands to the intrinsic functions. These are __m256, __m256d, and __m256i data types.

The __m256 data type is used to represent the contents of the extended SSE register, the YMM register, used by the Intel® AVX intrinsics. The __m256 data type can hold eight 32-bit floating-point values.

The __m256d data type can hold four 64-bit double precision floating-point values.

The __m256i data type can hold thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit integer values.

The compiler aligns the __m256, __m256d, and __m256i local and global data to 32-byte boundaries on the stack. To align integer, float, or double arrays, use the __declspec(align) statement.

The Intel® AVX intrinsics also use Intel® SSE2 data types like __m128, __m128d, and __m128i for some operations. See Details of Intrinsics topic for more information.

So you can expand this to come up with much higher narcissistic numbers with no performance hit whatsoever when using SSE. I'll try to work on that later.

Laserbeak · 06-17-2017, 09:05 AM

Let's try to get all Narcissistic number up to 9 digits long.

So let's say we have 9 digit number.

9^9 = 0x17179149, i.e. a 4-byte number (UInt32)

But 9 of them together would be 0xCFD41B91, another 4-byte number (UInt32), so we should be fine with anything that can hold 9 32-bit numbers.

Unfortunately, the __m256i data type can only hold 8 32-bit numbers, so we have to lower our goal to 8 digit Narcissistic numbers, unless we use more than one vector, but we're just learning here, so let's just use one.

Now, what can hold the sum of 8 32-bit numbers? 8 * 0x17179149 = 0xB8BC8A48, another UInt32, which is very serendipitous.

So we should be able to use basically the same code to determine eight 32-bit numbers in an __m256i data variable as we did with the 128-bit variable, right?

Just thinking here...

grail · 06-17-2017, 11:55 AM

Just for some fun and definitely not in the same category as my C friends for speed, here is a solution in gawk that can take any number, but be warned it gets real slow real fast

Code:

#!/usr/bin/awk -f

BEGIN{
	n = ARGV[1]

	for(i = 1; i <= 10^n-1; i++){
		split(sprintf("%0"n"d",i),nums,"")

		for(j in nums){
			x += nums[j]^n
			if(x > i)
				break
		}

		if(x == i)
			print i
		
		x = 0
	}
}

And a Ruby one for some fun too:

Code:

#!/usr/bin/env ruby

n = ARGV[0].to_i
first = "0" * (n - 1) + "1"
last  = "9" * n

(first..last).each do |num|
	puts num if num.split(//).sum{|i| i.to_i ** n} == num.to_i
end

Times for both are sluggish after 5

grail · 06-17-2017, 09:18 PM

So after a bit of sleep, it dawned on me that part of the expense is the constant calculation of the powers, so this speeds things up by a factor of 5 on my machine:

Code:

#!/usr/bin/env ruby

n = ARGV[0].to_i
totals = [0,1]
(2..9).each{|i| totals<<i**n}

(1...10**n).each do |num|
	x = 0
	(1..n).each do |i|
		x += totals[num % 10 ** i / 10 ** (i-1)]
		break if x > num || num < 10 ** i
	end
	puts num if x == num
end

I also removed all string calculations and conversions (except the input from the command line)

Laserbeak · 06-25-2017, 12:21 PM

After dabbling around with FORTRAN, I came up with this. I tried to make it more elegant, but kept seeming to hit a brick wall somewhere, so I just wanted to get something that worked, so here it is:

Code:

      
      integer, dimension (1:3) :: v
      integer :: a, b, c, value, vsum
      
      do a = 0, 5, 1
          do b = 0, 9, 1
              do c = 0, 9, 1
                  v(1) = a
                  v(2) = b
                  v(3) = c
                  vsum = sum(v ** 3)
                  value = a * 100 + b * 10 + c
                  if ( value .eq. vsum) then
                      print *, value, " is an Armstrong number"
                  endif
              enddo
          enddo
      enddo
      END

The output is this:

Code:

 0  is an Armstrong number
 1  is an Armstrong number
 153  is an Armstrong number
 370  is an Armstrong number
 371  is an Armstrong number
 407  is an Armstrong number

RUN FINISHED; exit value 0; real time: 10ms; user: 0ms; system: 0ms

pan64 · 06-26-2017, 02:52 AM

in this solution a^3 and b^3 calculated several times without any benefits, so something like this will definitely speed up:

Code:

do a = 0, 5, 1
# calculate a^3
   do b = 0, 9, 1
#     calculate b^3
        do c = 0, 9, 1
#          calculate c^3
            vsum = 
            value = 
                  if ( value .eq. vsum) then
                      print *, value, " is an Armstrong number"
                  endif

and, actually, using an array instead of a, b, c may allow to use any length, but need to handle the power (which is not really hard, just need to set it dynamically too)

Laserbeak · 06-26-2017, 03:22 AM

Quote:

Originally Posted by pan64

in this solution a^3 and b^3 calculated several times without any benefits, so something like this will definitely speed up:
and, actually, using an array instead of a, b, c may allow to use any length, but need to handle the power (which is not really hard, just need to set it dynamically too)

Oh, yeah, there's no purposeful optimization (other than what the compiler does itself) in here.

But FORTRAN does have the built-in '**' operator that can be used in vectors, so can raise each member of an array to a power in one line (which the modern compilers use SSE commands for so it's one operation anyway).

The problem I ran into was trying to make it flexible to go to any length, i.e. 999999999 or something.

But that seemed elusive, always giving me weird errors. As I've said, I have almost zero experience with FORTRAN except reading one program and rewriting it in C (and everyone should know reading is much easier than writing!)

If you know FORTRAN, please be my guest to try to improve the code!

Edit: Also this code goes to 599, not 500, not that it really matters because there are no Armstrong numbers between 500 and 599. I imagine there's a way in FORTRAN to break out of the loops once the value gets to 501.

Edit again: Just to be clear, in modern processors with vector processing, it doesn't matter whether you're raising one digit to a power or three (or many more). It still does it in one instruction cycle (or at least the number of instruction cycles the power is, I guess most processors don't have a general power instruction, or especially for larger exponents, you [or the compiler] can use logarithmic math like log() and exp() [ x^y = exp(log(x) * y) ] for faster results).

But there are ways to prune the data instructions intelligently, so it doesn't go down paths that obviously won't lead to a number that fits the parameters (here, an Armstrong number).

Still, the time gives 0ms for both usr and sys, so it's hard to improve on that. You'd have to run it many hundreds of thousands of times or even millions to get numbers you can really differentiate.

Beryllos · 06-26-2017, 08:36 PM

Great necro thread!

Quote:

Originally Posted by pan64

in this solution a^3 and b^3 calculated several times without any benefits, so something like this will definitely speed up:

Code:

do a = 0, 5, 1
# calculate a^3
   do b = 0, 9, 1
#     calculate b^3
        do c = 0, 9, 1
#          calculate c^3
            vsum = 
            value = 
                  if ( value .eq. vsum) then
                      print *, value, " is an Armstrong number"
                  endif

Allow me...

Many of the programs in this thread take the approach of starting with a large number and breaking it down to individual digits by integer division and modulo. An alternative approach is to start with separate digits and make up the large number by adding multiples of powers of ten. This turns out to be very fast if you exploit the fact that each multiple of each power of ten only needs to be calculated once, and likewise each cube (or other power) only needs to be calculated once. The nested loop structure involves no multiplications or divisions, only array lookups, additions, and a comparison.

Here is a first try, before further optimization:

Code:

// armstrong3.c
// finds Armstrong numbers with 3 digits

// May be compiled by:
// gcc -lm -o armstrong3 armstrong3.c

#include <stdio.h>
#include <math.h>
#include <malloc.h>

int main()
{
    long i, j;
    long d0, d1, d2; // the digits

//  Allocate memory for tables
    long *b1 = (long *) malloc(10*sizeof(long));
    long *b2 = (long *) malloc(10*sizeof(long));
    long *px = (long *) malloc(10*sizeof(long));

//  fill the tables
    for(i=0;i<10;++i)
    {
        b1[i] = i*10;     // array b1 is digit*base**1
        b2[i] = i*100;    // array b2 is digit*base**2
        px[i] = pow(i,3); // array px is digit**power
    }

    printf("Armstrong numbers:\n");
//  start search
    for(d2=0;d2<10;++d2)
        for(d1=0;d1<10;++d1)
            for(d0=0;d0<10;++d0)
                if(b2[d2]+b1[d1]+d0==px[d2]+px[d1]+px[d0])
                    printf("%ld%ld%ld\n", d2, d1, d0);

//  now do it a million times over, in order to estimate the computation time
    j = 0;
    for(i=0;i<1000000;++i)
        for(d2=0;d2<10;++d2)
            for(d1=0;d1<10;++d1)
                for(d0=0;d0<10;++d0)
                    if(b2[d2]+b1[d1]+d0==px[d2]+px[d1]+px[d0])
                        ++j; // needed to keep the compiler honest  
    printf("Armstrong numbers counted in 1000000 rounds: %ld\n", j);

//  free allocated memory
    free(b1);
    free(b2);
    free(px);

    return 0;
}

When I run it in on an Intel Core2 Duo 2.20GHz (executing in a single thread), this is the time report:

Code:

$ time ./armstrong3
Armstrong numbers:
000
001
153
370
371
407
Armstrong numbers counted in 1000000 rounds: 6000000

real	0m9.338s
user	0m9.328s
sys	0m0.000s

Which means that on average each pass through the innermost loop takes about 9 nanoseconds,

... or each scan of all numbers from 0 to 999 is completed in about 9 microseconds.

To be continued...

Beryllos · 06-26-2017, 10:45 PM

I extended the above method to 9-digit Armstrong numbers involving 9th powers of the digits:

Code:

// armstrong9.c
// finds Armstrong numbers with 9 digits and 9th power 

// May be compiled by:
// gcc -lm -o armstrong9 armstrong9.c

#include <stdio.h>
#include <math.h>
#include <malloc.h>

int main()
{
    long i, j;
    long d0, d1, d2, d3, d4, d5, d6, d7, d8; // the digits

//  Allocate memory for tables
    long *b1 = (long *) malloc(10*sizeof(long));
    long *b2 = (long *) malloc(10*sizeof(long));
    long *b3 = (long *) malloc(10*sizeof(long));
    long *b4 = (long *) malloc(10*sizeof(long));
    long *b5 = (long *) malloc(10*sizeof(long));
    long *b6 = (long *) malloc(10*sizeof(long));
    long *b7 = (long *) malloc(10*sizeof(long));
    long *b8 = (long *) malloc(10*sizeof(long));
    long *px = (long *) malloc(10*sizeof(long));

//  fill tables
    for(i=0;i<10;++i)
    {
        b1[i] = i*10;   // b1 is digit*base**1
        b2[i] = i*100;  // b2 is digit*base**2
        b3[i] = i*1000; // and so on
        b4[i] = i*10000;
        b5[i] = i*100000;
        b6[i] = i*1000000;
        b7[i] = i*10000000;
        b8[i] = i*100000000;
        px[i] = pow(i,9); // px is digit**power
    }

    printf("Armstrong numbers:\n");
//  start search
    for(d8=0;d8<10;++d8)
        for(d7=0;d7<10;++d7)
            for(d6=0;d6<10;++d6)
                for(d5=0;d5<10;++d5)
                    for(d4=0;d4<10;++d4)
                        for(d3=0;d3<10;++d3)
                            for(d2=0;d2<10;++d2)
                                for(d1=0;d1<10;++d1)
                                    for(d0=0;d0<10;++d0)
                                        if(b8[d8]+b7[d7]+b6[d6]+b5[d5]+b4[d4]+b3[d3]+b2[d2]+b1[d1]+d0==px[d8]+px[d7]+px[d6]+px[d5]+px[d4]+px[d3]+px[d2]+px[d1]+px[d0])
                                            printf("%ld%ld%ld%ld%ld%ld%ld%ld%ld\n", d8, d7, d6, d5, d4, d3, d2, d1, d0);

//  free allocated memory
    free(b1);
    free(b2);
    free(b3);
    free(b4);
    free(b5);
    free(b6);
    free(b7);
    free(b8);
    free(px);

    return 0;
}

It tests all 10^9 combinations of the 9 digits in about 26 seconds. Not bad, but we can do much better. Consider the comparison statement in the above program:

Code:

if(b8[d8]+b7[d7]+b6[d6]+b5[d5]+b4[d4]+b3[d3]+b2[d2]+b1[d1]+d0==px[d8]+px[d7]+px[d6]+px[d5]+px[d4]+px[d3]+px[d2]+px[d1]+px[d0])

It involves 17 lookups and 16 additions. Most of the time, most digits are unchanged from the previous pass. We needn't look up and add all the digits every time. In fact, in each of the nested loops, we only have to look up one multiple of a power of ten, look up one 9th power of a digit, and perform two additions. This is the program that does that:

Code:

// armstrong9s.c
// finds Armstrong numbers with 9 digits and 9th power 
// using fewer lookups and additions in the main loop structure

// May be compiled by:
// gcc -lm -o armstrong9s armstrong9s.c

#include <stdio.h>
#include <math.h>
#include <malloc.h>

int main()
{
    long i, j;
    long d0, d1, d2, d3, d4, d5, d6, d7, d8; // the digits
    long bsum1, bsum2, bsum3, bsum4, bsum5, bsum6, bsum7, bsum8; // intermediate sums
    long psum1, psum2, psum3, psum4, psum5, psum6, psum7, psum8; // intermediate sums

// Allocate memory for tables
    long *b1 = (long *) malloc(10*sizeof(long));
    long *b2 = (long *) malloc(10*sizeof(long));
    long *b3 = (long *) malloc(10*sizeof(long));
    long *b4 = (long *) malloc(10*sizeof(long));
    long *b5 = (long *) malloc(10*sizeof(long));
    long *b6 = (long *) malloc(10*sizeof(long));
    long *b7 = (long *) malloc(10*sizeof(long));
    long *b8 = (long *) malloc(10*sizeof(long));
    long *px = (long *) malloc(10*sizeof(long));

//  fill tables
    for(i=0;i<10;++i)
    {
        b1[i] = i*10;   // b1 is digit*base**1
        b2[i] = i*100;  // b2 is digit*base**2
        b3[i] = i*1000; // and so on
        b4[i] = i*10000;
        b5[i] = i*100000;
        b6[i] = i*1000000;
        b7[i] = i*10000000;
        b8[i] = i*100000000;
        px[i] = pow(i,9); // px is digit**power
    }

    printf("Armstrong numbers:\n");
//  start search
    for(d8=0;d8<10;++d8)
    {
        bsum8 = b8[d8]; // update place value of digit d8
        psum8 = px[d8]; // update n-th power of digit d8
        for(d7=0;d7<10;++d7)
        {
            bsum7 = bsum8+b7[d7]; // add place value of digit d7
            psum7 = psum8+px[d7]; // add n-th power of digit d7
            for(d6=0;d6<10;++d6)
            {
                bsum6 = bsum7+b6[d6]; // add place value of digit d6
                psum6 = psum7+px[d6]; // add n-th power of digit d6
                for(d5=0;d5<10;++d5)
                {
                    bsum5 = bsum6+b5[d5]; // and so on
                    psum5 = psum6+px[d5];
                    for(d4=0;d4<10;++d4)
                    {
                        bsum4 = bsum5+b4[d4];
                        psum4 = psum5+px[d4];
                        for(d3=0;d3<10;++d3)
                        {
                            bsum3 = bsum4+b3[d3];
                            psum3 = psum4+px[d3];
                            for(d2=0;d2<10;++d2)
                            {
                                bsum2 = bsum3+b2[d2];
                                psum2 = psum3+px[d2];
                                for(d1=0;d1<10;++d1)
                                {
                                    bsum1 = bsum2+b1[d1];
                                    psum1 = psum2+px[d1];
                                    for(d0=0;d0<10;++d0)
                                        if(bsum1+d0==psum1+px[d0]) // add last digit
                                            printf("%ld%ld%ld%ld%ld%ld%ld%ld%ld\n", d8, d7, d6, d5, d4, d3, d2, d1, d0);
                                }
                            }
                        }
                    }
                }
            }
        }
    }

//  free allocated memory
    free(b1);
    free(b2);
    free(b3);
    free(b4);
    free(b5);
    free(b6);
    free(b7);
    free(b8);
    free(px);

    return 0;
}

and the results:

Code:

$ time ./armstrong9s
Armstrong numbers:
000000000
000000001
146511208
472335975
534494836
912985153

real	0m5.274s
user	0m5.272s
sys	0m0.000s

That's roughly 5 nanoseconds per inner loop cycle.

Beryllos · 06-28-2017, 03:12 PM

Quote:

Originally Posted by Laserbeak

There's two basic approaches -- try to calculate every Armstrong number between 1 and 500 or check every number between 1 and 500 to see if it's an Armstrong number. I assume this has already been made clear.

Which is faster would need testing.

There are other approaches. Here is one which I found as I was reading about Armstrong numbers:

1. Pick a random number in the range (for example, between 100 and 999).
2. Calculate the sum of the third (or other) power of digits.
3. If the sum is equal to the original number, you have found an Armstrong number.
4. Now it gets interesting: If the sum is not equal to the original number, use it as the next starting number. In other words, calculate the sum of the power of its digits and iterate.
5. Break when the sum goes out of range, enters a cycle, or reaches an Armstrong number.

Examples of cycles for digits to the third power:
136, 244, 136, ...
160, 217, 352, 160, ...

I'm not sure that this method is guaranteed to save computation time, but there is at least one reason why it can: All permutations of digits yield the same sum. For example, 145, 154, 415, 451, 514, and 541 all yield 190, which leads to 730 and finally the Armstrong number 370. In addition, some of the numbers in that permutation can be obtained by other numbers with their permutations. For example, 145 comes from 809, 890, 908, and 980. So it is that Armstrong numbers descend from a "tree" of different possible ancestors. Some of these trees are large, some not so large, like that of 407, which can only come from its own permutations.

I have been playing around with this. It appears that it has an advantage for finding longer Armstrong numbers, where the number of permutations is much larger.

I have been coding it in Python, which allows arbitrarily large integers and runs with multiple cores on my machine.

Laserbeak · 06-29-2017, 01:36 AM

Quote:

Originally Posted by Beryllos

There are other approaches. Here is one which I found as I was reading about Armstrong numbers:

1. Pick a random number in the range (for example, between 100 and 999).
2. Calculate the sum of the third (or other) power of digits.
3. If the sum is equal to the original number, you have found an Armstrong number.
4. Now it gets interesting: If the sum is not equal to the original number, use it as the next starting number. In other words, calculate the sum of the power of its digits and iterate.
5. Break when the sum goes out of range, enters a cycle, or reaches an Armstrong number.

Examples of cycles for digits to the third power:
136, 244, 136, ...
160, 217, 352, 160, ...

I'm not sure that this method is guaranteed to save computation time, but there is at least one reason why it can: All permutations of digits yield the same sum. For example, 145, 154, 415, 451, 514, and 541 all yield 190, which leads to 730 and finally the Armstrong number 370. In addition, some of the numbers in that permutation can be obtained by other numbers with their permutations. For example, 145 comes from 809, 890, 908, and 980. So it is that Armstrong numbers descend from a "tree" of different possible ancestors. Some of these trees are large, some not so large, like that of 407, which can only come from its own permutations.

I have been playing around with this. It appears that it has an advantage for finding longer Armstrong numbers, where the number of permutations is much larger.

I have been coding it in Python, which allows arbitrarily large integers and runs with multiple cores on my machine.

This is very interesting (really). but according to everything I've read the only way to determine Narcissistic Numbers (of which Armstrong Numbers are just a subset of three-digit numbers) is by brute force, i.e. testing each number to see if it is one. Even if your algorithm works, how can you make sure that you've found all of the Narcissistic numbers within a range?

grail · 06-29-2017, 02:46 AM

Here is another idea ... if we look at the 1000 numbers with 3 digit combinations, there are only 220 effective combinations that need to be tested as all others are repeats, as per Beryllos' example
of 145, 154, 415, 451, 514, and 541 all yielding the same solution.

The nice thing on working from this direction is the growth of numbers to be tested as the power increases is considerably less than the total number added:

Code:

10 ^ 3 = 220 to test
10 ^ 4 = 715 to test
10 ^ 5 = 2002 to test
10 ^ 6 = 5005 to test
10 ^ 7 = 11440 to test

Of course the problem with the above scenario is trying to come up with what numbers are in the shortlist to be checked against.
As an idea, I tried this in Ruby and just getting the number above took ages, as you can see I never got to 8 or 9

pan64 · 06-29-2017, 04:50 AM

Quote:

Originally Posted by Beryllos

and the results:

Code:

$ time ./armstrong9s
Armstrong numbers:
000000000
000000001
146511208
472335975
534494836
912985153

real	0m5.274s
user	0m5.272s
sys	0m0.000s

I'm sorry, but that is incorrect. The leading zeroes cannot be taken into account, they modify the "length", the number of digits. The valid result is:

Code:

   (?)  0      153     1634    54748   548834  1741725 24678050 146511208
        1      370     8208    92727           4210818 24678051 472335975
        2      371     9474    93084           9800817 88593477 534494836
        3      407                             9926315          912985153
        4
        5
        6
        7
        8
        9

execution time is at about 23 sec on my box (with -O3)

Laserbeak · 06-29-2017, 05:39 AM

@Beryllos -- You really don't have to explicitly free() all that memory before you quit, when the process ends, the kernel will clean it up for you. You only have to free() memory that you malloc() when you're running a program that continues to run indefinitely and you want that memory back.