[SOLVED] Need a fast von Neumann randomness extractor

sycamorex · 04-11-2012, 02:23 PM

yeah, I've been following this thread and wondering why exactly you're doing it

. Is it anything to do with encryption or generating keys?

ta0kira · 04-11-2012, 04:40 PM

Quote:

Originally Posted by H_TeXMeX_H

I hooked up a radio headphone jack to my mic jack using an audio recording cable, something like:
http://www.sonicelectronix.com/item_...he-AUX3BL.html

I tuned the radio to a noisy channel. This is the hardest part, because noise doesn't always mean randomness.

Maybe you should hash before the von Neuman extractor since the distribution of 0s and 1s in the upper bits might differ from that of the lower bits. I would expect that to be the case whether or not there is clipping during A/D conversion. I only suggest that because it seems like the VN extractor only works well if the data is close to random in the first place. I guess if you're hashing anywhere in there, it might not make a difference where that happens.
Kevin Barry

ButterflyMelissa · 04-11-2012, 05:11 PM

Thanks for this thread! I learned something here, you guys lit up my week!!!!

Just let me/us know where your "eagle" landed, please

@ sycamorex - it would seem to be a way to find patterns in randomness...did'nt know that hissing on the radio was good for something, goes to prove you can always learn something!!!

Thanks

Thor

sycamorex · 04-11-2012, 05:18 PM

Quote:

Originally Posted by Thor_2.0

@ sycamorex - it would seem to be a way to find patterns in randomness...did'nt know that hissing on the radio was good for something, goes to prove you can always learn something!!!

Excuse my ignorance but I'm totally confused and don't really understand much from the thread. Is the OP trying to extract a truly random data or find patterns in apparent randomness?

H_TeXMeX_H · 04-12-2012, 02:28 AM

Quote:

Originally Posted by sycamorex

yeah, I've been following this thread and wondering why exactly you're doing it

. Is it anything to do with encryption or generating keys?

Quote:

Originally Posted by sycamorex

Excuse my ignorance but I'm totally confused and don't really understand much from the thread. Is the OP trying to extract a truly random data or find patterns in apparent randomness?

I want to do both actually. But, my goal is to make a TRNG, true random number generator. I probably will not achieve it, but may come close.

I should have linked this before:
http://www.random.org/history/

Reading that made me want to try to do something like that.

Technically, I don't have too much use for it, I just want a proof of concept for myself. I trust radioactive decay (hotbits) more than radio, but I have tried samples from random.org and they seem to be of good quality. So then, I wanted to do it myself so I don't have to rely on sites that may go down or have daily limits (hotbits).

sycamorex · 04-12-2012, 03:23 AM

Quote:

Originally Posted by H_TeXMeX_H

I want to do both actually. But, my goal is to make a TRNG, true random number generator. I probably will not achieve it, but may come close.

I should have linked this before:
http://www.random.org/history/

Reading that made me want to try to do something like that.

Technically, I don't have too much use for it, I just want a proof of concept for myself. I trust radioactive decay (hotbits) more than radio, but I have tried samples from random.org and they seem to be of good quality. So then, I wanted to do it myself so I don't have to rely on sites that may go down or have daily limits (hotbits).

Ok, it all makes sense now. Thank you for the explanation.

H_TeXMeX_H · 04-12-2012, 03:56 AM

I've tried using the hash before the extractor, but it doesn't make much difference. Either way, both must be used to produce good random output, neither will work well alone. Using the hash first produces slightly smaller output, which is a good thing in theory.

Some more theory:

I am using a short-wave radio for this, because this is HF radio, and is prone to atmospheric and solar interference. This means that there's plenty of noise and potential randomness.

Also see:
http://en.wikipedia.org/wiki/Noise_%28radio%29
http://en.wikipedia.org/wiki/Cosmic_noise

H_TeXMeX_H · 04-12-2012, 06:46 AM

As long as we are discussing it, I found something very interesting in my quest.

Here are algorithms to determine if a file is random and how random.

Code:

The procedure to determine if a file is random or not:

1) Run rawtopgm, view in GIMP

Horizontal stripe patterns on image + patterns or major spikes on histogram => non random

Static noise image + histogram is irregular or flat => 2

2) Run ent

not random => not random
random => random

EDIT: I removed incorrect assumptions. I will do more testing, but I need more hotbits.

The diehard and dieharder tests seem quite useless and time consuming.

H_TeXMeX_H · 04-12-2012, 09:23 AM

It is hard for me to believe what I just posted, but it is EDIT: actually false. Sorry, I must do more tests on hotbits to fix these unreliable and size dependent tests.

Nominal Animal · 04-12-2012, 11:35 AM

Quote:

Originally Posted by H_TeXMeX_H

But, my goal is to make a TRNG, true random number generator. I probably will not achieve it, but may come close.

Since you're already using radio, why not build an USB hardware random number generator? (Or buy one?)

For example, you could start with a $16 Teensy++ to interface to via USB. It has a native USB interface, and is very easy to program using gcc-avr. Then, use a hardware random bit generator (one or more) as a source of random bits, using the von Neumann extractor in the Teensy to avoid any bias in the result. If you have more than one, you can increase the generated bit rate. You can easily write a PRNG like Xorshift on the Teensy (or several!), and perturb its state continuously using the random bits, making it more or less a true TRNG. You can easily provide both TRNG and T/PRNG in the same device. The hardware parts should cost about $25 total, including a nice tin enclosure. The circuit generates EM noise, so a metal enclosure is a must.

The data rate is small enough that Teensy could easily continuously encrypt the output using a cryptographically secure algorithm, if you want truly random bits.

If you use the raw HID interface (the one used by USB mice, keyboards, joysticks, gamepads), you are limited to 512000 bits per second (1000 packets, 64 bytes each, per second; more than enough for several hardware bits), but you don't need any drivers. On any OS.

This might be rather interesting project. A kernel driver to feed the random bits from the above hardware USB device to the entropy pool would be very simple, even including some randomness checks on the bits. Or, one could tell the OpenSSL library to obtain initial PRNG state bits from the USB device, for example from a simple userspace application providing it via a pipe (using the HID interface to the USB device).

ta0kira · 04-12-2012, 11:47 AM

Quote:

Originally Posted by H_TeXMeX_H

I've tried using the hash before the extractor, but it doesn't make much difference. Either way, both must be used to produce good random output, neither will work well alone. Using the hash first produces slightly smaller output, which is a good thing in theory.

Randomness is such a poorly-defined concept. For a probability distribution over a continuous vector space described by a continuous function, given a vector there is usually some neighborhood of vectors nearby that are considered "pretty much the same". With binary data, however, the assumption is that any difference in bit values in essence gives you something completely different. The goal of randomness for binary data is to therefore make all possible combinations of 0s and 1s of a given length seem equally likely. In reality, however, if the bits are independent Bernoulli random variables with p=0.5 then the number of 1s over N bits has a distribution of Binomial(N,0.5). Since the variance is linear wrt N, as the number of bits increases the confidence interval around 0.5*N converges to 0.5*N itself. Since this should also apply to all members of the power set of N bits, that implies that the probability of any discernible order to the bits should decrease as the length of the given subset increases. This obviously isn't possible, since the set containing all of the 1 elements, and all of that set's subsets, are in the power set of N bits. Additionally, there are 2^N elements in the power set, so even for a "modest" sample such as 64 bits it's intractable to verify that the desired randomness exists.

Much like the concept of hashing, where in general there are more sets of data of a given length with a given hash value than there are possible hash values, you can't account for all sources of order in the data. You need to determine the ways that order within the data can be useful to an adversary, which will vary by application, and attempt to make its exploitation disadvantageous. The best way to do that is to obscure how the data is transformed into the "publicly-observable final product", which will prevent inference of the most epistemically-useful ways to attack the data.
Kevin Barry

H_TeXMeX_H · 04-12-2012, 12:37 PM

I suppose I could buy one, but I am reluctant to buy any of those because they don't provide any sample data that it generates.

From the above tests I only trust /dev/random and hotbits (uses geiger counter). Other methods may just produce pseudorandom data.

H_TeXMeX_H · 04-12-2012, 01:39 PM

EDIT: Wrong info removed.

Nominal Animal · 04-13-2012, 09:21 AM

How about using a Xorshift PRNG, and using the random/non-random data to perturb its state continouously?

Here is a very slow example. Compile it, then run with a single parameter describing how many bytes it outputs per input byte. It will consume the first 16 bytes of standard input to prime the Xorshift generator state, then it will regularly use the input to perturb the generator state. Since Xorshift is supposed to pass the diehard tests, I wonder if the output from this is sufficiently random for you?

Note that I'm not at all sure whether this kind of perturbation (directly xor'ing the input into the generator state) makes sense, and I have not checked whether the output is random or not.

If you don't want to perturb the state at all, you can redirect standard input from /dev/zero, in which case the output is the same as the Wikipedia example code, but with 65536 first values discarded.

Code:

#include <unistd.h>
#include <stdint.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>

static uint32_t       xorshift_state[4] = { 0U, 0U, 0U, 0U };

static inline uint32_t xorshift(const uint32_t perturb)
{
    uint32_t    temp;

    temp = xorshift_state[0] ^ (xorshift_state[0] << 11U);
    xorshift_state[0] = xorshift_state[1];
    xorshift_state[1] = xorshift_state[2];
    xorshift_state[2] = xorshift_state[3];
    temp ^= (temp >> 8U) ^ xorshift_state[3] ^ (xorshift_state[3] >> 19U);
    xorshift_state[3] = temp ^ perturb;

    return temp;
}

static uint32_t fgetu32(FILE *const in)
{
    uint32_t    value;
    errno = (fread(&value, sizeof value, 1, in) == 1) ? 0 : ENOENT;
    return value;
}

static void fputu32(uint32_t const value, FILE *const out)
{
    fwrite(&value, sizeof value, 1, out);
}

int main(int argc, char *argv[])
{
    uint32_t    value;
    int         rate, i;
    char        dummy;

    if (argc != 2 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n"
                        "Usage: %s [-h|--help]\n"
                        "       %s rate\n"
                        "\n"
                        "This reads binary input, emitting\n"
                        "rate pseudorandom bytes for each input byte.\n"
                        "Initial 16 bytes are read for Xorshift state.\n"
                        "\n"
                      , argv[0], argv[0]);
        return 0;
    }

    if (sscanf(argv[1], "%d %c", &rate, &dummy) != 1 || rate < 1) {
        fprintf(stderr, "%s: Invalid rate.\n", argv[1]);
        return 1;
    }

    /* Prime the Xorshift generator from input. */
    xorshift_state[0] = fgetu32(stdin);    
    xorshift_state[1] = fgetu32(stdin);    
    xorshift_state[2] = fgetu32(stdin);    
    xorshift_state[3] = fgetu32(stdin);
    if (errno)
        return 1;

    /* Since a zero state generates zeros, make sure state is nonzero. */
    if (!xorshift_state[0] && !xorshift_state[1] &&
        !xorshift_state[2] && !xorshift_state[3]) {
        /* Use Wikipedia example state. */
        xorshift_state[0] = 123456789U;
        xorshift_state[1] = 362436069U;
        xorshift_state[2] = 521288629U;
        xorshift_state[3] = 88675123U;
    }

    /* Initial state reflects input too much. Skip. */
    for (i = 0; i < 65536; i++)
        (void)xorshift(0U);

    /* Convert rate to the number of unperturbed words per perturbed word. */
    rate--;

    while (1) {

        value = fgetu32(stdin);
        if (errno)
            return 0;

        fputu32(xorshift(value), stdout);

        i = rate;
        while (i-->0)
            fputu32(xorshift(0U), stdout);
    }
}

Edited to add: The above is easily feasible on a Teensy, or just about any ARM microcontroller. The 32-bit shifts may need a bit trickery on 16- or 8-bit microcontrollers -- rotating each 8-bit unit separately. It should be fast enough to provide at least some thousands of random bits per second.

H_TeXMeX_H · 04-14-2012, 08:28 AM

I have finally found a good way to extract random data from a radio signal. After lots of tests I have come to the conclusion that the best way to extract the random data is like this:

1) Use the von Neumann extractor first.
2) Second use the whirlpool hash with 512-bit = 64 byte input (I removed the option to choose other input byte sizes because it doesn't help).
3) Split the file into 1 MB chunks.
4) Check each chunk with a modified version of ent, posted below.
5) Concatenate the chunks that passed into the final file.

Using this method the extraction ratio is 10:1. So for 100 MB input you will get 10 MB of good quality random data out.

So here's the modified version of ent, you'll need the rest of it, this is just ent.c:

Code:

/*
	ENT  --  Entropy calculation and analysis of putative
		 random sequences.

        Designed and implemented by John "Random" Walker in May 1985.

	Multiple analyses of random sequences added in December 1985.

	Bit stream analysis added in September 1997.

	Terse mode output, getopt() command line processing,
	optional stdin input, and HTML documentation added in
	October 1998.
	
	Documentation for the -t (terse output) option added
	in July 2006.
	
	Replaced table look-up for chi square to probability
	conversion with algorithmic computation in January 2008.

	For additional information and the latest version,
	see http://www.fourmilab.ch/random/

*/

#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <math.h>
#ifdef _WIN32
#include <fcntl.h>
#include <io.h>
#else
#include <unistd.h>
#endif

#include "iso8859.h"
#include "randtest.h"

#define UPDATE  "January 28th, 2008"

#define FALSE 0
#define TRUE  1

#ifdef M_PI
#define PI	 M_PI
#else
#define PI	 3.14159265358979323846
#endif

extern double pochisq(const double ax, const int df);

/*  HELP  --  Print information on how to call	*/

static void help(void)
{
	fprintf(stderr,"ent --  Calculate entropy of file.  Call");
	fprintf(stderr,"\n        with ent [options] [input-file]");
	fprintf(stderr,"\n");
	fprintf(stderr,"\n        Options:   -b   Treat input as a stream of bits");
	fprintf(stderr,"\n                   -c   Print occurrence counts");
	fprintf(stderr,"\n                   -f   Fold upper to lower case letters");
	fprintf(stderr,"\n                   -t   Terse output in CSV format");
	fprintf(stderr,"\n                   -u   Print this message\n");
	fprintf(stderr,"\nBy John Walker");
	fprintf(stderr,"\n   http://www.fourmilab.ch/");
	fprintf(stderr,"\n   %s\n", UPDATE);
}

/*  GETOPT  --	Dumb version of getopt for brain-dead Windows.  */

#ifdef _WIN32	
static int optind = 1;

static int getopt(int argc, char *argv[], char *opts)
{
    static char *opp = NULL;
    int o;
    
    while (opp == NULL) 
    {
		if ((optind >= argc) || (*argv[optind] != '-')) 
        {
			return -1;
		}
		opp = argv[optind] + 1;
		optind++;
		if (*opp == 0) 
		{
			opp = NULL;
		}	
	}
    o = *opp++;
    if (*opp == 0) 
    {
		opp = NULL;
    }
    return strchr(opts, o) == NULL ? '?' : o;
}
#endif

/*  Main program  */

int main(int argc, char *argv[])
{
	int i, oc, opt;
	long ccount[256];	      /* Bins to count occurrences of values */
	long totalc = 0;	      /* Total character count */
	char *samp;
	double montepi, chip, scc, ent, mean, chisq, compression, montec;
	FILE *fp = stdin;
	unsigned int overall;
	int counts = FALSE,	      /* Print character counts */
		fold = FALSE,	      /* Fold upper to lower */
		binary = FALSE,	      /* Treat input as a bitstream */
		terse = FALSE;	      /* Terse (CSV format) output */

	while ((opt = getopt(argc, argv, "bcftu?BCFTU")) != -1) 
	{
		switch (toISOlower(opt)) 
	    {
			case 'b':
				binary = TRUE;
		    break;

            case 'c':
				counts = TRUE;
		    break;

            case 'f':
				fold = TRUE;
		    break;

            case 't':
				terse = TRUE;
		    break;

            case '?':
            case 'u':
				help();
		    return 0;
	    }
	}
	if (optind < argc) 
	{
	   if (optind != (argc - 1)) 
	   {
			fprintf(stderr,"Duplicate file name.\n");
			help();
			return 2;
	   }
       if ((fp = fopen(argv[optind], "rb")) == NULL) 
       {
			fprintf(stderr,"Cannot open file %s\n", argv[optind]);
			return 2;
	   }
	}
	
#ifdef _WIN32

	    /** Warning!  On systems which distinguish text mode and
		binary I/O (MS-DOS, Macintosh, etc.) the modes in the open
        	statement for "fp" should have forced the input file into
        	binary mode.  But what if we're reading from standard
		input?  Well, then we need to do a system-specific tweak
        	to make sure it's in binary mode.  While we're at it,
        	let's set the mode to binary regardless of however fopen
		set it.

		The following code, conditional on _WIN32, sets binary
		mode using the method prescribed by Microsoft Visual C 7.0
        	("Monkey C"); this may require modification if you're
		using a different compiler or release of Monkey C.	If
        	you're porting this code to a different system which
        	distinguishes text and binary files, you'll need to add
		the equivalent call for that system. */

	    _setmode(_fileno(fp), _O_BINARY);
#endif

	samp = binary ? "bit" : "byte";
	memset(ccount, 0, sizeof ccount);

	/* Initialise for calculations */

	rt_init(binary);

	/* Scan input file and count character occurrences */

	while ((oc = fgetc(fp)) != EOF) 
	{
	   unsigned char ocb;

	   if (fold && isISOalpha(oc) && isISOupper(oc)) 
	   {
			oc = toISOlower(oc);
	   }
	   ocb = (unsigned char) oc;
	   totalc += binary ? 8 : 1;
	   if (binary) 
	   {
			int b;
			unsigned char ob = ocb;

			for (b = 0; b < 8; b++) 
			{
			ccount[ob & 1]++;
			ob >>= 1;
			}
		} 
		else 
		{
			ccount[ocb]++;	      /* Update counter for this bin */
		}
		rt_add(&ocb, 1);
	}
	fclose(fp);

	/* Complete calculation and return sequence metrics */

	rt_end(&ent, &chisq, &mean, &montepi, &scc);

	/* Calculate probability of observed distribution occurring from
	   the results of the Chi-Square test */

    chip = pochisq(chisq, (binary ? 1 : 255));

	/* Print bin counts if requested */

	if (counts) 
	{
		if (terse) 
		{
			fprintf(stderr,"2,Value,Occurrences,Fraction\n");
		} 
		else 
		{
			fprintf(stderr,"Value Char Occurrences Fraction\n");
		}
		for (i = 0; i < (binary ? 2 : 256); i++) 
		{
			if (terse) 
			{
				fprintf(stderr,"3,%d,%ld,%f\n", i, ccount[i], ((double) ccount[i] / totalc));
			} 
			else 
			{
				if (ccount[i] > 0) 
				{
					fprintf(stderr,"%3d   %c   %10ld   %f\n", i, (!isISOprint(i) || isISOspace(i)) ? ' ' : i, ccount[i], ((double) ccount[i] / totalc));
				}
			}
		}
		if (!terse) 
		{
              fprintf(stderr,"\nTotal:    %10ld   %f\n\n", totalc, 1.0);
		}
	}

	/* Print calculated results */
	overall = 1;
	
	compression = ((100 * ((binary ? 1 : 8) - ent) / (binary ? 1.0 : 8.0)));
	montec = 100.0 * (fabs(PI - montepi) / PI);

	if (!terse) 
	{
		if (ent > (binary ? 0.999999 : 7.9998))
		{
			fprintf(stderr,"PASS ");
		}
		else
		{
			fprintf(stderr,"FAIL ");
			overall = 0;
		}
		fprintf(stderr,"Entropy = %f bits/%s\n", ent, samp);
		if (compression < (binary ? 0.000025 : 0.0025))
		{
			fprintf(stderr,"PASS ");
		}
		else
		{
			fprintf(stderr,"FAIL ");
			overall = 0;
		}
		fprintf(stderr,"Compression = %f %%\n", compression);
		if (chip < 0.05 || chip > 0.95) 
		{
			fprintf(stderr,"FAIL ");
			overall = 0;
		} 
		else 
		{
			fprintf(stderr,"PASS ");
		}
		fprintf(stderr,"Chi-exceed = %f %%\n", chip * 100);
		if (mean < (binary ? 0.4997 : 127.3) || mean > (binary ? 0.5003 : 127.7))
		{
			fprintf(stderr,"FAIL ");
			overall = 0;
		}
		else
		{
			fprintf(stderr,"PASS ");
		}
		fprintf(stderr,"Mean = %f\n", mean);
		if (montec > 0.3)
		{
			fprintf(stderr,"FAIL ");
			overall = 0;
		}
		else
		{
			fprintf(stderr,"PASS ");
		}
		fprintf(stderr,"Monte-Carlo = %f %%\n", montec);
		if (sqrt(pow(scc,2)) > (binary ? 0.001 : 0.002))
		{
			fprintf(stderr,"FAIL ");
			overall = 0;
		}
		else
		{
			fprintf(stderr,"PASS ");
		}
		fprintf(stderr,"SCC = %f\n", scc);
		if (0 == overall)
		{
			fprintf(stderr,"\nFAIL overall\n");
			return 1;
		}
		else
		{
			fprintf(stderr,"\nPASS overall\n");
		}
	}
	else
	{
		fprintf(stderr, "Entropy,Compression,Chi-exceed,Mean,Monte-Carlo,SCC\n");
		fprintf(stderr, "%f,%f,%f,%f,%f,%f\n", ent, compression, chip * 100, mean, montec, scc);
	}
	return 0;
}

And the script to extract randomness:

Code:

#!/bin/sh
# extract randomness in file
# depends on vne, whash, ent (modified version), split, etc.

temp=/tmp/randextr

if test $# != 2
then
	echo "Usage: $(basename $0) infile outfile"
	exit 1
fi

if test -e "$2"
then
	echo "$2 exists"
	exit 1
fi

current="$(pwd)"
mkdir "$temp" || exit 1

# start extraction

vne < "$1" | whash > "$temp/vh"
cd "$temp"
if split -d -b 1M -a 5 vh
then
	for i in x*
	do
		echo "$i"
		if ent "$i"
		then
			cat "$i" >> "$current/$2"
		fi
	done
else
	echo "Split failed"
	exit 1
fi

rm -rf "$temp"

exit 0

I can get radio data at 44.0 kB/s, so that means I can get about 4 kB/s of good quality random data.

Here's an example of data quality:

Code:

bash-4.1$ ent compiled 
PASS Entropy = 7.999979 bits/byte
PASS Compression = 0.000256 %
PASS Chi exceed = 76.254451 %
PASS Mean = 127.489548
PASS Monte Carlo = 0.014112 %
PASS SCC = 0.000082

PASS overall

this is for an 8 MB file.