Need a fast von Neumann randomness extractor
What would be the most efficient way of coding this:
https://en.wikipedia.org/wiki/Bernou...ess_extraction Code:
input output The problem is that this maybe very inefficient. I mean how do I work with two bits at a time. I don't even think I can do this in C efficiently. Any hints on how to go about this ? I need it because I want to extract randomness from radio signals, and although hashing the input works well, when I look at the histogram I see that it is not level and that there are many values at the two extremes, maybe due to non-uniformity. I was thinking this extractor would help ... I'm no expert tho, but I will try it. One thing I have done is use GIMP on a generated pnm and tried to change the histogram by hand. It has worked quite well, but would like to do it a different way if possible. EDIT: The idea is from: http://www.entropykey.co.uk/res/down...xplanation.pdf Basically the von Neumann extractor first, and then the hash (which I have working already). |
Hmm, a quick thought...
I'd use Perl to get around it, use a for loop, but step per two, then get bit (n) and bit (n-1) and compare. A quick table: Quote:
Thor |
Ok, I think this code works, but do check it:
PHP Code:
EDIT: I've done more testing and it works very well, especially after hashing. So I do raw -> extractor -> hash, and the output is very much random. However, I do want someone to look over the code and see if it does what I think it does. I've never coded stuff like this before. |
I wrote a slightly optimized version. It uses a 256-byte translation table (small enough to stay in CPU cache) to map each input byte to 0 to 4 output bits. Using a single core on Athlon II X4 640 (3GHz) it translates about 200 MB/s (1600 Mbit/s). (Tested using a 256MB file containing all byte values in sequence.)
This is not the fastest I can do, but going faster gets less portable and progressively harder to read and to maintain. Depending on how you use the data, you probably want to tweak the input and output buffer sizes. Code:
#include <unistd.h> Code:
unsigned char *in_next = in_data; The bitmask array maps each input byte into two nibbles (high and low four bits):(the number of) bits and part. I used a trivial generator program to compute the array. (Note: And I got it wrong on the first edit. This is the fixed version.) When a full byte is extracted, we need to extract the high bits (since they are the oldest bits), and keep the excess (low/new) bits intact. Masking the input (the last line) is not really necessary, since we do shift the value before adding the new bits (i.e. new bits are always added to clear bits) -- in other words the results are the same if you remove the masking altogether -- but I like to be careful. The program will work with both blocking and non-blocking standard input and output. Let me know if you want it in a "library" form you can embed in your own code. I'll take a look at your own code next, H_TeXMeX_H. Edit: H_TeXMeX_H, your code should work perfectly, as far as I can tell. I can see nothing wrong in it. (I did not test it, though.) Your code will be limited by stdio.h (fread() and fwrite()) speed; that is why I used low-level I/O instead. You can alleviate that by reading and writing larger buffers at once, or you too can switch to low-level I/O. Also, my tests indicate that on an Athlon II X4 640, doing bit pairs at a time takes about 5x to 6x the time the lookup takes, thus incurring > 5x slowdown compared to my lookup method. Note: This is the fixed version, which only works with architectures that have 8-bit unsigned chars; the code does not check that, though. Look at my post #10 in this thread for a better example code. It is slightly faster, too, and much more portable. |
Wow, that's way past my abilities. Many thanks for taking the time to write it. I will try it right away :)
|
Ok, I have fixed my code and it works properly now, there was a mistake which I missed:
PHP Code:
However, there is something unusual... @ Nominal Animal Using your code I don't get the same result. This is strange. I tested results from both and mine produces more random output. I'm not sure of the cause. Yours is most certainly the fastest. I will try to investigate into the cause of this discrepancy. |
Quote:
Edit: Oh, yes, I mangled the output bit order. Fixed now; both my codes yield the same output, the one I outline in post #9 in this thread. |
Oh, ok, the output order may be it. I'll run some more tests.
|
Yes, there was a bug on my code (in the bitmask array and how it was used). It's fixed now.
Here is the table of bits you should get from eight-bit bytes, with - indicating discarded bit pair, zero 01 and one 10 bit pairs, with most significant bits left. First row corresponds to values 0 to 15, and last row to values 240 to 255: Code:
---- ---0 ---1 ---- --0- --00 --01 --0- --1- --10 --11 --1- ---- ---0 ---1 ---- Code:
42 dd 08 04 53 42 de 96 f7 f7 42 dd 08 04 53 42 Code:
for ((i=0; i<256; i++)) ; do h=$(printf '%02x' $i) ; printf "\x$h" ; done |
Here is my von Neumann randomness extractor, which provides the output described in my post above.
Instead of hardcoding the array, I precalculate it at the beginning of the program. This way the code should compile on architectures with different byte size (as long as it is at most the size of an unsigned int, as I use an unsigned int bit accumulator). The precalculator, init_random(), simply takes each possible input value (unsigned char) and converts it bit pair by bit pair. It consumes and saves ascending bits, starting with the lowest bits first. The value loop is in descending order, but since each value is a self-contained unit, this does not matter. While calculating bit pairs is slow compared to using a lookup, doing it once for each possible input value does not take appreciable time, and allows portable code. Therefore it is worthwhile. You might notice in the inner loop that I handle the bit accumulator has more than a word and bit accumulator has exactly a word cases separately. It turns out that on my architecture and gcc-4.6.1, this is ten percent faster. This version is roughly 5% faster on my architecture than my previous code in this thread, handling about 218000000 bytes of input per second on an Athlon II X4 640. Code:
#include <unistd.h> |
Thanks, I think that fixed it. It now seems to generate equivalent randomness to mine. It is about 7 times faster, so many thanks again :)
I will mark this solved. |
Here's another take on it, if you're interested. It's shorter than the last one Nominal Animal posted, but about 7% slower. I'm pretty sure the difference is due to more stack usage (i.e. less can be taken care of in registers.)
Code:
#include <stdio.h> edit: Actually, it takes 3.5x longer when reading from a file and dumping to /dev/null. Sorry about the false claim! |
Thanks for the other take. I will try to understand it as well :)
I do not understand one thing: Using the test sequence (0-255), the output for mine and ta0kira: Code:
0x00000000: DD42F7F7 96DE4253 0408DD42 F7F796DE ÝB÷÷.ÞBS..ÝB÷÷.Þ Code:
0x00000000: 42DD0804 5342DE96 F7F742DD 08045342 BÝ..SBÞ.÷÷BÝ..SB |
Quote:
Quote:
Kevin Barry |
I see, so it really depends on many things. Still the output is equivalent, so I can use it.
In case anyone is wondering what I'm doing with all this. I hooked up a radio headphone jack to my mic jack using an audio recording cable, something like: http://www.sonicelectronix.com/item_...he-AUX3BL.html I tuned the radio to a noisy channel. This is the hardest part, because noise doesn't always mean randomness. I did tests on the output using ENT and this script: Code:
#!/bin/sh As per the schema I posted in the OP, I first run a von Neumann extractor, any of these will work just fine, but performance differs. Nominal Animal's code is unsurpassed. Then I run a hash on blocks of that output. This was hard to code, probably because I'm not that good at C. However, I have coded a small add-on to the whirlpool hash reference implementation. This is the part that I changed: Code:
int main(void) { I can just pipe the output from the radio like this: Code:
rrec | vnemed | whash > output |
yeah, I've been following this thread and wondering why exactly you're doing it:). Is it anything to do with encryption or generating keys?
|
Quote:
Kevin Barry |
Thanks for this thread! I learned something here, you guys lit up my week!!!!
:D Just let me/us know where your "eagle" landed, please :) @ sycamorex - it would seem to be a way to find patterns in randomness...did'nt know that hissing on the radio was good for something, goes to prove you can always learn something!!! Thanks Thor |
Quote:
|
Quote:
Quote:
I should have linked this before: http://www.random.org/history/ Reading that made me want to try to do something like that. Technically, I don't have too much use for it, I just want a proof of concept for myself. I trust radioactive decay (hotbits) more than radio, but I have tried samples from random.org and they seem to be of good quality. So then, I wanted to do it myself so I don't have to rely on sites that may go down or have daily limits (hotbits). |
Quote:
|
I've tried using the hash before the extractor, but it doesn't make much difference. Either way, both must be used to produce good random output, neither will work well alone. Using the hash first produces slightly smaller output, which is a good thing in theory.
Some more theory: I am using a short-wave radio for this, because this is HF radio, and is prone to atmospheric and solar interference. This means that there's plenty of noise and potential randomness. Also see: http://en.wikipedia.org/wiki/Noise_%28radio%29 http://en.wikipedia.org/wiki/Cosmic_noise |
As long as we are discussing it, I found something very interesting in my quest.
Here are algorithms to determine if a file is random and how random. Code:
The procedure to determine if a file is random or not: The diehard and dieharder tests seem quite useless and time consuming. |
It is hard for me to believe what I just posted, but it is EDIT: actually false. Sorry, I must do more tests on hotbits to fix these unreliable and size dependent tests.
|
Quote:
For example, you could start with a $16 Teensy++ to interface to via USB. It has a native USB interface, and is very easy to program using gcc-avr. Then, use a hardware random bit generator (one or more) as a source of random bits, using the von Neumann extractor in the Teensy to avoid any bias in the result. If you have more than one, you can increase the generated bit rate. You can easily write a PRNG like Xorshift on the Teensy (or several!), and perturb its state continuously using the random bits, making it more or less a true TRNG. You can easily provide both TRNG and T/PRNG in the same device. The hardware parts should cost about $25 total, including a nice tin enclosure. The circuit generates EM noise, so a metal enclosure is a must. The data rate is small enough that Teensy could easily continuously encrypt the output using a cryptographically secure algorithm, if you want truly random bits. If you use the raw HID interface (the one used by USB mice, keyboards, joysticks, gamepads), you are limited to 512000 bits per second (1000 packets, 64 bytes each, per second; more than enough for several hardware bits), but you don't need any drivers. On any OS. This might be rather interesting project. A kernel driver to feed the random bits from the above hardware USB device to the entropy pool would be very simple, even including some randomness checks on the bits. Or, one could tell the OpenSSL library to obtain initial PRNG state bits from the USB device, for example from a simple userspace application providing it via a pipe (using the HID interface to the USB device). |
Quote:
Much like the concept of hashing, where in general there are more sets of data of a given length with a given hash value than there are possible hash values, you can't account for all sources of order in the data. You need to determine the ways that order within the data can be useful to an adversary, which will vary by application, and attempt to make its exploitation disadvantageous. The best way to do that is to obscure how the data is transformed into the "publicly-observable final product", which will prevent inference of the most epistemically-useful ways to attack the data. Kevin Barry |
I suppose I could buy one, but I am reluctant to buy any of those because they don't provide any sample data that it generates.
From the above tests I only trust /dev/random and hotbits (uses geiger counter). Other methods may just produce pseudorandom data. |
EDIT: Wrong info removed.
|
How about using a Xorshift PRNG, and using the random/non-random data to perturb its state continouously?
Here is a very slow example. Compile it, then run with a single parameter describing how many bytes it outputs per input byte. It will consume the first 16 bytes of standard input to prime the Xorshift generator state, then it will regularly use the input to perturb the generator state. Since Xorshift is supposed to pass the diehard tests, I wonder if the output from this is sufficiently random for you? Note that I'm not at all sure whether this kind of perturbation (directly xor'ing the input into the generator state) makes sense, and I have not checked whether the output is random or not. If you don't want to perturb the state at all, you can redirect standard input from /dev/zero, in which case the output is the same as the Wikipedia example code, but with 65536 first values discarded. Code:
#include <unistd.h> |
I have finally found a good way to extract random data from a radio signal. After lots of tests I have come to the conclusion that the best way to extract the random data is like this:
1) Use the von Neumann extractor first. 2) Second use the whirlpool hash with 512-bit = 64 byte input (I removed the option to choose other input byte sizes because it doesn't help). 3) Split the file into 1 MB chunks. 4) Check each chunk with a modified version of ent, posted below. 5) Concatenate the chunks that passed into the final file. Using this method the extraction ratio is 10:1. So for 100 MB input you will get 10 MB of good quality random data out. So here's the modified version of ent, you'll need the rest of it, this is just ent.c: Code:
/* Code:
#!/bin/sh Here's an example of data quality: Code:
bash-4.1$ ent compiled |
I have an even better solution now, in fact it is exactly what I wanted. I edited ENT some more and now I can make a pipe that outputs random data and only random data (as checked by ent). I had to make all printfs output to stderr, which should be fine, and stdout will output only random data in 1MiB chunks (edit this in the source, but you must take into account the changes that must be made to the threshold parameters, because I have found that using ent on small files in unreliable, and that 1MiB is a reasonable size to check).
The modified ent code (I call it entpipe): Code:
/* This is a script to run it all: Code:
#!/bin/sh Code:
bash-4.1$ ent random I've also had great fun programming in C. This is probably the most I've done since high school. I'm just glad Linux seg faults instead of crashing like the only Window$ computers used to. |
All times are GMT -5. The time now is 03:16 AM. |