[SOLVED] Data distribution among lines within a file with bash

phazeman · 04-22-2012, 07:14 AM

Hi All

I need to create a text file and distribute some numbers among the lines by percentage. What do i mean exactly:
i want to set percentage for each number and then fill the lines by that percentage
500 - 20%
501 - 30%
502 - 50%
i need that 20% of the lines will contain the number "500", 30% for "501" and remaining 50% with "502". I need that to be filled random and not followed one by another.

Any help will be appreciated !

colucix · 04-22-2012, 07:47 AM

Well, you can fill the file with numbers in sequence (according to their percentage) and scramble it later. You can try the shuf command or the shuffle function in perl, e.g.

Code:

shuf file

or

Code:

cat file | perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);'

Hope this helps.

phazeman · 04-22-2012, 08:43 AM

Quote:

Originally Posted by colucix

Well, you can fill the file with numbers in sequence (according to their percentage) and scramble it later. You can try the shuf command or the shuffle function in perl, e.g.

Code:

shuf file

or

Code:

cat file | perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);'

Hope this helps.

Since there are more parameters involved in the line (i tried not to write garbage here) it's impossible to do so

the line will look like:
<some number>, <some number>, 500, <some number>

I'm generating the additional numbers with "for" sentences and that's not a problem, but i can't think of this specific distribution mechanism...

colucix · 04-22-2012, 09:11 AM

Looking at the relevant piece of your script should be useful. At first I would generate the shuffled sequence of numbers then insert them in the original text one at a time in a loop.

phazeman · 04-22-2012, 09:14 AM

this is the script that is ready:

for i in `seq -w 0 255`; do
for j in `seq -w 0 255`; do
echo -e "930000${i}${j},<here should come the distributed number>,,0,,930000${i}${j},930000${i}${j},English" >> test.txt
done
done

colucix · 04-22-2012, 10:21 AM

Ok. I would assign the shuffled sequence of numbers to an array and then use the Nth element of the array inside the loops, by increasing the index of the array by one at each iteration. Here we go:

Code:

#!/bin/bash
lines=65536
p1=$(( lines * 20 / 100 ))
p2=$(( lines * 30 / 100 ))
p3=$(( lines * 50 / 100 ))
p4=$(( lines - p1 - p2 - p3 ))
sequence=( $(echo "$(seq 1 $p1 | awk '{print 500}' && seq 1 $p2 | awk '{print 501}' && seq 1 $p3 | awk '{print 502}' && seq 1 $p4 | awk '{print 502}')" | shuf) )
for i in $(seq -w 0 255)
do
  for j in $(seq -w 0 255)
  do 
    echo "930000${i}${j},${sequence[((c++))]},,0,,930000${i}${j},930000${i}${j},English"
  done
done > test.txt

The part in blue increases the variable c by one at each iteration. Using this specific notation (inherited from the C language) the variable is increased after it is evaluated. This means that at the first iteration the value is still 0, at the second iteration it is 1 and so on. This is exactly what we want, since array elements in bash are numbered starting from 0.

Hope this helps.

phazeman · 04-23-2012, 01:39 AM

This looks very promising ! but i can't run it since my linux doesn't have the shuf. and i can't seem to find the rpm anywhere (RHEL 5.5 Tikanga 32bit). the official ISO doesn't have the rpm of it. Apparently coreutils rpm doesn't include it in RHEL 5.5...

colucix · 04-23-2012, 02:17 AM

Indeed, shuf is available in more recent versions of coreutils. You can try the perl command, that is:

Code:

sequence=( $(echo "$(seq 1 $p1 | awk '{print 500}' && seq 1 $p2 | awk '{print 501}' && seq 1 $p3 | awk '{print 502}' && seq 1 $p4 | awk '{print 502}')" | perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);') )

This ensures compatibility with older systems.

phazeman · 04-23-2012, 03:43 AM

THANK YOU VERY MUCH ! looks like it solved the problem !!!

colucix · 04-23-2012, 03:49 AM

Glad to hear it!

Please, mark this thread as SOLVED. Thanks!