need help optimizing simple BASH script

di11rod · 01-12-2014, 10:17 PM

I wrote this bash script to create a big text file with a goal of five million lines.

It performs three reads from static files (5words, 9words, 10words) and then concatenates together some string values and writes that to a file using redirection.

Problem is that I ran this overnight last night and it only had created 2258425 lines in the target file.

I'm afraid the reads and the writes are bottlenecking on my single conventional hard drive. Because of that, I don't want to try to run it in parallel, because then it would just brutalize my drive even more.

I'm thinking of mapping a filesystem in memory and putting both the source and target files on that. Any other ideas? Script is below.

di11rod

Code:

#!/bin/sh

for ((counter=1; counter<=5000000; counter++))
do
        ID=$counter`shuf -n 1 5words`
        firstName=`shuf -n 1 9words| sed 's/.*/\u&/'`
        lastName=`shuf -n 1 10words| sed 's/.*/\u&/'`
        managerID="X5X"
        fullName=$firstName" "$lastName
        email=$firstName"."$lastName"@company.com"
        department="IT"
        region="Europe"
        location="Schipol"
        inactivity="False"
        costcenter="Admin"

        echo $ID","$firstName","$lastName","$managerID","$fullName","$email","$department","$region","$location","$inactivity","$costcenter

done

rhoekstra · 01-13-2014, 02:46 AM

I thought the reads are bottlenecking when I read this post, but I guess this isn't true, as the files to be read are cached. The way to treat them, each and every time again, is quite intense. you read the files every time again and again, and capitalise the second and third items every row, over and over again... you could make that more efficient in a very easy way.

I rewrote the script slightly to read the input files once, capitalize them once, then use them every row, only using shuf as external command to get the desired output.

This way, no file handles need to be opened every time again, which is more processor friendly.

Code:

#!/bin/sh

#input files to variables
words5=`cat 5words`
words9=`cat 9words | sed 's/.*/\u&/'`
words10=`cat 10words | sed 's/.*/\u&/'`

for ((counter=1; counter<=5000000; counter++))
do
        ID=$counter`echo "$words5"| shuf -n 1`
        firstName=`echo "$words9"| shuf -n 1`
        lastName=`echo "$words10"| shuf -n 1`
        managerID="X5X"
        fullName="$firstName $lastName"
        email="$firstName.$lastName@company.com"
        department="IT"
        region="Europe"
        location="Schipol"
        inactivity="False"
        costcenter="Admin"

        echo "$ID,$firstName,$lastName,$managerID,$fullName,$email,$department,$region,$location,$inactivity,$costcenter"

done

pan64 · 01-13-2014, 06:07 AM

I do not really know the number of lines in 5words, 9words and 10words, but invoking shuf so many (15 000 000) times looks very strange for me. Probably you can have a much cheaper solution to get a random value.

rhoekstra · 01-13-2014, 08:14 AM

Indeed, better would be in a language like perl, have the file entries read into an array, and randomly pick an element at each iteration

di11rod · 01-14-2014, 06:25 PM

Thanks for the ideas on this!

I spent some time with it this evening. I modified the script per Rhoekstra's suggestion. I though that was a much smarter design than my original draft. Surprisingly, though, by putting the source word lists into variables and pushing those lists through shuf seems to run slower than reading them off the file system. I checked the environment and there is plenty of memory, CPU, and hard drive bandwidth. It's just crazy.

Maybe I do need to avoid shuf and try to pick them out of an array randomly. I'll rework it and post back.

Here's some performance stuff to consider:

(FiveMilList.csv is the target file of my script)

Code:

# date
Tue Jan 14 18:18:25 CST 2014
# cat FiveMilList.csv | wc -l
2261002


# date
Tue Jan 14 18:19:52 CST 2014
# cat FiveMilList.csv | wc -l
2261300

Pretty slow!

Code:

# cat 10words | wc -l
54529

# cat 5words | wc -l
31812

# cat 9words | wc -l
65644

rhoekstra · 01-15-2014, 01:23 AM

Sorry it didn't work out as expected, though I had my doubts when writing the solution. You could try a perl equivalent to see what that does?

Code:

#!/usr/bin/env perl
use strict;
use warnings;

my $word5=[];
my $word9=[];
my $word10=[];

sub load {
        my $input=shift;
        my $file=shift;
        my $uc=shift;
        open IN, "<", "$file" or die("Problem: $!");
        while( my $line=<IN> ) {
                chomp( $line );
                if( defined( $uc ) ) {
                        $line =~ s/^(.)/\u$1/;
                }
                push(@$input, $line );
        }
        close IN;
        return $input;
}

load( $word5, "5words" );
load( $word9, "9words" , "uc");
load( $word10, "10words" , "uc");

my ( $ID, $firstName, $lastName, $managerID, $fullName, $email, $department, $region, $location, $inactivity, $costcenter );
$managerID="X5X";
$department="IT";
$region="Europe";
$location="Schiphol";
$inactivity="False";
$costcenter="Admin";

open( OUT, ">", "output.txt" );
for( my $i=0; $i<=5000000; $i++ ) {
        my $w5=$word5->[int(rand(@$word5))];
        my $w9=$word9->[int(rand(@$word9))];
        my $w10=$word10->[int(rand(@$word10))];
        $ID="$i$w5";
        $firstName="$w9";
        $lastName="$w10";
        $fullName="$firstName $lastName";
        $email="$firstName.$lastName\@company.com";
        print OUT "$ID,$firstName,$lastName,$managerID,$fullName,$email,$department,$region,$location,$inactivity,$costcenter\n";
}
close OUT;