LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 01-12-2014, 10:17 PM   #1
di11rod
Member
 
Registered: Jan 2004
Location: Austin, TEXAS
Distribution: CentOS 6.5
Posts: 211

Rep: Reputation: 32
need help optimizing simple BASH script


I wrote this bash script to create a big text file with a goal of five million lines.

It performs three reads from static files (5words, 9words, 10words) and then concatenates together some string values and writes that to a file using redirection.

Problem is that I ran this overnight last night and it only had created 2258425 lines in the target file.

I'm afraid the reads and the writes are bottlenecking on my single conventional hard drive. Because of that, I don't want to try to run it in parallel, because then it would just brutalize my drive even more.

I'm thinking of mapping a filesystem in memory and putting both the source and target files on that. Any other ideas? Script is below.

di11rod

Code:
#!/bin/sh

for ((counter=1; counter<=5000000; counter++))
do
        ID=$counter`shuf -n 1 5words`
        firstName=`shuf -n 1 9words| sed 's/.*/\u&/'`
        lastName=`shuf -n 1 10words| sed 's/.*/\u&/'`
        managerID="X5X"
        fullName=$firstName" "$lastName
        email=$firstName"."$lastName"@company.com"
        department="IT"
        region="Europe"
        location="Schipol"
        inactivity="False"
        costcenter="Admin"

        echo $ID","$firstName","$lastName","$managerID","$fullName","$email","$department","$region","$location","$inactivity","$costcenter

done
 
Old 01-13-2014, 02:46 AM   #2
rhoekstra
Member
 
Registered: Aug 2004
Location: The Netherlands
Distribution: RedHat 2, 3, 4, 5, Fedora, SuSE, Gentoo
Posts: 372

Rep: Reputation: 42
I thought the reads are bottlenecking when I read this post, but I guess this isn't true, as the files to be read are cached. The way to treat them, each and every time again, is quite intense. you read the files every time again and again, and capitalise the second and third items every row, over and over again... you could make that more efficient in a very easy way.

I rewrote the script slightly to read the input files once, capitalize them once, then use them every row, only using shuf as external command to get the desired output.

This way, no file handles need to be opened every time again, which is more processor friendly.

Code:
#!/bin/sh

#input files to variables
words5=`cat 5words`
words9=`cat 9words | sed 's/.*/\u&/'`
words10=`cat 10words | sed 's/.*/\u&/'`

for ((counter=1; counter<=5000000; counter++))
do
        ID=$counter`echo "$words5"| shuf -n 1`
        firstName=`echo "$words9"| shuf -n 1`
        lastName=`echo "$words10"| shuf -n 1`
        managerID="X5X"
        fullName="$firstName $lastName"
        email="$firstName.$lastName@company.com"
        department="IT"
        region="Europe"
        location="Schipol"
        inactivity="False"
        costcenter="Admin"

        echo "$ID,$firstName,$lastName,$managerID,$fullName,$email,$department,$region,$location,$inactivity,$costcenter"

done

Last edited by rhoekstra; 01-13-2014 at 02:51 AM.
 
Old 01-13-2014, 06:07 AM   #3
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,838

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
I do not really know the number of lines in 5words, 9words and 10words, but invoking shuf so many (15 000 000) times looks very strange for me. Probably you can have a much cheaper solution to get a random value.
 
Old 01-13-2014, 08:14 AM   #4
rhoekstra
Member
 
Registered: Aug 2004
Location: The Netherlands
Distribution: RedHat 2, 3, 4, 5, Fedora, SuSE, Gentoo
Posts: 372

Rep: Reputation: 42
Indeed, better would be in a language like perl, have the file entries read into an array, and randomly pick an element at each iteration
 
Old 01-14-2014, 06:25 PM   #5
di11rod
Member
 
Registered: Jan 2004
Location: Austin, TEXAS
Distribution: CentOS 6.5
Posts: 211

Original Poster
Rep: Reputation: 32
Thanks for the ideas on this!

I spent some time with it this evening. I modified the script per Rhoekstra's suggestion. I though that was a much smarter design than my original draft. Surprisingly, though, by putting the source word lists into variables and pushing those lists through shuf seems to run slower than reading them off the file system. I checked the environment and there is plenty of memory, CPU, and hard drive bandwidth. It's just crazy.

Maybe I do need to avoid shuf and try to pick them out of an array randomly. I'll rework it and post back.

Here's some performance stuff to consider:

(FiveMilList.csv is the target file of my script)

Code:
# date
Tue Jan 14 18:18:25 CST 2014
# cat FiveMilList.csv | wc -l
2261002


# date
Tue Jan 14 18:19:52 CST 2014
# cat FiveMilList.csv | wc -l
2261300
Pretty slow!


Code:
# cat 10words | wc -l
54529

# cat 5words | wc -l
31812

# cat 9words | wc -l
65644
 
Old 01-15-2014, 01:23 AM   #6
rhoekstra
Member
 
Registered: Aug 2004
Location: The Netherlands
Distribution: RedHat 2, 3, 4, 5, Fedora, SuSE, Gentoo
Posts: 372

Rep: Reputation: 42
Sorry it didn't work out as expected, though I had my doubts when writing the solution. You could try a perl equivalent to see what that does?

Code:
#!/usr/bin/env perl
use strict;
use warnings;

my $word5=[];
my $word9=[];
my $word10=[];

sub load {
        my $input=shift;
        my $file=shift;
        my $uc=shift;
        open IN, "<", "$file" or die("Problem: $!");
        while( my $line=<IN> ) {
                chomp( $line );
                if( defined( $uc ) ) {
                        $line =~ s/^(.)/\u$1/;
                }
                push(@$input, $line );
        }
        close IN;
        return $input;
}

load( $word5, "5words" );
load( $word9, "9words" , "uc");
load( $word10, "10words" , "uc");

my ( $ID, $firstName, $lastName, $managerID, $fullName, $email, $department, $region, $location, $inactivity, $costcenter );
$managerID="X5X";
$department="IT";
$region="Europe";
$location="Schiphol";
$inactivity="False";
$costcenter="Admin";

open( OUT, ">", "output.txt" );
for( my $i=0; $i<=5000000; $i++ ) {
        my $w5=$word5->[int(rand(@$word5))];
        my $w9=$word9->[int(rand(@$word9))];
        my $w10=$word10->[int(rand(@$word10))];
        $ID="$i$w5";
        $firstName="$w9";
        $lastName="$w10";
        $fullName="$firstName $lastName";
        $email="$firstName.$lastName\@company.com";
        print OUT "$ID,$firstName,$lastName,$managerID,$fullName,$email,$department,$region,$location,$inactivity,$costcenter\n";
}
close OUT;
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to get some bash scripts into a simple bash script with some echo and if statement. y0_gesh Programming 3 03-01-2012 09:46 AM
Optimizing Mysql backup bash script montblanc Programming 1 10-07-2011 05:08 AM
Optimizing a Bash Script zokken Programming 3 04-13-2011 12:29 AM
optimizing this bash script yah0m Programming 6 04-17-2009 07:42 PM
Simple Bash Script dmedici Programming 9 12-31-2004 03:48 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 03:44 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration