Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
01-12-2014, 10:17 PM
|
#1
|
Member
Registered: Jan 2004
Location: Austin, TEXAS
Distribution: CentOS 6.5
Posts: 211
Rep:
|
need help optimizing simple BASH script
I wrote this bash script to create a big text file with a goal of five million lines.
It performs three reads from static files (5words, 9words, 10words) and then concatenates together some string values and writes that to a file using redirection.
Problem is that I ran this overnight last night and it only had created 2258425 lines in the target file.
I'm afraid the reads and the writes are bottlenecking on my single conventional hard drive. Because of that, I don't want to try to run it in parallel, because then it would just brutalize my drive even more.
I'm thinking of mapping a filesystem in memory and putting both the source and target files on that. Any other ideas? Script is below.
di11rod
Code:
#!/bin/sh
for ((counter=1; counter<=5000000; counter++))
do
ID=$counter`shuf -n 1 5words`
firstName=`shuf -n 1 9words| sed 's/.*/\u&/'`
lastName=`shuf -n 1 10words| sed 's/.*/\u&/'`
managerID="X5X"
fullName=$firstName" "$lastName
email=$firstName"."$lastName"@company.com"
department="IT"
region="Europe"
location="Schipol"
inactivity="False"
costcenter="Admin"
echo $ID","$firstName","$lastName","$managerID","$fullName","$email","$department","$region","$location","$inactivity","$costcenter
done
|
|
|
01-13-2014, 02:46 AM
|
#2
|
Member
Registered: Aug 2004
Location: The Netherlands
Distribution: RedHat 2, 3, 4, 5, Fedora, SuSE, Gentoo
Posts: 372
Rep:
|
I thought the reads are bottlenecking when I read this post, but I guess this isn't true, as the files to be read are cached. The way to treat them, each and every time again, is quite intense. you read the files every time again and again, and capitalise the second and third items every row, over and over again... you could make that more efficient in a very easy way.
I rewrote the script slightly to read the input files once, capitalize them once, then use them every row, only using shuf as external command to get the desired output.
This way, no file handles need to be opened every time again, which is more processor friendly.
Code:
#!/bin/sh
#input files to variables
words5=`cat 5words`
words9=`cat 9words | sed 's/.*/\u&/'`
words10=`cat 10words | sed 's/.*/\u&/'`
for ((counter=1; counter<=5000000; counter++))
do
ID=$counter`echo "$words5"| shuf -n 1`
firstName=`echo "$words9"| shuf -n 1`
lastName=`echo "$words10"| shuf -n 1`
managerID="X5X"
fullName="$firstName $lastName"
email="$firstName.$lastName@company.com"
department="IT"
region="Europe"
location="Schipol"
inactivity="False"
costcenter="Admin"
echo "$ID,$firstName,$lastName,$managerID,$fullName,$email,$department,$region,$location,$inactivity,$costcenter"
done
Last edited by rhoekstra; 01-13-2014 at 02:51 AM.
|
|
|
01-13-2014, 06:07 AM
|
#3
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 24,674
|
I do not really know the number of lines in 5words, 9words and 10words, but invoking shuf so many (15 000 000) times looks very strange for me. Probably you can have a much cheaper solution to get a random value.
|
|
|
01-13-2014, 08:14 AM
|
#4
|
Member
Registered: Aug 2004
Location: The Netherlands
Distribution: RedHat 2, 3, 4, 5, Fedora, SuSE, Gentoo
Posts: 372
Rep:
|
Indeed, better would be in a language like perl, have the file entries read into an array, and randomly pick an element at each iteration
|
|
|
01-14-2014, 06:25 PM
|
#5
|
Member
Registered: Jan 2004
Location: Austin, TEXAS
Distribution: CentOS 6.5
Posts: 211
Original Poster
Rep:
|
Thanks for the ideas on this!
I spent some time with it this evening. I modified the script per Rhoekstra's suggestion. I though that was a much smarter design than my original draft. Surprisingly, though, by putting the source word lists into variables and pushing those lists through shuf seems to run slower than reading them off the file system. I checked the environment and there is plenty of memory, CPU, and hard drive bandwidth. It's just crazy.
Maybe I do need to avoid shuf and try to pick them out of an array randomly. I'll rework it and post back.
Here's some performance stuff to consider:
(FiveMilList.csv is the target file of my script)
Code:
# date
Tue Jan 14 18:18:25 CST 2014
# cat FiveMilList.csv | wc -l
2261002
# date
Tue Jan 14 18:19:52 CST 2014
# cat FiveMilList.csv | wc -l
2261300
Pretty slow!
Code:
# cat 10words | wc -l
54529
# cat 5words | wc -l
31812
# cat 9words | wc -l
65644
|
|
|
01-15-2014, 01:23 AM
|
#6
|
Member
Registered: Aug 2004
Location: The Netherlands
Distribution: RedHat 2, 3, 4, 5, Fedora, SuSE, Gentoo
Posts: 372
Rep:
|
Sorry it didn't work out as expected, though I had my doubts when writing the solution. You could try a perl equivalent to see what that does?
Code:
#!/usr/bin/env perl
use strict;
use warnings;
my $word5=[];
my $word9=[];
my $word10=[];
sub load {
my $input=shift;
my $file=shift;
my $uc=shift;
open IN, "<", "$file" or die("Problem: $!");
while( my $line=<IN> ) {
chomp( $line );
if( defined( $uc ) ) {
$line =~ s/^(.)/\u$1/;
}
push(@$input, $line );
}
close IN;
return $input;
}
load( $word5, "5words" );
load( $word9, "9words" , "uc");
load( $word10, "10words" , "uc");
my ( $ID, $firstName, $lastName, $managerID, $fullName, $email, $department, $region, $location, $inactivity, $costcenter );
$managerID="X5X";
$department="IT";
$region="Europe";
$location="Schiphol";
$inactivity="False";
$costcenter="Admin";
open( OUT, ">", "output.txt" );
for( my $i=0; $i<=5000000; $i++ ) {
my $w5=$word5->[int(rand(@$word5))];
my $w9=$word9->[int(rand(@$word9))];
my $w10=$word10->[int(rand(@$word10))];
$ID="$i$w5";
$firstName="$w9";
$lastName="$w10";
$fullName="$firstName $lastName";
$email="$firstName.$lastName\@company.com";
print OUT "$ID,$firstName,$lastName,$managerID,$fullName,$email,$department,$region,$location,$inactivity,$costcenter\n";
}
close OUT;
|
|
|
All times are GMT -5. The time now is 05:06 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|