Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I wrote this bash script to create a big text file with a goal of five million lines.
It performs three reads from static files (5words, 9words, 10words) and then concatenates together some string values and writes that to a file using redirection.
Problem is that I ran this overnight last night and it only had created 2258425 lines in the target file.
I'm afraid the reads and the writes are bottlenecking on my single conventional hard drive. Because of that, I don't want to try to run it in parallel, because then it would just brutalize my drive even more.
I'm thinking of mapping a filesystem in memory and putting both the source and target files on that. Any other ideas? Script is below.
di11rod
Code:
#!/bin/sh
for ((counter=1; counter<=5000000; counter++))
do
ID=$counter`shuf -n 1 5words`
firstName=`shuf -n 1 9words| sed 's/.*/\u&/'`
lastName=`shuf -n 1 10words| sed 's/.*/\u&/'`
managerID="X5X"
fullName=$firstName" "$lastName
email=$firstName"."$lastName"@company.com"
department="IT"
region="Europe"
location="Schipol"
inactivity="False"
costcenter="Admin"
echo $ID","$firstName","$lastName","$managerID","$fullName","$email","$department","$region","$location","$inactivity","$costcenter
done
I thought the reads are bottlenecking when I read this post, but I guess this isn't true, as the files to be read are cached. The way to treat them, each and every time again, is quite intense. you read the files every time again and again, and capitalise the second and third items every row, over and over again... you could make that more efficient in a very easy way.
I rewrote the script slightly to read the input files once, capitalize them once, then use them every row, only using shuf as external command to get the desired output.
This way, no file handles need to be opened every time again, which is more processor friendly.
Code:
#!/bin/sh
#input files to variables
words5=`cat 5words`
words9=`cat 9words | sed 's/.*/\u&/'`
words10=`cat 10words | sed 's/.*/\u&/'`
for ((counter=1; counter<=5000000; counter++))
do
ID=$counter`echo "$words5"| shuf -n 1`
firstName=`echo "$words9"| shuf -n 1`
lastName=`echo "$words10"| shuf -n 1`
managerID="X5X"
fullName="$firstName $lastName"
email="$firstName.$lastName@company.com"
department="IT"
region="Europe"
location="Schipol"
inactivity="False"
costcenter="Admin"
echo "$ID,$firstName,$lastName,$managerID,$fullName,$email,$department,$region,$location,$inactivity,$costcenter"
done
I do not really know the number of lines in 5words, 9words and 10words, but invoking shuf so many (15 000 000) times looks very strange for me. Probably you can have a much cheaper solution to get a random value.
I spent some time with it this evening. I modified the script per Rhoekstra's suggestion. I though that was a much smarter design than my original draft. Surprisingly, though, by putting the source word lists into variables and pushing those lists through shuf seems to run slower than reading them off the file system. I checked the environment and there is plenty of memory, CPU, and hard drive bandwidth. It's just crazy.
Maybe I do need to avoid shuf and try to pick them out of an array randomly. I'll rework it and post back.
Here's some performance stuff to consider:
(FiveMilList.csv is the target file of my script)
Code:
# date
Tue Jan 14 18:18:25 CST 2014
# cat FiveMilList.csv | wc -l
2261002
# date
Tue Jan 14 18:19:52 CST 2014
# cat FiveMilList.csv | wc -l
2261300
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.