LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-08-2015, 07:38 PM   #31
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660

Quote:
Originally Posted by Rozak View Post
are you sure nothin is wrong with the command?
Sure? No. I'm not sure of anything except death and taxes.

I wrote the code on my computer and tested it using sample data posted by you. It worked here. I gave you the entire program. I don't know why it fails on your computer.

For the third time: Seek help from a knowledgeable person at your location.

Daniel B. Martin
 
Old 09-08-2015, 07:56 PM   #32
Rozak
LQ Newbie
 
Registered: Sep 2015
Posts: 23

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by danielbmartin View Post
Sure? No. I'm not sure of anything except death and taxes.

I wrote the code on my computer and tested it using sample data posted by you. It worked here. I gave you the entire program. I don't know why it fails on your computer.

For the third time: Seek help from a knowledgeable person at your location.

Daniel B. Martin
Thank you for denoting your time Daniel! you answered me more than 10 times! thank you very much. If people in my department were as nice as you I would not have to come to this site.
bests,
Zahra
 
Old 09-09-2015, 08:41 AM   #33
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,780

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
Originally Posted by Rozak View Post
My input file is hap.txt and my output file is uniqhap.txt I changed them to these but I get this error:
skarimi@signal[19:20][~]$ cd mkhap
skarimi@signal[19:26][~/mkhap]$ awk '{for (j=1;j<=NF;j++) !a[j","$j]++?b[j]=b[j]" "$j:0;} END{for (j=1;j<=NF;j++) print b[j]}' hap.txt \ |awk '{for (j=1;j<=NF;j++) a[j]=a[j]" "$j} END {j=1; while (j in a) {print a[j];j++}}' > uniqhap.txt
awk: cmd. line:1: fatal: cannot open file ` ' for reading (No such file or directory)
skarimi@signal[19:27][~/mkhap]$
should I add a program to my file? is this the problem? can you please guide me?
It's because of the \, just remove it. It's needed in danielbmartin's original code because he put a newline after it.
 
2 members found this post helpful.
Old 09-09-2015, 09:19 AM   #34
Rozak
LQ Newbie
 
Registered: Sep 2015
Posts: 23

Original Poster
Rep: Reputation: Disabled
Unhappy

Quote:
Originally Posted by ntubski View Post
It's because of the \, just remove it. It's needed in danielbmartin's original code because he put a newline after it.
i did this before. it did not help to solve the problem.
 
Old 09-09-2015, 09:32 AM   #35
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
I have to say, I would curious how long that would take to run with your original file being it is so large. You will want to have a good source of memory
 
Old 09-09-2015, 10:26 AM   #36
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi.

What is the size of your file? If your numbers from the first post are real then it is probably about 30Gigabytes or so, right? In this case the problem is not to sort columns etc (which is easy by itself), but to do this in a reasonable amount of time. You already know how to process single column -- just do sort|uniq or sort -u, maybe with the -n flag to correctly sort numbers. Now you need a way to extract particular column from input file. In my experiments cut -d ' ' -f N is much faster for this than awk/perl etc. Okay, now we can extract any column and process it as needed. Because columns are large and there are lots of them, it is better to store them in temporary files. Last step is to join these sorted columns into a single file next to each other -- this can be done with paste utility. Because there are lots of columns you can not just do paste col-* -- you'll get Too many arguments bla bla error. So we may do this in two steps -- joining, say, 1000 columns and creating temporary files for them and then joining these temporary files into a final one. Last thing is that we definitely want to utilize multiple processors/cores for all these steps. In the following script I use the GNU parallel utility for this (dunno how it works in win8 though..)

Code:
#!/bin/bash

fn="$1"
outfn='sorted.dat'

cols=$(head -n1 $fn | wc -w)
echo $fn $cols columns

# split input file into columns, sort (numerically) and uniq each column
# and save to files col-0001 etc.
seq -w $cols | parallel --progress "cut -d ' ' -f {1} $fn | sort -nu > col-{1}"

# join all columns (side-by-side) into a single file in two steps:
# first, create few files with 1000 columns each, then join
# these files in order.
ls col-* | parallel --progress -N1000 -k --files paste | parallel --xargs paste {} ';' rm {} > $outfn

# remove column files
rm col-*
Run it in the folder with enough free space (at least the size of input file) as follows:
Code:
/tmp$ time ./sort-colums.sh /tmp/big.dat 
/tmp/big.dat 5000 columns

Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/of started jobs/Average seconds to complete
local:0/5000/100/0.0s 

Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/5/100%/0.0s 

real	3m47.886s
user	12m40.702s
sys	1m10.267s
I created sample input file with 5000 columns and 1000 rows using the following command
Code:
$ seq 5e6 | parallel -N5000 echo > /tmp/big.dat
Now bad news.
As you can see this file (38M) is processed in about 4 minutes on my 4-core laptop with 10G RAM. On the other hand your input file is about 1000 times larger and execution time on my machine would be roughly 70 hours, or 3 days.

Last edited by firstfire; 09-09-2015 at 10:28 AM.
 
Old 09-09-2015, 10:45 AM   #37
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Nice idea firstfire, but my concerns would be:

1. Creating 50k files (ie number of cols in current file according to OP). To then have the system sort through all these would increase the times significantly (I think)

2. Your option to sort the files / columns would not help the OP as he wants the data to remain in place if it is to be kept, so anything currently in row 1 would need to remain in that row
(of course row 1 will always stay exactly as is being all points are unique at this level )
 
Old 09-09-2015, 10:50 AM   #38
Rozak
LQ Newbie
 
Registered: Sep 2015
Posts: 23

Original Poster
Rep: Reputation: Disabled
okay, yes you were right . but after runnig, I checked the result. However the result was not what I should be. the number of columns were less than that from original data and there were relocated as well! (
 
Old 09-09-2015, 10:57 AM   #39
Rozak
LQ Newbie
 
Registered: Sep 2015
Posts: 23

Original Poster
Rep: Reputation: Disabled
okay, i work in linux. linux can not realize this command: parallel. what should I do if I want to keep working on linux? look:
skarimi@signal[11:42][~/mkhap]$ seq -w $cols | parallel --progress "cut -d ' ' -
f {1} $hap.txt | sort -nu > col-{1}"
bash: parallel: command not found...
skarimi@signal[11:42][~/mkhap]$ man parallel
No manual entry for parallel
 
Old 09-09-2015, 11:43 AM   #40
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,187

Rep: Reputation: 354Reputation: 354Reputation: 354Reputation: 354
You've asked for a Linux command to process your data file, and several suggestion have been provided, but a more general solution (in my opinion) would be to drop your data into a data base system and extract what you want from that data base when you want it.

For example, if your file was in a SQL data base, a "SELECT UNIQUE COL_NUMBER FROM DATA_TABLE ORDER BY INPUT_ORDER;" might be all you need.

Perhaps we could be more helpful if you explained why you want these columns of unique values. There may be better solutions to that problem.
 
Old 09-09-2015, 11:44 AM   #41
Rozak
LQ Newbie
 
Registered: Sep 2015
Posts: 23

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by firstfire View Post
Hi.

What is the size of your file? If your numbers from the first post are real then it is probably about 30Gigabytes or so, right? In this case the problem is not to sort columns etc (which is easy by itself), but to do this in a reasonable amount of time. You already know how to process single column -- just do sort|uniq or sort -u, maybe with the -n flag to correctly sort numbers. Now you need a way to extract particular column from input file. In my experiments cut -d ' ' -f N is much faster for this than awk/perl etc. Okay, now we can extract any column and process it as needed. Because columns are large and there are lots of them, it is better to store them in temporary files. Last step is to join these sorted columns into a single file next to each other -- this can be done with paste utility. Because there are lots of columns you can not just do paste col-* -- you'll get Too many arguments bla bla error. So we may do this in two steps -- joining, say, 1000 columns and creating temporary files for them and then joining these temporary files into a final one. Last thing is that we definitely want to utilize multiple processors/cores for all these steps. In the following script I use the GNU parallel utility for this (dunno how it works in win8 though..)

Code:
#!/bin/bash

fn="$1"
outfn='sorted.dat'

cols=$(head -n1 $fn | wc -w)
echo $fn $cols columns

# split input file into columns, sort (numerically) and uniq each column
# and save to files col-0001 etc.
seq -w $cols | parallel --progress "cut -d ' ' -f {1} $fn | sort -nu > col-{1}"

# join all columns (side-by-side) into a single file in two steps:
# first, create few files with 1000 columns each, then join
# these files in order.
ls col-* | parallel --progress -N1000 -k --files paste | parallel --xargs paste {} ';' rm {} > $outfn

# remove column files
rm col-*
Run it in the folder with enough free space (at least the size of input file) as follows:
Code:
/tmp$ time ./sort-colums.sh /tmp/big.dat 
/tmp/big.dat 5000 columns

Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/of started jobs/Average seconds to complete
local:0/5000/100/0.0s 

Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/5/100%/0.0s 

real	3m47.886s
user	12m40.702s
sys	1m10.267s
I created sample input file with 5000 columns and 1000 rows using the following command
Code:
$ seq 5e6 | parallel -N5000 echo > /tmp/big.dat
Now bad news.
As you can see this file (38M) is processed in about 4 minutes on my 4-core laptop with 10G RAM. On the other hand your input file is about 1000 times larger and execution time on my machine would be roughly 70 hours, or 3 days.
my original data size is 5 GIG (I was wrong I have 10,000 columns not 50,000 . and I had a question about your codes. the first line of your code is :
fn="$1" , which I replaced my original file name there: hap.txt="1"
but i got my first error here :
skarimi@signal[12:39][~/mkhap]$ hap.txt="1"
bash: hap.txt=1: command not found...
skarimi@signal[12:42][~/mkhap]$

SO, I wondering what am I doing wrong? meanwhile I am working in linux does not realize parallel. it is not even in its manual. why is that?
 
Old 09-09-2015, 01:33 PM   #42
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
You do not need to change anything in the script to try it, just save it in some file, say, sort-columns.sh, then make it executable with chmod +x sort-columns.sh and execute it as ./sort-columns.sh hap.txt (note ./ in the beginning).

What kind of linux? On Debian/Ubuntu you install it by sudo apt-get install parallel. Or download parallel package from the link I provided in my previous post and install with the package manager of your OS, e.g. sudo dpkg -i parallel*.deb in Debian.

@grail: 1. We can store these 50K files in a directory hierarchy, say 50 directories with 1000 files in each:
Code:
#!/bin/bash

fn="$1"
outfn='sorted.dat'

process_column='sort -nu' # or any other script

cols=$(head -n1 $fn | wc -w)
echo $fn $cols columns

seq -w $cols  | parallel -N 1000 printf -vd 'block-05d' {#} ';' mkdir \$d ';' printf '%s\\n' \$d/{} | \
 	parallel --eta "cut -d ' ' -f {/} $fn | $process_column > {}"

ls -d block-* | parallel --progress -k --files 'paste {}/*' | parallel --xargs paste {} ';' rm {} > $outfn

rm -rf block-*
2. Maybe
 
Old 09-09-2015, 02:39 PM   #43
Rinndalir
Member
 
Registered: Sep 2015
Posts: 733

Rep: Reputation: Disabled
Is the original data online? Or a portion of it? 10,000 columns and 117,000 rows is 6-7GB.
 
Old 09-09-2015, 08:16 PM   #44
Rozak
LQ Newbie
 
Registered: Sep 2015
Posts: 23

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Rinndalir View Post
Is the original data online? Or a portion of it? 10,000 columns and 117,000 rows is 6-7GB.
no it is not online it is a genotype data of dairy cattle. it is part of my thesis. if you want to see I can send a part of that for you.
 
Old 09-09-2015, 10:54 PM   #45
Rinndalir
Member
 
Registered: Sep 2015
Posts: 733

Rep: Reputation: Disabled
Quote:
Originally Posted by Rozak View Post
no it is not online it is a genotype data of dairy cattle. it is part of my thesis. if you want to see I can send a part of that for you.
I only wanted it just to test the timing of a python program I wrote for this data. No big deal.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Addition of characters to column in tab file sawdusted Linux - Newbie 17 04-04-2013 08:30 PM
[SOLVED] Replace Characters in txt file struct Linux - Newbie 4 11-07-2010 01:28 PM
html2text >> a.txt creates the file but has extra characters in it Kakarot_Rathish Linux - General 4 03-08-2010 05:01 AM
How can I use Shell script to edit row 23 column 5-8 in a txt file? leena_d Linux - Newbie 4 12-14-2009 03:43 AM
strange characters when routing man page to txt file DJOtaku Linux - General 3 05-15-2005 01:03 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:17 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration