how to find unique characters within each column in a txt.file in linux ?

danielbmartin · 09-08-2015, 07:38 PM

Quote:

Originally Posted by Rozak

are you sure nothin is wrong with the command?

Sure? No. I'm not sure of anything except death and taxes.

I wrote the code on my computer and tested it using sample data posted by you. It worked here. I gave you the entire program. I don't know why it fails on your computer.

For the third time: Seek help from a knowledgeable person at your location.

Daniel B. Martin

Rozak · 09-08-2015, 07:56 PM

Quote:

Originally Posted by danielbmartin

Sure? No. I'm not sure of anything except death and taxes.

I wrote the code on my computer and tested it using sample data posted by you. It worked here. I gave you the entire program. I don't know why it fails on your computer.

For the third time: Seek help from a knowledgeable person at your location.

Daniel B. Martin

Thank you for denoting your time Daniel! you answered me more than 10 times! thank you very much. If people in my department were as nice as you I would not have to come to this site.
bests,
Zahra

ntubski · 09-09-2015, 08:41 AM

Quote:

Originally Posted by Rozak

My input file is hap.txt and my output file is uniqhap.txt I changed them to these but I get this error:
skarimi@signal[19:20][~]$ cd mkhap
skarimi@signal[19:26][~/mkhap]$ awk '{for (j=1;j<=NF;j++) !a[j","$j]++?b[j]=b[j]" "$j:0;} END{for (j=1;j<=NF;j++) print b[j]}' hap.txt \ |awk '{for (j=1;j<=NF;j++) a[j]=a[j]" "$j} END {j=1; while (j in a) {print a[j];j++}}' > uniqhap.txt
awk: cmd. line:1: fatal: cannot open file ` ' for reading (No such file or directory)
skarimi@signal[19:27][~/mkhap]$
should I add a program to my file? is this the problem? can you please guide me?

It's because of the \, just remove it. It's needed in danielbmartin's original code because he put a newline after it.

Rozak · 09-09-2015, 09:19 AM

Quote:

Originally Posted by ntubski

It's because of the \, just remove it. It's needed in danielbmartin's original code because he put a newline after it.

i did this before. it did not help to solve the problem.

grail · 09-09-2015, 09:32 AM

I have to say, I would curious how long that would take to run with your original file being it is so large. You will want to have a good source of memory

firstfire · 09-09-2015, 10:26 AM

Hi.

What is the size of your file? If your numbers from the first post are real then it is probably about 30Gigabytes or so, right? In this case the problem is not to sort columns etc (which is easy by itself), but to do this in a reasonable amount of time. You already know how to process single column -- just do sort|uniq or sort -u, maybe with the -n flag to correctly sort numbers. Now you need a way to extract particular column from input file. In my experiments cut -d ' ' -f N is much faster for this than awk/perl etc. Okay, now we can extract any column and process it as needed. Because columns are large and there are lots of them, it is better to store them in temporary files. Last step is to join these sorted columns into a single file next to each other -- this can be done with paste utility. Because there are lots of columns you can not just do paste col-* -- you'll get Too many arguments bla bla error. So we may do this in two steps -- joining, say, 1000 columns and creating temporary files for them and then joining these temporary files into a final one. Last thing is that we definitely want to utilize multiple processors/cores for all these steps. In the following script I use the GNU parallel utility for this (dunno how it works in win8 though..)

Code:

#!/bin/bash

fn="$1"
outfn='sorted.dat'

cols=$(head -n1 $fn | wc -w)
echo $fn $cols columns

# split input file into columns, sort (numerically) and uniq each column
# and save to files col-0001 etc.
seq -w $cols | parallel --progress "cut -d ' ' -f {1} $fn | sort -nu > col-{1}"

# join all columns (side-by-side) into a single file in two steps:
# first, create few files with 1000 columns each, then join
# these files in order.
ls col-* | parallel --progress -N1000 -k --files paste | parallel --xargs paste {} ';' rm {} > $outfn

# remove column files
rm col-*

Run it in the folder with enough free space (at least the size of input file) as follows:

Code:

/tmp$ time ./sort-colums.sh /tmp/big.dat 
/tmp/big.dat 5000 columns

Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/of started jobs/Average seconds to complete
local:0/5000/100/0.0s 

Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/5/100%/0.0s 

real	3m47.886s
user	12m40.702s
sys	1m10.267s

I created sample input file with 5000 columns and 1000 rows using the following command

Code:

$ seq 5e6 | parallel -N5000 echo > /tmp/big.dat

Now bad news.
As you can see this file (38M) is processed in about 4 minutes on my 4-core laptop with 10G RAM. On the other hand your input file is about 1000 times larger and execution time on my machine would be roughly 70 hours, or 3 days.

grail · 09-09-2015, 10:45 AM

Nice idea firstfire, but my concerns would be:

1. Creating 50k files (ie number of cols in current file according to OP). To then have the system sort through all these would increase the times significantly (I think)

2. Your option to sort the files / columns would not help the OP as he wants the data to remain in place if it is to be kept, so anything currently in row 1 would need to remain in that row
(of course row 1 will always stay exactly as is being all points are unique at this level

)

Rozak · 09-09-2015, 10:50 AM

okay, yes you were right . but after runnig, I checked the result. However the result was not what I should be. the number of columns were less than that from original data and there were relocated as well!

(

Rozak · 09-09-2015, 10:57 AM

okay, i work in linux. linux can not realize this command: parallel. what should I do if I want to keep working on linux? look:
skarimi@signal[11:42][~/mkhap]$ seq -w $cols | parallel --progress "cut -d ' ' -
f {1} $hap.txt | sort -nu > col-{1}"
bash: parallel: command not found...
skarimi@signal[11:42][~/mkhap]$ man parallel
No manual entry for parallel

PTrenholme · 09-09-2015, 11:43 AM

You've asked for a Linux command to process your data file, and several suggestion have been provided, but a more general solution (in my opinion) would be to drop your data into a data base system and extract what you want from that data base when you want it.

For example, if your file was in a SQL data base, a "SELECT UNIQUE COL_NUMBER FROM DATA_TABLE ORDER BY INPUT_ORDER;" might be all you need.

Perhaps we could be more helpful if you explained why you want these columns of unique values. There may be better solutions to that problem.

Rozak · 09-09-2015, 11:44 AM

Quote:

Originally Posted by firstfire

Hi.

What is the size of your file? If your numbers from the first post are real then it is probably about 30Gigabytes or so, right? In this case the problem is not to sort columns etc (which is easy by itself), but to do this in a reasonable amount of time. You already know how to process single column -- just do sort|uniq or sort -u, maybe with the -n flag to correctly sort numbers. Now you need a way to extract particular column from input file. In my experiments cut -d ' ' -f N is much faster for this than awk/perl etc. Okay, now we can extract any column and process it as needed. Because columns are large and there are lots of them, it is better to store them in temporary files. Last step is to join these sorted columns into a single file next to each other -- this can be done with paste utility. Because there are lots of columns you can not just do paste col-* -- you'll get Too many arguments bla bla error. So we may do this in two steps -- joining, say, 1000 columns and creating temporary files for them and then joining these temporary files into a final one. Last thing is that we definitely want to utilize multiple processors/cores for all these steps. In the following script I use the GNU parallel utility for this (dunno how it works in win8 though..)

Code:

#!/bin/bash

fn="$1"
outfn='sorted.dat'

cols=$(head -n1 $fn | wc -w)
echo $fn $cols columns

# split input file into columns, sort (numerically) and uniq each column
# and save to files col-0001 etc.
seq -w $cols | parallel --progress "cut -d ' ' -f {1} $fn | sort -nu > col-{1}"

# join all columns (side-by-side) into a single file in two steps:
# first, create few files with 1000 columns each, then join
# these files in order.
ls col-* | parallel --progress -N1000 -k --files paste | parallel --xargs paste {} ';' rm {} > $outfn

# remove column files
rm col-*

Run it in the folder with enough free space (at least the size of input file) as follows:

Code:

/tmp$ time ./sort-colums.sh /tmp/big.dat 
/tmp/big.dat 5000 columns

Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/of started jobs/Average seconds to complete
local:0/5000/100/0.0s 

Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/5/100%/0.0s 

real	3m47.886s
user	12m40.702s
sys	1m10.267s

I created sample input file with 5000 columns and 1000 rows using the following command

Code:

$ seq 5e6 | parallel -N5000 echo > /tmp/big.dat

Now bad news.
As you can see this file (38M) is processed in about 4 minutes on my 4-core laptop with 10G RAM. On the other hand your input file is about 1000 times larger and execution time on my machine would be roughly 70 hours, or 3 days.

my original data size is 5 GIG (I was wrong I have 10,000 columns not 50,000 . and I had a question about your codes. the first line of your code is :
fn="$1" , which I replaced my original file name there: hap.txt="1"
but i got my first error here :
skarimi@signal[12:39][~/mkhap]$ hap.txt="1"
bash: hap.txt=1: command not found...
skarimi@signal[12:42][~/mkhap]$

SO, I wondering what am I doing wrong? meanwhile I am working in linux does not realize parallel. it is not even in its manual. why is that?

firstfire · 09-09-2015, 01:33 PM

You do not need to change anything in the script to try it, just save it in some file, say, sort-columns.sh, then make it executable with chmod +x sort-columns.sh and execute it as ./sort-columns.sh hap.txt (note ./ in the beginning).

What kind of linux? On Debian/Ubuntu you install it by sudo apt-get install parallel. Or download parallel package from the link I provided in my previous post and install with the package manager of your OS, e.g. sudo dpkg -i parallel*.deb in Debian.

@grail: 1. We can store these 50K files in a directory hierarchy, say 50 directories with 1000 files in each:

Code:

#!/bin/bash

fn="$1"
outfn='sorted.dat'

process_column='sort -nu' # or any other script

cols=$(head -n1 $fn | wc -w)
echo $fn $cols columns

seq -w $cols  | parallel -N 1000 printf -vd 'block-05d' {#} ';' mkdir \$d ';' printf '%s\\n' \$d/{} | \
 	parallel --eta "cut -d ' ' -f {/} $fn | $process_column > {}"

ls -d block-* | parallel --progress -k --files 'paste {}/*' | parallel --xargs paste {} ';' rm {} > $outfn

rm -rf block-*

2. Maybe

Rinndalir · 09-09-2015, 02:39 PM

Is the original data online? Or a portion of it? 10,000 columns and 117,000 rows is 6-7GB.

Rozak · 09-09-2015, 08:16 PM

Quote:

Originally Posted by Rinndalir

Is the original data online? Or a portion of it? 10,000 columns and 117,000 rows is 6-7GB.

no it is not online it is a genotype data of dairy cattle. it is part of my thesis. if you want to see I can send a part of that for you.

Rinndalir · 09-09-2015, 10:54 PM

Quote:

Originally Posted by Rozak

no it is not online it is a genotype data of dairy cattle. it is part of my thesis. if you want to see I can send a part of that for you.

I only wanted it just to test the timing of a python program I wrote for this data. No big deal.