how to find unique characters within each column in a txt.file in linux ?
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Sure? No. I'm not sure of anything except death and taxes.
I wrote the code on my computer and tested it using sample data posted by you. It worked here. I gave you the entire program. I don't know why it fails on your computer.
For the third time: Seek help from a knowledgeable person at your location.
Sure? No. I'm not sure of anything except death and taxes.
I wrote the code on my computer and tested it using sample data posted by you. It worked here. I gave you the entire program. I don't know why it fails on your computer.
For the third time: Seek help from a knowledgeable person at your location.
Daniel B. Martin
Thank you for denoting your time Daniel! you answered me more than 10 times! thank you very much. If people in my department were as nice as you I would not have to come to this site.
bests,
Zahra
My input file is hap.txt and my output file is uniqhap.txt I changed them to these but I get this error:
skarimi@signal[19:20][~]$ cd mkhap
skarimi@signal[19:26][~/mkhap]$ awk '{for (j=1;j<=NF;j++) !a[j","$j]++?b[j]=b[j]" "$j:0;} END{for (j=1;j<=NF;j++) print b[j]}' hap.txt \ |awk '{for (j=1;j<=NF;j++) a[j]=a[j]" "$j} END {j=1; while (j in a) {print a[j];j++}}' > uniqhap.txt
awk: cmd. line:1: fatal: cannot open file ` ' for reading (No such file or directory)
skarimi@signal[19:27][~/mkhap]$
should I add a program to my file? is this the problem? can you please guide me?
It's because of the \, just remove it. It's needed in danielbmartin's original code because he put a newline after it.
I have to say, I would curious how long that would take to run with your original file being it is so large. You will want to have a good source of memory
What is the size of your file? If your numbers from the first post are real then it is probably about 30Gigabytes or so, right? In this case the problem is not to sort columns etc (which is easy by itself), but to do this in a reasonable amount of time. You already know how to process single column -- just do sort|uniq or sort -u, maybe with the -n flag to correctly sort numbers. Now you need a way to extract particular column from input file. In my experiments cut -d ' ' -f N is much faster for this than awk/perl etc. Okay, now we can extract any column and process it as needed. Because columns are large and there are lots of them, it is better to store them in temporary files. Last step is to join these sorted columns into a single file next to each other -- this can be done with paste utility. Because there are lots of columns you can not just do paste col-* -- you'll get Too many arguments bla bla error. So we may do this in two steps -- joining, say, 1000 columns and creating temporary files for them and then joining these temporary files into a final one. Last thing is that we definitely want to utilize multiple processors/cores for all these steps. In the following script I use the GNU parallel utility for this (dunno how it works in win8 though..)
Code:
#!/bin/bash
fn="$1"
outfn='sorted.dat'
cols=$(head -n1 $fn | wc -w)
echo $fn $cols columns
# split input file into columns, sort (numerically) and uniq each column
# and save to files col-0001 etc.
seq -w $cols | parallel --progress "cut -d ' ' -f {1} $fn | sort -nu > col-{1}"
# join all columns (side-by-side) into a single file in two steps:
# first, create few files with 1000 columns each, then join
# these files in order.
ls col-* | parallel --progress -N1000 -k --files paste | parallel --xargs paste {} ';' rm {} > $outfn
# remove column files
rm col-*
Run it in the folder with enough free space (at least the size of input file) as follows:
Code:
/tmp$ time ./sort-colums.sh /tmp/big.dat
/tmp/big.dat 5000 columns
Computers / CPU cores / Max jobs to run
1:local / 4 / 4
Computer:jobs running/jobs completed/of started jobs/Average seconds to complete
local:0/5000/100/0.0s
Computers / CPU cores / Max jobs to run
1:local / 4 / 4
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/5/100%/0.0s
real 3m47.886s
user 12m40.702s
sys 1m10.267s
I created sample input file with 5000 columns and 1000 rows using the following command
Code:
$ seq 5e6 | parallel -N5000 echo > /tmp/big.dat
Now bad news.
As you can see this file (38M) is processed in about 4 minutes on my 4-core laptop with 10G RAM. On the other hand your input file is about 1000 times larger and execution time on my machine would be roughly 70 hours, or 3 days.
1. Creating 50k files (ie number of cols in current file according to OP). To then have the system sort through all these would increase the times significantly (I think)
2. Your option to sort the files / columns would not help the OP as he wants the data to remain in place if it is to be kept, so anything currently in row 1 would need to remain in that row
(of course row 1 will always stay exactly as is being all points are unique at this level )
okay, yes you were right . but after runnig, I checked the result. However the result was not what I should be. the number of columns were less than that from original data and there were relocated as well! (
okay, i work in linux. linux can not realize this command: parallel. what should I do if I want to keep working on linux? look:
skarimi@signal[11:42][~/mkhap]$ seq -w $cols | parallel --progress "cut -d ' ' -
f {1} $hap.txt | sort -nu > col-{1}"
bash: parallel: command not found...
skarimi@signal[11:42][~/mkhap]$ man parallel
No manual entry for parallel
You've asked for a Linux command to process your data file, and several suggestion have been provided, but a more general solution (in my opinion) would be to drop your data into a data base system and extract what you want from that data base when you want it.
For example, if your file was in a SQL data base, a "SELECT UNIQUE COL_NUMBER FROM DATA_TABLE ORDER BY INPUT_ORDER;" might be all you need.
Perhaps we could be more helpful if you explained why you want these columns of unique values. There may be better solutions to that problem.
What is the size of your file? If your numbers from the first post are real then it is probably about 30Gigabytes or so, right? In this case the problem is not to sort columns etc (which is easy by itself), but to do this in a reasonable amount of time. You already know how to process single column -- just do sort|uniq or sort -u, maybe with the -n flag to correctly sort numbers. Now you need a way to extract particular column from input file. In my experiments cut -d ' ' -f N is much faster for this than awk/perl etc. Okay, now we can extract any column and process it as needed. Because columns are large and there are lots of them, it is better to store them in temporary files. Last step is to join these sorted columns into a single file next to each other -- this can be done with paste utility. Because there are lots of columns you can not just do paste col-* -- you'll get Too many arguments bla bla error. So we may do this in two steps -- joining, say, 1000 columns and creating temporary files for them and then joining these temporary files into a final one. Last thing is that we definitely want to utilize multiple processors/cores for all these steps. In the following script I use the GNU parallel utility for this (dunno how it works in win8 though..)
Code:
#!/bin/bash
fn="$1"
outfn='sorted.dat'
cols=$(head -n1 $fn | wc -w)
echo $fn $cols columns
# split input file into columns, sort (numerically) and uniq each column
# and save to files col-0001 etc.
seq -w $cols | parallel --progress "cut -d ' ' -f {1} $fn | sort -nu > col-{1}"
# join all columns (side-by-side) into a single file in two steps:
# first, create few files with 1000 columns each, then join
# these files in order.
ls col-* | parallel --progress -N1000 -k --files paste | parallel --xargs paste {} ';' rm {} > $outfn
# remove column files
rm col-*
Run it in the folder with enough free space (at least the size of input file) as follows:
Code:
/tmp$ time ./sort-colums.sh /tmp/big.dat
/tmp/big.dat 5000 columns
Computers / CPU cores / Max jobs to run
1:local / 4 / 4
Computer:jobs running/jobs completed/of started jobs/Average seconds to complete
local:0/5000/100/0.0s
Computers / CPU cores / Max jobs to run
1:local / 4 / 4
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/5/100%/0.0s
real 3m47.886s
user 12m40.702s
sys 1m10.267s
I created sample input file with 5000 columns and 1000 rows using the following command
Code:
$ seq 5e6 | parallel -N5000 echo > /tmp/big.dat
Now bad news.
As you can see this file (38M) is processed in about 4 minutes on my 4-core laptop with 10G RAM. On the other hand your input file is about 1000 times larger and execution time on my machine would be roughly 70 hours, or 3 days.
my original data size is 5 GIG (I was wrong I have 10,000 columns not 50,000 . and I had a question about your codes. the first line of your code is :
fn="$1" , which I replaced my original file name there: hap.txt="1"
but i got my first error here :
skarimi@signal[12:39][~/mkhap]$ hap.txt="1"
bash: hap.txt=1: command not found...
skarimi@signal[12:42][~/mkhap]$
SO, I wondering what am I doing wrong? meanwhile I am working in linux does not realize parallel. it is not even in its manual. why is that?
You do not need to change anything in the script to try it, just save it in some file, say, sort-columns.sh, then make it executable with chmod +x sort-columns.sh and execute it as ./sort-columns.sh hap.txt (note ./ in the beginning).
What kind of linux? On Debian/Ubuntu you install it by sudo apt-get install parallel. Or download parallel package from the link I provided in my previous post and install with the package manager of your OS, e.g. sudo dpkg -i parallel*.deb in Debian.
@grail: 1. We can store these 50K files in a directory hierarchy, say 50 directories with 1000 files in each:
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.