Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
07-05-2012, 09:19 PM
|
#1
|
Member
Registered: Mar 2010
Location: Berkeley
Distribution: Ubuntu, Mint, CentOS
Posts: 81
Rep:
|
awk split file into variable number of files
Hi,
I have a file like this
Code:
link1 chr1 57505384
link2 chr1 137313720
link3 chr2 26351096
link4 chr2 87616977
link5 chr4 153210661
link6 chr5 76008854
link7 chr7 81543608
link8 chr8 93738351
link9 chr9 14298503
link10 chr9 67889613
link11 chr9 67889613
link12 chr9 67889613
link13 chr9 67889613
link14 chr9 67889613
link15 chr15 49329448
link16 chr15 61920423
link17 chr15 62182866
link18 chr15 62334341
I need to separate the file into X number of files based on the number of unique records in $2, 8 in this example. I wrote a script that can do this with 8 versions of the following, but I would like a more simple solution if possible. I will have many files with different names to process.:
Code:
awk '$2 ~ /chr1\y/{print $0 > "chr1"}' input_file
How can I write the code to be one line so that the search value is incremented and then at each change in $2 those records are written to a file containing that line (which will always be chr[ whatever], where whatever can be from 1-19 or X, Y, or M). I hope my question is clear. Thanks for any help you can offer.
|
|
|
07-05-2012, 10:24 PM
|
#2
|
Member
Registered: Jun 2012
Location: Porto Alegre-Brazil
Distribution: Slackware- 14, Debian Wheezy, Ubuntu Studio, Tails
Posts: 88
Rep:
|
Hello
Give a example as final result
Cheers
|
|
|
07-06-2012, 03:50 AM
|
#3
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,028
|
My first question back would be: Is it guaranteed that the file is sorted by $2?
|
|
|
07-06-2012, 05:13 AM
|
#4
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 23,274
|
why don't you write:
Code:
awk ' { print $0 > $2 } ' input_file
|
|
1 members found this post helpful.
|
07-06-2012, 06:11 AM
|
#5
|
Member
Registered: Mar 2010
Location: Berkeley
Distribution: Ubuntu, Mint, CentOS
Posts: 81
Original Poster
Rep:
|
sorry, I thought my example was clear. The results would be 8 files, and their names would be determined by the value in $2 ("chr1" has two lines of data, "chr9" has six lines of data, etc., but they could just as easily be 0 as they would be 1e6):
file "chr1":
Code:
link1 chr1 57505384
link2 chr1 137313720
file "chr2":
Code:
link3 chr2 26351096
link4 chr2 87616977
file "chr4":
Code:
link5 chr4 153210661
file "chr5":
Code:
link6 chr5 76008854
file "chr7":
Code:
link7 chr7 81543608
file "chr8":
Code:
link8 chr8 93738351
file "chr9":
Code:
link9 chr9 14298503
link10 chr9 67889613
link11 chr9 67889613
link12 chr9 67889613
link13 chr9 67889613
link14 chr9 67889613
and file "chr15":
Code:
link15 chr15 49329448
link16 chr15 61920423
link17 chr15 62182866
link18 chr15 62334341
@pan64, that just creates an empty file.
@grail, yes, they will be sorted on $2 (sort -Vk2). Would it be possible to use an unsorted file? I'm not planning to, the script I have so far, one of the first steps is to sort, but it might be a useful tool to know.
Thanks!
Last edited by captainentropy; 07-06-2012 at 03:28 PM.
|
|
|
07-06-2012, 09:13 AM
|
#6
|
Member
Registered: Feb 2009
Posts: 347
Rep:
|
See if this meets your requirement
Code:
awk 'BEGIN{i=0;temp[i]="xx"} {if($2 != temp[i]) {i++;temp[i]=$2} { print $0 >> "file"i }}' data
The new files are named file0,file1... etc. (Assuming the file is sorted according to column 2)
Last edited by bsat; 07-06-2012 at 09:29 AM.
|
|
2 members found this post helpful.
|
07-06-2012, 10:14 AM
|
#7
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,028
|
In that case, pan64's solution is the best choice.
|
|
|
07-06-2012, 01:43 PM
|
#8
|
Member
Registered: Jun 2012
Location: Porto Alegre-Brazil
Distribution: Slackware- 14, Debian Wheezy, Ubuntu Studio, Tails
Posts: 88
Rep:
|
Quote:
Originally Posted by bsat
See if this meets your requirement
Code:
awk 'BEGIN{i=0;temp[i]="xx"} {if($2 != temp[i]) {i++;temp[i]=$2} { print $0 >> "file"i }}' data
The new files are named file0,file1... etc. (Assuming the file is sorted according to column 2)
|
I tested here, works like @captainentropy demonstrated in the last post reply!
Cheers
|
|
|
07-06-2012, 03:57 PM
|
#9
|
Member
Registered: Mar 2010
Location: Berkeley
Distribution: Ubuntu, Mint, CentOS
Posts: 81
Original Poster
Rep:
|
@pan64, I'm sorry, I made a mistake - your solution worked perfectly! (I did try it at 3am though, and I made a silly mistake). So simple a solution. And it named the files appropriately.
@bsat. Yours works too(as Alchemikos said). I made a modification, though (bold), so that it keeps the name more explicit to the content (e.g. "chr1", "chr15", etc.):
Code:
awk 'BEGIN{i=0;temp[i]="xx"} {if($2 != temp[i]) {i++;temp[i]=$2} { print $0 >> "whatever prefix I need"temp[i]}}' input_data
Thank you all! Time and aggravation savers this forum is.
|
|
|
07-09-2012, 11:17 PM
|
#10
|
Member
Registered: Mar 2010
Location: Berkeley
Distribution: Ubuntu, Mint, CentOS
Posts: 81
Original Poster
Rep:
|
ok, the solution above worked for the example above but I have a new factor that I can't figure out. The files I'm creating are to be read by a program that requires a pair of "links" to be on two lines (in the previous example there was in reality a second link on each line that I eliminated for clarity). What is to become the second line (the link pair) was created with this:
Code:
awk 'BEGIN{t=1}{print "link"t,$1,$2,$2+50,"link"t,"chrX","500000","500000";t+=1}'
So then I have something like this:
Code:
link1 chr1 57505384 57505434 link1 chrX 500000 500000
link2 chr1 137313720 137313770 link2 chrX 500000 500000
link3 chr2 26351096 26351144 link3 chrX 500000 500000
link4 chr2 87616977 87617027 link4 chrX 500000 500000
link5 chr4 153210661 153210711 link5 chrX 500000 500000
I modified the previous solution to this to split it into the two line format:
Code:
awk 'BEGIN{i=0;temp[i]="xx"} {if($2 != temp[i]) {i++;temp[i]=$2} {print $1,$2,$3,$4"\n",$5,$6,$7,$8 >> "whatever prefix I need"temp[i]}}' input_data
And now it looks like this:
file "chr1":
Code:
link1 chr1 57505384 57505434
link1 chrX 500000 500000
link2 chr1 137313720 137313770
link2 chrX 500000 500000
file "chr2":
Code:
link3 chr2 26351096 26351144
link3 chrX 500000 500000
link4 chr2 87616977 87617027
link4 chrX 500000 500000
file "chr4":
Code:
link5 chr4 153210661 153210711
link5 chrX 500000 500000
The data are in the correct places but I can't get rid of or prevent the leading space from appearing on the second line of each link. I can remove them after they're split into separate files no problem, but I'm hoping there's a way I can do this in one step (and I can't figure it out).
Thanks for any help you might be able to offer.
Last edited by captainentropy; 07-10-2012 at 03:09 PM.
Reason: spelling
|
|
|
07-10-2012, 12:31 AM
|
#11
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,028
|
Well it looks a little over complicated, but the answer to your question is to remove the comma from in front of $5
|
|
1 members found this post helpful.
|
07-10-2012, 03:06 PM
|
#12
|
Member
Registered: Mar 2010
Location: Berkeley
Distribution: Ubuntu, Mint, CentOS
Posts: 81
Original Poster
Rep:
|
thanks grail! I didn't realize \n obviated the need for the following comma. There's a lot of jiujitsu I'm applying to my data files. I think I can simplify the code I'm writing a bit more but it's doing what I need now thanks to all the help here.
cheers!
|
|
|
07-10-2012, 10:54 PM
|
#13
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,028
|
Just to clarify, when using 'print' anything separated by a comma will have OFS placed between each item. This means that '\n' or anything else, also nothing,
could be used. The nice part about the comma is if you have used a specific OFS and wish it between items, eg OFS="|" - this would place a pipe where ever
you enter a comma between items past to 'print'
The guide below is invaluable:
http://www.gnu.org/software/gawk/man...ode/index.html
|
|
|
All times are GMT -5. The time now is 08:42 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|