LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 07-05-2012, 09:19 PM   #1
captainentropy
Member
 
Registered: Mar 2010
Location: Berkeley
Distribution: Ubuntu, Mint, CentOS
Posts: 81

Rep: Reputation: 0
awk split file into variable number of files


Hi,

I have a file like this
Code:
link1 chr1 57505384
link2 chr1 137313720
link3 chr2 26351096
link4 chr2 87616977
link5 chr4 153210661
link6 chr5 76008854
link7 chr7 81543608
link8 chr8 93738351
link9 chr9 14298503
link10 chr9 67889613
link11 chr9 67889613
link12 chr9 67889613
link13 chr9 67889613
link14 chr9 67889613
link15 chr15 49329448
link16 chr15 61920423
link17 chr15 62182866
link18 chr15 62334341
I need to separate the file into X number of files based on the number of unique records in $2, 8 in this example. I wrote a script that can do this with 8 versions of the following, but I would like a more simple solution if possible. I will have many files with different names to process.:
Code:
awk '$2 ~ /chr1\y/{print $0 > "chr1"}' input_file
How can I write the code to be one line so that the search value is incremented and then at each change in $2 those records are written to a file containing that line (which will always be chr[whatever], where whatever can be from 1-19 or X, Y, or M). I hope my question is clear. Thanks for any help you can offer.
 
Old 07-05-2012, 10:24 PM   #2
Alchemikos
Member
 
Registered: Jun 2012
Location: Porto Alegre-Brazil
Distribution: Slackware- 14, Debian Wheezy, Ubuntu Studio, Tails
Posts: 88

Rep: Reputation: 6
Hello

Give a example as final result

Cheers
 
Old 07-06-2012, 03:50 AM   #3
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,028

Rep: Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200
My first question back would be: Is it guaranteed that the file is sorted by $2?
 
Old 07-06-2012, 05:13 AM   #4
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 23,274

Rep: Reputation: 7696Reputation: 7696Reputation: 7696Reputation: 7696Reputation: 7696Reputation: 7696Reputation: 7696Reputation: 7696Reputation: 7696Reputation: 7696Reputation: 7696
why don't you write:
Code:
awk ' { print $0 > $2 } ' input_file
 
1 members found this post helpful.
Old 07-06-2012, 06:11 AM   #5
captainentropy
Member
 
Registered: Mar 2010
Location: Berkeley
Distribution: Ubuntu, Mint, CentOS
Posts: 81

Original Poster
Rep: Reputation: 0
sorry, I thought my example was clear. The results would be 8 files, and their names would be determined by the value in $2 ("chr1" has two lines of data, "chr9" has six lines of data, etc., but they could just as easily be 0 as they would be 1e6):

file "chr1":
Code:
link1 chr1 57505384
link2 chr1 137313720
file "chr2":
Code:
link3 chr2 26351096
link4 chr2 87616977
file "chr4":
Code:
link5 chr4 153210661
file "chr5":
Code:
link6 chr5 76008854
file "chr7":
Code:
link7 chr7 81543608
file "chr8":
Code:
link8 chr8 93738351
file "chr9":
Code:
link9 chr9 14298503
link10 chr9 67889613
link11 chr9 67889613
link12 chr9 67889613
link13 chr9 67889613
link14 chr9 67889613
and file "chr15":
Code:
link15 chr15 49329448
link16 chr15 61920423
link17 chr15 62182866
link18 chr15 62334341
@pan64, that just creates an empty file.
@grail, yes, they will be sorted on $2 (sort -Vk2). Would it be possible to use an unsorted file? I'm not planning to, the script I have so far, one of the first steps is to sort, but it might be a useful tool to know.

Thanks!

Last edited by captainentropy; 07-06-2012 at 03:28 PM.
 
Old 07-06-2012, 09:13 AM   #6
bsat
Member
 
Registered: Feb 2009
Posts: 347

Rep: Reputation: 72
See if this meets your requirement

Code:
awk 'BEGIN{i=0;temp[i]="xx"} {if($2 != temp[i]) {i++;temp[i]=$2} { print $0 >> "file"i }}' data
The new files are named file0,file1... etc. (Assuming the file is sorted according to column 2)

Last edited by bsat; 07-06-2012 at 09:29 AM.
 
2 members found this post helpful.
Old 07-06-2012, 10:14 AM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,028

Rep: Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200
In that case, pan64's solution is the best choice.
 
Old 07-06-2012, 01:43 PM   #8
Alchemikos
Member
 
Registered: Jun 2012
Location: Porto Alegre-Brazil
Distribution: Slackware- 14, Debian Wheezy, Ubuntu Studio, Tails
Posts: 88

Rep: Reputation: 6
Quote:
Originally Posted by bsat View Post
See if this meets your requirement

Code:
awk 'BEGIN{i=0;temp[i]="xx"} {if($2 != temp[i]) {i++;temp[i]=$2} { print $0 >> "file"i }}' data
The new files are named file0,file1... etc. (Assuming the file is sorted according to column 2)

I tested here, works like @captainentropy demonstrated in the last post reply!

Cheers
 
Old 07-06-2012, 03:57 PM   #9
captainentropy
Member
 
Registered: Mar 2010
Location: Berkeley
Distribution: Ubuntu, Mint, CentOS
Posts: 81

Original Poster
Rep: Reputation: 0
@pan64, I'm sorry, I made a mistake - your solution worked perfectly! (I did try it at 3am though, and I made a silly mistake). So simple a solution. And it named the files appropriately.

@bsat. Yours works too(as Alchemikos said). I made a modification, though (bold), so that it keeps the name more explicit to the content (e.g. "chr1", "chr15", etc.):

Code:
awk 'BEGIN{i=0;temp[i]="xx"} {if($2 != temp[i]) {i++;temp[i]=$2} { print $0 >> "whatever prefix I need"temp[i]}}' input_data
Thank you all! Time and aggravation savers this forum is.
 
Old 07-09-2012, 11:17 PM   #10
captainentropy
Member
 
Registered: Mar 2010
Location: Berkeley
Distribution: Ubuntu, Mint, CentOS
Posts: 81

Original Poster
Rep: Reputation: 0
ok, the solution above worked for the example above but I have a new factor that I can't figure out. The files I'm creating are to be read by a program that requires a pair of "links" to be on two lines (in the previous example there was in reality a second link on each line that I eliminated for clarity). What is to become the second line (the link pair) was created with this:

Code:
awk 'BEGIN{t=1}{print "link"t,$1,$2,$2+50,"link"t,"chrX","500000","500000";t+=1}'
So then I have something like this:
Code:
link1 chr1 57505384 57505434 link1 chrX 500000 500000
link2 chr1 137313720 137313770 link2 chrX 500000 500000
link3 chr2 26351096 26351144 link3 chrX 500000 500000
link4 chr2 87616977 87617027 link4 chrX 500000 500000
link5 chr4 153210661 153210711 link5 chrX 500000 500000
I modified the previous solution to this to split it into the two line format:
Code:
awk 'BEGIN{i=0;temp[i]="xx"} {if($2 != temp[i]) {i++;temp[i]=$2} {print $1,$2,$3,$4"\n",$5,$6,$7,$8 >> "whatever prefix I need"temp[i]}}' input_data
And now it looks like this:

file "chr1":
Code:
link1 chr1 57505384 57505434
 link1 chrX 500000 500000
link2 chr1 137313720 137313770
 link2 chrX 500000 500000
file "chr2":
Code:
link3 chr2 26351096 26351144
 link3 chrX 500000 500000
link4 chr2 87616977 87617027
 link4 chrX 500000 500000
file "chr4":
Code:
link5 chr4 153210661 153210711
 link5 chrX 500000 500000
The data are in the correct places but I can't get rid of or prevent the leading space from appearing on the second line of each link. I can remove them after they're split into separate files no problem, but I'm hoping there's a way I can do this in one step (and I can't figure it out).

Thanks for any help you might be able to offer.

Last edited by captainentropy; 07-10-2012 at 03:09 PM. Reason: spelling
 
Old 07-10-2012, 12:31 AM   #11
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,028

Rep: Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200
Well it looks a little over complicated, but the answer to your question is to remove the comma from in front of $5
 
1 members found this post helpful.
Old 07-10-2012, 03:06 PM   #12
captainentropy
Member
 
Registered: Mar 2010
Location: Berkeley
Distribution: Ubuntu, Mint, CentOS
Posts: 81

Original Poster
Rep: Reputation: 0
thanks grail! I didn't realize \n obviated the need for the following comma. There's a lot of jiujitsu I'm applying to my data files. I think I can simplify the code I'm writing a bit more but it's doing what I need now thanks to all the help here.

cheers!
 
Old 07-10-2012, 10:54 PM   #13
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,028

Rep: Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200Reputation: 3200
Just to clarify, when using 'print' anything separated by a comma will have OFS placed between each item. This means that '\n' or anything else, also nothing,
could be used. The nice part about the comma is if you have used a specific OFS and wish it between items, eg OFS="|" - this would place a pipe where ever
you enter a comma between items past to 'print'

The guide below is invaluable:

http://www.gnu.org/software/gawk/man...ode/index.html
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to split a file into multiple files using AWK? keenboy Linux - General 1 08-05-2010 02:18 PM
Sed/awk/grep search for number string of variable length in text file Alexr Linux - Newbie 10 01-19-2010 02:34 PM
awk: Using split to divide string to array. How do I find out the number of elements? vxc69 Programming 9 02-09-2008 01:49 PM
Split large file in several files using scripting (awk etc.) chipix Programming 14 10-29-2007 12:16 PM
How to split file , .. awk or split ERBRMN Linux - General 9 08-15-2006 01:02 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 08:42 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration