LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 06-05-2013, 12:21 PM   #1
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Rep: Reputation: Disabled
how can I split a file into many files using a string in awk or sed


Hi guys,

I have a file called DataDictionary.txt that has a bunch of "groups" all in a single column, like this:

SRIG_NAME: FILENAME_TIG
data.sap_1_ecp
data.sap_2_ecp
data.sap_3_ecp
data.sap_4_ecp
etc...
SRIG_END: FILENAME_TIG
SRIG_NAME: BSG_BSG
info.bsg_1.csv
info.bsg_2.csv
info.bsg_3.csv
info.bsg_4.csv
etc...
SRIG_END: BSG_BSG
SRIG_NAME: CMP34_ADY
cmp34_data.ady._1.dat
cmp34_data.ady._2.dat
cmp34_data.ady._3.dat
cmp34_data.ady._4.dat
etc...
SRIG_END: CMP34_ADY

and the file continues on this way for a very long length.

I would like to break up the one long file and get many smaller files. I think this could be done by using the begining and ending strings of each "group" to parse on

SRIG_NAME:
SRIG_END:

and the name of each of the new text files would be the string that follows

SRIG_NAME:
SRIG_END:

so in my example I'd have

FILENAME_TIG.txt
BSG_BSG.txt
CMP34_ADY.txt

can anybody help me?

I know a little awk and sed so I can follow along

Thanks so much guys! Tabitha



oh, and here's what I've already been working with

#/bin/sh
awk -F "," '$1 == SRIG_NAME: {print FILENAME}' DataDictionary.txt | uniq > FILENAME_TIG.txt

Last edited by atjurhs; 06-05-2013 at 12:33 PM. Reason: but that sholud only get the first occurance and it gets messed up output
 
Old 06-05-2013, 02:15 PM   #2
shivaa
Senior Member
 
Registered: Jul 2012
Location: Grenoble, Fr.
Distribution: Sun Solaris, RHEL, Ubuntu, Debian 6.0
Posts: 1,800
Blog Entries: 4

Rep: Reputation: 286Reputation: 286Reputation: 286
Though awk could do it better, but before that you can try split cmd (I assume that every group has 7 lines) as:
Code:
~$ split -l 7 file.txt newfile
~$ ls
newfileaa newfileab newfileac .....
 
Old 06-05-2013, 04:48 PM   #3
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by shivaa View Post
Though awk could do it better, but before that you can try split cmd (I assume that every group has 7 lines) as:
Code:
~$ split -l 7 file.txt newfile
~$ ls
newfileaa newfileab newfileac .....
thanks Shivaa for the thought! but the number of lines in each "group" varies

I'm pretty sure that the script will have to key off of the SRIG_NAME: and SRIG_END: strings

thanks again, Tabby
 
Old 06-05-2013, 09:29 PM   #4
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Distribution: UBUNTU 5.10 since Jul-18,2006 on Intel 820 DC
Posts: 805

Rep: Reputation: 186Reputation: 186
awk can redirect output to multiple files based on a value within the file.

A hint.
If a line starts with SRIG_NAME:, then all subsequent data (including the current line) gets written to the file with name as the 2nd argument in the line starting with SRIG_NAME. (FILENAME_TIG, BSG_BSG etc).

OK

Last edited by AnanthaP; 06-05-2013 at 09:30 PM.
 
Old 06-05-2013, 09:41 PM   #5
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by AnanthaP View Post
awk can redirect output to multiple files based on a value within the file.

A hint.
If a line starts with SRIG_NAME:, then all subsequent data (including the current line) gets written to the file with name as the 2nd argument in the line starting with SRIG_NAME. (FILENAME_TIG, BSG_BSG etc).

OK
Thanks AnanthaP for your reply!

yep, that's exactly the idea, and when it reads SRIG_END it ends writing lines for that "group" and starts again looking for the next SRIG_NAME

the question is how to implement this?

Tabby
 
Old 06-06-2013, 03:02 PM   #6
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Code:
csplit -z -f 'srigfile_' -b '%03d.txt' infile.txt '/^SRIG_NAME/' '{*}'
This creates individual files named srigfile_000.txt, etc.

See info csplit for details on how to use it properly.


PS: Please use ***[code][/code]*** tags around your code and data, to preserve the original formatting and to improve readability. Do not use quote tags, bolding, colors, "start/end" lines, or other creative techniques. Thanks.

Last edited by David the H.; 06-06-2013 at 03:09 PM.
 
1 members found this post helpful.
Old 06-06-2013, 03:15 PM   #7
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Oh, and here's a simple loop for renaming the files to the desired strings from the text.

Code:
for oname in srigfile_00*; do
   read -r _ nname <"$oname"
   mv "$oname" "$nname.txt"
done
This should work as long as the new name is the second space-delimited field on the first line of each file. But be sure that there aren't any duplicate names.

Last edited by David the H.; 06-06-2013 at 03:19 PM. Reason: add a bit more
 
1 members found this post helpful.
Old 06-06-2013, 08:52 PM   #8
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Distribution: UBUNTU 5.10 since Jul-18,2006 on Intel 820 DC
Posts: 805

Rep: Reputation: 186Reputation: 186
Hi atjurhs,

The idea was to give you just a hint so that you can try it yourself. You seem to have made a start with awk (in post #1).

I refer you to the standard help on awk (below).
http://www.gnu.org/software/gawk/manual/gawk.html

OK
 
Old 06-07-2013, 11:03 AM   #9
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by David the H. View Post
Oh, and here's a simple loop for renaming the files to the desired strings from the text.

Code:
for oname in srigfile_00*; do
   read -r _ nname <"$oname"
   mv "$oname" "$nname.txt"
done
This should work as long as the new name is the second space-delimited field on the first line of each file. But be sure that there aren't any duplicate names.
The csplit command worked great, thanks so much! it's output was 1765 files named srigfile_000.txt through file srigfile_1765.txt

The oname loop hasn't been as succesfull.

If I run it as a bash script with #!/bin/bash thinking that maybe the path to my bin is somehow messed up, the command line gves me back nothing and there is no change to the srigfile names.

If I run it without the #!/bin/bash the command line gves me back ./script: line 2: srigfile_00*: No such file or directory and the mv command of course says it cannot stat `srigfile_00*'

So I tried changing around the string of srigfile_00* but that had no effect either, it still can't find the srigfiles, and sometimes even deleted all the srigfiles, yikes!

I double checked the fields on the first line of each of the newly created srigfiles from the csplit command, and they are space delimeted, but I don't think this part of the script is getting accessed yet?

can you tell where I'm going wrong?
 
Old 06-09-2013, 05:50 AM   #10
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
It's usually a good idea not to run a possibly-destructive command like mv until you've confirmed that it's configured correctly. The easiest way is to just stick an echo at the front of the command, then you'll see a printout of what would actually be executed after the variables are expanded.

I don't really see what could be wrong with what I posted though. It's just a simple globbing pattern and for loop.

Since you have many more files than what I used for testing, You'll probably need to shorten the glob to something like "srig*". Just keep it long enough to match only the files you want. "printf '%s\n' <glob>" can be used to list out all the files matched by that pattern, one per line.

The read command inside the loop just takes the first line from each file and splits it into two variables; the first word on the line goes into the throw-away "_", and all the rest into the nname variable, for use as the new filename.

Check to see that you haven't made any syntax or spelling errors. And of course the loop needs to be run in the same directory as the files, or else it would have to be made more complex. Make sure the new names don't have any illegal filename characters or other conflicts either, as I mentioned before.

I highly doubt there are any problems with your PATH or other low-level issue like that. If you haven't had any problems before, then they aren't likely to be a factor now. It's certainly either a syntax or matching error of some kind.

Also, another thought, could the files have dos-style line-endings in them? If so, you may need to run them through dos2unix or a similar converter first.
 
Old 06-09-2013, 10:24 AM   #11
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 5,027

Rep: Reputation: 845Reputation: 845Reputation: 845Reputation: 845Reputation: 845Reputation: 845Reputation: 845
davids is probably better. what i wouldve done is grep for a list of lines to feed into sed:
Code:
grep -n SRIG file.txt
and then parse them with sed.
 
Old 06-09-2013, 10:50 AM   #12
Firerat
Senior Member
 
Registered: Oct 2008
Distribution: Debian Jessie / sid
Posts: 1,471

Rep: Reputation: 444Reputation: 444Reputation: 444Reputation: 444Reputation: 444
here is how I would do it

Code:
for i in `awk '/^SRIG_NAME:/{print $NF}' DataDictionary.txt`;do
    sed '/^SRIG_NAME:.'$i'/,/^SRIG_END/!d' DataDictionary.txt > ${i}.txt
done
 
1 members found this post helpful.
Old 06-10-2013, 03:20 PM   #13
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
Both David's and the Firerat's sripts work David I apologize. I missed typed "nname" as just "name" in the move command. The output of echo pointed me to my errors.

you know it's funny how you can see a script and follow along and know what it's doing at each step, but know I can't write it myself. I get some of it and then get stuck, or I start off down the wrong road

but thanks so much guys!

Tabby
 
Old 06-10-2013, 05:18 PM   #14
Firerat
Senior Member
 
Registered: Oct 2008
Distribution: Debian Jessie / sid
Posts: 1,471

Rep: Reputation: 444Reputation: 444Reputation: 444Reputation: 444Reputation: 444
Hi glad it worked for you,
I had a little think and re-wrote just in awk

Code:
awk '{if ( $1 == "SRIG_NAME:" ){FileName = $NF ;print $0 > FileName".txt";next};{print $0 >> FileName".txt"}}' DataDictionary.txt
Should be much faster than the awk/sed combo I posted

However there is a little catch.. It will overwrite duplicates, which you can of course avoid by using ">>" globally
 
Old 06-11-2013, 01:04 PM   #15
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Those little typos get you every time. In hindsight I probably should've used variable names that were a bit clearer to read, like "oldname" and "newname", instead of the shorter ones. I usually use "fname" and "dname" myself for files and directories, so I was keeping with the same pattern.

@Firerat, Nicely done. Just a couple of quick suggestions. "print" on its own is the same as "print $0", and an "else" would probably be a better choice to connect the two commands, rather than "next".

I believe you could also reduce it down to just this (untested, 'cause I'm lazy):

Code:
awk '$1 == "SRIG_NAME:" { FileName = $NF } { print >> FileName".txt" }' DataDictionary.txt

Last edited by David the H.; 06-11-2013 at 01:05 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] awk split file into variable number of files captainentropy Linux - Newbie 12 07-10-2012 10:54 PM
[SOLVED] Replace 2nd occurrence of a string in a file - sed or awk? kushalkoolwal Programming 10 05-02-2011 03:30 PM
How to split a file into multiple files using AWK? keenboy Linux - General 1 08-05-2010 02:18 PM
split very large 200mb text file by every N lines (sed/awk fails) doug23 Programming 8 08-10-2009 07:08 PM
Split large file in several files using scripting (awk etc.) chipix Programming 14 10-29-2007 12:16 PM


All times are GMT -5. The time now is 10:33 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration