LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   How to split a text based on keywords and put each block in a separate file? (https://www.linuxquestions.org/questions/linux-newbie-8/how-to-split-a-text-based-on-keywords-and-put-each-block-in-a-separate-file-940707/)

patatahead 04-19-2012 10:55 AM

How to split a text based on keywords and put each block in a separate file?
 
hi,

I'm trying to do a task here, that is to separate certain block of text from a file and put them separately into different files. I've searched the net for resources and luckily landed on this forum, hope you can help me out.

To be exact, I'm working on a file containing SQL commands. All these SQL commands are stored in a file, and I want to extract each one of them and put them onto files-their filenames bearing their object names (ie. CREATE TABLE tableA - gives out tableA.tab as a filename).

here's what the SQL file looks like

CREATE TABLE PHONEBOOK_TABLE ...
;
------------------------------------------------

CREATE VIEW PHONEBOOK_VIEW ...
;

------------------------------------------------
CREATE VIEW
PHONEBOOK_HIDDEN_VIEW ...
;


Now, following the code posted above I was only able to remove the --- lines but seemed clueless on what to do next. Is there a way to read the third word in a paragraph (regardless of space or next line) and use that as a filename ? Example below

file PHONEBOOK_TABLE.tab contains
CREATE TABLE PHONEBOOK_TABLE ...
;

file PHONEBOOK_HIDDEN_VIEW.vw contains
CREATE VIEW
PHONEBOOK_HIDDEN_VIEW ...
;

hope you guys can help me out. Thanks

quarlington 04-19-2012 02:53 PM

If I'm reading what you're trying to do correctly, this code may be a quick and dirty start of a solution - you'll still have to deal with the file extensions though:


#!/bin/bash

INFILE=$1
DIR=$(pwd)
FILE=$DIR/$INFILE

while read LINE
do
STATEMENT="$STATEMENT $LINE"
echo "$LINE" >> $DIR/tmpfile.tmp
echo "$LINE" | grep \; > /dev/null
if [ $? -eq 0 ]; then
NEWFILENAME=$(echo $STATEMENT | awk '{print $3}')
mv $DIR/tmpfile.tmp $DIR/$NEWFILENAME
STATEMENT=""
fi
done < $FILE

exit 0

grail 04-19-2012 03:31 PM

Well searching the forums can help to:

http://www.linuxquestions.org/questi...-files-940465/

patatahead 04-21-2012 04:40 AM

Hi Quarlington,

Thanks for this one, I'll try to decipher and please do correct me if I'm wrong, let's see if I understood it right.

Quote:

Originally Posted by quarlington (Post 4657704)
If I'm reading what you're trying to do correctly, this code may be a quick and dirty start of a solution - you'll still have to deal with the file extensions though:


#!/bin/bash

INFILE=$1 # Pass the first parameter to the variable INFILE
DIR=$(pwd) # Set the value for variable DIR to the present working directory
FILE=$DIR/$INFILE # set the value of variable FILE to the fully qualified file name

while read LINE # while loop, where the variable LINE came from?
do
STATEMENT="$STATEMENT $LINE" # hmm i got lost here, are you passing each statement on the while loop to the variable STATEMENT? What does variable LINE contain?

echo "$LINE" >> $DIR/tmpfile.tmp # append the lines to a temporay file
echo "$LINE" | grep \; > /dev/null # get the line with ";" character and send to output /dev/null (blank)
if [ $? -eq 0 ]; then
NEWFILENAME=$(echo $STATEMENT | awk '{print $3}')
mv $DIR/tmpfile.tmp $DIR/$NEWFILENAME
STATEMENT=""
fi # If my idea of looping thru the file, reading line per line is correct, won't this create multiple files with the third word as the file name? Breaking the block of SQL into multiple files, one line per file ?

done < $FILE

exit 0

Thanks.

patatahead 04-21-2012 04:48 AM

Quote:

Originally Posted by grail (Post 4657725)
Well searching the forums can help to:

http://www.linuxquestions.org/questi...-files-940465/

Guru Grail,

Thank you for responding.

Here's what I got from the other post

Quote:

Originally Posted by grail (Post 4657725)
awk 'BEGIN{i=1}/^[A-Z].*proc/,/\//{print > "File"i}/\//{i++}' orig_file

I've seen people used this tool, I've made a few studies with it and based from the cryptic statement above, What I understood is

awk 'BEGIN{i=1} # Begin statement and setting variable i to the value 1
/^[A-Z].*proc/ # find all lines containing A to Z as a line start character, the .* I don't understand but if used with conjunction to proc might mean, any line with characters preceeding "proc", did I understood it right?
/\//{print > "File"i}/\//{i++}' # What I only understood here is you'll want to search for the "/" symbol and print the line into File[i] where I would be the number, and a search for the "/" symbol again and {i++} would represent a loop increment?

Hmm I haven't have an access to a unix box at the moment but will this code of yours print the lines one by one to a file?

Thank you.

grail 04-21-2012 05:28 AM

Not too bad a shot at the understanding, let me flesh it out a little more to hopefully make it clear:

BEGIN{i=1} - the upshot here is that we initialise the variable 'i'. The other piece of information is that BEGIN is only ever performed once prior to all files being read.

/^[A-Z].*proc/,/\// - As you can see I have included the test for the slash (/) as this is called a range. This means that from finding a line that starts (^) with a capital letter ([A-Z])
followed by zero or more of any character (.*) and finally the string "proc" perform the tasks inside the curly braces until you reach a line containing a slash (/)

{print > "File"i} - Only when previous expression equates to true, print the currently stored line into a file called "FileN", where N is the current value of the variable "i"

/\//{i++} - This is completely separate to the previous tests and actions. On any line that contains a slash (/), increment the variable "i" by 1

So with a few changes this could be made to process your data, but the upside is that your data is actually a little easier. Awk has a variable called RS (record separator) which allows one
to define what makes a single record. As your data, assuming example is correct, always has a line of dashes between each record you can now use this as the RS.

As the solution is trivial I will let you investigate further. Let me know if you get stuck?

Also, here is a valuable resource for awk: - http://www.gnu.org/software/gawk/man...ode/index.html

patatahead 04-23-2012 10:03 AM

Quote:

Originally Posted by grail (Post 4659089)
Not too bad a shot at the understanding, let me flesh it out a little more to hopefully make it clear:

BEGIN{i=1} - the upshot here is that we initialise the variable 'i'. The other piece of information is that BEGIN is only ever performed once prior to all files being read.

/^[A-Z].*proc/,/\// - As you can see I have included the test for the slash (/) as this is called a range. This means that from finding a line that starts (^) with a capital letter ([A-Z])
followed by zero or more of any character (.*) and finally the string "proc" perform the tasks inside the curly braces until you reach a line containing a slash (/)

{print > "File"i} - Only when previous expression equates to true, print the currently stored line into a file called "FileN", where N is the current value of the variable "i"

/\//{i++} - This is completely separate to the previous tests and actions. On any line that contains a slash (/), increment the variable "i" by 1

So with a few changes this could be made to process your data, but the upside is that your data is actually a little easier. Awk has a variable called RS (record separator) which allows one
to define what makes a single record. As your data, assuming example is correct, always has a line of dashes between each record you can now use this as the RS.

As the solution is trivial I will let you investigate further. Let me know if you get stuck?

Also, here is a valuable resource for awk: - http://www.gnu.org/software/gawk/man...ode/index.html

Hi Grail,

I've read your awk pages and managed to do this line

awk 'BEGIN{RS="---------------------------------------------------------------------------";FS="\n"} /^$/,/\;/ { print $0 }' file_SQL.txt

the RS is the record separator, and since the block of codes were separated with dashed lines, so I used them as RS; following it would be the field separator. If I'm right, this \n will treat the field as one line. So next, I placed my search condition - starting with a blank line till it reaches something with ";"

Interestingly, when I executed this one with print $0, the first block of SQL text came out.

CREATE TABLE PHONEBOOK_TABLE ...
some statements ...
;

when I tried changing the $0 to $1, it showed nothing
when I tried $2, there it shows only the first statement
CREATE TABLE PHONEBOOK_TABLE ...

I'm trying to imagine things, go like say, the whole file contains 10 block of codes, separated by dashes. How could I tell awk that I want those 10 block of codes inside a collection and either loop thru it one by one (per block not per lines) and saving them to file

am I still on the right track?

schneidz 04-23-2012 11:20 AM

heres my stab at it (i cheated a little by editing the input to make all the stanzas a standard format:
Code:

[schneidz@hyper patatahead]$ cat patatahead.txt | while read line; do if [ "`echo $line | grep "\---"`" ]; then  read fout; read line; echo $fout > `echo $fout | awk '{print $3}'`.txt; fi; echo $line >> `echo $fout | awk '{print $3}'`.txt ; done
[schneidz@hyper patatahead]$ head *.txt
==> patatahead.txt <==
----------------------------------------
CREATE TABLE PHONEBOOK_TABLE ...
;
------------------------------------------------
CREATE VIEW PHONEBOOK_VIEW ...
;

------------------------------------------------
CREATE VIEW PHONEBOOK_HIDDEN_VIEW ...
;

==> PHONEBOOK_HIDDEN_VIEW.txt <==
CREATE VIEW PHONEBOOK_HIDDEN_VIEW ...
;

==> PHONEBOOK_TABLE.txt <==
CREATE TABLE PHONEBOOK_TABLE ...
;

==> PHONEBOOK_VIEW.txt <==
CREATE VIEW PHONEBOOK_VIEW ...
;


grail 04-23-2012 11:51 AM

I am with schneidz in that the format not being uniform could cause issues so you would need to advise if this actual data or if it looks more like the data in post #8?

Making the same assumption, in awk it would look like:
Code:

awk '{print > $3".txt"}' RS="-+\n" file

patatahead 04-26-2012 09:20 AM

Quote:

Originally Posted by grail (Post 4660914)
I am with schneidz in that the format not being uniform could cause issues so you would need to advise if this actual data or if it looks more like the data in post #8?

Making the same assumption, in awk it would look like:
Code:

awk '{print > $3".txt"}' RS="-+\n" file

apologies for the delay in response.

the dashed lines in the SQL file I have are uniform. I've read @schneidz posts and boy! what a way to attack the problem! I liked it! So basically, we loop thru the doc and capture the lines that contain dashed lines "---" and the rest, echo them inside the file.. sweet!

how did you guys get to know those tricks? i'm quite interested to know :)

@grail: guru, I know that \n stands for new line, what does -+ stand for?

grail 04-26-2012 10:30 AM

+ - is regex for one or more of the previous, so in this case one or more -'s

btw. The dashes are not we were concerned about, the issue was the following comparison from yours to schneidz:
Code:

CREATE VIEW
PHONEBOOK_HIDDEN_VIEW ...
;

CREATE VIEW PHONEBOOK_HIDDEN_VIEW ...
;

So our scripts require the CREATE to have all parts on the one line as in your example there is no $3 on either line.

patatahead 04-29-2012 09:08 AM

Quote:

Originally Posted by grail (Post 4663616)
+ - is regex for one or more of the previous, so in this case one or more -'s

btw. The dashes are not we were concerned about, the issue was the following comparison from yours to schneidz:
Code:

CREATE VIEW
PHONEBOOK_HIDDEN_VIEW ...
;

CREATE VIEW PHONEBOOK_HIDDEN_VIEW ...
;

So our scripts require the CREATE to have all parts on the one line as in your example there is no $3 on either line.

hi Grail,

yeah, too bad, poorly written code that is! I had a situation wherein it gets much worst than this. Imagine, having a keyword split in two due to wrong line sizing. To split these SQLs in different files are one thing, to get the name from numerous create object lines is another story. sigh ... but much thanks for you and schneidz's effort. I don't have access to a unix box during weeknights and weekends. That's why today, I converted my windows box into a linux box. :)


All times are GMT -5. The time now is 11:24 AM.