LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   AWK: split the file into multiple file and request for explanation of a known code (https://www.linuxquestions.org/questions/programming-9/awk-split-the-file-into-multiple-file-and-request-for-explanation-of-a-known-code-914268/)

cristalp 11-18-2011 12:25 PM

AWK: split the file into multiple file and request for explanation of a known code
 
Dear Experts,

I have a file looks like:
Code:

input_wez

.....
.....
.....
end

useless content


input_rty

....
....
....
end

uesless content

input_utl

....
....
....
end

uesless content

...
...

I want to split the file based on the patter of each input_***/end coupling. Eliminate the useless content.
The output file should have a name labled by the three code after "input_" and a sequence number, like:
Code:

wez_1.txt
rty_2.txt
utl_3.txt
...

***_9999.txt
...

The content of each file should be:
Code:

....
....
....
end

Please notice that there is not line of "input_***" and no empty line saved in the output file! The file started right from the content which was 2 lines after the "input_***" title in the big file.

No "input_***" and No that empty line between the "input_***" and the content.

I modified some other's code and now can achieve close result by:
Code:

awk -F_ '/input/{ f=$2; n++; next} f{print > f "_" n ".pdb"} /END/{close(f);f=x}' INPUTFILE
But the output file from this code looks like:
Code:

#EMPTY LINE APPEARED HERE
....
....
....
end

Please notice that the empty line which is in between of the "input_***" and the content can not be eliminated by this code.


My questions are:

1. How to eliminate the empty line by the simplest modification in above awk code

2. In the above awk code, what is the meaning of the f before
Code:

f{print > f "_" n ".pdb"}
Why when I replace it by
Code:

{print > f "_" n ".pdb"}
it gave me file name as _n.pdb, but not ***_n.pdb anymore?
Is this a general method when I am trying to write to files?
What is the general usage and functional purpose of
Code:

f{....}
?


3. In the end of my awk code, when I close the file by
Code:

{close(f);f=x}
Why do I need to reset f to x? If I do not do this, why I get the "useless content" at the end of each output file? What is the logic behind?

Could you please, may be, if you understand better the code than me, explain a bit more for these two parts of the code?

I know, may be these questions are annoying. But now I am really tring very hard to understand AWK and I really hope I can use it more freely. To do that I have to have a better and deeper understanding. I hope these question may not disturb you too much. But, if you don't like it, please just ignore it. I would thank you all the same!!!

grail 11-18-2011 12:47 PM

Quote:

1. How to eliminate the empty line by the simplest modification in above awk code
Change the order
Quote:

2. In the above awk code, what is the meaning of the f before
I'll answer with a question, what is the point of the following in your code (answer is the same):
Code:

/input/
Quote:

Why do I need to reset f to x?
What does 'x' equal?

cristalp 11-18-2011 12:48 PM

Hi all,

I found a answer for the 1st question, but may be not the simplest method:

Code:

awk 'BEGIN {FS = "_"} /input/{ f=$2; n++;next} f{if (NF > 0) print > f "_" n ".txt"} /END/{close(f);f=x}' INPUTFILE
Any better ideas??

Thanks!

firstfire 11-18-2011 01:45 PM

Hi, cristalp.

Try this:
Code:

awk -F_ '/input/{ f=$2; n++; m=0; next;} {m++} m>1&&f{print > f "_" n ".pdb"} /end/{close(f); f=0}' test.txt
On your questions:
1. To eliminate empty line (if you mean the line after input_* ) one could use additional counter `m', which counts lines after input_* and print only lines with m > 2. See above for example.

2,3.
In the code
Code:

f{print > f "_" n ".pdb"}
`f' before `{' understood as pattern, actually as logical expression. Expression in braces executed only if variable `f' have non-null and non-zero value. As you can see, I use more complex logical expression `m>1&&f' to decide whether to print something or not.

Resetting f to x means resetting f to empty string (because variable `x' is not set) so as to f be a logical false. Note that I reset `f' to zero with the same effect.

If you remove `f' and use just {print > f "_" n ".pdb"}, then you get not only wez_1.pdb etc, but also _1.pdb etc. _n.pdb files contain what you called 'useless content' which follow n-th input_***...end record. This happens because you print every line regardless of the value of `f' and f=="" for useless content.

Note that /END/ in your code should read as /end/ (if you use `end' in input file).

Hope this helps. I apologize for my poor english.

cristalp 11-23-2011 07:29 AM

Quote:

Originally Posted by firstfire (Post 4527767)
Hi, cristalp.

Try this:
Code:

awk -F_ '/input/{ f=$2; n++; m=0; next;} {m++} m>1&&f{print > f "_" n ".pdb"} /end/{close(f); f=0}' test.txt
On your questions:
1. To eliminate empty line (if you mean the line after input_* ) one could use additional counter `m', which counts lines after input_* and print only lines with m > 2. See above for example.

2,3.
In the code
Code:

f{print > f "_" n ".pdb"}
`f' before `{' understood as pattern, actually as logical expression. Expression in braces executed only if variable `f' have non-null and non-zero value. As you can see, I use more complex logical expression `m>1&&f' to decide whether to print something or not.

Resetting f to x means resetting f to empty string (because variable `x' is not set) so as to f be a logical false. Note that I reset `f' to zero with the same effect.

If you remove `f' and use just {print > f "_" n ".pdb"}, then you get not only wez_1.pdb etc, but also _1.pdb etc. _n.pdb files contain what you called 'useless content' which follow n-th input_***...end record. This happens because you print every line regardless of the value of `f' and f=="" for useless content.

Note that /END/ in your code should read as /end/ (if you use `end' in input file).

Hope this helps. I apologize for my poor english.

Thanks a lot firstfair. Your explanation is very clear and very helpful and your English is good in fact. Really helpful, Thanks again!


All times are GMT -5. The time now is 12:42 PM.