LinuxQuestions.org - [SOLVED] AWK: split the file into multiple file and request for explanation of a known code

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - AWK: split the file into multiple file and request for explanation of a known code (https://www.linuxquestions.org/questions/programming-9/awk-split-the-file-into-multiple-file-and-request-for-explanation-of-a-known-code-914268/)

AWK: split the file into multiple file and request for explanation of a known code

Dear Experts,

I have a file looks like:

Code:

input_wez



.....

.....

.....

end



useless content





input_rty



....

....

....

end



uesless content



input_utl



....

....

....

end



uesless content



...

...

I want to split the file based on the patter of each input_***/end coupling. Eliminate the useless content.
The output file should have a name labled by the three code after "input_" and a sequence number, like:

Code:

wez_1.txt

rty_2.txt

utl_3.txt

...



***_9999.txt

...

The content of each file should be:

Code:

....

....

....

end

Please notice that there is not line of "input_***" and no empty line saved in the output file! The file started right from the content which was 2 lines after the "input_***" title in the big file.

No "input_***" and No that empty line between the "input_***" and the content.

I modified some other's code and now can achieve close result by:

Code:

awk -F_ '/input/{ f=$2; n++; next} f{print > f "_" n ".pdb"} /END/{close(f);f=x}' INPUTFILE

But the output file from this code looks like:

Code:

#EMPTY LINE APPEARED HERE

....

....

....

end

Please notice that the empty line which is in between of the "input_***" and the content can not be eliminated by this code.

My questions are:

1. How to eliminate the empty line by the simplest modification in above awk code

2. In the above awk code, what is the meaning of the f before

Code:

f{print > f "_" n ".pdb"}

Why when I replace it by

Code:

{print > f "_" n ".pdb"}

it gave me file name as _n.pdb, but not ***_n.pdb anymore?
Is this a general method when I am trying to write to files?
What is the general usage and functional purpose of

Code:

f{....}

?

3. In the end of my awk code, when I close the file by

Code:

{close(f);f=x}

Why do I need to reset f to x? If I do not do this, why I get the "useless content" at the end of each output file? What is the logic behind?

Could you please, may be, if you understand better the code than me, explain a bit more for these two parts of the code?

I know, may be these questions are annoying. But now I am really tring very hard to understand AWK and I really hope I can use it more freely. To do that I have to have a better and deeper understanding. I hope these question may not disturb you too much. But, if you don't like it, please just ignore it. I would thank you all the same!!!

Quote:

1. How to eliminate the empty line by the simplest modification in above awk code

Change the order

Quote:

2. In the above awk code, what is the meaning of the f before

I'll answer with a question, what is the point of the following in your code (answer is the same):

Code:

/input/

Quote:

Why do I need to reset f to x?

What does 'x' equal?

Hi all,

I found a answer for the 1st question, but may be not the simplest method:

Code:

awk 'BEGIN {FS = "_"} /input/{ f=$2; n++;next} f{if (NF > 0) print > f "_" n ".txt"} /END/{close(f);f=x}' INPUTFILE

Any better ideas??

Thanks!

Hi, cristalp.

Try this:

Code:

awk -F_ '/input/{ f=$2; n++; m=0; next;} {m++} m>1&&f{print > f "_" n ".pdb"} /end/{close(f); f=0}' test.txt

On your questions:
1. To eliminate empty line (if you mean the line after input_* ) one could use additional counter `m', which counts lines after input_* and print only lines with m > 2. See above for example.

2,3.
In the code

Code:

f{print > f "_" n ".pdb"}

`f' before `{' understood as pattern, actually as logical expression. Expression in braces executed only if variable `f' have non-null and non-zero value. As you can see, I use more complex logical expression `m>1&&f' to decide whether to print something or not.

Resetting f to x means resetting f to empty string (because variable `x' is not set) so as to f be a logical false. Note that I reset `f' to zero with the same effect.

If you remove `f' and use just {print > f "_" n ".pdb"}, then you get not only wez_1.pdb etc, but also _1.pdb etc. _n.pdb files contain what you called 'useless content' which follow n-th input_***...end record. This happens because you print every line regardless of the value of `f' and f=="" for useless content.

Note that /END/ in your code should read as /end/ (if you use `end' in input file).

Hope this helps. I apologize for my poor english.

Quote:

Originally Posted by firstfire (Post 4527767)

Hi, cristalp.

Try this:

Code:

awk -F_ '/input/{ f=$2; n++; m=0; next;} {m++} m>1&&f{print > f "_" n ".pdb"} /end/{close(f); f=0}' test.txt

Code:

f{print > f "_" n ".pdb"}

Thanks a lot firstfair. Your explanation is very clear and very helpful and your English is good in fact. Really helpful, Thanks again!