redirecting input from file in awk script

konsolebox · 09-25-2012, 01:50 AM

Trd300: Do you want to read the file as you send data to it? Perhaps you need to use a pipe instead. (See man mkfifo).

AnanthaP · 09-25-2012, 02:15 AM

Not tried it out yet, but I might read and process output1.txt on the END pattern. (The basic call remaining as

Quote:

gawk -f myprog.awk input1.txt

.)

Also, I would generate records in output1.txt iff (NR%2 == 0) and in this case generate as many lines as the length($3). Note that this would handle the missing last line in your output.

OK

Trd300 · 09-25-2012, 03:48 AM

Quote:

Do you want to read the file as you send data to it?

konsolebox: the goal would be to convert step-wise algorithms (e.g. "1.awk" and "2.awk") into one single script "myprog.awk".

Single script "myprog.awk":

Code:

original input ---> myprog.awk ---> final output

Step-wise algorithm:

Code:

original input ---> 1.awk ---> output1 ---> 2.awk ---> final output

(output1 being the file that I am trying to redirect inside myprog.awk)

Do you see my point?
Do I really need to use named pipe to do that?

AnanthaP: I tried konsolebox's 2nd strategy (using the END section), but it didn't change anything.

konsolebox · 09-25-2012, 03:59 AM

@Trd300: If you do want to run 1.awk and 2.awk simultaneously you do would need a pipe or some runtime medium for that. Using an ordinary file (in this case, that output1 file) won't work. I mean, both awk files opening the file at the same time as output and intput won't. It would if you run 1.awk first then 2.awk afterwards. I'm not sure if there are techniques to do that on an ordinary file but generally I know there's none, or perhaps it would be hard in awk. In C opened as binary maybe.

Anyway perhaps you're just giving that an example to show what you really want (myprog.awk) so how about grail's suggestion of using arrays? Do you still *need the in-between operation output1 file to be used for later operations?

Trd300 · 09-25-2012, 04:34 AM

Actually the step-wise algorithm looks more like that:

Code:

original input ---> 1.awk ---> output1
        output1 ---> 2.awk ---> final output

OK, I see what you mean.

I am not sure arrays will fix the problem. In myprog.awk a first function1 will produce results1, and a different function2 will produce results2. As I cannot assign the same variable for different values (results1 & results2) I though redirecting results1 and results 2 in the same file by concatenating them would sort the problem out:

Code:

BEGIN{}

<define function1 here>

<define function2 here>


{print function1($X) > output1.txt}

{print function2($X) >> output1.txt}

{    close("output1.txt");
     RS=ORS="\n";  
     while((getline < "output1.txt") > 0){<keep working on "output1.txt" as an input>}


END{}

I am gonna try using getline from a coprocess, although I don't know if there is a way to concatenate the results of the different functions.

grail · 09-25-2012, 04:51 AM

I must be reading this all wrong as I am getting very lost now

If we assume the 1.awk / 2.awk approach, my understanding is that you would create a temporary file after running 1.awk on the original input and then run 2.awk on the temporary
file to produce the final output. Is this correct?

If above is correct, is it not simply a case of first performing the necessary tasks on the original data and then any follow up tasks to produce the desired output?

Again I would request a before an after picture of data? It seems to me you may be trying to place the square peg in round hole when it is not necessarily the process you should be using.

Trd300 · 09-25-2012, 05:47 AM

Quote:

If we assume the 1.awk / 2.awk approach, my understanding is that you would create a temporary file after running 1.awk on the original input and then run 2.awk on the temporary
file to produce the final output. Is this correct?

Yes it is correct.

Quote:

If above is correct, is it not simply a case of first performing the necessary tasks on the original data and then any follow up tasks to produce the desired output?

Yes it is a case like that.

Code:

original input ---> function1---> results1
                                           ----> concatenate results1 & 2 ---> process ---> final output
               ---> function2---> results2

konsolebox · 09-25-2012, 06:07 AM

Does that mean input is read by two functions twice (one file at a time), or twice by line? How bout the concatenated output as well?

PTrenholme · 09-25-2012, 03:55 PM

If you're using a version of gawk that supports it (Version 4 does; I'm not sure about version 3), you could consider something like this:

Code:

BEGIN {
# Expand the argument list so each input file name is duplicated:
  for (i=1; i<ARGC; ++i) {
# Is this a valid (readable) file?
    if ((getline test < ARGV[i]) > 0) {
      close(ARGV[i])
      for (j=ARGC;j>i;--j) {
        ARGV[j]=ARGV[j-1]
      }
      ++ARGC
      ++i # So the outer loop skips the duplicate we've added . . .
    }
   }
   process_count=0
}
BEGINFILE {
# Is this a readable file?
  if (ERRNO != 0) {
#   Process the non-file value.
    nextfile
  }
  ++process_count
}
process_count==1 {
# Do the stuff for the first pass through the file . . .
}
process_count=2 {
# Do your thing for the second pass through the file . . .
}
ENDFILE {
  if (process_count==2) {
    process_count=0
#   Any other EOF processing you want . . .
  }
}
END {
# Final clean-up and termination processing . . .
}

konsolebox · 09-25-2012, 04:36 PM

Actually if it's per-line basis, Trd300 could just use the variable ($0 or other) that stores the input twice and pass it to two functions. If it's a per-file basis, he/she could read the file twice with:

Code:

while (getline < input) {
    # ...
}

close(input)

while (getline < input) {
    # ...
}

close(input)

The latter is to be based from my suggestion with only using the BEGIN block.

Trd300 · 09-25-2012, 08:19 PM

Here is an example I've seen on the web.

input:

Code:

@XXXXXX|YYY
12345678
...

First, writing the numbers on the same line as the preceding record separated by a pipe (and remove the "@"):

Code:

XXXXXX|YYY|12345678
...

To do that, set the RS as "@" and delete the "\n".

Then we use 2 functions:
function1: convert block of 2 numbers to letters (according to a conversion array)
function2: reverse the string of numbers

1) From the original input file , using function1, convert block of 2 numbers to letters starting from the 1st letter, then the 2nd, then the 3rd,...until the end of the string.
2) Then always with the same input, using function2, reverse the original string of numbers and do like 1) to it.
3) concatenate the results of 1) with the results of 2) in the same output (in which we removed $2), to get this intermediate file:

Code:

XXXXXX|aceg      # start from 1st number (i.e. 12345678)
XXXXXX|bdfx      # start from 2nd number (i.e. 2345678)
XXXXXX|ceg       # start from 3rd number (i.e. 345678)
XXXXXX|dfx        # start from 4th number (i.e. 45678)
XXXXXX|eg         # start from 5th number (i.e. 5678)
XXXXXX|fx          # start from 6th number (i.e. 678)
XXXXXX|g           # start from 7th number (i.e. 78)
XXXXXX|x           # start from last number (i.e. 8)
XXXXXX|hjln       # same but after reversing the string starting from 1st number (i.e. 87654321)
XXXXXX|ikmx      # same but after reversing the string starting from 2nd number (i.e. 7654321)
etc...

4) Keep processing the intermediate file (e.g. keep the strings with more than 2 letters, or with a specific letter,...)

Here is how I tried to do:

Code:

BEGIN{
         RS="@"; FS=OFS="|"; conv["12"]="a"; conv["23"]="b"; conv["34"]="c"; conv["45"]="d"; conv["56"]="e"; conv["67"]="f"; conv["78"]="g";
         conv["87"]="h"; conv["76"]="i"; conv["65"]="j"; conv["54"]="k"; conv["43"]="l"; conv["32"]="m"; conv["21"}="n"
         }

function convert(field, start){
         letter = ""
         block = substr (field, start, 2)
         while (block != ""){
              letter = letter (block in conv ? conv[block] : "x")
              start = start + 2
              block = substr (field, start, 2)
         }
         return letter
}

function rev(field){
         rever = ""
         l = length(field)
         for (i=l; 0<i; i--){
              rever = rever substr (field, i, 1)
         }
         return rever
}      



NR==1{next}

NR>1{
          sub("\n", "|")       # write second line next to the preceding one
          gsub("\n", "")
         }

{
     for(i=1; i<=(length($3); i++){                                            
          print $1 FS convert($3, i) > "intermediate.txt"    # step 1) and output in a file (we removed $2)
     }
     
     for(i=1; i<=(lentgh($3); i++){
          print $1 FS convert(rev($3), i) >> "intermediate.txt"    # step 2) (we removed $2) and 3) concatenate in the same file
     }
}

##### BLOCK BELOW DOESN'T WORK ######

{
     close("intermediate.txt");
     RS=ORS="\n"; FS=OFS="|";                 # re-define RS, FS to be able to use "intermediate.txt" as if it was the input of a second command-line
     while((getline < "intermediate.txt") > 0){
           if(length($2) > 2) {print $0}          # note that previous $3 in original input becomes $2 in "intermediate.txt"
           else{next}
  
           ... <keep processing "intermediate.txt">

}

konsolebox · 09-25-2012, 08:30 PM

Code:

{
     for(i=1; i<=(length($3); i++){                                            
          print $1 FS convert($3, i) > "intermediate.txt"    # step 1) and output in a file (we removed $2)
     }
     
     for(i=1; i<=(lentgh($3); i++){
          print $1 FS convert(rev($3), i) >> "intermediate.txt"    # step 2) (we removed $2) and 3) concatenate in the same file
     }
}

For that I think you should use >> as well for the first step, but you truncate the file intermediate.txt in the BEGIN block, but only if it doesn't work - that is, if the file is truncated back when first step is encountered.

Trd300 · 09-25-2012, 08:40 PM

When I delete the "while((getline ...)" block after redirecting the output to "intermediate.txt" for the second time, the file contains the correct data.

If I do the same with ">>" at the first redirection, the file contains the data in duplicate.

The last block is the issue !

konsolebox · 09-25-2012, 08:55 PM

Sorry. I try to examine the whole thread but it's still not apparent what is the ~final~ output that you really want to have. We could help better if we know that. It's somehow confusing to comply with the procedures at hand.

---- Add ----

I mean at least we need a real example output from original form to final.

Trd300 · 09-25-2012, 09:22 PM

I understand it can bee confusing.
Starting from my last post with the code explain pretty much everything. You don't need to look before this post.

input:

Code:

@XXXXXX|YYY
12345678

"intermediate.txt":

Code:

##### Results from the first call of the function ######
XXXXXX|aceg      # start from 1st number (i.e. 12345678)
XXXXXX|bdfx      # start from 2nd number (i.e. 2345678)                                                               
XXXXXX|ceg       # start from 3rd number (i.e. 345678)
XXXXXX|dfx        # start from 4th number (i.e. 45678)
XXXXXX|eg         # start from 5th number (i.e. 5678)
XXXXXX|fx          # start from 6th number (i.e. 678)
XXXXXX|g           # start from 7th number (i.e. 78)
XXXXXX|x           # start from last number (i.e. 8)
###### Results from the second call of the function after reversing the string ######
XXXXXX|hjln       # same but after reversing the string starting from 1st number (i.e. 87654321)
XXXXXX|ikmx      # same but after reversing the string starting from 2nd number (i.e. 7654321)
etc...                   # same as previous line until the end of the reverse string

final output (if, in the last block of the code when I redirect "intermediate.txt" as the new input, I want to keep $2 > 2 letters long for instance):

Code:

XXXXXX|aceg      # start from 1st number (i.e. 12345678)
XXXXXX|bdfx      # start from 2nd number (i.e. 2345678)
XXXXXX|ceg       # start from 3rd number (i.e. 345678)
XXXXXX|dfx        # start from 4th number (i.e. 45678)
XXXXXX|hjln       # same but after reversing the string starting from 1st number (i.e. 87654321)
XXXXXX|ikmx      # same but after reversing the string starting from 2nd number (i.e. 7654321)
etc...

The problem is the transition between the block when I use the functions and concatenate both results and the block when I want to use "intermediate.txt" as a new input.