[SOLVED] how do i keep the header

tabbygirl1990 · 01-17-2014, 12:27 PM

hi guys,

i wrote an awk script that does the stuff below

(filtering a file row by row based on criteria in each column) except it does not include the original files header, can someone please show me how to keep the header in the output file for those columns of data that are kept.

Code:

BEGIN {
       FS = ' '
       }
       {
        if ($3=="42" && $5=="the answer to the universe")
        printf("%f %f %d %f %f %s\, $1, $2, $3, $11, $12, $5)
        }
END{}

and I run the script using the command line:

Code:

awk -f row_parsing_tool.awk  input.inp > output.out

thanks guys! tabby

schneidz · 01-17-2014, 12:39 PM

Code:

awk 'NR == 1 {print $0}' tabbygirl.txt

tabbygirl1990 · 01-17-2014, 12:53 PM

that doesn't do what i need at all

that will simply pull the header (the first line) on the file has not been filtered.

i need to have the headers that go along with the filtered file, in this case the headers of columns $1, $2, $3, $11, $12, $5

i'm thinking it will take another

Code:

printf

statement and then a

Code:

cat

selfprogrammed · 01-17-2014, 05:07 PM

I often would like to know how to do that too. But it seems that all these editing tools will only treat all lines equally, with the same rules or commands applied to every selected line.

That leaves a two pass procedure as the most universal solution.
Get the headers to one temp file, the sorted to another temp file, then cat them back together.

Firerat · 01-17-2014, 07:03 PM

schneidz did provide the answer !

Code:

awk 'NR == 1 {print $0}' tabbygirl.txt

is essentially

if line is 1 then print line 1

add it to your script

Code:

BEGIN {
   FS = ' '
   }
   {
   if ( NR == 1 ) {
       print $0
       } else {
        if ($3=="42" && $5=="the answer to the universe")
           printf("%f %f %d %f %f %s\, $1, $2, $3, $11, $12, $5)
       }
   }
END{}

untested, but you get the idea

edit, obviously replace print $0 with "printf <desired formatting> $1, $2, $3, $11, $12, $5",

syg00 · 01-17-2014, 07:12 PM

Save the the fields of interest from the first record into some variable, if you eventually find anything to print, print the saved variable once (set a flag) then the record(s) to follow. print rather than printf as the header will be simple strings, and no need to use cat or any other external command.

D'oh - too slow at typing.

smeezekitty · 01-18-2014, 12:08 AM

If two command is acceptable, you can do this

Code:

head -n 1 input.inp > output.out && awk -f row_parsing_tool.awk  input.inp >> output.out

Firerat · 01-18-2014, 04:02 PM

@smeezekitty

just a few problems with that approach

the first is the field order, head -n1 is not much use since it won't 'reorder' the fields

another is highlighted by syg00
that is "do we want a header if we would have no data?"

To get round both use a single awk script
have the BEGIN 'capture' the header feilds to some variable,
now test each record
when condition is 'true' check the header variable,, if set print it and then unset it (or set it to null,e.g. Header=""), then print the data line, repeat with all records

should only get the header once, and only when there was actual output data

smeezekitty · 01-18-2014, 04:50 PM

I wonder under the impression that she wanted the first line verbatim but I could be wrong

grail · 01-18-2014, 11:38 PM

I am not sure exactly which header we are talking about, ie where it appears in the data (perhaps because no example data was provided (hint)).

However, if we are able to assume that the header is in fact the first row within the file, simply adding this criteria to the existing would do the trick.

I would add that the current setting of FS is also not required as white space is the default.

So it could just be:

Code:

NR == 1 || ($3=="42" && $5=="the answer to the universe"){printf("%f %f %d %f %f %s\, $1, $2, $3, $11, $12, $5)}

tabbygirl1990 · 01-20-2014, 12:08 PM

schneidz - i'm sorry, i didn't mean to be snooty at all, one of those days, thank you for your help.

when i ran firerat's script in post 5, my output file came back empty

when i ran grail's command line in post #10, my output file came back with the line after the line that met the filter criteria, but alos no header

so, here's an example space deliminated input file

Code:

DATE		TIME       OPERATOR VERSION  RUN_ID  SPEC      DTG      FAIL  END    ONGOING  FEATURE_1   FEATURE_2      FEATURE_3 GOODNESS     
12/04/2013      6:00:011.27   SM    2.6.8 6   90501   5   921996008.31 FALSE 5  *     5      0.711503131 5660093.929    6.22      0.91
12/05/2013      6:00:011.3     DK    2.6.8 6   90501   4   921996009.31 FALSE 8  *     8      0.567142359 5660095.848    0.53      0.90
12/06/2013      6:00:011.41    SM    2.6.8 5   90503   2   921996009.01 FALSE 8  *     8      0.708699814 5660097.221    0.54      0.91
12/06/2013      6:00:011.41    JF    2.6.8 6   90501   5   921996010.31 FALSE 3  *     3      0.142189285 5660100.259    -0.27      0.08
12/09/2013      6:00:011.55    SM    2.6.8 6   90501   1   921996010.01 FALSE 8  *     8      0.213247275 5660103.596    -0.27      0.08
12/10/2013      6:00:011.41   SM    2.6.8 4   90503   5   921996011.31 FALSE 8  *     8      0.91836074 5660103.492    0.53       0.91
12/10/2013      6:00:011.32   SM    2.6.8 4   90501   5   921996011.01 TRUE 1
12/11/2013      6:00:011.21   DK    2.6.8 4   90501   3   921996015.01 TRUE 1
12/11/2013      6:00:011.42   SM    2.6.8 4   90501   3   921996015.01 FALSE 10  *    10      0.864147301 5660105.265    0.622     0.91
12/12/2013      6:00:011.50   JF    2.6.8 4   90501   3   921996015.31 FALSE 8  *     8      0.539123318 5660104.795    0.622     0.92
12/13/2013      6:00:011.15   SM    2.6.8 4   90503   5   921996016.01 FALSE 2  *     2      0.922633758 5660109.457    7.05      0.96

if i filter on OPERATOR=SM and SPEC=5 then what I'd like to getout is

Code:

DATE		TIME       OPERATOR VERSION  RUN_ID  SPEC      DTG      FAIL  END FEATURE_1   FEATURE_2      FEATURE_3 GOODNESS     
12/04/2013      6:00:011.27   SM    2.6.8 6   90501   5   921996008.31  FALSE  5  0.711503131 5660093.929    6.22      0.91
12/10/2013      6:00:011.41   SM    2.6.8 4   90503   5   921996011.31  FALSE  8  0.91836074 5660103.492    0.53      0.88
12/10/2013      6:00:011.32   SM    2.6.8 4   90501   5   921996011.01  TRUE  1
12/13/2013      6:00:011.15   SM    2.6.8 4   90503   5   921996016.01  FALSE  2  0.922633758 5660109.457    7.05      0.96

the files that i'm trying to process are much much bigger but i think this little one covers all the cases

thanks so much guys!!!

tabby

grail · 01-20-2014, 07:56 PM

Tabby I see an issue prior to the solution. That being that you have more columns of data than you have of header. This means that once you pass the VERSION column, the header
and data become out of sync. Not sure if your current formatting has allowed for this??

Also, looking at your data, your reference in your format for printf to %f and %d will not match most of the data presented.

So I will leave these 2 issues to you, but the sort of thing I would look at doing is:

Code:

BEGIN{    fmt[1] = "%s %s %s %s %s\n" # header
          fmt[2] = "<the format for other lines>"
}

NR == 1 || ( $3 == "SM" && $7 == 5 ){
    printf(fmt[NR==1?1:2],<choose your columns here>)
}

tabbygirl1990 · 01-20-2014, 09:49 PM

thanks sooo much grail !!!

yep, that's actually the way the files are after VERSION, the headers and the data columns don't line up 1 for 1 and when FAIL is set to TRUE then no more data is written to that line

i know i can't have a different numbers of arguments types in the format statement of fmt[2], but is there a way to "PAD" the control characters in fmt[1] ?

Code:

fmt[1] = "%s  %s  %s  %s PAD %s  %s  %s  %s %s  %s  %s  %s %s\n" 
fmt[2] = " printf("%s  %s  %s  %s  %d  %d  %d %f  %s  %d  %f  %f  %f  %f\, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $13,  $14, $15, $16)"

as i understand "<choose your columns here>" it would be the column calls in fmt[2] yes?

Code:

NR==1?1:2

i thought that was really cool using the ?: operator for NR, i'll have to think of using that more

thanks!!!

tabby

grail · 01-20-2014, 10:02 PM

For more on padding have a look here

Unfortunately I seem to have lead you astray it seems for fmt[2]. It should be of the same format as fmt[1] but using different modifiers to display the data you need, like

Code:

fmt[2]= "%s  %s  %s  %s  %d  %d  %d %f  %s  %d  %f  %f  %f  %f\n"

Whereas the "<choose your columns here>" part would be:

Code:

printf(fmt[NR==1?1:2],$1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $13,  $14, $15, $16)

Hope that is a little clearer

smeezekitty · 01-20-2014, 11:15 PM

Here is a working Perl solution

Code:

$oc = "SM";
$sp = "5";

<>;
print("DATE		TIME       OPERATOR VERSION  RUN_ID  SPEC      DTG      FAIL  END FEATURE_1   FEATURE_2      FEATURE_3 GOODNESS\n");
while(<>){
    ($date, $time, $opr, $version, $v2, $run, $spec, $dtg, $fail, $end, $p, $ongoing, $f1, $f2, $f3, $goodness) = split(' ',$_);
    if($oc eq $opr && $sp == $spec){print ($date, "      ", $time, "   ", $opr, "    ", $version, " ", $v2, "   ", $run, "   ", $spec, "   ", $dtg, "  ", $fail, "  ", $end, "  ", $f1, " ", $f2, "    ", $f3, "      ", $goodness, "\n");}
}