[SOLVED] Awk with missing fields

grail · 04-26-2012, 12:47 PM

Ok ... so here is what I came up with, obviously you can change the output as needed, also, I went with an awk script instead of bash calling awk, but I am sure you can edit as required

Code:

#!/usr/bin/awk -f

match($0,/^[0 ](.{8}) (.{8}) (.{4}) (.{8}) (.{8}) (.{3}) (.{2}) (.{8}) (.{8}) (.{8}) (.{3})(.{4}) (.{5}) (.{28}) (.{5}) (.*)/,f){
    for(i=1; i <= 16; i++){
        gsub(/^ *| *$/,"",f[i])
        printf "%s",f[i](i==16?"\n":"|")
    }
}

And you run it like so:

Code:

./awk_script --re-interval file

Only after version 4 can you not use that switch.

PS. You pulled a dodgy with fields 11 and 12 as they do not have a space between them but a hyphen

This was corrected and allowed for.

Linux_Kidd · 04-26-2012, 03:20 PM

that is elegant. so how to add/combine that to post #24 script, i need the 1st awk to ignore lines of data (aka noise) that are not needed, etc.

ah, as you see F76-F82 is consecutive in raw data w/o h20, my bad. i needed to separate them, etc.
11 = $76$77$78
12 = $79$80$81$82

grail · 04-27-2012, 12:22 AM

No combining required, that takes out both scripts

Firstly I would try running the script with a test file that contains the "noise". The point here is that unless the "noise" exactly matches the 'match' function, it will be ignored.

If this does not work as you have lines with the exact same format but wish to ignore them based on a pattern, simply put pattern in slashes (//) and 'and' (&&) with match.

Let me know if any of this is unclear?

Linux_Kidd · 04-27-2012, 07:43 AM

the elegance works fine, except i need to run it in bash script, or, call this awk from a bash script and be able to send output to $FILE along with $i, etc.
see, i use two scripts, one to verify the directory and the 2nd (awk processing) does the rest. if you notice i pass $FILE to the awk script (actually its a bash script, i just name it .awk , etc) and i print out $2 from the awk script into last field of my file. i do this so that if any data shows up funny i know which file caused the problem, etc. the script(s) currently process 187 files, and a new file gets added daily.

Code:

#!/bin/bash -l
# written by me
umask 026
NOW=`date +%F%T`
FILE="$HOME/HEAP.$NOW.txt"
if [ -d "$HOME/HEAP" ]
then
  echo ""
  echo "HEAP folder was found in $HOME."
  echo "Please wait, processing files..."
  echo "1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|FILENAME" >> $FILE
   for i in `ls $HOME/HEAP/`; do /var/scripts/convert.awk $HOME/HEAP/$i $FILE; done
  echo ""
  echo "All done, your output file is $FILE"
  echo "have a nice day..."
  echo ""
else
  echo ""
  echo "HEAP folder in directory $HOME does not exist."
  echo "Please make sure this directory exists and has"
  echo "files in it."
  echo ""
fi

grail · 04-27-2012, 10:20 AM

Please do not take this the wrong way ...

Code:

for i in `ls $HOME/HEAP/`; do /var/scripts/convert.awk $HOME/HEAP/$i $FILE; done

I am hoping this means you can confirm that absolutely no files contain spaces, tabs or new lines in the name. Otherwise this is a big no no. Much safer to use:

Code:

for i in $HOME/HEAP/*; ...

I have to back up here as another part looks ... unusual:

Code:

FILE="$HOME/HEAP.$NOW.txt"

Is the dot (.) between HEAP and $NOW correct? Or should it be a slash like:

Code:

$HOME/HEAP/$i

Here is a way you could make it an awk script:

Code:

#!/usr/bin/awk -f

BEGIN{
    if(ARGV[1] ~ "HEAP/\\*"){
        print "HEAP folder in directory",ENVIRON["HOME"],"does not exist or"
        print "no files were available"
        exit
    }

    file = strftime("%F%T")".txt"

    print "1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|FILENAME" > file
}

match($0,/^[0 ](.{8}) (.{8}) (.{4}) (.{8}) (.{8}) (.{3}) (.{2}) (.{8}) (.{8}) (.{8}) (.{3})(.{4}) (.{5}) (.{28}) (.{5}) (.*)/,f){
    for(i=1; i <= 16; i++){
        gsub(/^ *| *$/,"",f[i])
        printf "%s",f[i](i==16?"\n":"|") > file
    }
}

END{
    print "All done, your output file is",file
    print "have a nice day..."
}

Then you would call it like so:

Code:

/var/scripts/convert.awk $HOME/HEAP/*

Have a play and let me know if you have any questions?

Linux_Kidd · 04-27-2012, 10:59 AM

ok, suggestion for using * for filename understood, but it is guaranteed the file names have no spaces. i did however make the change for the better, etc.

as for FILE="$HOME/HEAP.$NOW.txt"
this is correct, this is my output file. i name my output file at run time which is named with a timestamp to the second. the script will never be ran twice within the same second by same uid, etc. so everytime it runs the output is a unique file (for some uid's having date/time in the filename is easier than ls -al, etc).

not sure i have time to test this elegance, might need to leave what i have since i have already trained the uid's on how to run what i have, which is "log in via ssh, type /var/scripts/process.sh and hit enter".

grail · 04-27-2012, 11:56 AM

No probs with the file name ... I was a little confused as the start of the file name was the same as the directory ... so just checking

Quote:

log in via ssh, type /var/scripts/process.sh and hit enter

So process.sh then calls /var/scripts/convert.awk? You could just as easily call one, as you have them doing, but no need to then break off elsewhere, just put the script in that does the work.

Quote:

not sure i have time to test this elegance

I can fully understand as putting things in a live environment that you aren't a 100% on is not flash.

As I have been playing, I thought I might show you another way (just to keep that mind of yours guessing (lol)):

Code:

#!/bin/bash

regex='^[0 ](.{8}) (.{8}) (.{4}) (.{8}) (.{8}) (.{3}) (.{2}) (.{8}) (.{8}) (.{8}) (.{3})(.{4}) (.{5}) (.{28}) (.{5}) (.*)'

umask 026
NOW=$(date +%F%T)
FILE="$HOME/HEAP.$NOW.txt"
if [ -d "$HOME/HEAP" ]
then
  echo ""
  echo "HEAP folder was found in $HOME."
  echo "Please wait, processing files..."
  echo "1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|FILENAME" >> $FILE

  for i in $HOME/HEAP/*
  do
    while IFS="" read -r line
    do
      if [[ $line =~ $regex ]]
      then
	for (( i = 1; i <= 16; i++ ))
	do
	    read field <<< "${BASH_REMATCH[i]}"
	    (( i == 16 )) && end="\n" || end="|"
	    echo -ne "$field$end" >> $FILE
	done
      fi
    done<"$i"
  done
  
  echo ""
  echo "All done, your output file is $FILE"
  echo "have a nice day..."
  echo ""
else
  echo ""
  echo "HEAP folder in directory $HOME does not exist."
  echo "Please make sure this directory exists and has"
  echo "files in it."
  echo ""
fi

Linux_Kidd · 04-27-2012, 01:02 PM

i tried your bash script, no dice.

i ran your bash vs my 2 scripts. each way processes 187 txt files in the dir.

your bash:
2min30sec producing 28,989 lines of output

my scripts:
28sec producing 29,190 lines of output (this output was verified to be correct)

not sure where it choked. i'll use this for reference. thnx.

grail · 04-27-2012, 02:05 PM

Yeah the awk will always run quicker as that is its thing, but of course the bash has the nicety of being all bash

Obviously it worked fine on the data you gave me for testing. The tests on my machine also show awk performs over bash even for the small level of data:

Code:

# bash
real	0m0.034s
user	0m0.012s
sys	0m0.016s

#awk
real	0m0.008s
user	0m0.000s
sys	0m0.004s

I do find it a little odd the amount that is out, ie. just over 200 out of 29000+. I would have thought larger if a recurring items was being missed.

Oh well ... it was a bit of fun

grail · 04-27-2012, 02:27 PM

Ok ... one last edition which I finally worked out ... just seemed cool (just the part doing the work):

Code:

regex='^[0 ](.{8}) (.{8}) (.{4}) (.{8}) (.{8}) (.{3}) (.{2}) (.{8}) (.{8}) (.{8}) (.{3})(.{4}) (.{5}) (.{28}) (.{5}) (.*)'

for i in $HOME/HEAP/*
do
  IFS="|$IFS"

  while IFS="" read -r line
  do
    if [[ $line =~ $regex ]]
    then
	read -a temp <<<"${BASH_REMATCH[*]:1}"
	echo "${temp[*]}"
    fi
  done<"$i"
done

unset IFS

And also 3 or 4 times faster than previous bash (on the small data)