Please Help with AWK code to parse XML messages

JamesOwen · 01-30-2012, 02:04 PM

Hi Guy's

Can I please get some help with this code.

I have xml feed file which rapidly changing temporary file and I need to capture the content of this file as soon as data arrives.

Example of the data

Quote:

[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="John Smith"><Age="23"><D.O.B="11-10-1988"> <Gender="Male">"
[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="Emy Williams"><Age="23"><D.O.B="01-05-1988"> <Gender="Female">"
[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="Jack Adam"><Age="66"><D.O.B="24-07-1945"> <Gender="Male">"
[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="Charlie Daniel"><Age="38"><D.O.B="15-08-1973"> <Gender="Male">"
[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="Ruby James"><Age="38"><D.O.B="11-03-1973"> <Gender="Female">"
[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="Sophie Thomas"><Age="20"><D.O.B="12-09-1991"><Gender="Female">"

Required data output

Quote:

8:30,Male,23,1
8:31,Female,23,1
8:32,Female,30,4
8:33,Male,50,10

Time is current time.

This is awk code that I have so far but this doesn't do what I need it to do. Can I please get help with it.

All I want the code to do is to run for 2 minutes process the counts , write it to output then do the same process again and again.

Code:

awk 'BEGIN { INTERVAL=120;    "date +%s"|getline sec;
    NEXT=sec+120;}

    {
        if(sec >= NEXT)
        {
           printf( "\nSummary\n" );
           for( x in agcount )
              printf( "%s,%d\n", x, agcount[x] ) | "sort";

           NEXT=sec+120;
        }

        gsub( ">", "" );        # strip uneeded junk and make "foo bar" easy to capture
        gsub( " ", "~" );
        gsub( "<", " " );

        for( i = 1; i <= NF; i++ )          # snarf up each name=value pair
        {
            if( split( $(i), a, "=" ) == 2 )
            {
                gsub(  "\"", "", a[2] );
                gsub(  "~", " ", a[2] );
                values[a[1]] = a[2];
            }
        }

        #gcount[values["Gender"]]++;         # collect counts
        #acount[values["Age"]]++;
        agcount[values["Gender"]","values["Age"]]++;

        printf( "%s %s %s %s\n", values["NAME"], values["Age"], values["D.O.B"], values["Gender"] );
    }' input-file

I can't use gawk or cron scheduler.

Will anyone be able to help me with this?

any help would be greatly appreciated.

James

cbtshare · 01-30-2012, 04:29 PM

Can it be a shell script?

Also what is the numbers at the end of the output?

JamesOwen · 01-30-2012, 04:42 PM

Yes it can be shell script.

The numbers at the end are counts for the age, so if there are 2 males of age 34 then instead of writing male,34,1 twice. Its easier to have male,34,2.

Thanks

cbtshare · 01-30-2012, 09:12 PM

Below is a shell script you want to help you out....

data.txt has

Code:

 cat data.txt 
[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="John Smith"><Age="23"><D.O.B="11-10-1988"><Gender="Male">"
[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="Emy Williams"><Age="23"><D.O.B="01-05-1988"><Gender="Female">"
[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="Jack Adam"><Age="66"><D.O.B="24-07-1945"><Gender="Male">"
[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="Charlie Daniel"><Age="38"><D.O.B="15-08-1973"><Gender="Male">"
[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="Ruby James"><Age="38"><D.O.B="11-03-1973"><Gender="Female">"
[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="Sophie Thomas"><Age="20"><D.O.B="12-09-1991"><Gender="Female">"
[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="Sophie Thomas"><Age="20"><D.O.B="12-09-1991"><Gender="Female">"
[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="Sophie Thomas"><Age="20"><D.O.B="12-09-1991"><Gender="Female">"
[date+time], message=[DATA= "<?xml version="1.0?"><data changeMsg><NAME="Sophie Thomas"><Age="20"><D.O.B="12-09-1991"><Gender="Female">"

Code:

#!/bin/bash
#Author cbtshare
#Pupose: To grab specific information from a file and format the information and output to screen.

FILELOCATION=/var/www/data.txt

> result.txt
cat $FILELOCATION | while read line;
do

DATE=$(echo $line | cut -d "," -f1)
GENDER=$(echo $line | cut -d "=" -f8 | cut -d '"' -f2)
AGE=$(echo $line | cut -d "<" -f5 | cut -d '"' -f2)

echo "$AGE,$GENDER,$DATE" >> result.txt

done

cat result.txt | sort | uniq -c

output

Quote:

1 66,Male,[date+time]
1 38,Male,[date+time]
1 38,Female,[date+time]
1 23,Male,[date+time]
1 23,Female,[date+time]
4 20,Female,[date+time]

grail · 01-30-2012, 11:54 PM

Please show an exact format for date + time?

Assuming the file is always the same format (and not currently including date + time) the following works:

Code:

awk -F"[<>]+" '{gsub(/^.*="|"$/,"",$(NF-1));gsub(/^.*="|"$/,"",$5);total[$5,$(NF-1)]++}END{for( x in total)print x,total[x]}' file

Of course we can easily tidy the output.

JamesOwen · 01-31-2012, 03:32 PM

@cbtshare,

This is good way for me to start but the only problem with this is that I am reading the data from rapidly-changing kshfile.

I am using a pipe to read the ksh file then what I want is to read from the pipe every 2 minutes and write to output file.

Is this something that can be done using shell script?

Also is there away to add the counts to the loop?

Thank you all again

chrism01 · 01-31-2012, 07:58 PM

I assume you mean you are reading the output from a ksh file, not reading the ksh prog file.

Where does the 2 mins thing come from?
Does the ksh prog produce a new file every 2 mins?
Does it output for 2 mins then overwrite the same file?
In either case, synchronisation is key to avoid losing data.

In either case (or even if this is a continuous stream being out put eg like a logfile), I would highly recommend http://search.cpan.org/~mgrabnar/Fil...0.99.3/Tail.pm which is designed to handle those situations.
I've used it myself; very handy.

cbtshare · 01-31-2012, 09:12 PM

Quote:

Originally Posted by JamesOwen

@cbtshare,

This is good way for me to start but the only problem with this is that I am reading the data from rapidly-changing kshfile.

I am using a pipe to read the ksh file then what I want is to read from the pipe every 2 minutes and write to output file.

Is this something that can be done using shell script?

Also is there away to add the counts to the loop?

Thank you all again

Yes definitely, you can use cron to let the script run at any interval you want.You can add counts as well :

count=0 and to increase the count , let "count=+1"

you can use the wait command to anywhere you want to pause the script also.

JamesOwen · 02-02-2012, 01:56 PM

@chrism01,

Where does the 2 mins thing come from?

Quote:

Does the ksh prog produce a new file every 2 mins?
Does it output for 2 mins then overwrite the same file?
In either case, synchronisation is key to avoid losing data.

The 2 minute thing came because i want the script to loop for 2 minutes and not until the end of the file. This will help me to log messages coming every 2 minutes.

The ksh file produces new message every couple seconds and each new message overwrites the previous message.

And yes you are right i want to avoid losing data.

Quote:

In either case (or even if this is a continuous stream being out put eg like a logfile), I would highly recommend http://search.cpan.org/~mgrabnar/Fil...0.99.3/Tail.pm which is designed to handle those situations.

This link is PERL code and i am not familiar with PERL coding. I have never used PERL and also File tail wouldn't this get the end of the file. my file as soon as new message arrives it overwrite the previous message so i am not sure if this will do what i want.

@cbtshare,

I can't use cron scheduler and this why I am not sure how i could solve this issue.

All Please help

Thank you all again

James

chrism01 · 02-02-2012, 06:25 PM

Quote:

The ksh file produces new message every couple seconds and each new message overwrites the previous message.
...
my file as soon as new message arrives it overwrite the previous message

Taking this to mean what it says, you are saying that the output_file (using that term loosely) only ever actually contains one msg (the latest), which is overwritten by the next/new msg approx(!) every 1 or 2 secs.
in that case, I don't get the 2 mins thing at all. You've got to grab each msg immediately or you will lose it...
So, you do need to use something like that Perl module or eg

Code:

tail -f output_file | your post-processing prog

In fact, you could use the first example on that Perl page pretty much as is.

Going back to bash soln, maybe instead of having the ksh file write to the ever-changing file, just pipe the output directly thus

Code:

ksh_prog | post-process_prog

# or output to log and pipe to prog
ksh_prog | tee ksh.log | post_process_prog

JamesOwen · 02-05-2012, 09:55 AM

Thank you all for replying, but I think I haven’t explained myself.

I have .ksh file which contains XML messages.

What I need is to parse and capture this XML messages then store the output in log file.

The two minutes thing is something I came up with as the log file could be a large file if I get each message output to it. But if I collect the messages for 2 minutes then I will be able to get summary output as this example:

Quote:

1 66,Male,[date+time]
1 38,Male,[date+time]
1 38,Female,[date+time]
1 23,Male,[date+time]
1 23,Female,[date+time]
4 20,Female,[date+time]

[date+time] means when the output was logged in the log file.

Can anyone please help with these questions?

How can I parse the xml messages using this PERL code? (the link)

After this how can I save the output to a log file?

Any Advice will be appreciated.

Thank you all again,
James

JamesOwen · 02-06-2012, 12:57 PM

Hi Guy's,

Can someone please help with this issue?

Thank you all

JamesOwen · 02-07-2012, 01:43 PM

Guy’s

I don’t want to bump my thread but can I please get help with this problem.

Thanks

James

grail · 02-08-2012, 02:11 AM

I am not sure I understand your current issue? You have been presented with code to parse the xml and retrieve data. Redirecting this into a new file should be trivial.

Are you able to explain where you are now stuck?

JamesOwen · 02-08-2012, 02:29 PM

@grail,

Which code are you referring to?

If you are referring to the bash code, this code does almost what my AWK code does?

I could parse and retrieve the data then redirect to output file. But the only problem with AWK is that it reads the whole file at once and what I want is to read part of the file each minute or so.

If this is not possible then parse messages then log output in file, but this should include the current time and count of the messages.

For the PERL link I am getting this error and I am not sure how this should parse XML messages.

Quote:

Can't locate File/Tail.pm in @INC (@INC contains:

Any help with this will be appreciated.

Thanks again
James