Help needed in formatting the Output file

raosr020 · 08-28-2014, 12:45 PM

Hi All,

Need your help in resolving the below issue.

I've a file called "data.txt" with the below lines:

PHP Code:



TT: <tell://me/sreenivas> 
<tell://me/100>

TT: <tell://me/sudheer> 
<tell://me/300>

TT: <tell://me/sreenivas> 
<tell://me/200>

TT: <tell://me/sudheer> 
<tell://me/400>

I want an output in the below format. Please help me.

PHP Code:



TT: <tell://me/sreenivas>
<tell://me/100>
<tell://me/200>

TT: <tell://me/sudheer> 
<tell://me/300>
<tell://me/400>

Explanation of above o/p:
If the pattern between "<tell://me/" and ">" is same on any of the lines that contains "TT" then take only one line from them.
That line should be followed by the lines followed by the actual lines that have the same pattern between "<tell://me/" and ">".

Looking forward to your help as soon as possible. Let me know if any queries.

With Regards,
SRK

jpollard · 08-28-2014, 03:07 PM

This is possible (though awkward) to do with "awk", and likely much easier in perl.

The approach is to simply save every TT entry in an hash table of arrays. Each time you get a duplicate TT entry you just push the addition records on the end of the array associated with key. (until you reach a blank line).

After you reach the end of file, you can output each entry in the array - which requires the output of each key in the hash, then each entry in the nested array.

The major problem occurs if the input file has millions of records to process. You could run out of memory adding entries to the hash table, or one of the arrays.

jefro · 08-28-2014, 03:34 PM

Might be easier to use a python script or (forget the name) program that includes all the parts needed without sending python.

I might be able to write it but I'm sure some good programmer in python could do this in maybe 10 lines or less. Maybe 3 lines.

jpollard · 08-28-2014, 07:36 PM

Try this:

Code:

#!/usr/bin/perl

$k = "";

while (<>) {
  
    if (/^TT: \</) { # a key is identified
        $k = $_;
        $tbl{$k} = [] if (!defined($tbl{$k})); # create new entry only if it doesn't exist
    } elsif (/^\<tell:/) {
        push($tbl{$k},$_);    # add a record to the array for this key
    } elsif (/^$/) {          # blank lines have things start over
        $k = "";
    }
}

# entire file has been read

foreach $k (keys(%tbl)) {
    print $k, @{$tbl{$k}};   # output the key record, and all data records associated with it
    print "\n";              # and restore the blank line between sections
}

Make the script executable and run as "script <input_file >output_file".

This works for your sample input.

syg00 · 08-28-2014, 09:00 PM

Are those keys guaranteed to be in "entry" order (when printed) ?.

jpollard · 08-28-2014, 10:02 PM

Nope. Because the keys(the "TT:" records) are used in a hash table they could come out in any order.

Now if you are referring to the "<tell:" records, then yes - these are in the array in the same order they were read.

It is possible to maintain the order of the "TT:" records though - by using another hash table (and a record counter). If the key is undefined in the current hash table (tbl in my example), then all that has to be done is to use the record number as the key in this "another hash", and the value of associated with it is the key used in the original table.

At the end of the data gathering, sort the keys of the new hash table (which will now be in numeric order), then use that to retrieve the key from the new hash table, and then output the data from the first hash table.

This should only add several new lines, and reword two existing lines:

Code:

#!/usr/bin/perl

$k = "";
$record_counter=0;

while (<>) {
    $record_counter++;   # new record read
    if (/^TT: \</) { # a key is identified
        $k = $_;
        if (!defined($tbl{$k})) { # only add definitions if they don't exist
            $tbl{$k} = [];        # new TT: line seen
            $newtbl{$record_counter} = $k;   # and where it was seen
        }
    } elsif (/^\<tell:/) {
        push($tbl{$k},$_);    # add a record to the array for this key
    } elsif (/^$/) {          # blank lines have things start over
        $k = "";
    }
}

# entire file has been read

foreach $i (sort keys(%newtbl)) { # the keys of newtbl are record numbers,
                                  # so sorting will force the correct order
    $k = $newtbl{$i}         # get the key for this entry
    print $k, @{$tbl{$k}};   # output the key record, and all data records associated with it
    print "\n";              # and restore the blank line between sections
}

I haven't tested this version, but I think it would work.

syg00 · 08-28-2014, 10:32 PM

I was trying something similar with assoc arrays in awk - like you said above, ugly ...

Easy enough to get the data as wanted, but not sorted as requested. I though I recalled hashes had the same issue.

jpollard · 08-29-2014, 04:22 AM

I was thinking about it more, and it is also possible to use an array to determine the order - just push the key onto a new array when it is not found. The advantage this has is that it eliminates the need for the sort, and even the record counter. Since the array of keys is in the proper order things just come out right:

Code:

#!/usr/bin/perl

$k = "";

while (<>) {
    if (/^TT: \</) { # a key is identified
        $k = $_;
        if (!defined($tbl{$k})) { # only add definitions if they don't exist
            $tbl{$k} = [];        # new TT: line seen
            push(@newarray,$k);   # and add to the order it was seen
        }
    } elsif (/^\<tell:/) {
        push($tbl{$k},$_);    # add a record to the array for this key
    } elsif (/^$/) {          # blank lines have things start over
        $k = "";
    }
}

# entire file has been read

foreach $k (@newarray) {     # the keys in newarray are in the order read
    print $k, @{$tbl{$k}};   # output the key record, and all data records associated with it
    print "\n";              # and restore the blank line between sections
}

This version hasn't been tested either, but is really not that different from the original.

As they say "Always a different way to do the same thing".

BTW, it shouldn't even be necessary to "start things over", so the last two lines of the "elsif (/^$/)" and the $k = "" could be dropped. I thought it would help point out errors by creating a null key reference - but if there are no records other than "TT:" and "<tell:", there shouldn't be any null keys either. Since things are either "TT" or "<tell:" entries being recorded -- there aren't any null keys recorded either.

grail · 08-29-2014, 10:59 AM

Here are some alternatives:

Code:

#!/usr/bin/awk -f

BEGIN{ FS = "[/>]"
       c = 1 
     }   

/^TT/{
  a = $0

  if(!(a in o)) 
    b[c++] = a 

  getline
  
  o[a][$(NF-1)] = $0
}

END{
  for(i = 1; i < c; i++)
  {
    print b[i]
    for(j in o[b[i]])
      print o[b[i]][j]
  }
}

Or maybe a confusing one liner

Code:

ruby -ne 'o ||= {}; if /^TT/;a = $_;o["#{a}"] ||= [];end;o["#{a}"] << $_ if /^</;END{o.each{|k,v| puts "#{k}#{v.sort.join}"}}' file