Shell script to find/replace build new TAB record

ljungers · 01-18-2007, 10:18 AM

Hi to all, newbe to the world of shell scripting, but I hope someone will have an idea I can use. Have a large text file that contains one big memo field with several constants and a value, most of the time. Plan to convert this file to a MySQL as a table that will have several fields plus a large text memo remaining.

For the most part this memo field has constants like “P:” for phone number, and “Name:” for the persons name, with there values following in the front part of this memo.
Have been using the sed command with the constants that I know have values and are physically in correct positions. Example “ FixedRecord=`sed –e “s/[\ ]* P: / /”` to change the phone number constant to a tab in front of the phone number value. After several –e extensions of sed the FixedRecord is echo to a file for later processing.

Now I need to figure out a way to do the same thing for constants that may or may not have values following them in the rest of the record. Anyone know of an easy way to check if a constant exist, and if it dose exist, dose it have a value.

For example I want to check for the existence of “Item Number: Desc: Start Date: End Date: Duration: Project: Supervisor:” constants then check if they have values following them. So far I have done the following script
RecordPart_1to19=`OrigRecord | cut –d’ ‘ -f1-19` # first 19 tab fields of record
RecordPart_20=`OrigRecord | cut –d’ ‘ –f20` # rest of record to search for new fields
RecordPart_20 contains the following for tab conversion.
Item Number: H14J649
Missing Desc: # this constant is missing and so is it’s value, TAB still required
Start Date: 9/16/2004
End Date:
Duration: 0017
Project:
Supervisor: John Doe

Results I want would be that the new record I build would contain “$RecordPart_1to19 TAB H14J649 TAB TAB 9/16/2004 TAB TAB 0017 TAB TAB John Doe TAB $RecordPart_20”
# RecordPart_20 when outputted should only contain what remains after the found constants and constant values are removed.

Just need some ideas or be pointed to an example script that shows how to accomplish this task.

Thanks in advance for any and all help.

matthewg42 · 01-18-2007, 11:36 AM

Hi ljungers,

while I think it's possible to do what you want to do in a shell script, it isn't going to be very efficient. It might be better to move up a level of sophistication from sed to the likes of awk or Perl. This way, you will only have to read the input record once, and can have variables and more login inside your program.

Does this "memo" get printed on standard output when you execute the program OrigRecord? This is how it appears from the sample you provided above.

If so, I'd probably go about it like this (I'm a Perl fan, so that's what I'd use):

Code:

#!/usr/bin/perl -w

use strict;

open(INPUT, "./OrigRecord|") || die "cound't execute OrigRecord: $!\n";
my %data = ();
while(<INPUT>) {
        chomp;     # this takes the \n off the end of the line
        my ($field, $value) = split(/:/, $_, 2);
        $value =~ s/^\s+//;   # strip leading whitespace
        $field = lc($field);  # make field name lower case
        $data{$field} = $value;
}
close(INPUT);

print "\$RecordPart_1to19";
foreach my $field ( "item number", "missing desc", "start date", "end date", "duration", "project", "supervisor" ) {
        print "\t" . ($data{$field} || "");
}
print "\n";

You could replace the \t with the literal string TAB for testing to show that you are getting enough tabs, or do what I did - use the \t, but send the output through "od -tc". The output with the data you provided above is as follows:

Code:

0000000   $   R   e   c   o   r   d   P   a   r   t   _   1   t   o   1
0000020   9  \t   H   1   4   J   6   4   9  \t  \t   9   /   1   6   /
0000040   2   0   0   4  \t  \t   0   0   1   7  \t  \t   J   o   h   n
0000060       D   o   e  \n
0000065

Is that what you wanted?

ljungers · 01-18-2007, 12:08 PM

Thanks matthewg42 for the reply. I forgot to show how I was getting the OrigRecord variable that the cut is used on. This is how I'm doing it.

ORGIFS="$IFS"
IFS="
"
for OrigRecord in `cat the_input_file_with_memo.txt`
do
RecordPart_1to19=`OrigRecord | cut –d’ ‘ -f1-19` # 1st 19 fields have been built
etc., etc/, etc.

Have looked at Perl but for now I'm trying to keep my head above water with the shell scripts. Would like to use awk or whatever it takes to accomplish this last task. Would it be better to build an array with the constants and condition indicator in the array to indicate if a constant was found and it's value was found.

I hope this helps some. Thanks

ljungers · 01-18-2007, 03:05 PM

Quote:

Originally Posted by matthewg42

The output with the data you provided above is as follows:

Code:

0000000   $   R   e   c   o   r   d   P   a   r   t   _   1   t   o   1
0000020   9  \t   H   1   4   J   6   4   9  \t  \t   9   /   1   6   /
0000040   2   0   0   4  \t  \t   0   0   1   7  \t  \t   J   o   h   n
0000060       D   o   e  \n
0000065

Is that what you wanted?

Yes that is what I want to accomplish plus the reminder of $RecordPart_20 less the found constants and there values.

Is that possible to do?

matthewg42 · 01-18-2007, 07:29 PM

Quote:

Originally Posted by ljungers

Yes that is what I want to accomplish plus the reminder of $RecordPart_20 less the found constants and there values.

Is that possible to do?

Anything's possible with perl

Alas, I don't understand what you mean.

ljungers · 01-19-2007, 09:46 AM

Have a large ASCII file with records (29233) that is all text characters. Been able to build the first 19 fields because the record allowed for this (original tabs probably where removed) thus creating a record with 19 tab delimited fields and text memo field giving

RecordPart_1to19 # first 19 tab delimited fields
RecordPart_20 # field 20 that has constants and values plus transcription notes

Now I wish to build new fields from data in this field20 and place them after the existing 19 fields. There are anywhere from 12 – 18 known constants in this field20 “ITEM NO: NAME: START DATE: etc. etc.” that may or may not be in this record. When a constant is present it may or may not have a value.

Example: lets say we have 6 possible constants and values
Constant1=’ITEM NO:’
Constant2=’DESC:’
Constant3=’NAME:’
Constant4=’START DATE:’
Constant5=’END DATE:’
Constant6=’PROJECT CODE’

Field20 contains “ITEM NO: 27J426 NAME: START DATE: 12/05/2006 END DATE: PROJECT CODE: 1B2C This is text at the end of field20 and can be as big as 8-12kb“

The results would be

Value1=27J426
Value2=
Value3=
Value4=12/05/2006
Value5=
Value6=1B2C

Plus the constants and values should be removed/deleted from field 20 because we have the values for the new tab delimited fields and do not want this data repeated. The output record should be something like this (spaces used to make it more readable and TAB=hex 09, VT 0B)

“$RecordPart_1to19 TAB Value1 TAB Value2 TAB Value3 TAB Value4 TAB Value5 TAB Value6 TAB $RecordPart_20”

I hope this explains the task I’m trying to accomplish. I wish to keep this process in a shell script using whatever command I need to use. With shell scripts I am able to sort of keep my head above water while keeping what I have the same language (Scripts).

Thanks for any help and ideas on this.

matthewg42 · 01-19-2007, 04:47 PM

This might be a good way split up the last portion, but it is based on the assumption that there is no space in the value:

Code:

#!/usr/bin/perl -w

use strict;

my $input = "ITEM NO: 27J426 NAME: START DATE: 12/05/2006 END DATE: PROJECT CODE: 1B2C";

print "INPUT DATA IS: $input\n\n";
while ( length($input) > 0 ) {
        if ( $input =~ s/^([^:]+):\s*([^\s]+)(\s+)?// ) {
                print "field name is $1\n";
                print "field value is $2\n";
                print "remaining input is $input\n\n";
        }
        else {
                print "oh dear, we can't match the pattern \"field: value\"\n";
                last;
        }
}