Need script to clean up file

h/w · 01-16-2008, 01:43 PM

Quote:

Originally Posted by stevemcb

I didn't get a chance to try it last night, but when I ran it this morning, there are still issues (lines that start with a pipe).

Strange, when I had tried it, it seemed to work. At least it did on the sample you'd given above.

Let's just wait for other members to chime in now. This isn't a biggie really, but with my rusty skills it's taking me more time than I should.

chrism01 · 01-16-2008, 05:09 PM

I think what the OP is saying is that if there is a 'newline' followed immediately by a pipe symbol '|', then remove just the newline...

h/w · 01-16-2008, 05:45 PM

Quote:

Originally Posted by chrism01

I think what the OP is saying is that if there is a 'newline' followed immediately by a pipe symbol '|', then remove just the newline...

Right, which is a pain to do with sed (for me, at least.) The awk script I'd written earlier tried to do that - checks for a '|' at the start of line, and appends to the previous line.

chrism01 · 01-17-2008, 12:03 AM

Ok, not pretty, but seems to work:

Code:

#!/usr/bin/perl -w

use locale;             # Ensure correct charset for eg 'uc()'
use strict;             # Enforce declarations

my (
    $out_rec, $new_rec, $in_rec, $pipe
   );

open(DATA, "<t.dat") or die "Can't open t.dat $!\n";
while( defined($in_rec = <DATA> ) )
{
    chomp($in_rec);

    if(substr($in_rec, 0, 1) eq '|' )
    {
        $new_rec = substr($in_rec, 1, length($in_rec) -1);
        $out_rec .= $new_rec;
        $pipe = 1;
    }
    else
    {
        $out_rec .= "\n".$in_rec;
        $pipe = 0;
    }

    if( !$pipe )
    {
        print "$out_rec";
        $out_rec = "";
    }
}
print "\n";
close(DATA) or die "Can't close t.dat $!\n";

data file:

Code:

aaasasdasdasdddsd
dddddddddddddddd
|ffffffffffffffffff
gggggggggggggg
hhhhhhhhhhhhhhhh
|jjjjjjjjjjjjjjjj
|kkkkkkkkkkkkkkkk
llllllllllllllll

Output:

Code:

aaasasdasdasdddsd
ddddddddddddddddffffffffffffffffff
gggggggggggggg
hhhhhhhhhhhhhhhhjjjjjjjjjjjjjjjjkkkkkkkkkkkkkkkk
llllllllllllllll

Note the extra blank line at the start .... grrrr

PS: now that's odd, there was no blank lines in my input file and 1 extra at the start of my output, but when I copy/pasted it, it's different ... hmmmmmmmmmmmm

stevemcb · 01-17-2008, 06:01 AM

Sorry, was away last night and didn't see the new posts. I'll try the perl script in a little while once I get my head screwed back on straight this morning.

stevemcb · 01-17-2008, 06:12 AM

BTW, the way I have been editing the file (by hand) is the visually identify a line that starts with a pipe, put my cursor to the left of the pipe and hit the backspace key - which makes it part of the previous line (record).

That information just in case there was clarification needed.

HTH, and thanks for the help.
Stevemcb

h/w · 01-17-2008, 08:39 AM

Quote:

Originally Posted by stevemcb

BTW, the way I have been editing the file (by hand) is the visually identify a line that starts with a pipe, put my cursor to the left of the pipe and hit the backspace key - which makes it part of the previous line (record).

That information just in case there was clarification needed.

HTH, and thanks for the help.
Stevemcb

So, that means the line starts with a space followed by a pipe. Which is probably why my earlier awk script failed, as it was looking for lines starting with pipes.
Try this mod:

Code:

awk 'BEGIN{nxt="";}{curr=$0;getline nxt;if(index(nxt, " |")== 1){print curr nxt;}else{print $0;}}' < inputfile > outputfile

angrybanana · 01-17-2008, 02:35 PM

This works with your sample data.

Code:

$ awk -F'\n?\\|' '$1=$1' OFS='|' RS= uglydb
2003123|A15690195|3|N|1994-03-15 00:00:00|OPS$LSANDERS|SOUTHERN|COMPANY|64A PERIMETER CENTER EAST||ATLANTA|GA|US|30346|||REPLACEMENT IS DOA|REPLACE DOA REPLACEMENT|1993-12-16 00:00:00|1993-12-21 00:00:00|1993-12-21 00:00:00|1994-01-11 00:00:00|1994-03-15 00:00:00|N||THIS DOES NOT APPEAR TO BE A DUPLICATE OF CLAIM A14659942. PRIOR CLAIM WAS SERVICED BY NW COMPUTER SUPPORT IN WA STATE. SERIAL #'S MUST HAVE BEEN TYPED IN INCORRECTLY.|A|||CDR-74|||||0||||||||||||00000194.0013.0005
2001235|A15078491|3|N|1994-06-28 00:00:00|OPS$LSANDERS|NPPD||PO BOX 499||COLUMBUS|NE|US|68601|||DOA|REPALCED MONITOR|||||1994-03-15 00:00:00|N|Pending more than 60 days with no resolution; claim rejected|YELLOW STICKY ATTACHED TO CLAIM INDICATED THAT MONITOR WAS RETURNED ON MRA NUMBER #41801 ON 02/22/94.....SOP MRA POINTS TO CLAIM NUMBER #A15078478|R|||JC-1532VMA-2|||||||||||||||||00000194.0014.0005

Edit: Perl way if doing the same thing. Both of these are loading the whole file into ram. So if the db is HUGE then it might not be a good idea.

Code:

perl -0pe 's#\n\|#\|#g' uglydb

Another Edit: Just realized I forgot to add a 'g' at the end of the regex, worked with the example cause it was only 2 records.

chrism01 · 01-17-2008, 04:46 PM

angrybanana: That's why I designed mine not to load the whole file into memory.
h/w: as per my prev post, I think it's actually newline then pipe.
He's saying how he manually fixed it by backspacing to delete the newline.

angrybanana · 01-17-2008, 08:14 PM

Quote:

Originally Posted by chrism01

angrybanana: That's why I designed mine not to load the whole file into memory.

You're right, my answer wasn't good

Here's a better/corrected version of my awk code. (this needs gnu awk)

Code:

awk '$1=$1' RS='\n[^|]' uglydb

stevemcb · 01-18-2008, 05:29 AM

I took a fresh copy of the file and used the "awk 'BEGIN{nxt="";}{curr=$0;getline nxt;if(index(nxt, " |")== 1){print curr nxt;}else{print $0;}}' < inputfile > outputfile" on it.

It went from 308,000 records down to 190,000 records, and a quick visual scan of the results makes me think that it did, indeed, process all the records that started with a pipe. Now I need to go back to the guy who created the file to begin with and verify how many records were in the database to make sure I didn't lose any.

I'll get back to you as soon as I can.

Thanks,
Stevemcb

h/w · 01-18-2008, 05:55 AM

You could also check using:
wc -l inputfile
grep '^ |' inputfile

The difference between the two should give you the number.