Rogue line terminators in csv file

DKMCG · 12-13-2012, 02:04 PM

Hi Everyone, I have a little issue with rogue line terminators in a csv file i receive from a client. I am trying to figure out of there is a way via script to clean up these bad terminators here is an example:

Before:
AB~12~CD~345~EFG HIJK~6789
AB~12~CD~345~EFG HIJK~6789
AB~12~CD~345~EFG
HIJK~6789
AB~12~CD~345~EFG HIJK~6789

After:
AB~12~CD~345~EFG HIJK~6789
AB~12~CD~345~EFG HIJK~6789
AB~12~CD~345~EFG HIJK~6789
AB~12~CD~345~EFG HIJK~6789

Im thinking the best way to do this is to scan for each line terminator and once found, grab the first 9 characters after the line terminator. If characters 3, 6 and 9 -eq ~ then its a good terminator, else replace it with a white space (or delte it). Problem is i have no idea how to code it. I do have perl on my server as well as korn shell. Any help would be appreciated.

Thanks in advance,
DM

danielbmartin · 12-13-2012, 03:07 PM

Help us to help you. You gave a sample input file (that's good) and some words (also good). Construct a sample output file which corresponds to your sample input and post it here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.

Daniel B. Martin

DKMCG · 12-13-2012, 03:16 PM

Quote:

Originally Posted by danielbmartin

Help us to help you. You gave a sample input file (that's good) and some words (also good). Construct a sample output file which corresponds to your sample input and post it here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.

Daniel B. Martin

Hi Daniel,

Thanks for the response, i updated the post to include the desired results.

danielbmartin · 12-13-2012, 04:17 PM

Quote:

Originally Posted by DKMCG

If characters 3, 6 and 9 -eq ~ then its a good terminator, else replace it with a white space (or delete it).

With this input file ...

Code:

AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG
HIJK~6666
AB~15~CD~555~EFG HIJK~6666

This code ...

Code:

awk -F "" '{if ($3=="~" && $6=="~" && $9=="~") print}'  $InFile

... produces this output file ...

Code:

AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG
AB~15~CD~555~EFG HIJK~6666

Daniel B. Martin

DKMCG · 12-13-2012, 04:29 PM

Quote:

Originally Posted by danielbmartin

With this input file ...

Code:

AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG
HIJK~6666
AB~15~CD~555~EFG HIJK~6666

This code ...

Code:

awk -F "" '{if ($3=="~" && $6=="~" && $9=="~") print}'  $InFile

... produces this output file ...

Code:

AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG
AB~15~CD~555~EFG HIJK~6666

Daniel B. Martin

Hi Daniel,

The desired results would be

AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG HIJK~6666
AB~15~CD~555~EFG HIJK~6666

What i am trying to accomplish is removing the line terminator from that 5th column and mending it back together with the other half of the record on the fourth line in my before example. additionally, The actual data is not the fifth column, the terminator can be embedded in one of several columns of a record so dont get hung up on coding for a specific column.

AnanthaP · 12-13-2012, 07:37 PM

One sure way I use to see the actual bit values of a rogue line terminator is
od -c specially if it's near the beginning of a file.

(Note that it isn't visible in the example and the :se li option of vi may also display only what it can).

Once you know what the rogue character is, you can simply replace it in the file with sed or vi itself.

By the way, if it happens in a CSV file (presumably structured output of a previous process), I would suggest that you look at that process also to eliminate the rogue line terminator at source.

OK

danielbmartin · 12-13-2012, 08:25 PM

Quote:

Originally Posted by DKMCG

What i am trying to accomplish is removing the line terminator from that 5th column and mending it back together with the other half of the record on the fourth line in my before example. additionally, The actual data is not the fifth column, the terminator can be embedded in one of several columns of a record so dont get hung up on coding for a specific column.

Can we say that "healthy" lines always have 26 characters? If so, we can "connect" any line shorter than 26 with the next line. This is one way to do it.

Code:

# print only lines of 26 characters or longer
# write to OutFile
sed -n '/^.\{26\}/p'  $InFile  \
> $OutFile

# print only lines of less than 26 characters
# join pairs of lines side-by-side (like "paste")
# append to OutFile
sed '/^.\{26\}/d'  $InFile  \
|sed '$!N;s/\n/ /'          \
>> $OutFile

I'm confident there is a cleaner way to do this with awk but I haven't figured it out... yet.

Daniel B. Martin

danielbmartin · 12-13-2012, 09:20 PM

With this input file ...

Code:

AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG
HIJK~6666
AB~15~CD~555~EFG HIJK~7777

... use this awk ...

Code:

awk -F "" '{if (NF<25) {getline a; $0=$0" "a;}} 1' $InFile

... to produce this output file ...

Code:

AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG HIJK~6666
AB~15~CD~555~EFG HIJK~7777

Daniel B. Martin

sundialsvcs · 12-13-2012, 10:53 PM

One big issue that I have, with scenarios like this one, is ... "are you inadvertantly committing a worse sin, by trying to 'correct' these 'errors?'"

Whatever computer program produced this file, is the party that is ultimately responsible for its accuracy and completeness. If you discover that the data is not absolutely consistent, and demonstrably complete, then I think that you probably would be well-advised to reject it .. without further explanation, and with no attempt to "fix it."

My reasoning is thus: any algorithm that you might devise to "fix it" is necessarily based upon an assumption, about what's actually wrong with the program that produced this file. Those assumptions, in turn, are based upon the inconsistencies that you've observed so far, and upon your human judgments about what was "meant" and what the data "should have been." But ... "there's always one more bug." And the worst possible outcome here is "garbage in garbage multiplied." The data-integrity of this source file is ... non-existent.

theNbomr · 12-14-2012, 08:22 AM

Agree with sundialsvcs. If this is the one and only instance of this file (it's not a CSV file, BTW), then it might be easier to use a decent editor to edit it by hand. If the file format will be reproduced over and over, then it would make sense to discover the reason behind the apparently inconsistent format, and/or to ascertain the algorithm by which it is produced. This will prevent unexpected behaviors if the format changes.
Having said that, it looks like a simple algorithm to correct the formatting would be to replace any newline character(s) that are adjacent to a whitespace character with the whitespace character alone. It would be possible to give an exact example if the a sample of the output of od unambiguously showing all of the file content were provided.
--- rod.

DKMCG · 12-14-2012, 08:26 AM

Quote:

Originally Posted by danielbmartin

With this input file ...

Code:

AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG
HIJK~6666
AB~15~CD~555~EFG HIJK~7777

... use this awk ...

Code:

awk -F "" '{if (NF<25) {getline a; $0=$0" "a;}} 1' $InFile

... to produce this output file ...

Code:

AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG HIJK~6666
AB~15~CD~555~EFG HIJK~7777

Daniel B. Martin

Hi Daniel,

That is looking pretty good, The actual lenght of the records is something like 45 columns and they are delivered as fixed width columns so we can probably count the total number of characters and use that but obviously if there are any changes we would need to modify this code as well. I am going to run so tests using AWK and see how we fare. Thanks for the solution.

danielbmartin · 12-14-2012, 08:59 AM

Quote:

Originally Posted by DKMCG

The actual length of the records is something like 45 columns and they are delivered as fixed width columns so we can probably count the total number of characters and use that but obviously if there are any changes we would need to modify this code as well.

No need to modify the code. Rather than have a line-length criterion hard-coded, let the program determine the length of the longest line in the file and feed that value to awk as an external variable. Try this ...

Code:

L=$(cat $InFile |wc -L)   # L = Longest Line Length
awk -F "" -v L="$L" '{if (NF<L) {getline a; $0=$0" "a;}} 1' $InFile

Daniel B. Martin

DKMCG · 12-14-2012, 09:55 AM

Quote:

Originally Posted by theNbomr

Agree with sundialsvcs. If this is the one and only instance of this file (it's not a CSV file, BTW), then it might be easier to use a decent editor to edit it by hand. If the file format will be reproduced over and over, then it would make sense to discover the reason behind the apparently inconsistent format, and/or to ascertain the algorithm by which it is produced. This will prevent unexpected behaviors if the format changes.
Having said that, it looks like a simple algorithm to correct the formatting would be to replace any newline character(s) that are adjacent to a whitespace character with the whitespace character alone. It would be possible to give an exact example if the a sample of the output of od unambiguously showing all of the file content were provided.
--- rod.

I totally agree that the source should be correcting andif not the source someone on the business side on our end but either way IT should not be manipulating the data. The source does run some clenaup before delivery but they dont catch all terminators (obviously). We have gone back and mentioned it to them and they manually clean up the data in their system, but only after we encounter an issue and report it. They are one of ht elarger clients and are handled with kid gloves...this means management does not want to keep "bugging" them with these issues and asked us to find a solution.

The pattern around the bad terminator is not always the same. Its usually located in a field that captures company name or a description field. so the data surrounding the terminator is never the same and located at different positions within the field. This is why i figured that doing a pattern match after the terminator would be the best solution and have a very high percentage of success in cleaning up the bad terms.

theNbomr · 12-14-2012, 12:29 PM

Quote:

Originally Posted by DKMCG

The pattern around the bad terminator is not always the same. Its usually located in a field that captures company name or a description field. so the data surrounding the terminator is never the same and located at different positions within the field. This is why i figured that doing a pattern match after the terminator would be the best solution and have a very high percentage of success in cleaning up the bad terms.

Kind of important to say that up front, so people don't create solutions that are not sufficiently general.

--- rod.

AnanthaP · 12-14-2012, 09:10 PM

Quote:

The pattern around the bad terminator is not always the same. Its usually located in a field that captures company name or a description field. so the data surrounding the terminator is never the same a

But presumably the terminator remains the same or does it also change?

Normally if it happens in text data - as in your case - in a web page (assuming its a web page), it might mean that some special character needs to be stored with a full unicode representation to translate correctly.

OK