LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-13-2012, 02:04 PM   #1
DKMCG
LQ Newbie
 
Registered: Dec 2012
Location: Central New Jersey
Posts: 5

Rep: Reputation: Disabled
Rogue line terminators in csv file


Hi Everyone, I have a little issue with rogue line terminators in a csv file i receive from a client. I am trying to figure out of there is a way via script to clean up these bad terminators here is an example:

Before:
AB~12~CD~345~EFG HIJK~6789
AB~12~CD~345~EFG HIJK~6789
AB~12~CD~345~EFG
HIJK~6789
AB~12~CD~345~EFG HIJK~6789

After:
AB~12~CD~345~EFG HIJK~6789
AB~12~CD~345~EFG HIJK~6789
AB~12~CD~345~EFG HIJK~6789
AB~12~CD~345~EFG HIJK~6789


Im thinking the best way to do this is to scan for each line terminator and once found, grab the first 9 characters after the line terminator. If characters 3, 6 and 9 -eq ~ then its a good terminator, else replace it with a white space (or delte it). Problem is i have no idea how to code it. I do have perl on my server as well as korn shell. Any help would be appreciated.

Thanks in advance,
DM

Last edited by DKMCG; 12-13-2012 at 03:14 PM.
 
Old 12-13-2012, 03:07 PM   #2
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Help us to help you. You gave a sample input file (that's good) and some words (also good). Construct a sample output file which corresponds to your sample input and post it here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.

Daniel B. Martin
 
Old 12-13-2012, 03:16 PM   #3
DKMCG
LQ Newbie
 
Registered: Dec 2012
Location: Central New Jersey
Posts: 5

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by danielbmartin View Post
Help us to help you. You gave a sample input file (that's good) and some words (also good). Construct a sample output file which corresponds to your sample input and post it here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.

Daniel B. Martin
Hi Daniel,

Thanks for the response, i updated the post to include the desired results.
 
Old 12-13-2012, 04:17 PM   #4
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by DKMCG View Post
If characters 3, 6 and 9 -eq ~ then its a good terminator, else replace it with a white space (or delete it).
With this input file ...
Code:
AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG
HIJK~6666
AB~15~CD~555~EFG HIJK~6666
This code ...
Code:
awk -F "" '{if ($3=="~" && $6=="~" && $9=="~") print}'  $InFile
... produces this output file ...
Code:
AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG
AB~15~CD~555~EFG HIJK~6666
Daniel B. Martin
 
Old 12-13-2012, 04:29 PM   #5
DKMCG
LQ Newbie
 
Registered: Dec 2012
Location: Central New Jersey
Posts: 5

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by danielbmartin View Post
With this input file ...
Code:
AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG
HIJK~6666
AB~15~CD~555~EFG HIJK~6666
This code ...
Code:
awk -F "" '{if ($3=="~" && $6=="~" && $9=="~") print}'  $InFile
... produces this output file ...
Code:
AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG
AB~15~CD~555~EFG HIJK~6666
Daniel B. Martin
Hi Daniel,

The desired results would be

AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG HIJK~6666
AB~15~CD~555~EFG HIJK~6666

What i am trying to accomplish is removing the line terminator from that 5th column and mending it back together with the other half of the record on the fourth line in my before example. additionally, The actual data is not the fifth column, the terminator can be embedded in one of several columns of a record so dont get hung up on coding for a specific column.

Last edited by DKMCG; 12-13-2012 at 04:36 PM.
 
Old 12-13-2012, 07:37 PM   #6
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Posts: 952

Rep: Reputation: 217Reputation: 217Reputation: 217
One sure way I use to see the actual bit values of a rogue line terminator is
od -c specially if it's near the beginning of a file.

(Note that it isn't visible in the example and the :se li option of vi may also display only what it can).

Once you know what the rogue character is, you can simply replace it in the file with sed or vi itself.

By the way, if it happens in a CSV file (presumably structured output of a previous process), I would suggest that you look at that process also to eliminate the rogue line terminator at source.

OK

Last edited by AnanthaP; 12-13-2012 at 07:39 PM.
 
Old 12-13-2012, 08:25 PM   #7
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by DKMCG View Post
What i am trying to accomplish is removing the line terminator from that 5th column and mending it back together with the other half of the record on the fourth line in my before example. additionally, The actual data is not the fifth column, the terminator can be embedded in one of several columns of a record so dont get hung up on coding for a specific column.
Can we say that "healthy" lines always have 26 characters? If so, we can "connect" any line shorter than 26 with the next line. This is one way to do it.
Code:
# print only lines of 26 characters or longer
# write to OutFile
sed -n '/^.\{26\}/p'  $InFile  \
> $OutFile

# print only lines of less than 26 characters
# join pairs of lines side-by-side (like "paste")
# append to OutFile
sed '/^.\{26\}/d'  $InFile  \
|sed '$!N;s/\n/ /'          \
>> $OutFile
I'm confident there is a cleaner way to do this with awk but I haven't figured it out... yet.

Daniel B. Martin
 
Old 12-13-2012, 09:20 PM   #8
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
With this input file ...
Code:
AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG
HIJK~6666
AB~15~CD~555~EFG HIJK~7777
... use this awk ...
Code:
awk -F "" '{if (NF<25) {getline a; $0=$0" "a;}} 1' $InFile
... to produce this output file ...
Code:
AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG HIJK~6666
AB~15~CD~555~EFG HIJK~7777
Daniel B. Martin

Last edited by danielbmartin; 12-13-2012 at 09:23 PM. Reason: Tighten the code, slightly
 
Old 12-13-2012, 10:53 PM   #9
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941
One big issue that I have, with scenarios like this one, is ... "are you inadvertantly committing a worse sin, by trying to 'correct' these 'errors?'"

Whatever computer program produced this file, is the party that is ultimately responsible for its accuracy and completeness. If you discover that the data is not absolutely consistent, and demonstrably complete, then I think that you probably would be well-advised to reject it .. without further explanation, and with no attempt to "fix it."

My reasoning is thus: any algorithm that you might devise to "fix it" is necessarily based upon an assumption, about what's actually wrong with the program that produced this file. Those assumptions, in turn, are based upon the inconsistencies that you've observed so far, and upon your human judgments about what was "meant" and what the data "should have been." But ... "there's always one more bug." And the worst possible outcome here is "garbage in garbage multiplied." The data-integrity of this source file is ... non-existent.
 
Old 12-14-2012, 08:22 AM   #10
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Agree with sundialsvcs. If this is the one and only instance of this file (it's not a CSV file, BTW), then it might be easier to use a decent editor to edit it by hand. If the file format will be reproduced over and over, then it would make sense to discover the reason behind the apparently inconsistent format, and/or to ascertain the algorithm by which it is produced. This will prevent unexpected behaviors if the format changes.
Having said that, it looks like a simple algorithm to correct the formatting would be to replace any newline character(s) that are adjacent to a whitespace character with the whitespace character alone. It would be possible to give an exact example if the a sample of the output of od unambiguously showing all of the file content were provided.
--- rod.
 
Old 12-14-2012, 08:26 AM   #11
DKMCG
LQ Newbie
 
Registered: Dec 2012
Location: Central New Jersey
Posts: 5

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by danielbmartin View Post
With this input file ...
Code:
AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG
HIJK~6666
AB~15~CD~555~EFG HIJK~7777
... use this awk ...
Code:
awk -F "" '{if (NF<25) {getline a; $0=$0" "a;}} 1' $InFile
... to produce this output file ...
Code:
AB~12~CD~222~EFG HIJK~4444
AB~13~CD~333~EFG HIJK~5555
AB~14~CD~444~EFG HIJK~6666
AB~15~CD~555~EFG HIJK~7777
Daniel B. Martin
Hi Daniel,

That is looking pretty good, The actual lenght of the records is something like 45 columns and they are delivered as fixed width columns so we can probably count the total number of characters and use that but obviously if there are any changes we would need to modify this code as well. I am going to run so tests using AWK and see how we fare. Thanks for the solution.
 
Old 12-14-2012, 08:59 AM   #12
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by DKMCG View Post
The actual length of the records is something like 45 columns and they are delivered as fixed width columns so we can probably count the total number of characters and use that but obviously if there are any changes we would need to modify this code as well.
No need to modify the code. Rather than have a line-length criterion hard-coded, let the program determine the length of the longest line in the file and feed that value to awk as an external variable. Try this ...
Code:
L=$(cat $InFile |wc -L)   # L = Longest Line Length
awk -F "" -v L="$L" '{if (NF<L) {getline a; $0=$0" "a;}} 1' $InFile
Daniel B. Martin
 
Old 12-14-2012, 09:55 AM   #13
DKMCG
LQ Newbie
 
Registered: Dec 2012
Location: Central New Jersey
Posts: 5

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by theNbomr View Post
Agree with sundialsvcs. If this is the one and only instance of this file (it's not a CSV file, BTW), then it might be easier to use a decent editor to edit it by hand. If the file format will be reproduced over and over, then it would make sense to discover the reason behind the apparently inconsistent format, and/or to ascertain the algorithm by which it is produced. This will prevent unexpected behaviors if the format changes.
Having said that, it looks like a simple algorithm to correct the formatting would be to replace any newline character(s) that are adjacent to a whitespace character with the whitespace character alone. It would be possible to give an exact example if the a sample of the output of od unambiguously showing all of the file content were provided.
--- rod.

I totally agree that the source should be correcting andif not the source someone on the business side on our end but either way IT should not be manipulating the data. The source does run some clenaup before delivery but they dont catch all terminators (obviously). We have gone back and mentioned it to them and they manually clean up the data in their system, but only after we encounter an issue and report it. They are one of ht elarger clients and are handled with kid gloves...this means management does not want to keep "bugging" them with these issues and asked us to find a solution.

The pattern around the bad terminator is not always the same. Its usually located in a field that captures company name or a description field. so the data surrounding the terminator is never the same and located at different positions within the field. This is why i figured that doing a pattern match after the terminator would be the best solution and have a very high percentage of success in cleaning up the bad terms.
 
Old 12-14-2012, 12:29 PM   #14
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Quote:
Originally Posted by DKMCG View Post
The pattern around the bad terminator is not always the same. Its usually located in a field that captures company name or a description field. so the data surrounding the terminator is never the same and located at different positions within the field. This is why i figured that doing a pattern match after the terminator would be the best solution and have a very high percentage of success in cleaning up the bad terms.
Kind of important to say that up front, so people don't create solutions that are not sufficiently general.

--- rod.
 
Old 12-14-2012, 09:10 PM   #15
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Posts: 952

Rep: Reputation: 217Reputation: 217Reputation: 217
Quote:
The pattern around the bad terminator is not always the same. Its usually located in a field that captures company name or a description field. so the data surrounding the terminator is never the same a
But presumably the terminator remains the same or does it also change?

Normally if it happens in text data - as in your case - in a web page (assuming its a web page), it might mean that some special character needs to be stored with a full unicode representation to translate correctly.

OK
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
File with \r as line terminators. stf92 Linux - Software 2 08-12-2010 09:21 AM
[SOLVED] Removing line terminators from a plain text file. stf92 Linux - Software 11 07-31-2010 07:05 AM
How to remove CRLF, CR line terminators keshavk Linux - Newbie 2 08-02-2009 12:56 PM
replace line in CSV file and rename file connected to that name wademac Linux - Newbie 3 07-15-2009 01:09 PM
Erase first line of a csv file imkornhulio Programming 2 02-07-2009 06:28 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 08:12 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration