ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hi Everyone, I have a little issue with rogue line terminators in a csv file i receive from a client. I am trying to figure out of there is a way via script to clean up these bad terminators here is an example:
Im thinking the best way to do this is to scan for each line terminator and once found, grab the first 9 characters after the line terminator. If characters 3, 6 and 9 -eq ~ then its a good terminator, else replace it with a white space (or delte it). Problem is i have no idea how to code it. I do have perl on my server as well as korn shell. Any help would be appreciated.
Help us to help you. You gave a sample input file (that's good) and some words (also good). Construct a sample output file which corresponds to your sample input and post it here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.
Help us to help you. You gave a sample input file (that's good) and some words (also good). Construct a sample output file which corresponds to your sample input and post it here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.
Daniel B. Martin
Hi Daniel,
Thanks for the response, i updated the post to include the desired results.
What i am trying to accomplish is removing the line terminator from that 5th column and mending it back together with the other half of the record on the fourth line in my before example. additionally, The actual data is not the fifth column, the terminator can be embedded in one of several columns of a record so dont get hung up on coding for a specific column.
One sure way I use to see the actual bit values of a rogue line terminator is od -c specially if it's near the beginning of a file.
(Note that it isn't visible in the example and the :se li option of vi may also display only what it can).
Once you know what the rogue character is, you can simply replace it in the file with sed or vi itself.
By the way, if it happens in a CSV file (presumably structured output of a previous process), I would suggest that you look at that process also to eliminate the rogue line terminator at source.
What i am trying to accomplish is removing the line terminator from that 5th column and mending it back together with the other half of the record on the fourth line in my before example. additionally, The actual data is not the fifth column, the terminator can be embedded in one of several columns of a record so dont get hung up on coding for a specific column.
Can we say that "healthy" lines always have 26 characters? If so, we can "connect" any line shorter than 26 with the next line. This is one way to do it.
Code:
# print only lines of 26 characters or longer
# write to OutFile
sed -n '/^.\{26\}/p' $InFile \
> $OutFile
# print only lines of less than 26 characters
# join pairs of lines side-by-side (like "paste")
# append to OutFile
sed '/^.\{26\}/d' $InFile \
|sed '$!N;s/\n/ /' \
>> $OutFile
I'm confident there is a cleaner way to do this with awk but I haven't figured it out... yet.
One big issue that I have, with scenarios like this one, is ... "are you inadvertantly committing a worse sin, by trying to 'correct' these 'errors?'"
Whatever computer program produced this file, is the party that is ultimately responsible for its accuracy and completeness. If you discover that the data is not absolutely consistent, and demonstrably complete, then I think that you probably would be well-advised to reject it .. without further explanation, and with no attempt to "fix it."
My reasoning is thus: any algorithm that you might devise to "fix it" is necessarily based upon an assumption, about what's actually wrong with the program that produced this file. Those assumptions, in turn, are based upon the inconsistencies that you've observed so far, and upon your human judgments about what was "meant" and what the data "should have been." But ... "there's always one more bug." And the worst possible outcome here is "garbage in garbage multiplied." The data-integrity of this source file is ... non-existent.
Agree with sundialsvcs. If this is the one and only instance of this file (it's not a CSV file, BTW), then it might be easier to use a decent editor to edit it by hand. If the file format will be reproduced over and over, then it would make sense to discover the reason behind the apparently inconsistent format, and/or to ascertain the algorithm by which it is produced. This will prevent unexpected behaviors if the format changes.
Having said that, it looks like a simple algorithm to correct the formatting would be to replace any newline character(s) that are adjacent to a whitespace character with the whitespace character alone. It would be possible to give an exact example if the a sample of the output of od unambiguously showing all of the file content were provided.
--- rod.
That is looking pretty good, The actual lenght of the records is something like 45 columns and they are delivered as fixed width columns so we can probably count the total number of characters and use that but obviously if there are any changes we would need to modify this code as well. I am going to run so tests using AWK and see how we fare. Thanks for the solution.
The actual length of the records is something like 45 columns and they are delivered as fixed width columns so we can probably count the total number of characters and use that but obviously if there are any changes we would need to modify this code as well.
No need to modify the code. Rather than have a line-length criterion hard-coded, let the program determine the length of the longest line in the file and feed that value to awk as an external variable. Try this ...
Agree with sundialsvcs. If this is the one and only instance of this file (it's not a CSV file, BTW), then it might be easier to use a decent editor to edit it by hand. If the file format will be reproduced over and over, then it would make sense to discover the reason behind the apparently inconsistent format, and/or to ascertain the algorithm by which it is produced. This will prevent unexpected behaviors if the format changes.
Having said that, it looks like a simple algorithm to correct the formatting would be to replace any newline character(s) that are adjacent to a whitespace character with the whitespace character alone. It would be possible to give an exact example if the a sample of the output of od unambiguously showing all of the file content were provided.
--- rod.
I totally agree that the source should be correcting andif not the source someone on the business side on our end but either way IT should not be manipulating the data. The source does run some clenaup before delivery but they dont catch all terminators (obviously). We have gone back and mentioned it to them and they manually clean up the data in their system, but only after we encounter an issue and report it. They are one of ht elarger clients and are handled with kid gloves...this means management does not want to keep "bugging" them with these issues and asked us to find a solution.
The pattern around the bad terminator is not always the same. Its usually located in a field that captures company name or a description field. so the data surrounding the terminator is never the same and located at different positions within the field. This is why i figured that doing a pattern match after the terminator would be the best solution and have a very high percentage of success in cleaning up the bad terms.
The pattern around the bad terminator is not always the same. Its usually located in a field that captures company name or a description field. so the data surrounding the terminator is never the same and located at different positions within the field. This is why i figured that doing a pattern match after the terminator would be the best solution and have a very high percentage of success in cleaning up the bad terms.
Kind of important to say that up front, so people don't create solutions that are not sufficiently general.
The pattern around the bad terminator is not always the same. Its usually located in a field that captures company name or a description field. so the data surrounding the terminator is never the same a
But presumably the terminator remains the same or does it also change?
Normally if it happens in text data - as in your case - in a web page (assuming its a web page), it might mean that some special character needs to be stored with a full unicode representation to translate correctly.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.