Tab separated file - remove CR/LF if it occurs before n tab characters ?
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Tab separated file - remove CR/LF if it occurs before n tab characters ?
I have a tab-separated data file coming from a source application that should contain a given number of columns, say 23, and rows are terminated by CR/LF. Unfortunately, the application allows users to hit Return within a text box, and that gets carried through into the file. The result being that for some rows, there are 15 columns, then an errant CR/LF (or multiple), then the rest of the row is finished on the next line(s).
I need a way to process the file "line by line" (sed, awk, etc.), and basically count tab characters and remove any CR/LF that is not after the 23rd tab character. This may involve joining multiple lines together in this way until each line has exactly 23 tabs and an ending CR/LF.
See attached image for a snippet of the data & issue - lines 168 & 169 illustrate a line break.
Pointing me in the right direction is also welcome.
I would set the field separator (-F or FS) to a single tab character, set a FIELDS variable to the expected number of fields per line, then set a counter to zero.
Then in the execution loop for each line add NF (number of fields) to the counter.
If counter is <FIELDS print the line with a trailing tab.
If counter is ==FIELDS print the line with a trailing CRLF, reset counter to zero.
If counter is >FIELDS then one or more consecutive lines had more than the expected number of fields - print an error and exit.
That should be fairly simple so long as your input file always has the correct number of fields per line, or per consecutive lines.
Last edited by astrogeek; 06-11-2019 at 02:34 PM.
Reason: tpoys
For completeness, here is an example of what I had in mind, saved as the file tabs.awk:
Code:
$ cat tabs.awk
BEGIN{ FS="\t"; fields=10 }
( cnt+=NF )>fields{
printf("ERROR: Too many fields in line: %s\n",NR)
exit 1
}
cnt<fields{ printf("%s\t",$0) }
cnt==fields{ printf("%s\n",$0); cnt=0 }
END{ if(cnt>0)
printf("\nERROR: Too few fields in last line!\n")
}
And the example text with visible tabs...
Code:
$ cat -T example.txt |sed 's/\^I/<TAB>/g'
Some<TAB>tab separated<TAB>text<TAB>with<TAB>ten<TAB>fields<TAB>inline<TAB>eight<TAB>nine<TAB>ten
One
Two
Three
Four<TAB>Five<TAB>Six<TAB>Seven
Eight
Nine<TAB>Ten
Another<TAB>tab<TAB>separated<TAB>line<TAB>with only five fields
followed<TAB>by<TAB>another<TAB>with five<TAB>fields
This<TAB>line<TAB>has<TAB>a<TAB>few<TAB>empty<TAB><TAB><TAB><TAB>fields
And the end result...
Code:
$ awk -f tabs.awk example.txt
Some tab separated text with ten fields inline eight nine ten
One Two Three Four Five Six Seven Eight Nine Ten
Another tab separated line with only five fields followed by another with five fields
This line has a few empty fields
One important characteristic of this approach is that it will aggregate consecutive lines which end with a break on the desired number of fields, butit will not introduce any breaks where they do not already occur in the input data. This seemed consistent with the idea that the user could introduce line breaks within a single record, but should not be able to change the number of fields in a record, so that if an incorrect number of fields in a line or an aggregate occurs it is treated as an error. This makes sync with first field of records more robust. Hope that makes sense.
Last edited by astrogeek; 06-12-2019 at 03:52 PM.
Reason: Expand comments on record sync and errors
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.