Tab separated file - remove CR/LF if it occurs before n tab characters ?

thesnow · 06-11-2019, 01:20 PM

I have a tab-separated data file coming from a source application that should contain a given number of columns, say 23, and rows are terminated by CR/LF. Unfortunately, the application allows users to hit Return within a text box, and that gets carried through into the file. The result being that for some rows, there are 15 columns, then an errant CR/LF (or multiple), then the rest of the row is finished on the next line(s).

I need a way to process the file "line by line" (sed, awk, etc.), and basically count tab characters and remove any CR/LF that is not after the 23rd tab character. This may involve joining multiple lines together in this way until each line has exactly 23 tabs and an ending CR/LF.

See attached image for a snippet of the data & issue - lines 168 & 169 illustrate a line break.

Pointing me in the right direction is also welcome.

astrogeek · 06-11-2019, 02:26 PM

Awk would my tool of choice for this task.

I would set the field separator (-F or FS) to a single tab character, set a FIELDS variable to the expected number of fields per line, then set a counter to zero.

Then in the execution loop for each line add NF (number of fields) to the counter.

If counter is <FIELDS print the line with a trailing tab.

If counter is ==FIELDS print the line with a trailing CRLF, reset counter to zero.

If counter is >FIELDS then one or more consecutive lines had more than the expected number of fields - print an error and exit.

That should be fairly simple so long as your input file always has the correct number of fields per line, or per consecutive lines.

danielbmartin · 06-12-2019, 06:05 AM

I created a miniature of your problem. The InFile is comma delimited; the line length is 5 columns.

With this InFile ...

Code:

one,two,three,four,five
six,seven,eight,nine,ten
eleven,twelve,thirteen
fourteen,fifteen
sixteen,seventeen,eighteen,nineteen,twenty
apple,,banana,,cherry

... this code ...

Code:

tr ',' '\n' <$InFile  \
|paste -sd',,,,\n'    \
>$OutFile

... produced this OutFile ...

Code:

one,two,three,four,five
six,seven,eight,nine,ten
eleven,twelve,thirteen,fourteen,fifteen
sixteen,seventeen,eighteen,nineteen,twenty
apple,,banana,,cherry

Daniel B. Martin

.

thesnow · 06-12-2019, 09:55 AM

I fought hard with the awk loop and awk won. In the end I went a slightly different route with awk, with some inspiration from here:

Code:

BEGIN {
  RS="\t"
}

{
  gsub("\r\n","")
  printf "%s%s",$0,(NR%102?"\t":"\n")
}

astrogeek · 06-12-2019, 03:14 PM

Ok, glad you found a solution!

For completeness, here is an example of what I had in mind, saved as the file tabs.awk:

Code:

$ cat tabs.awk
BEGIN{ FS="\t"; fields=10 }
( cnt+=NF )>fields{
        printf("ERROR: Too many fields in line: %s\n",NR)
        exit 1
        }
cnt<fields{     printf("%s\t",$0) }
cnt==fields{ printf("%s\n",$0); cnt=0 }
END{ if(cnt>0)
        printf("\nERROR: Too few fields in last line!\n")
        }

And the example text with visible tabs...

Code:

$ cat -T example.txt |sed 's/\^I/<TAB>/g'
Some<TAB>tab separated<TAB>text<TAB>with<TAB>ten<TAB>fields<TAB>inline<TAB>eight<TAB>nine<TAB>ten
One
Two
Three
Four<TAB>Five<TAB>Six<TAB>Seven
Eight
Nine<TAB>Ten
Another<TAB>tab<TAB>separated<TAB>line<TAB>with only five fields
followed<TAB>by<TAB>another<TAB>with five<TAB>fields
This<TAB>line<TAB>has<TAB>a<TAB>few<TAB>empty<TAB><TAB><TAB><TAB>fields

And the end result...

Code:

$ awk -f tabs.awk example.txt
Some    tab separated   text    with    ten     fields  inline  eight   nine    ten
One     Two     Three   Four    Five    Six     Seven   Eight   Nine    Ten
Another tab     separated       line    with only five fields   followed        by      another with five       fields
This    line    has     a       few     empty                           fields

One important characteristic of this approach is that it will aggregate consecutive lines which end with a break on the desired number of fields, butit will not introduce any breaks where they do not already occur in the input data. This seemed consistent with the idea that the user could introduce line breaks within a single record, but should not be able to change the number of fields in a record, so that if an incorrect number of fields in a line or an aggregate occurs it is treated as an error. This makes sync with first field of records more robust. Hope that makes sense.

syg00 · 06-12-2019, 06:04 PM

If (when) you get unexplained errors in your output at some point in the future, come back and re-read astrogeek' notes.