LinuxQuestions.org - Tab separated file - remove CR/LF if it occurs before n tab characters ?

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Tab separated file - remove CR/LF if it occurs before n tab characters ? (https://www.linuxquestions.org/questions/programming-9/tab-separated-file-remove-cr-lf-if-it-occurs-before-n-tab-characters-4175655524/)

Tab separated file - remove CR/LF if it occurs before n tab characters ?

I have a tab-separated data file coming from a source application that should contain a given number of columns, say 23, and rows are terminated by CR/LF. Unfortunately, the application allows users to hit Return within a text box, and that gets carried through into the file. The result being that for some rows, there are 15 columns, then an errant CR/LF (or multiple), then the rest of the row is finished on the next line(s).

I need a way to process the file "line by line" (sed, awk, etc.), and basically count tab characters and remove any CR/LF that is not after the 23rd tab character. This may involve joining multiple lines together in this way until each line has exactly 23 tabs and an ending CR/LF.

See attached image for a snippet of the data & issue - lines 168 & 169 illustrate a line break.

Pointing me in the right direction is also welcome.

Awk would my tool of choice for this task.

I would set the field separator (-F or FS) to a single tab character, set a FIELDS variable to the expected number of fields per line, then set a counter to zero.

Then in the execution loop for each line add NF (number of fields) to the counter.

If counter is <FIELDS print the line with a trailing tab.

If counter is ==FIELDS print the line with a trailing CRLF, reset counter to zero.

If counter is >FIELDS then one or more consecutive lines had more than the expected number of fields - print an error and exit.

That should be fairly simple so long as your input file always has the correct number of fields per line, or per consecutive lines.

I created a miniature of your problem. The InFile is comma delimited; the line length is 5 columns.

With this InFile ...

Code:

one,two,three,four,five

six,seven,eight,nine,ten

eleven,twelve,thirteen

fourteen,fifteen

sixteen,seventeen,eighteen,nineteen,twenty

apple,,banana,,cherry

... this code ...

Code:

tr ',' '\n' <$InFile  \

|paste -sd',,,,\n'    \

>$OutFile

... produced this OutFile ...

Code:

one,two,three,four,five

six,seven,eight,nine,ten

eleven,twelve,thirteen,fourteen,fifteen

sixteen,seventeen,eighteen,nineteen,twenty

apple,,banana,,cherry

Daniel B. Martin

.

I fought hard with the awk loop and awk won. In the end I went a slightly different route with awk, with some inspiration from here:

Code:

BEGIN {

  RS="\t"

}



{

  gsub("\r\n","")

  printf "%s%s",$0,(NR%102?"\t":"\n")

}

Ok, glad you found a solution!

For completeness, here is an example of what I had in mind, saved as the file tabs.awk:

Code:

$ cat tabs.awk

BEGIN{ FS="\t"; fields=10 }

( cnt+=NF )>fields{

        printf("ERROR: Too many fields in line: %s\n",NR)

        exit 1

        }

cnt<fields{    printf("%s\t",$0) }

cnt==fields{ printf("%s\n",$0); cnt=0 }

END{ if(cnt>0)

        printf("\nERROR: Too few fields in last line!\n")

        }

And the example text with visible tabs...

Code:

$ cat -T example.txt |sed 's/\^I/<TAB>/g'

Some<TAB>tab separated<TAB>text<TAB>with<TAB>ten<TAB>fields<TAB>inline<TAB>eight<TAB>nine<TAB>ten

One

Two

Three

Four<TAB>Five<TAB>Six<TAB>Seven

Eight

Nine<TAB>Ten

Another<TAB>tab<TAB>separated<TAB>line<TAB>with only five fields

followed<TAB>by<TAB>another<TAB>with five<TAB>fields

This<TAB>line<TAB>has<TAB>a<TAB>few<TAB>empty<TAB><TAB><TAB><TAB>fields

And the end result...

Code:

$ awk -f tabs.awk example.txt

Some    tab separated  text    with    ten    fields  inline  eight  nine    ten

One    Two    Three  Four    Five    Six    Seven  Eight  Nine    Ten

Another tab    separated      line    with only five fields  followed        by      another with five      fields

This    line    has    a      few    empty                          fields

One important characteristic of this approach is that it will aggregate consecutive lines which end with a break on the desired number of fields, butit will not introduce any breaks where they do not already occur in the input data. This seemed consistent with the idea that the user could introduce line breaks within a single record, but should not be able to change the number of fields in a record, so that if an incorrect number of fields in a line or an aggregate occurs it is treated as an error. This makes sync with first field of records more robust. Hope that makes sense.

If (when) you get unexplained errors in your output at some point in the future, come back and re-read astrogeek' notes.