Tab separated file - remove CR/LF if it occurs before n tab characters ?
1 Attachment(s)
I have a tab-separated data file coming from a source application that should contain a given number of columns, say 23, and rows are terminated by CR/LF. Unfortunately, the application allows users to hit Return within a text box, and that gets carried through into the file. The result being that for some rows, there are 15 columns, then an errant CR/LF (or multiple), then the rest of the row is finished on the next line(s).
I need a way to process the file "line by line" (sed, awk, etc.), and basically count tab characters and remove any CR/LF that is not after the 23rd tab character. This may involve joining multiple lines together in this way until each line has exactly 23 tabs and an ending CR/LF. See attached image for a snippet of the data & issue - lines 168 & 169 illustrate a line break. Pointing me in the right direction is also welcome. |
Awk would my tool of choice for this task.
I would set the field separator (-F or FS) to a single tab character, set a FIELDS variable to the expected number of fields per line, then set a counter to zero. Then in the execution loop for each line add NF (number of fields) to the counter. If counter is <FIELDS print the line with a trailing tab. If counter is ==FIELDS print the line with a trailing CRLF, reset counter to zero. If counter is >FIELDS then one or more consecutive lines had more than the expected number of fields - print an error and exit. That should be fairly simple so long as your input file always has the correct number of fields per line, or per consecutive lines. |
I created a miniature of your problem. The InFile is comma delimited; the line length is 5 columns.
With this InFile ... Code:
one,two,three,four,five Code:
tr ',' '\n' <$InFile \ Code:
one,two,three,four,five . |
I fought hard with the awk loop and awk won. In the end I went a slightly different route with awk, with some inspiration from here:
Code:
BEGIN { |
Ok, glad you found a solution!
For completeness, here is an example of what I had in mind, saved as the file tabs.awk: Code:
$ cat tabs.awk Code:
$ cat -T example.txt |sed 's/\^I/<TAB>/g' Code:
$ awk -f tabs.awk example.txt |
If (when) you get unexplained errors in your output at some point in the future, come back and re-read astrogeek' notes.
|
All times are GMT -5. The time now is 04:35 AM. |