LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-11-2019, 01:20 PM   #1
thesnow
Member
 
Registered: Nov 2010
Location: Minneapolis, MN
Distribution: Ubuntu, Red Hat, Mint
Posts: 172

Rep: Reputation: 56
Tab separated file - remove CR/LF if it occurs before n tab characters ?


I have a tab-separated data file coming from a source application that should contain a given number of columns, say 23, and rows are terminated by CR/LF. Unfortunately, the application allows users to hit Return within a text box, and that gets carried through into the file. The result being that for some rows, there are 15 columns, then an errant CR/LF (or multiple), then the rest of the row is finished on the next line(s).

I need a way to process the file "line by line" (sed, awk, etc.), and basically count tab characters and remove any CR/LF that is not after the 23rd tab character. This may involve joining multiple lines together in this way until each line has exactly 23 tabs and an ending CR/LF.

See attached image for a snippet of the data & issue - lines 168 & 169 illustrate a line break.

Pointing me in the right direction is also welcome.
Attached Thumbnails
Click image for larger version

Name:	2019-06-11 13_15_06.png
Views:	40
Size:	150.2 KB
ID:	30737  
 
Old 06-11-2019, 02:26 PM   #2
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,269
Blog Entries: 24

Rep: Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196
Awk would my tool of choice for this task.

I would set the field separator (-F or FS) to a single tab character, set a FIELDS variable to the expected number of fields per line, then set a counter to zero.

Then in the execution loop for each line add NF (number of fields) to the counter.

If counter is <FIELDS print the line with a trailing tab.

If counter is ==FIELDS print the line with a trailing CRLF, reset counter to zero.

If counter is >FIELDS then one or more consecutive lines had more than the expected number of fields - print an error and exit.

That should be fairly simple so long as your input file always has the correct number of fields per line, or per consecutive lines.

Last edited by astrogeek; 06-11-2019 at 02:34 PM. Reason: tpoys
 
1 members found this post helpful.
Old 06-12-2019, 06:05 AM   #3
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
I created a miniature of your problem. The InFile is comma delimited; the line length is 5 columns.

With this InFile ...
Code:
one,two,three,four,five
six,seven,eight,nine,ten
eleven,twelve,thirteen
fourteen,fifteen
sixteen,seventeen,eighteen,nineteen,twenty
apple,,banana,,cherry
... this code ...
Code:
tr ',' '\n' <$InFile  \
|paste -sd',,,,\n'    \
>$OutFile
... produced this OutFile ...
Code:
one,two,three,four,five
six,seven,eight,nine,ten
eleven,twelve,thirteen,fourteen,fifteen
sixteen,seventeen,eighteen,nineteen,twenty
apple,,banana,,cherry
Daniel B. Martin

.

Last edited by danielbmartin; 06-12-2019 at 06:06 AM. Reason: Clarify wording, no change to code.
 
2 members found this post helpful.
Old 06-12-2019, 09:55 AM   #4
thesnow
Member
 
Registered: Nov 2010
Location: Minneapolis, MN
Distribution: Ubuntu, Red Hat, Mint
Posts: 172

Original Poster
Rep: Reputation: 56
I fought hard with the awk loop and awk won. In the end I went a slightly different route with awk, with some inspiration from here:

Code:
BEGIN {
  RS="\t"
}

{
  gsub("\r\n","")
  printf "%s%s",$0,(NR%102?"\t":"\n")
}

Last edited by thesnow; 06-12-2019 at 09:57 AM. Reason: code tag
 
Old 06-12-2019, 03:14 PM   #5
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,269
Blog Entries: 24

Rep: Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196
Ok, glad you found a solution!

For completeness, here is an example of what I had in mind, saved as the file tabs.awk:

Code:
$ cat tabs.awk
BEGIN{ FS="\t"; fields=10 }
( cnt+=NF )>fields{
        printf("ERROR: Too many fields in line: %s\n",NR)
        exit 1
        }
cnt<fields{     printf("%s\t",$0) }
cnt==fields{ printf("%s\n",$0); cnt=0 }
END{ if(cnt>0)
        printf("\nERROR: Too few fields in last line!\n")
        }
And the example text with visible tabs...

Code:
$ cat -T example.txt |sed 's/\^I/<TAB>/g'
Some<TAB>tab separated<TAB>text<TAB>with<TAB>ten<TAB>fields<TAB>inline<TAB>eight<TAB>nine<TAB>ten
One
Two
Three
Four<TAB>Five<TAB>Six<TAB>Seven
Eight
Nine<TAB>Ten
Another<TAB>tab<TAB>separated<TAB>line<TAB>with only five fields
followed<TAB>by<TAB>another<TAB>with five<TAB>fields
This<TAB>line<TAB>has<TAB>a<TAB>few<TAB>empty<TAB><TAB><TAB><TAB>fields
And the end result...

Code:
$ awk -f tabs.awk example.txt
Some    tab separated   text    with    ten     fields  inline  eight   nine    ten
One     Two     Three   Four    Five    Six     Seven   Eight   Nine    Ten
Another tab     separated       line    with only five fields   followed        by      another with five       fields
This    line    has     a       few     empty                           fields
One important characteristic of this approach is that it will aggregate consecutive lines which end with a break on the desired number of fields, butit will not introduce any breaks where they do not already occur in the input data. This seemed consistent with the idea that the user could introduce line breaks within a single record, but should not be able to change the number of fields in a record, so that if an incorrect number of fields in a line or an aggregate occurs it is treated as an error. This makes sync with first field of records more robust. Hope that makes sense.

Last edited by astrogeek; 06-12-2019 at 03:52 PM. Reason: Expand comments on record sync and errors
 
1 members found this post helpful.
Old 06-12-2019, 06:04 PM   #6
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,139

Rep: Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122
If (when) you get unexplained errors in your output at some point in the future, come back and re-read astrogeek' notes.
 
  


Reply

Tags
awk, sed



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: How To Empty a File, Delete N Lines From a File, Remove Matching String From a File, And Remove Empty/Blank Lines From a File In Linux LXer Syndicated Linux News 0 11-22-2017 12:30 PM
Replace Tab Separated Values with Commas Except Last Column darkangel29 Programming 2 07-15-2013 05:20 AM
Where file change occurs? grchere Linux - General 5 05-11-2010 06:27 PM
convert tab separated file to simple X-Y chart since1993 Linux - Newbie 6 08-27-2009 06:07 PM
convert columns to rows (tab separated file to csv) doug23 Programming 16 08-16-2009 09:14 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:48 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration