LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-04-2015, 02:50 AM   #1
AZTiger98
LQ Newbie
 
Registered: Apr 2015
Posts: 2

Rep: Reputation: Disabled
Removing multiple lines from cell data in a .csv file


I am trying to process some .csv files with Linux as follows:

Some fields have data with newline characters embedded, like so:

"Bob Smith
531 Pennsylvania Avenue
Washington, DC"

(I verified the existence of the " via Wordpad. The file is too large to easily edit in Wordpad to get all the data for each row on a single line).

what linux command would I use on the files to get the data in each cell on one line?

I have tried:

1. awk -v RS="" '{gsub (/\n/,"")}1' file > newfile

but the cell data was still being read in as if "531 Pennsylvania Avenue" was a brand new row in the CSV file.

2. Command 1 followed by awk -v RS="" '{gsub (/\r/,"")}1' newfile > finalFile

but that resulted in all of the data in the file being put onto a single line.

3. awk -v RS="" '{gsub (/\r\n/,"")}1' file > newFile

But that result was the same as attempt number 2.

How can I preprocess the file so that:

"Bob Smith
531 Pennsylvania Avenue
Washington, DC"

is read as a single field on a single line as part of the row it should be associated with, like

"Bob Smith 531 Pennsylvania Avenue Washington, DC"
 
Old 04-04-2015, 03:35 AM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 14,832

Rep: Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820
Looks ok if you are using a supported gawk release.
I'd guessing you are testing for the wrong thing. Dump a few records out in hex and check what's really there.
 
Old 04-04-2015, 03:46 AM   #3
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 8,104

Rep: Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267
probably this helps you: http://stackoverflow.com/questions/1...riable-working
 
Old 04-04-2015, 04:41 AM   #4
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 4,429

Rep: Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348
For handling a CSV file with embedded new line characters, I would suggest a perl solution.
Based on http://stackoverflow.com/questions/1...edded-newlines
Code:
use strict;
use warnings;
use Text::CSV;

my $csv = Text::CSV->new ({
    binary => 1,
    sep_char => ',',
    eol => $/,                # to make $csv->print use newlines
    always_quote => 0,        # set to 1 to keep your numbers quoted
});

while (my $row = $csv->getline( *DATA )) {
    $row->[1] =~ s/[\n]+/ /g;
    $csv->print(*STDOUT, $row);
}

__DATA__
1,"Bob Smith
531 Pennsylvania Avenue
Washington, DC",3,4
produces
Quote:
1,"Bob Smith 531 Pennsylvania Avenue Washington, DC",3,4
You may need to be careful with CRLF line endings if Windows is involved.
 
Old 04-04-2015, 01:45 PM   #5
AZTiger98
LQ Newbie
 
Registered: Apr 2015
Posts: 2

Original Poster
Rep: Reputation: Disabled
How do I implement the code above if I'm getting the data from the .csv file and then writing back to it, rather than the data being at the end of the perl script file??
 
Old 04-04-2015, 08:16 PM   #6
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 4,429

Rep: Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348
A quick and dirty method is to copy everything from the start up to and including the __DATA__ line from the code block in post #5 to a file, say parse.pl, and then execute
Code:
cat parse.pl <input.csv> | perl - > <output.csv>
 
Old 04-05-2015, 07:04 AM   #7
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Distribution: UBUNTU 5.10 since Jul-18,2006 on Intel 820 DC
Posts: 804

Rep: Reputation: 186Reputation: 186
Quote:
but the cell data was still being read in
What is cell in this context?

Shouldn't RS be inside the awk script '{ ... }'? I mean it's a built in and not an application variable passed to awk by -v

Secondly, how about trying FS="\n" followed by print $1 $2 $3 of for i=1;i<=NF;i++

OK
 
Old 04-05-2015, 08:39 AM   #8
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 14,832

Rep: Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820
No, that's a special usage for multi-line processing. Section 4.8 in my copy of the doco.
Quote:
Originally Posted by AZTiger98 View Post
2. Command 1 followed by awk -v RS="" '{gsub (/\r/,"")}1' newfile > finalFile

but that resulted in all of the data in the file being put onto a single line.
Sorry, I didn't pay enough attention to this yesterday. My copy of LibreOffice Calc generates "\n" if you have a multiline entry in a cell. Is your data from Excel ?. You need to take care to only change a "\r" (in your case apparently) when it appears within double quotes - not the ones at end of rows. Try using gensub rather than gsub - the benefit is you can use back-references in your regex substitution.
 
Old 04-05-2015, 09:34 AM   #9
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 4,429

Rep: Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348
From https://www.gnu.org/software/gawk/ma...y-Content.html
Quote:
NOTE: Some programs export CSV data that contains embedded newlines between the double quotes. gawk provides no way to deal with this. Since there is no formal specification for CSV data, there isnt much more to be done; the FPAT mechanism provides an elegant solution for the majority of cases, and the gawk maintainer is satisfied with that.
 
Old 04-05-2015, 11:28 AM   #10
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,243

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
So I would be curious to see what more data looks like?

A simple awk for the above would be:
Code:
awk '$1=$1' FS="\n" RS="" file
Now this does not address the csv side but none was supplied in the data, however, a small change could use the FPAT feature:
Code:
awk 'gsub(/\n/," ")' FPAT="([^,]+)|(\"[^\"]+\")" RS="" file
But without seeing what additional data looks like and if there are in fact any commas in the output, it is hard to provide a general solution.
 
Old 04-05-2015, 07:55 PM   #11
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 4,429

Rep: Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348
The problem that gawk has with processing CSV files with embedded new lines, is that there is no way of differentiating between new lines between double quotes and normal new lines.
A substitution of new lines leads to output that is one long line.
Say the input is
Quote:
1,"Bob Smith
531 Pennsylvania Avenue
Washington, DC",3,4
5,"Bob Smith
531 Pennsylvania Avenue
Washington, DC",6,7
then
Code:
awk 'gsub(/\n/," ")' FPAT="([^,]+)|(\"[^\"]+\")" RS=""
produces
Quote:
1,"Bob Smith 531 Pennsylvania Avenue Washington, DC",3,4 5,"Bob Smith 531 Pennsylvania Avenue Washington, DC",6,7
 
Old 04-06-2015, 06:40 AM   #12
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,243

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
I agree but it is also why I mentioned seeing more data would help, so if using the current example I would use:
Code:
awk 'ORS = NR(percentage)3 ? " " : RS'
Note: I had to write the word percentage because for some reason it will not allow me to include the symbol when I post??

Last edited by grail; 04-06-2015 at 06:42 AM.
 
Old 04-06-2015, 07:19 AM   #13
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 14,832

Rep: Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820
Ain't that easy. Imagine a spreadsheet of 50 cells by 80 rows. Any cell of which can have any number of embedded \n (hopefully usually none).

It's not hard to set up a testcase that looks ugly even with only a few cells. I was able to handle a single embedded \n in any cell pretty easily, but indeterminate numbers of \n are a bear to deal with.
 
Old 04-06-2015, 09:27 AM   #14
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 4,429

Rep: Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348
Yes, it starts to get ugly. You can create a loop to check for zero or an even number of double quotes in a line and then keep reading a line and adding to the input until the condition is satisfied.
With the input
Quote:
1,"Bob Smith
531 Pennsylvania Avenue
Washington, DC",3,4
5,"Bob Smith
531 Pennsylvania Avenue
Washington, DC",6,7
then
Code:
awk 'FPAT="([^,]+)|(\"[^\"]+\")"{x=$0;while((gsub(/\"/,"\"",x)%2)!=0){getline;x=x " ";x=x $0};$0=x;print}'
produces
Quote:
1,"Bob Smith 531 Pennsylvania Avenue Washington, DC",3,4
5,"Bob Smith 531 Pennsylvania Avenue Washington, DC",6,7

Last edited by allend; 04-06-2015 at 10:39 AM.
 
Old 04-06-2015, 10:35 AM   #15
273
LQ Addict
 
Registered: Dec 2011
Location: UK
Distribution: Debian Sid AMD64, Raspbian Wheezy, various VMs
Posts: 5,990

Rep: Reputation: 1611Reputation: 1611Reputation: 1611Reputation: 1611Reputation: 1611Reputation: 1611Reputation: 1611Reputation: 1611Reputation: 1611Reputation: 1611Reputation: 1611
Sorry, this is off topic but I am testing a bug.
Code:
does ()
come out OK?
No, it does not.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Removing specific lines from a file (multiple conditions) Sushukka Linux - General 8 01-23-2015 09:29 AM
Why does open office put lines into adjacent cell when I am splitting one cell 1sweetwater! Linux - Software 1 12-03-2014 02:36 PM
[SOLVED] C Program Trouble Removing Multiple New Lines CincinnatiKid Programming 6 01-20-2014 09:50 AM
[SOLVED] Extract multiple lines of data from a text file. shawnamiller Programming 8 04-30-2010 12:46 PM
Removing quotes from a CSV file with Perl Grafbak Programming 5 07-25-2006 04:23 AM


All times are GMT -5. The time now is 06:29 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration