LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 09-11-2012, 02:17 PM   #1
K-Veikko
LQ Newbie
 
Registered: Jul 2005
Posts: 11

Rep: Reputation: 0
Convert text paragraph for database


I have successfully scanned and ocr'd thousands of "famous quotes". Now I got a text file:

Code:
One or more
lines of text immediately followed by.
The author name on one line.
[Two line]
[breaks.]
To copy these quotes to a database I'd like to re-organize the lines to a csv -file:

Code:
"One or more lines of text immediately followed by.","The author name on one line."
[Line break]
What is the sed or awk command to do this; maybe perl. Preferably a command on one line to use in a pipe.
 
Old 09-11-2012, 03:41 PM   #2
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 620

Rep: Reputation: 362Reputation: 362Reputation: 362Reputation: 362
Hi.

Code:
$ cat infile
One or more
lines of text immediately
followed by.
The author name on one line.


More
lines.
firstfire
$ sed -nr 'H; x; $ba; /\n\n/{:a; s/^\n|\n\n//g; s/\n/ /g; s/([^.]+\.) +(.*)/"\1", "\2"/; p; b}; x;' infile
"One or more lines of text immediately followed by.", "The author name on one line."
"More lines.", "firstfire"
This is hardly a one-liner though..

Last edited by firstfire; 09-12-2012 at 11:17 AM. Reason: Fixed file name
 
Old 09-11-2012, 06:24 PM   #3
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,458

Rep: Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941
GNU-awk:

Code:
$ awk 'BEGIN { RS = "\n\n\n" }
{
  gsub(/^|$/,"\"")
  sub(/\n+"$/,"\"")
  $0 = gensub(/\n([^\n]+"$)/,"\",\"\\1","g")
  gsub(/\n/," ")
  sub(/$/,"\n")
  print
}' file
Even this is hardly a one-liner though...
 
Old 09-12-2012, 06:30 AM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,399

Rep: Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814
GNU Awk:
Code:
# a one liner should be less than 80 characters
# 84 chars puts it a bit over
awk -F'\n' -vRS='\n\n\n' '{author=$NF; NF--; printf("\"%s\",\"%s\"\n", $0, author)}'

# we can squeeze it down a bit: 62 chars
awk -F\\n -vRS='\n\n\n' -vQ=\" '{a=$NF;NF--;$0=Q$0Q","Q a Q}1'

# 60 chars, requires gawk version 4+
awk -F\\n -vRS=\\n{3} -vQ=\" '{a=$NF;NF--;$0=Q$0Q","Q a Q}1'

# But if the last "famous quote" isn't followed by 2 line breaks you need
# -vRS='\n(\n\n|$)'

# I also thought a=$(NF--); should be equivalent to a=$NF;NF--;
# but this doesn't work for some reason...
 
Old 09-12-2012, 09:05 AM   #5
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,067

Rep: Reputation: 284Reputation: 284Reputation: 284
As a learning exercise I like to implement and test solutions posted by respondents to interesting problems such as this one. My test program (shown below) uses four solutions -- my own, and those already posted by firstfire, colucix, and ntubski. I constructed a test file of real-world quotations.

It is perplexing to find that the output files from the four solutions differ to some degree. Perhaps I have misunderstood the problem; perhaps there is ambiguity in the OP's problem statement.

Input file ...
Code:
Politics is the art of looking for trouble, finding it everywhere,
diagnosing it incorrectly, and applying the wrong remedies.
Groucho Marx

Too bad all the people who know how to run the country are
busy driving cabs and cutting hair.
George Burns

My husband and I are either going to buy a dog or have a child.
We can't decide whether to ruin our carpet or ruin our lives.
Rita Rudner

Men occasionally stumble over the truth, but most of them
pick themselves up and hurry off as if nothing happened.
Winston Churchill

Giving money and power to government is like giving whiskey and
car keys to teenage boys.
P.J. O'Rourke

The difference between genius and stupidity is that genius has limits.
Albert Einstein
My test program ...
Code:
#!/bin/bash
#   Daniel B. Martin   Sep12
#
#   To execute this program, launch a terminal session and enter:
#   bash /home/daniel/Desktop/LQfiles/dbm472.bin
#
#  This program inspired by
#  http://www.linuxquestions.org/questions/programming-9/
#    convert-text-paragraph-for-database-4175426737/

# File identification  
InFile='/home/daniel/Desktop/LQfiles/dbm472inp.txt'
OutFile1='/home/daniel/Desktop/LQfiles/dbm472out1.txt'
OutFile2='/home/daniel/Desktop/LQfiles/dbm472out2.txt'
OutFile3='/home/daniel/Desktop/LQfiles/dbm472out3.txt'
OutFile4='/home/daniel/Desktop/LQfiles/dbm472out4.txt'

# 1) Change all line breaks to tildes.
# 2) Change all double tildes to single line breaks.
# 3) Prefix and postfix every line with a double-quote .. and ..
#    replace the last tilde in each line with a comma.
# 4) Change all tildes to blanks.
echo; echo "Method of DBM"
tr "\n" "~"  < $InFile            \
|sed -r 's/~~/\n/g'               \
|sed -r 's/(.*)~(.*)/"\1","\2"/'  \
|tr '~' ' '                       \
> $OutFile1
cat $OutFile1


echo; echo "Method of LQ member firstfire"
sed -nr 'H; x; $ba; /\n\n/{:a; s/^\n|\n\n//g;
  s/\n/ /g; s/([^.]+\.) +(.*)/"\1", "\2"/; p; b}; x;' $InFile > $OutFile2
cat $OutFile2

echo; echo "Method of LQ moderator colucix"
awk 'BEGIN { RS = "\n\n\n" }
{
  gsub(/^|$/,"\"")
  sub(/\n+"$/,"\"")
  $0 = gensub(/\n([^\n]+"$)/,"\",\"\\1","g")
  gsub(/\n/," ")
  sub(/$/,"\n")
  print
}' $InFile > $OutFile3
cat $OutFile3

echo; echo "Method of LQ Senior Member ntubski"
awk -F\\n -vRS='\n(\n\n|$)' -vQ=\" '{a=$NF;NF--;$0=Q$0Q","Q a Q}1' $InFile > $OutFile4
cat $OutFile4

echo; echo "Normal end of job."; echo
exit
Readers are invited to comment on any aspect of my testing and/or correct their solutions.

Daniel B. Martin

Last edited by danielbmartin; 09-12-2012 at 09:11 AM. Reason: Correct t7po
 
Old 09-12-2012, 02:21 PM   #6
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,399

Rep: Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814
Quote:
Originally Posted by danielbmartin View Post
It is perplexing to find that the output files from the four solutions differ to some degree. Perhaps I have misunderstood the problem; perhaps there is ambiguity in the OP's problem statement.
firstfire, colucix, and I all understood the seperator between quotes to be 2 empty lines (which is 3 newline characters), whereas you understood it to be 2 newline characters (which is 1 empty line). Upon rereading the original post I suspect your interpretation was the intended one, although the notation of the example input is kind of confusing...


Code:
One or more
lines of text immediately followed by.
The author name on one line.
[Two line] # looks like
[breaks.]  # 2 empty lines
Code:
"One or more lines of text immediately followed by.","The author name on one line."
[Line break] # but nobody thought I didn't think this indicated an empty line
I think the lesson here is always give a concrete example input and output.

Last edited by ntubski; 09-12-2012 at 05:54 PM. Reason: I should speak for myself :/
 
1 members found this post helpful.
Old 09-12-2012, 02:42 PM   #7
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,067

Rep: Reputation: 284Reputation: 284Reputation: 284
Quote:
Originally Posted by ntubski View Post
I think the lesson here is always give a concrete example input and output.
Let's call on OP to clarify and provide this. He might use my test input file or one of his own.

Daniel B. Martin
 
Old 09-13-2012, 10:58 AM   #8
K-Veikko
LQ Newbie
 
Registered: Jul 2005
Posts: 11

Original Poster
Rep: Reputation: 0
Thank you very much for these answers.

I quickly noticed my question was inaccurate of how many linebreaks there is. Thanks for the comment ntubski.
- so I replaced \n\n\n with \n\n\n* where appropriate.

- The use of "gensub" in a script worked only after installing gawk.

--

"Oikeastaan tiedämme vain, kun tiedämme vähän: tietämisen mukana kasvaa epäilys.","Goethe, Maximen und Reflexionen."
 
  


Reply

Tags
awk, perl, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract last paragraph from text file bunti01 Programming 33 07-20-2012 09:31 AM
[SOLVED] How to Awk Paragraph in complex text file? VMthinker Linux - General 1 09-24-2010 05:41 AM
[SOLVED] How to Awk Paragraph in complex text file? VMthinker Linux - Newbie 1 09-24-2010 01:15 AM
Steps needed to convert multiple text files into one master text file jamtech Programming 5 10-07-2007 11:24 PM
need to convert text files into mysql database zafar466 Linux - Software 1 08-28-2007 12:53 AM


All times are GMT -5. The time now is 07:39 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration