LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Convert text paragraph for database (http://www.linuxquestions.org/questions/programming-9/convert-text-paragraph-for-database-4175426737/)

K-Veikko 09-11-2012 02:17 PM

Convert text paragraph for database
 
I have successfully scanned and ocr'd thousands of "famous quotes". Now I got a text file:

Code:

One or more
lines of text immediately followed by.
The author name on one line.
[Two line]
[breaks.]

To copy these quotes to a database I'd like to re-organize the lines to a csv -file:

Code:

"One or more lines of text immediately followed by.","The author name on one line."
[Line break]

What is the sed or awk command to do this; maybe perl. Preferably a command on one line to use in a pipe.

firstfire 09-11-2012 03:41 PM

Hi.

Code:

$ cat infile
One or more
lines of text immediately
followed by.
The author name on one line.


More
lines.
firstfire
$ sed -nr 'H; x; $ba; /\n\n/{:a; s/^\n|\n\n//g; s/\n/ /g; s/([^.]+\.) +(.*)/"\1", "\2"/; p; b}; x;' infile
"One or more lines of text immediately followed by.", "The author name on one line."
"More lines.", "firstfire"

This is hardly a one-liner though..

colucix 09-11-2012 06:24 PM

GNU-awk:

Code:

$ awk 'BEGIN { RS = "\n\n\n" }
{
  gsub(/^|$/,"\"")
  sub(/\n+"$/,"\"")
  $0 = gensub(/\n([^\n]+"$)/,"\",\"\\1","g")
  gsub(/\n/," ")
  sub(/$/,"\n")
  print
}' file

Even this is hardly a one-liner though... ;)

ntubski 09-12-2012 06:30 AM

GNU Awk:
Code:

# a one liner should be less than 80 characters
# 84 chars puts it a bit over
awk -F'\n' -vRS='\n\n\n' '{author=$NF; NF--; printf("\"%s\",\"%s\"\n", $0, author)}'

# we can squeeze it down a bit: 62 chars
awk -F\\n -vRS='\n\n\n' -vQ=\" '{a=$NF;NF--;$0=Q$0Q","Q a Q}1'

# 60 chars, requires gawk version 4+
awk -F\\n -vRS=\\n{3} -vQ=\" '{a=$NF;NF--;$0=Q$0Q","Q a Q}1'

# But if the last "famous quote" isn't followed by 2 line breaks you need
# -vRS='\n(\n\n|$)'

# I also thought a=$(NF--); should be equivalent to a=$NF;NF--;
# but this doesn't work for some reason...


danielbmartin 09-12-2012 09:05 AM

As a learning exercise I like to implement and test solutions posted by respondents to interesting problems such as this one. My test program (shown below) uses four solutions -- my own, and those already posted by firstfire, colucix, and ntubski. I constructed a test file of real-world quotations.

It is perplexing to find that the output files from the four solutions differ to some degree. Perhaps I have misunderstood the problem; perhaps there is ambiguity in the OP's problem statement.

Input file ...
Code:

Politics is the art of looking for trouble, finding it everywhere,
diagnosing it incorrectly, and applying the wrong remedies.
Groucho Marx

Too bad all the people who know how to run the country are
busy driving cabs and cutting hair.
George Burns

My husband and I are either going to buy a dog or have a child.
We can't decide whether to ruin our carpet or ruin our lives.
Rita Rudner

Men occasionally stumble over the truth, but most of them
pick themselves up and hurry off as if nothing happened.
Winston Churchill

Giving money and power to government is like giving whiskey and
car keys to teenage boys.
P.J. O'Rourke

The difference between genius and stupidity is that genius has limits.
Albert Einstein

My test program ...
Code:

#!/bin/bash
#  Daniel B. Martin  Sep12
#
#  To execute this program, launch a terminal session and enter:
#  bash /home/daniel/Desktop/LQfiles/dbm472.bin
#
#  This program inspired by
#  http://www.linuxquestions.org/questions/programming-9/
#    convert-text-paragraph-for-database-4175426737/

# File identification 
InFile='/home/daniel/Desktop/LQfiles/dbm472inp.txt'
OutFile1='/home/daniel/Desktop/LQfiles/dbm472out1.txt'
OutFile2='/home/daniel/Desktop/LQfiles/dbm472out2.txt'
OutFile3='/home/daniel/Desktop/LQfiles/dbm472out3.txt'
OutFile4='/home/daniel/Desktop/LQfiles/dbm472out4.txt'

# 1) Change all line breaks to tildes.
# 2) Change all double tildes to single line breaks.
# 3) Prefix and postfix every line with a double-quote .. and ..
#    replace the last tilde in each line with a comma.
# 4) Change all tildes to blanks.
echo; echo "Method of DBM"
tr "\n" "~"  < $InFile            \
|sed -r 's/~~/\n/g'              \
|sed -r 's/(.*)~(.*)/"\1","\2"/'  \
|tr '~' ' '                      \
> $OutFile1
cat $OutFile1


echo; echo "Method of LQ member firstfire"
sed -nr 'H; x; $ba; /\n\n/{:a; s/^\n|\n\n//g;
  s/\n/ /g; s/([^.]+\.) +(.*)/"\1", "\2"/; p; b}; x;' $InFile > $OutFile2
cat $OutFile2

echo; echo "Method of LQ moderator colucix"
awk 'BEGIN { RS = "\n\n\n" }
{
  gsub(/^|$/,"\"")
  sub(/\n+"$/,"\"")
  $0 = gensub(/\n([^\n]+"$)/,"\",\"\\1","g")
  gsub(/\n/," ")
  sub(/$/,"\n")
  print
}' $InFile > $OutFile3
cat $OutFile3

echo; echo "Method of LQ Senior Member ntubski"
awk -F\\n -vRS='\n(\n\n|$)' -vQ=\" '{a=$NF;NF--;$0=Q$0Q","Q a Q}1' $InFile > $OutFile4
cat $OutFile4

echo; echo "Normal end of job."; echo
exit

Readers are invited to comment on any aspect of my testing and/or correct their solutions.

Daniel B. Martin

ntubski 09-12-2012 02:21 PM

Quote:

Originally Posted by danielbmartin (Post 4778513)
It is perplexing to find that the output files from the four solutions differ to some degree. Perhaps I have misunderstood the problem; perhaps there is ambiguity in the OP's problem statement.

firstfire, colucix, and I all understood the seperator between quotes to be 2 empty lines (which is 3 newline characters), whereas you understood it to be 2 newline characters (which is 1 empty line). Upon rereading the original post I suspect your interpretation was the intended one, although the notation of the example input is kind of confusing...


Code:

One or more
lines of text immediately followed by.
The author name on one line.
[Two line] # looks like
[breaks.]  # 2 empty lines

Code:

"One or more lines of text immediately followed by.","The author name on one line."
[Line break] # but nobody thought I didn't think this indicated an empty line

I think the lesson here is always give a concrete example input and output.

danielbmartin 09-12-2012 02:42 PM

Quote:

Originally Posted by ntubski (Post 4778783)
I think the lesson here is always give a concrete example input and output.

Let's call on OP to clarify and provide this. He might use my test input file or one of his own.

Daniel B. Martin

K-Veikko 09-13-2012 10:58 AM

Thank you very much for these answers.

I quickly noticed my question was inaccurate of how many linebreaks there is. Thanks for the comment ntubski.
- so I replaced \n\n\n with \n\n\n* where appropriate.

- The use of "gensub" in a script worked only after installing gawk.

--

"Oikeastaan tiedämme vain, kun tiedämme vähän: tietämisen mukana kasvaa epäilys.","Goethe, Maximen und Reflexionen."


All times are GMT -5. The time now is 03:20 AM.