[SOLVED] html from bash

rbees · 03-11-2015, 07:21 PM

Ladies & Gents

As allays thanks for the wonderful guidance this site and its users provide.

This time I am trying to get a jump on what I want to do. I have to either generate by hand 475 data files or access the existing data which is in html files from my bash script. Needless to say I am not that interested in spending 2 weeks doing copy/paste to new files to generate those files. So I thought I would explore what it would take to access said html data. May be I can do it in less than 2 weeks. :>]

So I have some 900 html files that contain the Hebrew Bible on my localhost but I have not been able to get curl to access them, but if I curl localhost it pulls up the nginx server. Would I have to configure the server to access them, I would prefer not. Well color me brain dead, duhhhh. It helps if you use the right kind of url file:/// vs http:// So this code will pull the file and strip the html

Code:

curl file:///$HOME/bin/shabbat/JPS/et1403.htm -s | w3m -dump -T text/html

Guess the numbers may not mater that much after, all but will see.

What I need to do:
1. Get the part of weekly reading, there are nine of them, which may or may not span several files. I.E. Genesis 4:23 - 5:24 Each chapter is in a separate file, so in this case two different files are required, don't think it will ever be more than three.
2. Locate the actual verse the reading starts on.
3. Locate the actual verse the reading ends on, may be a different file.
4. Strip the htm tags, keeping the paragraph formating but striping the verse numbers, and export it to plain text for use with espeak or similar.

Thoughts?

Any sample code I could look at?

Thanks again

The smallest of the htm files as a sample:

Code:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<TITLE>Joel 3 / Hebrew Bible in English</TITLE>
<SCRIPT TYPE="text/javascript" SRC="em.js"></SCRIPT>
</HEAD>
<BODY BGCOLOR="#FFFFFF">
<DIV ALIGN="JUSTIFY">
<CENTER>
<TABLE CELLPADDING="10" CELLSPACING="10" WIDTH="100%">
<TR ALIGN="CENTER">
<TD VALIGN=TOP BGCOLOR="#FFFFCC"><P ALIGN=CENTER>
<FONT SIZE="-1">
<A HREF="et0.htm">Bible</A> -
Joel - <A HREF="et14.htm">All</A><BR>Chapter
<A HREF="et1401.htm">1</A>
<A HREF="et1402.htm">2</A>
3
<A HREF="et1404.htm">4</A>
</FONT></P>
</TD></TR>
</TABLE>
</CENTER>

<H1 ALIGN="CENTER">Joel Chapter 3</H1>
<A NAME="1"> </A>
<P><B>1</B> And it will come to pass afterward, that I will pour out My spirit upon all flesh; and your sons and your daughters will prophesy, your old men will dream dreams, your young men will see visions;
<A NAME="2"> </A>
<B>2</B> And also upon the servants and upon the handmaids in those days will I pour out My spirit.
<A NAME="3"> </A>
<B>3</B> And I will shew wonders in the heavens and in the earth, blood, and fire, and pillars of smoke.
<A NAME="4"> </A>
<B>4</B> The sun will be turned into darkness, and the moon into blood, before the great and terrible day of HaShem come.
<A NAME="5"> </A>
<B>5</B> And it will come to pass, that whosoever will call on the name of HaShem will be delivered; for in mount Zion and in Jerusalem there will be those that escape, as HaShem has said, and among the remnant those whom HaShem will call.</P>
<A NAME="6"> </A>
<CENTER>
<TABLE CELLPADDING="10" CELLSPACING="10" WIDTH="100%">
<TR ALIGN="CENTER">
<TD VALIGN=TOP BGCOLOR="#FFFFCC"><P ALIGN=CENTER>
<FONT SIZE="-1">
<A HREF="et0.htm">Bible</A> -
Joel - <A HREF="et14.htm">All</A><BR>Chapter
<A HREF="et1401.htm">1</A>
<A HREF="et1402.htm">2</A>
3
<A HREF="et1404.htm">4</A>
</FONT></P>
</TD></TR>
</TABLE>
<P>
<B><A NAME="Mail">Got a question or comment?</A> <SCRIPT TYPE="text/javascript">email('et1403')</SCRIPT></B></P>
</CENTER>
</DIV></BODY></HTML>

grail · 03-11-2015, 10:28 PM

In this case, bash is not really the go to tool. I would suggest looking into Perl or Ruby as either can strip detail from html relatively easily.

I am more familiar with Ruby so will use it as the example, but you may find Perl more to your liking and it probably has greater support on LQ (generally speaking).

So once you have Ruby installed (will leave this to you as it is system dependent) you will need the nokogiri gem which is used to do the heavy lifting:

Code:

$ gem install nokogiri -r

The above will give you a local copy, ie accessible by you but not others on the same box.

Then if you look at the attached, it is a simple example of how to use some of the features

rbees · 03-12-2015, 09:03 AM

Thanks grail,

I have chosen to try to use Perl for this as it is more widely installed on systems, and is installed on the headless server that that the whole project will eventually be on. It also has the benefit of running on most platforms making the project have a wider user base.

That said I have started working my way through this book hosted at https://www.perl.org/books/beginning-perl/

Thanks again.

schneidz · 03-12-2015, 10:07 AM

quick-and-dirty:

Code:

[schneidz@hyper ~]$ egrep "^.H1 ALIGN=\"CENTER\"|.B.[0-9]*./B." rbees.html 
<H1 ALIGN="CENTER">Joel Chapter 3</H1>
<P><B>1</B> And it will come to pass afterward, that I will pour out My spirit upon all flesh; and your sons and your daughters will prophesy, your old men will dream dreams, your young men will see visions;
<B>2</B> And also upon the servants and upon the handmaids in those days will I pour out My spirit.
<B>3</B> And I will shew wonders in the heavens and in the earth, blood, and fire, and pillars of smoke.
<B>4</B> The sun will be turned into darkness, and the moon into blood, before the great and terrible day of HaShem come.
<B>5</B> And it will come to pass, that whosoever will call on the name of HaShem will be delivered; for in mount Zion and in Jerusalem there will be those that escape, as HaShem has said, and among the remnant those whom HaShem will call.</P>

what is the end game here ?

rbees · 03-12-2015, 10:36 AM

Thanks schneidz,

The "end game" is to extract the weekly Torah (bible first 5 books) reading from the html files, parse it striping out all the computer speak, pass the end result to espeak/festival for text to speech processing @ a specific time. There are two choices for html file type, either one file containing the whole book of say Genesis, or one file per chapter. The weekly reading is broken up into 7 readings called aliyah's with two more tacked on for good measure.

You can see a "very crude" beginning at https://github.com/rbees/Shabbat-Shofar I have made a lot of improvements lately but until I get the readings part setup I don't want to push them.

schneidz · 03-12-2015, 10:49 AM

i would package them up by books so that you can parse thru a single text file with awk or sed.
i'm not very religious so its not immediately obvious to me what a reading is but maybe you could have a list of line numbers that equate to the weekly readings.

i competed in a hackathon where we used raspberry-pis (pifm) to broadcast a message using espeak.

rbees · 03-13-2015, 01:54 PM

The references and a description can be seen at.
http://en.wikipedia.org/wiki/Weekly_Torah_portion

But even if I packed them up so the each week was a separate file, I would have to generate 61 separate files for the Torah portion plus 61 for the Haftara portion. Then each week is divided into 7 seven parts, plus a conclusion and a Haftara. Then there are the special readings for new months and other things too. That is the work I want to get away from.

firstfire · 03-14-2015, 05:09 AM

Hi.

Your sample HTML file is not very well suited for retrieving data because sentences are not wrapped into any tags, so you can not target them using CSS selectors:

Code:

..
<B>2</B> And also upon the servants and upon the handmaids in those days will I pour out My spirit.
<B>3</B> And I will shew wonders in the heavens and in the earth, blood, and fire, and pillars of smoke.
..

Fortunately there a lot of alternatives on the net, e.g. this one (not sure about wording though).

For example, first sentence of first chapter of Genesis looks like:

Code:

<span class="text Gen-1-1"><span class="chapternum">1&nbsp;</span>In the beginning God created the heavens and the earth. </span>

So, you can refer to it using .Gen-1-1 CSS selector.

Let's retrieve whole Genesis book:

Code:

curl 'https://www.biblegateway.com/passage/?search=Genesis+[1-50]&version=CJB' -o /tmp/gen-#1.html

This will create 50 html files gen-1.html, etc.

To interpret CSS selectors I will use html-xml-utils (standard package in debian/ubuntu distros).

To retrieve 10-th sentence from third chapter we may do

Code:

$ hxnormalize -l 3000 -x /tmp/gen-3.html 2>/dev/null  | hxselect -cs '\n' .Gen-3-10 | hxremove span | hxremove sup | sed 's/<[^<>]\+>//g' 
He answered, I heard your voice in the garden, and I was afraid, because I was naked, so I hid myself.

hxnormalize -x fixes input HTML file so that it is suitable for hxselect. Using -l 3000 we set maximum line length (so that each sentence will be on a separate line). hxselect applies CSS selector. hxremove strips unnecessary tags (with content). Finally, sed removes remaining markup (<i> etc). Alternatively, instead of sed, you may use lynx -stdin -dump to interpret resulting html as text.

schneidz · 03-14-2015, 10:00 AM

Quote:

Originally Posted by rbees

The references and a description can be seen at.
http://en.wikipedia.org/wiki/Weekly_Torah_portion

But even if I packed them up so the each week was a separate file, I would have to generate 61 separate files for the Torah portion plus 61 for the Haftara portion. Then each week is divided into 7 seven parts, plus a conclusion and a Haftara. Then there are the special readings for new months and other things too. That is the work I want to get away from.

i don't understand this addressing. e.g.- Ex. 1:1-6:1
is this exodus book 1: verses 1-6: book 1 ?

rbees · 03-14-2015, 08:17 PM

Thanks firstfire & schneidz

I understand that the sample file I posted may not be easy to extract the data from in an automated way but only by hand. I am willing to do so, but would really rather not.

[Rant warning]
Not to get into a religious debate here but the problem with most of the downloadable "Bibles" in english out there is that they are "christian" translations and are translated in a way that supports their christian theology. They are highly frowned on in Jewish circles. The one you provided a link to is a prime example, even though it has a supposed "Jewish" name it is still a christian bible and contains the christian new testament and seeks to proselytize Jews and lead us away from the way G0d told us to live. [/Rant]

Quote:

i don't understand this addressing. e.g.- Ex. 1:1-6:1 is this exodus book 1: verses 1-6: book 1

Yes and no.

Ex refers to the Hebrew book of Shemot actually which has the english name Exodus.
1:1 refers to chapter 1 verse 1
6:1 refers to chapter 6 verse 1.

Note that the christian verse numberings are not always the same as the Hebrew numberings in all cases.

Sorry for the rant

Thanks again

firstfire · 03-14-2015, 10:59 PM

Okay, no problem.
How about this link: Genesis 1? It is from the same resource as your samples, but contains both hebrew and english versions side by side formatted as a table. Each row of this table corresponds to a separate verse.

Download whole Genesis book:

Code:

curl http://www.mechon-mamre.org/p/pt/pt01[01-50].htm -o gen-#1.html

This creates 50 files gen-01.html ... gen-50.html

Fetch, say 20-th verse from 5-th chapter:

Code:

$ hxnormalize -x -l 1000 gen-05.htm  | hxselect -s '\n' -c 'tr:nth-child(20) > td:nth-child(2)' | hxremove b
  And all the days of Jared were nine hundred sixty and two years; and he died.

rbees · 03-15-2015, 12:48 PM

Thanks

I have worked up a string that seams to work. Gota love sed1liners

Code:

w3m -dump -T text/html $HOME/bin/shabbat/JPS/et0106.htm | sed '1,5d' | sed -e :a -e '$d;N;2,6ba' -e 'P;D' | sed -n '/9/,/22/p' > testfile

It can probably be trimmed up by someone that really understands how it does what it does better than I do.

No all I have to do is put in the variables and build an associative array for the files.

On a different note, I am not sure perl is any easier than bash, but still working on it.

Thanks again

rbees · 03-22-2015, 07:31 PM

moderator please delete