ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
As allays thanks for the wonderful guidance this site and its users provide.
This time I am trying to get a jump on what I want to do. I have to either generate by hand 475 data files or access the existing data which is in html files from my bash script. Needless to say I am not that interested in spending 2 weeks doing copy/paste to new files to generate those files. So I thought I would explore what it would take to access said html data. May be I can do it in less than 2 weeks. :>]
So I have some 900 html files that contain the Hebrew Bible on my localhost but I have not been able to get curl to access them, but if I curl localhost it pulls up the nginx server. Would I have to configure the server to access them, I would prefer not. Well color me brain dead, duhhhh. It helps if you use the right kind of url file:/// vs http:// So this code will pull the file and strip the html
Guess the numbers may not mater that much after, all but will see.
What I need to do:
1. Get the part of weekly reading, there are nine of them, which may or may not span several files. I.E. Genesis 4:23 - 5:24 Each chapter is in a separate file, so in this case two different files are required, don't think it will ever be more than three.
2. Locate the actual verse the reading starts on.
3. Locate the actual verse the reading ends on, may be a different file.
4. Strip the htm tags, keeping the paragraph formating but striping the verse numbers, and export it to plain text for use with espeak or similar.
Thoughts?
Any sample code I could look at?
Thanks again
The smallest of the htm files as a sample:
Code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<TITLE>Joel 3 / Hebrew Bible in English</TITLE>
<SCRIPT TYPE="text/javascript" SRC="em.js"></SCRIPT>
</HEAD>
<BODY BGCOLOR="#FFFFFF">
<DIV ALIGN="JUSTIFY">
<CENTER>
<TABLE CELLPADDING="10" CELLSPACING="10" WIDTH="100%">
<TR ALIGN="CENTER">
<TD VALIGN=TOP BGCOLOR="#FFFFCC"><P ALIGN=CENTER>
<FONT SIZE="-1">
<A HREF="et0.htm">Bible</A> -
Joel - <A HREF="et14.htm">All</A><BR>Chapter
<A HREF="et1401.htm">1</A>
<A HREF="et1402.htm">2</A>
3
<A HREF="et1404.htm">4</A>
</FONT></P>
</TD></TR>
</TABLE>
</CENTER>
<H1 ALIGN="CENTER">Joel Chapter 3</H1>
<A NAME="1"> </A>
<P><B>1</B> And it will come to pass afterward, that I will pour out My spirit upon all flesh; and your sons and your daughters will prophesy, your old men will dream dreams, your young men will see visions;
<A NAME="2"> </A>
<B>2</B> And also upon the servants and upon the handmaids in those days will I pour out My spirit.
<A NAME="3"> </A>
<B>3</B> And I will shew wonders in the heavens and in the earth, blood, and fire, and pillars of smoke.
<A NAME="4"> </A>
<B>4</B> The sun will be turned into darkness, and the moon into blood, before the great and terrible day of HaShem come.
<A NAME="5"> </A>
<B>5</B> And it will come to pass, that whosoever will call on the name of HaShem will be delivered; for in mount Zion and in Jerusalem there will be those that escape, as HaShem has said, and among the remnant those whom HaShem will call.</P>
<A NAME="6"> </A>
<CENTER>
<TABLE CELLPADDING="10" CELLSPACING="10" WIDTH="100%">
<TR ALIGN="CENTER">
<TD VALIGN=TOP BGCOLOR="#FFFFCC"><P ALIGN=CENTER>
<FONT SIZE="-1">
<A HREF="et0.htm">Bible</A> -
Joel - <A HREF="et14.htm">All</A><BR>Chapter
<A HREF="et1401.htm">1</A>
<A HREF="et1402.htm">2</A>
3
<A HREF="et1404.htm">4</A>
</FONT></P>
</TD></TR>
</TABLE>
<P>
<B><A NAME="Mail">Got a question or comment?</A> <SCRIPT TYPE="text/javascript">email('et1403')</SCRIPT></B></P>
</CENTER>
</DIV></BODY></HTML>
In this case, bash is not really the go to tool. I would suggest looking into Perl or Ruby as either can strip detail from html relatively easily.
I am more familiar with Ruby so will use it as the example, but you may find Perl more to your liking and it probably has greater support on LQ (generally speaking).
So once you have Ruby installed (will leave this to you as it is system dependent) you will need the nokogiri gem which is used to do the heavy lifting:
Code:
$ gem install nokogiri -r
The above will give you a local copy, ie accessible by you but not others on the same box.
Then if you look at the attached, it is a simple example of how to use some of the features
I have chosen to try to use Perl for this as it is more widely installed on systems, and is installed on the headless server that that the whole project will eventually be on. It also has the benefit of running on most platforms making the project have a wider user base.
[schneidz@hyper ~]$ egrep "^.H1 ALIGN=\"CENTER\"|.B.[0-9]*./B." rbees.html
<H1 ALIGN="CENTER">Joel Chapter 3</H1>
<P><B>1</B> And it will come to pass afterward, that I will pour out My spirit upon all flesh; and your sons and your daughters will prophesy, your old men will dream dreams, your young men will see visions;
<B>2</B> And also upon the servants and upon the handmaids in those days will I pour out My spirit.
<B>3</B> And I will shew wonders in the heavens and in the earth, blood, and fire, and pillars of smoke.
<B>4</B> The sun will be turned into darkness, and the moon into blood, before the great and terrible day of HaShem come.
<B>5</B> And it will come to pass, that whosoever will call on the name of HaShem will be delivered; for in mount Zion and in Jerusalem there will be those that escape, as HaShem has said, and among the remnant those whom HaShem will call.</P>
The "end game" is to extract the weekly Torah (bible first 5 books) reading from the html files, parse it striping out all the computer speak, pass the end result to espeak/festival for text to speech processing @ a specific time. There are two choices for html file type, either one file containing the whole book of say Genesis, or one file per chapter. The weekly reading is broken up into 7 readings called aliyah's with two more tacked on for good measure.
You can see a "very crude" beginning at https://github.com/rbees/Shabbat-Shofar I have made a lot of improvements lately but until I get the readings part setup I don't want to push them.
i would package them up by books so that you can parse thru a single text file with awk or sed.
i'm not very religious so its not immediately obvious to me what a reading is but maybe you could have a list of line numbers that equate to the weekly readings.
i competed in a hackathon where we used raspberry-pis (pifm) to broadcast a message using espeak.
But even if I packed them up so the each week was a separate file, I would have to generate 61 separate files for the Torah portion plus 61 for the Haftara portion. Then each week is divided into 7 seven parts, plus a conclusion and a Haftara. Then there are the special readings for new months and other things too. That is the work I want to get away from.
Your sample HTML file is not very well suited for retrieving data because sentences are not wrapped into any tags, so you can not target them using CSS selectors:
Code:
..
<B>2</B> And also upon the servants and upon the handmaids in those days will I pour out My spirit.
<B>3</B> And I will shew wonders in the heavens and in the earth, blood, and fire, and pillars of smoke.
..
Fortunately there a lot of alternatives on the net, e.g. this one (not sure about wording though).
For example, first sentence of first chapter of Genesis looks like:
Code:
<span class="text Gen-1-1"><span class="chapternum">1 </span>In the beginning God created the heavens and the earth. </span>
So, you can refer to it using .Gen-1-1 CSS selector.
To interpret CSS selectors I will use html-xml-utils (standard package in debian/ubuntu distros).
To retrieve 10-th sentence from third chapter we may do
Code:
$ hxnormalize -l 3000 -x /tmp/gen-3.html 2>/dev/null | hxselect -cs '\n' .Gen-3-10 | hxremove span | hxremove sup | sed 's/<[^<>]\+>//g'
He answered, I heard your voice in the garden, and I was afraid, because I was naked, so I hid myself.
hxnormalize -x fixes input HTML file so that it is suitable for hxselect. Using -l 3000 we set maximum line length (so that each sentence will be on a separate line). hxselect applies CSS selector. hxremove strips unnecessary tags (with content). Finally, sed removes remaining markup (<i> etc). Alternatively, instead of sed, you may use lynx -stdin -dump to interpret resulting html as text.
Last edited by firstfire; 03-14-2015 at 10:13 AM.
Reason: Fix english.
But even if I packed them up so the each week was a separate file, I would have to generate 61 separate files for the Torah portion plus 61 for the Haftara portion. Then each week is divided into 7 seven parts, plus a conclusion and a Haftara. Then there are the special readings for new months and other things too. That is the work I want to get away from.
i don't understand this addressing. e.g.- Ex. 1:1-6:1
is this exodus book 1: verses 1-6: book 1 ?
I understand that the sample file I posted may not be easy to extract the data from in an automated way but only by hand. I am willing to do so, but would really rather not.
[Rant warning]
Not to get into a religious debate here but the problem with most of the downloadable "Bibles" in english out there is that they are "christian" translations and are translated in a way that supports their christian theology. They are highly frowned on in Jewish circles. The one you provided a link to is a prime example, even though it has a supposed "Jewish" name it is still a christian bible and contains the christian new testament and seeks to proselytize Jews and lead us away from the way G0d told us to live. [/Rant]
Quote:
i don't understand this addressing. e.g.- Ex. 1:1-6:1 is this exodus book 1: verses 1-6: book 1
Yes and no.
Ex refers to the Hebrew book of Shemot actually which has the english name Exodus.
1:1 refers to chapter 1 verse 1
6:1 refers to chapter 6 verse 1.
Note that the christian verse numberings are not always the same as the Hebrew numberings in all cases.
Okay, no problem.
How about this link: Genesis 1? It is from the same resource as your samples, but contains both hebrew and english versions side by side formatted as a table. Each row of this table corresponds to a separate verse.
$ hxnormalize -x -l 1000 gen-05.htm | hxselect -s '\n' -c 'tr:nth-child(20) > td:nth-child(2)' | hxremove b
And all the days of Jared were nine hundred sixty and two years; and he died.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.