LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-11-2015, 07:21 PM   #1
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 921

Rep: Reputation: 46
html from bash


Ladies & Gents

As allays thanks for the wonderful guidance this site and its users provide.

This time I am trying to get a jump on what I want to do. I have to either generate by hand 475 data files or access the existing data which is in html files from my bash script. Needless to say I am not that interested in spending 2 weeks doing copy/paste to new files to generate those files. So I thought I would explore what it would take to access said html data. May be I can do it in less than 2 weeks. :>]

So I have some 900 html files that contain the Hebrew Bible on my localhost but I have not been able to get curl to access them, but if I curl localhost it pulls up the nginx server. Would I have to configure the server to access them, I would prefer not. Well color me brain dead, duhhhh. It helps if you use the right kind of url file:/// vs http:// So this code will pull the file and strip the html
Code:
curl file:///$HOME/bin/shabbat/JPS/et1403.htm -s | w3m -dump -T text/html
Guess the numbers may not mater that much after, all but will see.

What I need to do:
1. Get the part of weekly reading, there are nine of them, which may or may not span several files. I.E. Genesis 4:23 - 5:24 Each chapter is in a separate file, so in this case two different files are required, don't think it will ever be more than three.
2. Locate the actual verse the reading starts on.
3. Locate the actual verse the reading ends on, may be a different file.
4. Strip the htm tags, keeping the paragraph formating but striping the verse numbers, and export it to plain text for use with espeak or similar.

Thoughts?

Any sample code I could look at?

Thanks again

The smallest of the htm files as a sample:
Code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<TITLE>Joel 3 / Hebrew Bible in English</TITLE>
<SCRIPT TYPE="text/javascript" SRC="em.js"></SCRIPT>
</HEAD>
<BODY BGCOLOR="#FFFFFF">
<DIV ALIGN="JUSTIFY">
<CENTER>
<TABLE CELLPADDING="10" CELLSPACING="10" WIDTH="100%">
<TR ALIGN="CENTER">
<TD VALIGN=TOP BGCOLOR="#FFFFCC"><P ALIGN=CENTER>
<FONT SIZE="-1">
<A HREF="et0.htm">Bible</A> -
Joel - <A HREF="et14.htm">All</A><BR>Chapter
<A HREF="et1401.htm">1</A>
<A HREF="et1402.htm">2</A>
3
<A HREF="et1404.htm">4</A>
</FONT></P>
</TD></TR>
</TABLE>
</CENTER>

<H1 ALIGN="CENTER">Joel Chapter 3</H1>
<A NAME="1"> </A>
<P><B>1</B> And it will come to pass afterward, that I will pour out My spirit upon all flesh; and your sons and your daughters will prophesy, your old men will dream dreams, your young men will see visions;
<A NAME="2"> </A>
<B>2</B> And also upon the servants and upon the handmaids in those days will I pour out My spirit.
<A NAME="3"> </A>
<B>3</B> And I will shew wonders in the heavens and in the earth, blood, and fire, and pillars of smoke.
<A NAME="4"> </A>
<B>4</B> The sun will be turned into darkness, and the moon into blood, before the great and terrible day of HaShem come.
<A NAME="5"> </A>
<B>5</B> And it will come to pass, that whosoever will call on the name of HaShem will be delivered; for in mount Zion and in Jerusalem there will be those that escape, as HaShem has said, and among the remnant those whom HaShem will call.</P>
<A NAME="6"> </A>
<CENTER>
<TABLE CELLPADDING="10" CELLSPACING="10" WIDTH="100%">
<TR ALIGN="CENTER">
<TD VALIGN=TOP BGCOLOR="#FFFFCC"><P ALIGN=CENTER>
<FONT SIZE="-1">
<A HREF="et0.htm">Bible</A> -
Joel - <A HREF="et14.htm">All</A><BR>Chapter
<A HREF="et1401.htm">1</A>
<A HREF="et1402.htm">2</A>
3
<A HREF="et1404.htm">4</A>
</FONT></P>
</TD></TR>
</TABLE>
<P>
<B><A NAME="Mail">Got a question or comment?</A> <SCRIPT TYPE="text/javascript">email('et1403')</SCRIPT></B></P>
</CENTER>
</DIV></BODY></HTML>
 
Old 03-11-2015, 10:28 PM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
In this case, bash is not really the go to tool. I would suggest looking into Perl or Ruby as either can strip detail from html relatively easily.

I am more familiar with Ruby so will use it as the example, but you may find Perl more to your liking and it probably has greater support on LQ (generally speaking).

So once you have Ruby installed (will leave this to you as it is system dependent) you will need the nokogiri gem which is used to do the heavy lifting:
Code:
$ gem install nokogiri -r
The above will give you a local copy, ie accessible by you but not others on the same box.

Then if you look at the attached, it is a simple example of how to use some of the features
Attached Files
File Type: txt dr_who_html_scaper.rb.txt (719 Bytes, 29 views)
 
1 members found this post helpful.
Old 03-12-2015, 09:03 AM   #3
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 921

Original Poster
Rep: Reputation: 46
Thanks grail,

I have chosen to try to use Perl for this as it is more widely installed on systems, and is installed on the headless server that that the whole project will eventually be on. It also has the benefit of running on most platforms making the project have a wider user base.

That said I have started working my way through this book hosted at https://www.perl.org/books/beginning-perl/

Thanks again.
 
Old 03-12-2015, 10:07 AM   #4
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
quick-and-dirty:
Code:
[schneidz@hyper ~]$ egrep "^.H1 ALIGN=\"CENTER\"|.B.[0-9]*./B." rbees.html 
<H1 ALIGN="CENTER">Joel Chapter 3</H1>
<P><B>1</B> And it will come to pass afterward, that I will pour out My spirit upon all flesh; and your sons and your daughters will prophesy, your old men will dream dreams, your young men will see visions;
<B>2</B> And also upon the servants and upon the handmaids in those days will I pour out My spirit.
<B>3</B> And I will shew wonders in the heavens and in the earth, blood, and fire, and pillars of smoke.
<B>4</B> The sun will be turned into darkness, and the moon into blood, before the great and terrible day of HaShem come.
<B>5</B> And it will come to pass, that whosoever will call on the name of HaShem will be delivered; for in mount Zion and in Jerusalem there will be those that escape, as HaShem has said, and among the remnant those whom HaShem will call.</P>
what is the end game here ?
 
Old 03-12-2015, 10:36 AM   #5
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 921

Original Poster
Rep: Reputation: 46
Thanks schneidz,

The "end game" is to extract the weekly Torah (bible first 5 books) reading from the html files, parse it striping out all the computer speak, pass the end result to espeak/festival for text to speech processing @ a specific time. There are two choices for html file type, either one file containing the whole book of say Genesis, or one file per chapter. The weekly reading is broken up into 7 readings called aliyah's with two more tacked on for good measure.

You can see a "very crude" beginning at https://github.com/rbees/Shabbat-Shofar I have made a lot of improvements lately but until I get the readings part setup I don't want to push them.
 
Old 03-12-2015, 10:49 AM   #6
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
i would package them up by books so that you can parse thru a single text file with awk or sed.
i'm not very religious so its not immediately obvious to me what a reading is but maybe you could have a list of line numbers that equate to the weekly readings.

i competed in a hackathon where we used raspberry-pis (pifm) to broadcast a message using espeak.

Last edited by schneidz; 03-12-2015 at 11:28 AM.
 
Old 03-13-2015, 01:54 PM   #7
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 921

Original Poster
Rep: Reputation: 46
The references and a description can be seen at.
http://en.wikipedia.org/wiki/Weekly_Torah_portion

But even if I packed them up so the each week was a separate file, I would have to generate 61 separate files for the Torah portion plus 61 for the Haftara portion. Then each week is divided into 7 seven parts, plus a conclusion and a Haftara. Then there are the special readings for new months and other things too. That is the work I want to get away from.

Last edited by rbees; 03-13-2015 at 01:55 PM.
 
Old 03-14-2015, 05:09 AM   #8
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi.

Your sample HTML file is not very well suited for retrieving data because sentences are not wrapped into any tags, so you can not target them using CSS selectors:
Code:
..
<B>2</B> And also upon the servants and upon the handmaids in those days will I pour out My spirit.
<B>3</B> And I will shew wonders in the heavens and in the earth, blood, and fire, and pillars of smoke.
..
Fortunately there a lot of alternatives on the net, e.g. this one (not sure about wording though).

For example, first sentence of first chapter of Genesis looks like:
Code:
<span class="text Gen-1-1"><span class="chapternum">1&nbsp;</span>In the beginning God created the heavens and the earth. </span>
So, you can refer to it using .Gen-1-1 CSS selector.

Let's retrieve whole Genesis book:
Code:
curl 'https://www.biblegateway.com/passage/?search=Genesis+[1-50]&version=CJB' -o /tmp/gen-#1.html
This will create 50 html files gen-1.html, etc.

To interpret CSS selectors I will use html-xml-utils (standard package in debian/ubuntu distros).

To retrieve 10-th sentence from third chapter we may do
Code:
$ hxnormalize -l 3000 -x /tmp/gen-3.html 2>/dev/null  | hxselect -cs '\n' .Gen-3-10 | hxremove span | hxremove sup | sed 's/<[^<>]\+>//g' 
He answered, I heard your voice in the garden, and I was afraid, because I was naked, so I hid myself.
hxnormalize -x fixes input HTML file so that it is suitable for hxselect. Using -l 3000 we set maximum line length (so that each sentence will be on a separate line). hxselect applies CSS selector. hxremove strips unnecessary tags (with content). Finally, sed removes remaining markup (<i> etc). Alternatively, instead of sed, you may use lynx -stdin -dump to interpret resulting html as text.

Last edited by firstfire; 03-14-2015 at 10:13 AM. Reason: Fix english.
 
1 members found this post helpful.
Old 03-14-2015, 10:00 AM   #9
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
Quote:
Originally Posted by rbees View Post
The references and a description can be seen at.
http://en.wikipedia.org/wiki/Weekly_Torah_portion

But even if I packed them up so the each week was a separate file, I would have to generate 61 separate files for the Torah portion plus 61 for the Haftara portion. Then each week is divided into 7 seven parts, plus a conclusion and a Haftara. Then there are the special readings for new months and other things too. That is the work I want to get away from.


i don't understand this addressing. e.g.- Ex. 1:1-6:1
is this exodus book 1: verses 1-6: book 1 ?
 
Old 03-14-2015, 08:17 PM   #10
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 921

Original Poster
Rep: Reputation: 46
Thanks firstfire & schneidz

I understand that the sample file I posted may not be easy to extract the data from in an automated way but only by hand. I am willing to do so, but would really rather not.

[Rant warning]
Not to get into a religious debate here but the problem with most of the downloadable "Bibles" in english out there is that they are "christian" translations and are translated in a way that supports their christian theology. They are highly frowned on in Jewish circles. The one you provided a link to is a prime example, even though it has a supposed "Jewish" name it is still a christian bible and contains the christian new testament and seeks to proselytize Jews and lead us away from the way G0d told us to live. [/Rant]

Quote:
i don't understand this addressing. e.g.- Ex. 1:1-6:1 is this exodus book 1: verses 1-6: book 1
Yes and no.

Ex refers to the Hebrew book of Shemot actually which has the english name Exodus.
1:1 refers to chapter 1 verse 1
6:1 refers to chapter 6 verse 1.

Note that the christian verse numberings are not always the same as the Hebrew numberings in all cases.

Sorry for the rant

Thanks again

Last edited by rbees; 03-14-2015 at 08:19 PM.
 
Old 03-14-2015, 10:59 PM   #11
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Okay, no problem.
How about this link: Genesis 1? It is from the same resource as your samples, but contains both hebrew and english versions side by side formatted as a table. Each row of this table corresponds to a separate verse.

Download whole Genesis book:
Code:
curl http://www.mechon-mamre.org/p/pt/pt01[01-50].htm -o gen-#1.html
This creates 50 files gen-01.html ... gen-50.html

Fetch, say 20-th verse from 5-th chapter:
Code:
$ hxnormalize -x -l 1000 gen-05.htm  | hxselect -s '\n' -c 'tr:nth-child(20) > td:nth-child(2)' | hxremove b
  And all the days of Jared were nine hundred sixty and two years; and he died.
 
Old 03-15-2015, 12:48 PM   #12
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 921

Original Poster
Rep: Reputation: 46
Thanks

I have worked up a string that seams to work. Gota love sed1liners
Code:
w3m -dump -T text/html $HOME/bin/shabbat/JPS/et0106.htm | sed '1,5d' | sed -e :a -e '$d;N;2,6ba' -e 'P;D' | sed -n '/9/,/22/p' > testfile
It can probably be trimmed up by someone that really understands how it does what it does better than I do.

No all I have to do is put in the variables and build an associative array for the files.

On a different note, I am not sure perl is any easier than bash, but still working on it.

Thanks again
 
Old 03-22-2015, 07:31 PM   #13
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 921

Original Poster
Rep: Reputation: 46
moderator please delete

Last edited by rbees; 03-22-2015 at 07:32 PM. Reason: wrong place
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
mail a html file from bash figure20012 Linux - Newbie 5 10-12-2012 12:58 AM
[SOLVED] bash in html aihaike Programming 4 03-27-2009 08:23 AM
bash scripting for html generation daveoily Linux - Newbie 37 08-12-2008 12:16 AM
ls *.html bash scripting bhar0761 Linux - Newbie 15 09-20-2005 11:07 PM
bash + html + javascript or just bash ? rblampain Programming 4 12-01-2004 07:53 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 01:23 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration