[SOLVED] Need bash script to check existence of images within an HTML document

thunor · 10-25-2010, 03:06 PM

Hi,

My first question

I've written a bash script to convert my Firefox bookmarks.html document into something that's as close to valid HTML as I can get it (I can then store it on my web space or on an SD card for example).

My script adds img tags next to each URL for any favicons that I choose to download and it looks like this:

Code:

<li><img src="images/groups.google.com.ico" alt="" width="16" height="16"> <a href="http://groups.google.com/" target="_blank">Google Groups</a>
<li><img src="images/www.bbc.co.uk.ico" alt="" width="16" height="16"> <a href="http://www.bbc.co.uk/news/" target="_blank">BBC News - Home</a>
<li><img src="images/happypenguin.org.ico" alt="" width="16" height="16"> <a href="http://happypenguin.org/" target="_blank">http://happypenguin.org/</a>

I have downloaded some favicons and named them as in the script, but I've 600 URLs and I won't be downloading 600 favicons, so I have an image called blank.png that I'd like to be referenced instead. So what I'd like to do is go through these lines and rename the images to images/blank.png when they are not found to exist within the images/ folder.

[EDIT] I imagine it's going to require something using grep, an "if [ -f xxx]; then" and sed but I don't know how to put it together. I do know that the regexp '/<img src="$[^"]*$"/' will match the image name using sed.

Cheers,
Thunor

TB0ne · 10-25-2010, 03:31 PM

Quote:

Originally Posted by thunor

Hi,
I've written a bash script to convert my Firefox bookmarks.html document into something that's as close to valid HTML as I can get it (I can then store it on my web space or on an SD card for example).

My script adds img tags next to each URL for any favicons that I choose to download and it looks like this:

I have downloaded some favicons and named them as in the script, but I've 600 URLs and I won't be downloading 600 favicons, so I have an image called blank.png that I'd like to be referenced instead. So what I'd like to do is go through these lines and rename the images to images/blank.png when they are not found to exist within the images/ folder.

Ok...we know what you NEED, now show us what you've DONE. Where are you getting stuck? Post your script, and we can try to help, but we're not going to write your whole script for you.

This page:
http://www.ibm.com/developerworks/li...ry/l-sed2.html

has examples of using regex'es in SED, which you could use to strip the xxxx.xxx.ico, and replace them with blank.ico.

thunor · 10-25-2010, 03:45 PM

Quote:

Originally Posted by TB0ne

Ok...we know what you NEED, now show us what you've DONE...

Sorry. I didn't want to scare people off by dumping a large amount of sed commands.

[EDIT] To make things clearer, I should point out that I don't require the script I've already created to be fixed; it's not faulty. I need to append it with something to achieve what I've mentioned previously.

I've read those IBM sed tutorials; I used those to write my script - very informative.

What I've done works i.e. it creates a really nice valid html document of a list of URLs, but the created html document is also full of non-existent images which I'd like to fix. The images are placed into the images/ folder from where the script is executed. One image is called file.png which is used for file:// URLs, and then there's blank.png which I've already mentioned.

Code:

#!/bin/bash

sed \
	-e 's/ ICON="[^"]*"//' \
	-e 's/ ADD_DATE="[^"]*"//' \
	-e 's/ LAST_VISIT="[^"]*"//' \
	-e 's/ LAST_MODIFIED="[^"]*"//' \
	-e 's/ LAST_CHARSET="[^"]*"//' \
	-e 's/ ID="[^"]*"//' \
	-e 's/ FEEDURL="[^"]*"//' \
	-e 's/ PERSONAL_TOOLBAR_FOLDER="[^"]*"//' \
	-e 's/ SHORTCUTURL="[^"]*"//' \
	-e '/<DD>.*/d' \
	-e 's/<p>//' \
	-e 's/DL>/ul>/' \
	-e 's/DT>/li>/' \
	-e 's/<A HREF/<img src="" alt="" width="16" height="16"> <a href/' \
	-e 's/<a href="[^"]*"/& target="_blank"/' \
	-e 's/<\/A>/<\/a>/' \
	-e 's|\(<img src="\)\(.*https://\)\([^/]*\)/|\1images/\3\.ico\2\3/|' \
	-e 's|\(<img src="\)\(.*http://\)\([^/]*\)/|\1images/\3\.ico\2\3/|' \
	-e 's|\(<img src="\)\(.*file://\)|\1images/file\.png\2|' \
	-e 's|\(<img src="\)"|\1images/blank\.png"|' \
	-e 's|<!DOCTYPE NETSCAPE-Bookmark-file-1>|<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" \
	"http://www.w3.org/TR/html4/loose.dtd">\n<html>\n<head>|' \
	-e "s/<\/TITLE>/&\n\n\
<style type=\"text\/css\">\n\
	body \{ font-size: 12pt; font-family: arial, verdana, sans-serif; color: #000000; background-color: #ffffff; \}\n\
	hr \{ color: #d0d0e0; border-style:solid; border-width:20px; \}\n\
	ul \{ list-style-type: none; \}\n\
	h1 \{ font-size: 16pt; \}\n\
	h3 \{ font-size: 14pt; \}\n\
	a \{ color: #202020; text-decoration: none; \}\n\
	a:link \{ \}\n\
	a:visited \{ color: #a00000; \}\n\
	a:hover \{ color: #ff8040; text-decoration: underline; \}\n\
<\/style>\n\n<\/head>\n<body>\n\nLast updated $(date)\n/" \
	$(find ~/.mozilla/firefox -name bookmarks.html) > mybookmarks.html

echo -e "\n</body>\n</html>" >> mybookmarks.html

thunor · 10-25-2010, 04:35 PM

Let's start again, ignoring the script that I've already written that does something else I don't require fixing.

I want to read an html document line by line...

Code:

<li><img src="images/groups.google.com.ico" alt="" width="16" height="16"> <a href="http://groups.google.com/" target="_blank">Google Groups</a>

possibly using sed and '/<img src="$[^"]*$"/' to get a match on the image name, then check to see if the image physically exists and if it does then just output the entire line unchanged, but if it doesn't exist then output the following line instead:

Code:

<li><img src="images/blank.png" alt="" width="16" height="16"> <a href="http://groups.google.com/" target="_blank">Google Groups</a>

I imagine that the file existing check is going to require "if [ -f xxx ]; then blabla; else blabla; done" but I don't know how to put this together i.e. how to use the data sed found with the file checking code.

thunor · 10-25-2010, 05:33 PM

I'm off to bed now, but I think it might require something a lot more complicated than I was thinking.

Possibly store each line of an html document in an array.
Iterate through the array in a "for" loop.
Read a line from the array.
Isolate the image name and store it in a variable (how?).
Check the physical existence of the image.
If the image doesn't exist then set the variable to "images/blank.png".
Pass the line to sed to change the image name appending it to >> mynewbookmarks.html.
Process the next line from the array.

I know that you can use variables in sed if you use "s/bla/$BLA/" instead of the usual single quotes.

[EDIT] I've sort of written the code to dump the image name which I could then put into a variable, but it also prints lines that don't match such as < /body > and < /head > etc.

Code:

echo "<li><img src=\"images/groups.google.com.ico\" alt=\"\" width=\"16\" height=\"16\">" | sed -e 's/.*<img src="\([^"]*\)".*/\1/'
images/groups.google.com.ico

echo "</body>" | sed -e 's/.*<img src="\([^"]*\)".*/\1/'
</body>

grail · 10-25-2010, 09:14 PM

How about something like:

Code:

awk 'BEGIN{OFS=FS="\""}/img src/{cmd="test -e "$2;if(system(cmd))$2="images/blank.png"}1' input_file

thunor · 10-26-2010, 09:43 AM

Quote:

Originally Posted by grail

How about something like:

Code:

awk 'BEGIN{OFS=FS="\""}/img src/{cmd="test -e "$2;if(system(cmd))$2="images/blank.png"}1' input_file

Yay

Thanks very much for that.

I haven't used/explored awk yet but I can understand what it's doing. I'm going to read up about it this evening.

My script is complete now. I'll post it here in case anybody else wants to make use of it -- save it as firefoxtomybookmarks.

Regards,
Thunor

Code:

#!/bin/bash

# firefoxtomybookmarks
# ====================
# 
# This script will create a valid HTML document of your Mozilla Firefox
# bookmarks.html file that is located somewhere within your home folder.
# Why? So that you can put it on an SD card or USB drive, upload it to
# your personal web space or access your bookmarks using multiple
# browsers across multiple OS installations.
# 
# To the left of each URL can be an icon commonly known as a favicon.
# It is named uniquely for the URL so that the favicon for www.ibm.com
# would be www.ibm.com.ico. For something similar to
# http://worldofspectrum.org/magazines/ it would be 
# worldofspectrum.org.ico. You don't need to download favicons for each
# and every URL, only the ones that you want.
# 
# To set this up, create a file structure somewhere like this:
# 
# firefoxtomybookmarks
# images/
# images/blank.png
# images/file.png
# 
# blank.png (16x16) is used for URLs that don't have a corresponding
# favicon. file.png (16x16) is used for local "file://" URLs.
# 
# For each URL that you want to have a favicon, download them and name
# them as described earlier. How to locate them depends on the website,
# but the most efficient way I've found is to open the URL in your
# browser, view the source and look for "favi" which should give you a
# <link rel="SHORTCUT ICON" type="image/x-icon" href="/favicon.ico">.
# You can then append the href to the domain name to gain access to the
# image. If you don't find a reference to favicon within the source
# then it'll be in the root of the domain name e.g. 
# domainname.com/favicon.ico.
# 
# Now you can execute this script. It will create a mybookmarks.html
# file and set-up the favicons depending on what was found to exist
# within the images folder. Thereafter, everytime you add or reorganise
# your bookmarks within Firefox or you remove or add favicons to the
# images folder, you should re-run this script.

# I've broken down the process into stages so that I can document what
# I'm doing and you can inspect the results of each stage.

# Stage1
# ------
# Firstly your existing Firefox bookmarks.html file is copied to a new
# file called mybookmarks.stage1.html which has had unnecessary tags
# removed, and the <DD> and <DL> tags are changed to <ul> and <li>.

sed \
-e 's/ ICON="[^"]*"//' \
-e 's/ ADD_DATE="[^"]*"//' \
-e 's/ LAST_VISIT="[^"]*"//' \
-e 's/ LAST_MODIFIED="[^"]*"//' \
-e 's/ LAST_CHARSET="[^"]*"//' \
-e 's/ ID="[^"]*"//' \
-e 's/ FEEDURL="[^"]*"//' \
-e 's/ PERSONAL_TOOLBAR_FOLDER="[^"]*"//' \
-e 's/ SHORTCUTURL="[^"]*"//' \
-e '/<DD>.*/d' \
-e 's/<p>//' \
-e 's/DL>/ul>/' \
-e 's/DT>/li>/' \
$(find ~/.mozilla/firefox -name bookmarks.html) > mybookmarks.stage1.html

# Stage2
# ------
# Next we take mybookmarks.stage1.html and add initial empty image tags,
# add target="_blank" tags to the anchor so that URLs are opened within
# new windows when they are clicked, and then the favicon image names
# are uniquely created for the URLs. Anything that isn't prefixed with
# https://, http:// or file:// will use image/blank.png.

sed \
-e 's/<A HREF/<img src="" alt="" width="16" height="16"> <a href/' \
-e 's/<a href="[^"]*"/& target="_blank"/' \
-e 's/<\/A>/<\/a>/' \
-e 's|\(<img src="\)\(.*https://\)\([^/]*\)/|\1images/\3\.ico\2\3/|' \
-e 's|\(<img src="\)\(.*http://\)\([^/]*\)/|\1images/\3\.ico\2\3/|' \
-e 's|\(<img src="\)\(.*file://\)|\1images/file\.png\2|' \
-e 's|\(<img src="\)"|\1images/blank\.png"|' \
mybookmarks.stage1.html > mybookmarks.stage2.html

# Stage3
# ------
# Now the results from the last stage are used to create a valid HTML
# document which includes redefined styles for many of the HTML tags.
# Feel free to change the style properties to suit your own personal
# taste. I've chosen a very thick horizontal rule to so that I can see
# them when speedily scrolling up and down the page.

sed \
-e 's|<!DOCTYPE NETSCAPE-Bookmark-file-1>|<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\
	"http://www.w3.org/TR/html4/loose.dtd">\n\
<html>\
<head>\n|' \
-e "s/<\/TITLE>/&\n\n\
<style type=\"text\/css\">\n\
	body \{ font-size: 12pt; font-family: arial, verdana, sans-serif; color: #000000; background-color: #ffffff; \}\n\
	hr \{ color: #d0d0e0; border-style:solid; border-width:20px; \}\n\
	ul \{ list-style-type: none; \}\n\
	h1 \{ font-size: 16pt; \}\n\
	h3 \{ font-size: 14pt; \}\n\
	a \{ color: #202020; text-decoration: none; \}\n\
	a:link \{ \}\n\
	a:visited \{ color: #a00000; \}\n\
	a:hover \{ color: #ff8040; text-decoration: underline; \}\n\
<\/style>\n\n<\/head>\n<body>\n\nLast updated $(date)\n/" \
mybookmarks.stage2.html > mybookmarks.stage3.html

echo -e "\n</body>\n</html>" >> mybookmarks.stage3.html

# mybookmarks.html
# ----------------
# This last stage then checks the physical existence of the favicons,
# and those that are not found are set to use images/blank.png instead.
# Depending on how many bookmarks you have and the speed of your
# computer, this may take a few seconds. This piece of code was
# supplied by user grail from the linuxquestions.org forums
# ( http://www.linuxquestions.org/questions/user/grail-490946/ ).

awk 'BEGIN{OFS=FS="\""}/img src/{cmd="test -e "$2;if(system(cmd))$2="images/blank.png"}1' \
mybookmarks.stage3.html > mybookmarks.html

# That's it :) The temporary files created at each stage can be deleted
# here if you wish. I've left them so you can view the results.
# 
# 2010-10-26 Thunor Sif Ese <thunorsif_at_hotmail.com>

Sergei Steshenko · 10-26-2010, 10:20 AM

Quote:

Originally Posted by thunor

...
[EDIT] I imagine it's going to require something using grep ...

Now think about the following:

what is your 'grep' approach going to yield if an image URL is found inside an HTML comment ?
how about the fact that HTML is not a line-oriented languages ?
how about not reinventing the wheel and using a ready-made full-fledged HTML parser, e.g. http://search.cpan.org/search?query=...arser&mode=all ?