LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-25-2010, 03:06 PM   #1
thunor
LQ Newbie
 
Registered: Oct 2010
Posts: 6

Rep: Reputation: 0
Need bash script to check existence of images within an HTML document


Hi,

My first question

I've written a bash script to convert my Firefox bookmarks.html document into something that's as close to valid HTML as I can get it (I can then store it on my web space or on an SD card for example).

My script adds img tags next to each URL for any favicons that I choose to download and it looks like this:

Code:
<li><img src="images/groups.google.com.ico" alt="" width="16" height="16"> <a href="http://groups.google.com/" target="_blank">Google Groups</a>
<li><img src="images/www.bbc.co.uk.ico" alt="" width="16" height="16"> <a href="http://www.bbc.co.uk/news/" target="_blank">BBC News - Home</a>
<li><img src="images/happypenguin.org.ico" alt="" width="16" height="16"> <a href="http://happypenguin.org/" target="_blank">http://happypenguin.org/</a>
I have downloaded some favicons and named them as in the script, but I've 600 URLs and I won't be downloading 600 favicons, so I have an image called blank.png that I'd like to be referenced instead. So what I'd like to do is go through these lines and rename the images to images/blank.png when they are not found to exist within the images/ folder.

[EDIT] I imagine it's going to require something using grep, an "if [ -f xxx]; then" and sed but I don't know how to put it together. I do know that the regexp '/<img src="\([^"]*\)"/' will match the image name using sed.

Cheers,
Thunor

Last edited by thunor; 10-25-2010 at 03:37 PM. Reason: Changed to 'bash script' to be clearer
 
Old 10-25-2010, 03:31 PM   #2
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,634

Rep: Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965
Quote:
Originally Posted by thunor View Post
Hi,
I've written a bash script to convert my Firefox bookmarks.html document into something that's as close to valid HTML as I can get it (I can then store it on my web space or on an SD card for example).

My script adds img tags next to each URL for any favicons that I choose to download and it looks like this:

I have downloaded some favicons and named them as in the script, but I've 600 URLs and I won't be downloading 600 favicons, so I have an image called blank.png that I'd like to be referenced instead. So what I'd like to do is go through these lines and rename the images to images/blank.png when they are not found to exist within the images/ folder.
Ok...we know what you NEED, now show us what you've DONE. Where are you getting stuck? Post your script, and we can try to help, but we're not going to write your whole script for you.

This page:
http://www.ibm.com/developerworks/li...ry/l-sed2.html

has examples of using regex'es in SED, which you could use to strip the xxxx.xxx.ico, and replace them with blank.ico.
 
Old 10-25-2010, 03:45 PM   #3
thunor
LQ Newbie
 
Registered: Oct 2010
Posts: 6

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by TB0ne View Post
Ok...we know what you NEED, now show us what you've DONE...
Sorry. I didn't want to scare people off by dumping a large amount of sed commands.

[EDIT] To make things clearer, I should point out that I don't require the script I've already created to be fixed; it's not faulty. I need to append it with something to achieve what I've mentioned previously.

I've read those IBM sed tutorials; I used those to write my script - very informative.

What I've done works i.e. it creates a really nice valid html document of a list of URLs, but the created html document is also full of non-existent images which I'd like to fix. The images are placed into the images/ folder from where the script is executed. One image is called file.png which is used for file:// URLs, and then there's blank.png which I've already mentioned.

Code:
#!/bin/bash

sed \
	-e 's/ ICON="[^"]*"//' \
	-e 's/ ADD_DATE="[^"]*"//' \
	-e 's/ LAST_VISIT="[^"]*"//' \
	-e 's/ LAST_MODIFIED="[^"]*"//' \
	-e 's/ LAST_CHARSET="[^"]*"//' \
	-e 's/ ID="[^"]*"//' \
	-e 's/ FEEDURL="[^"]*"//' \
	-e 's/ PERSONAL_TOOLBAR_FOLDER="[^"]*"//' \
	-e 's/ SHORTCUTURL="[^"]*"//' \
	-e '/<DD>.*/d' \
	-e 's/<p>//' \
	-e 's/DL>/ul>/' \
	-e 's/DT>/li>/' \
	-e 's/<A HREF/<img src="" alt="" width="16" height="16"> <a href/' \
	-e 's/<a href="[^"]*"/& target="_blank"/' \
	-e 's/<\/A>/<\/a>/' \
	-e 's|\(<img src="\)\(.*https://\)\([^/]*\)/|\1images/\3\.ico\2\3/|' \
	-e 's|\(<img src="\)\(.*http://\)\([^/]*\)/|\1images/\3\.ico\2\3/|' \
	-e 's|\(<img src="\)\(.*file://\)|\1images/file\.png\2|' \
	-e 's|\(<img src="\)"|\1images/blank\.png"|' \
	-e 's|<!DOCTYPE NETSCAPE-Bookmark-file-1>|<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" \
	"http://www.w3.org/TR/html4/loose.dtd">\n<html>\n<head>|' \
	-e "s/<\/TITLE>/&\n\n\
<style type=\"text\/css\">\n\
	body \{ font-size: 12pt; font-family: arial, verdana, sans-serif; color: #000000; background-color: #ffffff; \}\n\
	hr \{ color: #d0d0e0; border-style:solid; border-width:20px; \}\n\
	ul \{ list-style-type: none; \}\n\
	h1 \{ font-size: 16pt; \}\n\
	h3 \{ font-size: 14pt; \}\n\
	a \{ color: #202020; text-decoration: none; \}\n\
	a:link \{ \}\n\
	a:visited \{ color: #a00000; \}\n\
	a:hover \{ color: #ff8040; text-decoration: underline; \}\n\
<\/style>\n\n<\/head>\n<body>\n\nLast updated $(date)\n/" \
	$(find ~/.mozilla/firefox -name bookmarks.html) > mybookmarks.html

echo -e "\n</body>\n</html>" >> mybookmarks.html

Last edited by thunor; 10-25-2010 at 04:05 PM.
 
Old 10-25-2010, 04:35 PM   #4
thunor
LQ Newbie
 
Registered: Oct 2010
Posts: 6

Original Poster
Rep: Reputation: 0
Let's start again, ignoring the script that I've already written that does something else I don't require fixing.

I want to read an html document line by line...
Code:
<li><img src="images/groups.google.com.ico" alt="" width="16" height="16"> <a href="http://groups.google.com/" target="_blank">Google Groups</a>
possibly using sed and '/<img src="\([^"]*\)"/' to get a match on the image name, then check to see if the image physically exists and if it does then just output the entire line unchanged, but if it doesn't exist then output the following line instead:
Code:
<li><img src="images/blank.png" alt="" width="16" height="16"> <a href="http://groups.google.com/" target="_blank">Google Groups</a>
I imagine that the file existing check is going to require "if [ -f xxx ]; then blabla; else blabla; done" but I don't know how to put this together i.e. how to use the data sed found with the file checking code.

Last edited by thunor; 10-25-2010 at 04:38 PM.
 
Old 10-25-2010, 05:33 PM   #5
thunor
LQ Newbie
 
Registered: Oct 2010
Posts: 6

Original Poster
Rep: Reputation: 0
I'm off to bed now, but I think it might require something a lot more complicated than I was thinking.

Possibly store each line of an html document in an array.
Iterate through the array in a "for" loop.
Read a line from the array.
Isolate the image name and store it in a variable (how?).
Check the physical existence of the image.
If the image doesn't exist then set the variable to "images/blank.png".
Pass the line to sed to change the image name appending it to >> mynewbookmarks.html.
Process the next line from the array.

I know that you can use variables in sed if you use "s/bla/$BLA/" instead of the usual single quotes.

[EDIT] I've sort of written the code to dump the image name which I could then put into a variable, but it also prints lines that don't match such as < /body > and < /head > etc.
Code:
echo "<li><img src=\"images/groups.google.com.ico\" alt=\"\" width=\"16\" height=\"16\">" | sed -e 's/.*<img src="\([^"]*\)".*/\1/'
images/groups.google.com.ico

echo "</body>" | sed -e 's/.*<img src="\([^"]*\)".*/\1/'
</body>

Last edited by thunor; 10-25-2010 at 06:05 PM.
 
Old 10-25-2010, 09:14 PM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,006

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
How about something like:
Code:
awk 'BEGIN{OFS=FS="\""}/img src/{cmd="test -e "$2;if(system(cmd))$2="images/blank.png"}1' input_file
 
Old 10-26-2010, 09:43 AM   #7
thunor
LQ Newbie
 
Registered: Oct 2010
Posts: 6

Original Poster
Rep: Reputation: 0
Cool

Quote:
Originally Posted by grail View Post
How about something like:
Code:
awk 'BEGIN{OFS=FS="\""}/img src/{cmd="test -e "$2;if(system(cmd))$2="images/blank.png"}1' input_file
Yay Thanks very much for that.

I haven't used/explored awk yet but I can understand what it's doing. I'm going to read up about it this evening.

My script is complete now. I'll post it here in case anybody else wants to make use of it -- save it as firefoxtomybookmarks.

Regards,
Thunor

Code:
#!/bin/bash

# firefoxtomybookmarks
# ====================
# 
# This script will create a valid HTML document of your Mozilla Firefox
# bookmarks.html file that is located somewhere within your home folder.
# Why? So that you can put it on an SD card or USB drive, upload it to
# your personal web space or access your bookmarks using multiple
# browsers across multiple OS installations.
# 
# To the left of each URL can be an icon commonly known as a favicon.
# It is named uniquely for the URL so that the favicon for www.ibm.com
# would be www.ibm.com.ico. For something similar to
# http://worldofspectrum.org/magazines/ it would be 
# worldofspectrum.org.ico. You don't need to download favicons for each
# and every URL, only the ones that you want.
# 
# To set this up, create a file structure somewhere like this:
# 
# firefoxtomybookmarks
# images/
# images/blank.png
# images/file.png
# 
# blank.png (16x16) is used for URLs that don't have a corresponding
# favicon. file.png (16x16) is used for local "file://" URLs.
# 
# For each URL that you want to have a favicon, download them and name
# them as described earlier. How to locate them depends on the website,
# but the most efficient way I've found is to open the URL in your
# browser, view the source and look for "favi" which should give you a
# <link rel="SHORTCUT ICON" type="image/x-icon" href="/favicon.ico">.
# You can then append the href to the domain name to gain access to the
# image. If you don't find a reference to favicon within the source
# then it'll be in the root of the domain name e.g. 
# domainname.com/favicon.ico.
# 
# Now you can execute this script. It will create a mybookmarks.html
# file and set-up the favicons depending on what was found to exist
# within the images folder. Thereafter, everytime you add or reorganise
# your bookmarks within Firefox or you remove or add favicons to the
# images folder, you should re-run this script.

# I've broken down the process into stages so that I can document what
# I'm doing and you can inspect the results of each stage.

# Stage1
# ------
# Firstly your existing Firefox bookmarks.html file is copied to a new
# file called mybookmarks.stage1.html which has had unnecessary tags
# removed, and the <DD> and <DL> tags are changed to <ul> and <li>.

sed \
-e 's/ ICON="[^"]*"//' \
-e 's/ ADD_DATE="[^"]*"//' \
-e 's/ LAST_VISIT="[^"]*"//' \
-e 's/ LAST_MODIFIED="[^"]*"//' \
-e 's/ LAST_CHARSET="[^"]*"//' \
-e 's/ ID="[^"]*"//' \
-e 's/ FEEDURL="[^"]*"//' \
-e 's/ PERSONAL_TOOLBAR_FOLDER="[^"]*"//' \
-e 's/ SHORTCUTURL="[^"]*"//' \
-e '/<DD>.*/d' \
-e 's/<p>//' \
-e 's/DL>/ul>/' \
-e 's/DT>/li>/' \
$(find ~/.mozilla/firefox -name bookmarks.html) > mybookmarks.stage1.html

# Stage2
# ------
# Next we take mybookmarks.stage1.html and add initial empty image tags,
# add target="_blank" tags to the anchor so that URLs are opened within
# new windows when they are clicked, and then the favicon image names
# are uniquely created for the URLs. Anything that isn't prefixed with
# https://, http:// or file:// will use image/blank.png.

sed \
-e 's/<A HREF/<img src="" alt="" width="16" height="16"> <a href/' \
-e 's/<a href="[^"]*"/& target="_blank"/' \
-e 's/<\/A>/<\/a>/' \
-e 's|\(<img src="\)\(.*https://\)\([^/]*\)/|\1images/\3\.ico\2\3/|' \
-e 's|\(<img src="\)\(.*http://\)\([^/]*\)/|\1images/\3\.ico\2\3/|' \
-e 's|\(<img src="\)\(.*file://\)|\1images/file\.png\2|' \
-e 's|\(<img src="\)"|\1images/blank\.png"|' \
mybookmarks.stage1.html > mybookmarks.stage2.html

# Stage3
# ------
# Now the results from the last stage are used to create a valid HTML
# document which includes redefined styles for many of the HTML tags.
# Feel free to change the style properties to suit your own personal
# taste. I've chosen a very thick horizontal rule to so that I can see
# them when speedily scrolling up and down the page.

sed \
-e 's|<!DOCTYPE NETSCAPE-Bookmark-file-1>|<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\
	"http://www.w3.org/TR/html4/loose.dtd">\n\
<html>\
<head>\n|' \
-e "s/<\/TITLE>/&\n\n\
<style type=\"text\/css\">\n\
	body \{ font-size: 12pt; font-family: arial, verdana, sans-serif; color: #000000; background-color: #ffffff; \}\n\
	hr \{ color: #d0d0e0; border-style:solid; border-width:20px; \}\n\
	ul \{ list-style-type: none; \}\n\
	h1 \{ font-size: 16pt; \}\n\
	h3 \{ font-size: 14pt; \}\n\
	a \{ color: #202020; text-decoration: none; \}\n\
	a:link \{ \}\n\
	a:visited \{ color: #a00000; \}\n\
	a:hover \{ color: #ff8040; text-decoration: underline; \}\n\
<\/style>\n\n<\/head>\n<body>\n\nLast updated $(date)\n/" \
mybookmarks.stage2.html > mybookmarks.stage3.html

echo -e "\n</body>\n</html>" >> mybookmarks.stage3.html

# mybookmarks.html
# ----------------
# This last stage then checks the physical existence of the favicons,
# and those that are not found are set to use images/blank.png instead.
# Depending on how many bookmarks you have and the speed of your
# computer, this may take a few seconds. This piece of code was
# supplied by user grail from the linuxquestions.org forums
# ( http://www.linuxquestions.org/questions/user/grail-490946/ ).

awk 'BEGIN{OFS=FS="\""}/img src/{cmd="test -e "$2;if(system(cmd))$2="images/blank.png"}1' \
mybookmarks.stage3.html > mybookmarks.html

# That's it :) The temporary files created at each stage can be deleted
# here if you wish. I've left them so you can view the results.
# 
# 2010-10-26 Thunor Sif Ese <thunorsif_at_hotmail.com>

Last edited by thunor; 10-26-2010 at 10:17 AM.
 
Old 10-26-2010, 10:20 AM   #8
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by thunor View Post
...
[EDIT] I imagine it's going to require something using grep ...
Now think about the following:
  1. what is your 'grep' approach going to yield if an image URL is found inside an HTML comment ?
  2. how about the fact that HTML is not a line-oriented languages ?
  3. how about not reinventing the wheel and using a ready-made full-fledged HTML parser, e.g. http://search.cpan.org/search?query=...arser&mode=all ?
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
PRINTing images in CGI script (HTML tags reqd.) resetreset Programming 3 02-22-2009 10:38 AM
html; character encoding per document tag, not whole document TheLinuxDuck Programming 0 08-14-2008 11:12 AM
[Question] Check File existence hbinded Programming 3 12-19-2006 05:15 PM
Script or Simple App to Create HTML Table Populated With Images infidel Linux - Software 14 06-07-2005 01:08 AM
check for existence of file j-me Linux - Newbie 5 07-29-2003 07:58 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:46 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration