[SOLVED] Need bash script to check existence of images within an HTML document
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Need bash script to check existence of images within an HTML document
Hi,
My first question
I've written a bash script to convert my Firefox bookmarks.html document into something that's as close to valid HTML as I can get it (I can then store it on my web space or on an SD card for example).
My script adds img tags next to each URL for any favicons that I choose to download and it looks like this:
I have downloaded some favicons and named them as in the script, but I've 600 URLs and I won't be downloading 600 favicons, so I have an image called blank.png that I'd like to be referenced instead. So what I'd like to do is go through these lines and rename the images to images/blank.png when they are not found to exist within the images/ folder.
[EDIT] I imagine it's going to require something using grep, an "if [ -f xxx]; then" and sed but I don't know how to put it together. I do know that the regexp '/<img src="\([^"]*\)"/' will match the image name using sed.
Cheers,
Thunor
Last edited by thunor; 10-25-2010 at 03:37 PM.
Reason: Changed to 'bash script' to be clearer
Hi,
I've written a bash script to convert my Firefox bookmarks.html document into something that's as close to valid HTML as I can get it (I can then store it on my web space or on an SD card for example).
My script adds img tags next to each URL for any favicons that I choose to download and it looks like this:
I have downloaded some favicons and named them as in the script, but I've 600 URLs and I won't be downloading 600 favicons, so I have an image called blank.png that I'd like to be referenced instead. So what I'd like to do is go through these lines and rename the images to images/blank.png when they are not found to exist within the images/ folder.
Ok...we know what you NEED, now show us what you've DONE. Where are you getting stuck? Post your script, and we can try to help, but we're not going to write your whole script for you.
Ok...we know what you NEED, now show us what you've DONE...
Sorry. I didn't want to scare people off by dumping a large amount of sed commands.
[EDIT] To make things clearer, I should point out that I don't require the script I've already created to be fixed; it's not faulty. I need to append it with something to achieve what I've mentioned previously.
I've read those IBM sed tutorials; I used those to write my script - very informative.
What I've done works i.e. it creates a really nice valid html document of a list of URLs, but the created html document is also full of non-existent images which I'd like to fix. The images are placed into the images/ folder from where the script is executed. One image is called file.png which is used for file:// URLs, and then there's blank.png which I've already mentioned.
possibly using sed and '/<img src="\([^"]*\)"/' to get a match on the image name, then check to see if the image physically exists and if it does then just output the entire line unchanged, but if it doesn't exist then output the following line instead:
I imagine that the file existing check is going to require "if [ -f xxx ]; then blabla; else blabla; done" but I don't know how to put this together i.e. how to use the data sed found with the file checking code.
I'm off to bed now, but I think it might require something a lot more complicated than I was thinking.
Possibly store each line of an html document in an array.
Iterate through the array in a "for" loop.
Read a line from the array.
Isolate the image name and store it in a variable (how?).
Check the physical existence of the image.
If the image doesn't exist then set the variable to "images/blank.png".
Pass the line to sed to change the image name appending it to >> mynewbookmarks.html.
Process the next line from the array.
I know that you can use variables in sed if you use "s/bla/$BLA/" instead of the usual single quotes.
[EDIT] I've sort of written the code to dump the image name which I could then put into a variable, but it also prints lines that don't match such as < /body > and < /head > etc.
Code:
echo "<li><img src=\"images/groups.google.com.ico\" alt=\"\" width=\"16\" height=\"16\">" | sed -e 's/.*<img src="\([^"]*\)".*/\1/'
images/groups.google.com.ico
echo "</body>" | sed -e 's/.*<img src="\([^"]*\)".*/\1/'
</body>
I haven't used/explored awk yet but I can understand what it's doing. I'm going to read up about it this evening.
My script is complete now. I'll post it here in case anybody else wants to make use of it -- save it as firefoxtomybookmarks.
Regards,
Thunor
Code:
#!/bin/bash
# firefoxtomybookmarks
# ====================
#
# This script will create a valid HTML document of your Mozilla Firefox
# bookmarks.html file that is located somewhere within your home folder.
# Why? So that you can put it on an SD card or USB drive, upload it to
# your personal web space or access your bookmarks using multiple
# browsers across multiple OS installations.
#
# To the left of each URL can be an icon commonly known as a favicon.
# It is named uniquely for the URL so that the favicon for www.ibm.com
# would be www.ibm.com.ico. For something similar to
# http://worldofspectrum.org/magazines/ it would be
# worldofspectrum.org.ico. You don't need to download favicons for each
# and every URL, only the ones that you want.
#
# To set this up, create a file structure somewhere like this:
#
# firefoxtomybookmarks
# images/
# images/blank.png
# images/file.png
#
# blank.png (16x16) is used for URLs that don't have a corresponding
# favicon. file.png (16x16) is used for local "file://" URLs.
#
# For each URL that you want to have a favicon, download them and name
# them as described earlier. How to locate them depends on the website,
# but the most efficient way I've found is to open the URL in your
# browser, view the source and look for "favi" which should give you a
# <link rel="SHORTCUT ICON" type="image/x-icon" href="/favicon.ico">.
# You can then append the href to the domain name to gain access to the
# image. If you don't find a reference to favicon within the source
# then it'll be in the root of the domain name e.g.
# domainname.com/favicon.ico.
#
# Now you can execute this script. It will create a mybookmarks.html
# file and set-up the favicons depending on what was found to exist
# within the images folder. Thereafter, everytime you add or reorganise
# your bookmarks within Firefox or you remove or add favicons to the
# images folder, you should re-run this script.
# I've broken down the process into stages so that I can document what
# I'm doing and you can inspect the results of each stage.
# Stage1
# ------
# Firstly your existing Firefox bookmarks.html file is copied to a new
# file called mybookmarks.stage1.html which has had unnecessary tags
# removed, and the <DD> and <DL> tags are changed to <ul> and <li>.
sed \
-e 's/ ICON="[^"]*"//' \
-e 's/ ADD_DATE="[^"]*"//' \
-e 's/ LAST_VISIT="[^"]*"//' \
-e 's/ LAST_MODIFIED="[^"]*"//' \
-e 's/ LAST_CHARSET="[^"]*"//' \
-e 's/ ID="[^"]*"//' \
-e 's/ FEEDURL="[^"]*"//' \
-e 's/ PERSONAL_TOOLBAR_FOLDER="[^"]*"//' \
-e 's/ SHORTCUTURL="[^"]*"//' \
-e '/<DD>.*/d' \
-e 's/<p>//' \
-e 's/DL>/ul>/' \
-e 's/DT>/li>/' \
$(find ~/.mozilla/firefox -name bookmarks.html) > mybookmarks.stage1.html
# Stage2
# ------
# Next we take mybookmarks.stage1.html and add initial empty image tags,
# add target="_blank" tags to the anchor so that URLs are opened within
# new windows when they are clicked, and then the favicon image names
# are uniquely created for the URLs. Anything that isn't prefixed with
# https://, http:// or file:// will use image/blank.png.
sed \
-e 's/<A HREF/<img src="" alt="" width="16" height="16"> <a href/' \
-e 's/<a href="[^"]*"/& target="_blank"/' \
-e 's/<\/A>/<\/a>/' \
-e 's|\(<img src="\)\(.*https://\)\([^/]*\)/|\1images/\3\.ico\2\3/|' \
-e 's|\(<img src="\)\(.*http://\)\([^/]*\)/|\1images/\3\.ico\2\3/|' \
-e 's|\(<img src="\)\(.*file://\)|\1images/file\.png\2|' \
-e 's|\(<img src="\)"|\1images/blank\.png"|' \
mybookmarks.stage1.html > mybookmarks.stage2.html
# Stage3
# ------
# Now the results from the last stage are used to create a valid HTML
# document which includes redefined styles for many of the HTML tags.
# Feel free to change the style properties to suit your own personal
# taste. I've chosen a very thick horizontal rule to so that I can see
# them when speedily scrolling up and down the page.
sed \
-e 's|<!DOCTYPE NETSCAPE-Bookmark-file-1>|<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\
"http://www.w3.org/TR/html4/loose.dtd">\n\
<html>\
<head>\n|' \
-e "s/<\/TITLE>/&\n\n\
<style type=\"text\/css\">\n\
body \{ font-size: 12pt; font-family: arial, verdana, sans-serif; color: #000000; background-color: #ffffff; \}\n\
hr \{ color: #d0d0e0; border-style:solid; border-width:20px; \}\n\
ul \{ list-style-type: none; \}\n\
h1 \{ font-size: 16pt; \}\n\
h3 \{ font-size: 14pt; \}\n\
a \{ color: #202020; text-decoration: none; \}\n\
a:link \{ \}\n\
a:visited \{ color: #a00000; \}\n\
a:hover \{ color: #ff8040; text-decoration: underline; \}\n\
<\/style>\n\n<\/head>\n<body>\n\nLast updated $(date)\n/" \
mybookmarks.stage2.html > mybookmarks.stage3.html
echo -e "\n</body>\n</html>" >> mybookmarks.stage3.html
# mybookmarks.html
# ----------------
# This last stage then checks the physical existence of the favicons,
# and those that are not found are set to use images/blank.png instead.
# Depending on how many bookmarks you have and the speed of your
# computer, this may take a few seconds. This piece of code was
# supplied by user grail from the linuxquestions.org forums
# ( http://www.linuxquestions.org/questions/user/grail-490946/ ).
awk 'BEGIN{OFS=FS="\""}/img src/{cmd="test -e "$2;if(system(cmd))$2="images/blank.png"}1' \
mybookmarks.stage3.html > mybookmarks.html
# That's it :) The temporary files created at each stage can be deleted
# here if you wish. I've left them so you can view the results.
#
# 2010-10-26 Thunor Sif Ese <thunorsif_at_hotmail.com>
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.