Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
|
07-26-2011, 02:19 PM
|
#1
|
LQ Newbie
Registered: Jul 2011
Posts: 10
Rep: 
|
Loop through list of URLs in txt file, parse out parameters, pass to wget in bash.
What I have:
1. list of URLs in text file (i.e. in this form http://www.domain.tld/more-stuff-here)
2. script that extracts parameters from text file with URLs (example below)
3. script that downloads file with wget (example below)
I want to create a loop that:
1. takes a text file of URLs
2. parses $host and $host_and_domain from each URL
3. sends $host and $host_and_domain to the wget script
4. creates a file name by appending $host with time/date (i.e. mm:dd:yy:hh:mm:ss)
Feel free to let me know if I could clarify anything. Also open to code examples to play with instead of outright answers.
Thanks!
Example of URL parsing script:
Code:
#!/bin/sh
# test-domain-parse.sh
# (Can't remember where I found this, but I didn't write it)
# ask for URL. note: want to pull in URLs from txt file (instead of printf)
# and then pass $host and $host_and_path to wget script
printf "Paste the URL you would like to normalize: -> "
read full_url
# extract the protocol
proto="$(echo $full_url | grep :// | sed -e's,^\(.*://\).*,\1,g')"
# remove the protocol
url="$(echo ${full_url/$proto/})"
# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"
# extract the host
host="$(echo ${url/$user@/} | cut -d/ -f1)"
# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"
host_and_path="$(echo ${url/$user@/} )"
echo " host: $host"
echo " host_and_path: $host_and_path"
Example of wget script:
Code:
#!/bin/sh
# wget-url-test.sh
# note: I would like to pass URL from the parsing script and NOT use printf
printf "What URL would you like to PDF? ->"
read URL
# echo $URL
# note: I would like to pass NORMALIZED_URL from parsing script ($host)
# and to append with yy:mm:dd:hh:mm:ss instead of naming file with printf
printf "What would you like to name the file? ->"
read NORMALIZED_URL
wget -O $NORMALIZED_URL.png --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.19) Gecko/20110707 Firefox/3.6.19" "pdfmyurl.com?url=$URL&--png&--page-size=A1&--redirect-delay=500"
|
|
|
07-26-2011, 07:54 PM
|
#2
|
LQ Guru
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.x
Posts: 18,443
|
Unfortunately, I can't get to that link. Anyway, what you need is a loop like
Code:
# assumes no spaces in urls
for full_url in $(cat <yourfilehere> )
do
#insert first script here, skipping printf & read cmds
# append 2nd script here, skipping printf, read cmds. (Not sure what a 'normalized_url) is.
# if it is $host and you want to append time, then
$host=${host}${yy:mm:dd:hh:mm:ss}
# but you'll have to get the timestamp from somewhere
done
Here are some good bash links
http://rute.2038bug.com/index.html.gz
http://tldp.org/LDP/Bash-Beginners-G...tml/index.html
http://www.tldp.org/LDP/abs/html/
that should get you started
|
|
1 members found this post helpful.
|
07-26-2011, 11:42 PM
|
#3
|
LQ Newbie
Registered: Jul 2011
Posts: 10
Original Poster
Rep: 
|
Thanks Chris. That definitely helped. I used "export date=$(date +%s)" to append to "$host" for file naming convention.
Strangely, instead of iterating through the list, processing each line, the script only processes the last line of the text file. "test-urls.txt" is an 11 line file with no spaces. It contains URLs in this form: http://[host.com]/[pages-go-here]
I'll look into this but if you have any suggestions in the meantime feel free to share.
Here's the updated code:
Code:
#!/bin/sh
# wget-url-test.sh
for full_url in $(cat test-urls.txt)
do
# extract the protocol
proto="$(echo $full_url | grep :// | sed -e's,^\(.*://\).*,\1,g')"
# remove the protocol
url="$(echo ${full_url/$proto/})"
# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"
# extract the host
host="$(echo ${url/$user@/} | cut -d/ -f1)"
# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"
host_and_path="$(echo ${url/$user@/} )"
export date=$(date +%s)
wget -O "${host}${date}".png --referer="http://www.google.com" \
--user-agent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.19) Gecko/20110707 Firefox/3.6.19" \
"pdfmyurl.com?url=$host_and_path&--png&--page-size=A1&--redirect-delay=500"
done
|
|
|
07-26-2011, 11:47 PM
|
#4
|
Moderator
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
|
What line separators are you using, is this a Linux or DOS file?
Cheers,
Tink
|
|
1 members found this post helpful.
|
07-27-2011, 12:18 AM
|
#5
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038
|
Just out of curiosity, you are aware that your greps you are using are doing nothing?
If we assume the format you provided is correct for each line of the file ( http://[host.com]/[pages-go-here]), then
something like:
Code:
proto="$(echo $full_url | grep :// | sed -e's,^\(.*://\).*,\1,g')"
Here the full_url is only one line so doing a grep of one line serves no purpose, at lest not without any switches
to reduce what has been past in.
Also, I am not exactly sure what details are in the url lines in the file but are you aware that wget can read url information directly from a file? (just a thought)
|
|
|
07-27-2011, 12:22 AM
|
#6
|
LQ Guru
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.x
Posts: 18,443
|
Also, no need to use the 'export' keyword to define a var unless you want it to be visible to a sub-shell.
|
|
|
07-27-2011, 12:22 AM
|
#7
|
LQ Newbie
Registered: Jul 2011
Posts: 10
Original Poster
Rep: 
|
Tink, I'm using CR line terminators and this is a Mac OS file.
Thanks
|
|
|
07-27-2011, 12:32 AM
|
#8
|
LQ Newbie
Registered: Jul 2011
Posts: 10
Original Poster
Rep: 
|
grail, no, I wasn't aware of that. Übernoob with this stuff.
re: your thought, would this be wget's "-i" option? If so, the reason I didn't use this was because I also want to parse out the host of each URL and use the value of host to name the files I'm downloading. But if something else I could look into it. Thanks
In case it helps, here're sample lines from the text file:
Chris, thanks for the feedback.
|
|
|
07-27-2011, 01:08 AM
|
#9
|
LQ 5k Club
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
|
Do you have to use sh? Can you use bash? That is, could the first line be #!/bin/bash?
Reason for asking is that sh may effectively be several different shells depending on the distro and, even if it is linked to bash, bash when called as sh has a subset of its full functionality.
Regards only getting the last line and evolving the code to work with URLs including spaces, the outer loop could be changed to
Code:
while read -r full_url
do
...
done < test-urls.txt
|
|
|
07-27-2011, 01:12 AM
|
#10
|
LQ 5k Club
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
|
Code:
url="$(echo ${full_url/$proto/})"
is functionally equivalent to
Code:
url=${full_url/$proto/}
but, as the protocol must be the leftmost part of the full URL, this is more appropriate
Code:
url=${full_url#$proto}
|
|
1 members found this post helpful.
|
07-27-2011, 01:21 AM
|
#11
|
LQ Newbie
Registered: Jul 2011
Posts: 10
Original Poster
Rep: 
|
Thanks, put #!/bin/bash instead.
So it appears the special characters in the URLs could prevent the script from working as intended.
To troubleshoot I substituted the URLs with random strings without any special characters and can echo each line just fine. However even using the -r option in the below script doesn't produce any output when I reinsert the URLs into the text file.
Code:
#!/bin/bash
# test-echo-urls.sh
while read -r full_url
do
echo "$full_url"
done < test-urls.txt
Or (without quotes around $full_url)
Code:
#!/bin/bash
# test-echo-urls.sh
while read -r full_url
do
echo $full_url
done < test-urls.txt
|
|
|
07-27-2011, 03:46 AM
|
#12
|
LQ Newbie
Registered: Jul 2011
Posts: 10
Original Poster
Rep: 
|
Problem solved: it was a filetype issue, as Tink may have alluded to earlier. The script copied/pasted in comment #3 works.
I noticed when the "file" command returned "ASCII test" and nothing else for the file in question, the script worked.
However, when the "file" command returned, for example, "ASCII text, with CR line terminators," the script did not work.
|
|
|
07-27-2011, 04:04 AM
|
#13
|
LQ 5k Club
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
|
Quote:
Originally Posted by dchol
Problem solved: it was a filetype issue, as Tink may have alluded to earlier. The script copied/pasted in comment #3 works.
I noticed when the "file" command returned "ASCII test" and nothing else for the file in question, the script worked.
However, when the "file" command returned, for example, "ASCII text, with CR line terminators," the script did not work.
|
Good
So how does the script look now? There may be things that can be tided up such as replacing
Code:
user="$(echo $url | grep @ | cut -d@ -f1)"
with
if you are interested.
|
|
|
07-27-2011, 10:40 AM
|
#14
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038
|
May I also ask if the content of the text file with urls you posted in #8 is incomplete? I ask as it obviously has no user and or host details anywhere in it (this may be confidential)
so the other lines for setting user and host seem to have nothing to work on??
|
|
|
07-27-2011, 11:57 AM
|
#15
|
LQ Newbie
Registered: Jul 2011
Posts: 10
Original Poster
Rep: 
|
catkin, yep, will happily paste in the new code, assuming I'm able to later today. Thanks
grail, the script I adapted for processing the URLs was written by someone else, and I didn't (still don't, to a certain extent) understand all the code. There was actually no user information in the original file, I just kept that line in there because I wasn't quite ready to mess with that part. Before posting the finished script, however, I plan to strip out all the superfluous code so you'll be able to see how I'm using it then. Thanks
|
|
|
All times are GMT -5. The time now is 11:24 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|