LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Search this Thread
Old 07-26-2011, 02:19 PM   #1
dchol
LQ Newbie
 
Registered: Jul 2011
Posts: 10

Rep: Reputation: Disabled
Loop through list of URLs in txt file, parse out parameters, pass to wget in bash.


What I have:

1. list of URLs in text file (i.e. in this form http://www.domain.tld/more-stuff-here)

2. script that extracts parameters from text file with URLs (example below)

3. script that downloads file with wget (example below)

I want to create a loop that:

1. takes a text file of URLs

2. parses $host and $host_and_domain from each URL

3. sends $host and $host_and_domain to the wget script

4. creates a file name by appending $host with time/date (i.e. mm:dd:yy:hh:mm:ss)

Feel free to let me know if I could clarify anything. Also open to code examples to play with instead of outright answers.

Thanks!

Example of URL parsing script:

Code:
#!/bin/sh
# test-domain-parse.sh
# (Can't remember where I found this, but I didn't write it)

# ask for URL.  note: want to pull in URLs from txt file (instead of printf)
# and then pass $host and $host_and_path to wget script

printf "Paste the URL you would like to normalize:  -> "

read full_url

# extract the protocol
proto="$(echo $full_url | grep :// | sed -e's,^\(.*://\).*,\1,g')"
# remove the protocol
url="$(echo ${full_url/$proto/})"
# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"
# extract the host
host="$(echo ${url/$user@/} | cut -d/ -f1)"
# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"
host_and_path="$(echo ${url/$user@/} )"

echo "  host: $host"
echo "  host_and_path: $host_and_path"
Example of wget script:

Code:
#!/bin/sh

# wget-url-test.sh

# note: I would like to pass URL from the parsing script and NOT use printf
printf "What URL would you like to PDF? ->"

read URL

# echo $URL

# note: I would like to pass NORMALIZED_URL from parsing script ($host)
# and to append with yy:mm:dd:hh:mm:ss instead of naming file with printf
printf "What would you like to name the file? ->"

read NORMALIZED_URL

wget -O $NORMALIZED_URL.png --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.19) Gecko/20110707 Firefox/3.6.19" "pdfmyurl.com?url=$URL&--png&--page-size=A1&--redirect-delay=500"
 
Old 07-26-2011, 07:54 PM   #2
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,289

Rep: Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034
Unfortunately, I can't get to that link. Anyway, what you need is a loop like
Code:
# assumes no spaces in urls
for full_url in $(cat <yourfilehere> )
do

#insert first script here, skipping printf & read cmds

# append 2nd script here, skipping printf, read cmds. (Not sure what a 'normalized_url) is.
# if it is $host and you want to append time, then 
$host=${host}${yy:mm:dd:hh:mm:ss}

# but you'll have to get the timestamp from somewhere

done
Here are some good bash links
http://rute.2038bug.com/index.html.gz
http://tldp.org/LDP/Bash-Beginners-G...tml/index.html
http://www.tldp.org/LDP/abs/html/

that should get you started
 
1 members found this post helpful.
Old 07-26-2011, 11:42 PM   #3
dchol
LQ Newbie
 
Registered: Jul 2011
Posts: 10

Original Poster
Rep: Reputation: Disabled
Thanks Chris. That definitely helped. I used "export date=$(date +%s)" to append to "$host" for file naming convention.

Strangely, instead of iterating through the list, processing each line, the script only processes the last line of the text file. "test-urls.txt" is an 11 line file with no spaces. It contains URLs in this form: http://[host.com]/[pages-go-here]

I'll look into this but if you have any suggestions in the meantime feel free to share.

Here's the updated code:

Code:
#!/bin/sh

# wget-url-test.sh

for full_url in $(cat test-urls.txt)
do

# extract the protocol
proto="$(echo $full_url | grep :// | sed -e's,^\(.*://\).*,\1,g')"
# remove the protocol
url="$(echo ${full_url/$proto/})"
# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"
# extract the host
host="$(echo ${url/$user@/} | cut -d/ -f1)"
# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"
host_and_path="$(echo ${url/$user@/} )"

export date=$(date +%s)

wget -O "${host}${date}".png --referer="http://www.google.com"  \
--user-agent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.19) Gecko/20110707 Firefox/3.6.19" \
"pdfmyurl.com?url=$host_and_path&--png&--page-size=A1&--redirect-delay=500"

done
 
Old 07-26-2011, 11:47 PM   #4
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,986
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
What line separators are you using, is this a Linux or DOS file?


Cheers,
Tink
 
1 members found this post helpful.
Old 07-27-2011, 12:18 AM   #5
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,562

Rep: Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939
Just out of curiosity, you are aware that your greps you are using are doing nothing?

If we assume the format you provided is correct for each line of the file (http://[host.com]/[pages-go-here]), then
something like:
Code:
proto="$(echo $full_url | grep :// | sed -e's,^\(.*://\).*,\1,g')"
Here the full_url is only one line so doing a grep of one line serves no purpose, at lest not without any switches
to reduce what has been past in.

Also, I am not exactly sure what details are in the url lines in the file but are you aware that wget can read url information directly from a file? (just a thought)
 
Old 07-27-2011, 12:22 AM   #6
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,289

Rep: Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034
Also, no need to use the 'export' keyword to define a var unless you want it to be visible to a sub-shell.
 
Old 07-27-2011, 12:22 AM   #7
dchol
LQ Newbie
 
Registered: Jul 2011
Posts: 10

Original Poster
Rep: Reputation: Disabled
Tink, I'm using CR line terminators and this is a Mac OS file.

Thanks
 
Old 07-27-2011, 12:32 AM   #8
dchol
LQ Newbie
 
Registered: Jul 2011
Posts: 10

Original Poster
Rep: Reputation: Disabled
grail, no, I wasn't aware of that. Übernoob with this stuff.

re: your thought, would this be wget's "-i" option? If so, the reason I didn't use this was because I also want to parse out the host of each URL and use the value of host to name the files I'm downloading. But if something else I could look into it. Thanks

In case it helps, here're sample lines from the text file:

Chris, thanks for the feedback.
 
Old 07-27-2011, 01:08 AM   #9
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Servers: Debian Squeeze and Wheezy. Desktop: Slackware64 14.0. Netbook: Slackware 13.37
Posts: 8,551
Blog Entries: 28

Rep: Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176
Do you have to use sh? Can you use bash? That is, could the first line be #!/bin/bash?

Reason for asking is that sh may effectively be several different shells depending on the distro and, even if it is linked to bash, bash when called as sh has a subset of its full functionality.

Regards only getting the last line and evolving the code to work with URLs including spaces, the outer loop could be changed to
Code:
while read -r full_url
do
    ...
done < test-urls.txt
 
Old 07-27-2011, 01:12 AM   #10
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Servers: Debian Squeeze and Wheezy. Desktop: Slackware64 14.0. Netbook: Slackware 13.37
Posts: 8,551
Blog Entries: 28

Rep: Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176
Code:
url="$(echo ${full_url/$proto/})"
is functionally equivalent to
Code:
url=${full_url/$proto/}
but, as the protocol must be the leftmost part of the full URL, this is more appropriate
Code:
url=${full_url#$proto}
 
1 members found this post helpful.
Old 07-27-2011, 01:21 AM   #11
dchol
LQ Newbie
 
Registered: Jul 2011
Posts: 10

Original Poster
Rep: Reputation: Disabled
Thanks, put #!/bin/bash instead.

So it appears the special characters in the URLs could prevent the script from working as intended.

To troubleshoot I substituted the URLs with random strings without any special characters and can echo each line just fine. However even using the -r option in the below script doesn't produce any output when I reinsert the URLs into the text file.

Code:
#!/bin/bash

# test-echo-urls.sh

while read -r full_url
do 
    echo "$full_url"
done < test-urls.txt
Or (without quotes around $full_url)

Code:
#!/bin/bash

# test-echo-urls.sh

while read -r full_url
do 
    echo $full_url
done < test-urls.txt
 
Old 07-27-2011, 03:46 AM   #12
dchol
LQ Newbie
 
Registered: Jul 2011
Posts: 10

Original Poster
Rep: Reputation: Disabled
Problem solved: it was a filetype issue, as Tink may have alluded to earlier. The script copied/pasted in comment #3 works.

I noticed when the "file" command returned "ASCII test" and nothing else for the file in question, the script worked.

However, when the "file" command returned, for example, "ASCII text, with CR line terminators," the script did not work.
 
Old 07-27-2011, 04:04 AM   #13
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Servers: Debian Squeeze and Wheezy. Desktop: Slackware64 14.0. Netbook: Slackware 13.37
Posts: 8,551
Blog Entries: 28

Rep: Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176Reputation: 1176
Quote:
Originally Posted by dchol View Post
Problem solved: it was a filetype issue, as Tink may have alluded to earlier. The script copied/pasted in comment #3 works.

I noticed when the "file" command returned "ASCII test" and nothing else for the file in question, the script worked.

However, when the "file" command returned, for example, "ASCII text, with CR line terminators," the script did not work.
Good

So how does the script look now? There may be things that can be tided up such as replacing
Code:
user="$(echo $url | grep @ | cut -d@ -f1)"
with
Code:
user=${url%%@*}
if you are interested.
 
Old 07-27-2011, 10:40 AM   #14
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,562

Rep: Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939
May I also ask if the content of the text file with urls you posted in #8 is incomplete? I ask as it obviously has no user and or host details anywhere in it (this may be confidential)
so the other lines for setting user and host seem to have nothing to work on??
 
Old 07-27-2011, 11:57 AM   #15
dchol
LQ Newbie
 
Registered: Jul 2011
Posts: 10

Original Poster
Rep: Reputation: Disabled
catkin, yep, will happily paste in the new code, assuming I'm able to later today. Thanks

grail, the script I adapted for processing the URLs was written by someone else, and I didn't (still don't, to a certain extent) understand all the code. There was actually no user information in the original file, I just kept that line in there because I wasn't quite ready to mess with that part. Before posting the finished script, however, I plan to strip out all the superfluous code so you'll be able to see how I'm using it then. Thanks
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] bash how pass array as one of several parameters to a function porphyry5 Programming 5 06-14-2011 09:11 PM
bash script for selecting a parameter from list of parameters m4rtin Programming 9 12-21-2009 09:41 AM
how to parse info from a txt file using c++ code abhisheknayak Programming 8 11-23-2007 01:33 AM
Pass Parameters to php from bash iman00b Linux - Newbie 2 09-09-2004 05:30 PM
lynx : read a list of URLs from file ? fnd Linux - Software 0 06-22-2004 04:42 PM


All times are GMT -5. The time now is 06:50 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration