LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 06-17-2010, 06:29 AM   #1
moicpit
LQ Newbie
 
Registered: Mar 2010
Posts: 26

Rep: Reputation: 15
Question How to catch up a string in a line ?


Hi !

I would like to make a shell script that will do the following :
- have a web-page using curl
- download a picture from the web-page
- have a line in the source code of the web page
- catch a string (a link patch) from the line
- go to another page of the website from the result of the previous step
- repeat all this until there is no match for step 3

I know how to :
- have the web-page source code using curl (simple, we just have to make "curl http://www.mywebpage.com")
- download the picture (as simple as the first step)
- have a line in the source code (grep is made for that, isn't it ?)
- go to another website from the result of the previous step (it's the same as the first step)

But I really don't know how I can do the fourth step of my script.
In facts, I don't know regular expressions and even the shell command that can help me to do this.

To be more clear, I would like to store in a variable the string represented by the star (*) in this : <a href="*">Next ></a>

Can you help me, please ?

Thanks !

Pit
 
Old 06-17-2010, 07:27 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,552

Rep: Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898
Well I would probably use sed in this fashion:
Code:
var=$(sed -r -n '/Next/s/[^"]*"([^"]*).*/\1/p' source_code)
This assumes "Next" doesn't appear anywhere else in the source code of the page.
 
Old 06-17-2010, 01:52 PM   #3
moicpit
LQ Newbie
 
Registered: Mar 2010
Posts: 26

Original Poster
Rep: Reputation: 15
Hi !

Thank you but it's not working in my case because it recognize the "-" in the source code and try to use it as arguments.
Moreover, when I make it run with tests, it only gaves me "padding: 0px 7px;" and not "/comics/133" like it should.

Thank you again ! :-)
 
Old 06-17-2010, 02:19 PM   #4
arizonagroovejet
Senior Member
 
Registered: Jun 2005
Location: England
Distribution: openSUSE, Fedora, CentOS
Posts: 1,093

Rep: Reputation: 197Reputation: 197
It sounds like you're trying to download all strips from an online comic. Is that right?
If so, there might be an easier way. Look to see where the images you're interested in are stored. They might be all in the same directory and/or they might have filenames in a format that contains a common string and then some numbers, e.g. a date. Something like

http://somewhere.com/comics/comic_20090510.jpg
http://somewhere.com/comics/comic_20090512.jpg
http://somewhere.com/comics/comic_20090514.jpg

If so then you just work out the pattern of the numbers and then do a loop using seq. E.g.

Code:
for i in $(seq 10 2 14);do wget http://somewhere.com/comics/comic_200905${i}.jpg;done
you can put loops inside loops if you want.


Code:
for month in $(seq 1 12);do
for date in $(seq 10 2 14);do wget http://somewhere.com/comics/comic_2009${month}${date}.jpg;done
done
 
Old 06-17-2010, 04:41 PM   #5
moicpit
LQ Newbie
 
Registered: Mar 2010
Posts: 26

Original Poster
Rep: Reputation: 15
Hi !

Arf ! Uncovered... ^^'

Yes, I would like to download all the strips of Cyanide & Happiness so that I can see them directly on my computer.
I just hope that it's not a bad thing to do this...

And I can't download them using a number pattern because all the strips are named differently and without numbers. :-/

That's why I would like to make it using curl and "Next >". ;-)
 
Old 06-17-2010, 06:07 PM   #6
arizonagroovejet
Senior Member
 
Registered: Jun 2005
Location: England
Distribution: openSUSE, Fedora, CentOS
Posts: 1,093

Rep: Reputation: 197Reputation: 197
You could buy the book.

What you want to do is certainly possible. I've done it just for the hell of it, took about ten minutes, though I didn't use the method you seem to be going after with messing around trying to find the url for the next comic because that's a crazy overly complicated way to do it.

Have you actually tried "download the picture (as simple as the first step)"? If you grab a comic page with curl then the line that contains the url for the actually comic image is very long. I looked at it for a minute or so before deciding trying to extract the url using a regexp was way too difficult.

Anyway, I don't feel inclined to give you a solution to what you're trying to do. I don't want to rob you of the satisfaction of working it out for yourself I will tell you that it can be done using only the commands for, curl, grep, seq and tr.
 
Old 06-17-2010, 07:39 PM   #7
moicpit
LQ Newbie
 
Registered: Mar 2010
Posts: 26

Original Poster
Rep: Reputation: 15
Yes, I could buy the book, but I don't do that to replace the buying of the book, it's just to not have to use the website.
It's too much full of ads and is too slow for me (1 page take me about 10 seconds to be completely downloaded !).

If I judge that the comics deserves a donation, I will give them some money (but I won't buy a book that will cost me the double in frontier taxes, i already had the problem).

What do you meen by "trying to download the picture (as simple as the first step)" ? If you're talking about the right-click option, I won't do that because it's not what I want to do (just having a copy of the comic in my computer), and it's not really "possible" to do that for more than 1000 strips. :-/

Quote:
I've done it just for the hell of it, took about ten minutes, though I didn't use the method you seem to be going after with
You've downloaded all the strips ? How did you have done that ? 0_o
 
Old 06-18-2010, 12:14 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,552

Rep: Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898
Ok ... think I got it now that I have seen the source:
Code:
var=$(curl -s <site> | sed -r -n '/Next/s/.*(comics\/[0-9]+\/)">Next.*/\1/p')
These are the first two returns starting at the First page:
Quote:
comics/39/
comics/40/
 
Old 06-18-2010, 04:43 AM   #9
arizonagroovejet
Senior Member
 
Registered: Jun 2005
Location: England
Distribution: openSUSE, Fedora, CentOS
Posts: 1,093

Rep: Reputation: 197Reputation: 197
Quote:
Originally Posted by moicpit View Post
You've downloaded all the strips ?
Nope, but I wrote a script which will do it. Or at least I believe it will do it, assuming that all the comic pages use the same layout. It worked on the five most recent comics.


I haven't actually got my script to hand right now, it's at home and I am not, but I'll give you another hint. The urls for the comic pages contain a sequential number. Although there are gaps. E.g the first comics appears to be at /comics/15/ then the next one is /comics/40 and there is nothing at /comics/2080

Code:
for in in $(seq 1 2084);do [stuff with curl and grep etc] ;done
Also the tr command can be used to break a really long line up in to multiple lines which you can then pass to grep and tell grep to extract the line you want. E.g.

Code:
tr '"' '\n'
Will replace any " characters in the line with newlines.

Code:
me@mine:~> echo 'lorry"car"boat'
lorry"car"boat
me@mine:~> echo 'lorry"car"boat' | tr '"' '\n'
lorry
car
boat
 
Old 06-18-2010, 01:34 PM   #10
moicpit
LQ Newbie
 
Registered: Mar 2010
Posts: 26

Original Poster
Rep: Reputation: 15
Thank you all !

I'll try with your solutions and I'll tell you when it's done (or if i've got another problem -_-').

If everything works fine, I'll give you my script. ;-)
 
Old 06-20-2010, 12:22 PM   #11
moicpit
LQ Newbie
 
Registered: Mar 2010
Posts: 26

Original Poster
Rep: Reputation: 15
Hi !

So, I really don't understand the RegEx ! It looks like chinese for me ! And I tried... :-/

Can you tell me what RegEx can make my script find the URL of the picture to download ?
In this case, the HTML to display the picture looks like this :
Code:
<img alt="Cyanide and Happiness, a daily webcomic" src="http://www.explosm.net/db/files/Comics/pingpong0001.jpg">
But I would like to find the whole address. Because it's not all the time "http://www.explosm.net/db/files/Comics/***.jpg", I can't search for "http://www.explosm.net/db/files/Comics/".
But I think that the "alt" stay unchanged in all the pages. ;-)

So, please, help me ! ^^'
 
Old 06-20-2010, 02:28 PM   #12
Kenhelm
Member
 
Registered: Mar 2008
Location: N. W. England
Distribution: Mandriva
Posts: 336

Rep: Reputation: 141Reputation: 141
Try
Code:
curl -s http://www.explosm.net/comics/2082/ |
sed -n 's/.*<img alt="[^"]*" src="\([^"]*\).*/\1/p'

http://www.explosm.net/db/files/Comics/Rob/deadbody.png
 
Old 06-21-2010, 10:43 AM   #13
moicpit
LQ Newbie
 
Registered: Mar 2010
Posts: 26

Original Poster
Rep: Reputation: 15
Thank you all ! I now have a functionnal script !

Here is the code :
Code:
#!/bin/sh

echo =================== Download picture from $1$2 ==================;
source_code=$(curl $1$2 | tr '\-\-' '\-');

echo $source_code > source_code.html;

link=$(sed -r -n '/Next/s/.*(comics\/[0-9]+\/)">Next.*/\1/p' source_code.html);
echo link = $link;

picture=$(sed -n 's/.*<img alt="[^"]*" src="\([^"]*\).*/\1/p' source_code.html);
echo picture = $picture;

picture_name=$(basename "$picture");

curl "$picture" > "$picture_name";

if [ $link ]
then
    $0 $1 $link;
else
    exit 1;
fi

exit 0;
You can use it by tapping this :
Code:
./suck_images.sh http://www.explosm.net/ comics/15/
It have still bugs so I need to relaunch it sometimes but it works. ;-)
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
C++ text file line by line/each line to string/array Dimitris Programming 15 03-11-2008 09:22 AM
How to identify a line and replace another string on that line using Shell script? Sid2007 Programming 10 10-01-2007 09:49 PM
Parsing a string line-by-line in PHP enigma_0Z Programming 3 04-21-2006 09:07 AM
catch signals from the command line djgerbavore Programming 4 04-02-2006 08:03 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 12:07 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration