How to catch up a string in a line ?

moicpit · 06-17-2010, 05:29 AM

Hi !

I would like to make a shell script that will do the following :
- have a web-page using curl
- download a picture from the web-page
- have a line in the source code of the web page
- catch a string (a link patch) from the line
- go to another page of the website from the result of the previous step
- repeat all this until there is no match for step 3

I know how to :
- have the web-page source code using curl (simple, we just have to make "curl http://www.mywebpage.com")
- download the picture (as simple as the first step)
- have a line in the source code (grep is made for that, isn't it ?)
- go to another website from the result of the previous step (it's the same as the first step)

But I really don't know how I can do the fourth step of my script.
In facts, I don't know regular expressions and even the shell command that can help me to do this.

To be more clear, I would like to store in a variable the string represented by the star (*) in this : <a href="*">Next ></a>

Can you help me, please ?

Thanks !

Pit

grail · 06-17-2010, 06:27 AM

Well I would probably use sed in this fashion:

Code:

var=$(sed -r -n '/Next/s/[^"]*"([^"]*).*/\1/p' source_code)

This assumes "Next" doesn't appear anywhere else in the source code of the page.

moicpit · 06-17-2010, 12:52 PM

Hi !

Thank you but it's not working in my case because it recognize the "-" in the source code and try to use it as arguments.
Moreover, when I make it run with tests, it only gaves me "padding: 0px 7px;" and not "/comics/133" like it should.

Thank you again ! :-)

arizonagroovejet · 06-17-2010, 01:19 PM

It sounds like you're trying to download all strips from an online comic. Is that right?
If so, there might be an easier way. Look to see where the images you're interested in are stored. They might be all in the same directory and/or they might have filenames in a format that contains a common string and then some numbers, e.g. a date. Something like

http://somewhere.com/comics/comic_20090510.jpg
http://somewhere.com/comics/comic_20090512.jpg
http://somewhere.com/comics/comic_20090514.jpg

If so then you just work out the pattern of the numbers and then do a loop using seq. E.g.

Code:

for i in $(seq 10 2 14);do wget http://somewhere.com/comics/comic_200905${i}.jpg;done

you can put loops inside loops if you want.

Code:

for month in $(seq 1 12);do
for date in $(seq 10 2 14);do wget http://somewhere.com/comics/comic_2009${month}${date}.jpg;done
done

moicpit · 06-17-2010, 03:41 PM

Hi !

Arf ! Uncovered... ^^'

Yes, I would like to download all the strips of Cyanide & Happiness so that I can see them directly on my computer.
I just hope that it's not a bad thing to do this...

And I can't download them using a number pattern because all the strips are named differently and without numbers. :-/

That's why I would like to make it using curl and "Next >". ;-)

arizonagroovejet · 06-17-2010, 05:07 PM

You could buy the book.

What you want to do is certainly possible. I've done it just for the hell of it, took about ten minutes, though I didn't use the method you seem to be going after with messing around trying to find the url for the next comic because that's a crazy overly complicated way to do it.

Have you actually tried "download the picture (as simple as the first step)"? If you grab a comic page with curl then the line that contains the url for the actually comic image is very long. I looked at it for a minute or so before deciding trying to extract the url using a regexp was way too difficult.

Anyway, I don't feel inclined to give you a solution to what you're trying to do. I don't want to rob you of the satisfaction of working it out for yourself

I will tell you that it can be done using only the commands for, curl, grep, seq and tr.

moicpit · 06-17-2010, 06:39 PM

Yes, I could buy the book, but I don't do that to replace the buying of the book, it's just to not have to use the website.
It's too much full of ads and is too slow for me (1 page take me about 10 seconds to be completely downloaded !).

If I judge that the comics deserves a donation, I will give them some money (but I won't buy a book that will cost me the double in frontier taxes, i already had the problem).

What do you meen by "trying to download the picture (as simple as the first step)" ? If you're talking about the right-click option, I won't do that because it's not what I want to do (just having a copy of the comic in my computer), and it's not really "possible" to do that for more than 1000 strips. :-/

Quote:

I've done it just for the hell of it, took about ten minutes, though I didn't use the method you seem to be going after with

You've downloaded all the strips ? How did you have done that ? 0_o

grail · 06-17-2010, 11:14 PM

Ok ... think I got it now that I have seen the source:

Code:

var=$(curl -s <site> | sed -r -n '/Next/s/.*(comics\/[0-9]+\/)">Next.*/\1/p')

These are the first two returns starting at the First page:

Quote:

comics/39/
comics/40/

arizonagroovejet · 06-18-2010, 03:43 AM

Quote:

Originally Posted by moicpit

You've downloaded all the strips ?

Nope, but I wrote a script which will do it. Or at least I believe it will do it, assuming that all the comic pages use the same layout. It worked on the five most recent comics.

I haven't actually got my script to hand right now, it's at home and I am not, but I'll give you another hint. The urls for the comic pages contain a sequential number. Although there are gaps. E.g the first comics appears to be at /comics/15/ then the next one is /comics/40 and there is nothing at /comics/2080

Code:

for in in $(seq 1 2084);do [stuff with curl and grep etc] ;done

Also the tr command can be used to break a really long line up in to multiple lines which you can then pass to grep and tell grep to extract the line you want. E.g.

Code:

tr '"' '\n'

Will replace any " characters in the line with newlines.

Code:

me@mine:~> echo 'lorry"car"boat'
lorry"car"boat
me@mine:~> echo 'lorry"car"boat' | tr '"' '\n'
lorry
car
boat

moicpit · 06-18-2010, 12:34 PM

Thank you all !

I'll try with your solutions and I'll tell you when it's done (or if i've got another problem -_-').

If everything works fine, I'll give you my script. ;-)

moicpit · 06-20-2010, 11:22 AM

Hi !

So, I really don't understand the RegEx ! It looks like chinese for me ! And I tried... :-/

Can you tell me what RegEx can make my script find the URL of the picture to download ?
In this case, the HTML to display the picture looks like this :

Code:

<img alt="Cyanide and Happiness, a daily webcomic" src="http://www.explosm.net/db/files/Comics/pingpong0001.jpg">

But I would like to find the whole address. Because it's not all the time "http://www.explosm.net/db/files/Comics/***.jpg", I can't search for "http://www.explosm.net/db/files/Comics/".
But I think that the "alt" stay unchanged in all the pages. ;-)

So, please, help me ! ^^'

Kenhelm · 06-20-2010, 01:28 PM

Try

Code:

curl -s http://www.explosm.net/comics/2082/ |
sed -n 's/.*<img alt="[^"]*" src="\([^"]*\).*/\1/p'

http://www.explosm.net/db/files/Comics/Rob/deadbody.png

moicpit · 06-21-2010, 09:43 AM

Thank you all ! I now have a functionnal script !

Here is the code :

Code:

#!/bin/sh

echo =================== Download picture from $1$2 ==================;
source_code=$(curl $1$2 | tr '\-\-' '\-');

echo $source_code > source_code.html;

link=$(sed -r -n '/Next/s/.*(comics\/[0-9]+\/)">Next.*/\1/p' source_code.html);
echo link = $link;

picture=$(sed -n 's/.*<img alt="[^"]*" src="\([^"]*\).*/\1/p' source_code.html);
echo picture = $picture;

picture_name=$(basename "$picture");

curl "$picture" > "$picture_name";

if [ $link ]
then
    $0 $1 $link;
else
    exit 1;
fi

exit 0;

You can use it by tapping this :

Code:

./suck_images.sh http://www.explosm.net/ comics/15/

It have still bugs so I need to relaunch it sometimes but it works. ;-)