Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I would like to make a shell script that will do the following :
- have a web-page using curl
- download a picture from the web-page
- have a line in the source code of the web page
- catch a string (a link patch) from the line
- go to another page of the website from the result of the previous step
- repeat all this until there is no match for step 3
I know how to :
- have the web-page source code using curl (simple, we just have to make "curl http://www.mywebpage.com")
- download the picture (as simple as the first step)
- have a line in the source code (grep is made for that, isn't it ?)
- go to another website from the result of the previous step (it's the same as the first step)
But I really don't know how I can do the fourth step of my script.
In facts, I don't know regular expressions and even the shell command that can help me to do this.
To be more clear, I would like to store in a variable the string represented by the star (*) in this : <a href="*">Next ></a>
Thank you but it's not working in my case because it recognize the "-" in the source code and try to use it as arguments.
Moreover, when I make it run with tests, it only gaves me "padding: 0px 7px;" and not "/comics/133" like it should.
It sounds like you're trying to download all strips from an online comic. Is that right?
If so, there might be an easier way. Look to see where the images you're interested in are stored. They might be all in the same directory and/or they might have filenames in a format that contains a common string and then some numbers, e.g. a date. Something like
Yes, I would like to download all the strips of Cyanide & Happiness so that I can see them directly on my computer.
I just hope that it's not a bad thing to do this...
And I can't download them using a number pattern because all the strips are named differently and without numbers. :-/
That's why I would like to make it using curl and "Next >". ;-)
What you want to do is certainly possible. I've done it just for the hell of it, took about ten minutes, though I didn't use the method you seem to be going after with messing around trying to find the url for the next comic because that's a crazy overly complicated way to do it.
Have you actually tried "download the picture (as simple as the first step)"? If you grab a comic page with curl then the line that contains the url for the actually comic image is very long. I looked at it for a minute or so before deciding trying to extract the url using a regexp was way too difficult.
Anyway, I don't feel inclined to give you a solution to what you're trying to do. I don't want to rob you of the satisfaction of working it out for yourself I will tell you that it can be done using only the commands for, curl, grep, seq and tr.
Yes, I could buy the book, but I don't do that to replace the buying of the book, it's just to not have to use the website.
It's too much full of ads and is too slow for me (1 page take me about 10 seconds to be completely downloaded !).
If I judge that the comics deserves a donation, I will give them some money (but I won't buy a book that will cost me the double in frontier taxes, i already had the problem).
What do you meen by "trying to download the picture (as simple as the first step)" ? If you're talking about the right-click option, I won't do that because it's not what I want to do (just having a copy of the comic in my computer), and it's not really "possible" to do that for more than 1000 strips. :-/
Quote:
I've done it just for the hell of it, took about ten minutes, though I didn't use the method you seem to be going after with
You've downloaded all the strips ? How did you have done that ? 0_o
Nope, but I wrote a script which will do it. Or at least I believe it will do it, assuming that all the comic pages use the same layout. It worked on the five most recent comics.
I haven't actually got my script to hand right now, it's at home and I am not, but I'll give you another hint. The urls for the comic pages contain a sequential number. Although there are gaps. E.g the first comics appears to be at /comics/15/ then the next one is /comics/40 and there is nothing at /comics/2080
Code:
for in in $(seq 1 2084);do [stuff with curl and grep etc] ;done
Also the tr command can be used to break a really long line up in to multiple lines which you can then pass to grep and tell grep to extract the line you want. E.g.
Code:
tr '"' '\n'
Will replace any " characters in the line with newlines.
So, I really don't understand the RegEx ! It looks like chinese for me ! And I tried... :-/
Can you tell me what RegEx can make my script find the URL of the picture to download ?
In this case, the HTML to display the picture looks like this :
Code:
<img alt="Cyanide and Happiness, a daily webcomic" src="http://www.explosm.net/db/files/Comics/pingpong0001.jpg">
But I would like to find the whole address. Because it's not all the time "http://www.explosm.net/db/files/Comics/***.jpg", I can't search for "http://www.explosm.net/db/files/Comics/".
But I think that the "alt" stay unchanged in all the pages. ;-)
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.