Linux CMD - Increment URL and Site check
Evening folks
I am unsure if this is the correct section but here goes. So MAZDA in all their wisdom have a copy of there diagnostic manual on their site, unfortunately the URL needs to be incremented by 100 or so at a time, and even then there is no guarantee it will be correct, 2 examples below https://euroesi.mazda.co.jp/esicont/...2f1800300.html https://euroesi.mazda.co.jp/esicont/...2f1801000.html What i am hoping to find out is a 1 liner for the terminal that will Start off at id0102f1800300 and increment the last 4 numbers by 100 at a time, once the increment has happened then check to see if the page exists along with the title of the page once i have the 200 i then need to get the title of the page and at the end put it nicely in a text file so i can make sense of it later on. Regards P |
You could use seq for that.
Code:
for n in $(seq -w 300 100 1000); do echo wget ... "http://www.example.com/2f180$n"; done; |
Cheers for the reply, I was hoping for a one liner but after a bit of reading i dont think ill get that.
At the moment I use Code:
for n in $(seq -w 100 100 999900); do echo "https://euroesi.mazda.co.jp/esicont/eu_eng/mazda3/20060311105619/html/id0110f1$n.html"; done; >> /home/list.txt Code:
while read -r url; do I still need to find out how to put a Title against each link i get a 200 from but will try and get that done tomorrow. Cheers for the reply |
You don't need both head and awk. The latter will do it all:
Code:
... | awk 'NR==1 { print $2; exit; }' The title should be in the title element and you can extract that from your curl output using grep if it is all on one line, or even if it is split over multiple lines: Code:
grep -m 1 -P -z -o '(?s)(?<=<title>).*?(?=</title>)' However, speaking of perl, the direction you seem to be heading you should be using perl for this. Take a look at the CPAN module LWP and also any HTML parsing module like HTML::TokeParser or one of the others. Also, be sure to take it easy on the target site. Add some waits inside your loop and maybe rate limit the download as well. |
All times are GMT -5. The time now is 01:57 AM. |