Linux CMD

pr0xibus · 09-22-2017, 12:22 PM

Evening folks

I am unsure if this is the correct section but here goes.

So MAZDA in all their wisdom have a copy of there diagnostic manual on their site, unfortunately the URL needs to be incremented by 100 or so at a time, and even then there is no guarantee it will be correct, 2 examples below

https://euroesi.mazda.co.jp/esicont/...2f1800300.html
https://euroesi.mazda.co.jp/esicont/...2f1801000.html

What i am hoping to find out is a 1 liner for the terminal that will Start off at

id0102f1800300 and increment the last 4 numbers by 100 at a time, once the increment has happened then check to see if the page exists along with the title of the page

once i have the 200 i then need to get the title of the page and at the end put it nicely in a text file so i can make sense of it later on.

Regards
P

Turbocapitalist · 09-22-2017, 12:39 PM

You could use seq for that.

Code:

 for n in $(seq -w 300 100 1000); do echo wget ... "http://www.example.com/2f180$n"; done;

See "man seq" for the details.

pr0xibus · 09-22-2017, 05:10 PM

Cheers for the reply, I was hoping for a one liner but after a bit of reading i dont think ill get that.

At the moment I use

Code:

for n in $(seq -w 100 100 999900); do echo "https://euroesi.mazda.co.jp/esicont/eu_eng/mazda3/20060311105619/html/id0110f1$n.html"; done; >> /home/list.txt

After that is done I use

Code:

while read -r url; do
        res=$(curl -sSD- -o /dev/null -- "$url" | head -1 | awk '{ print $2 }')

        if (( res == 200 )); then
                printf '%s is a valid url.\n' "$url"
        fi

done < /home/list.txt

To list out what will give me a reply of 200 from the site

I still need to find out how to put a Title against each link i get a 200 from but will try and get that done tomorrow.

Cheers for the reply

Turbocapitalist · 09-22-2017, 11:07 PM

You don't need both head and awk. The latter will do it all:

Code:

... | awk 'NR==1 { print $2; exit; }'

However, you might save the whole HTTP response into a temp file using /bin/tempfile so you can also extract the title without querying the server a second time.

The title should be in the title element and you can extract that from your curl output using grep if it is all on one line, or even if it is split over multiple lines:

Code:

grep -m 1 -P -z -o '(?s)(?<=<title>).*?(?=</title>)'

... if your version of grep supports PCRE so you can use positive look-behind and look-ahead assertions to just print only the title itself.

However, speaking of perl, the direction you seem to be heading you should be using perl for this. Take a look at the CPAN module LWP and also any HTML parsing module like HTML::TokeParser or one of the others.

Also, be sure to take it easy on the target site. Add some waits inside your loop and maybe rate limit the download as well.