LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-22-2017, 12:22 PM   #1
pr0xibus
Member
 
Registered: Apr 2004
Location: Scotland
Distribution: Slackware
Posts: 218

Rep: Reputation: 44
Linux CMD - Increment URL and Site check


Evening folks

I am unsure if this is the correct section but here goes.

So MAZDA in all their wisdom have a copy of there diagnostic manual on their site, unfortunately the URL needs to be incremented by 100 or so at a time, and even then there is no guarantee it will be correct, 2 examples below

https://euroesi.mazda.co.jp/esicont/...2f1800300.html
https://euroesi.mazda.co.jp/esicont/...2f1801000.html

What i am hoping to find out is a 1 liner for the terminal that will Start off at

id0102f1800300 and increment the last 4 numbers by 100 at a time, once the increment has happened then check to see if the page exists along with the title of the page

once i have the 200 i then need to get the title of the page and at the end put it nicely in a text file so i can make sense of it later on.

Regards
P
 
Old 09-22-2017, 12:39 PM   #2
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,522
Blog Entries: 4

Rep: Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831
You could use seq for that.

Code:
 for n in $(seq -w 300 100 1000); do echo wget ... "http://www.example.com/2f180$n"; done;
See "man seq" for the details.
 
Old 09-22-2017, 05:10 PM   #3
pr0xibus
Member
 
Registered: Apr 2004
Location: Scotland
Distribution: Slackware
Posts: 218

Original Poster
Rep: Reputation: 44
Cheers for the reply, I was hoping for a one liner but after a bit of reading i dont think ill get that.

At the moment I use

Code:
for n in $(seq -w 100 100 999900); do echo "https://euroesi.mazda.co.jp/esicont/eu_eng/mazda3/20060311105619/html/id0110f1$n.html"; done; >> /home/list.txt
After that is done I use

Code:
while read -r url; do
        res=$(curl -sSD- -o /dev/null -- "$url" | head -1 | awk '{ print $2 }')

        if (( res == 200 )); then
                printf '%s is a valid url.\n' "$url"
        fi

done < /home/list.txt
To list out what will give me a reply of 200 from the site

I still need to find out how to put a Title against each link i get a 200 from but will try and get that done tomorrow.

Cheers for the reply

Last edited by pr0xibus; 09-22-2017 at 05:11 PM.
 
Old 09-22-2017, 11:07 PM   #4
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,522
Blog Entries: 4

Rep: Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831Reputation: 3831
You don't need both head and awk. The latter will do it all:

Code:
... | awk 'NR==1 { print $2; exit; }'
However, you might save the whole HTTP response into a temp file using /bin/tempfile so you can also extract the title without querying the server a second time.

The title should be in the title element and you can extract that from your curl output using grep if it is all on one line, or even if it is split over multiple lines:

Code:
grep -m 1 -P -z -o '(?s)(?<=<title>).*?(?=</title>)'
... if your version of grep supports PCRE so you can use positive look-behind and look-ahead assertions to just print only the title itself.

However, speaking of perl, the direction you seem to be heading you should be using perl for this. Take a look at the CPAN module LWP and also any HTML parsing module like HTML::TokeParser or one of the others.

Also, be sure to take it easy on the target site. Add some waits inside your loop and maybe rate limit the download as well.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Issue sg_modes cmd at cmd line, want to see the cmd in binary form NuUser Linux - Newbie 1 03-28-2012 08:08 AM
Posting to be able to post URL to external site. jeixav LinuxQuestions.org Member Intro 0 12-03-2009 11:15 PM
How do I link a file to a URL site. AZDAVE Linux - Networking 2 10-14-2004 06:30 PM
View my site on network with url, not the local ip# neyoung Linux - Networking 3 02-02-2004 07:51 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 09:18 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration