As david1941 points out, there may be other ways to attack this problem. It all depends on why you need this data and what you're going to do with it. If you need to process the names in a script, a browser extension may not work for you.
Your set of commands filter the output from
lynx -dump, which means you get a lot of URLs, and then for some web sites the text "links:" (the latter part of the "Hidden links:" header) followed by some more URLs.
You could filter on "^http://" (a "^" in a regular expression means "the start of the line") to lose any blank lines and the "links:" header. I've used "www.microsoft.com" in this example, as the page contains an unusual amount of links and whould be a good test case:
Code:
lynx -dump http://www.microsoft.com \
| grep -A999 "^References$" \
| tail -n +3 \
| awk '{print $2 }' \
| grep "http://"
(I've used backslash escaping to split the code across multiple lines, as this improves readability.)
The search-and-replace function in
sed could then be used to remove the "http://" part as well as the link after the host name:
Code:
lynx -dump http://www.microsoft.com \
| grep -A999 "^References$" \
| tail -n +3 \
| awk '{print $2 }' \
| grep "http://" \
| sed -e "s/http:\/\///" -e "s/\/.*//"
That should leave you with a simple list of host names. You could use a
while loop to read each name into a variable and feed that to the
host command:
Code:
lynx -dump http://www.microsoft.com +
| grep -A999 "^References$" \
| tail -n +3 \
| awk '{print $2 }' \
| grep "http://" \
| sed -e "s/http:\/\///" -e "s/\/.*//" \
| while read hname ; do
host $hname
done
However, the output from the
host command is somewhat unpredictable. The host name may have an A record, in which case one or more IP addresses are returned, or it could actually be a CNAME, in which case the
host command will attempt to follow the pointer and recursively resolve the name (good), and will report its progress as it goes along (not necessarily what you want to see in your list). Further filtering through
grep and
sed could be used to return just the IP address(es):
Code:
lynx -dump http://www.microsoft.com \
| grep -A999 "^References$" \
| tail -n +3 \
| awk '{print $2 }' \
| grep "http://" \
| sed -e "s/http:\/\///" -e "s/\/.*//" \
| while read hname ; do
host $hname \
| grep "has address" \
| sed "s/.*has address //"
done
You now have a simple list of IP addresses, but not the corresponding hostnames. If you also want the host names, the output from the
hosts command needs to be parsed (as a hostname can resolve to more than one IP address). An
echo statement inside a second
while loop would do the trick:
Code:
lynx -dump http://www.microsoft.com \
| grep -A999 "^References$" \
| tail -n +3 \
| awk '{print $2 }' \
| grep "http://" \
| sed -e "s/http:\/\///" -e "s/\/.*//" \
| while read hname ; do
host $hname \
| grep "has address" \
| sed "s/.*has address //" \
| while read addr; do
echo "$addr $hname"
done
done
While this works, it is:
- only one of many ways to solve this problem, and
- almost certainly not the best way
In fact, many would find this script ridiculously convoluted, and would opt to replace most of the code with a much shorter and arguably better
awk program.
My point was to demonstrate how you can use common filtering and substitution mechanisms to alter the output from one command into any format you like, and not hot to produce the shortest and most efficient solution to this particular problem.
I'd recommend you take a closer look at what commands like
awk,
sed,
cut, and
join (as well as regular expressions in general) can do with regards to manipulating text.