LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-22-2021, 08:28 PM   #1
ericlindellnyc
Member
 
Registered: Jun 2017
Posts: 181

Rep: Reputation: Disabled
Batch download linked assets from PDF files in website


I have downloaded a website that has many PDFs on it for download. I used
Code:
wget -r -l 2 -N -t1 --mirror --convert-links --adjust-extension --page-requisites --no-parent --no-check-certificate http://mileswmathis.com/
to get the PDFs.
I thought by adding -l 2 it would also retrieve the media assets linked to from the PDFs, but it did not. It will do this from an HTML file, but not a PDF.
There are great PDFs on this site, but I would like the linked assets also, such as when it links to an image or video file -- or anything else, for that matter.
I don't know how to batch sift through the PDFs to find the links and download linked content.
Ideas?
Thank you.
 
Old 10-23-2021, 01:21 AM   #2
b1bb2
Member
 
Registered: Oct 2021
Posts: 90

Rep: Reputation: Disabled
I did a similar project. Here are the project links to get you started. If needed, I can look for more details later. Sorry, I do not remember the commands offhand.
b1bb2.com/b/Downloads/application/makehumancommunity.org/sites/default/files/20210901/assetts/CC0/CC0.htm
http://www.makehumancommunity.org/fo...p?f=20&t=19858

If I were doing your project, first convert pdf to text. Make sure all links are not split between lines. There is a bash command that copies each line that has certain text. A link is easy to detect.
 
Old 10-23-2021, 08:10 AM   #3
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
To preserve links, you may convert PDF to HTML first, say, with pdftohtml from poppler-utils.

But AFAICT, many of the links are to his other essays, so if you've downloaded them all in the same directory, you'll just have to adjust them by removing http://milleswmathis.com/ from them.
 
  


Reply

Tags
batch, download, link, media, pdf



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Batch images to pdf / pdf to txt geeeeky.girl Linux - Software 10 12-24-2009 01:40 AM
save website as pdf and send pdf as an email? Cyberman Linux - Software 4 12-19-2009 09:41 PM
LXer: SCO assets are less than the $26m Novell says it's owed LXer Syndicated Linux News 0 01-18-2007 09:21 AM
Mandriva purchases Lycoris Assets reddazz Linux - News 9 07-21-2005 01:38 PM
Winxp linked to Linux Linked to home network OverboardKiller Linux - Networking 2 06-09-2003 09:59 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 03:17 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration