How to copy file properties and directory structure of website without any contents.

rupeshforu3 · 07-29-2017, 04:46 AM

Hi I am Rupesh from India. I have examined a website and it contains of 50000 mp3 files of which I want to download 11000 files and I have downloaded 8000 mp3 files and I want to download upto 3000 mp3 files from the same website and discard the remaining files.

I have downloaded the files from net using offline browser called extreme picture founder. In that application it has an option called skip if the destination file exists. I am going to re download the files and select the option skip if the destination file exists. The above application has also has options for scanning the website and spidering etc., of which all options I have understood.

Previously after downloading files using offline browser I have copied the files to another directory and the directory structure was lost but have files in another directories.

As I want to download 11000 files which is of size 135 gb but I have downloaded 93 gb. If I can obtain the directory structure of website and the files with filenames only without any data in it I can maintain the directory structure same as website I mean I can keep file names and directory names same as website.

At present I have installed opensuse leap 42.2 on my system. Upon opening terminal emulator and issuing the command ls -r > filenames.txt I can obtain the list of filenames with directory names. Is there any command or tool to obtain just the filenames and directory names of directory in website and store the output in a text file.

So please suggest a way how to obtain list of directory names and also filenames containing in those directories and store content in text file. If possible can you please suggest how to maintain directory structure same as website and also filenames without any content.

Regards,
Rupesh.

TB0ne · 07-29-2017, 07:29 AM

Quote:

Originally Posted by rupeshforu3

Hi I am Rupesh from India. I have examined a website and it contains of 50000 mp3 files of which I want to download 11000 files and I have downloaded 8000 mp3 files and I want to download upto 3000 mp3 files from the same website and discard the remaining files.

I have downloaded the files from net using offline browser called extreme picture founder. In that application it has an option called skip if the destination file exists. I am going to re download the files and select the option skip if the destination file exists. The above application has also has options for scanning the website and spidering etc., of which all options I have understood.

Previously after downloading files using offline browser I have copied the files to another directory and the directory structure was lost but have files in another directories.

As I want to download 11000 files which is of size 135 gb but I have downloaded 93 gb. If I can obtain the directory structure of website and the files with filenames only without any data in it I can maintain the directory structure same as website I mean I can keep file names and directory names same as website.

At present I have installed opensuse leap 42.2 on my system. Upon opening terminal emulator and issuing the command ls -r > filenames.txt I can obtain the list of filenames with directory names. Is there any command or tool to obtain just the filenames and directory names of directory in website and store the output in a text file.

So please suggest a way how to obtain list of directory names and also filenames containing in those directories and store content in text file. If possible can you please suggest how to maintain directory structure same as website and also filenames without any content.

Ok...we'll suggest you write a script to just compare what you want to download with what you've already downloaded. Since you've been here for several years, under both "rupeshforu3" and "rupeshforu", and have asked about scripting/programming for quite some time, this should be fairly easy for you to do:
http://www.linuxquestions.org/questi...eg-4175605139/
http://www.linuxquestions.org/questi...ux-4175516540/
http://www.linuxquestions.org/questi...an-4175442279/
http://www.linuxquestions.org/questi...ml#post4985952

...and since your thread involves what is essentially stealing/copyright violations from a website (by your own admission), I'm reporting this to the moderators for review.

rupeshforu3 · 08-02-2017, 08:59 AM

I am not going to steal any others data or harm other's.The site I want to download is a non-profit spiritual website and they are distributing the file's freely. In the website itself they have clearly mentioned that it doesn't contain any copyrighted material and if anyone finds they suggested to complain what they found which is copyrighted. For your reference I am providing the website address below. As the content they provide is not copyrighted anyone can download them.

http://www.pravachanam.com/

Regards,
Rupesh.

TB0ne · 08-02-2017, 09:05 AM

Quote:

Originally Posted by rupeshforu3

I am not going to steal any others data or harm other's.The site I want to download is a non-profit spiritual website and they are distributing the file's freely. In the website itself they have clearly mentioned that it doesn't contain any copyrighted material and if anyone finds they suggested to complain what they found which is copyrighted. For your reference I am providing the website address below. As the content they provide is not copyrighted anyone can download them.

Be that as it may...you still haven't shown any work or effort on your part, and you have done this here previously. Again, as you've been told before, we WILL NOT write your scripts for you, but will be happy to help if you're stuck. The first part of this is you posting what you have done/tried on your own, and you have not, despite being asked.

You have been registered here for four years; your previous questions are in this same vein, going back to 2013:
http://www.linuxquestions.org/questi...eg-4175605139/
http://www.linuxquestions.org/questi...re-4175478332/

You said four years ago you were a 'newbie'..that is not the case after four years. So again, you will have to write your own scripts. There are ample tutorials you can find with a simple Google search, scripting examples, etc., which you should be familiar with after four years. Once you have a script written, post it if you can't make it work and we will all be happy to help you.

IsaacKuo · 08-02-2017, 11:10 AM

Quote:

Originally Posted by rupeshforu3

So please suggest a way how to obtain list of directory names and also filenames containing in those directories and store content in text file.

There is generally no way to get this directly, unless you have ssh or ftp access to the site's files. Rather, you can use a web spider to crawl the web site to see all linked files and all files linked to other pages etc etc etc...

So, you end up downloading the content of the http files, pretty much no matter what.

Not precisely what you asked for, but try something like this:

Code:

wget --spider --recursive --level=inf --no-verbose --output-file=outfile.txt http://www.pravachanam.com/

This will spider through that entire web site, downloading all of the html pages it can find in search of linked files. Then, you can get a list of all of the mp3 files found with:

Code:

cat outfile.txt | grep ".mp3" | awk '{print $4}'

This will give you a list of URLs. You can then further process this list to figure out which ones you already have, and then use some method to download the remaining urls...

Turbocapitalist · 08-02-2017, 11:18 AM

Quote:

Originally Posted by IsaacKuo

Code:

cat outfile.txt | grep ".mp3" | awk '{print $4}'

Or

Code:

awk '/\.mp3$/ { print $4; }' outfile.txt

Turbocapitalist · 08-02-2017, 11:38 AM

Best if you could connect with SSH and then use find or rsync

If you use wget you'll have to download all the HTML files anyway just to be able to follow the links. However you might want to look more closely at some options: --reject, --delete-after, and --recursive

ondoho · 08-03-2017, 03:35 AM

Quote:

Originally Posted by rupeshforu3

I have examined a website and it contains of 50000 mp3 files of which I want to download 11000 files and I have downloaded 8000 mp3 files and I want to download upto 3000 mp3 files from the same website and discard the remaining files.

in that case the thread title indicates an x-y-problem - what you THINK is the solution to your problem, isn't.

what you really want is this:
http://dt.iki.fi/download-filetype-website

AwesomeMachine · 08-08-2017, 01:48 PM

Even if the OP isn't technically in violation of the law, I'm sure the site in question doesn't want users using download robots.