[SOLVED] preserving file contents while using wget or curl

amboxer21 · 10-11-2012, 08:25 PM

Anyone know how to preserve file contents while using wget or curl. I have to download a series of json objects and want to save myself the trouble of having to write a crazy awk/sed parser. Everything has been merged into one freaking line. I would like the contents to remain intact. They way they appear on the internet.

EDIT:
At the rate this thread is moving, it would be faster to write a parsing script! Which is what i am going to do. If someone wants to reply with an answer for future reference or for other people searching for the same answer... then feel free. Thanks

Snark1994 · 10-12-2012, 07:32 AM

You... gave up on an answer after half an hour? Bearing in mind that in my time zone you posted at 2AM, you're only really going to get an answer from people living near to you (globally) or with freakish sleep patterns. Anyway.

Do you have an example of the file you're downloading? Whenever I have used wget or curl, the file has downloaded exactly as it was on the web. The only things I can think of are:

Different line endings (shouldn't really be a problem, I would expect the all-on-one-line symptom if you were downloading a unix line ending file onto Windows)
The file uses <br/> for line endings and has no actual line breaks

However, it would be easier to work out what's going wrong if we have the link to the actual file.

arunchinnachamy · 10-12-2012, 08:49 AM

amboxer21,
Like Snark1994 mentioned, wget or curl downloads the exact file as it is in internet. Probably what you see in browser and after wget/curl is different as the browser will format the output based on the content type. Any link to the file might help to understand the issue better.

amboxer21 · 10-12-2012, 10:46 AM

Like I said, I already wrote a one liner to parse the info. For some odd reason, wget was removing tabs and new line chars and saving everything in one huge line. Whatever.

Snark1994 · 10-13-2012, 07:13 AM

Well done for solving your own problem, and marking the thread as 'SOLVED'

if you could post your one-liner, then it would help other people who are having a similar problem.

Thanks,

amboxer21 · 10-13-2012, 10:18 AM

...See below.

amboxer21 · 10-13-2012, 01:49 PM

Updated....

amboxer21 · 10-20-2012, 02:53 AM

I figured, I would share something I am working on ATM. A program to download a whole photostream. This is the parser part of it.

The file contains 1243 words and cannot post it here. So, here's a link -> http://www.4shared.com/file/UMrpIUld/photostream.html

That's just a file that consists of a bunch of Json objects. The new lines and tabs are removed. I have parsed out all of the http(s) urls.

The parser ->
awk '{gsub(",", "\n"); print}' photostream | sed -n 's/["{\\]//g;s/^[a-zA-Z0-9]*\://g;/$^[http].*\:$.*$[jpg]$$/p'

The file is obtained with an access token allowing you to obtain the Json file. Then the URL's are parsed out with the parser above. What do you think? If anyone wants to tighten up the parser, that would be cool and I welcome any suggestions!

I keep procrastinating and pushing this off to the side due to the next step being difficult. I wrote the downloader, but haven't accounted for the next URL set. Which resides at the bottom but is currently parsed out. I would have to separate the http urls from the next set url. Then run the parsed URL's through the downloader and download the next photoset before downloading them and proceeding. It's a bigger pain than it sounds!

Maybe someone wants to help write this?? Could incorporate some Perl to automate the access token process and a GUI with GTK!?

THOUGHTS? ADVICE?