LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (http://www.linuxquestions.org/questions/linux-software-2/)
-   -   editing a very large HTML file (or, extracting URLs from a file) (http://www.linuxquestions.org/questions/linux-software-2/editing-a-very-large-html-file-or-extracting-urls-from-a-file-602728/)

Chriswaterguy 11-27-2007 08:11 AM

editing a very large HTML file (or, extracting URLs from a file)
 
I've worked out the answer, but since I'd already written out this question, I'll post it anyway, in case someone finds it useful.

---
PROBLEM:
I have a file, made up of concatenated HTML files. I was going to open it and do some sorting, and search and replace work. (The aim is to make a tab-delimited file of urls for a Google custom search).

But now it won't open in gedit - too big at 600kb, I guess. And if I try to open it in OpenOffice, it opens it as HTML, in a semi-WYSIWYG mode rather than as source, in spite of the .txt suffix.

I can view the source by opening it in Firefox, but when I copy and paste, only part of the file is pasted (but how much depends which program I'm copying to).

Can I set OpenOffice to open it as text? Or is there another WYSIWYG program that will let me edit a large file like this? (I'd rather not learn to use a terminal based editors to do one simple task).

Or, (and perhaps this is more useful) is there a program or line command that will let me extract just the urls from the file?

---
SOLUTION:
Open in Opera, view source. Choose Edit -> Select all.*

Copy, and paste into OpenOffice. No problem. No idea why, but it works.

* ctrl-A doesn't work for some reason - several shortcuts don't work in Opera in Ubuntu, don't know about other distros.

b0uncer 11-27-2007 08:35 AM

Another obvious solution would have been using console; grep, sed and awk can do pretty good job with picking up data from large files, and tying that all up in a shell script usually makes it even better. No need to open big files in big applications, just feed the text to the basic Unix (in this case Linux) tools and have it done. Or if you don't want to play around to get it perfect, they'll at least help you shrink the data size a lot smaller, so you can then open the result with gedit or something else, and pick up what you wanted to.

jschiwal 11-27-2007 09:08 AM

Is this what you want:
Code:

grep -o 'HREF="[^"]*" ' bookmarks.html
HREF="http://en-US.add-ons.mozilla.com/en-US/firefox/bookmarks/"
HREF="http://www.opensuse.org/"
HREF="http://software.opensuse.org/"
HREF="http://software.opensuse.org/search?baseproject=openSUSE:10.3"
HREF="http://news.opensuse.org/?feed=rss2"
HREF="http://www.novell.com/linux/"
HREF="http://www.novell.com/coolsolutions/slp/"
HREF="http://www.novell.com/support/products/suselinux/"
HREF="http://en-US.www.mozilla.com/en-US/firefox/central/"
HREF="http://en-US.fxfeeds.mozilla.com/en-US/firefox/livebookmarks/"
HREF="http://www.twit.tv/node/feed"
HREF="http://feeds.feedburner.com/linuxquestions/latest"
HREF="http://feeds.feedburner.com/linuxquestions/noreplies"
HREF="http://www.scifi.com/scifiwire/rss/index.xml"
HREF="http://feeds.feedburner.com/AllAboutLinux"
HREF="http://en-US.www.mozilla.com/en-US/firefox/help/"
HREF="http://en-US.www.mozilla.com/en-US/firefox/customize/"
HREF="http://en-US.www.mozilla.com/en-US/firefox/community/"
HREF="http://en-US.www.mozilla.com/en-US/firefox/about/"
HREF="http://www.crankygeeks.com/"
HREF="http://www.lostaddress.org/"
HREF="http://polishlinux.org/dragonia/dragonia_eng.pdf"


chrism01 11-27-2007 06:07 PM

To be honest, if you are going to stay with Linux, knowing 1 cli editor is going to come in very handy. Sometimes the GUI tool isn't the right approach eg if GUI breaks.
Personally I use vim,and that def won't choke on 600K. I've gone into files of 10s of MB, possibly larger, just runs a bit slower as the file gets bigger.


All times are GMT -5. The time now is 07:26 AM.