Share your knowledge at the LQ Wiki.
Go Back > Forums > Linux Forums > Linux - Software
User Name
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.


  Search this Thread
Old 11-27-2007, 08:11 AM   #1
Registered: Apr 2007
Distribution: CrunchBang 10 Statler
Posts: 106

Rep: Reputation: 16
editing a very large HTML file (or, extracting URLs from a file)

I've worked out the answer, but since I'd already written out this question, I'll post it anyway, in case someone finds it useful.

I have a file, made up of concatenated HTML files. I was going to open it and do some sorting, and search and replace work. (The aim is to make a tab-delimited file of urls for a Google custom search).

But now it won't open in gedit - too big at 600kb, I guess. And if I try to open it in OpenOffice, it opens it as HTML, in a semi-WYSIWYG mode rather than as source, in spite of the .txt suffix.

I can view the source by opening it in Firefox, but when I copy and paste, only part of the file is pasted (but how much depends which program I'm copying to).

Can I set OpenOffice to open it as text? Or is there another WYSIWYG program that will let me edit a large file like this? (I'd rather not learn to use a terminal based editors to do one simple task).

Or, (and perhaps this is more useful) is there a program or line command that will let me extract just the urls from the file?

Open in Opera, view source. Choose Edit -> Select all.*

Copy, and paste into OpenOffice. No problem. No idea why, but it works.

* ctrl-A doesn't work for some reason - several shortcuts don't work in Opera in Ubuntu, don't know about other distros.
Old 11-27-2007, 08:35 AM   #2
LQ Guru
Registered: Aug 2003
Distribution: CentOS, OS X
Posts: 5,131

Rep: Reputation: Disabled
Another obvious solution would have been using console; grep, sed and awk can do pretty good job with picking up data from large files, and tying that all up in a shell script usually makes it even better. No need to open big files in big applications, just feed the text to the basic Unix (in this case Linux) tools and have it done. Or if you don't want to play around to get it perfect, they'll at least help you shrink the data size a lot smaller, so you can then open the result with gedit or something else, and pick up what you wanted to.
Old 11-27-2007, 09:08 AM   #3
LQ Guru
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 671Reputation: 671Reputation: 671Reputation: 671Reputation: 671Reputation: 671
Is this what you want:
grep -o 'HREF="[^"]*" ' bookmarks.html
Old 11-27-2007, 06:07 PM   #4
LQ Guru
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.9, Centos 7.3
Posts: 17,395

Rep: Reputation: 2395Reputation: 2395Reputation: 2395Reputation: 2395Reputation: 2395Reputation: 2395Reputation: 2395Reputation: 2395Reputation: 2395Reputation: 2395Reputation: 2395
To be honest, if you are going to stay with Linux, knowing 1 cli editor is going to come in very handy. Sometimes the GUI tool isn't the right approach eg if GUI breaks.
Personally I use vim,and that def won't choke on 600K. I've gone into files of 10s of MB, possibly larger, just runs a bit slower as the file gets bigger.

Last edited by chrism01; 11-27-2007 at 06:10 PM.


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
extracting data from html files into one text file adityavpratap Slackware 9 05-10-2007 10:30 AM
extracting a chunk of text from a large text file lothario Linux - Software 3 02-28-2007 08:16 AM
LXer: Decompile .chm file to view as html file under Linux LXer Syndicated Linux News 0 01-28-2007 01:03 AM
File does not exist/Large file support dreamtheater Linux - General 3 04-19-2004 09:14 AM
Large tar file taking huge disk space in ext3 file system pcwulf Linux - General 2 10-20-2003 07:45 AM > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 09:37 PM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration