Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum. |
| Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
 |
GNU/Linux Basic Guide
This 255-page guide will provide you with the keys to understand the philosophy of free software, teach you how to use and handle it, and give you the tools required to move easily in the world of GNU/Linux. Many users and administrators will be taking their first steps with this GNU/Linux Basic guide and it will show you how to approach and solve the problems you encounter.
Click Here to receive this Complete Guide absolutely free. |
|
 |
11-27-2007, 08:11 AM
|
#1
|
|
Member
Registered: Apr 2007
Distribution: CrunchBang 10 Statler
Posts: 106
Rep:
|
editing a very large HTML file (or, extracting URLs from a file)
I've worked out the answer, but since I'd already written out this question, I'll post it anyway, in case someone finds it useful.
---
PROBLEM:
I have a file, made up of concatenated HTML files. I was going to open it and do some sorting, and search and replace work. (The aim is to make a tab-delimited file of urls for a Google custom search).
But now it won't open in gedit - too big at 600kb, I guess. And if I try to open it in OpenOffice, it opens it as HTML, in a semi-WYSIWYG mode rather than as source, in spite of the .txt suffix.
I can view the source by opening it in Firefox, but when I copy and paste, only part of the file is pasted (but how much depends which program I'm copying to).
Can I set OpenOffice to open it as text? Or is there another WYSIWYG program that will let me edit a large file like this? (I'd rather not learn to use a terminal based editors to do one simple task).
Or, (and perhaps this is more useful) is there a program or line command that will let me extract just the urls from the file?
---
SOLUTION:
Open in Opera, view source. Choose Edit -> Select all.*
Copy, and paste into OpenOffice. No problem. No idea why, but it works.
* ctrl-A doesn't work for some reason - several shortcuts don't work in Opera in Ubuntu, don't know about other distros.
|
|
|
|
11-27-2007, 08:35 AM
|
#2
|
|
Guru
Registered: Aug 2003
Distribution: CentOS, OS X
Posts: 5,131
Rep: 
|
Another obvious solution would have been using console; grep, sed and awk can do pretty good job with picking up data from large files, and tying that all up in a shell script usually makes it even better. No need to open big files in big applications, just feed the text to the basic Unix (in this case Linux) tools and have it done. Or if you don't want to play around to get it perfect, they'll at least help you shrink the data size a lot smaller, so you can then open the result with gedit or something else, and pick up what you wanted to.
|
|
|
|
11-27-2007, 09:08 AM
|
#3
|
|
Moderator
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733
|
Is this what you want:
Code:
grep -o 'HREF="[^"]*" ' bookmarks.html
HREF="http://en-US.add-ons.mozilla.com/en-US/firefox/bookmarks/"
HREF="http://www.opensuse.org/"
HREF="http://software.opensuse.org/"
HREF="http://software.opensuse.org/search?baseproject=openSUSE:10.3"
HREF="http://news.opensuse.org/?feed=rss2"
HREF="http://www.novell.com/linux/"
HREF="http://www.novell.com/coolsolutions/slp/"
HREF="http://www.novell.com/support/products/suselinux/"
HREF="http://en-US.www.mozilla.com/en-US/firefox/central/"
HREF="http://en-US.fxfeeds.mozilla.com/en-US/firefox/livebookmarks/"
HREF="http://www.twit.tv/node/feed"
HREF="http://feeds.feedburner.com/linuxquestions/latest"
HREF="http://feeds.feedburner.com/linuxquestions/noreplies"
HREF="http://www.scifi.com/scifiwire/rss/index.xml"
HREF="http://feeds.feedburner.com/AllAboutLinux"
HREF="http://en-US.www.mozilla.com/en-US/firefox/help/"
HREF="http://en-US.www.mozilla.com/en-US/firefox/customize/"
HREF="http://en-US.www.mozilla.com/en-US/firefox/community/"
HREF="http://en-US.www.mozilla.com/en-US/firefox/about/"
HREF="http://www.crankygeeks.com/"
HREF="http://www.lostaddress.org/"
HREF="http://polishlinux.org/dragonia/dragonia_eng.pdf"
|
|
|
|
11-27-2007, 06:07 PM
|
#4
|
|
Guru
Registered: Aug 2004
Location: Brisbane
Distribution: Centos 6.4, Centos 5.9
Posts: 15,272
|
To be honest, if you are going to stay with Linux, knowing 1 cli editor is going to come in very handy. Sometimes the GUI tool isn't the right approach eg if GUI breaks.
Personally I use vim,and that def won't choke on 600K. I've gone into files of 10s of MB, possibly larger, just runs a bit slower as the file gets bigger.
Last edited by chrism01; 11-27-2007 at 06:10 PM.
|
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -5. The time now is 03:10 AM.
|
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|