LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 10-11-2005, 06:00 PM   #1
agtlewis
Member
 
Registered: Oct 2005
Distribution: Fedora 4
Posts: 40

Rep: Reputation: 15
How to parse 2GB file w linux commands


Hello,

I have a 2gb RDF feed which contains lots of urls. Is there any way for me to use a command prompt to extract all of the urls?

At the very least I would like a way to delete every line in the file that does not contain a URL, and then write a php script to extract the urls and insert them into a mysql database.

Any help/advice appreciated.

Thanks.
 
Old 10-11-2005, 06:33 PM   #2
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Not knowing the internal sturcture of the RDF thingamabob
I can't give you a turn-key answer. Chances are sed or
awk are the right tools for the job, though.


Cheers,
Tink
 
Old 10-11-2005, 06:43 PM   #3
agtlewis
Member
 
Registered: Oct 2005
Distribution: Fedora 4
Posts: 40

Original Poster
Rep: Reputation: 15
Hi,

Here is a snippet of the document. I will research the commands you noted.

Code:
<Topic r:id="Top/Arts/Movies/Titles/1">
  <catid>54803</catid>
</Topic>

<Topic r:id="Top/Arts/Movies/Titles/1/10_Rillington_Place">
  <catid>205108</catid>
  <link r:resource="http://www.britishhorrorfilms.co.uk/rillington.shtml"/>
  <link r:resource="http://www.shoestring.org/mmi_revs/10-rillington-place.html"/>
  <link r:resource="http://www.tvguide.com/movies/database/ShowMovie.asp?MI=22983"/>
  <link r:resource="http://us.imdb.com/title/tt0066730/"/>
</Topic>

<ExternalPage about="http://www.britishhorrorfilms.co.uk/rillington.shtml">
  <d:Title>British Horror Films: 10 Rillington Place</d:Title>
  <d:Description>Review which looks at plot especially the shocking features of it.</d:Description>
  <topic>Top/Arts/Movies/Titles/1/10_Rillington_Place</topic>
</ExternalPage>
 
Old 10-11-2005, 06:56 PM   #4
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
If that's all something like
Code:
awk -F"\"" '$2 ~ "http" {print $2 }' rdf.txt > url.txt
may be all you need.

[edit]
If it works, can you please time the execution and post
back how long it took? :}
[/edit]


Cheers,
Tink

Last edited by Tinkster; 10-11-2005 at 06:58 PM.
 
Old 10-11-2005, 07:53 PM   #5
agtlewis
Member
 
Registered: Oct 2005
Distribution: Fedora 4
Posts: 40

Original Poster
Rep: Reputation: 15
Tinkster,

Thank you very much!! The command you gave me was able to parse the file and correctly extract all of the url's.

As for the time involved, I'm not sure if that thing is written in pure assembly or what, but it only took 4 minutes and 45 seconds on a pentium 4 running at 2.0ghz with 512mb ram.
 
Old 10-11-2005, 08:06 PM   #6
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Heh - glad I could help.

awk is written in C, and if you find perl overkill (or the task
isn't complex enough for a perl-script) awk is an excellent
(and very fast and efficient) tool for the job. The speed
doesn't surprise me. :}


Cheers,
Tink
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
mysqldump : Can I split the file up to 2GB max per file? Swakoo Linux - General 10 10-17-2005 04:13 AM
Proiblem swith > 2GB file over nfs mhammock Linux - Networking 0 12-15-2004 11:52 AM
Redhat 9.0 >2GB file limit Wilbs Red Hat 2 08-12-2004 01:26 PM
2GB File Limit AS2.1 jbovaird Red Hat 19 11-14-2003 11:29 AM
How to read ans parse MS word file using a Linux Shell script. Alek Linux - General 2 11-10-2003 02:07 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 10:47 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration