LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 02-03-2009, 05:49 PM   #1
steve51184
Member
 
Registered: Dec 2006
Posts: 381

Rep: Reputation: 30
remove all links from a file with grep


hi all i'm trying to save a page of my site via curl and then remove everything but the links to "some-domain.com" to a file with grep so i end up with a nice clean html file with nothing but links but i can't seem to get it to work right

here's what i have so far

Code:
curl http://mydomain.com/myfile.html > links.html
grep -i "http://some-domain.com/" links.html > links2.html && mv links2.html links.html
the first line saves myfile.html to my server as links.html and the second line strips all the some-domain.com links but it's including a lot of html with it as it's not striping the links but the lines the links are on so is there a better way to do line 2 so i can just strip all http://some-domain.com links to a file?

Last edited by steve51184; 02-03-2009 at 05:52 PM.
 
Old 02-03-2009, 06:52 PM   #2
jhwilliams
Senior Member
 
Registered: Apr 2007
Location: Portland, OR
Distribution: Debian, Android, LFS
Posts: 1,168

Rep: Reputation: 211Reputation: 211Reputation: 211
This is a job for sed or awk, but not grep.

I would try something of this form, first:
Code:
cat original_file | sed 's,search_pattern,replacement,g' > clean_file
In particular, you might like something like...

Code:
cat links.html | sed 's,<a href="http://some-domain.com/">\([a-z,A-Z,\ ]*\)</a>,\1,g' > new_links.html
Note: I have not tested that regex, so I'm not sure if it's 100% correct.
 
Old 02-03-2009, 06:56 PM   #3
jhwilliams
Senior Member
 
Registered: Apr 2007
Location: Portland, OR
Distribution: Debian, Android, LFS
Posts: 1,168

Rep: Reputation: 211Reputation: 211Reputation: 211
Woops. I'm sorry. I didn't read your post right. What I have above will remove links from a file -- not compile a list of only those links.

So, okay. grep & sed join forces:

Code:
cat input_file.html | grep -E "<a href=\"domain.com\">" | sed 's,.*\(<a href=\"domain.com.*\">[a-z,A-Z,\ ]*</a>\).*,\1,g' > output_file.html
how about that one?

Again, the regex is long and complicated so it may need tweaking.
 
Old 02-03-2009, 07:22 PM   #4
steve51184
Member
 
Registered: Dec 2006
Posts: 381

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by jhwilliams View Post
Woops. I'm sorry. I didn't read your post right. What I have above will remove links from a file -- not compile a list of only those links.

So, okay. grep & sed join forces:

Code:
cat input_file.html | grep -E "<a href=\"domain.com\">" | sed 's,.*\(<a href=\"domain.com.*\">[a-z,A-Z,\ ]*</a>\).*,\1,g' > output_file.html
how about that one?

Again, the regex is long and complicated so it may need tweaking.
thank you very much for the quick reply but i've edited that command to include the domain i want to strip but it just outputs a blank file?

also correct me if i'm wrong but that command will only work if the (for example) links in JUST domain.com right? so domain.com/bla.php wrong get striped?
 
Old 02-03-2009, 09:26 PM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,337

Rep: Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176
You don't need the cat nor the grep - sed can do regex extended. Personally if you just want the links themselves, I would go for something simpler like
Code:
 sed -rn 's|.*(http://somewhere[^"]*).*|\1|p' links.html > just_the_links.txt
Adjust as required for the address. Presumes the address is first on the line, and you only want one per line ...

Last edited by syg00; 02-03-2009 at 09:29 PM.
 
Old 02-03-2009, 10:17 PM   #6
steve51184
Member
 
Registered: Dec 2006
Posts: 381

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by syg00 View Post
You don't need the cat nor the grep - sed can do regex extended. Personally if you just want the links themselves, I would go for something simpler like
Code:
 sed -rn 's|.*(http://somewhere[^"]*).*|\1|p' links.html > just_the_links.txt
Adjust as required for the address. Presumes the address is first on the line, and you only want one per line ...
that works PERFECT thank you

i now have a file that looks like this

Code:
http://domain.com/whatever.bla
http://domain.com/whatever.bla
http://domain.com/whatever.bla
http://domain.com/whatever.bla
http://domain.com/whatever.bla
http://domain.com/whatever.bla
http://domain.com/whatever.bla
i now have a question i need these links formatted how can i do this so it looks like this?

Code:
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
i'm guessing this'll be best with php but i don't know php so i really hope i can do it via command line

Last edited by steve51184; 02-03-2009 at 10:19 PM.
 
Old 02-03-2009, 10:34 PM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,337

Rep: Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176
If that (or most of it) is in the input file (likely), then you need something like @jhwilliams provided. In need plug the "<br>" on the end of the right hand substitution.
Should be pretty straight-forward - that regex will need some tailoring to handle the "." in the address.

You could also (easily) use sed to achieve your final output using the file generated from my offering - not much point running over the data twice though. Either way, it will be a nice exercise for you to learn regex.
 
Old 02-03-2009, 10:42 PM   #8
steve51184
Member
 
Registered: Dec 2006
Posts: 381

Original Poster
Rep: Reputation: 30
ok i have an idea of how to do this but i need need the name of the program(s) (and an example if you wish lol) here goes:

lets say x = the content of each line...

can i not put <a href=" before the content of each line then x then "> at the end of each line then x then </a><br>

this will turn:
Code:
http://domain.com/whatever.bla
http://domain.com/whatever.bla
http://domain.com/whatever.bla
into:
Code:
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
this will be the best option as far as i can think of just need a push in the right direction
 
Old 02-03-2009, 10:52 PM   #9
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,337

Rep: Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176
As I said, the data is likely to already be largely in that format in your downloaded html file.
However sed will easily allow you to do what you want in one (more) command. The back reference (the "\1") can be used more than once in the right hand side substitution, along with other text as needed.
 
Old 02-03-2009, 11:03 PM   #10
steve51184
Member
 
Registered: Dec 2006
Posts: 381

Original Poster
Rep: Reputation: 30
can i get a little more info on how to do this with sed please as i've searched google but can't find anything
 
Old 02-03-2009, 11:18 PM   #11
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,337

Rep: Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176Reputation: 4176
this tutorial gets recommended a bit here on LQ (that link is for references; go to the top of the page).
Also has a link to his tute on regex.
It's old but still o.k.

Also try the sed homepage on sourceforge - lots of links for info there.
 
Old 02-03-2009, 11:23 PM   #12
steve51184
Member
 
Registered: Dec 2006
Posts: 381

Original Poster
Rep: Reputation: 30
any chance of an example please?
 
Old 02-03-2009, 11:43 PM   #13
steve51184
Member
 
Registered: Dec 2006
Posts: 381

Original Poster
Rep: Reputation: 30
ok i have a php script that will display all files within a folder and display them inside a hyperlink so if this can be edited to open a file instead of a folder then with a little modification this should work but again i know nothing about php and would really really love teh help :|

Code:
<?

// Define the full path to your folder from root
$path = "full/path/to/folder";

// Open the folder
$dir_handle = @opendir($path) or die("Unable to open $path");

// Loop through the files
while ($file = readdir($dir_handle)) {

if($file == "." || $file == ".." || $file == "index.php" )

continue;
echo "<li><a href=\"$file\">$file</a></li>";

}

// Close
closedir($dir_handle);

?>
 
Old 02-04-2009, 12:03 AM   #14
steve51184
Member
 
Registered: Dec 2006
Posts: 381

Original Poster
Rep: Reputation: 30
ok made a script but i'm getting errors:

Code:
<?php $fh=fopen($file,'links.txt'); while(($line=fgets($fh))!= null){echo "<a href=\"$line\">$line</a>"; } fclose($fh) ?>
Quote:

Warning: fgets(): supplied argument is not a valid stream resource in /var/www/test.php on line 1

Warning: fclose(): supplied argument is not a valid stream resource in /var/www/test.php on line 1

Last edited by steve51184; 02-04-2009 at 12:05 AM.
 
Old 02-04-2009, 12:09 AM   #15
steve51184
Member
 
Registered: Dec 2006
Posts: 381

Original Poster
Rep: Reputation: 30
got it working

Quote:
<?php $fh=fopen("links.txt",'r'); while(($line=fgets($fh))!= null){echo "<a href=\"$line\">$line</a><br>"; } fclose($fh) ?>
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How can I remove symbolic links? MichaelZ Linux - Newbie 5 11-19-2009 08:45 AM
Grep and remove supasharp Linux - General 9 06-12-2007 12:16 PM
Is there a way to search for and remove dead symbolic links? HGeneAnthony Linux - General 5 01-19-2007 03:09 AM
Remove device links on desktop? RoaCh Of DisCor Linux - Newbie 1 12-06-2004 07:31 AM
remove text with grep craigdolson Linux - General 6 04-21-2004 04:25 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 01:18 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration