Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
02-03-2009, 05:49 PM
|
#1
|
Member
Registered: Dec 2006
Posts: 381
Rep:
|
remove all links from a file with grep
hi all i'm trying to save a page of my site via curl and then remove everything but the links to "some-domain.com" to a file with grep so i end up with a nice clean html file with nothing but links but i can't seem to get it to work right
here's what i have so far
Code:
curl http://mydomain.com/myfile.html > links.html
grep -i "http://some-domain.com/" links.html > links2.html && mv links2.html links.html
the first line saves myfile.html to my server as links.html and the second line strips all the some-domain.com links but it's including a lot of html with it as it's not striping the links but the lines the links are on so is there a better way to do line 2 so i can just strip all http://some-domain.com links to a file?
Last edited by steve51184; 02-03-2009 at 05:52 PM.
|
|
|
02-03-2009, 06:52 PM
|
#2
|
Senior Member
Registered: Apr 2007
Location: Portland, OR
Distribution: Debian, Android, LFS
Posts: 1,168
|
This is a job for sed or awk, but not grep.
I would try something of this form, first:
Code:
cat original_file | sed 's,search_pattern,replacement,g' > clean_file
In particular, you might like something like...
Code:
cat links.html | sed 's,<a href="http://some-domain.com/">\([a-z,A-Z,\ ]*\)</a>,\1,g' > new_links.html
Note: I have not tested that regex, so I'm not sure if it's 100% correct.
|
|
|
02-03-2009, 06:56 PM
|
#3
|
Senior Member
Registered: Apr 2007
Location: Portland, OR
Distribution: Debian, Android, LFS
Posts: 1,168
|
Woops. I'm sorry. I didn't read your post right. What I have above will remove links from a file -- not compile a list of only those links.
So, okay. grep & sed join forces:
Code:
cat input_file.html | grep -E "<a href=\"domain.com\">" | sed 's,.*\(<a href=\"domain.com.*\">[a-z,A-Z,\ ]*</a>\).*,\1,g' > output_file.html
how about that one?
Again, the regex is long and complicated so it may need tweaking.
|
|
|
02-03-2009, 07:22 PM
|
#4
|
Member
Registered: Dec 2006
Posts: 381
Original Poster
Rep:
|
Quote:
Originally Posted by jhwilliams
Woops. I'm sorry. I didn't read your post right. What I have above will remove links from a file -- not compile a list of only those links.
So, okay. grep & sed join forces:
Code:
cat input_file.html | grep -E "<a href=\"domain.com\">" | sed 's,.*\(<a href=\"domain.com.*\">[a-z,A-Z,\ ]*</a>\).*,\1,g' > output_file.html
how about that one?
Again, the regex is long and complicated so it may need tweaking.
|
thank you very much for the quick reply but i've edited that command to include the domain i want to strip but it just outputs a blank file?
also correct me if i'm wrong but that command will only work if the (for example) links in JUST domain.com right? so domain.com/bla.php wrong get striped?
|
|
|
02-03-2009, 09:26 PM
|
#5
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,337
|
You don't need the cat nor the grep - sed can do regex extended. Personally if you just want the links themselves, I would go for something simpler like
Code:
sed -rn 's|.*(http://somewhere[^"]*).*|\1|p' links.html > just_the_links.txt
Adjust as required for the address. Presumes the address is first on the line, and you only want one per line ...
Last edited by syg00; 02-03-2009 at 09:29 PM.
|
|
|
02-03-2009, 10:17 PM
|
#6
|
Member
Registered: Dec 2006
Posts: 381
Original Poster
Rep:
|
Quote:
Originally Posted by syg00
You don't need the cat nor the grep - sed can do regex extended. Personally if you just want the links themselves, I would go for something simpler like
Code:
sed -rn 's|.*(http://somewhere[^"]*).*|\1|p' links.html > just_the_links.txt
Adjust as required for the address. Presumes the address is first on the line, and you only want one per line ...
|
that works PERFECT thank you
i now have a file that looks like this
Code:
http://domain.com/whatever.bla
http://domain.com/whatever.bla
http://domain.com/whatever.bla
http://domain.com/whatever.bla
http://domain.com/whatever.bla
http://domain.com/whatever.bla
http://domain.com/whatever.bla
i now have a question i need these links formatted how can i do this so it looks like this?
Code:
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
i'm guessing this'll be best with php but i don't know php so i really hope i can do it via command line
Last edited by steve51184; 02-03-2009 at 10:19 PM.
|
|
|
02-03-2009, 10:34 PM
|
#7
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,337
|
If that (or most of it) is in the input file (likely), then you need something like @jhwilliams provided. In need plug the "<br>" on the end of the right hand substitution.
Should be pretty straight-forward - that regex will need some tailoring to handle the "." in the address.
You could also (easily) use sed to achieve your final output using the file generated from my offering - not much point running over the data twice though. Either way, it will be a nice exercise for you to learn regex.
|
|
|
02-03-2009, 10:42 PM
|
#8
|
Member
Registered: Dec 2006
Posts: 381
Original Poster
Rep:
|
ok i have an idea of how to do this but i need need the name of the program(s) (and an example if you wish lol) here goes:
lets say x = the content of each line...
can i not put <a href=" before the content of each line then x then "> at the end of each line then x then </a><br>
this will turn:
Code:
http://domain.com/whatever.bla
http://domain.com/whatever.bla
http://domain.com/whatever.bla
into:
Code:
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
<a href="http://domain.com/whatever.bla">http://domain.com/whatever.bla</a><br>
this will be the best option as far as i can think of just need a push in the right direction 
|
|
|
02-03-2009, 10:52 PM
|
#9
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,337
|
As I said, the data is likely to already be largely in that format in your downloaded html file.
However sed will easily allow you to do what you want in one (more) command. The back reference (the "\1") can be used more than once in the right hand side substitution, along with other text as needed.
|
|
|
02-03-2009, 11:03 PM
|
#10
|
Member
Registered: Dec 2006
Posts: 381
Original Poster
Rep:
|
can i get a little more info on how to do this with sed please as i've searched google but can't find anything
|
|
|
02-03-2009, 11:18 PM
|
#11
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,337
|
this tutorial gets recommended a bit here on LQ (that link is for references; go to the top of the page).
Also has a link to his tute on regex.
It's old but still o.k.
Also try the sed homepage on sourceforge - lots of links for info there.
|
|
|
02-03-2009, 11:23 PM
|
#12
|
Member
Registered: Dec 2006
Posts: 381
Original Poster
Rep:
|
any chance of an example please?
|
|
|
02-03-2009, 11:43 PM
|
#13
|
Member
Registered: Dec 2006
Posts: 381
Original Poster
Rep:
|
ok i have a php script that will display all files within a folder and display them inside a hyperlink so if this can be edited to open a file instead of a folder then with a little modification this should work but again i know nothing about php and would really really love teh help :|
Code:
<?
// Define the full path to your folder from root
$path = "full/path/to/folder";
// Open the folder
$dir_handle = @opendir($path) or die("Unable to open $path");
// Loop through the files
while ($file = readdir($dir_handle)) {
if($file == "." || $file == ".." || $file == "index.php" )
continue;
echo "<li><a href=\"$file\">$file</a></li>";
}
// Close
closedir($dir_handle);
?>
|
|
|
02-04-2009, 12:03 AM
|
#14
|
Member
Registered: Dec 2006
Posts: 381
Original Poster
Rep:
|
ok made a script but i'm getting errors:
Code:
<?php $fh=fopen($file,'links.txt'); while(($line=fgets($fh))!= null){echo "<a href=\"$line\">$line</a>"; } fclose($fh) ?>
Quote:
Warning: fgets(): supplied argument is not a valid stream resource in /var/www/test.php on line 1
Warning: fclose(): supplied argument is not a valid stream resource in /var/www/test.php on line 1
|
Last edited by steve51184; 02-04-2009 at 12:05 AM.
|
|
|
02-04-2009, 12:09 AM
|
#15
|
Member
Registered: Dec 2006
Posts: 381
Original Poster
Rep:
|
got it working
Quote:
<?php $fh=fopen("links.txt",'r'); while(($line=fgets($fh))!= null){echo "<a href=\"$line\">$line</a><br>"; } fclose($fh) ?>
|
|
|
|
All times are GMT -5. The time now is 01:18 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|