deleting extra text within a line... (tearing out remaining hair..!)
Hi all,
Im trying to tidy up a web scrape txt file so it just shows the URL's.. As below, I've got most with just the url but struggling to work out how to delete extra txt within a line, without deleting the whole line. i.e a command that says.. delete from this word.. till end of line .. or delete everything between word1 and word2 //newzealandtrails.com/sites/all/themes/nztrails/css/print.css?nmkrag"); //newzealandtrails.com/sites/all/themes/nztrails/css/tabs.css?nmkrag"); //newzealandtrails.com/sites/default/files/ctools //newzealandtrails.com/sites/default/files/New20Zealand%20Trails.png" //newzealandtrails.com/sites/default/files/nztrails-logo_0_0.png" alt="New Zealand Trails" /></a></div> //newzealandtrails.com/welcome-new-zealand-trails" st_title="" class="st_sharethis_button" displayText="sharethis"></span> //player.vimeo.com/video/71298207" webkitallowfullscreen="" width="500px"></iframe></div> //w.sharethis.com/button/buttons.js"></script> Thanks in advance |
That would be (normally) sed - but it gets interesting constructing the regex if the text varies.
If you (always) want to lose all the text after the blank, use cut (or awk if there is more processing to be done). Maybe better to start again - have a look at what "lynx -dump ..." produces; it yanks all the urls for you in a group. Minimal editting to get the lot. |
Welcome to LQ!
This is an example where the 'sed' command can be used. Inspecting your file suggests that you want to delete from the double quote character to the end of the line, which can be written as the regular expression ".* Using this in combination with the substitute option to sed gives Code:
sed 's/".*//g' <scrape.txt> Redirect the output to a file by adding '> <output.txt>' to the above command. |
All times are GMT -5. The time now is 06:22 AM. |