LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   deleting extra text within a line... (tearing out remaining hair..!) (https://www.linuxquestions.org/questions/linux-newbie-8/deleting-extra-text-within-a-line-tearing-out-remaining-hair-4175539367/)

pulsar1279 04-10-2015 07:07 PM

deleting extra text within a line... (tearing out remaining hair..!)
 
Hi all,
Im trying to tidy up a web scrape txt file so it just shows the URL's..
As below, I've got most with just the url but struggling to work out how to delete extra txt within a line, without deleting the whole line.
i.e a command that says.. delete from this word.. till end of line .. or delete everything between word1 and word2

//newzealandtrails.com/sites/all/themes/nztrails/css/print.css?nmkrag");
//newzealandtrails.com/sites/all/themes/nztrails/css/tabs.css?nmkrag");
//newzealandtrails.com/sites/default/files/ctools
//newzealandtrails.com/sites/default/files/New20Zealand%20Trails.png" //newzealandtrails.com/sites/default/files/nztrails-logo_0_0.png" alt="New Zealand Trails" /></a></div>
//newzealandtrails.com/welcome-new-zealand-trails" st_title="" class="st_sharethis_button" displayText="sharethis"></span>
//player.vimeo.com/video/71298207" webkitallowfullscreen="" width="500px"></iframe></div>
//w.sharethis.com/button/buttons.js"></script>

Thanks in advance

syg00 04-10-2015 07:59 PM

That would be (normally) sed - but it gets interesting constructing the regex if the text varies.
If you (always) want to lose all the text after the blank, use cut (or awk if there is more processing to be done).

Maybe better to start again - have a look at what "lynx -dump ..." produces; it yanks all the urls for you in a group. Minimal editting to get the lot.

allend 04-10-2015 08:03 PM

Welcome to LQ!

This is an example where the 'sed' command can be used.
Inspecting your file suggests that you want to delete from the double quote character to the end of the line, which can be written as the regular expression ".*
Using this in combination with the substitute option to sed gives
Code:

sed 's/".*//g' <scrape.txt>
It is important to use the single quotes around the sed command options to protect them from being interpreted by the shell.
Redirect the output to a file by adding '> <output.txt>' to the above command.


All times are GMT -5. The time now is 06:22 AM.