LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   exract string between two different characters (https://www.linuxquestions.org/questions/programming-9/exract-string-between-two-different-characters-4175488281/)

mierdatuti 12-17-2013 06:21 AM

exract string between two different characters
 
Hi
I have these code:
Code:

      <span class="resultsAd">VOLVO C70 2.3</span><br>
                        <span class="resultsAd">NIKON D5000 + accesorios</span><br>
                        <span class="resultsAd">IPhone 16GB Libre Perfecto Estado</span><br>
                        <span class="resultsAd">AUDI A3 Sportback 2.0 TDI Ambition</span><br>
                        <span class="resultsAd">Mitsubishi Colt 1.1 12v Inform 5p.</span><br>
                        <span class="resultsAd">Piso en calle Camarena, 90</span><br>

Can I extract these code with sed?

Code:

VOLVO C70 2.3
NIKON D5000 + accesorios
IPhone 16GB Libre Perfecto Estado
AUDI A3 Sportback 2.0 TDI Ambition
Mitsubishi Colt 1.1 12v Inform 5p.
Piso en calle Camarena, 90

Many thanks

Guttorm 12-17-2013 06:25 AM

http://stackoverflow.com/questions/7...sed-or-similar

druuna 12-17-2013 06:40 AM

Quote:

Originally Posted by mierdatuti (Post 5082231)
Hi
I have these code:
Code:

      <span class="resultsAd">VOLVO C70 2.3</span><br>
                        <span class="resultsAd">NIKON D5000 + accesorios</span><br>
                        <span class="resultsAd">IPhone 16GB Libre Perfecto Estado</span><br>
                        <span class="resultsAd">AUDI A3 Sportback 2.0 TDI Ambition</span><br>
                        <span class="resultsAd">Mitsubishi Colt 1.1 12v Inform 5p.</span><br>
                        <span class="resultsAd">Piso en calle Camarena, 90</span><br>

Can I extract these code with sed?

Code:

VOLVO C70 2.3
NIKON D5000 + accesorios
IPhone 16GB Libre Perfecto Estado
AUDI A3 Sportback 2.0 TDI Ambition
Mitsubishi Colt 1.1 12v Inform 5p.
Piso en calle Camarena, 90

Many thanks

Short answer: Don't use sed to remove html code tags, use a dedicated program like html2text

Although sed can be used for the specific example given by you, it is rather hard (close to impossible) to use sed when the html open and close tags aren't on the same line.

This would work for your specific example:
Code:

sed -r 's%.*sAd">(.*)</sp.*%\1%' input

danielbmartin 12-17-2013 07:41 AM

This would work for your specific example:
Code:

cut -d\> -f2-  $InFile  \
|cut -d\< -f1  >$OutFile

Daniel B. Martin

danielbmartin 12-17-2013 07:56 AM

This would work for your specific example:
Code:

awk -F "<|>" '{print $3}' $InFile >$OutFile
Daniel B. Martin

Sydney 12-17-2013 02:42 PM

I would use a program that is already able to extract the text. The lynx text based web browser is excellent for this type of thing.
Code:

syd@computer:~/Desktop$ lynx --dump a.html
  VOLVO C70 2.3
  NIKON D5000 + accesorios
  IPhone 16GB Libre Perfecto Estado
  AUDI A3 Sportback 2.0 TDI Ambition
  Mitsubishi Colt 1.1 12v Inform 5p.
  Piso en calle Camarena, 90


KarlJoe 12-25-2013 12:01 AM

sed -rne 's/(<.*">)(.*)(<\/.*>)/\2/p'

amboxer21 12-25-2013 12:55 PM

sed 's/<[^>]*>//g' test

EDIT: add an i switch to make the file changes permanent.

=> sed -i 's/<[^>]*>//g' test

kurumi 12-29-2013 10:59 PM

if you have Ruby, you can use Nokogiri to parse your html

Code:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

page = Nokogiri::HTML( File.open( "file" )  )
 
n = page.search("span").each do|el| 
  if el['class'] == "resultsAd"   
    puts el.children.text
  end 
end

result
Code:

# ruby test.rb
VOLVO C70 2.3
NIKON D5000 + accesorios
IPhone 16GB Libre Perfecto Estado
AUDI A3 Sportback 2.0 TDI Ambition
Mitsubishi Colt 1.1 12v Inform 5p.
Piso en calle Camarena, 90



All times are GMT -5. The time now is 09:00 PM.