LinuxQuestions.org - exract string between two different characters

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - exract string between two different characters (https://www.linuxquestions.org/questions/programming-9/exract-string-between-two-different-characters-4175488281/)

exract string between two different characters

Hi
I have these code:

Code:

      <span class="resultsAd">VOLVO C70 2.3</span><br>

                        <span class="resultsAd">NIKON D5000 + accesorios</span><br>

                        <span class="resultsAd">IPhone 16GB Libre Perfecto Estado</span><br>

                        <span class="resultsAd">AUDI A3 Sportback 2.0 TDI Ambition</span><br>

                        <span class="resultsAd">Mitsubishi Colt 1.1 12v Inform 5p.</span><br>

                        <span class="resultsAd">Piso en calle Camarena, 90</span><br>

Can I extract these code with sed?

Code:

VOLVO C70 2.3

NIKON D5000 + accesorios

IPhone 16GB Libre Perfecto Estado

AUDI A3 Sportback 2.0 TDI Ambition

Mitsubishi Colt 1.1 12v Inform 5p.

Piso en calle Camarena, 90

Many thanks

Quote:

Originally Posted by mierdatuti (Post 5082231)

Hi
I have these code:

Code:

      <span class="resultsAd">VOLVO C70 2.3</span><br>

                        <span class="resultsAd">NIKON D5000 + accesorios</span><br>

                        <span class="resultsAd">IPhone 16GB Libre Perfecto Estado</span><br>

                        <span class="resultsAd">AUDI A3 Sportback 2.0 TDI Ambition</span><br>

                        <span class="resultsAd">Mitsubishi Colt 1.1 12v Inform 5p.</span><br>

                        <span class="resultsAd">Piso en calle Camarena, 90</span><br>

Can I extract these code with sed?

Code:

VOLVO C70 2.3

NIKON D5000 + accesorios

IPhone 16GB Libre Perfecto Estado

AUDI A3 Sportback 2.0 TDI Ambition

Mitsubishi Colt 1.1 12v Inform 5p.

Piso en calle Camarena, 90

Many thanks

Short answer: Don't use sed to remove html code tags, use a dedicated program like html2text

Although sed can be used for the specific example given by you, it is rather hard (close to impossible) to use sed when the html open and close tags aren't on the same line.

This would work for your specific example:

Code:

sed -r 's%.*sAd">(.*)</sp.*%\1%' input

This would work for your specific example:

Code:

 cut -d\> -f2-  $InFile  \

|cut -d\< -f1  >$OutFile

Daniel B. Martin

This would work for your specific example:

Code:

awk -F "<|>" '{print $3}' $InFile >$OutFile

Daniel B. Martin

I would use a program that is already able to extract the text. The lynx text based web browser is excellent for this type of thing.

Code:

syd@computer:~/Desktop$ lynx --dump a.html

  VOLVO C70 2.3

  NIKON D5000 + accesorios

  IPhone 16GB Libre Perfecto Estado

  AUDI A3 Sportback 2.0 TDI Ambition

  Mitsubishi Colt 1.1 12v Inform 5p.

  Piso en calle Camarena, 90

sed -rne 's/(<.*">)(.*)(<\/.*>)/\2/p'

sed 's/<[^>]*>//g' test

EDIT: add an i switch to make the file changes permanent.

=> sed -i 's/<[^>]*>//g' test

if you have Ruby, you can use Nokogiri to parse your html

Code:

require 'rubygems'

require 'nokogiri'

require 'open-uri'



page = Nokogiri::HTML( File.open( "file" )  )

  

n = page.search("span").each do|el|  

  if el['class'] == "resultsAd"    

    puts el.children.text

  end  

end

result

Code:

# ruby test.rb 

VOLVO C70 2.3

NIKON D5000 + accesorios

IPhone 16GB Libre Perfecto Estado

AUDI A3 Sportback 2.0 TDI Ambition

Mitsubishi Colt 1.1 12v Inform 5p.

Piso en calle Camarena, 90