I want to pluck the title out of articles in 'Annals of Internal Medicine'. The titles appear in lines such as:
Quote:
<span id="ctl00_scm6MainContent_lblArticleTitle">What Can Medical Education Learn From Facebook and Netflix?<span class="alternateTitle">What Can
+Medical Education Learn From Facebook and Netflix?</span></span>
|
They recently introduced the alternate title, often a repeat of the
'main' title, sometimes a shortened version. Before that I could grep
on ctl00_scm6MainContent_lblArticleTitle and strip the html tags.
To address this complication I changed the algorithm to:
Quote:
Title=`grep ctl00_scm6MainContent_lblArticleTitle $file | sed 's/<span class="alternateTitle">/\
/g' | head -1 | html-strip`
|
which didn't work: it still returned both titles.
My problem is that
Quote:
grep ctl00_scm6MainContent_lblArticleTitle $file | sed 's/<span class="alternateTitle">/\
/g | head -1 | html-strip'
|
works outside of `` . The 'sed' statement isn't working the same inside of them.
I've missed something basic. I've used this maneuver before.