Keep specific text from a line in bash script

bmxakias · 01-17-2020, 06:23 AM

Hello

I have a file (file.html) and i have inside a few lines using a pattern like:

Code:

<td width="1%" nowrap="nowrap" align="right"><a href="/word-something/saf6059eb20/some-text-2015-web-710z-yts-lt" title="Super duper text (2015) [WEBfor] [532a] [YTR LT]"><img src="//images.some.info/dl_icon.png" alt="get..." width="28" height="21" border="0" align="absmiddle"></a></td>
<td width="1%" nowrap="nowrap" align="right"><a href="/word-something/s1a148a0a69/hello-of-a-blabla-1999-bit" title="Other nice text tha i will like to keep (5487) TREUsi"><img src="//images.some.info/dl_icon.png" alt="get..." width="28" height="21" border="0" align="absmiddle"></a></td>
<td width="1%" nowrap="nowrap" align="right"><a href="/word-something/s68ee3a70d3/bye-in-all-third-time-2067-5903f-amzn-web-ty-ddp2-1-h-245-ntu" title="A good one yes 1968 8731w AMDR WEB-TE DDU6 1 K 131-NTE"><img src="//images.some.info/dl_icon.png" alt="get..." width="28" height="21" border="0" align="absmiddle"></a></td>

I would like to clean that file and keep only the titles like:

Quote:

Super duper text (2015) [WEBfor] [532a] [YTR LT]
Other nice text tha i will like to keep (5487) TREUsi
A good one yes 1968 8731w AMDR WEB-TE DDU6 1 K 131-NTE

on the same file or output to a new file...

Thank you

syg00 · 01-17-2020, 06:30 AM

Well formed data (really well formed data over every line) can be simply parsed with sed. Else you might be up for using something more specific - pup maybe.

Turbocapitalist · 01-17-2020, 06:35 AM

If that is HTML or XHTML then you'll need a proper parser to manage that task. sed is not the right language for that.

XPath is one possibility. There are a lot of easy to find XPath utilities out there and you could use an xpath like either of these depending on the larger context within the document:

Code:

'//td/a/@title'

'//tr/td[1]/a/@title'

Show a little more of the structure from that part of the XHTML document so we can see the context and provide a more precise answer.

individual · 01-17-2020, 06:54 AM

Quote:

Originally Posted by syg00

Well formed data (really well formed data over every line) can be simply parsed with sed. Else you might be up for using something more specific - pup maybe.

I'm glad you suggested pup!

Code:

<links pup 'a attr{title}'

boughtonp · 01-17-2020, 09:16 AM

As has been mentioned, the correct way to read text from HTML is with a HTML parser.

But a very quick and dirty solution that might be good enough for a one-off is:

Code:

$ grep -Po '(?<=title=")[^"]+' file.html

Other than potentially malformed HTML, the other downside to this is HTML entities are not decoded (so a well-formed title with quotes in will appear as ", for example).

If this isn't a one-off then you should explain the general task you're trying to achieve, because there's probably a simpler solution. (Perhaps involving the site's Atom/RSS feed, for example.)

bmxakias · 01-17-2020, 10:57 AM

Great thank you !!!!