[SOLVED] awk anyone?

gregors · 02-17-2021, 12:20 AM

Hi there!

I want to make a news page (html with lots of pictures) more handy. Since my cat|tr|grep chain didn't lead to what I want, I think that awk might be able to do the job.
Problem is that I don't just want single lines but some combination of corresponding lines.

To make long things short: I want to look for a line with

<span class="teaser__headline">++ Japan beginnt mit Impfprogramm ++</span>

and take its (visible) text, combining it with the text from the following line that starts with

<p class="teaser__shorttext">Fünf Monate ...

So I need to look for "teaser__headline", take the text from that paragraph and make it followed by the text from the line that contains "teaser__shorttext".

The result should look similar to

++ Japan beginnt mit Impfprogramm ++
Fünf Monate ...

The next (?) step would be to see how things are linked with <a> tags and use them to make my result clickable ...

If there's a better forum my question please let me know. And if you don't know awk just like me: sorry to bother ...

TIA

Gregor

ondoho · 02-17-2021, 12:54 AM

First of all I'd look if that site doesn't have an RSS/Atom feed that might already contain a compressed version of the news.

Otherwise, I wouldn't use awk/grep etc. but a tool that is designed to deal with HTML (and similar code like XML).
Two such tools are xmllint (part of libxml2) and xmlstarlet.
The thing you want to learn are "xpath queries". Yes, there's a small learning curve but you'll soon appreciate working with the code you're parsing, not against it.

Just look around for a suitable tutorial.

If you give us example code we can help more.

gregors · 02-17-2021, 01:00 AM

Quote:

Originally Posted by ondoho

First of all I'd look if that site doesn't have an RSS/Atom feed that might already contain a compressed version of the news.

Thanks a lot for this hint! In fact there is an RSS feed for that page.

Gregor

syg00 · 02-17-2021, 01:04 AM

awk is awesome, and this is certainly do-able - but you'll get a bunch of recommendations to use a "proper" tool that understands the format. pup in one such, and CPAN will have a few as well if you're into perl.