LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-17-2020, 06:23 AM   #1
bmxakias
Member
 
Registered: Jan 2016
Posts: 254

Rep: Reputation: Disabled
Question Keep specific text from a line in bash script


Hello

I have a file (file.html) and i have inside a few lines using a pattern like:

Code:
<td width="1%" nowrap="nowrap" align="right"><a href="/word-something/saf6059eb20/some-text-2015-web-710z-yts-lt" title="Super duper text (2015) [WEBfor] [532a] [YTR LT]"><img src="//images.some.info/dl_icon.png" alt="get..." width="28" height="21" border="0" align="absmiddle"></a></td>
<td width="1%" nowrap="nowrap" align="right"><a href="/word-something/s1a148a0a69/hello-of-a-blabla-1999-bit" title="Other nice text tha i will like to keep (5487) TREUsi"><img src="//images.some.info/dl_icon.png" alt="get..." width="28" height="21" border="0" align="absmiddle"></a></td>
<td width="1%" nowrap="nowrap" align="right"><a href="/word-something/s68ee3a70d3/bye-in-all-third-time-2067-5903f-amzn-web-ty-ddp2-1-h-245-ntu" title="A good one yes 1968 8731w AMDR WEB-TE DDU6 1 K 131-NTE"><img src="//images.some.info/dl_icon.png" alt="get..." width="28" height="21" border="0" align="absmiddle"></a></td>
I would like to clean that file and keep only the titles like:

Quote:
Super duper text (2015) [WEBfor] [532a] [YTR LT]
Other nice text tha i will like to keep (5487) TREUsi
A good one yes 1968 8731w AMDR WEB-TE DDU6 1 K 131-NTE
on the same file or output to a new file...

Thank you
 
Old 01-17-2020, 06:30 AM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Well formed data (really well formed data over every line) can be simply parsed with sed. Else you might be up for using something more specific - pup maybe.
 
Old 01-17-2020, 06:35 AM   #3
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,307
Blog Entries: 3

Rep: Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721
If that is HTML or XHTML then you'll need a proper parser to manage that task. sed is not the right language for that.

XPath is one possibility. There are a lot of easy to find XPath utilities out there and you could use an xpath like either of these depending on the larger context within the document:

Code:
'//td/a/@title'

'//tr/td[1]/a/@title'
Show a little more of the structure from that part of the XHTML document so we can see the context and provide a more precise answer.
 
1 members found this post helpful.
Old 01-17-2020, 06:54 AM   #4
individual
Member
 
Registered: Jul 2018
Posts: 315
Blog Entries: 1

Rep: Reputation: 233Reputation: 233Reputation: 233
Quote:
Originally Posted by syg00 View Post
Well formed data (really well formed data over every line) can be simply parsed with sed. Else you might be up for using something more specific - pup maybe.
I'm glad you suggested pup!
Code:
<links pup 'a attr{title}'
 
Old 01-17-2020, 09:16 AM   #5
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,599

Rep: Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546
As has been mentioned, the correct way to read text from HTML is with a HTML parser.

But a very quick and dirty solution that might be good enough for a one-off is:
Code:
$ grep -Po '(?<=title=")[^"]+' file.html
Other than potentially malformed HTML, the other downside to this is HTML entities are not decoded (so a well-formed title with quotes in will appear as &quot;, for example).

If this isn't a one-off then you should explain the general task you're trying to achieve, because there's probably a simpler solution. (Perhaps involving the site's Atom/RSS feed, for example.)


Last edited by boughtonp; 01-17-2020 at 09:17 AM.
 
Old 01-17-2020, 10:57 AM   #6
bmxakias
Member
 
Registered: Jan 2016
Posts: 254

Original Poster
Rep: Reputation: Disabled
Great thank you !!!!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Sed append text to end of line if line contains specific text? How can this be done? helptonewbie Linux - Newbie 4 10-23-2013 01:48 PM
[SOLVED] Copy and replacing specific line from file1 to file2 line by line vjramana Programming 10 03-28-2011 07:49 AM
[SOLVED] Text on a specific line at the end of a line genderbender Programming 25 07-27-2010 06:47 AM
php - Read file line by line and change a specific line. anrea Programming 2 01-28-2007 01:43 PM
SED - display text on specific line of text file 3saul Linux - Software 3 12-29-2005 04:32 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:53 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration