LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-31-2015, 03:31 AM   #1
gordium
LQ Newbie
 
Registered: Oct 2015
Distribution: Debian, Ubuntu
Posts: 2

Rep: Reputation: Disabled
Replacing text which contains html tags - character escaping problem


Hello,
I am using perl pie and sed commands for removing text fragments in multiple files but I can't remove texts contains html tags

Successful commands:
perl -pi -e 's/Demotext<br \/>DemoCompany 22.0.1/ReplacedText/g' report.html
sed -i 's/jskl.*4ksjC=/newText/g' report.html

Commands above replaces text for me but I can not use them for below text.

<a class="class-1" href="http://url.tld/"><img src='data:image/png;base64,gibberish' alt='alttext'/></a>

I have been searching for this for 4 days. Any help will be appreciated.

Last edited by gordium; 10-31-2015 at 03:32 AM. Reason: typo
 
Old 10-31-2015, 05:31 AM   #2
mike acker
Member
 
Registered: Feb 2014
Location: Michigan
Distribution: LMDE MINT AMD 64
Posts: 91

Rep: Reputation: Disabled
have you tried opening the document in LibreOffice and then doing a "save as" text ?
 
Old 10-31-2015, 07:18 AM   #3
wpeckham
Senior Member
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, Fedora, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, Vsido, tinycore, Q4OS
Posts: 2,226

Rep: Reputation: 893Reputation: 893Reputation: 893Reputation: 893Reputation: 893Reputation: 893Reputation: 893
gsar might work for you.

Have you tried using GSAR for the job? The 'Generic Search and Replace' tool has never let me down.
It is the fastest tool I have found. The only limitation, it does not handle REGEX or filecards.

It can be hard to get, as it is not in every repository.
I have a copy if you need me to make it available.
 
Old 10-31-2015, 07:36 AM   #4
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 5,928

Rep: Reputation: 1408Reputation: 1408Reputation: 1408Reputation: 1408Reputation: 1408Reputation: 1408Reputation: 1408Reputation: 1408Reputation: 1408Reputation: 1408
interesting.

do you want to remove the complete block of html, or just the tags (=everything between < and >)?
and what's the bigger picture here? maybe you are approaching the problem from the wrong side.

also please use code tags for code, your first post is somewhat confusing.
 
Old 10-31-2015, 08:38 AM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,516

Rep: Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893
I do not understand a few things here:

1. Why would you use sed and perl??

2. Neither the sed nor the perl would work on the displayed line as the data in both scripts is not in the line

Maybe you could try and explain a little further.
 
Old 11-02-2015, 12:44 AM   #6
gordium
LQ Newbie
 
Registered: Oct 2015
Distribution: Debian, Ubuntu
Posts: 2

Original Poster
Rep: Reputation: Disabled
Okay, let me explain my problem step by step.

We are generating hourly html reports from logs.
Team members have to open, change a few things on these html reports by hand on windows.
But I told them they don't have to do these by hand every hour and as soon as I told them I can search and replace text recursively with perl pie command on linux With the help of cron
They loved it.

So I write a basic bash script.
It generates the html report, searches three long text fragments and replaces it with given text.
But one of the text block that I have to remove is this:
Code:
<a class="class-1" href="http://url.tld/"><img src='data:image/png;base64,gibberish' alt='alttext'/></a>
And I can't search and remove this piece of html with perl pie
I am using sed or perl pie because I know only these two to search and replace text, but I am open to any suggestion.

I hope I explained better this time.
Thanks in advance and I really appreciate it.
 
Old 11-02-2015, 09:46 AM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,516

Rep: Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893Reputation: 2893
Ok, so I am guessing the issue is a regex to get only this line? Is there nothing unique about it compared to other anchor lines in the html?

Also, as you are using perl, why not just write a perl script instead of a bash with perl in it? (just a thought)
 
Old 11-02-2015, 03:31 PM   #8
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 5,928

Rep: Reputation: 1408Reputation: 1408Reputation: 1408Reputation: 1408Reputation: 1408Reputation: 1408Reputation: 1408Reputation: 1408Reputation: 1408Reputation: 1408
Quote:
Originally Posted by gordium View Post
But one of the text block that I have to remove is this:
Code:
<a class="class-1" href="http://url.tld/"><img src='data:image/png;base64,gibberish' alt='alttext'/></a>
i just tried to formulate a sed command that would remove that, but it's way too complex.
my guess is, you have to rethink the problem from the beginning and find the tool that does what you need.

e.g., extracting info from html files, i find xmllint to be helpful (esp. the xpath options). it's part of libxml2, iirc.

or:
Quote:
Originally Posted by gordium View Post
We are generating hourly html reports from logs.
Team members have to open, change a few things on these html reports by hand
so you first generate html from logs, then remove the html again?
why don't you extract the info you need from the logs first, then generate the right html.

Last edited by ondoho; 11-02-2015 at 03:34 PM.
 
Old 11-02-2015, 06:57 PM   #9
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.9, Centos 7.3
Posts: 17,374

Rep: Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383
As per grail & ondoho.. basically use a proper Perl program to generate the html from the logs and don't bother generating (or at least outputting) the lines you don't want.
 
Old 11-03-2015, 04:55 AM   #10
wpeckham
Senior Member
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, Fedora, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, Vsido, tinycore, Q4OS
Posts: 2,226

Rep: Reputation: 893Reputation: 893Reputation: 893Reputation: 893Reputation: 893Reputation: 893Reputation: 893
Better than an edit...

+1 chrism01

I have built systems of monitor and analysis pages, often with simple mysql back ends updated by scripts, and NEVER wanted to edit pages in place. I always GENERATED the pages using perl or bash. Today I would be likely to use python, but even old ksh without extensions or updates could do this job.

It requires more thought, and more script, but correct edits when multiple embedded delimiters are involved is tricky for any tool. Generating the correct page is FAR easier.

Here is another idea to add to the above: if you MUST edit, try using the original page as a base for a template. Replace the part that will change into something easy to search for and that contains few or no delimiters. Generate your finished pages from the templates by replacing ONLY those easy to find and replace field holders in a copy of the template.

There are so many different kinds of software tools for search and replace simply because it is a tricky problem under some conditions. GENERATING text (and html xhtml is just formatted test) is far easier, and has been solved by every "hello world" program ever written in pretty much the same way.

I also have an idea for a combined approach, but it would require identifying comments added to the html document to delimit the blocks to be replaced. Easier to generate the page directly, but an interesting problem to consider.

Last edited by wpeckham; 11-03-2015 at 04:58 AM.
 
Old 11-03-2015, 06:03 PM   #11
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.9, Centos 7.3
Posts: 17,374

Rep: Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383
imho, even if you must edit (rather than going with my prev suggestion), nonetheless one-liners are fiddly and hard to debug even if it can be done.
Just write a proper Perl program to do the editing, gives you much more control and much easier to debug.
Perl is really good at this sort of thing (or pick another lang if you prefer).

PS: thx wpeckham
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem displaying % character in posts and within code tags allend LQ Suggestions & Feedback 23 06-19-2015 10:50 AM
Problem with tacking a character from a string and replacing it with two. PeterUK Programming 5 04-03-2014 03:44 PM
escaping ':' character in bash script paulyche Linux - General 2 11-07-2006 07:18 AM
perl post character escaping kapilcool Programming 3 12-20-2005 07:42 AM
how can I seprate normal text from html tags spell check it & then again place it ins amit_28oct Programming 5 08-07-2004 07:09 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 11:59 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration