LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-24-2012, 01:13 PM   #1
sanktwo
LQ Newbie
 
Registered: Nov 2003
Location: UK + France
Distribution: Kubuntu 12.04, Suse 10, 9.3, Ubuntu netbook remix, Windows xp, Windows 2000, Ubuntu 8 on desktop
Posts: 13

Rep: Reputation: 0
Binary codes in Bash regular expression.


On GNU bash, version 4.2.24(1)-release (x86_64-pc-linux-gnu)
I am trying to process text like:

déclarée au titre de la loi de juillet 1901, qui vise à la préservation,
la conservation et la mise en valeur du patrimoine ferroviaire de la Compagnie du Nord,
devenue SNCF région Nord. </p>

(don't worry about what it means)
The file containing the above is piped to the script "passthrough" thus:
cat ~/mytrickyfile.htm|./passthrough
I wish to replace the accented characters (latin9 code) by their html entity equivalents, but unfortunately not all codes e.g. not the "<".
e.g. é is code hex e9 (at least it is when I do a hex dump of the file) and should be replaced by &eacute; .

I have tried a lot of variations in regular expressions and I just cannot get any to work. Here is my bash file which simply tells me whether I have a match and what it is. Can somebody help me get the correct regex and make the script below match?

#!/bin/bash
# regex=".*['\xA0'-'\xFF'].*" # works to some degree
regex="(.*)\xE9(.*)" #doesn't work for me.
data=""

function htmlizelatin () {
if [[ $data =~ $regex ]]; then
echo "this matches: $data"
echo "matching substring: ${BASH_REMATCH[0]}"
let i=1
n=${#BASH_REMATCH[*]}
echo "rematch= ${BASH_REMATCH}"
while [[ $i -lt $n ]]
do
echo " capture[$i]: ${BASH_REMATCH[$i]}"
let i++
done
else
echo "this fails: $data"
fi
}
while read data; do
htmlizelatin
done
 
Old 09-24-2012, 06:09 PM   #2
porphyry5
Member
 
Registered: Jul 2010
Location: oregon usa
Distribution: Slackware 14.1, Arch
Posts: 437

Rep: Reputation: 19
Quote:
Originally Posted by sanktwo View Post
On GNU bash, version 4.2.24(1)-release (x86_64-pc-linux-gnu)
I am trying to process text like:

déclarée au titre de la loi de juillet 1901, qui vise à la préservation,
la conservation et la mise en valeur du patrimoine ferroviaire de la Compagnie du Nord,
devenue SNCF région Nord. </p>
I had a similar problem needing to anglicize the names of some of my music files, as accented letters bothered certain apps. Wasn't able to do it with REs, used tr instead. But I believe tr will only substitute on a one for one basis, so that won't help here. Have you considered using a substitution table?

Both bash and awk have associative arrays, so you could use the accented letter as the index for the corresponding &...;

Also, all &...; codes can be specified with a 3 digit decimal number for ... Possibly those numbers correspond to the ascii value of the character concerned.
 
Old 09-24-2012, 06:40 PM   #3
SecretCode
Member
 
Registered: Apr 2011
Location: UK
Distribution: Kubuntu 11.10
Posts: 562

Rep: Reputation: 102Reputation: 102
I think you should be thinking of searching unicode rather than binary. Regex Tutorial - Unicode Characters and Properties

I also suspect you'd get better results with Perl, which is the go-to tool for regular expression work
 
Old 09-24-2012, 07:56 PM   #4
Kenhelm
Member
 
Registered: Mar 2008
Location: N. W. England
Distribution: Mandriva
Posts: 333

Rep: Reputation: 141Reputation: 141
Try using ANSI-C Quoting to insert the actual character instead of its escape sequence.
http://www.gnu.org/software/bash/man..._002dC-Quoting
Code:
regex=$'(.*)\xE9(.*)'
GNU sed supports escape sequences. This replaces all '\xE9' characters with '&eacute;'
Code:
sed 's/\xE9/\&eacute;/g' infile > outfile
 
Old 09-25-2012, 07:40 AM   #5
sanktwo
LQ Newbie
 
Registered: Nov 2003
Location: UK + France
Distribution: Kubuntu 12.04, Suse 10, 9.3, Ubuntu netbook remix, Windows xp, Windows 2000, Ubuntu 8 on desktop
Posts: 13

Original Poster
Rep: Reputation: 0
Character by character in Bash,

Quote:
Originally Posted by porphyry5 View Post
I had a similar problem needing to anglicize the names of some of my music files, as accented letters bothered certain apps. Wasn't able to do it with REs, used tr instead. But I believe tr will only substitute on a one for one basis, so that won't help here. Have you considered using a substitution table?

Both bash and awk have associative arrays, so you could use the accented letter as the index for the corresponding &...;

Also, all &...; codes can be specified with a 3 digit decimal number for ... Possibly those numbers correspond to the ascii value of the character concerned.
I was about to give up on my quest to "use native Bash" rather than the two alternatives I had considered: 1. split the line into characters then a case statement or 2. invoke another program to do the work; but then decided to ask in this forum in case I was doing something stupid with RE.

You are right, tr is not appropriate. I did think about using a character I never come across in the 7bit range, substituting that, then using Bash to re-substitute, but that gets a bit ugly.
Regarding entity names versus codes, since I read the html produced from time to time, it is much easier if the entity names are there. I do have tables of Latin 9 and the codes, so it is not a big problem either way. Since there is no big speed issue in the sense that I don't do this that often, though there is a LOT of it; I think I will try using multiple calls of an external program to do each entity one at a time. I have about 20 entities to substitute in hundreds of files.

Many thanks for adding to my conviction that RE in Bash will not permit splitting at 8 bit characters.
 
Old 09-25-2012, 08:01 AM   #6
sanktwo
LQ Newbie
 
Registered: Nov 2003
Location: UK + France
Distribution: Kubuntu 12.04, Suse 10, 9.3, Ubuntu netbook remix, Windows xp, Windows 2000, Ubuntu 8 on desktop
Posts: 13

Original Poster
Rep: Reputation: 0
unicode versus latin9

Quote:
Originally Posted by SecretCode View Post
I think you should be thinking of searching unicode rather than binary. Regex Tutorial - Unicode Characters and Properties

I also suspect you'd get better results with Perl, which is the go-to tool for regular expression work
Hi Secretcode, I cannot tell a lie, I baulked at learning Perl - but I guess I should be more brave and add it to all the other languages I have learned... I did read that it is the doyen of text processing languages.

Regarding Unicode, well, the files which I am dealing with are not in UTF8 Unicode, they are in latin9 (or I should rather give the encoding which is ISO/IEC 8859-15:1999).
Of course, for the first 256 characters there isn't that much difference between the two encodings (excepting things like the sign for the Euro, Latin Capital Ligature Oe etc.) with the exception that those over 127 take 16 bits in UTF8. I am not sure that I want to investigate serving UTF-8 files on the web - I certainly have no experience of that. If I cannot get a normal Unix utility to work, I may well have a go at writing a Perl program - wish me luck.
 
Old 09-25-2012, 08:10 AM   #7
sanktwo
LQ Newbie
 
Registered: Nov 2003
Location: UK + France
Distribution: Kubuntu 12.04, Suse 10, 9.3, Ubuntu netbook remix, Windows xp, Windows 2000, Ubuntu 8 on desktop
Posts: 13

Original Poster
Rep: Reputation: 0
Thumbs up Looks like it is going to be SED

Quote:
Originally Posted by Kenhelm View Post
Try using ANSI-C Quoting to insert the actual character instead of its escape sequence.
http://www.gnu.org/software/bash/man..._002dC-Quoting
Code:
regex=$'(.*)\xE9(.*)'
GNU sed supports escape sequences. This replaces all '\xE9' characters with '&eacute;'
Code:
sed 's/\xE9/\&eacute;/g' infile > outfile
Well I tried your first suggestion - no change sorry, still fails.

However sed seems to accept the RE with no problem so I guess this is a "feature" (bug?) of bash.
at worst I can call sed around 20 times per file to do all the substitutions, one at a time. I don't know sed so well, so maybe can do more that one at a time, I will look. At least it seems that it will work.
I have already spent a day battling with bash and its REs so an external program it is and sed looks like the one. Many thanks for your time and efforts
 
Old 09-25-2012, 08:36 AM   #8
sanktwo
LQ Newbie
 
Registered: Nov 2003
Location: UK + France
Distribution: Kubuntu 12.04, Suse 10, 9.3, Ubuntu netbook remix, Windows xp, Windows 2000, Ubuntu 8 on desktop
Posts: 13

Original Poster
Rep: Reputation: 0
I think I may have found the culprit...

Quote:
Originally Posted by sanktwo View Post
On GNU bash, version 4.2.24(1)-release (x86_64-pc-linux-gnu)
I am trying to process text like:

déclarée au titre de la loi de juillet 1901, qui vise à la préservation,
la conservation et la mise en valeur du patrimoine ferroviaire de la Compagnie du Nord,
devenue SNCF région Nord. </p>
Thanks to the comment about unicode I was prompted to check my locale on bash. It reads:
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
and there is no 8859 locale available on my linux.

oops.. if Bash uses that (and I think it does) then it will not process ISO/IEC 8859-15 files properly. Sigh.
Personally I would have prefered bash just to treat files in the "unix" way i.e. stream of bytes and leave the user to handle UTF8 - but that is just me. Thanks for your help.
 
Old 09-25-2012, 08:55 AM   #9
sanktwo
LQ Newbie
 
Registered: Nov 2003
Location: UK + France
Distribution: Kubuntu 12.04, Suse 10, 9.3, Ubuntu netbook remix, Windows xp, Windows 2000, Ubuntu 8 on desktop
Posts: 13

Original Poster
Rep: Reputation: 0
Angry Some hypotheses are better than others...

Quote:
Originally Posted by sanktwo View Post
Thanks to the comment about unicode I was prompted to check my locale on bash. It reads:
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
and there is no 8859 locale available on my linux.

oops.. if Bash uses that (and I think it does) then it will not process ISO/IEC 8859-15 files properly. Sigh.
Personally I would have prefered bash just to treat files in the "unix" way i.e. stream of bytes and leave the user to handle UTF8 - but that is just me. Thanks for your help.
Well, it was a good hypothesis while it lasted. I compiled a locale: sudo locale-gen --no-purge en_GB.ISO-8859-15 then did LANG=en_GB.iso885915 then ran the tests.
Sigh, no difference, though the command "locale" in the bash session now shows
Code:
LANG=en_GB.iso885915
LANGUAGE=
LC_CTYPE="en_GB.iso885915"
LC_NUMERIC="en_GB.iso885915"
LC_TIME="en_GB.iso885915"
LC_COLLATE="en_GB.iso885915"
LC_MONETARY="en_GB.iso885915"
LC_MESSAGES="en_GB.iso885915"
LC_PAPER="en_GB.iso885915"
LC_NAME="en_GB.iso885915"
LC_ADDRESS="en_GB.iso885915"
LC_TELEPHONE="en_GB.iso885915"
LC_MEASUREMENT="en_GB.iso885915"
LC_IDENTIFICATION="en_GB.iso885915"
 
Old 09-25-2012, 09:07 AM   #10
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
PHP has a good thing for that: http://php.net/manual/en/function.ge...tion-table.php
 
1 members found this post helpful.
Old 09-26-2012, 08:52 AM   #11
porphyry5
Member
 
Registered: Jul 2010
Location: oregon usa
Distribution: Slackware 14.1, Arch
Posts: 437

Rep: Reputation: 19
Quote:
Originally Posted by sanktwo View Post
though there is a LOT of it; I think I will try using multiple calls of an external program to do each entity one at a time. I have about 20 entities to substitute in hundreds of files.

Many thanks for adding to my conviction that RE in Bash will not permit splitting at 8 bit characters.
You don't need to make multiple calls to an external app, for example sed can process all 20 of your entities in the same pass through the file. Check out the -e option
http://www.grymoire.com/Unix/Sed.html#uh-13

Sed can also process multiple files in the same evocation, see the next entry in that page
http://www.grymoire.com/Unix/Sed.html#uh-14
 
1 members found this post helpful.
Old 09-26-2012, 01:16 PM   #12
sanktwo
LQ Newbie
 
Registered: Nov 2003
Location: UK + France
Distribution: Kubuntu 12.04, Suse 10, 9.3, Ubuntu netbook remix, Windows xp, Windows 2000, Ubuntu 8 on desktop
Posts: 13

Original Poster
Rep: Reputation: 0
php for text processing

Quote:
Originally Posted by konsolebox View Post
Thanks Konsolebox, I had a look at the reference you gave, but not being a php programmer I could not figure out whether I could easily choose just which characters are converted.
If I wanted to convert everything for which html has entities, it is easy - I can use "recode ..HTML_4.0"
The problem is that the file has already html in it e.g. < which gets converted as well. That is very inconvenient.
I need to be very selective about what gets translated into entities.

I think I will stick with SED for the time being.
 
Old 09-27-2012, 08:39 AM   #13
sanktwo
LQ Newbie
 
Registered: Nov 2003
Location: UK + France
Distribution: Kubuntu 12.04, Suse 10, 9.3, Ubuntu netbook remix, Windows xp, Windows 2000, Ubuntu 8 on desktop
Posts: 13

Original Poster
Rep: Reputation: 0
The sed solution is fine for me

Quote:
Originally Posted by porphyry5 View Post
You don't need to make multiple calls to an external app, for example sed can process all 20 of your entities in the same pass through the file. Check out the -e option
Thanks porphyry5, it did just as you said and is just fine for my application. I think SED is a bit easier than learning php or perl so I went with that. If anyone is interested attached is my simple test script which takes standard in and converts a few eight bit characters coded with Latin 9 ISO/IEC 8859-15 (or in this case ISO/IEC 8859-1 if you prefer) and puts it on standard out.
I decided in the event to use the -f and put the commands in a file. Seems to work just fine and certainly fast enough for me.

Now to get on with my real work, many thanks to all who responded.
Attached Files
File Type: txt changeeightbit.txt (920 Bytes, 10 views)
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] BASH - regular expression elalexluna83 Programming 3 09-12-2012 11:23 AM
[SOLVED] bash script using regular expression edwardcode Programming 5 05-31-2012 03:07 AM
Bash Script / Regular Expression Problem rm_-rf_windows Linux - General 4 03-28-2012 02:05 PM
[SOLVED] [bash] rm regular expression help RaptorX Programming 26 08-01-2009 07:29 PM
bash: checking if a variable is a number (need regular expression help) anonguy9 Linux - Newbie 6 03-29-2009 03:37 AM


All times are GMT -5. The time now is 10:31 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration