Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
12-23-2016, 07:51 AM
|
#1
|
LQ Newbie
Registered: Dec 2016
Posts: 3
Rep: 
|
Using SED command to allow Cyrillic characters in XML file
I have a .prog script which is filtering ASCII characters and special characters from a XML output file.
Now, in a new requirement, Cyrillic characters/language will also be printed in the XML output. I want that my script should allow them also to get printed.
Currently, I am using below SED command:
sed -e "2,$ s/[^a-zA-Z0-9/?  ).,'+{}\n\r<>_"= -]//g" $cm_file > $work_file
Can anyone please help me in this?
Regards,
Gurpreet
|
|
|
12-23-2016, 08:59 AM
|
#2
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038
|
Using sed to parse xml files is pretty much always going to be a nightmare at some point. Your script seems to be looking for data to remove rather than what to keep. As your list of what to keep is
growing, I would suggest changing it to select what to remove, or better yet use a language like perl, python , ruby, etc which can parse xml correctly and allow you to display what you want / need.
|
|
2 members found this post helpful.
|
12-24-2016, 03:12 AM
|
#3
|
Member
Registered: Dec 2012
Location: Mauritius
Distribution: Slackware
Posts: 567
|
Hi,
Yes, you will need a higher level programming language, like Perl or Python. sed is very powerful but the character set is ASCII. You can tell your system to use UTF-8 characters by changing the locale but you will need to set up a "dictionary" of these characters in order for a program to do comparison. This sounds more like a job in Python.
And as grail said, the task might be easier if you choose to select what to remove instead.
|
|
|
12-27-2016, 01:45 AM
|
#4
|
LQ Newbie
Registered: Dec 2016
Posts: 3
Original Poster
Rep: 
|
Quote:
Originally Posted by grail
Using sed to parse xml files is pretty much always going to be a nightmare at some point. Your script seems to be looking for data to remove rather than what to keep. As your list of what to keep is
growing, I would suggest changing it to select what to remove, or better yet use a language like perl, python , ruby, etc which can parse xml correctly and allow you to display what you want / need.
|
I like your suggestion, but we have a limitation here. I can not use any other language. If I go for selecting what to remove, then can you please guide how to do that since I am not very comfortable with SED command.
|
|
|
12-27-2016, 02:28 AM
|
#5
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038
|
You will need to supply an example of what you are starting with and what the final output should be?
Also, please use the [code][/code] tags around your examples to maintain formatting and help with readability 
|
|
|
12-27-2016, 02:42 AM
|
#6
|
LQ Newbie
Registered: Dec 2016
Posts: 3
Original Poster
Rep: 
|
Hi,
We will have a XML output file which will contain the Cyrillic language characters as well apart from the allowable characters as mentioned in the below command:
sed -e "2,$ s/[^a-zA-Z0-9/?  ).,'+{}\n\r<>_"= -]//g" $cm_file > $work_file
In the XML file, we have a tag <Name> </Name>. Currently, it is passing the ASCII characters only ,
like for eg. <Name>ABCD109</Name.
Now, the requirement says Cyrillic characters/language will come this tag,
<Name>Обозначения использования</Name>
Initially, I thought like ASCII characters, I would add the Cyrillic characters range, but that is not possible.
Can you please share how to modify this sed command or create a new one so that this tag also displays the Cyrillic characters? Please help.
Regards,
Gurpreet Kaur
|
|
|
12-27-2016, 04:38 AM
|
#7
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038
|
So, here are the several problems I have with the last post:
1. No use of 'code' tags as requested
2. No before and after example
3. As mentioned, sed is the wrong tool for non-ASCII related work
New information:
1. sed is also the wrong tool to use to parse XML
So, once again, please provide an actual example or be more clear about the example you have provided?
Code:
<Name>Обозначения использования</Name>
Using the above as a possible example, what exactly is wrong with the data shown? What exactly are you trying to remove / keep?
As mentioned by aragorn2101, perl, python or ruby would be the better choices to parse XML data.
|
|
|
All times are GMT -5. The time now is 10:39 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|