Using SED command to allow Cyrillic characters in XML file
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Using SED command to allow Cyrillic characters in XML file
I have a .prog script which is filtering ASCII characters and special characters from a XML output file.
Now, in a new requirement, Cyrillic characters/language will also be printed in the XML output. I want that my script should allow them also to get printed.
Currently, I am using below SED command:
sed -e "2,$ s/[^a-zA-Z0-9/?).,'+{}\n\r<>_"= -]//g" $cm_file > $work_file
Using sed to parse xml files is pretty much always going to be a nightmare at some point. Your script seems to be looking for data to remove rather than what to keep. As your list of what to keep is
growing, I would suggest changing it to select what to remove, or better yet use a language like perl, python , ruby, etc which can parse xml correctly and allow you to display what you want / need.
Yes, you will need a higher level programming language, like Perl or Python. sed is very powerful but the character set is ASCII. You can tell your system to use UTF-8 characters by changing the locale but you will need to set up a "dictionary" of these characters in order for a program to do comparison. This sounds more like a job in Python.
And as grail said, the task might be easier if you choose to select what to remove instead.
Using sed to parse xml files is pretty much always going to be a nightmare at some point. Your script seems to be looking for data to remove rather than what to keep. As your list of what to keep is
growing, I would suggest changing it to select what to remove, or better yet use a language like perl, python , ruby, etc which can parse xml correctly and allow you to display what you want / need.
I like your suggestion, but we have a limitation here. I can not use any other language. If I go for selecting what to remove, then can you please guide how to do that since I am not very comfortable with SED command.
We will have a XML output file which will contain the Cyrillic language characters as well apart from the allowable characters as mentioned in the below command:
sed -e "2,$ s/[^a-zA-Z0-9/?).,'+{}\n\r<>_"= -]//g" $cm_file > $work_file
In the XML file, we have a tag <Name> </Name>. Currently, it is passing the ASCII characters only ,
like for eg. <Name>ABCD109</Name.
Now, the requirement says Cyrillic characters/language will come this tag,
<Name>Обозначения использования</Name>
Initially, I thought like ASCII characters, I would add the Cyrillic characters range, but that is not possible.
Can you please share how to modify this sed command or create a new one so that this tag also displays the Cyrillic characters? Please help.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.