Using SED command to allow Cyrillic characters in XML file
I have a .prog script which is filtering ASCII characters and special characters from a XML output file.
Now, in a new requirement, Cyrillic characters/language will also be printed in the XML output. I want that my script should allow them also to get printed. Currently, I am using below SED command: sed -e "2,$ s/[^a-zA-Z0-9/?:().,'+{}\n\r<>_"= -]//g" $cm_file > $work_file Can anyone please help me in this? Regards, Gurpreet |
Using sed to parse xml files is pretty much always going to be a nightmare at some point. Your script seems to be looking for data to remove rather than what to keep. As your list of what to keep is
growing, I would suggest changing it to select what to remove, or better yet use a language like perl, python , ruby, etc which can parse xml correctly and allow you to display what you want / need. |
Hi,
Yes, you will need a higher level programming language, like Perl or Python. sed is very powerful but the character set is ASCII. You can tell your system to use UTF-8 characters by changing the locale but you will need to set up a "dictionary" of these characters in order for a program to do comparison. This sounds more like a job in Python. And as grail said, the task might be easier if you choose to select what to remove instead. |
Quote:
|
You will need to supply an example of what you are starting with and what the final output should be?
Also, please use the [code][/code] tags around your examples to maintain formatting and help with readability :) |
Hi,
We will have a XML output file which will contain the Cyrillic language characters as well apart from the allowable characters as mentioned in the below command: sed -e "2,$ s/[^a-zA-Z0-9/?:().,'+{}\n\r<>_"= -]//g" $cm_file > $work_file In the XML file, we have a tag <Name> </Name>. Currently, it is passing the ASCII characters only , like for eg. <Name>ABCD109</Name. Now, the requirement says Cyrillic characters/language will come this tag, <Name>Обозначения использования</Name> Initially, I thought like ASCII characters, I would add the Cyrillic characters range, but that is not possible. Can you please share how to modify this sed command or create a new one so that this tag also displays the Cyrillic characters? Please help. :( Regards, Gurpreet Kaur |
So, here are the several problems I have with the last post:
1. No use of 'code' tags as requested 2. No before and after example 3. As mentioned, sed is the wrong tool for non-ASCII related work New information: 1. sed is also the wrong tool to use to parse XML So, once again, please provide an actual example or be more clear about the example you have provided? Code:
<Name>Обозначения использования</Name> As mentioned by aragorn2101, perl, python or ruby would be the better choices to parse XML data. |
All times are GMT -5. The time now is 10:42 PM. |