LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Using SED command to allow Cyrillic characters in XML file (https://www.linuxquestions.org/questions/linux-newbie-8/using-sed-command-to-allow-cyrillic-characters-in-xml-file-4175596006/)

Gurpreet86 12-23-2016 07:51 AM

Using SED command to allow Cyrillic characters in XML file
 
I have a .prog script which is filtering ASCII characters and special characters from a XML output file.

Now, in a new requirement, Cyrillic characters/language will also be printed in the XML output. I want that my script should allow them also to get printed.

Currently, I am using below SED command:

sed -e "2,$ s/[^a-zA-Z0-9/?:().,'+{}\n\r<>_"= -]//g" $cm_file > $work_file

Can anyone please help me in this?

Regards,
Gurpreet

grail 12-23-2016 08:59 AM

Using sed to parse xml files is pretty much always going to be a nightmare at some point. Your script seems to be looking for data to remove rather than what to keep. As your list of what to keep is
growing, I would suggest changing it to select what to remove, or better yet use a language like perl, python , ruby, etc which can parse xml correctly and allow you to display what you want / need.

aragorn2101 12-24-2016 03:12 AM

Hi,

Yes, you will need a higher level programming language, like Perl or Python. sed is very powerful but the character set is ASCII. You can tell your system to use UTF-8 characters by changing the locale but you will need to set up a "dictionary" of these characters in order for a program to do comparison. This sounds more like a job in Python.

And as grail said, the task might be easier if you choose to select what to remove instead.

Gurpreet86 12-27-2016 01:45 AM

Quote:

Originally Posted by grail (Post 5645481)
Using sed to parse xml files is pretty much always going to be a nightmare at some point. Your script seems to be looking for data to remove rather than what to keep. As your list of what to keep is
growing, I would suggest changing it to select what to remove, or better yet use a language like perl, python , ruby, etc which can parse xml correctly and allow you to display what you want / need.

I like your suggestion, but we have a limitation here. I can not use any other language. If I go for selecting what to remove, then can you please guide how to do that since I am not very comfortable with SED command.

grail 12-27-2016 02:28 AM

You will need to supply an example of what you are starting with and what the final output should be?

Also, please use the [code][/code] tags around your examples to maintain formatting and help with readability :)

Gurpreet86 12-27-2016 02:42 AM

Hi,

We will have a XML output file which will contain the Cyrillic language characters as well apart from the allowable characters as mentioned in the below command:

sed -e "2,$ s/[^a-zA-Z0-9/?:().,'+{}\n\r<>_"= -]//g" $cm_file > $work_file

In the XML file, we have a tag <Name> </Name>. Currently, it is passing the ASCII characters only ,

like for eg. <Name>ABCD109</Name.

Now, the requirement says Cyrillic characters/language will come this tag,

<Name>Обозначения использования</Name>

Initially, I thought like ASCII characters, I would add the Cyrillic characters range, but that is not possible.

Can you please share how to modify this sed command or create a new one so that this tag also displays the Cyrillic characters? Please help.
:(

Regards,
Gurpreet Kaur

grail 12-27-2016 04:38 AM

So, here are the several problems I have with the last post:

1. No use of 'code' tags as requested

2. No before and after example

3. As mentioned, sed is the wrong tool for non-ASCII related work

New information:

1. sed is also the wrong tool to use to parse XML

So, once again, please provide an actual example or be more clear about the example you have provided?
Code:

<Name>Обозначения использования</Name>
Using the above as a possible example, what exactly is wrong with the data shown? What exactly are you trying to remove / keep?

As mentioned by aragorn2101, perl, python or ruby would be the better choices to parse XML data.


All times are GMT -5. The time now is 10:42 PM.