LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 12-23-2016, 07:51 AM   #1
Gurpreet86
LQ Newbie
 
Registered: Dec 2016
Posts: 3

Rep: Reputation: Disabled
Using SED command to allow Cyrillic characters in XML file


I have a .prog script which is filtering ASCII characters and special characters from a XML output file.

Now, in a new requirement, Cyrillic characters/language will also be printed in the XML output. I want that my script should allow them also to get printed.

Currently, I am using below SED command:

sed -e "2,$ s/[^a-zA-Z0-9/?).,'+{}\n\r<>_"= -]//g" $cm_file > $work_file

Can anyone please help me in this?

Regards,
Gurpreet
 
Old 12-23-2016, 08:59 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Using sed to parse xml files is pretty much always going to be a nightmare at some point. Your script seems to be looking for data to remove rather than what to keep. As your list of what to keep is
growing, I would suggest changing it to select what to remove, or better yet use a language like perl, python , ruby, etc which can parse xml correctly and allow you to display what you want / need.
 
2 members found this post helpful.
Old 12-24-2016, 03:12 AM   #3
aragorn2101
Member
 
Registered: Dec 2012
Location: Mauritius
Distribution: Slackware
Posts: 567

Rep: Reputation: 301Reputation: 301Reputation: 301Reputation: 301
Hi,

Yes, you will need a higher level programming language, like Perl or Python. sed is very powerful but the character set is ASCII. You can tell your system to use UTF-8 characters by changing the locale but you will need to set up a "dictionary" of these characters in order for a program to do comparison. This sounds more like a job in Python.

And as grail said, the task might be easier if you choose to select what to remove instead.
 
Old 12-27-2016, 01:45 AM   #4
Gurpreet86
LQ Newbie
 
Registered: Dec 2016
Posts: 3

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by grail View Post
Using sed to parse xml files is pretty much always going to be a nightmare at some point. Your script seems to be looking for data to remove rather than what to keep. As your list of what to keep is
growing, I would suggest changing it to select what to remove, or better yet use a language like perl, python , ruby, etc which can parse xml correctly and allow you to display what you want / need.
I like your suggestion, but we have a limitation here. I can not use any other language. If I go for selecting what to remove, then can you please guide how to do that since I am not very comfortable with SED command.
 
Old 12-27-2016, 02:28 AM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
You will need to supply an example of what you are starting with and what the final output should be?

Also, please use the [code][/code] tags around your examples to maintain formatting and help with readability
 
Old 12-27-2016, 02:42 AM   #6
Gurpreet86
LQ Newbie
 
Registered: Dec 2016
Posts: 3

Original Poster
Rep: Reputation: Disabled
Hi,

We will have a XML output file which will contain the Cyrillic language characters as well apart from the allowable characters as mentioned in the below command:

sed -e "2,$ s/[^a-zA-Z0-9/?).,'+{}\n\r<>_"= -]//g" $cm_file > $work_file

In the XML file, we have a tag <Name> </Name>. Currently, it is passing the ASCII characters only ,

like for eg. <Name>ABCD109</Name.

Now, the requirement says Cyrillic characters/language will come this tag,

<Name>Обозначения использования</Name>

Initially, I thought like ASCII characters, I would add the Cyrillic characters range, but that is not possible.

Can you please share how to modify this sed command or create a new one so that this tag also displays the Cyrillic characters? Please help.


Regards,
Gurpreet Kaur
 
Old 12-27-2016, 04:38 AM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
So, here are the several problems I have with the last post:

1. No use of 'code' tags as requested

2. No before and after example

3. As mentioned, sed is the wrong tool for non-ASCII related work

New information:

1. sed is also the wrong tool to use to parse XML

So, once again, please provide an actual example or be more clear about the example you have provided?
Code:
<Name>Обозначения использования</Name>
Using the above as a possible example, what exactly is wrong with the data shown? What exactly are you trying to remove / keep?

As mentioned by aragorn2101, perl, python or ruby would be the better choices to parse XML data.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Remove sections of a xml file with sed viniciusandre Linux - Software 2 04-20-2009 01:18 PM
cannot read cyrillic characters in file names hazylama Linux - Software 6 05-01-2007 07:59 PM
sed error message:extra characters after the command wmh830621 Programming 4 08-14-2006 07:13 PM
Getting last characters of a line with sed command LULUSNATCH Programming 4 12-21-2005 09:33 AM
sed error message: extra characters after the command. nano_mag Linux - General 3 05-15-2005 01:00 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 04:18 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration