LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-25-2012, 04:46 AM   #1
hanae
Member
 
Registered: May 2012
Posts: 33

Rep: Reputation: Disabled
Extract tokens from XML file


I have an Xml file in this format,
<?xml version="1.0"?>
<!DOCTYPE LCTL_TEXT SYSTEM "ltf.v1.2.dtd">
<LCTL_TEXT lang="BER" source_file="articles1.ltf.xml" source_type="web_news" author="LDC" encoding="UTF-8">
<DOC id="articles1" lang="BER">
<TEXT>
<SEG id="articles1revu-1" start_char="0" end_char="27">
<ORIGINAL_TEXT>I have a book</ORIGINAL_TEXT>
<TOKEN id="articles1revu-1-1" start_char="0" end_char="2">I</TOKEN>
<TOKEN id="articles1revu-1-2" start_char="4" end_char="10">have</TOKEN>
<TOKEN id="articles1revu-1-4" start_char="14" end_char="18">a</TOKEN>
<TOKEN id="articles1revu-1-5" start_char="20" end_char="20">book</TOKEN>
I want the result to be:
I
have
a
book
I have several files in this format, I want to go through them all and extract only tokens;
is there a way to do using grep?

Thank you
 
Old 05-25-2012, 04:57 AM   #2
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Have a look at this thread
http://www.linuxquestions.org/questi...estion-945390/

It's possible to do it in sed (you can adapt my solution) but as other LQ member stated there, you're better off using an XML parser.

Last edited by sycamorex; 05-25-2012 at 05:00 AM.
 
Old 05-25-2012, 05:11 AM   #3
hanae
Member
 
Registered: May 2012
Posts: 33

Original Poster
Rep: Reputation: Disabled
Yes, I tried this shell script to extract first the line that contains TOKEN and then use the sed command, but it doesn't work:
Quote:
for this in *.txt;
do
grep TOKEN, sed's/\(.[^"]*\)"\(.[^"]*\)"\(.[^"]*\)"\(.[^"]*\)"\(.[^>]*\)>\(.[^<]*\).*/\6/' $this > "$this.$$"
mv "$this.$$" "$this"

done
I want to do them both at the same time
 
Old 05-25-2012, 05:30 AM   #4
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Okay, if you do want to go the sed way...

1. first try if the sed command does what you want. Don't write a bash script straight away.
2. The sed command is malformed ("sed's"?!)
3. You don't have to create a temporary file for permanent changes (see sed's -i flag)
4. 'grep TOKEN' is not necessary. Sed can do it for you. Besides there should be the pipe (|) symbol between them, not a comma (,)

eg.

Code:
sed '/TOKEN/ s/oldpattern/newpattern/' infile

Last edited by sycamorex; 05-25-2012 at 05:31 AM.
 
Old 05-25-2012, 05:48 AM   #5
hanae
Member
 
Registered: May 2012
Posts: 33

Original Poster
Rep: Reputation: Disabled
Yeah I tried it this way, but still doesn't give the right result:
Quote:
for this in *.txt;
do
grep TOKEN | sed -i 's/^<.*>\([^<].*\)<.*>$/\1/'
done
is there anything wrong?

Thanks
 
Old 05-25-2012, 06:45 AM   #6
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,842

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
yes, grep needs a filename:
grep TOKEN this | .....
 
Old 05-25-2012, 06:51 AM   #7
Nylex
LQ Addict
 
Registered: Jul 2003
Location: London, UK
Distribution: Slackware
Posts: 7,464

Rep: Reputation: Disabled
Do you really need to do this using tools like sed and grep? Python's ElementTree, for example, would be perfect for this.
 
Old 05-25-2012, 06:53 AM   #8
hanae
Member
 
Registered: May 2012
Posts: 33

Original Poster
Rep: Reputation: Disabled
I still couldn't get the correct result!
 
Old 05-25-2012, 06:58 AM   #9
hanae
Member
 
Registered: May 2012
Posts: 33

Original Poster
Rep: Reputation: Disabled
Thank you Nylex for your suggestion, but I am not familiar with Python :s

Thanks
 
Old 05-25-2012, 06:58 AM   #10
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,842

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
Quote:
Originally Posted by hanae View Post
I still couldn't get the correct result!
you could try to fix it yourself:

grep TOKEN $this | sed 's/^<.*>\([^<].*\)<.*>$/\1/' > $this

sed -i cannot be used with pipes.
 
Old 05-25-2012, 07:06 AM   #11
hanae
Member
 
Registered: May 2012
Posts: 33

Original Poster
Rep: Reputation: Disabled
Thank you very much pan64. Indeed, I tried to fix it, but I still couldn't get the correct result; it just doe nothing to the file.

Thanks
 
Old 05-25-2012, 07:11 AM   #12
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941
No, no ... heed thou the sage advice to use an XML parsing package, and a real programming language to drive it. This is not a suitable application for "a bash script."

For example, look at this link ... http://search.cpan.org/~shlomif/XML-....98/LibXML.pod ... which happens to be a description of a Perl-language binding for the libxml2 library. My purpose here is to point out, if you follow the related links from this page, (a) what a sturdy library like libxml2 can do, and (b) how you do not need to write a C program to use it. If you follow links about "XPath expressions," and if you google for "XSLT stylesheets," you'll begin to get an idea of just what fully-tested options are at your disposal.

Don't "monkey around" trying to "cobble up" a "solution" that will wind up just wasting you a lot of time while producing a decidedly inferior, if not utterly useless bag of code. (And mind you, I am saying this bluntly but in the nicest and most respectful way possible.) There's a big fat deep river in that direction, and no bridge.
 
Old 05-25-2012, 07:21 AM   #13
hanae
Member
 
Registered: May 2012
Posts: 33

Original Poster
Rep: Reputation: Disabled
Thank you, but I believe there can be an easy way to d it using sed and grep.

Thanks
 
Old 05-25-2012, 09:07 AM   #14
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Quote:
Originally Posted by hanae View Post
Thank you, but I believe there can be an easy way to d it using sed and grep.

Thanks
I'm not at a linux computer now (still at work) so can't test any sed solutions but perhaps you should listen to the majority of the posters in this thread:
Code:
sed/grep are not tools meant to be used for parsing XML documents
 
Old 05-25-2012, 09:16 AM   #15
hanae
Member
 
Registered: May 2012
Posts: 33

Original Poster
Rep: Reputation: Disabled
Ok, may be you are right, do you know any easy XML parser that I can use?
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
extract xml from a large text file then write it to a new file richiep Linux - Software 3 10-28-2010 09:15 PM
Prompt the user for a file to open, extract the XML and write to another text file. richiep Linux - Newbie 7 10-22-2010 03:34 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:17 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration