LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-03-2018, 02:41 AM   #1
ranjitabraham
LQ Newbie
 
Registered: Mar 2018
Posts: 3

Rep: Reputation: Disabled
Regular expression to select the field correctly which has a - in it


Hello All,
I have an XML file like this and i need a regular expression to select the <todo-item> from below. I wrote the expression like this: ([\r\n]+)(?=\s*\<todo-item\>)

but i think due to the - which is between todo and item is causing it to not detect correctly. Can anyone shed some light on how to change the regex code?

PHP Code:
<todo-items type="array">
 <
todo-item>
 <
project-id type="integer">353705</project-id>
 <
tasklist-istemplate type="boolean">false</tasklist-istemplate>
 <
hastickets type="boolean">false</hastickets>
 <
order type="integer">2003</order>
 <
comments-count type="integer">0</comments-count>
 <
created-on type="date">2018-02-21T06:26:43Z</created-on>
 <
canedit type="boolean">true</canedit>
 <
has-predecessors type="integer">0</has-predecessors>
 <
id type="integer">17223695</id>
 <
completed type="boolean">false</completed>
 <
position type="integer">2003</position>
 <
estimated-minutes type="integer">0</estimated-minutes>
 <
description/>
 <
progress type="integer">0</progress>
 <
harvest-enabled type="boolean">false</harvest-enabled>
 <
parenttaskid type="integer">17223687</parenttaskid>
 <
responsible-party-lastname>xxx</responsible-party-lastname>
 <
company-id type="integer">103131</company-id>
 <
creator-id type="integer">316954</creator-id>
 <
project-name>asdfasdfasdf</project-name>
 <
start-date type="integer">20180403</start-date>
 <
tasklist-private type="boolean">true</tasklist-private>
 <
lockdownid type="integer">806894</lockdownid>
 <
cancomplete type="boolean">true</cancomplete>
 <
responsible-party-id>317122,221525,316954</responsible-party-id>
 <
creator-lastname>asdfasdfsdf</creator-lastname>
 <
has-reminders type="boolean">false</has-reminders>
 <
has-unread-comments type="boolean">false</has-unread-comments>
 <
todo-list-name>Phase Two</todo-list-name>
 <
due-date-base type="integer">20180403</due-date-base>
 <private 
type="integer">2</private>
 <
userfollowingcomments type="boolean">false</userfollowingcomments>
 <
responsible-party-summary>You 2 others</responsible-party-summary>
 <
status>new</status>
 <
todo-list-id type="integer">1533948</todo-list-id>
 <
predecessors type="array"/>
 <
tags type="array"/>
 <
content>ffdddffdfdfdfdfdfdfdfdfdfd</content>
 <
responsible-party-type>Person</responsible-party-type>
 <
company-name>as dfcsdfsdfs</company-name>
 <
creator-firstname>asdfasdfasdfasdf</creator-firstname>
 <
last-changed-on type="date">2018-03-29T10:55:28Z</last-changed-on>
 <
due-date type="integer">20180403</due-date>
 <
has-dependencies type="integer">2</has-dependencies>
 <
attachments-count type="integer">0</attachments-count>
 <
userfollowingchanges type="boolean">false</userfollowingchanges>
 <
priority/>
 <
responsible-party-firstname>asdfasdfasdf</responsible-party-firstname>
 <
viewestimatedtime type="boolean">true</viewestimatedtime>
 <
responsible-party-ids>317122,221525,316954</responsible-party-ids>
 <
responsible-party-names>cdcdcdcdcdcdcdcd</responsible-party-names>
 <
tasklist-lockdownid type="integer">806894</tasklist-lockdownid>
 <
timeislogged type="integer">0</timeislogged>
 </
todo-item
 
Old 04-03-2018, 03:13 AM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 16,855

Rep: Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518
You haven't said which regex engine, or which tool you are using. Are we to guess php ?.

In general parsing XML youself is pointless - use one of the appropriate tools. No regex engine I use cares at all about a minus sign unless in a bracket selection.
Why do you require line feeds on input - normal stream tools strip the line-feed. I would remove the first subexpression completely - but I don't do php ...
 
1 members found this post helpful.
Old 04-03-2018, 03:17 AM   #3
ranjitabraham
LQ Newbie
 
Registered: Mar 2018
Posts: 3

Original Poster
Rep: Reputation: Disabled
Sorry for the confusion. I am actually importing this XML into splunk. So my props.conf is like
[project]
SHOULD_LINEMERGE = false
LINE_BREAKER = ([\r\n]+)(?=\s*\<todo-item\>)
DATETIME_CONFIG = CURRENT
KV_MODE =xml

With this i am able to injest other XML files without any issue. I was just thinking the data is not split because of the - in between. Thats why i posted the question. Again apologies for the confusion.
Thanks
 
Old 04-03-2018, 03:22 AM   #4
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Ubuntu, Devuan, OpenBSD
Posts: 3,275
Blog Entries: 3

Rep: Reputation: 1445Reputation: 1445Reputation: 1445Reputation: 1445Reputation: 1445Reputation: 1445Reputation: 1445Reputation: 1445Reputation: 1445Reputation: 1445
Agreed. It is quite pointless to use pure regex to try to manage XML. A proper parser is needed for that. You have several easy-to-use, mature XML parsers in CPAN. See XML::TreeBuilder, XML::XPathEngine, or XML::Twig there on CPAN.

There are also several standalone XML parsers also based on XPath. xmllint is one.
 
1 members found this post helpful.
Old 04-03-2018, 03:35 AM   #5
Michael Uplawski
Member
 
Registered: Dec 2015
Location: Normandy, France
Distribution: Debian buster/sid
Posts: 669
Blog Entries: 21

Rep: Reputation: 419Reputation: 419Reputation: 419Reputation: 419Reputation: 419
Quote:
Originally Posted by Turbocapitalist View Post
There are also several standalone XML parsers also based on XPath. xmllint is one.
... the nokogiri standalone executable is another.

I am responding, however, because I have made the same error a few times and tried to parse XML with Regexs. The way that the OP describes it and as commented by the other contributors, it is pointless, in deed.

An XML-parser, on the other hand, is called an XML-parser, because it pares XML. Think about it.
 
Old 04-03-2018, 05:31 AM   #6
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 16,855

Rep: Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518Reputation: 2518
Quote:
Originally Posted by ranjitabraham View Post
I am actually importing this XML into splunk.
... i am able to injest other XML files without any issue.
Sorry, I briefly looked at splunk when it first emerged years ago, but don't use it.
I just had a brief look at the regex doco - interesting; I've not seen \r and \n used like that as anchors before. Can't help - hopefully others with splunk experience can assist.
 
Old 04-03-2018, 09:09 AM   #7
MadeInGermany
Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 832

Rep: Reputation: 364Reputation: 364Reputation: 364Reputation: 364
Whatever RE engine it is, a literal < and > should not be escaped.
\<todo-item\> should be <todo-item>
 
Old 04-03-2018, 10:47 AM   #8
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 9,078
Blog Entries: 4

Rep: Reputation: 3169Reputation: 3169Reputation: 3169Reputation: 3169Reputation: 3169Reputation: 3169Reputation: 3169Reputation: 3169Reputation: 3169Reputation: 3169Reputation: 3169
While you can backslash-escape a literal dash, Turbocapitalist's admonition to "use a real XML parser" is a very sound one.

There are two general approaches. One reads the XML and builds an in-memory data structure. The other parses the XML and, while doing so, makes subroutine-calls to back-end routines of your own devising. Both are "known good" tools for handling the vagaries of XML.
 
Old 04-03-2018, 07:12 PM   #9
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: Slackware
Posts: 8,293

Rep: Reputation: 3369Reputation: 3369Reputation: 3369Reputation: 3369Reputation: 3369Reputation: 3369Reputation: 3369Reputation: 3369Reputation: 3369Reputation: 3369Reputation: 3369
Obligatory:

Parsing Html The Cthulhu Way
 
1 members found this post helpful.
  


Reply

Tags
regex


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular expression deleted23 Programming 16 11-05-2017 05:56 AM
[SOLVED] jhalfs sed: -e expression #1, char 55:Invalid preceding regular expression percy_vere_uk Linux From Scratch 10 07-22-2017 07:15 AM
[SOLVED] how to select the table based on regular expression upendra_35 Linux - Newbie 3 11-30-2012 08:19 AM
AWK: Using a regular expression as a field separator Blackened Justice Programming 8 06-01-2012 07:07 AM
awk: how to find expression in a certain field? Micro420 Programming 5 08-08-2007 05:59 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:01 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration