LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Regular expression to select the field correctly which has a - in it (https://www.linuxquestions.org/questions/programming-9/regular-expression-to-select-the-field-correctly-which-has-a-in-it-4175626904/)

ranjitabraham 04-03-2018 02:41 AM

Regular expression to select the field correctly which has a - in it
 
Hello All,
I have an XML file like this and i need a regular expression to select the <todo-item> from below. I wrote the expression like this: ([\r\n]+)(?=\s*\<todo-item\>)

but i think due to the - which is between todo and item is causing it to not detect correctly. Can anyone shed some light on how to change the regex code?

PHP Code:

<todo-items type="array">
 <
todo-item>
 <
project-id type="integer">353705</project-id>
 <
tasklist-istemplate type="boolean">false</tasklist-istemplate>
 <
hastickets type="boolean">false</hastickets>
 <
order type="integer">2003</order>
 <
comments-count type="integer">0</comments-count>
 <
created-on type="date">2018-02-21T06:26:43Z</created-on>
 <
canedit type="boolean">true</canedit>
 <
has-predecessors type="integer">0</has-predecessors>
 <
id type="integer">17223695</id>
 <
completed type="boolean">false</completed>
 <
position type="integer">2003</position>
 <
estimated-minutes type="integer">0</estimated-minutes>
 <
description/>
 <
progress type="integer">0</progress>
 <
harvest-enabled type="boolean">false</harvest-enabled>
 <
parenttaskid type="integer">17223687</parenttaskid>
 <
responsible-party-lastname>xxx</responsible-party-lastname>
 <
company-id type="integer">103131</company-id>
 <
creator-id type="integer">316954</creator-id>
 <
project-name>asdfasdfasdf</project-name>
 <
start-date type="integer">20180403</start-date>
 <
tasklist-private type="boolean">true</tasklist-private>
 <
lockdownid type="integer">806894</lockdownid>
 <
cancomplete type="boolean">true</cancomplete>
 <
responsible-party-id>317122,221525,316954</responsible-party-id>
 <
creator-lastname>asdfasdfsdf</creator-lastname>
 <
has-reminders type="boolean">false</has-reminders>
 <
has-unread-comments type="boolean">false</has-unread-comments>
 <
todo-list-name>Phase Two</todo-list-name>
 <
due-date-base type="integer">20180403</due-date-base>
 <private 
type="integer">2</private>
 <
userfollowingcomments type="boolean">false</userfollowingcomments>
 <
responsible-party-summary>You 2 others</responsible-party-summary>
 <
status>new</status>
 <
todo-list-id type="integer">1533948</todo-list-id>
 <
predecessors type="array"/>
 <
tags type="array"/>
 <
content>ffdddffdfdfdfdfdfdfdfdfdfd</content>
 <
responsible-party-type>Person</responsible-party-type>
 <
company-name>as dfcsdfsdfs</company-name>
 <
creator-firstname>asdfasdfasdfasdf</creator-firstname>
 <
last-changed-on type="date">2018-03-29T10:55:28Z</last-changed-on>
 <
due-date type="integer">20180403</due-date>
 <
has-dependencies type="integer">2</has-dependencies>
 <
attachments-count type="integer">0</attachments-count>
 <
userfollowingchanges type="boolean">false</userfollowingchanges>
 <
priority/>
 <
responsible-party-firstname>asdfasdfasdf</responsible-party-firstname>
 <
viewestimatedtime type="boolean">true</viewestimatedtime>
 <
responsible-party-ids>317122,221525,316954</responsible-party-ids>
 <
responsible-party-names>cdcdcdcdcdcdcdcd</responsible-party-names>
 <
tasklist-lockdownid type="integer">806894</tasklist-lockdownid>
 <
timeislogged type="integer">0</timeislogged>
 </
todo-item


syg00 04-03-2018 03:13 AM

You haven't said which regex engine, or which tool you are using. Are we to guess php ?.

In general parsing XML youself is pointless - use one of the appropriate tools. No regex engine I use cares at all about a minus sign unless in a bracket selection.
Why do you require line feeds on input - normal stream tools strip the line-feed. I would remove the first subexpression completely - but I don't do php ...

ranjitabraham 04-03-2018 03:17 AM

Sorry for the confusion. I am actually importing this XML into splunk. So my props.conf is like
[project]
SHOULD_LINEMERGE = false
LINE_BREAKER = ([\r\n]+)(?=\s*\<todo-item\>)
DATETIME_CONFIG = CURRENT
KV_MODE =xml

With this i am able to injest other XML files without any issue. I was just thinking the data is not split because of the - in between. Thats why i posted the question. Again apologies for the confusion.
Thanks

Turbocapitalist 04-03-2018 03:22 AM

Agreed. It is quite pointless to use pure regex to try to manage XML. A proper parser is needed for that. You have several easy-to-use, mature XML parsers in CPAN. See XML::TreeBuilder, XML::XPathEngine, or XML::Twig there on CPAN.

There are also several standalone XML parsers also based on XPath. xmllint is one.

Michael Uplawski 04-03-2018 03:35 AM

Quote:

Originally Posted by Turbocapitalist (Post 5838742)
There are also several standalone XML parsers also based on XPath. xmllint is one.

... the nokogiri standalone executable is another.

I am responding, however, because I have made the same error a few times and tried to parse XML with Regexs. The way that the OP describes it and as commented by the other contributors, it is pointless, in deed.

An XML-parser, on the other hand, is called an XML-parser, because it pares XML. Think about it.

syg00 04-03-2018 05:31 AM

Quote:

Originally Posted by ranjitabraham (Post 5838741)
I am actually importing this XML into splunk.
... i am able to injest other XML files without any issue.

Sorry, I briefly looked at splunk when it first emerged years ago, but don't use it.
I just had a brief look at the regex doco - interesting; I've not seen \r and \n used like that as anchors before. Can't help - hopefully others with splunk experience can assist.

MadeInGermany 04-03-2018 09:09 AM

Whatever RE engine it is, a literal < and > should not be escaped.
\<todo-item\> should be <todo-item>

sundialsvcs 04-03-2018 10:47 AM

While you can backslash-escape a literal dash, Turbocapitalist's admonition to "use a real XML parser" is a very sound one.

There are two general approaches. One reads the XML and builds an in-memory data structure. The other parses the XML and, while doing so, makes subroutine-calls to back-end routines of your own devising. Both are "known good" tools for handling the vagaries of XML.

dugan 04-03-2018 07:12 PM

Obligatory:

Parsing Html The Cthulhu Way


All times are GMT -5. The time now is 06:42 AM.