Regular expression to select the field correctly which has a - in it
Hello All,
I have an XML file like this and i need a regular expression to select the <todo-item> from below. I wrote the expression like this: ([\r\n]+)(?=\s*\<todo-item\>) but i think due to the - which is between todo and item is causing it to not detect correctly. Can anyone shed some light on how to change the regex code? PHP Code:
|
You haven't said which regex engine, or which tool you are using. Are we to guess php ?.
In general parsing XML youself is pointless - use one of the appropriate tools. No regex engine I use cares at all about a minus sign unless in a bracket selection. Why do you require line feeds on input - normal stream tools strip the line-feed. I would remove the first subexpression completely - but I don't do php ... |
Sorry for the confusion. I am actually importing this XML into splunk. So my props.conf is like
[project] SHOULD_LINEMERGE = false LINE_BREAKER = ([\r\n]+)(?=\s*\<todo-item\>) DATETIME_CONFIG = CURRENT KV_MODE =xml With this i am able to injest other XML files without any issue. I was just thinking the data is not split because of the - in between. Thats why i posted the question. Again apologies for the confusion. Thanks |
Agreed. It is quite pointless to use pure regex to try to manage XML. A proper parser is needed for that. You have several easy-to-use, mature XML parsers in CPAN. See XML::TreeBuilder, XML::XPathEngine, or XML::Twig there on CPAN.
There are also several standalone XML parsers also based on XPath. xmllint is one. |
Quote:
I am responding, however, because I have made the same error a few times and tried to parse XML with Regexs. The way that the OP describes it and as commented by the other contributors, it is pointless, in deed. An XML-parser, on the other hand, is called an XML-parser, because it pares XML. Think about it. |
Quote:
I just had a brief look at the regex doco - interesting; I've not seen \r and \n used like that as anchors before. Can't help - hopefully others with splunk experience can assist. |
Whatever RE engine it is, a literal < and > should not be escaped.
\<todo-item\> should be <todo-item> |
While you can backslash-escape a literal dash, Turbocapitalist's admonition to "use a real XML parser" is a very sound one.
There are two general approaches. One reads the XML and builds an in-memory data structure. The other parses the XML and, while doing so, makes subroutine-calls to back-end routines of your own devising. Both are "known good" tools for handling the vagaries of XML. |
|
All times are GMT -5. The time now is 06:42 AM. |