LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   extract directory and xml data to create a comma delimited file (https://www.linuxquestions.org/questions/programming-9/extract-directory-and-xml-data-to-create-a-comma-delimited-file-942575/)

j-me 04-30-2012 10:59 AM

extract directory and xml data to create a comma delimited file
 
I am needing to extract part of the directory string and then reading the build.xml extract certain fields into a comma delimited file.
example directory string highlight items required:

/tmp/jobs/Deploy_EVP_Test_Rate/builds/2012-01-04_14-41-42/build.xml

build.xml extract:
<value>EVP_3.5.4</value>
<result>SUCCESS</result>

ONLY first occurance of: <string>e010430</string>

Is there a good way to accomplish this using awk in a loop? or other? I'm not familiar with perl.
ty

Kustom42 04-30-2012 12:07 PM

Sed would be a good way to go if you are looking to find and replace content. So you could use sed to find the first occurance of '<string>' and replace it with ','.

Code:

sed s/<string>/,/ build.xml > build_new.xml
You may need to create a more in-depth sed command but thats just a quick basic one/.

Sergei Steshenko 04-30-2012 03:04 PM

Quote:

Originally Posted by Kustom42 (Post 4666888)
Sed would be a good way to go if you are looking to find and replace content. So you could use sed to find the first occurance of '<string>' and replace it with ','.

Code:

sed s/<string>/,/ build.xml > build_new.xml
You may need to create a more in-depth sed command but thats just a quick basic one/.

XML is not a line-oriented languages, so for a reliable solution one needs a full-fledged parser, not a simple thing like 'sed'.

Kustom42 04-30-2012 03:18 PM

I don't use XML but can you elaborate on that a bit more? Are you saying all of the XML code is on one line, if not then sed or awk would work just fine or is there something else I am missing?

Sergei Steshenko 04-30-2012 03:24 PM

Quote:

Originally Posted by Kustom42 (Post 4666989)
I don't use XML but can you elaborate on that a bit more? Are you saying all of the XML code is on one line, if not then sed or awk would work just fine or is there something else I am missing?

Whom are you asking ?

Anyway, I suggest starting from http://en.wikipedia.org/wiki/Token_%28parser%29 , http://en.wikipedia.org/wiki/Parsing .

j-me 04-30-2012 03:25 PM

Not looking to find and replace. That I understand how to do. What I want is to output the extracted data into a file that I can read into a database or spreadsheet is as follows:

Deploy_EVP_Test_Rate,2012-01-04_14-41-42,EVP_3.5.4,SUCCESS,e010430

with multiples lines of other data same format. This is for deploy information so the finished product will be a large file.

Kustom42 04-30-2012 03:40 PM

I am having trouble following what you are trying to accomplish, if your file looks as follows:

<string>value1</string>
<string>value2</string>
<string>value3</string>
<string>value4</string>

And you want your output to look as:

value1,value2,value3,value4

Then why not just use a cut and translate or sed?

Something like the below would give you the output in a comma delimited format. The below is a crude piped command not something I would use for a final result but I just want to clarify where exactly you want to get.

Code:

cut -d'>' -f2 /tmp/test.txt | cut -d '<' -f1 | tr '\n' ','

Sergei Steshenko 04-30-2012 03:43 PM

Quote:

Originally Posted by Kustom42 (Post 4667002)
I am having trouble following what you are trying to accomplish, if your file looks as follows:

<string>value1</string>
<string>value2</string>
<string>value3</string>
<string>value4</string>

And you want your output to look as:

value1,value2,value3,value4

Then why not just use a cut and translate or sed?

Something like the below would give you the output in a comma delimited format. The below is a crude piped command not something I would use for a final result but I just want to clarify where exactly you want to get.

Code:

cut -d'>' -f2 /tmp/test.txt | cut -d '<' -f1 | tr '\n' ','

From the point of XML

Code:

<string>value1</string>
is equivalent to

Code:

<string>
  value1
</string>

and other similar variations, so 'sed' is not a solution.

Kustom42 04-30-2012 03:56 PM

Ok Sergei I see what you are saying when you are talking about XML parsing, I was aware of that. But if he knows that his XML file is in a specific format then wouldn't sed or similar bash tools work? If he knows that all of his strings are on one line then that could be a solution right?

Otherwise I see where you are going when you are referring him to interpretive programming language that can parse XML. Something like Python using the xml.parsers.expat would be a viable solution in this case correct?

Sergei Steshenko 04-30-2012 04:02 PM

Quote:

Originally Posted by Kustom42 (Post 4667016)
Ok Sergei I see what you are saying when you are talking about XML parsing, I was aware of that. But if he knows that his XML file is in a specific format then wouldn't sed or similar bash tools work? If he knows that all of his strings are on one line then that could be a solution right?

Otherwise I see where you are going when you are referring him to interpretive programming language that can parse XML. Something like Python using the xml.parsers.expat would be a viable solution in this case correct?

Or Perl XML parser (there are a few) or Ruby, or "C", or C++ - but a full-fledged parser.

Kustom42 04-30-2012 04:05 PM

Sergei, thanks for clarifying. I'm still young to some aspects of Linux, I come here and learn something new everyday.

Sergei Steshenko 04-30-2012 04:16 PM

Quote:

Originally Posted by Kustom42 (Post 4667030)
... I'm still young to some aspects of Linux. ...

This has nothing to do with Linux specifically. This is not an OS-related issue.

David the H. 05-01-2012 01:51 AM

I'll have to agree with Sergei generally. Not only is the free-form nature of xml (and html) a problem for these tools, but regular expressions are not well designed for parsing nested data sets. Only a real, dedicated parser is capable of handling all the intricacies of xml-type formats.

That said, however, if all you need is some trivial editing of or extraction from well-formed files with an known, unchanging format, then sed/awk/whatever can indeed usually do the job. Just don't trust them for use on arbitrary input or mission-critical work.


As for the OP's question, could we get an actual full example of the xml input to work with? Just giving verbal descriptions and short selections isn't good enough if you want us to help you find a true solution.

And when you do, please use [code][/code] tags around your code and data, to preserve formatting and to improve readability. Please do not use quote tags, colors, or other fancy formatting.

j-me 05-01-2012 07:51 AM

I hope this is helpful

Directory: jobs/Deploy_EVP_Test/builds/2012-01-31_12-43-55
build.xml [entire as it looks]

<?xml version='1.0' encoding='UTF-8'?>
<build>
<actions>
<hudson.model.ParametersAction>
<parameters>
<hudson.model.StringParameterValue>
<name>tagName</name>
<description></description>
<value>EVP_3.5.8</value>
</hudson.model.StringParameterValue>
</parameters>
</hudson.model.ParametersAction>
<hudson.model.CauseAction>
<causes>
<hudson.model.Cause_-UserCause>
<authenticationName>its5939</authenticationName>
</hudson.model.Cause_-UserCause>
</causes>
</hudson.model.CauseAction>
<hudson.scm.SubversionTagAction>
<build class="build" reference="../../.."/>
<tags class="hudson.util.CopyOnWriteMap$Tree">
<no-comparator/>
<entry>
<hudson.scm.SubversionSCM_-SvnInfo>
<url>http://svn.fbfs.com/TS/scripts/deploy</url>
<revision>602</revision>
</hudson.scm.SubversionSCM_-SvnInfo>
<list/>
</entry>
<entry>
<hudson.scm.SubversionSCM_-SvnInfo>
<url>http://svn.fbfs.com/TS/scripts/deployEnv</url>
<revision>547</revision>
</hudson.scm.SubversionSCM_-SvnInfo>
<list/>
</entry>
</tags>
</hudson.scm.SubversionTagAction>
<hudson.scm.SVNRevisionState>
<revisions>
<entry>
<string>http://svn.fbfs.com/TS/scripts/deployEnv</string>
<long>547</long>
</entry>
<entry>
<string>http://svn.fbfs.com/TS/scripts/deploy</string>
<long>602</long>
</entry>
</revisions>
</hudson.scm.SVNRevisionState>
<hudson.plugins.descriptionsetter.DescriptionSetterAction>
<description>EVP_3.5.8</description>
</hudson.plugins.descriptionsetter.DescriptionSetterAction>
</actions>
<number>275</number>
<result>SUCCESS</result>
<description>EVP_3.5.8</description>
<duration>140982</duration>
<charset>UTF-8</charset>
<keepLog>false</keepLog>
<builtOn>thomidsrv05</builtOn>
<workspace>/var/hudson</workspace>
<hudsonVersion>1.362</hudsonVersion>
<scm class="hudson.scm.SubversionChangeLogParser"/>
<culprits>
<string>e012835</string>
<string>tsl7713</string>
<string>e009137</string>
</culprits>
</build>

Requirements: extract two fields from the directory name [Deploy_<whatever> and date of build full format yyyy-mm-dd_hh-mm-ss],
from build.xml: version, result, ONLY first string in culprits


ending file:
Deploy_EVP_Test,2012-01-31_12-43-55,EVP_3.5.8,SUCCESS,e012835

There are 100,000+ entries so this will have to loop to read all the files and create the resulting file.

Is this descriptive enough?
thank you.

j-me 05-01-2012 08:32 AM

Additional info:
directory view: /jobs/

drwxr-xr-x 5 appadmin users 4096 2011-09-22 10:11 Deploy_ACH_ACHServices_Test
drwxr-xr-x 5 appadmin users 4096 2011-08-11 08:39 Deploy_ALU_GF_Test
drwxr-xr-x 5 appadmin users 4096 2011-08-11 08:41 Deploy_ALU_GF_Test_Support
drwxr-xr-x 5 appadmin users 4096 2012-04-25 15:06 Deploy_AUS_AURulesEngine_GF_Test
drwxr-xr-x 5 appadmin users 4096 2012-04-25 15:08 Deploy_AUS_AURulesEngine_GF_Test_Support
drwxr-xr-x 5 appadmin users 4096 2012-01-31 12:46 Deploy_EVP_Test
drwxr-xr-x 5 appadmin users 4096 2012-01-11 13:40 Deploy_EVP_Test_Rate
drwxr-xr-x 5 appadmin users 4096 2012-01-11 13:42 Deploy_EVP_Test_Support
drwxr-xr-x 5 appadmin users 4096 2012-03-14 16:19 Deploy_HRM_MyTime_GF_Test
drwxr-xr-x 5 appadmin users 4096 2012-01-20 12:58 Deploy_XCD_XCDCommFW_Test_Support

each contains a builds directory which contains MANY yyyy-mm-dd_hh-mm-ss directories and a few other numeric folders 1,2,3,4,5, etc. I ONLY want to obtain build.xml from the yyyy-mm-dd_hh-mm-ss directories.

thanks.


All times are GMT -5. The time now is 07:25 AM.