[SOLVED] Script to print repeated values separated by line break

Perseus · 04-05-2014, 02:02 AM

Hello everyone,

I've been trying myself and searching for help without success aso far. Maybe someone could help me please, maybe could be a better way to do it.

What I want is print for each "Report" (SReport) its respective values in separate columns and for nodes
that appear several times, add "end of line" between each value, for example. For first report (SReport), within node "MR_NRanges", NA appears 3 times, then I want to printing NA like this 763LF358LF852, where LF represents end of line only to see it easy.

NA, NRB, SubRangeB and SubRangeE could appear inside the parent nodes "MR_NRanges" and "PK_NRanges" one or more times.

Thanks in advance for any help.

This is the XML input file:

Code:

<?xml version="1.0" encoding="UTF-8"?>
<REPORT-01-NUUMAX16 >
  <SReport>
    <RepName>JEUOP</RepName>
    <RepIn>KUI</RepIn>
    <RepIni>
      <Report>
        <ReportType>Regular</ReportType>
        <ReportData>
          <MainSec>
            <Date>2014-03-15</Date>
            <Indicators_MAX-MR>
              <MR_NRanges>
                <MRValues>
                  <MR_ValRanges>
                    <NA>763</NA>
                    <NRB>91</NRB>
                    <SubRange>
                      <SubRangeB>000</SubRangeB>
                      <SubRangeE>899</SubRangeE>
                    </SubRange>
                  </MR_ValRanges>
                </MRValues>
                <MRValues>
                  <MR_ValRanges>
                    <NA>358</NA>
                    <NRB>95</NRB>
                    <SubRange>
                      <SubRangeB>130</SubRangeB>
                      <SubRangeE>149</SubRangeE>
                    </SubRange>
                  </MR_ValRanges>
                </MRValues>
                <MRValues>
                  <MR_ValRanges>
                    <NA>852</NA>
                    <NRB>76</NRB>
                    <SubRange>
                      <SubRangeB>200</SubRangeB>
                      <SubRangeE>299</SubRangeE>
                    </SubRange>
                  </MR_ValRanges>
                </MRValues>				
              </MR_NRanges>
              <PK_NRanges>
                <MRValues>
                  <MR_ValRanges>
                    <NA>441</NA>
                    <NRB>97</NRB>
                    <SubRange>
                      <SubRangeB>786</SubRangeB>
                      <SubRangeE>789</SubRangeE>
                    </SubRange>
                  </MR_ValRanges>
                </MRValues>
                <MRValues>
                  <MR_ValRanges>
                    <NA>705</NA>
                    <NRB>98</NRB>
                    <SubRange>
                      <SubRangeB>677</SubRangeB>
                      <SubRangeE>859</SubRangeE>
                    </SubRange>
                  </MR_ValRanges>
                </MRValues>
              </PK_NRanges>
            </Indicators_MAX-MR>
            <MAX03_NRanges>
              <MXA>999</MXA>
              <MXB>87</MXB>
            </MAX03_NRanges>
          </MainSec>
        </ReportData>
      </Report>
    </RepIni>
  </SReport>
  <SReport>
    <RepName>KURMT</RepName>
    <RepIn>MUR</RepIn>
    <RepIni>
      <Report>
        <ReportType>Regular</ReportType>
        <ReportData>
          <MainSec>
            <Date>2014-03-19</Date>
            <Indicators_MAX-MR>
              <MR_NRanges>
                <MRValues>
                  <MR_ValRanges>
                    <NA>256</NA>
                    <NRB>12</NRB>
                    <SubRange>
                      <SubRangeB>100</SubRangeB>
                      <SubRangeE>999</SubRangeE>
                    </SubRange>
                  </MR_ValRanges>
                </MRValues>			
              </MR_NRanges>
              <PK_NRanges>
                <MRValues>
                  <MR_ValRanges>
                    <NA>113</NA>
                    <NRB>12</NRB>
                    <SubRange>
                      <SubRangeB>466</SubRangeB>
                      <SubRangeE>899</SubRangeE>
                    </SubRange>
                  </MR_ValRanges>
                </MRValues>
              </PK_NRanges>
            </Indicators_MAX-MR>
            <MAX03_NRanges>
              <MXA>398</MXA>
              <MXB>02</MXB>
            </MAX03_NRanges>
          </MainSec>
        </ReportData>
      </Report>
    </RepIni>
  </SReport>
</REPORT-01-NUUMAX16>

And I was able to think in the following awk code, but is not printing the output I want.

Code:

awk 'BEGIN{
           FS = "<|>"; OFS = "|"
           print "||||MR_NRanges||||PK_Nranges|||||\nRepName|RepIn|ReportType|Date|NA|NRB|SubRangeB|SubRangeE|NA|NRB|SubRangeB|SubRangeE|MXA|MXB"}


$2 == "RepName"			{a=1; RN = $3}
$2 == "RepIn"			{RI = $3}
$2 == "ReportType"		{RT = $3}
$2 == "Date"			{DT = $3}
$2 == "MR_NRanges"		{b = 1}
$2 == "NA"		        {NA1 = b?(NA1?NA1"\n"$3:$3):""}
$2 == "NRB"			{NRB1 = b?(NRB1?NRB1"\n"$3:$3):""}
$2 == "SubRangeB"		{sRB1 = b?(sRB1?sRB1"\n"$3:$3):""}
$2 == "SubRangeE"		{sRE1 = b?(sRE1?sRE1"\n"$3:$3):""}
$2 == "/MR_NRanges"		{b = 0}
$2 == "PK_NRanges"		{c = 1}
$2 == "NA"			{NA2 = 	c?(NA2?NA2"\n"$3:$3):""}
$2 == "NRB"			{NRB2 = c?(NRB2?NRB2"\n"$3:$3):""}
$2 == "SubRangeB"		{sRB2 = c?(sRB2?sRB2"\n"$3:$3):""}
$2 == "SubRangeE"		{sRE2 = c?(sRE2?sRE2"\n"$3:$3):""}
$2 == "/PK_NRanges"		{c = 0}
$2 == "MXA"			{MXA = $3}
$2 == "MXB"			{MXB = $3; print RN,RI,RT,DT,NA1,NRB1,sRB1,sRE1,NA2,NRB2,sRB2,sRE2,MXA,MXB}
' input.xml

The output desired is below (LF represents end of line, to see it easy):

Code:

||||MR_NRanges||||PK_Nranges|||||
RepName|RepIn|ReportType|Date|NA|NRB|SubRangeB|SubRangeE|NA|NRB|SubRangeB|SubRangeE|MXA|MXB
JEUOP|KUI|Regular|2014-03-15|763LF358LF852|91LF95LF76|000LF130LF200|899LF149LF299|441LF705|97LF98|786LF677|789LF859|999|87
KURMT|MUR|Regular|2014-03-19|256|12|100|999|113|12|466|899|398|02

BR

grail · 04-05-2014, 03:56 AM

How do you know that MR_NRanges is a node? Looking at the xml data I see nothing specific about this entry over many others that clarifies how to know it is a node?

Also, whilst I am always happy to champion awk as a great tool, if you are needing such a low level reference to data within an xml construct, I would recommend looking at something
like Ruby or Perl which have specific modules for interacting with xml data.

Perseus · 04-05-2014, 07:16 PM

Hello grail,

Well, R_NRanges and PK_NRanges are sub elements inside each SReport block.

I've tried with awk since it is the tool I know a little, but thanks for the suggestion about ruby or perl.

Code:

  <SReport>
    <RepName>JEUOP</RepName>
    <RepIn>KUI</RepIn>
    <RepIni>
      <Report>
        <ReportType>Regular</ReportType>
        <ReportData>
          <MainSec>
            <Date>2014-03-15</Date>
            <Indicators_MAX-MR>
              <MR_NRanges>
                <MRValues>
                  <MR_ValRanges>
                    <NA>763</NA>
                    <NRB>91</NRB>
                    <SubRange>
                      <SubRangeB>000</SubRangeB>
                      <SubRangeE>899</SubRangeE>
                    </SubRange>

Perseus · 04-06-2014, 02:56 AM

Hello grail,

I found an example with ruby REXML and testing with my XML input and trying to print only ReportName, ReportType and NA
and NRB that belong to MR_NRanges, the output should be (for repeated nodes I'm putting a "," here):

Code:

RepName|ReportType|NA|NRB
JEUOP|Regular|763,358,852|91,95,76
KURMT|Regular|256|12

and I getting this:

Code:

RepName|ReportType|NA
JEUOP|KURMT|Regular|Regular|763,358,852,256,
JEUOP|KURMT|Regular|Regular|763,358,852,256,

The code I'm trying is below:

Code:

#!/usr/bin/ruby -w

require 'rexml/document'
include REXML

xmlfile = File.new("input_1.xml")
xmldoc = Document.new(xmlfile)

print "RepName|ReportType|NA|NRB\n"

xmldoc.elements.each("REPORT-01-NUUMAX16/SReport/") {
	xmldoc.elements.each("//RepName") {|a| print a.text, "|"}
	xmldoc.elements.each("//RepIni/Report/ReportType") {|b| print b.text, "|"}
    xmldoc.elements.each("//Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/NA") {|c| print c.text, ","}
	print "|"
	xmldoc.elements.each("//Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/NRB") {|d| print d.text, ","}
	puts
}

May you help me with this please.

Thanks in advance.

grail · 04-06-2014, 05:39 AM

Ok ... I haven't played with this particular module much and there is still a pesky comma in the wrong spot (which just annoys me), but this is what I have so far:

Code:

require 'rexml/document'
include REXML

xmlfile = File.new("f.xml")
xmldoc = Document.new(xmlfile)

print "RepName|ReportType|NA|NRB\n"

xmldoc.elements.each("REPORT-01-NUUMAX16/SReport/") {
    |e| 

    print e.elements["RepName"].text + "|" 
    e.elements.each("RepIni/Report"){
        |a|

        print a.elements["ReportType"].text, "|" 
        a.elements.each("ReportData/MainSec/Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/NA"){|c| print c.text, ","}
        print "|" 
        a.elements.each("ReportData/MainSec/Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/NRB"){|c| print c.text, ","}
    }   
    puts
}

There is probably a cleaner method, I did see that you can iterate over items until a match is found and then print that so it may work for you to play a little more

Perseus · 04-07-2014, 01:09 AM

Hi grail,

Thank your for your fixing. I was able to print the other values, but it happens like you said, a comma as last character
is still present for me too. Maybe storing in a variable the content of each row before printing could be useful to remove
the extra "," and then print but I wasn't able to change the code to store in a variable before printing, maybe you can help
me one more time to fix that part.

The is the code I have so far.

Code:

#!/usr/bin/ruby -w
require 'rexml/document'
include REXML

xmlfile = File.new("input_1.xml")
xmldoc = Document.new(xmlfile)

print "RepName|RepIn|ReportType|Date|NA|NRB|SubRangeB|SubRangeE|NA|NRB|SubRangeB|SubRangeE|MXA|MXB\n"

xmldoc.elements.each("REPORT-01-NUUMAX16/SReport/") {
    |e| 

    print e.elements["RepName"].text + "|" + e.elements["RepIn"].text + "|"
    e.elements.each("RepIni/Report"){
        |a|

        print	a.elements["ReportType"].text, "|" +
				a.elements["ReportData/MainSec/Date"].text + "|"
				a.elements.each("ReportData/MainSec/Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/NA"){|c| print c.text, ","}
				print "|" 
				a.elements.each("ReportData/MainSec/Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/NRB"){|c| print c.text, ","}
				print "|" 
				a.elements.each("ReportData/MainSec/Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/SubRange/SubRangeB"){|c| print c.text, ","}
				print "|" 
				a.elements.each("ReportData/MainSec/Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/SubRange/SubRangeE"){|c| print c.text, ","}
				print "|" 
				a.elements.each("ReportData/MainSec/Indicators_MAX-MR/PK_NRanges/MRValues/MR_ValRanges/NA"){|c| print c.text, ","}
				print "|" 
				a.elements.each("ReportData/MainSec/Indicators_MAX-MR/PK_NRanges/MRValues/MR_ValRanges/NRB"){|c| print c.text, ","}
				print "|" 
				a.elements.each("ReportData/MainSec/Indicators_MAX-MR/PK_NRanges/MRValues/MR_ValRanges/SubRange/SubRangeB"){|c| print c.text, ","}
				print "|" 
				a.elements.each("ReportData/MainSec/Indicators_MAX-MR/PK_NRanges/MRValues/MR_ValRanges/SubRange/SubRangeE"){|c| print c.text, ","}
				print "|" 
				a.elements.each("ReportData/MainSec/MAX03_NRanges/MXA"){|c| print c.text, ","}
				print "|"
				a.elements.each("ReportData/MainSec/MAX03_NRanges/MXB"){|c| print c.text, ","}
				
    }   
    puts
}

anf the output so far is:

Code:

RepName|RepIn|ReportType|Date|NA|NRB|SubRangeB|SubRangeE|NA|NRB|SubRangeB|SubRangeE|MXA|MXB
JEUOP|KUI|Regular|2014-03-15|763,358,852,|91,95,76,|000,130,200,|899,149,299,|441,705,|97,98,|786,677,|789,859,|999,|87,
KURMT|MUR|Regular|2014-03-19|256,|12,|100,|999,|113,|12,|466,|899,|398,|02,

Thanks again for the help.

grail · 04-07-2014, 01:25 AM

I did find a way to remove that, although with the addition of several fields I am not sure it is practical:

Code:

xmldoc.elements.each("REPORT-01-NUUMAX16/SReport/") {
	|e|

	print e.elements["RepName"].text, "|"
	e.elements.each("RepIni/Report"){
		|a|

		f = { "NA" => [], "NRB" => [] }
		print a.elements["ReportType"].text, "|"

		a.elements.each("ReportData/MainSec/Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/NA"){|c| f["NA"] << c.text}
		a.elements.each("ReportData/MainSec/Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/NRB"){|c| f["NRB"] << c.text}
		puts [f["NA"].join(","),f["NRB"].join(",")].join("|")
	}
}

I feel there should be a way to somehow use the value as a reference in the hash and then append the data wanted into the associated array.

Hope that helps.

Perseus · 04-08-2014, 12:27 AM

Hello grail,

Following your example I was able to print all values for this input file and it seems
are printed like I want. I'm not sure how to use the values as reference in the hash.

What I've finally tried is how to add double quotes at begin and at the end of column that
contain repeated values.

I mean, this is the output I have so far:

Code:

RepName|RepIn|ReportType|Date|NA|NRB|SubRangeB|SubRangeE|NA|NRB|SubRangeB|SubRangeE|MXA|MXB
JEUOP|KUI|Regular|2014-03-15|763,358,852|91,95,76|000,130,200|899,149,299|441,705|97,98|786,677|789,859|999|87
KURMT|MUR|Regular|2014-03-19|256|12|100|999|113|12|466|899|398|02

And I finally want to have with duoble quotes repeated values as below:
*(if they appear only once could be printed with or without double quotes, doesn't matter. For example in column 5 of 2 row
only have one value, that is 256, so could be printed without double quotes.)

Code:

RepName|RepIn|ReportType|Date|NA|NRB|SubRangeB|SubRangeE|NA|NRB|SubRangeB|SubRangeE|MXA|MXB
JEUOP|KUI|Regular|2014-03-15|"763,358,852"|"91,95,76"|"000,130,200"|"899,149,299"|"441,705"|"97,98"|"786,677"|"789,859"|999|87
KURMT|MUR|Regular|2014-03-19|256|12|100|999|113|12|466|899|398|02

There is a short way to insert the double quotes in that way?

Thanks again for the help.

This is the code I have so far:

Code:

#!/usr/bin/ruby -w
require 'rexml/document'
include REXML

xmlfile = File.new("input_1.xml")
xmldoc = Document.new(xmlfile)

print "RepName|RepIn|ReportType|Date|NA|NRB|SubRangeB|SubRangeE|NA|NRB|SubRangeB|SubRangeE|MXA|MXB\n"

xmldoc.elements.each("REPORT-01-NUUMAX16/SReport/") {
	|e|

	print e.elements["RepName"].text + "|" + e.elements["RepIn"].text + "|"
	e.elements.each("RepIni/Report"){
		|a|

		print	a.elements["ReportType"].text, "|" +
				a.elements["ReportData/MainSec/Date"].text + "|"
				
		mr = { "NA" => [], "NRB" => [], "SubRangeB" => [], "SubRangeE" => []}
		pk = { "NA" => [], "NRB" => [], "SubRangeB" => [], "SubRangeE" => []}
		m3 = { "MXA" => [], "MXB" => [] }

		a.elements.each("ReportData/MainSec/Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/NA"){|c| mr["NA"] << c.text}
		a.elements.each("ReportData/MainSec/Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/NRB"){|c| mr["NRB"] << c.text}
		a.elements.each("ReportData/MainSec/Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/SubRange/SubRangeB"){|c| mr["SubRangeB"] << c.text}
		a.elements.each("ReportData/MainSec/Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/SubRange/SubRangeE"){|c| mr["SubRangeE"] << c.text}	
		
		a.elements.each("ReportData/MainSec/Indicators_MAX-MR/PK_NRanges/MRValues/MR_ValRanges/NA"){|c| pk["NA"] << c.text}
		a.elements.each("ReportData/MainSec/Indicators_MAX-MR/PK_NRanges/MRValues/MR_ValRanges/NRB"){|c| pk["NRB"] << c.text}
		a.elements.each("ReportData/MainSec/Indicators_MAX-MR/PK_NRanges/MRValues/MR_ValRanges/SubRange/SubRangeB"){|c| pk["SubRangeB"] << c.text}
		a.elements.each("ReportData/MainSec/Indicators_MAX-MR/PK_NRanges/MRValues/MR_ValRanges/SubRange/SubRangeE"){|c| pk["SubRangeE"] << c.text}	
		
		a.elements.each("ReportData/MainSec/MAX03_NRanges/MXA"){|c| m3["MXA"] << c.text}
		a.elements.each("ReportData/MainSec/MAX03_NRanges/MXB"){|c| m3["MXB"] << c.text}					
		
		puts [mr["NA"].join(","),mr["NRB"].join(","),mr["SubRangeB"].join(","),mr["SubRangeE"].join(",")].join("|") + "|" +
			 [pk["NA"].join(","),pk["NRB"].join(","),pk["SubRangeB"].join(","),pk["SubRangeE"].join(",")].join("|") + "|" +
			 [m3["MXA"].join(","),m3["MXB"].join(",")].join("|")	
	}
}

Regards

grail · 04-08-2014, 02:25 AM

Well it is messy (pretty sure there should be a better way), but this works:

Code:

puts "\"#{f["NA"].join(",")}\"|\"#{f["NRB"].join(",")}\""

Perseus · 04-09-2014, 01:05 AM

Hello grail,

Thanks again for your help. I've tried and works and follow your comment I found another way to do it in this way.

Code:

puts f.values.map{|z| '"' + z.join(",") + '"'}.join('|')

Now I have my last 2 questions, if you have time enough would be great if you can help me again.

1- Since this script uses a XML module, I cannot concatenate more than one XML file in a single one to apply this script because
is detected as XML file with bad format. Then, having several XML files with the same format in a directory, how can
I do in order this script takes all XML files and outputs the result in a single output file, with each line representing the values of each XML in output file?
* for each XML file there is only one Top node called "SReport".

2- I've been trying to make short the a.elements.each(...) commands assigning the path to a string, but when I do that,
even I don not receive a syntax error message, the output is not correct.

I ve tried to change from this:

Code:

a.elements.each("ReportData/MainSec/Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/NA"){|c| mr["NA"] << c.text}

to this:

Code:

str="ReportData/MainSec/Indicators_MAX-MR/MR_NRanges/MRValues/MR_ValRanges/"
a.elements.each(str << "NA"){|c| mr["NA"] << c.text}

but is not working and I don't know why.

Many thanks again grail.

Regards

grail · 04-09-2014, 05:04 AM

I am not sure I understand question 1? Are you saying you don't know how to pass the file names to the script? Have a look at the ARGV array.

As for number 2, what you have written seems to work fine for me

What version of ruby are you on?

Perseus · 04-10-2014, 01:13 AM

Hello grail,

The ruby version is 1.9.3p448 and I' not sure why the printout changes when I replace that string with a variable.

Actually I want the script processes all XML files inside a folder.

I found Dir command and it seems to work like this:

Code:

#!/usr/bin/ruby -w
require 'rexml/document'
include REXML

print "Headers..."

Dir.glob("*.xml").each do |file|  
	My script
end

But is not printing anything if I put the full path:

Code:

Dir.glob("C:\XMLs\latest\*.xml").each do |file|  
	My script
end

I try in cygwin and ruby for windows.

Thanks for the help.

grail · 04-10-2014, 08:08 AM

Well I am using a later version of ruby (2.1.1p76), so not sure if that make s a difference.

As for Dir, are you saying even a simple, puts file, with nothing else also prints nothing?
That works for me, of course on linux but I do not see why that would matter.

Perseus · 04-10-2014, 08:28 AM

Hello grail,

Yes, I tried to simply print the names of files in folder putting the full path inside Dir command and doesn't work in that way, only if I put the wildcard like this "*.xml".

Is there some other way to do the loop over all files?

Thanks again

grail · 04-10-2014, 10:01 AM

Well it did work for me just fine ... I guess the only thing I can think of off the top of my head is that your path may be incorrect?

As for alternatives, you could simply pass the files to the script on the command line:

Code:

./your_script.rb path/to/files/*.xml

Then simply loop over each one