LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   using SED or AWK to cut data from a file, between certain characters (https://www.linuxquestions.org/questions/linux-newbie-8/using-sed-or-awk-to-cut-data-from-a-file-between-certain-characters-840035/)

Kenhelm 10-27-2010 08:15 AM

This works for me with the posted file fragment.
The --recover option tells xmllint to
"Output any parsable portions of an invalid document" (From the man page)
It tries to fix the incomplete xml by adding the missing end tags.

Code:

sed -n '/^ <document>/,/^$/s/^ //p' file | tr -d '\n' | xmllint --format --recover -

-:1: parser error : Couldn't find end of Start Tag objectFie line 1
tLifePercentage</fieldName><fieldValue>0.00</fieldValue></objectField><objectFie
                                                                              ^
-:1: parser error : Premature end of data in tag level2Object line 1
tLifePercentage</fieldName><fieldValue>0.00</fieldValue></objectField><objectFie
                                                                              ^
-:1: parser error : Premature end of data in tag level1Object line 1
tLifePercentage</fieldName><fieldValue>0.00</fieldValue></objectField><objectFie
                                                                              ^
-:1: parser error : Premature end of data in tag level0Object line 1
tLifePercentage</fieldName><fieldValue>0.00</fieldValue></objectField><objectFie
                                                                              ^
-:1: parser error : Premature end of data in tag document line 1
tLifePercentage</fieldName><fieldValue>0.00</fieldValue></objectField><objectFie
                                                                              ^
<?xml version="1.0"?>
<document>
  <docRequestID>2010-10-22-11.57.22.903813</docRequestID>
  <docStylesheet>Thunderhead</docStylesheet>
  <requestType>claim</requestType>
  <level0Object>
    <objectType>transaction</objectType>
    <objectID>900</objectID>
    <objectSeq>1</objectSeq>
    <level1Object>
      <objectType>lifelite</objectType>
      <objectID>901</objectID>
      <objectSeq>1</objectSeq>
      <level2Object>
        <objectType>documentHeader</objectType>
        <objectID>100</objectID>
        <objectSeq>1</objectSeq>
        <objectField>
          <fieldID>1500</fieldID>
          <fieldName>transactionType</fieldName>
          <fieldValue>6</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1501</fieldID>
          <fieldName>lifeliteReference</fieldName>
          <fieldValue>000231263</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1502</fieldID>
          <fieldName>requestorUserid</fieldName>
          <fieldValue>LV20073</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1503</fieldID>
          <fieldName>requestDate</fieldName>
          <fieldValue>2010-10-22</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1504</fieldID>
          <fieldName>requestTime</fieldName>
          <fieldValue>6</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1505</fieldID>
          <fieldName>busProcess</fieldName>
          <fieldValue>LLP0101</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1506</fieldID>
          <fieldName>insert</fieldName>
          <fieldValue>N</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1507</fieldID>
          <fieldName>adviserName</fieldName>
          <fieldValue>PHIL</fieldValue>
        </objectField>
      </level2Object>
      <level2Object>
        <objectType>recipient</objectType>
        <objectID>110</objectID>
        <objectSeq>2</objectSeq>
        <objectField>
          <fieldID>1510</fieldID>
          <fieldName>rcpntPartyId</fieldName>
          <fieldValue>7510134</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1511</fieldID>
          <fieldName>companyCode</fieldName>
          <fieldValue>LVG</fieldValue>
        </objectField>
      </level2Object>
      <level2Object>
        <objectType>claim</objectType>
        <objectID>120</objectID>
        <objectSeq>3</objectSeq>
        <objectField>
          <fieldID>1107</fieldID>
          <fieldName>claimRef</fieldName>
          <fieldValue>V1058036</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1108</fieldID>
          <fieldName>totalClaimAmount</fieldName>
          <fieldValue>10000.00</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1109</fieldID>
          <fieldName>totalGroupClaimAmt</fieldName>
          <fieldValue>10000.00</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1533</fieldID>
          <fieldName>totalFundAmt</fieldName>
          <fieldValue>100000.00</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1110</fieldID>
          <fieldName>trivialityInd</fieldName>
          <fieldValue>N</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1111</fieldID>
          <fieldName>reducedPensionAmt</fieldName>
          <fieldValue>3750.00</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1112</fieldID>
          <fieldName>firstPaymentDate</fieldName>
          <fieldValue>2010-11-19</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1113</fieldID>
          <fieldName>paymentType</fieldName>
          <fieldValue>IN ADVANCE</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1114</fieldID>
          <fieldName>paymentInterval</fieldName>
          <fieldValue>QUARTERLY</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1115</fieldID>
          <fieldName>lumpSumAmt</fieldName>
          <fieldValue>25000.00</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1116</fieldID>
          <fieldName>residualSum</fieldName>
          <fieldValue>75000.00</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1117</fieldID>
          <fieldName>slaPerc</fieldName>
          <fieldValue>0.000</fieldValue>
        </objectField>
        <objectField>
          <fieldID>1118</fieldID>
          <fieldName>jointLifePercentage</fieldName>
          <fieldValue>0.00</fieldValue>
        </objectField>
        <objectFie/>
      </level2Object>
    </level1Object>
  </level0Object>
</document>


grail 10-27-2010 09:40 AM

Quote:

Are you not getting the line spaces in the extracted xml then? I get one on every line.. any ideas how i can stop this?
No I am not, but as I said I feel it may be that your input file is written and saved under Windows. Do you have dos2unix or some such that you can run over the file?

hugh86 10-28-2010 03:10 AM

@ KENHELM...


did you code work then? Did you just run that line or did you use it alongside my script i posted? Im new to all this and not sure if its an addition to what i have already done?

Code:

#!/bin/bash
echo "getXML"

echo -n "Enter the source file name WITH extension : "
read infile
echo "Processing... : "
sleep 1
echo -n "Enter output file name (extenstion not applicable) : "
read outfile
sed -n '/Sending XML/,/Message sending ended/p' ${infile} > ${outfile}
echo "Processing XML... : "
sleep 1
echo "Success..Data should be in '$outfile' if compiled correctly"


Thankyou

Kenhelm 10-28-2010 04:17 AM

I just ran the line of code I posted.
This is your script with the code inserted.
Code:

#!/bin/bash
echo "getXML"

echo -n "Enter the source file name WITH extension : "
read infile
echo "Processing... : "
sleep 1
echo -n "Enter output file name (extenstion not applicable) : "
read outfile

sed -n '/^ <document>/,/^$/s/^ //p' ${infile} |
tr -d '\n' |
xmllint --format - > ${outfile}

echo "Processing XML... : "
sleep 1
echo "Success..Data should be in '$outfile' if compiled correctly"

/^ <document>/,/^$/ selects lines from ' <document>' to the next empty line.
s/^ // removes the single leading space on each line.
tr -d '\n' removes the newlines, putting all the xml onto a single line.
If the xml is valid you shouldn't need the '--recover' option to xmllint, but if you get some parsing error messages try putting it back in.


All times are GMT -5. The time now is 08:57 AM.