LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Splitting file on content basis (https://www.linuxquestions.org/questions/linux-newbie-8/splitting-file-on-content-basis-4175504158/)

azheruddin 05-07-2014 12:57 AM

Splitting file on content basis
 
Dear all ,

I have one file around 20 MB and wanted to split it on content basis by awk or split utility.

I have done it by on basis of size but splitted files are of no use so wanted to split on content basis.

So here I need to splitt this file on content basis with addition of opening and closing tags in each splitted files.
for e.g
Original file having Opening tags...
<?xml version="1.0" encoding="UTF-8"?>
<ns0:ABCFile xmlns:ns0="urn:PQR:OTHERS:WXYZ:HELLOTEST">
<ABCFileHeader>
<RecordType>01</RecordType>
<Date>20140405</Date>
<TotalRecord>46048</TotalRecord> // 46048/4 = 11512 records in each file
</ABCFileHeader>
.
.
Actualrecord ....starts like
<ABRecordDetail>
<RecordType>02</RecordType>
<LineItem>0000000002</LineItem>
<CompanyCode>PQR</CompanyCode>
<ABDate>20130901</ABtDate>
<CurrencyKey>PVR</CurrencyKey>
<AmountInDC>0</AmountInDC>
<AmountInLC>0</AmountInLC>
<CostCenter>BBN</CostCenter>
<FType>DTH</FType>
<QNumber>VBR3581 </QNumber>
<SNumber>9kBQ</SNumber>
<VNumber>BBGRB</SNumber>
<Assignment>0945</Assignment>
</ABRecordDetail>

So the above actual 15 lines are the actual record and in original file it has 46048 such records so I wanted to split in a way that records 46048/4 = 11512 in each file in addition to opening and closing tags in each file

Opening tags.

<?xml version="1.0" encoding="UTF-8"?>
<ns0:ABCFile xmlns:ns0="urn:PQR:OTHERS:WXYZ:HELLOTEST">
<ABCFileHeader>
<RecordType>01</RecordType>
<Date>20140405</Date>
<TotalRecord>46048</TotalRecord> // 46048/4 = 11512 records in each file so in splited file tag would be like <TotalRecord>11512</TotalRecord>
</ABCFileHeader>

Closing tag:
</ns0:ABCFile>

Hope you understood, in a simple way file needs to be splitted on content basis [record basis] i.e 15 line just need to add fixed tags at top and bottom of each file.

sag47 05-07-2014 08:38 AM

Your best bet is to use a language that has an XML parsing library. Use an option parsing library to take options (such as how many splits or the name of the output file) and then output the split files (e.g. file001.xml file002.xml etc). You're not going to get a decent solution unless you use real parsing.

pan64 05-07-2014 09:28 AM

actually you can try to set record separator to </ABRecordDetail> and print the lines into a file (name is created using lineno/4)
Code:

awk ' BEGIN {RS="</ABRecordDetail>"}
      { filename = "file" NR/4 ".xml"
        print > filename }
' inputfile

but it was not tested


All times are GMT -5. The time now is 02:55 AM.