Splitting file on content basis
Dear all ,
I have one file around 20 MB and wanted to split it on content basis by awk or split utility. I have done it by on basis of size but splitted files are of no use so wanted to split on content basis. So here I need to splitt this file on content basis with addition of opening and closing tags in each splitted files. for e.g Original file having Opening tags... <?xml version="1.0" encoding="UTF-8"?> <ns0:ABCFile xmlns:ns0="urn:PQR:OTHERS:WXYZ:HELLOTEST"> <ABCFileHeader> <RecordType>01</RecordType> <Date>20140405</Date> <TotalRecord>46048</TotalRecord> // 46048/4 = 11512 records in each file </ABCFileHeader> . . Actualrecord ....starts like <ABRecordDetail> <RecordType>02</RecordType> <LineItem>0000000002</LineItem> <CompanyCode>PQR</CompanyCode> <ABDate>20130901</ABtDate> <CurrencyKey>PVR</CurrencyKey> <AmountInDC>0</AmountInDC> <AmountInLC>0</AmountInLC> <CostCenter>BBN</CostCenter> <FType>DTH</FType> <QNumber>VBR3581 </QNumber> <SNumber>9kBQ</SNumber> <VNumber>BBGRB</SNumber> <Assignment>0945</Assignment> </ABRecordDetail> So the above actual 15 lines are the actual record and in original file it has 46048 such records so I wanted to split in a way that records 46048/4 = 11512 in each file in addition to opening and closing tags in each file Opening tags. <?xml version="1.0" encoding="UTF-8"?> <ns0:ABCFile xmlns:ns0="urn:PQR:OTHERS:WXYZ:HELLOTEST"> <ABCFileHeader> <RecordType>01</RecordType> <Date>20140405</Date> <TotalRecord>46048</TotalRecord> // 46048/4 = 11512 records in each file so in splited file tag would be like <TotalRecord>11512</TotalRecord> </ABCFileHeader> Closing tag: </ns0:ABCFile> Hope you understood, in a simple way file needs to be splitted on content basis [record basis] i.e 15 line just need to add fixed tags at top and bottom of each file. |
Your best bet is to use a language that has an XML parsing library. Use an option parsing library to take options (such as how many splits or the name of the output file) and then output the split files (e.g. file001.xml file002.xml etc). You're not going to get a decent solution unless you use real parsing.
|
actually you can try to set record separator to </ABRecordDetail> and print the lines into a file (name is created using lineno/4)
Code:
awk ' BEGIN {RS="</ABRecordDetail>"} |
All times are GMT -5. The time now is 08:05 PM. |