LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Search and replace Pattern preceeding another pattern (https://www.linuxquestions.org/questions/programming-9/search-and-replace-pattern-preceeding-another-pattern-944529/)

nbkisnz 05-12-2012 02:16 AM

Search and replace Pattern preceeding another pattern
 
Hi All,

I have a an XML file which ideally should contain records of the format "<EMPLOYEE><EMP_NAME>ABC</EMP_NAME><EMP_ID>XY21Z</EMP_ID></EMPLOYEE><DEPARTMENT><DEPT_NAME>HR</DEPT_NAME></DEPARTMENT>"

But due to some issues in the ETL, some records are getting created in the following format:
"<EMPLOYEE><EMP_NAME>ABC</EMP_NAME><EMP_ID>XY21Z<DEPARTMENT><DEPT_NAME>HR</DEPT_NAME></DEPARTMENT>"

Due to certain limitations in the ETL process, I have to rectify such records using Shell programming. Is there any command using which I can find all the occurrences of "<DEPARTMENT>" tag which arent preceeded by "</EMPLOYEE>", so that I can replace those occurrences with the correct format.

Thanks in advance
Sriram

grail 05-12-2012 03:34 AM

If I understand correctly, maybe something like:
Code:

sed -n '/\/EMPLOYEE.*DEPARTMENT/!p' file

Nominal Animal 05-12-2012 07:30 AM

Please use [CODE][/CODE] tags around the data to make it more readable. See:
Quote:

Originally Posted by nbkisnz (Post 4676477)
Code:

<EMPLOYEE>
 <EMP_NAME>ABC</EMP_NAME>
 <EMP_ID>XY21Z</EMP_ID>
</EMPLOYEE>
<DEPARTMENT>
 <DEPT_NAME>HR</DEPT_NAME>
</DEPARTMENT>

But due to some issues in the ETL, some records are getting created in the following format:
Code:

<EMPLOYEE>
 <EMP_NAME>ABC</EMP_NAME>
 <EMP_ID>XY21Z
  <DEPARTMENT>
  <DEPT_NAME>HR</DEPT_NAME>
  </DEPARTMENT>


Are those records really that broken? The <EMP_ID> and <EMPLOYEE> elements never closed at all?

It is not difficult to fix things like this using awk with < as the record separator, and > as the field separator. You just need a stack describing currently open elements, and manipulate that to correct the structure.

However, it seems to me both your records are broken. The first one uses the braindead Microsoft approach, where sibling nodes apply to each other. (Do not expect that kind of data model to survive any standard XML tools: they expect elements to be "containers", where the element only applies to its contents, elements within itself.) The latter one leaves elements open, and is therefore not even valid XML at all.

Could you verify exactly what needs to be done to fix the broken records, and whether the correct records are formatted as you first displayed?

theNbomr 05-13-2012 01:50 PM

If newlines, tabs, and other eye-candy whitespace doesn't matter, this seems to do the job:
Code:

sed 's/<DEPARTMENT>/<\/EMP_ID><\/EMPLOYEE><DEPARTMENT>/g' LQnbkisnz.xml
--- rod.


All times are GMT -5. The time now is 02:36 AM.