LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 06-24-2013, 04:25 PM   #1
weebo1
LQ Newbie
 
Registered: Jun 2013
Posts: 3

Rep: Reputation: Disabled
file conversion issue


Hello

After a few hours of unsuccessful google-ing, I decided to ask to pros.
This is my problem:
The result of an annotation tool is saved as an .xml file
I need to extract some lines, which contain a certain pattern, from the many files in a folder.
Here is the trick: for some files it works, for some not. I tried to break-down the problem.
My "detective" work (aka GOOGLE like crazy) has led me to the following conclusions:
1. The output of
Code:
grep "start" test.anvil
is:
Code:
      <el index="0" start="0" end="0.93332">
      <el index="1" start="0.93332" end="1.93331">
      <el index="2" start="1.93331" end="3.1333">
So, my grep command works.

2. The output of
Code:
file -bi test.anvil
is:
Code:
text/xml
3. The output of
Code:
grep "start" test_not_working.anvil
is:
nothing.

4. The output of
Code:
file -bi test_not_working.anvil
is:
Code:
text/plain; charset=utf-16
I tried iconv in any possible way. No success (I get the error:
Code:
iconv: illegal input sequence at position 0.
I tried messing with the xml file itself. Nothing.
My only mistake was that when I saved the file in the annotation tool, I didn't choose from beginning ISO-8859-1 coding, and I left the default value: UTF-8
I really don't know what else I can try. What I found on google was related to iconv and nothing worked for me. not even the \\IGNORE option.
Any help is more than appreciated.

Thanks a lot
 
Old 06-24-2013, 04:41 PM   #2
szboardstretcher
Senior Member
 
Registered: Aug 2006
Location: Detroit, MI
Distribution: GNU/Linux systemd
Posts: 4,116

Rep: Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530
have you tried opening a copy of the file in vim?

then try this and write out the file?

Code:
:set fileencoding=UTF-8
 
Old 06-27-2013, 03:58 AM   #3
weebo1
LQ Newbie
 
Registered: Jun 2013
Posts: 3

Original Poster
Rep: Reputation: Disabled
Thanks for your reply. I tried and I still get errors.
Any other suggestions?


Thanks
 
Old 06-27-2013, 05:19 AM   #4
mddnix
Member
 
Registered: Mar 2013
Distribution: Redhat, Ubuntu
Posts: 516

Rep: Reputation: 139Reputation: 139
Code:
iconv -f UTF-16 -t UTF-8 test_not_working.anvil | grep 'start'
 
Old 06-28-2013, 01:45 AM   #5
weebo1
LQ Newbie
 
Registered: Jun 2013
Posts: 3

Original Poster
Rep: Reputation: Disabled
Thanks, but iconv was one of the first I tried and no success
 
Old 06-28-2013, 01:52 AM   #6
szboardstretcher
Senior Member
 
Registered: Aug 2006
Location: Detroit, MI
Distribution: GNU/Linux systemd
Posts: 4,116

Rep: Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530Reputation: 1530
Is there any way you can upload a small file that is giving you problems? After scrubbing it of proprietary/confidential information of course.

Id like to see whats what.
 
Old 06-28-2013, 01:56 PM   #7
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958
Try checking for the existence of a Byte Order Mark, and perhaps dos-style line-endings.

In any case, xml is not a very easily greppable data format. I suggest running each file through the hxpipe utility, from the html-xml-utils package. It will convert the xml into a line-based format that's safer and easier for grep/sed/awk to parse. You will have to clean up the lines a bit though to remove the formatting flags.

The package also has a few other tools like hxclean and hxnormalize which can help tidy up broken xml.


And finally, I agree with szboardstretcher. It would help to see a larger sample of the data, and preferably the original.

Edit: Here's an example of how hxpipe can be used. With the lines given in the OP, you can extract the "start" data like this:

Code:
$ text='<el index="0" start="0" end="0.93332">
<el index="1" start="0.93332" end="1.93331">
<el index="2" start="1.93331" end="3.1333">'

$ hxpipe <<<"$text"
-
Aindex CDATA 0
Astart CDATA 0
Aend CDATA 0.93332
(el
-\n
Aindex CDATA 1
Astart CDATA 0.93332
Aend CDATA 1.93331
(el
-\n
Aindex CDATA 2
Astart CDATA 1.93331
Aend CDATA 3.1333
(el
-\n

$ hxpipe <<<"$text" | sed -n '/^Astart/ s/.*CDATA //p'
0
0.93332
1.93331

Last edited by David the H.; 06-28-2013 at 02:06 PM. Reason: as stated
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] int to string conversion issue rohaanembedded Programming 5 05-30-2013 07:12 AM
.bat file conversion to .sh file (debian) AlanFletcher Programming 7 04-19-2012 07:47 PM
character conversion issue - wrongly displayed filenames. kevinyeandel Linux - Newbie 1 12-10-2010 09:47 PM
Perl file conversion results in a truncated file kshaffer Programming 0 01-31-2005 03:31 PM
DOS -> UNIX file conversion issue inspleak Linux - Newbie 5 07-04-2004 01:24 AM


All times are GMT -5. The time now is 11:02 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration