LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
Search this Thread
Old 08-23-2006, 10:22 PM   #1
Arjuna
LQ Newbie
 
Registered: May 2006
Posts: 11

Rep: Reputation: 0
Converting from Word for Mac .doc to txt


I am in process of tranferring 2000 documents to MySQL databaes by extracting the text and some header information from them. The documentss are written in various versions of MS Word, the eariliest ones mainly written with Word for Mac, perhaps version 5. The documents are from several years starting from year 1990 upto today. I was planning to do the work with antiword, but I came across with this issue:

$ antiword -s *
MacWord: fast saved documents are not supported yet

Are there any workarounds? Any other programs would extract the data from Word for Mac files? Open Office does not have filter for this particular format. WordPerfect perhaps could do it, but can you use it from commandline/script? And have it dump the result into .txt?

I am using antiword Version: 0.35.
Edit: Also tried the latest release 0.37 with the same result.

Last edited by Arjuna; 08-23-2006 at 10:33 PM.
 
Old 08-24-2006, 11:55 AM   #2
tredegar
Guru
 
Registered: May 2003
Location: London, UK
Distribution: Ubuntu 10.04, mostly
Posts: 6,007

Rep: Reputation: 367Reputation: 367Reputation: 367Reputation: 367
This (older "word" files) is a big problem. I had a lot of trouble reading old Word95(?) documents that I just assumed would be converted. They were not. This is the trouble with proprietary formats.

My (inelegant) solution was to take all my files to someone's computer with windows running. There might be a windows batch utility to convert (update) them, but I only had a few. I read them in, saved them in a better format, and took them back to linux to read them.

Now all my docs are saved as openoffice (sxw) and I just re-save as doc when I am sending something to someone who can only run windows.

HTH
 
Old 08-24-2006, 06:03 PM   #3
Arjuna
LQ Newbie
 
Registered: May 2006
Posts: 11

Original Poster
Rep: Reputation: 0
Yes, I have hundreds of them, and that is the problem. If nothing else comes up, I have to see the WordPerfect or the bach solution you are suggesting. Thanks for the idea.
 
Old 08-24-2006, 06:22 PM   #4
PingFloyd
Member
 
Registered: Jun 2006
Posts: 94

Rep: Reputation: 16
I don't know if it will solve your problem, but you might try openning them with Abiword and see how well it does.

Personally, I try to stay away from proprietary formats like word as much as possible. Sometimes there is no choice, but I will often save things in RTF format if it's got enough functionality to get the job done.

Word format is probably the one to avoid saving as whenever possible since Microsoft has a real bad habit of making each version incompatible (I have a feeling that that isn't an accident on their part either). It's funny how even they themselves have issues converting between different versions of it when it is their own format. To me this is a red flag and ranks it as a format to avoid like the plague and to only use if there is absolutely no other solution to solve what you want to do.
 
Old 08-24-2006, 07:54 PM   #5
Arjuna
LQ Newbie
 
Registered: May 2006
Posts: 11

Original Poster
Rep: Reputation: 0
Thanks for the tip. I tired abiword, with the same result. It complains "The file was not recognised as document. There can be unrecognised formattings in the document" or something equal.

I have done quite a bit research on this - and it seems only MS Word can cope with this format. And some is suggesting that also WordPerfect can. I am having the MS Word script as my last resort, as I do not know the win scripting by heart.

As installing WordPerfect 8.1 on Ubuntu Dapper seems quite tedious task, I would love if somebody could tell can you operate it from commandline and have it just dump the .doc data as plain text. This info would be a great help

To tredegar and PingFloyd: Sometimes you get this kind of projects, where the organisation has started as effort of few non-technical people, and they have started producing material. Over the years the organisation has grown large and now there is need for streamlined document management system. Of course your suggestion of not usin .doc is good by all means.
 
Old 08-24-2006, 07:55 PM   #6
J.W.
LQ Veteran
 
Registered: Mar 2003
Location: Milwaukee, WI
Distribution: Mint
Posts: 6,642

Rep: Reputation: 69
I'll echo PingFloyd's suggestion to try Abiword. I don't know if it can support all previous formats, but I likewise have several old Word docs that OpenOffice could not handle properly, while Abiword handled them correctly.

I'm not familiar with any capability to run a mass conversion via the CLI, but worst case scenario is that for any files that could not be mass-converted, it might be possible to manually open the file and save the information.
 
Old 08-24-2006, 08:49 PM   #7
Arjuna
LQ Newbie
 
Registered: May 2006
Posts: 11

Original Poster
Rep: Reputation: 0
JW, As I wrote above, I installed and tried abiword. Did not work. Thanks anyway.
 
Old 08-24-2006, 09:56 PM   #8
J.W.
LQ Veteran
 
Registered: Mar 2003
Location: Milwaukee, WI
Distribution: Mint
Posts: 6,642

Rep: Reputation: 69
Sorry about that, from the timestamps I was just typing my reply when you posted. Good luck with it anyway
 
Old 08-25-2006, 09:47 AM   #9
tredegar
Guru
 
Registered: May 2003
Location: London, UK
Distribution: Ubuntu 10.04, mostly
Posts: 6,007

Rep: Reputation: 367Reputation: 367Reputation: 367Reputation: 367
Arjuna,
As some of your "word" files are old I think you are going to have to use microsoft to do the conversion / updates for you. Just borrow someone else's computer. Once you have done this, you'll be able to read your files.

Even microsoft recognise this as a problem. See this link
"Word for Windows Batch Conversion Macro":
http://support.microsoft.com/kb/107439/EN-US/

HTH
 
Old 08-25-2006, 11:40 AM   #10
PingFloyd
Member
 
Registered: Jun 2006
Posts: 94

Rep: Reputation: 16
Quote:
Originally Posted by Arjuna
As installing WordPerfect 8.1 on Ubuntu Dapper seems quite tedious task, I would love if somebody could tell can you operate it from commandline and have it just dump the .doc data as plain text. This info would be a great help
What happens if you open them up in say vim or some other text editor? My guess is that all the special formatting shows up as garbage, but that the text is still at least there. If that's the case, then it's probably just a matter of writing script that will filter out the "garbage" (formatting garble) and create new text files. May be able to use grep to act as a filter for this. There may be another way. I'm sure there probably is with how flexible Linux is.

Anyway, that's just an idea of one way to possibly tackle it.

Another possibility, is what if you say use wine and an old version of word (or "word compatible" program), then maybe it will be easier to work a script into things since you could write up a shell script instead of having to deal with the limitations of windows for doing such automated things.
 
Old 09-02-2006, 03:46 PM   #11
Arjuna
LQ Newbie
 
Registered: May 2006
Posts: 11

Original Poster
Rep: Reputation: 0
I recently got email reply from Antiword developer. According to him, there is no more support coming for MacWord format, as there is not enough information about that format. It is more complex format than Win-word format. He further writes "The fast saved format is a lot more complex than the normal format and so, with insufficient information impossible to do." He would be delighted if anybody would have some info about the MacWord format, prior to version 6.

So the way with antiword came to an end. I decided to go with convert.wiz, the MS Word batch conversion macro, but that one does not work recursively. Not so easy task here.. Perhaps a script that would include the dir structure in filename and copy all files in one folder and then run the conversion would do. Lets try..
 
Old 09-04-2006, 12:04 PM   #12
Arjuna
LQ Newbie
 
Registered: May 2006
Posts: 11

Original Poster
Rep: Reputation: 0
I am posting this message with the permission of the author:
Quote:
I'm sorry to tell you there are no versions Antiword in the near future with more support for the Mac-Word format. The reason for this is simple: there is not enough information available about this format.

Sincs the Mac-Word format is even more complex than the Win-Word format it is very difficult to reverse engineer. The Mac-Word format and the Win-Word format were united for the first time in Word 6.

The fast saved format is a lot more complex than the normal format and so, with insufficient information impossible to do. That is also the reason why the people at OpenOffice have not tried, even with a full-time programmer staff it takes far to much effort.

If you have any information about the Mac-Word format for any verion of Mac-Word older than Word 6, please let me know.

Kind Regards,
Adri van Os
-Arjuna
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting PDA DOC to TXT kaaaja Linux - Laptop and Netbook 1 04-26-2005 03:48 PM
.html to MS Word doc h/w Linux - Software 5 12-06-2003 03:28 PM
Opening ms word .doc...? psyklops Linux - General 7 09-04-2003 11:10 AM
View Word Doc in Browser cli_man Linux - General 1 09-02-2003 03:36 PM
convert .doc to .txt using C++ ckamheng Programming 2 06-21-2003 08:25 AM


All times are GMT -5. The time now is 05:20 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration