LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Converting from Word for Mac .doc to txt (https://www.linuxquestions.org/questions/linux-software-2/converting-from-word-for-mac-doc-to-txt-476814/)

Arjuna 08-23-2006 10:22 PM

Converting from Word for Mac .doc to txt
 
I am in process of tranferring 2000 documents to MySQL databaes by extracting the text and some header information from them. The documentss are written in various versions of MS Word, the eariliest ones mainly written with Word for Mac, perhaps version 5. The documents are from several years starting from year 1990 upto today. I was planning to do the work with antiword, but I came across with this issue:

$ antiword -s *
MacWord: fast saved documents are not supported yet

Are there any workarounds? Any other programs would extract the data from Word for Mac files? Open Office does not have filter for this particular format. WordPerfect perhaps could do it, but can you use it from commandline/script? And have it dump the result into .txt?

I am using antiword Version: 0.35.
Edit: Also tried the latest release 0.37 with the same result.

tredegar 08-24-2006 11:55 AM

This (older "word" files) is a big problem. I had a lot of trouble reading old Word95(?) documents that I just assumed would be converted. They were not. This is the trouble with proprietary formats.

My (inelegant) solution was to take all my files to someone's computer with windows running. There might be a windows batch utility to convert (update) them, but I only had a few. I read them in, saved them in a better format, and took them back to linux to read them.

Now all my docs are saved as openoffice (sxw) and I just re-save as doc when I am sending something to someone who can only run windows.

HTH

Arjuna 08-24-2006 06:03 PM

Yes, I have hundreds of them, and that is the problem. If nothing else comes up, I have to see the WordPerfect or the bach solution you are suggesting. Thanks for the idea.

PingFloyd 08-24-2006 06:22 PM

I don't know if it will solve your problem, but you might try openning them with Abiword and see how well it does.

Personally, I try to stay away from proprietary formats like word as much as possible. Sometimes there is no choice, but I will often save things in RTF format if it's got enough functionality to get the job done.

Word format is probably the one to avoid saving as whenever possible since Microsoft has a real bad habit of making each version incompatible (I have a feeling that that isn't an accident on their part either). It's funny how even they themselves have issues converting between different versions of it when it is their own format. To me this is a red flag and ranks it as a format to avoid like the plague and to only use if there is absolutely no other solution to solve what you want to do.

Arjuna 08-24-2006 07:54 PM

Thanks for the tip. I tired abiword, with the same result. It complains "The file was not recognised as document. There can be unrecognised formattings in the document" or something equal.

I have done quite a bit research on this - and it seems only MS Word can cope with this format. And some is suggesting that also WordPerfect can. I am having the MS Word script as my last resort, as I do not know the win scripting by heart.

As installing WordPerfect 8.1 on Ubuntu Dapper seems quite tedious task, I would love if somebody could tell can you operate it from commandline and have it just dump the .doc data as plain text. This info would be a great help :)

To tredegar and PingFloyd: Sometimes you get this kind of projects, where the organisation has started as effort of few non-technical people, and they have started producing material. Over the years the organisation has grown large and now there is need for streamlined document management system. Of course your suggestion of not usin .doc is good by all means. :)

J.W. 08-24-2006 07:55 PM

I'll echo PingFloyd's suggestion to try Abiword. I don't know if it can support all previous formats, but I likewise have several old Word docs that OpenOffice could not handle properly, while Abiword handled them correctly.

I'm not familiar with any capability to run a mass conversion via the CLI, but worst case scenario is that for any files that could not be mass-converted, it might be possible to manually open the file and save the information.

Arjuna 08-24-2006 08:49 PM

JW, As I wrote above, I installed and tried abiword. Did not work. Thanks anyway. :)

J.W. 08-24-2006 09:56 PM

Sorry about that, from the timestamps I was just typing my reply when you posted. Good luck with it anyway

tredegar 08-25-2006 09:47 AM

Arjuna,
As some of your "word" files are old I think you are going to have to use microsoft to do the conversion / updates for you. Just borrow someone else's computer. Once you have done this, you'll be able to read your files.

Even microsoft recognise this as a problem. See this link
"Word for Windows Batch Conversion Macro":
http://support.microsoft.com/kb/107439/EN-US/

HTH

PingFloyd 08-25-2006 11:40 AM

Quote:

Originally Posted by Arjuna
As installing WordPerfect 8.1 on Ubuntu Dapper seems quite tedious task, I would love if somebody could tell can you operate it from commandline and have it just dump the .doc data as plain text. This info would be a great help :)

What happens if you open them up in say vim or some other text editor? My guess is that all the special formatting shows up as garbage, but that the text is still at least there. If that's the case, then it's probably just a matter of writing script that will filter out the "garbage" (formatting garble) and create new text files. May be able to use grep to act as a filter for this. There may be another way. I'm sure there probably is with how flexible Linux is.

Anyway, that's just an idea of one way to possibly tackle it.

Another possibility, is what if you say use wine and an old version of word (or "word compatible" program), then maybe it will be easier to work a script into things since you could write up a shell script instead of having to deal with the limitations of windows for doing such automated things.

Arjuna 09-02-2006 03:46 PM

I recently got email reply from Antiword developer. According to him, there is no more support coming for MacWord format, as there is not enough information about that format. It is more complex format than Win-word format. He further writes "The fast saved format is a lot more complex than the normal format and so, with insufficient information impossible to do." He would be delighted if anybody would have some info about the MacWord format, prior to version 6.

So the way with antiword came to an end. I decided to go with convert.wiz, the MS Word batch conversion macro, but that one does not work recursively. Not so easy task here.. Perhaps a script that would include the dir structure in filename and copy all files in one folder and then run the conversion would do. Lets try.. :)

Arjuna 09-04-2006 12:04 PM

I am posting this message with the permission of the author:
Quote:

I'm sorry to tell you there are no versions Antiword in the near future with more support for the Mac-Word format. The reason for this is simple: there is not enough information available about this format.

Sincs the Mac-Word format is even more complex than the Win-Word format it is very difficult to reverse engineer. The Mac-Word format and the Win-Word format were united for the first time in Word 6.

The fast saved format is a lot more complex than the normal format and so, with insufficient information impossible to do. That is also the reason why the people at OpenOffice have not tried, even with a full-time programmer staff it takes far to much effort.

If you have any information about the Mac-Word format for any verion of Mac-Word older than Word 6, please let me know.

Kind Regards,
Adri van Os
-Arjuna


All times are GMT -5. The time now is 05:12 AM.