Converting from Word for Mac .doc to txt
I am in process of tranferring 2000 documents to MySQL databaes by extracting the text and some header information from them. The documentss are written in various versions of MS Word, the eariliest ones mainly written with Word for Mac, perhaps version 5. The documents are from several years starting from year 1990 upto today. I was planning to do the work with antiword, but I came across with this issue:
$ antiword -s * MacWord: fast saved documents are not supported yet Are there any workarounds? Any other programs would extract the data from Word for Mac files? Open Office does not have filter for this particular format. WordPerfect perhaps could do it, but can you use it from commandline/script? And have it dump the result into .txt? I am using antiword Version: 0.35. Edit: Also tried the latest release 0.37 with the same result. |
This (older "word" files) is a big problem. I had a lot of trouble reading old Word95(?) documents that I just assumed would be converted. They were not. This is the trouble with proprietary formats.
My (inelegant) solution was to take all my files to someone's computer with windows running. There might be a windows batch utility to convert (update) them, but I only had a few. I read them in, saved them in a better format, and took them back to linux to read them. Now all my docs are saved as openoffice (sxw) and I just re-save as doc when I am sending something to someone who can only run windows. HTH |
Yes, I have hundreds of them, and that is the problem. If nothing else comes up, I have to see the WordPerfect or the bach solution you are suggesting. Thanks for the idea.
|
I don't know if it will solve your problem, but you might try openning them with Abiword and see how well it does.
Personally, I try to stay away from proprietary formats like word as much as possible. Sometimes there is no choice, but I will often save things in RTF format if it's got enough functionality to get the job done. Word format is probably the one to avoid saving as whenever possible since Microsoft has a real bad habit of making each version incompatible (I have a feeling that that isn't an accident on their part either). It's funny how even they themselves have issues converting between different versions of it when it is their own format. To me this is a red flag and ranks it as a format to avoid like the plague and to only use if there is absolutely no other solution to solve what you want to do. |
Thanks for the tip. I tired abiword, with the same result. It complains "The file was not recognised as document. There can be unrecognised formattings in the document" or something equal.
I have done quite a bit research on this - and it seems only MS Word can cope with this format. And some is suggesting that also WordPerfect can. I am having the MS Word script as my last resort, as I do not know the win scripting by heart. As installing WordPerfect 8.1 on Ubuntu Dapper seems quite tedious task, I would love if somebody could tell can you operate it from commandline and have it just dump the .doc data as plain text. This info would be a great help :) To tredegar and PingFloyd: Sometimes you get this kind of projects, where the organisation has started as effort of few non-technical people, and they have started producing material. Over the years the organisation has grown large and now there is need for streamlined document management system. Of course your suggestion of not usin .doc is good by all means. :) |
I'll echo PingFloyd's suggestion to try Abiword. I don't know if it can support all previous formats, but I likewise have several old Word docs that OpenOffice could not handle properly, while Abiword handled them correctly.
I'm not familiar with any capability to run a mass conversion via the CLI, but worst case scenario is that for any files that could not be mass-converted, it might be possible to manually open the file and save the information. |
JW, As I wrote above, I installed and tried abiword. Did not work. Thanks anyway. :)
|
Sorry about that, from the timestamps I was just typing my reply when you posted. Good luck with it anyway
|
Arjuna,
As some of your "word" files are old I think you are going to have to use microsoft to do the conversion / updates for you. Just borrow someone else's computer. Once you have done this, you'll be able to read your files. Even microsoft recognise this as a problem. See this link "Word for Windows Batch Conversion Macro": http://support.microsoft.com/kb/107439/EN-US/ HTH |
Quote:
Anyway, that's just an idea of one way to possibly tackle it. Another possibility, is what if you say use wine and an old version of word (or "word compatible" program), then maybe it will be easier to work a script into things since you could write up a shell script instead of having to deal with the limitations of windows for doing such automated things. |
I recently got email reply from Antiword developer. According to him, there is no more support coming for MacWord format, as there is not enough information about that format. It is more complex format than Win-word format. He further writes "The fast saved format is a lot more complex than the normal format and so, with insufficient information impossible to do." He would be delighted if anybody would have some info about the MacWord format, prior to version 6.
So the way with antiword came to an end. I decided to go with convert.wiz, the MS Word batch conversion macro, but that one does not work recursively. Not so easy task here.. Perhaps a script that would include the dir structure in filename and copy all files in one folder and then run the conversion would do. Lets try.. :) |
I am posting this message with the permission of the author:
Quote:
|
All times are GMT -5. The time now is 05:12 AM. |