LinuxQuestions.org
Support LQ: Use code LQCO20 and save 20% on CrossOver Office
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 06-02-2004, 02:24 PM   #1
linux_ub
Member
 
Registered: May 2004
Location: NY
Distribution: fedora core 1
Posts: 65

Rep: Reputation: 17
Question microsoft word .doc to text


Hi
I am building an application where the input files are microsoft word documents. the documents are generally texts with formatting like bold/italic. there are no tables or images in the documents. i want to read the contents of the word file using a java class. i have written a program that reads one byte at a time and outputs if it is a readable character ASCII 32 to 127.
the text of the document is fine but there is a lot of garbage before the beginning and after the actual text. how can i get rid of this garbage text and extract the actual readable text from the Word document.
Thanks in advance
 
Old 06-02-2004, 04:00 PM   #2
Hko
Senior Member
 
Registered: Aug 2002
Location: Groningen, The Netherlands
Distribution: ubuntu
Posts: 2,524

Rep: Reputation: 93
You could convert the word documents by running the "wvText" program from you java-program (wc = Word View). Or read the sources of the program, if you want to do it java-only.

There's a whole family of wv... tools: wvLatex, wvCleanLatex, wvRTF, wvPDF, .... Chances are good they are already included with your distribution (at least Debian). If not get it from http://wvware.sourceforge.net/
 
Old 06-04-2004, 01:42 PM   #3
linux_ub
Member
 
Registered: May 2004
Location: NY
Distribution: fedora core 1
Posts: 65

Original Poster
Rep: Reputation: 17
thanks a lot .. will try this out
 
Old 06-04-2004, 03:00 PM   #4
jlliagre
Moderator
 
Registered: Feb 2004
Location: Outside Paris
Distribution: Solaris10, Solaris 11, Ubuntu, OEL
Posts: 9,165

Rep: Reputation: 243Reputation: 243Reputation: 243
If you want a java only solution, you may have a look at POI:

http://jakarta.apache.org/poi/hwpf/index.html
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Microsoft Word won't word wrap Micro420 General 1 06-13-2005 04:36 PM
access an MS Word password protected doc axelmang Linux - General 12 10-18-2004 10:26 AM
.html to MS Word doc h/w Linux - Software 5 12-06-2003 03:28 PM
Opening ms word .doc...? psyklops Linux - General 7 09-04-2003 11:10 AM
View Word Doc in Browser cli_man Linux - General 1 09-02-2003 03:36 PM


All times are GMT -5. The time now is 08:59 AM.

Main Menu
 
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: @linuxquestions
Open Source Consulting | Domain Registration