LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-06-2007, 06:29 AM   #1
munna_dude
Member
 
Registered: Dec 2006
Posts: 362

Rep: Reputation: 30
html to text conversion


hi all

how to convert the html format to text format(i need text format only)

please help me

thank you in advance
 
Old 10-06-2007, 07:15 AM   #2
radoulov
Member
 
Registered: Apr 2007
Location: Milano, Italia/Варна, България
Distribution: Ubuntu, Open SUSE
Posts: 212

Rep: Reputation: 38
Check if this is what you need:

Code:
lynx --dump file.htm>file.txt
 
Old 10-06-2007, 07:50 AM   #3
munna_dude
Member
 
Registered: Dec 2006
Posts: 362

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by radoulov View Post
Check if this is what you need:

Code:
lynx --dump file.htm>file.txt
is there any way to store total text(of web page)
in one string?

please help me

thank you in advance
 
Old 10-06-2007, 08:15 AM   #4
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 67
Do you mean all on one line?
 
Old 10-06-2007, 08:18 AM   #5
munna_dude
Member
 
Registered: Dec 2006
Posts: 362

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by matthewg42 View Post
Do you mean all on one line?
in a single array
 
Old 10-06-2007, 08:23 AM   #6
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
what exactly are you trying to do ?
 
Old 10-06-2007, 08:24 AM   #7
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 67
Please - you have to provide more information. We are not mind readers. What language? Is the HTML in a file, or do you need to download it first?
 
Old 10-06-2007, 11:48 AM   #8
/bin/bash
Senior Member
 
Registered: Jul 2003
Location: Indiana
Distribution: Mandrake Slackware-current QNX4.25
Posts: 1,802

Rep: Reputation: 47
ARRAY=( $(lynx --dump webpage.html) )
 
Old 10-17-2007, 12:31 AM   #9
munna_dude
Member
 
Registered: Dec 2006
Posts: 362

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by /bin/bash View Post
ARRAY=( $(lynx --dump webpage.html) )
yes i tried this but not getting total page in one array
Code:
ARRAY=( $(lynx -dump http://google.com) )
echo $ARRAY
but it is printing only this text
Quote:
[1]iGoogle
what i have to do to store the total data in to the string.

please help me

thank you in advance
 
Old 10-17-2007, 03:15 AM   #10
munna_dude
Member
 
Registered: Dec 2006
Posts: 362

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by /bin/bash View Post
ARRAY=( $(lynx --dump webpage.html) )
yeah
it is working but the output is like this
Quote:
[1]iGoogle | [2]Sign in India Web [3]Images [4]Groups [5]News [6]Scholar [7]more » _______________________________________________________ Google Search I'm Feeling Lucky [8]Advanced Search [9]Preferences [10]Language Tools Search: (_) the web (_) pages from India Google.co.in offered in: [11]Hindi [12]Bengali [13]Telugu [14]Marathi [15]Tamil [16]Advertising Programs - [17]About Google - [18]We're Hiring - [19]Go to Google.com ©2007 Google References 1. http://www.google.co.in/url?sa=p&pre...Wg-J5Dx8ZlW-dA 2. https://www.google.com/accounts/Logi...e.co.in/&hl=en 3. http://images.google.co.in/imghp?oe=...1&hl=en&tab=wi 4. http://groups.google.co.in/grphp?oe=...1&hl=en&tab=wg 5. http://news.google.co.in/nwshp?oe=IS...1&hl=en&tab=wn 6. http://scholar.google.com/schhp?oe=I...1&hl=en&tab=ws 7. http://www.google.co.in/intl/en/options/ 8. http://www.google.co.in/advanced_search?hl=en 9. http://www.google.co.in/preferences?hl=en 10. http://www.google.co.in/language_tools?hl=en 11. http://www.google.co.in/hi 12. http://www.google.co.in/bn 13. http://www.google.co.in/te 14. http://www.google.co.in/mr 15. http://www.google.co.in/ta 16. http://www.google.co.in/intl/en/ads/ 17. http://www.google.co.in/intl/en/about.html 18. http://www.google.co.in/intl/en/jobs/ 19. http://www.google.com/ncr
is there any way to get the output like this(line by line)
by using array.
Quote:
[1]iGoogle | [2]Sign in

India

Web [3]Images [4]Groups [5]News [6]Scholar [7]more »

_______________________________________________________
Google Search I'm Feeling Lucky [8]Advanced Search
[9]Preferences
[10]Language Tools
Search: (_) the web (_) pages from India

Google.co.in offered in: [11]Hindi [12]Bengali [13]Telugu [14]Marathi
[15]Tamil
[16]Advertising Programs - [17]About Google - [18]We're Hiring - [19]Go
to Google.com

©2007 Google

References

1. http://www.google.co.in/url?sa=p&pre...Wg-J5Dx8ZlW-dA
2. https://www.google.com/accounts/Logi...e.co.in/&hl=en
3. http://images.google.co.in/imghp?oe=...1&hl=en&tab=wi
4. http://groups.google.co.in/grphp?oe=...1&hl=en&tab=wg
5. http://news.google.co.in/nwshp?oe=IS...1&hl=en&tab=wn
6. http://scholar.google.com/schhp?oe=I...1&hl=en&tab=ws
7. http://www.google.co.in/intl/en/options/
8. http://www.google.co.in/advanced_search?hl=en
9. http://www.google.co.in/preferences?hl=en
10. http://www.google.co.in/language_tools?hl=en
11. http://www.google.co.in/hi
12. http://www.google.co.in/bn
13. http://www.google.co.in/te
14. http://www.google.co.in/mr
15. http://www.google.co.in/ta
16. http://www.google.co.in/intl/en/ads/
17. http://www.google.co.in/intl/en/about.html
18. http://www.google.co.in/intl/en/jobs/
19. http://www.google.com/ncr
please help me

thank you in advance
 
Old 10-17-2007, 12:40 PM   #11
/bin/bash
Senior Member
 
Registered: Jul 2003
Location: Indiana
Distribution: Mandrake Slackware-current QNX4.25
Posts: 1,802

Rep: Reputation: 47
Quote:
is there any way to get the output like this(line by line)
by using array.
The problem with that is that there are no linefeeds or carrage returns in html pages. So you cant read them line by line.

I don't know of any way to do that.
 
Old 10-17-2007, 01:13 PM   #12
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Code:
awk 'BEGIN{FS="[0-9]."}
{
 match($0,/1\./) #match at 1. 
 lastpart=substr($0,RSTART) #get string from 1. till end
 lastpartstart=RSTART-3
 n=split(lastpart,arr,/[ ]*[0-9]\.[ ]*/)
 for (i in arr) ++z
 match($0,/\[[0-9]*\]/)
 firstpart=substr($0,RSTART,RSTART+lastpartstart)
 m=split(firstpart,arr2,/\[[0-9]*\]/)
 for (i in arr2) ++y
 for(i=2;i<=y;i++) print arr2[i]
 for(i=2;i<=z;i++) print i-1". "arr[i] 
}' "file"
output:
Code:
# ./test.sh
iGoogle |
Sign in India Web
Images
Groups
News
Scholar
more » _______________________________________________________ Google Search I'm Feeling Lucky
Advanced Search
Preferences
Language Tools Search: (_) the web (_) pages from India Google.co.in offered in:
Hindi
Bengali
Telugu
Marathi
Tamil
Advertising Programs -
About Google -
We're Hiring -
Go to Google.com ©2007 Google References
1. http://www.google.co.in/url?sa=p&pre...Wg-J5Dx8ZlW-dA
2. https://www.google.com/accounts/Logi...e.co.in/&hl=en
3. http://images.google.co.in/imghp?oe=...1&hl=en&tab=wi
4. http://groups.google.co.in/grphp?oe=...1&hl=en&tab=wg
5. http://news.google.co.in/nwshp?oe=IS...1&hl=en&tab=wn
6. http://scholar.google.com/schhp?oe=I...1&hl=en&tab=ws
7. http://www.google.co.in/intl/en/options/
8. http://www.google.co.in/advanced_search?hl=en
9. http://www.google.co.in/preferences?hl=en 1
10. http://www.google.co.in/language_tools?hl=en 1
11. http://www.google.co.in/hi 1
12. http://www.google.co.in/bn 1
13. http://www.google.co.in/te 1
14. http://www.google.co.in/mr 1
15. http://www.google.co.in/ta 1
16. http://www.google.co.in/intl/en/ads/ 1
17. http://www.google.co.in/intl/en/about.html 1
18. http://www.google.co.in/intl/en/jobs/ 1
19. http://www.google.com/ncr
 
Old 10-18-2007, 12:26 AM   #13
munna_dude
Member
 
Registered: Dec 2006
Posts: 362

Original Poster
Rep: Reputation: 30
hi all
i have a text like this
can i have a terminal command please
1). i would like to remove [5]....[234XXX] in this
2). and line by line
Quote:
[5]AAAAA [6]GGGGGG [7]25OC200 - - 5389.80 5203.10 5673.15 5255.00 89085 234115.83 5143.90 [8]AAAAA [9]GGGGGG [10]29NO2007 - - 5389.00 5211.00 5674.00 5269.95 5265 13869.14 5143.90 [11]BBBBBB [12]GGGGGG [13]25OC200 [14]CE [15]5600.00 74.00 15.00 168.60 34.05 810 2281.01 5143.90 [16]AAAAA [17]GGGGGG [18]27DE200 - - 5410.00 5125.00 5664.05 5265.00 694 1828.64 5143.90
i wanna output like this

Quote:
AAAAA GGGGGG 25OC200 - - 5389.80 5203.10 5673.15 5255.00 89085 234115.83 5143.90
AAAAA GGGGGG 29NO2007 - - 5389.00 5211.00 5674.00 5269.95 5265 13869.14 5143.90
BBBBBB GGGGGG 25OC200 CE 5600.00 74.00 15.00 168.60 34.05 810 2281.01 5143.90
AAAAA GGGGGG 27DE200 - - 5410.00 5125.00 5664.05 5265.00 694 1828.64 5143.90
please help me

thank you in advance
 
Old 10-18-2007, 06:52 AM   #14
/bin/bash
Senior Member
 
Registered: Jul 2003
Location: Indiana
Distribution: Mandrake Slackware-current QNX4.25
Posts: 1,802

Rep: Reputation: 47
sed -r 's/\[.{1,2}\]//g' file
 
Old 10-19-2007, 06:17 AM   #15
/bin/bash
Senior Member
 
Registered: Jul 2003
Location: Indiana
Distribution: Mandrake Slackware-current QNX4.25
Posts: 1,802

Rep: Reputation: 47
I think I figured out putting this into an array line-by-line. It first requires dumping it into a temp file.
Code:
lynx --dump http://www.google.co.in/ >temp.html
exec 3<> temp.html
while read LINE <&3;do
  ((COUNT++))
  ARRY[$COUNT]="$LINE"
done

echo "13 = ${ARRY[13]}"
The output shows that array element 13 is indeed a whole line.
13 = Google.co.in offered in: [11]Hindi [12]Bengali [13]Telugu [14]Marathi

The reason for the temp file is because if you just pipe the output of lynx into the while loop the variables in the loop would be local to the subshell and you could not use them.

Last edited by /bin/bash; 10-19-2007 at 06:19 AM. Reason: Tuypo
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
PDF to Text Conversion limnephilidae Programming 5 01-03-2012 08:22 AM
doc to html / xml conversion in linux newbie007007 Linux - Software 5 03-04-2007 10:19 PM
LXer: Tender: Conversion toolkit from HTML to ODF LXer Syndicated Linux News 0 09-02-2006 10:54 AM
how to convert text(html) back to html. d1l2w3 Linux - Software 4 04-08-2005 08:16 PM
HTML to XHTML conversion rjlee Linux - Software 3 01-10-2005 07:27 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 01:00 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration