Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
|
10-06-2007, 06:29 AM
|
#1
|
Member
Registered: Dec 2006
Posts: 362
Rep:
|
html to text conversion
hi all
how to convert the html format to text format(i need text format only)
please help me
thank you in advance
|
|
|
10-06-2007, 07:15 AM
|
#2
|
Member
Registered: Apr 2007
Location: Milano, Italia/Варна, България
Distribution: Ubuntu, Open SUSE
Posts: 212
Rep:
|
Check if this is what you need:
Code:
lynx --dump file.htm>file.txt
|
|
|
10-06-2007, 07:50 AM
|
#3
|
Member
Registered: Dec 2006
Posts: 362
Original Poster
Rep:
|
Quote:
Originally Posted by radoulov
Check if this is what you need:
Code:
lynx --dump file.htm>file.txt
|
is there any way to store total text(of web page)
in one string?
please help me
thank you in advance
|
|
|
10-06-2007, 08:15 AM
|
#4
|
Senior Member
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530
Rep:
|
Do you mean all on one line?
|
|
|
10-06-2007, 08:18 AM
|
#5
|
Member
Registered: Dec 2006
Posts: 362
Original Poster
Rep:
|
Quote:
Originally Posted by matthewg42
Do you mean all on one line?
|
in a single array
|
|
|
10-06-2007, 08:23 AM
|
#6
|
Senior Member
Registered: Aug 2006
Posts: 2,697
|
what exactly are you trying to do ?
|
|
|
10-06-2007, 08:24 AM
|
#7
|
Senior Member
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530
Rep:
|
Please - you have to provide more information. We are not mind readers. What language? Is the HTML in a file, or do you need to download it first?
|
|
|
10-06-2007, 11:48 AM
|
#8
|
Senior Member
Registered: Jul 2003
Location: Indiana
Distribution: Mandrake Slackware-current QNX4.25
Posts: 1,802
Rep:
|
ARRAY=( $(lynx --dump webpage.html) )
|
|
|
10-17-2007, 12:31 AM
|
#9
|
Member
Registered: Dec 2006
Posts: 362
Original Poster
Rep:
|
Quote:
Originally Posted by /bin/bash
ARRAY=( $(lynx --dump webpage.html) )
|
yes i tried this but not getting total page in one array
Code:
ARRAY=( $(lynx -dump http://google.com) )
echo $ARRAY
but it is printing only this text
what i have to do to store the total data in to the string.
please help me
thank you in advance
|
|
|
10-17-2007, 03:15 AM
|
#10
|
Member
Registered: Dec 2006
Posts: 362
Original Poster
Rep:
|
Quote:
Originally Posted by /bin/bash
ARRAY=( $(lynx --dump webpage.html) )
|
yeah
it is working but the output is like this
is there any way to get the output like this(line by line)
by using array.
please help me
thank you in advance
|
|
|
10-17-2007, 12:40 PM
|
#11
|
Senior Member
Registered: Jul 2003
Location: Indiana
Distribution: Mandrake Slackware-current QNX4.25
Posts: 1,802
Rep:
|
Quote:
is there any way to get the output like this(line by line)
by using array.
|
The problem with that is that there are no linefeeds or carrage returns in html pages. So you cant read them line by line.
I don't know of any way to do that.
|
|
|
10-17-2007, 01:13 PM
|
#12
|
Senior Member
Registered: Aug 2006
Posts: 2,697
|
Code:
awk 'BEGIN{FS="[0-9]."}
{
match($0,/1\./) #match at 1.
lastpart=substr($0,RSTART) #get string from 1. till end
lastpartstart=RSTART-3
n=split(lastpart,arr,/[ ]*[0-9]\.[ ]*/)
for (i in arr) ++z
match($0,/\[[0-9]*\]/)
firstpart=substr($0,RSTART,RSTART+lastpartstart)
m=split(firstpart,arr2,/\[[0-9]*\]/)
for (i in arr2) ++y
for(i=2;i<=y;i++) print arr2[i]
for(i=2;i<=z;i++) print i-1". "arr[i]
}' "file"
output:
Code:
# ./test.sh
iGoogle |
Sign in India Web
Images
Groups
News
Scholar
more » _______________________________________________________ Google Search I'm Feeling Lucky
Advanced Search
Preferences
Language Tools Search: (_) the web (_) pages from India Google.co.in offered in:
Hindi
Bengali
Telugu
Marathi
Tamil
Advertising Programs -
About Google -
We're Hiring -
Go to Google.com ©2007 Google References
1. http://www.google.co.in/url?sa=p&pre...Wg-J5Dx8ZlW-dA
2. https://www.google.com/accounts/Logi...e.co.in/&hl=en
3. http://images.google.co.in/imghp?oe=...1&hl=en&tab=wi
4. http://groups.google.co.in/grphp?oe=...1&hl=en&tab=wg
5. http://news.google.co.in/nwshp?oe=IS...1&hl=en&tab=wn
6. http://scholar.google.com/schhp?oe=I...1&hl=en&tab=ws
7. http://www.google.co.in/intl/en/options/
8. http://www.google.co.in/advanced_search?hl=en
9. http://www.google.co.in/preferences?hl=en 1
10. http://www.google.co.in/language_tools?hl=en 1
11. http://www.google.co.in/hi 1
12. http://www.google.co.in/bn 1
13. http://www.google.co.in/te 1
14. http://www.google.co.in/mr 1
15. http://www.google.co.in/ta 1
16. http://www.google.co.in/intl/en/ads/ 1
17. http://www.google.co.in/intl/en/about.html 1
18. http://www.google.co.in/intl/en/jobs/ 1
19. http://www.google.com/ncr
|
|
|
10-18-2007, 12:26 AM
|
#13
|
Member
Registered: Dec 2006
Posts: 362
Original Poster
Rep:
|
hi all
i have a text like this
can i have a terminal command please
1). i would like to remove [5]....[234XXX] in this
2). and line by line
Quote:
[5]AAAAA [6]GGGGGG [7]25OC200 - - 5389.80 5203.10 5673.15 5255.00 89085 234115.83 5143.90 [8]AAAAA [9]GGGGGG [10]29NO2007 - - 5389.00 5211.00 5674.00 5269.95 5265 13869.14 5143.90 [11]BBBBBB [12]GGGGGG [13]25OC200 [14]CE [15]5600.00 74.00 15.00 168.60 34.05 810 2281.01 5143.90 [16]AAAAA [17]GGGGGG [18]27DE200 - - 5410.00 5125.00 5664.05 5265.00 694 1828.64 5143.90
|
i wanna output like this
Quote:
AAAAA GGGGGG 25OC200 - - 5389.80 5203.10 5673.15 5255.00 89085 234115.83 5143.90
AAAAA GGGGGG 29NO2007 - - 5389.00 5211.00 5674.00 5269.95 5265 13869.14 5143.90
BBBBBB GGGGGG 25OC200 CE 5600.00 74.00 15.00 168.60 34.05 810 2281.01 5143.90
AAAAA GGGGGG 27DE200 - - 5410.00 5125.00 5664.05 5265.00 694 1828.64 5143.90
|
please help me
thank you in advance
|
|
|
10-18-2007, 06:52 AM
|
#14
|
Senior Member
Registered: Jul 2003
Location: Indiana
Distribution: Mandrake Slackware-current QNX4.25
Posts: 1,802
Rep:
|
sed -r 's/\[.{1,2}\]//g' file
|
|
|
10-19-2007, 06:17 AM
|
#15
|
Senior Member
Registered: Jul 2003
Location: Indiana
Distribution: Mandrake Slackware-current QNX4.25
Posts: 1,802
Rep:
|
I think I figured out putting this into an array line-by-line. It first requires dumping it into a temp file.
Code:
lynx --dump http://www.google.co.in/ >temp.html
exec 3<> temp.html
while read LINE <&3;do
((COUNT++))
ARRY[$COUNT]="$LINE"
done
echo "13 = ${ARRY[13]}"
The output shows that array element 13 is indeed a whole line.
13 = Google.co.in offered in: [11]Hindi [12]Bengali [13]Telugu [14]Marathi
The reason for the temp file is because if you just pipe the output of lynx into the while loop the variables in the loop would be local to the subshell and you could not use them.
Last edited by /bin/bash; 10-19-2007 at 06:19 AM.
Reason: Tuypo
|
|
|
All times are GMT -5. The time now is 01:00 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|