Slackware This Forum is for the discussion of Slackware Linux.
Notices
Welcome to
LinuxQuestions.org , a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free.
Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please
contact us . If you need to reset your password,
click here .
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a
virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month.
Click here for more info.
05-09-2007, 07:16 AM
#1
Member
Registered: Dec 2004
Location: Hyderabad, India
Distribution: Slackware 13, Ubuntu 12.04
Posts: 440
Rep:
extracting data from html files into one text file
Hi!
Here is my problem (it may not be appropriate to this forum, in which case I am sorry for posting here) -
The exam results of my school children of a particular class will be available online at a particular web site. I have to enter the roll-number of each candidate and the marks obtained by the student in each of the subjects will be displayed. Now what I normally do is to store each of the web-page displaying the marks of students and manually taking down the marks in each subject into a single tab-sepertated text file by cutting and pasting, like this -
Quote:
1. John Doe 67 65 83 98
2. Amitabh Bachchan 87 78 93 73
It may not be possible to download the marks of all the students as a single file, so after downloading the html files of each student is there a way of extracting data from these files automatically using scripts and storing the data into a single tab-separated text file?
Thanking in advance,
Last edited by adityavpratap; 05-09-2007 at 07:17 AM .
05-09-2007, 08:46 AM
#2
Member
Registered: Nov 2004
Location: Turkey
Distribution: Slackware
Posts: 145
Rep:
Yes It is possible, can you post example html codes?
05-09-2007, 08:51 AM
#3
Member
Registered: Oct 2004
Location: NY
Distribution: Slackware
Posts: 200
Rep:
I'd recommend using Perl script and regular expressions. If you don't have any programming experience, don't worry, you'll pick it up quickly.
05-09-2007, 09:27 AM
#4
Senior Member
Registered: Aug 2006
Location: Harrisburg, PA
Distribution: Slackware 64 14.2
Posts: 1,180
Rep:
Hi all,
Couldn't something like this be done by using Lynx to dump each student's web page to a text file:
Code:
lynx -dump http://www.studentpage.edu >> studentfiles.txt
The >> will append each student's web page to the end of the last one in studentfiles.txt.
If you have each student's page listed in a single text file you could have the bash shell automatically read each URL and output it to the studentfiles.txt file.
Code:
cat urls.txt | while read url; do lynx -dump "$url" >> studentfiles.txt; done
Note: I'm just learning bash, and I got the above code from another web site, but I think it will work.
Then I'd use grep to grab the relevant lines and output them to another text file. Without seeing the HTML, I'm not sure what's required.
I'm also not sure about how to insert tabs for tab delimiting. Maybe somebody else can help.
Regards,
-Drew
05-09-2007, 01:26 PM
#5
LQ Guru
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
All great advice. It would sure help if you post some examples of the HTML code tho.
05-09-2007, 03:55 PM
#6
Member
Registered: Oct 2004
Location: NY
Distribution: Slackware
Posts: 200
Rep:
Alternatively you can use wget instead of lynx, sed for parsing the data and grep for filtering the data. sed is powerful tool for extracting the pieces of text. If you posted some samples of you html it would be much easier to give you some hints. In fact, well formed html is also xml so you could use some xml parsing tools to extract the data.
05-09-2007, 09:54 PM
#7
Senior Member
Registered: Oct 2005
Location: New Mexico
Distribution: Slackware
Posts: 1,639
Rep:
Perl would be a fine choice, but please check out the many modules available rather than rolling your own regular expressions. A nice article on processing html with perl can be found at
http://www.perl.com/pub/a/2006/01/19...zing_html.html .
Brian
05-10-2007, 12:16 AM
#8
Member
Registered: Dec 2004
Location: Hyderabad, India
Distribution: Slackware 13, Ubuntu 12.04
Posts: 440
Original Poster
Rep:
;-)
Sorry for not posting sample html files. But the results are not online yet. They will be made available online at around 5:30 PM (IST). But I'll try to send old files as soon as I can locate them. Thanks for your valuable suggestions.
05-10-2007, 12:35 AM
#9
Member
Registered: Dec 2004
Location: Hyderabad, India
Distribution: Slackware 13, Ubuntu 12.04
Posts: 440
Original Poster
Rep:
Here is a previous year's file -
Quote:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<HTML>
<HEAD>
<title>APOnline - SSC Marks Memorandum</title>
<meta content="Microsoft Visual Studio 7.0" name="GENERATOR">
<meta content="C#" name="CODE_LANGUAGE">
<meta content="pragma" name="no-cache">
<meta content="JavaScript (ECMAScript)" name="vs_defaultClientScript">
<meta content="http://schemas.microsoft.com/intellisense/ie5" name="vs_targetSchema">
<LINK href="applystyles.css" type="text/css" rel="stylesheet">
<SCRIPT language="JavaScript" src="printFunction.js"></SCRIPT>
<script language="javascript">
function trimField()
{
var Strfldvalue=document.SSCForm.txtRollNumber.value;
var Strtrmvalue="";s
var j=0;
for(k=0; k<Strfldvalue.length; k++)
{
if(Strfldvalue.charAt(k)==" " )j++;
else
{
if(j<Strfldvalue.length)
Strtrmvalue=Strfldvalue.substring(j,Strfldvalue.length);
break;
}
}
document.SSCForm.txtRollNumber.value=Strtrmvalue;
}
function ValidateForm()
{
var msg ="";
trimField();
var regdNumber = document.SSCForm.txtRollNumber.value;
if(regdNumber == "")
{
msg = "Enter Regd. Number.";
}
else if(isNaN(regdNumber))
{
msg = "The Regd. Number you entered should be a numeric value.";
}
else if(regdNumber.indexOf(".") != -1)
{
msg = "Entered value for Regd. Number should not be a decimal value.";
}
else if(parseFloat(regdNumber) < 0)
{
msg= "Entered value for Regd. should be a positive value.";
}
if(msg.length > 0)
{
alert(msg);
document.SSCForm.txtRollNumber.focus();
return false;
}
else
{
return true;
}
}
function DisplayCongrats()
{
var msg = document.SSCForm.lblCongrats.value;
if(msg.length > 0)
{
alert(msg);
}
}
function DisplayCongrats()
{
if(document.SSCForm.lblCongrats.value.length !=0)
{
alert(document.SSCForm.lblCongrats.value);
}
}
</script>
</HEAD>
<body leftMargin="0" topMargin="0" onload="DisplayCongrats()" marginheight="0" marginwidth="0"
MS_POSITIONING="GridLayout">
<DIV id="PrintContent" align="center">
<table cellSpacing="1" cellPadding="3" width="610" align="center" border="0">
<tr>
<td class="govhead" align="center"><IMG src="./images/andhralogo.jpg"><br>
Government of Andhra Pradesh
</td>
</tr>
<tr>
<td class="formtext" align="center"><span class="head1">SSC Public Examinations
Regular, March 2006</span>
<BR>
</td>
</tr>
<tr>
<td id="tdMarksList" align="center" width="610" height="25">
<div class="head3" align="center">
<center>Marks List</center>
</div>
</td>
</tr>
<tr>
<td class="mandatory" align="center">
<table id="resultTable" bordercolor="#aaaaaa" cellspacing="0" cellpadding="3" width="610" border="1">
<tr width="=305">
<td class="formbg1" nowrap="nowrap" width="95" colspan="2"> Roll No.
</td>
<td class="formbg2" width="185" colspan="2">
<DIV id="lblRollNo">0132158</DIV>
</td>
<td class="formbg1" nowrap="nowrap" width="68" colspan="2"> Date
</td>
<td class="formbg2" width="257" colspan="2">
<DIV id="lblDate">04/05/2006</DIV>
</td>
</tr>
<tr>
<td class="formbg1" nowrap="nowrap" colspan="2"> Name of the Candidate
</td>
<td class="formbg2" colspan="6">
<DIV id="lblNameOfCandidate">UPADHYAY VARUN</DIV>
</td>
</tr>
<tr>
<td class="formbg1" width="99" colspan="2"> Center.Name.
</td>
<td class="formbg2" width="190" colspan="2">
<DIV id="lblCNo">GOVT HIGH SCHOOL SHAHINAYAT GUNJ HYD</DIV>
</td>
<td class="formbg1" width="56" colspan="2"> </td>
<td class="formbg2" width="252" colspan="2">
<DIV id="lblMedium"></DIV>
</td>
</tr>
<tr>
<td class="formtext" colspan="8" height="15">
</td>
</tr>
<tr class="formbg1">
<td class="formtext" nowrap="nowrap" align="center" width="75">I Lang
</td>
<td class="formtext" nowrap="nowrap" align="center" width="75">II Lang
</td>
<td class="formtext" nowrap="nowrap" align="center" width="75">Maths
</td>
<td class="formtext" nowrap="nowrap" align="center" width="75">Science
</td>
<td class="formtext" nowrap="nowrap" align="center" width="85">Social Studies
</td>
<td class="formtext" nowrap="nowrap" align="center" width="75">III Lang
</td>
<td class="urlbottom1" nowrap="nowrap" align="center" width="75">Total
</td>
<td class="urlbottom1" nowrap="nowrap" align="center" width="75">Result
</td>
</tr>
<tr class="formbg2">
<td class="formtext" align="center" width="75">
<DIV id="lblFirstLanguage" noWrap="">66 </DIV>
</td>
<td class="formtext" nowrap="nowrap" align="center" width="75"><DIV id="lblEnglish">77 </DIV>
</td>
<td class="formtext" nowrap="nowrap" align="center" width="75"><DIV id="lblMaths">74 </DIV>
</td>
<td class="formtext" nowrap="nowrap" align="center" width="75"><DIV id="lblScience">55 </DIV>
</td>
<td class="formtext" nowrap="nowrap" align="center" width="85"><DIV id="lblSocial">85 </DIV>
</td>
<td class="formtext" nowrap="nowrap" align="center" width="75"><DIV id="lblSecondLanguage">59 </DIV>
</td>
<td class="urlbottom1" nowrap="nowrap" align="center" width="75">
<DIV id="lblTotal">416 </DIV>
</td>
<td class="urlbottom1" nowrap="nowrap" align="center" width="75">
<DIV id="lblResult">First Class[/COLOR]</DIV>
</td>
</tr>
<tr>
<td class="govhead" colspan="8" height="25">NOTE: This information is provided to
the candidate on his/her online request and is only a prototype list.
</td>
</tr>
</table>
</td>
</tr>
<TR>
<td>
</td>
</TR>
</table>
</DIV>
<form name="SSCForm" method="post" action="ShowSSCResults.aspx" id="SSCForm" onsubmit="return ValidateForm()">
<input type="hidden" name="__VIEWSTATE" value="dDw5MzI2MTQzOTM7dDw7bDxpPDE+O2k8Mz47aTw1PjtpPDc+O2k8OT47PjtsPHQ8cDxsPFZpc2libGU7PjtsPG88dD47P j47Oz47dDxwPGw8aW5uZXJodG1sO1Zpc2libGU7PjtsPFxlO288Zj47Pj47Oz47dDxwPGw8VmlzaWJsZTs+O2w8bzx0Pjs+PjtsP Gk8MD47aTwxPjtpPDI+O2k8NT47PjtsPHQ8O2w8aTwxPjtpPDM+Oz47bDx0PDtsPGk8MT47PjtsPHQ8cDxsPGlubmVyaHRtbDs+O 2w8MDEzMjE1ODs+Pjs7Pjs+Pjt0PDtsPGk8MT47PjtsPHQ8cDxsPGlubmVyaHRtbDs+O2w8MDQvMDUvMjAwNjs+Pjs7Pjs+Pjs+P jt0PDtsPGk8MT47PjtsPHQ8O2w8aTwxPjs+O2w8dDxwPGw8aW5uZXJodG1sOz47bDxVUEFESFlBWSBWQVJVTjs+Pjs7Pjs+Pjs+P jt0PDtsPGk8MT47PjtsPHQ8O2w8aTwxPjs+O2w8dDxwPGw8aW5uZXJodG1sOz47bDxHT1ZUIEhJR0ggU0NIT09MIFNIQUhJTkFZQ VQgR1VOSiBIWUQ7Pj47Oz47Pj47Pj47dDw7bDxpPDA+O2k8MT47aTwyPjtpPDM+O2k8ND47aTw1PjtpPDY+O2k8Nz47PjtsPHQ8O 2w8aTwxPjs+O2w8dDxwPGw8aW5uZXJodG1sOz47bDw2Njs+Pjs7Pjs+Pjt0PDtsPGk8MD47PjtsPHQ8cDxsPGlubmVyaHRtbDs+O 2w8Nzc7Pj47Oz47Pj47dDw7bDxpPDA+Oz47bDx0PHA8bDxpbm5lcmh0bWw7PjtsPDc0Oz4+Ozs+Oz4+O3Q8O2w8aTwwPjs+O2w8d DxwPGw8aW5uZXJodG1sOz47bDw1NTs+Pjs7Pjs+Pjt0PDtsPGk8MD47PjtsPHQ8cDxsPGlubmVyaHRtbDs+O2w8ODU7Pj47Oz47P j47dDw7bDxpPDA+Oz47bDx0PHA8bDxpbm5lcmh0bWw7PjtsPDU5Oz4+Ozs+Oz4+O3Q8O2w8aTwxPjs+O2w8dDxwPGw8aW5uZXJod G1sOz47bDw0MTY7Pj47Oz47Pj47dDw7bDxpPDE+Oz47bDx0PHA8bDxpbm5lcmh0bWw7PjtsPEZpcnN0IENsYXNzOz4+Ozs+Oz4+O z4+Oz4+O3Q8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47Oz47dDw7bDxpPDM+Oz47bDx0PHA8bDxWaXNpYmxlOz47bDxvPHQ+Oz4+O zs+Oz4+Oz4+Oz4BgFu44vd/hpFN7SqVSNlw0wSXyw==" />
<table cellSpacing="1" cellPadding="3" width="610" align="center" border="0">
<tr>
<td class="head2" height="25">This special edition of SSC results has been powered
by APONLINE.
</td>
</tr>
<tr>
<td class="formtext">
<P> & nbsp;
<STRONG>Enter Regd. Number</STRONG>
<input class="formtext" id="txtRollNumber" style="WIDTH: 112px; HEIGHT: 18px" type="text"
maxLength="7" size="13" name="txtRollNumber"> <INPUT class="formtext" id="SubmitButton" type="submit" value="Submit" name="SubmitButton"> &n bsp;
<A href="ResultsHome.aspx">HOME</A>
<!-- <br>
&nbs p;
<STRONG>Select District</STRONG>
<select name="drpdwndist" id="drpdwndist">
<option selected="selected" value="00">Select</option>
<option value="35">Adilabad</option>
<option value="23">Ananthapur-1</option>
<option value="24">Ananthapur-2</option>
<option value="18">Chittoor-1</option>
<option value="19">Chittoor-2</option>
<option value="22">Cudapah</option>
<option value="08">East Godavari-1</option>
<option value="09">East Godavari-2</option>
<option value="14">Guntur-1</option>
<option value="15">Guntur-2</option>
<option value="01">Hyderabad-1</option>
<option value="02">Hyderabad-2</option>
<option value="03">Hyderabad-3</option>
<option value="33">Karimnagar-1</option>
<option value="34">Karimnagar-2</option>
<option value="30">Khammam</option>
<option value="12">Krishna-1</option>
<option value="13">Krishna-2</option>
<option value="20">Kurnool</option>
<option value="25">Mahaboobnagar</option>
<option value="28">Medak</option>
<option value="26">Nalgonda-1</option>
<option value="27">Nalgonda-2</option>
<option value="17">Nellore</option>
<option value="29">Nizamabad</option>
<option value="16">Prakasham</option>
<option value="36">RangaReddy-1</option>
<option value="37">RangaReddy-2</option>
<option value="05">Srikakulam</option>
<option value="38">Visakhapatnam-1</option>
<option value="39">Visakhapatnam-2</option>
<option value="06">Vizianagaram</option>
<option value="31">Warangal-1</option>
<option value="32">Warangal-2</option>
<option value="10">West Godavari-1</option>
<option value="11">West Godavari-2</option>
</select>
&nbs p; </P>
-->
<P><br>
&nbs p; &n bsp; &nbs p; <input name="printButton" id="printButton" type="button" class="formtext" onclick="PrintThisPageWithCount('SSCResults2005')" value=" Print " />
<input name="lblCongrats" id="lblCongrats" type="hidden" class="formtext" value="Congratulations, UPADHYAY VARUN" />
<BR>
<br>
</P>
<P></P>
</td>
</tr>
</table>
</form>
<script language="javascript" src="creditsfooter.js"></script>
</body>
</HTML>
The subject labels and marks obtained have been displayed in red.
05-10-2007, 10:30 AM
#10
Member
Registered: Oct 2004
Location: NY
Distribution: Slackware
Posts: 200
Rep:
Hi,
You can try this to parse this file. It's just one of infinite number of ways to do it but should work for you :
Code:
cat file.html | egrep -o "(lblEnglish|lblMath|lblScience).*[0-9]+" | sed s/"\">"/" "/g
You can put it in the loop and redirect it to the file if you have many of them, for ex.
Code:
for f in 'ls /myfiles/*.html'; do
cat $f | egrep -o "(lblEnglish|lblMath|lblScience).*[0-9]+" | sed s/"\">"/" "/g >> output.txt
done;
that will parse the files and dump the summarized output to output.txt
Best,
Chris
Last edited by krizzz; 05-11-2007 at 09:05 AM .
All times are GMT -5. The time now is 12:06 AM .
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know .
Latest Threads
LQ News