LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware
User Name
Password
Slackware This Forum is for the discussion of Slackware Linux.

Notices


Reply
  Search this Thread
Old 05-09-2007, 07:16 AM   #1
adityavpratap
Member
 
Registered: Dec 2004
Location: Hyderabad, India
Distribution: Slackware 13, Ubuntu 12.04
Posts: 440

Rep: Reputation: 32
extracting data from html files into one text file


Hi!
Here is my problem (it may not be appropriate to this forum, in which case I am sorry for posting here) -
The exam results of my school children of a particular class will be available online at a particular web site. I have to enter the roll-number of each candidate and the marks obtained by the student in each of the subjects will be displayed. Now what I normally do is to store each of the web-page displaying the marks of students and manually taking down the marks in each subject into a single tab-sepertated text file by cutting and pasting, like this -
Quote:
1. John Doe 67 65 83 98
2. Amitabh Bachchan 87 78 93 73
It may not be possible to download the marks of all the students as a single file, so after downloading the html files of each student is there a way of extracting data from these files automatically using scripts and storing the data into a single tab-separated text file?
Thanking in advance,

Last edited by adityavpratap; 05-09-2007 at 07:17 AM.
 
Old 05-09-2007, 08:46 AM   #2
scoban
Member
 
Registered: Nov 2004
Location: Turkey
Distribution: Slackware
Posts: 145

Rep: Reputation: 16
Yes It is possible, can you post example html codes?
 
Old 05-09-2007, 08:51 AM   #3
krizzz
Member
 
Registered: Oct 2004
Location: NY
Distribution: Slackware
Posts: 200

Rep: Reputation: 30
I'd recommend using Perl script and regular expressions. If you don't have any programming experience, don't worry, you'll pick it up quickly.
 
Old 05-09-2007, 09:27 AM   #4
Lufbery
Senior Member
 
Registered: Aug 2006
Location: Harrisburg, PA
Distribution: Slackware 64 14.2
Posts: 1,180
Blog Entries: 29

Rep: Reputation: 135Reputation: 135
Hi all,

Couldn't something like this be done by using Lynx to dump each student's web page to a text file:

Code:
lynx -dump http://www.studentpage.edu >> studentfiles.txt
The >> will append each student's web page to the end of the last one in studentfiles.txt.

If you have each student's page listed in a single text file you could have the bash shell automatically read each URL and output it to the studentfiles.txt file.

Code:
cat urls.txt | while read url; do lynx -dump "$url" >> studentfiles.txt; done
Note: I'm just learning bash, and I got the above code from another web site, but I think it will work.

Then I'd use grep to grab the relevant lines and output them to another text file. Without seeing the HTML, I'm not sure what's required.

I'm also not sure about how to insert tabs for tab delimiting. Maybe somebody else can help.

Regards,

-Drew
 
Old 05-09-2007, 01:26 PM   #5
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
All great advice. It would sure help if you post some examples of the HTML code tho.
 
Old 05-09-2007, 03:55 PM   #6
krizzz
Member
 
Registered: Oct 2004
Location: NY
Distribution: Slackware
Posts: 200

Rep: Reputation: 30
Alternatively you can use wget instead of lynx, sed for parsing the data and grep for filtering the data. sed is powerful tool for extracting the pieces of text. If you posted some samples of you html it would be much easier to give you some hints. In fact, well formed html is also xml so you could use some xml parsing tools to extract the data.
 
Old 05-09-2007, 09:54 PM   #7
BCarey
Senior Member
 
Registered: Oct 2005
Location: New Mexico
Distribution: Slackware
Posts: 1,639

Rep: Reputation: Disabled
Perl would be a fine choice, but please check out the many modules available rather than rolling your own regular expressions. A nice article on processing html with perl can be found at http://www.perl.com/pub/a/2006/01/19...zing_html.html.

Brian
 
Old 05-10-2007, 12:16 AM   #8
adityavpratap
Member
 
Registered: Dec 2004
Location: Hyderabad, India
Distribution: Slackware 13, Ubuntu 12.04
Posts: 440

Original Poster
Rep: Reputation: 32
;-)
Sorry for not posting sample html files. But the results are not online yet. They will be made available online at around 5:30 PM (IST). But I'll try to send old files as soon as I can locate them. Thanks for your valuable suggestions.
 
Old 05-10-2007, 12:35 AM   #9
adityavpratap
Member
 
Registered: Dec 2004
Location: Hyderabad, India
Distribution: Slackware 13, Ubuntu 12.04
Posts: 440

Original Poster
Rep: Reputation: 32
Here is a previous year's file -
Quote:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >

<HTML>

<HEAD>

<title>APOnline - SSC Marks Memorandum</title>

<meta content="Microsoft Visual Studio 7.0" name="GENERATOR">

<meta content="C#" name="CODE_LANGUAGE">

<meta content="pragma" name="no-cache">

<meta content="JavaScript (ECMAScript)" name="vs_defaultClientScript">

<meta content="http://schemas.microsoft.com/intellisense/ie5" name="vs_targetSchema">

<LINK href="applystyles.css" type="text/css" rel="stylesheet">

<SCRIPT language="JavaScript" src="printFunction.js"></SCRIPT>

<script language="javascript">

function trimField()

{

var Strfldvalue=document.SSCForm.txtRollNumber.value;

var Strtrmvalue="";s

var j=0;

for(k=0; k<Strfldvalue.length; k++)

{

if(Strfldvalue.charAt(k)==" " )j++;

else

{

if(j<Strfldvalue.length)

Strtrmvalue=Strfldvalue.substring(j,Strfldvalue.length);

break;

}

}

document.SSCForm.txtRollNumber.value=Strtrmvalue;

}

function ValidateForm()

{

var msg ="";

trimField();

var regdNumber = document.SSCForm.txtRollNumber.value;

if(regdNumber == "")

{

msg = "Enter Regd. Number.";

}

else if(isNaN(regdNumber))

{

msg = "The Regd. Number you entered should be a numeric value.";

}

else if(regdNumber.indexOf(".") != -1)

{

msg = "Entered value for Regd. Number should not be a decimal value.";

}

else if(parseFloat(regdNumber) < 0)

{

msg= "Entered value for Regd. should be a positive value.";

}

if(msg.length > 0)

{

alert(msg);

document.SSCForm.txtRollNumber.focus();

return false;

}

else

{



return true;



}

}



function DisplayCongrats()

{

var msg = document.SSCForm.lblCongrats.value;

if(msg.length > 0)

{

alert(msg);

}

}

function DisplayCongrats()

{

if(document.SSCForm.lblCongrats.value.length !=0)

{

alert(document.SSCForm.lblCongrats.value);

}

}

</script>

</HEAD>

<body leftMargin="0" topMargin="0" onload="DisplayCongrats()" marginheight="0" marginwidth="0"

MS_POSITIONING="GridLayout">

<DIV id="PrintContent" align="center">

<table cellSpacing="1" cellPadding="3" width="610" align="center" border="0">

<tr>

<td class="govhead" align="center"><IMG src="./images/andhralogo.jpg"><br>

Government of Andhra Pradesh

</td>

</tr>

<tr>

<td class="formtext" align="center"><span class="head1">SSC Public Examinations

Regular, March 2006</span>

<BR>

</td>

</tr>

<tr>

<td id="tdMarksList" align="center" width="610" height="25">

<div class="head3" align="center">

<center>Marks List</center>

</div>

</td>



</tr>

<tr>

<td class="mandatory" align="center">

<table id="resultTable" bordercolor="#aaaaaa" cellspacing="0" cellpadding="3" width="610" border="1">

<tr width="=305">

<td class="formbg1" nowrap="nowrap" width="95" colspan="2">&nbsp;Roll No.

</td>

<td class="formbg2" width="185" colspan="2">

<DIV id="lblRollNo">0132158</DIV>

</td>

<td class="formbg1" nowrap="nowrap" width="68" colspan="2">&nbsp;Date

</td>

<td class="formbg2" width="257" colspan="2">

<DIV id="lblDate">04/05/2006</DIV>

</td>

</tr>

<tr>

<td class="formbg1" nowrap="nowrap" colspan="2">&nbsp;Name of the Candidate

</td>

<td class="formbg2" colspan="6">

<DIV id="lblNameOfCandidate">UPADHYAY VARUN</DIV>

</td>

</tr>

<tr>

<td class="formbg1" width="99" colspan="2">&nbsp;Center.Name.&nbsp;&nbsp;

</td>

<td class="formbg2" width="190" colspan="2">

<DIV id="lblCNo">GOVT HIGH SCHOOL SHAHINAYAT GUNJ HYD</DIV>

</td>

<td class="formbg1" width="56" colspan="2">&nbsp;</td>

<td class="formbg2" width="252" colspan="2">

<DIV id="lblMedium"></DIV>

</td>

</tr>

<tr>

<td class="formtext" colspan="8" height="15">&nbsp;

</td>

</tr>

<tr class="formbg1">

<td class="formtext" nowrap="nowrap" align="center" width="75">I Lang

</td>

<td class="formtext" nowrap="nowrap" align="center" width="75">II Lang

</td>

<td class="formtext" nowrap="nowrap" align="center" width="75">Maths

</td>

<td class="formtext" nowrap="nowrap" align="center" width="75">Science

</td>

<td class="formtext" nowrap="nowrap" align="center" width="85">Social Studies

</td>

<td class="formtext" nowrap="nowrap" align="center" width="75">III Lang

</td>

<td class="urlbottom1" nowrap="nowrap" align="center" width="75">Total

</td>

<td class="urlbottom1" nowrap="nowrap" align="center" width="75">Result

</td>

</tr>

<tr class="formbg2">

<td class="formtext" align="center" width="75">

<DIV id="lblFirstLanguage" noWrap="">66</DIV>

</td>

<td class="formtext" nowrap="nowrap" align="center" width="75"><DIV id="lblEnglish">77</DIV>

</td>

<td class="formtext" nowrap="nowrap" align="center" width="75"><DIV id="lblMaths">74</DIV>

</td>

<td class="formtext" nowrap="nowrap" align="center" width="75"><DIV id="lblScience">55</DIV>

</td>

<td class="formtext" nowrap="nowrap" align="center" width="85"><DIV id="lblSocial">85</DIV>

</td>

<td class="formtext" nowrap="nowrap" align="center" width="75"><DIV id="lblSecondLanguage">59</DIV>

</td>

<td class="urlbottom1" nowrap="nowrap" align="center" width="75">

<DIV id="lblTotal">416</DIV>

</td>

<td class="urlbottom1" nowrap="nowrap" align="center" width="75">

<DIV id="lblResult">First Class[/COLOR]</DIV>

</td>

</tr>

<tr>

<td class="govhead" colspan="8" height="25">NOTE: This information is provided to

the candidate on his/her online request and is only a prototype list.

</td>

</tr>

</table>



</td>

</tr>

<TR>

<td>

</td>

</TR>

</table>

</DIV>

<form name="SSCForm" method="post" action="ShowSSCResults.aspx" id="SSCForm" onsubmit="return ValidateForm()">

<input type="hidden" name="__VIEWSTATE" value="dDw5MzI2MTQzOTM7dDw7bDxpPDE+O2k8Mz47aTw1PjtpPDc+O2k8OT47PjtsPHQ8cDxsPFZpc2libGU7PjtsPG88dD47P j47Oz47dDxwPGw8aW5uZXJodG1sO1Zpc2libGU7PjtsPFxlO288Zj47Pj47Oz47dDxwPGw8VmlzaWJsZTs+O2w8bzx0Pjs+PjtsP Gk8MD47aTwxPjtpPDI+O2k8NT47PjtsPHQ8O2w8aTwxPjtpPDM+Oz47bDx0PDtsPGk8MT47PjtsPHQ8cDxsPGlubmVyaHRtbDs+O 2w8MDEzMjE1ODs+Pjs7Pjs+Pjt0PDtsPGk8MT47PjtsPHQ8cDxsPGlubmVyaHRtbDs+O2w8MDQvMDUvMjAwNjs+Pjs7Pjs+Pjs+P jt0PDtsPGk8MT47PjtsPHQ8O2w8aTwxPjs+O2w8dDxwPGw8aW5uZXJodG1sOz47bDxVUEFESFlBWSBWQVJVTjs+Pjs7Pjs+Pjs+P jt0PDtsPGk8MT47PjtsPHQ8O2w8aTwxPjs+O2w8dDxwPGw8aW5uZXJodG1sOz47bDxHT1ZUIEhJR0ggU0NIT09MIFNIQUhJTkFZQ VQgR1VOSiBIWUQ7Pj47Oz47Pj47Pj47dDw7bDxpPDA+O2k8MT47aTwyPjtpPDM+O2k8ND47aTw1PjtpPDY+O2k8Nz47PjtsPHQ8O 2w8aTwxPjs+O2w8dDxwPGw8aW5uZXJodG1sOz47bDw2Njs+Pjs7Pjs+Pjt0PDtsPGk8MD47PjtsPHQ8cDxsPGlubmVyaHRtbDs+O 2w8Nzc7Pj47Oz47Pj47dDw7bDxpPDA+Oz47bDx0PHA8bDxpbm5lcmh0bWw7PjtsPDc0Oz4+Ozs+Oz4+O3Q8O2w8aTwwPjs+O2w8d DxwPGw8aW5uZXJodG1sOz47bDw1NTs+Pjs7Pjs+Pjt0PDtsPGk8MD47PjtsPHQ8cDxsPGlubmVyaHRtbDs+O2w8ODU7Pj47Oz47P j47dDw7bDxpPDA+Oz47bDx0PHA8bDxpbm5lcmh0bWw7PjtsPDU5Oz4+Ozs+Oz4+O3Q8O2w8aTwxPjs+O2w8dDxwPGw8aW5uZXJod G1sOz47bDw0MTY7Pj47Oz47Pj47dDw7bDxpPDE+Oz47bDx0PHA8bDxpbm5lcmh0bWw7PjtsPEZpcnN0IENsYXNzOz4+Ozs+Oz4+O z4+Oz4+O3Q8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47Oz47dDw7bDxpPDM+Oz47bDx0PHA8bDxWaXNpYmxlOz47bDxvPHQ+Oz4+O zs+Oz4+Oz4+Oz4BgFu44vd/hpFN7SqVSNlw0wSXyw==" />



<table cellSpacing="1" cellPadding="3" width="610" align="center" border="0">

<tr>

<td class="head2" height="25">This special edition of SSC results has been powered

by APONLINE.

</td>

</tr>

<tr>

<td class="formtext">

<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;& nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

<STRONG>Enter Regd. Number</STRONG> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

<input class="formtext" id="txtRollNumber" style="WIDTH: 112px; HEIGHT: 18px" type="text"

maxLength="7" size="13" name="txtRollNumber">&nbsp;&nbsp;&nbsp;<INPUT class="formtext" id="SubmitButton" type="submit" value="Submit" name="SubmitButton">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n bsp;

<A href="ResultsHome.aspx">HOME</A>

<!-- <br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

<STRONG>Select District</STRONG> &nbsp;&nbsp;&nbsp;&nbsp;

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

<select name="drpdwndist" id="drpdwndist">

<option selected="selected" value="00">Select</option>

<option value="35">Adilabad</option>

<option value="23">Ananthapur-1</option>

<option value="24">Ananthapur-2</option>

<option value="18">Chittoor-1</option>

<option value="19">Chittoor-2</option>

<option value="22">Cudapah</option>

<option value="08">East Godavari-1</option>

<option value="09">East Godavari-2</option>

<option value="14">Guntur-1</option>

<option value="15">Guntur-2</option>

<option value="01">Hyderabad-1</option>

<option value="02">Hyderabad-2</option>

<option value="03">Hyderabad-3</option>

<option value="33">Karimnagar-1</option>

<option value="34">Karimnagar-2</option>

<option value="30">Khammam</option>

<option value="12">Krishna-1</option>

<option value="13">Krishna-2</option>

<option value="20">Kurnool</option>

<option value="25">Mahaboobnagar</option>

<option value="28">Medak</option>

<option value="26">Nalgonda-1</option>

<option value="27">Nalgonda-2</option>

<option value="17">Nellore</option>

<option value="29">Nizamabad</option>

<option value="16">Prakasham</option>

<option value="36">RangaReddy-1</option>

<option value="37">RangaReddy-2</option>

<option value="05">Srikakulam</option>

<option value="38">Visakhapatnam-1</option>

<option value="39">Visakhapatnam-2</option>

<option value="06">Vizianagaram</option>

<option value="31">Warangal-1</option>

<option value="32">Warangal-2</option>

<option value="10">West Godavari-1</option>

<option value="11">West Godavari-2</option>



</select>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</P>

-->

<P><br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs p;&nbsp;&nbsp;&nbsp;<input name="printButton" id="printButton" type="button" class="formtext" onclick="PrintThisPageWithCount('SSCResults2005')" value=" Print " />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

<input name="lblCongrats" id="lblCongrats" type="hidden" class="formtext" value="Congratulations, UPADHYAY VARUN" />

<BR>

<br>

</P>

<P></P>

</td>

</tr>

</table>

</form>

<script language="javascript" src="creditsfooter.js"></script>

</body>

</HTML>

The subject labels and marks obtained have been displayed in red.
 
Old 05-10-2007, 10:30 AM   #10
krizzz
Member
 
Registered: Oct 2004
Location: NY
Distribution: Slackware
Posts: 200

Rep: Reputation: 30
Hi,

You can try this to parse this file. It's just one of infinite number of ways to do it but should work for you :

Code:
cat file.html | egrep -o "(lblEnglish|lblMath|lblScience).*[0-9]+" | sed s/"\">"/" "/g
You can put it in the loop and redirect it to the file if you have many of them, for ex.

Code:
for f in 'ls /myfiles/*.html'; do   
    cat $f | egrep  -o "(lblEnglish|lblMath|lblScience).*[0-9]+" | sed s/"\">"/" "/g >> output.txt
done;
that will parse the files and dump the summarized output to output.txt

Best,
Chris

Last edited by krizzz; 05-11-2007 at 09:05 AM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
extracting a chunk of text from a large text file lothario Linux - Software 3 02-28-2007 08:16 AM
Extracting data from file using sed EneWolverine Programming 7 12-29-2006 09:23 AM
help extracting data from csv file willinusf Linux - General 10 10-27-2006 09:10 PM
Extracting MySQL data from raw files cs-cam Linux - Software 1 06-12-2006 11:22 PM
Convert text files to html files lothario Linux - Software 3 09-27-2005 08:48 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware

All times are GMT -5. The time now is 12:06 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration