Latest LQ Deal: Linux Power User Bundle
Go Back > Forums > Non-*NIX Forums > Programming
User Name
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.


  Search this Thread
Old 10-18-2007, 04:43 PM   #1
Registered: Dec 2003
Location: USA
Distribution: Debian
Posts: 40

Rep: Reputation: 15
Question script to extract blocks of text from many files.


I have about 2000 html files with blocks of info which I need to extract.

I am thinking bash but I am more fluent in php so that would be better.

The file names are pro1.html pro2.html ... pro1980.html

So what I need is a script that:
1) opens each file
2) Try to find a start string, skip file if start string does not exist.
4) finds a end string
5) copies/prints everything in between to another file.

So that all the selected text ends up in one single long file.

Can you get me started on this?


start_string = "<table>"
end_string = "</table>"
selected_text = ''
Old 10-18-2007, 05:21 PM   #2
Senior Member
Registered: Dec 2002
Location: England
Distribution: Used to use Mandrake/Mandriva
Posts: 2,794

Rep: Reputation: 116Reputation: 116
Is it XHTML or HTML4? Is it properly formed (i.e. matching tags&closing tags, nested correctly) XML? Do you know about XSLT?

I'm more of a Python or Java man myself, but I'm a little rusty&tired atm. It could probably be done with a grep statement of course

As you say, if you're writing the file searching yourself, you need to:
create an empty output file buffer
list the files
open them one at a time
start from the beginning, treat the file contents as a long string, look for the start string
if you find it, look from that place onward for the end string
if you find it, what's in between is what to append to the output file buffer OR copy what's after the start string until you maybe find the end string
close input file and move on to next file until none left
write output file

You may want to learn an efficient string matching algorithm if you're DIYing this.
Old 10-18-2007, 05:29 PM   #3
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Don't have files to test with handy, but this *should* work

awk '/<table>/,/<\/table>/' *htm > tables

Old 10-19-2007, 01:47 AM   #4
Registered: Dec 2003
Location: USA
Distribution: Debian
Posts: 40

Original Poster
Rep: Reputation: 15
I have already got this done but I am still going to test your awk line because this would make a lot of things easier

Old 10-19-2007, 02:31 AM   #5
Registered: Nov 2005
Location: Davao City, Philippines
Distribution: RHEL, CentOS, Ubuntu, Mint
Posts: 139

Rep: Reputation: 20
another simple solution here, though it has not been tested..

for file in *.html
sed -n '/start_string/,/end_string/p' $file  >> outputfile

Last edited by yongitz; 10-19-2007 at 02:33 AM.


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
tool to extract text from various files with sql-queries?? xomic Linux - Software 1 04-17-2007 09:44 PM
How to extract Text from RTF files (or even DOC) SkipHuffman Linux - Software 5 03-02-2007 12:57 PM
Appending Text Files From Bash Script alts Programming 3 11-18-2004 06:36 PM
extract text portions from html files linuxfond Programming 3 04-28-2004 11:00 AM
Script file to replace large text blocks in files? stodge Linux - Software 0 09-27-2003 10:53 AM > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:01 AM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration