LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 10-18-2007, 05:43 PM   #1
gruessle
Member
 
Registered: Dec 2003
Location: USA
Distribution: Debian
Posts: 40

Rep: Reputation: 15
Question script to extract blocks of text from many files.


Hi

I have about 2000 html files with blocks of info which I need to extract.

I am thinking bash but I am more fluent in php so that would be better.

The file names are pro1.html pro2.html ... pro1980.html

So what I need is a script that:
1) opens each file
2) Try to find a start string, skip file if start string does not exist.
4) finds a end string
5) copies/prints everything in between to another file.

So that all the selected text ends up in one single long file.

Can you get me started on this?

Code:
#!/bin/bash

start_string = "<table>"
end_string = "</table>"
selected_text = ''
 
Old 10-18-2007, 06:21 PM   #2
Proud
Senior Member
 
Registered: Dec 2002
Location: England
Distribution: Used to use Mandrake/Mandriva
Posts: 2,794

Rep: Reputation: 116Reputation: 116
Is it XHTML or HTML4? Is it properly formed (i.e. matching tags&closing tags, nested correctly) XML? Do you know about XSLT?

I'm more of a Python or Java man myself, but I'm a little rusty&tired atm. It could probably be done with a grep statement of course

As you say, if you're writing the file searching yourself, you need to:
create an empty output file buffer
list the files
open them one at a time
start from the beginning, treat the file contents as a long string, look for the start string
if you find it, look from that place onward for the end string
if you find it, what's in between is what to append to the output file buffer OR copy what's after the start string until you maybe find the end string
close input file and move on to next file until none left
write output file

You may want to learn an efficient string matching algorithm if you're DIYing this.
 
Old 10-18-2007, 06:29 PM   #3
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,999
Blog Entries: 11

Rep: Reputation: 881Reputation: 881Reputation: 881Reputation: 881Reputation: 881Reputation: 881Reputation: 881
Don't have files to test with handy, but this *should* work

Code:
awk '/<table>/,/<\/table>/' *htm > tables


Cheers,
Tink
 
Old 10-19-2007, 02:47 AM   #4
gruessle
Member
 
Registered: Dec 2003
Location: USA
Distribution: Debian
Posts: 40

Original Poster
Rep: Reputation: 15
I have already got this done but I am still going to test your awk line because this would make a lot of things easier

Thanks
 
Old 10-19-2007, 03:31 AM   #5
yongitz
Member
 
Registered: Nov 2005
Location: Davao City, Philippines
Distribution: RHEL, CentOS, Ubuntu, Mint
Posts: 139

Rep: Reputation: 20
another simple solution here, though it has not been tested..

Code:
for file in *.html
do 
sed -n '/start_string/,/end_string/p' $file  >> outputfile
done

Last edited by yongitz; 10-19-2007 at 03:33 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
tool to extract text from various files with sql-queries?? xomic Linux - Software 1 04-17-2007 10:44 PM
How to extract Text from RTF files (or even DOC) SkipHuffman Linux - Software 5 03-02-2007 01:57 PM
Appending Text Files From Bash Script alts Programming 3 11-18-2004 07:36 PM
extract text portions from html files linuxfond Programming 3 04-28-2004 12:00 PM
Script file to replace large text blocks in files? stodge Linux - Software 0 09-27-2003 11:53 AM


All times are GMT -5. The time now is 01:01 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration