LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-08-2008, 09:23 PM   #1
arispipis
LQ Newbie
 
Registered: Mar 2008
Location: Nicosia,Cyprus
Posts: 4

Rep: Reputation: 0
shell programming - file open


Hello there,
i have to write a shell programm that takes as an input a directory which contains other directories and files. i just want to take the htm[l]? files only. i have used
ls -R $directoryName | grep -E '*.htm[l]?' > htmlFiles.txt

to do this.
Now i have to open these files and make some text processing. i have tried
to do the following:

textProcessing(){
for file in `cat htmlFiles.txt`
do
grep -v '<*>' $file >> tempFile # get rid of all <...> tags
grep -v '&*;' tempFile >> newTempFile # get rid of all &...; tags

done
}

but it did not work. i have no idea if i am thinking wrongly.
i need a way to isolate the files i want and then read them one by one.

i appreciate you time and effort.
Aris
 
Old 03-08-2008, 11:57 PM   #2
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
With regular expressions, "*" means "the previous character repeated 0 or more times." A "." means "any character," so ".*" means "any character repeated 0 or more times": the same thing as "*" in the shell itself, but "<*>" means "0 or more '<' followed by ">"," which means that both ">" and "<<<<<<<<<<<<<<>" are valid matches but not "<br>". I recommend you switch "<*>" to "<[^<]+>", and you also need sed instead of grep to make the substitution.

Here is what "<[^<]+>" means:
  • "<": match a "<"
  • "[...]": pick a character from the list
  • "[^...]": pick a character besides one in the list
  • "[^<]": pick a character that isn't '<'
  • "+": match the preceding 1 or more times
  • "[^<]+": 1 or more characters that aren't '<'
  • ">": match a ">"
  • "<[^<]+>": "<" followed by 1 or more non-'<' followed by ">"
Here is how to use it:
Code:
sed -r "s/<[^<]+>//g" $file > tempFile
Here is what the sed line means:
  • "s/.../.../": match the first part and replace with the second
  • "s/...//": delete matching portions
  • "g": repeat the preceding multiple times on the same line
  • "s/<[^<]+>//g": delete all tags
ta0kira

PS "/" as used with sed above can be replaced with any other character if "/" is actually a part of your pattern. Example:
Code:
find ~ | sed "s@/home/`whoami`/@-> @"

Last edited by ta0kira; 03-09-2008 at 12:13 AM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
file courrepted with ftp Shell programming on suse 10.0 samir01m11 Programming 3 08-04-2006 07:46 AM
write and append to a file using shell programming christina_rules Programming 5 07-16-2006 07:00 AM
how to modify a text file or a string in shell programming luckyvietman Programming 5 07-12-2005 05:08 PM
How to delete a line from a text file with shell script programming Bassam General 1 01-28-2004 08:51 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:54 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration