shell programming

arispipis · 03-08-2008, 09:23 PM

Hello there,
i have to write a shell programm that takes as an input a directory which contains other directories and files. i just want to take the htm[l]? files only. i have used
ls -R $directoryName | grep -E '*.htm[l]?' > htmlFiles.txt

to do this.
Now i have to open these files and make some text processing. i have tried
to do the following:

textProcessing(){
for file in `cat htmlFiles.txt`
do
grep -v '<*>' $file >> tempFile # get rid of all <...> tags
grep -v '&*;' tempFile >> newTempFile # get rid of all &...; tags

done
}

but it did not work. i have no idea if i am thinking wrongly.
i need a way to isolate the files i want and then read them one by one.

i appreciate you time and effort.
Aris

ta0kira · 03-08-2008, 11:57 PM

With regular expressions, "*" means "the previous character repeated 0 or more times." A "." means "any character," so ".*" means "any character repeated 0 or more times": the same thing as "*" in the shell itself, but "<*>" means "0 or more '<' followed by ">"," which means that both ">" and "<<<<<<<<<<<<<<>" are valid matches but not "<br>". I recommend you switch "<*>" to "<[^<]+>", and you also need sed instead of grep to make the substitution.

Here is what "<[^<]+>" means:

"<": match a "<"
"[...]": pick a character from the list
"[^...]": pick a character besides one in the list
"[^<]": pick a character that isn't '<'
"+": match the preceding 1 or more times
"[^<]+": 1 or more characters that aren't '<'
">": match a ">"
"<[^<]+>": "<" followed by 1 or more non-'<' followed by ">"

Here is how to use it:

Code:

sed -r "s/<[^<]+>//g" $file > tempFile

Here is what the sed line means:

"s/.../.../": match the first part and replace with the second
"s/...//": delete matching portions
"g": repeat the preceding multiple times on the same line
"s/<[^<]+>//g": delete all tags

ta0kira

PS "/" as used with sed above can be replaced with any other character if "/" is actually a part of your pattern. Example:

Code:

find ~ | sed "s@/home/`whoami`/@-> @"