Delete lines from a file by their's length

dayamoon · 04-27-2010, 03:41 PM

Hello, i've got a file with sorted words - one on each line.
How could it be possible to delete thouse lines that have words of length 1 or 2 (1-2 letters). I guess a good way it will be with AWK, n its fuction length(), but getting it, i dont know how to delete those very lines..
THANKS in advance !!!

sycamorex · 04-27-2010, 03:46 PM

Welcome to LQ.

Is it your homework?
Can you post a sample of your file? What are other lines? longer or empty?

dayamoon · 04-27-2010, 03:55 PM

ok... its a part of a bigger project.. ive got some files from a folder, deleting any chars except letters, stemming it, delete the stopwords.. n i got to count the different words, but it still remains some gabbage like 1 word letters.. taht i want to remove..

dayamoon · 04-27-2010, 03:58 PM

if [ -d "$*" ]
then
cat "$1"/*.txt > f
file=f
fi
sed "s/--/ /g" < "$file" | sed "s/[-'_]//g" | sed "s/[0-9]//g"| tr -d '=:;-_|"<>.,?!@#*&^[](){}' | tr "[A-Z]" "[a-z]" > ff

./stop.txt "ff" stopwords.txt ffs.txt
gcc stemming.c -o stem; ./stem "ffs.txt"
tr -s "\ " "\n" <"ffs.txts" |grep -v '^$' | sort | uniq -c > Index/Vocabulary.txt

it goes something like this..... but in the Vocabulary.txt i still want to remove those 1-letter word lines

sycamorex · 04-27-2010, 03:59 PM

Can you please put some effort in writing correctly? If I understand your second post correctly, you can do it with 'sed' and the answer is on this page:
http://sed.sourceforge.net/sed1line.txt

sycamorex · 04-27-2010, 04:01 PM

Wouldn't:

Code:

sed -n '/^.\{3\}/p' vocabulary.txt

do the trick?

If you want to make changes permanent, just add the '-i' file.

dayamoon · 04-27-2010, 04:11 PM

Ahhh.. OK i guess i wasn't so clear.. Vocabulary.txt contains also the word frequency... so having 4266 a (a is 4266 times in the file), Sed didn't delete it, maybe because im using bash shell?!?!?

sycamorex · 04-27-2010, 04:14 PM

Quote:

Originally Posted by dayamoon

Ahhh.. OK i guess i wasn't so clear.. Vocabulary.txt contains also the word frequency... so having 4266 a (a is 4266 times in the file), Sed didn't delete it, maybe because im using bash shell?!?!?

Bash is the standard shell in linux and 'sed' is a small tool that runs in Bash.

The sed command prints all the lines that are longer than 2 characters. Isn't that what you wanted to achieve?

I'd be easier to post a representative extract of the file.

MTK358 · 04-27-2010, 04:15 PM

Code:

grep ...

dayamoon · 04-27-2010, 04:19 PM

4266 a
1 aaah
3 ab
14 abandon
1 abash
4 abat
1 abdic
1 abduct
1 abhorr
4 abid
8 abil
25 abinet
1 abject
29 abl
1 ablest
1 abnorm
1 abod
1 abolit
6 abomin
1 aborigin
348 about
49 abov
4 abreast
7 abroad
6 abrupt
20 abruptli
10 absenc

after running $
sed -n '/^.\{3\}/p' Index/Vocabulary.txt >voc.txt
i got the same:

4266 a
1 aaah
3 ab
14 abandon
1 abash
4 abat
1 abdic
1 abduct
1 abhorr
4 abid
8 abil
25 abinet
1 abject
29 abl
1 ablest
1 abnorm
1 abod
1 abolit
6 abomin
1 aborigin
348 about
49 abov
4 abreast
7 abroad
6 abrupt
20 abruptli
10 absenc

sycamorex · 04-27-2010, 04:20 PM

ok, I get what you mean.

MTK358 · 04-27-2010, 04:25 PM

Why did you never say that there was a number before each word?!?

Code:

sed -rn 's:[0-9]* .{3,}:&:p'

dayamoon · 04-27-2010, 04:33 PM

Maybe because at the beggining i said that i wanted to delete a line by the length of a word..
Still no success... but can i get any explanations about the s: &: p ?? what exactly do they do?

Thank you everyone though...

sycamorex · 04-27-2010, 04:36 PM

In that case, try awk:

Code:

awk 'length($2) > 2' vocabulary.txt

dayamoon · 04-27-2010, 04:40 PM

AWW !!! it Worked !! thank you So much !! second column's length.. THANK YOU !!