shell script: find subwords with 'compilation' word

suresheva31 · 09-29-2005, 01:00 PM

Hello,

Can someone please tell me how to write a bash or shell script for the following questions:

Given a word such as compilation, the subword problem can be described as follows: list all of the words which can be created, using only letters in the given word. For example subwords of compilation word include clap, and mop. It would also contain the subword lotion, because there were two o's available to use in the word. In general, if a word has more the two i's or more than two o's or more than one repetition of any other letter of compilation it cannot be a subword.

I have a dictionery with 1000 of words in it, and I need to find all the subword in the compilation.

Only thing that I know is the I need to use grep command with regular expression in it. But I am not quite familiar with regular expression.

Any help would much much appreicatied.

Thanks

Suresh

Hko · 09-29-2005, 01:07 PM

Homework?
What code did you write so far, and what's the problem with it?

suresheva31 · 09-29-2005, 03:20 PM

Quote:

Originally posted by Hko
Homework?
What code did you write so far, and what's the problem with it?

Nope, this is for work? I need to find something for my work. that is why?

addy86 · 09-29-2005, 03:54 PM

What you're looking for is an anagram generator. Search the net for it.

bigearsbilly · 09-30-2005, 02:38 AM

on linux (not POSIX) there is a printf function which does random
strings like you require, can't remember what, but strangely I ran across it yesyerday
while looking up vnsprintf.

it's part of the printf family, it's in a man page

bigearsbilly · 09-30-2005, 09:12 AM

I got intrigued so knocked this recursive solution up.
It doesn't filter duplicates.

Could it be more elegant? any offers?

Here's an anagram creator. Unfort. it's not quite what you want as
you need sub-words too.
should have read the spec. properly. phew!

Code:

#include<stdio.h>
#include<string.h>

#define BUFSZ 1023

/* 
** (c) George W. Bush
*/

void anagramerizerate(char * so_far, char * remains)
{
    char right_buf[BUFSZ + 1];
    char left_buf[BUFSZ + 1];

    char * p = right_buf;
    static char blob[2];
    int done = 0;

    strncpy (left_buf, so_far, BUFSZ);
    strncpy (right_buf, remains, BUFSZ);

    for (p=right_buf; *p; p++) {

	if (*p == '_') {
		continue;
	}

	blob[0] = *p;
	strcat (left_buf, blob);
	*p = '_';
	anagramerizerate(left_buf, right_buf);
	done++;

	strncpy (left_buf, so_far, BUFSZ);
	strncpy (right_buf, remains, BUFSZ);
    }
   if(!done) puts(so_far);

}

void anagram(const char * string)
{
	anagramerizerate( "", string);
}

int main(int argc, char ** argv, char ** envp)
{
	if (--argc) {

		for(argv++; argc; argc--, argv++) {
			anagram(*argv);
		}
	}

return(0);
}

Code:

$ anagram how do
how
hwo
ohw
owh
who
woh
do
od

suresheva31 · 10-01-2005, 10:40 AM

Quote:

Originally posted by bigearsbilly
I got intrigued so knocked this recursive solution up.
It doesn't filter duplicates.

Could it be more elegant? any offers?

Here's an anagram creator. Unfort. it's not quite what you want as
you need sub-words too.
should have read the spec. properly. phew!

Code:

#include<stdio.h> #include<string.h> #define BUFSZ 1023 /* ** (c) George W. Bush */ void anagramerizerate(char * so_far, char * remains) { char right_buf[BUFSZ + 1]; char left_buf[BUFSZ + 1]; char * p = right_buf; static char blob[2]; int done = 0; strncpy (left_buf, so_far, BUFSZ); strncpy (right_buf, remains, BUFSZ); for (p=right_buf; *p; p++) { if (*p == '_') { continue; } blob[0] = *p; strcat (left_buf, blob); *p = '_'; anagramerizerate(left_buf, right_buf); done++; strncpy (left_buf, so_far, BUFSZ); strncpy (right_buf, remains, BUFSZ); } if(!done) puts(so_far); } void anagram(const char * string) { anagramerizerate( "", string); } int main(int argc, char ** argv, char ** envp) { if (--argc) { for(argv++; argc; argc--, argv++) { anagram(*argv); } } return(0); }

Code:

$ anagram how do how hwo ohw owh who woh do od

Is there is any other way we could do this using grep command and regualr expressions.

Suresh

bigearsbilly · 10-03-2005, 03:41 AM

If I could think of an easy way
I wouldn't have done it in C

It's not really easy to do character level
stuff in any scripting languages, or grep/sed -like tools.
They are more designed for line-oriented operations and strings.
Also, I think, anything you may do would be horribly messy and contrived and
could be incredibly slow.

But I'm willing to be shown otherwise!

sajith · 10-03-2005, 08:33 AM

hai suresh

please go through the linux man page to know more about grep command

there also egrep and fgrep commands are there

to know about more linux commands please go through the following link

http://linuxreviews.org/man/

suresheva31 · 10-03-2005, 09:46 PM

Code:

typeset words=0
while read str	
do
	typeset i=0
	while [ $i -lt 9 ]
	do
		count[$i]=0
		let i=i+1
	done
	typeset flag=1
	i=0
	let j=i+1
	while [ $i -lt 64 ] && [ -n ${str:$i:$j} ] && [ ${str:$i:$j}!='\n']
	do
		if [ ${str:$i:$j}!=c ] && [ ${str:$i:$j}!=o ] && [ ${str:$i:$j}!=m ] && [ ${str:$i:$j}!=p ] && [ ${str:$i:$j}!=i ] && [ ${str:$i:$j}!=l ] && [ ${str:$i:$j}!=a ] && [ ${str:$i:$j}!=t ] && [ ${str:$i:$j}!=n ]
		then
			flag=0
			break
		fi
		let i=i+1
		let j=i+1
	done

	if [ $flag==0 ]
	then
		continue
	fi

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
	do
		if [ ${str:$i:$j}==c ] 
		then let count[0]=count[0]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
	do
		if [ ${str:$i:$j}==o ] 
		then let count[1]=count[1]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
	do
		if [ ${str:$i:$j}==m ] 
		then let count[2]=count[2]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
	do
		if [ ${str:$i:$j}==p ] 
		then let count[3]=count[3]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
	do
		if [ ${str:$i:$j}==i ] 
		then let count[4]=count[4]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
	do
		if [ ${str:$i:$j}==l ] 
		then let count[5]=count[5]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
	do
		if [ ${str:$i:$j}==a ] 
		then let count[6]=count[6]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
	do
		if [ ${str:$i:$j}==t ] 
		then let count[7]=count[7]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
	do
		if [ ${str:$i:$j}==n ] 
		then let count[8]=count[8]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	j=1
	while [ $i -lt 9 ]
	do
		if [ $i!=1 ] && [ $i!=4 ] && [ ${count:$i:$j} -gt 1 ]
		then flag=0;
		fi
		let i=i+1
		let j=i+1
	done
	
	if [ ${count:1:2} -gt 2 ] && [ ${count:4:5} -gt 2 ]
	then flag=0
	fi

	if [ ${#str} -gt 2 ]
	then flag=0
	fi

	if [ $flag==1 ]
	then
		echo $str
	fi

done < words.txt

Here is the code I wrote, but I am gettingt the following error.

[: ./test.sh 152: unbalanced []

where 152 is the line of the code "done < words.txt"

Can someone please tell me where I am making a mistake?

thanks

suresh

suresheva31 · 10-03-2005, 10:27 PM

The code works now with no errors(it was a syntax that missed a space), but there is no output been produced.

Code:

typeset words=0
while read str
do
	typeset i=0
	while [ $i -lt 9 ]
	do
		count[$i]=0
		let i=i+1
	done
	typeset flag=1
	i=0
	let j=i+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
	do
		if [ ${str:$i:$j}!="c" ] && [ ${str:$i:$j}!="o" ] && [ ${str:$i:$j}!="m" ] && [ ${str:$i:$j}!="p" ] && [ ${str:$i:$j}!="i" ] && [ ${str:$i:$j}!="l" ] && [ ${str:$i:$j}!="a" ] && [ ${str:$i:$j}!="t" ] && [ ${str:$i:$j}!="n" ]
		then
			flag=0
			break	
		fi
		let i=i+1
		let j=i+1
	done

	if [ $flag==0 ]
	then
		continue
	fi

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
	do
		if [ ${str:$i:$j}==c ] 
		then let count[0]=count[0]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
	do
		if [ ${str:$i:$j}==o ] 
		then let count[1]=count[1]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
	do
		if [ ${str:$i:$j}==m ] 
		then let count[2]=count[2]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
	do
		if [ ${str:$i:$j}==p ] 
		then let count[3]=count[3]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
	do
		if [ ${str:$i:$j}==i ] 
		then let count[4]=count[4]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
	do
		if [ ${str:$i:$j}==l ] 
		then let count[5]=count[5]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
	do
		if [ ${str:$i:$j}==a ] 
		then let count[6]=count[6]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
	do
		if [ ${str:$i:$j}==t ] 
		then let count[7]=count[7]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	let j=j+1
	while [ $i -lt 64 ] && [ -n ${str:$i:$j} ] && [ ${str:$i:$j}!="\n" ]
	do
		if [ ${str:$i:$j}==n ] 
		then let count[8]=count[8]+1
		fi
		let i=i+1
		let j=i+1
	done

	i=0
	j=1
	while [ $i -lt 9 ]
	do
		if [ $i!=1 ] && [ $i!=4 ] #&& [ ${count:$i:$j} -gt 1 ]
		then flag=0;
		fi
		let i=i+1
		let j=i+1
	done
	
	if [ $count[1] > 2 ] && [ $count[4] > 2 ]
	then flag=0
	fi

	if [ ${#str} > 2 ]
	then flag=0
	fi

	if [ $flag==1 ]
	then echo $str
	fi

done < words.txt

Could someone please give me tips on why this is not working?

bigearsbilly · 10-05-2005, 04:17 AM

Suresh, you will not be able to do this using shell tools.
They are not up to the job.

Using my C example I did the anagrams of 'compilation' and it took
11 minutes, and that's just using the whole word not even subwords.
This produces 40 million words; ten million when duplicates are removed.
So including subwords is going to a be 50 million at least maybe.

You will need to use a much more intelligent approach than brute-force.

If 'compilation' produces 50 million combinations then obviously this is
too large a list.
You will need to apply rules to remove impossible combinations, like
say, three letters in a row.

If you prune it down to maybe 1 million It is still larger than your dictionary!

As the dictionary is a very small subset of possible 'words'
maybe turn it on it's head, and use the dictionary entries to search the word.
See if each word in the dictionary can be made from the target word.

It's not a trivial problem.