shell script: find subwords with 'compilation' word

ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Welcome to LinuxQuestions.org, a friendly and active Linux Community.

You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!

Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.

If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.

Having a problem logging in? Please visit this page to clear all LQ-related cookies.

Introduction to Linux - A Hands on Guide

This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.

Click Here to receive this Complete Guide absolutely free.

shell script: find subwords with 'compilation' word

Hello,

Can someone please tell me how to write a bash or shell script for the following questions:

Given a word such as compilation, the subword problem can be described as follows: list all of the words which can be created, using only letters in the given word. For example subwords of compilation word include clap, and mop. It would also contain the subword lotion, because there were two o's available to use in the word. In general, if a word has more the two i's or more than two o's or more than one repetition of any other letter of compilation it cannot be a subword.

I have a dictionery with 1000 of words in it, and I need to find all the subword in the compilation.

Only thing that I know is the I need to use grep command with regular expression in it. But I am not quite familiar with regular expression.

on linux (not POSIX) there is a printf function which does random
strings like you require, can't remember what, but strangely I ran across it yesyerday
while looking up vnsprintf.

it's part of the printf family, it's in a man page

If I could think of an easy way
I wouldn't have done it in C

It's not really easy to do character level
stuff in any scripting languages, or grep/sed -like tools.
They are more designed for line-oriented operations and strings.
Also, I think, anything you may do would be horribly messy and contrived and
could be incredibly slow.

typeset words=0
while read str
do
typeset i=0
while [ $i -lt 9 ]
do
count[$i]=0
let i=i+1
done
typeset flag=1
i=0
let j=i+1
while [ $i -lt 64 ] && [ -n ${str:$i:$j} ] && [ ${str:$i:$j}!='\n']
do
if [ ${str:$i:$j}!=c ] && [ ${str:$i:$j}!=o ] && [ ${str:$i:$j}!=m ] && [ ${str:$i:$j}!=p ] && [ ${str:$i:$j}!=i ] && [ ${str:$i:$j}!=l ] && [ ${str:$i:$j}!=a ] && [ ${str:$i:$j}!=t ] && [ ${str:$i:$j}!=n ]
then
flag=0
break
fi
let i=i+1
let j=i+1
done
if [ $flag==0 ]
then
continue
fi
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
do
if [ ${str:$i:$j}==c ]
then let count[0]=count[0]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
do
if [ ${str:$i:$j}==o ]
then let count[1]=count[1]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
do
if [ ${str:$i:$j}==m ]
then let count[2]=count[2]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
do
if [ ${str:$i:$j}==p ]
then let count[3]=count[3]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
do
if [ ${str:$i:$j}==i ]
then let count[4]=count[4]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
do
if [ ${str:$i:$j}==l ]
then let count[5]=count[5]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
do
if [ ${str:$i:$j}==a ]
then let count[6]=count[6]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
do
if [ ${str:$i:$j}==t ]
then let count[7]=count[7]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!='\0' ] && [ ${str:$i:$j}!='\n' ]
do
if [ ${str:$i:$j}==n ]
then let count[8]=count[8]+1
fi
let i=i+1
let j=i+1
done
i=0
j=1
while [ $i -lt 9 ]
do
if [ $i!=1 ] && [ $i!=4 ] && [ ${count:$i:$j} -gt 1 ]
then flag=0;
fi
let i=i+1
let j=i+1
done
if [ ${count:1:2} -gt 2 ] && [ ${count:4:5} -gt 2 ]
then flag=0
fi
if [ ${#str} -gt 2 ]
then flag=0
fi
if [ $flag==1 ]
then
echo $str
fi
done < words.txt

Here is the code I wrote, but I am gettingt the following error.

[: ./test.sh 152: unbalanced []

where 152 is the line of the code "done < words.txt"

Can someone please tell me where I am making a mistake?

The code works now with no errors(it was a syntax that missed a space), but there is no output been produced.

Code:

typeset words=0
while read str
do
typeset i=0
while [ $i -lt 9 ]
do
count[$i]=0
let i=i+1
done
typeset flag=1
i=0
let j=i+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
do
if [ ${str:$i:$j}!="c" ] && [ ${str:$i:$j}!="o" ] && [ ${str:$i:$j}!="m" ] && [ ${str:$i:$j}!="p" ] && [ ${str:$i:$j}!="i" ] && [ ${str:$i:$j}!="l" ] && [ ${str:$i:$j}!="a" ] && [ ${str:$i:$j}!="t" ] && [ ${str:$i:$j}!="n" ]
then
flag=0
break
fi
let i=i+1
let j=i+1
done
if [ $flag==0 ]
then
continue
fi
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
do
if [ ${str:$i:$j}==c ]
then let count[0]=count[0]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
do
if [ ${str:$i:$j}==o ]
then let count[1]=count[1]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
do
if [ ${str:$i:$j}==m ]
then let count[2]=count[2]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
do
if [ ${str:$i:$j}==p ]
then let count[3]=count[3]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
do
if [ ${str:$i:$j}==i ]
then let count[4]=count[4]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
do
if [ ${str:$i:$j}==l ]
then let count[5]=count[5]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
do
if [ ${str:$i:$j}==a ]
then let count[6]=count[6]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ ${str:$i:$j}!="\0" ] && [ ${str:$i:$j}!="\n" ]
do
if [ ${str:$i:$j}==t ]
then let count[7]=count[7]+1
fi
let i=i+1
let j=i+1
done
i=0
let j=j+1
while [ $i -lt 64 ] && [ -n ${str:$i:$j} ] && [ ${str:$i:$j}!="\n" ]
do
if [ ${str:$i:$j}==n ]
then let count[8]=count[8]+1
fi
let i=i+1
let j=i+1
done
i=0
j=1
while [ $i -lt 9 ]
do
if [ $i!=1 ] && [ $i!=4 ] #&& [ ${count:$i:$j} -gt 1 ]
then flag=0;
fi
let i=i+1
let j=i+1
done
if [ $count[1] > 2 ] && [ $count[4] > 2 ]
then flag=0
fi
if [ ${#str} > 2 ]
then flag=0
fi
if [ $flag==1 ]
then echo $str
fi
done < words.txt

Could someone please give me tips on why this is not working?

Suresh, you will not be able to do this using shell tools.
They are not up to the job.

Using my C example I did the anagrams of 'compilation' and it took
11 minutes, and that's just using the whole word not even subwords.
This produces 40 million words; ten million when duplicates are removed.
So including subwords is going to a be 50 million at least maybe.

You will need to use a much more intelligent approach than brute-force.

If 'compilation' produces 50 million combinations then obviously this is
too large a list.
You will need to apply rules to remove impossible combinations, like
say, three letters in a row.

If you prune it down to maybe 1 million It is still larger than your dictionary!

As the dictionary is a very small subset of possible 'words'
maybe turn it on it's head, and use the dictionary entries to search the word.
See if each word in the dictionary can be made from the target word.

LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.