Command and Conquer

Nimoy · 11-03-2003, 02:03 PM

Okay here is my predicament!

I have two ASCII files containing words.

The files are structured like this

værktøj p418.spr
væv p446.spr
walkman p444.spr
wc papir p141.spr
wc p140.spr

One has the words in a foreign language and one has them in english

The two files have one thing in common and that is the pxxx.spr placed after each word/words.

The phrase pxxx.spr (where xxx is any random number) points to a specific picture which is the same picture in whichever language when we are talking about the same word.

The difference between the two is that the English file contains several (i.e. a lot) of words that are irrelevant.

What I need to do is therefore to find a way to replace each of the foreign words with the corresponding English word in a clean file. All off course based upon the foreign file which is the most current file.

I think the logical way to go about it would be the following way.

Read the Foreign text file, identify the first pxxx.spr

Copy that particular pxxx.spr to a new file

Read the English text file, find pxxx.spr

Copy the English word in front of pxxx.spr to the new file by inserting the word to the new file containing the English word in a manner that would look something like this:

ENGLISHWORD1 pxxx.spr

Next I would go back to reading the foreign text file, identify the next pxxx.spr...

Then follow same procedure as above, put ENGLISHWORD1 and pxxx.spr into the new file below the first one - then find the third pxxx.spr etc. all the way untill the end of the foreign file producing a file that would look something like this:

ENGLISHWORD1 pxxx.spr
ENGLISHWORD2 pxxx.spr
ENGLISHWORD3 pxxx.spr
ENGLISHWORD4 pxxx.spr
ENGLISHWORD5 pxxx.spr
ENGLISHWORD6 pxxx.spr
ENGLISHWORD7 pxxx.spr

Thus I would have produced a new file containing the English words without all the extra words contained in the old English text file and a file that corresponded in content to the foreign text file.

I am 100% certain that this can be done via the command line using commands such as grep cat etc. as these are simple text files we are talking about and if anyone can help me out by solving this one he/she will have done me/us a huge favor.

You can copy the above foreign text example and save it as the foreign file, and then use the below example as the English file. (clutter included)

tools p418.spr
baboon p999.spr
loom p446.spr
banana p566.spr
walkman p444.spr
toilet paper p141.spr
toilet p140.spr

I am only posting this due to severe time constraints on my part and wouldn't ask for help unless I needed it at this moment in time!

This is not homework - This is something I need to take care of in support of the charity http://globability.org to which I belong - and which I am currently directing toward OpenSource development.

Please refrain from asking me to go figure - RTFM or any of the sort... well you know what I mean

If you are willing to take the challenge do post a reply - If not please ignore this post entirely!!!

scorbett · 11-04-2003, 03:35 PM

Try this script:

Code:

#!/bin/bash
# parser.sh

if [ "$1" = "" ]
then
  echo "USAGE: $0 <englishfile>"
  exit
fi

while read crap
do
  # get picture from next line of input (assuming space separator)
  pict=`echo "$crap" | cut -d \  -f 2`

  # find english word for this picture (note no error handling)
  englishWord=`grep $pict $1 | cut -d \  -f 1`

  # output the english word and the picture to stdout
  echo "$englishWord $pict"
done

This script will read from standard in the "foreign" file. It will grab the correct English word for every picture it finds in the input, then it will write the english word and the picture to standard output. You call it like this:

cat <foreignFile> | parser.sh <englishFile>

I don't do any error checking in here - this is just a quick and dirty script. This also probably isn't the fastest way to do this - if these files are very large you should probably write a C program to do this instead.

scorbett · 11-04-2003, 03:36 PM

By the way, I forgot to mention that my script will dump its output to stdout, so if you want to direct this to a new file you should do something like this:

cat foreignFile | parser.sh englishFile > newOutputFile

Hope that helps.

Nimoy · 11-04-2003, 05:14 PM

First of all thanks for taking the challenge

and I'm fine with the quick dirty approach as we are not talking huge files etc.... plus I'm on my 2.8GHz Laptop so I should have a little computing power at my disposal

I Tried saving everything into my Area51/groundzero directory i.e. parser.sh entest.txt (the english file) and testdk.txt (the Danish file) - (yes I only took the cli commands and omitted the "text")

The script looks like this:
code:
#!/bin/bash
# parser.sh

if [ "$1" = "" ]
then
echo "USAGE: $0 <englishfile>"
exit
fi

while read crap
do
# get picture from next line of input (assuming space separator)
pict=`echo "$crap" | cut -d \ -f 2`

# find english word for this picture (note no error handling)
englishWord=`grep $pict $1 | cut -d \ -f 1`

# output the english word and the picture to stdout
echo "$englishWord $pict"
done

I tried to run it but failed - Then checked and had not set ant permissions on the .sh as one should according to a script tutorial I have followed.

So I typed chmod 755 parser.sh

and then I typed the following

cat testdk.txt | parser.sh entest.txt > newOutputFile.txt

To my dismay the result was the following

bash: parser.sh: command not found

Do I need to toss parser.sh in a particular place or is the problem an other? - I have tried running with the phrase code: both included and excluded from the script.

Any ideas?

Nimoy · 11-04-2003, 05:19 PM

Too tired - I've just spotted a mistake of mine - I need to replace the text pict with the correct word spr right?

However the script ought to run nevertheless just not finding anything if I am correct!

scorbett · 11-04-2003, 05:19 PM

Hmm... if your PATH doesn't include your current directory, you should type this instead:

cat testdk.txt | ./parser.sh entest.txt > newOutputFile.txt

The "./" in front of parser.sh tells bash to look in the current directory for it. Alternatively, you can just copy parser.sh to something like /usr/local/bin or some other directory on your PATH so that bash knows where to look for it.

Nimoy · 11-05-2003, 03:50 PM

Yup like I said last night way too tired - I tried it a moment ago and the script seems to work fairly well - there is some odd behavior but some of it might be my fault...

- I'll be checking the original source files for my own errors, I have found a few

But one error that I am not responsible for and one that puzzles me is the omission in the middle of the text of the pxxx.spr but the inclusion of the right word - But wait there is more!

The words do not exist in the danish source file but have been thrown in seemingly by the sorting mechanism itself from the UK source file.

We are talking about 30 words or so... very odd indeed! But not a devastating problem

Your script will probably make it possible for me to put roundabout 7 more languages through the grinder as the script should be indifferent to the fact that the language is different than the language in the UK source text, so this has really saved me a lot of work - Even if I have to do some checking afterward.

Thanks a bunch - You just made me a very happy customer!