LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Identify words having repeated character strings (http://www.linuxquestions.org/questions/programming-9/identify-words-having-repeated-character-strings-930345/)

danielbmartin 02-20-2012 11:02 AM

Identify words having repeated character strings
 
Have: a file containing English-language words, one word per line, all lower case.

Want: a file containing only those words which have repeated substrings of length 4 or more. Examples:
abracadabra
handstands
hodgepodge
lightweight
nationalization

The desired output file was generated with this grep:
Code:

cat < $WrdLst            \
|grep -e '\(....\).*\1'  \
> $Work20

The output file did not contain alfalfa. That's understandable; the grep was looking for any 4-letter sequence followed by zero or more characters, followed by that same 4-letter sequence.

The 4 letters alfa appear twice in alfalfa but the substrings overlap.

How may my code be improved (or replaced) to find alfalfa?

Daniel B. Martin

firstfire 02-20-2012 11:56 AM

Hmm.. The problem is a bit fuzzy. The 'alfalfa' may be matched by '(.{3})\1' but that's probably not what you want. On the other hand the two 4-character substrings in this word have one common character. Do you want to find all such strings (with one or no common charaters and 4 characters length)?

grail 02-20-2012 12:06 PM

Isn't this the same concept as the consecutive alphabet solution?
Code:

awk '{for( i = 1; i <= length - 4; i++)if(substr($0,i+1) ~ substr($0,i,4)){print;next}}' $WrdLst
And I know you have been told many times but like to ignore it, the cat you use is still not required.

danielbmartin 02-20-2012 12:09 PM

Quote:

Originally Posted by firstfire (Post 4607432)
Do you want to find all such strings (with one or no common characters and 4 characters length)?

I want to find all words which have two identical 4-letter substrings, regardless of overlap. To give contrived examples, the string qqqqq has two identical 4-letter substrings, one starting at position 1 and another starting in position 2. The string ababab has two identical 4-letter substrings, one starting at position 1 and another starting in position 3.

Daniel B. Martin

firstfire 02-20-2012 12:12 PM

How about
Code:

$ egrep '(.+)(.{2})\1\2\1' /usr/share/dict/words
Mississippi
Mississippi's
Mississippian
Mississippians
alfalfa
alfalfa's
assesses
dispossesses
entente
entente's
ententes
possesses
prepossesses
reassesses
repossesses


grail 02-20-2012 12:30 PM

@firstfire - this solution does not seem to capture any of the original words in example.

danielbmartin 02-20-2012 01:18 PM

Quote:

Originally Posted by grail (Post 4607439)
And I know you have been told many times but like to ignore it, the cat you use is still not required.

I have not ignored your advice. To my eyes, the cat improves code readability. Since it performs no logic I assume the cost of the cat is negligible, particularly in code where it is the first of a long string of piped commands. I have not tested this assumption and may be wrong.

Daniel B. Martin

firstfire 02-20-2012 01:36 PM

Quote:

Originally Posted by grail (Post 4607459)
@firstfire - this solution does not seem to capture any of the original words in example.

I know, but it does captures most of normal words.

Here is another solution, which captures provided examples:
Code:

$ echo ababab | sed -r 's/.*/&:&/; s/((.{4}).*):.*\2/[\1]:\2/'
[ababab]:abab

If matched, the word surrounded by brackets; substring shown after a colon. BUT it also matches any other word (4 or more chars), where substring is a whole word. And it is not a bug, such words indeed satisfy given conditions ("regardless of overlap").

danielbmartin 02-20-2012 01:44 PM

Quote:

Originally Posted by grail (Post 4607439)
Code:

awk '{for( i = 1; i <= length - 4; i++)if(substr($0,i+1) ~ substr($0,i,4)){print;next}}' $WrdLst

I ran this pipe ...
Code:

# Find words which contain repeated 4-letter sequences
# such as abracadabra, hodgepodge, lightweight, whippersnapper.
# This version picks up words with overlapping repeated sequences
# such as alfalfa, entente, possesses, ratatat.
Method of LQ Senior Member grail.
echo; echo "Writing Work26"
< $WrdLst \
awk '{for( i = 1; i <= length - 4; i++)
      if(substr($0,i+1) ~ substr($0,i,4))
      {print;next}}' \
> $Work26

... and it found words with repeated fields which overlapped and also words with repeated fields with no overlap. So far, so good. This code also found words which I didn't know were in the input file, and which do not meet the criterion. Examples:
Code:

ch^ateau
cr^epe
d_eb^acle
d_eb^acles

I guess the special characters indicate accent marks but don't understand why the pipe selected them.

Daniel B. Martin

firstfire 02-20-2012 03:13 PM

Hi.

Slow, but works:
Code:

sed -r 's/.*/\L&:&/; /^(.*)(.{4,}).+:\1.+\2/!d; s/:.*//'
Method:
1) duplicate a word (transforming to lowercase): "reassesses" => "reassesses:reassesses".
2) line should match: <prefix><substring>.+:<prefix>.+<substring>
with the same <substring>. Here <prefix> and .+ (one or more chars; note the `+') are used to prevent matching completely overlapping patterns.
Eg: <prefix>="rea"; <substring>="sses"; <.+>[1]="ses"; <.+>[2]="sse".
2a) No match -- delete a line and go to the next one.
2b) Else -- remove second word.

The whole idea of this mess is to transform overlapping patterns to non-overlapping ones (they are in two different words).

EDIT: Wow! I just downloaded so-called cheap-sed from this page. It runs more than 3 times faster than awk solution presented above and about 60 times faster that GNU sed! Really impressing. Unfortunately it does not supports the `-r' flag so one have to escape all meta characters.

Nominal Animal 02-20-2012 03:46 PM

My awk suggestion:
Code:

awk '{ for (i = length($0)-3; i > 1; i--) if (index($0, substr($0, i, 4)) < i) { print ; next } }' input-file > output-file
mawk-1.3.3 takes only a third (0.30s) of the time gawk-3.1.8 does (0.85s) when scanning /usr/share/dict/words on my machine, by the way.

The logic is simple: Starting at the end, check if each four-character substring is repeated earlier in this record. If yes, then print the record and move to the next record. The first four-letter substring does not need to be checked, since all the other checks will compare against it already.

Addendum:

For fun, I wrote basically the same logic in C using fgets() and strstr() ; the execution time was just 0.13s (45% of mawk execution time, 15% of gawk execution time). Replace strstr() with a double loop of memcmp() tests and the time halves (to 0.06s). Replace it with a triple loop that checks character by character, and the time halves again, to 0.03s. (About 80 megabytes per second, in other words.)

The fastest method I could come up with is to memory-map the word list file (using mmap()), and scan and check the words using simple open-coded loops.

It scans the word list in 0.02s, at about 120 megabytes per second. (Actually, since that is much too short a time for a reliable measurement, I checked using a hundred-times duplicated wordlist, 250 megabytes/23.5 million words. It does that correctly in 1.94s, i.e. 125 megabytes, or 12 million words, per second.)

I don't really want to show the C source code, because it is such an utter hack; I'm lazy, I didn't comment it or handle the SIGBUS if the file changes, and so on. But, here it is anyway:
Code:

/* Compile using
 *      gcc this.c -Wall -O2 -o program-name
 * or
 *      gcc this.c -Wall -O3 -fomit-frame-pointer -ftree-vectorize -march=native -mtune=native -o program-name
 * for an optimized version (on x86-64).
*/
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>

static inline int repeats(const unsigned char *const word, const size_t length)
{
    const size_t    limit = length - 4;
    size_t          i, o;

    if (length <= 4)
        return 0;

    for (o = limit; o > 0; o--)
        for (i = 0; i < o; i++)
            if (word[i] == word[o])
            if (word[i+1] == word[o+1])
            if (word[i+2] == word[o+2])
            if (word[i+3] == word[o+3])
                return 1;
    return 0;
}

static int wrerr(const char *p)
{
    if (p && *p) {
        int              saved_errno = errno;
        const char *const q = p + strlen(p);
        ssize_t          n;

        while (p < q) {
            n = write(STDERR_FILENO, p, (size_t)(q - p));
            if (n > (ssize_t)0)
                p += n;
            else
            if (n != (ssize_t)-1) {
                saved_errno = errno;
                return EIO;
            } else
            if (errno != EINTR) {
                const int retval = errno;
                errno = saved_errno;
                return retval;
            }
        }

        errno = saved_errno;
    }

    return 0;
}

static unsigned char      out_data[65536];
static const size_t      out_size = sizeof(out_data);
static size_t            out_used = 0;
static int                out_errno = 0;

static void out_flush(void)
{
    int                  saved_errno = errno;
    unsigned char        *head = out_data;
    unsigned char *const  tail = out_data + out_used;
    ssize_t              bytes;

    out_used = 0;

    while (head < tail) {

        bytes = write(STDOUT_FILENO, head, (size_t)(tail - head));
        if (bytes > (ssize_t)0) {
            head += bytes;

        } else
        if (bytes != (ssize_t)-1) {
            if (!out_errno)
                out_errno = EIO;
            errno = saved_errno;
            return;

        } else
        if (errno != EINTR) {
            if (!out_errno)
                out_errno = errno;
            errno = saved_errno;
            return;

        }
    }

    errno = saved_errno;
    return;
}

static void out(const void *data, size_t length)
{
    if (length < (size_t)1)
        return;

    if (out_used + length < out_size) {
        memcpy(out_data + out_used, data, length);
        out_used += length;
        return;
    }

    out_flush();

    while (length >= out_size) {
        memcpy(out_data, data, out_size);
        out_used = out_size;
        data = (const void *)((const char *)data + out_used);
        length -= out_size;
        out_flush();
    }

    if (length > 0) {
        memcpy(out_data + out_used, data, length);
        out_used += length;
    }

    return;
}

struct map {
    int      descriptor;
    void    *mem;
    size_t  memsize;
    size_t  filesize;
};

static inline void *map_ptr(struct map *const file)
{
    return file->mem;
}

static inline void *map_end(struct map *const file)
{
    return (void *)( (char *)(file->mem) + file->filesize );
}

static int close_map(struct map *const file)
{
    int saved_errno, result, retval;

    if (!file)
        return errno = EINVAL;

    saved_errno = errno;
    retval = 0;

    if (file->memsize > (size_t)0) {
        if (munmap(file->mem, file->memsize) == -1)
            retval = errno;
    }
    file->mem = NULL;
    file->memsize = 0;
    file->filesize = 0;

    if (file->descriptor != -1) {
        do {
            result = close(file->descriptor);
        } while (result == -1 && errno == EINTR);
        if (result == -1 && !retval)
            retval = errno;
    }
    file->descriptor = -1;

    if (retval)
        return errno = retval;

    errno = saved_errno;
    return 0;
}

static int open_map(struct map *const file, const char *const filename)
{
    int        descriptor, result;
    struct stat info;
    long        page;
    size_t      npages, length;
    void      *area;

    if (!file || !filename || !*filename)
        return errno = EINVAL;

    do {
        descriptor = open(filename, O_RDONLY | O_NOCTTY);
    } while (descriptor == -1 && errno == EINTR);
    if (descriptor == -1)
        return errno;

    do {
        result = fstat(descriptor, (struct stat *)&info);
    } while (result == -1 && errno == EINTR);
    if (result == -1) {
        const int saved_errno = errno;
        do {
            result = close(descriptor);
        } while (result == -1 && errno == EINTR);
        return errno = saved_errno;
    }

    if (info.st_size < (off_t)1) {
        do {
            result = close(descriptor);
        } while (result == -1 && errno == EINTR);
        return errno = ENODATA;
    }

    page = sysconf(_SC_PAGE_SIZE);
    if (page < 1L) {
        do {
            result = close(descriptor);
        } while (result == -1 && errno == EINTR);
        return errno = ENOTSUP;
    }

    length = (size_t)info.st_size;
    if ((off_t)length != info.st_size) {
        do {
            result = close(descriptor);
        } while (result == -1 && errno == EINTR);
        return errno = EOVERFLOW;
    }
 
    npages = length / (size_t)page;
    if (length % (size_t)page)
        npages++;

    area = mmap(NULL, npages * (size_t)page, PROT_READ, MAP_SHARED | MAP_NORESERVE, descriptor, 0);
    if (area == MAP_FAILED) {
        const int err = errno;
        do {
            result = close(descriptor);
        } while (result == -1 && errno == EINTR);
        return errno = err;
    }

    file->descriptor = descriptor;
    file->mem = area;
    file->memsize = npages * (size_t)page;
    file->filesize = length;

    return 0;
}

/* Consider all ASCII control characters whitespace. */
static inline int whitespace(const unsigned char c)
{
    return (c >= 0 && c <= 32);
}

int main(int argc, char *argv[])
{
    const unsigned char *word, *delim, *next, *ends;
    struct map          file;
    int                  arg;

    for (arg = 1; arg < argc; arg++) {

        if (open_map(&file, argv[arg])) {
            const char *const errmsg = strerror(errno);
            wrerr(argv[arg]);
            wrerr(": ");
            wrerr(errmsg);
            wrerr(".\n");
            return 1;
        }

        next = (const unsigned char *)map_ptr(&file);
        ends = (const unsigned char *)map_end(&file);

        /* Skip leading garbage. */
        while (next < ends && whitespace(*next))
            next++;

        while (next < ends) {

            /* Start of word. */
            word = next;

            /* Find end of word. */
            while (next < ends && !whitespace(*next))
                next++;

            /* End of word delimiter. */
            delim = next;

            /* Skip whitespace. */
            while (next < ends && whitespace(*next))
                next++;

            /* Word is [ word, delim ), whitespace [ delim, next ). */
            if (repeats(word, (size_t)(delim - word)))
                out(word, (size_t)(next - word));
        }
        out_flush();

        ends = (const unsigned char *)map_end(&file);

        if (close_map(&file)) {
            const char *const errmsg = strerror(errno);
            wrerr(argv[arg]);
            wrerr(": ");
            wrerr(errmsg);
            wrerr(".\n");
            return 1;
        }
    }

    return 0;
}


grail 02-21-2012 02:16 AM

hmmm ... well it took me a minute but as mine is using the regex comparison it does a comparison where it now uses ^ as the start of the string and hence the comparison looks like:
Code:

ch^ateau|ateau|^ate
#word|string to compare|what we are looking for

Here the 'what we are looking for' is a string starting with the letters "ate", which "ateau" does :(

NA's index option is better in comparing these accented words.

Cedrik 02-21-2012 01:23 PM

With perl...
Code:

perl -ne 'print if /(?=(....))(...).*\1/' /usr/share/dict/words
I use (....) syntax instead of (.{4}), as it seems to be slightly faster in my system for some reason (?)
The reg exp uses zero-width positive look-ahead assertion

(edit)
It's slower than NA's awk solution ;)

(edit2)
Note that gnu grep has -P option to interpret the pattern as perl regular expression
Code:

grep -P '(?=(....))(...).*\1' /usr/share/dict/words
It seems faster than gawk or perl...

danielbmartin 02-21-2012 04:02 PM

Quote:

Originally Posted by Cedrik (Post 4608407)
Note that gnu grep has -P option to interpret the pattern as perl regular expression
Code:

grep -P '(?=(....))(...).*\1' /usr/share/dict/words

This grep failed to recognize ratatat.

Daniel B. Martin

Cedrik 02-21-2012 04:25 PM

This one will do:
Code:

grep -P '(?=(....))..*\1'


All times are GMT -5. The time now is 01:06 AM.