LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 02-20-2012, 10:02 AM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,093

Rep: Reputation: 288Reputation: 288Reputation: 288
Identify words having repeated character strings


Have: a file containing English-language words, one word per line, all lower case.

Want: a file containing only those words which have repeated substrings of length 4 or more. Examples:
abracadabra
handstands
hodgepodge
lightweight
nationalization

The desired output file was generated with this grep:
Code:
cat < $WrdLst            \
|grep -e '\(....\).*\1'  \
> $Work20
The output file did not contain alfalfa. That's understandable; the grep was looking for any 4-letter sequence followed by zero or more characters, followed by that same 4-letter sequence.

The 4 letters alfa appear twice in alfalfa but the substrings overlap.

How may my code be improved (or replaced) to find alfalfa?

Daniel B. Martin
 
Old 02-20-2012, 10:56 AM   #2
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 623

Rep: Reputation: 364Reputation: 364Reputation: 364Reputation: 364
Hmm.. The problem is a bit fuzzy. The 'alfalfa' may be matched by '(.{3})\1' but that's probably not what you want. On the other hand the two 4-character substrings in this word have one common character. Do you want to find all such strings (with one or no common charaters and 4 characters length)?
 
1 members found this post helpful.
Old 02-20-2012, 11:06 AM   #3
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,503

Rep: Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893
Isn't this the same concept as the consecutive alphabet solution?
Code:
awk '{for( i = 1; i <= length - 4; i++)if(substr($0,i+1) ~ substr($0,i,4)){print;next}}' $WrdLst
And I know you have been told many times but like to ignore it, the cat you use is still not required.
 
1 members found this post helpful.
Old 02-20-2012, 11:09 AM   #4
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,093

Original Poster
Rep: Reputation: 288Reputation: 288Reputation: 288
Quote:
Originally Posted by firstfire View Post
Do you want to find all such strings (with one or no common characters and 4 characters length)?
I want to find all words which have two identical 4-letter substrings, regardless of overlap. To give contrived examples, the string qqqqq has two identical 4-letter substrings, one starting at position 1 and another starting in position 2. The string ababab has two identical 4-letter substrings, one starting at position 1 and another starting in position 3.

Daniel B. Martin
 
Old 02-20-2012, 11:12 AM   #5
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 623

Rep: Reputation: 364Reputation: 364Reputation: 364Reputation: 364
How about
Code:
$ egrep '(.+)(.{2})\1\2\1' /usr/share/dict/words
Mississippi
Mississippi's
Mississippian
Mississippians
alfalfa
alfalfa's
assesses
dispossesses
entente
entente's
ententes
possesses
prepossesses
reassesses
repossesses
 
1 members found this post helpful.
Old 02-20-2012, 11:30 AM   #6
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,503

Rep: Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893
@firstfire - this solution does not seem to capture any of the original words in example.
 
1 members found this post helpful.
Old 02-20-2012, 12:18 PM   #7
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,093

Original Poster
Rep: Reputation: 288Reputation: 288Reputation: 288
Quote:
Originally Posted by grail View Post
And I know you have been told many times but like to ignore it, the cat you use is still not required.
I have not ignored your advice. To my eyes, the cat improves code readability. Since it performs no logic I assume the cost of the cat is negligible, particularly in code where it is the first of a long string of piped commands. I have not tested this assumption and may be wrong.

Daniel B. Martin
 
Old 02-20-2012, 12:36 PM   #8
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 623

Rep: Reputation: 364Reputation: 364Reputation: 364Reputation: 364
Quote:
Originally Posted by grail View Post
@firstfire - this solution does not seem to capture any of the original words in example.
I know, but it does captures most of normal words.

Here is another solution, which captures provided examples:
Code:
$ echo ababab | sed -r 's/.*/&:&/; s/((.{4}).*):.*\2/[\1]:\2/'
[ababab]:abab
If matched, the word surrounded by brackets; substring shown after a colon. BUT it also matches any other word (4 or more chars), where substring is a whole word. And it is not a bug, such words indeed satisfy given conditions ("regardless of overlap").

Last edited by firstfire; 02-20-2012 at 12:54 PM.
 
1 members found this post helpful.
Old 02-20-2012, 12:44 PM   #9
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,093

Original Poster
Rep: Reputation: 288Reputation: 288Reputation: 288
Quote:
Originally Posted by grail View Post
Code:
awk '{for( i = 1; i <= length - 4; i++)if(substr($0,i+1) ~ substr($0,i,4)){print;next}}' $WrdLst
I ran this pipe ...
Code:
# Find words which contain repeated 4-letter sequences
# such as abracadabra, hodgepodge, lightweight, whippersnapper.
# This version picks up words with overlapping repeated sequences
# such as alfalfa, entente, possesses, ratatat.
Method of LQ Senior Member grail.
echo; echo "Writing Work26"
< $WrdLst \
awk '{for( i = 1; i <= length - 4; i++)
      if(substr($0,i+1) ~ substr($0,i,4))
      {print;next}}' \
> $Work26
... and it found words with repeated fields which overlapped and also words with repeated fields with no overlap. So far, so good. This code also found words which I didn't know were in the input file, and which do not meet the criterion. Examples:
Code:
ch^ateau
cr^epe
d_eb^acle
d_eb^acles
I guess the special characters indicate accent marks but don't understand why the pipe selected them.

Daniel B. Martin
 
Old 02-20-2012, 02:13 PM   #10
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 623

Rep: Reputation: 364Reputation: 364Reputation: 364Reputation: 364
Hi.

Slow, but works:
Code:
sed -r 's/.*/\L&:&/; /^(.*)(.{4,}).+:\1.+\2/!d; s/:.*//'
Method:
1) duplicate a word (transforming to lowercase): "reassesses" => "reassesses:reassesses".
2) line should match: <prefix><substring>.+:<prefix>.+<substring>
with the same <substring>. Here <prefix> and .+ (one or more chars; note the `+') are used to prevent matching completely overlapping patterns.
Eg: <prefix>="rea"; <substring>="sses"; <.+>[1]="ses"; <.+>[2]="sse".
2a) No match -- delete a line and go to the next one.
2b) Else -- remove second word.

The whole idea of this mess is to transform overlapping patterns to non-overlapping ones (they are in two different words).

EDIT: Wow! I just downloaded so-called cheap-sed from this page. It runs more than 3 times faster than awk solution presented above and about 60 times faster that GNU sed! Really impressing. Unfortunately it does not supports the `-r' flag so one have to escape all meta characters.

Last edited by firstfire; 02-20-2012 at 02:48 PM. Reason: Explain a bit.
 
1 members found this post helpful.
Old 02-20-2012, 02:46 PM   #11
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
My awk suggestion:
Code:
awk '{ for (i = length($0)-3; i > 1; i--) if (index($0, substr($0, i, 4)) < i) { print ; next } }' input-file > output-file
mawk-1.3.3 takes only a third (0.30s) of the time gawk-3.1.8 does (0.85s) when scanning /usr/share/dict/words on my machine, by the way.

The logic is simple: Starting at the end, check if each four-character substring is repeated earlier in this record. If yes, then print the record and move to the next record. The first four-letter substring does not need to be checked, since all the other checks will compare against it already.

Addendum:

For fun, I wrote basically the same logic in C using fgets() and strstr() ; the execution time was just 0.13s (45% of mawk execution time, 15% of gawk execution time). Replace strstr() with a double loop of memcmp() tests and the time halves (to 0.06s). Replace it with a triple loop that checks character by character, and the time halves again, to 0.03s. (About 80 megabytes per second, in other words.)

The fastest method I could come up with is to memory-map the word list file (using mmap()), and scan and check the words using simple open-coded loops.

It scans the word list in 0.02s, at about 120 megabytes per second. (Actually, since that is much too short a time for a reliable measurement, I checked using a hundred-times duplicated wordlist, 250 megabytes/23.5 million words. It does that correctly in 1.94s, i.e. 125 megabytes, or 12 million words, per second.)

I don't really want to show the C source code, because it is such an utter hack; I'm lazy, I didn't comment it or handle the SIGBUS if the file changes, and so on. But, here it is anyway:
Code:
/* Compile using
 *      gcc this.c -Wall -O2 -o program-name
 * or
 *      gcc this.c -Wall -O3 -fomit-frame-pointer -ftree-vectorize -march=native -mtune=native -o program-name
 * for an optimized version (on x86-64).
*/
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>

static inline int repeats(const unsigned char *const word, const size_t length)
{
    const size_t    limit = length - 4;
    size_t          i, o;

    if (length <= 4)
        return 0;

    for (o = limit; o > 0; o--)
        for (i = 0; i < o; i++)
            if (word[i] == word[o])
            if (word[i+1] == word[o+1])
            if (word[i+2] == word[o+2])
            if (word[i+3] == word[o+3])
                return 1;
    return 0;
}

static int wrerr(const char *p)
{
    if (p && *p) {
        int               saved_errno = errno;
        const char *const q = p + strlen(p);
        ssize_t           n;

        while (p < q) {
            n = write(STDERR_FILENO, p, (size_t)(q - p));
            if (n > (ssize_t)0)
                p += n;
            else
            if (n != (ssize_t)-1) {
                saved_errno = errno;
                return EIO;
            } else
            if (errno != EINTR) {
                const int retval = errno;
                errno = saved_errno;
                return retval;
            }
        }

        errno = saved_errno;
    }

    return 0;
}

static unsigned char      out_data[65536];
static const size_t       out_size = sizeof(out_data);
static size_t             out_used = 0;
static int                out_errno = 0;

static void out_flush(void)
{
    int                   saved_errno = errno;
    unsigned char        *head = out_data;
    unsigned char *const  tail = out_data + out_used;
    ssize_t               bytes;

    out_used = 0;

    while (head < tail) {

        bytes = write(STDOUT_FILENO, head, (size_t)(tail - head));
        if (bytes > (ssize_t)0) {
            head += bytes;

        } else
        if (bytes != (ssize_t)-1) {
            if (!out_errno)
                out_errno = EIO;
            errno = saved_errno;
            return;

        } else
        if (errno != EINTR) {
            if (!out_errno)
                out_errno = errno;
            errno = saved_errno;
            return;

        }
    }

    errno = saved_errno;
    return;
}

static void out(const void *data, size_t length)
{
    if (length < (size_t)1)
        return;

    if (out_used + length < out_size) {
        memcpy(out_data + out_used, data, length);
        out_used += length;
        return;
    }

    out_flush();

    while (length >= out_size) {
        memcpy(out_data, data, out_size);
        out_used = out_size;
        data = (const void *)((const char *)data + out_used);
        length -= out_size;
        out_flush();
    }

    if (length > 0) {
        memcpy(out_data + out_used, data, length);
        out_used += length;
    }

    return;
}

struct map {
    int      descriptor;
    void    *mem;
    size_t   memsize;
    size_t   filesize;
};

static inline void *map_ptr(struct map *const file)
{
    return file->mem;
}

static inline void *map_end(struct map *const file)
{
    return (void *)( (char *)(file->mem) + file->filesize );
}

static int close_map(struct map *const file)
{
    int saved_errno, result, retval;

    if (!file)
        return errno = EINVAL;

    saved_errno = errno;
    retval = 0;

    if (file->memsize > (size_t)0) {
        if (munmap(file->mem, file->memsize) == -1)
            retval = errno;
    }
    file->mem = NULL;
    file->memsize = 0;
    file->filesize = 0;

    if (file->descriptor != -1) {
        do {
            result = close(file->descriptor);
        } while (result == -1 && errno == EINTR);
        if (result == -1 && !retval)
            retval = errno;
    }
    file->descriptor = -1;

    if (retval)
        return errno = retval;

    errno = saved_errno;
    return 0;
}

static int open_map(struct map *const file, const char *const filename)
{
    int         descriptor, result;
    struct stat info;
    long        page;
    size_t      npages, length;
    void       *area;

    if (!file || !filename || !*filename)
        return errno = EINVAL;

    do {
        descriptor = open(filename, O_RDONLY | O_NOCTTY);
    } while (descriptor == -1 && errno == EINTR);
    if (descriptor == -1)
        return errno;

    do {
        result = fstat(descriptor, (struct stat *)&info);
    } while (result == -1 && errno == EINTR);
    if (result == -1) {
        const int saved_errno = errno;
        do {
            result = close(descriptor);
        } while (result == -1 && errno == EINTR);
        return errno = saved_errno;
    }

    if (info.st_size < (off_t)1) {
        do {
            result = close(descriptor);
        } while (result == -1 && errno == EINTR);
        return errno = ENODATA;
    }

    page = sysconf(_SC_PAGE_SIZE);
    if (page < 1L) {
        do {
            result = close(descriptor);
        } while (result == -1 && errno == EINTR);
        return errno = ENOTSUP;
    }

    length = (size_t)info.st_size;
    if ((off_t)length != info.st_size) {
        do {
            result = close(descriptor);
        } while (result == -1 && errno == EINTR);
        return errno = EOVERFLOW;
    }
 
    npages = length / (size_t)page;
    if (length % (size_t)page)
        npages++;

    area = mmap(NULL, npages * (size_t)page, PROT_READ, MAP_SHARED | MAP_NORESERVE, descriptor, 0);
    if (area == MAP_FAILED) {
        const int err = errno;
        do {
            result = close(descriptor);
        } while (result == -1 && errno == EINTR);
        return errno = err;
    }

    file->descriptor = descriptor;
    file->mem = area;
    file->memsize = npages * (size_t)page;
    file->filesize = length;

    return 0;
}

/* Consider all ASCII control characters whitespace. */
static inline int whitespace(const unsigned char c)
{
    return (c >= 0 && c <= 32);
}

int main(int argc, char *argv[])
{
    const unsigned char *word, *delim, *next, *ends;
    struct map           file;
    int                  arg;

    for (arg = 1; arg < argc; arg++) {

        if (open_map(&file, argv[arg])) {
            const char *const errmsg = strerror(errno);
            wrerr(argv[arg]);
            wrerr(": ");
            wrerr(errmsg);
            wrerr(".\n");
            return 1;
        }

        next = (const unsigned char *)map_ptr(&file);
        ends = (const unsigned char *)map_end(&file);

        /* Skip leading garbage. */
        while (next < ends && whitespace(*next))
            next++;

        while (next < ends) {

            /* Start of word. */
            word = next;

            /* Find end of word. */
            while (next < ends && !whitespace(*next))
                next++;

            /* End of word delimiter. */
            delim = next;

            /* Skip whitespace. */
            while (next < ends && whitespace(*next))
                next++;

            /* Word is [ word, delim ), whitespace [ delim, next ). */
            if (repeats(word, (size_t)(delim - word)))
                out(word, (size_t)(next - word));
        }
        out_flush();

        ends = (const unsigned char *)map_end(&file);

        if (close_map(&file)) {
            const char *const errmsg = strerror(errno);
            wrerr(argv[arg]);
            wrerr(": ");
            wrerr(errmsg);
            wrerr(".\n");
            return 1;
        }
    }

    return 0;
}

Last edited by Nominal Animal; 02-20-2012 at 08:36 PM.
 
2 members found this post helpful.
Old 02-21-2012, 01:16 AM   #12
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,503

Rep: Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893Reputation: 1893
hmmm ... well it took me a minute but as mine is using the regex comparison it does a comparison where it now uses ^ as the start of the string and hence the comparison looks like:
Code:
ch^ateau|ateau|^ate
#word|string to compare|what we are looking for
Here the 'what we are looking for' is a string starting with the letters "ate", which "ateau" does

NA's index option is better in comparing these accented words.
 
1 members found this post helpful.
Old 02-21-2012, 12:23 PM   #13
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 242Reputation: 242Reputation: 242
With perl...
Code:
perl -ne 'print if /(?=(....))(...).*\1/' /usr/share/dict/words
I use (....) syntax instead of (.{4}), as it seems to be slightly faster in my system for some reason (?)
The reg exp uses zero-width positive look-ahead assertion

(edit)
It's slower than NA's awk solution

(edit2)
Note that gnu grep has -P option to interpret the pattern as perl regular expression
Code:
grep -P '(?=(....))(...).*\1' /usr/share/dict/words
It seems faster than gawk or perl...

Last edited by Cedrik; 02-21-2012 at 01:57 PM.
 
2 members found this post helpful.
Old 02-21-2012, 03:02 PM   #14
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,093

Original Poster
Rep: Reputation: 288Reputation: 288Reputation: 288
Quote:
Originally Posted by Cedrik View Post
Note that gnu grep has -P option to interpret the pattern as perl regular expression
Code:
grep -P '(?=(....))(...).*\1' /usr/share/dict/words
This grep failed to recognize ratatat.

Daniel B. Martin
 
1 members found this post helpful.
Old 02-21-2012, 03:25 PM   #15
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 242Reputation: 242Reputation: 242
This one will do:
Code:
 grep -P '(?=(....))..*\1'

Last edited by Cedrik; 02-21-2012 at 03:31 PM.
 
1 members found this post helpful.
  


Reply

Tags
grep, sed, strings


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Select words with alphabetical-order character strings danielbmartin Programming 7 02-17-2012 12:08 AM
printing rows with repeated strings verse123 Linux - Newbie 4 12-20-2011 04:56 AM
[SOLVED] Array of words(strings)- please help! stefanolima Programming 4 07-12-2010 04:15 AM
[SOLVED] How can I extract strings based on character positions? btacuso Linux - Newbie 8 03-25-2010 12:31 PM


All times are GMT -5. The time now is 01:06 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration