Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
| Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
 |
GNU/Linux Basic Guide
This 255-page guide will provide you with the keys to understand the philosophy of free software, teach you how to use and handle it, and give you the tools required to move easily in the world of GNU/Linux. Many users and administrators will be taking their first steps with this GNU/Linux Basic guide and it will show you how to approach and solve the problems you encounter.
Click Here to receive this Complete Guide absolutely free. |
|
 |
|
02-20-2012, 10:02 AM
|
#1
|
|
Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 765
Rep: 
|
Identify words having repeated character strings
Have: a file containing English-language words, one word per line, all lower case.
Want: a file containing only those words which have repeated substrings of length 4 or more. Examples:
abracadabra
handstands
hodgepodge
lightweight
nationalization
The desired output file was generated with this grep:
Code:
cat < $WrdLst \
|grep -e '\(....\).*\1' \
> $Work20
The output file did not contain alfalfa. That's understandable; the grep was looking for any 4-letter sequence followed by zero or more characters, followed by that same 4-letter sequence.
The 4 letters alfa appear twice in alfalfa but the substrings overlap.
How may my code be improved (or replaced) to find alfalfa?
Daniel B. Martin
|
|
|
|
02-20-2012, 10:56 AM
|
#2
|
|
Member
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 577
|
Hmm.. The problem is a bit fuzzy. The 'alfalfa' may be matched by '(.{3})\1' but that's probably not what you want. On the other hand the two 4-character substrings in this word have one common character. Do you want to find all such strings (with one or no common charaters and 4 characters length)?
|
|
|
1 members found this post helpful.
|
02-20-2012, 11:06 AM
|
#3
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,325
|
Isn't this the same concept as the consecutive alphabet solution?
Code:
awk '{for( i = 1; i <= length - 4; i++)if(substr($0,i+1) ~ substr($0,i,4)){print;next}}' $WrdLst
And I know you have been told many times but like to ignore it, the cat you use is still not required.
|
|
|
1 members found this post helpful.
|
02-20-2012, 11:09 AM
|
#4
|
|
Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 765
Original Poster
Rep: 
|
Quote:
Originally Posted by firstfire
Do you want to find all such strings (with one or no common characters and 4 characters length)?
|
I want to find all words which have two identical 4-letter substrings, regardless of overlap. To give contrived examples, the string qqqqq has two identical 4-letter substrings, one starting at position 1 and another starting in position 2. The string ababab has two identical 4-letter substrings, one starting at position 1 and another starting in position 3.
Daniel B. Martin
|
|
|
|
02-20-2012, 11:12 AM
|
#5
|
|
Member
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 577
|
How about
Code:
$ egrep '(.+)(.{2})\1\2\1' /usr/share/dict/words
Mississippi
Mississippi's
Mississippian
Mississippians
alfalfa
alfalfa's
assesses
dispossesses
entente
entente's
ententes
possesses
prepossesses
reassesses
repossesses
|
|
|
1 members found this post helpful.
|
02-20-2012, 11:30 AM
|
#6
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,325
|
@firstfire - this solution does not seem to capture any of the original words in example.
|
|
|
1 members found this post helpful.
|
02-20-2012, 12:18 PM
|
#7
|
|
Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 765
Original Poster
Rep: 
|
Quote:
Originally Posted by grail
And I know you have been told many times but like to ignore it, the cat you use is still not required.
|
I have not ignored your advice. To my eyes, the cat improves code readability. Since it performs no logic I assume the cost of the cat is negligible, particularly in code where it is the first of a long string of piped commands. I have not tested this assumption and may be wrong.
Daniel B. Martin
|
|
|
|
02-20-2012, 12:36 PM
|
#8
|
|
Member
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 577
|
Quote:
Originally Posted by grail
@firstfire - this solution does not seem to capture any of the original words in example.
|
I know, but it does captures most of normal words.
Here is another solution, which captures provided examples:
Code:
$ echo ababab | sed -r 's/.*/&:&/; s/((.{4}).*):.*\2/[\1]:\2/'
[ababab]:abab
If matched, the word surrounded by brackets; substring shown after a colon. BUT it also matches any other word (4 or more chars), where substring is a whole word. And it is not a bug, such words indeed satisfy given conditions ("regardless of overlap").
Last edited by firstfire; 02-20-2012 at 12:54 PM.
|
|
|
1 members found this post helpful.
|
02-20-2012, 12:44 PM
|
#9
|
|
Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 765
Original Poster
Rep: 
|
Quote:
Originally Posted by grail
Code:
awk '{for( i = 1; i <= length - 4; i++)if(substr($0,i+1) ~ substr($0,i,4)){print;next}}' $WrdLst
|
I ran this pipe ...
Code:
# Find words which contain repeated 4-letter sequences
# such as abracadabra, hodgepodge, lightweight, whippersnapper.
# This version picks up words with overlapping repeated sequences
# such as alfalfa, entente, possesses, ratatat.
Method of LQ Senior Member grail.
echo; echo "Writing Work26"
< $WrdLst \
awk '{for( i = 1; i <= length - 4; i++)
if(substr($0,i+1) ~ substr($0,i,4))
{print;next}}' \
> $Work26
... and it found words with repeated fields which overlapped and also words with repeated fields with no overlap. So far, so good. This code also found words which I didn't know were in the input file, and which do not meet the criterion. Examples:
Code:
ch^ateau
cr^epe
d_eb^acle
d_eb^acles
I guess the special characters indicate accent marks but don't understand why the pipe selected them.
Daniel B. Martin
|
|
|
|
02-20-2012, 02:13 PM
|
#10
|
|
Member
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 577
|
Hi.
Slow, but works:
Code:
sed -r 's/.*/\L&:&/; /^(.*)(.{4,}).+:\1.+\2/!d; s/:.*//'
Method:
1) duplicate a word (transforming to lowercase): "reassesses" => "reassesses:reassesses".
2) line should match: <prefix><substring>.+:<prefix>.+<substring>
with the same <substring>. Here <prefix> and .+ (one or more chars; note the `+') are used to prevent matching completely overlapping patterns.
Eg: <prefix>="rea"; <substring>="sses"; <.+>[1]="ses"; <.+>[2]="sse".
2a) No match -- delete a line and go to the next one.
2b) Else -- remove second word.
The whole idea of this mess is to transform overlapping patterns to non-overlapping ones (they are in two different words).
EDIT: Wow! I just downloaded so-called cheap-sed from this page. It runs more than 3 times faster than awk solution presented above and about 60 times faster that GNU sed! Really impressing. Unfortunately it does not supports the `-r' flag so one have to escape all meta characters.
Last edited by firstfire; 02-20-2012 at 02:48 PM.
Reason: Explain a bit.
|
|
|
1 members found this post helpful.
|
02-20-2012, 02:46 PM
|
#11
|
|
Senior Member
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
|
My awk suggestion:
Code:
awk '{ for (i = length($0)-3; i > 1; i--) if (index($0, substr($0, i, 4)) < i) { print ; next } }' input-file > output-file
mawk-1.3.3 takes only a third (0.30s) of the time gawk-3.1.8 does (0.85s) when scanning /usr/share/dict/words on my machine, by the way.
The logic is simple: Starting at the end, check if each four-character substring is repeated earlier in this record. If yes, then print the record and move to the next record. The first four-letter substring does not need to be checked, since all the other checks will compare against it already.
Addendum:
For fun, I wrote basically the same logic in C using fgets() and strstr() ; the execution time was just 0.13s (45% of mawk execution time, 15% of gawk execution time). Replace strstr() with a double loop of memcmp() tests and the time halves (to 0.06s). Replace it with a triple loop that checks character by character, and the time halves again, to 0.03s. (About 80 megabytes per second, in other words.)
The fastest method I could come up with is to memory-map the word list file (using mmap()), and scan and check the words using simple open-coded loops.
It scans the word list in 0.02s, at about 120 megabytes per second. (Actually, since that is much too short a time for a reliable measurement, I checked using a hundred-times duplicated wordlist, 250 megabytes/23.5 million words. It does that correctly in 1.94s, i.e. 125 megabytes, or 12 million words, per second.)
I don't really want to show the C source code, because it is such an utter hack; I'm lazy, I didn't comment it or handle the SIGBUS if the file changes, and so on. But, here it is anyway:
Code:
/* Compile using
* gcc this.c -Wall -O2 -o program-name
* or
* gcc this.c -Wall -O3 -fomit-frame-pointer -ftree-vectorize -march=native -mtune=native -o program-name
* for an optimized version (on x86-64).
*/
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
static inline int repeats(const unsigned char *const word, const size_t length)
{
const size_t limit = length - 4;
size_t i, o;
if (length <= 4)
return 0;
for (o = limit; o > 0; o--)
for (i = 0; i < o; i++)
if (word[i] == word[o])
if (word[i+1] == word[o+1])
if (word[i+2] == word[o+2])
if (word[i+3] == word[o+3])
return 1;
return 0;
}
static int wrerr(const char *p)
{
if (p && *p) {
int saved_errno = errno;
const char *const q = p + strlen(p);
ssize_t n;
while (p < q) {
n = write(STDERR_FILENO, p, (size_t)(q - p));
if (n > (ssize_t)0)
p += n;
else
if (n != (ssize_t)-1) {
saved_errno = errno;
return EIO;
} else
if (errno != EINTR) {
const int retval = errno;
errno = saved_errno;
return retval;
}
}
errno = saved_errno;
}
return 0;
}
static unsigned char out_data[65536];
static const size_t out_size = sizeof(out_data);
static size_t out_used = 0;
static int out_errno = 0;
static void out_flush(void)
{
int saved_errno = errno;
unsigned char *head = out_data;
unsigned char *const tail = out_data + out_used;
ssize_t bytes;
out_used = 0;
while (head < tail) {
bytes = write(STDOUT_FILENO, head, (size_t)(tail - head));
if (bytes > (ssize_t)0) {
head += bytes;
} else
if (bytes != (ssize_t)-1) {
if (!out_errno)
out_errno = EIO;
errno = saved_errno;
return;
} else
if (errno != EINTR) {
if (!out_errno)
out_errno = errno;
errno = saved_errno;
return;
}
}
errno = saved_errno;
return;
}
static void out(const void *data, size_t length)
{
if (length < (size_t)1)
return;
if (out_used + length < out_size) {
memcpy(out_data + out_used, data, length);
out_used += length;
return;
}
out_flush();
while (length >= out_size) {
memcpy(out_data, data, out_size);
out_used = out_size;
data = (const void *)((const char *)data + out_used);
length -= out_size;
out_flush();
}
if (length > 0) {
memcpy(out_data + out_used, data, length);
out_used += length;
}
return;
}
struct map {
int descriptor;
void *mem;
size_t memsize;
size_t filesize;
};
static inline void *map_ptr(struct map *const file)
{
return file->mem;
}
static inline void *map_end(struct map *const file)
{
return (void *)( (char *)(file->mem) + file->filesize );
}
static int close_map(struct map *const file)
{
int saved_errno, result, retval;
if (!file)
return errno = EINVAL;
saved_errno = errno;
retval = 0;
if (file->memsize > (size_t)0) {
if (munmap(file->mem, file->memsize) == -1)
retval = errno;
}
file->mem = NULL;
file->memsize = 0;
file->filesize = 0;
if (file->descriptor != -1) {
do {
result = close(file->descriptor);
} while (result == -1 && errno == EINTR);
if (result == -1 && !retval)
retval = errno;
}
file->descriptor = -1;
if (retval)
return errno = retval;
errno = saved_errno;
return 0;
}
static int open_map(struct map *const file, const char *const filename)
{
int descriptor, result;
struct stat info;
long page;
size_t npages, length;
void *area;
if (!file || !filename || !*filename)
return errno = EINVAL;
do {
descriptor = open(filename, O_RDONLY | O_NOCTTY);
} while (descriptor == -1 && errno == EINTR);
if (descriptor == -1)
return errno;
do {
result = fstat(descriptor, (struct stat *)&info);
} while (result == -1 && errno == EINTR);
if (result == -1) {
const int saved_errno = errno;
do {
result = close(descriptor);
} while (result == -1 && errno == EINTR);
return errno = saved_errno;
}
if (info.st_size < (off_t)1) {
do {
result = close(descriptor);
} while (result == -1 && errno == EINTR);
return errno = ENODATA;
}
page = sysconf(_SC_PAGE_SIZE);
if (page < 1L) {
do {
result = close(descriptor);
} while (result == -1 && errno == EINTR);
return errno = ENOTSUP;
}
length = (size_t)info.st_size;
if ((off_t)length != info.st_size) {
do {
result = close(descriptor);
} while (result == -1 && errno == EINTR);
return errno = EOVERFLOW;
}
npages = length / (size_t)page;
if (length % (size_t)page)
npages++;
area = mmap(NULL, npages * (size_t)page, PROT_READ, MAP_SHARED | MAP_NORESERVE, descriptor, 0);
if (area == MAP_FAILED) {
const int err = errno;
do {
result = close(descriptor);
} while (result == -1 && errno == EINTR);
return errno = err;
}
file->descriptor = descriptor;
file->mem = area;
file->memsize = npages * (size_t)page;
file->filesize = length;
return 0;
}
/* Consider all ASCII control characters whitespace. */
static inline int whitespace(const unsigned char c)
{
return (c >= 0 && c <= 32);
}
int main(int argc, char *argv[])
{
const unsigned char *word, *delim, *next, *ends;
struct map file;
int arg;
for (arg = 1; arg < argc; arg++) {
if (open_map(&file, argv[arg])) {
const char *const errmsg = strerror(errno);
wrerr(argv[arg]);
wrerr(": ");
wrerr(errmsg);
wrerr(".\n");
return 1;
}
next = (const unsigned char *)map_ptr(&file);
ends = (const unsigned char *)map_end(&file);
/* Skip leading garbage. */
while (next < ends && whitespace(*next))
next++;
while (next < ends) {
/* Start of word. */
word = next;
/* Find end of word. */
while (next < ends && !whitespace(*next))
next++;
/* End of word delimiter. */
delim = next;
/* Skip whitespace. */
while (next < ends && whitespace(*next))
next++;
/* Word is [ word, delim ), whitespace [ delim, next ). */
if (repeats(word, (size_t)(delim - word)))
out(word, (size_t)(next - word));
}
out_flush();
ends = (const unsigned char *)map_end(&file);
if (close_map(&file)) {
const char *const errmsg = strerror(errno);
wrerr(argv[arg]);
wrerr(": ");
wrerr(errmsg);
wrerr(".\n");
return 1;
}
}
return 0;
}
Last edited by Nominal Animal; 02-20-2012 at 08:36 PM.
|
|
|
2 members found this post helpful.
|
02-21-2012, 01:16 AM
|
#12
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,325
|
hmmm ... well it took me a minute but as mine is using the regex comparison it does a comparison where it now uses ^ as the start of the string and hence the comparison looks like:
Code:
ch^ateau|ateau|^ate
#word|string to compare|what we are looking for
Here the 'what we are looking for' is a string starting with the letters "ate", which "ateau" does
NA's index option is better in comparing these accented words.
|
|
|
1 members found this post helpful.
|
02-21-2012, 12:23 PM
|
#13
|
|
Senior Member
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140
|
With perl...
Code:
perl -ne 'print if /(?=(....))(...).*\1/' /usr/share/dict/words
I use (....) syntax instead of (.{4}), as it seems to be slightly faster in my system for some reason (?)
The reg exp uses zero-width positive look-ahead assertion
(edit)
It's slower than NA's awk solution
(edit2)
Note that gnu grep has -P option to interpret the pattern as perl regular expression
Code:
grep -P '(?=(....))(...).*\1' /usr/share/dict/words
It seems faster than gawk or perl...
Last edited by Cedrik; 02-21-2012 at 01:57 PM.
|
|
|
2 members found this post helpful.
|
02-21-2012, 03:02 PM
|
#14
|
|
Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 765
Original Poster
Rep: 
|
Quote:
Originally Posted by Cedrik
Note that gnu grep has -P option to interpret the pattern as perl regular expression
Code:
grep -P '(?=(....))(...).*\1' /usr/share/dict/words
|
This grep failed to recognize ratatat.
Daniel B. Martin
|
|
|
1 members found this post helpful.
|
02-21-2012, 03:25 PM
|
#15
|
|
Senior Member
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140
|
This one will do:
Code:
grep -P '(?=(....))..*\1'
Last edited by Cedrik; 02-21-2012 at 03:31 PM.
|
|
|
1 members found this post helpful.
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -5. The time now is 04:03 AM.
|
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|