Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
simple, but not very fast, so if the files are large, I'd try awk or something.
When you give grep a list of regexps it checks each one for every line, so the runtime is O(Pn) (P is the number of patterns, n is number of lines to search in). This will be much faster with -F because then grep knows it has just plain strings and uses a much faster algorithm which is O(P+n). However, since we want to find occurrences only at the beginning of lines we can't use that in this case.
Here is an awk program which combines all the keywords into a single regexp so that the search should be O(P+n): Doesn't work as millgates points out.
Code:
#!/usr/bin/awk -f
NR == FNR {
for (i = 1; i <= length($0); i++) {
char = substr($0, i, 1);
if (!index(charsets[i], char))
charsets[i] = charsets[i] char;
}
}
function regexp_range(charset, i, c, reg_range) {
for (i = 1; i <= length(charset); i++) {
c = substr(charset, i, 1);
if (index("\\]-^", c))
reg_range = reg_range "\\" c;
else
reg_range = reg_range c;
}
return "[" reg_range "]";
}
NR != FNR && !kw_regexp {
kw_regexp = "^";
for (i = 1; i in charsets; i++) {
kw_regexp = kw_regexp regexp_range(charsets[i])
}
# print kw_regexp ; exit
}
NR != FNR && match($0, kw_regexp) {
kw[substr($0, RSTART, RLENGTH)]++;
}
END {
for(w in kw) {print w, kw[w];}
}
Last edited by ntubski; 04-18-2013 at 10:55 AM.
Reason: note my script doesn't work
i ran the above awk script and got this error
awk: linux.awk:19: (FILENAME=zonecrap FNR=1) fatal: Invalid range end: /^[01#23456789abcdefghijklmnopqrstuvwxyz][291s0cft6rdpxqabeghijklmnouvwz-y][id5679ct01oalwsupmhzrenqvgfbjykx�][ra0684hbfikmv573osculeytdngwqjpxz-129][enlfviaywhkmbpustordxcgjzq12356ı4897/
actually we do not know what is your problem, probably you have no enough memory. splitting file2 may help on this, but if you want a real efficient solution you would need to sort file1 and file2. You can simple execute:
sort file1 > file1.sorted
sort file2 > file2.sorted
to check how much time they need
After that short there can be a very quick and efficient solution... (without splitting)
also would be nice to know:
keywords occur at the beginning of lines, only one time in a line or random?
I think we can help you to solve it, just give us more info (instead of paying).
keywords occur at the beginning of lines (in file2), only one time in a line or random?
every keyword from file1 exist in file2 or a few of them are missing?
I think we can help you to solve it, just give us more info (instead of paying).
keywords occur at the beginning of lines (in file2), only one time in a line or random?
every keyword from file1 exist in file2 or a few of them are missing?
(your example is not a shorted list)
To answer your questions:
in file 2 keywords occur one on each line
file 2 and 1 is sorted and all uniq keywords
Not every keyword in file 1 is in file 2.
example file
abc
bal
cat
dog
etc
please allow me to explain my requirements once more in detail.
file 2 has 100 million keywords
file 1 has 20 million.
example
file1
abs
bat
ball
car
file2
abs
bata
cricket-bat
ballz
i want the script to take every keyword from file1 and count how many keywords in file 2 begin with that keyword.
result is
abs 1
bat 1
ball 1
car 0
i hope you people can help me crack this.
ive been working on this project since weeks.
i ran the above awk script and got this error
awk: linux.awk:19: (FILENAME=zonecrap FNR=1) fatal: Invalid range end: /^[01#23456789abcdefghijklmnopqrstuvwxyz][291s0cft6rdpxqabeghijklmnouvwz-y][id5679ct01oalwsupmhzrenqvgfbjykx�][ra0684hbfikmv573osculeytdngwqjpxz-129][enlfviaywhkmbpustordxcgjzq12356ı4897/
Oh, I was a bit sloppy: my script didn't escape special characters. I'll edit my post with a correction, but probably the scripts that take advantage of the fact you have sorted files will be faster and not wrong.
Quote:
lets see how much time it takes to compare the big files.
You might try running things on a bit less than the whole big file, say just the first 100MB to get an idea of how fast it will be for the entire thing.
Last edited by ntubski; 04-18-2013 at 10:53 AM.
Reason: note my script is wrong
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.