Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
i created two file having usernames of linux. file A contains 49000 user names, while file "B" contains 800 user names. i want to creat a file "C" having all records of file A but not B,
C contains 48200 usernames
can any one tell me how to do this. a sample of file A
Distribution: approximately NixOS (http://nixos.org)
Posts: 1,900
Rep:
sort A > a.srt
sort B > b.srt
diff a.srt b.srt | sed -e 's/^[<>] //' > C
Note that a.srt and b.srt are created (or overwritten), so change file names if you need.
Reading 'man sort', 'man diff', 'man sed' and 'man bash' is a good thing to include in medium-term todo list. Or 'info bash' and so on...
i want to creat a file "C" having all records of file A but not B,
By this do you mean that "C" should contain everything in "A", except what also appears in "B"? If so,
Code:
grep -Fvxf B A >C
should work & quite quickly -- it's the "F" option that makes it fast. It also doesn't disturb the order of the file.
I had a problem of similar magnitude a few months back, a long list of fixed patterns & a longer file to search through. That is when I discovered the "F" option.
Running on a 1GHz box w/ 512MiB of RAM, here are some speed tests that simulate the file sizes you mentioned --
Set up 2 test files using numbers instead of names, reversing the order to approximate real life disorder:
Code:
$ >1; for X in {49000..1}; do echo $X >>1; done
$ >2; for X in {999..200}; do echo $X >>2; done
Time the grep technique:
Code:
$ time grep -Fvxf 2 1 >3
real 0m0.088s
user 0m0.044s
sys 0m0.006s
Time the sort & comm technique:
Code:
$ time sort 1 > 1s
real 0m0.400s
user 0m0.351s
sys 0m0.013s
$ time sort 2 > 2s
real 0m0.009s
user 0m0.005s
sys 0m0.004s
$ time comm -23 1s 2s >3s
real 0m0.084s
user 0m0.046s
sys 0m0.007s
It is interesting to note 2 things: comm is slightly faster than grep, & the necessary sorting takes 83% of the time used by this method. While it's very fast in this test, using files of real user names will probably only slow it down.
I really don't understand what you are doing (or trying to do). It's probably not what you are asking for.
Below an example based on the data you gave in post 5 of this thread:
Code:
$ cat A
root
bin
ftp
neo
jhons
maria
leo
xerox
$ cat B
jhons
maria
leo
$ cat A.sorted
bin
ftp
jhons
leo
maria
neo
root
xerox
$ cat B.sorted
jhons
leo
maria
$ comm -3 A.sorted B.sorted
bin
ftp
neo
root
xerox
$ comm -3 A.sorted B.sorted > C
$ wc -l A.sorted B.sorted C
8 A.sorted
3 B.sorted
5 C
Looks correct to me.
And comm's man page says the following: -3 suppress lines that appear in both files This will definitely not add the 2 files together.
Could it be that the infiles are dos/windows text files instead of linux/unix text files?
thanks druuna
it works on my pc. i am using fc6. while when i was doing it on server system. i was getting
wc -l C
51xxx
which was greater than file A. i will check it on that system.
Description
file-A contains all records file-B contains some records of file-A. means file-b is subset of file-A.
Result file (out-file-c) contains
C=A-B
C will contain all records that aren't in file-b thanks
Sounds exactly my analysis:
Quote:
Originally Posted by archtoad6
By this do you mean that "C" should contain everything in "A", except what also appears in "B"?
Did you bother to try my grep -Fvxf technique? I showed its results & the code to generate my test files. If you're stuck on the sort; comm method, did you notice that I show the correct options for comm, -23, in my analysis of the relative speeds of the 2 ways that have been suggested? Sorry, I didn't explicitly point the error I discovered in the earlier post, an error that may have been due to the slightly hazy phrasing of the problem.
Did you read the associated "FM"s? That's always a good idea when trying unfamiliar commands. It's unfortunate that "RTFM" is brusque & a bit rude, because it's always good advice. Maybe if we use "RTFM " when it's meant as a good natured reminder.
BTW, re-reading my post I noticed that the following is unclear:
Quote:
Originally Posted by archtoad6
It is interesting to note 2 things: comm is slightly faster than grep, & the necessary sorting takes 83% of the time used by this method. While it's very fast in this test, using files of real user names will probably only slow it down.
I should have said "while comm ITSELF is slightly faster than grep", I didn't mean to obscure the fact that the total time for sort & comm combined is almost 6 times what it takes grep to do the job.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.