script string modification

Ammad · 01-04-2007, 11:25 PM

i created two file having usernames of linux. file A contains 49000 user names, while file "B" contains 800 user names. i want to creat a file "C" having all records of file A but not B,

C contains 48200 usernames
can any one tell me how to do this. a sample of file A

root
bin
ftp
apache
ammad
neo

druuna · 01-05-2007, 01:49 AM

Hi,

Take a look at the diff and comm commands.

Something like the following will show unique entries in file1 only:

comm -3 file1 file2

See manpages for details.

PS: I do believe that both commands need a sorted input, so sort probably needs to be used too.

raskin · 01-05-2007, 01:50 AM

sort A > a.srt
sort B > b.srt
diff a.srt b.srt | sed -e 's/^[<>] //' > C

Note that a.srt and b.srt are created (or overwritten), so change file names if you need.
Reading 'man sort', 'man diff', 'man sed' and 'man bash' is a good thing to include in medium-term todo list. Or 'info bash' and so on...

archtoad6 · 01-05-2007, 08:05 AM

Quote:

Originally Posted by Ammad

i want to creat a file "C" having all records of file A but not B,

By this do you mean that "C" should contain everything in "A", except what also appears in "B"? If so,

Code:

grep -Fvxf B A >C

should work & quite quickly -- it's the "F" option that makes it fast. It also doesn't disturb the order of the file.

I had a problem of similar magnitude a few months back, a long list of fixed patterns & a longer file to search through. That is when I discovered the "F" option.

Running on a 1GHz box w/ 512MiB of RAM, here are some speed tests that simulate the file sizes you mentioned --

Set up 2 test files using numbers instead of names, reversing the order to approximate real life disorder:

Code:

$ >1; for X in {49000..1}; do echo $X >>1; done
$ >2; for X in {999..200}; do echo $X >>2; done

Time the grep technique:

Code:

$ time  grep -Fvxf 2 1 >3

real    0m0.088s
user    0m0.044s
sys     0m0.006s

Time the sort & comm technique:

Code:

$ time sort 1 > 1s

real    0m0.400s
user    0m0.351s
sys     0m0.013s

$ time sort 2 > 2s

real    0m0.009s
user    0m0.005s
sys     0m0.004s

$ time comm -23 1s 2s >3s

real    0m0.084s
user    0m0.046s
sys     0m0.007s

Verify the file sizes:

Code:

$ wc {1..3}{,s}
  49000   49000  282894 1
  49000   49000  282894 1s
    800     800    3200 2
    800     800    3200 2s
  48200   48200  279694 3
  48200   48200  279694 3s
 196000  196000 1131576 total

It is interesting to note 2 things: comm is slightly faster than grep, & the necessary sorting takes 83% of the time used by this method. While it's very fast in this test, using files of real user names will probably only slow it down.

Ammad · 01-06-2007, 01:10 AM

thanks for all, but i want to do this.

file-A
root
bin
ftp
neo
jhons
maria
leo
xerox

file-B
jhons
maria
leo

out-file-c

root
bin
ftp
neo
xerox

Description
file-A contains all records file-B contains some records of file-A. means file-b is subset of file-A.
Result file (out-file-c) contains

C=A-B

C will contain all records that aren't in file-b

thanks

druuna · 01-06-2007, 05:02 AM

Hi,

The above answer(s) do solve your problem. To recap:

1) sort file A (sort A > A.sorted),
2) sort file B (sort B > B.sorted),
3) get unique entries in file A.sorted (comm -3 A.sorted B.sorted > C.unique).

Steps 1 and 2 are essential.

The grep example given by archtoad6 (grep -Fvxf B A > C) also works (no need to sort files A and B first).

You do need to try out the given commands, and read the info/man pages.

Ammad · 01-06-2007, 10:40 AM

thanks druuna,
but in comm -3 will add column 1 and column2 values

and i did this
file1=100 entries
file2=10 entries

sort file1 > file1.srt
sort file2 > file2.srt

comm -3 file1.srt file2.srt > file3

wc -l file3
110

raskin · 01-06-2007, 11:11 AM

You were told to use 'comm -1', as I understood. Also did you try other solutions, like 'grep -Fvxf' or diff ?

druuna · 01-06-2007, 11:14 AM

Hi,

I really don't understand what you are doing (or trying to do). It's probably not what you are asking for.

Below an example based on the data you gave in post 5 of this thread:

Code:

 $ cat A
root
bin
ftp
neo
jhons
maria
leo
xerox

$ cat B
jhons
maria
leo

$ cat A.sorted 
bin
ftp
jhons
leo
maria
neo
root
xerox

$ cat B.sorted 
jhons
leo
maria

$ comm -3 A.sorted B.sorted
bin
ftp
neo
root
xerox

$ comm -3 A.sorted B.sorted > C

$ wc -l A.sorted B.sorted C
 8 A.sorted
 3 B.sorted
 5 C

Looks correct to me.

And comm's man page says the following: -3 suppress lines that appear in both files This will definitely not add the 2 files together.

Could it be that the infiles are dos/windows text files instead of linux/unix text files?

Hope this helps.

Ammad · 01-06-2007, 11:39 AM

thanks druuna
it works on my pc. i am using fc6. while when i was doing it on server system. i was getting
wc -l C
51xxx
which was greater than file A. i will check it on that system.

thanks for your precious time.

archtoad6 · 01-07-2007, 08:37 AM

Quote:

Originally Posted by Ammad

thanks for all, but i want to do this. . . .

Description
file-A contains all records file-B contains some records of file-A. means file-b is subset of file-A.
Result file (out-file-c) contains

C=A-B

C will contain all records that aren't in file-b thanks

Sounds exactly my analysis:

Quote:

Originally Posted by archtoad6

By this do you mean that "C" should contain everything in "A", except what also appears in "B"?

Did you bother to try my grep -Fvxf technique? I showed its results & the code to generate my test files. If you're stuck on the sort; comm method, did you notice that I show the correct options for comm, -23, in my analysis of the relative speeds of the 2 ways that have been suggested? Sorry, I didn't explicitly point the error I discovered in the earlier post, an error that may have been due to the slightly hazy phrasing of the problem.

Did you read the associated "FM"s? That's always a good idea when trying unfamiliar commands. It's unfortunate that "RTFM" is brusque & a bit rude, because it's always good advice. Maybe if we use "RTFM

" when it's meant as a good natured reminder.

BTW, re-reading my post I noticed that the following is unclear:

Quote:

Originally Posted by archtoad6

It is interesting to note 2 things: comm is slightly faster than grep, & the necessary sorting takes 83% of the time used by this method. While it's very fast in this test, using files of real user names will probably only slow it down.

I should have said "while comm ITSELF is slightly faster than grep", I didn't mean to obscure the fact that the total time for sort & comm combined is almost 6 times what it takes grep to do the job.