LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-05-2007, 12:25 AM   #1
Ammad
Member
 
Registered: Apr 2004
Distribution: redhat 9.0, fc4, redhat as 4
Posts: 522

Rep: Reputation: 31
script string modification


i created two file having usernames of linux. file A contains 49000 user names, while file "B" contains 800 user names. i want to creat a file "C" having all records of file A but not B,

C contains 48200 usernames
can any one tell me how to do this. a sample of file A

root
bin
ftp
apache
ammad
neo
 
Old 01-05-2007, 02:49 AM   #2
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387
Hi,

Take a look at the diff and comm commands.

Something like the following will show unique entries in file1 only:

comm -3 file1 file2

See manpages for details.

PS: I do believe that both commands need a sorted input, so sort probably needs to be used too.

Last edited by druuna; 01-06-2007 at 12:17 PM. Reason: Changed a typo
 
Old 01-05-2007, 02:50 AM   #3
raskin
Senior Member
 
Registered: Sep 2005
Location: Russia
Distribution: NixOS (http://nixos.org)
Posts: 1,899

Rep: Reputation: 68
sort A > a.srt
sort B > b.srt
diff a.srt b.srt | sed -e 's/^[<>] //' > C

Note that a.srt and b.srt are created (or overwritten), so change file names if you need.
Reading 'man sort', 'man diff', 'man sed' and 'man bash' is a good thing to include in medium-term todo list. Or 'info bash' and so on...
 
Old 01-05-2007, 09:05 AM   #4
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 233Reputation: 233Reputation: 233
Quote:
Originally Posted by Ammad
i want to creat a file "C" having all records of file A but not B,
By this do you mean that "C" should contain everything in "A", except what also appears in "B"? If so,
Code:
grep -Fvxf B A >C
should work & quite quickly -- it's the "F" option that makes it fast. It also doesn't disturb the order of the file.

I had a problem of similar magnitude a few months back, a long list of fixed patterns & a longer file to search through. That is when I discovered the "F" option.

Running on a 1GHz box w/ 512MiB of RAM, here are some speed tests that simulate the file sizes you mentioned --

Set up 2 test files using numbers instead of names, reversing the order to approximate real life disorder:
Code:
$ >1; for X in {49000..1}; do echo $X >>1; done
$ >2; for X in {999..200}; do echo $X >>2; done
Time the grep technique:
Code:
$ time  grep -Fvxf 2 1 >3

real    0m0.088s
user    0m0.044s
sys     0m0.006s
Time the sort & comm technique:
Code:
$ time sort 1 > 1s

real    0m0.400s
user    0m0.351s
sys     0m0.013s

$ time sort 2 > 2s

real    0m0.009s
user    0m0.005s
sys     0m0.004s

$ time comm -23 1s 2s >3s

real    0m0.084s
user    0m0.046s
sys     0m0.007s
Verify the file sizes:
Code:
$ wc {1..3}{,s}
  49000   49000  282894 1
  49000   49000  282894 1s
    800     800    3200 2
    800     800    3200 2s
  48200   48200  279694 3
  48200   48200  279694 3s
 196000  196000 1131576 total
It is interesting to note 2 things: comm is slightly faster than grep, & the necessary sorting takes 83% of the time used by this method. While it's very fast in this test, using files of real user names will probably only slow it down.
 
Old 01-06-2007, 02:10 AM   #5
Ammad
Member
 
Registered: Apr 2004
Distribution: redhat 9.0, fc4, redhat as 4
Posts: 522

Original Poster
Rep: Reputation: 31
thanks for all, but i want to do this.

file-A
root
bin
ftp
neo
jhons
maria
leo
xerox

file-B
jhons
maria
leo

out-file-c

root
bin
ftp
neo
xerox


Description
file-A contains all records file-B contains some records of file-A. means file-b is subset of file-A.
Result file (out-file-c) contains

C=A-B

C will contain all records that aren't in file-b

thanks
 
Old 01-06-2007, 06:02 AM   #6
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387
Hi,

The above answer(s) do solve your problem. To recap:

1) sort file A (sort A > A.sorted),
2) sort file B (sort B > B.sorted),
3) get unique entries in file A.sorted (comm -3 A.sorted B.sorted > C.unique).

Steps 1 and 2 are essential.

The grep example given by archtoad6 (grep -Fvxf B A > C) also works (no need to sort files A and B first).

You do need to try out the given commands, and read the info/man pages.
 
Old 01-06-2007, 11:40 AM   #7
Ammad
Member
 
Registered: Apr 2004
Distribution: redhat 9.0, fc4, redhat as 4
Posts: 522

Original Poster
Rep: Reputation: 31
thanks druuna,
but in comm -3 will add column 1 and column2 values

and i did this
file1=100 entries
file2=10 entries


sort file1 > file1.srt
sort file2 > file2.srt

comm -3 file1.srt file2.srt > file3


wc -l file3
110
 
Old 01-06-2007, 12:11 PM   #8
raskin
Senior Member
 
Registered: Sep 2005
Location: Russia
Distribution: NixOS (http://nixos.org)
Posts: 1,899

Rep: Reputation: 68
You were told to use 'comm -1', as I understood. Also did you try other solutions, like 'grep -Fvxf' or diff ?
 
Old 01-06-2007, 12:14 PM   #9
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387
Hi,

I really don't understand what you are doing (or trying to do). It's probably not what you are asking for.

Below an example based on the data you gave in post 5 of this thread:
Code:
 $ cat A
root
bin
ftp
neo
jhons
maria
leo
xerox

$ cat B
jhons
maria
leo

$ cat A.sorted 
bin
ftp
jhons
leo
maria
neo
root
xerox

$ cat B.sorted 
jhons
leo
maria

$ comm -3 A.sorted B.sorted
bin
ftp
neo
root
xerox

$ comm -3 A.sorted B.sorted > C

$ wc -l A.sorted B.sorted C
 8 A.sorted
 3 B.sorted
 5 C
Looks correct to me.

And comm's man page says the following: -3 suppress lines that appear in both files This will definitely not add the 2 files together.

Could it be that the infiles are dos/windows text files instead of linux/unix text files?

Hope this helps.
 
Old 01-06-2007, 12:39 PM   #10
Ammad
Member
 
Registered: Apr 2004
Distribution: redhat 9.0, fc4, redhat as 4
Posts: 522

Original Poster
Rep: Reputation: 31
thanks druuna
it works on my pc. i am using fc6. while when i was doing it on server system. i was getting
wc -l C
51xxx
which was greater than file A. i will check it on that system.

thanks for your precious time.
 
Old 01-07-2007, 09:37 AM   #11
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 233Reputation: 233Reputation: 233
Quote:
Originally Posted by Ammad
thanks for all, but i want to do this. . . .

Description
file-A contains all records file-B contains some records of file-A. means file-b is subset of file-A.
Result file (out-file-c) contains

C=A-B

C will contain all records that aren't in file-b thanks
Sounds exactly my analysis:
Quote:
Originally Posted by archtoad6
By this do you mean that "C" should contain everything in "A", except what also appears in "B"?
Did you bother to try my grep -Fvxf technique? I showed its results & the code to generate my test files. If you're stuck on the sort; comm method, did you notice that I show the correct options for comm, -23, in my analysis of the relative speeds of the 2 ways that have been suggested? Sorry, I didn't explicitly point the error I discovered in the earlier post, an error that may have been due to the slightly hazy phrasing of the problem.

Did you read the associated "FM"s? That's always a good idea when trying unfamiliar commands. It's unfortunate that "RTFM" is brusque & a bit rude, because it's always good advice. Maybe if we use "RTFM " when it's meant as a good natured reminder.

BTW, re-reading my post I noticed that the following is unclear:
Quote:
Originally Posted by archtoad6
It is interesting to note 2 things: comm is slightly faster than grep, & the necessary sorting takes 83% of the time used by this method. While it's very fast in this test, using files of real user names will probably only slow it down.
I should have said "while comm ITSELF is slightly faster than grep", I didn't mean to obscure the fact that the total time for sort & comm combined is almost 6 times what it takes grep to do the job.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Get file modification date/time in Bash script cmfarley19 Programming 16 04-15-2015 07:25 PM
script for email modification Ateo General 1 08-29-2006 06:00 PM
how to manipulate string in script? ringerxyz Programming 2 02-17-2005 02:14 AM
String manipulation with a script. philipina Programming 4 03-16-2004 03:42 PM
String manipulation with a script? philipina General 1 03-15-2004 01:07 PM


All times are GMT -5. The time now is 09:56 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration