Script to remove repetition from file

talat · 02-21-2008, 06:27 AM

Hi Guys,

Consider the following scenario. I have a file which has list of users e.g

jone
micheal
jone
jone
steve
adam
steve

Now as you can see this list has repetition as well . I need to remove repetition from this file as this file has around 100s of entries. Can i have any sample script. Please guide.

acid_kewpie · 02-21-2008, 06:29 AM

Code:

sort file.txt | uniq

AnanthaP · 02-21-2008, 06:43 AM

I thought of sort too but that would destroy the original order.

So using awk (maybe):
In each line,
if associative_array($0) doesnt exist, then the value in the array is NR;
On EOF,
Sort by the value and dump out.

Have to develop it but seems OK.

End

acid_kewpie · 02-21-2008, 06:49 AM

True, but I can't see the original could could really matter in this scenario.

talat · 02-23-2008, 03:43 AM

Many thanks guys

/bin/bash · 02-23-2008, 05:47 AM

$ cat file
jone
micheal
jone
jone
steve
adam
steve

$ sed -n 'G; s/\n/&&/; /^$[ -~]*\n$.*\n\1/d; s/\n//; h; P' file
jone
micheal
steve
adam

HTH
HANDY ONE-LINERS FOR SED (Unix stream editor) Apr. 26, 2004
Latest version of this file is usually at:
http://sed.sourceforge.net/sed1line.txt
http://www.student.northpark.edu/pem...d/sed1line.txt

ghostdog74 · 02-23-2008, 06:50 AM

Code:

# sort -u file
adam
jone
micheal
steve

# awk '!x[$0]++' file
jone
micheal
steve
adam

Quote:

sed -n 'G; s/\n/&&/; /^$[ -~]*\n$.*\n\1/d; s/\n//; h; P' file

don't think OP will understand.

pixellany · 02-23-2008, 07:04 AM

Quote:

sed -n 'G; s/\n/&&/; /^$[ -~]*\n$.*\n\1/d; s/\n//; h; P' file

don't think OP will understand.

I'm not sure if there are 100 people in the WORLD who would understand.....

They say that C gives you the power to write incomprehensible code. SED's pretty good at that too........

angrybanana · 02-23-2008, 11:11 PM

Quote:

sed -n 'G; s/\n/&&/; /^$[ -~]*\n$.*\n\1/d; s/\n//; h; P' file

Wow.. my head hurts just looking at that. I'm not that great with sed, can someone please explain that?

anyways, here's a shorter/more readable awk solution to your problem

Code:

$ awk 'seen[$0]!=1{print} {seen[$0]=1}' file
jone
micheal
steve
adam

kaz2100 · 02-24-2008, 09:28 AM

Hya,

I am trying to understand that sed command (and regular expression). However, it seems that I need more time.

So far, I have found, that the script works with sed on Macintosh (most probably BSD one, sed -v or --version gives me an error). But gnu sed (on Penguin, Debian lenny and etch) version 4.1.5 does not. (even with --posix option)

I will update.

Happy Penguins!

kaz2100 · 02-24-2008, 12:12 PM

Hya,

update to post #10.

After

Code:

setenv LANG C

the sed script works as expected. LANG was en_US, when the script did not work.

Now I know it is off topic.

Happy Penguins!

pixellany · 02-24-2008, 12:12 PM

$ sed -n 'G; s/\n/&&/; /^$[ -~]*\n$.*\n\1/d; s/\n//; h; P' file

I can decipher everything except the part in bold.
"[a-f]" means anything in the range of a thru f (it can also mean A thru F---it does on my system).

I assume that "[ -~]" is meant to mean everything from " " (space)to "~". After several experiments, I am finding that ranges that include more than alphas and digits can be ambiguous and unpredictable--if for no other reason than characters within a range can have a special meaning. I never seen anything about this in the books.

makyo · 02-24-2008, 01:44 PM

Hi.

Quote:

Originally Posted by pixellany

... I never seen anything about this in the books.

Quote:

"Caution: ranges are locale-sensitive, and thus not portable."

-- Classic Shell Programming, page 34, POSIX meta-characters table, Robbins and Beebe, O'Reilly, 2005

On the other hand, I skimmed Effective AWK Programming, and didn't see any warning, nor in Programming Perl, 3rd. Perhaps such warnings are taken for granted by the time one is ready for awk and perl ... cheers, makyo

chrism01 · 02-24-2008, 05:28 PM

That's why Perl has the

use locale;

stricture available.

Actually I thought this page would mention it (http://perldoc.perl.org/perltrap.html) but it doesn't

/bin/bash · 03-02-2008, 08:10 AM

I can't find my handy little reference but I believe [:print:] and [ -~] are the same thing.
So it would match any non control character, i.e. any ascii character not below char(32).