Duplicate removal/text manipulation

fdiaz05 · 03-17-2011, 06:41 PM

hey guys wonder if anyone can help with this little dilema

Trying to remove lines from a syslog text file that have duplicate strings

Mar 10 06:51:11[http-8080-1] INFO com.MYCOMPANY.webservices.userservice.web.UserServiceController [u:2533274802474744|360] Authorize [platformI$tformIdAndOs=2533274802474744|360, userRegion=America|360]

then a few lines down

Mar 10 06:52:03 [http-8080-1] INFO com.MYCOMPANY.webservices.userservice.web.UserServiceController [u:2533274802474744|360] Authorize [platformI$tformIdAndOs=2533274802474744|360, userRegion=America|360

got the same thing in terms of a u: number but the issue is I need to remove duplicates and just leave one and the file has multiple duplicates of different u: numbers and it's 14,000 lines long.

can anyone tell me if I can use awk? sed? or sort for something like this to ? removing lines that have a certain string in there that's a duplicate.

Any help is appreciated! thanks

k3lt01 · 03-17-2011, 07:00 PM

I do a very similar thing when I am building a host file from multiple files put together which always have multiple entries of the same web address'.

It probably wont serve your purpose but it may show you a few tips.

Code:

sort /home/michael/hosts | tr '\t'  ' ' | tr -s ' ' | uniq >| /home/michael/hosts.new

fdiaz05 · 03-17-2011, 07:03 PM

can you explain? I see it but where is the indicator you are using in your case to do the text manipulation?

k3lt01 · 03-17-2011, 07:22 PM

sort- sorts the lines into alphabetical order so lines starting with a will be placed before lines starting with b etc.

tr '\t' ' ' | tr -s ' '- this part cleans up white space and a couple of other things. I'm not sure exactly but thats the general idea of it.

uniq- deletes duplicate entries, so if I have more than 1 line saying something like abcdefg.com in the combined host file the output at the end will only have 1 abcdefg.com line in it.

Everything is in the man pages.

fdiaz05 · 03-17-2011, 07:41 PM

I did but i'm looking for something along the lines of a particular indicator and if that indicator after the u: is not uniq I need to remove that entire line. it's a bit tricky