Author Topic: Searching for duplicate names in a file..  (Read 770 times)

Agent007

  • Member
  • **
  • Posts: 120
  • Kudos: 0
Searching for duplicate names in a file..
« on: 20 January 2003, 16:23 »
Hi all,


I have this text file with tonnes of usernames......I need to search the file for duplicate usernames, since this is a very tedious process, is there a some script thingy that would do the trick? I was thinking if VI would be able to search and delete......?


thanks,
007
AMD Athlon processor
256MB SDRAM
Linux Distro - RedHat 9.0

flap

  • Member
  • **
  • Posts: 1,268
  • Kudos: 137
Searching for duplicate names in a file..
« Reply #1 on: 20 January 2003, 16:45 »
sort filename | uniq > newfilename

will remove duplicates from filename and output the new list to newfilename.
"While envisaging the destruction of imperialism, it is necessary to identify its head, which is none other than the United States of America." - Ernesto Che Guevara

http://counterpunch.org
http://globalresearch.ca


voidmain

  • VIP
  • Member
  • ***
  • Posts: 5,605
  • Kudos: 184
    • http://voidmain.is-a-geek.net/
Searching for duplicate names in a file..
« Reply #2 on: 20 January 2003, 17:24 »
You can also just add the "-u" param to the "sort" command:

"sort -u" is basically the same as saying "sort | uniq".  Now "sort -u filename" will only throw out duplicates if "entire lines" are the same.

e.g. A file with data like this:

...
joe
bill
joe
mary
joe
...

after running the sort -u command on it will output:

...
bill
joe
mary
...

Now if the file contained:

...
joe:Studly
bill:Nurdy
joe:studly
mary:Sexy
joe:studly dude
...

would output:

...
bill:Nurdy
joe:Studly
joe:studly
joe:studly dude
mary:Sexy
...

However, if there is some format to the text file containing your usernames then you can use sort keys. And of course you could have "cut" out the colums or fields you want to find uniq patterns on  and you may want to ignore case. Sort is really a powerful tool and here is the "info" page that doesn't do it full justice:

http://voidmain.kicks-ass.net/man/?parm=sort&docType=info
http://voidmain.kicks-ass.net/man/?parm=sort&docType=man

If your file doesn't just contain the simple userid all in the same case and you are having trouble getting the right command, just paste in a sample of the file and I can give you a command to use to accomplish what you want.

[ January 20, 2003: Message edited by: void main ]

Someone please remove this account. Thanks...

Agent007

  • Member
  • **
  • Posts: 120
  • Kudos: 0
Searching for duplicate names in a file..
« Reply #3 on: 20 January 2003, 21:33 »
Void Main,

As u can c below, "user11" is repeated twice....I want only one instance of that
username.

 
quote:

  n0 tty n0    skdj@xyzt Async interface      00:00:06   PPP: 900.n.00.n82
  n0 tty n0    lala@xyz Async interface      00:00:0n   PPP: 900.n.00.n29
  n0 tty n0    user11@xyz Async interface      00:00:00   PPP: 900.n.00.n42
  n0 tty n0    user11@xyzt Async interface      00:0n:49   PPP: 900.n.00.n42
  n0 tty n0    weoi@xyzt Async interface      00:00:00   PPP: 900.n.00.n85
  nn tty nn    awer@xyz Async interface      00:00:52   PPP: 900.n.00.n00
  66 tty 66    it@xyzt Async interface      00:02:04   PPP: 900.6.00.649




I have tried out the below sort command, and that did not remove the
duplicates....Pls giv me the correct syntax.

 
quote:

[root@localhost root]# sort -u test.txt | uniq > test1.txt




thanks & rgds,
007
AMD Athlon processor
256MB SDRAM
Linux Distro - RedHat 9.0

voidmain

  • VIP
  • Member
  • ***
  • Posts: 5,605
  • Kudos: 184
    • http://voidmain.is-a-geek.net/
Searching for duplicate names in a file..
« Reply #4 on: 20 January 2003, 22:30 »
Couple of things. If you are going to do the "-u" on your sort then don't use the "uniq" in your pipe. However, whether using "sort -u" or "sort | uniq" the "uniq" command will look for entire lines that are matching, unless you give it a field parameter. If all you are concerned about is seeing the ID's and nothing else then this command would do what you want:

$ cut -f4 -d' ' test.txt | cut -f1 -d'@' | sort -u

Which will list your IDs like:

awer
it
lala
skdj
user11
weoi

Do you want the entire line output? And what command did your data come from? Is the formatting exactly the same in your sample vs what is in your actual file? If the @ sign wasn't there it would be easy with a single sort command, in fact we could change the @ sign to a space and have it do what you want by:

$ cat test.txt | tr '@' ' ' | sort -k4,4 -u

which lists this:
nn tty nn awer xyz Async interface 00:00:52 PPP: 900.n.00.n00
66 tty 66 it xyzt Async interface 00:02:04 PPP: 900.6.00.649
n0 tty n0 lala xyz Async interface 00:00:0n PPP: 900.n.00.n29
n0 tty n0 skdj xyzt Async interface 00:00:06 PPP: 900.n.00.n82
n0 tty n0 user11 xyz Async interface 00:00:00 PPP: 900.n.00.n42
n0 tty n0 weoi xyzt Async interface 00:00:00 PPP: 900.n.00.n85

I would have to think a little more about doing it without changing the @ sign. This worked with the data you posted pasted into a file.

[ January 20, 2003: Message edited by: void main ]

Someone please remove this account. Thanks...

Agent007

  • Member
  • **
  • Posts: 120
  • Kudos: 0
Searching for duplicate names in a file..
« Reply #5 on: 20 January 2003, 23:15 »
Thanks a million Void Main!! That really worked...Ur right, I only wanted the ID's to be listed. Btw, how does it actually work? I mean what's the f1-d'@' for? also, why the need of pipes?

thanks & rgds,
007
AMD Athlon processor
256MB SDRAM
Linux Distro - RedHat 9.0

flap

  • Member
  • **
  • Posts: 1,268
  • Kudos: 137
Searching for duplicate names in a file..
« Reply #6 on: 20 January 2003, 23:30 »
quote:
Originally posted by void main:
I would have to think a little more about doing it without changing the @ sign.


How about this?

tr '@' ' ' < test.txt | sort -k4,4 -u | gawk '{print $1 " " $2 " " $3 " " $4 "@" $5 " " $6 " " $7 " " $8 " " $9 " " $10}'

Or is there a way of making that gawk statement smaller?
"While envisaging the destruction of imperialism, it is necessary to identify its head, which is none other than the United States of America." - Ernesto Che Guevara

http://counterpunch.org
http://globalresearch.ca


voidmain

  • VIP
  • Member
  • ***
  • Posts: 5,605
  • Kudos: 184
    • http://voidmain.is-a-geek.net/
Searching for duplicate names in a file..
« Reply #7 on: 20 January 2003, 23:34 »
quote:
Originally posted by flap:


How about this?

tr '@' ' ' < test.txt | sort -k4,4 -u | gawk '{print $1 " " $2 " " $3 " " $4 "@" $5 " " $6 " " $7 " " $8 " " $9 " " $10}'

Or is there a way of making that gawk statement smaller?



I'm sure there is and gawk/awk is very powerful. Unfortunately my brain was already full 10 years ago before reaching the awk chapter. And I don't believe your command will actually prevent lines with duplicates ids (before the @).

[ January 20, 2003: Message edited by: void main ]

Someone please remove this account. Thanks...

flap

  • Member
  • **
  • Posts: 1,268
  • Kudos: 137
Searching for duplicate names in a file..
« Reply #8 on: 20 January 2003, 23:40 »
Well the duplicate id's have already been removed by your command, output of which is piped through gawk.
"While envisaging the destruction of imperialism, it is necessary to identify its head, which is none other than the United States of America." - Ernesto Che Guevara

http://counterpunch.org
http://globalresearch.ca


voidmain

  • VIP
  • Member
  • ***
  • Posts: 5,605
  • Kudos: 184
    • http://voidmain.is-a-geek.net/
Searching for duplicate names in a file..
« Reply #9 on: 20 January 2003, 23:42 »
quote:
Originally posted by Agent007:
Thanks a million Void Main!! That really worked...Ur right, I only wanted the ID's to be listed. Btw, how does it actually work? I mean what's the f1-d'@' for? also, why the need of pipes?

thanks & rgds,
007



It's really simple once you play with some of the basic UNIX commands. The first part of the command "cut -f4 -d' ' test.txt" says to break the file into columns separated by spaces "' '" and then only output the fourth column. Now that output will be in the form of "userid@host". So you pipe that output into the "cut -f1 -d'@'" which will split the input data into columns delimeted by '@' caracters which would result in two columns, and the "-f1" says to only output the first column which is the "userid". Take that output and pipe it directly into the "sort -u" command which sorts the input and removes duplicates and then spits the result back at you. It would be wise to invest in a shell programming book. This will become second nature to you...
Someone please remove this account. Thanks...

voidmain

  • VIP
  • Member
  • ***
  • Posts: 5,605
  • Kudos: 184
    • http://voidmain.is-a-geek.net/
Searching for duplicate names in a file..
« Reply #10 on: 20 January 2003, 23:43 »
quote:
Originally posted by flap:
Well the duplicate id's have already been removed by your command, output of which is piped through gawk.


You're right. I'm having a bad hair day. That would indeed work.  
Someone please remove this account. Thanks...