LinuxQuestions.org
Latest LQ Deal: Linux Power User Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 11-22-2012, 01:57 PM   #1
ASTRAPI
Member
 
Registered: Feb 2007
Posts: 210

Rep: Reputation: 16
Question Remove duplicated words from two big wordlist txt files


Hello

I am using Ubuntu 64Bit and i have on desktop 2 txt files:

file1.txt and file2.txt about 200mb each wordlist...

file1.txt content example:

Code:
word1
word2
bla
blabla
123456
andsoon
file2.txt content example:

Code:
word21
123456
word32
blaster
blabla
andsoon2
1)I need to remove the word blabla and 123456 as they exist on both wordlists and keep them on one file only or even better to output the results to a new file
on my desktop named output.txt and got both file1.txt and file2.txt contents but not duplicated words...

2)The second thing that i want is to be able to remove all words/digits/characters that are less than 5 characters ...How can i do this so the smallest word to be 5 characters?

Thank you
 
Old 11-22-2012, 02:25 PM   #2
mmoreno80
LQ Newbie
 
Registered: Nov 2012
Posts: 8

Rep: Reputation: Disabled
Quote:
Originally Posted by ASTRAPI View Post
1)I need to remove the word blabla and 123456 as they exist on both wordlists and keep them on one file only or even better to output the results to a new file
on my desktop named output.txt and got both file1.txt and file2.txt contents but not duplicated words...
That can be easily done with a standard GNU/Linux distribution:

Code:
$ cat file1.txt file2.txt | sort | uniq > output.txt
Output will contain sorted words without duplicates.

Quote:
Originally Posted by ASTRAPI View Post
2)The second thing that i want is to be able to remove all words/digits/characters that are less than 5 characters ...How can i do this so the smallest word to be 5 characters?
Code:
egrep -v '^[[:alnum:]]{1,4}$' output.txt > output2.txt
File output2.txt will contain those words with 5 or more characters.

If you need the output in random order, finally you could do something like this:

Code:
 shuf output2.txt > output3.txt
That is my two cents.

Last edited by mmoreno80; 11-22-2012 at 02:26 PM.
 
1 members found this post helpful.
Old 11-22-2012, 02:25 PM   #3
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850
1) Well, I suppose this is homework and they want you to use sed......

I would simply cat both files together in one and then use uniq. Read the manpage for the uniq command
Code:
cat file1.txt file2.txt | sort | uniq -u > output.txt
2) take a look at sed.

Markus
 
Old 11-22-2012, 02:35 PM   #4
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850
And here for 2) the sed solution
Code:
sed -n '/\w\{5,\}/p' < output.txt
Markus
 
Old 11-22-2012, 03:08 PM   #5
ASTRAPI
Member
 
Registered: Feb 2007
Posts: 210

Original Poster
Rep: Reputation: 16
1)Can i use those commands on huge txt files like 2 or 4 GB files without any problem?

2)Are the above commands ok to use them if i have on my txt files also digits and special characters as i want to keep them also (not the duplicated ones)?

3)Also for the second command is it ok again to remove digits and special characters using above commands but keep all digits and special characters from 5 and up?

Last question:
4)How can i split one from my huge wordlist at half so i can get two files with equal words inside?

5)Which one is the correct command?

Code:
cat file1.txt file2.txt | sort | uniq > output.txt
or

Code:
cat file1.txt file2.txt | sort | uniq -u > output.txt
Thank you

Last edited by ASTRAPI; 11-22-2012 at 03:21 PM.
 
Old 11-24-2012, 06:19 PM   #6
ASTRAPI
Member
 
Registered: Feb 2007
Posts: 210

Original Poster
Rep: Reputation: 16
please?
 
Old 11-24-2012, 06:28 PM   #7
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850
Quote:
Originally Posted by ASTRAPI View Post
1)Can i use those commands on huge txt files like 2 or 4 GB files without any problem?
you'll have to try it out. I would suppose that it depends on your amount of RAM. And I don't know how log this would run.

Quote:
2)Are the above commands ok to use them if i have on my txt files also digits and special characters as i want to keep them also (not the duplicated ones)?
You will have to try that out.

Quote:
3)Also for the second command is it ok again to remove digits and special characters using above commands but keep all digits and special characters from 5 and up?
as above, try it out, if it doesn't work, post the problem so that we can help.

Quote:
Last question:
4)How can i split one from my huge wordlist at half so i can get two files with equal words inside?

5)Which one is the correct command?

Code:
cat file1.txt file2.txt | sort | uniq > output.txt
or

Code:
cat file1.txt file2.txt | sort | uniq -u > output.txt
Thank you
Both are correct, referring to the manpage -u is needed but is also the default.

I would strongly recommend that you try it out. If it doesn't work, you may ask for help.

Markus

Last edited by markush; 11-25-2012 at 01:53 AM. Reason: changed a mistake
 
Old 11-24-2012, 07:10 PM   #8
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Arch
Posts: 3,013

Rep: Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225
Quote:
Originally Posted by markush View Post
Both are correct, referring to the manpage -u is needed but is also the default.
You may want to read the manpage again, the default behaviour is not the same as what the -u option does.

Quote:
man uniq(1):
...
With no options, matching lines are merged to the first occurrence.
...
-u, --unique
only print unique lines
 
Old 11-25-2012, 01:51 AM   #9
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850
ntubski, yes, you are right, it's not the same. When I tried the above code at first, it made no difference, but it seems that it's better to omit the -u option.

Markus
 
Old 11-25-2012, 04:34 AM   #10
ASTRAPI
Member
 
Registered: Feb 2007
Posts: 210

Original Poster
Rep: Reputation: 16
Here are my results:

I create file1.txt:

Code:
urwuorwuor
5
3t3t
4y647y
t3t3
ty45
6y356t3t3t3
#%@^$%&^@#$%,@%
g
%^#FG#@%&U^#$G@$
%$Y
$@Y,@$
3647585
ertwe
TERTERTGE
ERT
ss
sdfwrfw
t,4$&!@#$%&*();'.y54t4
rthrh

And file2.txt:

Code:
urwguorwuor
5
3t3t
4y647y
t3t3
ty45
6y356ht3t3t3
#%@^n$%&^@#$%,@%
gh
%^#FG#@%&U^#$G@$
%$Y
$@,Y,@$
3647585
ertwe
TERTERTGE
ERT
ssm
sdfwrfw
t,4$&!@#$%&*();'.y54t4
rthrhm
Then using:

Code:
cat file1.txt file2.txt | sort | uniq -u > output.txt
I got the output.txt with content:

Code:
#%@^$%&^@#$%,@%
6y356ht3t3t3
6y356t3t3t3
g
gh
#%@^n$%&^@#$%,@%
rthrh
rthrhm
ss
ssm
urwguorwuor
urwuorwuor
$@,Y,@$
$@Y,@$
TOTALY WRONG !

Then i try without -u :

Code:
cat file1.txt file2.txt | sort | uniq > output2.txt
And i got output2.txt content:

Code:
#%@^$%&^@#$%,@%
3647585
3t3t
4y647y
5
6y356ht3t3t3
6y356t3t3t3
ERT
ertwe
%^#FG#@%&U^#$G@$
g
gh
#%@^n$%&^@#$%,@%
rthrh
rthrhm
sdfwrfw
ss
ssm
t3t3
t,4$&!@#$%&*();'.y54t4
TERTERTGE
ty45
urwguorwuor
urwuorwuor
$@,Y,@$
$@Y,@$
%$Y
That seems ok if i am not wrong

Then i try this:

Code:
egrep -v '^[[:alnum:]]{1,4}$' output2.txt > output3.txt
And i got this content on output3.txt:

Code:
#%@^$%&^@#$%,@%
3647585
4y647y
6y356ht3t3t3
6y356t3t3t3
ertwe
%^#FG#@%&U^#$G@$
#%@^n$%&^@#$%,@%
rthrh
rthrhm
sdfwrfw
t,4$&!@#$%&*();'.y54t4
TERTERTGE
urwguorwuor
urwuorwuor
$@,Y,@$
$@Y,@$
%$Y
Again if i am correct it seems ok exept the last line that is 3 characters

Code:
%$Y
Any ideas?

Thank you
 
Old 11-25-2012, 07:40 AM   #11
shivaa
Senior Member
 
Registered: Jul 2012
Location: Grenoble, Fr.
Distribution: Sun Solaris, RHEL, Ubuntu, Debian 6.0
Posts: 1,800
Blog Entries: 4

Rep: Reputation: 286Reputation: 286Reputation: 286
You can try more filter instead of cat to read/open large files, as:
Code:
more file1.txt file2.txt | sort -u > output.txt
Also you can relpace "sort | uniq" filters with simply "sort -u", as that will do the same thing.

Last edited by shivaa; 11-25-2012 at 07:43 AM.
 
Old 11-25-2012, 08:09 AM   #12
ASTRAPI
Member
 
Registered: Feb 2007
Posts: 210

Original Poster
Rep: Reputation: 16
I try this:

Code:
more file1.txt file2.txt | sort | uniq > output2.txt
This is what i got:

Code:
::::::::::::::
#%@^$%&^@#$%,@%
3647585
3t3t
4y647y
5
6y356ht3t3t3
6y356t3t3t3
ERT
ertwe
%^#FG#@%&U^#$G@$
file1.txt
file2.txt
g
gh
#%@^n$%&^@#$%,@%
rthrh
rthrhm
sdfwrfw
ss
ssm
t3t3
t,4$&!@#$%&*();'.y54t4
TERTERTGE
ty45
urwguorwuor
urwuorwuor
$@,Y,@$
$@Y,@$
%$Y
What do you think?

Those are wrong and maybe more not correct:

Code:
::::::::::::::
file1.txt
file2.txt
I don't care about put the words in order i care just to remove duplicated words digits and symbols if that helps....


*If i am not wrong this is ok for the first job:

Code:
cat file1.txt file2.txt | sort | uniq > output2.txt
I think i must adjust this to remove all words from one up to four characters so the smaller word,digits,symbols will be 5 characters...

Code:
egrep -v '^[[:alnum:]]{1,4}$' output2.txt > output3.txt
But using the above it was not remove this:

Code:
%$Y



Thank you

Last edited by ASTRAPI; 11-25-2012 at 08:14 AM.
 
Old 11-25-2012, 09:01 AM   #13
shivaa
Senior Member
 
Registered: Jul 2012
Location: Grenoble, Fr.
Distribution: Sun Solaris, RHEL, Ubuntu, Debian 6.0
Posts: 1,800
Blog Entries: 4

Rep: Reputation: 286Reputation: 286Reputation: 286
First job:
The more filter is used to see/open a file in page by page manner. If you open 2 files with more, then at the top of output it mentions filenames as:
Code:
:::::::::::
file1.txt
:::::::::::
A
B and so on...
:::::::::::
file2.txt
:::::::::::
C
D and so on...
So it's nothing but characters associated with filenames descriptor in outout, so either manually remove it after generating output or use:
Code:
more file1.txt file2.txt | grep -v ":" | sort -u > output.txt
Or better use previously suggested command for first job i.e.
Code:
cat file1.txt file2.txt | sort | uniq > output.txt
Second job:
Code:
awk 'length($0) >= 5' output2.txt > final_output.txt
And final_output.txt will contain all unique entries having length equal to or greater than 5 characters, and then remove output.txt file.

Last edited by shivaa; 11-25-2012 at 09:07 AM.
 
Old 11-25-2012, 10:47 AM   #14
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 670Reputation: 670Reputation: 670Reputation: 670Reputation: 670Reputation: 670
Also look at the comm command. It will list any of unique to file1, unique to file, and common to both files.

The sort command has a unique option, so you don't need to use the uniq command.
Code:
cat file1.txt file2.txt | sort -u > output.txt
 
Old 11-25-2012, 11:16 AM   #15
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Arch
Posts: 3,013

Rep: Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225
Quote:
Originally Posted by ASTRAPI View Post
Then i try this:

Code:
egrep -v '^[[:alnum:]]{1,4}$' output2.txt > output3.txt
...
Again if i am correct it seems ok exept the last line that is 3 characters

Code:
%$Y
The egrep command you used will print every line that is NOT 1-4 alphanumeric characters. Since "%$Y" has some NON alphanumeric characters, it is printed. You started this thread talking about word lists, but now your sample data contains nonwords full of punctuation. Do you want to treat them differently or not? If you just wanted to remove any lines with 1-4 characters (of any type), use "." instead of "[[:alnum:]]".


Quote:
Originally Posted by shivaa View Post
You can try more filter instead of cat to read/open large files, as:
Code:
more file1.txt file2.txt | sort -u > output.txt
No, cat has no problem with large files, you only need more if you want to view a large file in the terminal: more will let you scroll through it.


About efficiency: it will probably be more efficient if you filter out the small words before sorting and eliminating duplicates.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] remove directories that only contain .txt and .log files? jmark Linux - Newbie 13 08-09-2010 04:55 AM
How do I create words.db from words.txt using gdbm? kline General 8 12-14-2008 09:48 PM
changing words in txt files Raakh Linux - Newbie 3 11-15-2007 03:40 AM
Remove files w/ extension txt recursively spiri Linux - General 4 12-14-2005 04:52 AM


All times are GMT -5. The time now is 08:42 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration