[SOLVED] Remove duplicated words from two big wordlist txt files

ASTRAPI · 11-25-2012, 10:36 AM

@ntubski

Yes i need for every character even it is letter/number/special character...

So for first job i think the best is:

Code:

cat file1.txt file2.txt | sort | uniq > output.txt

(i don't care to view the content after the job is done.I just need to have the correct stuff inside

)

For the second one:

Code:

egrep -v '^[[:.:]]{1,4}$' output2.txt > output3.txt

is teh above correct? [[:.:]]

or

Code:

awk 'length($0) >= 5' output2.txt > final_output.txt

?

Thank you

ntubski · 11-25-2012, 10:45 AM

Use just "." not "[[:.:]]".

Any of

Code:

egrep -v '^.{1,4}$' output2.txt > output3.txt
egrep '^.{5,}$' output2.txt > output3.txt
awk 'length($0) >= 5' output2.txt > final_output.txt

should work fine. Although the first command lets blank lines through as well (you could change to {0,4} to fix that).

ASTRAPI · 11-25-2012, 03:07 PM

All seems to work great

Can i use this for many files?

Code:

cat file1.txt file2.txt file3.txt file4.txt file5.txt file6.txt | sort | uniq > output.txt

Is thare any system limit for big txt files or the possibility for timeout or something?

As i have a very fast pc with core i7 3770K and 16gb ram

Thank you

ntubski · 11-25-2012, 03:57 PM

Quote:

Originally Posted by ASTRAPI

Can i use this for many files?

Yup.

Quote:

Is thare any system limit for big txt files or the possibility for timeout or something?

Nope, the only timeout will be your patience running out.

shivaa · 11-25-2012, 09:39 PM

One shot solution for both jobs:

Code:

awk 'length($0) >= 5' file1.txt file2.txt | sort -u > output.txt

ASTRAPI · 11-26-2012, 01:07 AM

ok all done great

Now i have one big file 8gb ....

How can i split it to 500mb parts or better to 1gb parts?

Thank you

shivaa · 11-26-2012, 02:22 AM

Let's say you have file longfile.txt that has 100 lines, then you can split it's parts using head and tail commands, and can save that parts in seperate files.
For instance, for first 20 lines:

Code:

head -20 longfile.txt > output1.txt

For next 20 lines i.e. 21-40 lines:

Code:

cat longfile.txt | head -40 | tail -20 > output2.txt
cat longfile.txt | head -60 | tail -40 > output3.txt

And so on...
To check how many lines that longfile.txt has, use:

Code:

cat longfile.txt | wc -l

Suppose it gives result as 500000, then you can use:

Code:

head -125000 longfile.txt > output1.txt
cat longfile.txt | head -250000 | tail -124999 > output2.txt
cat longfile.txt | head -375000 | tail -249999 > output3.txt

And so on for as many as parts you want to do..
Although you use head and tail commands with -c option to split data on bytes basis, but that would not be much convenient. So better try as said above.
Also read man pages of head and tail for better understanding.

ASTRAPI · 11-26-2012, 03:05 AM

It will be ok if i use:

Code:

cat longfile.txt | wc -l

result: 8.000.000

and then use:

Code:

split -l 1000000 longfile.txt new

It will create eight files new.txt newb.txt newc.txt newd.txt newe.txt newf.txt newg.txt newh.txt files with 1000000 lines inside each ?

Thank you

shivaa · 11-26-2012, 03:20 AM

Command split is also a good option, you can use it as:

Code:

split -n <number_of_parts> <filename>

For example, to split the file filename in 10 parts, do:

Code:

split -n 10 filename

It will then create 10 new files of equal sizes, named xaa, xab, xac... so on or whatever in current working directoy.

ASTRAPI · 11-26-2012, 03:41 AM

I just try:

Code:

split -n 4 longfile.txt

And i got this:

Code:

split -n 4 output2.txt
split: invalid option -- 'n'
Try `split --help' for more information.

It will be great to be able to split the file on equal size four parts...

Thank you

shivaa · 11-26-2012, 08:14 AM

Option -n is available in Ubuntu only (I am not sure in other Linux flavors), but perhaps in your case it's not availale. Then you should better use:

Code:

split -l 1000000 longfile.txt new    ## To create 8 new files named newa, newb, newc...
split -l 2000000 longfile.txt new    ## To create 4 new files named newa, newb, newc...

Note: File sizes will be equal in this case also.
Or if you want to split that file on size basis, then check size of file (I assume here it is 8GB), calculate it's 8th part, convert that into kbs (size in MB x 1024) and split:

Code:

du -sh longfile.txt
8192M
split -b 1048578 longfile.txt new   ## To create 8 new files named newa, newb, newc...
split -b 2048578 longfile.txt new   ## To create 4 new files named newa, newb, newc...

schneidz · 11-26-2012, 10:15 AM

Quote:

Originally Posted by ASTRAPI

@ntubski

Yes i need for every character even it is letter/number/special character...

So for first job i think the best is:

Code:

cat file1.txt file2.txt | sort | uniq > output.txt

(i don't care to view the content after the job is done.I just need to have the correct stuff inside

)

For the second one:

Code:

egrep -v '^[[:.:]]{1,4}$' output2.txt > output3.txt

is teh above correct? [[:.:]]

or

Code:

awk 'length($0) >= 5' output2.txt > final_output.txt

?

Thank you

Quote:

Originally Posted by ASTRAPI

All seems to work great

Can i use this for many files?

Code:

cat file1.txt file2.txt file3.txt file4.txt file5.txt file6.txt | sort | uniq > output.txt

Is thare any system limit for big txt files or the possibility for timeout or something?

As i have a very fast pc with core i7 3770K and 16gb ram

Thank you

why dont you just try stuff to find out if it works to your satisfaction ?

ASTRAPI · 11-26-2012, 12:30 PM

All working great

Is it safe to rename the newaa file to newaa.txt ?

Thank you

shivaa · 11-26-2012, 08:11 PM

As you wish, there is no problem in renaming. :-)

BTW, if there's no more issue. please mark the thread as solved from the top right side of the page, under Thread Tools option.