[SOLVED] Remove duplicated words from two big wordlist txt files
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
Let's say you have file longfile.txt that has 100 lines, then you can split it's parts using head and tail commands, and can save that parts in seperate files.
For instance, for first 20 lines:
head -20 longfile.txt > output1.txt
For next 20 lines i.e. 21-40 lines:
cat longfile.txt | head -40 | tail -20 > output2.txt
cat longfile.txt | head -60 | tail -40 > output3.txt
And so on...
To check how many lines that longfile.txt has, use:
cat longfile.txt | wc -l
Suppose it gives result as 500000, then you can use:
head -125000 longfile.txt > output1.txt
cat longfile.txt | head -250000 | tail -124999 > output2.txt
cat longfile.txt | head -375000 | tail -249999 > output3.txt
And so on for as many as parts you want to do..
Although you use head and tail commands with -c option to split data on bytes basis, but that would not be much convenient. So better try as said above.
Also read man pages of head and tail for better understanding.
Option -n is available in Ubuntu only (I am not sure in other Linux flavors), but perhaps in your case it's not availale. Then you should better use:
split -l 1000000 longfile.txt new ## To create 8 new files named newa, newb, newc...
split -l 2000000 longfile.txt new ## To create 4 new files named newa, newb, newc...
Note: File sizes will be equal in this case also.
Or if you want to split that file on size basis, then check size of file (I assume here it is 8GB), calculate it's 8th part, convert that into kbs (size in MB x 1024) and split:
du -sh longfile.txt
split -b 1048578 longfile.txt new ## To create 8 new files named newa, newb, newc...
split -b 2048578 longfile.txt new ## To create 4 new files named newa, newb, newc...