[SOLVED] Remove duplicated words from two big wordlist txt files
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Let's say you have file longfile.txt that has 100 lines, then you can split it's parts using head and tail commands, and can save that parts in seperate files.
For instance, for first 20 lines:
Code:
head -20 longfile.txt > output1.txt
For next 20 lines i.e. 21-40 lines:
Code:
cat longfile.txt | head -40 | tail -20 > output2.txt
cat longfile.txt | head -60 | tail -40 > output3.txt
And so on...
To check how many lines that longfile.txt has, use:
Code:
cat longfile.txt | wc -l
Suppose it gives result as 500000, then you can use:
Code:
head -125000 longfile.txt > output1.txt
cat longfile.txt | head -250000 | tail -124999 > output2.txt
cat longfile.txt | head -375000 | tail -249999 > output3.txt
And so on for as many as parts you want to do..
Although you use head and tail commands with -c option to split data on bytes basis, but that would not be much convenient. So better try as said above.
Also read man pages of head and tail for better understanding.
Option -n is available in Ubuntu only (I am not sure in other Linux flavors), but perhaps in your case it's not availale. Then you should better use:
Code:
split -l 1000000 longfile.txt new ## To create 8 new files named newa, newb, newc...
split -l 2000000 longfile.txt new ## To create 4 new files named newa, newb, newc...
Note: File sizes will be equal in this case also.
Or if you want to split that file on size basis, then check size of file (I assume here it is 8GB), calculate it's 8th part, convert that into kbs (size in MB x 1024) and split:
Code:
du -sh longfile.txt
8192M
split -b 1048578 longfile.txt new ## To create 8 new files named newa, newb, newc...
split -b 2048578 longfile.txt new ## To create 4 new files named newa, newb, newc...
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.