ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I am working on a research project using article and edit data from Wikipedia. Someone put together a script for me to combine data in a particular way but it doesn't do what I need (I think they had another idea in mind). I know nothing about PHP and only the very basics in Bash. I was wondering if someone could help fix this script.
Script and sample files are attached in the "sample.txt" file which is actually a zip file so you would need to change the extension to .zip first. Not sure if that will work, but it wouldn't let me upload a zip file. If this doesn't work, I'll try another approach.
Files to be worked are in two directories /article_info and /user_info.
Basically, the script should open a text file in article_info and read the names of the users. It should then read in all those user text files from user_info into one combined file. It should exclude all lines in each user file with less than 6 (i.e. 5 or fewer) edits in column 3.
Next, it should find all items in column 2 that are the same (i.e. duplicates) and preserve those or output them to a new file.
The final product would be a file with all duplicates (or triplicates, etc.) preserved and nothing else. The Hydrology-output.txt file attached is a sample of the output.
It should be able to do this for all files in article_info (about 150) and there will be times when different user_info files are used in more than one article_info file. However, since it should do each article_info file one at a time this should be okay.
Not sure if it has to be in php but I did what you want based on your description in python.
Code:
#!/usr/bin/env python
from os import listdir
from sys import exit
from os.path import isfile
if len(listdir("./article_info")) == 0:
print "empty directory..."
exit(1)
for file in listdir("./article_info"):
users=[]
f=open("./article_info/" + file,'r')
#grab all the users out of the files in article info
for line in f:
if line.split()[0] != "rev_user_text":
users.append(line.rsplit(None,1)[0])
f.close()
#open ./file-output.txt file and write to it
f=open("./" + file.rsplit('.',1)[0] + "-output." + file.rsplit('.',1)[1], 'w')
#now work on each user in ./user_info
for user in users:
if not isfile("./user_info/" + user + ".txt"):
continue
fuser=open("./user_info/" + user + ".txt",'r')
for line in fuser:
if line.split()[0] != "rev_user_text":
if int(line.rsplit(None,1)[1]) > 5:
f.write(line)
fuser.close()
f.close()
Thanks for an interesting text manipulation problem. With my script you could have multiple articles in the article_info folder. I win the race with your friend because my script is shorter (including comments, not empty lines).
Thanks for the help. Python is fine. Doesn't really matter to me.
I ran this and it seems to have run, but the output isn't quite what I need. Looks like it properly combined all the user files into one new file but the next step is to find all the duplicate article names (col 2) and save those lines (lines matching something else based on column 2) while deleting everything else or outputting those to a new file. Either way, the final product would be a file with just all the duplicates (or triplicates, etc.) of the article titles along with the user names and edit counts for each line. Let me know if that doesn't make sense.
EDIT: you also win the race for a script that ran first time
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.