PHP Help

tuxlux · 09-27-2011, 06:42 PM

I am working on a research project using article and edit data from Wikipedia. Someone put together a script for me to combine data in a particular way but it doesn't do what I need (I think they had another idea in mind). I know nothing about PHP and only the very basics in Bash. I was wondering if someone could help fix this script.

Script and sample files are attached in the "sample.txt" file which is actually a zip file so you would need to change the extension to .zip first. Not sure if that will work, but it wouldn't let me upload a zip file. If this doesn't work, I'll try another approach.

Files to be worked are in two directories /article_info and /user_info.

Basically, the script should open a text file in article_info and read the names of the users. It should then read in all those user text files from user_info into one combined file. It should exclude all lines in each user file with less than 6 (i.e. 5 or fewer) edits in column 3.

Next, it should find all items in column 2 that are the same (i.e. duplicates) and preserve those or output them to a new file.

The final product would be a file with all duplicates (or triplicates, etc.) preserved and nothing else. The Hydrology-output.txt file attached is a sample of the output.

It should be able to do this for all files in article_info (about 150) and there will be times when different user_info files are used in more than one article_info file. However, since it should do each article_info file one at a time this should be okay.

Thanks much.

sag47 · 09-27-2011, 10:17 PM

Not sure if it has to be in php but I did what you want based on your description in python.

Code:

#!/usr/bin/env python
from os import listdir
from sys import exit
from os.path import isfile

if len(listdir("./article_info")) == 0:
  print "empty directory..."
  exit(1)

for file in listdir("./article_info"):
  users=[]
  f=open("./article_info/" + file,'r')
  
  #grab all the users out of the files in article info
  for line in f:
    if line.split()[0] != "rev_user_text":
      users.append(line.rsplit(None,1)[0])
  f.close()
  
  #open ./file-output.txt file and write to it
  f=open("./" + file.rsplit('.',1)[0] + "-output." + file.rsplit('.',1)[1], 'w')
  #now work on each user in ./user_info
  for user in users:
    if not isfile("./user_info/" + user + ".txt"):
      continue
    fuser=open("./user_info/" + user + ".txt",'r')
    for line in fuser:
      if line.split()[0] != "rev_user_text":
        if int(line.rsplit(None,1)[1]) > 5:
          f.write(line)
    fuser.close()
  f.close()

Thanks for an interesting text manipulation problem. With my script you could have multiple articles in the article_info folder. I win the race with your friend because my script is shorter (including comments, not empty lines).

SAM

tuxlux · 09-27-2011, 11:31 PM

Thanks for the help. Python is fine. Doesn't really matter to me.

I ran this and it seems to have run, but the output isn't quite what I need. Looks like it properly combined all the user files into one new file but the next step is to find all the duplicate article names (col 2) and save those lines (lines matching something else based on column 2) while deleting everything else or outputting those to a new file. Either way, the final product would be a file with just all the duplicates (or triplicates, etc.) of the article titles along with the user names and edit counts for each line. Let me know if that doesn't make sense.

EDIT: you also win the race for a script that ran first time