LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-13-2014, 05:13 AM   #1
Vinter
Member
 
Registered: Feb 2005
Location: Germany
Distribution: Aptosid
Posts: 148

Rep: Reputation: 19
Question Help with optimizing a PHP script that processes large amounts of JSON data?


Hi!

I recently got into writing scripts that use the API of reddit.com, and my most promising project is one that tracks users across the various subforums of the site. As a bit of background: Apart from doing statistical analysis of the data, the purpose is to allow the user to tag participants in certain groups so they and their potential biases are recognizable in general discussions. This is achieved through the Reddit Enhancement Suite, which allows for direct access to its tagging facilities through JSON input and output. My script takes RES's JSON output about the current tags, merges it with the information I collected and returns a JSON blob that can be handed back to RES.

Locally, everything is working fine; however, as the amounts of data to be processed may be large (I'm currently tracking ~17,000 users, combine that with a user's pre-existing tags), my webhost routinely struggles or collapses. I'm planning to tackle this issue in various ways, but for the moment, my biggest problem is with large amounts of JSON being submitted by the user, which will happen routinely if people want to update tags I generated before. So I'm looking for ways to improve my script's performance - however, I only started to learn PHP (and JS and HTML for the interface) two days ago, and I'm very unsure how to optimize it and I'm quite shaky about their general use.

So, can anyone help me by looking over the code below? Any general comments about improving it are also welcome, of course!

For reference, the site itself is here: http://taglog.site90.com, and the code for the index is here: http://pastebin.com/pduC1yYN

The JSON is simple, formatted as {"user":{"tag":"x","votes":"0","color":"blue"}} for the RES tags and {"user":{"Sub1":5,"Sub2":1,"SubN":36}} for the userbase.

Code:
<?php
header('Content-type: text/plain');
if ( $_POST['return_tagfile'] ) {
	header('Content-Disposition: attachment; filename="tags.txt"');
}
ini_set('max_execution_time', 300);
$users = json_decode(file_get_contents('users'), true);
if ( $_FILES['tagfile']['tmp_name'] ) {
	$oldtags = json_decode(file_get_contents($_FILES['tagfile']['tmp_name']), true);
} else {
	$oldtags = json_decode(str_replace("\\","",$_POST["oldtags"]), true);
}
$tags = "";
if ( $_POST['mincomments'] == Null ) {
	$_POST['mincomments'] = 2;
}
foreach ( $users as $user => $subs ) {
	unset($oldtags[$user]);
	$user_groups = "";
	$largest_group = None;
	$largest_group_count = 0;
	$i = 1;
	while ( array_key_exists("name$i",$_POST) ) {
		$group_posts = 0;
		if ( $_POST["name$i"] and $_POST["group$i"] ) {
			foreach ( $_POST["group$i"] as $sub ) {
				if ( $subs[$sub] ) {
					$group_posts += $subs[$sub];
				}
			}
		}
		if ( $group_posts >= $_POST["mincomments"] ) {
			$user_groups = $user_groups.$_POST["name$i"].":".$group_posts." ";
		}
		if ( $group_posts > $largest_group_count ) {
			$largest_group_count = $group_posts;
			$largest_group = $i;
		}
		$i += 1;
	}
	if ( $oldtags[$user] ) {
		$pre = preg_split("/{/",$oldtags[$user]['tag']);
		$pre = trim($pre[0]);
		if ( $pre ) {
			$pre = $pre.' ';
		}
		$post = trim(array_pop(preg_split("/}/",$oldtags[$user]['tag'])));
		if ( $post ) {
			$post = ' '.$post;
		}
		if ( $oldtags[$user]['votes'] ) {
			$votes = $oldtags[$user]['votes'];
		} else {
			$votes = 0;
		}
		if ( $oldtags[$user]['color'] && ! $_POST['override'] == "true" ) { 
			$color = $oldtags[$user]['color'];
		} else {
			$color = $_POST["col$largest_group"];
		}
	} else {
		$pre = '';
		$post = '';
		$votes = "0";
		$color = $_POST["col$largest_group"];
	}
	if ( $user_groups ) {
		$tags = $tags.'"'.$user.'":{"votes":'.$votes.',"color":"'.$color.'","tag":"'.$pre.'{'.trim($user_groups).'}'.$post.'"},';
	}
}
if ( $oldtags ) { foreach ( $oldtags as $user => $old ) {
	$tags = $tags.'"'.$user.'":{"votes":'.$old['votes'].',"color":"'.$old['color'].'","tag":"'.$old['tag'].'"},';
} }
$tags = "{".trim($tags,',')."}";
echo $tags;
?>
Thank you very much in advance for any suggestions!
Regards,
David
 
Old 07-13-2014, 04:25 PM   #2
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 9,078
Blog Entries: 4

Rep: Reputation: 3176Reputation: 3176Reputation: 3176Reputation: 3176Reputation: 3176Reputation: 3176Reputation: 3176Reputation: 3176Reputation: 3176Reputation: 3176Reputation: 3176
Okay, and without dumpster-diving into your particular code, here is a general suggestion:

Cleanly separate the process of "discovering work that needs to be done," from "carrying out a particular unit of work," from "managing the processing that is being done by both of these two." In other words, a variable number of processes simply look for work to do and put a description of it into an SQL table or set of tables. Likewise, another set of processes (when signaled to do so by their boss) scoop-up a work request from the table and carry it out, reporting their progress or lack thereof. A third process or set of processes wear the pointy hats, and among their duties is load-balancing. (There is, as usual, a hierarchy of managers.)

The workflow should be restartable. If one process dies, the work is cleanly rolled-back and the unit-of-work either goes back to be re-attempted or gets marked dead. It must also be scalable, and manageable.

So, any single PHP script does not do everything, just as a single worker in an organization does not try to be a jack-of-all-trades hero.

There are, by the way, existing workflow management systems, in various languages, which already address this general type of problem – both for computer workloads and for human ones. As much as possible, "do not do a thing already done."
 
Old 07-13-2014, 10:02 PM   #3
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,498

Rep: Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806
I'm not very familiar with PHP. But a general rule for optimizing code in scripting languages is to write less of it, since any code you write will be slower than the builtin stuff written in C. So instead of putting together the JSON yourself, it may be faster to use json-encode; I expect it will be clearer at least.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] MySQL backup - how to deal with large amounts of data? karll Linux - Server 8 02-18-2011 09:51 AM
Writing large amounts of data to multiple CD/DVD's enine Linux - General 1 09-03-2009 09:32 AM
Kernel panics when trying to transfer large amounts of data from or to my hardrive CuriouserAndCur Debian 3 01-10-2007 11:53 AM
Using wget to copy large amounts of data via ftp. AndrewCAtWayofthebit Linux - General 1 05-11-2006 11:55 AM
rm command is choking on large amounts of data? Jello Linux - General 18 02-28-2003 07:11 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:34 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration