I want to let people know I've just released a new version of colmux, a tool in the collectl-utils package on sourceforge.
To be brief, it allows you to monitor an entire cluster and look at virtually any 'top' performance metric, similar to the top command but this works on any collectl command, which means you can display the top-n nfs clients, nodes that are heating up, memory hogs, flaky network interfaces, slow disks or just about anything you can think of. It's been tested on clusters of over 2K nodes so I know it works reasonable well in the face of a lot of data - and I'm doing this monitoring once a second!
To read more see -
http://collectl-utils.sourceforge.net/colmux.html but I can promise you will never look at cluster monitoring the same way again.
-mark