Need advice on setting up historical process monitoring on linux
Hello,
I am working on a research project and we usually start a job on a server and it takes about a day or two to run and then we go and look back at the results. The problem is that in some cases we are starving the server of resources and it is impacting our results since some jobs fail.
I can manually watch the server with something like htop to see which processes are hogging memory resources (the "command" column provides very useful data.) but obviously it isn't feasible to do this manually all day.
I've tried running atop and writing to a rawfile before starting the job, the problem is that the machine I analyze these results from is not exactly the same version of linux as the server, and therefore the atop versions are different and the rawfile won't read on the machine I am analyzing from.
Further, I have tried netdata but it isn't letting me drill-down on per-process level.
The best case scenario is I can generate some sort of file which accurately represents resource allocation on a per-process level as the job was executing. Actually atop would be great if it would work (but because of the version differences making the rawfile incompatible I don't think it is a good solution.)
I can probably dedicate 5 hours to this, any suggestions?
Thanks!
|