Web application freezes for some offices, not others
I'm pretty sure this is a network problem, but hey I could be wrong. I will try to keep the first post brief to minimize my biases.
We have a pretty standard LAPP application (1st P for Postgres). Webservers on RHEL5, Postgres on Centos5 all co-located in a data center.
Our users range from the home office user all the way up to large companies with headquarters and offices in different states.
Sometime around last Saturday users started complaining that the system would freeze while using it in all areas of our software. I saw the problem myself on a very simple page, we have a drop down that when you select an item it posts to the server to get info about that item. If you did this ten times in a row, you would never make it through all ten. Sometimes it would stop at 3, sometimes at 7 but never all the way. The amount of data returned in the sample page was minimal. Firebug would claim that it was waiting for the server, usually while getting the page that held our css.
By stop I mean Firefox would say “transferring data from your messed up server” but the data would never get there. You would have to completely reload the Firefox page and start over. (Chrome and IE had the issue too) and then the problem would happen again in a short while.
The fun part. Only certain offices have the problem. Everyone else ran normally day in and day out without a complaint. The offices that were affected, from a single home office to a whole offsite branch would experience the problem on every computer on their network.
Server logs don't appear to show anything strange. All other users are churning along happily, database usage is low, nothing timing out. Apache is happy, top doesn't show anything exiting on webserver neither does lsoff. Colo handles the firewall, nat and routers.
Users that had the problem
3-7 User office Speakeasy T1, Edgemarc 4500 Router
1 User home office Sprint Wireless
5 User office Level3 Bonded 3Gig
? User office ? T1
? User office ? DSL
All users claimed that they had no other issues with any other websites (except BofA).
Users not having the problem
Mega User Office with Mega bandwidth
1 User offices with Cable Modems
A lot more didn't have the problem than did. Maybe a 10 to 1 ratio.
Now it starts to get stranger.
For the T1 Office, the problem went apparently by itself after 2 days.
For the Sprint Wireless office, problem went away after 3 days.
For the Bonded 3 Gig office, problem went away after 4 days.
I know that nothing was changed on the servers to that fixed this problem. Even after the T1 Office was working, all of the others were still having the problem. I am fairly certain that nothing was changed in the offices that reported the problem.
Others are probably still having this problem.
I was able to reproduce the problem and older versions of our software (>1 month old), software that we didn't write (phpPgAdmin) and on all of the servers that we had at the data center.
So, any idea what could cause this? How did the systems apparently self repair?
I can supply more info as needed, but I can not recreate the issue on a client that I control
Thanks
Rusty
|