01 Août 2012
UCMA apps, load testing, and timeouts
One of the things that often ends up coming up too late in a development cycle is testing your application under load. Sometimes we think to do this early on, but more often than not, one of the last things developers tend to do is throw traffic at an application until it breaks. The problem is, finding an issue with load at the end of a dev cycle can be very difficult to fix, and it can call into question some of the fundamental aspects of your architecture.
Recently, I found an issue when testing a UCMA app that looked amazingly like calls to the app were being throttled. I was even able to reproduce the error using an amazingly trivial UCMA app that simply started an ApplicationEndpoint, registered for AudioVideoCallReceived, accepted a call, and waited (if you want to follow along, the source code for the app and snooper traces are on skydrive here). I started throwing 10 calls per second at the system up to 100 calls (using SIPP to a mediation server, which is a great SIP traffic generation tool), and after about the first 30 calls, I started noticing failures. Here’s a sample trace from the front end server:
And here’s the same side of the call from the app server:
As you can see, the original INVITE went to the app, and was received, and the app promptly responded with the 100, 180, and 200. The front end sent the INVITE at 14:10:21, and if we ignore the time difference between the two servers, you see that the app server’s responses all went out within milliseconds. If you look at the trace though, it appears that the 100 took 31 seconds to be processed by the front end server? There’s a lot of signalling going through that system at that point, but did I mention that the front end had 24 CPU cores and 64 GB of memory? And that it barely registered the traffic? This was a real puzzle, and I posted in a few forums (both public and private) trying to find an answer.
As it turns out, the solution involved no code changes, and it was something completely out of my control. Evidently, a power failure had reset the configuration on the switch that these systems were connected to, which, for some reason, put the ports in half duplex mode, while windows assumed full duplex. Apparently this can cause some really serious performance problems. Once I found our network admin, convinced him that this was probably a network issue, and got him to fix the duplex mode, all was well again, and traffic started flowing smoothly.
I suppose the moral here is two things: One, never assume that the network just works (just like assuming that the power is always going to be on), and Two, sometimes you can blame the network guy and be right about it.