7 Key Themes in Maximising Your Performance Testing Outcomes
Having recently been engaged to evaluate a client's approach to load testing a web-based application, we wanted to share some of the themes arising from the outcomes in the hope that some of these can assist you in your own performance testing journey.
To set the scene, the system under test provided a client-centric healthcare viewing capability. It was implemented as an ASP.NET Model View Controller (MVC) web application backed by a set of .NET Windows Communication Foundation (WCF) web services and hosted in Internet Information Services (IIS) on the client's own data centre infrastructure. The MVC web application provided users with a view of client data obtained via the WCF web services, with some of this data in turn obtained on-demand from a remote third-party service.
We were engaged to evaluate the client's approach to load testing and to make any suggestions we could for process or configuration changes that could affect the outcomes of their test efforts. While we weren't able to influence the architecture of the overall solution as part of this engagement, if possible, we were also to identify bottlenecks within specific components.
Here are some of the key themes we encountered during this evaluation, together with some fairly specific examples of recommendations we made where these themes were affecting the test outcomes.
Theme 1: Analyse & create a realistic usage model
We discovered that while some initial usage analysis had been performed, the usage model implemented in the load testing framework was unrealistic. The usage analysis had identified the anticipated number of requests per minute to a specific screen within the system during peak periods. However, analysis had stopped short of understanding how a user typically interacts with the system, and consequently, the load testing framework modelled this peak number of users (let's say 100) all simultaneously sending the same request to the system at the first millisecond of the peak period. Further to this, the usage model simulated 0 users using the system prior to this (no prior load), and indeed 0 users using the system after that first millisecond (no sustained load). Some of the improvements we suggested included:
Changes to the usage model employed by the load testing framework to more realistically model typical system usage, away from an immediately perpendicular spike of 0 to 100 users to instead allow for various steady-state usage profiles, for example 100 users per minute with requests distributed across that minute
Extending the modelling of each user session to include not only loading a specific screen, but also typical system usage while in that screen, for example reviewing different tabs of information, downloading additional information, or performing additional functions
Extending the duration of each test run to provide a more realistic average when the system is under varying levels of sustained load
Extending the modelling of usage over time to model expected usage curves across a day, for instance lesser use earlier in the morning, ramping to peak use late morning, reducing somewhat during lunchtime, then ramping back to peak use again in mid-afternoon, before again tailing away later in the afternoon. Depending on the resources available, this could potentially be modelled across a shorter period, for instance 1-2 hours
Further improving the realism of the test scenario by simulating other realistic load on the system and environment, for example background processing or use of shared infrastructure by other systems
Theme 2: Analyse & establish realistic performance targets: What's Important?
Building on the creation of a realistic usage model, it's also important to analyse and establish realistic performance targets. It's not particularly helpful to make a blanket assertion that all system screens must respond in under 3 seconds without understanding how users interact with the system and how its response time may be affected by other constraints. We all want our systems to respond as close to instantly as possible, however this is rarely achievable. Given that, we should focus on what is important, particularly to our users. In this particular case, we determined that three metrics were particularly important to the client:
The initial response time: the time between when the user initiates an action and when they see something happening, and can potentially start interacting with the screen
The total response time: the time before all content is loaded and the user can fully interact with the screen
The number of concurrent users: the number of user sessions that can be supported concurrently without crossing a defined threshold for initial and total response time
Having established these metrics, we were able to re-set performance targets around these, for example:
Support a sustained load of 100 concurrent users while providing an initial response time not exceeding 5 seconds and a total response time not exceeding 15 seconds.
This is just an example, but as we'll see with some of our other themes, we discovered that in the case of the system under test, the original target of response time under 3 seconds was never going to be achievable.
Theme 3: Verify Performance of dependencies
Having created a realistic usage model and established our performance targets, we explored the system architecture to attempt to identify any bottlenecks. During this analysis, we explored not only the system itself, but other components of the system architecture that the system was dependent upon. This is an important step when performance testing any system, but particularly in a distributed system where the system itself may not be entirely in control of its destiny. Here again we made several very interesting discoveries:
The remote third-party service had been mocked to remove the actual third-party service as a dependency and to simulate a consistent response time (based on Production metrics) for the services invoked by the system under test. Ordinarily this is an excellent approach to avoid having variability in external dependencies influence test outcomes. However, we discovered that as load on the mocked service increased, it did not in fact provide a consistent response time, but that its response time increased significantly in line with load. So, in effect, while it had been an assumption that mocking the service would allow the test efforts to focus on the system under test rather than this dependency, the mocked service itself had been contributing to compromising the test results
Due to the original usage model employed (instantaneously increasing from 0 to 100 users and then to 0 again), we identified a bottleneck in the rate at which the IIS web server was able to create new pipeline instances to service an immediately perpendicular spike in requests. This reinforces the impact of the original poorly devised usage model on the test outcomes. Web servers (not just IIS) are designed to scale their processing resources up and down as they observe load: As observed load increases, the web server allocates more resources to service the load; as observed load decreases, the web server gradually releases those resources back to the operating system. In the completely unrealistic case where IIS had observed absolutely no load and was then instantaneously asked to service 100 users simultaneously, it scrambled to allocate resources (pipeline instances in the IIS world) to process the incoming requests. We were able to observe this directly through performance monitoring, detecting that IIS was queuing incoming requests while it had insufficient pipeline instances to process those requests, and then as it allocated sufficient instances this queuing decreased. This also highlights how important it is to identify performance counters that can readily assist with detecting these kinds of bottlenecks.
Our final observation here leads nicely into our next theme…
Theme 4: Test with Production-realistic configuration (infrastructure, application topology, application configuration)
We also found that the load testing was being conducted against a different configuration to production. It's OK to load test against a non-production configuration if you're attempting to establish a performance baseline for a non-production configuration. But you certainly cannot and should not assume that you can establish a baseline using one configuration and extrapolate this for a completely different configuration. Here in particular we found:
The infrastructure specs for the load test environment were significantly different to the specs for the production environment. Consequently, we implemented a change to the environment under test to utilise production-equivalent specs in order to obtain a more production-realistic set of test results
A review of the configuration settings for the system under test discovered that while it supported caching, this had been disabled, even though utilising caching is the vendor recommended configuration, and as such it was intended to be utilised in the production environment. So once again, we implemented a change to the system under test to configure it consistent with the production environment, including re-enabling caching.
Theme 5: Test with realistic data sets
While investigating why caching had been disabled, we discovered this had been done in an attempt to simulate a greater volume of test data than was actually available in the environment under test. The rationale was that rather than loading additional test data into the system database, disabling caching would allow multiple user sessions to simulate accessing distinct client data (when it was in fact not distinct). This rationale is flawed for a number of reasons:
By disabling caching, you're testing with a system configuration that is inconsistent with your target production configuration
You're disabling a mechanism that has clearly and deliberately been included in the system under test to affect better performance outcomes
You're ignoring the effect that volume of data may have on performance of other components of the system under test such as its database or web services: If the production database is anticipated to have 1,000 or 2,000 or 5,000 distinct clients, you should be testing with that volume of data present
The solution to this was fairly simple and a step that could easily have been taken much earlier: We inserted 1,000 test clients into the system database to provide a more realistic data set.
Theme 6: Verify the correct function of the load test framework itself
In this case, Selenium had previously been selected as the engine for driving the load test framework. Selenium is an excellent tool for scripting and testing browser-based interactions as part of functional system testing; however, it can be fairly heavy to use for load testing as it requires distinct "driver" browser instances to be created to simulate user sessions. To attempt to achieve the target number of concurrent users, a Java-based framework had been constructed around Selenium to distribute load testing driver instances to multiple nodes. These nodes were various virtual machines in the client's development and test environments.
Across the course of the engagement we revealed multiple issues with this approach, including:
Anti-virus software on one of the load test nodes was blocking access to the system under test, and consequently its driver instances were timing out. These were being counted as failures by the test framework
Load from other systems on the load test nodes was affecting the test framework, in particular lack of memory resulting in the framework being either unable to create Selenium driver instances or these instances responding more slowly. The best example of this was when one of the load test nodes that was part of an active-passive cluster for another system (and had been the passive node) became the active node as a result of a failover
Excessively long requests that failed to time out were deliberately being excluded from test results, further compromising results
The overhead associated with the load test framework in fact meant that no single node was capable of simulating more than 20 concurrent users (requiring at least 5 nodes to achieve 100 concurrent users) and further these nodes could only sustain those 20 users for 1 minute before exhausting the resources of the host. Worse, the custom Java-based test framework had very limited granularity in its reporting to enable detecting issues with the framework itself. This meant that often test results were compromised by the framework and did not in any way accurately reflect the performance of the system under test.
This reflects that if you're conducting any kind of performance testing, it's equally important to verify the performance of your test framework itself.
Theme 7: Identify & analyse system-specific bottlenecks
Having remediated and re-established confidence in the load testing approach and framework, we were finally in a position to focus on the performance of the system under test.
With the preceding remediation applied we had been able to observe the average initial response time at 7 seconds and to reduce the reported average total response time from 65 seconds to 25 seconds. While this is a significant improvement achieved mostly through appropriate configuration and better testing techniques, it's still a long way from the targets of 5 and 15 seconds respectively, and still not ideal from the perspective of an end-user's experience.
Through further analysis of system logs it became clear there was still a significant bottleneck in obtaining and displaying to the user information obtained via the remote third-party service. We were able to observe that the system was initiating per user session multiple concurrent XHR requests from the browser to the application server to obtain information to display to the user, and each of these requests required an interaction with the remote third-party service. However, the behaviour we were observing didn't align with the system logs, in which it appeared that these requests were being serviced sequentially.
On this occasion we were fortunate enough to have access to the product development team for the system under test, and together to be able to identify the issue as being related to the use of ASP.NET Session State in the MVC controller. Although the concurrent XHR requests from the user's browser session were being accepted by IIS for processing, because they were all from the same user session (bound to an ASP.NET Session), they were being processed sequentially due to the locking model employed by ASP.NET Session State. In effect this meant that the average total response time we were observing was almost exactly the sum of the response time for each XHR request processed sequentially, and as currently implemented could never be better than this.
Working further with the product development team, they were able to refactor the ASP.NET controller to use a read-only session, freeing ASP.NET to process requests for the same session in parallel. With these changes we have been able to further reduce the average initial response time to less than 500ms and average total response time to 5-6 seconds.
Hopefully you’ve enjoyed our performance testing journey. It was certainly an exceedingly interesting and rewarding engagement with some great outcomes for our client. And while none of the themes we discussed should really come as much of a surprise, it does serve to illustrate the importance of thinking through anything you do – or risk your efforts being significantly compromised.