Mature Optimization - Part 2
Recap
This is part 2 of a series about finding a structured, repeatable way to optimize software. In part 1 of this series, we talked about a process for making decisions about performance without having to worry about committing the "premature optimization" sin. Basically, it comes down to these steps:
1. Goals - Set some performance goals early in the life of the project and continue to revisit and adjust them as needed.
2. Testing - Test your system to see if it meets the current set of performance goals.
3. Monitoring - Monitor usage of the system and understand how your users actually interface with your software.
4. Optimizations - Optimize your software if the steps above dictate that it's time. In other words, optimize if testing and/or monitoring show that your system's performance is out of line with your goals.
In part 2 I'll be discussing more details about each of these steps and giving you some examples of how to apply them. Let's do it.
Goals
Setting goals for your software's performance is really important. Without these you're just making guesses at how performant your system needs to be. So how do we go about establishing some reasonable goals? This is probably the most "creative" part of the process. You're going to have to think about things like users, usage patterns, data access, and technologies. What type of software you are working with will have a large impact on this as well, so I'm going to break it up into four common system types and talk about each one independently:
Web apps
Mobile apps
APIs
Of course there are other types of systems out there like video games and desktop apps that will have their own unique performance constraints. Sorry if I didn't mention your favorite above, but you can email me at dave@ultravioletsoftware.com or use the comments section if you have specific questions.
Web apps
Raw number of users is a common and important consideration here. How many users do you currently have or are you expecting to have in the near future? This can be hard to determine but there are some techniques you can use to get a reasonable guesstimate. If you know of another service similar to yours, you might consider asking them how many users they have. If you have a marketing department, they may have a goal they are targeting. If you are a start up and you are following a lean approach (which I highly recommend), your MVPs work may give you an idea of how many people are interested.
Another factor you should consider with a website is usage patterns. Beyond raw user count, how many concurrent users will you have? Do you have peak usage times? Do all your users essentially use the system the same way? What about admin users?
Do you have resource-hungry features? Are you doing image processing or running some kind of simulation? That's going to eat up CPU cycles and you won't be able to have too many users doing those kinds of operations on one box. The same goes for features that involve a lot of disk access, require a lot of RAM, or use a lot of network bandwidth.
Mobile apps
Mobile apps usually share a lot of the same issues with webs app because they are often connected to the same backend via a set of services. So in general you should have many of the same goals. Where mobile apps differ is that they can distribute a lot of the resource usage across your user space rather than having it all concentrated on your server. So if you are doing image processing on the mobile device, only that one device is affected. Your server and all other users remain blissfully unaware. This can be significant in terms of scalability of your system. Of course, it introduces a lot of other issues that will have to be considered as well. Code duplication is one. Also mobile devices aren't as powerful as big servers so they can't do as much work as fast. Depending on what you're doing, it may still be a better user experience to have some of the heavy lifting done on the server.
Another big issue with mobile devices is that hardware varies widely and you don't have much control over it. Choose some careful hardware and OS requirements for your app. Don't make them overly restrictive but make sure that users on the low end have a shot at a reasonable experience.
APIs
Settings goals for your API can be a bit trickier than it is for web and mobile apps. This is because APIs are consumed by computers rather than people. This means that they are capable of rapid fire bursts of traffic that no human could ever hope to match. Fortunately, many API calls are still triggered by human actions such as clicking a button or opening a page. This means you can still tie API usage to user count and usage patterns in many cases.
Of course there are some completely automated systems that consume APIs and it can be difficult or impossible to control how these systems work. This has been a historical problem for APIs and is usually solved by introducing some sort of rate limiting scheme whereby the number of calls a particular API consumer can make are capped within a certain time period (say 100 per minute).
An example
Let's say that you're building a system that is intended to help small stores (e.g. Kwik-E-Mart) manage their inventory. The system lets the user search and browse current inventory, view recent sales, get some insight into incoming supply shipments, and view some analytics.
Let's say that there is a mobile and web application front end to this thing (which I've just decided is called Hoard). Both the web and mobile apps talk to the back end via a set of web services. Pretty standard architecture.
So you want to establish some performance goals so you ask the product lead about how many users they expect to have in the first year. He's not 100% sure but they are hoping for about 25,000. You research the closest thing you can find to Hoard already in the market and find something called Cyber Stocker. You call up their sales team and bribe them for some user base info. You have to sit through a demo but you eventually learn that they have 75,000 users, about 50,000 of which have been active in the last year. Good job on the stealth ops. Given that Cyber Stocker has been in business about four years, it seems reasonable to assume Hoard could accrue 25,000 users in the first year.
Raw user count: 25,000
Now you start to think about usage patterns. When will people be using Hoard? Through requirements research, you learn that most small shop keepers don't check inventory every day. They do it once a week either on Monday or Friday (I have no idea if that's true but this is my hypothetical world and that's how people do it there). That means you will probably have most of your traffic occurring during those two days of the week. You decide you need to be able to support 15,000 users (slightly over half the user population) during peak time.
Concurrent user count: 15,000
Now you start to think about the really resource intensive operations that will be supported. Most of the features don't involve a lot of computation, it's mostly just pulling records out of a database and putting them on screen. There is a sales report that involves some large database queries and a bit of post processing of data. It's not a report that is run all the time but the system needs to be able to generate it on the fly. You decide that you need to be able to support generating 1 sales report per minute with little noticeable effect on the rest of the system.
Resource intensive operation: 1 sales report per minute
Okay, so here is a summary of your goals for the hypothetcal Hoard system:
25,000 total users
15,000 concurrent users
1 sales report per minute
In real life you'd probably have quite a few more goals than these, but I can only make these blogs so long before they can no longer be effectively prooof red.
Testing
So now that you know what performance goals your system needs to meet, it's time to find out how well it stacks up. This is usually done by writing a suite of automated tests that will exercise the system in some way and record the results (e.g. the number of requests handled per second). Here are some general guidelines:
Look into existing products that can help you with this. I've used JMeter, Grinder, and LoadRunner in addition to writing my own systems. I'd strongly recommend trying existing product before writing your own test system. While you can get exactly what you want out of your code, writing a highly scalable load tester is not a small job.
Make sure that you carefully record your results in a way that lets you easily compare them across tests. It's important to spend a little time thinking about this and make sure that you can get useful data out of your tests. Many tools come out of the box with summaries and reports. If you write your own system, exporting data to a spreadsheet can be a good way to go.
Do your best to automate as much of the process as is practical. You will be running these tests a lot and the more overhead there is setting them up and post-processing the results, the longer it will take to get answers. Let the computer take care of the drudgery!
Testing Hoard
To continue our example with our sweet inventory app, Hoard, let's say that it's now a few months into development, and you want to see how the system is stacking up. You decide to test out the system under a full load of 15,000 concurrent users. You decide to simulate a Friday inventory scenario where throughout the day most of the users will be requesting inventory data. Through some early beta testing with users you have learned that most users browse through their inventory one page at a time, checking it up against actual stock in the store. On average they ask for a new page of data roughly once per minute. Here is what usage looks like throughout the day:
As you can see, usage peaks around 10:00 AM, where there are 15,000 users on the system at the same time. Then dips down for lunch, then picks up again until around 3:00 PM where it starts to drop off. At its 10:00 AM peak, the system needs to support 250 (15,000 / 60) requests for a page of inventory data per second. That means the web service responsible for sending out a page of inventory data must be load tested at this rate.
You set up a Grinder test to determine if your system can meet this level of load. After running the test you see that the system only gets up to about 200 requests per second. Time to optimize some code.
Optimizations
Congratulations, you are in no danger of optimizing prematurely! You have identified a verifiable performance problem that must be addressed. The question is what are you going to do about it? Well that depends greatly on what the issue is. Do you have an issue with RAM consumption, CPUs pegging, databases taking too long to run their queries? The possible situations are limitless so I won't spend too much time going over solutions. I will give you some pointers on good places to start your diagnostic work:
CPU pegging
No more RAM available
No more disk space available
Network bandwidth is saturated
Slow database queries
Slow 3rd party libraries or external services
Optimizing Hoard
In our scenario we can't get inventory data out of the Hoard fast enough. In order to figure out where the bottleneck lies, you start investigating the resource usage on the server while your tests are running. RAM usage looks fine, CPU isn't doing much, and the disk is fairly active but not out of control. Could it be that the database simply can't get the data out fast enough? Maybe you need to optimize the query. To find out, you try running the inventory page queries directly against the database to see how many pages you can get out in a second. Turns out you can get over 500 pages per second. So it's not the database. Next you look at the network bandwidth and notice that it's completely saturated. Since you're running the server in Amazon EC2, spin up a new instance with higher network throughput and rerun your tests. Now you're getting almost 400 requests per second! So great, you've met your performance goals.
Now keep in mind that there are lots of tools out there for you to use to determine where bottlenecks are and what you use will vary greatly on your platform, your technology stack, or your experience level, so let's discuss those.
Monitoring
The remaining step is system monitoring. Over time the number of users on your system is going to change. Hopefully as people realize how great your software is, it will increase. That means that your current set of goals may eventually become insufficient to cover usage and you'll need to set new goals, re-test, and possibly optimize.
So how do we monitor usage? Well there are a number of great tools out there for doing that. All your major databases will have very good resource monitoring tools, query analyzers, and optimization features available. Windows comes with Resource Monitor and Performance Monitor. Unix and Linux come with lots of tools like "top" and "iostat". Keep in mind that cloud environments like EC2 are really VMs running on a shared machine, which may have an effect on the stats you get out of some of these tools. In the end, you will have to get familiar with the tools available for your particular scenario.
There are also a lot of excellent 3rd party tools available. One of the more popular options is New Relic, which is an impressive system but also fairly expensive and may be overkill for some situations. There are other free tools out there that may suit your needs just fine. Nagios and Zabbix both come to mind.
Whatever tool you choose, make sure you set it up to alert you when your software is in danger of exceeding its limitations. Does your API support 1000 requests per hour? You're going to want to know if you had a day where you got close to that mark. How quickly has it been trending up? Do you have a few months before it's likely to cross the 1000 mark or a few days? You need the data to know!
Of course you'll also want to monitor storage space, RAM, CPU usage, security, DoS, and various other metrics that are important for all software.
Wrap it up
This concludes our tour of software performance strategies. I hope you found it useful. Like I said in the beginning, it's a fair amount of work, but if you are serious about keeping your system up and running and yourself asleep at night, you do what it takes.