Sunday, 29 April 2012

Performance Testing on a shoestring

The setting for this particular story is the run-up to the biggest yearly Horseracing event in the UK, the
Grand National in Aintree.

Even people usually not betting have a flutter heading to the nearest betting shop or hitting the net. Good news for an online gambling company, mixed news when it comes to the servers that need to cope with a lot more traffic than any other time of the year. Quite a few online gaming companies had problems on the day.

Our servers held and we only load tested one aspect of our system. Here's the story.

About two weeks before the race it was decided that we want our payment section tested after all. Right, plenty of time, better get started. We looked into JMeter, however there's limited knowledge in the team. So we asked some developers for help. After a lot of headscratching it was decided, yes, it can be done within about 3 weeks and 2 devs. So the idea of using JMeter was out of the window. A commercial solution was not feasable due to timescales and some other factors.

We decided, since it was to be something that had to be quick and was to be a one off (we'll worry about next year later), to do an in-house solution. While I was running around, gathering required information, investigating possible pitfalls, etc our automation expert (don't tell him, he will ask for more money) used AutoIT to code the load tool.
Yes, you read right. That's about the most unlikely tool for a performance test that I can think of but I employed most of my team members, I have trust in their abilities so time to put my money where my mouth is.

We knew that we wanted to get 8+ transactions a second to test that our system and our third party payment provider could take the heat. And we wanted that sustained over a period of time, say half an hour before the race is busiest so let's just run it for that time. That's the detailed requirements gathering out of the way...

I won't go into the details of the system for obvious reasons, rather explain how we went about it. We ran a transaction manually and found that it took between 10-25 seconds to complete, including filling out the form again for a second try. We also found that we could fit 8 browser windows on a monitor and show all necessary fields and buttons when setting the window size to 35%.

The idea was then to click "Submit" on all 8 browsers at the same time (or a millisecond apart). That worked fine. So, we could hit the system with 8 transactions. Once.
Ok, no problem, since we needed about 25 seconds for a round trip that was maintainable, we added another 5 seconds in case the system slowed down and then went to hunt down 30 PCs in the office (an experience in itself) to start them a second apart.

After having identified the machines we set all 30 to the same screen resolution (I didn't know how many different monitors we had until that day), deployed our performance tool to each machine and set up the machines. That involved setting the IE homepage to the desired URL, running the setup tool, clicking the necessary buttons and fields so that the correct X/Y coordinates were captured. The AutoIT application would then calculate the relative difference for the other 7 windows.

That was a lot of manual setup but we haven't had the time to code for more. Each machine got it's own ID (set up in the application and post it on the monitor) from 1 to 30 and we coded an editable start time into it. For example, machine with ID 1 would start at 08:30:01, ID 2 would start at 08:30:02, etc and then loop round starting again at 08:30:31.

Quite a bit of thinking and discussion has gone into this bit because we could have let each window just start again after each transaction was finished. Depending on the system response time transaction would have drifted apart though resulting in potentially a lot higher transaction number for some seconds and none at others. I was more concerned about the former as it could have been a theoretical 30*8=240 transactions at the same time which would very likely have created serious problems.

So we had our 8 transactions running for 30 seconds, looping around indefinetely until we stopped them. Setup was completed in the evening after a 13 hour+ day. We also did a small test run just to make sure it would work.

Before starting the next morning we had sys admins and several other people in place to monitor the various system components during the test and in case something went seriously wrong.

Execution threw up some problems, not least some network issues that weren't identified before. Also, some pages threw errors that the application couldn't recover from. Time was running out and people came into the office wanting their PCs to work on, so work was stopped. Two days later, after resolving the issues and having two more days to put some resilience into the AutoIT scripts so that recovery after errors was better we tried again.

This time it worked really well. Some machines had to be started again but overall we put the desired load on the system.

Lessons learnt:
  • Regardless how much you ask people for information, there's always something that someone has forgot to mention that will ruin your day. Do a test run and plan for round 2.
  • Don't use IE if you don't have a standard company build on all PCs. Use a browser that you install yourself and control so you don't run into configuration issues when you least need it.
  • Ask people for help. Running it yourself will only stress you out. I found that once we explained the somewhat mad plan people were happy to assist
  • Tell people if you're highjacking their PCs, you may not get a chance to revert all changes. That was done for most but the ones we forgot were understandably miffed that we changed their settings without warning.
  • Get a demonstration of what will be monitored and what can be saved for analysis later. Assuming it will all be there can lead to disappointment and repeats of the test.
  • Don't ask others to do the boring jobs. Being hands on helps the rest of the team to see that you're serious about making this work.
  •  It's possible to get the job done, regardless if you're prepared, have the tools or the time. Determination, the belief that the job can be done and trust in others will get you a long way.
  If someone has similar experiences I'd like to hear about it.

Thanks for reading.

Saturday, 28 April 2012

Do I want to write this?

I usually keep private and work net presence clearly separated. This one is different but I decided to post it as I couldnt' get the testing mindset out of my private life and it may help. It also explains why I haven't blogged or been active in the testing scene in the last couple of months. Not that I think that needs explaining.

Germany/Wuppertal, December 2012, Intensive Care Unit. I browse through the notes the nurses left for the last two days, noting the structure of it to be easily recognisable for the next ICU nurse. It's easy to see at a glance what drugs and treatments the patient got in the last 24 hours. It has to be, any failure here could be fatal.

Looking at the syringes next to my fathers bed I wonder what they all are and take note of the names. He's been in a coma since before Christmas. Reading the drug names comes easy, I worked as a pharmaceutical research scientist for over a decade. Memories of that come back. At home I find that Ketamine is for disassociating the body from pain and that the street price has fallen over the last couple of years. For some reason that stuck with me. I read up on resuscitation, survival rates and the side effects like personality changes and brain disorders. I read a lot and educated myself but sometimes I learn things that I really don't want to know.

Beds, syringes, forms, power, gas supplies, room layout, etc are all standardized so that any nurse or doctor can take over where the last one left off. In an ICU that's of vital importance. Each shift has a 1 hour handover/scrum to brief  the next shift of what happened. I wonder what would happen if we were to do 1 hour handovers each day.  In my current line of work it's not that important to know exactly where the last person left of. It's useful but no one dies if you miss a piece of information.

Seeing that everyone working in the ICU absolutely has to know everything about each patient/project was an eye opener. Of course that comes at a cost. But in that context that cost is worth paying.
So how much is it worth in our projects that everyone knows everything about the project? How many people in the project have no idea what their colleagues are working on? Is that OK or is that acceptable? How is the risk covered that information goes missing?

I learned a lot more in these weeks. How to recognise if people make mistakes and where the system fails; who puts in more effort than the rest; for some nurses the relatives play a bigger part, for others the patient is the only important thing. Most are somewhere in between. I reckon that's the Manager in me making these observations.
The people who I think of as "best" without defining what I mean exactly all have a passion for what they're doing. They're not only knowledgeable but are emotionally involved. I can say the same about the testing scene or probably any other craft that people are working in.

Of course the "learning" during this time wasn't purely to do with this mindset. Most of it was on the emotional side as can be expected. I learned quite a bit about what my approach to thinking and learning is compared to my parents and what is self-learned. But watching myself making these observations was a convincing sign that I'm working in the right job.

My whole family spent Christmas in Germany (unplanned and at very short notice; 6 hours from getting the call to leaving for the airport with my wife, 8 year old son and all Christmas presents) while my father was in coma all the time. He woke up in January and I flew back to Germany to spend some time with him. He died in February.

You won't be forgotten.