Risk is…

Risk adds complexity to all software engineering projects. It’s difficult to define; some define it subjectively as an uncertainty, others define it mathematically as a function of impact and likelihood. However you define it, there’s no one-fits-all answer in terms of how to approach it, yet we find it shapes our whole test approach: what we test, when we test and how much we test.

When I read that the Ministry of Testing July/August bloggers club challenge was “Risk is…”, it took me a while to decide exactly what to write about. Eventually, I decided to use the definition that a risk is a potential mode of failure to explore risks that have been realised as actual failures in my recent projects. I hope that these anecdotes either teach you something and stop you making the mistakes we made, or raise a smile as you remember the time that you already did!

Risk: Not enough time to investigate all the bad smells

Whilst testing a recent release on a tight timeline, we ran our regression tests and found some issues. One issue in particular, we investigated and traced it back to issues with our test network configuration; some settings had been changed and were now inconsistent with some others. With this set-up, we couldn’t expect the clients to work. Having debugged the issue this far and knowing that it would not be trivial to resolve, we made a call to move the issue onto our backlog for after the release deadline and focus our remaining time on other issues raised by the regression testing.

Unfortunately, this proved to be the wrong call in this case. Before we had a chance to pick the task up from our backlog, issues raised in the field highlighted that there was a regression in this functionality. The broken configuration was hiding a bug. We could have found this had we decided to fix the configuration and rerun these particular tests, however, due to time pressures with the project, we had decided against this. The risk that we would miss some issues due to our tight timeline came true.

Risk: Too many variables to test exhaustively

We develop very complex products; our front-end clients expose a range of different features based on the combination of configuration set at the levels of individual users, their businesses and their service providers. This results in a vast test space; the skill of our testers is to analyse the potential interaction between different variables and prioritise those against the quality criteria a project is aiming for. In other words, we evaluate the likelihood and impact of an issue, and use this, the risk, to direct our testing. Inevitably, we get this wrong sometimes, and in a recent project, we missed a key interaction resulting in a crash after upgrade for users with a particular setting.

Risk: Getting the balance between conflicting requirements wrong

Due to the range of servers in our solution, we often try to limit the number of servers that an enhancement impacts in order to simplify the version compatibility requirements and complexity of the delivery to customers. This was important angle in the design of a recent project we delivered; however, once this hit the field, we saw the impact this decision had on diagnosability – something we’d not spotted in testing! An intermediate server was not changed, and therefore, despite the various new error cases we could hit, it always sent the same error code to the client. As a result, the diagnostics our support team received from clients (which is the easiest information for them to access) appeared the same across a range of issues seen in the field.

In this case, we’d prioritised a requirement that was important to some stakeholders, but at the cost of others. Had we evaluated this risk more accurately and realised cost of this trade-off, we’d probably have arrived at a different decision when considering these conflicting requirements.

Risk: Test environment not simulating the real world

As hard as we try to simulate our real world deployments with our test environments, there will always be minor differences with impacts you hadn’t pre-empted. In this case, we made assumptions about user devices and network conditions as part of two fixes in a release. When combined, these assumptions simply didn’t stack up, and as different engineers were working on the two issues, nobody had a full picture of the risk involved of making both these changes.

We learn from mistakes like these by improving our internal processes, but we also improve our test practices. Testing in prod is a hot topic in the test community, but as a B2B organisation, this isn’t easy for us to do; there are thousands of prods and we don’t own them. Fortunately, we do own one of them as we dogfood our solutions and through falling foul of this risk, we resolved to make better use of our beta testing in our dogfooding program, as doing this may have saved us in this case.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s