One of the first stages of designing architecture for a system, is the Non-Functional requirements gathering.
This stage usually goes like this:
MEMI: So, what are the most important non-functional requirements?
CUSTOMER: … What are non-functional requirements?
M: Well, you know, all the requirements that are not related directly to the system’s functionality, such as reliability, data growth, performance…
C: Yes! Performance! I want that!
M: OK, so how fast do you need this process to be?
C: Mmm…1ms?
These kind of conversations, combined with the fact that modern-day computers are REALLY FAST, made us believe that the system’s performance is perhaps the most important aspect of the system.
Google “How to make my code” and let autocomplete do its magic, and you’ll see the first option is “…fast”. Hit Enter, and you’ll be looking at more than 29,500,000 results.
So, performance is important, and we should invest a lot of time improving it.
But is it?
Many times, improving performance comes at the expense of other non-functional requirements (let’s go with NFR from now on, OK?). Here is a quick example:
Say we’re designing a service (let’s call it the Broker) that should pass messages to other services. The Broker receives the message to publish via a REST API, and should send it to other services, per configuration. Basically – a Pub/Sub service.
This is how the flow looks like:
Now, we have two ways of implementing this Broker.
Alternative 1:
- Publisher calls Broker’s API
- Broker gets subscribers list from an in-memory store
- Broker calls subscribers, passes them the relevant data
Alternative 2:
- Publisher calls Broker’s API
- Broker stores the message in a persistent DB
- Another thread queries the DB periodically (say, every 500ms) and looks for new messages
- When such messages are found, the code locks the records in the DB and retrieves those messages
- It then retrieves the relevant subscribers from an in-memory store, and…
- Calls the subscribers, passing them the relevant data
Let’s discuss two questions about those alternatives:
- Which one is faster? – Well, that’s an easy one. The first alternative is definitely much faster. There is no timer, there is no writing to physical DB (and the subsequent disk writes) and everything happens very very fast in memory.
- Which one is better? – That’s an entirely different question, and it depends heavily on the actual NFR of the system. One of the most prominent trade-offs in the software architecture industry is the Reliability / Performance trade-off. In general, this trade-off looks like this:
As you can see, the more performant your solution is, the less reliable it is.
Let’s see how this trade-off is reflected in our Broker.
Take a look again at the first alternative implementation. Here are the steps:
- Publisher calls Broker’s API
- Broker gets subscribers list from an in-memory store
- Broker calls subscribers, passing them the relevant data
And now imagine this:
After the broker gets the subscribers list (step 2), it encounters a problem and shuts down. It doesn’t really matter what the reason is – it could be a hardware malfunction, a bug in the code, anything goes. And the question is – what will happen to the message in this case?
Well, in this case – the message is simply gone.
Since it was stored only in the Broker’s memory, it cannot be restored and re-sent.
So what have we got? A blazing fast Broker with a very low reliability. Every little bug can cause message loss.
Can you live with that?
That depends of course on your NFR. Perhaps you can (for example, if you’re streaming a high volume of stock data, and when one point is gone missing, the next one, which arrives 100ms after that, will cover for it), and perhaps you can’t (for example, you’re sending message regarding financial transaction).
On the other hand, if we’ll implement the second alternative – we don’t have that problem. If the Broker shuts down during operation – the message is still safely stored in the DB, and another instance of the Broker can retrieve it and send it.
This way, we get a much slower process, but a much more reliable one.
Looking at the trade-off chart from above, the two alternatives will be represented like this:
This is, of course, a very simplistic description, but you get the gist.
So we now know that reliability can come at the expense of performance.
Are there more factors that can overcome performance? Well, apparently there are. Here are a few:
- User Interaction – If at the end of your flow there is an actual, human, user, then you’re not always have to make sacrifices to gain performance. Most humans won’t notice the difference between 1000ms and 1200ms in an information system (note that realtime apps / games are very different in this aspect). Don’t believe me? Try this awesome test.
- Fire & Forget – When the system performs a task that you don’t really care when it will end – the performance is not very important. A great example of that are almost all the kinds of nightly batches that crunch numbers and produce insights for future reports. You usually don’t really care if the crunching, which began at 2:00am, will end at 2:17am or 2:22am.
So what is the bottom line?
Quite simply – performance is one of the most important NFR. But it’s not the only one, and it’s definitely not the holy grail of every system.
Always look at performance in light of other, perhaps more important, NFR, and only then decide how much you’re going to invest in it.
What’s your take on that? Let me know in the comments!
We can also try to achieve both points (#1 & #2) in above graph, in some situations:
we opt for #1, but also have a reliable polling mechanism (simulating #2) against the source-of-truth.
If we can achieve that, we get both performance & reliability w/o one compromising the other.
(Sorry for the late response, somehow the notification was lost…)
Yes, you’re right.
We can definitely achieve that this way.
Problem is, we then need to develop two distribution mechanisms, and add more complexity. For example, it won’t be easy to make sure we do not publish the same message twice – via the publisher, and via the timer.