PaaS Design Lessons Learned - part 2: Design Objectives & Principles

Building a PaaS is hard, yet everyone seems to be doing it. In this blog series I will look back on a decade of building bespoke PaaS, and lessons learned along the way. This second part we’ll dive into Design Objectives and Principles.

May 25, 2020
consultancy cloud architecture

This is the second post in this series. Make sure to check out Part 1 if you haven’t already

Why –> What –> How

Previously I discussed the different meanings of the word ‘platform’, and how it depends on your organization; the abstraction level and its importance, and how to leverage a DevRel Chapter to make sure you end up building the best platform possible and are continuously able to adapt and improve.

Today will be about taking our platform sketch from Part 1, and adding Design Objectives and Design Principles which will focus on how the platform should be built. After all, a platform that is hard to improve or maintain will turn into a very expensive piece of tech debt. So let’s prevent that.

To make sure we don’t paint ourselves into the proverbial corner, it’s a good practice to define Design Objectives as part of your platform architecture. Think of them as non-functional requirements that document the resulting behaviours or capabilities of platform components but from a platform team perspective.

A platform that is hard to improve or maintain will turn into a very expensive piece of tech debt.

The Game of Wants

When designing a platform, there’s a wide variety of stakeholders to consider. The platform engineers who build and run it, the developers who build and run their software on top of it, the product owners of those respective bits of software, the Security Officer, the DPO, the Data Analysts, etcetera. They all want things from the platform, and while they often come from very different points of view, their ‘wants’ are often interlinked. And even if it’s not immediately apparent, collecting all these ‘wants’ in a single place can give you interesting insights.

Let’s play a Game of Wants. We’ll gather various stakeholders in a room, and discuss a topic. With regard to a topic, each stakeholder will write down on sticky notes what they want.

Example: Platform Maintenance

Let’s discuss the topic of Platform Maintenance. This will mean different things to different people. To some, it means ‘downtime’; to others it may mean ‘new functionality’, ‘things may break’, or ‘a lot of work on my plate’. As a result, the answers you get during the Game of Wants are very, very different.

Product Owner :

  1. I don’t want downtime for our customers
  2. My developers need to be able to keep working at least during office hours

Security Officer:

  1. I want you to keep everything up-to-date. No lagging behind with updates.

Risk Manager:

  1. Maintenance steps need to be testable
  2. Rollbacks need to be possible

Software Developer:

  1. I need the platform to constantly improve, and add functionality we need
  2. Bugs in the platform need to be addressed a.s.a.p.
  3. I don’t want the platform to be down while I’m using it

Platform Engineer:

  1. I want maintenance to be straightforward
  2. I want maintenance steps to be testable
  3. I don’t want to do maintenance at weird times like 3am on a Saturday
  4. I want to do maintenance in small steps (because it’s easier)
  5. I want to be able to test improvements side-by-side to see if they work properly

Let’s look at this small set of very different wants. What can we get from this?

  • We need to be able to do maintenance regularly, basically all the time. Add functionality, fix bugs, stay up-to-date.
  • Maintenance needs to be easy
  • Maintenance needs to consist of testable steps, with a means of going back
  • Maintenance should happen without downtime

Wait.. NO DOWNTIME? Where did that come from? Well, the Platform Engineer doesn’t want to do maintenance at weird times, the Developer and PO want the platform to not be down during office hours.. So either we find a way to do maintenance without downtime, or we need to compromise here. But we’re not going to. Don’t compromise because something looks hard.

Design Objectives & Principles

Now that we’ve collected the ‘wants’ on a subject and got a list of requirements out of that, how do we turn these into parameters that we can use for our technical architecture? For this, we document Design Objectives and Principles. They are a bit similar, and they typically go hand-in-hand, but there’s a clear distinction between the two:

  • Design Principle: high-level description of how something should be done, or how something should behave.
  • Design Objective: describes a capability of the resulting product.

In this particular case, the ‘wants’ for Platform Maintenance on our container platform turned into the following Design Objectives and Principles:

Design Principles:

  1. Decoupled State: anything that absolutely needs to have state should store it outside of the container clusters, using databases or cloud storage offerings.
  2. Ephemeral & Stateless Container Clusters: the actual container clusters are considered disposable. Any state (scheduler, service discovery, runtime config) is treated as ‘volatile’.
  3. Immutability: every component of the Container Platform is treated as an immutable object. This means that components are replaced rather than updated/reconfigured.

Design Objectives:

  1. Zero-Downtime Maintenance: maintenance can be performed at any time as it introduces no downtime whatsoever
  2. Parallel Isolated Container Clusters: we can run multiple isolated container clusters within the same environment
  3. Dynamic Cluster Wiring: we can pick any specific cluster (or set of clusters) to be used for actual customer traffic at any given time

As you can see, the Principles give us guidance on how to build the platform, and if we follow those principles, we achieve our objectives. And by achieving our objectives, we get a platform where maintenance can be done without downtime, at any time, and where maintenance consists of building a parallel new cluster next to the existing one. This means maintenance doesn’t affect running production, no gradual-upgrade-scenarios need to be tested, and our ‘rollback scenario’ consists of simply switching back to the old one.

Not every ‘want’ is a ‘need’. Or is it?

In the example we tried to give all stakeholders what they wanted, and it resulted in a non-functional requirement of Zero-Downtime Maintenance, and some Design Principles to help us get it. Nobody actually asked for those specific things, though. However, combining certain ‘wants’ can sometimes dramatically raise the bar. And sometimes that’s completely fine. Looking at our example, you may argue that nobody actually needs Zero-Downtime, because the occasional maintenance window is completely fine. And the platform will be cheaper to build if we just accept those maintenance windows and pay some Platform Engineers for working on the weekends.

However, this is where you need to put your CFO on hold. We may want to accept some planned downtime, but we still want our maintenance steps to be easy and testable. We still want easy rollbacks. Side-by-side testing. If we build our platform given the Principles outlined in the example, we get all of that, and we get Zero-Downtime Maintenance. We also get an easy way of recovering from failure using proven steps: instead of trying to fix your burning cluster in the middle of the night, you just roll out a fresh one. Just like you do for maintenance. Less stress, no ductape, better post-mortems.

So this ‘want’ may actually turn out to be a massive ‘need’.

To catch these ‘hidden needs’ as early as possible, we can improve our Game of Wants by using multiple categories (colors) of sticky notes for ‘needs’ and ‘wants’. If needed you can add ‘outside requirements’ (legislation, compliance regulations) on a separate color in order to link certain needs/wants to outside requirements for a more fine-grained view. In order to not confuse stakeholders early on, it’s usually best to first start with just one color; the ‘wants’. Get as much of them as you can, and then start grouping them. Which ones are similar and how can we summarize them? Which ones contradict? Then play Round 2, to further refine:

  • Wants vs Needs: Which wants are actually needs?
  • New Insights: Which new wants or needs did we discover from Round 1?
  • Reasoning: Why do we need certain things (external requirements, linked requirements, etcetera)