PaaS Design Lessons Learned - part 1

Building a PaaS is hard, yet everyone seems to be doing it. In this blog series I will look back on a decade of building bespoke PaaS, and lessons learned along the way.

May 11, 2020
consultancy cloud architecture

I’ve spent the better part of the past decade designing, building, and running platforms. Those platforms have differed in many ways, but what all of them had in common was that they were built so that software could be deployed and run. In recent years, more and more companies have begun to realize they need to ‘be good at shipping software’ and have subsequently increased their efforts in that area, often resulting in the creation of bespoke platforms.

In this series of blog posts I will go over various aspects of designing platforms. Things to consider, approaches that work (or don’t), and general guidelines for designing platforms that are usable, maintainable, secure, and allow for future development.

Before we begin: technology

Let’s start by addressing the big T-word. Yes, there usually is a lot of technology involved in creating a PaaS. However, I am not going to tell you which pieces of tech to glue together, or how to do it. After building platforms for over a decade, I’ve learned that whatever tech stack you end up using for a platform will not be the stack you’ll use for the next one. So instead of telling you what tech I glued together on my most recent endeavour, I prefer not to, for two reasons:

Whatever worked in my situation isn’t necessarily the correct match for yours (if that doesn’t make sense to you, this blog series will be for you)
This kind of information doesn’t age well

I’ll touch on technology every now and then, though, mostly to provide examples. But there’s a lot more to building a good platform than technology, and that’s what I want to focus on in this series of blogs.

What is a platform?

Depending on who you ask you’ll get vastly different answers. To some, the answer will be ‘AWS’. Others will say ‘Kubernetes’. Heroku. But also: Gitlab. The answer can go pretty much anywhere, from an infrastructure provider (with or without API) to a collection of infrastructure, VCS, and CI/CD tools. And basically anything in between. So that doesn’t quite answer the question. Or does it?

I think all of the answers I mentioned above are completely correct. They differ because different organizations or teams use different levels of abstraction, from very low to very high. But they all have something in common: the platform is a service or collection of services that is made available to a product organization or team for the purpose of delivering business value through software. Its purpose is to offer Happy-Flow-as-a-Service (and maybe then some).

The platforms I’ve worked on usually consisted of some abstraction on top of infrastructure, and a tool-as-a-service model for things like code hosting, CI, and deployment. In various shapes and sizes.

Abstraction Level

I’ve mentioned the abstraction level a few times now, and the fact that it may differ from platform to platform. Or rather: from organization to organization. The abstraction level essentially tells you how much you’re getting as a service. Getting a lot as a service may sound great, and it very well may be great. But the abstraction level is a bit of a double-edged sword. Let me explain..

Consider a platform with a very high abstraction level. It offers ‘everything’; code hosting, predefined CI/CD pipeline templates (based on app type or programming language), artifact storage, uniform deployment, and a Containers-as-a-Service infrastructure that handles app scheduling, monitoring, logging, metrics, dashboards, storage, databases, secret management, routing, load-balancing, CDN, DNS, backups, restores, etcetera.

There are some clear advantages to having such a platform:

Easy onboarding. Most of the heavy lifting is done my the platform, and there’s a paved way to shipping stuff
Low complexity. You don’t have to worry about the intricacies of complex (distributed) systems, so less to learn
High efficiency. You don’t have to spend time configuring supporting infra/services. You build your app, the platform offers the rest.
Happy developers. Developers who aren’t comfortable with infrastructure, CI/CD, or operations, or who simply prefer focusing on just their code, will be very happy.
Consistency. Because ‘everything’ is handled by the platform, the way it will be used will be very consistent across teams
Compliance. Because ‘everything’ is handled by the platform, you can build your compliance controls into the platform

But the convenience also has a cost:

Bottom-up innovation and experimentation are hard. When the way things are done is heavily standardized, maybe even ‘set in stone’, it’s hard to come up with something new (that doesn’t fit that standard just yet) and validate it in a real-life scenario. Experimentation (validating in production) may be completely impossible.
Hidden complexity. Hiding complexity, while convenient when everything is working well, may also be annoying as the platform basically is a ‘black box’ for developers. It can also mean that developers come up with less-than-ideal implementations by leveraging wiggle room in your abstraction.
Unhappy developers. Developers (usually those somewhat experienced with infra/cloud technology) who prefer being less restricted in their freedom to tinker with things are not going to like the restrictions the platform puts on them.

As you can see, some advantages are also disadvantages. And if we take things to the other extreme, with a platform that is basically an empty AWS/GCP/Azure account, you’ll see a similar but reversed list. Teams get all the freedom, but need to figure out lots of things themselves, which increases complexity, reduces efficiency and consistency. It makes compliance harder. It makes it harder for some developers to do things. But your ‘rockstar devs’ that would otherwise feel restricted will love it. And it’s much easier to come up with something new.

In my experience, if you’re building a PaaS, you’re going to end up somewhere in between. Where exactly depends heavily on the people building and using it, the engineering culture, the business objectives, and to some extent the budget available. And it definitely pays off to get this right. Just don’t expect that ‘getting it right’ means that all developers love your platform. Accept that a small portion won’t. Also, ‘getting it right’ isn’t final. The software organization will evolve. Technology will evolve. The team building the PaaS will evolve. So your abstraction level should evolve as well. Sometimes that evolution simply means building a new platform and deprecating the old one. But in other cases you may have introduced a rigid ‘happy flow’ that you can extend later on to allow for more freedom. Teams who need or want it can use it, teams who are happy where they are can just proceed as they had.

I think the most useful approach is working towards a sensible level of configurability, with rock-solid defaults. Make it easy to be default, but also make it possible to be special. But it’s definitely more of a ‘2.0 feature’ than something you’ll build in your first iteration. You need time, usage, and feedback to make the right decisions there. So what does version 1.0 look like? Version 1.0 may consist of consuming off-the-shelf PaaS services until you run into their limits. Or it may be a low-abstraction setup (empty AWS account, vanilla Kubernetes, etcetera) that is flexible but harder to use. Either way, you need Version 1.0 to learn. Learn what is needed, and what isn’t. What works, and what doesn’t. Which assumptions were wrong. You need to get feedback from all kinds of people. The ones that love your ‘happy flow’ (or who even struggle with it) and the ‘rockstar devs’ alike. But while the rockstars may be the most vocal critics of your platform, they are also usually the minority. Also, don’t forget the non-devs. Your QA people, product owners, etcetera.

However, the ‘rockstar devs’ who think your platform is too restrictive just might be the ‘Rockstar DevRel’ people you need to recruit to your platform team.

DevRel

DevRel you say? Yes. What if you could leverage some of those great devs, who have clear opinions on how to build software and how the platform should work, and have them act as liaisons between software teams and the PaaS team? That way they can directly influence the PaaS backlog, raise issues, advocate for features, even work with the PaaS team to make some things reality. Similarly, they can work with the software teams to help them get the most from the PaaS, and help them with issues they may be having, or work with the platform team to make sure certain solutions are usable for developers.

If your organization is big enough that it has a Platform Team and its own PaaS, then it is big enough to warrant having a DevRel Chapter/Team. Whether or not this DevRel Team is also involved with building/running the platform is up to you. But in the very least, the DevRel Team offers a place to share knowledge, identify needs and wants, and improve communication. If your organization is coming from traditional silos and wants to adopt DevOps, move to the cloud, and develop its own purpose-built platform in the process; start with the DevRel Team.

Next in Part 2

This first part was mostly about ‘what is a platform anyway?’, the abstraction level and how your organization plays a part in that. In Part 2 we’re going to dive into Design Objectives, and how creating a set of guiding principles influences what you’re building and how you’re building it.

Before we begin: technology

What is a platform?

Abstraction Level

DevRel

Next in Part 2

Read more