Over the last five years, I have managed multiple teams working on distributed systems, predominantly in go, javascript, and .net. Teams are tasked with building out many new services. This poses some interesting challenges.
From a management perspective, the ability to reallocate engineers across teams without friction is essential for agility. Onboarding engineers is made easier if services have some kind of common standard / contract. Especially so if the service catalogue is large. On the other hand, I want to allow engineers as much flexibility as possible to choose the best tools to use. In most cases, it is best to let the market decide (cathedral/bazaar).
Managers need to think very carefully about where they draw the “standardisation” line. As a result, this document starts with a small contract, and the rest of the document focuses on “best practices”. It focuses on application NFRs and developer Quality of Life (QoL), in roughly equal measure.
Maybe this won’t work for your ecosystem; your mileage may vary.
The Application Entrypoint
All apps contain the following three files. This acts as an engineer’s very first entry point into a service, so I recommend this as a contract across all services. This is the only contract in this document, the rest focuses on best practices.
- README.md
- Makefile
- a programmatic entry point into the app
- source-controlled
- language-agnostic
- not everything will be in the app language (scripts, migrations, compiled dependencies)
- what language-agnostic task runner will you use? something at the OS level?
- `Make` has been around since 1976; it’s available in just about every Linux distribution out there. It’s also not particularly windows portable (which may be strategically useful, when you think about it)
- .env.example
- an explicit declaration of all environment variables that the app requires to function
- programmatic, human-readable, source-controlled
The Local Developer Experience
The most important subject to me. A new engineer’s first impression, a seconded engineer’s onboarding speed. A poor developer experience scales linearly – O(n) – every engineering manager should sweat when they read that. Some high-level considerations:
- do devs have short feedback loops
- can devs deploy prod-like infra locally?
- if I hear “it works on *my* machine one more time…”
- can devs incrementally upgrade their services
- can devs run multiple versions of javascript/postgres/etc locally
- can devs profile apps with prod-like accuracy?
- wildly helpful if done right
- requires performance tests
App Infra
As above, can devs deploy prod-like infra locally? How do you spin up your application’s infrastructure for local development? For example:
- db (postgres, etc)
- messaging (kafka, zookeeper, kowl, etc)
- containerisation/orchestration/mesh (docker, k8s, istio, helm, etc)
- observability (ELK, grafana, etc)
- cicd (jenkins, etc)
Unit Tests
- ensure a high coverage (>85% or >90% as a baseline)
- a DI container will likely help
- introduce a gate into your CI build pipeline that fails a build if coverage goes down
- this is a great way to reduce the chance of incremental deterioration in software quality
- this can be done by outputting a coverage report from your unit testing harness, and then performing a delta in your pipeline
Component Tests
Aka integration tests, aka acceptance tests. Basically small-scale integration tests where all but immediate dependencies are mocked. For example, the app, it’s database, and kafka are concrete, whereas any external APIs/etc are mocked. A lot of bugs are caught here and they are hard to unit test. E.g. did you remember to update your migrations?
- you can treat component tests as unit test with no mocks
- i.e. you can use your existing testing harness if you don’t want to add moving parts
- containerised dependencies
- containers can be executed inside your CI pipeline
- think about if two builds can run in parallel
- this trivial with test-containers
- it’s a little trickier with docker-compose, you need to understand dind, but I like how transparent it is (less magic under the hood than testcontainers)
- however, I suspect tools like testcontainers can help you to combine coverage numbers across unit & component tests, which would be cool
- sanity load tests
- if you can easily spin up your app and it’s infra, what stops you from running a 30s sanity load test?
Jenkins
- linting, formatting, unit testing
- shared .editorconfing file (nuget package?)
- publish coverage numbers to jenkins
- track changes
- static checks
- sonarqube
- config (sonar-project.properties)
- fail on critical severity
- image scanning
- snyk/trivy
- fail on critical severity
- component tests
Load Tests
- k6
- great for API tests
- kafka
- but what about asynchronous services (i.e. no API)?
- k6 can do “kafka” load tests, but these tests exercise the kafka cluster, not your app
- if you can create an ephemeral topic and an ephemeral consumer, it’s pretty easy to load it up with messages and then track the consumer lag
- once the lag hits zero, your load test is complete
- history
Sitrus
- can you automatically do ephemeral E2E (or big integration) tests?
- sitrus
- sitrus can create an ephemeral k8s namepsace in the sbx cluster
e.g. a `sit-123-rewards` namespace underneath the sbx namespace - sitrus will then trigger each app to deploy itself
- this way sitrus doesn’t need to know about each app’s dependencies (postgres/couchbase/elastic/etc)
- sitrus can create an ephemeral k8s namepsace in the sbx cluster
- can also be used for DPT
- similar to above: if you can easily spin up your app and it’s infra, what stops you from running a load test?
- database clusters have been split => so some tests can run concurrently
Releasing
- semantic versioning
- use the Conventional Commit standard
- use the “Merge Check” bitbucket plugin to enforce the standard (copy dropbears regex)
- use the Semantic Release library to auto-generate a semver and release notes, and generate a jenkins release object
- i.e. adding a “version” tag will lead to a semver being automagically generated and a jenkins release object
node dotenv, python venvs, golang vendoring, dockerised infra (e.g. a pg12 container, etc