- Stateless - The state of a service should be determined by a shared database and not dependent on data local to application. Storage should be treated as a service within itself and antiquated thinking of storage as a device should be avoided.
- Scale linearly - An application should run as a single process on a small as possible footprint. This enables the SRE team to scale services linearly in a granular fashion. Code logic should not necessitate a specific number of instances but be capable of scaling up or down as load changes. Discrete functionality is preferred such that there is a single and obvious metric to scale upon.
- Minimal configuration - Services should require little to no configuration. We have found configuration management to be a ripe source of human error, therefore services should ship with sane defaults and infer as much as possible on startup from consideration of environment variables or service discovery. Thread and memory footprints should configure automatically maximizing the resource usage on an instance. A side benefit is that the less configuration options available on a service the less permutations are needed for testing.
- Robust communication - A great quote from Release It! is, "Integration points are the number-one killer of systems. Every single one of those feeds presents a stability risk." Therefore, we ask that all integration points of a service be enumerated and have a proper harness to torture test data input and output. Communication should be asynchronous whenever possible (and it almost always is). Adequate controls at integrate points should existing including circuit breakers, time-outs, bulkheads and protocol hand-shaking.
- Application Visibility - At a minimum all inbound and outbound transactions should have telemetry that provides visibility on the number/size of transactions, their type and the time the transaction took to execute. Services should know if they are functioning correctly and make this health status available to the SRE team through some type of API (usually REST). Logging is also a critical component of application visibility. It should be obvious when reading the log files whether the service is working.
There were three sources that we relied heavily on for inspiration. I want to explicitly call these out so we can give credit where credit is due and encourage people to look up these fantastic resources:
Finally, a word on the importance of principles. Modern systems are complex. Distributed, cloud systems are particularly complex because of the number of integration points. The allure of intuitively understanding systems with thousands of nodes is toxic. When designing these distributed systems we prefer the pattern/anti-pattern approach where principles are preferred over attempting to enumerate every possible failure scenario. You will hit failure edge cases you never thought of. Therefore, relying on principles instead of our own capability to exhaustively understand a system is a practical dose of humility.