My takeaways from QCon SF 2018

Date: 2018-11-26 | architectures | qcon | software |

DISCLOSURE: If you buy through affiliate links, I may earn a small commission. (disclosures)

[hamy: change to publish on monday]

Last month, I took a trip out to San Francisco to attend QCon [hamy: link] as part of my yearly conference benefit at APT [hamy: link]. This was the first major tech conference I'd attended and was bombarded by information from every direction - this post is my attempt to solidify takeaways and organize additional info for efficient future consumption.

Key takeaways

In recent years, there have been huge improvements in deploy infrastructure and companies of all sizes have been making moves to incorporate them into their architectures
- Containers - Docker leading the pack, with k8s et al being used for orchestration
- Service meshes - To federate policies in increasingly distributed, sprawling systems
Teams are evolving what it means to be efficient and processes to attain that goal, typically with the goal of increasing autonomy and empowerment on an individual/team level -> velocity -> better outcomes
- "Full cycle" devs/teams
- Focus on teams creating good tools rather than just fighting fires (read: re-usable solutions rather than one-off hacks). Related: Writing tests for bugs you've encountered [hamy: link]
- Testing in production-representative environments (and in some cases, testing right in prod)
No one has it figured out, everyone is just experimenting. Okay, I didn't really learn that one there, it's more of a truth of life to me but I like pointing it out whenever it makes itself obvious.

Talks, loosely organized by major theme

Related reads

Microservices / architectures

Microservices in Gaming

The video [hamy: link]

In this talk, scale problems are presented against "current" architectures and you're walked through the thought process of modernizing them to met these new challenges

Gaming systems (League of Legends)

Require:

low latency
shared state wrt matchmaking/in-game
rapid development cycles

Had a two-layer system

Client - on end user machine
Service - All the other services, but was shipped as one

[hamy: img of the arch]

Takeaway

Need to be micro all the way down or risk bottlenecks in the services still to migrate

Media streaming systems (Hulu)

Called out bit.ly/hulu-landscape for further reading

Requires:

Lots of caching for seamless browsing experiences (think about your endless Netflix search for the perfect show)
Supply metadata that is needed everywhere (size, support, length)
real-time playback
many integrations on edge networks (for instance every device kind that needs to play stuff)

In the original arch, they'd built a lot of stuff in house but there were limitations to what they built as compared to commercial/oss offerings. He specifically mentions the lack of "scripting deploys" from devs as a particular issue.

Takeaways

If anything can go wrong, it will go wrong
Should prioritize scaling enhancements based on scale:query, the lower the factor you can support, the more you should prioritize increasing it
Have "circuit breakers" at crucial junctures (ideally every juncture) to prevent errors from spilling out to upstream services. Circuit breakers were talked about a lot at the conference and are really just a federated way to prevent upstream apps from calling downstream apps that are downed or otherwise having issues so as to 1) not totally destroy the downstream app and 2) not totally destroy the upstream request tree just cause one minor call didn't work out.

Microservices in general

Cloud v. DC, go cloud for:

Elasticity
Abstraction

Why microservices

ownership + independence => velocity and reliability, achieved via:
- fast, granular deploys
- built-in CI checks
operational and dev scaling

Cloud is especially nice because you're building on the shoulders of constantly improving infrastructure. Dude recommends to just pick a cloud and run with it, you can worry about genericizing your infra later.

Reactive DDD (Domain Driven Design)

The video [hamy: link]

This talk was very much about moving from a rigid push system to a more organic poll system wrt your architectures as it can increase its flexibility and (implied) efficiency. This is very much the argument for message queues like Kafka which can be subscribed to and played back at any time over message queues like RabbitMQ which prioritize single receiver, single play.

[hamy: image of event-driven arch]

Reactive architectures lend well to code fluency or the idea that code easily and accurately conveys the real-world processes it is modeling. Note that this is very similar to one of the primary goals of functional programming.

DDD's goals are to be

Fluent
Explicit

Thus both synergize well with each other.

Creating an architecture like this lends well to async operations and thus means it can increase your ability to scale (though remember that async ops can never increase actual throughput if all other things are treated equal)

Microservices talk

Just a system that is also autonomous with features like:

Easy scaling
Isolated failure
Resiliency
Elasticity

Isolated failure:

Can cache results rather than fetching
- Can achieve this by creating a layer that determines when to update this cache while having all reads hit the cache itself
Guard against cascading failure
- Set timeouts for GET and response (i.e. don't hang forever)
- Use fallbacks whenever possible - don't just fail because something else did
Rethink problems to be async / reactive - many times you don't need things to happen right now, having them done "eventually" gives much more room for elasticity in your backend services

Resiliency

Don't let anyone break your internal state
Have event sources/logging along with idempotent operations to allow operations to run back and re-create after failure
Respect other services - if they're failing, don't hit it so much
Use a service mesh to build in policies, "respect", to systems

Elasticity

k8s can help deal with this as it runs off a config and just does it for you

Airbnb: monolith to microservice

SOA design tenets

Services own data, reads + writes
Services address a specific concern
Data mutations publish based on domain events

SOA best practices

Consistency - standardize service building
Auto generate code for app
Test and deploy - use production replay traffic to battle test new code
Observability

Scaling for cryptomania

Volume testing is good

Perfect parity is not necessary
Use capacity cycle to determine where to increase

Good instrumentation (monitoring/observability) will surface problems whereas bad will obscure them.

Pillars of realistic load testing

Data
Patterns of traffic
Systems

Faster feedback means faster progress

Want 'real' load behavior to capture and playback real behaviors

Production ready microservices

Production Readiness: Can trust service to handle production traffic

Stability and reliability

Stable dev cycle
Stable deploy cycle
Dependency management
Routing discovery
Onboarding and dependency procedures

Scalability and performance

Understand growth-scales
Resource awareness
Dependenc/scaling
Constant evaluatoin
Traffic management
Capacity planning

Fault tolerance and disaster recovery

Avoid single point of failure
Resiliency engineering

Containers

K8s commandments

To go fast, you must start deliberately
Always let them know your next move - move past docker build
Never trust - use pod security policies
Never get high off what kube supplies
New mix interna/external traffic - lots of cool tools out there to help like service meshes
If you think you know what's happening in your cluster, forget it - have observability and logging
Keep your storage and management separate
Use tools
- Package management
  - Helm2
- Config management
  - ksonnet
  - pulumi
  - ballerma
- others
  - skaffold
  - kustomize

Observability

Connectivity, observability, and monitoring

Connect

Service discovery
Resiliency
load balancing

Monitor

metrics
logs
tracing

Manage

find-grained traffic control
policy on requests

Secure

workload identities throughout infra
service to service authentication

A service mesh moves all of these functionalities outside of the app which helps reduce developer overhead and can be used to change federated policies very fast (when compared with having each dev team roll them out themselves).

Some examples of service meshes:

envoy
linkerd
istio

How istio works [hamy : link to the slides?]

Team processes

Netflix: full cycle engineers

The video [hamy: link]

This talk was particularly interesting because it covered many of the efficiencies I've seen in large systems - different teams waiting on each other for things and how stuff can often get lost in the cracks. It presents a way to minimize these botched hand-offs though also presents the downsides of the paradigm.

The basis of this talk is essentially that it is an antipattern to need to throw a CM ticket over a silo wall in order to perform your necessary job functions.

Context:

dev team
dev ops
central Ops team

all of whom communicate with each other as part of the dev cycle.

Issues:

Devs and testers don't understand the machines
devops / ops don't understand the apps

This leads to high communication overhead

Issues:

people are cautious because they don't understand everything
- leads to long troubleshooting sessions and a high mean time to resolution (MTTR)
poor feedback (as in low quantity/quality) means slow turn-around and prioritization

Solution:

Create specialist teams which will build tools to help overcome each of these problems
Add a specialist onto each team to help them interact with underlying/touching techologies with confidence

To do this, your org must:

increase staffing
increase training via bootcamps and shadowing of other teams/roles
prioritize the creation of these paved-road tools
- "The tooling you build is the tooling your developers need"

Tradeoffs - note this is not for everyone

change is scary
each team will need to balance more priorities
- this will be empowering
- will also lead to increased interruptions and cognitive load
- each member will need to decide how they can best fit in/work with the system

Improvements:

Tooling that is oppinionated and uses the best practices
Metrics - to measure impact and areas for improvement

Chaos Engineering with Containers

To chaos:

Monitoring and observability
Incident management
Cost of downtime/hr

Uses:

Outage reproduction
On-call training
Strengthen new prod builds

Process:

Minimize the blast radius
- Don't start in prod
- Don't start at 100% fail

Testing in production

You should test in both testing && prod

Code confidence can increase as it gets more load
When we move to complex systems, it becomes harder to know what ight fail
- Debugging much more complex -> give it prod traffi
- We often under invest in prod tooling, yet the ode that ships/deploys our code is some of the most important

Every deploy = process + code + system

By deploying to prod, can catch 80% of bugs with 20% of the effort

Other evironments do not have the same traffic (staging != prod)
Real unpredictables (acts of god, acts of stupidity)

The process

Test before prod
- Does it reasonably work?
Test in prod
- Literally everything else

Observability/monitoring useful to determine what you should actually build

Feature flags good for testing in prod
Use canaries (slowly roll out/fork traffic), can even shadow
Allow multiple versions at once

misc

Data science as an art

Explorations through solutions

Think and try
hypothesize and experimet

Estimating QCon SF 2018's revenue

Want more like this?

The best way to support my work is to like / comment / share for the algorithm and subscribe for future updates.

My takeaways from QCon SF 2018

Key takeaways

Talks, loosely organized by major theme

Microservices / architectures

Microservices in Gaming

Gaming systems (League of Legends)

Media streaming systems (Hulu)

Microservices in general

Reactive DDD (Domain Driven Design)

Microservices talk

Airbnb: monolith to microservice

Scaling for cryptomania

Production ready microservices

Containers

K8s commandments

Observability

Connectivity, observability, and monitoring

Team processes

Netflix: full cycle engineers

Chaos Engineering with Containers

Testing in production

misc

Data science as an art

Related reads

Want more like this?