My takeaways from QCon SF 2018
Date: 2018-11-26 | qcon | software | architectures |
[hamy: change to publish on monday]
Last month, I took a trip out to San Francisco to attend QCon [hamy: link] as part of my yearly conference benefit at APT [hamy: link]. This was the first major tech conference I'd attended and was bombarded by information from every direction - this post is my attempt to solidify takeaways and organize additional info for efficient future consumption.
Key takeaways
In recent years, there have been huge improvements in deploy infrastructure and companies of all sizes have been making moves to incorporate them into their architectures
- Containers - Docker leading the pack, with k8s et al being used for orchestration
- Service meshes - To federate policies in increasingly distributed, sprawling systems
Teams are evolving what it means to be efficient and processes to attain that goal, typically with the goal of increasing autonomy and empowerment on an individual/team level -> velocity -> better outcomes
- "Full cycle" devs/teams
- Focus on teams creating good tools rather than just fighting fires (read: re-usable solutions rather than one-off hacks). Related: Writing tests for bugs you've encountered [hamy: link]
- Testing in production-representative environments (and in some cases, testing right in prod)
No one has it figured out, everyone is just experimenting. Okay, I didn't really learn that one there, it's more of a truth of life to me but I like pointing it out whenever it makes itself obvious.
Talks, loosely organized by major theme
Microservices / architectures
Microservices in Gaming
The video [hamy: link]
In this talk, scale problems are presented against "current" architectures and you're walked through the thought process of modernizing them to met these new challenges
Gaming systems (League of Legends)
Require:
- low latency
- shared state wrt matchmaking/in-game
- rapid development cycles
Had a two-layer system
- Client - on end user machine
- Service - All the other services, but was shipped as one
[hamy: img of the arch]
Takeaway
- Need to be micro all the way down or risk bottlenecks in the services still to migrate
Media streaming systems (Hulu)
Called out bit.ly/hulu-landscape for further reading
Requires:
- Lots of caching for seamless browsing experiences (think about your endless Netflix search for the perfect show)
- Supply metadata that is needed everywhere (size, support, length)
- real-time playback
- many integrations on edge networks (for instance every device kind that needs to play stuff)
In the original arch, they'd built a lot of stuff in house but there were limitations to what they built as compared to commercial/oss offerings. He specifically mentions the lack of "scripting deploys" from devs as a particular issue.
Takeaways
- If anything can go wrong, it will go wrong
- Should prioritize scaling enhancements based on scale:query, the lower the factor you can support, the more you should prioritize increasing it
- Have "circuit breakers" at crucial junctures (ideally every juncture) to prevent errors from spilling out to upstream services. Circuit breakers were talked about a lot at the conference and are really just a federated way to prevent upstream apps from calling downstream apps that are downed or otherwise having issues so as to 1) not totally destroy the downstream app and 2) not totally destroy the upstream request tree just cause one minor call didn't work out.
Microservices in general
Cloud v. DC, go cloud for:
- Elasticity
- Abstraction
Why microservices
- ownership + independence => velocity and reliability, achieved via:
- fast, granular deploys
- built-in CI checks
- operational and dev scaling
Cloud is especially nice because you're building on the shoulders of constantly improving infrastructure. Dude recommends to just pick a cloud and run with it, you can worry about genericizing your infra later.
Reactive DDD (Domain Driven Design)
The video [hamy: link]
This talk was very much about moving from a rigid push system to a more organic poll system wrt your architectures as it can increase its flexibility and (implied) efficiency. This is very much the argument for message queues like Kafka which can be subscribed to and played back at any time over message queues like RabbitMQ which prioritize single receiver, single play.
[hamy: image of event-driven arch]
Reactive architectures lend well to code fluency or the idea that code easily and accurately conveys the real-world processes it is modeling. Note that this is very similar to one of the primary goals of functional programming.
DDD's goals are to be
- Fluent
- Explicit
Thus both synergize well with each other.
Creating an architecture like this lends well to async operations and thus means it can increase your ability to scale (though remember that async ops can never increase actual throughput if all other things are treated equal)
Microservices talk
Just a system that is also autonomous with features like:
- Easy scaling
- Isolated failure
- Resiliency
- Elasticity
Isolated failure:
- Can cache results rather than fetching
- Can achieve this by creating a layer that determines when to update this cache while having all reads hit the cache itself
- Guard against cascading failure
- Set timeouts for GET and response (i.e. don't hang forever)
- Use fallbacks whenever possible - don't just fail because something else did
- Rethink problems to be async / reactive - many times you don't need things to happen right now, having them done "eventually" gives much more room for elasticity in your backend services
Resiliency
- Don't let anyone break your internal state
- Have event sources/logging along with idempotent operations to allow operations to run back and re-create after failure
- Respect other services - if they're failing, don't hit it so much
- Use a service mesh to build in policies, "respect", to systems
Elasticity
- k8s can help deal with this as it runs off a config and just does it for you
Airbnb: monolith to microservice
SOA design tenets
- Services own data, reads + writes
- Services address a specific concern
- Data mutations publish based on domain events
SOA best practices
- Consistency - standardize service building
- Auto generate code for app
- Test and deploy - use production replay traffic to battle test new code
- Observability
Scaling for cryptomania
Volume testing is good
- Perfect parity is not necessary
- Use capacity cycle to determine where to increase
Good instrumentation (monitoring/observability) will surface problems whereas bad will obscure them.
Pillars of realistic load testing
- Data
- Patterns of traffic
- Systems
Faster feedback means faster progress
Want 'real' load behavior to capture and playback real behaviors
Production ready microservices
Production Readiness: Can trust service to handle production traffic
Stability and reliability
- Stable dev cycle
- Stable deploy cycle
- Dependency management
- Routing discovery
- Onboarding and dependency procedures
Scalability and performance
- Understand growth-scales
- Resource awareness
- Dependenc/scaling
- Constant evaluatoin
- Traffic management
- Capacity planning
Fault tolerance and disaster recovery
- Avoid single point of failure
- Resiliency engineering
Containers
K8s commandments
- To go fast, you must start deliberately
- Always let them know your next move - move past docker build
- Never trust - use pod security policies
- Never get high off what kube supplies
- New mix interna/external traffic - lots of cool tools out there to help like service meshes
- If you think you know what's happening in your cluster, forget it - have observability and logging
- Keep your storage and management separate
- Use tools
- Package management
- Helm2
- Config management
- ksonnet
- pulumi
- ballerma
- others
- skaffold
- kustomize
- Package management
Observability
Connectivity, observability, and monitoring
Connect
- Service discovery
- Resiliency
- load balancing
Monitor
- metrics
- logs
- tracing
Manage
- find-grained traffic control
- policy on requests
Secure
- workload identities throughout infra
- service to service authentication
A service mesh moves all of these functionalities outside of the app which helps reduce developer overhead and can be used to change federated policies very fast (when compared with having each dev team roll them out themselves).
Some examples of service meshes:
- envoy
- linkerd
- istio
How istio works [hamy : link to the slides?]
Team processes
Netflix: full cycle engineers
The video [hamy: link]
This talk was particularly interesting because it covered many of the efficiencies I've seen in large systems - different teams waiting on each other for things and how stuff can often get lost in the cracks. It presents a way to minimize these botched hand-offs though also presents the downsides of the paradigm.
The basis of this talk is essentially that it is an antipattern to need to throw a CM ticket over a silo wall in order to perform your necessary job functions.
Context:
- dev team
- dev ops
- central Ops team
all of whom communicate with each other as part of the dev cycle.
Issues:
- Devs and testers don't understand the machines
- devops / ops don't understand the apps
This leads to high communication overhead
Issues:
- people are cautious because they don't understand everything
- leads to long troubleshooting sessions and a high mean time to resolution (MTTR)
- poor feedback (as in low quantity/quality) means slow turn-around and prioritization
Solution:
- Create specialist teams which will build tools to help overcome each of these problems
- Add a specialist onto each team to help them interact with underlying/touching techologies with confidence
To do this, your org must:
- increase staffing
- increase training via bootcamps and shadowing of other teams/roles
- prioritize the creation of these paved-road tools
- "The tooling you build is the tooling your developers need"
Tradeoffs - note this is not for everyone
- change is scary
- each team will need to balance more priorities
- this will be empowering
- will also lead to increased interruptions and cognitive load
- each member will need to decide how they can best fit in/work with the system
Improvements:
- Tooling that is oppinionated and uses the best practices
- Metrics - to measure impact and areas for improvement
Chaos Engineering with Containers
To chaos:
- Monitoring and observability
- Incident management
- Cost of downtime/hr
Uses:
- Outage reproduction
- On-call training
- Strengthen new prod builds
Process:
- Minimize the blast radius
- Don't start in prod
- Don't start at 100% fail
Testing in production
You should test in both testing && prod
- Code confidence can increase as it gets more load
- When we move to complex systems, it becomes harder to know what ight fail
- Debugging much more complex -> give it prod traffi
- We often under invest in prod tooling, yet the ode that ships/deploys our code is some of the most important
Every deploy = process + code + system
By deploying to prod, can catch 80% of bugs with 20% of the effort
- Other evironments do not have the same traffic (staging != prod)
- Real unpredictables (acts of god, acts of stupidity)
The process
- Test before prod
- Does it reasonably work?
- Test in prod
- Literally everything else
Observability/monitoring useful to determine what you should actually build
- Feature flags good for testing in prod
- Use canaries (slowly roll out/fork traffic), can even shadow
- Allow multiple versions at once
misc
Data science as an art
Explorations through solutions
- Think and try
- hypothesize and experimet
Related reads
Want more like this?
The best / easiest way to support my work is by subscribing for future updates and sharing with your network.