Note: These lecture notes were slightly modified from the ones posted on the 6.824 course website from Spring 2015.
Who builds distributed systems:
- SaaS market
- Startups: CustomMade, Instagram, HubSpot
- Mature: Akamai, Facebook, Twitter
- Enterprise market
- Startup: Basho (Riak), Infinio, Hadapt
- Mature: VMWare, Vertica
- ...and graduate students
High-level components:
- front-end: load balancing routers
- handlers, caching, storage, business services
- infra-services: logging, updates, authentication
Low-level components:
- RPCs (semantics, failure)
- coordination (consensus, Paxos)
- persistence (serialization semantics)
- caching
- abstractions (queues, jobs, workflows)
Business needs will affect scale and architecture
- dating website core data: OkCupid uses 2 beefy database servers
- analytics distributed DB: Vertica/Netezza clusters have around 100 nodes
- mid-size SaaS company: HubSpot uses around 100 single-node DBs or around
10 node HBase clusters
- MySQL mostly
- Akamai, Facebook, Amazon: tens of thousands of machines
Small SaaS startup:
- early on the best thing is to figure out if you have a good idea that people would buy
- typically use a platform like Heroku, Google App Engine, AWS, Joyent, CloudFoundry
Midsized SaaS:
- need more control than what PaaS offers
- scale may enable you to build better solutions more cheaply
- open source solutions can help you
Mature SaaS:
- Jepsen tool
- "Ensure your design works if scale changes by 10x or 20x; the right solution for x often not optimal for 100x", Jeff Dean
How to think about your design:
- understand what your system needs to do and the semantics
- understand workload scale then estimate (L2 access time, network latency) and plan to understand performance
- "telemetry beats event logging"
- logs can be hard to understand: getting a good story out is difficult
- logging: first line of defense, doesn't scale well
- logs on different machines
- what if timestamps are useless because clocks are not synced
- lots of tools around logging
- having log data in queryable format tends to be very useful
- monitoring, telemetry, alerting
- annotate code with timing and counting events
- measure how big a memory queue is or how long a request takes and you can count it
- can do telemetry at multiple granularities so we can break long requests into smaller pieces and pinpoint problems
- in classroom settings you don't have to set up a bunch of machines
- as your business scales new machines need to be set up => must automate
- separate configuration from app
- HubSpot uses a ZooKeeper like system that allows apps to get config values
- Maven for dependencies in Java
- Jenkins for continuous integration testing
- automated testing makes it easy to verify newly introduced changes to your code
- UI testing can be a little harder (simulate clicks, different layout in different browsers)
- front end changes => must change tests?
- people: how do you get together and build the thing
- analogy: software engineering process is sort of like a distributed system
with unreliable components.
- somehow must build reliable software on a reliable schedule
- gotta take care of your people: culture has to be amenable to people growing, learning and failing
- waterfall: big design upfront and then implement it
- agile/scrum: don't know the whole solution, need to iterate on designs
- kanban:
- lean:
- making a big change on fast changing code base
- if you branch and then merge your changes, chances are the codebase has changed drastically
- you can try to have two different branches deployed such that the new branch can be tested in production
- culture changes with growth
- need to pay attention to culture and happiness of employees
- very important to measure happiness
- having small teams might help because people can own projects