An updated and organized reading list for illustrating the patterns of scalable, reliable, and performant large-scale systems. Concepts are explained in the articles of prominent engineers and credible references. Case studies are taken from battle-tested systems that serve millions to billions of users.

If your system goes slow

Understand your problems: scalability problem (fast for a single user but slow under heavy load) or performance problem (slow for a single user) by reviewing some design principles and checking how scalability and performance problems are solved at tech companies. The section of intelligence are created for those who work with data and machine learning at big (data) and deep (learning) scale.

If your system goes down

"Even if you lose all one day, you can build all over again if you retain your calm!" - Thuan Pham, former CTO of Uber. So, keep calm and mind the availability and stability matters!

If you are having a system design interview

Look at some interview notes and real-world architectures with completed diagrams to get a comprehensive view before designing your system on whiteboard. You can check some talks of engineers from tech giants to know how they build, scale, and optimize their systems. There are some selected books for you (most of them are free)! Good luck!

If you are building your dream team

The goal of scaling team is not growing team size but increasing team output and value. You can find out how tech companies reach that goal in various aspects: hiring, management, organization, culture, and communication in the organization section.

Community power

Contributions are greatly welcome! You may want to take a look at the contribution guidelines. If you see a link here that is no longer maintained or is not a good fit, please submit a pull request!

Many long hours of hard work have gone into this project. If you find it helpful, please share on Facebook, on Twitter, on Weibo, or on your chat groups! Knowledge is power, knowledge shared is power multiplied. Thank you!

Content

Principle
Scalability
Availability
Stability
Performance
Intelligence
Architecture
Interview
Organization
Talk
Book

Principle

Lessons from Giant-Scale Services - Eric Brewer, UC Berkeley & Google
Designs, Lessons and Advice from Building Large Distributed Systems - Jeff Dean, Google
How to Design a Good API & Why it Matters - Joshua Bloch, CMU & Google
On Efficiency, Reliability, Scaling - James Hamilton, VP at AWS
Things to Keep in Mind When Building a Platform for the Enterprise - Heidi Williams, VP Platform at Box
Principles of Chaos Engineering
Finding the Order in Chaos
The Twelve-Factor App
Clean Architecture
High Cohesion and Low Coupling
Monoliths and Microservices
CAP Theorem and Trade-offs
CP Databases and AP Databases
Stateless vs Stateful Scalability
Scale Up vs Scale Out
Scale Up vs Scale Out: Hidden Costs
ACID and BASE
Blocking/Non-Blocking and Sync/Async
Performance and Scalability of Databases
Database Isolation Levels and Effects on Performance and Scalability
The Probability of Data Loss in Large Clusters
Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence
SQL vs NoSQL
SQL vs NoSQL - Lesson Learned at Salesforce
NoSQL Databases: Survey and Decision Guidance
How Sharding Works
Consistent Hashing
Consistent Hashing: Algorithmic Tradeoffs
Don’t be tricked by the Hashing Trick
Uniform Consistent Hashing at Netflix
Eventually Consistent - Werner Vogels, CTO at Amazon
Cache is King
Anti-Caching
Understand Latency
Latency Numbers Every Programmer Should Know
The Calculus of Service Availability
Architecture Issues When Scaling Web Applications: Bottlenecks, Database, CPU, IO
Common Bottlenecks
Life Beyond Distributed Transactions
Relying on Software to Redirect Traffic Reliably at Various Layers
Breaking Things on Purpose
Avoid Over Engineering
Scalability Worst Practices
Use Solid Technologies - Don’t Re-invent the Wheel - Keep It Simple!
Simplicity by Distributing Complexity
Why Over-Reusing is Bad
Performance is a Feature
Make Performance Part of Your Workflow
The Benefits of Server Side Rendering over Client Side Rendering
Automate and Abstract: Lessons at Facebook
AWS Do's and Don'ts
(UI) Design Doesn’t Scale - Stanley Wood, Design Director at Spotify
Linux Performance
Building Fast and Resilient Web Applications - Ilya Grigorik
Accept Partial Failures, Minimize Service Loss
Design for Resiliency
Design for Self-healing
Design for Scaling Out
Design for Evolution
Learn from Mistakes

Scalability

Microservices and Orchestration
Distributed Caching
Distributed Locking
Distributed Tracking, Tracing, and Measuring
Distributed Scheduling
Distributed Monitoring and Alerting
Distributed Security
Distributed Messaging, Queuing, and Event Streaming
Distributed Logging
Distributed Searching
Distributed Storage
- In-memory Storage
- Object Storage
Relational Databases
NoSQL Databases
Time Series Databases
Distributed Repositories, Dependencies, and Configurations Management
Scaling Continuous Integration and Continuous Delivery

Availability

Resilience Engineering: Learning to Embrace Failure
Failover
Load Balancing
Rate Limiting
Autoscaling
Availability in Globally Distributed Storage Systems at Google
NodeJS High Availability at Yahoo
Operations (11 parts) at LinkedIn
Monitoring Powers High Availability for LinkedIn Feed
Supporting Global Events at Facebook
High Availability at BlaBlaCar
High Availability at Netflix
High Availability Cloud Infrastructure at Twilio
Automating Datacenter Operations at Dropbox
Globalizing Player Accounts at Riot Games

Stability

Circuit Breaker
Timeouts
Crash-safe Replication for MySQL at Booking.com
Bulkheads: Partition and Tolerate Failure in One Part
Steady State: Always Put Logs on Separate Disk
Throttling: Maintain a Steady Pace
Multi-Clustering: Improving Resiliency and Stability of a Large-scale Monolithic API Service at LinkedIn
Determinism (4 parts) in League of Legends Server

Performance

Performance Optimization on OS, Storage, Database, Network
Performance Optimization by Tuning Garbage Collection
Performance Optimization on Image, Video, Page Load
Performance Optimization by Brotli Compression
Performance Optimization on Languages and Frameworks

Intelligence

Big Data
Distributed Machine Learning

Architecture

Systems We Make
Tech Stack (2 parts) at Uber
Tech Stack at Medium
Tech Stack at Shopify
Building Services (4 parts) at Airbnb
Architecture of Evernote
Architecture of Chat Service (3 parts) at Riot Games
Architecture of League of Legends Client Update
Architecture of Ad Platform at Twitter
Architecture of API Gateway at Uber
Basic Architecture of Slack
Back-end at LinkedIn
Back-end at Flickr
Infrastructure (3 parts) at Zendesk
Cloud Infrastructure at Grubhub
Real-time Presence Platform at LinkedIn
Settings Platform at LinkedIn
Nearline System for Scale and Performance (2 parts) at Glassdoor
Real-time User Action Counting System for Ads at Pinterest
API Platform at Riot Games
Games Platform at The New York Times
Kabootar: Communication Platform at Swiggy
Simone: Distributed Simulation Service at Netflix
Seagull: Distributed System that Helps Running > 20 Million Tests Per Day at Yelp
PriceAggregator: Intelligent System for Hotel Price Fetching (3 parts) at Agoda
Phoenix: Testing Platform (3 parts) at Tinder
Hexagonal Architecture at Netflix
Architecture of Play API Service at Netflix
Architecture of Sticker Services at LINE
Stack Overflow Enterprise at Palantir
Architecture of Following Feed, Interest Feed, and Picked For You at Pinterest
API Specification Workflow at WeWork
Media Database at Netflix
Member Transaction History Architecture at Walmart
Sync Engine (2 parts) at Dropbox
Architectures of Finance and Banking Systems
- Bank Backend at Monzo
- Trading Platform for Scale at Wealthsimple
- Core Banking System at Margo Bank
- Architecture of Nubank
- Tech Stack at TransferWise
- Tech Stack at Addepar
- Avoiding Double Payments in a Distributed Payments System at Airbnb

Interview

Designing Large-Scale Systems
Explaining Low-Level Systems (OS, Network/Protocol, Database, Storage)
"What Happens When... and How" Questions

Organization

Engineering Levels at SoundCloud
Engineering Roles at Palantir
Scaling Engineering Teams at Twitter
Scaling Decision-Making Across Teams at LinkedIn
Scaling Data Science Team at GOJEK
Scaling Agile at Zalando
Scaling Agile at bol.com
Lessons Learned from Scaling a Product Team at Intercom
Hiring, Managing, and Scaling Engineering Teams at Typeform
Scaling the Datagram Team at Instagram
Scaling the Design Team at Flexport
Team Model for Scaling a Design System at Salesforce
Building Analytics Team (4 parts) at Wish
From 2 Founders to 1000 Employees at Transferwise
Lessons Learned Growing a UX Team from 10 to 170 at Adobe
Five Lessons from Scaling at Pinterest
Approach Engineering at Vinted
Using Metrics to Improve the Development Process (and Coach People) at Indeed
Mistakes to Avoid while Creating an Internal Product at Skyscanner
RACI (Responsible, Accountable, Consulted, Informed) at Etsy
Four Pillars of Leading People (Empathy, Inspiration, Trust, Honesty) at Zalando
Pair Programming at Shopify
Distributed Responsibility at Asana
Rotating Engineers at Zalando
Experiment Idea Review at Pinterest
Tech Migrations at Spotify
Improving Code Ownership at Yelp
Agile Code Base at eBay
Code Review
- Code Review at Palantir
- Code Review at LINE
- Code Reviews at Medium
- Code Review at LinkedIn
- Code Review at Disney
- Code Review at Netlify

Talk

Distributed Systems in One Lesson - Tim Berglund, Senior Director of Developer Experience at Confluent
Building Real Time Infrastructure at Facebook - Jeff Barber and Shie Erlich, Software Engineer at Facebook
Building Reliable Social Infrastructure for Google - Marc Alvidrez, Senior Manager at Google
Building a Distributed Build System at Google Scale - Aysylu Greenberg, SDE at Google
Site Reliability Engineering at Dropbox - Tammy Butow, Site Reliability Engineering Manager at Dropbox
How Google Does Planet-Scale for Planet-Scale Infra - Melissa Binde, SRE Director for Google Cloud Platform
Netflix Guide to Microservices - Josh Evans, Director of Operations Engineering at Netflix
Achieving Rapid Response Times in Large Online Services - Jeff Dean, Google Senior Fellow
Architecture to Handle 80K RPS Celebrity Sales at Shopify - Simon Eskildsen, Engineering Lead at Shopify
Lessons of Scale at Facebook - Bobby Johnson, Director of Engineering at Facebook
Performance Optimization for the Greater China Region at Salesforce - Jeff Cheng, Enterprise Architect at Salesforce
How GIPHY Delivers a GIF to 300 Millions Users - Alex Hoang and Nima Khoshini, Services Engineers at GIPHY
High Performance Packet Processing Platform at Alibaba - Haiyong Wang, Senior Director at Alibaba
Solving Large-scale Data Center and Cloud Interconnection Problems - Ihab Tarazi, CTO at Equinix
Scaling Dropbox - Kevin Modzelewski, Back-end Engineer at Dropbox
Scaling Reliability at Dropbox - Sat Kriya Khalsa, SRE at Dropbox
Scaling with Performance at Facebook - Bill Jia, VP of Infrastructure at Facebook
Scaling Live Videos to a Billion Users at Facebook - Sachin Kulkarni, Director of Engineering at Facebook
Scaling Infrastructure at Instagram - Lisa Guo, Instagram Engineering
Scaling Infrastructure at Twitter - Yao Yue, Staff Software Engineer at Twitter
Scaling Infrastructure at Etsy - Bethany Macri, Engineering Manager at Etsy
Scaling Real-time Infrastructure at Alibaba for Global Shopping Holiday - Xiaowei Jiang, Senior Director at Alibaba
Scaling Data Infrastructure at Spotify - Matti (Lepistö) Pehrs, Spotify
Scaling Pinterest - Marty Weiner, Pinterest’s founding engineer
Scaling Slack - Bing Wei, Software Engineer (Infrastructure) at Slack
Scaling Backend at Youtube - Sugu Sougoumarane, SDE at Youtube
Scaling Backend at Uber - Matt Ranney, Chief Systems Architect at Uber
Scaling Global CDN at Netflix - Dave Temkin, Director of Global Networks at Netflix
Scaling Load Balancing Infra to Support 1.3 Billion Users at Facebook - Patrick Shuff, Production Engineer at Facebook
Scaling (a NSFW site) to 200 Million Views A Day And Beyond - Eric Pickup, Lead Platform Developer at MindGeek
Scaling Counting Infrastructure at Quora - Chun-Ho Hung and Nikhil Gar, SEs at Quora
Scaling Git at Microsoft - Saeed Noursalehi, Principal Program Manager at Microsoft
Scaling Multitenant Architecture Across Multiple Data Centres at Shopify - Weingarten, Engineering Lead at Shopify

Book

Big Data, Web Ops & DevOps Ebooks - O'Reilly (Online - Free)
Google Site Reliability Engineering (Online - Free)
Distributed Systems for Fun and Profit (Online - Free)
What Every Developer Should Know About SQL Performance (Online - Free)
Beyond the Twelve-Factor App - Exploring the DNA of Highly Scalable, Resilient Cloud Applications (Free)
Chaos Engineering - Building Confidence in System Behavior through Experiments (Free)
The Art of Scalability
Web Scalability for Startup Engineers
Scalability Rules: 50 Principles for Scaling Web Sites

Donation

Roses are red. Violets are blue. Binh likes sweet. Treat Binh a tiramisu? 🍰

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

If your system goes slow

If your system goes down

If you are having a system design interview

If you are building your dream team

Community power

Content

Principle

Scalability

Availability

Stability

Performance

Intelligence

Architecture

Interview

Organization

Talk

Book

Donation

Files

README.md

Latest commit

History

README.md

File metadata and controls

If your system goes slow

If your system goes down

If you are having a system design interview

If you are building your dream team

Community power

Content

Principle

Scalability

Availability

Stability

Performance

Intelligence

Architecture

Interview

Organization

Talk

Book

Donation