Designing Data-intensive Applications – Chapter 1 Notes

 Three main components:

1. Reliability 

2. Scalability

3. Maintainability 

Reliability:

1. App performs as expected

2. Can tolerate mistakes made by the user

3. Prevent unauthorized access and abuse

  • Things that can go wrong are faults, a system should be fault tolerant but doesn’t mean it encompasses all faults, tolerating certain types of faults makes sense.
  • Faults are not same as failures, fault is denied as some component of the system deviating from its spec, while failure is system as whole stops providing the required service to the user
  • We can’t avoid failures but can reduce their probability, so it makes sense to trigger them deliberately. 

Hardware faults:

1. A machine is unplugged or ram, the disk has failed, in 10K machine disk cluster, it is expected to have 1 failed disk/day.

2. This can be tackled by adding redundancy, disks or Hot CPU swaps, etc.

3. Redundancy of power sources

3. Rolling upgrades to minimize downtime

4. Design systems that can tolerate whole system downtime

Software Errors:

  • Software bug that causes every instance of app server to crash given a bad input
  • A runaway process that takes shared resources RAM, CPU and network bandwidth etc
  • A service that system depends on the slows down becomes unresponsive 
  • Cascading failures 

No quick solution to software errors

  1. Think about all assumptions and interactions in the system
  2. Thorough testing
  3. Process isolation
  4. Allowing processes to to crash and restart
  5. measuring , monitoring and analyzing system behaviour in production

Human Errors:

Leading cause of outages: Human errors, whereas hardware faults only 10-15% of outages

How to deal with Human errors:

  1. Minimize opportunities for human errors
  2. Decouple where people make most mistakes from the places where they can cause failures, provide non prod environments for testing
  3. Test thoroughly:
    1. Unit testing, integration testing, automated testing
  4. Allow quick and easy recovery form human errors:
    1. Easy rollbacks
    2. Provide tools to recompute data if data processed already was incorrect
  5. Set up detailed and clear monitoring:
    1. So it is easy to debug errors and fix them 
    2. Metrics can signal early if there are issues in the assumptions and constraints

Scalability:

Systems ability to cope up with the increased load. 

How to describe load: 

Few numbers to describe load on the system we call them Load parameters.

The choice of parameters will depend on the architecture of the system. 

Could be : 

  1. Requests per second
  2. Ratio of reads to writes
  3. Number of active users
  4. Cache hit ratio

Describing performance: This helps to investigate what happens when the load increases. 

1. When you increase load parameter and if system parameters such as CPU and Memory etc keep unchanged how will your system behave

2. To keep your performance of the system how much resources you should increase in order to keep up with increasing load parameters

Difference between latency and response time

Response time: Actual time to process + Networking delay

Latency: How long request was waiting before it was handled (it was latent in this duration awaiting service)

Percentiles are better than averages because averages/mean does not tell us how many users experienced that response time.

100 ms latency reduced 1% sales for amazon.

It helps to discuss and analyze tail latencies like p95, p99, p999 since those customers are most likely with a high amount of data and hence could be the most important customers to cater to.

But also chasing to fix these p999 latencies could mean expensive changes across the board as well as won’t yield enough benefits. It is also harder to fix these response times as they are easily affected by random events outside of our control and benefits would be diminishing. 

Head of line blocking:

Delays in queuing often account for a large part of response time at high percentiles. Server usually is bounded by CPU cores. It’ll take only a few requests to hold up the processing of subsequent requests.
Even if most of the requests could have been completed faster, customers would experience a slower response time due to the blocking of a few slow requests.

In our testing, we should set up our testing in such a way that clients don’t wait for the request to finish, if the client waits for the server to finish the request that would be emulation where queue size for processing is kept artificially shorter. 

 How to keep up with the load?

Horizontal scaling: 

Vertical scaling: 

Few powerful machines, maintenance if only few machines but it is expensive

In reality, good architectures usually will involve a mix of two strategies, usually several fairly powerful machines can be still simpler and cheaper than a large number of small virtual machines.

Elastic systems:
Add and reduce computing resources. Some could be automatic systems to scale or human maintained systems can be scaled up or down depending on the metrics from the analyzers set up.

There is one size fits all, it all depends on volume of reads, volume of writes, size of the data, complexicit of the data, access patterns etc.

Maintainability:  

The majority of the cost of software is not initial development, but in its ongoing maintenance- fixing bugs, keeping its systems operational, investing failures, adopting new platforms, adding new use cases, repaying technical debts and adding new features. 

Design principles for maintainability:

Operability:

Making it easier for the operations team to keep the system running smoothly.

Simplicity: 

Easy to understand for new engineers, adopters etc.

Evolvability: 

Make it easier for future changes, adding new features.