https://www.youtube.com/watch?v=VZePNGQojfA
Different common existing pattern.
Reliability engineering can be found in many different fields and used as references like in mechanicals or aerospacial environments but, before, some definitions are required.
availability: probability system is operating at time T. The availability is the success of the mission you give to a piece of work
stability: architectural characteristic. producing availability despite faults and errors.
fault: incorrect internal state in application, created by defect or injection (e.g. bad network packets). fault happens. You can tolerate fault (e.g. exception) like in Java or Python and correct it or use fault intolerant system like erlang by crashing if something goes wrong.
error: observably incorrect operation
failure: loss of availability, mission fail and the system got unresponsive. You must avoid this case.
# Stability Anti Patterns
Integration point, side effect caused by uncontrolled things, like socket, process or pipe can kill the system. Its impossible to find all the errors before an error appears. Can we prevent these issues?
Well-behaved errors are defined in the spec. those errors are attended behavior defined somewhere
Wicked errors are out of the scope of the spec. those errors are present and sometimes happens and you don't have any documentation about them.
Failures propagate quickly
Large systems fail faster than small ones
Yo isolate these failures, you can use circuit breaker, timeouts... and add more test in development
Chain reaction: even if a node fail, and everything is redounded, the others can be impacted due to the charge. failure moves horizontally across tier
Attacks of self denial, good marketing can kill you because of the connection generated by customers. A good solution is to defend you but also ask the marketing team for the calendar and the scheduling of the outside communication
Hints: desk check ratios, check many unseen things if the systems is overloaded, you should fail fast.
# Stability Patterns
Use timeouts, consider delayed retry, but retry can make users waiting for a common result: failure.
Use circuit breakers with timeouts with state exposure and report.
Use Bulkheads (partitioning the system), each pool must have their own pool of thread or workers
Use Fail Fast method, avoid slow responses, blocked threads and cascading failure.
Note: We reinvent Remote Procedure Call (RPC) everyday, but this is the same with a different name (e.g. message passing)
# Questions
- How to monitoring systems a systems for failure? Absence of signals is the signal to monitor. If something goes wrong you should monitor this state.