As the world embraces the transition to microservice architecture, we're essentially moving from a single, tightly-knit process to a landscape where our application is divided into multiple independent processes. However, this shift introduces a level of uncertainty due to the need for processes to communicate with one another via the network. While inter-process communication enables greater flexibility and scalability, it also unlocks a new failure mode, presenting us with a new set of considerations to ensure seamless cooperation between services and to improve application resilience to failure. One helpful tool we can use to improve application resilience is called a "circuit breaker." Imagine it like a smart switch that senses when things are going wrong and stops the trouble from spreading. This switch can prevent small problems from growing into big ones and keeps microservices working smoothly together. In microservices terms, it helps prevent cascading failures in a microservice system due to network issues, service unavailability or slow response times. It acts as a proxy for the client and monitors the services being called. If the services fail to respond within a specified time or returns too many errors, the circuit breaker prevents further requests to the client, returning either an error message or a fallback response immediately without trying to call the services anymore.
Inspiration Drawn From Electrical Circuit Breaker
The Circuit Breaker pattern in Microservice Architecture draws its inspiration from its real-world counterpart, the electrical circuit breaker. Just as an electrical circuit breaker is designed to protect electrical systems from overloads and faults, the software circuit breaker serves a similar purpose in safeguarding software applications from failures and disruptions.
In an electrical system, a circuit breaker is a device that automatically interrupts the flow of electricity by tripping the circuit when it detects abnormal conditions, such as an overload or a short circuit. This interruption prevents damage to the electrical system and reduces the risk of fire or other hazards. It provides a fail-safe mechanism to prevent further harm while allowing for easy restoration once the issue is resolved.
Similarly, in software engineering, the circuit breaker pattern operates as a fail-safe mechanism to protect software applications from failing or degrading due to service failures, network issues, or unexpected errors. It monitors the health and responsiveness of services or components that the application depends on. When the pattern detects a certain threshold of failures or errors, it "trips" or "opens" the circuit breaker, temporarily blocking requests to the failing component. Unlike an electrical circuit breaker, once the issue is resolved or conditions improve, the circuit breaker can gradually transition back to the "closed" state on its own, allowing normal operation to resume.
Why Circuit Breakers Are Beneficial
Microservice systems have the potential to become quite large and intricate. As more services are added, the way they communicate becomes more complicated. Sometimes, when a user asks for something, the needed information isn't just in one service. Instead, that service has to ask another service for the information, which might then ask another, and so on. This can create a complex web of connections between services. Handling just one user request might involve many services working together in this web. However, if something goes wrong, like a service not responding or crashing, it can affect not only one service, but several others too. This is where we need to be cautious and put safety measures in place to handle failures. Without these measures, problems can quickly spread throughout the system, which is especially risky given how big these systems can become. This is where circuit breakers come into play: they act like protective barriers to help manage and control the impact of these potential issues. Circuit Breaker can provide the following benefits:
- Failing Fast: It monitors the outgoing requests to other services and takes action if it detects that one of those services is unavailable. It fails all outgoing requests to that service, since they are likely to fail too. By doing this, it ensures that the system doesn't waste resources on requests that won't succeed and frees up the threads. This way, users don't have to wait indefinitely for a response that might not come.
- Fallback: When a service is down, a Circuit Breaker can be configured to switch to the fallback mode. It may return cached responses or default values for the requests going towards that service. It may event try to call a secondary service when the primary is unavailable. It may even throw a customized error for the users of the application to make it easy for them to understand what went wrong with the system.
- Automatic Recovery: The circuit breaker detects issue resolution and automatically restores normal operations by "closing" the circuit. This ensures a seamless system recovery without requiring manual intervention.
- Avoiding Cascading Failures: Circuit breakers prevent cascading failures by quickly stopping requests to a failing service, ensuring other parts of the system remain unaffected. This containment helps maintain system stability and prevents widespread disruptions.
Failures Cascading Through The System
In large microservice setups, various services work together like a connected network to get things done. If one part fails, it can quickly degrade or even completely shut down another part. Imagine there are four services involved in handling a user's request. Each service assigns a thread to work on that request, and these threads are all waiting for the request to finish. Let's see how problems can spread across this setup.
Lets assume the fourth service isn't responding. This causes a delay in the third, second, and first services, as they're all waiting for their tasks to be completed.
When the first service's request to the second service times out, it retries the call.
The second service's request to the third service times out, leading to a retry. Notice that the additional call from the second service to the third service is triggered by the retry attempt from the first service.
This pattern continues: the third service's request to the fourth service times out, resulting in a retry with two extra calls because of retries from the first and second services. As we go down the chain, services start to suffer due to many occupied threads.
As more users make similar requests, services suffer from bottom-up due to threads being tied up waiting. This shortage of available threads disrupts new requests as the requests are queued indefinitely.
Gradually, either a portion or the entire system starts suffering due to one failing service, essentially cascading the failure throughout the system.
Even without the retry feature, the issue we discussed would still exist, albeit to a lesser extent. In this case, each call would tie up a thread in the first, second, and third services until the timeout period ends. When dealing with a microservice handling significant traffic, the impact remains substantial. Multiple users making the same request concurrently could block numerous threads across these services. Unlike a single application in a monolithic setup, threads from three service applications are now held up until timeouts, leading to resource wastage. A sudden surge in traffic could potentially overwhelm these three services, resulting in a deteriorated user experience for those making different types of requests involving at least one of these services.
Circuit Breaker
Circuit breaker works well for functions that could potentially fail. It works like a protective layer around a function and monitors it for failure. If the number of failure, or failure rate in some more sophisticated implementations, is greater than predefined threshold, it prevents further calls to the function. Circuit breakers can be set up using state machines. Here are the different states they can have:
- Closed State: In the closed state, the circuit breaker operates by permitting requests to flow through to the service. It closely observes the requests, keeping a close watch on response times and error rates. It considers requests taking longer than a predefined value as failed. As long as the number of errors or error rates remain within acceptable levels, the circuit breaker maintains its closed state, allowing requests to proceed uninterrupted to the service. If the error count or error rate if higher than the acceptable level, it transitions to a open state.
- Open State: In the open state, the circuit breaker interrupts the flow of requests to the service. An error message is returned to the client, indicating the service's unavailability. This state remains in effect for a specified duration, during which requests are denied access to the service. The intention here is to shield the service from overwhelming request loads and to allow it a period of time for recovery and restoration. After the reset period is over, the Circuit Breaker transitions into half-open state.
- Half-open State: Half-open is the state where the Circuit Breaker checks whether the service has recovered. It does this by letting a limited number of requests go through. If these requests succeed, it moves back to the closed state, showing that normal operation is back. But if the requests fail, the circuit breaker goes back to the open state. Optionally, the reset timeout can be extended when the circuit breaker moves from the half-open state back to the open state.
Configurable Parameters
In Close State
- Count of requests failed among the last n requests.
- Time to wait for the response before a request is considered failed.
In Open State
- Time to wait in open state before transitioning to half-open state.
In Half-open State
- Number of allowed requests
- Success rate threshold
Fallback Strategies
When the Circuit Breaker rejects a call to a failed service in open or half-open states, it can trigger a fallback strategy.
- It may respond to the request using cached responses
- It may return a null value or any other default value
- Can return a user-friendly error message
Monitoring
Monitoring a circuit breaker's state changes is essential and should always be logged. The current state should also be accessible for querying and monitoring purposes. Keeping track of how often the state changes can reveal potential issues in the larger system. Monitoring the number of requests within a specific time frame provides insights into service dependencies. Additionally, having a way for Operations to manually trip or reset the circuit breaker is valuable. This monitoring process also conveniently collects metrics about call volumes and response times, making it easier to fine tune the circuit breaker parameters if needed.
Available Implementations of the Circuit Breaker Pattern
Some widely recognized implementations are Netflix Hystrix, Resilience4j, Sentinel, and Istio. Netflix's Hystrix, an earlier framework, is now in maintenance mode, so should not be used in new projects. Resilience4j, inspired by Hystrix, has taken its place, even gaining traction within Netflix's own projects. Tailored for functional programming, Resilience4j is a lightweight fault tolerance library.
Alibaba's Sentinel offers language support for Java, Go, and C++, while Istio, a service mesh technology, features a built-in circuit breaker function. However, remember that Istio requires adopting its comprehensive service mesh framework to fully leverage this specific feature.
When it comes to designing enterprise microservices, making sure services stay operational in even during failures is key. This is where the circuit breaker comes into play – think of it as a useful tool that can manage transient errors and prevent failures from cascading throughout complex systems. Utilizing circuit breakers ensures your system remains robust, even when unanticipated problems arise.
Explore More
Topics
Are you new? Start here
Microservice Architecture
Patterns & best practices to achieve scalability, flexibility, and resiliency.
Event Driven Architecture
Embrace Scalable, Responsive, and Resilient Systems through Event-Driven Paradigm.
System Design
Explore modern software solutions to scale to the horizon.