Feb 25, 2017

Hystrix

Micro-service architecture brings in a lot flexibility with application development and deployment but introduces a new level complexity when it comes to handling transactions and inter-service communication.

 

The microservices architecture is like a huge web consisting of several services each talking to one or many services.

   
Microservices architecture
Microservices Architecture


Each service has its own performance and reliability as Service Level Agreement (SLA) which in turn may be affected by the performance of dependent services.

 

A front facing service could be held hostage if one or more of services that it depends on is unable to its SLA's. This in turn would impact the SLA of the front facing service and might end up affecting the user experience on the application.

The above problem is magnified when we are talking about systems with large scale.

 
Service Issues
A service and its dependencies

This is where the 'Circuit Breaker' design pattern comes into picture. The design pattern is similar to how an electrical circuit breaker works. The idea is to trip the circuit when something bad happens and prevent the issue from escalating and turning into a disaster.

The below diagram illustrates the design pattern as explained by 'Martin Fowler' in 'Application Architecture’.

Martin Fowler - Circuit Breaker Pattern
Martin Fowler - Circuit Breaker Pattern


Key Parameters Affecting API SLAs in a MicroService Architecture:

 

1. Connection Timeouts
	
   Happens when the client is unable to connect to a service within a given timeframe. This may be caused due to a slow or unresponsive service.
   
   

2. Read Timeouts

   When the client is unable to read the results from the service within a given timeframe. The service may be doing a lot of computation or is using some inefficient way to prepare data to be returned.
	

3. Exception Caused Due To

   
*   Bad data sent to the service by the client
   
* Service being down
   
* 	Issue on the service
   
*	Issue on the client while parsing the response. Response change on the service and the client unaware of it.  
 

Building Resilient Service

Netflix has built Hystrix, a library which implements the circuit breaker pattern. This library will help us build resilient services.

    

Key advantages of Circuit Breaker Pattern:


* Fail fast and rapid recovery.
* Prevent cascading failure.
* Fallback and gracefully degrade when possible.

Hystrix library:


* Implements the circuit breaker pattern.
* Provides near real time monitoring via. Hystrix stream and Hystrix dashboard.
* Generates monitoring events which can be published to external systems like Graphite.
   

Use Case: NewsFeed Aggregation Service

Overview:

NewsFeed Aggregation service is responsible for delivering data that will be used to render the 'Recent Activities' page for an user. Recent activities is similar to facebook timeline. It tracks all activies of the user and the other related users.

 
News feed events aggregation service

The aggregation service is dependent on 3 other microservices. News Feed Service is the wrapper service around a data store. The data from News Feed service is further enriched based on the information retrieved from two other micro services user-service and photo-service.

The feed is shown on the landing page of the application and acts as a driver for user engagement on the site. Hence, it is of utmost importance that the operation to fetch the required new feed data is fast and reliable.

 

User should not see a long 'loading...' animation or see no data at all. Slow, buggy applications are a user killer.

Application Flow Diagram: Before Integrating With Hystrix

Here's a flow diagram showing how the user feed is fetched and enriched. The Newsfeed Aggregation service has three clients, each talking to different services (news feed, user and photo service).

Application flow without Hystrix

Each client has a certain 'Read Time Out' and 'Connect Time Out' associated with it. Let's assume these are configured to be 3 seconds each. So, if dependent services don't respond back in 3s or if the calling service is unable to establish a connection within 3 seconds, a read or connect timeout exception is thrown by the client.

In the happy path scenario, everything will work as usual and the dependent services will return data within some miliseconds. The application is able to retrieve the feed and return it to the user.

However, if one of the dependent services go down then every user will see an increase( at least 3 seconds, based on timeout set) in the response time. The pages will be slower to load and user experience not so good.


Application Flow Diagram: After Integrating With Hystrix

Application flow without Hystrix integrated

Using the Hystrix library, we can have the service execute a fallback operation when a certain threshold of failure rate for talking to an external service is reached. This is essentially a state where the circuit is considered open. As soon as the circuit is open, the service only executes the fallback operation and avoids reaching out to the external service for a certain cooldown period. Thus, saving the expensive 3 second timeout period for each API call to the external service.

After the cool down period, Hystrix will allow one request to go out to the external service, if this succeeds then the circuit will be closed again and all subsequent calls will reach out to the external service until the threshold for failure is reached next time.