Introduction
Circuit breaker logic is an extreme condition where one needs to stop the application automatically if something is wrong. This problem can be with our application itself or with the third-party APIs on which our application is dependent. To implement circuit breaker logic you must have a good understanding of retry policies and transient exceptions. in electronic terms circuit breaking or opening a circuit means cutting the wire at a single point to stop the flow of current so it does not harm the wire or components connected to that wire.
Structure
We have an existing eCommerce system say Contoso which gets a million hits per second. In Contoso, we have a customer module hosted as a web service. We have an order processing module which is a internal module which indicates the warehouse to start processing the items in the order. We wanted to make both the modules decoupled and async so we introduced a serverless layer that employes azure functions and event hubs.
We used event hubs because it is capable of handeling millions of requests per second until the threshold is reached i.e. 1000 requests per second or 1 MB of request data per second and we have more control on retries in cases of event hub then service bus or queue storage. When a customer makes an order we store the order in any DB and push the data in the event hub. The azure function is triggered by the event hub which sends the order data to the backend process via Refit.
Requirement
After this new implementation, a new problem arose. What if the backend order processing app is down for few minutes? In this case, the azure function will start giving errors for every request filling the precious logs with errors. You also know that if the backend API is down that function is running purposeless and just increasing compute units and cost. Also because of this we have to implement a lookup service to retrieve the lost data which was not processed because of backend pai downtime.
So our requirement is to know if our service is giving a large number of error in a limited time so that we can stop the function app and fix the issue. We also want to retry one message almost 5 times if it generates a transient error
Now comes the benefit of the event hub. Events hubs maintain a pointer which indicates that up to this location events have been processed. When the event hub delivers a message to the function app it expects a response, if the response is a success then the event hub increases the pointer by one otherwise it does not increase the counter and the same message can be retried.
Solution
We can either use [FixedDelayRetry] or [ExponentialBackOffRetry] attributes to start the retry for more details see here.
[ExponentialBackoffRetry(5, "00:00:05", "00:00:50")]
The above attribute will retry the same request 5 times after the first request. It will delay the first request by 5 seconds and the last request by 50 seconds. Now we have made one change in our code. in catch blocks instead of just logging the error, we need to throw so that the event hub knows that and it does not increase the pointer Now before stopping the function app we have to create the azure logic app.
You can see in the logic app designer I have added one trigger and one action. You can use HTTP Trigger or event grid trigger but I already had an event hub employed for this so I used event hub trigger.
Configure the trigger with eventhub instance name, consumer group (default or any), and other details set connection to eventhub.
Connect to eventhub using APIConnection.
Here select the eventhub namespace and press create. After this add action as the next step and use Azure resource manager to stop the function app with the following configuration,
Here you can filter the data according to your need or before this HTTPAction, you can introduce a condition to filter the data. and you can also add additional email send action to notify the DevOps saying "something is wrong stopping function app"
You need to make sure that this logic app has appropriate access to the function app which it is going to stop.
Now the final circuit breaker logic and how to call this logic app.
Create a Redis cache instance in Azure where we will store our error count. I wanted to stop the function app I received 40 errors within 100 seconds at any two-point in a timeframe so I set up a cache with sliding expiry.
As soon as we send the "stop app" message on the given eventhub to which the logic app is listening it will stop our function app and save us precious resources, money, and logs.
One good thing about this is whenever the function app is restarted manually or automatically it again starts listening to eventhub message from where it left. Also, keep in mind each partition of eventhub has a separate pointer so.
Hope this helps.
For further discussion feel free to connect with me.
Thanks.
No comments:
Post a Comment