Building Fault Tolerant System on AWS
Imagine if you have order processing system and there is communication with other services which can fail randomly. Well, you just starts losing shit ton of money, and you manager is not happy.
Since nobody is happy with the solution, you start investigating what's happening. You find out that code that is written actually creates orders but they never get processed if any of the service fails.
What's happening? You found out that your service highly depends on all HTTP requests toward other services to succeed and if one fails, order doesn't get properly stored into database. So you realize that you have something like below picture
You start pulling your hair and thinking how to change this, but you remember how you read about Fault Tolerant System using Queue. Since you are on AWS, you start searching and find out about SQS.
Wait, but there are 2 different types:
- Default Queue
- FIFO
Well, if you are building order processing there are couple of requirements you should think of. It's important that orders are processed in the proper order and there are not duplicates, otherwise you will charge people double orders.
For this particular use case it's better to use FIFO queue since it will guarantee order of the messages and guarantee that you will receive exactly one message.
Ok, so now we need to redesign our solution to be based on the messages. Our Order Processing will take the order through HTTP request and send it into SQS. And other services will pickup order and process it asynchronously. We are going to leverage asynchronous communication over synchronous.
Ok, so now on our system, it's more resilient, what's happening here:
- We receive order request and push message into SQS, in this case our order is stored in SQS. Message order:placed is published.
- Payment service is communicating with SQS and waits for order:placed message, when it receives it, tries to process it. If it fails, SQS will keep the message and retry again later.
- In case that payment succeeds it will publish message to the SQS order:paid which will be picked up by invoice service
- In parallel, Shipment service also picked up order:placed and prepared it for shipment
- After message order:paid is in SQS, invoice service can pick it up and generate PDF invoices for it
Now you changed your code, and moved from synchronous -> asynchronous communication and you can recover from the failures. In case of any failure at any stage, message is kept and retried.
The problem with this approach is that it will keep retrying unless we specify certain threshold. Well, if you have threshold then your message is lost again. In next post, we will cover the concept of Dead Letter Queues and how it helps us define threshold and still recover from failures.