When Failsafes Fail and What to do About it

by Digital Asset October 2, 2020

In common legacy systems supporting mission-critical networks that simply cannot go down, there’s often some form of failover system. In a recent real world case the Tokyo Stock Exchange experienced this type of outage, their primary system went down and that triggered a failover to a secondary system which also failed. In another the US’s 911 system had an hour-long outage in 14 different states. In today’s world these things happen, in tomorrow’s world, with the right technology, they don’t need to.

Breaking: There is a major 911 outage across multiple areas in the United States. The 911 lines are down in parts of Arizona, Minnesota, and Nevada.
— PM Breaking News (@PMBreakingNews) September 28, 2020

Two Eggs, One Basket

Running two systems in parallel for failover cases is common practice in legacy systems, but the nail-biting part for any IT engineer is the switchover; when primary goes down and secondary takes over. That transition represents a lot of risk and stress. Everyone holds their breath and breathes a sigh of relief when it’s over, or panics, loses a lot of money, and calls their significant other to say they’ll be late to dinner during a quick sprint across the server room floor.

When the world's third largest exchange, halts trading for a whole day due to hardware failure, you know that operational risk should be taken seriously. #TokyoStockExchange #riskmanagement
— Manisha Bhardwaj Jain (@ManishaYadnya) October 1, 2020

More Baskets

Distributed ledgers present a solution to this problem by being in a constant state of failover. They run side-by-side continually ensuring that they’re operating properly with each other. There is no primary and secondary in a distributed ledger, instead there is node 1, node 2, node 3, etc. If node 1 were to experience a hardware issue and fail nodes 2, and 3 would continue their operations uninterrupted. IT would have time to diagnose node 1, bring it back up, the trading day would end, and everyone would go home to eat dinner on time.

There are two mechanisms by which these nodes prevent a failure - relay and consensus:

When a node receives a new transaction, say a trade, it relays it immediately to the rest of the nodes in the network so that all nodes have seen the same transaction at around the same time.
Every so often, say every few seconds, the nodes discuss the latest transactions sent to them and programmatically settle on the new set of data. They then continue on their merry way, receiving and processing transactions, and then settling up once again just a few moments later.

This is how distributed ledgers provide a level of fault tolerance not previously seen in legacy systems. Most importantly, these nodes can be distributed within an organization and geographically much like you’d do in legacy systems. The difference is when the data syncs. Legacy systems sync up every so often, or not until a critical issue is encountered. Distributed ledgers, on the other hand, sync up continuously across the wire making sure they’re all on the same page. So you’re never putting all your eggs in one basket.

Distributed ledgers are easy to deploy. Try our demo in your web browser here:

Build, Deploy, and Run a Daml application