All you didn't know but you wanted to know about smart contract exception handling
What would be your reaction to a programming language developer that told you: "I've created this new language which deliberately doesn't have any exception handling because I want to encourage developers to write perfect, error-free code by catastrophically failing on any exception"?
There may be some that think "If the Apollo engineers could do it, so can I." But if you are anything like me, your reaction is probably more along the lines of "But... My entire development process is handling exceptions until my tests pass...".
Yet virtually all smart contract languages, and that includes Daml, have taken that approach at launch. Why? And why are they now introducing these features? And why did it take so long and why are the exception handling features so weird?
In this blog series I'll try to answer these questions. In this first post, I'll give a take on the historic background to the lack of exception handling, and why this was shortsighted. In part two I'll talk about the difficulties of providing exception handling in a smart contract language - especially one with strong privacy like Daml - and establish a few properties such a feature must have. And lastly, in part three, I'll talk about the Daml exception handling feature in comparison to other exception handling features, pointing out both its quirks, but also how it is more powerful than your classic C++-style try/catch.
- Part 1: Why smart contract exception handling is a key feature.
- Part 2: Why smart contract exception handling is so subtle and difficult.
- Part 3: Why the resulting features are both the best and the weirdest.
Why even go without exception handling?
Smart contracts serve as the schema of distributed and decentralized systems like blockchains and distributed ledgers. They encapsulate data and the rules for transforming that data, both in terms of processes and permissions on those processes. Probably the most well-known type of smart contracts are tokens, which represent assets like cash. In that case the data is who owns how much, and who can change ownership by creating (minting), transferring, or destroying (burning) value.
When you call the API of a smart contract, like the `Transfer` choice in the Daml contract above, the result of that call is either an error, or a transaction that gets committed. It's black or white, a transfer is valid or it is not. The developer of such a smart contract needs to carefully specify the happy path by adding lots of checks. That's why smart contract languages are full of keywords like `assert`, `require`, `ensure`, `error`, `revert`, etc. Bonus points if you know which of these are Daml and which are other languages.
In case of errors, these need to be handled by the client application (or user) that submitted the transaction in the first place. For example, if you get an error that you couldn't pay your $200 bill because the `Cash` asset you tried to use only has $100 on it, you need to find a bigger `Cash` position and try again.
In many respects, methods on smart contracts are like transactions in SQL in that it's all about the happy path. And error trapping is a pretty recent development in SQL databases, too, for the same reasons. If I'm not mistaken, T-SQL only got TRY/CATCH in 2012 just shortly before Solidity launched as the first smart contract language for blockchains.
That analogy with database transactions goes further. Transactionality or atomicity of smart contract calls is a key property of the system. To build more complex transactions, one composes smart contracts as in this example where two
Cash contracts get swapped.
It is absolutely crucial here that the two transfers either happen together or not at all. "Handling" a failure in any of the preconditions (like the `assert`) or either of the transfers is just not in the spirit of the contract. All-or-nothing semantics are a core feature. Furthermore, smart contract execution is expensive. On public networks, transaction costs scale with transaction complexity so it is not in anyone's interest to do any computational work that isn't directly on the happy path of a transaction.
So in short, exception handling is difficult to realize (see part two of the blog series), subtle from a user point of view (see part three of the blog series), has no practical value for the early canonical use-cases, is counter to the all-or-nothing execution model, and expensive.
Oh how wrong we were
I've already given part of the game away above by mentioning that most transactional SQL dialects have exception handling these days. It's all about scale. At the complexity of the above contracts it's almost hard not to write Apollo-style bug-free code, and even more importantly, it's easy for both the developer and the end-user to understand the boundaries of the happy path. Off-ledger exception handling is practical. You try to `Settle` and if it fails with an error message you can make sense of, you try again with a different `Cash` contract.
At Digital Asset, we don't just build the Daml platform, we help our customers build the next generation of financial infrastructure on that platform. In doing so, we have learnt a lot of things about what it means to use smart contracts - Daml smart contracts, specifically - at scale.
Business is full of Exceptions
Bearer instruments like cash are a wonderful example for smart contracts because their rules are so clean cut and easy to understand. There are virtually no exceptions other than "you don't have the funds" or "you are trying to break the rules".
Most business isn't quite that black and white, nor as predictable. Let's take one of the most well understood examples of smart contract based bearer instruments: [ERC-20 tokens](https://eips.ethereum.org/EIPS/eip-20). The `transfer` method of that standard is allowed to throw exceptions under almost any circumstance. There are even entire standards like [ERC-1404](https://erc1404.org/) dedicated to restricting ERC-20. Maybe an asset transfer that would have worked yesterday fails tomorrow because of new political sanctions.
In short, almost anything can fail in practice. The more complex your system, the points of failure there are, and the harder it is to understand the happy path.
Minimize Off-Ledger State
The systems smart contracts like Daml run on are highly resilient. And the solutions being built in Daml - financial infrastructure, for example - require that resiliency. But resilient smart contract infrastructure is not enough. The robustness and availability of the system must extend to the entire solution. All automation and integration components must be able to deal with faults and outages, too. As soon as you need essential state in integration components, you need to protect and manage that state too, which is expensive.
In pure CQRS architectures you can't get around that. If your application logic is to monitor for events, and based on those events send commands, it needs to keep track of which events have been processed and which ones have not.
Exceptions that get caught off-ledger are just such events. Imagine this scenario:
- Your integration component receives an event that someone wants to swap some cash.
- It sends a command to settle the swap.
- The integration component receives a completion event that says the settlement failed.
- It sends a new command to try again with a different cash position.
At no point from 1 to 4 has the state of the underlying smart contracts changed. So if there needs to be a failover to a new node or new integration component, the state of where in that process things stand are lost.
It is much easier to build resilient systems with stateless integration components. Stateless in the sense that they rely only on the smart contract state. But that doesn't play well with off-ledger exception handling. The state before and after the exception are identical. Thus to make this work, the exceptions have to be handled on-ledger, at a minimum by recording them.
Off-Ledger Exception Handling does not Scale
When your asset swap goes wrong, you only have so many levers you can pull. You as a developer of the swap or a user of swap or asset also understand the rules and can predict transaction validity with great certainty. So you just try again with a different asset that you are pretty sure will work.
Now imagine you are trying to build collateral optimizer. The transaction it spits out consists of dozens of asset allocations, splits, merges, locks, return requests, etc.
To have any chance of handling the exceptions that could occur here, you need to not only understand the collateral management contracts, but of every dependency of that model, and how it all fits together. Going back to the ERC-20/1404 topic above, you need to understand every exception that any one of the diverse asset contracts could have emitted and how to handle them.
The complexity for the developer of the integration component or the next layer of smart contract functionality grows exponentially with each layer of functionality.
DIY Exception Handling really is Rocket Science
So exceptions need to be handled as low down in the application as possible. You can "re-throw" them, of course, but ideally the number of exceptions that a consumer of smart contract code needs to understand is limited.
And the rocket scientists of old knew how to do this already: Careful precondition checking before any operation that could fail and conditional behaviour if the precondition is violated.
Let's write a
Split choice with such exception handling for our
We have prevented aborting due to the `ensure amount > 0.0` clause on `Cash`. Depending on whether `splitAmount` is in the valid range or not, the choice returns two new `Cash` positions or an error text. This can be handled by a calling contract or stateless automation. This can be abstracted nicely using things like monad transformers.
But if you try to make use of this in a slightly more sophisticated
Swap, we get into trouble:
I'm stuck. I can rethrow, but I've already split the buyer's cash. I guess I have to write a Merge choice to tidy up the mess I have already created and live with the fact that the ledger will contain a Split/Merge pair that doesn't actually do anything.
Oh dear. I'm really stuck. To tidy up one exception, I need to perform an action that can itself fail. I cannot rethrow, but have to abort the entire transaction at this point.
The only solution to this is not to write anything before we are sure that everything will go through, in a pattern some call "read before write". That pattern makes sure that the ledger stays clear (ie no action and counter-action pairs), and that I don't maneuver myself into an unhandleable deadend like the above.
Now I can't actually write the
validateAllTheThings function, because I haven't written a
Transfer choice that could return an error yet. I don't know all the possible error scenarios. The other problem is that to even get my hands on the input to the transfers, I first have to simulate
Swap. To scale this, I effectively have to run Daml inside Daml to first simulate what I want to do, catch any errors from the validation, and if there weren't any do the real thing.
It's possible, but thanks to immutable state and side-effects that can't always be rolled back, it may actually be harder to handle exceptions than it was using early low-level programming models.
Daml is a platform for composable multi-party applications at scale. It needs to support our vision for a global economic network of businesses that transact across organisational, regulatory, and technological boundaries.
But without the ability to handle the majority of business applications gracefully in the core business logic on-ledger, even single complex applications are hard to build in a robust fashion. And that problem gets exponentially worse as different developers try to compose their smart contracts into networks of applications. Exception Handling in smart contracts is a must-have feature.
We've known this truth for a good while - we've been developing enterprise-scale smart contract applications in Daml since 2016. The reason we didn't "just do it" much earlier is that there are some interesting difficulties and tradeoffs to be made with such a feature. Those will be explored in depth in part 2 of this series.