What the Hell is a Monad?Posted: 2011/09/09
In the programming languages literature, the concept of the monad is widely known and almost as widely misunderstood. Some authors have gone so far as to call monads “the essence of functional programming”, while others seem to think that monads are little more than an especially elaborate kind of syntactic sugar. A consequence of this broad confusion is a lack of clear explanations of what exactly a monad is. Authors seem sharply divided into two camps: experts on monads and category theory, who don’t need to explain anything to other members of their clique, and experts on everything else, who just don’t care much about monads at all. Didactic explanations of monads are notoriously difficult to come by, and where explanations do exist, they are almost always dense with category theory. Category theory’s biggest strength, its abstractness, can also be a severe weakness when it is used in tutorial explanations: something concrete ultimately has to be shown in order to forge an understanding, and it’s hard to find much that’s decidedly concrete in category theory.
What follows is what I personally believe is missing from the literature: a self-contained, intuitive explanation of what monads are and why they can be a useful specification tool. No mathematical background or experience with a specific programming language is presumed, only a basic familiarity with the usual concepts of functional programming – which is what I would tend to expect if you’ve actually chosen to read this article.
The conventional explanation of monads is that they’re used to specify side effects in functional programming languages. This is a correct explanation, but it’s not extremely informative if you’ve never programmed using monads and know nothing else about them. There are two shortcomings. Firstly, although there are a lot of examples and vague intuitions, there seems to be no single, authoritative definition of what precisely is meant by “side-effect” in the general sense. Secondly, even given a good definition of side-effects, monads are not essentially functional, and could be constructed in any adequately expressive language. It’s true that monads are very useful in the context of functional languages for several reasons, but it’s something of a misconception to suppose that monads only make sense in the context of functional languages, and I suspect this misconception contributes to much of the confusion that monads are simply convenient syntax for certain kinds of functional idioms.
Effects are closely tied to the notion of referential transparency. By definition, a computation is referentially transparent whenever it can be replaced by its result (as an expressible value) in the source text without affecting the meaning of the program. All terminating computations in a pure functional language are referentially transparent, as are all primitive expressions in C that do not involve pointers. In general, a computation is side-effecting or effectful whenever it is not referentially transparent, so that its meaning in the context of the program can only be determined by actually performing the computation. It should be noted that the relatively academic terms “referential transparency” and “effectful computation” are very closely related to the notion of orthogonality as familiar to experienced programmers and software engineers. While these terms don’t quite coincide (mostly due to the fact that the former have very precise technical meanings, while the latter is a much higher-level design concept), monadic style is a powerful tool for enforcing orthogonality by making syntactically and semantically explicit the separation between the different purposes and consequences of different computations.
Good design dictates that a distinct block of code should do just one thing, and that it should be clear what that one thing is. Side-effects are anything other than the result that must be considered when running that code. The virtue of monads as a programming tool is in their ability to clearly distinguish effects, and thus factor computations into actions that very clearly do one thing or another.
Even in a pure lambda calculus with a strong type system, anomalous things can happen. Take as an example one of the simplest and most familiar kinds of anomalous conditions, the old divide-by-zero error. Divide-by-zero is a possibility in any language with the usual complement of arithmetic operations, and since the result of a zero-division is undefined, any subsequent computation that uses that result is also undefined. For this reason, processors are usually designed to throw a hardware exception in the event of a divide-by-zero. However, it’s also possible to write programs that internally catch and throw their own divide-by-zero faults and thus circumvent the low-level code that would trigger a hardware exception. In the hardware case, some line in the circuit checks for a zero in one of the operands of a divide instruction and sets an exception if an attempt is made to execute the instruction. In the software case, the program itself examines the data passed to a routine where a division may take place, and returns a special error value if a zero is passed in at the wrong place. Intuitively, both software and hardware divide-by-zero exceptions have something in common: they are both consequences of running a computation, but are not themselves the defined result of that computation. In both cases, the comptutation produces something apart from its defined output value.
An easy way to modify functional programs to handle anomalous conditions, e.g. divide by zero, is to lift susceptible functions into a special kind of sum type:
data Maybe a = Just a | Nothing
where the constructor Just specifies an ordinary value, and Nothing stands for the distinguished error condition. Thus, if f : Int -> Int, and f performs a division with its argument, f can be modified to something equivalent to:
f’ x = if (x == 0) then Nothing else Just (f x)
The new function f’ knows that f will create an anomalous condition if passed a zero, and so circumvents f entirely if this condition is detected. However, this doesn’t quite solve the problem of specifying how a program should behave in the presence of an anomalous condition: the value resulting from an anomalous condition will have to propapate upward through the call tree. While f’ can screen f from anomalous inputs and set an error condition, this error condition will have to be passed to whatever function calls f’, and so on, possibly all the way up to the top-level call. Any or all of these callers may in turn have to handle the error value that originates with f’.
If it is not quite clear that other changes will be necessary throughout the program, consider what the type judgments throughout the program will look like as a result of the change from f to f’. For example, if
g :: Int -> Int
then for any value v,
g (f v)
is well-typed, but
g (f’ v)
is not well-typed, as g is being passed the wrong type of argument.
The difficulty here is that f may do more than one thing; besides doing some arithmetic, f may also cause an error. Although f can be changed into f’, the change will necessitate other changes throughout the program to reflect the possible propagation of the effects of the error state. What we want is a systematic way to map a “pure” i.e. non-effectful program (one that uses f) into one that can account for a divide-by-zero error (one that uses f’) without a lot of extra fuss.
The first thing such a transformation will need is some way of distinguishing computations causing or depending upon effects from computations independent of effects, since only some parts of the program will need to be changed. The data type Maybe provides exactly this with its Just constructor; Just indicates that its argument can be considered as-is because it does not set the error condition. Conversely, anything in Maybe that is not wrapped by Just must cause the error condition. The distinction between effecting and non-effecting computations is thus present in the discrimination between the constructors of the sum type Maybe.
The next thing the transformation will need is some way of specifying how effects propagate through program execution. Recall that this is important because the consequences of an error condition will not necessarily be localized to the function in which it occurs. For these purposes, let’s adopt the simple convention that the error condition is unrecoverable, i.e. that once it has been set, no other function can unset it. Since we have already accepted that Just indicates that a result has been produced free from error conditions, it is relatively straightforward to transform individual functions, using the transformation of f to f’ as above. If g is any function of suitable type and v any value of suitable type, and assuming a call-by-value evaluation strategy, the following transformation should then be performed:
⌈ g (f v) ⌉ = ⌈ f v ⌉ >>= \ v’ -> ⌈ g v’⌉
where the infix operator >>= is defined as:
x >>= f =
case x of
Just u -> f u
Nothing -> Nothing
Take a moment to really understand these definitions. Since evaluation is performed in a call-by-value fashion, f v should be evaluated before applying g, hence the placement of f v on the left-hand-side of the >>= operator. The >>= operator itself will check the result of f v before proceeding with execution, propagating the error condition forward if it is present, and passing through the value normaly if it has not.
The very fastidious reader (or just one already accustomed to monads) may have noted that the transformation ⌈-⌉ above also could have been written:
⌈ g (f v) ⌉ = ⌈ f v ⌉ >>= g
This is fundamentally the same, however, since g is eta-equivalent to \ v’ -> g v’. Including the extra parameter just happens to be a habit of my own, as its presence emphasizes the passing of data between actions, and because it may often be useful to have lexically bound for use in other actions in scope.
What the transformation above provides is a way to add the effect of an error condition to functions that should be considered susceptible, together with a systematic (albeit quite simple) way of propagating any effects through the program without disturbing non-effecting results. A monad is simply a concept of effectfulness (in this case, error condtions) together with a way to distinguish non-effecting computations (in this case the constructors of Maybe) and a way of sequentially propagating effects forward (in this case, the >>= operator). In general, how exactly all of these things are defined will depend upon what notion of effectfulness is chosen. The effect of exceptional conditions embodied by the Maybe monad is one, but it is not the only one, as we shall see.
Many programmers migrating from an imperative language to their first functional language are perplexed and frustrated by the absence of anything like assignments. Imperative programming makes very liberal use of state transitions, with very large and complex states changing in complicated ways each time a statement is executed. This is useful for some applications, but not so helpful with others. Good functional programming style tries to isolate uses of state so that possible states are relatively small and sparse, and their transitions are very clear. While ML and Scheme concede reference types that do allow for reading and writing mutable storage locations, accepted practice makes sparse use of these. Haskell, for its part, does not incorporate reference types at all. While there are good reasons for this aversion, it is nonetheless sometimes convenient to have a state that can be read and written as execution proceeds. For example, it is often quite useful to incorporate a counter into your program. While functional style does not allow for direct assignments, its usual variable-binding conventions can nonetheless be harnessed to accomplish the same end.
The naive way to implement a mutable state without references is simply to thread an additional argument through all functions in your program, while making sure that each function returns the new value of the state in addition to its usual return value. While this should work in principle, it creates a real mess in the code, and it is very easy to write a function call that is passed the wrong state argument, which will lead to all kinds of subtle bugs. While you’re welcome (perhaps even encouraged) to go and try writing a non-trivial program this way, you won’t enjoy the experience. As is usually the case in programming, a little bit of careful thought about what you’re actually trying to do will lead to a much better solution.
Suppose we want to implement a program that contains a global counter of type Int. The intuition that state should be threaded through all function calls and returns is basically correct. However, a close inspection will reveal that there is a pattern to how functions are augmented to receive and return state. Firstly, the function needs to be passed an extra argument that binds the current state. Secondly, the function needs to return the new value of the state together with its result value. Although not at first obvious, this can be neatly summarized in a single binary operation on a counter-transition datatype:
type Counter a = Int -> (a, Int)
s >>= f = \c ->
let (v, c’) = s c
(f v) c’
(This is a different definition than the >>= presented for Maybe above. The reuse of the notation is deliberate, but don’t be confused.)
The datatype Counter reflects a lifting of “pure” functional values into a setting where they occur together with transitions on the counter state. That is, any value may come with a side-effect on the state. The >>= operator uses this datatype to formalize the procedure of passing in the state argument (from a state transition) to a function and returning a new one with its result. This is actually much less complicated than it appears. Firstly, the lambda that binds the parameter c specifies the extra argument that must be passed in as the current state. Next, the state transition s is applied to that state c to obtain a value and a new state. If f is a side-effecting function, i.e. one that produces not just a value but also a new counter state, then f is passed the value v resulting from the transition s. Closely reading the types, it should be clear that the result of applying f to the value v has type Counter b which is identical to Int -> (b, Int). Thus, (f v) can be applied to the state c’ resulting from the transition x to obtain a tuple consisting of a value and a new state. Take a moment to think through all of this and understand why it should be.
In the case of Maybe, we saw why it was important to specify how effects propagate through the program. The very essence of a mutable state is that effects must propagate globally in the background. The >>= operator specifies how exactly this should work. However, notice that Counter is not a simple sum-type like Maybe; discriminating between side-effecting and non-effecting computations is not as easy as simply discriminating between constructors, as with Just and Nothing. There is, nonetheless, an easy way to do so. Firstly, we must specify which computations have no effect. Since there is no Just constructor, let’s introduce a unary operator return that simply injects an arbitrary value into the stream of state transitions without altering the state. This is easy:
return x = \n -> (x, n)
The transition return is trivial; the state is unaffected. Meanwhile, an arbitrary pure value can be placed in the value-slot of the tuple.
This leaves unanswered the question of what exactly should be done to specify side-effecting computations. However, this matter is open to much more interpretation. While there will always be some need for computations that do just one thing, i.e. that are functionally pure, side-effects may take on a wide array of forms depending upon what’s necessary and convenient. Thus, the actual application in question needs to be considered.
We’ll likely need to read the current value of the counter, which is easy to construct as:
get = \n -> (n, n)
This little function demonstrates an important but somewhat subtle feature of the sequencing operation >>=; the value of state remains in the background unless it is deliberately moved to the foreground as a value. The function get does exactly this, allowing for the value of the counter to be passed to another possibly side-effecting function using >>=.
We’ll certainly want to change the value of the counter as well, otherwise nothing can actually be counted. One way to define this behavior is:
put v = \_ -> ((), v)
(Note that I use Haskell’s notation for a function with a dummy parameter, as ’_’; this is only to stress that changing the state in this coarse way takes no account of the preceding state.)
The function put simply replaces the current state value with a new state. While this may be a useful general-purpose operation, counters are much more often incremented or decremented. As such, we might choose to use inc and dec functions instead of a general-purpose put:
inc = \n -> ((), n + 1)
dec :: Counter ()
dec = \n -> ((), n - 1)
To illustrate how >>= works, notice that these also could have been defined using get and put:
inc’ = get >>= \n -> put (n + 1)
dec’ :: Counter ()
dec’ = get >>= \n -> put (n - 1)
I should emphasize, again, that these are only possible features to throw in; they are chosen and defined as may or may not be expedient. Only return will be necessary in order to ground the distinction between side-effecting and non-effecting computations.
What we now have is a nice way of threading a counter through a program without passing around a lot of extra stuff. Any computations independent of the counter simply must be wrapped by a return in order to be included in the stream of state transitions. What may not, however, be obvious is what the initial value of the counter will be, or how the final result will be obtained. The attentive reader may have noticed that the actions above specify what will happen, but do not actually cause it to happen. The result of chaining many state transitions with >>=, after all, will simply be an action of type Counter a, which is itself a function. In order to actually run the effecting computations, the function must be applied to an initial counter value. This is easy to specify:
run x n = x n
which will return a value and a final counter state. Note, however, that if the value in the left-hand component of the tuple is removed and used in some other, arbitrary functional expression, no information about the effects in x would be propagated. Running the action is necessary to obtain a result, but also means that pure values may escape from the effectful setting in which they were produced.
What we’ve just seen here is a special case of the common state monad. The state monad is virtually identical to the Counter construction above, except that it parameterizes over the type of the state, as:
type State s a = s -> (a, s)
Observing that Counter a is identical to State Int a, it should be quite easy to see how all of the constructions above can then be generalized.
Thus far, we’ve proceeded using only examples. An effort has been made to allude to some of the patterns and important features of a monad without having to state them in an artificial way. The motivation behind this decision is that, while the formal definition of a monad has deep consequences for its behavior and usefulness, its relation to any sort of concrete meaning or usefulness is quite obscure. That said, it’s important to understand this definition. Hopefully, with the two examples above, the utility of this definition will become clear.
Formally, a monad consists of three things: a functor m on types, a return :: a -> m a (or ’unit’) operator and a >>= :: m a -> (a -> m b) -> m b (or ’bind’) operator. The operators return and >>= must also satisfy the following laws:
|return v >>= f||=||f v||(left-unit)|
|x >>= return||=||x||(right-unit)|
|(x >>= f) >>=g||=||x >>= (λv -> (f v) >>=g)||(associativity)|
Any construction with those three pieces, satisfying those three conditions, is a monad. While this seems highly abstract, the abstraction is of a good kind; it provides a very general framework for describing a very wide range of computational behaviors. To see how exactly this works, it suffices to return to the examples above, and see how they satisfy these definitions.
The purpose of the functor m is to relate pure functional types to some suitable counterpart that reflects the possibility of side-effects. In the case of Maybe, the functor m takes an ordinary type a to a sum-type a + Nothing. In the case of Counter (or State Int if you prefer), the functor takes a type a to a state transition type as Int -> (Int, a).
The purpose of return is to specify computations that are effect-free; this allows ordinary functional computations to interface with the side-effecting constructions. The return operation for Counter is just what was specified above. For Maybe, it is even simpler; the constructor Just behaves in exactly the right way. That is, for Maybe, return x = Just x.
The purpose of >>= is to sequence computations in a way that specifies how effects propagate. Definitions for >>= in both Maybe and Counter were given above; the reader is encouraged to go back and momentarily review them.
What is most important about all these parts, however, is how they fit together. This aspect is what makes constructing a new monad relatively difficult; there are many possible ways that m, return and >>= could be defined, but it is how exactly they relate that makes them into a monad. The nature of this relationship is expressed in the three monad laws above. These laws, however are not purely for the sake of formalism, as they describe exactly the same intuitions we used to construct Maybe and Counter above. A construction that did not satisfy them would, in some concrete sense, fail to realize the necessary notions of the distinction between side-effecting and non-effecting computations, and the expected way in which effects propagate.
First let’s consider the two unit laws. When considering return in the examples above, it was important that return clearly distinguish computations with no effects. Since return v has no effects, there should be nothing to propagate forward when return v is sequenced on the left-hand side of a >>=. Thus, it should be possible simply to cancel out the trivial left-hand side of such a >>=, and consider only the value to be passed to f. This behavior is what the left-unit law expresses. Similarly, since return has no side-effects, it should cancel on the right-hand-side of >>= as well, since it has no effects to add to whatever action precedes it.
Note, however, that not all returns will cancel as the result of a right-unit. Consider, for example:
get >>= \ v -> return (v + 1)
Although return appears to the right of the >>=, the function that properly serves as the right-hand operand to >>= is not simply return; it is (\ v -> return (v + 1)), which is not the same function. Thus the above cannot be simplified to just get. The reader is encouraged to work this through to see why.
The important point, however, is just that the two unit laws formalize the intuition that return must induce no side-effects in terms of the effect-sequencing defined by >>=. This feature is necessary if a clean distinction between side-effecting and non-effecting computations is to be possible. If return couldn’t be erased from a sequence of >>= -chained actions, that would necessarily mean that it had some effect that had to be carried forward, or that depended upon a previous side-effect.
The law of associativity of >>= is less clear, but no less important. Associativity is essential to any meaningful notion of sequentiality, in which one action can be said to precede or follow another in a well-defined way. While the relation between the two may not seem quite clear, recall that parentheses indicate order of operation, and >>= is a binary operator that produces a new action from two existing actions. Writing out a sequence of binds specifies a sequential composition of side-effecting actions. If one pair of actions is parenthesized, then they must be composed first. If the ordering of these compositions matters in addition to the ordering of actions specified by the left-to-right order of the binds themselves, i.e. if associativity does not hold, then >>= would have some effect on its arguments other than simply propagating a single effect forward through execution.
To see why this is the case, consider the Maybe monad above, and suppose that u, is an action in Maybe such that:
(u >>= \ x -> Nothing) >>= \ y -> u ≠ u >>= (\ x -> Nothing >>= \ y -> u)
Reducing the parenthesized binds on the LHS and RHS, this means that:
Nothing >>= \ y -> u ≠ u >>= \ x -> Nothing
which would mean that following a computation with an uncaught sticky exception has a different effect than preceding that computation with a sticky exception. Such a situation does not really make sense. A consequence of this kind of behavior would be that >>= could do something more than just propagate effects forward to the next action; it would also be free to take into account arbitrary other features of successive computations in deciding the result of composing its two arguments. In this case, >>= would not merely carry the effect of an exceptional condition from one action to another; it would have some mysterious, ill-defined power to anticipate an exception in its RHS argument and catch it before it had even happened. What intuition that should correspond to, I am not sure, but it has nothing to do with side-effects as they are generally understood.
What this all comes down to is that the formal definition of a monad specifies a way of lifting a computation into a setting where side-effects may occur at any point in execution, together with well-defined notions of what it means for a computation to occur with no side-effect, and a well-defined notion of what it means for effects to propagate from one action to the next. Some of the difficulty of understanding what exactly a monad is may come from the fact that it captures not just a single data structure or type, but a particular relationship between a data structure, a distinction between pure and side-effecting computations, and a concept of sequential composition. The goal here has been to illustrate that this construction is not artificial, but is rooted in fairly concrete and familiar computer science concepts that it just happens to articulate in a very powerful and flexible way.
The above presentation was driven by examples, but restricted itself to only two. The reason is that there are a quite a few other monads,but most of them rely upon powerful notions with a steep learning curve. The best example of this is the continuation monad:
return x = \ k . k x
x >>= f = \ k . x (\ y . (f a) k))
which makes control-flow itself an effect, so that any computation may have the effect of radically changing the course of execution. As it happens, continuations control flow that is so flexible (perhaps too flexible) that it surpasses even the expressiveness of unrestricted GOTO statements. One way to view continuations, informally, is as a kind of state monad for which executable code itself is the type of the state.
A number of other useful monads are: the resumption monad, a restricted form of continuation; the environment monad, a formalization of the idea of lexical scoping of variables which happens to be applicable to other kinds of nested context behaviors; the parser monad, a concept first proposed by Graham Hutton that later found its way into the powerful and quite useful Parsec library for constructing parsers in Haskell.
The really sharp reader may have also conjectured whether perhaps monads can be combined with one another. As it happens, they can. A down-to-earth explanation of how this happens or what it means is well beyond the scope of the current presentation, but the interested reader should consult Haskell’s monad transformer libraries or Sheng Liang’s “Monad Transformers and Modular Interpreters” for a detailed treatment. In short, monad transformers act to layer different kind of effects over one another, which happens to be quite powerful, though challenging to get one’s head around at first.
The main object, however, has been to present the concept of a monad in a motivated way that presumes minimal prerequisites. Monads turn out to be quite useful in functional programming practice, but their adoption is very arguably hampered by the difficulty of understanding what exactly monads are and how they work. Many practical explanations correctly detail how to use monads, but give little notion of why monads are a suitable structuring technique, or why they should behave as they do, while theoretical explanations require so much background as to be over the heads of most non-specialists. It’s my sincere hope that the reader has found this explanation to lie comfortably between the two, providing just enough examples to remain concrete but just enough generality to make clear the motivation behind the connections made. Thanks for reading.