Willow Data Model

Status: Final

In this document, we define the core data model of Willow.

Willow is a system for giving meaningful, hierarchical names to arbitrary sequences of bytes (called payloads), not unlike a filesystem. For example, you might give the name blogidea1 to the bytestring Dear reader, I've got a great idea.

You also give the name blogidea2 to the bytestring (watch this space).

A (one-dimensional) list containing the two paths "blog/idea/1" and "blog/idea/2", with a stylised file next to each path. Idea 1 shows a lightbulb, idea 2 shows a deeply smug expression.

A little later you overwrite the existing entry at path blogidea2 with I've made a mistake. Willow tracks the timestamp of each assignment, and the new entry overwrites the old one.

The same visualization of paths as before, but now with second dimension, a time axis. The smug-faced file disappears in the first time step, being replaced by a sweating face at a second timestep.

That night you decide it would be best if everyone forgot about the whole thing. By writing a new entry at blogidea, our previous entries are deleted. Think of it as overwriting a directory in a file system with an empty file. We call this mechanism prefix pruning.

The same visualization as before, but both paths and files got deleted by adding the third path "blog/idea" with a stylised file whistling in a totally inconspicuous way.
The entries prefixed by blogidea are deleted by a newer entry at that prefix.

Things would be rather chaotic if everyone wrote to the same blog. Instead, entries live in separate subspaces — intuitively, each user writes to their own, separate universe of data. Willow allows for various ways of controlling who gets to write to which subspace, from simple per-user access control to sophisticated capability systems.

Stylised files with friendly icons arranged in a now three-dimensional space. Adding to the path and time dimensions of the preceeding drawings, a depth dimension shows three different people to signify different subspaces. They look happy, one waves to the viewer, good vibes all around.
The three dimensions of Willow’s data model: paths, timestamps, and subspaces.

Willow further allows the aggregation of subspaces into completely independent namespaces. Data from a public wiki should live in a separate namespace than data from a photo-sharing application for my family. Some namespaces should allow anyone to set up subspaces within them, others might require authorisation from a trusted manager. Willow offers a flexible mechanism for using different policies on a per-namespace basis.

Three simplified three-dimensional visualisations of namespaces, each shaded in a different color. One is labeled "family", one "wiki", and one "project".
Three completely independent namespaces.

This constitutes a full overview of the data model of Willow. Applications read and write payloads from and to subspaces, addressing via hierarchical paths. Willow tracks timestamps of write operations, newer writes replace older writes in the manner of a traditional file system. These data collections live in namespaces; read and write access to both namespaces and subspaces can be controlled through a variety of policies.

Now we can almost delve into the precise definition of these concepts.

Parameters

Some questions in protocol design have no clear-cut answer. Should namespaces be identified via human-readable strings, or via the public keys of some digital signature scheme? That depends entirely on the use-case. To sidestep such questions, the Willow data model is generic over certain choices of parameters. You can instantiate Willow to use strings as the identifiers of namespaces, or you could have it use 256 bit integers, or urls, or iris scans, etc.

This makes Willow a higher-order protocol: you supply a set of specific choices for its parameters, and in return you get a concrete protocol that you can then use. If different systems instantiate Willow with non-equal parameters, the results will not be interoperable, even though both systems use Willow.

We give precise semantics to these parameters in the spec proper, the list need not fully make sense on the first read-through.An instantiation of Willow must define concrete choices of the following parameters:

Concepts

Willow can store arbitrary bytestrings of at most bytes. We call such a bytestring a Payload.

A Path is a sequence of at most max_component_count many bytestrings, each of at most max_component_length bytes, and whose total number of bytes is at most max_path_length. The bytestrings that make up a Path are called its Components.

A Timestamp is a 64-bit unsigned integer, that is, a natural number between zero (inclusive) and 2^64 - 1 (exclusive). Timestamps are to be interpreted as a time in microseconds since the Unix epoch.

The metadata associated with each Payload is called an Entry:

The metadata for storing a Payload.
struct Entry
The identifier of the namespace to which the Entry belongs.
The identifier of the subspace to which the Entry belongs.
The Path to which the Entry was written.
Wall-clock timestamps may come as a surprise. We are cognisant of their limitations, and use them anyway. To learn why, please see Timestamps, really?
The claimed creation time of the Entry.
The length of the Payload in bytes.
The result of applying hash_payload to the Payload.

A PossiblyAuthorisedEntry is a pair of an Entry and an AuthorisationToken. An AuthorisedEntry is a PossiblyAuthorisedEntry for which is_authorised_write returns true.

a is a prefix of a and of ab, but not of ab.A Path s is a prefix of a Path t if the first Components of t are exactly the Components of s.

We can now formally define which Entries overwrite each other and which can coexist. An Entry e1 is newer than another Entry e2 if

A store is a set of AuthorisedEntries such that

When two peers connect and wish to update each other, they compute the joins of all their stores with equal NamespaceIds. Doing so efficiently can be quite challenging, we recommend our Willow General Purpose Sync protocol.Formally, adding a new Entry to a store consists of computing the join of the original store and a singleton store containing only the new Entry.The join of two stores that store Entries of the same namespace_id is the store obtained as follows:

A namespace is the join over all11No matter in which order and groupings the stores are joined the result is always the same. Stores form a join semi-lattice (also known as a state-based CRDT) under the join operation. stores with Entries of a given NamespaceId. Note that this concept only makes sense as an abstract notion, since no participant in a distributed system can ever be certain that it has (up-to-date) information about all existing stores. A subspace is the set of all Entries of a given SubspaceId in a given namespace.

Further Reading

The Willow data model stays fairly compact by deliberately sidestepping some rather important questions. In this section, we point to our answers for the most important ones.

How can we precisely delimit meaningful groups of Entries, for example, all recipes that Alex posted on their blog in the past three months? Grouping Entries always incurs a tradeoff between expressivity (which sets of Entries can be characterised) and efficiency (how quickly a database can retrieve all its Entries of an arbitrary grouping). We present a carefully crafted selection of ways of grouping Entries here.

How should we encode the concepts of Willow for storage or network transmission? Due to the parameterised nature of Willow, there can be no overarching answer, but we cover some recurring aspects of the question here.

How should we select the AuthorisationToken and is_authorised_write parameters? Different deployments of Willow will have different needs. We provide Meadowcap, a capability-based solution that should be suitable for most use-cases.

How do we efficiently and securely compute joins over a network to synchronise data between peers? Again, different settings require different answers, but we provide the Willow General Purpose Sync protocol as a well-engineered, privacy-preserving solution that should be applicable to a wide range of scenarios.

How can we encrypt Entries while retaining the semantics of the original, unencrypted data? This question lies at the heart of end-to-end encryption for Willow, and we discuss our findings here.

How can a database provide efficient access to Entries? We give an introduction to the types of queries that a data store for Willow should support, and present some data structures for supporting them efficiently here.

A Willow emblem: a stylised drawing of a Willow’s branch tipping into a water surface, next to a hand-lettered display of the word "Willow".