24 December 2018

What’s In A Datom?

Abstract
Datomic adopted the datom as the fundamental unit of data. Playing with this notion, we observe that different communication contexts call for slightly different types of datoms.

Recall the humble datom.

[e a v tx added?]

;; e - entity
;; a - attribute
;; v - value
;; tx - transaction id
;; added? - flag indicating addition / retraction

Establishing a denotational semantic domain like that is great fun, because it invites us to look at each constituent individually and to consider what other things could reasonably take its place, and what the resulting thing would mean.

Entities

According to the Datomic glossary, e is “the first component of a datom, specifying who or what the datom is about”. Within a system, identifying entities by positive integers (eids) should’nt usually leave much to be desired.

The question of what to put in the e-slot becomes more interesting, once we consider communcation across system boundaries. Separate systems might not use the same identification scheme. Even if they do, systems need to coordinate the assignment of identifiers, such as to avoid collisions.

A common example of this arises in communcation between a web-application and a server. Here negative eids might indicate entities that have been created on the client, but are not yet known to the server or other clients. Alternatively, clients might make use of a UUID scheme, in order to avoid coordination entirely.

We can therefore add the new shape [uuid a v tx added?] to our collection, for use in communication between separate eid domains.

Attributes and Values

At least in the Clojure world, strong, global names (in the form of fully qualified keywords) are found in the a-slot. This should be considered a great blessing and display of wisdom and kindness. It will take a much less biased mind to even consider other things to fill their place.

Similarly, values (numbers, strings, booleans, maybe instants, etc…) are well understood and liked. We do not mess around with those.

Time

To Clojure’s and Datomic’s eternal credit, immutable values and reified, logical system time are well established in our community. It took an outsider to teach me that timestamps can be so much more still. In particular, timestamps most certainly don’t have to be scalars and allow us to talk about multiple axes of time, manage speculative multi-user computations, and work with heterogeneous data sources.

We will therefore write the more general t for timestamp, when talking about datoms.

Multiplicity

Datomic has set semantics. Consequently, datoms can only have one out of two possible multiplicities: 0 and 1. At any given logical point in time, an additive datom allows us to go from 0 to 1, a retractive one allows us to go from 1 to 0. The added?-slot of a datom allows us to indicate whether it is meant to be additive or retractive.

Datomic comes with two transaction functions to create additive and retractive datoms respectively: :db/add and :db/retract. If we allow for a slight reformulation here, we could imagine the group (#{0 1} add) to govern the addition and retraction of datoms under set semantics:

;; sets
'(#{0 1} add)

(= (add 0 0)           0)
(= (add 0 1) (add 1 0) 1)
(= (add 1 1)           0)

Written like this, we are naturally led to ask whether other groups could reasonably take its place?

;; multisets
'(integer? +)

(= (add 0 0)             0)
(= (add 0 1)  (add 1 0)  1)
(= (add 1 1)             2)
(= (add -1 1) (add 1 -1) 0)
;; ...
;; probabilities
'([0..1] ???)

Again we might look at other systems for some inspiration. In any case we might want to write the more general diff (difference, as in change in multiplicity) in place of added?, when thinking about what a datom can be.

Intent

Datoms do not record intent, they record facts. For completeness sake, we note that some communication requires preservation of intent. In particular, whenever we talk about a source-of-truth, we are referring to a system that has access to user intent and the authority to impose its interpretation. Other times, preservation of intent is outright dangerous, because it allows for diverging interpretations to creep in.

Datomic’s transaction data is a representation that preserves intent:

;; intent
[data-fn args*]

Datomic is therefore designed to act as a source-of-truth, because it transforms intent-aware inputs (which require interpretation in the form of transaction functions) into intent-less datoms. 3DF, in contrast, doesn’t know anything about the correct interpretation of user intent, and therefore expects datoms as input.

When replicating / propagating information, we want to be careful to only ever send intent-less data. As soon as more than one system has authority to impose interpretation, we are playing the game of distributed consensus, which is not a fun game at all.

Summary

We have seen a number of generalizations and tweaks to the humble datom. Let’s list the most important categories again and give them names.

[data-fn args*] ;; intent-preserving novelty
[e a v t diff]  ;; controlled novelty
[uuid a v diff] ;; open novelty

Intent-preserving novelty is uniquely suited to record all user interactions with your system. All other forms of novelty should be derivable from a stored representation of user intent.

Systems that share entity identification and timestamping schemes can communicate controlled novelty amongst each other. Most commonly, these are database peers partaking in the replication of transactions.

Finally, some systems are interested in information exchange without a shared notion of entity ids and timestamps. Consider again a web-application that at any point in time maintains some local-only entities, some shared with server A, and some with server B. Communication must happen via open novelty now. In a fully peer-to-peer setting, intent-preserving novelty can be used in combination with a UUID scheme.

This concludes our little bestiary.