A year-end snapshot of thoughts about a range of topics, some of which I hope to publish more detailed notes about eventually. It doubles as a necessarily incomplete collection of interesting papers and articles I’ve read and re-read this year.
Databases and Stream Processing
Databases are collections of time-varying relations, streams are derivatives of time-varying relations1. For this to work, relations must be represented as multisets, not as sets.
New data structures designed from this perspective2 have me hopeful that mutually beneficial unification of the two is possible.
What will remain is the distinction between data-driven and demand-driven systems, two answers to a foundational architectural decision that (at first glance) couldn’t have more contrasting implications.
Incremental View Maintenance (IVM)
IVM3 is the field of inquiry most immediately concerned with the unification of databases and streams. Given a database , a query , and changes to one or more input relations, the goal of IVM is to determine changes to an output relation, such that the output relation corresponds to the reevaluation from scratch.
The IVM problem for relational databases is equivalent to computing the derivative of an expression in relational algebra4.
Alternatively, IVM can be practiced bottom-up, starting from derivatives of data sources (i.e. input streams) and describing the composition of those into derivatives of higher-level relational expressions5.
Multiverse Databases / Selective Replication
Modify the IVM problem slightly, to the following: given a query , and changes to a collection of input relations , maintain a collection of output relations , such that .
If we restrict this formulation and require that enumerating is possible with only constant delay per result tuple, then we enter the fascinating world of factorized databases6 and worst-case optimal join algorithms7.
In other words: we are interested in a way to compress a database losslessly with respect to a specific query. This problem is apparently itself a facet of a more general problem, known as “The Chase”8, about which I understand next to nothing (for now!).
Whatever it may or may not be called, solving this problem would give us a generic tool to efficiently replicate a portion of a database, where that portion is described declaratively via a relational query — an essential building block for multiverse databases9.
Multiverse databases are interesting for a number of reasons, one of them being the ability to replicate highly restricted versions of a backend database to frontend clients10.
Transaction processing really is a three step process:
- Work hard to minimize the number of conflicts.
- Actually change things.
- Work hard to correct the conflicts that did arise.
For almost a decade now, most of the thinking has gone into step (1)11. After spending some time working with Frank McSherry, I have grown much more interested in approaches that invest the bare minimum in step (1), and instead focus on doing step (3) as efficiently as possible.
The shortage of software talent, combined with the need for big non-tech players to catch up, will impose incredible rigor on engineering departments. They won’t be able to afford the same amounts of wild west experimentation that the field has grown used to. Software engineering has some catching up to do, before it can reach the scale of more established engineering practices.
We might therefore be entering an age where organizational leverage is most important. How far can your organization get with one or two great engineers? Can they, supported by the right tools, make a larger team of junior engineers successful? I view the success of languages with heavily integrated tooling, like Go and Rust, as signs of this trend.
All companies are becoming tech companies, all tech companies are becoming fintechs. The current generation of engineers is working within environments that place a much more immediate dollar figure on architecture decisions. Some are looking forward to take this to the extreme14. Problems remain15. Many opportunities for rethinking system architecture present themselves1617.
[…] for cases when the increase in execution time can be offset by reduction in waiting time, legacy systems and applications will not suffer from degraded overall performance on disaggregated architectures.
Marzoev, Agarwal, “Benefits of Resource Disaggregation”
- Use immutable state wherever possible.
- Model essential state in relations.
- From those, derive all other state as views.
- Express derivations in appropriate language fragment20.
- Prefer dataflow to control flow5.
Successful businesses are information dissemination machines, designed to build machines, that then build products or deliver services. Entrepreneurship is continuously finding solutions to a string of communication problems, and to simultaneously scale those solutions beyond face-to-face discussions.