30 December 2019

Perspectives 2019

A year-end snapshot of thoughts about a range of topics, some of which I hope to publish more detailed notes about eventually. It doubles as a necessarily incomplete collection of interesting papers and articles I’ve read and re-read this year.

Databases and Stream Processing

Databases are collections of time-varying relations, streams are derivatives of time-varying relations1. For this to work, relations must be represented as multisets, not as sets.

New data structures designed from this perspective2 have me hopeful that mutually beneficial unification of the two is possible.

What will remain is the distinction between data-driven and demand-driven systems, two answers to a foundational architectural decision that (at first glance) couldn’t have more contrasting implications.

Incremental View Maintenance (IVM)

IVM3 is the field of inquiry most immediately concerned with the unification of databases and streams. Given a database \(D\), a query \(Q\), and changes to one or more input relations, the goal of IVM is to determine changes to an output relation, such that the output relation corresponds to the reevaluation \(Q(D)\) from scratch.

The IVM problem for relational databases is equivalent to computing the derivative of an expression in relational algebra4.

Alternatively, IVM can be practiced bottom-up, starting from derivatives of data sources (i.e. input streams) and describing the composition of those into derivatives of higher-level relational expressions5.

Multiverse Databases / Selective Replication

Modify the IVM problem slightly, to the following: given a query \(Q\), and changes to a collection of input relations \(D = \{R_1...R_n\}\), maintain a collection of output relations \(D' = \{F_1...F_m\}\), such that \(Q(D) = Q(D')\).

If we restrict this formulation and require that enumerating \(Q(D')\) is possible with only constant delay per result tuple, then we enter the fascinating world of factorized databases6 and worst-case optimal join algorithms7.

In other words: we are interested in a way to compress a database losslessly with respect to a specific query. This problem is apparently itself a facet of a more general problem, known as “The Chase”8, about which I understand next to nothing (for now!).

Whatever it may or may not be called, solving this problem would give us a generic tool to efficiently replicate a portion of a database, where that portion is described declaratively via a relational query — an essential building block for multiverse databases9.

Multiverse databases are interesting for a number of reasons, one of them being the ability to replicate highly restricted versions of a backend database to frontend clients10.

Transaction Processing

Transaction processing really is a three step process:

  1. Work hard to minimize the number of conflicts.
  2. Actually change things.
  3. Work hard to correct the conflicts that did arise.

For almost a decade now, most of the thinking has gone into step (1)11. After spending some time working with Frank McSherry, I have grown much more interested in approaches that invest the bare minimum in step (1), and instead focus on doing step (3) as efficiently as possible.

See Kleppmann et al. on Online Event Processing12, transaction repair in LogicBlox13, and of course differential computation5.

Software Engineering

The shortage of software talent, combined with the need for big non-tech players to catch up, will impose incredible rigor on engineering departments. They won’t be able to afford the same amounts of wild west experimentation that the field has grown used to. Software engineering has some catching up to do, before it can reach the scale of more established engineering practices.

We might therefore be entering an age where organizational leverage is most important. How far can your organization get with one or two great engineers? Can they, supported by the right tools, make a larger team of junior engineers successful? I view the success of languages with heavily integrated tooling, like Go and Rust, as signs of this trend.

All companies are becoming tech companies, all tech companies are becoming fintechs. The current generation of engineers is working within environments that place a much more immediate dollar figure on architecture decisions. Some are looking forward to take this to the extreme14. Problems remain15. Many opportunities for rethinking system architecture present themselves1617.

[…] for cases when the increase in execution time can be offset by reduction in waiting time, legacy systems and applications will not suffer from degraded overall performance on disaggregated architectures.

Marzoev, Agarwal, “Benefits of Resource Disaggregation”

Out of the Tar Pit1819

  1. Use immutable state wherever possible.
  2. Model essential state in relations.
  3. From those, derive all other state as views.
  4. Express derivations in appropriate language fragment20.
  5. Prefer dataflow to control flow5.

See also 212223.

Entrepreneurship

Successful businesses are information dissemination machines, designed to build machines, that then build products or deliver services. Entrepreneurship is continuously finding solutions to a string of communication problems, and to simultaneously scale those solutions beyond face-to-face discussions.

  1. Begoli, Akidau et al. 2019, “One SQL to Rule Them All” 

  2. McSherry, Lattuada, Schwarzkopf 2018, “K-Pg: Shared State in Differential Dataflows” 

  3. Gupta, Mumick, Subrahmanian 1993, “Maintaining views incrementally” 

  4. Koch, Ahmad et al. 2013, “DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views” 

  5. McSherry, Murray, et al. 2013, “Differential dataflow”  2 3

  6. Factorized Databases 

  7. Abo Khamis, Ngo, Rudra, “Juggling Functions Inside a Database” 

  8. en.wikipedia.org/wiki/Chase_(algorithm) 

  9. Marzoev, Araújo et al. 2019, “Towards Multiverse Databases” 

  10. Bach, Sandstede 2019, “Right Inside the Database” 

  11. Thomson, Diamond, et al. 2012, “Calvin: Fast Distributed Transactions for Partitioned Database Systems” 

  12. Kleppmann, Beresford, Svingen 2019, “Online Event Processing” 

  13. Aref, ten Cate, et al. 2015, “Design and Implementation of the LogicBlox System” 

  14. Wardley 2016, “Why the fuss about serverless?” 

  15. Jonas et al. 2019, “A Berkeley View on Serverless Computing” 

  16. Marzoev, Agarwal, “Benefits of Resource Disaggregation” 

  17. Dageville, Cruanes et al. 2016, “The Snowflake Elastic Data Warehouse” 

  18. Moseley, Marks 2006, “Out of the Tar Pit” 

  19. Göbel 2019, “Out of the Tar Pit” @ PWL Zurich 

  20. Alvaro 2015, “I See What You Mean” 

  21. Smaragdakis 2019, “Next-Paradigm Programming Languages” 

  22. Hellerstein 2010, “The Declarative Imperative” 

  23. Valiant 1990, “A Bridging Model for Parallel Computation”