Localizing Write Access (nonblonde)

10.2.1 Localizing Write Access

A very large number of edit operations (hundreds of thousands, millions) sees dismal writing performance, as each new insertion may cause an inconsequential disk write.

Very large transactions may avoid excessive disk writing, but they’d have to be large enough so that every node gets several changes. Due to extensive running time, very large transactions offer little in the way of durability and require a large part or perhaps the entire database brought into memory.

One method of addressing the scattered write pattern characteristic of b-trees that does not require long running transactions or a very large memory space is collecting many insertion/deletion commands and then to executing them on the b-tree in the order in which they are sorted by keys. The data structure collecting the edit commands may reside entirely in memory, or may be itself written on disk.

It may work by recording the random edits and writing them in key consequential runs. The b-tree updating would occur when the number of collected edit items reaches a certain threshold. The number of keys written in one run would be relatively small (e.g. two orders of magniture under the number of collected modified keys), with one b-tree updating starting from where the last stopped.

Keeping the edits collector data structure in memory only may provide the best performance, but would not be a durable solution and it would break the database transactionality. One option to address the two issues is to keep the edits in a database table. The latter solution is not without issues itself.

The problems with the table collector come from the fact that the data structure is supported by a disk b-tree, and hence it exhibits the same writing performance issues it tries to solve. However, the collector table would be smaller than the table or tables it optimizes write access for and it may have very large blocks so that it does not scatter too much the writing itself. The tables for which the writing is optimized (localized, that is) may have themselves large user data nodes, and for the same reason.

A single collector table solution may not be enough, as the table may be either too big to have good writing characteristics (it would have too many nodes) or too small to be able to localize the writes for the tables it interfaces.

A two or more tables hierarchical solution would very likely be able to achieve a good editing localization. It would have a first concentrating table in which the edits commands are first collected, one that spills into the second and larger one when it gets to a certain items count. The first table would have a very small number of blocks, possibly only a very large one. The second table would allow a number of collected items an order of magnitude over the first one and would feature itself large nodes, etc.

Large transactions are still required for the write localizing solution to work, but their size does not need to be outside the hundreds/thousands updates range.

Data structures of such a design are ordinarily described as log structured merged trees, although most of the implementations are not built over read/write b-trees (and rather using read-only b-tree like indexes to organize the merge levels).