182 lines
10 KiB
Text
182 lines
10 KiB
Text
Diff Handling Framework
|
|
=======================
|
|
|
|
Within the support library, in the namespace `lib::diff`, there is a collection of loosely coupled of tools
|
|
known as »the diff framework«. It revolves around generic representation and handling of structural differences.
|
|
Beyond some rather general assumptions, to avoid stipulating the usage of specific data elements or containers,
|
|
the framework is kept _generic_, cast in terms of *elements*, *sequences* and *strategies*
|
|
for access, indexing and traversal.
|
|
|
|
.motivation
|
|
**********************
|
|
Diff handling is multi purpose within Lumiera:
|
|
Diff representation is seen as a meta language and abstraction mechanism;
|
|
it enables tight collaboration without the need to tie and tangle the involved implementation data structures.
|
|
Used this way, diff representation reduces coupling and helps to cut down overall complexity — so to justify
|
|
the considerable amount of complexity seen within the diff framework implementation.
|
|
**********************
|
|
|
|
Definitions
|
|
-----------
|
|
element::
|
|
the atomic unit treated in diff detection, representation and application. +
|
|
Elements are considered to be
|
|
|
|
- lightweight copyable values
|
|
- equality comparable
|
|
- bearing distinct identity
|
|
- unique _as far as concerned_
|
|
|
|
sequence::
|
|
data is delivered in the form of a sequence, which might or might not be _ordered,_
|
|
but in any case will be traversed once only.
|
|
|
|
diff::
|
|
the changes necessary to transform an input sequence (``old sequence'') into a target sequence (``new sequence'')
|
|
|
|
diff language::
|
|
differences are spelled out in linearised form: as a sequence of constant-size diff actions, called »diff verbs«
|
|
|
|
diff verb::
|
|
a single element within a diff. Diff verbs are conceived as operations, which,
|
|
when applied consuming the input sequence, will produce the target sequence of the diff.
|
|
|
|
diff application::
|
|
the process of consuming a diff (sequence of diff verbs), with the goal to produce some effect at the
|
|
_target_ of diff application. Typically we want to apply a diff to a data sequence, to mutate it
|
|
into a new shape, conforming with the shape of the diff's ``target sequence''
|
|
|
|
diff generator::
|
|
a facility producing a diff, which is a sequence of diff verbs.
|
|
Typically, a diff generator works lazily, demand driven.
|
|
|
|
diff detector::
|
|
special kind of diff generator, which takes two data sequences as input:
|
|
an ``old sequence'' and a ``new sequence''. The diff detector traverses and compares
|
|
these sequences to produce a diff, which describes the steps necessary to transform
|
|
the ``old'' shape into the ``new'' shape of the data.
|
|
|
|
|
|
List Diff Algorithm
|
|
-------------------
|
|
While in general this is a well studied subject, in Lumiera we'll confine ourselves to a very
|
|
specific flavour of diff handling: we rely on _elementary atomic units_ with well established
|
|
object identity. And in addition, within the scope of one coherent diff handling task,
|
|
we require those elements to be 'unique'. The purpose of this design decision is to segregate
|
|
the notorious matching problem and treat diff handling in isolation.
|
|
|
|
Effectively this means that, for any given element, there can be at most _one_ matching
|
|
counterpart in the other sequence, and the presence of such can be detected by using an *index*.
|
|
In fact, we retrieve an index for every sequence involved into the diff detection task;
|
|
this is our trade-off for simplicity in the diff detection algorithm.footnote:[traditionally,
|
|
diff detection schemes, especially those geared at text diff detection, engage into great lengths
|
|
of producing an ``optimal'' diff, which effectively means to build specifically tuned pattern
|
|
or decision tables, from which the final diff can then be pulled or interpreted.
|
|
We acknowledge that in our case building a lookup table index with additional annotations can be
|
|
as bad as O(n^2^) and worse; we might well be able to do better, but likely for the price of
|
|
turning the algorithm into some kind of mental challenge.]
|
|
In case this turns out as a performance problem, we might consider integrating the index
|
|
maintenance into the data structure to be diffed, which shifts the additional impact of
|
|
indexing onto the data population phase.footnote:[in the general tree diff case this is far
|
|
from trivial, since we need an self-contained element index for every node, and we need the
|
|
ability to take a snapshot of the ``old'' state before mutating a node into ``new'' shape]
|
|
|
|
Element classification
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
By using the indices of the old and the new sequence, we are able to _classify_ each element:
|
|
|
|
- elements only present in the new sequence are treated as *inserts*
|
|
- elements only present in the old sequence are treated as *deletes*
|
|
- elements present in both sequences form the *permutation*
|
|
|
|
Processing pattern
|
|
~~~~~~~~~~~~~~~~~~
|
|
We _consume both the old and the new sequence synchronously, while emitting the diff sequence_.
|
|
|
|
The diff describes a sequence of operations, which, when applied, consume a sequence congruent
|
|
to the old sequence, while emitting a sequence congruent to the new sequence. We use the
|
|
following *list diff language* here:
|
|
|
|
verb `ins(elm)`::
|
|
insert the given argument element `elm` at the _current processing position_
|
|
into the target sequence. This operation allows to inject new data
|
|
verb `del(elm)`::
|
|
delete the _next_ element `elm` at _current position._
|
|
For sake of verification, the element to be deleted is also included as argument (redundancy).
|
|
verb `pick(elm)`::
|
|
accepts the _next_ element at _current position_ into the resulting altered sequence.
|
|
The element is given redundantly as argument.
|
|
verb `find(elm)`::
|
|
effect a re-ordering of the target list contents. This verb requires to search
|
|
for the (next respective single occurrence of the) given element in the remainder
|
|
of the datastructure, to bring it forward and insert it as the _next_ element.
|
|
verb `skip(elm)`::
|
|
processing hint, emitted at the position where an element previously fetched by
|
|
some `find` verb happened to sit in the old order. This allows an optimising
|
|
implementation to ``fetch'' a copy and just drop or skip the original,
|
|
thereby avoiding to shift other elements.
|
|
|
|
Since _inserts_ and _deletes_ can be detected and emitted right at the processing frontier,
|
|
for the remaining theoretical discussion, we consider the insert / delete part filtered
|
|
away conceptually, and concentrate on generating the permutation part.
|
|
|
|
Handling sequence permutation
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
Permutation handling turns out to be the turning point and tricky part of diff generation;
|
|
the handling of new and missing members can be built on top, once we manage to describe a
|
|
permutation diff. If we consider -- for a theoretical discussion -- the _inserts_ and _deletes_
|
|
to be filtered away, what remains is a permutation of index numbers to describe the re-ordering.
|
|
We may describe this re-ordering by the index numbers in the new sequence, given in the order
|
|
of the old sequence. For such a re-ordering permutation, the theoretically optimal result
|
|
can be achieved by http://wikipedia.org/wiki/Cycle_sort[Cycle Sort] in _linear time_.footnote:[
|
|
assuming random access by index is possible, *Cycle Sort* walks the sequence once. Whenever
|
|
encountering an element out-of order, i.e. new postion != current position, we leap to the
|
|
indicated new position, which necessarily will be out-of-order too, so we can leap from there
|
|
to the next indicated position, until we jump back to the original position eventually. Such
|
|
a closed cycle can then just be _rotated_ into proper position. Each permutation can be
|
|
decomposed into a finite number of closed cycles, which means, after rotating all cycles
|
|
out of order, the permutation will be sorted completely.] Starting from such ``cycle rotations'',
|
|
we possibly could work out a minimal set of moving operations.
|
|
|
|
But there is a twist: Our design avoids using index numbers, since we aim at _stream processing_
|
|
of diffs. We do not want to communicate index numbers to the consumer of the diff; rather we
|
|
want to communicate reference elements with our _diff verbs_. Thus we prefer the most simplistic
|
|
processing mechanism, which happens to be some variation of *Insertion Sort*.footnote:[to support
|
|
this choice, *Insertion Sort* -- in spite of being O(n^2^) -- turns out to be the best choice for
|
|
sorting small data sets for reasons of cache locality; even typical Quicksort implementations
|
|
switch to insertion sorting of small subsets for performance reasons]
|
|
|
|
|
|
Implementation and Complexity
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
Based on these choices, we're able to consume two permutations of the same sequence simultaneously,
|
|
while emitting `find` and `pick` verbs to describe the re-ordering. Consider the two sequences
|
|
split into an already-processed part, and a part still-to-be-processed.
|
|
|
|
.Invariant
|
|
Matters are arranged such, that, in the to-be-processed part, each element appearing at the
|
|
front of the ``new'' sequence _can be consumed right away_.
|
|
|
|
Now, to arrive at that invariant, we use indices to determine
|
|
|
|
- if the element at head of the old sequence is not present in the new sequence, which means
|
|
it has to be deleted
|
|
- while an element at head of the new sequence but not present in the old sequence has to be inserted
|
|
- and especially a non-matching element at the old sequence prompts us to fetch the right
|
|
element from further down in the sequence and insert it a current position
|
|
|
|
after that, the invariant is (re)established and we can consume the element and move the
|
|
processing point forward.
|
|
|
|
For generating the diff description, we need index tables of the ``old'' and the ``new'' sequence,
|
|
which causes a O(n log n) complexity and storage in the order of the sequence length. Application
|
|
of the diff is quadratic, due to the find-and-insert passes.footnote:[In the theoretical treatment
|
|
of diff problems it is common to introduce a *distance metric* to describe how _far apart_ the
|
|
two sequences are in terms of atomic changes. This helps to make the quadratic (or worse) complexity
|
|
of such algorithms look better: if we know the sequences are close, the nested sub-scans will be
|
|
shorter than the whole sequence (with n·d < n^2^). +
|
|
However, since our goal is to support permutations and we have to deal with arbitrary sequences, such
|
|
an argument looks somewhat pointless. Let's face it, structural diff computation is expensive; the only
|
|
way to keep matters under control is to keep the local sequences short, which prompts us to exploit
|
|
structural knowledge instead of comparing the entire data as flat sequence]
|
|
|