236 lines
11 KiB
Text
236 lines
11 KiB
Text
Diff Handling Framework
|
|
=======================
|
|
|
|
Within the support library, in the namespace `lib::diff`, there is a collection of loosely coupled of tools
|
|
known as »the diff framework«. It revolves around generic representation and handling of structural differences.
|
|
Beyond some rather general assumptions, to avoid stipulating the usage of specific data elements or containers,
|
|
the framework is kept _generic_, cast in terms of *elements*, *sequences* and *strategies*
|
|
for access, indexing and traversal.
|
|
|
|
.motivation
|
|
**********************
|
|
Diff handling is multi purpose within Lumiera:
|
|
Diff representation is seen as a meta language and abstraction mechanism;
|
|
it enables tight collaboration without the need to tie and tangle the involved implementation data structures.
|
|
Used this way, diff representation reduces coupling and helps to cut down overall complexity — so to justify
|
|
the considerable amount of complexity seen within the diff framework implementation.
|
|
**********************
|
|
|
|
Definitions
|
|
-----------
|
|
element::
|
|
the atomic unit treated in diff detection, representation and application. +
|
|
Elements are considered to be
|
|
|
|
- lightweight copyable values
|
|
- equality comparable
|
|
- bearing distinct identity
|
|
- unique _as far as concerned_
|
|
|
|
sequence::
|
|
data is delivered in the form of a sequence, which might or might not be _ordered,_
|
|
but in any case will be traversed once only.
|
|
|
|
diff::
|
|
the changes necessary to transform an input sequence (``old sequence'') into a target sequence (``new sequence'')
|
|
|
|
diff language::
|
|
differences are spelled out in linearised form: as a sequence of constant-size diff actions, called »diff verbs«
|
|
|
|
diff verb::
|
|
a single element within a diff. Diff verbs are conceived as operations, which,
|
|
when applied consuming the input sequence, will produce the target sequence of the diff.
|
|
|
|
diff application::
|
|
the process of consuming a diff (sequence of diff verbs), with the goal to produce some effect at the
|
|
_target_ of diff application. Typically we want to apply a diff to a data sequence, to mutate it
|
|
into a new shape, conforming with the shape of the diff's ``target sequence''
|
|
|
|
diff generator::
|
|
a facility producing a diff, which is a sequence of diff verbs.
|
|
Typically, a diff generator works lazily, demand driven.
|
|
|
|
diff detector::
|
|
special kind of diff generator, which takes two data sequences as input:
|
|
an ``old sequence'' and a ``new sequence''. The diff detector traverses and compares
|
|
these sequences to produce a diff, which describes the steps necessary to transform
|
|
the ``old'' shape into the ``new'' shape of the data.
|
|
|
|
|
|
List Diff Algorithm
|
|
-------------------
|
|
While in general this is a well studied subject, in Lumiera we'll confine ourselves to a very
|
|
specific flavour of diff handling: we rely on _elementary atomic units_ with well established
|
|
object identity. And in addition, within the scope of one coherent diff handling task,
|
|
we require those elements to be 'unique'. The purpose of this design decision is to segregate
|
|
the notorious matching problem and treat diff handling in isolation.
|
|
|
|
Effectively this means that, for any given element, there can be at most _one_ matching
|
|
counterpart in the other sequence, and the presence of such can be detected by using an *index*.
|
|
In fact, we retrieve an index for every sequence involved into the diff detection task;
|
|
this is our trade-off for simplicity in the diff detection algorithm.footnote:[traditionally,
|
|
diff detection schemes, especially those geared at text diff detection, engage into great lengths
|
|
of producing an ``optimal'' diff, which effectively means to build specifically tuned pattern
|
|
or decision tables, from which the final diff can then be pulled or interpreted.
|
|
We acknowledge that in our case building a lookup table index with additional annotations can be O(n^2^);
|
|
we might well be able to do better, but certainly for the price of an algorithm more mentally challenging.]
|
|
In case this turns out as a performance problem, we might consider integrating the index
|
|
maintenance into the data structure to be diffed, which shifts the additional impact of
|
|
indexing onto the data population phase.footnote:[in the general tree diff case this is far
|
|
from trivial, since we need an self-contained element index for every node, and we need the
|
|
ability to take a snapshot of the ``old'' state before mutating a node into ``new'' shape]
|
|
|
|
Element classification
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
By using the indices of the old and the new sequence, we are able to _classify_ each element:
|
|
|
|
- elements only present in the new sequence are treated as *inserts*
|
|
- elements only present in the old sequence are treated as *deletes*
|
|
- elements present in both sequences form the *permutation*
|
|
|
|
Processing pattern
|
|
~~~~~~~~~~~~~~~~~~
|
|
We _consume both the old and the new sequence synchronously, while emitting the diff sequence_.
|
|
|
|
The diff describes a sequence of operations, which, when applied, consume a sequence congruent
|
|
to the old sequence, while emitting a sequence congruent to the new sequence. We use the
|
|
following *list diff language* here:
|
|
|
|
verb `ins(elm)`::
|
|
insert the given argument element `elm` at the _current processing position_
|
|
into the target sequence. This operation allows to inject new data
|
|
verb `del(elm)`::
|
|
delete the _next_ element `elm` at _current position._
|
|
For sake of verification, the element to be deleted is also included as argument (redundancy).
|
|
verb `pick(elm)`::
|
|
accepts the _next_ element at _current position_ into the resulting altered sequence.
|
|
The element is given redundantly as argument.
|
|
verb `push(elm)`::
|
|
effect a re-ordering of the target list contents. This verb requires to take
|
|
the _next_ element, which happens to sit at _current processing position_ and
|
|
_push it back_ further into the list, to be placed at a position _behind_ the
|
|
_anchor element_ `elm` given as argument.
|
|
|
|
Since _inserts_ and _deletes_ can be detected and emitted right at the processing frontier,
|
|
for the rest of this theoretical discussion, we consider the insert / delete part filtered
|
|
away conceptually, and concentrate on generating the permutation part.
|
|
|
|
Handling sequence permutation
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
This paragraph describes how to consume two permutations of the same sequence simultaneously,
|
|
while emitting `push` and `pick` verbs to describe the re-ordering. Consider the sequences
|
|
split into an already-processed part, and a part still-to-be-processed.
|
|
|
|
.Invariant
|
|
Matters are arranged such, that, in the to-be-processed part, each element appearing at the
|
|
front of the ``new'' sequence _can be picked right away_.
|
|
|
|
Now, to arrive at that invariant, we have especially to deal with the case that a different
|
|
(not matching) element appears at the front of the ``old'' list. We have to emit additional
|
|
`push` verbs to get rid of non-matching elements in the ``old'' order, until we get into a state
|
|
where the invariant is re-established (and we're able to `pick` to consume the same element
|
|
from the existing sequence and the target sequence). Obviously, the tricky part is how to
|
|
determine the *anchor element* for this `push` directrive...
|
|
|
|
- we need to be sure the anchor _is indeed present_ in the current shape of the sequence in processing.
|
|
- the anchor must be in the right place, so to conform to the target sequence at the point of picking it.
|
|
- it is desirable to emit at most one `push` directive for any given element; we want it to settle at the
|
|
right place with a single shot
|
|
|
|
.Some example push-directives
|
|
[frame="topbot",grid="none", width="40%", float="right"]
|
|
[cols=">s,<2m,>s,<2m,>s,<2m,>s,<2m,>s,<2m,>s,<2m,>s,<2m,>s,<2m"]
|
|
|======
|
|
|1 ||1 ||1 ||1 ||1 ||1 ||1 ||1|
|
|
|
|
|2 ||2 ||2 ||2 ||2 ||2 ||2 ||2|
|
|
|
|
|4 |->3
|
|
|5 |->3
|
|
|4 |->3
|
|
|6 |->3
|
|
|6 |->3
|
|
|5 |->3
|
|
|7 |->5
|
|
|5 |->4
|
|
|
|
|5 |->4
|
|
|4 |->3
|
|
|5 |->4
|
|
|5 |->3
|
|
|4 |->3
|
|
|4 |->3
|
|
|4 |->3
|
|
|7 |->5
|
|
|
|
||
|
|
||
|
|
|6 |->5
|
|
|4 |->3
|
|
|5 |->4
|
|
|6 |->5
|
|
||
|
|
||
|
|
|
|
|3 ||3 ||3 ||3 ||3 ||3 ||3 ||3 |
|
|
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
|5 |
|
|
|4 |
|
|
|======
|
|
|
|
To find out about rules for choosing the anchor element, we look at the sequence in ``old'' order,
|
|
but we indicate the _index number in the ``new'' order_ for each element. Any discontinuity in the
|
|
ascending sequence of these numbers indicates that we have to push elements back, until encountering
|
|
the next lowest number still missing. The examples given in the table show such a gap after the
|
|
second element -- we have to push back until we find the third element.
|
|
|
|
Basically we get some kind of increasing ``*water level*'': the continuous sequence prefix, the highest
|
|
number where all predecessors are present already. An element at this level can be picked and thus
|
|
consumed -- since we're in conformance to the desired target sequence _up to this point_. But any
|
|
elements still ``above water level'' can not yet be consumed, but need to be pushed back, since
|
|
some predecessor has still to arrive. If we attribute each element with the water level reached
|
|
_at the point when we are visiting this element,_ we get a criterion for possible anchor elements:
|
|
What is above water level, can not be an anchor, since it needs to move itself. But any element
|
|
at water level is usable. And, in addition, any element already pushed once can serve as an anchor
|
|
too. This follows by recursive argument: it has been moved behind a proper anchor, and thus will
|
|
in turn remain stable. Of all the possible candidates we have to use the largest possible predecessor,
|
|
otherwise there would be the possibility of messing up the ordering (e.g. if you place 6 behind 3
|
|
instead of 5).
|
|
|
|
.Rules
|
|
. pick at water level
|
|
. push anything above water level
|
|
. use as anchor the largest possible from...
|
|
* elements at water level
|
|
* or elements already pushed into place
|
|
|
|
Implementation and Complexity
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
We need an index lookup for an element from the ``old'' sequence to find the corresponding index number
|
|
in the ``new'' sequence. Based on this attribution, the ``water level'' attribution can be calculated
|
|
in the same linear pass. So we get two preprocessing passes, one for the ``new'' sequence and one for
|
|
the ``old'', using lookups into the ``new''-index. After these preparations, the diff can be emitted
|
|
in a further pass.
|
|
|
|
In fact we do not even need the numerical ``water level''; we need the relations. This allows to extend
|
|
the argumentation to include the deletes and inserts and treat all from a single list. But, unfortunately
|
|
the search for suitable anchor elements turns the algorithm into *quadratic complexity*: essentially this
|
|
is a nested sub-pass to find a maximum -- O(n^2^) will dominate the O(n log n) from indexing.footnote:[In
|
|
the theoretical treatment of diff problems it is common to introduce a *distance metric* to describe
|
|
how _far apart_ the two sequences are in terms of atomic changes. This helps to make the quadratic
|
|
(or worse) complexity of such algorithms look better: if we know the sequences are close, the nested
|
|
sub-scans will be shorter than the whole sequence (with n·d < n^2^). In our case, we would be able to
|
|
find the anchor in close vicinity of the current position. +
|
|
However, since our goal is to support permutations and we have to deal with arbitrary sequences, such
|
|
an argument is somewhat pointless. Let's face it, structural diff computation is expensive; the only
|
|
way to keep matters under control is to keep the local sequences short, which means to exploit structural
|
|
knowledge instead of comparing the entire data as flat sequence]
|
|
The additional space requirements footnote:[in _addition_ to the storage for the ``old'' and ``new'' sequence
|
|
plus the storage for the generated diff output] of our solution is of O(`len(old)` + `len(new)`).
|
|
|