this is a theoretical description of our method, and gives the reaoning why it is correct, plus the assesment of size and complexity order.
236 lines
11 KiB
Text
236 lines
11 KiB
Text
Diff Handling Framework
|
|
=======================
|
|
|
|
Within the support library, in the namespace `lib::diff`, there is a collection of loosely coupled of tools
|
|
known as »the diff framework«. It revolves around generic representation and handling of structural differences.
|
|
Beyond some rather general assumptions, to avoid stipulating the usage of specific data elements or containers,
|
|
the framework is kept _generic_, cast in terms of *elements*, *sequences* and *strategies*
|
|
for access, indexing and traversal.
|
|
|
|
.motivation
|
|
**********************
|
|
Diff handling is multi purpose within Lumiera:
|
|
Diff representation is seen as a meta language and abstraction mechanism;
|
|
it enables tight collaboration without the need to tie and tangle the involved implementation data structures.
|
|
Used this way, diff representation reduces coupling and helps to cut down overall complexity — so to justify
|
|
the considerable amount of complexity seen within the diff framework implementation.
|
|
**********************
|
|
|
|
Definitions
|
|
-----------
|
|
element::
|
|
the atomic unit treated in diff detection, representation and application. +
|
|
Elements are considered to be
|
|
|
|
- lightweight copyable values
|
|
- equality comparable
|
|
- bearing distinct identity
|
|
- unique _as far as concerned_
|
|
|
|
sequence::
|
|
data is delivered in the form of a sequence, which might or might not be _ordered,_
|
|
but in any case will be traversed once only.
|
|
|
|
diff::
|
|
the changes necessary to transform an input sequence (``old sequence'') into a target sequence (``new sequence'')
|
|
|
|
diff language::
|
|
differences are spelled out in linearised form: as a sequence of constant-size diff actions, called »diff verbs«
|
|
|
|
diff verb::
|
|
a single element within a diff. Diff verbs are conceived as operations, which,
|
|
when applied consuming the input sequence, will produce the target sequence of the diff.
|
|
|
|
diff application::
|
|
the process of consuming a diff (sequence of diff verbs), with the goal to produce some effect at the
|
|
_target_ of diff application. Typically we want to apply a diff to a data sequence, to mutate it
|
|
into a new shape, conforming with the shape of the diff's ``target sequence''
|
|
|
|
diff generator::
|
|
a facility producing a diff, which is a sequence of diff verbs.
|
|
Typically, a diff generator works lazily, demand driven.
|
|
|
|
diff detector::
|
|
special kind of diff generator, which takes two data sequences as input:
|
|
an ``old sequence'' and a ``new sequence''. The diff detector traverses and compares
|
|
these sequences to produce a diff, which describes the steps necessary to transform
|
|
the ``old'' shape into the ``new'' shape of the data.
|
|
|
|
|
|
List Diff Algorithm
|
|
-------------------
|
|
While in general this is a well studied subject, in Lumiera we'll confine ourselves to a very
|
|
specific flavour of diff handling: we rely on _elementary atomic units_ with well established
|
|
object identity. And in addition, within the scope of one coherent diff handling task,
|
|
we require those elements to be 'unique'. The purpose of this design decision is to segregate
|
|
the notorious matching problem and treat diff handling in isolation.
|
|
|
|
Effectively this means that, for any given element, there can be at most _one_ matching
|
|
counterpart in the other sequence, and the presence of such can be detected by using an *index*.
|
|
In fact, we retrieve an index for every sequence involved into the diff detection task;
|
|
this is our trade-off for simplicity in the diff detection algorithm.footnote:[traditionally,
|
|
diff detection schemes, especially those geared at text diff detection, engage into great lengths
|
|
of producing an ``optimal'' diff, which effectively means to build specifically tuned pattern
|
|
or decision tables, from which the final diff can then be pulled or interpreted.
|
|
We acknowledge that in our case building a lookup table index with additional annotations can be O(n^2^);
|
|
we might well be able to do better, but certainly for the price of an algorithm more mentally challenging.]
|
|
In case this turns out as a performance problem, we might consider integrating the index
|
|
maintenance into the data structure to be diffed, which shifts the additional impact of
|
|
indexing onto the data population phase.footnote:[in the general tree diff case this is far
|
|
from trivial, since we need an self-contained element index for every node, and we need the
|
|
ability to take a snapshot of the ``old'' state before mutating a node into ``new'' shape]
|
|
|
|
Element classification
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
By using the indices of the old and the new sequence, we are able to _classify_ each element:
|
|
|
|
- elements only present in the new sequence are treated as *inserts*
|
|
- elements only present in the old sequence are treated as *deletes*
|
|
- elements present in both sequences form the *permutation*
|
|
|
|
Processing pattern
|
|
~~~~~~~~~~~~~~~~~~
|
|
We _consume both the old and the new sequence synchronously, while emitting the diff sequence_.
|
|
|
|
The diff describes a sequence of operations, which, when applied, consume a sequence congruent
|
|
to the old sequence, while emitting a sequence congruent to the new sequence. We use the
|
|
following *list diff language* here:
|
|
|
|
verb `ins(elm)`::
|
|
insert the given argument element `elm` at the _current processing position_
|
|
into the target sequence. This operation allows to inject new data
|
|
verb `del(elm)`::
|
|
delete the _next_ element `elm` at _current position._
|
|
For sake of verification, the element to be deleted is also included as argument (redundancy).
|
|
verb `pick(elm)`::
|
|
accepts the _next_ element at _current position_ into the resulting altered sequence.
|
|
The element is given redundantly as argument.
|
|
verb `push(elm)`::
|
|
effect a re-ordering of the target list contents. This verb requires to take
|
|
the _next_ element, which happens to sit at _current processing position_ and
|
|
_push it back_ further into the list, to be placed at a position _behind_ the
|
|
_anchor element_ `elm` given as argument.
|
|
|
|
Since _inserts_ and _deletes_ can be detected and emitted right at the processing frontier,
|
|
for the rest of this theoretical discussion, we consider the insert / delete part filtered
|
|
away conceptually, and concentrate on generating the permutation part.
|
|
|
|
Handling sequence permutation
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
This paragraph describes how to consume two permutations of the same sequence simultaneously,
|
|
while emitting `push` and `pick` verbs to describe the re-ordering. Consider the sequences
|
|
split into an already-processed part, and a part still-to-be-processed.
|
|
|
|
.Invariant
|
|
Matters are arranged such, that, in the to-be-processed part, each element appearing at the
|
|
front of the ``new'' sequence _can be picked right away_.
|
|
|
|
Now, to arrive at that invariant, we have especially to deal with the case that a different
|
|
(not matching) element appears at the front of the ``old'' list. We have to emit additional
|
|
`push` verbs to get rid of non-matching elements in the ``old'' order, until we get into a state
|
|
where the invariant is re-established (and we're able to `pick` to consume the same element
|
|
from the existing sequence and the target sequence). Obviously, the tricky part is how to
|
|
determine the *anchor element* for this `push` directrive...
|
|
|
|
- we need to be sure the anchor _is indeed present_ in the current shape of the sequence in processing.
|
|
- the anchor must be in the right place, so to conform to the target sequence at the point of picking it.
|
|
- it is desirable to emit at most one `push` directive for any given element; we want it to settle at the
|
|
right place with a single shot
|
|
|
|
.Some example push-directives
|
|
[frame="topbot",grid="none", width="40%", float="right"]
|
|
[cols=">s,<2m,>s,<2m,>s,<2m,>s,<2m,>s,<2m,>s,<2m,>s,<2m,>s,<2m"]
|
|
|======
|
|
|1 ||1 ||1 ||1 ||1 ||1 ||1 ||1|
|
|
|
|
|2 ||2 ||2 ||2 ||2 ||2 ||2 ||2|
|
|
|
|
|4 |->3
|
|
|5 |->3
|
|
|4 |->3
|
|
|6 |->3
|
|
|6 |->3
|
|
|5 |->3
|
|
|6 |->5
|
|
|5 |->4
|
|
|
|
|5 |->4
|
|
|4 |->3
|
|
|5 |->4
|
|
|5 |->3
|
|
|4 |->3
|
|
|4 |->3
|
|
|4 |->3
|
|
|7 |->5
|
|
|
|
||
|
|
||
|
|
|6 |->5
|
|
|4 |->3
|
|
|5 |->4
|
|
|6 |->6
|
|
||
|
|
||
|
|
|
|
|3 ||3 ||3 ||3 ||3 ||3 ||3 ||3 |
|
|
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
|5 |
|
|
|4 |
|
|
|======
|
|
|
|
To find out about rules for choosing the anchor element, we look at the sequence in ``old'' order,
|
|
but we indicate the _index number in the ``new'' order_ for each element. Any discontinuity in the
|
|
ascending sequence of these numbers indicates that we have to push elements back, until encountering
|
|
the next lowest number still missing. The examples given in the table show such a gap after the
|
|
second element -- we have to push back until we find the third element.
|
|
|
|
Basically we get some kind of increasing ``*water level*'': the continuous sequence prefix, the highest
|
|
number where all predecessors are present already. An element at this level can be picked and thus
|
|
consumed -- since we're in conformance to the desired target sequence _up to this point_. But any
|
|
elements still ``above water level'' can not yet be consumed, but need to be pushed back, since
|
|
some predecessor has still to arrive. If we attribute each element with the water level reached
|
|
_at the point when we are visiting this element,_ we get a criterion for possible anchor elements:
|
|
What is above water level, can not be an anchor, since it needs to move itself. But any element
|
|
at water level is usable. And, in addition, any element already pushed once can serve as an anchor
|
|
too. This follows by recursive argument: it has been moved behind a proper anchor, and thus will
|
|
in turn remain stable. Of all the possible candidates we have to use the largest possible predecessor,
|
|
otherwise there would be the possibility of messing up the ordering (e.g. if you place 6 behind 3
|
|
instead of 5).
|
|
|
|
.Rules
|
|
. pick at water level
|
|
. push anything above water level
|
|
. use as anchor the largest possible from...
|
|
* elements at water level
|
|
* or elements already pushed into place
|
|
|
|
Implementation and Complexity
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
We need an index lookup for an element from the ``old'' sequence to find the corresponding index number
|
|
in the ``new'' sequence. Based on this attribution, the ``water level'' attribution can be calculated
|
|
in the same linear pass. So we get two preprocessing passes, one for the ``new'' sequence and one for
|
|
the ``old'', using lookups into the ``new''-index. After these preparations, the diff can be emitted
|
|
in a further pass.
|
|
|
|
In fact we do not even need the numerical ``water level''; we need the relations. This allows to extend
|
|
the argumentation to include the deletes and inserts and treat all from a single list. But, unfortunately
|
|
the search for suitable anchor elements turns the algorithm into *quadratic complexity*: essentially this
|
|
is a nested sub-pass to find a maximum -- O(n^2^) will dominate the O(n log n) from indexing.footnote:[In
|
|
the theoretical treatment of diff problems it is common to introduce a *distance metric* to describe
|
|
how _far apart_ the two sequences are in terms of atomic changes. This helps to make the quadratic
|
|
(or worse) complexity of such algorithms look better: if we know the sequences are close, the nested
|
|
sub-scans will be shorter than the whole sequence (with n·d < n^2^). In our case, we would be able to
|
|
find the anchor in close vicinity of the current position. +
|
|
However, since our goal is to support permutations and we have to deal with arbitrary sequences, such
|
|
an argument is somewhat pointless. Let's face it, structural diff computation is expensive; the only
|
|
way to keep matters under control is to keep the local sequences short, which means to exploit structural
|
|
knowledge instead of comparing the entire data as flat sequence]
|
|
The additional space requirements footnote:[in _addition_ to the storage for the ``old'' and ``new'' sequence
|
|
plus the storage for the generated diff output] of our solution is of O(`len(old)` + `len(new)`).
|
|
|