DOC: reasoning behind the construction of our list diff algorithm
this is a theoretical description of our method, and gives the reaoning why it is correct, plus the assesment of size and complexity order.
This commit is contained in:
parent
42f69b6cb5
commit
fe9105f321
2 changed files with 114 additions and 3 deletions
|
|
@ -72,8 +72,8 @@ this is our trade-off for simplicity in the diff detection algorithm.footnote:[t
|
|||
diff detection schemes, especially those geared at text diff detection, engage into great lengths
|
||||
of producing an ``optimal'' diff, which effectively means to build specifically tuned pattern
|
||||
or decision tables, from which the final diff can then be pulled or interpreted.
|
||||
We acknowledge that in our case building a lookup table index can be O(n log n); we might
|
||||
well be able to do better, but certainly for the price of an algorithm more mentally challenging.]
|
||||
We acknowledge that in our case building a lookup table index with additional annotations can be O(n^2^);
|
||||
we might well be able to do better, but certainly for the price of an algorithm more mentally challenging.]
|
||||
In case this turns out as a performance problem, we might consider integrating the index
|
||||
maintenance into the data structure to be diffed, which shifts the additional impact of
|
||||
indexing onto the data population phase.footnote:[in the general tree diff case this is far
|
||||
|
|
@ -137,5 +137,100 @@ determine the *anchor element* for this `push` directrive...
|
|||
- it is desirable to emit at most one `push` directive for any given element; we want it to settle at the
|
||||
right place with a single shot
|
||||
|
||||
.Some example push-directives
|
||||
[frame="topbot",grid="none", width="40%", float="right"]
|
||||
[cols=">s,<2m,>s,<2m,>s,<2m,>s,<2m,>s,<2m,>s,<2m,>s,<2m,>s,<2m"]
|
||||
|======
|
||||
|1 ||1 ||1 ||1 ||1 ||1 ||1 ||1|
|
||||
|
||||
|2 ||2 ||2 ||2 ||2 ||2 ||2 ||2|
|
||||
|
||||
|4 |->3
|
||||
|5 |->3
|
||||
|4 |->3
|
||||
|6 |->3
|
||||
|6 |->3
|
||||
|5 |->3
|
||||
|6 |->5
|
||||
|5 |->4
|
||||
|
||||
|5 |->4
|
||||
|4 |->3
|
||||
|5 |->4
|
||||
|5 |->3
|
||||
|4 |->3
|
||||
|4 |->3
|
||||
|4 |->3
|
||||
|7 |->5
|
||||
|
||||
||
|
||||
||
|
||||
|6 |->5
|
||||
|4 |->3
|
||||
|5 |->4
|
||||
|6 |->6
|
||||
||
|
||||
||
|
||||
|
||||
|3 ||3 ||3 ||3 ||3 ||3 ||3 ||3 |
|
||||
|
||||
||
|
||||
||
|
||||
||
|
||||
||
|
||||
||
|
||||
||
|
||||
|5 |
|
||||
|4 |
|
||||
|======
|
||||
|
||||
To find out about rules for choosing the anchor element, we look at the sequence in ``old'' order,
|
||||
but we indicate the _index number in the ``new'' order_ for each element. Any discontinuity in the
|
||||
ascending sequence of these numbers indicates that we have to push elements back, until encountering
|
||||
the next lowest number still missing. The examples given in the table show such a gap after the
|
||||
second element -- we have to push back until we find the third element.
|
||||
|
||||
Basically we get some kind of increasing ``*water level*'': the continuous sequence prefix, the highest
|
||||
number where all predecessors are present already. An element at this level can be picked and thus
|
||||
consumed -- since we're in conformance to the desired target sequence _up to this point_. But any
|
||||
elements still ``above water level'' can not yet be consumed, but need to be pushed back, since
|
||||
some predecessor has still to arrive. If we attribute each element with the water level reached
|
||||
_at the point when we are visiting this element,_ we get a criterion for possible anchor elements:
|
||||
What is above water level, can not be an anchor, since it needs to move itself. But any element
|
||||
at water level is usable. And, in addition, any element already pushed once can serve as an anchor
|
||||
too. This follows by recursive argument: it has been moved behind a proper anchor, and thus will
|
||||
in turn remain stable. Of all the possible candidates we have to use the largest possible predecessor,
|
||||
otherwise there would be the possibility of messing up the ordering (e.g. if you place 6 behind 3
|
||||
instead of 5).
|
||||
|
||||
.Rules
|
||||
. pick at water level
|
||||
. push anything above water level
|
||||
. use as anchor the largest possible from...
|
||||
* elements at water level
|
||||
* or elements already pushed into place
|
||||
|
||||
Implementation and Complexity
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
We need an index lookup for an element from the ``old'' sequence to find the corresponding index number
|
||||
in the ``new'' sequence. Based on this attribution, the ``water level'' attribution can be calculated
|
||||
in the same linear pass. So we get two preprocessing passes, one for the ``new'' sequence and one for
|
||||
the ``old'', using lookups into the ``new''-index. After these preparations, the diff can be emitted
|
||||
in a further pass.
|
||||
|
||||
In fact we do not even need the numerical ``water level''; we need the relations. This allows to extend
|
||||
the argumentation to include the deletes and inserts and treat all from a single list. But, unfortunately
|
||||
the search for suitable anchor elements turns the algorithm into *quadratic complexity*: essentially this
|
||||
is a nested sub-pass to find a maximum -- O(n^2^) will dominate the O(n log n) from indexing.footnote:[In
|
||||
the theoretical treatment of diff problems it is common to introduce a *distance metric* to describe
|
||||
how _far apart_ the two sequences are in terms of atomic changes. This helps to make the quadratic
|
||||
(or worse) complexity of such algorithms look better: if we know the sequences are close, the nested
|
||||
sub-scans will be shorter than the whole sequence (with n·d < n^2^). In our case, we would be able to
|
||||
find the anchor in close vicinity of the current position. +
|
||||
However, since our goal is to support permutations and we have to deal with arbitrary sequences, such
|
||||
an argument is somewhat pointless. Let's face it, structural diff computation is expensive; the only
|
||||
way to keep matters under control is to keep the local sequences short, which means to exploit structural
|
||||
knowledge instead of comparing the entire data as flat sequence]
|
||||
The additional space requirements footnote:[in _addition_ to the storage for the ``old'' and ``new'' sequence
|
||||
plus the storage for the generated diff output] of our solution is of O(`len(old)` + `len(new)`).
|
||||
|
||||
|
|
|
|||
|
|
@ -7723,7 +7723,7 @@ Before we can consider a diffing technique, we need to clarify the primitive ope
|
|||
&rarr; [[Implementation considerations|TreeDiffImplementation]]
|
||||
</pre>
|
||||
</div>
|
||||
<div title="TreeDiffImplementation" creator="Ichthyostega" modifier="Ichthyostega" created="201412210015" modified="201412210150" tags="Model GuiPattern design draft" changecount="8">
|
||||
<div title="TreeDiffImplementation" creator="Ichthyostega" modifier="Ichthyostega" created="201412210015" modified="201412220418" tags="Model GuiPattern design draft" changecount="10">
|
||||
<pre>//This page details decisions taken for implementation of Lumiera's diff handling framework//
|
||||
This topic is rather abstract, since diff handling is multi purpose within Lumiera: Diff representation is seen as a meta language and abstraction mechanism; it enables tight collaboration without the need to tie and tangle the involved implementation data structures. Used this way, diff representation reduces coupling and helps to cut down overall complexity -- so to justify the considerable amount of complexity seen within the diff framework implementation.
|
||||
|
||||
|
|
@ -7741,6 +7741,22 @@ Provided as a loosely coupled collection of tools in the namespace {{{lib::diff}
|
|||
:a diff represents the changes necessary to transform an input sequence ("old sequence") into a target sequence ("new sequence")
|
||||
:differences are spelled out in ''linearised form'': as a sequence of constant-size diff actions (called &raquo;diff verbs&laquo;)
|
||||
:they are conceived as operations, which, when applied consuming the input sequence, will produce the target sequence of the diff.
|
||||
|
||||
!{{red{WIP 12/2014}}}building a list diff detector
|
||||
transforming the algorithm sketch (&rarr; see [[technical documentation|http://lumiera.org/documentation/technical/library/DiffFramework.html]]) into working code incurs some challenges at the level of technical details.
|
||||
|
||||
!!!interface of the index component
|
||||
Obviously we want the helper indices to be an internal component abstraction, so the outline of the algorithm remains legible. Additionally this allows to introduce a strategy for obtaining the index. This is important, since in case of performance problems we might consider to integrate the indexing efforts into the external data structure, because this might enable to exploit external structural knowledge.
|
||||
So the challenge is to come up with an API not too high-level and not too low-level
|
||||
|
||||
!!!calculating the »water level«
|
||||
While obvious in theory, this is far from trivial when combined with the presence of inserts and deletes: because now it is no longer obvious when we encounter the next applicable element; it is no longer "n+1" but rather "n+d" with d interspersed deletes. We need to look ahead and write back our findings.
|
||||
|
||||
!!!criteria for the anchor search
|
||||
the search for the anchor used in a push operation is basically a nested scan. But the range to scan, the abort condition and the selection of elements to be excluded from search is technically challenging, since it relies on information available only in a transient fashion right within the main diff generation pass. It boils down to very precise timing when to exploit what additional "side-effect" like knowledge.
|
||||
|
||||
!!!invocation mechanics for the decision of push vs pick
|
||||
this is tricky since we're passing the abstraction barrier created by the index component. The actual problem arises since we do //not want to execute// our own mutations while generating the diff -- that is, we do not actually want to push around input elements, just to keep track of what has been pushed already, and where. Unfortunately this means that the pushed elements are out of sight; we do not know when to assume that a pushed element re-appears in the input sequence at the new position. We need the index to help us with that information: the index needs to tell us when the next element we see is actually //behind// the position where the next {{{pick}}} directive is to be emitted to consume the pending element from the new sequence.
|
||||
</pre>
|
||||
</div>
|
||||
<div title="TreeDiffModel" creator="Ichthyostega" modifier="Ichthyostega" created="201410270313" modified="201412210002" tags="Model GuiPattern spec draft" changecount="47">
|
||||
|
|
|
|||
Loading…
Reference in a new issue