Write down some fundamental insight regarding diff algorithms

This commit is contained in:
Fischlurch 2014-11-07 03:58:37 +01:00
parent 85d24e980d
commit ed54f44b5e

View file

@ -7703,7 +7703,25 @@ Using transitions is a very basic task and thus needs viable support by the GUI.
Because of this experience, ichthyo wants to support a more general case of transitions, which have N output connections, behave similar to their "simple" counterpart, but leave out the mixing step. As a plus, such transitions can be inserted at the source ports of N clips or between any intermediary or final output pipes as well. Any transition processor capable of handling this situation should provide some flag, in order to decide if he can be placed in such a manner. (wichin the builder, encountering a inconsistently placed transition is just an [[building error|BuildingError]])
</pre>
</div>
<div title="TreeDiffModel" creator="Ichthyostega" modifier="Ichthyostega" created="201410270313" modified="201411040353" tags="Model GuiPattern spec draft" changecount="38">
<div title="TreeDiffFundamentals" creator="Ichthyostega" modifier="Ichthyostega" created="201411070156" modified="201411070257" tags="Concepts design" changecount="6">
<pre>At some points within the implementation we encounter problems of structural difference computation, like finding the effective changes in a tree data structure, or using a diff represetation as notification message format. Pointing out some observations and relating our approach(es) to the problem as generally known might be in place.
!classical diff problem
There is very wide spread use of a quite special flavour of diff calculations: the classical diff of textual data. Here, the treated data has a completely uniform linear structure -- it is either a sequence of lines or a sequence of bytes (&quot;binary diff&quot;). The individual data cells are built from a finite alphabet of comparable values (characters). Since a given character or word may be present several times in the same document, we need to establish a //matching// as foundation of any difference calculation. We need to establish the identical parts to find the differences. This matching is not unique, and largely determines the differences found, so it is hard to determine if a given diff is //correct.// Thus, the core idea to attack and solve this classical diff problem was to look for an ''optimal diff''. Some variations include
* longest common substrings
* largest common subsequence
* minimal edit script to perform the diff
!challenge of the diff problem
While, from a mathematical point of view, the above optimisation problem can be considered solved, in practice the available solutions are far from perfect. The fundamental assumption of a linear sequence to base the diff turns out as an oversimplification -- real world data carries meaning and can be judged by imposing additional structure. There can be structure violating diffs and there can be nonsensical and misguiding diffs. In the light of this notion, every diff is in fact a structural diff.
Unfortunately, attempts to amend the beautiful mathematical solution by incorporating this additional structure turns out to be incredible hard. Either, the general case problem can be proven to be ~NP-hard, or, exploiting some special structural properties renders the meaning of an &quot;optimal&quot; diff solution more or less arbitrary. The quest for a generic diff problem and an universal solution turns out to be a dead end.
!fundamentals of diff handling
The most fundamental distinction is the difference between //finding// a diff and //representing// a diff. The former is concerned with uncovering structural relations, while the latter deals with knowledge about structural relations and thus is more general. It is possible to capture structural relations while they emerge -- this way describing a process of transformation. A likewise fundamental distinction is between //reordering// elements and //mutating// them. This is related to the notion of //identity,// which in turn implies an underlying model of the elements and entities to be considered for diffing. If, as an example, we model the tokens, functions and classes of program code as mere characters, we will be happily matching curly braces and can not expect meaningful diffs. So we need a theory about what can possibly happen, and we use this as a foundation to establish the representation of such possible processes. The choices taken here can make all the difference towards efficient and usable methods. The //matching problem// should be viewed from here: a matching is actually a hypothesis about possible processes of transformation -- this observation explains why the matching strategy is at the core of any diff algorithm. If we use some arbitrary notation of optimality, we have to consider endless combinations and end up with a lot of accidental complexity. Yet if we manage to base the calculation on a representation well aligned with the nature of the entities in question, the matching can be as simple as retrieving a given entry by ID. The actual work is reduced to extracting the actual changes in data.
Before we can consider a diffing technique, we need to clarify the primitive operations used as foundation. These primitives form a language. This incurs a problem and choice of context. We might base the representation language on the situation where the diff is applied, we may care for the conciseness of the representation, or the effort and space required to apply it. We should consider if the focus is on reading and understanding the diff, if we intend to derive something directly from the diff representation -- say, if and what and when something is changed -- and we should consider how the context where the diff is established relates to the context where it is applied.
</pre>
</div>
<div title="TreeDiffModel" creator="Ichthyostega" modifier="Ichthyostega" created="201410270313" modified="201411070254" tags="Model GuiPattern spec draft" changecount="40">
<pre>for the purpose of handling updates in the GUI timeline display efficiently, we need to determine and represent //structural differences//
We build a slightly abstracted representation of tree changes and use this to propagate //change notifications// to the actual widgets. To keep the whole process space efficient, a demand-driven, stateless implementation approach is chosen. This reduces the problem into several layered stages.
* our model is a heterogeneous tree &amp;rArr; use demand-driven recursion
@ -7711,6 +7729,8 @@ We build a slightly abstracted representation of tree changes and use this to pr
* find changes in ordered collections of children &amp;rArr; symbolic list diffing algorithm
* problems with identity and state &amp;rArr; encapsulate state and have a airtight object identity scheme
Doubtless we're dealing with a highly specific application here. &amp;rarr; see [[discussion of diffing solutions|TreeDiffFundamentals]]
!list diffing algorithm
| !source data|!|!desired result |
|(a~~1~~, a~~2~~, a~~3~~, a~~4~~, a~~5~~) |&amp;hArr;| {{{delete}}}(a~~1~~, a~~2~~)&lt;br/&gt;{{{update}}}(a~~3~~, a~~5~~, a~~4~~)&lt;br/&gt;{{{insert}}}(//before a~~3~~//, b~~1~~)&lt;br/&gt;{{{insert}}}(//before a~~4~~//, b~~2~~, b~~3~~)&lt;br/&gt;{{{append}}}(b~~4~~)|