Based on the building blocks developed thus far,
it was possible to assemble a typical media processing chain
* two source nodes
* one of these passes data through a filter
* a mixer node on top to combine both chains
* time-based automation for processing parameters
As actual computation, hash-chaining on blocks of
reproducible random data was used, allowing to verify
for every data word that expected computations were
carried out, in the expected order.
This is a high-level integration test to sum up this development effort
* an advanced refactoring was carried out to introduce a
flexible and fully-typed binding for the ''processing-functor''
* this entailed a complete rework of the `FeedManifold` to integrate
inline storage for a ''parameter tuple'' and input / output ''buffer tuples''
* optional ''parameter functors'' were included into the design at a deep level,
closely related to the binding of the processing-functor
* the chosen design is thus a compromise between ''everything nodes''
and a ''dedicated parameter-handling'' at invocation level
As a proof-of-concept, an scheme to handle extended parameters was devised,
using a special »Param Agent Node« and extension storage blocks in stack memory.
While not immediately necessary, this design exercise proves the overall design
is flexible enough to accommodate future extended needs.
This is some quite technical and redundant code, which largely maps
the configured elements from the Builder-DSL level down into the delegate
builder functors. For the ''Param Agent Node,'' most of the structure
is already embedded deep into the `ParamWeavingPattern`, by virtue of a
tuple of parameter-functors, which are supplied to the builder-API
as a `ParamBuildSpec` (which in fact is in itself a builder and will be
used on a higher level to fill in suitable parameter-functors)
This changeset is assumed to complete the definition of a builder and
weaving pattern for a ''Param Agent Scheme'' — yet only the tests to be
elaborated next will show the extent to which this is true....
After this extended excursion to lift the internals of Node invocation
to the use of structured and typed data (notably the invocation parameters),
the »Playback Vertical Slice« continues to push ahead towards the goal of integration.
The existing code has been re-oriented and some aspects of node invocation have been reworked
in a prototyping effort, which (in part though the aforementioned rework)
is meanwhile on a good path to lead to a consolidated final version.
* ✔ building a simple Render Node works now with the revamped code
* 🔁invoking this simple Node ''should be just one step away'' (since all parts are known to work)
* ⌛ the next step would then be to build a Node outfitted with a ''Parameter Functor'', which is the new concept introduced by recent changes
* ⌛ this should then get us at the point to take the hurdle of invoking one of our **Random Test** functions as a Render Node
Remove left-overs from the preceding prototypical implementation,
which is now obliterated by the change to a flexibly configured `FeedManifold`
with structured, typed storage for buffers and for parameter data.
The Render Node invocation sequence, as rearranged and reworked for the »Playback Vertical Slice«, now seems reasonably clear and settled.
Adding extensive documentation to describe the conventions and structures worked out thus far;
moreover, start makeover of old documentation in the !TiddlyWiki to remove concepts obviously obsoleted now...
After augmenting our `lib/random.hpp` abstraction framework to add the necessary flexibility,
a common seeding scheme was ''built into the Test-Runner.''
* all tests relying on some kind of randomness should invoke `seedRand()`
* this draws a seed from the `entropyGen` — which is also documented in the log
* individual tests can now be launched with `--seed` to force a dedicated seed
* moreover, tests should build a coherent structure of linked generators,
especially when running concurrently. The existing tests were adapted accordingly
All usages of `rand()` in the code base were investigated and replaced
by suitable calls to our abstraction framework; the code base is thus
isolated from the actual implementation, simplifying further adaptation.
The immediate next goal is to verify properties of render nodes
generated by the builder framework; two kinds of validations
can be distinguished
* structural aspects of the wiring
* the fact that processing functionality is invoked in proper order
Looking into the structural aspects brings about the necessity
to identify the actual processing function bound into some functor.
Some recapitulation of goals and requirements revealed, that this
can not be a merely technical identity record — because the intention
is to base the ''cache key'' on chained processing node identities,
so that the key is stable as long as the user-visible results will be
equivalent. And while structural data can be aggregated, at the
core this information must be provided by the scheme embedded
into the domain ontology, which is tasked with invoking the
builder in order to implement a ''specific processing-asset''
Not entirely sure how to use the `emit()` call properly,
assuming that it means that data is complete in buffer,
but can still be read after that point
...the complexity of details is a nightmare
...still fighting to grasp a generic structure allowing to ''fold down''
the details into the specific ''domain ontologies'' for the media libraries
The immediate next step is to build some render nodes directly
in a test setting, without using any kind of ''node factory.''
Getting ahead with this task requires to identify the constituents
to be represented on the first code layer for the reworked code
(here ''first layer'' means any part that are ''not'' supplied
by generic, templated building blocks).
Notably we need to build a descriptor for the `FeedManifold` —
which in turn implies we have to decide on some fundamental aspects
of handling buffers in the render process.
To allow rework of the `ProcNode` connectivity, a lot of presumably obsoleted
draft code from 2011 has to be detached, to be able to keep it in-tree
for further reference (until the rework and refactoring is settled).
As outlined in #1367, the integration effort requires some rework
of existing code, which will be driven ahead by the `NodeLinkage_test`
* redefine Node Connectivity
* build simple `ProcNode` directly in scope
* create an `TurnoutSystem` instance
* perform a ''dummy Node-Invocation''
...which seems to be basically fine thus far
...beyond some renaming and rearranging
''it turns out that the final, crucial links,
necessary to tie all together, are yet to be developed''
Facing quite some difficulties here, since there are (at least)
two abandoned past efforts towards building a render node network
in the code base; the structure and architecture decisions from these
previous attempts seem largely valid still, yet on a technical level,
the style of construction evolved considerably in the meantime. Moreover,
these old fragments of code, written during the early stages of the
project, were lacking clear goals and anchor points at places;
the situation is quite different now in this respect.
Sticking to well proven practice, the rework will be driven by a test setup,
and will progress over three steps with increasing levels of integration.
The initial effort of building a Scheduler can now be **considered complete**
Reaching this milestone required considerable time and effort, including
an extended series of tests to weld out obvious design and implementation flaws.
While the assessment of the new Scheduler's limitation and traits is ''far from complete,''
some basic achievements could be confirmed through this extended testing effort:
* the Scheduler is able to follow a given schedule effectively,
until close up to the load limit
* the ''stochastic load management'' causes some latency on isolated events,
in the order of magnitude < 5ms
* the Scheduler is susceptible to degradation through Contention
* as mitigation, the Scheduler prefers to reduce capacity in such a situation
* operating the Scheduler effectively thus requires a minimum job size of 2ms
* the ability for sustained operation under full nominal load has been confirmed
by performing **test sequences with over 80 seconds**
* beyond the mentioned latency (<5ms) and a typical turnaround of 100µs per job
(for debug builds), **no further significant overhead** was found.
Design, Implementation and Testing were documented extensively in the [https://lumiera.org/wiki/renderengine.html#Scheduler%20SchedulerProcessing%20SchedulerTest%20SchedulerWorker%20SchedulerMemory%20RenderActivity%20JobPlanningPipeline%20PlayProcess%20Rendering »TiddlyWiki« #Scheduler]
The behaviour seems consistent and the schedule breaks at the expected point.
At first sight, concurrency seems slightly to low; detailed investigation
however shows that this is due to the structure of the load graph,
and in fact the run time comes close to optimal values.
the `BreakingPoint` tool conducts a binary search to find the ''stress factor''
where a given schedule breaks. There are some known deviations related to the
measurement setup, which unfortunately impact the interpretation of the
''stress factor'' scale. Earlier, an attempt was made, to watch those factors
empirically and work a ''form factor'' into the ''effective stress factor''
used to guide this measurement method.
Closer investigation with extended and elastic load patters now revealed
a strong tendency of the Scheduler to scale down the work resources when not
fully loaded. This may be mistaken by the above mentioned adjustments as a sign
of a structural limiation of the possible concurrency.
Thus, as a mitigation, those adjustments are now only performed at the
beginning of the measurement series, and also only when the stress factor
is high (implying that the scheduler is actually overloaded and thus has
no incentive for scaling down).
These observations indicate that the »Breaking Point« search must be taken
with a grain of salt: Especially when the test load does ''not'' contain
a high degree of inter dependencies, it will be ''stretched elastically''
rather than outright broken. And under such circumstances, this measurement
actually gauges the Scheduler's ability to comply to an established
load and computation goal.
...this seems to be the last topic for this investigation of Scheduler behaviour;
the goal is to demonstrate readiness for stable-state operation over an extended period of time
- use parameters known to produce a clean linear model
- assert on properties of this linear model
Add extended documentation into the !TiddlyWiki,
with a textual account of the various findings,
also including some of the images and diagrams,
rendered as SVG
Use the statistic functions imported recently from Yoshimi-test
to compute a linear regression model as immediate test result.
Combining several measurement series, this allows to draw conclusions
about some generic traits and limitations of the scheduler.
- the goal is to run a binary search
- the search condition should be factored out
- thus some kind of framework or DSL is required,
to separate the technicalities of the measurement
from the specifics of the actual test case.
...the idea is to use the sum of node weights per level
to create a schedule, which more closely reflects the distribution
of actual computation time. Hopefully such a schedule can then be
squeezed or stretched by a time factor to find out a ''breaking point'',
at which the Scheduler is no longer able to keep up.
while my basic assessment is still that contention will not play a significant
role given the expected real world usage scenario — when testing with
tighter schedule and rather short jobs (500µs), some phases of massive contention
can be observed, leading to significant slow-down of the test.
The major problem seems to be that extended phases of contention will
effectively cause several workers to remain in an active spinning-loop for
multiple microseconds, while also permanently reading the atomic lock.
Thus an adaptive scheme is introduced: after some repeated contention events,
workers now throttle down by themselves, with polling delays increased
with exponential stepping up to 2ms. This turns out to be surprisingly
effective and completely removes any observed delays in the test setup.
This is the first kind of integration,
albeit still with a synthetic load.
- placed two excessive load peaks in the scheduling timeline
- verified load behaviour
- verified timings
- verified that the scheduler shuts down automatically when done
- Ensure the grooming-token (lock) is reliably dropped
- also explicitly drop it prior to trageted sleeps
- properly signal when not able to acquire the token before dispatch
- amend tests broken by changes since yesterday
Notably the work-function is now completely covered, by adding
this last test, and the detailed investigations yesterday
ultimately unveiled nothing of concern; the times sum up.
Further reflection regarding the overall concept led me
to a surprising solution for the problem with priority classes.
- investigate consistency guarantees through acquire-release
==> turns out we do not need a fence, but it is tantamount
to have a guard variable and actually load and check
the value to ensure we indeed get a happens-before
- elaborate design of the WorkForce
+ no shared control variables necessary
+ no ability to forcibly shut-down the WorkForce
+ rather, all control will be exerted through the return value
of the Work-Functor
...to show in test that indeed the actual time is retrieved on each activation,
we can assign a λ -- which is rigged to increase the time on each access
...turns out there is still a lot of leeway in the possible implementation,
and seemingly it is too early to decide which case to consider the default.
Thus I'll proceed with the drafted preliminary solution...
- on primary-chain, an inhibited Gate dispatches itself into future for re-check
- on Notification, activation happens if and only if this very notification opens the Gate
- provide a specifically wired requireDirectActivation() to allow enforcing a minimal start time
Solved by special treatment of a notification, which happens
to decrement the latch to zero: in this case, the chain is
dispatched, but also the Gate is locked permanently to block
any further activations scheduled or forwareded otherwise
essentially define a concept how to ''perform'' render activities in the Scheduler.
This entails to specify the operation patterns for the four known base cases
and to establish a setup for the implementation.
Further extensive testing with parameter variations,
using the test setup in `BlockFlow_test::storageFlow()`
- Tweaks to improve convergence under extreme overload;
sudden load peaks are now accomodated typically < 5 sec
- Make the test definition parametric, to simplify variations
- Extract the generic microbenchmark helper function
- Documentation
..using a simplistic implementation for now: scale down the
Epoch-stepping by 0.9 to increase capacity accordingly.
This is done on each separate overflow event, and will be
counterbalanced by the observation of Epoch fill ratio
performed later on clean-up of completed Epochs
Iteration should just yield an Reference to an Extent,
thereby hiding all details of the actual raw storage (char[]).
This can be achieved by usind a wrapper type around a pointer
into the managing vector; from this pointer we may convert
into a vector::iterator with the trick described here
https://stackoverflow.com/a/37101607/444796
Furthermore, continued planning of the Activity-Language,
basically clarified the complete usage scenario for now;
seems all implementable right away without further difficulties
- decision to favour small memory footprint
- rather use several Activity records to express invocation
- design Activity record as »POD with constructor«
- conceptually, Activity is polymorphic, but on implementation
level, this is "folded down" into union-based data storage,
layering accessor functions on top
- decision how to handle the Extent storage (by forced-cast)
- decision to place the administrative record directly into the Extent
TODO not clear yet how to handle the implicit limitation for future deadlines
using a simple yet performant data structure.
Not clear yet if this approach is sustainable
- assuming that no value initialisation happens for POD payload
- performance trade-off growth when in wrapped-state vs using a list
- define a simple record to represent the Activity
- define a handle with an ordering function
- low-level functions to...
+ accept such a handle
+ pick it from the entrace queue
+ pass it for priorisation into the PriQueue
+ dequeue the top priority element