...attempt to get this intricate state machine sorted out
Notification turned out quite tricky, since it may emanate
from a concurrently executed phase and we try to avoid having
to protect the gate directly with a lock; rather we re-dispatch
the notification through the queue, which indirectly also ensures
that the worker de-queuing the NOTIFY-Activity operates in
management mode (single threaded, holding the GroomingToken)
Decision how to handle a failed Gate-check
- spin forward (re-scheduler) by some time amount
- this spin-offset parameter is retrieved from the Execution Context
- thus it will be some kind of engine parameter
With these determinations and the framework for the Execution Context
it is now possible to code up the logic for Gate check, which in turn
can then be verified by the watchGate diagnostics
due to technical limitations this requires to wire the adaptor
as replacement for the subject Activity, so that it can capture
and log the activation, and then pass it on to its watched subject
...turns out that util::toString does not explicitly handle pointers differently,
for very good reasons; this function must always work, always produce a simple and
compact representation, and it must be possible to instantiate the template
and take a function reference (which precludes adding an overload for pointers)
requires to supplement EventLog matching primitives
to pick and verify a specific positional argument.
Moreover, it is more or less arbitrary which job invocation parameters
are unpacked and exposed for verification; we'll have to see what is
actually required for writing tests...
doing so would contradict the fundamental architecture,
all kinds of failures and timeouts need to be handled within
Scheduler-Layer-2 rather.
Jobs are never aborted, nor do they need to know if and when they are invoked
Testcase (detect function invocation) passes now as expected
Some Library / Framework changes
- rename event-log-test.cpp
- allow the ExpectString also to work with concatenated expectation strings
Remark: there was a warning in the comment in event-log.hpp,
pointing out that negative assertions are shallow.
However, after the rework in 9/2018 (commit: d923138d1)
...this should no longer be true, since we perform proper backtracking,
leading to an exhaustive search.
ActivityMatch inherits privately from the EventMatch object,
and is thus able to delegate relevant matching queries, but
also to provide high-level special matchers.
This new design resolves the ambiguity regarding function arguments.
Moreover, we can now record the current sequence-Number as *attribute*
in the respective log record (this is the benefit of using structured
log entries instead of just a textual log), thereby avoiding the various
pitfalls with explicit bracketing sequence-number log entries
bottom line: this reworked design seems to be a better fit,
even while technically the implementation with the wrapped matcher
is somewhat ugly...
The EventLog seems to provide all the building blocks, but we need
some higher level special matchers (and maybe we also want to hide
some of the basic EventLog matchers). A soulution might be to wrap
the EventMatcher and delegate all follow-up builder calls.
This seems adequate, since the EventLog-Matcher is basically used as black box,
building up more elaborate matchers from the provided basic matchers...
Spent some time again to understand how EventLog matching works.
My feelings towards this piece of code are always the same: it is
somewhat too "tricky", but I am not aware of any other technique
to get this degree of elaborate chained matching on structured records,
short of building a dedicated matching engine from scratch.
The other alternative would be to use a flat textual log (instead of
the structured log records from EventLog), but then we'd have to
generate quite intricate regular expressions from the builder,
and I'm really doubtful it would be easier and clearer....
...turns out this is entirely generic and not tied to the context
within ActivityDetector, where it was first introduced to build a
mock functor to log all invocations.
Basically this meta-function generates a new instantiation of the
template X, using the variadic argument pack from template U<ARGS...>
...for coverage of the Activity-Language,
various invocations of unspecific functions must be verified,
with the additional twist that the implementation avoids indirections
and is thus hard to rig for tests.
Solution-Idea: provide a λ-mock to log any invocation into the
Event-Log helper, which was created some years ago to trace GUI communication...
essentially define a concept how to ''perform'' render activities in the Scheduler.
This entails to specify the operation patterns for the four known base cases
and to establish a setup for the implementation.
Further extensive testing with parameter variations,
using the test setup in `BlockFlow_test::storageFlow()`
- Tweaks to improve convergence under extreme overload;
sudden load peaks are now accomodated typically < 5 sec
- Make the test definition parametric, to simplify variations
- Extract the generic microbenchmark helper function
- Documentation
There seems to be a ''sweet spot'' for somewhat larger Epoch sizes around 500 slots.
At least in the test setup used here, which works with a load of 200 Frames / sec,
which is significantly over the typical value of 50fps (video + audio) for simple playback.
The optimisation of averaged allocation times can not be much improved **below 30ns**.
Overall, this can be considered a good result,
since this allocation scheme does way more than just allocate memory,
it also provides a means to track dependencies and lifecycle.
__For context__:
- we should strive at processing one frame in ~ 10ms
- for 10 Activity records per Frame, we currently use < 0.5 µs for
memory and dependency management in the scheduler
- this leaves enough room for the further administrative efforts
(priority queue, job planning, buffer management)
... while this a comparison of apples and oranges, since the standard
heap allocator does not offer any dependency and lifecycle managmenet,
while the BlockFlow scheme developed here is much more complex and
offers a lifetime and dependency control specifically tailored to
the needs of the Scheduler.
Anyway, with the latest tweaks and refactorings, the test case
now shows averaged times per allocation on a comparable level
(both in the range of ~30ns)
BUT -> +50% runtime in -O3 (+20ns)
Investigation seems to indicate
- that the increased (+1 Epochs, 10 -> 11) moving average
caused the Algo to perform worse (strong effect)
- that the Optimiser has problems with boost::rational, which however
yields only a minute effect (+5ns), and only on the critical path
The access via Meyers Singleton has no adverse effect,
rather the new setup gives a tiny benefit (46ns -> 37ns).
Surprisingly, the increased pre-allocation has no observable effect.
On the long run, there will be a central Render Engine parametrisation;
some parameters can even be expected to be dynamic; thus prepare the
BlockFlow allocator to fit in with this expectation
For comparison: use individual managment by refcount.
This supports the conclusion that BlockFlow is more than just a
custom allocator; it also supports a non-trivial lifetime management,
and this comes at a cost.
Playing around with various load patterns uncovers further weak spots
in the regulation mechanism. As a remedy, introduce a stronger feed-back
and especially set the target load factor from 100% -> 90%
to add some headroom to absorb intermittent load peaks
Presumably ''much more observation and fine-tuning'' will be necessary
under real-world load conditions (⟹ Ticket #1318 for later)
- BUG: must prevent the Epoch size to become excessive low
- Problem: feedback signal should not be overly aggressive
Fine-Tuning:
- Dose for Overflow-compensation is delicate
- Moving average and Overflow should be balanced
- ideally the compensatory actions should be one order of magnitude
slower than the characteristic regulation time
Improvement: perform Moving-Average calculations in doubles
...leading to PATHETICALLY bad timing comparison
...it seems clear that the Epoch-Step went to zero
(which was neither anticipated, nor protected against)
However, even individual heap allocations fare surprisingly well
under full optimisation; just they don't solve our problem with
tracking dependencies; the most simplest solution that would
also fulfil this requirement would be using shared_ptr
..as a heuristic to regulate optimal Epoch duration;
when Epochs are discarded, the effective fill factor can be used
to guess an Epoch duration time, which would (in hindsight)
lead to perfect usage of storage space
..using a simplistic implementation for now: scale down the
Epoch-stepping by 0.9 to increase capacity accordingly.
This is done on each separate overflow event, and will be
counterbalanced by the observation of Epoch fill ratio
performed later on clean-up of completed Epochs
further implementation makes clear that the AllocationHandle,
which is the primary usage front-end, has to rely both on
services of the underlying ExtentFamily allocator, as well
as on the BlockFlow itself for managing the Epoch spacing.
- fix a bug in IterExplorer: when iterating a »state core« directly,
the helper CoreYield passed the detected type through ValueTypeBindings.
This is logically wrong, because we never want to pick up some typedefs,
rather we always want to use the type directly returned from CORE::yield()
Here the iterator returns an Epoch&, which itself is again iterable
(it inherits from std::array<Activity, N>). However, it is clear
that we must not descent into such a "flatMap" style recursive expansion
- draft a simple scheme how to regulate Epoch lengths dynamically
- add diagnostics to pinpoint a given Activity and find out into which
Epoch it has been allocated; used to cover the allocator behaviour
- add preliminary deadline-check (directly instead of using the Activity)
- with this shortcut, now able to implement discarding obsoleted Epochs
- Iteration and use of the underlying `ExtentFamily` is also settled by now
💡 ''Implementation concept for the allocation scheme complete and validated''
...with the preceding IterableDecorator refactoring,
the navigation and access to the storage extents can now be
organised into a clear progression
Allocator::iterator -> EpochIter -> Epoch&
Convenience management and support functions can then be
pushed down into Epoch, while iteration control can be done
high-level in BlockFlow, based on the helpers in Epoch
Especially for the BlockFlow allocator, sanity checks are elided
for performance reasons; yet, generally speaking, it can be a very bad idea
to "optimise" away sanity checks. Thus an additional adaptor is provided
to layer such checks on top of an existing core; and IterEplorer now
always wires in this additional adaptor, and so the original behaviour
is now restored in this respect (and for the largest part of the code base)
While at first sight just a superficial variation of the existing IterStateWrapper,
it became clear with the evolution of the IterExplorer framework that
this setup represents a distinct concept, and especially lends itself
for complex and cohesive collaboration in a layered pipeline. Which
may, or may not be a good idea, depending on the circumstances.
Now, for the implementation of the scheduler memory allocation scheme,
another twist is added to the picture: we can not effort the sanity checks
on each access, even more so when layering / adapting iterators, where
it is essential that the optimiser can remove all unnecessary warts.
..this is the most simple case, where no Epochs are opened yet
..add diagnostics to inspect alloc count and deadlines
..add accessors for the first/last underlying Extent
...continue to proceed test-driven
...scheduler internals turn out to be intricate and cohesive,
and thus the only hope is to adhere to strict testing discipline
Library: add "obvious" utility to the IterExplorer, allowing to
materialise all contents of the Pipeline into a container
...use this to take a snapshot of all currently active Extent addresses
- use a checksum to prove that ctor / dtor of "content" is not invoked
- let the usage of active extents "wrap around" so that the mem block is re-used
- verify that the same data is still there
The low-level allocator is basically implemented now,
but we still need to check thoroughly that the tricky
wrap-around and expansion logic behaves sane...
(see #1311)
Using a Storage* within a wrapper as "pos" will work,
but is borderline trickery, since it amounts to subverting
the idea behind IterAdapter (which is to encapsulate a target
pointer with some control-logic in the managing container).
Using the same storage size and implementation overhead,
it is much more straight-forward to package the complete
iteration logic into a »State Core«, which in this case
however maintains a back-link to the ExtentFamily.
Iteration should just yield an Reference to an Extent,
thereby hiding all details of the actual raw storage (char[]).
This can be achieved by usind a wrapper type around a pointer
into the managing vector; from this pointer we may convert
into a vector::iterator with the trick described here
https://stackoverflow.com/a/37101607/444796
Furthermore, continued planning of the Activity-Language,
basically clarified the complete usage scenario for now;
seems all implementable right away without further difficulties
..with the ability to grow on demand..
..possibly add the new extents in the middle, by first allocating at the end
and then using the std::rotate() algo to bring them to the point
in the middle where new extents are required
- the idea is to use slot-0 in each extent for administrative metadata
- to that end, a specialised GATE-Activity is placed into slot-0
- decision to use the next-pointer for managing the next free slot
- thus we need the help of the underlying ExtentFamily for navigating Extents
Decision to refrain from any attempt to "fix" excessive memory usage,
caused by Epochs still blocked by pending IO operations. Rather, we
assume the engine uses sane parametrisation (possibly with dynamic adjustment)
Yet still there will be some safety limit, but when exceeding this limit,
the allocator will just throw, thereby killing the playback/render process
- decision to favour small memory footprint
- rather use several Activity records to express invocation
- design Activity record as »POD with constructor«
- conceptually, Activity is polymorphic, but on implementation
level, this is "folded down" into union-based data storage,
layering accessor functions on top
- decision how to handle the Extent storage (by forced-cast)
- decision to place the administrative record directly into the Extent
TODO not clear yet how to handle the implicit limitation for future deadlines
using a simple yet performant data structure.
Not clear yet if this approach is sustainable
- assuming that no value initialisation happens for POD payload
- performance trade-off growth when in wrapped-state vs using a list
....still about to find out what kinds of Activities there are,
and what reasonably to implement on layer-2 vs. layer-1
It is clear that the worker will typically invoke a doWork()
operation on layer-2, which in turn will iterate layer-1.
Each worker pulls and performs internal managmenet tasks exclusively
until encountering the next real render task, at which point it will
drop an exclusion flag and then engage into performing the actual
extended work for rendering...
- define a simple record to represent the Activity
- define a handle with an ordering function
- low-level functions to...
+ accept such a handle
+ pick it from the entrace queue
+ pass it for priorisation into the PriQueue
+ dequeue the top priority element
after completing the recent clean-up and refactoring work,
the monad based framework for recursive tree expansion
can be abandoned and retracted.
This approach from functional programming leads to code,
which is ''cool to write'' yet ''hard to understand.''
A second design attempt was based on the pipeline and decorator pattern
and integrates the monadic expansion as a special case, used here to
discover the prerequisites for a render job. This turned out to be
more effective and prolific and became standard for several exploring
and backtracking algorithms in Lumiera.
An extended series of refactoring and partial rewrites resulted
in a new definition of the `Dispatcher` interface and completes
the buildup of a Job-Planning pipeline, including the ability
to discover prerequisites and compute scheduling deadlines.
At this point, I am about to ''switch to the topic'' of the `Scheduler`,
''postponing'' the completion of the `RenderDrive` until the related
questions regarding memory management and Scheduler interface are settled.
- allow to configure the expected job runtime in the test spec
- remove link to EngineConfig and hard-wire the engine latency for now
... extended integration testing reveals two further bugs ;-)
... document deadline calculation
This finishes the last series of refactorings; the basic concept
remains the same, but in the initial version we arranged the expander
function in the pipeline to maintain a Tuple (parent, child) for the
JobTickets. Unfortunately this turned out to be insufficient, since
JobTicket is effectively const and responsible for a complete Sement,
so there is no room to memorise a Deadline for the parent dependency.
This leads to the better idea to link the JobPlanning aggregators
themselves by parent-child references, which is possible since the
whole dependency chain actually sits in the stack embedded into the
Expander (in the pipeline)
This very deep change (which requires almost complete rebuild)
was prompted by the need to process an object (JobPlanning),
which holds several references and is thus move-only, in the
middle of a complex processing pipeline with child expansion.
If this works out well, a long-standing and obnoxious problem
with transforming iterators would be solved, albeit by incurring
a (presumably small) performance overhead, since now the new
value is no longer *assigned*, but rather the existing payload
is destroyed and a new instance is copy/move constructed into
the inline buffer.
The primary purpose (and widely used in Lumieara) is to have a
Lambda create a new Object, which is then returned by value
and thus immediately moved into this inline buffer, where it
resides for further use (as long as the enclosing pipeline
stays alive). Unless such an object does very elaborate
allocations and registrations behind the scene, the
expense of assigning vs creating should be the same.
...in the hope that the Optimiser is able to elide those references entirely,
when (as is here the case) they point into another field of a larger object compound
...as a preparation for solving a logical problem with the Planning-Pipeline;
it can not quite work as intended just by passing down the pair of
current ticket and dependent ticket, since we have to calculate a chained
calculation of job deadlines, leading up to the root ticket for a frame.
My solution idea is to create the JobPlanning earlier in the pipeline,
already *before* the expansion of prerequisites, and rather to integrate
the representation of the dependency relation direcly into JobPlanning
...using hard coded values instead of observation of actual runtimes,
but at least the calculation scheme (now relocated from TimeAnchor to JobPlanning)
should be a reasonable starting point.
TODO: test fails...
- collect list of entities to be picked up from the dispatcher-pipeline
- as it turns out: there is no sensible use for the realTimeDeadline in
in the FrameCoord record ==> remove it
The initial implementation effort for Player and Job-Planning
has been reviewed and largely reworked, and some parts are now
obsoleted by the reworked alternative and can be disabled.
The basic idea will be retained though: JobPlanning is a
data aggregator and performs the final step of creating a Job
- had to fix a logical inconsistency in the underlying Expander implementation
in TreeExplorer: the source-pipeline was pulled in advance on expansion,
in order to "consume" the expanded element immediately; now we retain
this element (actually inaccessible) until all of the immediate
children are consumed; thus the (visible) state of the PipeFrameTick
stays at the frame number corresponding to the top-level frame Job,
while possibly expanding a complete tree of flexible prerequisites
This test now gives a nice visualisation of the interconnected states
in the Job-Planning pipeline. This can be quite complex, yet I still think
that this semi-functional approach with a stateful pipeline and expand functors
is the cleanest way to handle this while encapsulating all details
- fix a bug in the MockDispatcher, when duplicating the ExitNodes.
A vector-ctor with curly braces will be interpreted as std::initializer_list
- add visualisation of the contents appearing at the end of the pipeline
*** something still broken here, increments don't happen as expected
`steam/engine/mock-dispatcher.hpp |cpp` now integrates this
''complete mock setup for render jobs and frame dispatching.''
The exising `DummyJob` has been slightly adapted and renamed
to `MockJob` and is tightly integrated with the other mocks.
The implementation of a `MockDispatcher` necessitated to change
the use of `MockJobTicket`. The initial attempts used a complete
mock implementation, but this approach turned out not to be viable.
Instead — based on the ideas developed for the mock setup —
now the prospective real implementation of `JobTicket` is available
and will be used by the mock setup too. Instead of a synthetic spec,
now a setup of recursively connected `ExitNode`(s) is used; the latter
seems to develop into some kind of Facade for the render node network.
Based on this mock setup, we can now demonstrate the (mostly) complete
Job-Planning pipeline, starting from a segmentation up to render jobs,
and verify proper connectivity and job invocation.
✔
- has to be prepared / supported by the RenderEnvironmentClosure
- actual translation happens when building the Dispatcher-Pipeline
- implementation delegate through
virtual size_t Dispatcher::resolveModelPort (ModelPort)
...ouch this was insidious: the STL implementation for list does not
return a pointer to the element just allocated, but rather retrieves
and dereferences the back() / front() iterator after returning from emplace_back|front()
...which in case of re-entrant allocations is something wildly different
than the initial allocation. Thus a *cheap* and dirty placeholder implementation
just using a STL container is not possible, and we need at least
to code up likewise cheesy placeholder implementation by hand.
- separate allocation and ctor all
- use an inline buffer in the STL container
- explicitly handle ctor failures to discard allocation
- NOT THREADSAFE and likely WASTFUL in terms of performance
==> MockSupport_test now back to GREEN after complete refactoring
The existing implementation of the Player from 2012~2015 inclduded
an additional differentiation by media channel (for multichannel media)
and would build a separate CalcStream for each channel.
The in-depth analysis conducted for the ongoing »Vertical Slice« effort
revealed that this differentiation is besides the point and would never
be materialised: Since -- by definition -- all media processing has
to be done by the engine, also the generation of the final output format
including any channel multiplexing will happen in render nodes.
The only exception would be when only a single channel of multichannel
media is extracted -- yet this case would then translate into a
dedicated ModelPort.
Based on this reasoning, a lot of complexity (and some contradictions)
within the JobTicket implementation can be removed -- together with
some further leftovers of the fist attempt to build JobTickets always
from a Mock specification (we now use construction by the Segment,
based on an ExitNode, which is the expected actual implementation
for production setup)
...by defining a new scheme for access to custom allocators
...and then passing a reference to such an accessor into the
JobTicket ctor, thereby allowing the ticket istelf recursively
to place further JobTicket instances into the allocation space
--> success, test passes (finally)
Up to now, a draft/mock implementation was used, relying on a »spec tuple«,
which was fabricated by MockJobTicket. But with the introduction of
NodeGraphAttachment, the MockSequence now generates a nested ExitNode structure,
and thus the JobTicket will be created through the "real" ctor, and
no longer via MockJobTicket.
Thus it is possible to skip this whole interspersed »spec tuple«,
since ExitNode *is* already this aggregated / abstracted Spec
PROBLEM: can not implement Spec-generation, since
- we must use a λ for internal allocation of JobTickets
- but recursive type inference is not possible
Will thus need to abandon the Spec-Tuple and relocate this
traversal-and-generation code into JobTicket itself
Use another unit-test (FixtureSegment_test) to guide and cover
the transition from the existing fake-implementation to the
actual implementation, where the JobTicket will be generated
on-demand, from a NodeGraphAttachment
It turns out that the real (not mocked) implementation of JobTicket creation
is already required now for this planned (mock)Dispatcher setup;
moreover, this real implementation turns out to be almost identical
to the mock implementation written recently -- just nested structure
of prerequiste JobTickets need to be changed into a similar structur
of ExitNodes
-- as an aside: rearrange various tests to be more in-line
with the envisioned architecture of playback and engine
...this opens up yet another difficult question and a host of new problems
- how are prerequisites detected or arranged by the Builder
- how are prerequisites represented?
- what is an ExitNode in terms of implementation? A subclass of ProcNode?
- how will the actual implementation of JobTicket creation (on-demand) work?
- how to adapt the Mock implementation, while retaining the Specification
for Segments and prerequisites?