doing so would contradict the fundamental architecture,
all kinds of failures and timeouts need to be handled within
Scheduler-Layer-2 rather.
Jobs are never aborted, nor do they need to know if and when they are invoked
essentially define a concept how to ''perform'' render activities in the Scheduler.
This entails to specify the operation patterns for the four known base cases
and to establish a setup for the implementation.
Further extensive testing with parameter variations,
using the test setup in `BlockFlow_test::storageFlow()`
- Tweaks to improve convergence under extreme overload;
sudden load peaks are now accomodated typically < 5 sec
- Make the test definition parametric, to simplify variations
- Extract the generic microbenchmark helper function
- Documentation
There seems to be a ''sweet spot'' for somewhat larger Epoch sizes around 500 slots.
At least in the test setup used here, which works with a load of 200 Frames / sec,
which is significantly over the typical value of 50fps (video + audio) for simple playback.
The optimisation of averaged allocation times can not be much improved **below 30ns**.
Overall, this can be considered a good result,
since this allocation scheme does way more than just allocate memory,
it also provides a means to track dependencies and lifecycle.
__For context__:
- we should strive at processing one frame in ~ 10ms
- for 10 Activity records per Frame, we currently use < 0.5 µs for
memory and dependency management in the scheduler
- this leaves enough room for the further administrative efforts
(priority queue, job planning, buffer management)
... while this a comparison of apples and oranges, since the standard
heap allocator does not offer any dependency and lifecycle managmenet,
while the BlockFlow scheme developed here is much more complex and
offers a lifetime and dependency control specifically tailored to
the needs of the Scheduler.
Anyway, with the latest tweaks and refactorings, the test case
now shows averaged times per allocation on a comparable level
(both in the range of ~30ns)
BUT -> +50% runtime in -O3 (+20ns)
Investigation seems to indicate
- that the increased (+1 Epochs, 10 -> 11) moving average
caused the Algo to perform worse (strong effect)
- that the Optimiser has problems with boost::rational, which however
yields only a minute effect (+5ns), and only on the critical path
The access via Meyers Singleton has no adverse effect,
rather the new setup gives a tiny benefit (46ns -> 37ns).
Surprisingly, the increased pre-allocation has no observable effect.
On the long run, there will be a central Render Engine parametrisation;
some parameters can even be expected to be dynamic; thus prepare the
BlockFlow allocator to fit in with this expectation
For comparison: use individual managment by refcount.
This supports the conclusion that BlockFlow is more than just a
custom allocator; it also supports a non-trivial lifetime management,
and this comes at a cost.
Playing around with various load patterns uncovers further weak spots
in the regulation mechanism. As a remedy, introduce a stronger feed-back
and especially set the target load factor from 100% -> 90%
to add some headroom to absorb intermittent load peaks
Presumably ''much more observation and fine-tuning'' will be necessary
under real-world load conditions (⟹ Ticket #1318 for later)
- BUG: must prevent the Epoch size to become excessive low
- Problem: feedback signal should not be overly aggressive
Fine-Tuning:
- Dose for Overflow-compensation is delicate
- Moving average and Overflow should be balanced
- ideally the compensatory actions should be one order of magnitude
slower than the characteristic regulation time
Improvement: perform Moving-Average calculations in doubles
..as a heuristic to regulate optimal Epoch duration;
when Epochs are discarded, the effective fill factor can be used
to guess an Epoch duration time, which would (in hindsight)
lead to perfect usage of storage space
..using a simplistic implementation for now: scale down the
Epoch-stepping by 0.9 to increase capacity accordingly.
This is done on each separate overflow event, and will be
counterbalanced by the observation of Epoch fill ratio
performed later on clean-up of completed Epochs
further implementation makes clear that the AllocationHandle,
which is the primary usage front-end, has to rely both on
services of the underlying ExtentFamily allocator, as well
as on the BlockFlow itself for managing the Epoch spacing.
- fix a bug in IterExplorer: when iterating a »state core« directly,
the helper CoreYield passed the detected type through ValueTypeBindings.
This is logically wrong, because we never want to pick up some typedefs,
rather we always want to use the type directly returned from CORE::yield()
Here the iterator returns an Epoch&, which itself is again iterable
(it inherits from std::array<Activity, N>). However, it is clear
that we must not descent into such a "flatMap" style recursive expansion
- draft a simple scheme how to regulate Epoch lengths dynamically
- add diagnostics to pinpoint a given Activity and find out into which
Epoch it has been allocated; used to cover the allocator behaviour
- add preliminary deadline-check (directly instead of using the Activity)
- with this shortcut, now able to implement discarding obsoleted Epochs
- Iteration and use of the underlying `ExtentFamily` is also settled by now
💡 ''Implementation concept for the allocation scheme complete and validated''
...with the preceding IterableDecorator refactoring,
the navigation and access to the storage extents can now be
organised into a clear progression
Allocator::iterator -> EpochIter -> Epoch&
Convenience management and support functions can then be
pushed down into Epoch, while iteration control can be done
high-level in BlockFlow, based on the helpers in Epoch
While at first sight just a superficial variation of the existing IterStateWrapper,
it became clear with the evolution of the IterExplorer framework that
this setup represents a distinct concept, and especially lends itself
for complex and cohesive collaboration in a layered pipeline. Which
may, or may not be a good idea, depending on the circumstances.
Now, for the implementation of the scheduler memory allocation scheme,
another twist is added to the picture: we can not effort the sanity checks
on each access, even more so when layering / adapting iterators, where
it is essential that the optimiser can remove all unnecessary warts.
..this is the most simple case, where no Epochs are opened yet
..add diagnostics to inspect alloc count and deadlines
..add accessors for the first/last underlying Extent
...continue to proceed test-driven
...scheduler internals turn out to be intricate and cohesive,
and thus the only hope is to adhere to strict testing discipline
Library: add "obvious" utility to the IterExplorer, allowing to
materialise all contents of the Pipeline into a container
...use this to take a snapshot of all currently active Extent addresses
- use a checksum to prove that ctor / dtor of "content" is not invoked
- let the usage of active extents "wrap around" so that the mem block is re-used
- verify that the same data is still there
The low-level allocator is basically implemented now,
but we still need to check thoroughly that the tricky
wrap-around and expansion logic behaves sane...
(see #1311)
Using a Storage* within a wrapper as "pos" will work,
but is borderline trickery, since it amounts to subverting
the idea behind IterAdapter (which is to encapsulate a target
pointer with some control-logic in the managing container).
Using the same storage size and implementation overhead,
it is much more straight-forward to package the complete
iteration logic into a »State Core«, which in this case
however maintains a back-link to the ExtentFamily.
Iteration should just yield an Reference to an Extent,
thereby hiding all details of the actual raw storage (char[]).
This can be achieved by usind a wrapper type around a pointer
into the managing vector; from this pointer we may convert
into a vector::iterator with the trick described here
https://stackoverflow.com/a/37101607/444796
Furthermore, continued planning of the Activity-Language,
basically clarified the complete usage scenario for now;
seems all implementable right away without further difficulties
..with the ability to grow on demand..
..possibly add the new extents in the middle, by first allocating at the end
and then using the std::rotate() algo to bring them to the point
in the middle where new extents are required
- the idea is to use slot-0 in each extent for administrative metadata
- to that end, a specialised GATE-Activity is placed into slot-0
- decision to use the next-pointer for managing the next free slot
- thus we need the help of the underlying ExtentFamily for navigating Extents
Decision to refrain from any attempt to "fix" excessive memory usage,
caused by Epochs still blocked by pending IO operations. Rather, we
assume the engine uses sane parametrisation (possibly with dynamic adjustment)
Yet still there will be some safety limit, but when exceeding this limit,
the allocator will just throw, thereby killing the playback/render process
- decision to favour small memory footprint
- rather use several Activity records to express invocation
- design Activity record as »POD with constructor«
- conceptually, Activity is polymorphic, but on implementation
level, this is "folded down" into union-based data storage,
layering accessor functions on top
- decision how to handle the Extent storage (by forced-cast)
- decision to place the administrative record directly into the Extent
TODO not clear yet how to handle the implicit limitation for future deadlines
using a simple yet performant data structure.
Not clear yet if this approach is sustainable
- assuming that no value initialisation happens for POD payload
- performance trade-off growth when in wrapped-state vs using a list
....still about to find out what kinds of Activities there are,
and what reasonably to implement on layer-2 vs. layer-1
It is clear that the worker will typically invoke a doWork()
operation on layer-2, which in turn will iterate layer-1.
Each worker pulls and performs internal managmenet tasks exclusively
until encountering the next real render task, at which point it will
drop an exclusion flag and then engage into performing the actual
extended work for rendering...
- define a simple record to represent the Activity
- define a handle with an ordering function
- low-level functions to...
+ accept such a handle
+ pick it from the entrace queue
+ pass it for priorisation into the PriQueue
+ dequeue the top priority element
- allow to configure the expected job runtime in the test spec
- remove link to EngineConfig and hard-wire the engine latency for now
... extended integration testing reveals two further bugs ;-)
... document deadline calculation
...using hard coded values instead of observation of actual runtimes,
but at least the calculation scheme (now relocated from TimeAnchor to JobPlanning)
should be a reasonable starting point.
TODO: test fails...
...this opens up yet another difficult question and a host of new problems
- how are prerequisites detected or arranged by the Builder
- how are prerequisites represented?
- what is an ExitNode in terms of implementation? A subclass of ProcNode?
- how will the actual implementation of JobTicket creation (on-demand) work?
- how to adapt the Mock implementation, while retaining the Specification
for Segments and prerequisites?
right now we're lacking a complete working implementation of render node invocation,
and thus the Dispatcher implementation can only be verified with the help
of mocked jobs. However, at least a preliminary implementation of tagging the
invocation instance is available, and thus we're able to verify that
a given job instance indeed belongs to and is "backed" by a specific JobTicket.
This is prerequisite for building up a (likewise mocked) Fixture datastructure,
and this in turn was meant to form the basis for attacking an actual Scheduler
implementation, followed by a real render node invocation.
- can now create a Job from JobTicket::NIL
- on invocation this Job will to nothing
Only when the first real output backend is implemented,
we can decide if this simplistic implementation is enough,
or if an empty output must be explicitly generated...
* using a simplified preliminary implementation of hash chaining (see #1293)
* simplistic implementation of hashing for time values (half-rotation)
* for now just hashing the time into the upper part of the LUID
Maybe we can even live with that implementation for some time,
depending on how important uniform distribution of hash values is
for proper usage of the frame cache.
Needless to say, various further fine points need more consideration,
especially questions of portability (32bit anyone?). Moreover, since
frame times are typically quantised, the search space for the hashed
time values is drastically reduced; conceivably we should rather
research and implement a good hash function for 128bit and then combine
all information into a single hash key....
...using the MockJobTicket setup as point of reference,
since the actual invocation of render nodes will only be drafted
later in this "Vertical Slice" integration effort...
...requires a first attempt towards defining a `JobTiket`.
This turns out quite tricky, due to using those `LinkedElements`
(intrusive single linked list), which requires all added records
actually to live elsewhere. Since we want to use a custom allocator
later (the `AllocationCluster`), this boils down to allocating those
records only when about to construct the `JobTicket` itself.
What makes matters even worse: at the moment we use a separate spec
per Media channel (maybe these specs can be collapsed later non).
And thus we need to pass a collection -- or better an iterator
with raw specs, which in turn must reveal yet another nested
sequence for the prerequisite `JobTickets`.
Anyhow, now we're able at least to create an empty `JobTicket`,
backed by a dummy `JobFunctor`....
- decision: the Monad-style iteration framework will be abandoned
- the job-planning will be recast in terms of the iter-tree-explorer
- job-planning and frame dispatch will be disentangled
- the Scheduler will deliberately offer a high-level interface
- on this high-level, Scheduler will support dependency management
- the low-level implementation of the Scheduler will be based on Activity verbs
This is a deep refactoring to allow to represent the distance
between all valid time points as a time::Offset or time::Duration.
By design this is possible, since Time::MAX was defined as 1/30 of
the maximum value technically representable as int64_t. However,
introducing a different limiter for offsets and durations turns
out difficult, due to the inconsistencies in the exiting hierarchy
of temporal entities. Which in turn seems to stem from the unfortunate
decision to make time entities immutable, see #1261
Since the limiter is hard wired into the `time::TimeValue` constructor,
we are forced to create a "backdoor" of sorts, to pass up values
with different limiting from child classes. This would not be so
much of a problem if calculations weren't forced to go through `TimeVar`,
which does not distinguish between time points and time durations.
This solution rearranges all checks to be performed now by time::Offset,
while time::Duration will only take the absolute value at construction,
based on the fact that there is no valid construction path to yield
a duration which does not go through an offset first.
Later, when we're ready to sort out the implementation base of time values
(see #1258), this design issue should be revisited
- either we'll allow derived classes explicitly to invoke the limiter functions
- or we may be able to have an automatic conversion path from clearly
marked base implementation types, in which case we wouldn't use the
buildRaw_() and _raw() "backdoor" functions any more...
effectively we rely in the micro tick timescale promoted by libGAVL,
but it seems indicated to introduce our own constant definition.
And also clarify some comments and tests.
(this changeset does not change any values or functionality)
basically we can pick just any convention here, and so we should pick the convention in a way
that makes most sense informally, for a *human reader*. But what we previously did, was to pick
the condition such as to make it simple in some situations for the programmer....
With the predictable result: even with the disappointingly small number of usages we have up to now,
we got that condition backwards several times.
OK, so from now on!!!
Time::NEVER == Time::MAX, because "never" is as far as possible into the future
- most notably the NOBUG logging flags have been renamed now
- but for the configuration, I'll stick to "GUI" for now,
since "Stage" would be bewildering for an occasional user
- in a similar vein, most documentation continues to refer to the GUI