There seems to be a ''sweet spot'' for somewhat larger Epoch sizes around 500 slots.
At least in the test setup used here, which works with a load of 200 Frames / sec,
which is significantly over the typical value of 50fps (video + audio) for simple playback.
The optimisation of averaged allocation times can not be much improved **below 30ns**.
Overall, this can be considered a good result,
since this allocation scheme does way more than just allocate memory,
it also provides a means to track dependencies and lifecycle.
__For context__:
- we should strive at processing one frame in ~ 10ms
- for 10 Activity records per Frame, we currently use < 0.5 µs for
memory and dependency management in the scheduler
- this leaves enough room for the further administrative efforts
(priority queue, job planning, buffer management)
... while this a comparison of apples and oranges, since the standard
heap allocator does not offer any dependency and lifecycle managmenet,
while the BlockFlow scheme developed here is much more complex and
offers a lifetime and dependency control specifically tailored to
the needs of the Scheduler.
Anyway, with the latest tweaks and refactorings, the test case
now shows averaged times per allocation on a comparable level
(both in the range of ~30ns)
BUT -> +50% runtime in -O3 (+20ns)
Investigation seems to indicate
- that the increased (+1 Epochs, 10 -> 11) moving average
caused the Algo to perform worse (strong effect)
- that the Optimiser has problems with boost::rational, which however
yields only a minute effect (+5ns), and only on the critical path
The access via Meyers Singleton has no adverse effect,
rather the new setup gives a tiny benefit (46ns -> 37ns).
Surprisingly, the increased pre-allocation has no observable effect.
On the long run, there will be a central Render Engine parametrisation;
some parameters can even be expected to be dynamic; thus prepare the
BlockFlow allocator to fit in with this expectation
For comparison: use individual managment by refcount.
This supports the conclusion that BlockFlow is more than just a
custom allocator; it also supports a non-trivial lifetime management,
and this comes at a cost.
Playing around with various load patterns uncovers further weak spots
in the regulation mechanism. As a remedy, introduce a stronger feed-back
and especially set the target load factor from 100% -> 90%
to add some headroom to absorb intermittent load peaks
Presumably ''much more observation and fine-tuning'' will be necessary
under real-world load conditions (⟹ Ticket #1318 for later)
- BUG: must prevent the Epoch size to become excessive low
- Problem: feedback signal should not be overly aggressive
Fine-Tuning:
- Dose for Overflow-compensation is delicate
- Moving average and Overflow should be balanced
- ideally the compensatory actions should be one order of magnitude
slower than the characteristic regulation time
Improvement: perform Moving-Average calculations in doubles
..as a heuristic to regulate optimal Epoch duration;
when Epochs are discarded, the effective fill factor can be used
to guess an Epoch duration time, which would (in hindsight)
lead to perfect usage of storage space
..using a simplistic implementation for now: scale down the
Epoch-stepping by 0.9 to increase capacity accordingly.
This is done on each separate overflow event, and will be
counterbalanced by the observation of Epoch fill ratio
performed later on clean-up of completed Epochs
further implementation makes clear that the AllocationHandle,
which is the primary usage front-end, has to rely both on
services of the underlying ExtentFamily allocator, as well
as on the BlockFlow itself for managing the Epoch spacing.
- fix a bug in IterExplorer: when iterating a »state core« directly,
the helper CoreYield passed the detected type through ValueTypeBindings.
This is logically wrong, because we never want to pick up some typedefs,
rather we always want to use the type directly returned from CORE::yield()
Here the iterator returns an Epoch&, which itself is again iterable
(it inherits from std::array<Activity, N>). However, it is clear
that we must not descent into such a "flatMap" style recursive expansion
- draft a simple scheme how to regulate Epoch lengths dynamically
- add diagnostics to pinpoint a given Activity and find out into which
Epoch it has been allocated; used to cover the allocator behaviour
- add preliminary deadline-check (directly instead of using the Activity)
- with this shortcut, now able to implement discarding obsoleted Epochs
- Iteration and use of the underlying `ExtentFamily` is also settled by now
💡 ''Implementation concept for the allocation scheme complete and validated''
...with the preceding IterableDecorator refactoring,
the navigation and access to the storage extents can now be
organised into a clear progression
Allocator::iterator -> EpochIter -> Epoch&
Convenience management and support functions can then be
pushed down into Epoch, while iteration control can be done
high-level in BlockFlow, based on the helpers in Epoch
While at first sight just a superficial variation of the existing IterStateWrapper,
it became clear with the evolution of the IterExplorer framework that
this setup represents a distinct concept, and especially lends itself
for complex and cohesive collaboration in a layered pipeline. Which
may, or may not be a good idea, depending on the circumstances.
Now, for the implementation of the scheduler memory allocation scheme,
another twist is added to the picture: we can not effort the sanity checks
on each access, even more so when layering / adapting iterators, where
it is essential that the optimiser can remove all unnecessary warts.
..this is the most simple case, where no Epochs are opened yet
..add diagnostics to inspect alloc count and deadlines
..add accessors for the first/last underlying Extent
...continue to proceed test-driven
...scheduler internals turn out to be intricate and cohesive,
and thus the only hope is to adhere to strict testing discipline
Library: add "obvious" utility to the IterExplorer, allowing to
materialise all contents of the Pipeline into a container
...use this to take a snapshot of all currently active Extent addresses
- use a checksum to prove that ctor / dtor of "content" is not invoked
- let the usage of active extents "wrap around" so that the mem block is re-used
- verify that the same data is still there
The low-level allocator is basically implemented now,
but we still need to check thoroughly that the tricky
wrap-around and expansion logic behaves sane...
(see #1311)
Using a Storage* within a wrapper as "pos" will work,
but is borderline trickery, since it amounts to subverting
the idea behind IterAdapter (which is to encapsulate a target
pointer with some control-logic in the managing container).
Using the same storage size and implementation overhead,
it is much more straight-forward to package the complete
iteration logic into a »State Core«, which in this case
however maintains a back-link to the ExtentFamily.
Iteration should just yield an Reference to an Extent,
thereby hiding all details of the actual raw storage (char[]).
This can be achieved by usind a wrapper type around a pointer
into the managing vector; from this pointer we may convert
into a vector::iterator with the trick described here
https://stackoverflow.com/a/37101607/444796
Furthermore, continued planning of the Activity-Language,
basically clarified the complete usage scenario for now;
seems all implementable right away without further difficulties
..with the ability to grow on demand..
..possibly add the new extents in the middle, by first allocating at the end
and then using the std::rotate() algo to bring them to the point
in the middle where new extents are required
- the idea is to use slot-0 in each extent for administrative metadata
- to that end, a specialised GATE-Activity is placed into slot-0
- decision to use the next-pointer for managing the next free slot
- thus we need the help of the underlying ExtentFamily for navigating Extents
Decision to refrain from any attempt to "fix" excessive memory usage,
caused by Epochs still blocked by pending IO operations. Rather, we
assume the engine uses sane parametrisation (possibly with dynamic adjustment)
Yet still there will be some safety limit, but when exceeding this limit,
the allocator will just throw, thereby killing the playback/render process
- decision to favour small memory footprint
- rather use several Activity records to express invocation
- design Activity record as »POD with constructor«
- conceptually, Activity is polymorphic, but on implementation
level, this is "folded down" into union-based data storage,
layering accessor functions on top
- decision how to handle the Extent storage (by forced-cast)
- decision to place the administrative record directly into the Extent
TODO not clear yet how to handle the implicit limitation for future deadlines
using a simple yet performant data structure.
Not clear yet if this approach is sustainable
- assuming that no value initialisation happens for POD payload
- performance trade-off growth when in wrapped-state vs using a list
....still about to find out what kinds of Activities there are,
and what reasonably to implement on layer-2 vs. layer-1
It is clear that the worker will typically invoke a doWork()
operation on layer-2, which in turn will iterate layer-1.
Each worker pulls and performs internal managmenet tasks exclusively
until encountering the next real render task, at which point it will
drop an exclusion flag and then engage into performing the actual
extended work for rendering...
- define a simple record to represent the Activity
- define a handle with an ordering function
- low-level functions to...
+ accept such a handle
+ pick it from the entrace queue
+ pass it for priorisation into the PriQueue
+ dequeue the top priority element
- allow to configure the expected job runtime in the test spec
- remove link to EngineConfig and hard-wire the engine latency for now
... extended integration testing reveals two further bugs ;-)
... document deadline calculation
...using hard coded values instead of observation of actual runtimes,
but at least the calculation scheme (now relocated from TimeAnchor to JobPlanning)
should be a reasonable starting point.
TODO: test fails...
...this opens up yet another difficult question and a host of new problems
- how are prerequisites detected or arranged by the Builder
- how are prerequisites represented?
- what is an ExitNode in terms of implementation? A subclass of ProcNode?
- how will the actual implementation of JobTicket creation (on-demand) work?
- how to adapt the Mock implementation, while retaining the Specification
for Segments and prerequisites?
right now we're lacking a complete working implementation of render node invocation,
and thus the Dispatcher implementation can only be verified with the help
of mocked jobs. However, at least a preliminary implementation of tagging the
invocation instance is available, and thus we're able to verify that
a given job instance indeed belongs to and is "backed" by a specific JobTicket.
This is prerequisite for building up a (likewise mocked) Fixture datastructure,
and this in turn was meant to form the basis for attacking an actual Scheduler
implementation, followed by a real render node invocation.
- can now create a Job from JobTicket::NIL
- on invocation this Job will to nothing
Only when the first real output backend is implemented,
we can decide if this simplistic implementation is enough,
or if an empty output must be explicitly generated...
* using a simplified preliminary implementation of hash chaining (see #1293)
* simplistic implementation of hashing for time values (half-rotation)
* for now just hashing the time into the upper part of the LUID
Maybe we can even live with that implementation for some time,
depending on how important uniform distribution of hash values is
for proper usage of the frame cache.
Needless to say, various further fine points need more consideration,
especially questions of portability (32bit anyone?). Moreover, since
frame times are typically quantised, the search space for the hashed
time values is drastically reduced; conceivably we should rather
research and implement a good hash function for 128bit and then combine
all information into a single hash key....
...using the MockJobTicket setup as point of reference,
since the actual invocation of render nodes will only be drafted
later in this "Vertical Slice" integration effort...
...requires a first attempt towards defining a `JobTiket`.
This turns out quite tricky, due to using those `LinkedElements`
(intrusive single linked list), which requires all added records
actually to live elsewhere. Since we want to use a custom allocator
later (the `AllocationCluster`), this boils down to allocating those
records only when about to construct the `JobTicket` itself.
What makes matters even worse: at the moment we use a separate spec
per Media channel (maybe these specs can be collapsed later non).
And thus we need to pass a collection -- or better an iterator
with raw specs, which in turn must reveal yet another nested
sequence for the prerequisite `JobTickets`.
Anyhow, now we're able at least to create an empty `JobTicket`,
backed by a dummy `JobFunctor`....
- decision: the Monad-style iteration framework will be abandoned
- the job-planning will be recast in terms of the iter-tree-explorer
- job-planning and frame dispatch will be disentangled
- the Scheduler will deliberately offer a high-level interface
- on this high-level, Scheduler will support dependency management
- the low-level implementation of the Scheduler will be based on Activity verbs