...which seems to be basically fine thus far
...beyond some renaming and rearranging
''it turns out that the final, crucial links,
necessary to tie all together, are yet to be developed''
Facing quite some difficulties here, since there are (at least)
two abandoned past efforts towards building a render node network
in the code base; the structure and architecture decisions from these
previous attempts seem largely valid still, yet on a technical level,
the style of construction evolved considerably in the meantime. Moreover,
these old fragments of code, written during the early stages of the
project, were lacking clear goals and anchor points at places;
the situation is quite different now in this respect.
Sticking to well proven practice, the rework will be driven by a test setup,
and will progress over three steps with increasing levels of integration.
The initial effort of building a Scheduler can now be **considered complete**
Reaching this milestone required considerable time and effort, including
an extended series of tests to weld out obvious design and implementation flaws.
While the assessment of the new Scheduler's limitation and traits is ''far from complete,''
some basic achievements could be confirmed through this extended testing effort:
* the Scheduler is able to follow a given schedule effectively,
until close up to the load limit
* the ''stochastic load management'' causes some latency on isolated events,
in the order of magnitude < 5ms
* the Scheduler is susceptible to degradation through Contention
* as mitigation, the Scheduler prefers to reduce capacity in such a situation
* operating the Scheduler effectively thus requires a minimum job size of 2ms
* the ability for sustained operation under full nominal load has been confirmed
by performing **test sequences with over 80 seconds**
* beyond the mentioned latency (<5ms) and a typical turnaround of 100µs per job
(for debug builds), **no further significant overhead** was found.
Design, Implementation and Testing were documented extensively in the [https://lumiera.org/wiki/renderengine.html#Scheduler%20SchedulerProcessing%20SchedulerTest%20SchedulerWorker%20SchedulerMemory%20RenderActivity%20JobPlanningPipeline%20PlayProcess%20Rendering »TiddlyWiki« #Scheduler]
The behaviour seems consistent and the schedule breaks at the expected point.
At first sight, concurrency seems slightly to low; detailed investigation
however shows that this is due to the structure of the load graph,
and in fact the run time comes close to optimal values.
the `BreakingPoint` tool conducts a binary search to find the ''stress factor''
where a given schedule breaks. There are some known deviations related to the
measurement setup, which unfortunately impact the interpretation of the
''stress factor'' scale. Earlier, an attempt was made, to watch those factors
empirically and work a ''form factor'' into the ''effective stress factor''
used to guide this measurement method.
Closer investigation with extended and elastic load patters now revealed
a strong tendency of the Scheduler to scale down the work resources when not
fully loaded. This may be mistaken by the above mentioned adjustments as a sign
of a structural limiation of the possible concurrency.
Thus, as a mitigation, those adjustments are now only performed at the
beginning of the measurement series, and also only when the stress factor
is high (implying that the scheduler is actually overloaded and thus has
no incentive for scaling down).
These observations indicate that the »Breaking Point« search must be taken
with a grain of salt: Especially when the test load does ''not'' contain
a high degree of inter dependencies, it will be ''stretched elastically''
rather than outright broken. And under such circumstances, this measurement
actually gauges the Scheduler's ability to comply to an established
load and computation goal.
...this seems to be the last topic for this investigation of Scheduler behaviour;
the goal is to demonstrate readiness for stable-state operation over an extended period of time
- use parameters known to produce a clean linear model
- assert on properties of this linear model
Add extended documentation into the !TiddlyWiki,
with a textual account of the various findings,
also including some of the images and diagrams,
rendered as SVG
Use the statistic functions imported recently from Yoshimi-test
to compute a linear regression model as immediate test result.
Combining several measurement series, this allows to draw conclusions
about some generic traits and limitations of the scheduler.
- the goal is to run a binary search
- the search condition should be factored out
- thus some kind of framework or DSL is required,
to separate the technicalities of the measurement
from the specifics of the actual test case.
...the idea is to use the sum of node weights per level
to create a schedule, which more closely reflects the distribution
of actual computation time. Hopefully such a schedule can then be
squeezed or stretched by a time factor to find out a ''breaking point'',
at which the Scheduler is no longer able to keep up.
while my basic assessment is still that contention will not play a significant
role given the expected real world usage scenario — when testing with
tighter schedule and rather short jobs (500µs), some phases of massive contention
can be observed, leading to significant slow-down of the test.
The major problem seems to be that extended phases of contention will
effectively cause several workers to remain in an active spinning-loop for
multiple microseconds, while also permanently reading the atomic lock.
Thus an adaptive scheme is introduced: after some repeated contention events,
workers now throttle down by themselves, with polling delays increased
with exponential stepping up to 2ms. This turns out to be surprisingly
effective and completely removes any observed delays in the test setup.
This is the first kind of integration,
albeit still with a synthetic load.
- placed two excessive load peaks in the scheduling timeline
- verified load behaviour
- verified timings
- verified that the scheduler shuts down automatically when done
- Ensure the grooming-token (lock) is reliably dropped
- also explicitly drop it prior to trageted sleeps
- properly signal when not able to acquire the token before dispatch
- amend tests broken by changes since yesterday
Notably the work-function is now completely covered, by adding
this last test, and the detailed investigations yesterday
ultimately unveiled nothing of concern; the times sum up.
Further reflection regarding the overall concept led me
to a surprising solution for the problem with priority classes.
- investigate consistency guarantees through acquire-release
==> turns out we do not need a fence, but it is tantamount
to have a guard variable and actually load and check
the value to ensure we indeed get a happens-before
- elaborate design of the WorkForce
+ no shared control variables necessary
+ no ability to forcibly shut-down the WorkForce
+ rather, all control will be exerted through the return value
of the Work-Functor
...to show in test that indeed the actual time is retrieved on each activation,
we can assign a λ -- which is rigged to increase the time on each access
...turns out there is still a lot of leeway in the possible implementation,
and seemingly it is too early to decide which case to consider the default.
Thus I'll proceed with the drafted preliminary solution...
- on primary-chain, an inhibited Gate dispatches itself into future for re-check
- on Notification, activation happens if and only if this very notification opens the Gate
- provide a specifically wired requireDirectActivation() to allow enforcing a minimal start time
Solved by special treatment of a notification, which happens
to decrement the latch to zero: in this case, the chain is
dispatched, but also the Gate is locked permanently to block
any further activations scheduled or forwareded otherwise
essentially define a concept how to ''perform'' render activities in the Scheduler.
This entails to specify the operation patterns for the four known base cases
and to establish a setup for the implementation.
Further extensive testing with parameter variations,
using the test setup in `BlockFlow_test::storageFlow()`
- Tweaks to improve convergence under extreme overload;
sudden load peaks are now accomodated typically < 5 sec
- Make the test definition parametric, to simplify variations
- Extract the generic microbenchmark helper function
- Documentation
..using a simplistic implementation for now: scale down the
Epoch-stepping by 0.9 to increase capacity accordingly.
This is done on each separate overflow event, and will be
counterbalanced by the observation of Epoch fill ratio
performed later on clean-up of completed Epochs
Iteration should just yield an Reference to an Extent,
thereby hiding all details of the actual raw storage (char[]).
This can be achieved by usind a wrapper type around a pointer
into the managing vector; from this pointer we may convert
into a vector::iterator with the trick described here
https://stackoverflow.com/a/37101607/444796
Furthermore, continued planning of the Activity-Language,
basically clarified the complete usage scenario for now;
seems all implementable right away without further difficulties
- decision to favour small memory footprint
- rather use several Activity records to express invocation
- design Activity record as »POD with constructor«
- conceptually, Activity is polymorphic, but on implementation
level, this is "folded down" into union-based data storage,
layering accessor functions on top
- decision how to handle the Extent storage (by forced-cast)
- decision to place the administrative record directly into the Extent
TODO not clear yet how to handle the implicit limitation for future deadlines
using a simple yet performant data structure.
Not clear yet if this approach is sustainable
- assuming that no value initialisation happens for POD payload
- performance trade-off growth when in wrapped-state vs using a list
- define a simple record to represent the Activity
- define a handle with an ordering function
- low-level functions to...
+ accept such a handle
+ pick it from the entrace queue
+ pass it for priorisation into the PriQueue
+ dequeue the top priority element
after completing the recent clean-up and refactoring work,
the monad based framework for recursive tree expansion
can be abandoned and retracted.
This approach from functional programming leads to code,
which is ''cool to write'' yet ''hard to understand.''
A second design attempt was based on the pipeline and decorator pattern
and integrates the monadic expansion as a special case, used here to
discover the prerequisites for a render job. This turned out to be
more effective and prolific and became standard for several exploring
and backtracking algorithms in Lumiera.
- allow to configure the expected job runtime in the test spec
- remove link to EngineConfig and hard-wire the engine latency for now
... extended integration testing reveals two further bugs ;-)
... document deadline calculation
..this is now the third attempt, and it seems this one leads to a
clean solution for the type rebinding problem, while also allowing
to unit-test each step in isolation.
The idea is to layer a *templated* builder class on top,
but to slice it away in each step, re-assemble the pipeline
and decorate a new builder instance on top. The net result
is a tightly interconnected processing pipeline without
any spurious interspersed leftovers from the builder,
while all intermediate steps are fully operational
and can thus be unit-tested...
...which is build a »Job planning pipeline« step by step
in a test setup, and then factor that out as RenderDrive,
to supersede the existing CalcPlanContinuation and get
rid of the Monads this way...
Challenges
- there is a inconsistency with channel usage
- need to establish a way how to transport the output-Sink into the JobFunctor
- need a way to propagate the current frame number to the next planning chunk
The prototypical setup of data structures and test support components
is largely complete by now — with the exception of the `MockDispatcher`,
which will be completed while moving to the next steps pertaining the
setup of a frame dispatch pipeline.
* the existing `DummyJob` was augmented to allow verification of
association between Job and `JobTicket`
* the existing implementation of `JobTicket` was verified and augmented
to allow coverage of the whole usage cycle
* a `MockJobTicket` was implemented on top, which can be generated
from a symbolical test specification (rather than from the real
Fixture data structure)
* a complete `MockSegmentation` was developed, allowing to establish
all the aforementioned data structures without an actual backing
Render Engine. Moreover, `MockSegmentation` can be generated
from the aforementioned symbolic test specification.
* as part of this work, an algorithm to split an existing Segmentation
and to splice in new segments was developed and verified
By reasoning and analysis I conclude that the differentiation into
multiple channels is likely misplaced in JobTicket; it belongs ratther
into the Segment and should provide a suitable JobTicket for each ModelPort
Handling of prerequisites also needs to be reshaped entirely after
switching to a pipeline builder for the Job-planning pipeline; as
preliminary access point, just add an iterator over the immediate
prerequisites, thereby shifting the exploration mechanism entirely
out of the JobTicket implementation
Testcase: A simple Sementation with a single and bounded Segment
As aside, figured out how to unpack an iterator such as to
tie a fixed number of references through a structural binding:
auto const& [s1,s2,s3] = seqTuple<3> (mockSegs.eachSeg());
There are 12 distinct cases regarding the orientation of two intervals;
The Segmentation::splitSplice() operation shall insert a new Segment
and adjust / truncate / expand / split / delete existing segments
such as to retain the *Invariant* (seamless segmentation covering
the complete time axis)