Scheduler-test: complete and document stress testing effort (closes #1344)

The initial effort of building a Scheduler can now be **considered complete**
Reaching this milestone required considerable time and effort, including
an extended series of tests to weld out obvious design and implementation flaws.

While the assessment of the new Scheduler's limitation and traits is ''far from complete,''
some basic achievements could be confirmed through this extended testing effort:
 * the Scheduler is able to follow a given schedule effectively,
   until close up to the load limit
 * the ''stochastic load management'' causes some latency on isolated events,
   in the order of magnitude < 5ms
 * the Scheduler is susceptible to degradation through Contention
 * as mitigation, the Scheduler prefers to reduce capacity in such a situation
 * operating the Scheduler effectively thus requires a minimum job size of 2ms
 * the ability for sustained operation under full nominal load has been confirmed
   by performing **test sequences with over 80 seconds**
 * beyond the mentioned latency (<5ms) and a typical turnaround of 100µs per job
   (for debug builds), **no further significant overhead** was found.

Design, Implementation and Testing were documented extensively in the [https://lumiera.org/wiki/renderengine.html#Scheduler%20SchedulerProcessing%20SchedulerTest%20SchedulerWorker%20SchedulerMemory%20RenderActivity%20JobPlanningPipeline%20PlayProcess%20Rendering »TiddlyWiki« #Scheduler]
This commit is contained in:
Fischlurch 2024-04-20 01:55:41 +02:00
parent a46449d5ac
commit d71eb37b52
4 changed files with 544 additions and 278 deletions

View file

@ -56,7 +56,7 @@ END
PLANNED "Scheduler Performance" SchedulerStress_test <<END
TEST "Scheduler Performance" SchedulerStress_test <<END
return: 0
END

View file

@ -2,7 +2,7 @@
SchedulerStress(Test) - verify scheduler performance characteristics
Copyright (C) Lumiera.org
2023, Hermann Vosseler <Ichthyostega@web.de>
2024, Hermann Vosseler <Ichthyostega@web.de>
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as
@ -28,11 +28,11 @@
#include "lib/test/run.hpp"
#include "test-chain-load.hpp"
#include "stress-test-rig.hpp"
#include "lib/test/test-helper.hpp"
#include "vault/gear/scheduler.hpp"
#include "lib/time/timevalue.hpp"
#include "lib/format-string.hpp"
#include "lib/format-cout.hpp"
#include "lib/test/diagnostic-output.hpp"//////////////////////////TODO work in distress
#include "lib/util.hpp"
using test::Test;
@ -49,6 +49,13 @@ namespace test {
/***************************************************************************//**
* @test Investigate and verify non-functional characteristics of the Scheduler.
* @remark This test can require several seconds to run and might be brittle,
* due to reliance on achieving performance within certain limits, which
* may not be attainable on some systems; notably the platform is expected
* to provide at least four independent cores for multithreaded execution.
* The performance demonstrated here confirms that a typical load scenario
* can be handled while also documenting various measurement setups
* usable for focused investigation.
* @see SchedulerActivity_test
* @see SchedulerInvocation_test
* @see SchedulerCommutator_test
@ -69,10 +76,7 @@ namespace test {
}
/** @test TODO demonstrate sustained operation under load
* - TODO this is a placeholder and works now, but need a better example
* - it should not produce so much overload, rather some stretch of steady-state processing
* @todo WIP 12/23 🔁 define implement
/** @test demonstrate test setup for sustained operation under load
*/
void
smokeTest()
@ -132,11 +136,11 @@ namespace test {
* - as in many other tests, use the massively forking load pattern
* - demonstrate how TestChainLoad computes an idealised level expense
* - verify how schedule times are derived from this expense sequence
* @todo WIP 12/23 define implement
*/
void
setup_systematicSchedule()
{
MARK_TEST_FUN
TestChainLoad testLoad{64};
testLoad.configureShape_chain_loadBursts()
.buildTopology()
@ -263,17 +267,18 @@ namespace test {
double runTime = testSetup.launch_and_wait();
double expected = testSetup.getExpectedEndTime();
CHECK (fabs (runTime-expected) < 5000);
} // Scheduler should able to follow the expected schedule
} // Scheduler should be able to follow the expected schedule
/** @test verify capability for instrumentation of job invocations
* @see IncidenceCount_test
* @todo WIP 2/24 define implement
*/
void
verify_instrumentation()
{
MARK_TEST_FUN
const size_t NODES = 20;
const size_t CORES = work::Config::COMPUTATION_CAPACITY;
auto LOAD_BASE = 5ms;
@ -306,6 +311,8 @@ namespace test {
} // should ideally spend most of the time at highest concurrency levels
using StressRig = StressTestRig<16>;
/** @test determine the breaking point towards scheduler overload
@ -322,10 +329,9 @@ namespace test {
* computed by accounting for the work units in isolation, without considering
* dependency constraints. These observed deviations are cast into an empirical
* »form factor«, which is then used to correct the applied stress factor.
* Only with taking these corrective steps, the observed stress factor at
* After applying these corrective steps, the observed stress factor at
* _breaking point_ comes close to the theoretically expected value of 1.0
* @see stress-test-rig.hpp
* @todo WIP 1/24 define implement
*/
void
search_breaking_point()
@ -365,7 +371,6 @@ namespace test {
* - optionally generate a **Gnuplot** script for visualisation
* @see vault::gear::bench::ParameterRange
* @see gnuplot-gen.hpp
* @todo WIP 4/24 define implement
*/
void
watch_expenseFunction()
@ -396,15 +401,15 @@ namespace test {
auto [socket,gradient,v1,v2,corr,maxDelta,stdev] = bench::linearRegression (results.param, results.time);
double avgConc = Setup::avgConcurrency (results);
/*
cout << "───═══───═══───═══───═══───═══───═══───═══───═══───═══───═══───"<<endl;
cout << Setup::renderGnuplot (results);
// cout << "───═══───═══───═══───═══───═══───═══───═══───═══───═══───═══───"<<endl;
// cout << Setup::renderGnuplot (results) <<endl;
cout << "───═══───═══───═══───═══───═══───═══───═══───═══───═══───═══───"<<endl;
cout << _Fmt{"Model: %3.2f·p + %3.2f corr=%4.2f Δmax=%4.2f σ=%4.2f ∅concurrency: %3.1f"}
% gradient % socket % corr % maxDelta % stdev % avgConc
<< endl;
*/
CHECK (corr > 0.80); // clearly a linear correlated behaviour
CHECK (corr > 0.80); // clearly a linearly correlated behaviour
CHECK (isLimited (0.4, gradient, 0.7)); // should be slightly above 0.5 (2ms and 4 threads => 0.5ms / Job)
CHECK (isLimited (3, socket, 9 )); // we have a spin-up and a shut-down both ~ 2ms plus some further overhead
@ -413,89 +418,30 @@ namespace test {
/** @test TODO build a load pattern to emolate a typical high work load
* @todo WIP 4/24 🔁 define implement
/** @test use an extended load pattern to emulate a typical high work load
* - using 4-step linear chains, interleaved such that each level holds 4 nodes
* - the structure overall spans out to 66 levels, leading to 3.88 nodes/level
* - load on each node is 5ms, so the overall run would take ~330ms back to back
* - this structure is first performed on the bench::BreakingPoint
* - in the second part, a similar structure with 4-times the size is performed
* as a single run, but this time with planning and execution interleaved.
* - this demonstrates the Scheduler can sustain stable high load performance
*/
void
investigateWorkProcessing()
{
ComputationalLoad cpuLoad;
cpuLoad.timeBase = 200us;
cpuLoad.calibrate();
//////////////////////////////////////////////////////////////////TODO for development only
MARK_TEST_FUN
/*
TestChainLoad testLoad{200};
testLoad.configure_isolated_nodes()
.buildTopology()
// .printTopologyDOT()
.printTopologyStatistics();
{
TRANSIENTLY(work::Config::COMPUTATION_CAPACITY) = 4;
BlockFlowAlloc bFlow;
EngineObserver watch;
Scheduler scheduler{bFlow, watch};
auto set1 = testLoad.setupSchedule(scheduler)
.withLevelDuration(200us)
.withJobDeadline(500ms)
.withUpfrontPlanning()
.withLoadTimeBase(2ms)
.withInstrumentation();
double runTime = set1.launch_and_wait();
auto stat = set1.getInvocationStatistic();
cout << "time="<<runTime/1000
<< " covered="<<stat.coveredTime / 1000
<< " avgconc="<<stat.avgConcurrency
<<endl;
}
TestChainLoad<8> testLoad{256};
testLoad.seedingRule(testLoad.rule().probability(0.6).maxVal(2))
.pruningRule(testLoad.rule().probability(0.44))
.setSeed(60)
.buildTopology()
.printTopologyDOT()
.printTopologyStatistics()
;
return;
// auto stressFac = 1.0;
// auto concurrency = 8;
//
// auto testSetup =
// testLoad.setupSchedule(scheduler)
// .withLoadTimeBase(LOAD_BASE)
// .withJobDeadline(50ms)
// .withUpfrontPlanning()
// .withAdaptedSchedule (stressFac, concurrency);
// double runTime = testSetup.launch_and_wait();
// double expected = testSetup.getExpectedEndTime();
// auto stat = testSetup.getInvocationStatistic();
//SHOW_EXPR(runTime)
//SHOW_EXPR(expected)
//SHOW_EXPR(refTime)
return;
*/
using StressRig = StressTestRig<8>;
using StressRig = StressTestRig<8>;
struct Setup : StressRig
{
// double UPPER_STRESS = 12;
//
// double FAIL_LIMIT = 1.0; //0.7;
// double TRIGGER_SDEV = 1.0; //0.25;
// double TRIGGER_DELTA = 2.0; //0.5;
// uint CONCURRENCY = 4;
uint CONCURRENCY = 4;
// bool SCHED_DEPENDS = true;
bool showRuns = true;
auto
testLoad()
{
TestLoad testLoad{256};
TestLoad testLoad{256}; // use a pattern of 4-step interleaved linear chains
testLoad.seedingRule(testLoad.rule().probability(0.6).maxVal(2))
.pruningRule(testLoad.rule().probability(0.44))
.weightRule(testLoad.value(1))
@ -506,20 +452,18 @@ cout << "time="<<runTime/1000
auto testSetup (TestLoad& testLoad)
{
return StressRig::testSetup(testLoad)
// .withBaseExpense(200us)
.withLoadTimeBase(5ms);
.withLoadTimeBase(5ms);// ◁─────────────── Load 5ms on each Node
}
};
auto [stress,delta,time] = StressRig::with<Setup>()
.perform<bench::BreakingPoint>();
SHOW_EXPR(stress)
SHOW_EXPR(delta)
SHOW_EXPR(time)
cout << "Time for 256 Nodes: "<<time<<"ms with stressFactor="<<stress<<endl;
/* ========== verify extended stable operation ============== */
// Use the same pattern, but extended to 4 times the length;
// moreover, this time planning and execution will be interviened.
// moreover, this time planning and execution will be interleaved.
TestChainLoad<8> testLoad{1024};
testLoad.seedingRule(testLoad.rule().probability(0.6).maxVal(2))
.pruningRule(testLoad.rule().probability(0.44))
@ -545,14 +489,8 @@ SHOW_EXPR(time)
.withAdaptedSchedule (1.0, 4); // ◁───────────────────── stress factor 1.0 and 4 workers
double runTime = testSetup.launch_and_wait();
auto stat = testSetup.getInvocationStatistic();
cout << "Extended Scheduler Run: "<<runTime/1e6<<"sec concurrency:"<<stat.avgConcurrency<<endl;
SHOW_EXPR(runTime)
SHOW_EXPR(stat.activeTime);
SHOW_EXPR(stat.coveredTime);
SHOW_EXPR(stat.activationCnt);
SHOW_EXPR(stat.avgConcurrency);
SHOW_EXPR(expectedHash)
SHOW_EXPR(testLoad.getHash());
CHECK (stat.activationCnt == 1024);
CHECK (expectedHash == testLoad.getHash());
CHECK (3.2 < stat.avgConcurrency);

View file

@ -46,7 +46,7 @@ DAMAGE.
<!--}}}-->
<!--PRE-HEAD-END-->
<title> Engine - Building a Render Nodes Network from Objects in the Session </title>
<title> Engine - Building a Render Nodes Network from Media Objects in the Session </title>
<style id="styleArea" type="text/css">
#saveTest {display:none;}
#messageArea {display:none;}
@ -444,7 +444,7 @@ a.tiddlyLinkNonExisting.shadow {font-weight:bold;}
.viewer blockquote {line-height:1.5em; padding-left:0.8em;margin-left:2.5em;}
.viewer ul, .viewer ol {margin-left:0.5em; padding-left:1.5em;}
.viewer table, table.twtable {border-collapse:collapse; margin:0.8em 1.0em;}
.viewer table, table.twtable {border-collapse:collapse; margin:0.8em 1.0em; font-size:1em;}
.viewer th, .viewer td, .viewer tr,.viewer caption,.twtable th, .twtable td, .twtable tr,.twtable caption {padding:3px;}
table.listView {font-size:0.85em; margin:0.8em 1.0em;}
table.listView th, table.listView td, table.listView tr {padding:0 3px 0 3px;}
@ -6183,7 +6183,7 @@ This is the core service provided by the player subsystem. The purpose is to cre
:any details of this processing remain opaque for the clients; even the player subsystem just accesses the EngineFaçade
</pre>
</div>
<div title="PlaybackVerticalSlice" creator="Ichthyostega" modifier="Ichthyostega" created="202303272236" modified="202312281714" tags="overview impl discuss draft" changecount="37">
<div title="PlaybackVerticalSlice" creator="Ichthyostega" modifier="Ichthyostega" created="202303272236" modified="202404192303" tags="overview impl discuss draft" changecount="38">
<pre>//Integration effort to promote the development of rendering, playback and video display in the GUI//
This IntegrationSlice was started in {{red{2023}}} as [[Ticket #1221|https://issues.lumiera.org/ticket/1221]] to coordinate the completion and integration of various implementation facilities, planned, drafted and built during the last years; this effort marks the return of development focus to the lower layers (after years of focussed UI development) and will implement the asynchronous and time-bound rendering coordinated by the [[Scheduler]] in the [[Vault|Vault-Layer]]
@ -6216,7 +6216,7 @@ The Scheduler will be structured into two Layers, where the lower layer is imple
* 🗘 establish a baseline for //performance measurements//
* ⌛ adapt the [[job planning pipeline|JobPlanningPipeline]] implemented thus far to produce the appropriate {{{Activity}}} records for the scheduler
__December23__: building the Scheduler required some time and dedication, including some related topics like a [[dedicated memory management scheme|SchedulerMemory]], rework and modernisation of the [[#1279 thread handling framework|https://issues.lumiera.org/ticket/1279]], using a [[worker pool|SchedulerWorker]] and developing the [[foundation for load control|SchedulerLoadControl]]. This amounts to the creation of a considerable body of new code; some load- and stress testing helps to establish performance characteristics .
__December23__: building the Scheduler required some time and dedication, including some related topics like a [[dedicated memory management scheme|SchedulerMemory]], rework and modernisation of the [[#1279 thread handling framework|https://issues.lumiera.org/ticket/1279]], using a [[worker pool|SchedulerWorker]] and developing the [[foundation for load control|SchedulerLoadControl]]. This amounts to the creation of a considerable body of new code; some &amp;rarr;[[load- and stress testing|SchedulerTest]] helps to establish &amp;rarr;[[performance characteristics and traits|SchedulerBehaviour]].
!Decisions
;Scheduler
@ -7174,7 +7174,7 @@ Later on we expect a distinct __query subsystem__ to emerge, presumably embeddin
&amp;rarr; QuantiserImpl</pre>
</div>
<div title="Scheduler" creator="Ichthyostega" modifier="Ichthyostega" created="202304140131" modified="202404081719" tags="Rendering spec draft" changecount="31">
<div title="Scheduler" creator="Ichthyostega" modifier="Ichthyostega" created="202304140131" modified="202404192301" tags="Rendering spec draft" changecount="32">
<pre>//Invoke and control the dependency and time based execution of [[render jobs|RenderJob]]//
The Scheduler acts as the central hub in the implementation of the RenderEngine and coordinates the //processing resources// of the application. Regarding architecture, the Scheduler is located in the Vault-Layer and //running// the Scheduler is equivalent to activating the »Vault Subsystem«. An EngineFaçade acts as entrance point, providing high-level render services to other parts of the application: [[render jobs|RenderJob]] can be activated under various timing and dependency constraints. Internally, the implementation is organised into two layers:
;Layer-2: Coordination
@ -7209,6 +7209,7 @@ Processing is not driven by a centralised algorithm, but rather stochastically b
!!!Instructing the Scheduler
The Scheduler is now considered an implementation-level facility with an interface specifically tailored at the JobPlanningPipeline: the [[»Render Activity Language«|RenderActivity]]. This //builder-style// setup allows to construct an ''~Activity-Term'' to model all the structural properties of an individual rendering invocation -- it is compriesed of a network of {{{Activity}}} records, which can be directly handled by the Scheduler.
!!!!Discussion of further details
&amp;rarr; [[Activity|RenderActivity]]
&amp;rarr; [[Memory|SchedulerMemory]]
&amp;rarr; [[Workers|SchedulerWorker]]
@ -7217,16 +7218,29 @@ The Scheduler is now considered an implementation-level facility with an interfa
&amp;rarr; [[Testing|SchedulerTest]]
</pre>
</div>
<div title="SchedulerBehaviour" creator="Ichthyostega" modifier="Ichthyostega" created="202404081716" modified="202404081717" tags="Rendering operational draft" changecount="6">
<div title="SchedulerBehaviour" creator="Ichthyostega" modifier="Ichthyostega" created="202404081716" modified="202404190001" tags="Rendering operational draft" changecount="18">
<pre>//Characteristic behaviour traits of the [[Scheduler]] implementation//
The design of the scheduler was chosen to fulfil some fundamental requirements..
* flexibility to accommodate a wide array of processing patterns
* direct integration of some notion of //dependency//
* ability to be re-triggered by external events
* ability to be re-triggered by external events (IO)
* reactive, but not over-reactive response to load peaks
* ability to withstand extended periods of excessive overload
* roughly precise timing with a margin of ≈ ''5''ms
The above list immediately indicates that this scheduler implementation is not oriented towards high throughput or extremely low latency, and thus can be expected to exhibit some //traits of response and behaviour.// These traits were confirmed and further investigated in the efforts for [[stress testing|SchedulerTest]] of the new Scheduler implementation. {{red{As of 4/2024}}} it remains to be seen, if the characteristics of the chosen approach are beneficial or even detrimental to the emerging actual usage -- it may well turn out that some adjustments must be made or even a complete rewrite of the Scheduler may be necessary.
!Randomised capacity distribution
The chosen design relies on //active workers// and a //passive scheduling service.// In line with this approach, the //available capacity// is not managed actively to match a given scheduler -- workers perform slightly randomised sleep cycles rather, which are guided by the current distance to the next Activity on schedule. These wait-cycles are delegated to the OS scheduler, which is known to respond flexibly, yet with some leeway depending on current situation and given wait duration, typically around some 100 microseconds. Together this implies that start times can be ''slightly imprecise'', while deadlines will be decided at the actual time of schedule. Assuming some pre-roll, then the first entry on the schedule will be matched within some 100µs, while further follow-up activities depend on available capacity, which can be scattered within the »work horizon« amounting to 5 milliseconds. A ramp-up to full concurrent capacity thus requires typically 5ms -- and can take up to 20ms after an extended period of inactivity. The precise temporal ordering of entries on the schedule however will be observed strictly; when capacity becomes available, it is directed towards the first entry in row.
!Worker stickiness and downscaling
The allocation of capacity strongly favours active workers. When returning from active processing, a worker gets precedence when looking for further work -- yet before looking for work, any worker must acquire the {{{GroomingToken}}} first. Workers failing those hurdles will retry, but step down on repeated failure. Together this (deliberately) creates a strong preference of keeping ''only a small number of workers active'', while putting excess workers into sleep cycles first, and removing them from the work force altogether, after some seconds of compounded inactivity. When a schedule is light, activity will rather be stretched out to fill. It takes //a slight overload,// with schedule entries becoming overdue repeatedly for about 5 milliseconds, in order to flip this preference and scale up to the full workforce. Notably also a ''minimum job length'' is required to keep an extended work force in active processing state. The [[stress tests|SchedulerTest]] indicate that it takes a seamless sequence of 2ms-jobs to bring more than four workers into sustained active work state.
To put this into proportion, the tests (with debug build) indicate a typical //turnover time// of 100µs, spent on acquiring the {{{GroomingToken}}}, then reorganising the queue and further to process through the administrative activities up to the point where the next actual {{{JobFunctor}}} is invoked. Taking into account the inherent slight randomness of timings, it thus takes a »window« of several 100µs to get yet another worker reliably into working state. This implies the theoretical danger of clogging the Scheduler with tiny jobs, leading to build-up of congestion and eventually failed deadlines, while most of the worker capacity remains in sleep state. Based on the expected requirements for media processing however this is not considered a relevant threat. {{red{As of 4/2024}}} more practical experience are required to confirm this assessment.
!Limitations
The [[memory allocation scheme|SchedulerMemory]] is tied to the deadlines of planned activities. To limit the memory pool size and keep search times for allocation blocks within reasonable bounds, this arrangement imposes a ''hard limit'' for planning ''deadlines into the future'', set at roughly 20 seconds from //current time.// It is thus necessary to break computations down into manageable chunks, and to perform schedule planning as an ongoing effort, packaged into ''planning jobs'' scheduled alongside with the actual work processing. Incidentally, the mentioned limitation is related to setting deadlines (and by extension also for defining start times) -- no limitation whatever exists on //actual run times,// other than availability of work capacity. Once started, a job may run for an hour, but it is not possible to schedule a follow-up job in advance for such an extended span into the future.
A similar and related limitation stems from handling of internal administrative work. A regular »tick« job is repeatedly scheduled +50ms into the future. This job is declared with a deadline of some 100 milliseconds -- if a load peak happens to cause an extended slippage beyond that tolerance frame, a ''Scheduler Emergency'' is triggered, discarding the complete schedule and pausing all current calculation processes. The Scheduler thus allows for ''local and rather limited overload only''. The job planning -- which must be performed as an ongoing process with continuation -- is expected to be reasonably precise and must include a capacity management on a higher level. The Scheduler is not prepared for handling a ''workload of unknown size''.
</pre>
</div>
<div title="SchedulerLoadControl" creator="Ichthyostega" modifier="Ichthyostega" created="202310240240" modified="202311010317" tags="Rendering operational spec draft" changecount="60">
@ -7317,7 +7331,7 @@ The primary scaling effects exploited to achieve this level of performance are t
&amp;rarr; [[Scheduler performance testing|SchedulerTest]]
</pre>
</div>
<div title="SchedulerProcessing" creator="Ichthyostega" modifier="Ichthyostega" created="202312281750" modified="202404150019" tags="Rendering operational draft" changecount="9">
<div title="SchedulerProcessing" creator="Ichthyostega" modifier="Ichthyostega" created="202312281750" modified="202404192304" tags="Rendering operational draft" changecount="40">
<pre>At first sight, the internals of [[Activity|RenderActivity]] processing may seem overwhelmingly complex -- especially since there is no active »processing loop« which might serve as a starting point for the understanding. It is thus necessary to restate the working mode of the Scheduler: it is an //accounting and direction service// for the //active// [[render workers|SchedulerWorker]]. Any processing happens stochastically and is driven by various kinds of events --
* a //worker// becoming ready to perform further tasks
* an external //IO event// {{red{12/23 only planned yet}}}
@ -7328,13 +7342,44 @@ The last point highlights a //circular structure:// the planning job itself was
From the outside, the Scheduler appears as a service component, exposing two points-of-access: Jobs can be added to the schedule (planned), and a worker can retrieve the next instruction, which implies either to perform an (opaque) computation, or to sleep for some given short period of time. Jobs are further distinguished into processing tasks, IO activities and meta jobs, which are related to the self-regulation of the scheduling process. On a closer look however, several distinct realms can be identified, each offering an unique perspective of operation.
!!!The Activity Realm
[[Activities|SchedulerActivity]] are part of an //activity language// to describe patterns of processing, and will be arranged into activity terms; what is scheduled is actually the entrance point to such a term. Within this scope it is assumed abstractly that there is some setup and arrangement to accomplish the //activation// of activities. At the level of the activity language, this is conceptualised and represented as an {{{ExecutionCtx}}}, providing a set of functions with defined signature and abstractly defined semantics. This allows to define and even implement the entirety of activity processing without any recurrence to implementation structures of the scheduler. Rather, the functions in the execution-context will absorb any concern not directly expressed in relations of activities (i.e. anything not directly expressed as language term).
[[Activities|RenderActivity]] are part of an //activity language// to describe patterns of processing, and will be arranged into activity terms; what is scheduled is actually the entrance point to such a term. Within this scope, it is assumed abstractly that there is some setup and arrangement to accomplish the //activation// of activities. At the level of the activity language, this is conceptualised and represented as an {{{ExecutionCtx}}}, providing a set of functions with defined signature and abstractly defined semantics. This allows to define and even implement the entirety of activity processing without any recurrence to implementation structures of the scheduler. Rather, the functions in the execution-context will absorb any concern not directly expressed in relations of activities (i.e. anything not directly expressed as language term).
While the ability to reason about activities and verify their behaviour in isolation is crucial to allow for the //openness// of the activity language (able to accommodate future requirements not foreseeable at this time), it also creates some kind of »double bottom« -- which can be challenging when it comes to reasoning about the processing steps within the scheduler. While all functions of the execution-context are in fact wired into internals of the scheduler for the actual usage, it is indispensable to keep both levels separate and to ensure each level is logically consistent in itself.
While the possibility to reason about activities and verify their behaviour in isolation is crucial to allow for the //openness// of the activity language (its ability to accommodate future requirements not foreseeable at this time), it also creates some kind of »double bottom« -- which can be challenging when it comes to reasoning about the processing steps within the scheduler. While all functions of the execution-context are in fact wired into internals of the scheduler for the actual usage, it is indispensable to keep both levels separate and to ensure each level is logically consistent in itself.
One especially relevant example is the handling of a notification from the activity chain; this may highlight how these two levels are actually interwoven, yet must be kept separate for understanding the functionality and the limitations. Notification is added optionally to a chain, typically to cause further activities in another chain after completion of the first chain's {{{JobFunctor}}}. On the //language level,// the logical structure of execution becomes clear: the {{{NOTIFY}}}-Activity is part of the chain in an activity term, and thus will be activated eventually. For each kind of Activity, it is well defined what activation entails, as can be seen in the function {{{Activity::activate}}} (see at the bottom of [[activity.hpp|https://issues.lumiera.org/browser/Lumiera/src/vault/gear/activity.hpp]]) &amp;rArr; the activity will invoke the λ-post in the execution-context, passing the target of notification. The formal semantics of this hook is that a posted activity-term will be //dispatched;// and the dispatch of an Activity is defined likewise, depending on what kind of Activity is posted. The typical case will be to have a {{{GATE}}}-Activity as target, and dispatching towards such a gate causes the embedded condition to be checked, possibly passing forward the activation on the chain behind the gate. The condition in the gate check entails to look at the gate's deadline, and a pre-established prerequisite count, which is decremented when receiving a notification. So this clearly describes operational semantics, albeit on a conceptual level.
Yet looking into the bindings within the Scheduler implementation however will add an additional layer of meaning. When treated within the scope of the scheduler, any Activity passed at some point through the λ-post translates into an //activation event// -- which is the essence of what can be placed into the scheduler queue. As such, it defines a start point and a deadline, adding an entirely different perspective of planning and time-bound execution. And the execution-context takes on a very specific meaning here, as it defines a //scope// to draw inferences, as can be seen in {{{ExecutionCtx::post()}}} in the Scheduler itself. The context carries contextual timing information, which is combined with timing information from the target and timing information given explicitly. While the activity language only contains the notion of //sequencing,// within the Scheduler implementation, actual timing constraints become relevant: in the end, the notification actually translates into another entry placed into the queue, at a starting time derived contextually from its target.
!!!Queues and Activation Events
The core of the Scheduler implementation is built up from two layers of functionality
;Layer-1
:The scheduler-invocation revolves around ''Activation Events'', which are //instructed,// fed into //prioritisation// and finally //retrieved for dispatch.//
;Layer-2
:The scheduler-commutator adds the notion of ''planning'', and ''dispatch'' and is responsible for coordinating activations and ''concurrency''
Activation Events are the working medium within the Scheduler implementation. They are linked to a definition of a processing pattern given as Activity-chain, which is then qualified further through a start-time and deadline and other metadata. The Scheduler uses its own //time source// (available as function in the execution-context) -- which is in fact implemented by access to the high-resolution timer of the operating system. Relying on time comparisons and a priority queue, the Scheduler can decide if and when an Activation Event actually should „happen“. This decisions and the dispatch of each event are embedded into the execution-context moreover, which (for performance reasons) is mostly established on-the-fly within the current call graph, starting from an //anchor event// and refining the temporal constraints on each step. Moreover, the {{{ExecutionCtx}}} object holds a link to the Scheduler implementation, allowing to bind the λ-hooks -- as required by the Activity-language -- into Scheduler implementation functions.
The Scheduler has two entrance avenues, and the first one to mention is the ''Job planning''. The interface to this part takes on the form of a builder-DSL, class {{{ScheduleSpec}}}; its purpose is to build an Activity-term and to qualify the temporal aspects of its performance. Moreover, several such terms can be connected through //notification links.// (see above for a description of the notification functionality). This interface is not meant to be „really“ public, insofar it requires an understanding of the Activity language and the formal semantics of dispatch to be used properly. For an successful execution, the proper kind of execution scheme or {{{Term::Template}}} must be used, the given timing data should match up and be located sufficiently into the future; the planning must ensure not to attach notification links from activities which could possibly be dispatched already, and it //must not touch// an Activity past its deadline ever (since the memory of such an Activity can be reused for other Activities without further notice).
A lot can be expressed by choosing the start time, deadline, connectivity and further metadata of an Activity -- and it is assumed that the job planning will involve some kind of capacity management and temporal windowing to allocate the processing work on a higher level. Moreover, each term can be marked with an ''manifestation ID'', and the execution of already planned Activities for some specific manifestation ID can be disabled globally, making those planned Activities kind of disappear instantaneously from the schedule. This feature is provided as foundation to allow some //amending of the schedule// after the fact -- which obviously must be done based on a clear understanding of the current situation, as the Scheduler itself has no means to check for duplicate or missing calculation steps. A similar kind of //expressive markup// is the ability to designate ''compulsory'' jobs. If such a job is encountered past its deadline, a »Scheduler Emergency« will ensue, bringing all ongoing calculations to a halt. The expectation is that some higher level of coordination will then cause a clean re-boot of the affected calculation streams.
!!!The Worker Aspect
A second avenue of calling into the Scheduler is related to an altogether different perspective: the organisation and coordination of work. In this design, the Scheduler is a //passive service,// supporting a pool of workers, asking //actively for further work// when ready. At the implementation level, a worker is outfitted with a {{{doWork()}}} λ-function. So the individual worker is actually not aware what kind of work is performed -- its further behaviour is defined as reaction on the //sequencing mark// returned from such a »work-pull«. This sequencing mark is a state code of type {{{Activity::Proc}}} and describes the expected follow-up behaviour, which can be to {{{PASS}}} on and continue pulling , enter a {{{WAIT}}} cycle, be {{{KICK}}}ed aside due of contention, or an instruction to {{{HALT}}} further processing and quit the pool.
Following this avenue further down will reveal an intricate interplay of the //essentially random event// of a worker asking for work, and the current timing situation in relation to the //Scheduler queue head,// which is the next Activation Event to be considered. Moreover, based on the call context, the processing of these pull-calls is either treated as //incoming capacity// or //outgoing capacity.// In this context, a pull-call initiated by the worker is considered »incoming«, since the availability of the worker's capacity stems from the worker itself, its current state and behaviour pattern (which, of course, is a result of the previous history of work-pulling). Quite to the contrary, when the call-graph of the »pull« has just returned from processing a {{{JobFunctor}}}, the worker again equates to available capacity -- a capacity however, which is directly tied to the processing situation currently managed by the scheduler, and is about to move out of the scheduler's direct control, hence the term »outgoing«. By virtue of this distinction, the Scheduler is able to enact a //strong preference// towards //keeping active workers active// -- even while this might be detrimental at times to the goal of increasing usage of available capacity: whenever possible, the Scheduler attempts to utilise outgoing capacity immediately again, while, on the other hand, incoming capacity needs to „fit into the situation“ -- it must be able to acquire the {{{GroomingToken}}}, which implies that all other workers are currently either engaged into work processing or receded to a sleep cycle. The reasoning behind this arrangement is that the internal and administrative work is small in comparison to the actual media processing; each worker is thus utilised alongside to advance the Scheduler's internal machinery through some steps, until hitting on the next actual processing task.
Any internal and administrative work, and anything possibly changing the Scheduler's internal state, must be performed under the proviso of holding the {{{GroomingToken}}} (which is implemented as an //atomic variable// holding the respective worker's thread-ID). Internal work includes //feeding// newly instructed Activation Events from the entrance queue into the priority queue, reorganising the priority queue when retrieving the »head element«, but also state tracking computations related to the [[capacity control|SchedulerLoadControl]]. Another aspect of internal processing are the »meta jobs« used to perform the ''job planning'', since these jobs typically cause addition of further Activity terms into the schedule. And besides that, there is also the »''Tick''«, which is a special duty-cycle job, re-inserted repeatedly during active processing every 50 milliseconds. Tasks to be fulfilled by the »tick« are hard-wired in code, encompassing crucial maintenance work to clean-up the memory manager and reclaim storage of Activities past their deadline; a load state update hook is also included here -- and when the queues have fallen empty, the »tick« will cease to re-insert itself, sending the Scheduler into paused state. After some idle time, the {{{WorkForce}}} (worker pool) will scale down autonomously, effectively placing the Render Engine into standby.
!!!Process Control
Seen as a service and a central part of the Lumiera Render Engine, the Scheduler has a distinct operational state on its own, with the need to regulate some specifics of its processing modalities, and the ability to transition from one mode of operation to another one. The most obvious and uttermost consequential distinction regards to whether the Scheduler is //running,// or //in stand-by// or even //disabled.// Starting discussion with the latter -- as such the Scheduler can not be disabled directly, but confining the {{{WorkForce}}} to size zero will effectively render the Scheduler defunct; no processing whatsoever can take place then, even after adding a job to the schedule. Taking this approach is mostly of relevance for testing however, since having this kind of a crippled service in the core of the application is not commendable, due to breaking various implicit assumptions regarding cause and effect of media processing. In a similar vein, it is not clear if the engine can be run successfully on a worker pool with just a single worker. In theory it seems this should be possible though.
For use within the application it can thus be assumed that the {{{WorkForce}}} is somehow configured with a //maximal capacity// -- which should relate to the number of independent (virtual) cores available on the hardware platform. The active processing then needs to be //seeded// by injecting the initial ''planning job'' for the desired stream-of-calculations; this in turn will have to establish the initial schedule and then re-add itself as a //continuation// planning job. Adding a seed-job this way causes the Scheduler to »ignite« -- which means to pass a signal for scale-up to the worker pool and then immediately to enter the first duty-cylce (»tick«) processing directly. This serves the purpose of updating and priming the state of the [[memory manager|SchedulerMemory]] and of any state pertaining operational load control. From this point on, the »tick« jobs will automatically reproduce themselves every 50ms, until some »tick« cycle detects an empty scheduler queue. As discussed above, such a finding will discontinue the »tick« and later cause the {{{WorkForce}}} to scale down again. Notably, no further regulation signals should be necessary, since the actual Scheduler performance is driven by the worker's »pull« calls. Workers will show up with some initial randomisation, and the stochastic capacity control will serve to re-distribute their call-back cycles in accordance with the current situation at the Scheduler queue head.
A special operational condition mentioned above is the ''Scheduler Emergency''. This condition arises when timing constraints and capacity limitations lead to breaking the logical assumptions of the Activity Language, without actually breaking the low-level operational state within the implementation (meaning all queue and state management data structures remain in good health). This condition can not be cured by the Scheduler alone, and there is the serious danger that such a situation will reproduce itself, unless adjustments and corrective actions are taken on a higher level of application state control.
!!!!💡 see also
&amp;rarr; [[Behaviour traits|SchedulerBehaviour]]
&amp;rarr; [[Stress testing|SchedulerTest]]
&amp;rarr; [[Rendering]]
</pre>
</div>
<div title="SchedulerRequirements" modifier="Ichthyostega" created="201107080145" modified="201112171835" tags="Rendering spec draft discuss">
@ -7352,35 +7397,36 @@ While the ability to reason about activities and verify their behaviour in isola
The way other parts of the system are built, requires us to obtain a guaranteed knowledge of some job's termination. It is possible to obtain that knowledge with some limited delay, but it nees to be absoultely reliable (violations leading to segfault). The requirements stated above assume this can be achieved through //jobs with guaranteed execution.// Alternatively we could consider installing specific callbacks -- in this case the scheduler itself has to guarantee the invocation of these callbacks, even if the corresponding job fails or is never invoked. It doesn't seem there is any other option.
</pre>
</div>
<div title="SchedulerTest" creator="Ichthyostega" modifier="Ichthyostega" created="202312281814" modified="202404172226" tags="Rendering operational draft img" changecount="119">
<div title="SchedulerTest" creator="Ichthyostega" modifier="Ichthyostega" created="202312281814" modified="202404182232" tags="Rendering operational draft img" changecount="170">
<pre>With the Scheduler testing effort [[#1344|https://issues.lumiera.org/ticket/1344]], several goals are pursued
* by exposing the new scheduler implementation to excessive overload, its robustness can be assessed and defects can be spotted
* with the help of a systematic, calibrated load, characteristic performance limits and breaking points can be established
* when performed in a reproducible way, these //stress tests// can also serve to characterise the actual platform
* a synthetic load emulating standard use cases allows to watch complex interaction patterns and to optimise the implementation
* a synthetic load, emulating standard use cases, allows to watch complex interaction patterns and to optimise the implementation
!A synthetic load for performance testing
In fall 2023, a load generator component was developed [[#1346|https://issues.lumiera.org/ticket/1346]] as foundation for Scheduler testing.
The {{{TestChainLoad}}} is inspired by the idea of a blockchain, and features a //graph// of //processing steps (nodes)// to model complex interconnected computations. Each //node// in this graph stands for one render job to be scheduled, and it is connected to //predecessors// and //successors//. The node is able to compute a chained hash based on the hash values of its predecessors. When configured with a //seed value//, this graph has the capability to compute a well defined result hash, which involves the invocation of every node in the right order. For verification, this computation can be done sequentially, by a single thread performing linear pass over the graph. The actual test however is to traverse the graph to produce a schedule of render jobs, each linked to a single node and defining prerequisites in accordance to the graph's structure. It is up to the scheduler then to ensure dependency relations and work out a way to pass on activation, so that the proper result hash will be reproduced at the end.
In December 2023, a load generator component was developed [[#1346|https://issues.lumiera.org/ticket/1346]] as foundation for Scheduler testing.
The {{{TestChainLoad}}} is inspired by the idea of a blockchain, and features a //graph// of //processing steps (nodes)// to model complex interconnected computations. Each //node// in this graph stands for one render job to be scheduled, and it is connected to //predecessors// and //successors//. The node is able to compute a chained hash based on the hash values of its predecessors. When configured with a //seed value//, this graph has the capability to compute a well defined result hash, which involves the invocation of every node in proper order. For verification, this computation can be done sequentially, by a single thread performing a linear pass over the graph. The actual test however is to traverse the graph to produce a schedule of render jobs, each linked to a single node and defining prerequisites in accordance to the graph's structure. It is up to the scheduler then to uphold dependency relations and work out a way to pass on the node activation, so that the correct result hash will be reproduced at the end.
The Scheduler thus follows the connectivity pattern, and defining a specific pattern characterises the load to produce. Patterns are formed by connecting (or re-connecting) nodes, based on the node's hash -- which implies that patterns are generated. The generation process and thus the emerging form is controlled and defined by a small number of rules. Each rule defines a probability mapping; when fed with a node hash, some output parameter value is generated randomly yet deterministically. The following //topology control rules// are recognised
The Scheduler thus follows the connectivity pattern, and defining a specific pattern characterises the load to produce. Patterns are formed by connecting (or re-connecting) nodes, based on the node's hash -- which implies that patterns are generated. The generation process and thus the emerging form is guided and defined by a small number of rules. Each rule establishes a probability mapping; when fed with a node hash, some output parameter value is generated randomly yet deterministically.
The following //topology control rules// are recognised:
;seeding rule
:when non-zero, the given number of new start (seed) nodes is injected.
:all seed nodes hold the same, preconfigured fixed seed value and only have successors, no predecessors
:all seed nodes hold the same, preconfigured fixed seed value and have successors only, but no predecessors
;expansion rule
:when non-zero, the current node forks out to the given additional number of successor nodes
;reduction rule
:when non-zero, the current node will join the given number of additional predecessor nodes
;pruning rule
:when non-zero, the current node will be an exit node and terminate the current chain without successor
:when non-zero, the current node will be an exit node without successor, and terminate the current chain
:when all ongoing chains happen to be terminated, a new seed node is injected automatically
;weight rule
:when non-zero, a computational weight with the given level is applied when invoking the node as render job
:when non-zero, a computational weight with the given degree (multiplier) is applied on invocation of the node as render job
:the //base time// of that weight can be configured (in microseconds) and is roughly calibrated for the current machine
* //without any rule,// the default is to connect linear chains with one predecessor and one successor
* the node graph holds a fixed preconfigured number of nodes
* the //width// is likewise preconfigured, i.e. the number of nodes on the same »level« and typically started at the same time
* generation is automatically terminated with an exit node, joining all remaining chains
* the //width// is likewise preconfigured, i.e. the number of nodes on the same »level«, typically to be started at the same time
* generation is automatically terminated with an exit node at all remaining chains
! Scheduler Stress Testing
The point of departure for any stress testing is to show that the subject is resilient and will break in controlled ways only. Since the Scheduler relies on the limited computational resources available in the system, overload is easy to provoke by adding too many render jobs; delay will build up until some job's deadline is violated, at which point the Scheduler will drop this job (and any follow-up jobs with unmet dependencies). Much more challenging however is the task to find out about //the boundary of regular scheduler operation.// The domain of regular operation can be defined by the ability of the scheduler to follow and conform to the timings set out explicitly in the schedule. Obviously, short and localised load peaks can be accommodated, yet once a persistent backlog builds up, the schedule starts to slip and the calculation process will flounder.
@ -7390,11 +7436,11 @@ A method to determine such a »''breaking point''« in a systematic way relies o
Observing this breaking point in correlation with various load patterns will unveil performance characteristics and weak spots of the implementation.
&amp;rarr; [[Scheduler behaviour traits|SchedulerBehaviour]]
Another, quite different avenue of testing is to investigate a ''steady full-load state'' of processing. Contrary to the //breaking point// technique discussed above, for this method a fluid, homogenous schedule is required, and effects of scaffolding, ramp-up and load adaptation should be minimised. By watching a constant flow of back-to-back processing, in a state of //saturation,// the boundary capabilities for throughput and parallelisation can be derived, ideally expressed as a model of processing efficiency. A setup for this kind of investigation would be to challenge the scheduler with a massive load peak of predetermined size: a set of jobs //without any further interdependencies,// which are scheduled effectively instantaneously, so that the scheduler is immediately in a state of total overload. The actual measurement entails to watch the time until completing this work load, together with the individual job activation times during that period; the latter can be integrated to account for the //effective parallelism// and the amount of time in //impeded state,// where at most single threaded processing is observed. Challenging the Scheduler with a random series of such homogenous load peaks allows build a correlation table and to compute a linear regression model.
Another, quite different avenue of testing is to investigate a ''steady full-load state'' of processing. Contrary to the //breaking point// technique discussed above, for this method a fluid, homogenous schedule is required, and effects of scaffolding, ramp-up and load adaptation should be minimised. By watching a constant flow of back-to-back processing, in a state of //saturation,// the boundary capabilities for throughput and parallelisation can be derived, ideally expressed as a model of processing efficiency. A viable setup for this kind of investigation would be to challenge the scheduler with a massive load peak of predetermined size: a set of jobs //without any further interdependencies,// which are scheduled effectively instantaneously, so that the scheduler is immediately in a state of total overload. The actual measurement entails to watch the time until completing this work load peak, together with the individual job activation times during that period; the latter can be integrated to account for the //effective parallelism// and the amount of time in //impeded state,// where at most single threaded processing is observed. Challenging the Scheduler with a random series of such homogenous load peaks allows build a correlation table and to compute a linear regression model.
! Observations
!!!Breaking Point and Stress
Several investigations to determine the »breaking point« of a schedule were conducted with the load topology depicted to the right. This load pattern is challenging on various levels. There are dependency chains leading from the single start node to the three exit nodes, and thus the order of processing must be strictly observed. Moreover, several nodes bear //no weight,// and so the processing for those jobs returns immediately, producing mostly administrative overhead. Some nodes however are attributed with a weight up to 3.
Several investigations to determine the »breaking point« of a schedule were conducted with the load topology depicted to the right. This load pattern is challenging on various levels. There are dependency chains leading from the single start node to the three exit nodes, and thus the order of processing must be strictly observed. Moreover, several nodes bear //no weight,// and so the processing for those jobs returns immediately, producing mostly administrative overhead. Some nodes however are attributed with a weight up to factor 3.
&lt;html&gt;&lt;img title=&quot;Load topology with 64 nodes joined into a dependency chain, used for »breaking point« search&quot; src=&quot;dump/2024-04-08.Scheduler-LoadTest/Topo-10.svg&quot; style=&quot;float:right; margin-left:2ex&quot;/&gt;&lt;/html&gt;For execution, this weight is loaded with a base time, for example ''500''µs. An //adapted schedule// is generated based on the //node layers,// and using a simplified heuristic to account both for the accumulated node weight found within a given level, and the ability for speed-up through concurrency. Nodes without a weight are assumed to take no time (a deliberate simplification), while possible parallelisation is applied solely as factor based on the node count, completely disregarding any concerns of »optimal stacking«. This leads to the following schedule:
@ -7426,21 +7472,26 @@ Several investigations to determine the »breaking point« of a schedule were co
| 24| 28.167| 14.083ms|
| 25| 30.867| 15.433ms|
| 26| 32.200| 16.100ms|
The tests were typically performed with ''4 workers''. Thus e.g. at begin of Layer-3, a factor 2 is added as increment, since Node-2 is attributed with weight≔2; in the .dot-diagram, the weight is added as suffix, behind the node hash mark, in this case ({{{2: 95.2}}}). With 500µs as weight, the node(s) in Layer-2 will thus be scheduled at t=1ms. To discuss a more interesting example, Layer-19 holds 5 nodes with weight≔2. Scheduling for 4 workers will allow to parallelise 4 nodes, but require another round for the remaining node. Thus an increment of +4 is added at the beginning of Layer-20, thus scheduling the two following nodes ({{{34: D8.3}}}) and ({{{35: B0.2}}}) at t=10ms. So there is a combined weight≡5 in Layer-20, and the two nodes are parallelised, thus allocating an offset of +5/2 · 500µs, placing Layer-21 at t=11.25ms
This heuristic time allocation leads to a schedule, which somehow considers the weight distribution, yet is deliberately unrealistic, since it does not consider any base effort, nor does it fully account for the limited worker pool size. At the beginning, the Scheduler will thus be partially idle waiting, while at the end, a massive short overload peak is added, further exacerbated by the dependency constraints. After all, the objective of this test is to tighten or stretch this schedule by a constant ''stress factor'', and to search the point at which a cascading catastrophic slippage can be observed.
The tests were typically performed with ''4 workers''.
The computed schedule takes this into account, but only approximately, considering the number of nodes in each layer of nodes -- but not their dependencies on predecessors or possibly differing weight
(node simulated computation duration).
&lt;html&gt;&lt;div style=&quot;clear: both&quot;/&gt;&lt;/html&gt;
To explain this calculation, e.g. at begin of Level-3, a factor 2 is added as increment, since Node-2 is attributed with weight-factor ≔ 2; in the .dot-diagram above, the weight is indicated as suffix, attached behind the node hash mark, in this case ({{{2: 95.2}}}). With 500µs as weight, the node(s) in Level-2 will thus be scheduled at t≔1ms. To discuss another, more interesting example, Level-19 holds 5 nodes with weight-factor ≔ 2. Scheduling for 4 workers will allow to parallelise 4 nodes, but require another round for the remaining node. Thus an increment of +4 is added at the beginning of Level-20, thereby scheduling the two following nodes ({{{34: D8.3}}}) and ({{{35: B0.2}}}) at t≔10ms. So given by their weight factors, a combined weight ≡ 5 is found in Level-20, and the two nodes are assumed to be parallelised, thus (in a simplified manner) allocating an offset of +5/2 · 500µs. The following Level-21 is thus placed at t ≔ 11.25ms
This heuristic time allocation leads to a schedule, which somehow considers the weight distribution, yet is deliberately unrealistic, since it does not consider any base effort, nor does it fully account for the limited worker pool size. At the beginning, the Scheduler will thus be partially idle waiting, while at the end, a massive short overload peak is added, further exacerbated by the dependency constraints. It should be recalled that the objective of this test is to tighten or stretch this schedule by a constant ''stress factor'', and to search the point at which a cascading catastrophic slippage can be observed.
And such a well defined breaking point can indeed be determined reliably. However -- initial experiments placed this event at a stress-factor closely above 0.5 -- which is way off any expectation. The point in question is not the absolute value of slippage, which is expectedly strong, due to the overload peak at end; rather, we are looking for the point, at which the scheduler is unable to follow even the general outline of this schedule. And the expectation would be that this happens close to the nominal schedule, which implies stress-factor ≡ 1. Further //instrumentation// was thus added, allowing to capture invocations of the actual processing function; values integrated from these events allowed to draw conclusions about various aspects of the actual behaviour, especially
* the actual job run times consistently showed a significant drift towards longer run times (slower execution) than calibrated
* the effective average concurrency deviates systematically from the average speed-up factor, as assumed in the schedule generation; notably this deviation is stable over a wide range of stress factors, but obviously depends strongly on the actual load graph topology and work pool size; is can thus be interpreted as a ''form factor'' to describe the topological ability to map a given local node connectivity to a scheduling situation.
By virtue of the instrumentation, both effects can be determined //empirically// during the test run, and compounded into a correction factor, applied to the scale of the stress-factor value determined as result. With this correction in place, the observed »breaking point« moved ''very close to 1.0''. This is considered an //important result;// both corrected effects relate to circumstances considered external to the Scheduler implementation -- which, beyond that, seems to handle timings and the control of the processing path //without significant further overhead.//
&lt;html&gt;&lt;div style=&quot;clear: both&quot;/&gt;&lt;/html&gt;
!!!Overload
Once the Scheduler is //overloaded,// the actual schedule does not matter much any more -- within the confines of an important limitation: the deadlines. The schedule may slip, but when a given job is pushed ahead beyond its deadline, the general promise of the schedule is //broken.// This draws on an important distinction; deadlines are hard, while start times can be shifted to accommodate. As long as the timings stay within the overall confines, as defined by the deadlines, the scheduling is able to absorb short load peaks. While in this mode of operation, no further timing waits are performed, rather, jobs are processed in order, as defined by their start times; when done with one job, a worker immediately retrieves the next job, which in state of overload is likewise overdue. So this setup allows to observe the efficiency of the „mechanics“ of job invocation.
Once the Scheduler is //overloaded,// the actual schedule does not matter much any more -- within the confines of an important limitation: the deadlines. The schedule may slip, but when a given job is pushed ahead beyond its deadline, the general promise of the schedule is //broken.// This draws on an important distinction; deadlines are hard, while start times can be shifted to accommodate. As long as the timings stay within the overall confines, as defined by the deadlines, the scheduling is able to absorb short load peaks. During this mode of operation, no further timing waits are performed, rather, jobs are processed in order, as defined by their start times; when done with one job, a worker immediately retrieves the next job, which -- in state of overload -- is likewise overdue. So this setup allows to observe the efficiency of the „mechanics“ of job invocation.
&lt;html&gt;&lt;img title=&quot;Load Peak with 8ms&quot; src=&quot;dump/2024-04-08.Scheduler-LoadTest/Graph13.svg&quot; style=&quot;float:right; width: 80ex; margin-left:2ex&quot;/&gt;&lt;/html&gt;
The measurement shown to the right uses a pool of ''8 workers'' to processes a load peak of jobs, each loaded with a processing function calibrated to run ''8''ms. Workers are thus occupied with processing the job load for a significant amount of time -- and so the probability of workers asking for work at precisely the same time is low. Since the typical {{{ActicityTerm}}} for [[regular job processing|RenderOperationLogic]] entails dropping the {{{GroomingToken}}} prior to invocation of the actual {{{JobFunctor}}}, another worker can access the queues meanwhile, and process Activities up to invocation of the next {{{JobFunctor}}}
With such a work load, the //worker pull processing// plays out to its full strength; since there is no »manager« thread, the administrative work is distributed evenly to all workers, and performed on average without imposing any synchronisation overhead to other workers. Especially with larger load peaks, the concurrency converges towards the theoretical maximum of 8, as can be seen at the light blue vertical bars in the secondary diagram below (left Y scale: concurrency). The remaining headroom can be linked (by investigation of trace dumps) to the inevitable ramp-up and tear-down; the work capacity shows up with some distribution of random delay, and thus it typically takes several milliseconds until all workers got their first task. Moreover, there is the overhead of the management work, which happens outside the measurement bracket inserted around the invocation of the job function -- even while in this load scenario also the management work is done concurrent to the other worker's payload processing, it is not accounted as part of the payload effort, and thus reduces the average concurrency.
The measurement shown to the right used a pool of ''8 workers'' to process a //load peak of jobs,// each loaded with a processing function calibrated to run ''8''ms. Workers are thus occupied with processing the job workload for a significant amount of time -- and so the probability of workers accidentally asking for work at precisely the same time is rather low. Since the typical {{{ActicityTerm}}} for [[regular job processing|RenderOperationLogic]] entails dropping the {{{GroomingToken}}} prior to invocation of the actual {{{JobFunctor}}}, another worker can access the queues in the meantime, and handle Activities up to invocation of the next {{{JobFunctor}}}
With such a work load, the //worker pull processing// plays out to its full strength; since there is no »manager« thread, the administrative work is distributed evenly to all workers, and performed on average without imposing any synchronisation overhead to other workers. Especially with longer load peaks, the concurrency was observed to converge towards the theoretical maximum of 8 (on this machine) -- as can be seen at the light blue vertical bars in the secondary diagram below (where the left Y scale displays the average concurrency). The remaining headroom can be linked (by investigation of trace dumps) to the inevitable ramp-up and tear-down; the work capacity shows up with some distribution of random delay, and thus it typically takes several milliseconds until all workers got their first task. Moreover, there is the overhead of the management work, which happens outside the measurement bracket inserted around the invocation of the job function -- even while in this load scenario the management work is also done concurrently to the other worker's payload processing, it is not captured as part of the payload effort, and thus reduces the average concurrency accounted.
The dependency of load size to processing time is clearly linear, with a very high correlation (0.98). A ''linear regression model'' indicates a gradient very close to the expected value of 1ms/job (8ms nominal job time distributed to 8 cores). The slight deviation is due to the fact that //actual job times// (&amp;rarr; dark green dots) tend to diverge to higher values than calibrated, an effect consistently observed on this machine throughout this scheduler testing effort. An explanation might be that the calibration of the work load is done in a tight loop (single threaded or multi threaded does not make much of difference here), while in the actual processing within the scheduler, some global slowdown is generated by cache misses, pipeline stalls and the coordination overhead caused by accessing the atomic {{{GroomingToken}}} variable. Moreover, the linear model indicates a socket overhead, which largely can be attributed to the ramp-up / tear-down phase, where -- inevitably -- not all workers can be put to work. In accordance with that theory, the socket overhead indeed increases with larger job times. The probabilistic capacity management employed in this Scheduler implementation adds a further socket overhead; the actual work start depends on workers pulling further work, which, depending on the circumstances, happens more or less randomly at the begin of each test run. Incidentally, there is further significant scaffolding overhead, which is not accounted for in the numbers presented here: At start, the worker pool must be booted, the jobs for the processing load must be planned, and a dependency to a wake-up job will be maintained, prompting the controlling test-thread to collect the measurement data and to commence with the next run in the series.
@ -7450,14 +7501,14 @@ Note also that all these investigations were performed with ''debug builds''.
However, ''very short jobs'' cause ''significant loss of efficiency''.
The example presented to the right uses a similar setup (''8 workers''), but reduced the calibrated job-time to only ''200''µs. This comes close to the general overhead required for retrieving and launching the next job, which amounts to further 100µs running a debug build, including the ~30µs required solely to access, set and drop the {{{GroomingToken}}}. Consequently, only a small number of //other workers// get a chance to acquire the {{{GroomingToken}}} and then to work through the Activities up to the next Job invocation, before the first worker returns already -- causing a significant amount of ''contention'' on the {{{GroomingToken}}}. Now, the handling of work-pull requests in the Scheduler implementation is arranged in a way to prefer workers just returning from active processing. Thus (intentionally) only a small subset of the workers is able to pull work repeatedly, while the other workers will encounter a series of »contention {{{KICK}}}« events; an inbuilt //contention mitigation scheme// responds to this kind of repeated pull-failures by interspersing sleep cycles, thereby effectively throttling down the {{{WorkForce}}} until contention events return to an attainable level. This mitigation is important, since contention, even just on an atomic variable, can cause a significant global slow-down of the system.
As net effect, most of the load peaks are just handled by two workers, especially for larger load sizes; most of the available processing capacity remains unused for such short running payloads. Moreover, on average a significant amount of time is spent with partially blocked or impeded operation (&amp;rarr; light green circles), since administrative work must be done non-concurrently. Depending on the perspective, this can be seen as a weakness -- or as the result of a deliberate trade-off made by the choice of active work-pulling and a passive Scheduler.
As net effect, most of the load peaks are just handled by two workers, especially for larger load sizes; most of the available processing capacity remains unused for such short running payloads. Moreover, on average a significant amount of time is spent with partially blocked or impeded operation (&amp;rarr; light green circles), since administrative work must be done non-concurrently. Depending on the perspective, the behaviour, as exposed in this test, can be seen as a weakness -- or as the result of a deliberate trade-off made by the reliance on active work-pulling and a passive Scheduler.
The actual average in-job time (&amp;rarr; dark green dots) is offset significantly here, and closer to 400µs -- which is also confirmed by the gradient of the linear model (0.4ms / 2 Threads ≙ 0.2ms/job). With shorter load sizes below 90 jobs, increased variance can be observed, and measurements can no longer be subsumed under a single linear relation -- in fact, data points seem to be arranged into several groups with differing, yet mostly linear correlation, which also explains the negative socket value of the overall computed model; using only the data points with &gt; 90 jobs would yield a model with slightly lower gradient but a positive offset of ~2ms.
&lt;html&gt;&lt;div style=&quot;clear: both&quot;/&gt;&lt;/html&gt;
Further measurement runs with other parameter values fit well in between the two extremes presented above. It can be concluded that this Scheduler implementation strongly favours larger job sizes starting with several milliseconds, when it comes to processing through a extended homogenous work load without much job interdependencies. Such larger lot sizes can be handled efficiently and close to expected limits, while very small jobs massively degrade the available performance. This can be attributed both to the choice of a randomised capacity distribution, and of pull processing without a central manager.
Further measurement runs with other parameter values fit well in between the two extremes presented above. It can be concluded that this Scheduler implementation strongly favours larger job sizes starting with several milliseconds, when it comes to processing through an extended homogenous work load without much job interdependencies. Such larger lot sizes can be handled efficiently and close to expected limits, while very small jobs massively degrade the available performance. This can be attributed both to the choice of a randomised capacity distribution, and of pull processing without a central manager.
!!!Stationary Processing
The ultimate goal of //load- and stress testing// is to establish a notion of //full load// and to demonstrate adequate performance under //nominal load conditions.// Thus, after investigating overheads and the breaking point of a complex schedule, a measurement setup was established with load patterns deemed „realistic“ -- based on knowledge regarding some typical media processing demands encountered for video editing. Such a setup entails small dependency trees of jobs loaded with computation times around 5ms, interleaving several challenges up to the available level of concurrency. To determine viable parameter bounds, the //breaking-point// measurement method can be applied to an extended graph with this structure, to find out at which level the computations will use the system's abilities to such a degree that it is not able to move along faster any more.
The ultimate goal of //load- and stress testing// is to establish a notion of //full load// and to demonstrate adequate performance under //nominal load conditions.// Thus, after investigating overheads and the breaking point of a complex schedule, a measurement setup was established with load patterns deemed „realistic“ -- based on knowledge regarding some typical media processing demands encountered for video editing. Such a setup entails small dependency trees of jobs loaded with computation times around 5ms, interleaving several challenges up to the available level of concurrency. To determine viable parameter bounds, the //breaking-point// measurement method can be applied to an extended graph with this structure, to find out at which level the computations will use the system's abilities to such a degree that it is not able to move along any faster.
&lt;html&gt;&lt;img title=&quot;Load topology for stationary processing with 8 cores&quot; src=&quot;dump/2024-04-08.Scheduler-LoadTest/Topo-20.svg&quot; style=&quot;width:100%&quot;/&gt;&lt;/html&gt;
This pattern can be processed
* with 8 workers in overall 192ms
@ -7474,8 +7525,13 @@ For comparison, another, similar load pattern was used, which however is compris
| ≙ per level|5.15ms |3.5 |
| avg.conc|3.5 |5.3 |
These observations indicate ''adequate handling without tangible overhead''.
When limited to 4 workers, the concurrency of ∅ 3.5 is only slightly below the average number of 3.88 Nodes/Level, and the time per level is near optimal, taking into account the fact (established by the overload measurements) that the actual job load tends to be slightly above the calibrated value of 5ms. The setup with 8 workers shows that further worker can be used to accommodate a tighter schedule, but then the symptoms for //breaking the schedule// are already reached at a nominally lower stress value, and only 5.3 of 8 workers will be active on average — this graph simply does not offer more work load locally, since 75% of all Nodes have a predecessor.
When limited to ''4 workers'', the concurrency of ∅ 3.5 is only slightly below the average number of 3.88 Nodes/Level, and the time per level is near optimal, taking into account the fact (established by the overload measurements) that the actual job load tends to be slightly above the calibrated value of 5ms. The setup with ''8 workers'' shows that further workers can be used to accommodate a tighter schedule, but then the symptoms for //breaking the schedule// are already reached at a nominally lower stress value, and only 5.3 of 8 workers will be active on average — this graph topology simply does not offer more work load locally, since 75% of all Nodes have a predecessor.
@@font-size:1.4em; 🙛 -- 🙙@@
Building upon these results, an extended probe with 64k nodes was performed -- this time with planning activities interwoven for each chunk of 32 nodes -- leading to a run time of 87.4 seconds (5.3ms per node) and concurrency ≡ 3.95
This confirms the ''ability for steady-state processing at full load''.
&lt;html&gt;&lt;div style=&quot;clear: both&quot;/&gt;&lt;/html&gt;</pre>
</div>
<div title="SchedulerWorker" creator="Ichthyostega" modifier="Ichthyostega" created="202309041605" modified="202312281745" tags="Rendering operational spec draft" changecount="21">
@ -8063,8 +8119,8 @@ Shutdown is initiated by sending a message to the dispatcher loop. This causes t
<div title="SideBarOptions" modifier="CehTeh" created="200706200048">
<pre>&lt;&lt;search&gt;&gt;&lt;&lt;closeAll&gt;&gt;&lt;&lt;permaview&gt;&gt;&lt;&lt;newTiddler&gt;&gt;&lt;&lt;saveChanges&gt;&gt;&lt;&lt;slider chkSliderOptionsPanel OptionsPanel &quot;options »&quot; &quot;Change TiddlyWiki advanced options&quot;&gt;&gt;</pre>
</div>
<div title="SiteSubtitle" modifier="Ichthyostega" created="200706190044" modified="200802030406">
<pre>Building a Render Nodes Network from Objects in the Session</pre>
<div title="SiteSubtitle" modifier="Ichthyostega" created="200706190044" modified="202404192306" changecount="1">
<pre>Building a Render Nodes Network from Media Objects in the Session</pre>
</div>
<div title="SiteTitle" modifier="Ichthyostega" created="200706190042" modified="200708080212">
<pre>Engine</pre>

File diff suppressed because it is too large Load diff