lumiera_/doc/devel/dump/2024-04-08.Scheduler-LoadTest/index.txt

Scheduler Load and Stress Testing
=================================
:date: 2024-04-08

Following the initial integration, an extended series of Load- and Stress-testing
was conducted to determine the limitations and weaknesses of the new Scheduler
implementation. During that time, from December to April, several bugs and problems
were identified, improving the maturity of design and code. In the first phase,
the goal was to subject the implementation to increasingly challenging load patterns.
In the second phase, several instrumentation- and measurement methods were developed,
to establish characteristic traits and boundary conditions quantitatively.

This directory holds a loose collection of material related to these testing efforts.
Trace-dumps were produced by adding print statements at relevant locations; over time,
a change set with a suitable setup of such diagnostic print statements was assembled,
which can be applied by `git cherry-pick`. Moreover, load patterns generated by the
`TestChainLoad` tooling can be visualised as _graphviz diagram_ and the collection
of tools in the `StressTestRig` generate report output and data visualisation as
_Gnuplot_ script. Raw measurement data is stored as CSV (see 'csv.hpp').


Breaking Point Testing
----------------------
Topo-10::
 Topology of the processing load used as typical example for _breaking a schedule._
 This Graph with 64 nodes is generated by the pre-configured rules
 `configureShape_chain_loadBursts()`; it starts with a single linear, yet »bursts«
 into an excessively interconnected parallel graph roughly in the middle. Due to
 the dependency connections, typically the capacity is first underused, followed
 by a short and massive overload peak. Deliberately, there is not much room for
 speed-up through parallelisation.


Load Peak Testing
-----------------

Dump-10::
 Example of a typical clean run working through a homogenous load peak.
 With load patterns in this category, contention is not an issue and, after the
 initial ramp-up, all workers are fully loaded and immediately commence with the
 next job in row, typically requiring ~70µs for the administrative work between
 processing jobs (Note: all those measurements were done with *debug builds*,
 without compiler optimisation). As an artefact of the measurement setup, a
 follow-up job is wired as dependency on each work job; the dependency wiring
 and dependency checking is performed outside the active measurement interval,
 so that only the additional post of the follow-up check adds a constant
 administrative overhead to each job, which can be guessed to be +30µs.

Graph-10::
 Plot from a parametric measurement series with the same settings as used for
 the Dump-10; settings in these range are known to produce clean processing
 with no obvious irregular overhead. In this case here
+
- the series is comprised of 40 measurements, covering a parameter range 10...100
- the free parameter is the number of jobs in a homogenous load peak
- through built-in instrumentation, the job activation can be observed, allowing
  to derive the average work job time, average time in full concurrency and amount
  of time with limitations (less than two active workers)
+
Notably the concurrency approaches the theoretic limit with increasing peak length,
yet some headroom remains. However, by logical reasoning it becomes clear that there
_must be a start-up and shutdown phase_ in each run, and this phase must be calculated
with one Job length each (in this case 2ms start-up and 2ms shutdown); start-up counts
as _one Job completed,_ while shutdown counts as _N-1 Jobs completed._ Taking these
considerations into account, 4ms of the socket indicated by the linear model can be
explained, so there remains a further socket of 5ms. Various observations seem to
indicate that these *5ms* can be considered a *generic scheduling latency*
+
The actual job times are always slightly above the calibrated value (the load operation
is calibrated before each test on the specific hardware; it could be shown that this
calibration also holds under multithreaded performance, but the behaviour degrades
when contention, lock coordination and cache misses come into play)

Graph-11::
 There seem to be limitations on the achievable degree of concurrency however.
 Using the same basic setup, again a job load of 2ms, but this time a 8 worker pool
 (the maximum possible on this machine), even in an extended range up to 200 jobs
 the average concurrency barely surpasses 7 workers. Moreover, in similar runs,
 a small number of excessive outliers was observed, hinting at some pattern
 which prevents full usage of available capacity.

Dump-11::
 One example trace with 200 Jobs, Job load 2ms and 8-worker pool (all instrumented
 runs investigated did take notably longer than the corresponding runs in the Graph,
 which can be due to overheads created by the print statements). Some observations
+
- planning expense is significant, which causes the planning to surpass the start
  of the first »tick« job. At this point, the complete workforce shows up as expected,
  and goes into contention wait subsequently. _This seems to have further ramifications._
- almost 10ms pass until more workers return back from sleep states; this is the expected
  behaviour, given that there was a pause of 10ms after all planning work was complete.
- on surface level, the following processing pattern looks completely regular
- yet always when a worker happens to hit contention, it immediately retracts for
  an extended period of several milliseconds.

Dump-12::
 Based on that observation, the _stickiness_ of the contention-encounter was reduced;
 the step-down is not by a power series and thus it takes far less successful work-pulls
 to erase the memory of an contention event. The effect can be seen clearly; there are
 now longer strikes of contention in the middle of the processing, and workers encountering
 contention return faster to regular work.
+
Yet some further irregularities remain: sometimes an assumed pause of 2ms takes 6 or 10ms,
without obvious reason. Presumably these are effects of hardware and OS scheduling -- which
effectively prevent the processing to reach full load on all 8 cores.

Graph-13::
 Increasing the load per job again by an factor 4, setting the nominal job computation load
 to 8ms, further supports the observed tendency: with increasing number of jobs in row, the
 concurrency converges even closer to the theoretical limit, but misses it by a narrow margin.
 Interestingly, also the distribution of timings around the regression line is tighter here,
 and, again, the linear model matches up close to the expectation (8ms with 8 cores => ∅ 1ms)

Graph-14::
 Going into the opposite direction, with only a small job computation load of 500µs, the
 parallelisation remains even below 4 workers on average, due to throttling of worker's
 pull cycles when encountering repeated contention on the »Grooming Token«

Graph-15::
 As an extreme example, here the job computation load was set to 200µs, which is barely
 above the known base overhead of 100µs required to schedule a single job (with debug build).
 Notably, a breach of patterns appears here: below 90 jobs, in some cases a higher concurrency
 is achieved, thus distinguishing two separate sub-populations with different behaviour patterns.
 Above 90 jobs, all cases are handled by two workers, and the behaviour is much more uniform.
 From watching individual runs with trace-dump, it can be hypothesised that the amount of
 contention at the beginning of the work phase is determinant for the emerging work pattern.
 If, by chance, workers show up interleaved and evenly spaced, a higher number of workers is
 able to pick a job and remain in duty also for follow-up jobs, since workers returning from
 active work get precedence in picking the next job.
+
Another surprising observation is the drift of the actual in-job times; even while the work load
was pre-calibrated for this hardware, the actual computation in the context of the scheduler is
reproducibly slower (at least on my machine). Below 90 jobs, also the spread of this observation
value is larger, as is the spread of time in _impeded state,_ which is defined as less than two
workers processing active job content at a given time.
Scheduler-test: measurements documented 2024-04-09 16:57:26 +02:00			`Scheduler Load and Stress Testing`
			`=================================`
			`:date: 2024-04-08`

			`Following the initial integration, an extended series of Load- and Stress-testing`
			`was conducted to determine the limitations and weaknesses of the new Scheduler`
			`implementation. During that time, from December to April, several bugs and problems`
			`were identified, improving the maturity of design and code. In the first phase,`
			`the goal was to subject the implementation to increasingly challenging load patterns.`
			`In the second phase, several instrumentation- and measurement methods were developed,`
Scheduler-test: adjust contention mitigation as result of testing Investigate the behaviour over a wider range of job loads, job count and worker pool sizes. Seemingly the processing can not fully utilise the available worker pool capacity. By inspection of trace-dumps, one impeding mechanism could be identified: the »stickiness« of the contention mitigation. Whenever a worker encounters repeated contention, it steps up and adds more and more wait cycles to remove pressure from the schedule coordination. As such this is fine and prevents further degradation of performance by repeated atomic synchronisation. However, this throttling was kept up needlessly after further successful work-pulls. Since job times of several milliseconds can be expected on average in media processing, such a long retention would spread a performance degradation over a duration of several frames. Thus, the scheme for step-down was changed to decrease the throttling by a power series rather than just documenting the level. 2024-04-10 03:29:44 +02:00			`to establish characteristic traits and boundary conditions quantitatively.`
Scheduler-test: measurements documented 2024-04-09 16:57:26 +02:00
			`This directory holds a loose collection of material related to these testing efforts.`
			`Trace-dumps were produced by adding print statements at relevant locations; over time,`
			`a change set with a suitable setup of such diagnostic print statements was assembled,`
			which can be applied by `git cherry-pick`. Moreover, load patterns generated by the
			`TestChainLoad` tooling can be visualised as _graphviz diagram_ and the collection
			of tools in the `StressTestRig` generate report output and data visualisation as
Scheduler-test: adjust contention mitigation as result of testing Investigate the behaviour over a wider range of job loads, job count and worker pool sizes. Seemingly the processing can not fully utilise the available worker pool capacity. By inspection of trace-dumps, one impeding mechanism could be identified: the »stickiness« of the contention mitigation. Whenever a worker encounters repeated contention, it steps up and adds more and more wait cycles to remove pressure from the schedule coordination. As such this is fine and prevents further degradation of performance by repeated atomic synchronisation. However, this throttling was kept up needlessly after further successful work-pulls. Since job times of several milliseconds can be expected on average in media processing, such a long retention would spread a performance degradation over a duration of several frames. Thus, the scheme for step-down was changed to decrease the throttling by a power series rather than just documenting the level. 2024-04-10 03:29:44 +02:00			`_Gnuplot_ script. Raw measurement data is stored as CSV (see 'csv.hpp').`
Scheduler-test: measurements documented 2024-04-09 16:57:26 +02:00

Scheduler-test: complete and document the Load-peak tests - use parameters known to produce a clean linear model - assert on properties of this linear model Add extended documentation into the !TiddlyWiki, with a textual account of the various findings, also including some of the images and diagrams, rendered as SVG 2024-04-11 23:08:20 +02:00			`Breaking Point Testing`
			`----------------------`
			`Topo-10::`
			`Topology of the processing load used as typical example for _breaking a schedule._`
			`This Graph with 64 nodes is generated by the pre-configured rules`
			`configureShape_chain_loadBursts()`; it starts with a single linear, yet »bursts«
			`into an excessively interconnected parallel graph roughly in the middle. Due to`
			`the dependency connections, typically the capacity is first underused, followed`
			`by a short and massive overload peak. Deliberately, there is not much room for`
			`speed-up through parallelisation.`


Scheduler-test: measurements documented 2024-04-09 16:57:26 +02:00			`Load Peak Testing`
			`-----------------`

			`Dump-10::`
			`Example of a typical clean run working through a homogenous load peak.`
			`With load patterns in this category, contention is not an issue and, after the`
Scheduler-test: adjust contention mitigation as result of testing Investigate the behaviour over a wider range of job loads, job count and worker pool sizes. Seemingly the processing can not fully utilise the available worker pool capacity. By inspection of trace-dumps, one impeding mechanism could be identified: the »stickiness« of the contention mitigation. Whenever a worker encounters repeated contention, it steps up and adds more and more wait cycles to remove pressure from the schedule coordination. As such this is fine and prevents further degradation of performance by repeated atomic synchronisation. However, this throttling was kept up needlessly after further successful work-pulls. Since job times of several milliseconds can be expected on average in media processing, such a long retention would spread a performance degradation over a duration of several frames. Thus, the scheme for step-down was changed to decrease the throttling by a power series rather than just documenting the level. 2024-04-10 03:29:44 +02:00			`initial ramp-up, all workers are fully loaded and immediately commence with the`
Scheduler-test: measurements documented 2024-04-09 16:57:26 +02:00			`next job in row, typically requiring ~70µs for the administrative work between`
			`processing jobs (Note: all those measurements were done with debug builds,`
			`without compiler optimisation). As an artefact of the measurement setup, a`
			`follow-up job is wired as dependency on each work job; the dependency wiring`
			`and dependency checking is performed outside the active measurement interval,`
			`so that only the additional post of the follow-up check adds a constant`
			`administrative overhead to each job, which can be guessed to be +30µs.`

			`Graph-10::`
			`Plot from a parametric measurement series with the same settings as used for`
			`the Dump-10; settings in these range are known to produce clean processing`
Scheduler-test: complete and document the Load-peak tests - use parameters known to produce a clean linear model - assert on properties of this linear model Add extended documentation into the !TiddlyWiki, with a textual account of the various findings, also including some of the images and diagrams, rendered as SVG 2024-04-11 23:08:20 +02:00			`with no obvious irregular overhead. In this case here`
Scheduler-test: measurements documented 2024-04-09 16:57:26 +02:00			`+`
			`- the series is comprised of 40 measurements, covering a parameter range 10...100`
			`- the free parameter is the number of jobs in a homogenous load peak`
			`- through built-in instrumentation, the job activation can be observed, allowing`
			`to derive the average work job time, average time in full concurrency and amount`
			`of time with limitations (less than two active workers)`
			`+`
			`Notably the concurrency approaches the theoretic limit with increasing peak length,`
			`yet some headroom remains. However, by logical reasoning it becomes clear that there`
Scheduler-test: adjust contention mitigation as result of testing Investigate the behaviour over a wider range of job loads, job count and worker pool sizes. Seemingly the processing can not fully utilise the available worker pool capacity. By inspection of trace-dumps, one impeding mechanism could be identified: the »stickiness« of the contention mitigation. Whenever a worker encounters repeated contention, it steps up and adds more and more wait cycles to remove pressure from the schedule coordination. As such this is fine and prevents further degradation of performance by repeated atomic synchronisation. However, this throttling was kept up needlessly after further successful work-pulls. Since job times of several milliseconds can be expected on average in media processing, such a long retention would spread a performance degradation over a duration of several frames. Thus, the scheme for step-down was changed to decrease the throttling by a power series rather than just documenting the level. 2024-04-10 03:29:44 +02:00			`_must be a start-up and shutdown phase_ in each run, and this phase must be calculated`
			`with one Job length each (in this case 2ms start-up and 2ms shutdown); start-up counts`
Scheduler-test: measurements documented 2024-04-09 16:57:26 +02:00			`as _one Job completed,_ while shutdown counts as _N-1 Jobs completed._ Taking these`
			`considerations into account, 4ms of the socket indicated by the linear model can be`
			`explained, so there remains a further socket of 5ms. Various observations seem to`
			`indicate that these 5ms can be considered a generic scheduling latency`
			`+`
			`The actual job times are always slightly above the calibrated value (the load operation`
			`is calibrated before each test on the specific hardware; it could be shown that this`
			`calibration also holds under multithreaded performance, but the behaviour degrades`
			`when contention, lock coordination and cache misses come into play)`
Scheduler-test: adjust contention mitigation as result of testing Investigate the behaviour over a wider range of job loads, job count and worker pool sizes. Seemingly the processing can not fully utilise the available worker pool capacity. By inspection of trace-dumps, one impeding mechanism could be identified: the »stickiness« of the contention mitigation. Whenever a worker encounters repeated contention, it steps up and adds more and more wait cycles to remove pressure from the schedule coordination. As such this is fine and prevents further degradation of performance by repeated atomic synchronisation. However, this throttling was kept up needlessly after further successful work-pulls. Since job times of several milliseconds can be expected on average in media processing, such a long retention would spread a performance degradation over a duration of several frames. Thus, the scheme for step-down was changed to decrease the throttling by a power series rather than just documenting the level. 2024-04-10 03:29:44 +02:00
			`Graph-11::`
			`There seem to be limitations on the achievable degree of concurrency however.`
			`Using the same basic setup, again a job load of 2ms, but this time a 8 worker pool`
			`(the maximum possible on this machine), even in an extended range up to 200 jobs`
			`the average concurrency barely surpasses 7 workers. Moreover, in similar runs,`
			`a small number of excessive outliers was observed, hinting at some pattern`
			`which prevents full usage of available capacity.`

			`Dump-11::`
			`One example trace with 200 Jobs, Job load 2ms and 8-worker pool (all instrumented`
			`runs investigated did take notably longer than the corresponding runs in the Graph,`
			`which can be due to overheads created by the print statements). Some observations`
			`+`
			`- planning expense is significant, which causes the planning to surpass the start`
			`of the first »tick« job. At this point, the complete workforce shows up as expected,`
			`and goes into contention wait subsequently. _This seems to have further ramifications._`
			`- almost 10ms pass until more workers return back from sleep states; this is the expected`
			`behaviour, given that there was a pause of 10ms after all planning work was complete.`
			`- on surface level, the following processing pattern looks completely regular`
			`- yet always when a worker happens to hit contention, it immediately retracts for`
			`an extended period of several milliseconds.`

			`Dump-12::`
			`Based on that observation, the _stickiness_ of the contention-encounter was reduced;`
			`the step-down is not by a power series and thus it takes far less successful work-pulls`
			`to erase the memory of an contention event. The effect can be seen clearly; there are`
			`now longer strikes of contention in the middle of the processing, and workers encountering`
			`contention return faster to regular work.`
			`+`
			`Yet some further irregularities remain: sometimes an assumed pause of 2ms takes 6 or 10ms,`
			`without obvious reason. Presumably these are effects of hardware and OS scheduling -- which`
			`effectively prevent the processing to reach full load on all 8 cores.`

			`Graph-13::`
			`Increasing the load per job again by an factor 4, setting the nominal job computation load`
			`to 8ms, further supports the observed tendency: with increasing number of jobs in row, the`
			`concurrency converges even closer to the theoretical limit, but misses it by a narrow margin.`
			`Interestingly, also the distribution of timings around the regression line is tighter here,`
			`and, again, the linear model matches up close to the expectation (8ms with 8 cores => ∅ 1ms)`

			`Graph-14::`
			`Going into the opposite direction, with only a small job computation load of 500µs, the`
			`parallelisation remains even below 4 workers on average, due to throttling of worker's`
			`pull cycles when encountering repeated contention on the »Grooming Token«`

			`Graph-15::`
			`As an extreme example, here the job computation load was set to 200µs, which is barely`
			`above the known base overhead of 100µs required to schedule a single job (with debug build).`
			`Notably, a breach of patterns appears here: below 90 jobs, in some cases a higher concurrency`
			`is achieved, thus distinguishing two separate sub-populations with different behaviour patterns.`
			`Above 90 jobs, all cases are handled by two workers, and the behaviour is much more uniform.`
			`From watching individual runs with trace-dump, it can be hypothesised that the amount of`
			`contention at the beginning of the work phase is determinant for the emerging work pattern.`
			`If, by chance, workers show up interleaved and evenly spaced, a higher number of workers is`
			`able to pick a job and remain in duty also for follow-up jobs, since workers returning from`
			`active work get precedence in picking the next job.`
			`+`
			`Another surprising observation is the drift of the actual in-job times; even while the work load`
			`was pre-calibrated for this hardware, the actual computation in the context of the scheduler is`
			`reproducibly slower (at least on my machine). Below 90 jobs, also the spread of this observation`
			`value is larger, as is the spread of time in _impeded state,_ which is defined as less than two`
			`workers processing active job content at a given time.`