lumiera_/doc/devel/dump/2024-04-08.Scheduler-LoadTest/index.txt

Scheduler Load and Stress Testing
=================================
:date: 2024-04-08

Following the initial integration, an extended series of Load- and Stress-testing
was conducted to determine the limitations and weaknesses of the new Scheduler
implementation. During that time, from December to April, several bugs and problems
were identified, improving the maturity of design and code. In the first phase,
the goal was to subject the implementation to increasingly challenging load patterns.
In the second phase, several instrumentation- and measurement methods were developed,
to establish characteristic traits and boundary conditions quantitatively.

This directory holds a loose collection of material related to these testing efforts.
Trace-dumps were produced by adding print statements at relevant locations; over time,
a change set with a suitable setup of such diagnostic print statements was assembled,
which can be applied by `git cherry-pick`. Moreover, load patterns generated by the
`TestChainLoad` tooling can be visualised as _graphviz diagram_ and the collection
of tools in the `StressTestRig` generate report output and data visualisation as
_Gnuplot_ script. Raw measurement data is stored as CSV (see 'csv.hpp').


Breaking Point Testing
----------------------
Topo-10::
 Topology of the processing load used as typical example for _breaking a schedule._
 This Graph with 64 nodes is generated by the pre-configured rules
 `configureShape_chain_loadBursts()`; it starts with a single linear, yet »bursts«
 into an excessively interconnected parallel graph roughly in the middle. Due to
 the dependency connections, typically the capacity is first underused, followed
 by a short and massive overload peak. Deliberately, there is not much room for
 speed-up through parallelisation.


Load Peak Testing
-----------------

Dump-10::
 Example of a typical clean run working through a homogenous load peak.
 With load patterns in this category, contention is not an issue and, after the
 initial ramp-up, all workers are fully loaded and immediately commence with the
 next job in row, typically requiring ~70µs for the administrative work between
 processing jobs (Note: all those measurements were done with *debug builds*,
 without compiler optimisation). As an artefact of the measurement setup, a
 follow-up job is wired as dependency on each work job; the dependency wiring
 and dependency checking is performed outside the active measurement interval,
 so that only the additional post of the follow-up check adds a constant
 administrative overhead to each job, which can be guessed to be +30µs.

Graph-10::
 Plot from a parametric measurement series with the same settings as used for
 the Dump-10; settings in these range are known to produce clean processing
 with no obvious irregular overhead. In this case here
+
- the series is comprised of 40 measurements, covering a parameter range 10...100
- the free parameter is the number of jobs in a homogenous load peak
- through built-in instrumentation, the job activation can be observed, allowing
  to derive the average work job time, average time in full concurrency and amount
  of time with limitations (less than two active workers)
+
Notably the concurrency approaches the theoretic limit with increasing peak length,
yet some headroom remains. However, by logical reasoning it becomes clear that there
_must be a start-up and shutdown phase_ in each run, and this phase must be calculated
with one Job length each (in this case 2ms start-up and 2ms shutdown); start-up counts
as _one Job completed,_ while shutdown counts as _N-1 Jobs completed._ Taking these
considerations into account, 4ms of the socket indicated by the linear model can be
explained, so there remains a further socket of 5ms. Various observations seem to
indicate that these *5ms* can be considered a *generic scheduling latency*
+
The actual job times are always slightly above the calibrated value (the load operation
is calibrated before each test on the specific hardware; it could be shown that this
calibration also holds under multithreaded performance, but the behaviour degrades
when contention, lock coordination and cache misses come into play)

Graph-11::
 There seem to be limitations on the achievable degree of concurrency however.
 Using the same basic setup, again a job load of 2ms, but this time a 8 worker pool
 (the maximum possible on this machine), even in an extended range up to 200 jobs
 the average concurrency barely surpasses 7 workers. Moreover, in similar runs,
 a small number of excessive outliers was observed, hinting at some pattern
 which prevents full usage of available capacity.

Dump-11::
 One example trace with 200 Jobs, Job load 2ms and 8-worker pool (all instrumented
 runs investigated did take notably longer than the corresponding runs in the Graph,
 which can be due to overheads created by the print statements). Some observations
+
- planning expense is significant, which causes the planning to surpass the start
  of the first »tick« job. At this point, the complete workforce shows up as expected,
  and goes into contention wait subsequently. _This seems to have further ramifications._
- almost 10ms pass until more workers return back from sleep states; this is the expected
  behaviour, given that there was a pause of 10ms after all planning work was complete.
- on surface level, the following processing pattern looks completely regular
- yet always when a worker happens to hit contention, it immediately retracts for
  an extended period of several milliseconds.

Dump-12::
 Based on that observation, the _stickiness_ of the contention-encounter was reduced;
 the step-down is not by a power series and thus it takes far less successful work-pulls
 to erase the memory of an contention event. The effect can be seen clearly; there are
 now longer strikes of contention in the middle of the processing, and workers encountering
 contention return faster to regular work.
+
Yet some further irregularities remain: sometimes an assumed pause of 2ms takes 6 or 10ms,
without obvious reason. Presumably these are effects of hardware and OS scheduling -- which
effectively prevent the processing to reach full load on all 8 cores.

Graph-13::
 Increasing the load per job again by an factor 4, setting the nominal job computation load
 to 8ms, further supports the observed tendency: with increasing number of jobs in row, the
 concurrency converges even closer to the theoretical limit, but misses it by a narrow margin.
 Interestingly, also the distribution of timings around the regression line is tighter here,
 and, again, the linear model matches up close to the expectation (8ms with 8 cores => ∅ 1ms)

Graph-14::
 Going into the opposite direction, with only a small job computation load of 500µs, the
 parallelisation remains even below 4 workers on average, due to throttling of worker's
 pull cycles when encountering repeated contention on the »Grooming Token«

Graph-15::
 As an extreme example, here the job computation load was set to 200µs, which is barely
 above the known base overhead of 100µs required to schedule a single job (with debug build).
 Notably, a breach of patterns appears here: below 90 jobs, in some cases a higher concurrency
 is achieved, thus distinguishing two separate sub-populations with different behaviour patterns.
 Above 90 jobs, all cases are handled by two workers, and the behaviour is much more uniform.
 From watching individual runs with trace-dump, it can be hypothesised that the amount of
 contention at the beginning of the work phase is determinant for the emerging work pattern.
 If, by chance, workers show up interleaved and evenly spaced, a higher number of workers is
 able to pick a job and remain in duty also for follow-up jobs, since workers returning from
 active work get precedence in picking the next job.
+
Another surprising observation is the drift of the actual in-job times; even while the work load
was pre-calibrated for this hardware, the actual computation in the context of the scheduler is
reproducibly slower (at least on my machine). Below 90 jobs, also the spread of this observation
value is larger, as is the spread of time in _impeded state,_ which is defined as less than two
workers processing active job content at a given time.