- use parameters known to produce a clean linear model - assert on properties of this linear model Add extended documentation into the !TiddlyWiki, with a textual account of the various findings, also including some of the images and diagrams, rendered as SVG
139 lines
8.2 KiB
Text
139 lines
8.2 KiB
Text
Scheduler Load and Stress Testing
|
|
=================================
|
|
:date: 2024-04-08
|
|
|
|
Following the initial integration, an extended series of Load- and Stress-testing
|
|
was conducted to determine the limitations and weaknesses of the new Scheduler
|
|
implementation. During that time, from December to April, several bugs and problems
|
|
were identified, improving the maturity of design and code. In the first phase,
|
|
the goal was to subject the implementation to increasingly challenging load patterns.
|
|
In the second phase, several instrumentation- and measurement methods were developed,
|
|
to establish characteristic traits and boundary conditions quantitatively.
|
|
|
|
This directory holds a loose collection of material related to these testing efforts.
|
|
Trace-dumps were produced by adding print statements at relevant locations; over time,
|
|
a change set with a suitable setup of such diagnostic print statements was assembled,
|
|
which can be applied by `git cherry-pick`. Moreover, load patterns generated by the
|
|
`TestChainLoad` tooling can be visualised as _graphviz diagram_ and the collection
|
|
of tools in the `StressTestRig` generate report output and data visualisation as
|
|
_Gnuplot_ script. Raw measurement data is stored as CSV (see 'csv.hpp').
|
|
|
|
|
|
Breaking Point Testing
|
|
----------------------
|
|
Topo-10::
|
|
Topology of the processing load used as typical example for _breaking a schedule._
|
|
This Graph with 64 nodes is generated by the pre-configured rules
|
|
`configureShape_chain_loadBursts()`; it starts with a single linear, yet »bursts«
|
|
into an excessively interconnected parallel graph roughly in the middle. Due to
|
|
the dependency connections, typically the capacity is first underused, followed
|
|
by a short and massive overload peak. Deliberately, there is not much room for
|
|
speed-up through parallelisation.
|
|
|
|
|
|
Load Peak Testing
|
|
-----------------
|
|
|
|
Dump-10::
|
|
Example of a typical clean run working through a homogenous load peak.
|
|
With load patterns in this category, contention is not an issue and, after the
|
|
initial ramp-up, all workers are fully loaded and immediately commence with the
|
|
next job in row, typically requiring ~70µs for the administrative work between
|
|
processing jobs (Note: all those measurements were done with *debug builds*,
|
|
without compiler optimisation). As an artefact of the measurement setup, a
|
|
follow-up job is wired as dependency on each work job; the dependency wiring
|
|
and dependency checking is performed outside the active measurement interval,
|
|
so that only the additional post of the follow-up check adds a constant
|
|
administrative overhead to each job, which can be guessed to be +30µs.
|
|
|
|
Graph-10::
|
|
Plot from a parametric measurement series with the same settings as used for
|
|
the Dump-10; settings in these range are known to produce clean processing
|
|
with no obvious irregular overhead. In this case here
|
|
+
|
|
- the series is comprised of 40 measurements, covering a parameter range 10...100
|
|
- the free parameter is the number of jobs in a homogenous load peak
|
|
- through built-in instrumentation, the job activation can be observed, allowing
|
|
to derive the average work job time, average time in full concurrency and amount
|
|
of time with limitations (less than two active workers)
|
|
+
|
|
Notably the concurrency approaches the theoretic limit with increasing peak length,
|
|
yet some headroom remains. However, by logical reasoning it becomes clear that there
|
|
_must be a start-up and shutdown phase_ in each run, and this phase must be calculated
|
|
with one Job length each (in this case 2ms start-up and 2ms shutdown); start-up counts
|
|
as _one Job completed,_ while shutdown counts as _N-1 Jobs completed._ Taking these
|
|
considerations into account, 4ms of the socket indicated by the linear model can be
|
|
explained, so there remains a further socket of 5ms. Various observations seem to
|
|
indicate that these *5ms* can be considered a *generic scheduling latency*
|
|
+
|
|
The actual job times are always slightly above the calibrated value (the load operation
|
|
is calibrated before each test on the specific hardware; it could be shown that this
|
|
calibration also holds under multithreaded performance, but the behaviour degrades
|
|
when contention, lock coordination and cache misses come into play)
|
|
|
|
Graph-11::
|
|
There seem to be limitations on the achievable degree of concurrency however.
|
|
Using the same basic setup, again a job load of 2ms, but this time a 8 worker pool
|
|
(the maximum possible on this machine), even in an extended range up to 200 jobs
|
|
the average concurrency barely surpasses 7 workers. Moreover, in similar runs,
|
|
a small number of excessive outliers was observed, hinting at some pattern
|
|
which prevents full usage of available capacity.
|
|
|
|
Dump-11::
|
|
One example trace with 200 Jobs, Job load 2ms and 8-worker pool (all instrumented
|
|
runs investigated did take notably longer than the corresponding runs in the Graph,
|
|
which can be due to overheads created by the print statements). Some observations
|
|
+
|
|
- planning expense is significant, which causes the planning to surpass the start
|
|
of the first »tick« job. At this point, the complete workforce shows up as expected,
|
|
and goes into contention wait subsequently. _This seems to have further ramifications._
|
|
- almost 10ms pass until more workers return back from sleep states; this is the expected
|
|
behaviour, given that there was a pause of 10ms after all planning work was complete.
|
|
- on surface level, the following processing pattern looks completely regular
|
|
- yet always when a worker happens to hit contention, it immediately retracts for
|
|
an extended period of several milliseconds.
|
|
|
|
Dump-12::
|
|
Based on that observation, the _stickiness_ of the contention-encounter was reduced;
|
|
the step-down is not by a power series and thus it takes far less successful work-pulls
|
|
to erase the memory of an contention event. The effect can be seen clearly; there are
|
|
now longer strikes of contention in the middle of the processing, and workers encountering
|
|
contention return faster to regular work.
|
|
+
|
|
Yet some further irregularities remain: sometimes an assumed pause of 2ms takes 6 or 10ms,
|
|
without obvious reason. Presumably these are effects of hardware and OS scheduling -- which
|
|
effectively prevent the processing to reach full load on all 8 cores.
|
|
|
|
Graph-13::
|
|
Increasing the load per job again by an factor 4, setting the nominal job computation load
|
|
to 8ms, further supports the observed tendency: with increasing number of jobs in row, the
|
|
concurrency converges even closer to the theoretical limit, but misses it by a narrow margin.
|
|
Interestingly, also the distribution of timings around the regression line is tighter here,
|
|
and, again, the linear model matches up close to the expectation (8ms with 8 cores => ∅ 1ms)
|
|
|
|
Graph-14::
|
|
Going into the opposite direction, with only a small job computation load of 500µs, the
|
|
parallelisation remains even below 4 workers on average, due to throttling of worker's
|
|
pull cycles when encountering repeated contention on the »Grooming Token«
|
|
|
|
Graph-15::
|
|
As an extreme example, here the job computation load was set to 200µs, which is barely
|
|
above the known base overhead of 100µs required to schedule a single job (with debug build).
|
|
Notably, a breach of patterns appears here: below 90 jobs, in some cases a higher concurrency
|
|
is achieved, thus distinguishing two separate sub-populations with different behaviour patterns.
|
|
Above 90 jobs, all cases are handled by two workers, and the behaviour is much more uniform.
|
|
From watching individual runs with trace-dump, it can be hypothesised that the amount of
|
|
contention at the beginning of the work phase is determinant for the emerging work pattern.
|
|
If, by chance, workers show up interleaved and evenly spaced, a higher number of workers is
|
|
able to pick a job and remain in duty also for follow-up jobs, since workers returning from
|
|
active work get precedence in picking the next job.
|
|
+
|
|
Another surprising observation is the drift of the actual in-job times; even while the work load
|
|
was pre-calibrated for this hardware, the actual computation in the context of the scheduler is
|
|
reproducibly slower (at least on my machine). Below 90 jobs, also the spread of this observation
|
|
value is larger, as is the spread of time in _impeded state,_ which is defined as less than two
|
|
workers processing active job content at a given time.
|
|
|
|
|
|
|
|
|