2024-04-09 16:57:26 +02:00
|
|
|
Scheduler Load and Stress Testing
|
|
|
|
|
=================================
|
|
|
|
|
:date: 2024-04-08
|
|
|
|
|
|
|
|
|
|
Following the initial integration, an extended series of Load- and Stress-testing
|
|
|
|
|
was conducted to determine the limitations and weaknesses of the new Scheduler
|
|
|
|
|
implementation. During that time, from December to April, several bugs and problems
|
|
|
|
|
were identified, improving the maturity of design and code. In the first phase,
|
|
|
|
|
the goal was to subject the implementation to increasingly challenging load patterns.
|
|
|
|
|
In the second phase, several instrumentation- and measurement methods were developed,
|
2024-04-10 03:29:44 +02:00
|
|
|
to establish characteristic traits and boundary conditions quantitatively.
|
2024-04-09 16:57:26 +02:00
|
|
|
|
|
|
|
|
This directory holds a loose collection of material related to these testing efforts.
|
|
|
|
|
Trace-dumps were produced by adding print statements at relevant locations; over time,
|
|
|
|
|
a change set with a suitable setup of such diagnostic print statements was assembled,
|
|
|
|
|
which can be applied by `git cherry-pick`. Moreover, load patterns generated by the
|
|
|
|
|
`TestChainLoad` tooling can be visualised as _graphviz diagram_ and the collection
|
|
|
|
|
of tools in the `StressTestRig` generate report output and data visualisation as
|
2024-04-10 03:29:44 +02:00
|
|
|
_Gnuplot_ script. Raw measurement data is stored as CSV (see 'csv.hpp').
|
2024-04-09 16:57:26 +02:00
|
|
|
|
|
|
|
|
|
2024-04-11 23:08:20 +02:00
|
|
|
Breaking Point Testing
|
|
|
|
|
----------------------
|
|
|
|
|
Topo-10::
|
|
|
|
|
Topology of the processing load used as typical example for _breaking a schedule._
|
|
|
|
|
This Graph with 64 nodes is generated by the pre-configured rules
|
|
|
|
|
`configureShape_chain_loadBursts()`; it starts with a single linear, yet »bursts«
|
|
|
|
|
into an excessively interconnected parallel graph roughly in the middle. Due to
|
|
|
|
|
the dependency connections, typically the capacity is first underused, followed
|
|
|
|
|
by a short and massive overload peak. Deliberately, there is not much room for
|
|
|
|
|
speed-up through parallelisation.
|
|
|
|
|
|
|
|
|
|
|
2024-04-09 16:57:26 +02:00
|
|
|
Load Peak Testing
|
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
|
|
Dump-10::
|
|
|
|
|
Example of a typical clean run working through a homogenous load peak.
|
|
|
|
|
With load patterns in this category, contention is not an issue and, after the
|
2024-04-10 03:29:44 +02:00
|
|
|
initial ramp-up, all workers are fully loaded and immediately commence with the
|
2024-04-09 16:57:26 +02:00
|
|
|
next job in row, typically requiring ~70µs for the administrative work between
|
|
|
|
|
processing jobs (Note: all those measurements were done with *debug builds*,
|
|
|
|
|
without compiler optimisation). As an artefact of the measurement setup, a
|
|
|
|
|
follow-up job is wired as dependency on each work job; the dependency wiring
|
|
|
|
|
and dependency checking is performed outside the active measurement interval,
|
|
|
|
|
so that only the additional post of the follow-up check adds a constant
|
|
|
|
|
administrative overhead to each job, which can be guessed to be +30µs.
|
|
|
|
|
|
|
|
|
|
Graph-10::
|
|
|
|
|
Plot from a parametric measurement series with the same settings as used for
|
|
|
|
|
the Dump-10; settings in these range are known to produce clean processing
|
2024-04-11 23:08:20 +02:00
|
|
|
with no obvious irregular overhead. In this case here
|
2024-04-09 16:57:26 +02:00
|
|
|
+
|
|
|
|
|
- the series is comprised of 40 measurements, covering a parameter range 10...100
|
|
|
|
|
- the free parameter is the number of jobs in a homogenous load peak
|
|
|
|
|
- through built-in instrumentation, the job activation can be observed, allowing
|
|
|
|
|
to derive the average work job time, average time in full concurrency and amount
|
|
|
|
|
of time with limitations (less than two active workers)
|
|
|
|
|
+
|
|
|
|
|
Notably the concurrency approaches the theoretic limit with increasing peak length,
|
|
|
|
|
yet some headroom remains. However, by logical reasoning it becomes clear that there
|
2024-04-10 03:29:44 +02:00
|
|
|
_must be a start-up and shutdown phase_ in each run, and this phase must be calculated
|
|
|
|
|
with one Job length each (in this case 2ms start-up and 2ms shutdown); start-up counts
|
2024-04-09 16:57:26 +02:00
|
|
|
as _one Job completed,_ while shutdown counts as _N-1 Jobs completed._ Taking these
|
|
|
|
|
considerations into account, 4ms of the socket indicated by the linear model can be
|
|
|
|
|
explained, so there remains a further socket of 5ms. Various observations seem to
|
|
|
|
|
indicate that these *5ms* can be considered a *generic scheduling latency*
|
|
|
|
|
+
|
|
|
|
|
The actual job times are always slightly above the calibrated value (the load operation
|
|
|
|
|
is calibrated before each test on the specific hardware; it could be shown that this
|
|
|
|
|
calibration also holds under multithreaded performance, but the behaviour degrades
|
|
|
|
|
when contention, lock coordination and cache misses come into play)
|
2024-04-10 03:29:44 +02:00
|
|
|
|
|
|
|
|
Graph-11::
|
|
|
|
|
There seem to be limitations on the achievable degree of concurrency however.
|
|
|
|
|
Using the same basic setup, again a job load of 2ms, but this time a 8 worker pool
|
|
|
|
|
(the maximum possible on this machine), even in an extended range up to 200 jobs
|
|
|
|
|
the average concurrency barely surpasses 7 workers. Moreover, in similar runs,
|
|
|
|
|
a small number of excessive outliers was observed, hinting at some pattern
|
|
|
|
|
which prevents full usage of available capacity.
|
|
|
|
|
|
|
|
|
|
Dump-11::
|
|
|
|
|
One example trace with 200 Jobs, Job load 2ms and 8-worker pool (all instrumented
|
|
|
|
|
runs investigated did take notably longer than the corresponding runs in the Graph,
|
|
|
|
|
which can be due to overheads created by the print statements). Some observations
|
|
|
|
|
+
|
|
|
|
|
- planning expense is significant, which causes the planning to surpass the start
|
|
|
|
|
of the first »tick« job. At this point, the complete workforce shows up as expected,
|
|
|
|
|
and goes into contention wait subsequently. _This seems to have further ramifications._
|
|
|
|
|
- almost 10ms pass until more workers return back from sleep states; this is the expected
|
|
|
|
|
behaviour, given that there was a pause of 10ms after all planning work was complete.
|
|
|
|
|
- on surface level, the following processing pattern looks completely regular
|
|
|
|
|
- yet always when a worker happens to hit contention, it immediately retracts for
|
|
|
|
|
an extended period of several milliseconds.
|
|
|
|
|
|
|
|
|
|
Dump-12::
|
|
|
|
|
Based on that observation, the _stickiness_ of the contention-encounter was reduced;
|
|
|
|
|
the step-down is not by a power series and thus it takes far less successful work-pulls
|
|
|
|
|
to erase the memory of an contention event. The effect can be seen clearly; there are
|
|
|
|
|
now longer strikes of contention in the middle of the processing, and workers encountering
|
|
|
|
|
contention return faster to regular work.
|
|
|
|
|
+
|
|
|
|
|
Yet some further irregularities remain: sometimes an assumed pause of 2ms takes 6 or 10ms,
|
|
|
|
|
without obvious reason. Presumably these are effects of hardware and OS scheduling -- which
|
|
|
|
|
effectively prevent the processing to reach full load on all 8 cores.
|
|
|
|
|
|
|
|
|
|
Graph-13::
|
|
|
|
|
Increasing the load per job again by an factor 4, setting the nominal job computation load
|
|
|
|
|
to 8ms, further supports the observed tendency: with increasing number of jobs in row, the
|
|
|
|
|
concurrency converges even closer to the theoretical limit, but misses it by a narrow margin.
|
|
|
|
|
Interestingly, also the distribution of timings around the regression line is tighter here,
|
|
|
|
|
and, again, the linear model matches up close to the expectation (8ms with 8 cores => ∅ 1ms)
|
|
|
|
|
|
|
|
|
|
Graph-14::
|
|
|
|
|
Going into the opposite direction, with only a small job computation load of 500µs, the
|
|
|
|
|
parallelisation remains even below 4 workers on average, due to throttling of worker's
|
|
|
|
|
pull cycles when encountering repeated contention on the »Grooming Token«
|
|
|
|
|
|
|
|
|
|
Graph-15::
|
|
|
|
|
As an extreme example, here the job computation load was set to 200µs, which is barely
|
|
|
|
|
above the known base overhead of 100µs required to schedule a single job (with debug build).
|
|
|
|
|
Notably, a breach of patterns appears here: below 90 jobs, in some cases a higher concurrency
|
|
|
|
|
is achieved, thus distinguishing two separate sub-populations with different behaviour patterns.
|
|
|
|
|
Above 90 jobs, all cases are handled by two workers, and the behaviour is much more uniform.
|
|
|
|
|
From watching individual runs with trace-dump, it can be hypothesised that the amount of
|
|
|
|
|
contention at the beginning of the work phase is determinant for the emerging work pattern.
|
|
|
|
|
If, by chance, workers show up interleaved and evenly spaced, a higher number of workers is
|
|
|
|
|
able to pick a job and remain in duty also for follow-up jobs, since workers returning from
|
|
|
|
|
active work get precedence in picking the next job.
|
|
|
|
|
+
|
|
|
|
|
Another surprising observation is the drift of the actual in-job times; even while the work load
|
|
|
|
|
was pre-calibrated for this hardware, the actual computation in the context of the scheduler is
|
|
|
|
|
reproducibly slower (at least on my machine). Below 90 jobs, also the spread of this observation
|
|
|
|
|
value is larger, as is the spread of time in _impeded state,_ which is defined as less than two
|
|
|
|
|
workers processing active job content at a given time.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|