Some test-runs performed excitingly smooth, but in one case the processing was was drastically delayed, due to heavy contention. The relevance of this incident is not clear yet, since this test run uses a rather atypical load with very short actual work jobs. Anyway, the dump-logs are documented with this commit.
74 lines
3.6 KiB
Text
74 lines
3.6 KiB
Text
Scheduler Integration Testing
|
|
-----------------------------
|
|
:date: 2023-12-09
|
|
|
|
After successfully completing the unit tests for the newly built scheduler,
|
|
the next goal was to get some complete integration test runs. Moreover,
|
|
in preparation for Load Testing, the `TestChainLoad` was developed,
|
|
which is a generator for artificial yet reproducivle test calculations.
|
|
This component generates a graph of interconnected nodes, where each
|
|
node will be invoked as an individual render job. However, nodes can
|
|
have predecessors and successors; in order to calculate a node's hash value,
|
|
the hash values of all predecessors must have been calculated.
|
|
|
|
In order to see what is happening in the scheduler and the job functors
|
|
generated by `TestChainLoad::setupSchedule()`, some print statements were
|
|
added temporarily. Together, these generate a trace to indicate all planning,
|
|
scheduling and dispatching events, together with an abbreviated thread-ID
|
|
and a time relative to the first scheduler time encountered during this
|
|
test run. There is a changeset in the Git history which _removes_ all
|
|
these print statements -- and by reverting this git commit, it should be
|
|
possible to re-introduce this simplistic instrumentation.
|
|
|
|
This directory documents some Dump-Logs and detail findings from the
|
|
Debugging- and Investigation-Sessions done at that time.
|
|
|
|
.TestChainLoad_test::usageExample
|
|
Generally speaking, this test presents a slightly unusal load.
|
|
It is comprised of peaks -- one for _each level of the graph_ -- each
|
|
requiring typically 5 jobs to be calculated, all with the same start time.
|
|
The graph is composed of small join-trees, 2-3 nodes deep, which are started
|
|
interleaved. In the version used here, the actual calculations produce no
|
|
tangible CPU load and thus the job processing is absolutely dominated by
|
|
the scheduler management overhead. Several workers, pulling concurrently,
|
|
may create a slowdown due to contention on the »Grooming-Token«
|
|
|
|
|
|
Dump-01::
|
|
example of a nice and clean run
|
|
|
|
Dump-02::
|
|
example of a run with insufficient planning pre-roll;
|
|
while some nodes miss their prerequisite calculations,
|
|
overall the Scheduler copes quite well.
|
|
+
|
|
- only two workers show up after the schedule of the first tick
|
|
- these manage to work through the initial backlog (...5000) in 900µs
|
|
- then the tick runs, and then the backlog is done after 1500µs
|
|
- now scheduling is ahead of schedule, causing mostly a single worker
|
|
to do the complete work.
|
|
- occasionally a second worker shows up, but the contention does not
|
|
cause much (if any) delay
|
|
- typically one _level_ is comprised of 5 jobs with the same start time,
|
|
and these are completed after ~300µs
|
|
|
|
Dump-03::
|
|
A test run with drastic failure
|
|
+
|
|
- seemingly, a larger fraction of the work capacity shows up simulatnously,
|
|
roughly at t ≈ 5000µs
|
|
- likewise by accident, at that point the planning is already mostly complete
|
|
- massive contention can be observed between the workers attempting to work
|
|
through the backlog, but (probably due to this contention) are never able
|
|
to catch up.
|
|
- final wake-up job is scheduled for t = 17000µs
|
|
- it is actually invoked at ≈ 27000µs (~10ms late), wakes up the main (test)
|
|
thread, which then aborts, since the final exit hash is still 0.
|
|
- notably there are very long stretches where a single thread repeatedly
|
|
attempts to acquire the GroomingToken, yet fails, while no other progress
|
|
can be seen in the log
|
|
- moreover the "now" times reported for various threads differ drastically,
|
|
and there is quite some contradiction to the order in which the dump strings
|
|
appear in the log. This _may hint at additional contention generated by the
|
|
dump output itself...._
|
|
|