LUMIERA.clone/doc/devel/dump/2023-12-09.Scheduler-Integration/index.txt

Scheduler Integration Testing
-----------------------------
:date: 2023-12-09

After successfully completing the unit tests for the newly built scheduler,
the next goal was to get some complete integration test runs. Moreover,
in preparation for Load Testing, the `TestChainLoad` was developed,
which is a generator for artificial yet reproducivle test calculations.
This component generates a graph of interconnected nodes, where each
node will be invoked as an individual render job. However, nodes can
have predecessors and successors; in order to calculate a node's hash value,
the hash values of all predecessors must have been calculated.

In order to see what is happening in the scheduler and the job functors
generated by `TestChainLoad::setupSchedule()`, some print statements were
added temporarily. Together, these generate a trace to indicate all planning,
scheduling and dispatching events, together with an abbreviated thread-ID
and a time relative to the first scheduler time encountered during this
test run. There is a changeset in the Git history which _removes_ all
these print statements -- and by reverting this git commit, it should be
possible to re-introduce this simplistic instrumentation.

This directory documents some Dump-Logs and detail findings from the
Debugging- and Investigation-Sessions done at that time.

.TestChainLoad_test::usageExample
Generally speaking, this test presents a slightly unusal load.
It is comprised of peaks -- one for _each level of the graph_ -- each
requiring typically 5 jobs to be calculated, all with the same start time.
The graph is composed of small join-trees, 2-3 nodes deep, which are started
interleaved. In the version used here, the actual calculations produce no
tangible CPU load and thus the job processing is absolutely dominated by
the scheduler management overhead. Several workers, pulling concurrently,
may create a slowdown due to contention on the »Grooming-Token«

SchedulerTestGraph.pdf::
 Visualisation of the computation structure used for these tests

Dump-01::
 example of a nice and clean run

Dump-02::
 example of a run with insufficient planning pre-roll;
 while some nodes miss their prerequisite calculations,
 overall the Scheduler copes quite well.
+
- only two workers show up after the schedule of the first tick
- these manage to work through the initial backlog (...5000) in 900µs
- then the tick runs, and then the backlog is done after 1500µs
- now scheduling is ahead of schedule, causing mostly a single worker
  to do the complete work.
- occasionally a second worker shows up, but the contention does not
  cause much (if any) delay
- typically one _level_ is comprised of 5 jobs with the same start time,
  and these are completed after ~300µs

Dump-03::
 A test run with drastic failure
+
- seemingly, a larger fraction of the work capacity shows up simulatnously,
  roughly at t ≈ 5000µs
- likewise by accident, at that point the planning is already mostly complete
- massive contention can be observed between the workers attempting to work
  through the backlog, but (probably due to this contention) are never able
  to catch up.
- final wake-up job is scheduled for t = 17000µs
- it is actually invoked at ≈ 27000µs (~10ms late), wakes up the main (test)
  thread, which then aborts, since the final exit hash is still 0.
- notably there are very long stretches where a single thread repeatedly
  attempts to acquire the GroomingToken, yet fails, while no other progress
  can be seen in the log
- moreover the "now" times reported for various threads differ drastically,
  and there is quite some contradiction to the order in which the dump strings
  appear in the log. This _may hint at additional contention generated by the
  dump output itself...._

Dump-04::
 Example of a successful run with a computational load of 100µs per step.
 This example was obtained form running `TestChainLoad_test::usageExample()`
 with additional print-statements inserted into the code.

Dump-05::
 Example of processing a more complicated and hightly interdependent graph.
 This is a capture of SchedulerService_test::processSchedule(),
 using a fixed level-spacing of 1ms, but a computational base-weight of 500µs.
+
- When looking at the graph, it is clear that this schedule is unfavourable,
  dictating excess wait times at the beginning and overloading the Scheduler
  (concurrency=8) towards the end.
- Yet in spite of these difficult conditions, parallelisation happens mostly
  at those parts of the graph where it is possible
- Note moreover, that the actual computation time spent in those nodes with
  a computational weight (suffix `.1` ... `.3`) is mostly in excess of the
  configured times. For example, a node with weight 3 (≙ 3·500µs) often nees
  2-3ms to complete. This is probably due to cache effects, since the weight
  calibration is performed in a hot microbenchmark loop, while here, in a
  more realistic setting, the cpu-load code is invoked intermingled with
  management code, atomic exchanges, sleep-waits and OS scheduler activity.