Scheduler Integration Testing ----------------------------- :date: 2023-12-09 After successfully completing the unit tests for the newly built scheduler, the next goal was to get some complete integration test runs. Moreover, in preparation for Load Testing, the `TestChainLoad` was developed, which is a generator for artificial yet reproducivle test calculations. This component generates a graph of interconnected nodes, where each node will be invoked as an individual render job. However, nodes can have predecessors and successors; in order to calculate a node's hash value, the hash values of all predecessors must have been calculated. In order to see what is happening in the scheduler and the job functors generated by `TestChainLoad::setupSchedule()`, some print statements were added temporarily. Together, these generate a trace to indicate all planning, scheduling and dispatching events, together with an abbreviated thread-ID and a time relative to the first scheduler time encountered during this test run. There is a changeset in the Git history which _removes_ all these print statements -- and by reverting this git commit, it should be possible to re-introduce this simplistic instrumentation. This directory documents some Dump-Logs and detail findings from the Debugging- and Investigation-Sessions done at that time. .TestChainLoad_test::usageExample Generally speaking, this test presents a slightly unusal load. It is comprised of peaks -- one for _each level of the graph_ -- each requiring typically 5 jobs to be calculated, all with the same start time. The graph is composed of small join-trees, 2-3 nodes deep, which are started interleaved. In the version used here, the actual calculations produce no tangible CPU load and thus the job processing is absolutely dominated by the scheduler management overhead. Several workers, pulling concurrently, may create a slowdown due to contention on the »Grooming-Token« SchedulerTestGraph.pdf:: Visualisation of the computation structure used for these tests Dump-01:: example of a nice and clean run Dump-02:: example of a run with insufficient planning pre-roll; while some nodes miss their prerequisite calculations, overall the Scheduler copes quite well. + - only two workers show up after the schedule of the first tick - these manage to work through the initial backlog (...5000) in 900µs - then the tick runs, and then the backlog is done after 1500µs - now scheduling is ahead of schedule, causing mostly a single worker to do the complete work. - occasionally a second worker shows up, but the contention does not cause much (if any) delay - typically one _level_ is comprised of 5 jobs with the same start time, and these are completed after ~300µs Dump-03:: A test run with drastic failure + - seemingly, a larger fraction of the work capacity shows up simulatnously, roughly at t ≈ 5000µs - likewise by accident, at that point the planning is already mostly complete - massive contention can be observed between the workers attempting to work through the backlog, but (probably due to this contention) are never able to catch up. - final wake-up job is scheduled for t = 17000µs - it is actually invoked at ≈ 27000µs (~10ms late), wakes up the main (test) thread, which then aborts, since the final exit hash is still 0. - notably there are very long stretches where a single thread repeatedly attempts to acquire the GroomingToken, yet fails, while no other progress can be seen in the log - moreover the "now" times reported for various threads differ drastically, and there is quite some contradiction to the order in which the dump strings appear in the log. This _may hint at additional contention generated by the dump output itself...._ Dump-04:: Example of a successful run with a computational load of 100µs per step. This example was obtained form running `TestChainLoad_test::usageExample()` with additional print-statements inserted into the code. Dump-05:: Example of processing a more complicated and hightly interdependent graph. This is a capture of SchedulerService_test::processSchedule(), using a fixed level-spacing of 1ms, but a computational base-weight of 500µs. + - When looking at the graph, it is clear that this schedule is unfavourable, dictating excess wait times at the beginning and overloading the Scheduler (concurrency=8) towards the end. - Yet in spite of these difficult conditions, parallelisation happens mostly at those parts of the graph where it is possible - Note moreover, that the actual computation time spent in those nodes with a computational weight (suffix `.1` ... `.3`) is mostly in excess of the configured times. For example, a node with weight 3 (≙ 3·500µs) often nees 2-3ms to complete. This is probably due to cache effects, since the weight calibration is performed in a hot microbenchmark loop, while here, in a more realistic setting, the cpu-load code is invoked intermingled with management code, atomic exchanges, sleep-waits and OS scheduler activity.