diff --git a/src/vault/gear/scheduler-commutator.hpp b/src/vault/gear/scheduler-commutator.hpp index 915e765ed..79c94586a 100644 --- a/src/vault/gear/scheduler-commutator.hpp +++ b/src/vault/gear/scheduler-commutator.hpp @@ -200,6 +200,7 @@ namespace gear { * - activity::PASS continue processing in regular operation * - activity::WAIT nothing to do now, check back later * - activity::HALT serious problem, cease processing + * - activity::SKIP to contend (spin) on GroomingToken * @note Attempts to acquire the GroomingToken for immediate * processing, but not for just enqueuing planned tasks. * Never drops the GroomingToken explicitly (unless when @@ -212,7 +213,7 @@ namespace gear { ,EXE& executionCtx ,SchedulerInvocation& layer1) { - if (!chain) return activity::WAIT; + if (!chain) return activity::SKIP; Time now = executionCtx.getSchedTime(); if (decideDispatchNow (when, now)) diff --git a/src/vault/gear/scheduler.hpp b/src/vault/gear/scheduler.hpp index 85ee5ea30..8a3f69889 100644 --- a/src/vault/gear/scheduler.hpp +++ b/src/vault/gear/scheduler.hpp @@ -26,12 +26,66 @@ ** The implementation of scheduling services is provided by an integration ** of two layers of functionality: ** - Layer-1 allows to enqueue and prioritise render activity records - ** - Layer-2 connects and coordinates activities to conduct complex calculations + ** - Layer-2 connects and coordinates activities to conduct complex calculations + ** Additionally, a [custom allocation scheme](\ref BlockFlow) is involved, + ** a [notification service](\ref EngineObserver) and the execution environment + ** for the low-level [»Activity Language](\ref ActivityLang). Some operational + ** control and and load management is delegated to the \ref LoadController. + ** The *purpose* of the »Scheduler Service« in the lumiera Render Engine + ** is to coordinate the execution of [»Render Jobs«](\ref vault::gear::Job), + ** which can be controlled by a timing scheme, but also triggered in response + ** to some prerequisite event, most notably the completion of IO work. ** - ** @see SchedulerUsage_test Component integration test - ** @see scheduler.cpp implementation details + ** # Thread coordination + ** The typical situation found when rendering media is the demand to distribute + ** rather scarce computation resources to various self-contained tasks sequenced + ** in temporary and dependency order. In addition, some internal management work + ** must be conducted to order these tasks, generate further tasks and coordinate + ** the dependencies. Overall, any such internal work is by orders of magnitude + ** less expensive than the actual media calculations, which reach up into the + ** range of 1-10 milliseconds, possibly even way more (seconds for expensive + ** computations). For this reason, the Scheduler in the Lumiera Render Engine + ** uses a pool of workers, each representing one unit of computation resource + ** (a »core«), and these workers will _pull work actively,_ rather then + ** distributing, queuing and dispatching tasks to a passive set of workers. + ** And notably the »management work« is performed also by the workers themselves, + ** to the degree it is necessary to retrieve the next piece of computation. + ** So there is no dedicated »queue manager« — scheduling is driven by the workers. + ** + ** Assuming that this internal work is comparatively cheap to perform, a choice + ** was made to handle any internal state changes of the Scheduler exclusively + ** in single-threaded mode. This is achieved by an atomic lock, maintained in + ** [Layer-2 of the Scheduler implementation](\ref SchedulerCommutator::groomingToken_). + ** Any thread looking for more work will pull a pre-configured functor, which + ** is implemented by the [work-function](\ref Scheduler::getWork()). The thread + ** will attempt to acquire the lock, designated as »grooming-token« -- but only + ** if this is necessary to perform internal changes. Since workers are calling + ** in randomly, in many cases there might be no task to perform at the moment, + ** and the worker can be instructed to go to a sleep cycle and call back later. + ** On the other hand, when load is high, workers are instructed to call back + ** immediately again to find the next piece of work. Based on assessment of + ** the current [»head time«](\ref SchedulerInvocation::headTime), a quick + ** decision will be made if the thread's capacity is useful right now, + ** or if this capacity will be re-focussed into another zone of the + ** scheduler's time axis, based on the distance to the next task. + ** + ** If however a thread is put to work, it will start dequeuing an entry from + ** the head of the [priority queue](\ref SchedulerInvocation::pullHead), + ** and start interpreting this entry as a _chain of render activities_ with + ** the help of the [»Activity Language«](\ref ActivityLang::dispatchChain). + ** In the typical scenario, after some preparatory checks and notifications, + ** the thread [transitions into work mode](\ref Scheduler::ExecutionCtx::work), + ** which entails to [drop the grooming-token](\ref SchedulerCommutator::dropGroomingToken). + ** Since the scheduler queue only stores references to render activities, which are + ** allocated in a [special arrangement](\ref BlockFlow) exploiting the known deadline + ** time of each task, further processing can commence concurrently. + ** + ** @see SchedulerService_test Component integration test + ** @see SchedulerStress_test + ** @see SchedulerUsage_test ** @see SchedulerInvocation Layer-1 ** @see SchedulerCommutator Layer-2 + ** @see activity.hpp description of »Render Activities« ** ** @todo WIP-WIP 10/2023 »Playback Vertical Slice« ** @@ -321,31 +375,51 @@ namespace gear { * - activity::PROC causes the worker to poll again immediately * - activity::SLEEP induces a sleep state * - activity::HALT terminates the worker + * @note Under some circumstances, this function depends on acquiring the »grooming-token«, + * which is an atomic lock to ensure only one thread at a time can alter scheduler internals. + * In the regular processing sequence, this token is dropped after dequeuing and processing + * some Activities, yet prior to invoking the actual »Render Job«. Explicitly dropping the + * token at the end of this function is a safeguard against deadlocking the system. + * If some other thread happens to hold the token, SchedulerCommutator::findWork + * will bail out, leading to active spinning wait for the current thread. */ inline activity::Proc Scheduler::getWork() { - ExecutionCtx& ctx = ExecutionCtx::from(*this); - - return WorkerInstruction{} - .performStep([&]{ - Time now = ctx.getSchedTime(); - Time head = layer1_.headTime(); - return scatteredDelay(now, - loadControl_.markIncomingCapacity (head,now)); - }) - .performStep([&]{ - Time now = ctx.getSchedTime(); - Activity* act = layer2_.findWork (layer1_,now); - return ctx.post (now, act, ctx); - }) - .performStep([&]{ - Time now = ctx.getSchedTime(); - Time head = layer1_.headTime(); - return scatteredDelay(now, - loadControl_.markOutgoingCapacity (head,now)); - }) - ; + auto self = std::this_thread::get_id(); + auto& ctx = ExecutionCtx::from (*this); + try { + auto res = WorkerInstruction{} + .performStep([&]{ + Time now = ctx.getSchedTime(); + Time head = layer1_.headTime(); + return scatteredDelay(now, + loadControl_.markIncomingCapacity (head,now)); + }) + .performStep([&]{ + Time now = ctx.getSchedTime(); + Activity* act = layer2_.findWork (layer1_,now); + return ctx.post (now, act, ctx); + }) + .performStep([&]{ + Time now = ctx.getSchedTime(); + Time head = layer1_.headTime(); + return scatteredDelay(now, + loadControl_.markOutgoingCapacity (head,now)); + }); + + // ensure lock clean-up + if (res != activity::PASS + and layer2_.holdsGroomingToken(self)) + layer2_.dropGroomingToken(); + return res; + } + catch(...) + { + if (layer2_.holdsGroomingToken (self)) + layer2_.dropGroomingToken(); + throw; + } } @@ -366,7 +440,11 @@ namespace gear { Scheduler::scatteredDelay (Time now, LoadController::Capacity capacity) { auto doTargetedSleep = [&] - { + { // ensure not to block the Scheduler after management work + auto self = std::this_thread::get_id(); + if (layer2_.holdsGroomingToken (self)) + layer2_.dropGroomingToken(); + // relocate this thread(capacity) to a time where its more useful Offset targetedDelay = loadControl_.scatteredDelayTime (now, capacity); std::this_thread::sleep_for (std::chrono::microseconds (_raw(targetedDelay))); }; diff --git a/tests/vault/gear/scheduler-commutator-test.cpp b/tests/vault/gear/scheduler-commutator-test.cpp index 2c47148c4..55e0d7d05 100644 --- a/tests/vault/gear/scheduler-commutator-test.cpp +++ b/tests/vault/gear/scheduler-commutator-test.cpp @@ -369,7 +369,7 @@ namespace test { CHECK (not sched.holdsGroomingToken (myself)); // no effect when no Activity given - CHECK (activity::WAIT == sched.postDispatch (nullptr, now, detector.executionCtx, queue)); + CHECK (activity::SKIP == sched.postDispatch (nullptr, now, detector.executionCtx, queue)); CHECK (not sched.holdsGroomingToken (myself)); // Activity immediately dispatched when on time and GroomingToken can be acquired diff --git a/tests/vault/gear/scheduler-service-test.cpp b/tests/vault/gear/scheduler-service-test.cpp index 8e00eb9ed..9f9bad474 100644 --- a/tests/vault/gear/scheduler-service-test.cpp +++ b/tests/vault/gear/scheduler-service-test.cpp @@ -96,7 +96,7 @@ namespace test { /** @test verify visible behaviour of the [work-pulling function](\ref Scheduler::getWork) * - use a rigged Activity probe to capture the schedule time on invocation * - additionally perform a timing measurement for invoking the work-function - * - empty invocations cost ~5µs (-O3) rsp. ~25µs (debug) + * - invoking the Activity probe itself costs 50...150µs, Scheduler internals < 50µs * - this implies we can show timing-delay effects in the millisecond range * - demonstrated behaviour * + an Activity already due will be dispatched immediately by post() @@ -140,6 +140,7 @@ namespace test { { // this test class is declared friend to get a backdoor to Scheduler internals... auto& schedCtx = Scheduler::ExecutionCtx::from(scheduler); + scheduler.layer2_.acquireGoomingToken(); schedCtx.post (start, &probe, schedCtx); }; diff --git a/wiki/renderengine.html b/wiki/renderengine.html index e78780dc1..298864b1c 100644 --- a/wiki/renderengine.html +++ b/wiki/renderengine.html @@ -7290,7 +7290,7 @@ The primary scaling effects exploited to achieve this level of performance are t The way other parts of the system are built, requires us to obtain a guaranteed knowledge of some job's termination. It is possible to obtain that knowledge with some limited delay, but it nees to be absoultely reliable (violations leading to segfault). The requirements stated above assume this can be achieved through //jobs with guaranteed execution.// Alternatively we could consider installing specific callbacks -- in this case the scheduler itself has to guarantee the invocation of these callbacks, even if the corresponding job fails or is never invoked. It doesn't seem there is any other option. -
+
The Scheduler //maintains a ''Work Force'' (a pool of workers) to perform the next [[render activities|RenderActivity]] continuously.//
 Each worker runs in a dedicated thread; the Activities are arranged in a way to avoid blocking those worker threads
 * IO operations are performed asynchronously {{red{planned as of 9/23}}}
@@ -7298,7 +7298,7 @@ Each worker runs in a dedicated thread; the Activities are arranged in a way to
 !Workload and invocation scheme
 Using a pool of workers to perform small isolated steps of work atomically and in parallel is an well established pattern in high performance computing. However, the workload for rendering media is known to have some distinctive traits, calling for a slightly different approach compared with an operation system scheduler or a load balancer. Notably, the demand for resources is high, often using „whatever it takes“ -- driving the system into load saturation. The individual chunks of work, which can be computed independently, are comparatively large, and must often be computed in a constrained order. For real-time performance, it is desirable to compute data as late as possible, to avoid blocking memory with computed results. And for the final quality render, for the same reason it is advisable to proceed in data dependency order to keep as much data as possible in memory and avoid writing temporary files.
 
-This leads to a situation where it is more adequate to //distribute the scarce computation resources// to the tasks //sequenced in temporary and dependence order//. The computation tasks must be prepared and ordered -- but beyond that, there is not much that can be »managed« with a computation task. For this reason, the Scheduler in the Lumiera Render Engine uses a pool of workers, each providing one unit of computation resource (a »core«), and these workers will ''pull work'' actively, rather then distributing, queuing and dispatching tasks to a passive set of workers.
+This leads to a situation where it is more adequate to //distribute the scarce computation resources// to the tasks //sequenced in temporary and dependency order//. The computation tasks must be prepared and ordered -- but beyond that, there is not much that can be »managed« with a computation task. For this reason, the Scheduler in the Lumiera Render Engine uses a pool of workers, each providing one unit of computation resource (a »core«), and these workers will ''pull work'' actively, rather then distributing, queuing and dispatching tasks to a passive set of workers.
 
 Moreover, the actual computation tasks, which can be parallelised, are at least by an order of magnitude more expensive than any administrative work for sorting tasks, checking dependencies and maintaining process state. This leads to a scheme where a worker first performs some »management work«, until encountering the next actual computation job, at which point the worker leaves the //management mode// and transitions into //concurrent work mode//. All workers are expected to be in work mode almost entirely most of the time, and thus we can expect not much contention between workers performing »management work« -- allowing to confine this management work to //single threaded operation,// thereby drastically reducing the complexity of management data structures and memory allocation.
 !!!Regulating workers
diff --git a/wiki/thinkPad.ichthyo.mm b/wiki/thinkPad.ichthyo.mm
index 34d40fe1b..350022ee7 100644
--- a/wiki/thinkPad.ichthyo.mm
+++ b/wiki/thinkPad.ichthyo.mm
@@ -80636,7 +80636,7 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
-

pass on the activation down the chain @@ -82050,8 +82050,10 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
- - + + + + @@ -82084,20 +82086,20 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
- - + + - - + + - - - + + + @@ -82110,6 +82112,9 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
+ + +
@@ -82175,7 +82180,7 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
- + @@ -82195,7 +82200,7 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
- + @@ -82222,13 +82227,89 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
- - - - + + + + - - + + + + + + + + + +

+ ...das ist ein kleiner Kniff und führt dazu, daß ein Thread ohne weiteres in einem Schwung alle Management-Aufgaben erledigt, bis er durch eine reguläre Transaktion in den Work-Modus geht, oder eben nichts mehr unmittelbar zu tun ist. Für die logische Konsistenz ist diese Ausnahme nicht notwendig, aber sie verhindert in einigen Fällen eine unnötige read-write-barrier und arbeitet weiter aus dem Cache der jeweiligen Core. Cache-Effekte können locker mal 50-100µs kosten +

+ +
+ +
+ + + + + + + + + + + + + + + +

+ sonst könnte der Scheduler zwischendurch +

+

+  für Zeitspannen bis zu 20ms geblockt sein +

+ +
+ + + + + +

+ ...denn typischerweise passiert ja ein targeted sleep genau dann, wenn grade eben nichts zu tun ist; aber ein Teil dieser Verzögerun könnte eben doch über eine Zeitspanne reichen, in der Activities geplant sind (oder zwischenzeitlich eingespielt wurden) +

+ +
+
+ + + + +
+ + + + +

+ passiert tatsächlich in SchedulerCommutator::postDispatch() +

+ +
+ + + + + + + + + + + + + + @@ -87135,6 +87216,10 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
+ + + +
@@ -90515,7 +90600,7 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
-

auch mit weiterer Variation der Parameter kommt man kaum unter 30ns @@ -91251,6 +91336,17 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
+ + + + +

+ ...das könnte tatsächlich schwierig herauszufinden sein, weil wir definitiv nicht jede Handhabung des Grooming-Tokens irgendwo als Event loggen wollen; zudem besteht die Schwierigkeit, daß wir das Grooming-Token nach Aufrufen der work-Funktion oft wieder droppen +

+ +
+ +
@@ -91291,6 +91387,51 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
+ + + + + + + + + +

+ ...denn der Contender hat in dem Moment ohnehin nix anderes zu tun und ist auch von der Kapazität her definitiv für das Rendering belegt. Zudem belästigen wir durch aktives Polling niemanden sonst (abgesehen von dem produzierten CO₂); und der Fall sollte selten sein und idealerweise auch nur kurz dauern +

+ +
+
+ + + + +

+ Eine Wirkung können wir nämlich nicht messen ⟹ es läuft darauf hinaus, ob dieses Vorgehen verhältnismäßig ist, und das ist identisch mit der Frage, ob Contentions wirklich so extrem selten sind, wie vom Konzept her vermutet. Der wichtigste Ansatzpunkt wäre daher, nach zeitlichen Koinzidenzen zu suchen, was allerdings ohne weitere Instrumentierung auch nicht möglich sein dürfte +

+ +
+ +
+
+ + + + + + + + +

+ ...da wir nur einen Status-Code haben, aber nicht sagen können was und wie genau weiter verfahren werden soll. Dadurch ist zwar die Kopplung oberflächlich lose und explizit, tatsächlich aber führt es zu sehr komplexen Kollaborationen +

+ +
+
+ + +
+
@@ -91845,16 +91986,13 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
- - - +

Gruß vom Vollmond in Pullach (noch nicht ganz voll, morgen ist Mondfinsternis), der geht das Tal entlang mit

- -
+ @@ -91876,9 +92014,7 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
- - - +

Priorität bestimmt die Wahrscheinlichkeit, @@ -91887,15 +92023,12 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
Kapazität zufällig zu erlangen

- -
+ - - - +

Grundsätzlich darf ein Render-Prozeß nur bei vorhandener Kapazität zugelassen werden. Aber die Anwendung dieses Prinzips hat Abstufungen, die sich aus der Art des Prozesses bestimmen. @@ -91920,103 +92053,79 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
- - + - - - +

die aktuelle Situation bestimmt darüber, wie Kapazität zugeführt wird; aber der Querschnitt bestimmt, wie gute Chancen ein konkreter Task hat, davon etwas abzubekommen

- -
+ - - - +

Ein frei stehender Task bekommt durch den tend-Next-Mechanismus auch schon sehr viel früher die nächste frei werdende Kapazität zugeordnet; solange aber in der einfachen zeitlichen Ordnung noch etwas vor ihm steht (selbst wenn überfällig), dann zieht dieses die Priorität auf sich

- -
+
- - - +

sobald er aber frei steht, bestimmt seine Länge die Priorität

- -
+ - - - +

Und zwar durch das Ende (die Deadline), nach deren Überstreichen der Task effektiv unwirksam ist und im Vorrübergehen entnommen und verworfen wird. Wenn nun verschiedene Tasks jeweils in der Länge beschränkt sind, dann fällt ihnen diejenige Kapazität zu, die zufällig in ihrem Wirkradius „landet“

- -
+
- - - +

Da sind zunächst die Gates von Relevanz. Ein noch geschlossenes Gate kann einen Task nach hinten schieben und damit andere Tasks aufdecken. Ein getriggertes und endgültig geschlossenes Gate nimmt den Task aus der Konkurrenz komplett heraus. Und außerdem werden Tasks auch noch über eine Prozeß/Kontext-ID gekennzeichnet, wodurch eine Revision und Aktualisierung eines gesamten Planungsvorgangs möglich wird.

- -
+
- - - +

...das bedeutet, wenn wir einmal durch das Gate gegangen sind, ist es endgültig geschlossen. Deshalb können gewissermaßen mehrere »Instanzen« eingeplant werden, denn das Wegräumen von Müll ist in einer Priority-Queue relativ effizient, O(logₙ), und läuft bei uns im einstelligen µs-Bereich pro Einzelschritt

- -
+
- - - +

Ein Task der nur Restkapazität bekommen soll, darf niemals am Anfang des Fensters stehen, und auch möglichst nicht ganz am Ende. Je kürzer und je mehr in der Mitte, desto geringer seine Chancen. Ein Task der in jedem Fall Kapazität bekommen soll, muß nur hinreichend weit nach hinten reichen (aber unter der Einschränkung, unseren Epochen-basierten BlockFlow nicht zu überlasten). Oder er muß hinreichend oft wiederholt werden. Es wird also ein generelles Segment-Schema etabliert, und in diesem gibt es vorgefertigte »Slots«. Gemäß übergeordnete Kapazitätsplanung werden diese Slots in Anspruch genommen. Wenn wir beispielsweise eine Restkapazität 10-fach überprovisionieren, dann bedeutet das, jeden einzelnen Task (ggfs. mit Zeitabstand) 10+x-mal einzuplanen, wobei x einen empirisch bestimmten safety-margin darstellt, um die tatsächlichen Fluktuationen der Kapazität abzufedern...

- -
+
- - - +

dadurch übersetzen wir multidimensionale Zusammenhänge @@ -92028,8 +92137,7 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
in ein low-Level-Ausführungsschema

- -
+
@@ -96655,19 +96763,16 @@ Date:   Thu Apr 20 18:53:17 2023 +0200
09.05.19 17:10

- Library: further narrowing down the tuple-forwarding problem -

-
    
-
    ...yet still not successful.
-
    
-
    The mechanism used for std::apply(tuple&) works fine when applied directly to the target function,
-
    but fails to select the proper overload when passed to a std::forward-call for
-
    "perfect forwarding". I tried again to re-build the situation of std::forward
-
    with an explicitly coded function, but failed in the end to supply a type parameter
-
    to std::forward suitably for all possible cases
-

- + Library: further narrowing down the tuple-forwarding problem

+
    
+
    ...yet still not successful.
+
    
+
    The mechanism used for std::apply(tuple&) works fine when applied directly to the target function,
+
    but fails to select the proper overload when passed to a std::forward-call for
+
    "perfect forwarding". I tried again to re-build the situation of std::forward
+
    with an explicitly coded function, but failed in the end to supply a type parameter
+
    to std::forward suitably for all possible cases