Master’s Thesis: Deterministic Execution for Entire ROS 2 Stacks
Recently, I submitted my master’s thesis titled “Enabling Reproducibility in ROS 2 by Ensuring Sequence Deterministic Callback Execution”. I already wrote about this topic in a previous post about deterministic ROS 2 callback scheduling, which was about half-way into the time I ended up working on this. The underlying idea and method did not change since, and most of the work following this post was fixing bugs in the implementation, and integrating a “real-world” use case as a test, to verify not only functionality but also practicability of the method.
To recap the initial problem: In ROS 2, processing within a robotics stack happens in callbacks, which are triggered by data inputs (“subscriptions”) or timers. Within those callbacks, outputs may be produced (“published”), which are then again inputs for other callbacks. Callbacks are executed by ROS Nodes, which encapsulate functionality of individual modules. For example: A sensor driver periodically polls a sensor for measurements, and publishes them when available. Multiple perception modules subscribe to that input, and publish a detection (such as a detected object, and its size). A tracking module then subscribes to all detections, and merges them into an environment model. The result of this processing step depends on the execution speed of the detection modules, since that influences the order in which the tracking module receives its inputs. In that way, execution of ROS stacks is nondeterministic, even with otherwise deterministic ROS nodes.
My contribution consists of a method for scheduling those callbacks, such that processing of an input happens in a deterministic fashion, not only for subsequent inputs but more crucially for multiple runs of the same test, simulation scenario or playback of recorded data.
Some Technical Notes⌗
This required a method of manually executing callbacks, which was achieved by buffering messages on the input topics, and separating input topics where necessary. Details can be found in the “Controlling Callback Invocations” section of the thesis.
The interesting, and perhaps non-obvious part is how the framework determines when a callback is allowed to execute. This is accomplished by iteratively extending a graph representing all expected callback executions within the ROS stack. Edges in the graph constrain the execution order of the connected callbacks. Edges are added to serialize execution of callbacks which simultaneously (or without defined order) access a ROS node, a ROS topic or a ROS service. This information is then used to execute callbacks without such constraints, and buffer inputs otherwise. Again, details can be found in the “Ensuring Sequence Determinism Using Callback Graphs” section of the thesis. This method has the advantage that it only serializes callbacks wherever necessary, and does not prevent parallel execution entirely. (This does not imply that lower serialization overhead is impossible: By intercepting output topic in addition to input topics for example, serialization of callbacks accessing the same topics might be avoided. See this example for details. Similar improvements can be made regarding service calls.)
Integrating this method with a real-world use case presented significantly more work compared to the initial development of the orchestrator. In my experience, this was mainly due to unusual ROS node behaviors that were significantly more complex than anticipated. Here, this was the multi-object tracking module, which performed some sophisticated queueing, batching and processing of sensor measurements. This tries to process inputs from the “same” timestamp in a batch, but needs to handle cases of missing inputs and un-synchronized sensors. Processing also happens in multiple threads, separating ROS message handling from processing of sensor data. I am currently still working on integrating my software for the purpose of integration tests, and still regularly find cases of deadlocks due to previously unknown code paths in message processing, which usually results in missing status messages (the orchestrator is never informed of callback completion).
I’m sure that this complicated callback behavior is motivated from (and surely adequately addresses) several real-world problems, and an environment of un-synchronized inputs, input- and processing-timeouts and the apparent algorithmic requirement for batching measurements. It does however make it somewhat hard to reason about the node’s behavior, and even more to manually run processing steps as intended here. I sometimes wondered if this behavior is really required, or if perhaps ROS should make more strict requirements on when and how inputs and outputs of nodes are handled. I don’t think that this kind of callback behavior is “too complicated” and ROS make it impossible to implement this, but I believe that a (potential, new) framework of this kind should enable some kind of traceability of inputs and outputs. This would probably require associating published outputs with the current callback execution, which could likely be made by the current thread context or something similar, but I also like the concept of returning the resulting outputs from the callback function directly. Supporting early or omitted outputs may be possible by returning future objects when publishing, and returning those from the callback. This could also enforce specifying all the (possible) types of outputs for each callback. (Obviously this is not thought through properly at all, and is just an idea which would have made my work a lot easier…)
Concluding Remarks⌗
Developing this framework, and especially developing the core method was a lot of fun during the past six months. Especially coming up with the callback graph concept and implementation was very interesting, and I think the solution found presents a nice mix of an abstract-enough methodical approach and an implementation that does not deviate from the theory too much to make it practical to use. In hindsight, an even more methodical/theoretical derivation of the callback graph might have been interesting to look at. I’m lacking the background in modeling such systems, but as far as I know there is a lot of relevant theory already established in the fields of realtime and distributed computing. That said, I have way more confidence in the correctness of the callback graph than in the compliance of random ROS nodes to their behavior specification…
Some rambling about ROS 2 docs⌗
There is however a third aspect to this method working, although it was an explicit goal to minimize this, which is not under direct control of the authors of the ROS nodes under test or my framework: ROS itself. During the past months, I realized how many aspects of ROS are missing documentation and specification. Especially core concepts like message delivery, time handling and timer execution within ROS nodes and details about name remapping are not really explained. For some aspects design documents are found, but the actual implementation usually varies or has important caveats, found by digging up old issues and pull requests. API docs are sometimes available, but the fact that the ability to generate API docs for python-only packages has only been merged this march (after 103 comments on the PR over ~2 years) gives an indication of the state of API docs. I think that some of this comes from the way in which the ROS community is rather loosely organized, and everyone does things their own way (reminds me of how ROS nodes work and communicate, heh). This opens the door for many possible contributors of course, but also fails to set precedent for a level of documentation and expected user/developer experience. (I should write some more thoughts on this in a separate post, I guess.)
Before linking to all the results, I want to thank my supervisors Matti and Jan for making it possible for me to work on this thesis. They had a lot of the important ideas and plans for this already before I joined, always helped me with the work and asked the right questions for me to realize the bugs in my implementations. I also want to thank Dominik, Simon and Marco for the inspiring discussions in the computer lab and for being patient rubber-ducky-debugging victims. And last but not least, I thank the Team Spatzenhirn for providing a Club-Mate filled fridge, a sofa and as much distraction from the thesis as I needed.
Basically all the code I wrote is now available on GitHub at https://github.com/uulm-mrm/ros2_def. The hosted documentation from the repo is at https://uulm-mrm.github.io/ros2_def/, and also contains the thesis itself in HTML format. The version I submitted is also available in PDF format, but be aware that potential fixes/modifications will only be made to the HTML version.