In my masters thesis, i’m currently working on deterministic execution of a ROS stack for the purpose of evaluating the system performance in a reproducable way. Since the topic came up (again) on the ROS discourse, i decided to summarize what i’ve been working on so far.

My goal is to provide a framework that allows execution of a ROS stack within a simulator or from a ROS bag in a way that produces deterministic results, meaning that a second simulation run, or repeated playback of the ROS bag, results in exactly the same output (which may be control signals for the robot, or some kind of metric quantifying system performance). While providing that, we want to minimize the changes that have to be made to existing ROS nodes (which more or less rules out the approach of just running the core functionality of the nodes entirely without ROS).

In the following, I want to decribe the scenario and assumptions made, along with the ideas used to solve the issue. Details will follow as soon as i am done writing my thesis, which should include evaluation of the approach using the autonomous-driving software stack used at my university…

Assumptions

There are a few assumptions made about the ROS nodes under test, and the communication between them:

  • The individual ROS nodes are deterministic. For a given sequence of inputs, they will always produce the same output. The nodes may however
    • take an arbitrary amount of time to execute each callback.
  • The underlying communication middleware delivers all messages, but may
    • have arbitrary latency,
    • reorder messages published to the same topic,
    • deliver messages on different topics in arbitrary order.

Determinism of the ROS nodes is a strong requirement, but should not come as a surprise when deterministic execution of the entire stack is the goal… The assumption of the middleware reordering messages on the same topic is one that is perhaps not necessary. However, requiring ordered topics does not really simplify anything: If two nodes are running which will eventually publish on the same topic, the order of messages is not deterministic anyways.

The sources of non-determinism eliminated here are:

  • Node runtime: Node callbacks may take a non-deterministic (or even non-bounded) time to execute. This might be for example due to non-deterministic scheduling within the OS or a varying system load.
  • Communication: If two input topics of a node are published at the same time, the reception order is non-deterministic, which may influence the callback output.

Callbacks

All interesting actions in nodes happen in callbacks, which are either subscription or timer callbacks. Timer callbacks are not that different to topic callbacks when using simulation time, since they are triggered by the proper /clock input. Service-callbacks also exist, but are handled in a different way.

It is necessary to control when callbacks happen and to know when callbacks are done. The latter requires knowledge of all topics on which the callback publishes messages. If the callback would usually not publish any messages, we require the node to add an explicit status-topic, on which a message has to be published when the callback is done.

The information which callbacks exist in a node, and on which topic(s) the results are published, is provided statically in a configuration file in my implementation.

Topic-Interception

In order to control when a subscription callback is executed, we redirect input topics through the orchestrator, which can then “release” the message once its recipient is ready. This is done by remapping all topics which any node subscribes to a unique name.

Orchestrator

The orchestrator is the component which ensures deterministic callback execution. It is integrated into the data provider, which could be a simulator or rosbag player. It ensures deterministic execution by constructing a dependency graph of all callbacks which happen due to some data input. Constraints are applied to the callback graph which serialize operations at the same node, topic or service. By buffering outputs from nodes and only executing callbacks when all dependency constraints are fulfilled, deterministic callback execution is guaranteed for each data input. Combined with a deterministic simulator, or a rosbag which provides the same messages on every playbag, this results in deterministic execution.

The orchestrator interfaces with the data source (simulator/bag player) in a way such that the source offers the data to the orchestrator, which then forces the source to wait until all constraints for publishing this input are met.