AC

How the ROS 2 Core Team Chose Zenoh as Its Alternative Middleware

·8 min read

In September 2023, the ROS 2 core team published a study titled "ROS 2 RMW Alternate" — a systematic evaluation of whether DDS, the middleware that has powered ROS 2 since its inception around 2015, should be supplemented with an alternative. After surveying over 180 community members, deriving a formal requirements list, and comparing more than twenty middleware candidates against those requirements, the team reached a clear conclusion: Zenoh best meets the requirements and will be developed into the new non-DDS RMW for ROS 2.

This post walks through the study's findings in the order they appear in the report.

The Problem: Eight Years of DDS Pain Points

The RMW (ROS MiddleWare) interface was designed as an abstraction layer that lets ROS 2 swap its underlying communication mechanism at compile time or runtime. All current Tier 1 implementations are DDS-based. DDS was a reasonable choice in 2015 — it had a long history in mission-critical deployments and addressed many of the same goals as ROS. But eight years of real-world use had accumulated a clear set of recurring problems.

Fully-Connected Graph

DDS maintains a fully-connected graph: every participant, topic, and service in the network must be discovered by every other participant. This produces O(n²) discovery traffic and "packet storms" when new nodes join large networks. ROS programs are accustomed to creating many topics cheaply — an assumption that does not hold with DDS at scale.

UDP Multicast for Discovery

By default, DDS relies on UDP multicast for peer discovery. Many institutional networks disable multicast for security or performance reasons, and large WiFi deployments routinely suppress it. The failure mode is silent: nodes simply do not find each other, leaving users to diagnose a network-level issue they may not even know exists.

Large Message Transfers

DDS uses UDP as its default transport. While this gives fine-grained QoS control, UDP is far less optimised than TCP across the entire software stack — from OS kernels to network chipsets. Linux defaults to small UDP buffer sizes (~256 KB), which is routinely insufficient for the images and point clouds that are staples of robotics. Transferring large sensor data reliably requires manual kernel tuning that most users are not equipped to perform.

WiFi Reliability

WiFi compounds both of the above problems. Spotty connectivity interacts badly with UDP-fragmented large messages, and disabled multicast breaks discovery. Since ROS 2 is routinely used on mobile robots and debugging laptops, poor out-of-the-box WiFi behaviour is a significant friction point.

Complex Tuning

All of the above issues are, in principle, addressable through DDS configuration. In practice, the configuration surface is enormous, the parameters that work on one network often fail on another, and new users have no clear path to the right settings. This is complexity relocated, not eliminated.

Vendor-Specific Extensions

Several DDS vendors have developed proprietary workarounds — custom discovery servers, non-standard transports, closed-source tooling. These create vendor lock-in and cannot be freely used in an open-source framework. They also diverge from each other, multiplying what users must learn.

The study is careful to note that DDS works well when expertly tuned on a well-managed network — as evidenced by its use in mission-critical systems worldwide. The goal is not to replace DDS for demanding applications, but to provide an alternative that "just works" for the broad majority of robotics use cases.

Requirements Gathering

The core team drew requirements from three sources: known ROS 2 use cases, targeted interviews with key stakeholders, and a public community survey.

The Survey

On July 31, 2023, the team posted to ROS Discourse asking the community for input. Over 180 responses were collected. A few highlights from the technical data:

  • Fleet sizes — Nearly half of respondents run fewer than 10 robots; the other half are roughly evenly split between 10–1,000 and larger fleets.
  • Topic counts — The most common range was 20–200 topics per deployment, with a significant portion exceeding 200.
  • Containers — The vast majority use Docker or Podman.
  • Network topologies — No single topology dominated. Respondents reported localhost-only, WiFi, VPN, cellular, and mixed configurations.

When asked to suggest alternative middlewares to investigate, Zenoh was the most frequently nominated option, ahead of TCPROS, MQTT, and ZeroMQ.

The Requirements

Using RFC 2119 terminology, the team derived the following requirements:

Must-have

RequirementNotes
Pub/SubPeer-to-peer preferred for performance
Security — Encryptione.g. TLS
Security — AuthenticationCertificate-based peer verification
Security — Access ControlPer-identity, per-topic granularity
Graceful disconnect/reconnectCritical for WiFi and mobile robots
Tolerance to bandwidth changesWiFi can swing by orders of magnitude
Configure network interfaceEssential for virtual networks and routing
Multi-megabyte messagesImages, point clouds at up to ~30 Hz
Fast small messagesRobot state at ~1 kHz, sub-1 kB
Restart discovery without restarting nodesNo single point of failure in safety-critical systems
Cross-platform supportUbuntu amd64/arm64 and Windows (ROS 2 Tier 1)
OSI-approved permissive licenseNo copyleft

Should-have

  • Built-in discovery
  • Routing across subnets
  • Shared memory for intra-host transfers
  • Peer-to-peer data connections (brokers amplify bandwidth and latency)

May-have

  • RPC support
  • Protocol debugging tooling (CLI, Wireshark plugins)
  • Message stream prioritisation
  • QoS reliability and history controls
  • Latching for late-joining subscribers
  • Static peer configuration

Comparative Analysis

Middlewares Investigated

The team examined over twenty options. A representative selection:

MiddlewareLicenseExisting RMW?
Eclipse Cyclone DDSEPL 2.0Yes
eProsima Fast DDSApache 2.0Yes
RTI ConnextProprietaryYes
ZenohApache 2.0 / EPL 2.0Yes (rmw_zenoh)
Zenoh-PicoApache 2.0
MQTTImplementation-dependent
ZeroMQ / nngMPL 2.0 / MIT
LCMLGPL
IceOryxApache 2.0Yes (rmw_iceoryx)
OPC-UAMixed
eCalApache 2.0Yes (rmw_ecal)
KafkaApache 2.0
TCPROS (ROS 1)BSD / Apache
Cyphal (libcanard)MIT

On Performance

The team deliberately excluded detailed performance benchmarking from the comparison. Their reasoning is worth quoting directly: previous performance testing exercises (Galactic, Humble middleware selection) proved extremely time-intensive and could only surface gross differences, because machine configuration, network conditions, and middleware tuning all dominate the fine-grained numbers. In ideal conditions, all seriously-considered options can saturate a gigabit link — the meaningful differences lie in how easily each middleware can be configured to perform well across the wide range of real-world applications.

Third-party benchmarks (including the Zenoh vs MQTT vs Kafka vs DDS comparison) were noted as illustrative but not treated as definitive evidence.

Key Takeaways from the Requirements Matrix

The full requirements-versus-middleware matrix is in Appendix A of the study. The team's summary findings:

  • Zenoh meets most requirements. Where gaps exist, either the feature is already in development or it can be layered on top of existing Zenoh mechanisms.
  • TCPROS (the ROS 1 transport) also meets most requirements, since it was designed specifically for robotics — but it is a legacy protocol without a path forward.
  • MQTT meets several requirements and is widely used in IoT, but its message size limitations and fully-brokered architecture are poor fits for ROS use cases.
  • ZeroMQ / nng meet a number of requirements and are actively used by Gazebo, but ZeroMQ is fundamentally a toolkit of networking primitives — building a fully-featured middleware on top would require substantial additional development.
  • OPC-UA meets several requirements but uses a brokered architecture and lacks built-in discovery.
  • DDS (the current choice) meets most requirements, but with the documented problems that motivated this study.
  • Kafka is widely used but complex, and its messaging model does not map naturally to ROS concepts.

Conclusion

The study concludes:

"The research has concluded that Zenoh best meets the requirements, and will be chosen as an alternative middleware. Zenoh was also the most-recommended alternative by users. It can be viewed as a modern version of the TCPROS implementation, and meets most of the ROS 2 requirements."

Zenoh satisfies every Must requirement either natively or through in-development features. It provides built-in discovery via both gossip and UDP multicast, routes across subnets through a Zenoh router, supports shared memory (experimental at the time of the study), and operates peer-to-peer — avoiding the bandwidth amplification of brokered systems. Its dual Apache 2.0 / EPL 2.0 licensing satisfies the permissive OSI requirement, and it already had an existing RMW implementation (rmw_zenoh) maintained at github.com/atolab/rmw_zenoh.

For me, having designed Zenoh precisely to work across the full compute continuum — from microcontrollers to cloud infrastructure — seeing it chosen by the ROS 2 core team as the middleware that "just works" for robotics is a meaningful validation. The problems DDS exposed at scale over eight years are exactly the ones Zenoh was built to avoid from first principles.

The next step, as noted in the study, was to begin design discussions on discourse.ros.org and develop the implementation. That work is now well underway.