Experiments with Infiniband

This post is very much WIP

After attending a lecture on high performance computing at my university, and first coming into contact with communication technologies such as MPI and infiniband as a potential underlying technology, i wanted to experiment with that a bit more than what was possible in the university lab. We were given the opportunity to run distributed software via MPI on the lab PCs, and i later found out that it is also possible to run software on an actual HPC cluster via the BWUniCluster 2 project.

What all this does not allow for however, is to interact with the high-speed network hardware directly.

In this post, I want to share my experience in experimenting with Infiniband hardware, and provide some resources for anyone who wants to do this themselves.

Hardware⌗

Fortunately, such network equipment is available for cheap on ebay, and while not the highest speed, the previous generation of high-performance hardware is still quite a bit faster than what I was used to. I purchased two QLogic IBA7322 infiniband cards and a matching cable for <50€ in total, with a link speed of 40Gb/s.

Resources⌗

Resources and help for programming Infiniband hardware was not that easy to find, but there were some really great places to find important information:

Papers by Tarick Bedeir on using the IB Verbs library
The Infiniband Arch Wiki Page for information on using the IB specific commandline utilities and how to turn on your NIC
These slides from a tutorial on IB Verbs, by Dotan Barak.
And, most importantly, RDMAmojo, which not only documents every libibverbs API call in detail, but also contains examples, explains concepts and gives the required background. Thank you, Dotan Barak!

Platforms, Infrastructure⌗

Not using dedicated, identical hardware and software setups at both ends makes setup somewhat more complicated. The required software works on Ubuntu, and I was also able to get it working on Arch eventually, but haven’t yet managed to use it on my NAS which runs the TrueNAS OS (BSD).

At time of writing (Jan 30 2022), the current kernel seems to contain bug preventing device initialization. A patch seems to be available, but kernel 5.16 didn’t seem to work. Reverting to 5.10 resolved the issue.

Ubuntu packages:⌗

libibverbs-dev: IB Verbs library
infiniband-diags: provides diagnostic tools like ibstat and ibstatus
rdma-core: “basic boot time support for systems that use the Linux kernel’s remote direct memory access (RDMA) subystem”, whatever that means
opensm: The subnet manager

Subnet manager⌗

One essential piece of software is the Subnet Manager, which could run on a dedicated server in the network, but has to run on one of the nodes in my case of just connecting two PCs directly. I used opensm, which runs as a systemd service in the background. Without the SM, no connection between the NICs is possible. Running opensm on ubuntu is as simple as:

sudo apt install opensm
sudo systemctl start opensm

The SM is responsible for transitioning the link state from Initializing to Active.

IPoIB⌗

After installing rdma-core, a network device should be created automatically at boot, and will be visible as ibpNsM in the output of ip link. This can be used to establish IP connection over Infiniband.

The port has to be enabled on both ends of the link:

sudo ibportstate -D 0 1 enable

The link state can be queried using sudo ibportstate -D 0 1 or ibstat and should transition from Down to * Initializing* once a cable has been connected, and to Active once the SM has set up the connection.

Assign different IP addresses to both ends of the connection:

sudo ip addr add 10.0.0.1/24 dev ibp1s0

It is now possible to reach the other node via IPoIB. This can be used as-is, IPoIB is however not nearly as fast as the link speed on my setup (~10Gb/s, CPU limited by the Iperf3 server).

Software⌗

All my software experiments are available on GitHub.

The goal I set myself for this experiment is to send an image from one PC to the other, apply some image-processing task, and send the result back to the first PC. I also wanted to write a C++ library that can abstract from most of the communication details.

Library: librdmacm⌗

Library: libibverbs⌗

Connection Establishment⌗

Creating a connection for use with libibverbs is unfortunately not straightforward. It is simplified however by using rdma_cm. It provides the function rdma_resolve_addr, which can be used if IPoIB is already configured to get the Infiniband addresses of the NICs. This also binds a context to the proper local device.

Queue Pair Semantics⌗

Queue Pairs, and Queues in general are a reoccurring concept in the verbs interface. Sending and receiving data over infiniband is always done via queues. Both sending and receiving a message requires actively posting a Work Request in the form of a Work Queue Element to either the sending or receiving queue. The send- and receive queue on one side of the connection may be unified in a Queue Pair.

A queue is also usually connected to a Completion Queue, where Completion Queue Elements are posted by the hardware as soon as a Work Queue Element has been completed.

Buffers⌗

Buffers are regions of memory that are intended for sending and receiving of data. Buffers have to be registered with the verbs library using ibv_reg_mr, and can be used for sending and receiving. This allows the receiving end of the connection to have the data placed directly into an application buffer, without any additional copying. In my library, I wrote the Buffer<T> class which wraps a std::unique_ptr<T[]> and manages registering and deregistering the memory region with the verbs library. RMDAMojo mentions that this operation takes some time, so I also built the BufferSet class which holds unused (or in-flight), registered buffers.

Sending⌗

Sending a message via infiniband means posting a send work-request to the appropriate queue pair.

In detail, first the buffer b needs to be wrapped inside a scatter-gather-element:

ibv_sge sge{};
sge.addr = reinterpret_cast<uint64_t>(b.data());
sge.length = getSendSize();
sge.lkey = b.getMR()->lkey;

And then a work request wr can be created and sent:

ibv_send_wr wr{};
wr.wr_id = reinterpret_cast<uint64_t>(b.data());
wr.opcode = IBV_WR_SEND; // Send request that must match a corresponding receive request on the peer
wr.sg_list = &sge;
wr.num_sge = 1;
wr.send_flags = IBV_SEND_SIGNALED; // We want complete notification for this send request

ibv_send_wr *bad_wr = nullptr;
fmt::print("posting send with size {}\n", sge.length);
auto send_result = ibv_post_send(conn->qp, &wr, &bad_wr);

The wr_id field allows us to add some data to the work request, which will also be available in the corresponding work completion event once the buffer is actually sent. Adding the pointer to the data buffer here allows us to identify the buffer and mark it as available for use again once the verbs library is done sending:

[...]
char *receiveBufferData = reinterpret_cast<char *>(wc.wr_id);
if (wc.opcode == IBV_WC_SEND) {
    fmt::print("Send completed successfully\n");
    returnBuffer(findSendBuffer(receiveBufferData));
}

Receiving⌗

In order to receive messages, a receive work request has to be posted to the receiving side of the local queue pair. This requires already allocating a buffer and knowing the message size beforehand.

ibv_sge sge{};
sge.addr = reinterpret_cast<uint64_t >(b.data()); // The previously registered buffer
sge.length = getRecvSize();                       // Size of expected message
sge.lkey = b.getMR()->lkey;

ibv_recv_wr wr{};
wr.wr_id = reinterpret_cast<uintptr_t>(b.data()); // ID to identify WR on completion
wr.next = nullptr;
wr.sg_list = &sge;
wr.num_sge = 1;

ibv_recv_wr *bad_wr = nullptr;
ibv_post_recv(conn->qp, &wr, &bad_wr);

It is not required to use message-based communication! It is also possible to use RDMA in a way that mimics writing directly to memory on the other host.

Completion Events⌗

Since sending and receiving data happens completely asynchronously, information about the completion of some task is delivered via “work completions”. Receiving a work-completion might be rather uninteresing in the send-case, as it only involves marking the send-buffer as available again. For receiving however, the receive work-request might have been posted a long time ago, without knowing if anyone will ever send anything (but knowing the size, and having prepared a buffer). In this case the work completion is the notification of successful reception of data, and might trigger processing of this message.

Completions are delivered via completion queues. There exists another related concept called completion channel, which provides a file descriptor that can be used for waiting for events in the associated queue.

A Completion event channel is essentially file descriptor that is used to deliver Work Completion notifications to a userspace process.
[…]
One or more Completion Queues can be associated with the same Completion event channel.
RDMAMojo on Completion Channels

It is also noted on RDMAMojo that in low latency environments, it could be preferable to directly poll the completion queue. The file descriptor however helps if the thread processing events should sleep between events, and simplify terminating the waiting process even if no completions arrive. In my library, the CompletionPoller class owns the completion queue and an associated channel, and waits for events using the channel. Once an event arrives, a callback is called.

Weird problems that you may run into⌗

ibportstate throws errors like mad_rpc_open_port: can't open UMAD port ((null):0): Sounds bad, the real issue is that you forgot to plug in the cable…
The port state does not change from Initializing to Active, the SM is running, I checked! Rebooting one or both systems fixed the issue for me every single time. 🙂