Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with RDMAvsTcpBenchmark of DiSNI over SoftRoCE #37

Closed
faramir opened this issue Dec 11, 2018 · 11 comments
Closed

Problem with RDMAvsTcpBenchmark of DiSNI over SoftRoCE #37

faramir opened this issue Dec 11, 2018 · 11 comments

Comments

@faramir
Copy link

faramir commented Dec 11, 2018

Hello,

After a hard and long time spent trying to compile Soft-iWARP (without success), I found a module RXE (kernel: 4.9.0-8-amd64) and Soft-RoCE that works without any problems (sudo apt-get -t stretch-backports install rdma-core ibverbs-providers ibverbs-utils libibverbs-dev librdmacm-dev).

I've successfully compiled and installed DiSNI on the virtual machines (Debian on VirtualBox).

The basic, simple example (com.ibm.disni.examples.SendRecv*) works like a charm.
However, I have problem with com.ibm.disni.benchmarks.RDMAvsTcpBenchmark*

Sometimes it finishes as it should, sometimes it hangs. I've added some diagnostic outputs in the classes, and sometimes it hangs after 2000 iterations of the RDMA loop, sometimes after 1942 iterations and so on.
Then I have also changed the value that is being send. Client sends current iteration number, server sends the minus current iteration number. The received value is not always the next expected value (eg. 0, 1, 1, 3, 3, 4, ... or -1, -2, -4, -5, -5, -6, -7, -8, -9, ...)

I does not have a clue, where I should look what causes the problem so I'm unable to find a solution.

Best regards,
Marek

---cut---

server-101$ java -cp disni-2.0-jar-with-dependencies.jar:disni-tests.jar com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkServer -a 192.168.56.101 -k 3000
client-102$ java -cp disni-2.0-jar-with-dependencies.jar:disni-tests.jar com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkClient -a 192.168.56.101 -k 3000

...

jstack prints (GC and other JVM threads ommited) on the server:
"main" #1 prio=5 os_prio=0 tid=0x00007f577800a800 nid=0x773 waiting on condition [0x00007f577fa75000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f5c061f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
        at com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkServer.runRDMA(RDMAvsTcpBenchmarkServer.java:116)  # clientEndpoint.getWcEvents().take();
        at com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkServer.launch(RDMAvsTcpBenchmarkServer.java:159)
        at com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkServer.main(RDMAvsTcpBenchmarkServer.java:48)
"Thread-1" #10 prio=5 os_prio=0 tid=0x00007f5778189000 nid=0x784 runnable [0x00007f57575bc000]
   java.lang.Thread.State: RUNNABLE
        at com.ibm.disni.verbs.impl.NativeDispatcher._getCqEvent(Native Method)
        at com.ibm.disni.verbs.impl.RdmaVerbsNat.getCqEvent(RdmaVerbsNat.java:165)
        at com.ibm.disni.verbs.IbvCompChannel.getCqEvent(IbvCompChannel.java:77)
        at com.ibm.disni.RdmaCqProcessor.run(RdmaCqProcessor.java:120)
        at java.lang.Thread.run(Thread.java:748)
"Thread-0" #9 prio=5 os_prio=0 tid=0x00007f57781f2800 nid=0x783 runnable [0x00007f575c1c8000]
   java.lang.Thread.State: RUNNABLE
        at com.ibm.disni.verbs.impl.NativeDispatcher._getCmEvent(Native Method)
        at com.ibm.disni.verbs.impl.RdmaCmNat.getCmEvent(RdmaCmNat.java:193)
        at com.ibm.disni.verbs.RdmaEventChannel.getCmEvent(RdmaEventChannel.java:75)
        at com.ibm.disni.RdmaCmProcessor.run(RdmaCmProcessor.java:68)
        at java.lang.Thread.run(Thread.java:748)

on the client:
"main" #1 prio=5 os_prio=0 tid=0x00007f274000a800 nid=0x748 waiting on condition [0x00007f2749ece000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f5c24ab0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
        at com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkClient.runRDMA(RDMAvsTcpBenchmarkClient.java:115)  # endpoint.getWcEvents().take();
        at com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkClient.launch(RDMAvsTcpBenchmarkClient.java:156)
        at com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkClient.main(RDMAvsTcpBenchmarkClient.java:46)
"Thread-1" #10 prio=5 os_prio=0 tid=0x00007f2740209800 nid=0x759 runnable [0x00007f26f9bf4000]
   java.lang.Thread.State: RUNNABLE
        at com.ibm.disni.verbs.impl.NativeDispatcher._getCqEvent(Native Method)
        at com.ibm.disni.verbs.impl.RdmaVerbsNat.getCqEvent(RdmaVerbsNat.java:165)
        at com.ibm.disni.verbs.IbvCompChannel.getCqEvent(IbvCompChannel.java:77)
        at com.ibm.disni.RdmaCqProcessor.run(RdmaCqProcessor.java:120)
        at java.lang.Thread.run(Thread.java:748)
"Thread-0" #9 prio=5 os_prio=0 tid=0x00007f2740202000 nid=0x758 runnable [0x00007f26f9cf5000]
   java.lang.Thread.State: RUNNABLE
        at com.ibm.disni.verbs.impl.NativeDispatcher._getCmEvent(Native Method)
        at com.ibm.disni.verbs.impl.RdmaCmNat.getCmEvent(RdmaCmNat.java:193)
        at com.ibm.disni.verbs.RdmaEventChannel.getCmEvent(RdmaEventChannel.java:75)
        at com.ibm.disni.RdmaCmProcessor.run(RdmaCmProcessor.java:68)
        at java.lang.Thread.run(Thread.java:748)
@faramir
Copy link
Author

faramir commented Dec 11, 2018

@patrickstuedi
Copy link
Member

@petro-rudenko Can you take a look at this?

@petro-rudenko
Copy link
Contributor

Interesting - reproduce on my environment. So overall barrier is working - it sending after receiving. The issue with stall data on some iterations:

RDMA: 0 < 301
recv    0
send    0
RDMA: 1 < 301
recv    0
send    -1
RDMA: 2 < 301
recv    2
send    -2

Checking.

@petro-rudenko
Copy link
Contributor

@faramir thanks, can you please call postRecv.execute() after endpoint.getWcEvents().take(); like in #38 PR. With it got consistent result.

@faramir
Copy link
Author

faramir commented Dec 14, 2018

Thank you @petro-rudenko . Results are consistent now.

However, I'm still facing problem with the occasional deadlock. Eg. server sends -1195 and waits for receiving message by client, and the client sends 1196 and waits for receiving message by server... Are you able to reproduce also this issue?

        /* Server */
        for (int i = 0; i < loopCount + 1; i++) {
            // Recv PING
            clientEndpoint.getWcEvents().take();
            postRecv.execute();
            System.out.println("recv\t" + recvBuf.asIntBuffer().get(0));
            recvBuf.clear();

            //Send PONG
            sendBuf.asIntBuffer().put(0, -i);
            System.out.println("send\t" + sendBuf.asIntBuffer().get(0));
            postSend.execute();
            clientEndpoint.getWcEvents().take();
            sendBuf.clear();
        }
        /* Client */
        for (int i = 0; i < loopCount + 1; i++) {
            // Send PING
            sendBuf.asIntBuffer().put(0, i);
            sendBuf.clear();
            System.out.println("send\t" + sendBuf.asIntBuffer().get(0));
            postSend.execute();
            endpoint.getWcEvents().take();

            // Recv PONG
            endpoint.getWcEvents().take();
            postRecv.execute();
            System.out.println("recv\t" + recvBuf.asIntBuffer().get(0));
            recvBuf.clear();
        }
server$ java -cp disni-2.0-jar-with-dependencies.jar:disni-tests.jar com.ibm.disni.benchmarksDMAvsTcpBenchmarkServer -a 192.168.56.101 -k 3000
client$ java -cp disni-2.0-jar-with-dependencies.jar:disni-tests.jar com.ibm.disni.benchmarksDMAvsTcpBenchmarkClient -a 192.168.56.101 -k 3000

Example output files atteched.
client-102.txt
server-101.txt

@patrickstuedi
Copy link
Member

I haven't looked at the RdmavsTcpBenchmark in too much detail, but a quick glance made me wonder if there is not a bug with the receive posting. I think the receive at the client should always be posted prior to sending a request. Moving the receive posting just before reaping the event it not enough, the response might already have come back.

@faramir
Copy link
Author

faramir commented Dec 14, 2018

What I have just seen (in the output client-102.txt and server-101.txt): on the server, just before freeze, it "received" two times the same value:

RDMA: 1193 < 3001
recv    1193
send    -1193
RDMA: 1194 < 3001
recv    1194
send    -1194
RDMA: 1195 < 3001
recv    1194
send    -1195

while on the client it sends the right values:

RDMA: 1193 < 3001
send    1193
recv    -1193
RDMA: 1194 < 3001
send    1194
recv    -1194
RDMA: 1195 < 3001
send    1195
recv    -1194
RDMA: 1196 < 3001
send    1196

One of the interesting recording of this problem is available here: https://youtu.be/QSt9wZqkloo (after a while, just before freezing, client received next notification and go one step forward).

@petro-rudenko
Copy link
Contributor

For me doesn't reproduce, but yes try as @patrickstuedi suggested. Client:

 for (int i = 0; i < loopCount + 1; i++) {
            System.err.println("RDMA: " + i + " < " + (loopCount + 1));

            postRecv.execute();
            // Send PING
            sendBuf.asIntBuffer().put(0, i);
            sendBuf.clear();
            postSend.execute();
            System.out.println("send\t" + sendBuf.asIntBuffer().get(0));
            wc = endpoint.getWcEvents().take();

            // Recv PONG
            wc = endpoint.getWcEvents().take();
            System.out.println("recv\t" + recvBuf.asIntBuffer().get(0));
            recvBuf.clear();
        }

@faramir
Copy link
Author

faramir commented Dec 14, 2018

I tried it. Unfortunately, the freezing is still there.

I decided to do simpler test - Client sends only messages to Server, Server receives only messages. That test also lead to deadlock.

/* Client */
for (int i = 0; i < loopCount + 1; i++) {
    System.err.println("RDMA: " + i + " < " + (loopCount + 1));
    // Send PING
    sendBuf.asIntBuffer().put(0, i);
    sendBuf.clear();
    System.out.println("send\t" + sendBuf.asIntBuffer().get(0));
    postSend.execute();
    endpoint.getWcEvents().take();
}

/* Server */
for (int i = 0; i < loopCount + 1; i++) {
    System.err.println("RDMA: " + i + " < " + (loopCount + 1));
    // Recv PING
    postRecv.execute();
    clientEndpoint.getWcEvents().take();
    System.out.println("recv\t" + recvBuf.asIntBuffer().get(0));
    recvBuf.clear();
}

I've tried to invoke clientEndpoint.postRecv(clientEndpoint.getWrList_recv()) instead of using postRecv variable every time in the loop, and to invoke free() after execution, but all those attempts do not help - freeze is still there.

@petro-rudenko
Copy link
Contributor

petro-rudenko commented Jan 28, 2019

@faramir just watch this talk about rxe / SoftRoce. They had similar issues to yours, where state machine in rxe could just stuck. Long story short:

  • Can you try SoftRoce with the latest kernel
  • If you're using virtualization - make sure you configured KVM / openwswitch (virtualbox could not work)
  • Try to compile this kernel module: https://github.com/larrystevenwise/krping - and try to run ping with different message sizes.

@mokdeve
Copy link

mokdeve commented Apr 19, 2019

Is there any size limit ?

when I run RDMAvsTcpBenchmarkClient -a IP -k 3000 -s 16384,
RDMA result:
Total time: 1363.831266 ms
Bidirectional bandwidth: 0.06712907722339899 Gb/s
Bidirectional average latency: 0.454610422 ms
TCP result:
Total time: 1800.555879 ms
Bidirectional bandwidth: 0.05084692757541461 Gb/s
Bidirectional average latency: 0.600185293 ms

when I run RDMAvsTcpBenchmarkClient -a IP -k 3000 -s 32768,
blocked at RDMAvsTcpBenchmarkClient::client channel set up

RDMAvsTcpBenchmarkServer -a IP -k 3000 -s 32768
blocked at RDMAvsTcpBenchmarkClient::client channel set up

PS: Linux 5.0.2 soft-roce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants