-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need proper handling for RDMA_CM_EVENT_ROUTE_ERROR #41
Comments
Dealing with partially initialized enpoint seems requiring a lot of modification and tests to existing implementations. |
@lynus So you're saying, a client gets RDMA_CM_EVENT_ROUTE_ERROR and after that blocks indefinitely, correct? |
@patrickstuedi Yes,I have confirmed that. This is due to RdmaEndpoint.dispatchCmEvent does not notifyAll() when receiving unknow events. |
Did you check if pulling notifyAll() out of the "if" statements helps? https://github.com/zrlio/disni/blob/master/src/main/java/com/ibm/disni/RdmaEndpoint.java#L135 |
Pulling notifyAll() out of if statement can fix the indefinite loop problem mentioned above. However, that's no enough. Say endpoint received route resolve error , then connect() gets notified and throws a IOException. Now the endpoint is in partial initialized state, which disni so far can not properly handle. For example, this route-resolve-failed endpoint can not simply recall connect, because it has already called rdma_resolve_addr, and bond to a device. On the other hand, if ep is of RdmaActiveEndpoint type, close() implementation of this type will call RdmaActiveEndpointGroup.close(), which would throw exception on
due to
In summary, disni lacks the consideration of handling partially initialized ep.
|
Ok, understand. But can we not solve this issue even simpler:
|
This only makes RdmaEndpoint itself robust, not the derived class. I still think isResourceAllocated() is needed, to inform derived class whether init() has been successfully executed. For example, the close() of DaRPCEndpoint should be: |
Why not reset the endpoint to initial state if it fails in the connect? Resetting can be easily handled in the connect method. Or is there any sensible use-case in retrying the connect from a specific state. If that is the case I would rather expose this in the interface, e.g. by having multiple methods to connect. |
@PepperJo We can't simply set connState to CONN_STATE_INITIALIZED and then connect() again, because |
@lynus the problem is that it is hidden to the user that the connect() method performs multiple steps. I don't like users calling connect() multiple times with connect starting wherever the last error occurred. Because this can change the semantic of the connect call e.g. think about dynamic routeable environments. So I propose whenever we failed in the connect call we set the endpoint in failure state and throw an exception whenever a method is called other than close(). |
I encountered in a situation where client randomly receive RDMA_CM_EVENT_ROUTE_ERROR, when client connects the server at a relatively high frequence. Disni does not handle this case properly, as it would block indefinitely in RdmaEndpoint.connect().
I think a new 'CONN_STATE_ERROR' CM state should be introduced to RdmaEndpoint and properly handled in connect() and close().
The text was updated successfully, but these errors were encountered: