Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting OOM when reconnecting socket #8

Open
physiii opened this issue Dec 9, 2017 · 8 comments
Open

Getting OOM when reconnecting socket #8

physiii opened this issue Dec 9, 2017 · 8 comments

Comments

@physiii
Copy link

physiii commented Dec 9, 2017

I want to reconnect to a socket if it is closed.

So I add a connect flag to LWS_CALLBACK_CLOSED

case LWS_CALLBACK_CLOSED:
	//vTaskDelay(1000/portTICK_PERIOD_MS);
	printf("re-connecting with token protocol\n");
	token_connect = true;
	token_received = false;
	break;

and that starts lws_client_connect_via_info in main.c

if (token_connect && token_conn_count >= 10) {
	printf("%s token protocol\n",tag);
	vTaskDelay(3000/portTICK_PERIOD_MS);
	token_connecting = true;
	token_req_sent = false;
	//token_connect = false;  
	wsi_token = NULL;
	i.pwsi = &wsi_token;
	i.protocol = "token-protocol";
	i.path = "/tokens";
	wsi_token = lws_client_connect_via_info(&i);
	token_conn_count = 0;
}

This works when I manually restart my server

[power-protocol] callback_token: 50
re-connecting with token protocol
4: error on reading from skt : 104
[lws_service loop]
[connection-loop] token protocol
4: _realloc: size 544: client wsi
4: _realloc: size 192: client ws struct
4: _realloc: size 5152: client stash
4: _realloc: size 996: ah struct
4: _realloc: size 512: ah data
4: lws_client_connect_2: 0x3ffd51c4: address 192.168.0.10

However if I leave my server running for a few hours, the socket inevitably closes and reconnecting fails giving "OOM"

[connection-loop] token protocol
4: _realloc: size 544: client wsi
4: _realloc: size 192: client ws struct
4: OOM

Can you tell me how I can auto-reconnect when a socket is closed? It's strange to me that it works when I manually restart the server but not when it happens after leaving it running.

@lws-team
Copy link
Member

lws-team commented Dec 9, 2017

I can't see your whole code but this alone looks like it may cause your code to continuously create additional connections until OOM, once you had the first successful connect.

if (token_connect && token_conn_count >= 10) {

In that case the client connections find no reason to close. New ones keep getting created when token_conn_count hits 10.

@physiii
Copy link
Author

physiii commented Dec 10, 2017

Okay, little different method:

I should probably wait for a response before trying to reconnect. So I need to set token_connect true when I receive a connection error:

re-connecting with token protocol
4: lws_client_connect_2: 0x3ffd7e80: address 192.168.0.10
4: Connect failed errno=128

I have been using callbacks like LWS_CALLBACK_CLOSED and SYSTEM_EVENT_STA_GOT_IP but how do I get LWS_ERRNO from client-handshake.c in a callback so I can try the connection again?

@lws-team
Copy link
Member

I think you are missing the point... once you opened a connection, why would it ever close? But your code keeps opening new connections at intervals. So there is an OOM eventually... resetting the server forces all your open connections to close, avoiding the OOM...

You don't need to know errno, lws will retry the connect if the error is nonfatal, else inform you with LWS_CALLBACK_CLIENT_CONNECTION_ERROR that it met something fatal.

@physiii
Copy link
Author

physiii commented Dec 11, 2017

Okay I see why there is an OOM.

I don't think it retries the connection because I don't see LWS_CALLBACK_CLIENT_WRITEABLE after I reset my server - just LWS_CALLBACK_HTTP_DROP_PROTOCOL then LWS_CALLBACK_CLOSED so I can't write to that socket anymore. It also doesn't retry the connection if I attempt lws_client_connect_via_info before SYSTEM_EVENT_STA_GOT_IP so I wait for that until I try to connect.

Why do I no longer get LWS_CALLBACK_CLIENT_WRITEABLE if it is retrying the connection?

@lws-team
Copy link
Member

It retries the accept you showed erroring out, if nothing fatal happened. You must do whole reconnects in the way you have been, but with a bit more care tracking the state of any existing connect.

The accept and particularly SSL_accept() are multistep things requiring network roundtrips. Because LWS is nonblocking, things like accept() that normally just stall your thread until they complete return immediately and need to be retried later. LWS takes care of that for you.

WRITEABLE only comes when you got a successful connection, and asked for it.

@physiii
Copy link
Author

physiii commented Dec 12, 2017

Thank you for clarification and patience.

I added a connect flag back to the lws_service loop so when I receive LWS_CALLBACK_CLOSED I set the connect flag true triggering lws_client_connect_via_info on that socket - I then wait for LWS_CALLBACK_CLIENT_ESTABLISHED to set the connect flag false again.

Problem is lws_client_connect_via_info runs again before I get LWS_CALLBACK_CLIENT_ESTABLISHED and creates redundant connections - how can I know if a connection failed after lws_client_connect_via_info? I would wait until I get LWS_CALLBACK_CLIENT_CONNECTION_ERROR to try again but I never see that callback - just an error message like Connect failed errno=128.

Also when you say ask for a connection I assume that means running lws_client_connect_via_info

@lws-team
Copy link
Member

There is a test client example in lws... although this is like the only demo code for ESP32 actually it's all based on normal lws, where there are more examples.

https://github.com/warmcat/libwebsockets/blob/master/test-apps/test-client.c

what it does is treat the client wsi pointer returned by lws_client_connect_via_info() as the flag. If it's NULL, it will try to connect, after considering a ratelimit. If the client closes or gets LWS_CALLBACK_CLIENT_CONNECTION_ERROR, it sets the copy of the client wsi to NULL, signalling it should retry.

Note that LWS_CALLBACK_CLIENT_CONNECTION_ERROR comes on the first protocol of the vhost, ie, vhost->protocols[0]. It's because the active protocol is not negotiated until there has been a successful connection.

@physiii
Copy link
Author

physiii commented Dec 12, 2017

Okay getting close.

I'm using your ratelimit function and client wsi pointer as a flag but it's not being set NULL if an attempt was made but the server is not running. Since no connection is made, LWS_CALLBACK_CLOSED is not called to set wsi pointer NULL and since there's no indication lws_client_connect_via_info failed besides Connect failed errno=128, it never retries.

I also tried setting client wsi point NULL making another attempt if LWS_CALLBACK_CLIENT_ESTABLISHED isn't called after some time but that gives an OOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants