-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional ca connect issue #16
Comments
This may be a manifestation one of the long standing "gotchas" of libca. |
I just noticed I forgot to put the flushes in the test script, although they are there in aioca where the problem manifests itself, let me try again with flushes... |
We still get the occasional connection time with the flushes inserted: @pytest.mark.parametrize("iteration", [i for i in range(NUM)])
def test_get_pv(iteration, ioc):
q = Queue()
@cadef.connection_handler
def put_on_queue(args):
if args.op == cadef.CA_OP_CONN_UP:
q.put(None)
t = time.time()
chid = ctypes.c_void_p()
cadef.ca_create_channel(
"LONGOUT%d" % iteration, put_on_queue, 0, 0, ctypes.byref(chid)
)
cadef.ca_flush_io()
q.get()
cadef.ca_clear_channel(chid.value)
cadef.ca_flush_io()
assert time.time() - t < 2 |
With these additional flush_io calls, we get less of the long connection times, but maybe 1 in 10000 is over the 2s threshold |
We see this occasionally on beamlines with R3.14.12.7 too, ones that do thousands of ca connects (with cothread) at the start of their scripts occasionally have one that crosses the 5s threshold for reporting an error |
Are there others CA servers/clients on the host/network where you are running this test? eg. is the test client receiving beacons from CA servers other than the IOC being run by the script? Could you try isolating your test by setting:
(maybe replacing |
I'm inside a container running under podman using a bridge network:
I've set those environment variables and run the tests again, I get 2 failures out of 10000 |
Another possible scenario is if the newly started test IOC, after sending an initial beacon, receives a burst of search requests from other clients. This might result in some of the searches from the test client being dropped. |
Running with 100000 iterations I see 3 iterations with time
The second two instances have associated warnings from (I think)
It might be interesting to restructure this script to record all of the times, to see if there is any trend. |
There have long been occasional reports of (re)connect slowdowns w/ CA at scale. So far these have been anecdotal and/or not reproduced. While these may not all have the same underlying cause, it is useful to have a test case which is reproducible. (eventually. 100000 iterations takes me ~45 minutes to run)
@ralphlange @anjohnson fyi. |
I've done this: import ctypes
import subprocess
import sys
import tempfile
import time
from queue import Queue
import matplotlib.pyplot as plt
import numpy as np
from epicscorelibs.ca import cadef
NUM = 10000
cadef.ca_context_create(1)
with tempfile.NamedTemporaryFile() as records:
records.write(
b"\n".join(b'record(longout, "LONGOUT%d") {\n}' % i for i in range(NUM))
)
records.flush()
process = subprocess.Popen(
[sys.executable, "-m", "epicscorelibs.ioc", "-d", records.name],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
universal_newlines=True,
)
q = Queue()
@cadef.connection_handler
def put_on_queue(args):
if args.op == cadef.CA_OP_CONN_UP:
q.put(None)
times = []
for i in range(NUM):
t = time.time()
chid = ctypes.c_void_p()
cadef.ca_create_channel("LONGOUT%d" % i, put_on_queue, 0, 0, ctypes.byref(chid))
cadef.ca_flush_io()
q.get()
cadef.ca_clear_channel(chid.value)
cadef.ca_flush_io()
delta = time.time() - t
times.append(delta)
print(f"Iteration {i} took {delta}s", end="\r", flush=True)
if delta > 2:
print()
times = np.array(times)
print(f"Min: {np.min(times)}")
print(f"Max: {np.max(times)}")
print(f"Mean: {np.mean(times)}")
print(f"Std: {np.std(times)}")
plt.plot(times)
plt.show()
process.communicate("exit()") |
Running it gives output:
|
Thanks, I'm happy to see that there is no trend. I don't have a good feel for what all could be causing these occasional delays. So the following is speculative. The number of If so, then the delays should go away if you add a second channel, which is kept open throughout the entire test. Waiting for this one to connect would also synchronize with test IOC startup, removing the longer delay on the first iteration. I also have old suspensions that there are problems with the search backoff timer logic of libca. This might manifest as discrete steps in the anomalous delays. However, I don't think ~9 samples is enough to show this one way or the other. Maybe setting |
Interesting, keeping a channel open to the IOC for the duration makes it work reliably: import ctypes
import subprocess
import sys
import tempfile
import time
from queue import Queue
import matplotlib.pyplot as plt
import numpy as np
from epicscorelibs.ca import cadef
NUM = 100000
if __name__ == "__main__":
cadef.ca_context_create(1)
with tempfile.NamedTemporaryFile() as records:
records.write(
b"\n".join(b'record(longout, "LONGOUT%d") {\n}' % i for i in range(NUM))
)
records.flush()
process = subprocess.Popen(
[sys.executable, "-m", "epicscorelibs.ioc", "-d", records.name],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
universal_newlines=True,
)
try:
q = Queue()
@cadef.connection_handler
def put_on_queue(args):
if args.op == cadef.CA_OP_CONN_UP:
q.put(None)
def connect(pv_name):
chid = ctypes.c_void_p()
cadef.ca_create_channel(pv_name, put_on_queue, 0, 0, ctypes.byref(chid))
cadef.ca_flush_io()
q.get()
return chid
def disconnect(chid):
cadef.ca_clear_channel(chid.value)
cadef.ca_flush_io()
times = []
chid0 = connect("LONGOUT0")
for i in range(1, NUM):
t = time.time()
chid = connect("LONGOUT%d" % i)
disconnect(chid)
delta = time.time() - t
times.append(delta)
print(f"Iteration {i} took {delta}s", end="\r", flush=True)
if delta > 2:
print()
disconnect(chid0)
times = np.array(times)
print(f"Min: {np.min(times)}")
print(f"Max: {np.max(times)}")
print(f"Mean: {np.mean(times)}")
print(f"Std: {np.std(times)}")
plt.plot(times)
plt.show()
finally:
process.communicate("exit()") What might this indicate? The docs on |
I think this points towards a problem with the connection/circuit phase, and away from a problem with search phase. So there is probably no need to try changing My working theory is that a This theory is mostly informed by my experience with pvAccessCPP which had, and still has, some similar problems with worker synchronization when closing channels. So still a bit of a guess. |
When doing a lot of ca_connects in a short space of time (1000ish in a few seconds) some of them take a longer than usual time to connect. I've had a go at reproducing this, and I have a fairly minimal reproducer:
This will make an IOC with 1000 PVs, then do a ca_connect and wait for connection to each of those PVs serially. Most take about 0.1s to connect, but about 1 in 1000 will take between 2 and 9 seconds. Have you ever seen this @mdavidsaver ?
The text was updated successfully, but these errors were encountered: