-
Notifications
You must be signed in to change notification settings - Fork 34
Fix race condition in subnet deletion API #663
base: master
Are you sure you want to change the base?
Fix race condition in subnet deletion API #663
Conversation
Add document about transactional semantics for Alcor Caches in general and spefically about Ignite Caches.
This pull request fixes 1 alert when merging 8160d4a into 559f2bd - view on LGTM.com fixed alerts:
|
|
||
return currentVpcState; | ||
} | ||
|
||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good but there is a convention about the naming. The convention is to call someCache.getTransaction().start(), .commit(), .rollback() instead of wrapping the transaction calls in a method and calling it. It would be better if this code follows the convention.
Another change is required for this to actually work. Ignite configuration declare that this cache must have atomicity mode of TRANSACTIONAL. Here is the declatration.
<bean class="org.apache.ignite.configuration.CacheConfiguration">
<!-- Set the cache name. -->
<property name="name" value="com.futurewei.alcor.web.entity.vpc.VpcEntity"/>
<!-- Set the cache mode. -->
<property name="atomicityMode" value="TRANSACTIONAL"/>
<!-- Other cache parameters. -->
<property name="cacheMode" value="PARTITIONED"/>
</bean>
In kubernetes/services/ignite_config.xml there is this section
under which there is tag . The above declaration should be added under the tag.Every one needs this update to the ignite config file to not run into this issue, in addition the iginte database should be deleted. (default is <your_ignite_home>/work). If anyone has valuable data in the ignite DB, let me know we can delete just this cache.
If there are any questions about this extra work, ping me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pkommoju Thanks. We have updated the transaction naming and added a new cacheConfiguration for VpcRepo.
This pull request fixes 1 alert when merging 639a0cd into 559f2bd - view on LGTM.com fixed alerts:
|
<!-- Other cache parameters. --> | ||
<property name="cacheMode" value="PARTITIONED"/> | ||
</bean> | ||
|
||
<bean class="org.apache.ignite.configuration.CacheConfiguration"> | ||
<!-- Set the cache name. --> | ||
<property name="name" value="dpm_nodeinfo_cache"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Best to actually try to start Ignite with this change, one can never be sure about XML nestings and Ignite's naming oddities.
This pull request fixes 1 alert when merging 85de8fb into 559f2bd - view on LGTM.com fixed alerts:
|
This pull request fixes 1 alert when merging d6ed075 into 559f2bd - view on LGTM.com fixed alerts:
|
This pull request fixes 1 alert when merging 016aaf5 into faed9db - view on LGTM.com fixed alerts:
|
This pull request fixes 1 alert when merging 2ae9045 into 9887247 - view on LGTM.com fixed alerts:
|
CacheConfiguration cfg = new CacheConfiguration(); | ||
cfg.setName(getIpAddrCacheName(rangeId)); | ||
cfg.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL); | ||
|
||
ICache<String, IpAddrAlloc> ipAddrCache = | ||
cacheFactory.getCache(IpAddrAlloc.class, getIpAddrCacheName(rangeId)); | ||
cacheFactory.getCache(IpAddrAlloc.class, cfg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pkommoju We are hitting a IP allocation failure, likely causing by the change in this file. Could you try to reproduce this issue in your local and propose a solution (and make code directly to this branch)? Thanks.
More detailed logs from IP manager, failure at the end:
2021-08-20 08:12:44.053 INFO 32420 --- [nio-9004-exec-1] c.f.alcor.common.stats.StatisticsAspect : com.futurewei.alcor.privateipmanager.repo.IpAddrRangeRepo.createIpAddrRange() startTime: 808185834948500ns, endTime: 808186051074000ns, duration: 216ms
2021-08-20 08:12:44.054 INFO 32420 --- [nio-9004-exec-1] c.f.a.p.s.implement.IpAddrServiceImpl : Create ip address range success, request: IpAddrRangeRequest{id='44a1b42c-8c74-4fed-a9e8-f0db3b804568', vpcId='9192a4d4-ffff-4ece-b3f0-8d36e3d88038', subnetId='9192a4d4-ffff-4ece-b3f0-8d36e3d88000', ipVersion=4, firstIp='10.0.0.1', lastIp='10.0.0.254', usedIps=0, totalIps=254}
2021-08-20 08:12:44.108 INFO 32420 --- [nio-9004-exec-1] c.f.alcor.common.stats.StatisticsAspect : com.futurewei.alcor.privateipmanager.service.implement.IpAddrServiceImpl.createIpAddrRange() startTime: 808185826488100ns, endTime: 808186105872800ns, duration: 279ms
2021-08-20 08:12:44.108 INFO 32420 --- [nio-9004-exec-1] c.f.alcor.common.stats.StatisticsAspect : com.futurewei.alcor.privateipmanager.controller.IpAddrController.createIpAddrRange() startTime: 808185806863500ns, endTime: 808186106032800ns, duration: 299ms
2021-08-20 08:12:44.199 INFO 32420 --- [nio-9004-exec-1] i.j.internal.reporters.LoggingReporter : Span reported: 19c5df46e43f0868:119f40ed61423030:19c5df46e43f0868:1 - createIpAddrRange
2021-08-20 08:12:45.145 INFO 32420 --- [nio-9004-exec-2] global : Getting or creating cache IpAddrCache-44a1b42c-8c74-4fed-a9e8-f0db3b804568 AtomicityMode is TRANSACTIONAL
2021-08-20 08:12:45.266 INFO 32420 --- [nio-9004-exec-2] global : Retrieved cache IpAddrCache-44a1b42c-8c74-4fed-a9e8-f0db3b804568 AtomicityMode is TRANSACTIONAL
2021-08-20 08:12:45.441 INFO 32420 --- [nio-9004-exec-2] c.f.alcor.common.stats.StatisticsAspect : com.futurewei.alcor.privateipmanager.repo.IpAddrRangeRepo.allocateIpAddr() startTime: 808187127741900ns, endTime: 808187439280300ns, duration: 311ms
2021-08-20 08:12:45.441 INFO 32420 --- [nio-9004-exec-2] c.f.a.p.s.implement.IpAddrServiceImpl : Allocate ip address success, request: IpAddrRequest{ipVersion=4, vpcId='9192a4d4-ffff-4ece-b3f0-8d36e3d88038', subnetId='9192a4d4-ffff-4ece-b3f0-8d36e3d88000', rangeId='44a1b42c-8c74-4fed-a9e8-f0db3b804568', ip='10.0.0.1', state='activated'}
2021-08-20 08:12:45.451 INFO 32420 --- [nio-9004-exec-2] c.f.alcor.common.stats.StatisticsAspect : com.futurewei.alcor.privateipmanager.service.implement.IpAddrServiceImpl.allocateIpAddr() startTime: 808187127625600ns, endTime: 808187449775400ns, duration: 322ms
2021-08-20 08:12:45.451 INFO 32420 --- [nio-9004-exec-2] c.f.alcor.common.stats.StatisticsAspect : com.futurewei.alcor.privateipmanager.controller.IpAddrController.allocateIpAddr() startTime: 808187127480000ns, endTime: 808187450186000ns, duration: 322ms
2021-08-20 08:12:45.459 INFO 32420 --- [nio-9004-exec-2] i.j.internal.reporters.LoggingReporter : Span reported: f7abf3c5dd709f12:cdd48c47c05d4a75:1156c528c55d9674:1 - allocateIpAddr
2021-08-20 08:19:07.366 INFO 32420 --- [nio-9004-exec-5] c.f.alcor.common.stats.StatisticsAspect : com.futurewei.alcor.privateipmanager.repo.IpAddrRangeRepo.createIpAddrRange() startTime: 808569352538200ns, endTime: 808569364686900ns, duration: 12ms
2021-08-20 08:19:07.367 INFO 32420 --- [nio-9004-exec-5] c.f.a.p.s.implement.IpAddrServiceImpl : Create ip address range success, request: IpAddrRangeRequest{id='c49181bb-4ad0-4cdf-a240-3443fa00df65', vpcId='9192a4d4-ffff-4ece-b3f0-8d36e3d88038', subnetId='9192a4d4-ffff-4ece-b3f0-8d36e3d88001', ipVersion=4, firstIp='10.0.1.1', lastIp='10.0.1.254', usedIps=0, totalIps=254}
2021-08-20 08:19:07.367 INFO 32420 --- [nio-9004-exec-5] c.f.alcor.common.stats.StatisticsAspect : com.futurewei.alcor.privateipmanager.service.implement.IpAddrServiceImpl.createIpAddrRange() startTime: 808569352499400ns, endTime: 808569365131900ns, duration: 12ms
2021-08-20 08:19:07.367 INFO 32420 --- [nio-9004-exec-5] c.f.alcor.common.stats.StatisticsAspect : com.futurewei.alcor.privateipmanager.controller.IpAddrController.createIpAddrRange() startTime: 808569352336500ns, endTime: 808569365194600ns, duration: 12ms
2021-08-20 08:19:07.369 INFO 32420 --- [nio-9004-exec-5] i.j.internal.reporters.LoggingReporter : Span reported: c854d7eac6e2fd40:1ff2cfc38a8d1790:c854d7eac6e2fd40:1 - createIpAddrRange
2021-08-20 08:19:07.418 INFO 32420 --- [nio-9004-exec-6] global : Getting or creating cache IpAddrCache-44a1b42c-8c74-4fed-a9e8-f0db3b804568 AtomicityMode is TRANSACTIONAL
2021-08-20 08:19:07.420 INFO 32420 --- [nio-9004-exec-6] global : Retrieved cache IpAddrCache-44a1b42c-8c74-4fed-a9e8-f0db3b804568 AtomicityMode is TRANSACTIONAL
2021-08-20 08:19:07.424 WARN 32420 --- [nio-9004-exec-6] c.f.a.p.repo.IpAddrRangeRepo : Allocate ip address from 44a1b42c-8c74-4fed-a9e8-f0db3b804568 failed
2021-08-20 08:19:07.425 INFO 32420 --- [nio-9004-exec-6] global : Getting or creating cache IpAddrCache-c49181bb-4ad0-4cdf-a240-3443fa00df65 AtomicityMode is TRANSACTIONAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will work on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will work on it.
Thanks. This issue was triggered when a subnet is created. SM=>PM=>Private ip manager to create a gateway port and allocate ip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did build and run it on Jenkins machine and it worked fine. I am doing a second run to make sure I didn't pick up wrong run environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am able to reproduce it. subnet got created but the IP address range in question (in the cache named IpAddrCache-e4efed2e-33ab-4783-a7ce-0254dfd695d2) is empty. Should there be something at the point of failure or is it legitimate condition that has to be dealt with is something I am going to debug and find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set logging level in PIM to INFO and ping test succeeded. The IP address cache has three entries, but since this state is after the test finished, it is not clear the cache being is empty when the test failed is a legitimate state.
Since adding debug out made the test succeed, I am suspecting timing (race conditions). Need to further investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It ran fine with logging=INFO three times, and once without the change. I am rerunning the Jenkins job as is now. The nature of failure suggests that there may be more race conditions which have not been uncovered (or rather this failure has uncovered).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pkommoju Could you share a link of one Jenkins run which points to this branch? In local dev environment, we always hit the cache issue, 100%.
This PR proposes two issue fix