Skip to content
This repository has been archived by the owner on Mar 31, 2023. It is now read-only.

Fix race condition in subnet deletion API #663

Open
wants to merge 43 commits into
base: master
Choose a base branch
from

Conversation

xieus
Copy link
Contributor

@xieus xieus commented Jul 9, 2021

This PR proposes two issue fix

@xieus xieus added bug Something isn't working perf testing Performance Testing labels Jul 9, 2021
@xieus xieus added this to the Version 0.17.2021.07.30 milestone Jul 9, 2021
@xieus xieus self-assigned this Jul 9, 2021
@lgtm-com
Copy link

lgtm-com bot commented Jul 9, 2021

This pull request fixes 1 alert when merging 8160d4a into 559f2bd - view on LGTM.com

fixed alerts:

  • 1 for Boxed variable is never null


return currentVpcState;
}

@Override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good but there is a convention about the naming. The convention is to call someCache.getTransaction().start(), .commit(), .rollback() instead of wrapping the transaction calls in a method and calling it. It would be better if this code follows the convention.

Another change is required for this to actually work. Ignite configuration declare that this cache must have atomicity mode of TRANSACTIONAL. Here is the declatration.

        <bean class="org.apache.ignite.configuration.CacheConfiguration">
            <!-- Set the cache name. -->
             <property name="name" value="com.futurewei.alcor.web.entity.vpc.VpcEntity"/>
            <!-- Set the cache mode. -->
            <property name="atomicityMode" value="TRANSACTIONAL"/>
            <!-- Other cache parameters. -->
            <property name="cacheMode" value="PARTITIONED"/>
         </bean>

In kubernetes/services/ignite_config.xml there is this section

under which there is tag . The above declaration should be added under the tag.

Every one needs this update to the ignite config file to not run into this issue, in addition the iginte database should be deleted. (default is <your_ignite_home>/work). If anyone has valuable data in the ignite DB, let me know we can delete just this cache.

If there are any questions about this extra work, ping me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkommoju Thanks. We have updated the transaction naming and added a new cacheConfiguration for VpcRepo.

@lgtm-com
Copy link

lgtm-com bot commented Jul 9, 2021

This pull request fixes 1 alert when merging 639a0cd into 559f2bd - view on LGTM.com

fixed alerts:

  • 1 for Boxed variable is never null

<!-- Other cache parameters. -->
<property name="cacheMode" value="PARTITIONED"/>
</bean>

<bean class="org.apache.ignite.configuration.CacheConfiguration">
<!-- Set the cache name. -->
<property name="name" value="dpm_nodeinfo_cache"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Best to actually try to start Ignite with this change, one can never be sure about XML nestings and Ignite's naming oddities.

@lgtm-com
Copy link

lgtm-com bot commented Jul 10, 2021

This pull request fixes 1 alert when merging 85de8fb into 559f2bd - view on LGTM.com

fixed alerts:

  • 1 for Boxed variable is never null

@lgtm-com
Copy link

lgtm-com bot commented Jul 13, 2021

This pull request fixes 1 alert when merging d6ed075 into 559f2bd - view on LGTM.com

fixed alerts:

  • 1 for Boxed variable is never null

@xieus xieus linked an issue Jul 22, 2021 that may be closed by this pull request
@lgtm-com
Copy link

lgtm-com bot commented Jul 22, 2021

This pull request fixes 1 alert when merging 016aaf5 into faed9db - view on LGTM.com

fixed alerts:

  • 1 for Boxed variable is never null

@lgtm-com
Copy link

lgtm-com bot commented Aug 20, 2021

This pull request fixes 1 alert when merging 2ae9045 into 9887247 - view on LGTM.com

fixed alerts:

  • 1 for Boxed variable is never null

Comment on lines +169 to +174
CacheConfiguration cfg = new CacheConfiguration();
cfg.setName(getIpAddrCacheName(rangeId));
cfg.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL);

ICache<String, IpAddrAlloc> ipAddrCache =
cacheFactory.getCache(IpAddrAlloc.class, getIpAddrCacheName(rangeId));
cacheFactory.getCache(IpAddrAlloc.class, cfg);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkommoju We are hitting a IP allocation failure, likely causing by the change in this file. Could you try to reproduce this issue in your local and propose a solution (and make code directly to this branch)? Thanks.

More detailed logs from IP manager, failure at the end:

2021-08-20 08:12:44.053  INFO 32420 --- [nio-9004-exec-1] c.f.alcor.common.stats.StatisticsAspect  : com.futurewei.alcor.privateipmanager.repo.IpAddrRangeRepo.createIpAddrRange() startTime: 808185834948500ns, endTime: 808186051074000ns, duration: 216ms
2021-08-20 08:12:44.054  INFO 32420 --- [nio-9004-exec-1] c.f.a.p.s.implement.IpAddrServiceImpl    : Create ip address range success, request: IpAddrRangeRequest{id='44a1b42c-8c74-4fed-a9e8-f0db3b804568', vpcId='9192a4d4-ffff-4ece-b3f0-8d36e3d88038', subnetId='9192a4d4-ffff-4ece-b3f0-8d36e3d88000', ipVersion=4, firstIp='10.0.0.1', lastIp='10.0.0.254', usedIps=0, totalIps=254}
2021-08-20 08:12:44.108  INFO 32420 --- [nio-9004-exec-1] c.f.alcor.common.stats.StatisticsAspect  : com.futurewei.alcor.privateipmanager.service.implement.IpAddrServiceImpl.createIpAddrRange() startTime: 808185826488100ns, endTime: 808186105872800ns, duration: 279ms
2021-08-20 08:12:44.108  INFO 32420 --- [nio-9004-exec-1] c.f.alcor.common.stats.StatisticsAspect  : com.futurewei.alcor.privateipmanager.controller.IpAddrController.createIpAddrRange() startTime: 808185806863500ns, endTime: 808186106032800ns, duration: 299ms
2021-08-20 08:12:44.199  INFO 32420 --- [nio-9004-exec-1] i.j.internal.reporters.LoggingReporter   : Span reported: 19c5df46e43f0868:119f40ed61423030:19c5df46e43f0868:1 - createIpAddrRange
2021-08-20 08:12:45.145  INFO 32420 --- [nio-9004-exec-2] global                                   : Getting or creating cache IpAddrCache-44a1b42c-8c74-4fed-a9e8-f0db3b804568 AtomicityMode is TRANSACTIONAL
2021-08-20 08:12:45.266  INFO 32420 --- [nio-9004-exec-2] global                                   : Retrieved cache IpAddrCache-44a1b42c-8c74-4fed-a9e8-f0db3b804568 AtomicityMode is TRANSACTIONAL
2021-08-20 08:12:45.441  INFO 32420 --- [nio-9004-exec-2] c.f.alcor.common.stats.StatisticsAspect  : com.futurewei.alcor.privateipmanager.repo.IpAddrRangeRepo.allocateIpAddr() startTime: 808187127741900ns, endTime: 808187439280300ns, duration: 311ms
2021-08-20 08:12:45.441  INFO 32420 --- [nio-9004-exec-2] c.f.a.p.s.implement.IpAddrServiceImpl    : Allocate ip address success, request: IpAddrRequest{ipVersion=4, vpcId='9192a4d4-ffff-4ece-b3f0-8d36e3d88038', subnetId='9192a4d4-ffff-4ece-b3f0-8d36e3d88000', rangeId='44a1b42c-8c74-4fed-a9e8-f0db3b804568', ip='10.0.0.1', state='activated'}
2021-08-20 08:12:45.451  INFO 32420 --- [nio-9004-exec-2] c.f.alcor.common.stats.StatisticsAspect  : com.futurewei.alcor.privateipmanager.service.implement.IpAddrServiceImpl.allocateIpAddr() startTime: 808187127625600ns, endTime: 808187449775400ns, duration: 322ms
2021-08-20 08:12:45.451  INFO 32420 --- [nio-9004-exec-2] c.f.alcor.common.stats.StatisticsAspect  : com.futurewei.alcor.privateipmanager.controller.IpAddrController.allocateIpAddr() startTime: 808187127480000ns, endTime: 808187450186000ns, duration: 322ms
2021-08-20 08:12:45.459  INFO 32420 --- [nio-9004-exec-2] i.j.internal.reporters.LoggingReporter   : Span reported: f7abf3c5dd709f12:cdd48c47c05d4a75:1156c528c55d9674:1 - allocateIpAddr
2021-08-20 08:19:07.366  INFO 32420 --- [nio-9004-exec-5] c.f.alcor.common.stats.StatisticsAspect  : com.futurewei.alcor.privateipmanager.repo.IpAddrRangeRepo.createIpAddrRange() startTime: 808569352538200ns, endTime: 808569364686900ns, duration: 12ms
2021-08-20 08:19:07.367  INFO 32420 --- [nio-9004-exec-5] c.f.a.p.s.implement.IpAddrServiceImpl    : Create ip address range success, request: IpAddrRangeRequest{id='c49181bb-4ad0-4cdf-a240-3443fa00df65', vpcId='9192a4d4-ffff-4ece-b3f0-8d36e3d88038', subnetId='9192a4d4-ffff-4ece-b3f0-8d36e3d88001', ipVersion=4, firstIp='10.0.1.1', lastIp='10.0.1.254', usedIps=0, totalIps=254}
2021-08-20 08:19:07.367  INFO 32420 --- [nio-9004-exec-5] c.f.alcor.common.stats.StatisticsAspect  : com.futurewei.alcor.privateipmanager.service.implement.IpAddrServiceImpl.createIpAddrRange() startTime: 808569352499400ns, endTime: 808569365131900ns, duration: 12ms
2021-08-20 08:19:07.367  INFO 32420 --- [nio-9004-exec-5] c.f.alcor.common.stats.StatisticsAspect  : com.futurewei.alcor.privateipmanager.controller.IpAddrController.createIpAddrRange() startTime: 808569352336500ns, endTime: 808569365194600ns, duration: 12ms
2021-08-20 08:19:07.369  INFO 32420 --- [nio-9004-exec-5] i.j.internal.reporters.LoggingReporter   : Span reported: c854d7eac6e2fd40:1ff2cfc38a8d1790:c854d7eac6e2fd40:1 - createIpAddrRange
2021-08-20 08:19:07.418  INFO 32420 --- [nio-9004-exec-6] global                                   : Getting or creating cache IpAddrCache-44a1b42c-8c74-4fed-a9e8-f0db3b804568 AtomicityMode is TRANSACTIONAL
2021-08-20 08:19:07.420  INFO 32420 --- [nio-9004-exec-6] global                                   : Retrieved cache IpAddrCache-44a1b42c-8c74-4fed-a9e8-f0db3b804568 AtomicityMode is TRANSACTIONAL
2021-08-20 08:19:07.424  WARN 32420 --- [nio-9004-exec-6] c.f.a.p.repo.IpAddrRangeRepo             : Allocate ip address from 44a1b42c-8c74-4fed-a9e8-f0db3b804568 failed
2021-08-20 08:19:07.425  INFO 32420 --- [nio-9004-exec-6] global                                   : Getting or creating cache IpAddrCache-c49181bb-4ad0-4cdf-a240-3443fa00df65 AtomicityMode is TRANSACTIONAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will work on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will work on it.

Thanks. This issue was triggered when a subnet is created. SM=>PM=>Private ip manager to create a gateway port and allocate ip.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did build and run it on Jenkins machine and it worked fine. I am doing a second run to make sure I didn't pick up wrong run environment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am able to reproduce it. subnet got created but the IP address range in question (in the cache named IpAddrCache-e4efed2e-33ab-4783-a7ce-0254dfd695d2) is empty. Should there be something at the point of failure or is it legitimate condition that has to be dealt with is something I am going to debug and find.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set logging level in PIM to INFO and ping test succeeded. The IP address cache has three entries, but since this state is after the test finished, it is not clear the cache being is empty when the test failed is a legitimate state.
Since adding debug out made the test succeed, I am suspecting timing (race conditions). Need to further investigate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ran fine with logging=INFO three times, and once without the change. I am rerunning the Jenkins job as is now. The nature of failure suggests that there may be more race conditions which have not been uncovered (or rather this failure has uncovered).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkommoju Could you share a link of one Jenkins run which points to this branch? In local dev environment, we always hit the cache issue, 100%.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working perf testing Performance Testing
Projects
None yet
3 participants