-
Notifications
You must be signed in to change notification settings - Fork 425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Executor Segmentation fault when callback group is destroyed while spinning #2664
Comments
Possible duplicate with #2445 |
I reproduced this issue in rolling. GDB output
A non-existent guard condition is accessed. It's currently suspected that it might be related to the notify_guard_condition_ in the temporarily created callback group.
The newly created callback group is stored in the node using a weak pointer.
I will investigate further. |
I was also able to reproduce, but it does seem to exclusively happen on fastrtps for me and not with the cyclone middleware, perhaps a sequencing issue? #2665 |
@mjcarroll @Barry-Xu-2018 i only can reproduce this issue with |
…inning ros2/rclcpp#2664 Signed-off-by: Tomoya Fujita <[email protected]>
according to the stack trace between #2664 (comment) and #2445 (comment), it looks like a different problem. |
But freed |
The root cause: The rclcpp/rclcpp/src/rclcpp/executors/executor_entities_collector.cpp Lines 299 to 302 in 37e3688
In Executor::collect_entities(), all guard conditions (include interrupt, shutdown, nodes, callback group guard condition) in rclcpp/rclcpp/src/rclcpp/executor.cpp Lines 672 to 677 in 37e3688
In storage_rebuild_rcl_wait_set_with_sets(), waitable was added to wait set.
Finally, it will call ExecutorNotifyWaitable::add_to_wait_set(). rclcpp/rclcpp/src/rclcpp/executors/executor_notify_waitable.cpp Lines 54 to 65 in 37e3688
If guard condition isn't freed, executor will get notify about callback group changed. Guard condition will be removed before next calling rwm_wait(). This issue doesn't occur. So the root cause is unrelated to rmw. |
@Barry-Xu-2018 thank you for checking on this issue. could you also enlighten me why this problem does not happen with cyclonedds if that is the problem with rclcpp? |
It is related to the implementation of rmw_cyclonedds. // Detach guard conditions
if (gcs) {
for (size_t i = 0; i < gcs->guard_condition_count; i++) {
auto x = static_cast<CddsGuardCondition *>(gcs->guard_conditions[i]);
if (ws->trigs[trig_idx] == static_cast<dds_attach_t>(nelems)) {
bool dummy;
dds_take_guardcondition(x->gcondh, &dummy);
trig_idx++;
} else {
gcs->guard_conditions[i] = nullptr;
}
nelems++;
}
} Based on the debugging results, when the guard condition is released, |
rclcpp/rclcpp/src/rclcpp/executor.cpp Line 748 in fbe602f
There's no way to directly lock the guard condition of the callback group in wait_set_.wait(). |
Possible, the same problem of ros2/demos#654 |
Not sure. After fixing the issue, we can try to see if the problem still occurs. |
@mjcarroll @fujitatomoya @clalancette The root cause has been found. Please look at #2664 (comment). |
My code is not equal of the issue ros2/demos#654, but i have tested not |
it looks like rmw_cyclonedds checks waitset trigger index after wait call. if the index does not match, it sets nullptr for
to be honest, i am not really sure either, including this approach is the appropriate path... @mjcarroll @alsora @jmachowinski any thoughts? |
This is a design failure in the ExecutorNotifyWaitable. Before creating the rwm waitset, we aquire all shared pointers, to make sure, that they don't run out of scope during the rmw_wait. The bug is in the ExecutorNotifyWaitable, as it only holds weak pointers to the guard condition. Therefore the GuardConditions can run out of scope during the rmw_wait, causing the segfault. As for a fix for jazzy, I would suggest something like this : void
Executor::wait_for_work(std::chrono::nanoseconds timeout)
{
TRACETOOLS_TRACEPOINT(rclcpp_executor_wait_for_work, timeout.count());
// Clear any previous wait result
this->wait_result_.reset();
// we need to make sure that callback groups don't get out of scope
// during the wait. As in jazzy, they are not covered by the DynamicStorage,
// we explicitly hold them here as a bugfix
std::vector<rclcpp::CallbackGroup::SharedPtr> cbgs;
{
std::lock_guard<std::mutex> guard(mutex_);
if (this->entities_need_rebuild_.exchange(false) || current_collection_.empty()) {
this->collect_entities();
}
auto callback_groups = this->collector_.get_all_callback_groups();
cbgs.resize(callback_groups.size());
for(const auto &w_ptr : callback_groups)
{
auto shr_ptr = w_ptr.lock();
if(shr_ptr)
{
cbgs.push_back(std::move(shr_ptr));
}
}
}
this->wait_result_.emplace(wait_set_.wait(timeout));
if (!this->wait_result_ || this->wait_result_->kind() == WaitResultKind::Empty) {
RCUTILS_LOG_WARN_NAMED(
"rclcpp",
"empty wait set received in wait(). This should never happen.");
} else {
// drop references to the callback groups, before trying to execute anything
cbgs.clear();
if (this->wait_result_->kind() == WaitResultKind::Ready && current_notify_waitable_) {
auto & rcl_wait_set = this->wait_result_->get_wait_set().get_rcl_wait_set();
if (current_notify_waitable_->is_ready(rcl_wait_set)) {
current_notify_waitable_->execute(current_notify_waitable_->take_data());
}
}
}
} |
@jmachowinski thanks! @Barry-Xu-2018 can you try #2683 to see if that fixes the problem? |
Jmachowinski's solution is almost same as what I thought #2664 (comment). I've already verified that this method can fix the problem. rclcpp/rclcpp/include/rclcpp/wait_set_template.hpp Lines 676 to 677 in 7e9ff5f
The fix, however, will handle the callback group separately in a different place, which is why I was considering if there might be a better approach. |
@Barry-Xu-2018 Sorry, I missed your solution... I think this was already written somewhere else, but just to sum it up. For rolling, we should afterwards implement a cleaner solution, like locking the callbackgroups in the DynamicStorage. |
Bug report
Required Info:
Issue description
The
rclcpp::Executor
crashes when a timer with a non-default callback group is destroyed before the executor stops spinning.The bug also appears if a callback group is created and destroyed without being added to the timer.
The bug does not manifest in the example below when:
Steps to reproduce issue
See the full gist
Expected behavior
The node runs and terminates without issue.
Actual behavior
The node fails with a Segmentation fault in
rclcpp::Executor::spin()
.Additional information
Backtrace:
The text was updated successfully, but these errors were encountered: