Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set maintenance mode or stop corosync/pacemaker on update #563

Closed
wants to merge 1 commit into from
Closed

set maintenance mode or stop corosync/pacemaker on update #563

wants to merge 1 commit into from

Conversation

Itxaka
Copy link
Member

@Itxaka Itxaka commented Jul 22, 2016

As per all the docs by pacemaker, the services should be stopped for a software update of the cluster stack. If other packages are being updated but not the cluster stack, then it's enough to set maintenance mode.

FIXME: however if the update is happening on a remote node, then the crm command is unavailable so we need to set maintenance mode another way. UPDATE: something like crowbar/crowbar-ha#187 might fix this.

Depends on crowbar/crowbar-ha#183

block do
%x{#{command}}
node.run_state['found_ha_packages'] = $?.exitstatus
Chef::Log.debug("Run #{command}, got exit status #{node.run_state['found_ha_packages']}.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/StringLiteralsInInterpolation: Prefer double-quoted strings inside interpolations. (https://github.com/SUSE/style-guides/blob/master/Ruby.md#stylestringliteralsininterpolation)

@@ -91,6 +120,12 @@
end # block
end # ruby_block

%w(corosync pacemaker).each do |s|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/WordArray: Use [] for an array of words. (https://github.com/SUSE/style-guides/blob/master/Ruby.md#stylewordarray)

only_if do
ha = node.run_state["found_ha_packages"]
is_cluster = !search(:node, "roles:pacemaker-cluster-member").empty?
return !ha && is_cluster && node.run_state["needs_update"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont return here

end # ruby_block

service "pacemaker" do
action :start
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not if is a remote node

@Itxaka
Copy link
Member Author

Itxaka commented Apr 12, 2017

needs rebase + adam patch + some changes around

::Chef::Recipe.send(:include, CrowbarPacemaker::MaintenanceModeHelpers)
::Chef::Resource.send(:include, CrowbarPacemaker::MaintenanceModeHelpers)


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/EmptyLines: Extra blank line detected.

::Chef::Recipe.send(:include, CrowbarPacemaker::MaintenanceModeHelpers)
::Chef::Resource.send(:include, CrowbarPacemaker::MaintenanceModeHelpers)


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/EmptyLines: Extra blank line detected.

@Itxaka
Copy link
Member Author

Itxaka commented Apr 17, 2017

@aspiers ready for re-review.

Spent most of the day testing the different scenarios and it seems to work perfectly.

On a node part of a cluster, non-remote:
When there are package to update but no ha_packages -> sets maintenance mode before update, then chef lifts the maintenance mode at the end of the run
When there are ha packages to update -> stops the services (pacemaker/corosync) then starts pacemaker after the update
Where there are no packages to update -> does nothing, duh!

Sadly, there is some code duplication in regards to https://github.com/crowbar/crowbar-ha/blob/master/chef/cookbooks/crowbar-pacemaker/libraries/maintenance_mode_helpers.rb#L60

This is because the helper on crowbar-pacemaker is already using an execute resource which cannot be called from a ruby_block.
Unfortunately, we need to wrap the crm maintenance mode in a chef resource block, as we need the previous steps to have run to obtain the proper variables in order to fire off the maintenance mode, and you cannot wrap an execute resource inside a ruby_block one (Actually you probably can, but its a huge hack and terrible)

So instead we create several resources that get triggered in the case that maintenance mode is required due to the late evaluation of the variables.

I could not find a better way of doing it while respecting the chef workflow.

service s do
action :stop
only_if { node.run_state["found_ha_packages"] }
not_if { node[:pacemaker][:is_remote] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should make sure that node[:pacemaker] exists; If I'm not missing sometthing, this could could be executed at non-pacemaker nodes as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Itxaka
Copy link
Member Author

Itxaka commented Apr 18, 2017

Missing exceptions for non-cluster nodes in some of the resources!

true
end
only_if do
ha = node.run_state["found_ha_packages"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha seems bit confusing, could you rename this variable? (Or just not create it at all and directly do a check for run_state at line 104)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

end
only_if do
ha = node.run_state["found_ha_packages"]
is_cluster = !search(:node, "roles:pacemaker-cluster-member").empty?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wtf, this is wrong. I wanted to check if the node is part of a cluster, this will just search for any nodes that have that role ¬_¬

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, true. Look also at the same search used earlier.

@@ -17,6 +17,9 @@
# limitations under the License.
#

::Chef::Recipe.send(:include, CrowbarPacemaker::MaintenanceModeHelpers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - how is this different from ::Chef::Recipe.include CrowbarPacemaker::MaintenanceModeHelpers ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chef-ism. http://lists.opscode.com/sympa/arc/chef/2011-05/msg00286.html

Actually, I think that due the restructuring of the resources in my last update this is not required, I'll retry with the usual include to see

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore, reverting to the usual include

@aspiers
Copy link
Member

aspiers commented Apr 18, 2017

@Itxaka I'm still a bit confused how this is supposed to handle the remote case?

@Itxaka
Copy link
Member Author

Itxaka commented Apr 18, 2017

@aspiers Lets see, for remote case, there should be no pacemaker/corosync services runnig so it wont stop them.

The question is, for remote nodes, when upgrading them, do we need to set the node into maintenance mode as well? The HA guide does not mention remote nodes so Im a bit lost in there and what to do with them.

… nodes (bsc#983617)

Make the update barclamp aware of HA nodes and deal with them properly.
When HA packages need to be updated, stop the HA services before the update
and start them again after the package have been updated.
When normal packages are updated, set the node in maintenance mode as mentioned
on the HA guide.
Also do not stop or set maintenance mode if the node is a remote_node.

Updates on normal nodes should not be affected by this changes.
end
end

["corosync", "pacemaker"].each do |s|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HA docs only say to stop pacemaker, why are both here? Could you add a comment?

command += '|egrep -q "corosync|pacemaker"'
system("zypper #{command}")
# exit 0: found, 1 not found
node.run_state["found_ha_packages"] = $?.exitstatus ? true : false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with hound here, I don't think $? is a cryptic perlism.

But I think the logic is wrong - $?.exitstatus produces the numeric exit code, and grep will return 1 if no matches were found, so 1 ? true : false results in true, so the pacemaker services get stopped even when it's not necessary.

Also, if needs_update ends up being false you could probably skip the extra zypper call.

true
end
only_if do
is_cluster = node.role? "pacemaker-cluster-member"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With crowbar/crowbar-ha#187 this could probably include pacemaker-remote roles too, yes?

# HA packages are NOT gonna be updated
# And Node is part of a cluster
# And there is packages to update
execute "crm --wait node maintenance" do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add the remote node name following what crowbar/crowbar-ha#187 did

@aspiers
Copy link
Member

aspiers commented Oct 5, 2017

Revisiting this due to Bug 1061834 – installing updates causes HA failures. @Itxaka Any chance you could amend the commit message to contain the full bsc URL? Thanks!

@Itxaka
Copy link
Member Author

Itxaka commented Oct 5, 2017

Ummm, seems that the branch for this PR migth have been removed somehow, so I would need to create a different PR with that branch covered somehow I guess?

@Itxaka
Copy link
Member Author

Itxaka commented Oct 5, 2017

Moved to #1353 @cmurphy comments are still unresolved on that PR

@Itxaka
Copy link
Member Author

Itxaka commented Aug 16, 2019

doesnt seem to be needed feel free to restore if so

@Itxaka Itxaka closed this Aug 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants