Mass Upgrade Infrastructure SOP

Every once in a while, we need to apply mass upgrades to our servers for various security and other upgrades.

Contact Information

Owner

Fedora Infrastructure Team

Contact

element.io=chat.fedoraproject.org[Fedora Infra. channel on Matrix] element.io=chat.fedoraproject.org[Fedora NOC channel on Matrix] sysadmin-main, https://docs.fedoraproject.org/en-US/infra/

Location

All over the world.

Servers

all

Purpose

Apply kernel/other upgrades to all of our servers

Preparation

Mass updates are usually applied every few months, or sooner if there’s some critical bugs fixed. Mass updates are done outside of freeze windows to avoid causing any problems for Fedora releases.

The following items are all done before the actual mass update:

  • Plan a outage window or windows outside of a freeze.

  • File a outage ticket in the fedora-infrastructure tracker, using the outage template. This should describe the exact date/time, the duration and what is included.

  • Get the outage ticket reviewed by someone else to confirm there’s no mistakes in it.

  • Sent outage announcement to infrastructure and devel-announce lists (for outages that affect contributors only) or infrastructure, devel-announce and announce (for outages that affect all users).

  • Add a 'planned' outage to fedorastatus. This will show the planned outage there for higher visibility.

  • Setup a hackmd or other shared document that lists all the virthosts and bare metal hosts that need rebooting and organize it per day. This is used to coordinate which admin is handling which server(s).

Typically updates/reboots are done in three seperate steps:

  • all staging hosts

  • non outage causing production hosts

  • outage causing production hosts

It has been somewhat common to use Mon/Tues/Wed for these different steps, but as long as the production outage is not on a Friday it is fine to fit it around people’s schedule.

Staging

Any updates that can be tested in staging or a pre-production environment should be tested there first. Including new kernels, updates to core database applications / libraries. Web applications, libraries, etc. This is typically done a couple of days before the actual outage, or even the day before. Too far in advance and things may have changed again, so it’s important to do this just before the production updates. But for a significant update doing it the day before can make it difficult to coordinate fixes for any non-trivial problems.

Non outage causing hosts

Some hosts can be safely updated/rebooted without an outage because they either have multiple machines in a load balancer or are not visible to end users or other reasons. These updates are typically done outside the outage window, so they are done before the outage. These hosts include proxies and a number of virthosts that have VM’s that meet this criteria.

Special Considerations

While this may not be a complete list, here are some special things that must be taken into account before rebooting certain systems:

Post reboot action

The following machines require post-boot actions (mostly entering passphrases). Make sure admins that have the passphrases are on hand for the reboot:

  • backup01 (ssh agent passphrase for backup ssh key)

  • sign-vault01 (NSS passphrase for sigul service and luks passphrase)

  • sign-bridge01 (run: 'sigul_bridge -dvv' after it comes back up, not passphrase needed)

  • autosign01 (NSS passphrase for robosignatory service and luks passphrase)

  • buildvm-s390x-15/16/16 ( need sshfs mount of koji volume redone)

  • batcave01 (ssh agent passphrase for ansible ssh key)

  • notifs-backend01 ( rabbitmqctl eval 'application:set_env(rabbit, consumer_timeout, 36000000).' systemctl restart fmn-backend@1; for i in seq 1 24; do echo $i; systemctl restart fmn-worker@$i | cat; done

Bastion01 and Bastion02 and openvpn server

If a reboot of bastion01 is done during an outage, nothing needs to be changed here. However, if bastion01 will be down for an extended period of time openvpn can be switched to bastion02 by stopping openvpn-server@openvpn on bastion01 and starting it on bastion02.

on bastion01: 'systemctl stop openvpn-server@openvpn' on bastion02: 'systemctl start openvpn-server@openvpn'

and the process can be reversed after the other is back. Clients try 01 first, then 02 if it’s down. It’s important to make sure all the clients are using one machine or the other, because if they are split routing between machines may be confused.

NOTE: Your fellow admins will likely be using bastion01 to access batcave01 and run the update playbooks, so rebooting either of these machines needs extreme coordination so that people aren’t in the middle of doing other things.

batcave01

batcave01 is our ansible control host. It’s where you run playbooks that have been mentioned in this SOP. However, it too needs updating and rebooting and you cannot use the vhost_reboot playbook for it, since it’s rebooting it’s own virthost. For this host you should go to the virthost and 'virsh shutdown' all the other vm’s, then 'virsh shutdown' batcave01, then reboot the virthost manually.

NOTE: Your fellow admins will likely be using bastion01 to access batcave01 and run the update playbooks, so rebooting either of these machines needs extreme coordination so that people aren’t in the middle of doing other things.

noc01 / dhcp server

noc01 is our dhcp server. Unfortunately, when rebooting the vmhost that contains noc01 VM, it means that that vmhost has no dhcp server to answer it when booting and trying to configure network to talk to the tang server. To work around this you can run a simple dhcpd on batcave01. Start it there and let the vmhost with noc01 come up and then stop it. Ideally we would make another dhcp host to avoid this issue at some point.

batcave01: 'systemctl start dhcpd'

remember to stop it after the host comes back up.

COPR / OpenQA

All of these hosts are generally updated outside of the "mass" update so that the people who monitor those machines can be present.

Special package management directives

Sometimes we need to exclude something from being updated. This can be done with the package_exlcudes variable. Set that and the playbooks doing updates will exclude listed items.

This variable is set in ansible/host_vars or ansible/group_vars for the host or group.

Update Leader

Each update should have a Leader appointed. This person will be in charge of doing any read-write operations, and delegating to others to do tasks. If you aren’t specficially asked by the Leader to reboot or change something, please don’t. The Leader will assign out machine groups to reboot, or ask specific people to look at machines that didn’t come back up from reboot or aren’t working right after reboot. It’s important to avoid multiple people operating on a single machine in a read-write manner and interfering with changes.

Usally for a mass update/reboot there will be a hackmd or similar document that tracks what machines have already been rebooted and who is working on which one. Please check with the leader for a link to this document.

Updates and Reboots via playbook

People should mostly use the vhost_update_reboot.yml playbook which runs both vhost_update.yml to apply updates and vhost_reboot.yml to reboot the host and VM’s. This can be called via. rbac-playbook so that non-sysadmin-main people can help (need sysadmin-updates).

For hosts out of outage you probably want to use these to make sure updates are applied before reboots (apply updates can take a lot of time, esp. when not done in parallel). However once updates are applied globally before the outage you will still want to use the update_reboot playbook (but the update part should be very fast).

By far the most common problem we have is that machines don’t come back after a reboot. This is usually a firmware booting problem, or a luks problem. Both can be solved by logging into the console for the vmhost and seeing what the error is and fixing it manually.

To monitor a machine you are update/rebooting in another window (also from batcave01) you can run: mtr --displaymode 1 -i 4 <host>

Also read the howto restart a server docs.

Checking hosts have updated

Additionally you should use the updates-uptime-cmd.py python script on batcave01 to see what machines have updates available and/or need to be rebooted. You’ll need to run the update sub-command, before viewing the usual information, to get the latest data.

There are older playbooks check-for-nonvirt-updates.yml and check-for-updates.yml, but the above script should be easier to use and give clearer results.

Doing the upgrade

If possible, system upgrades should be done in advance of the reboot (with relevant testing of new packages on staging). To do the upgrades, make sure that the Infrastructure RHEL repo is updated as necessary to pull in the new packages (Infrastructure Yum Repo SOP)

Before outage, ansible can be used to just apply all updates to hosts or apply all updates to staging hosts before those are done. Something like: ansible -m shell 'yum clean all; yum update -y; rkhunter --propupd' hostlist

Aftermath

  1. Make sure that everything’s running fine

  2. Check nagios for alerts and clear them all

  3. Reenable nagios notification after they are cleared.

  4. Make sure to perform any manual post-boot setup (such as entering passphrases for encrypted volumes)

  5. Consider running check-for-updates or check-for-nonvirt-updates to confirm that all hosts are updated.

  6. Close fedorastatus outage

  7. Close outage ticket.

Non virthost reboots

If you need to reboot specific hosts and make sure they recover - consider using:

sudo ansible -m reboot hostname