SOP Add an OCP4 Node to an Existing Cluster
-
Red Hat OpenShift Container Platform 4.x cluster has been installed some time ago (1+ days ago) and additional worker nodes are required to increase the capacity for the cluster.
Steps
-
Add the new nodes to the Ansible inventory file in the appropriate group.
eg:
[ocp_workers] worker01.ocp.iad2.fedoraproject.org worker02.ocp.iad2.fedoraproject.org worker03.ocp.iad2.fedoraproject.org [ocp_workers_stg] worker01.ocp.stg.iad2.fedoraproject.org worker02.ocp.stg.iad2.fedoraproject.org worker03.ocp.stg.iad2.fedoraproject.org worker04.ocp.stg.iad2.fedoraproject.org worker05.ocp.stg.iad2.fedoraproject.org
-
Add the new hostvars for each new host being added, see the following examples for
VM
vsbaremetal
hosts.# control plane VM inventory/host_vars/ocp01.ocp.iad2.fedoraproject.org # compute baremetal inventory/host_vars/worker01.ocp.iad2.fedoraproject.org
-
If the nodes are
compute
orworker
nodes, they must be also added to the following group_varsproxies
for prod,proxies_stg
for staginginventory/group_vars/proxies:ocp_nodes: inventory/group_vars/proxies_stg:ocp_nodes_stg:
-
Changes must be made to the
roles/dhcp_server/files/dhcpd.conf.noc01.iad2.fedoraproject.org
file for DHCP to ensure that the node will receive an IP address based on its MAC address, and tells the node to reach out to thenext-server
where it can find the UEFI boot configuration.host worker01-ocp { # UPDATE THIS hardware ethernet 68:05:CA:CE:A3:C9; # UPDATE THIS fixed-address 10.3.163.123; # UPDATE THIS filename "uefi/grubx64.efi"; next-server 10.3.163.10; option routers 10.3.163.254; option subnet-mask 255.255.255.0; }
-
Changes must be made to DNS. To do this one must be a member of
sysadmin-main
, if you are not, one must send a patch request to the Fedora Infra mailing list for review which will be merged by the sysadmin-main members.See the following examples for the
worker01.ocp
nodes for production and staging.master/163.3.10.in-addr.arpa:123 IN PTR worker01.ocp.iad2.fedoraproject.org. master/166.3.10.in-addr.arpa:118 IN PTR worker01.ocp.stg.iad2.fedoraproject.org. master/iad2.fedoraproject.org:worker01.ocp IN A 10.3.163.123 master/stg.iad2.fedoraproject.org:worker01.ocp IN A 10.3.166.118
-
Run the playbook to update the haproxy config to monitor the new nodes, and add it to the load balancer.
sudo rbac-playbook groups/noc.yml -t "tftp_server,dhcp_server" sudo rbac-playbook groups/proxies.yml -t 'haproxy,httpd'
-
DHCP instructs the node to reach out to the
next-server
when it is handed out an IP address. Thenext-server
runs a tftp server which contains the kernel, initramfs and UEFI boot configuration.uefi/grub.cfg
. Contained in this grub.cfg is the following which relates to the OCP4 nodes:menuentry 'RHCOS 4.8 worker staging' { linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.3.163.33 coreos.inst.install_dev=/dev/sda coreos.live.rootfs_url=http://10.3.166.50/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.3.166.50/rhcos/worker.ign initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img } menuentry 'RHCOS 4.8 worker production' { linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.3.163.33 coreos.inst.install_dev=/dev/sda coreos.live.rootfs_url=http://10.3.163.65/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.3.163.65/rhcos/worker.ign initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img }
When a node is booted up, and reads this UEFI boot configuration, the menu option must be manually selected:
-
To add a node to the staging cluster choose:
RHCOS 4.8 worker staging
-
To add a node to the production cluster choose:
RHCOS 4.8 worker production
-
-
Connect to the
os-control01
node which corresponds with the ENV which the new node is being added to.Verify that you are authenticated correctly to the OpenShift cluster as the
system:admin
user.oc whoami system:admin
-
Contained within the UEFI boot menu configuration are the links to the web server running on the
os-control01
host specific to the ENV. This server should only run when we wish to reinstall an existing node or install a new node. Start it using systemctl manually:systemctl start httpd.service
-
Boot up the node and select the appropriate menu entry to install the node into the correct cluster. Wait until the node displays a SSH login prompt with the nodes name. It may reboot several times during the process.
-
As the new nodes are provisioned, they will attempt to join the cluster. They must first be accepted. From the
os-control01
node run the following:# List the certs. If you see status pending, this is the worker/compute nodes attempting to join the cluster. It must be approved. oc get csr # Accept all node CSRs one liner oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
This process usually needs to be repeated twice, for each new node.
To see more information about adding new worker/compute nodes to a user provisioned infrastructure based OCP4 cluster see the detailed steps at [1],[2].
Want to help? Learn how to contribute to Fedora Docs ›