Hardware Troubleshooting Power Issue

Overview

This SOP shows some of the steps required to troubleshoot and diagnose a power issue with one of our servers. A ticket was opened Infra Ticket: https://pagure.io/fedora-infrastructure/issue/11950

Symptoms: - This server is not responding at all, and will not power on. - To get to mgmt of RDU2-CC devices it’s a bit trickier than IAD2. We have a private management vlan there, but it’s only reachable via cloud-noc-os01.rdu-cc.fedoraproject.org. I usually use the ‘sshuttle’ package/command/app to transparently forward my traffic to devices on that network. That looks something like: sshuttle 172.23.1.0/24 -r cloud-noc-os01.rdu-cc.fedoraproject.org - The devices are all in the 172.23.1 network. There’s a list of them in ansible-private/docs/rdu-networks.txt but this host is: 172.23.1.105. - In the Bitwarden Vault, the management password can be obtained. - Logs show issues with voltages not being in the correct range. - At RDU2-CC we have a contact: James Gibson.

Contact Information

Owner

Fedora Infrastructure Team

Contact

#fedora-admin, sysadmin-main

Purpose

Provide basic orientation and introduction to the sysadmin group

Requirements

  • sshuttle to access the network at RDU2-CC

  • Bitwarden Vault Access - Access to the vault is under discussion. For now, consult the sysadmin-main team for the login credentials.

  • Access to ansible-private repo.

Troubleshooting Steps

Connect to the management VLAN for the RDU2-CC network:

This is only required because this server is not in IAD2 datacenter. Use sshuttle to make a connection to the 172.23.1.0/24 (from your laptop directly, not from the batcave01 to the management network). sshuttle 172.23.1.0/24 -r cloud-noc-os01.rdu-cc.fedoraproject.org

SSH to the batcave01 and retrieve the ip address for this machine

Ssh to the batcave01, access the ansible-private repo and read the IP address for this machine from the docs/rdu-networks.txt

Open the Management Console

With the IP address, visit https://IP in browser to access the idrac management console. Like so: https://172.23.1.105/

Retrieve the username and password from Bitwarden

This is a prod machine so use the username and password from Bitwarden to login.

Once Logged in, retrieve the service tag for this server

Get the service tag: XXXXXXX its on the summary page on the management console. This is required in order to prove to Dell tech support that the server is under warranty.

Open a tech support ticket with Dell

Open a ticket with tech support chat: https://www.dell.com/support/incidents-online/en-ie/ContactUs/Dynamic?spestate

Collect logs from the server for Dell

https://www.dell.com/support/kbdoc/en-us/000126308/export-a-supportassist-collection-via-idrac9 how to collect logs for tech support.

Dell requested firmware updates on the idrac and server, along with reseat of OCP card to be carried out.

Contacted James Gibson internally and opened a ticket in servicenow. Requested that he arrange a trip to the datacenter in order to reseat this OCP card. Updated the firmware on the idrac itself successfully, but failed to update the firmware on the server obviously as it wont turn on.

OCP reseat carried out

James finally managed to get out to the rdu-2 data center and carry out this work. Reseating the OCP had no effect, however he did troubleshoot further and removed one PSU, and still rebooting cycle, reattached and removed the other, and the server is booting fine. So we think we have identified a faulty PSU.

Request to reupload logs

First request was to get the zip TSR logs generated and forwarded to Dell. Use the following site to upload the TSR as it might be too big to attach to email https://tdm.dell.com/file-upload This requires a service request, so be sure to ask the Dell technician for a service request number in order to use this form.

Swap PSU1 with PSU2

Dell requested the following check be carried out: Please Swap PSU1 with PSU2 and check if the server will power up. if the issue persisit, test PSU2 on slot 1 and confirm Once completed collect logs and share so we can proceed with action.

Both PSUs seem functional

James Gibson, swapped the PSU units in this server on Friday, and the server is powering on as normal. So appears both PSU units are in fact working, perhaps something wrong with the chassis the units are going into ? Informed Dell just waiting on update to see what to troubleshoot next.

Dell suggest use different power point to plug hardware into

Since both ports has been test, I’m thinking this could be an external issue or a configuration issue. Are the PSUs set to redundant? When plugged at the same time, are them being plug to the same outlet/UPS? If so, can we test by plugging them to different outlets/UPS ?

This appears to have resolved our issue.

Forwarded information to James Gibson to see what he thinks. We have moved the power to different power points, with the 2nd PSU reattached and the server appears to be working correctly now. Closed the ticket with Dell.