Hardware Troubleshooting Power Issue
Overview
This SOP shows some of the steps required to troubleshoot and diagnose a power issue with one of our servers. A ticket was opened Infra Ticket: https://pagure.io/fedora-infrastructure/issue/11950
Symptoms:
- This server is not responding at all, and will not power on.
- To get to mgmt of RDU2-CC devices it’s a bit trickier than IAD2. We have a private management vlan there, but it’s only reachable via cloud-noc-os01.rdu-cc.fedoraproject.org. I usually use the ‘sshuttle’ package/command/app to transparently forward my traffic to devices on that network. That looks something like: sshuttle 172.23.1.0/24 -r cloud-noc-os01.rdu-cc.fedoraproject.org
- The devices are all in the 172.23.1 network. There’s a list of them in ansible-private/docs/rdu-networks.txt
but this host is: 172.23.1.105
.
- In the Bitwarden Vault, the management password can be obtained.
- Logs show issues with voltages not being in the correct range.
- At RDU2-CC we have a contact: James Gibson
.
Contact Information
- Owner
-
Fedora Infrastructure Team
- Contact
-
#fedora-admin, sysadmin-main
- Purpose
-
Provide basic orientation and introduction to the sysadmin group
Requirements
-
sshuttle to access the network at RDU2-CC
-
Bitwarden Vault Access - Access to the vault is under discussion. For now, consult the sysadmin-main team for the login credentials.
-
Access to ansible-private repo.
Troubleshooting Steps
This is only required because this server is not in IAD2 datacenter. Use sshuttle to make a connection to the 172.23.1.0/24 (from your laptop directly, not from the batcave01 to the management network). sshuttle 172.23.1.0/24 -r cloud-noc-os01.rdu-cc.fedoraproject.org
Ssh to the batcave01, access the ansible-private repo and read the IP address for this machine from the docs/rdu-networks.txt
With the IP address, visit https://IP in browser to access the idrac management console. Like so: https://172.23.1.105/
This is a prod machine so use the username and password from Bitwarden to login.
Get the service tag: XXXXXXX its on the summary page on the management console. This is required in order to prove to Dell tech support that the server is under warranty.
Open a ticket with tech support chat: https://www.dell.com/support/incidents-online/en-ie/ContactUs/Dynamic?spestate
https://www.dell.com/support/kbdoc/en-us/000126308/export-a-supportassist-collection-via-idrac9 how to collect logs for tech support.
Contacted James Gibson internally and opened a ticket in servicenow. Requested that he arrange a trip to the datacenter in order to reseat this OCP card. Updated the firmware on the idrac itself successfully, but failed to update the firmware on the server obviously as it wont turn on.
James finally managed to get out to the rdu-2 data center and carry out this work. Reseating the OCP had no effect, however he did troubleshoot further and removed one PSU, and still rebooting cycle, reattached and removed the other, and the server is booting fine. So we think we have identified a faulty PSU.
First request was to get the zip TSR logs generated and forwarded to Dell. Use the following site to upload the TSR as it might be too big to attach to email https://tdm.dell.com/file-upload This requires a service request, so be sure to ask the Dell technician for a service request number in order to use this form.
Dell requested the following check be carried out: Please Swap PSU1 with PSU2 and check if the server will power up. if the issue persisit, test PSU2 on slot 1 and confirm Once completed collect logs and share so we can proceed with action.
James Gibson, swapped the PSU units in this server on Friday, and the server is powering on as normal. So appears both PSU units are in fact working, perhaps something wrong with the chassis the units are going into ? Informed Dell just waiting on update to see what to troubleshoot next.
Since both ports has been test, I’m thinking this could be an external issue or a configuration issue. Are the PSUs set to redundant? When plugged at the same time, are them being plug to the same outlet/UPS? If so, can we test by plugging them to different outlets/UPS ?
Forwarded information to James Gibson to see what he thinks. We have moved the power to different power points, with the 2nd PSU reattached and the server appears to be working correctly now. Closed the ticket with Dell.
Want to help? Learn how to contribute to Fedora Docs ›