The switch has rebooted unexpectedly Follow
Model:
AS5916-54XL, AS5916-54XKS.
Symptom:
If the switch has rebooted unexpectedly, you can follow the steps to troubleshoot the hardware.
Diagnosis:
Step 1. Check the power source and environment
Firstly, check how many units have been rebooted unexpectedly on the same power source on the rack? If only 1 unit on the rack has rebooted unexpectedly (the other units have not rebooted), thus it is not rack power outage or abnormal external power supply.
Step 2. Diagnose the mainboard and PSUs
Power monitor will let switch shutdown when the failure in some power rails.
Edgecore switches support power monitor feature on the mainboard. The power monitor will shutdown the switch when it detects the problem between the mainboard and PSUs. In case the switch is at "shutdown" state, it might be PSUs issue. Have to send the whole unit(mainboard + PSUs) back for RMA repair service.
If the switch has reboot unexpectedly (in other words, you can see the console output after booting), it means this event is not triggered by the power monitor. In other words, it's NO problem between the mainboard and PSUs.
Step 3. Check the BMC and CPU uptime on the NOS.
Compare the BMC's uptime and CPU's uptime on the NOS. If the CPU's uptime is less than BMC's time, it means only CPU has rebooted. The switch was not rebooted by power outage. (for example: BMC's uptime: 8,640,000 seconds. CPU's uptime: 3600 seconds). In other words, the switch rebooted by Software NOS. Please collect all information and escalate this problem to NOS vendor.
How to check the BMC and CPU Uptime on the ONIE?
Use the "ipmitool" command to get the BMC uptime on the ONIE.
# ipmitool raw 0x34 0x07 3e 55 01 00
EX: This is displayed in seconds. Read from right to left.
00 01 55 3e = 87358s (Total Time.) This means the BMC has been on for 87358 seconds.
Use "uptime" command to get the CPU uptime on the ONIE. The command might be different in various NOS. Please refer NOS user guide.
# uptime 04:52:51 up 10 min, load average: 0.00, 0.00, 0.00
Caution: If ipmitool command is not available on the NOS, please consult NOS vendor to get BCM and CPU Uptime.
Step 4. Check the CPU CPLD last reset reason register on the NOS.
Use "ipmitool" command to get the CPU CPLD last reset reason register value.
# ipmitool raw 0x34 0x22 0x65 0x24 10
If BMC is disabled by NOS, please try this i2c command to get the value.
# i2cget -f -y 0 0x65 0x24 0x10
Caution: If ipmitool command is not available on the NOS, please consult NOS vendor to get BCM and CPU Uptime.
The CPU CPLD last reset reason as below:
The value of CPU CPLD last reset reason:
Value = 10: It's normal rebooting, including warm boot and cold boot(power cycle).
Value = 80: The switch has rebooted by NOS. Bit 7 of CPU CPLD 0x24 has set 1 by NOS.
Value = 01: The switch has rebooted by WDT(watchdog timeout). The reason is NOS crashed. Firstly, have to check WDT(bit 4 of CPU CPLD 0x65 0x03) is or isn't enabled on NOS.
Value = 02: The switch has rebooted by NOS. Bit 0 of CPU CPLD 0x04 has set 0 by NOS.
Value = 04: The switch as rebooted by the reset button. Use has pressed the reset button to cause rebooting.
Please get the value and report to NOS vendor and Edgecore Support.
Comments
0 comments
Please sign in to leave a comment.