Search this Blog

Thursday, March 6, 2014

Should N7K-M132XP-12/L: need to be RMA or not ?

What are the step-by-step process to determine if a N7K-M132XP-12 or N7K-M132XP-12L module in Nexus7000 switch need to be RMAed or not?

Scenario 1: N7K-M132XP-12 or N7K-M132XP-12L “TestPortLoopback” diagnostic test failed

Symptoms:

Diag failure, the following syslog is observed:
2011 Dec 10 11:47:11 %DIAG_PORT_LB-2-PORTLOOPBACK_TEST_FAIL: Module:18 TestPortLoopback failed 10 consecutive times. Faulty module:Module 18 affected ports:23 Error:Loopback test failed. Packets lost on the LC at the Queueing engine ASIC

N7K2# show diagnostic result module 18
Current bootup diagnostic level: complete
Module 18: 10 Gbps Ethernet Module

Test results: (. = Pass, F = Fail, I = Incomplete,
U = Untested, A = Abort, E = Error disabled)

1) EOBCPortLoopback--------------> .
2) ASICRegisterCheck-------------> E
3) PrimaryBootROM----------------> .
4) SecondaryBootROM--------------> .
5) PortLoopback:

Port 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
-----------------------------------------------------
U U I I I I I I U U I . I . I .

Port 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
-----------------------------------------------------
U U . . U U E . U U I I I I I I

6) RewriteEngineLoopback:

Port 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
-----------------------------------------------------
. . . . . . . . . . . . . . . .

Port 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
-----------------------------------------------------
. . . . . . . . . . . . . . . .

N7K2# show module
Mod Ports Module-Type Model Status
--- ----- -------------------------------- ------------------ ------------
16 32 10 Gbps Ethernet Module N7K-M132XP-12 ok
17 32 10 Gbps Ethernet Module N7K-M132XP-12 ok
18 32 10 Gbps Ethernet Module N7K-M132XP-12 ok

Mod Online Diag Status
--- ------------------
16 Fail
17 Pass
18 Fail

Checklist:

This is likely due to CSCtn81109 or CSCti95293

To verify if you hit these bugs, perform the following checks:

(1) Check NX-OS version if match with ddts found version. Both bugs are fixed and verified in 5.2(4) and later releases.

(2) Understand when diag message was observed, “show log” will give the time stamp of diag test failure, then check if there is any CPU issue happening near the sametime. Sometimes when the CPU is overwhelmed it causes the diag port loopback test failed, it is not definite, but a good data point to collect.

(3) Collect additional Logs:
  1. “tac-pac bootflash:tech.txt”  <<== This command may take several minutes to complete
  2. “show tech module 1”
  3. “show tech gold”
  4. “show hardware internal errors module 1 | diff -s” <<== execute this command few times
(4) We can clear the diagnostic result and re-run them while CPU is not overwhelmed:
  1. # show diagnostic result module 1
  2. # diagnostic clear result module all
  3. (config)# no diagnostic monitor module 1 test 5 (check the test number using "show diagnostic content module X")
  4. (config)# diagnostic monitor module 1 test 5
  5. # diagnostic start module 1 test 5
  6. # show diagnostic result module 1 test 5 (could take a few minutes before test completed)
  7. # show module internal exceptionlog module 1
  8. # show module internal event-history errors
  9. # show hardware internal errors module 1

If the module is recovered and diag test pass, highly likely this is due to the DDTS' mentioned above; as actual hardware failure should fail diag consistently.

If the module failed diag consistently, Please contact Cisco TAC for further analysis.

Scenario 2: M1 modules gets reset and/or link flaps

Symptoms:

2012  Jun 13 15:51:30 MDT Q93-7010-A %$ VDC-1 %$  %DIAG_PORT_LB-2-PORTLOOPBACK_TEST_FAIL: Module:3 TestPortLoopback failed  10 consecutive times. Faulty module: affected  ports:3,5,7,11,13,15,19,21,23,27,29,31 Error:Loopback test failed.  Packets lost on the LC at the MAC ASIC

2012 Jun 13 15:51:30 MDT Q93-7010-A %$ VDC-1 %$  %DIAG_PORT_LB-2-PORTLOOPBACK_TEST_FAIL: Module:3 TestPortLoopback failed  10 consecutive times. Faulty module: affected  ports:4,6,8,12,14,16,20,22,24,26,28,30,32 Error:Loopback test failed.  Packets lost on the LC at the Queueing engine ASIC

Checklist:

This is highly likely due to CSCtt43115
It is NOT a hardware failure, and no RMA/EFA required.

Collect all the logs reported and sequence of events occurred.
- show tech detail
- show accounting log
- show logging

Please make sure the configurations (specifically SPAN) and symptoms match with that of mentioned in the DDTS Release-note enclosure.

Note: This issue is applicable to all M1 module types.

Click here for Relevant Field Notice:


Scenario 3: N7K-M132XP-12 consistent device base process crashed

Symptoms:

Nexus7000 switch reports:
%SYSMGR-SLOT8-2-SERVICE_CRASHED: Service "lamira_usd" (PID 1944) hasn't caught signal 6 (core will be saved).

Here, Lamira is the L3/L4 forwarding engine.
The above menssage indicates that this ASIC has crashed and dumped the core.

Checklist:

In most cases, lamira_usd crash is caused by a reoccurring TCAM parity error, the fix was to RMA/EFA the card.

This checklist provides instruction on how to confirm that lamira_usd crash is caused by multiple uncorrectable TCAM parity errors, which can lead to a fast resolution to the case.

(1) Collect following logs, by attaching to the module:
  1. # attach module <#>
  2. module-<#># show logging onboard exception-log
  3. module-<#># show logging onboard internal lamira

Note: We have to collect the execption-log from the onboard log. “show module internal exceptionlog module ” won't suffice

(2) Note the time lamira_usd crashed

(3) Now check the onboard exception-log around the time of the LC crash (10-20 minutes)

// Five minutes before the lamira_usd crashed
Exception Log Record : Tue Jul 10 04:05:40 2012 (840169 us)
Device Id : 81
Device Name : Lamira
Device Error Code : c5101210(H)
Device Error Type : ERR_TYPE_HW
Device Error Name : NULL
Device Instance : 1
Sys Error : Generic failure
Errtype : INFORMATIONAL
PhyPortLayer : Ethernet
Port(s) Affected :
Error Description : LM_INT_CL1_TCAM_B_PARITY_ERR <== uncorrectable parity error in TCAM B
DSAP : 211
UUID : 382
Time : Tue Jul 10 04:05:40 2012
(840168 usecs 4FFC0C84(H) jiffies)

(4) Compare the TCAM exception above with the exception log record at the time of the crash

Exception Log Record : Tue Jul 10 04:17:02 2012 (270399 us) <== 12 minutes after the TCAM parity error
Device Id : 34304
Device Name : 0x8600
Device Error Code : 7e010000(H)
Device Error Type : NULL
Device Error Name : NULL
Device Instance : 0
Sys Error : (null)
Errtype : CATASTROPHIC
PhyPortLayer : 0x0
Port(s) Affected :
Error Description : lamira_usd hap reset <== lamira crashed because the number of TCAM parity errors in TCAM B
violated the HA Policy(HAP)
DSAP : 0
UUID : 16777216
Time : Tue Jul 10 04:17:02 2012
(270399 usecs 4FFC0F2E(H) jiffies)

(5) If the lamira crash was caused by HAP reset due to multiple TCAM parity errors then the LC should be RMAd/EFAd. Otherwise, if lamira crashed for some other reason, continue your normal troubleshooting. The onboard lamira log (command #2) will help Cisco TAC to root-cause it.

Scenario 4: All M1 modules fail specific diagnostic tests, like PortLoopback or RewriteEngineLoopback test

In Nexus7000, if there is any issue between active supervisor engine and an xbar module, and as a result diagnositic packets are dropped, the supeervisor engine may report diagnostic test failure for multiple/all ports in multiple/all modules.

This issue requires manual investigation and isolation of faulty sup engine.

The condition which caused the tests to go into errdisabled state may be transient.
It is recommended to run the tests on-demand and see if the condition is persistent.

First, try to clear the ErrDisabled state of the test:

N7K# diagnostic clear result module 1 test ?
  <1-6>  Test ID(s)
  all    Select all

To run on-demand test:
N7K# diagnostic start module test

To stop the test:
N7K# diagnostic stop module test

As a corrective action, the sup engine do not trigger failover or reset to recover from this condtion. To request corrective action, an enhancement request has been filed -
CSCth03474  - n7k/GOLD:Improve Fault Isolation of N7K-GOLD

Citation - This blog post does not reflect original content from the author. Rather it summarizes content that are relevant to the topic from different sources in the web. The sources might include any online discussion boards, forums, websites and others.

No comments :

Post a Comment

 
/* Google Analytics begin ----------------------------------------------- */ /* Google Analytics end ----------------------------------------------- */