Jump to content

How to Initially Troubleshoot a Junos Chassis Cluster

0
  chco's Photo
Posted Sep 09 2010 05:09 AM

From time to time things can go wrong. You can be driving along in your car and a tire can blow out; sometimes a firewall can crash. Nothing that is made by humans is precluded from undergoing an unseen failure. Because of this, the administrator must be prepared to deal with the worst possible scenarios. In this excerpt from Junos Security the various methods that show an administrator how to troubleshoot a chassis cluster gone awry are discussed.
There are a few commands to use when trying to look into an issue. The administrator needs to first identify the cluster status and determine if it is communicating.

The show chassis cluster status command, although simple in nature, shows the administrator the status of the cluster. It shows who is the primary member for each redundancy group and the status of those nodes, and it will give insight into who should be passing traffic in the network. Here’s a sample:

{primary:node1}
root@SRX210-B> show chassis cluster status
Cluster ID: 1
Node              Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 1
    node0                254         secondary      no       no
    node1                1           primary        no       no

Redundancy group: 1 , Failover count: 2
    node0                254         primary        no       no
    node1                1           secondary      no       no

{primary:node1}
root@SRX210-B>


Things to look for here are that both nodes show as up, both have a priority greater than zero, both have a status of either primary, secondary, or secondary-hold, and one and only one node is primary for each redundancy group. Generally, if those conditions are met, things in the cluster should be looking OK. If not, and for some reason one of the nodes does not show up in this output, communication to the other node has been lost. The administrator should then connect to the other node and verify that it can communicate.

To validate that the two nodes can communicate the show chassis cluster control-plane statistics command is used, showing the messages that are being sent between the two members. The send and receive numbers should be incrementing between the two nodes. If they are not, something may be wrong with both the control and fabric links. Here is an example with the statistics highlighted:

{primary:node0}
root@SRX210-A> show chassis cluster control-plane statistics
Control link statistics:
    Control link 0:
        Heartbeat packets sent: 124
        Heartbeat packets received: 95
        Heartbeat packet errors: 0
Fabric link statistics:
    Probes sent: 122
    Probes received: 56
    Probe errors: 0

{primary:node0}
root@SRX210-A>


Again, this command should be familiar as it has been used in this chapter. If these (highlighted) numbers are not increasing, check the fabric and control plane interfaces. The fabric interfaces method is the same across all SRX products.

Next let’s check the fabric links. It’s important to verify that the fabric link and the child links show they are in an up state:

{primary:node0}
root@SRX210-A> show interfaces terse
Interface               Admin Link Proto    Local            Remote
--snip--
fe-0/0/4.0              up    up   aenet    --> fab0.0
fe-0/0/5                up    up
fe-0/0/5.0              up    up   aenet    --> fab0.0
--snip--
fe-2/0/4.0              up    up   aenet    --> fab1.0
fe-2/0/5                up    up
fe-2/0/5.0              up    up   aenet    --> fab1.0
--snip--
fab0                    up    up
fab0.0                  up    up   inet     30.17.0.200/24
fab1                    up    up
fab1.0                  up    up   inet     30.18.0.200/24
--snip--
{primary:node0}
root@SRX210-A>


If any of the child links of the fabric link, fabX, show in a down state, this would show the interface that is physically down on the node. This must be restored to enable communications.

The control link is the most critical to verify, and it varies per SRX platform type. On the branch devices, the interface that is configured as the control link must be checked. The procedure would be the same as any physical interface. Here an example from an SRX210 was used, and it shows that the specified interfaces are up:

{primary:node0}
root@SRX210-A> show interfaces terse
Interface               Admin Link Proto    Local             Remote
--snip--
fe-0/0/7                up    up
--snip--
fe-2/0/7                up    up
--snip--

{primary:node0}
root@SRX210-A>


On the data center SRXs, there is no direct way to check the state of the control ports; since the ports are dedicated off of switches inside the SRX and they are not typical interfaces, it’s not possible to check them. It is possible, however, to check the switch that is on the SCB to ensure that packets are being received from that card. Generally, though, if the port is up and configured correctly, there should be no reason why it won’t communicate. But checking the internal switch should show that packets are passing from the SPC to the RE. There will also be other communications coming from the card as well, but this at least provides insight into the communication. To check, the node and FPC that has the control link must be known. In the following command, the specified port coincides with the FPC number of the SPC with the control port:

{primary:node0}
root@SRX5800-1> show chassis ethernet-switch statistics 1 node 0
node0:
------------------------------------------------------------------
Displaying port statistics for switch 0
Statistics for port 1 connected to device FPC1:
  TX Packets 64 Octets        7636786
  TX Packets 65-127 Octets    989668
  TX Packets 128-255 Octets   37108
  TX Packets 256-511 Octets   35685
  TX Packets 512-1023 Octets  233238
  TX Packets 1024-1518 Octets  374077
  TX Packets 1519-2047 Octets  0
  TX Packets 2048-4095 Octets  0
  TX Packets 4096-9216 Octets  0
  TX 1519-1522 Good Vlan frms  0
  TX Octets                   9306562
  TX Multicast Packets        24723
  TX Broadcast Packets        219029
  TX Single Collision frames  0
  TX Mult. Collision frames   0
  TX Late Collisions          0
  TX Excessive Collisions     0
  TX Collision frames         0
  TX PAUSEMAC Ctrl Frames     0
  TX MAC ctrl frames          0
  TX Frame deferred Xmns      0
  TX Frame excessive deferl   0
  TX Oversize Packets         0
  TX Jabbers                  0
  TX FCS Error Counter        0
  TX Fragment Counter         0
  TX Byte Counter             1335951885
  RX Packets 64 Octets        6672950
  RX Packets 65-127 Octets    2226967
  RX Packets 128-255 Octets   39459
  RX Packets 256-511 Octets   34332
  RX Packets 512-1023 Octets  523505
  RX Packets 1024-1518 Octets  51945
  RX Packets 1519-2047 Octets  0
  RX Packets 2048-4095 Octets  0
  RX Packets 4096-9216 Octets  0
  RX Octets                   9549158
  RX Multicast Packets        24674
  RX Broadcast Packets        364537
  RX FCS Errors               0
  RX Align Errors             0
  RX Fragments                0
  RX Symbol errors            0
  RX Unsupported opcodes      0
  RX Out of Range Length      0
  RX False Carrier Errors     0
  RX Undersize Packets        0
  RX Oversize Packets         0
  RX Jabbers                  0
  RX 1519-1522 Good Vlan frms 0
  RX MTU Exceed Counter       0
  RX Control Frame Counter    0
  RX Pause Frame Counter      0
  RX Byte Counter             999614473

{primary:node0}
root@SRX5800-1>


The output looks like standard port statistics from a switch. Looking in here will validate that packets are coming from the SPC. Since the SRX3000 has its control ports on the SFB, and there is nothing to configure for the control ports, there is little to look at on the interface. It is best to focus on the result from the show chassis cluster control-plane statistics command.

If checking the interfaces yields mixed results where they seem to be up but they are not passing traffic, it’s possible to reboot the node in the degraded state. The risk here is that the node may come up in split brain. Since that is a possibility, it’s best to disable its interfaces, or physically disable all of them except the control or data link. The ports can even be disabled on the switch they are connected to. This way, upon boot, if the node determines it is master it will not interrupt traffic. A correctly operating node using the minimal control port and fabric port configuration should be able to communicate to its peer. If, after a reboot, it still cannot communicate to the other node, it’s best to verify the configuration and cabling. Lastly, the box or cluster interfaces may be bad.

Junos Security

Learn more about this topic from Junos Security.

Junos® Security is the complete and authorized introduction to the new Juniper Networks SRX hardware series. This book not only provides a practical, hands-on field guide to deploying, configuring, and operating SRX, it also serves as a reference to help you prepare for any of the Junos Security Certification examinations offered by Juniper Networks. Network administrators and security professionals will learn how to use SRX Junos services gateways to address an array of enterprise data network requirements -- including IP routing, intrusion detection, attack mitigation, unified threat management, and WAN acceleration.

See what you'll learn


Tags:
0 Subscribe


0 Replies