Troubleshooting ODA Network connectivity

TroubleShootAudits1Setting up an ODA in a customer’s environment can either go very well or give you lots of trouble.  It all depends on having your install checklist completed, reviewed by the customer and any questions answered ahead of time.

 

I’ve installed dozens of ODA’s in a variety of configurations.  Ranging from a simple bare metal install to a complex virtualized install with multiple VMs and networks.  Now understand that I’m not a network engineer nor do I play one on TV, but I know enough about networking to have a civil conversation with a 2nd level network admin without getting too far out of my comfort zone. Knowing this- I can certainly appreciate the level of complexity involved in configuring and supporting an enterprise grade network.

 

Having said that, I find that when there are issues with a deployment, whether it’s an ODA, ZFS appliance, Exadata or other device, at least 80% of the time network misconfigurations are the culprit.  I can’t tell you how many times I’ve witnessed misconfigurations where the network admin swore up and down that they were set correctly but in fact were wrong.  It usually involves checking, re-checking and checking yet again to finally uncover the culprit.  Below, I’ll outline some of the snafu’s I’ve been involved with and the troubleshooting that can help resolve the issue.

 

Internet lock

 

  • Cabling: Are you sure the cables are all plugged into the right place?

Make sure that if you didn’t personally cable the ODA and you’re having network issues, don’t go too long without personally validating the cable configuration.  In this case, the fancy setup charts are a lifesaver!  On the X5-2 ODA’s for example, the InfiniBand private interconnect is replaced by the 10gb fiber ethernet option if the customer needs 10gb ethernet over fiber.  There is only one expansion slot available so unfortunately it’s either or.  As a result of this, the private interconnect is then facilitated by net0 and net1 with crossover cables (green and yellow) between the two compute nodes instead of the InfiniBand cables.  This can be missed very easily.  Also make sure the storage cables are all connected to the proper ports for your configuration- whether it’s one storage shelf or two.  This will typically be caught shortly after deploying the OS image whether it’s virtualized or bare metal.  There’s a storagetopology check that gets run during the install process that will catch most cabling mistakes but best not to chance it.

  • Switch configuration: Trunk port vs. Access port

When you configure a switch port, you need to tell the switch about what kind of traffic will pass through that port.  One of the important items is what network(s) does the server attached to this port need to talk on.  If you’re configuring a standalone physical server, chances are you won’t have a need to talk on more than one VLAN.  In this case, it’s usually appropriate to configure the switch port as an access port.  You can still put the server on a non-default VLAN (a VLAN other than 1) but the VLAN “tags” get stripped off at the switch and the server never sees them.

If however you’re setting up a VMware server or a machine that uses virtualization technology, it’s more likely that the VM’s that run on that server may indeed need to talk on more than one VLAN through the same network adapter(s).  In this case, you would need to set the port mode to trunked.  You then need to make sure to assign all the VLAN’s that the server will need to communicate on to that trunk port.  The server is then responsible for analyzing the VLAN tags and passing the traffic to the appropriate destination on the server.  This is one of the areas where the switch is usually configured incorrectly.  Most of the time, the network engineer fails to configure trunk mode on the port, forgets to assign the proper VLANs to the port or even setting a native VLAN on the port.

There is a difference between the default VLAN and a native VLAN.  The default VLAN is always present and is typically needed for intra-network device communication to take place.  Things like Cisco’s CDP protocol use this VLAN.  The Native VLAN, if configured, is treated similar to an access port from the perspective of the network adapter on the server.  The server NIC does not have to have a VLAN interface configured on top of it to be able to talk on the native VLAN.  If you want to talk on any other VLAN on this port however, you would need to configure a VLAN interface on the server to be able to receive those packets.  I’ve not seen the native VLAN used in a lot of configurations where more than one VLAN is needed, but it is most certainly a valid configuration.  Have the network team check these settings and make sure you understand how it should apply to your device.

  • Switch configuration: Aggregated ports vs. regular ports

Most switches have the ability to cobble together 2 to as many as 8 ports to provide higher throughput/utilization of the ports as well as redundancy at the same time.  This is referred to in different ways depending on your switch vendor.  Cisco calls it etherchannel, HP calls it Dynamic LACP trunking while extreme networks refer to it as sharing (LAG).  However you want to refer to it, it’s an implementation of a portion of the 802.3 IEEE standard which is commonly referred to as Link Aggregation or LACP (Link Aggregation Control Protocol).  Normally when you want to configure a pair of network interfaces on a server together, it’s usually to provide redundancy and avoid a SPOF (Single Point Of Failure).  I’ll refer to the standard Linux implementation mainly because I’m familiar with the different methods of load balancing that is typically employed.  This isn’t to say that other OS’s don’t have this capability (almost all do), I’m just not very experienced with all of them.

Active-Backup (Linux bonding driver mode=1) is a very simple implementation in which a primary interface is used for all traffic until that interface fails.  The traffic then moves over to the backup interface and communication is restored almost seamlessly.  There are other load balancing modes besides this one that don’t require any special configurations on the switch, each has their strengths and weaknesses.

LACP, which does require a specific configuration on the switch ports that are involved in order to work tends to be more performant while still maintaining redundancy.  The main reason for this is that there is out of band communication via the multicast group MAC address (01:80:c2:00:00:02) between the network driver on the server and the switch to keep both partners up to date on the status of the link.  This allows both ports to be utilized with an almost 50/50 split to evenly distribute the load between the totality of all the NICs in the LACP group effectively doubling (or better) throughput.

The reason I’m talking about this in the first place is because of the configuration that needs to be in place on the switch if you’re to use LACP.  If you configure your network driver for Active-Backup mode but the switch ports are set to LACP, you likely won’t see any packets at all on the server.  Likewise, if you have LACP configured on the server but the switch isn’t properly set up to handle it you’ll get the same result.  This is another setting that commonly gets misconfigured.  Other parameters such as STP (Spanning Tree Protocol), lacp_rate and passive vs. active LACP are some of the more common misconfigurations.  Also sometimes the configuration has to be split between two switches (again- no SPOF) and an MLAG configuration needs to be properly set up in order to allow LACP to work between switches.  Effectively, MLAG is one way of making two switches appear as one from a network protocol perspective and is required to span multiple switches within a LACP port group.  The take away here is to have the network admin verify their configuraiton on the switch(es) and ports involved.

  • Link speed: how fast can the server talk on the network?

Sometimes a server is capable of communicating at 10gb/s versus the more common 1gb/s either via copper or fiber media (most typically).  It used to be that you had to force switches to talk at 1gb/s in order for the server to negotiate that speed.  This was back when 1gb/s was newer and the handshake protocol that takes place between the NIC and the switch port at connection time was not as mature as it is now.  However, as a holdover from those halcyon days of yore, some network admins are prone to still set port speeds manually rather than letting them auto-negotiate like a good network admin should.  Thus you have servers connecting at 1gb/s when they should be running at 10gb/s.  Again- just something to keep in mind if you’re having speed issues.

  • Cable Quality: what speed is your cable rated at?

There are currently four common ratings for copper ethernet cables.  They are by no means the only ones but these are the most commonly used in datacenters.  They all have to do with how fast you can send data through the cables.  Cat 5 is capable of transmitting up to 1gb/s.  Cat 5e was an improvement on Cat 5 and introduced some enhancements that limited crosstalk (interference) between the 8 strands of a standard ethernet cable.  Cat 6 and 6a are further improvements on those standards, now allowing speeds of up to 10gb/s or more.  Basically the newer the Cat x number/letter the faster you can safely transmit data without data loss or corruption.  The reason I mention this is that I’ve been burned on more than one occasion when using cat5 for 1gb/s and had too much crosstalk which severely limited throughput and resulted in a lot of collisions.  Replacing the cable with a new cat 5 or higher rated cable almost always fixed the problem.  If you’re having communication problems, rule this out early on so you’re not chasing your tail in other areas.

  • IP Networking: Ensuring you have accurate network configurations

I’ve had a lot of problems in this area.  The biggest problem seems to be the fact that not all customers have taken the time to review and fill out the pre-install checklist.  This checklist prompts you for all the networking information you’ll need to do the install.  If you’ve been given IP information, before you tear your hair out make sure it’s correct.  I’ve been given multiple configurations at the same customer for the same appliance and each time there was something critical wrong that kept me from talking on the network.  Configuring VLAN’s can be especially trying because if you have it wrong, you just won’t see any traffic.  With regular non-VLAN configurations, If you put yourself on the wrong physical switch port or network, you can always sniff the network (tcpdump is now installed as part of the ODA software).  This doesn’t really work with VLAN traffic.  Other things to verify would be your subnet mask and default gateway.  If either of these are misconfigured, you’re gonna have problems.  Also as I mentioned earlier, don’t make the mistake of assuming you have to create a VLAN interface on the ODA just because you’re connected to a trunked port.  Remember the native VLAN traffic is passed on to the server with the VLAN tags stripped off so it uses a regular network interface (i.e. net1).

These are just some of the pitfalls you may encounter.  I hope some of this has helped!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s