Linux Exploit: How Making Your System More Secure Can Hurt You

51zah9831tlWith all the security exploits in the wild these days, it pays to protect your data.  One would think that encrypting your filesystems would be a good step in the right direction.  Normally it would, but in thstar_trek_ii_the_wrath_of_khan_2009_dvd_cover_region_1is case not so much.  Unlike Iron Maiden’s nod to the doomsday clock in the song 2 minutes to midnight, now it doesn’t even take that long to compromise a system.  No- watch out, here comes Cryptsetup.  We’ll do it for you in 30 seconds!  Ok, maybe too many references to British heavy metal bands and cult classic movies but you get the point.

 


CVE-2016-4484: Cryptsetup Initrd root Shell
was first revealed to “the public” about a week ago at the DeepSec security conference held in Austria and a few days later on the web at large.  The kicker is that it only applies if you have encrypted your system partition.  There is a much more detailed writeup here on how it works, how to tell if you’re vulnerable and how to fix it.  The good news is apparently not many people are encrypting their system partitions because I can’t believe that someone hasn’t even by accident ran across this until now (who hasn’t accidentally left something pressing the enter key on their keyboard like a book or something like that).

Advertisements

Putting the Oracle SPARC M7 Chip through its paces

From time to time I get an opportunity to dive under the hood of some pretty cool technologies in my line of work.  Being an Oracle Platinum Partner, Collier IT specializes in Oracle based hardware and software solutions.  On the hardware side we work with Exadata, Oracle Database Appliance and the Oracle ZFS Appliance just to name a few.  We have a pretty nice lab that includes our own Exatada and ODA, and just recently a T7-2.

 

download (1)Featuring the new SPARC M7 chip released in October of 2015 with Software in Silicon technology, the M7-x and T7-x server line represents a huge leap forward in Oracle Database performance.  The difference between the M7 and T7 servers is basically size and power.  The chip itself is called M7, not to be confused with the server model M7-x.  The T7-x servers also use the same M7 processor.  Hopefully that clears up any confusion on this going forward.  Here’s a link to a datasheet that outlines the server line in more detail.

 

In addition to faster on-chip encryption and real time data integrity checking, SQL query acceleration provides an extremely compelling use case for consolidation while maintaining a high level of performance and security with virtually no overhead.  The SPARC line of processors has come a very long way indeed since it’s infancy.  Released in late 1987, it was designed from the start to provide a highly scalable architecture around which to build a compute package that ranged from embedded processors all the way up to large server based CPU’s while utilizing the same core instruction set.  The name SPARC itself stands for Scalable Processor ARChitecture.  Based on the RISC (Reduced Instruction Set Computer) architecture, operations are designed to be as simple as possible.  This helps achieve nearly one instruction per CPU cycle which allows for greater speed and simplicity of hardware.  Furthermore this helps promote consolidation of other functions such as memory management or Floating Point operations on the same chip.

 

Some of what the M7 chip is doing has actually been done in principle for decades.  Applications such as Hardware Video Acceleration or Cryptographic Acceleration leverage instruction sets hard coded into the processor itself yielding incredible performance.  Think of it as a CPU that has only one job in life- to do one thing and do it very fast.  Modern CPUs such as the Intel x86 cpu have many many jobs to perform and they have to juggle all of them at once.  They are very powerful however because of the sheer number of jobs they are asked to perform, they don’t really excel at any one thing.  Call them a jack of all trades and master of none.  The concept of what a dedicated hardware accelerator is doing for Video playback for example, is what Oracle is doing with Database Instructions such as SQL in the M7 chip.  The M7 processor is still a general purpose CPU, however with the ability to perform in hardware database related instructions at machine level speeds with little to no overhead.  Because of this, the SPARC M7 is able to outperform all other general purpose processors that have to timeshare those types of instructions along with all the other workloads they’re being asked to perform.

 

sprinting-runnerA great analogy would be comparing an athlete who competes in a decathlon to a sprint runner.  The decathlete is very good at running fast, however he needs to be proficient in 9 other areas of competition.  Because of this, the decathlete cannot possibly be as good at running fast as the sprinter because the sprinter is focusing on doing just one thing and being the best at it.  In the same vein, the M7 chip also performs SQL instructions like a sprinter.  The same applies to encryption and real time data compression.

 

Having explained this concept, we can now get into practical application.  The most common use case will be for accelerating Oracle Database workloads.  I’ll spend some time digging into that in my next article.  Bear in mind that there are also other applications such as crypto acceleration and hardware data compression that are accelerated as well.

 

Over the past few weeks, we’ve been doing some benchmark comparisons between 3 very different Oracle Database hardware configurations.  The Exadata (x5), the Oracle Database Appliance (x5) and an Oracle T7-2 are the three platforms that were chosen.  There is a white paper that Collier IT is in the process of developing which I will be a part of.  Because the data is not yet fully analyzed, I can’t go into specifics on the results.  What I can say is that the T7-2 performed amazingly well from a price/performance perspective compared to the other two platforms.

 

Stay tuned for more details on a new test with the S7 and a Nimble CS-500 array as well as a more in depth look at how the onboard acceleration works including some practical examples.

 

 

 

 

 

 

hjh

Ecessa ClariLink installation

download-2The company I work for recently purchased a WAN load balancer for our office as we converted to VoIP phones from our old POTS system and we need them to work ALL the time.  We are an Ecessa partner and like to eat our own dog food so to speak so this was a natural fit for us.

 

Ecessa has been in the WAN virtualization business since 2000 and the company (previously known as Astrocom) has been around since 1968.  The device we decided to implement is called a ClariLink 175.  There are a number of things that this little gem is capable of doing.  For our primary purposes, it is able to perform transparent failover of SIP calls without losing the session.  This means that if I start a VoIP call that goes out our Comcast link and that link goes down, the ClariLink fails that call over to our Century Link connection without dropping the call.  Slick huh?

 

This is all well and good, however there are a number of intricacies to how this works under the hood and they all differ based on your VoIP provider.  Ours (ANPI/Voyant) is a hosted solution provider meaning that the PBX is in “the cloud” and our IP phones connect to it over the internet to make calls.  Because of this, they enforce SIP authentication to make sure someone can’t fire up a desktop and pretend to be me and call a bunch of 1-900 numbers!

 

imagesThe ClariLink has numerous configuration options including the ability to proxy SIP traffic.  This is how it can perform seamless failover of calls even though our external IP address changes when traffic starts coming from a different provider.  The ClariLink sends a SIP reinvite to the provider with the new source IP address mid-call when it detects that one of the internet connections goes down.  Because the device is proxying the traffic, it is able to modify the SIP packets and change some bits of information to make sure the provider maintains the call even though the source IP address is different.  I don’t know of any desktop phones today that are capable of doing that- cool stuff.

 

We’re actually in the process of implementing this feature with our phone system and haven’t quite got it working just yet.  I plan on posting a second article once things are up and running to talk in more detail about how it works.  In the meantime, I’m having a lot of fun setting up QoS rules to keep someone from chewing up all our bandwidth by uploading pictures to their google drive ;).

Using VVOLs with vSphere 6 and Nimble

VMware Virtual Volumes is a concept that represents a major paradigm shift from the way storage is used in VMware today.

Below is a short 5 minute video that explains the basic concept of VVOLs.

 

Additionally, knowing the difference between communicating with LUNs as in the old world and communicating with PEs (Protocol Endpoints) is crucial to understanding what VVOLs brings to the table and why.

In short, PE’s are actually a special LUN on the storage array that the ESXi server uses to communicate with the array.  It’s not a LUN in the traditional sense, but more like a logical gateway to talk to the array.  I would say in some ways it’s similar in function to a gatekeeper LUN on an EMC array.  That LUN in turn maps to multiple sub-luns that make up the VM’s individual storage related components (vmdk, vswp, vmsd, vmsn etc).  When the host wants to talk to a LUN, it sends the request to the address of the PE “LUN” with an offset address of the actual LUN on the storage array.  Two things immediately came to mind once I understood this concept:

  1. Since all communication related to the sub-volumes is a VASA function, what happens when vCenter craps the bed?
  2. If I only have 1 PE, isn’t that going to be a huge bottleneck for storage I/O?

The answers to these and other questions are handily dealt with in a post here by VMware vExpert Paul Meehan.  Again- the short version is that vCenter is not needed after the host boots and gets information on PE’s and address offsets.  When it IS needed however is during a host boot.  Secondly, I/O traffic actually goes through the individual volumes, not the PE.  Remember, the PE is a logical LUN that serves as a map to the actual volumes underneath.

This brings me to the next video- understanding PEs.  This link starts about 12 minutes into an hour long presentation where PE’s are talked about.  Feel free to watch the entire video if you want to learn more!

 

Finally, let’s walk through how to set up VVOLs on your Nimble array.  There are a few pre-requisites before you can start:

  • NOS version 3.x or newer
  • vSphere 6.x or newer

Here’s the step by step process:

  1. Connect to web interface of local array
  2. Click on Administration -> VMware integration
  3. Fill in the following information
    • vCenter Name (this can be a vanity name- doesn’t have to be the address of the host)
    • choose the proper subnet on your Nimble array to communicate with vCenter
    • vCenter Host (FQDN or IP address)
    • Credentials
    • Check Web Client and VASA Provider
    • Click Save (This registers vCenter with the storage array and installs the VASA 2.0 provider)
  4. Navigate to Manage -> Storage Pools
  5. Select the Pool in which you want to create the VVOLs (for most sites this will be default)
  6. Click New Folder
  7. Change the Management Type to VMware Virtual Volumes
  8. Give the folder a Name and Description
  9. Set the size of the folder
  10. Choose the vCenter that you registered above, then click Create

Now you have a storage container on the Nimble array that you can use to create VVOLs.  Let’s look at the VMware side now:

  1. Connect to the vSphere web client for your vCenter 6 instance (this will not work with the thick client)
  2. Navigate to Storage and highlight your ESX server
  3. Click on Datastores on the right side of the window
  4. Click on the icon to create a new datastore
  5. Select your location (datacenter) then click next
  6. Select type VVOL then click next
  7. You should see at least one container- click next.  If not, try rescanning your HBA’s in the web client and start again
  8. Assign which host(s) will need access to the VVOL and click next
  9. On the summary screen- click finish

You should now see a new datastore.  Now let’s create a VM in the datastore and see what it looks like in the Nimble web interface!

  1. In vCenter, navigate to hosts and clusters
  2. Right click on your host to create a new virtual machine
  3. Click next under creation type to create a new virtual machine
  4. Give the VM a name, select the location where it should be created and click next
  5. Select the VVOL no requirements policy under VM storage policy
  6. Select the VVOL datastore that is compatible and click next
  7. Select ESXI 6.0 and later under the VM compatibility dtrop down and click next
  8. Choose the appropriate guest OS family and version then click next
  9. Adjust the virtual hardware to meet your needs and click next
  10. At the summary screen, verify all settings are correct and click Finish

Now if you navigate to Manage volumes in your Nimble web interface you will see multiple volumes for each VM you created.  Instead of putting all the .vmdk, .vmx, .vswp and other files inside a single datastore on a single LUN, each object is it’s own volume.  This is what allows you to set performance policies on a per VM basis because each volume can be treated differently.  You can set high performance policy on your production VM’s and low performance on dev/test for example.  Normally you would have to split your VMs into separate datastores and manage the performance policies on a per datastore level.  The problem with this is that you still have no visibility into each VM in that datastore at the storage layer.  With VVOLs, you can see latency, throughput and even noisy neighbor information on a per VM basis in the Nimble web interface!

 

Dirty COW Linux Vulnerability – CVE-2016-5195

dirty_cow

A newly reported exploit in the memory mapping section of the Kernel has been reported.  It’s actually been in the kernel for years but just recently became much more dangerous due to recent changes in the kernel structure.  Here’s the alert from Red Hat’s website:

 

Red Hat Product Security has been made aware of a vulnerability in the Linux kernel that has been assigned CVE-2016-5195. This issue was publicly disclosed on October 19, 2016 and has been rated as Important.

Background Information

A race condition was found in the way the Linux kernel’s memory subsystem handled the copy-on-write (COW) breakage of private read-only memory mappings. An unprivileged local user could use this flaw to gain write access to otherwise read-only memory mappings and thus increase their privileges on the system.

This could be abused by an attacker to modify existing setuid files with instructions to elevate privileges. An exploit using this technique has been found in the wild.

 

Here’s a great description of how the exploit works in a 12 minute youtube video

 

Patch patch patch!!

Adding new hosts to vSphere cluster with RDM disks

33I was working with a customer to assist them in upgrading their cluster from 5.1 to 6.o u2.  They had started off with a 2 host cluster, then added a third node to the cluster.  The SAN is an HP EVA6000.  When VMware was first set up and volumes were provisioned and presented, there were only the first two hosts in the cluster.  After the third host was added, naturally the SAN admin presented the volumes to the new host.  What was missed was making sure the LUN number for each volume that was previously presented to just the first two hosts was the same LUN number when presented to the third host.

 

We were running into some problems performing a vMotion of a VM with RDM’s to the new host.  It was complaining that the target host couldn’t see the disks, even though I was able to verify both in the GUI and CLI that it absolutely could see it.  I was able to vMotion between the two original hosts however so this had me stumped.  I had the SAN admin double check the presentation for the RDM disk on that VM and that’s when I saw the LUN number mismatch.

 

The fix was to power off the host, unpresent the volumes, present the volumes making sure to use the same LUN number as the other two, then power the host back up.  After doing this, our problems were solved!

Windows Wifi troubleshooting tools

Have you ever tried connecting your laptop to a Wifi network and for one reason or another it failed?  It can be extremely frustrating, even to a seasoned vet who knows their way around Windows.  The big problem is that you get virtually no information from the connection failure.  No logs, no error codes, nothing.

There is a reporting tool that is built into windows 8 and newer that will compile a very detailed report showing each connection attempt, its status and a ton of other stuff.  Here’s how to run the report and where it gets put:

 

  • Open a command prompt as administrator
  • Run the following command
    • netsh wlan show wlanreport
  • Note the path where the html file is generated.  It should be C:\ProgramData\Microsoft\Windows\WlanReport\wlan-report-latest.html
  • Open your favorite web browser and point it to that file.  voila!

 

Here’s a snippet of what some of the report looks like:

capture

 

There is a LOT more information below this including extremely detailed information about all the network adapters on the system, the system itself and some script output.  If you click on the items in the session list above, it will bring you to a detailed log of that session and why it was able to or not able to connect.

 

Suffice it to say this is an invaluable tool to review logs and information all in one place.

OVM Server for x86 version 3.4.2 released!

downloadOracle has just released the latest version of Oracle VM for x86 and announced it at OpenWorld.  There are some really cool additions that enhance the stability and useability of Oracle VM.  Here are some of the new features:

 

Installation and Upgrades

Oracle VM Manager support for previous Oracle VM Server releases
As of Oracle VM Release 3.4.2, Oracle VM Manager supports current and previous Oracle VM Server releases. For more information, see Chapter 6, Oracle VM Manager Support for Previous Oracle VM Server releases.

Infrastructure

Support for NVM Express (NVMe) devices
Oracle VM Server now discovers NVMe devices and presents them to Oracle VM Manager, where the NVMe device is available as a local disk that you can use to store virtual machine disks or create storage repositories.

The following rules apply to NVMe devices:

Oracle VM Server for x86
  • To use the entire NVMe device as a storage repository or for a single virtual machine physical disk, you should not partition the NVMe device.
  • To provision the NVMe device into multiple physical disks, you should partition it on the Oracle VM Server where the device is installed. If an NVMe device is partitioned then Oracle VM Manager displays each partition as a physical disk, not the entire device.

    You must partition the NVMe device outside of the Oracle VM environment. Oracle VM Manager does not provide any facility for partitioning NVMe devices.

  • NVMe devices can be discovered if no partitions exist on the device.
  • If Oracle VM Server is installed on an NVMe device, then Oracle VM Server does not discover any other partitions on that NVMe device.
Oracle VM Server for SPARC
  • Oracle VM Manager does not display individual partitions on an NVMe device but only a single device.

    Oracle recommends that you create a storage repository on the NVMe device if you are using Oracle VM Server for SPARC. You can then create as many virtual disks as required in the storage repository. However, if you plan to create logical storage volumes for virtual machine disks, you must manually create ZFS volumes on the NVMe device. See Creating ZFS Volumes on NVMe Devices in the Oracle VM Administration Guide.

Using Oracle Ksplice to update the dom0 kernel
Oracle Ksplice capabilities are now available that allow you to update the dom0 kernel for Oracle VM Server without requiring a reboot. Your systems remain up to date with their OS vulnerability patches and downtime is minimized. A Ksplice update takes effect immediately when it is applied. It is not an on-disk change that only takes effect after a subsequent reboot.

Note

This does not impact the underlying Xen hypervisor.

Depending on your level of support, contact your Oracle support representative for assistance before using Oracle Ksplice to update the dom0 kernel for Oracle VM Server. For more information, see Oracle VM: Using Ksplice Uptrack Document ID 2115501.1, on My Oracle Support at: https://support.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=2115501.1.

Extended SCSI functionality available for virtual machines
Oracle VM now provides additional support for SCSI functionality to virtual machines:

  • Linux guests can now retrieve vital product data (VPD) page 0x84 information from physical disks if the device itself makes it available.
  • Microsoft Windows Server guests can use SCSI-3 persistent reservation to form a Microsoft Failover Cluster in an upcoming Oracle VM Paravirtual Drivers for Microsoft Windows release. See the Oracle VM Paravirtual Drivers for Microsoft Windowsdocumentation for information about the availability of failover cluster capabilities on specific Microsoft Operating System versions.
Dom0 kernel upgraded
The dom0 kernel for Oracle VM Server is updated to Oracle Unbreakable Enterprise Kernel Release 4 Quarterly Update 2 in this release.

Package additions and updates
  • The ovmport-1.0-1.el6.4.src.rpm package is added to the Oracle VM Server ISO to support Microsoft Clustering and enable communication between Dom0 and DomU processes using the libxenstore API.
  • The Perl package is updated to perl-5.10.1-141.el6_7.1.src.rpm.
  • The Netscape Portable Runtime (NSPR) package is updated to nspr-4.11.0-1.el6.x86_64.rpm.
  • The openSCAP package is updated to openscap-1.2.8-2.0.1.el6.rpm.
  • The Linux-firmware package is updated to linux-firmware-20160616-44.git43e96a1e.0.12.el6.src.rpm.

Performance and Scalability

Oracle VM Manager performance enhancements
This release enhances the performance of Oracle VM Manager by reducing the number of non-critical events that Oracle VM Server sends to Oracle VM Manager when a system goes down.

Note

If you are running a large Oracle VM environment, it is recommended to increase the amount of memory allocated to the Oracle WebLogic Server. This ensures that adequate memory is available when required. See Increasing the Memory Allocated to Oracle WebLogic Server in the Oracle VM Administration Guide for more information.

Oracle VM Server for x86 performance optimization
For information on performance optimization goals and techniques for Oracle VM Server for x86, see Optimizing Oracle VM Server for x86 Performance, on Oracle Technology Network at: http://www.oracle.com/technetwork/server-storage/vm/ovm-performance-2995164.pdf.

Xen 4.4.4 performance and scalability updates
  • Improved memory allocation: Host system performance is improved by releasing memory more efficiently when tearing down domains, for example, migrating a virtual machine from one Oracle VM Server to another or deleting a virtual machine. This ensures that the host system can manage other guest systems more effectively without experiencing issues with performance.
  • Improved aggregate performance: Oracle VM Server now uses ticket locks for spinlocks, which improves aggregate performance on large scale machines with more than four sockets.
  • Improved performance for Windows and Solaris guests: Microsoft Windows and Oracle Solaris guests with the HVM or PVHVM domain type can now specify local APIC vectors to use as upcall notifications for specific vCPUs. As a result, the guests can more efficiently bind event channels to vCPUs.
  • Improved workload performance: Changes to the Linux scheduler ensure that workload performance is optimized in this release.
  • Improved grant locking: Xen-netback multi-queue improvements take advantage of the grant locking enhancements that are now available in Oracle VM Server Release 3.4.2.
  • Guest disk I/O performance improvements: Block scalability is improved through the implementation of the Xen block multi-queue layer.

Usability

Oracle VM Manager Rule for Live Migration
To prevent failure of live migration, and subsequent issues with the virtual machine environment, a rule has been added to Oracle VM Manager, as follows:

Oracle VM Manager does not allow you to perform a live migration of a virtual machine to or from any instance of Oracle VM Server with a Xen release earlier than xen-4.3.0-55.el6.22.18. This rule applies to any guest OS.

Table 3.1 Live Migration Paths between Oracle VM Server Releases using Oracle VM Manager Release 3.4.2

capture
 

Where the live migration path depends on the Xen release, you should review the following details:

Xen Release (from) Xen Release (to) Live Migration Available?
xen-4.3.0-55.el6.x86_64 xen-4.3.0-55.el6.0.17.x86_64 No
xen-4.3.0-55.el6.22.18.x86_64 and newer xen-4.3.0-55 Yes

For example, as a result of this live migration rule, all virtual machines in an Oracle VM server pool running Oracle VM Server Release 3.3.2 with Xen version xen-4.3.0-55.el6.22.9.x86_64 must be stopped before migrating to Oracle VM Server Release 3.4.2.

Tip

Run the following command on Oracle VM Server to find the Xen version:

# rpm -qa | grep "xen"
PVHVM hot memory modification
As of this release, it is possible to modify the memory allocated to running PVHVM guests without a reboot. Additionally, Oracle VM Manager now allows you to set the allocated memory to a value that is different to the maximum memory available.

Note
  • Hot memory modification is supported on x86-based PVHVM guests running on Linux OS and guests running on Oracle VM Server for SPARC. For x86-based PVHVM guests running on Oracle Solaris OS, you cannot change the memory if the virtual machine is running.
  • See the Oracle VM Paravirtual Drivers for Microsoft Windows documentation for information about the availability of hot memory modification on PVHVM guests that are running a Microsoft Windows OS. You must use a Windows PV Driver that supports hot memory modification or you must stop the guest before you modify the memory.
  • Oracle VM supports hot memory modification through Oracle VM Manager only. If you have manually created unsupported configurations, such as device passthrough, hot memory modification is not supported.

Security

  • Oracle MySQL patch update: This release of Oracle VM includes the July 2016 Critical Patch Update for MySQL. (23087189)
  • Oracle WebLogic patch update: This release of Oracle VM includes the July 2016 Critical Patch Update for WebLogic. (23087185)
  • Oracle Java patch update: This release of Oracle VM includes the July 2016 Critical Patch Update for Java. (23087198).
  • Xen security advisories: The following Xen security advisories are included in this release:
    • XSA-154 (CVE-2016-2270)
    • XSA-170 (CVE-2016-2271)
    • XSA-172 (CVE-2016-3158 and CVE-2016-3159)
    • XSA-173 (CVE-2016-3960)
    • XSA-175 (CVE-2016-4962)
    • XSA-176 (CVE-2016-4480)
    • XSA-178 (CVE-2016-4963)
    • XSA-179 (CVE-2016-3710 and CVE-2016-3712)
    • XSA-180 (CVE-2014-3672)
    • XSA-182 (CVE-2016-6258)
    • XSA-185 (CVE-2016-7092)
    • XSA-187 (CVE-2016-7094)
    • XSA-188 (CVE-2016-7154)

 

 

Hands on with FireEye

images

I recently had a chance to get some soak time with some of FireEye’s suite of cyber security hardware at a customer site.  They deployed NX, HX and CM appliances into their network.  DTI (Dynamic Threat Intelligence) was also purchased, I’ll go into that more in a later post.  Following is an eye chart of FireEye’s comprehensive suite of products as well as a more in depth description of the products that were deployed at this particular customer site:

 

1470248677864

NX

The NX appliance is responsible for monitoring and stopping web based attacks, zero day web exploits and multi-protocol callbacks.  What this means is that the appliance is constantly monitoring traffic coming into your network.  It looks for suspicious activity based on known exploits and how they work (i.e. modify the registry to turn off the firewall, turn off anti-virus or spawn multiple processes and delete the original executable to hide itself).  Once it finds something suspicious, it can analyze the behavior of the potential threat using it’s Multi-Vector Virtual Execution (MVX) engine.  The MVX engine will detonate the payload in an isolated and heavily instrumented VM environment where it can log exactly what the exploit does and how it does it.  Once it has this information and it has identified the exploit, it can automatically block it from getting into your network.

 

HX/HXD

The HX/HXD appliances are used to monitor endpoints (windows desktops/laptops, servers or even cellphones and tablets) for compromises.  It monitors all endpoints across the entire organization at once and is able to correlate suspicious activity.  Once a threat is identified, you then have the option of downloading a triage package that consists of detailed information about what the system was doing or even isolating or containing an endpoint from the network so it can’t cause any additional harm to the environment.  The appliances are typically deployed both in the internal network and the DMZ.  This gives the additional ability to protect remote endpoints that connect externally as well as internal ones.

 

CM

The CM appliance is basically the command center for FireEye that is able to communicate with all other appliances and provide a comprehensive view of what is going on.  It has the ability to reach into email, file storage, endpoint, network and mobile platforms and correlate activities in a single pane of glass.  One of the big benefits of this product is it’s ability to stop multi-vector attacks that span multiple platforms.  By deploying the FireEye NX, EX, FX, HX and AX series together with the FireEye CM series, the analysis of blended threats, such as pinpointing a spear-phishing email used to distribute malicious URLs, and correlating a perimeter alert to the endpoint, becomes possible.

 

The vast majority of customers that purchase HX appliances also purchase DTI for its obvious advantages.  The MVX engine is the really cool part of what FireEye has to offer.  Below is a description of MVX:

The FireEye Malware Protection System features dynamic, real-time analysis for advanced malware using our patent-pending, multi-flow Multi-Vector Virtual Execution (MVX) engine. The MVX engine captures and confirms zero-day, and targeted APT attacks by detonating suspicious files, Web objects, and email attachments within instrumented virtual machine environments.

The MVX engine performs multi-flow analysis to understand the full context of an advanced targeted attack. Stateful attack analysis is critical to trigger analysis of the entire attack lifecycle, from initial exploit to data exfiltration. This is why point products that focus on a single attack object (e.g., malware executable (EXE), dynamic linked library (DLL), or portable document format (PDF) file types) will miss the vast majority of advanced attacks as they are blind to the full attack lifecycle.

 

The customer was able to deploy the endpoint software to hundreds of agents automatically by using Group Policy profiles to push out the installer and run it silently.  Once that was done, we tested containment which essentially takes the machine off the network until you can decide how you want to react.  If the endpoint has been determined to be safe, you can then un-contain it through the GUI.
 
We did run into what seems to be a rather interesting glitch during testing of the containment process.  We contained a machine that was originally on the internal network.  We then placed it on the VPN and due to some initial configuration issues, it was unable to contact the HXD appliance to receive the un-contain instruction.  The result left the machine unable to communicate on the network and no way to fix it.  Surprisingly, we were able to revert to a previous system restore point when the agent hadn’t been installed yet.  Thus we circumvented the whole containment process.
 
I’ve not yet decided if this truly is a bad thing since by reverting back to a point before the agent install, it would essentially rid the machine of the exploit as well.  Regardless, I was a bit dismayed at how easy it was to bypass containment.  Assuming the end user was not malicious (they wouldn’t have infected their own machine) I’m not sure this is a really viable scenario.  One potential would be an exploit that was aware of the mechanics of how the agent works and communicates- in which case it could theoretically block communication to the HX.  This would manifest as an endpoint that hasn’t checked in for awhile and would probably arouse suspicion as well.
 
In summary- I’m extremely impressed with FireEye’s ability to detect and block very complex and coordinated attacks where other products fall down completely.  The MVX engine in particular is something to behold- the level of instrumentation of an exploit is truly incredible.  To take this a step further, if you purchase the AX appliance, this also gives you the ability to do forensics against the exploit including a video of the exploit during detonation (along with all the other telemetry that is captured).  This could prove to be invaluable to root cause analysis in situations where it’s required to determine exactly how an exploit works.

Nimble CASL filesystem overview

nimbleChassis

 

As a Nimble partner, Collier IT has worked with a number of customers to install, configure and deploy a number of Nimble arrays.  We also have a CS-300 and a CS-500 in our lab so I get to use it on a daily basis.  I wanted to take some time to give a quick overview of how Nimble’s CASL filesystem works and how it differentiates itself from other storage vendor technologies.

 

These days, you can’t throw a rock in the Storage industry without hitting about a dozen new vendors every month touting their flashy new storage array and how it’s better than all the others out there.  I’ve used a lot of them and they all seem to share some common pitfalls:

 

 

CI-Generic-Chart.jpg

  • Performance drops as the array fills up, sometimes at only 80% of capacity or less!

Nobody in their right mind would intentionally push their SAN up to 100% utilization.  It’s not that hard though to lose track of thin provisioning and how much actual storage you have left.  As most storage arrays get close to full capacity, they start to slow down.  The reason for this is based on how the filesystem architecture is designed to handle free space and how it writes to disk.  Consider NetApp’s WAFL filesystem for example.  It is based on the WIFS (Write In Free Space) concept.  As new blocks are written to disk, instead of overwriting blocks in place (which adds a TON of overhead), it is redirected to a new location on disk and an is written as a full stripe across the disk group.  It allows for fast sequential reads because the blocks are laid out in a very contiguous manner.  Once the amount of free space starts to diminish, you find that what’s left is very scattered and located in different locations on the array.  It’s not contiguous anymore and there is significantly more time spent seeking for reads and writes, slowing down the array.

One of the big benefits of the CASL filesystem is that it even though it is also a WIFS filesystem, it does not fill holes.  Instead, it uses a lightweight sweeping process to consolidate the holes into full stripe writes in free space in the background.  The filesystem is designed from the ground up to run these sweeps very efficiently, and it also utilizes flash disk to speed up the process even more.  What this does is allow the array to ALWAYS write in a full stripe.  What’s more- CASL is also able to combine blocks of different size into a full stripe.  The big benefit here is that you get a very low overhead method of performing compression inline that doesn’t slow the array down.

 

 

innacuracy.jpg

  • Write performance is inconsistent, especially when dealing with a lot of random write patterns

The CASL filesystem has been designed from the ground up to deal with one of the Achilles heel of most arrays: the cache design.  Typical arrays will employ flash or DRAM as a cache mechanism for a fast write acknowledgement.  This is all well and good until you get into a situation where you have a lot of random reads and writes over a sustained period of time at a high rate of throughput.  This is most I/O workloads now with virtualization and storage consolidation.  We aren’t just streaming video anymore folks- we’re updating dozens of storage profiles simultaneously each with their own read and write characteristics.  The problem with other storage array caching mechanisms is that once this sustained load gets to the point where the controller(s) can’t flush the cache to spinning disk as fast as the data is coming in, you get throttled.

Nimble has a different approach to caching that was designed from the ground up to be not only scalable but media agnostic.  It doesn’t matter if you’re writing to spinning disk or SSD.  Here’s a quick breakdown:

  1. Write is received at storage array and stored in the active controller’s NVRAM cache
  2. Write is mirrored to standby controller’s NVRAM
  3. At this point, the write is acknowledged
  4. Write is shadow copied to DRAM
  5. While the data is in DRAM, it is analyzed for compression.  If the data is a good candidate for compression, the array determines the best compression algorithm to use for that type of data and it is compressed.  If it’s not a good candidate for compression (JPG for example) then it will not be compressed at all.
  6. data is grouped into stripes of 4.5mb and then written to disk in a full RAID stripe write operation

Here, the big performance benefits are mainly reduction in I/O to spinning disk and targeted inline compression.  This is achieved because we’re not blindly flushing data to disk as cache mechanisms fill up.  Instead we analyze the writes in memory, compress them inline and write them out to disk in a much more efficient manner.  The compression leverages the processing power on the array that is capable of compressing at 300mb/sec per core or faster.  As a result- you experience orders of magnitude less IOPS from the controller to the disk due to both compressing of the data and the way data is written.  What would have been maybe 1000 IOPS is now reduced to as little as 10 IOPS in some cases!  This is why Nimble doesn’t have to spend a lot of money on 15k or even 10k SAS drives on the back end.

To protect against data loss before the data is written to disk, both controllers have super capacitors that will hold the contents of NVRAM safe until you restore power and then write to disk.  Redundant controllers also guard against data corruption/loss in the event of the primary controller failure.

ssd2.jpg

  • Poor SSD write life

 A common problem since SSD’s have come about is that they eventually “wear out” due to the fact that the NAND flash substrate can only sustain a finite number of erase cycles before it becomes unusable.  Without getting into all the details of things like write amplification, garbage collection and flash cell degredation, understand that the less you write to an SSD generally the better off it will be.  Due to the nature of how typical arrays utilize SSD as a cache layer, inherently there will be a lot of writing.

Like I described earlier when talking about write performance, Nimble designed their filesystem from the ground up to minimize the amount of writes to SSD or just disk in general.  A side benefit of that is the fact that they also don’t need to use more expensive SLC SSD’s in their arrays due to the lower amount of writes needed.

Managing-Poor-Performance

  • My read performance sucks, especially with random reads

Typical storage arrays employ multiple caching layers to help boost read performance.  It is understood that the worst case scenario is having to read all data from slow spinning disk.  Even the fastest 15k SAS drives can only sustain about 150-170 IOPS per drive.  So the standard drill is when a read request comes in, the cache layer is queried for the data and if it exists there and hasn’t been modified, is sent to the client.  This is the fastest read operation.  Next you go to the secondary cache- typically SSD(s).  The same thing happens- if the data is there, it’s read from slower SSD and served up to the client.  Finally if the data isn’t cached or if it has changed since it was in cache then you experience a “cache-miss” and data is read from slow spinning disk.

Nimble is smarter about how it handles caching.  First NVRAM (much faster than DRAM) is checked for the data.  Then DRAM is checked.  Flash cache is the next step- if data is found there it is checksummed and uncompressed on the fly, then returned.  Finally spinning disk serves up any missing data if none of it is in cache.  The beautiful thing about CASL is that it will keep track of read patterns and make a decision on whether or not the data that was just served up from disk should be held in a higher level cache.

I haven’t talked about all of the technologies that CASL employs let alone some of the other benefits of owning Nimble storage.  Suffice it to say I’m excited about the future of Nimble.

ODA Patching – get ahead of yourself?

I was at a customer site deploying an X5-2 ODA.  They are standardizing on the 12.1.2.6.0 patch level.  Even though 12.1.2.7.0 is currently the latest, they don’t want to be on the bleeding edge.  Recall that the 12.1.2.6.0 patch doesn’t include infrastructure patches (mostly firmware) so you have to install 12.1.2.5.0 first, run the –infra patch to get the firmware and then update to 12.1.2.6.0.

 

We unpacked the 12.1.2.5.0 patch on both systems and then had an epiphany.  Why don’t we just unpack the 12.1.2.6.0 patch as well and save some time later?  What could possibly go wrong?  Needless to say, when we went to install or even verify the 12.1.2.5.0 patch it complained as follows:

ERROR: Patch version must be 12.1.2.6.0

 

Ok, so there has to be a way to clean that patch off the system so I can use 12.1.2.5.0 right?  I stumbled across the oakcli manage cleanrepo command and thought for sure that would fix things up nicely.  Ran it and I got this output:

 


[root@CITX-5ODA-ODABASE-NODE0 tmp]# oakcli manage cleanrepo --ver 12.1.2.6.0
Deleting the following files...
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/OAK/12.1.2.6.0/Base
Deleting the files under /DOM0OAK/12.1.2.6.0/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/Seagate/ST95000N/SF04/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/Seagate/ST95001N/SA03/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/WDC/WD500BLHXSUN/5G08/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HGST/H101860SFSUN600G/A770/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/Seagate/ST360057SSUN600G/0B25/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HITACHI/H106060SDSUN600G/A4C0/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HITACHI/H109060SESUN600G/A720/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HITACHI/HUS1560SCSUN600G/A820/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HGST/HSCAC2DA6SUN200G/A29A/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HGST/HSCAC2DA4SUN400G/A29A/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/STEC/ZeusIOPs-es-G3/E12B/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/STEC/Z16IZF2EUSUN73G/9440/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Expander/ORACLE/DE2-24P/0018/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Expander/ORACLE/DE2-24C/0018/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Expander/ORACLE/DE3-24C/0291/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/LSI-es-Logic/0x0072/11.05.03.00/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/LSI-es-Logic/0x0072/11.05.03.00/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Ilom/SUN/X4370-es-M2/3.0.16.22.f-es-r100119/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HITACHI/H109090SESUN900G/A720/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/STEC/Z16IZF4EUSUN200G/944A/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HGST/H7240AS60SUN4.0T/A2D2/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HGST/H7240B520SUN4.0T/M554/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HGST/H7280A520SUN8.0T/P554/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Expander/SUN/T4-es-Storage/0342/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/LSI-es-Logic/0x0072/11.05.03.00/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/LSI-es-Logic/0x005d/4.230.40-3739/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/LSI-es-Logic/0x0097/06.00.02.00/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/Mellanox/0x1003/2.11.1280/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Ilom/SUN/X4170-es-M3/3.2.4.26.b-es-r101722/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Ilom/SUN/X4-2/3.2.4.46.a-es-r101689/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Ilom/SUN/X5-2/3.2.4.52-es-r101649/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/HMP/2.3.4.0.1/Base
Deleting the files under /DOM0HMP/2.3.4.0.1/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/IPMI/1.8.12.4/Base
Deleting the files under /DOM0IPMI/1.8.12.4/Base
Deleting the files under /JDK/1.7.0_91/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/ASR/5.3.1/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/OPATCH/12.1.0.1.0/Patches/6880880
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/OPATCH/12.0.0.0.0/Patches/6880880
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/OPATCH/11.2.0.4.0/Patches/6880880
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/GI/12.1.0.2.160119/Patches/21948354
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/DB/12.1.0.2.160119/Patches/21948354
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/DB/11.2.0.4.160119/Patches/21948347
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/DB/11.2.0.3.15/Patches/20760997
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/DB/11.2.0.2.12/Patches/17082367
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/OEL/6.7/Patches/6.7.1
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/OVM/3.2.9/Patches/3.2.9.1
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/OVS/12.1.2.6.0/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/LSI-es-Logic/0x0072/11.05.02.00/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/LSI-es-Logic/0x0072/11.05.02.00/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/GI/12.1.0.2.160119/Base

 

So I assumed that this fixed the problem.  Nope…

 


[root@CITX-5ODA-ODABASE-NODE0 tmp]# oakcli update -patch 12.1.2.5.0 --verify

ERROR: Patch version must be 12.1.2.6.0

 

 

Ok so more searching the CLI manual and the oakcli help pages came up with bupkiss.  So I decided to do an strace of the oakcli command I had just ran.  As ususal- there was a LOT of garbage I didn’t care about or didn’t know what it was doing.  I did find however that it was reading the contents of a file that looked interesting to me:

 


[pid 5509] stat("/opt/oracle/oak/pkgrepos/System/VERSION", {st_mode=S_IFREG|0777, st_size=19, ...}) = 0
[pid 5509] open("/opt/oracle/oak/pkgrepos/System/VERSION", O_RDONLY) = 3
[pid 5509] read(3, "version=12.1.2.6.0\n", 8191) = 19
[pid 5509] read(3, "", 8191) = 0
[pid 5509] close(3) = 0
[pid 5509] fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
[pid 5509] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f159799d000
[pid 5509] write(1, "\n", 1
) = 1
[pid 5509] write(1, "ERROR: Patch version must be 12."..., 40ERROR: Patch version must be 12.1.2.6.0
) = 40
[pid 5509] exit_group(0) = ?

 

There were a dozen or so lines after that, but I had what I needed.  Apparently /opt/oracle/oak/pkgrepos/System/VERSION contains the current version of the latest patch that has been unpacked.  The system software version is kept somewhere else because after I unpacked the 12.1.2.6.0 patch, I ran an oakcli show version and it reported 12.1.2.5.0.  But the VERSION file referenced earlier said 12.1.2.6.0.  I assume when I unpacked the 12.1.2.6.0 patch, it updates this file.  So what I wound up doing is changing the VERSION file back to 12.1.2.5.0 as well as deleting the folder /opt/oracle/oak/pkgrepos/System/12.1.2.6.0.  Once I did this, everything worked as I expected.  I was able to verify and install the –infra portion of 12.1.2.5.0 and continue on my merry way.

 

This highlights the fact that there isn’t a known way (to me at least) to delete an unpacked patch via oakcli or any python scripts I’ve been able to find yet.  Also- as an aside I tried just deleting the VERSION file assuming it would be rebuilt by oakcli and it didn’t.  I got this:

 


[root@CITX-5ODA-ODABASE-NODE0 System]# oakcli update -patch 12.1.2.5.0 --verify
ERROR : Couldn't find the VERSION file to extract the current allowed version

 

So I just recreated the file and all was good.  I was hoping that the oak software didn’t maintain some sort of binary formatted database that kept track of all this information- I think I got lucky in this case.  Hope this helps someone out in a pinch!

New Oracle ODA X6 configurations officially released today!

Today, Oracle has announced the release of two new ODA configurations squarely targeted at the S in SMB.  I blogged about this back on June 9th here.  A few differences to note:

  • Two new commands replace oakcli (oakcli is gone).
    • odacli – perform “lifecycle” activities for the ODA appliance (provisioning and configuring)
    • odaadmcli – administer and configure the running appliance attributes
  • All new web based user interface used to deploy appliance.  Command line obviously still available but not required anymore to deploy.
  • No more virtualization or shared storage on the Small and Medium configuration

 

I’m not sure if I’ll have a chance to lay hands on the new hardware any time soon but if I do I’ll definitely give first impressions here!

ODA X6-2 in the wild!

cw20v1-x62-3-2969092

It looks like Oracle has deployed their newest server (the X6-2) into the ODA appliance lineup now.  It’s already an option on the ExaData, BDA and ZDLRA.  There are now 3 different configurations available, 2 of which don’t include shared storage and have a much lower price point.  You can also run Oracle Database SE2 or EE on the two smaller configurations however neither one offers the virtualization option that’s been around since the original V1 ODA.

 

Here are the 3 options:

Oracle Database Appliance X6-2S ($18k):
One E5-2630 v4 2.2GHz 10 core CPU
6.4 TB (2 x 3.2 TB) NVMe SSDs *
128 GB (4 x 32 GB) DDR4-2400 Main Memory **
Two 480 GB SATA SSDs (mirrored) for OS
Two onboard 10GBase-T Ethernet ports
Dual-port 10GbE SFP+ PCIe

Notes: 
* You can add up to 2 more NVMe SSD’s for a total of 4
** An optional memory expansion kit is available that brings this configuration up to 384GB

 

Oracle Database Appliance X6-2M ($24k):
Two E5-2630 v4 2.2GHz 10 core CPUs
6.4 TB (2 x 3.2 TB) NVMe SSDs *
256 GB (8 x 32 GB) DDR4-2400 Main Memory **
Two 480 GB SATA SSDs (mirrored) for OS
Four onboard 10GBase-T Ethernet ports
Dual-port 10GbE SFP+ PCIe

Notes:
* You can add up to 2 more NVMe SSD’s for a total of 4
** An optional memory expansion kit is available that brings this configuration up to 768GB

 

Oracle Database Appliance X6-2HA (?):
TBD – information about this configuration isn’t available yet.  More info coming soon!

X5-2 ODA upgrade from 12.1.2.5.0 to 12.1.2.6.0 observations

Word on keyboard

More fun with patching!  So this time I’m doing a fresh virtualized install and I decided to take my own sage advice of installing 12.1.2.5.0 first to get the firmware patches.  I ran into a bunch of other issues which will be the topic of a different post but I digress.  I got 12.1.2.5.0 fully installed, ODA_BASE deployed, everything was happy.

 

Remember that starting with version 12.1.2.6.0, you have to patch each node separately with the –local option for the infra patches.  So I started the patch on node 0 and it got almost all the way to the end at step 12 where oakd is being patched.  I ran into the “known issue” in 888888.1 item 9:

9.  During the infra patching, after step 12 completed, IPMI, HMP done, if it appeared to be hang during Patching OAK with the following two lines
                               INIT: Sending processes the TERM signal
                               INIT: no more processes left in this runlevel
JDK is not patched, the infra patching is not complete to the end.  
Workaround:  To reboot the appeared hang node manually, then run 
# oakcli update -patch 12.1.2.6 –clean

# oakcli update -patch 12.1.2.6.0 –infra –local
To let it complete the infra patch cleanly.  

I waited about 30 minutes at this step before I started to wonder, and sure enough after checking some log files in /opt/oracle/oak/onecmd/tmp/ it thought oakd was fully patched.  What I found is that oakd gets whacked because the patch doesn’t fully complete.  After doing the reboot that’s recommended in the workaround above, sure enough oakd is not running.  What’s more- now when I boot ODA_BASE the console doesn’t get to the login prompt and you can’t do anything even though you can ssh in just fine.  So I ran the –clean option then kicked off the patch again.  This time it complained that oakd wasn’t running on the remote node.  It was in fact running on node1 but node0 oakd was not.  I suspect that when the ODA communicates to oakd between nodes it’s using the local oakd to do so.

 

So I manually restarted oakd by running /etc/init.d/init.oak restart and then oakd was running.  I rebooted ODA_BASE on node0 just to be sure everything was clean then kicked off the infra patch again.  This time it went all the way through and finished.  The problem now is that the ODA_BASE console is non responsive no matter what I do so I’ll be opening a case with Oracle support to get a WTF.  I’ll update this post with their answer/solution.  If I were a betting man I’d say they’ll tell me to update to 12.1.2.7.0 to fix it.  We’ll see…

 

As an aside- one of the things that 12.1.2.6.0 does is do an in-place upgrade of Oracle Linux 5.11 to version 6.7 for ODA_BASE.  I’ve never done a successful update that way and in fact, Red Hat doesn’t support it.  I guess I can see why they would want to do an update rather than a fresh install but it still feels very risky to me.

ODA Software v12.1.2.6.0 possible bug

I’ve been updating some X5-2 ODA’s for a customer of mine to version 12.1.2.6.0 in preparation for deployment.  I came across a stubborn bug that proved to be a little tricky to solve.  I was having a problem with ODA_BASE not fully completing the boot cycle after initial deployment and as a result I couldn’t get into the ODA_BASE console to configure firstnet.

 

The customer has some strict firewall rules for the network that these ODA’s sit in so I also couldn’t connect to the VNC console on port 5900 as a result.  If you’re gonna implement 12.1.2.6.0 on an X5-2 ODA, I’d recommend installing 12.1.2.5.0 first then update to 12.1.2.6.0..  I’ve not been able to determine for sure what the problem was- I originally thought it had something to do with firmware because 12.1.2.6.0 doesn’t update any of the firmware due to a big ODA_BASE OS version update from 5.11 to 6.7.  Apparently the thought was that the update would either be too big or take too long to download/install so they skip firmware in this release.  Here is the readme for the 12.1.2.6.0 update:

 

This Patch bundle consists of the Jan 2016 12.1.0.2.160119 GI Infrastructure and RDBMS – 12.1.0.2.160119, 11.2.0.4.160119, and 11.2.0.3.15.  The Grid Infrastructure release 12.1.0.2.160119 upgrade is included in this patch bundle.  The database patches 12.1.0.2.160119, 11.2.0.4.160119, 11.2.0.3.15 and 11.2.0.2.12 are included in this patch bundle. Depending on the current version of the system being patched, usually all other infrastructure components like Controller, ILOM, BIOS, and disk firmware etc will also be patched; due to this release focus on the major OS update from OL5 to OL6.7; all other infrastructure components will not be patches.  In a virtualized environment, usually all other infrastructure components on dom0 will also be patched; in this release, we skip them.  To avoid all other infrastructure components version too far behind, the minimum version required is 12.1.2.5.0 for infra and GI.  As part of the Appliance Manager 12.1.2.6, a new parameter has been introduced to control the rolling of ODA patching from one node to another.  This is the first release to provide this functionality to allow you to control when the second node to be patched.

 

I wound up having to re-image to 12.1.2.5.0 and then upgraded as I stated above.  That fixed the problem.  I’m not sure- it may have been a bad download or a glitch in the ODA_BASE bundle because I checked against our own X5-2 ODA and it has the same problem with a fresh install of 12.1.2.6.0 and all of the firmware is up to date.  In hindsight, I probably should have given more credence to this message but it would have added hours onto the install process.  As it is, it more than doubled the time because of the troubleshooting needed.  Lesson learned…