Setting up fail2ban on a linux server

download (1)Fail2ban is an open source cross platform tool that leverages your firewall to block persistent threats from actors that are trying to break into your server.  There are a number of services that typically run by default on most standard linux distros.  I typically work mostly with Oracle Linux which is a derivative of Red Hat Enterprise Linux.  I like it because it’s free to download and use, you only have to pay for support if you want it.

Setting it up is fairly straightforward, out of the box it comes with a bunch of rule definitions that you can borrow and adjust to fit your needs.  There are a few things you have to do after installation to make it functional and I’ll walk through a very basic set of steps that you can build on.

fail2ban is found in the Extra Packages for Enterprise Linux (EPEL) repository.  Make sure it’s enabled before trying to install or you won’t get very far.  If you only want to install just that package and not leave the repository enabled, use this command to install:

# yum enablerepo=epel/x86_64 install fail2ban

If you want to leave the repository enabled for additional software installs or updates to fail2ban, there are a couple extra steps (this applies to Oracle Linux 7.x):

  • change the enabled= flag to 1 for the EPEL repository in /etc/yum.repos.d/public-yum-ol7.repo
name=Oracle Linux $releasever Developement Packages ($basearch)
enabled=1  <<- make sure this is a 1
  • install the fail2ban software
# yum install fail2ban

Once the software successfully installs, you should have a new folder called /etc/fail2ban.  You need to create /etc/fail2ban/jail.local with the contents below:

enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
bantime = -1
banaction = iptables-allports
findtime = 900
maxretry = 3

This file enables the sshd jail. This does the following:

  • enables the sshd jail
  • identifies the path to the log file that sshd writes to when users try to log into the system
  • sets the time to ban to forever
  • determines how many times over how many seconds it should be looking for attempts
    • findtime sets the window for the number of seconds to assess access attempts
    • maxretry sets how many times is one person allowed to try to log in before they’re banned

I’ve found these numbers seem to work well for me, YMMV.

NOTE: If you’re using selinux for added security, you’ll need to update your selinux policy and change the file context of jail.local so selinux doesn’t block you from making changes to it.  Here’s a quick overview of what I had to do:

    • update the selinux-policy* packages
# yum update -y selinux-policy*
    • change the context of the jail.local file to fail2ban_exec_t
# chcon -t fail2ban_exec_t jail.local
    • restart fail2ban and verify there are no selinux errors
# systemctl restart fail2ban

If you still have problems, drop me a message or google is always your friend!

Now, let’s make some final additions to make the IP address bans persist across restarts and reboots.

  • Create a file called /etc/fail2ban/persistent.bans
  • Modify /etc/fail2ban/actions.d/iptables-multiport.conf as seen below.  Note the lines in red bold that are added.
# Fail2Ban configuration file
# Author: Cyril Jaquier
# Modified by Yaroslav Halchenko for multiport banning


before = iptables-common.conf


# Option:  actionstart
# Notes.:  command executed once at the start of Fail2Ban.
# Values:  CMD
actionstart =  -N f2b-
               -A f2b- -j 
               -I  -p  -m multiport --dports  -j f2b-
               cat /etc/fail2ban/persistent.bans | awk '/^fail2ban-/ {print $2}' \
               | while read IP; do iptables -I fail2ban- 1 -s $IP -j ; done

# Option:  actionstop
# Notes.:  command executed once at the end of Fail2Ban
# Values:  CMD
actionstop =  -D  -p  -m multiport --dports  -j f2b-
              -F f2b-
              -X f2b-

# Option:  actioncheck
# Notes.:  command executed once before each actionban command
# Values:  CMD
actioncheck =  -n -L  | grep -q 'f2b-[ \t]'

# Option:  actionban
# Notes.:  command executed when banning an IP. Take care that the
#          command is executed with Fail2Ban user rights.
# Tags:    See jail.conf(5) man page
# Values:  CMD
actionban =  -I f2b- 1 -s  -j 
echo "fail2ban- " >> /etc/fail2ban/persistent.bans

# Option:  actionunban
# Notes.:  command executed when unbanning an IP. Take care that the
#          command is executed with Fail2Ban user rights.
# Tags:    See jail.conf(5) man page
# Values:  CMD
actionunban =  -D f2b- -s  -j 

  • Restart fail2ban
# systemctl restart fail2ban
  • Check the log file to make sure everything looks ok and in the case of selinux, that it allowed the configuration changes.
# journalctl -xe
  • Look at the firewall contents to see if any IP addresses have already been banned. If you have a system that’s directly connected to the internet and has port 22 forwarded you may see some in a matter of minutes.
# iptables -L -n

Happy hunting!


Log into sftp server with filezilla using ssl key


  1. create account on sftp server (optionally in a chroot sftp only environment for safety)
  2. generate an rsa ssl key with puttygen
  3. save the key in both public and private key (.ppk) format
  4. copy the public key to the remote system you’re connecting to in the appropriate location (usually ~username/.ssh/authorized_keys)
  • copy the private .ppk key you created to the local system
  • open filezilla
  • click file -> Site Manager
  • click New Site
    • Host: {address of sftp server}
    • Port: 22
    • Protocol: sftp
    • Logon Type: key file
    • user: {username created on the sftp server}
    • Key file: {browse to location where you saved the .ppk file}
    • click connect
  • In the left pane, cd to the location of the file you want to upload
  • Drag the file you want to upload from the left pane to the right pane and wait for the upload to complete
  • Close the program



Log into UNIX servers with key via Putty

  • Create an rsa keypair with no passphrase using puttygen
  • Save the public key
  • Save the private key (.ppk format)
  • Configure PuTTy to use your private key
    • Connection -> SSH -> Auth -> Private key for authentication
  • Configure PuTTy to automatically log you in with your username
    • Connection -> Data -> Auto-login Username
  • Save the profile in PuTTy
  • Copy your public key to ~/.ssh/authorized_hosts on each server you want to connect to

Oracle X7-2 HA ODA fiber interface issues

I was working with a customer to deploy an X7-2HA ODA awhile back.  They opted to use 10gbE fiber (as most customers do) for their public network interface.  One problem I ran into quite early on is what turned out to be a bug in the driver for those onboard SFP ports.  They actually can negotiate up to 25gb in addition to 10gb.  This is what caused my problem- the fiber switch ports couldn’t do 25gb and that’s what the onboard adapters was trying to negotiate.

I applied the updated driver to the NIC and rebooted.  Nothing.  No link no packets.  I verified with ethtool that the NIC was only advertising 10gb as its max speed and not 25 like before.  After some troubleshooting, I disabled autonegotiation and forced each of the two adapters to 10gb.  A few seconds after this the link came up and I was able to ping my default gateway!

After reboot however, same thing- no link.



I wound up having to put the following lines at the end of /etc/rc.local on each compute node:

# btbond1 - force em2/em3 to 10 gig Ethernet speed
/sbin/ethtool -s em2 autoneg off speed 10000
/sbin/ethtool -s em3 autoned off speed 10000
sleep 10
ping -c 5 {default gateway}
sleep 10
ping -c 5 {default gateway}

This basically ensures the adapters get forced to 10gb and makes them ping to “wake up” the interface at the end of the system boot process.  Make sure you don’t forget to limit the ping to 5 packets or something reasonable, otherwise guess what you’re going to see on your console every second until eternity?

I’ve been assured this has been fixed in future driver releases- and I’m pretty confident that it will be or there are going to be quite a few pissed off Oracle customers out there!

Rescan virtual disk in a VMware linux VM without reboot

I’ve run into this situation a number of times.  I get a request from a user to resize a filesystem and add some space to it.  Thankfully, through the magic of virtualization I can change the size of the disk on the fly (assuming there are no snapshots in effect).


Resizing the disk in VMware goes fine, so I log into the VM.  Normally for a physical machine with SAN or even SCSI disks, I’d go to /sys/class/scsi_host/ and figure out which adapter the disk I want to resize is sitting on.  Usually a combination of fdisk and lsscsi will give me the info I need here.  Good to go!  So I cd into the hostX folder that represents the right controller.  Here’s where things go south.  I’ve had luck with sending an all HCTL scan and the disk recognizes the new size:

echo "- - -" > scan


By the way, when I mentioned sending an all HCTL scan, let me explain what that means. HCTL stands for scsi_Host, Channel, Target and Lun. When you’re looking at the output of lsscsi as you’ll see below shortly, you’ll see some numbers separated by colons like such:

[2:0:3:0] disk VMware Virtual disk 1.0 /dev/sdd

The four numbers here represent the following
2 = scsi_host
    This is the numeric iteration of the host bus adapter that the scsi device is connected to.  Think of a scsi card or fiber channel card here.  The first scsi_host is usually the internal ones built into most servers as they usually get enumerated first during POST.

0 = Channel
    This is the channel on the HBA that is being referred to.  Think of a dual channel SCSI card or a dual port Fiber Channel HBA.  

3 = Target
    This refers to the SCSI target of the device we're looking at.  In this case, a good description would be an internal SCSI card that has a tape drive, CDROM and a couple hard drives attached to it.  Each of those devices would have a separate "target" to address that device specifically.

0 = Logical Unit Number or LUN
    This is the representation of a sub unit of a target.  A good example would be an optical disk library where the drive itself gets a SCSI target and assumes LUN 0, then the optical disks themselves get assigned LUNs so they can be addressed.  This also more commonly comes into play when you have a SAN that is exporting multiple disks (a.k.a. LUNs).  Say you have an EMC SAN that is presenting 30 disks to a server.  Based on conventional SCSI limitations, most cards can only address up to 15 targets per HBA channel (some will do up to 24 but they are extremely rare).  In this scenario you would need a couple SCSI HBAs to see all those disks.  Now picture thousands of disks... You see where I'm going with this.

I’ve always had luck seeing newly presented disks by doing an all HCTL scan, even in VMware.  But I always wound up having to reboot the damn VM just to get it to recognize the new size of the disk.  Well I stumbled upon a slightly different process today that lets me do what I’ve been trying to do.  Here’s the breakdown:

  • Determine which disk you want to resize.  fdisk -l usually does the trick:
[root@iscsi ~]# fdisk -l

Disk /dev/sda: 32.2 GB, 32212254720 bytes
64 heads, 32 sectors/track, 30720 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0009e4a5

Device Boot Start End Blocks Id System
/dev/sda1 * 2 501 512000 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 502 30720 30944256 8e Linux LVM
Partition 2 does not end on cylinder boundary.

Disk /dev/mapper/VolGroup-lv_root: 27.5 GB, 27455913984 bytes
255 heads, 63 sectors/track, 3337 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/mapper/VolGroup-lv_swap: 4227 MB, 4227858432 bytes
255 heads, 63 sectors/track, 514 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/sdb: 1099.5 GB, 1099511627776 bytes
255 heads, 63 sectors/track, 133674 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x02020202

Disk /dev/sdc: 12.9 GB, 12884901888 bytes
64 heads, 32 sectors/track, 12288 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

WARNING: GPT (GUID Partition Table) detected on '/dev/sdd'! The util fdisk doesn't support GPT. Use GNU Parted.

Disk /dev/sdd: 16.6 GB, 32212254720 bytes
256 heads, 63 sectors/track, 3900 cylinders
Units = cylinders of 16128 * 512 = 8257536 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sdd1 1 2081 16777215+ ee GPT
[root@iscsi ~]#
  • Ok so /dev/sdd is the disk I want to resize.  I go resize it in VMware and re-run fdisk but still get the same size- no shock there.
  • Now let’s use the lsscsi command to show us some information about the scsi devices that the OS sees.  You may have to install this tool first.
[root@iscsi ~]# lsscsi -v
[1:0:0:0] cd/dvd NECVMWar VMware IDE CDR10 1.00 /dev/sr0
dir: /sys/bus/scsi/devices/1:0:0:0 [/sys/devices/pci0000:00/0000:00:07.1/host1/target1:0:0/1:0:0:0]
[2:0:0:0] disk VMware Virtual disk 1.0 /dev/sda
dir: /sys/bus/scsi/devices/2:0:0:0 [/sys/devices/pci0000:00/0000:00:15.0/0000:03:00.0/host2/target2:0:0/2:0:0:0]
[2:0:1:0] disk Nimble Server 1.0 /dev/sdb
dir: /sys/bus/scsi/devices/2:0:1:0 [/sys/devices/pci0000:00/0000:00:15.0/0000:03:00.0/host2/target2:0:1/2:0:1:0]
[2:0:2:0] disk Nimble Server 1.0 /dev/sdc
dir: /sys/bus/scsi/devices/2:0:2:0 [/sys/devices/pci0000:00/0000:00:15.0/0000:03:00.0/host2/target2:0:2/2:0:2:0]
[2:0:3:0] disk VMware Virtual disk 1.0 /dev/sdd
dir: /sys/bus/scsi/devices/2:0:3:0 [/sys/devices/pci0000:00/0000:00:15.0/0000:03:00.0/host2/target2:0:3/2:0:3:0]


  • I used the -v (verbose) flag to have it tell me more information about each scsi device.  This gives us a great shortcut to decoding where in /sys/class/scsi_device our target resides.
  • To tell the scsi subsystem to rescan the scsi device, we simply echo a 1 to the rescan file which is located inside the folder identified above.  In our case, the folder is /sys/devices/pci0000:00/0000:00:15.0/0000:03:00.0/host2/target2:0:3/2:0:3:0.  If you cd into this folder, you see a bunch of entries.  BE CAREFUL here, these file entries represent in most cases a live view of what the system is seeing or doing.  If you do the wrong thing, you could tell the disk to power off, or maybe something even more destructive.  There’s no failsafe, the OS isn’t going to ask you if you’re sure you want to do this- we’re poking around in the live kernel here through the “back door” if you will.  Here’s a listing of what my systems shows:
[root@iscsi 2:0:3:0]# ls -la
total 0
drwxr-xr-x 8 root root 0 Mar 21 09:22 .
drwxr-xr-x 4 root root 0 Mar 21 09:22 ..
drwxr-xr-x 3 root root 0 Mar 21 09:22 block
drwxr-xr-x 3 root root 0 Mar 21 09:33 bsg
--w------- 1 root root 4096 Mar 21 09:33 delete
-r--r--r-- 1 root root 4096 Mar 21 09:33 device_blocked
-rw-r--r-- 1 root root 4096 Mar 21 09:33 dh_state
lrwxrwxrwx 1 root root 0 Mar 21 09:33 driver -> ../../../../../../../bus/scsi/drivers/sd
-r--r--r-- 1 root root 4096 Mar 21 09:33 evt_media_change
lrwxrwxrwx 1 root root 0 Mar 21 09:30 generic -> scsi_generic/sg4
-r--r--r-- 1 root root 4096 Mar 21 09:33 iocounterbits
-r--r--r-- 1 root root 4096 Mar 21 09:33 iodone_cnt
-r--r--r-- 1 root root 4096 Mar 21 09:33 ioerr_cnt
-r--r--r-- 1 root root 4096 Mar 21 09:33 iorequest_cnt
-r--r--r-- 1 root root 4096 Mar 21 09:33 modalias
-r--r--r-- 1 root root 4096 Mar 21 09:33 model
drwxr-xr-x 2 root root 0 Mar 21 09:33 power
-r--r--r-- 1 root root 4096 Mar 21 09:33 queue_depth
-r--r--r-- 1 root root 4096 Mar 21 09:33 queue_type
--w------- 1 root root 4096 Mar 21 09:33 rescan
-r--r--r-- 1 root root 4096 Mar 21 09:30 rev
drwxr-xr-x 3 root root 0 Mar 21 09:22 scsi_device
drwxr-xr-x 3 root root 0 Mar 21 09:33 scsi_disk
drwxr-xr-x 3 root root 0 Mar 21 09:33 scsi_generic
-r--r--r-- 1 root root 4096 Mar 21 09:30 scsi_level
-rw-r--r-- 1 root root 4096 Mar 21 09:33 state
lrwxrwxrwx 1 root root 0 Mar 21 09:33 subsystem -> ../../../../../../../bus/scsi
-rw-r--r-- 1 root root 4096 Mar 21 09:33 timeout
-r--r--r-- 1 root root 4096 Mar 21 09:33 type
-rw-r--r-- 1 root root 4096 Mar 21 09:33 uevent
-r--r--r-- 1 root root 4096 Mar 21 09:33 vendor
[root@iscsi 2:0:3:0]#
  • The file we’re interested in is called “rescan”.  The way these generally work is you can poke a value into the kernel by echoing that value into the file like you were appending something to a text file.  Depending on the kernel parameter you’re working with, the value you poke into it will determine what action it takes.  They generally take a 1 or 0 for true or false.  In this case, we want the kernel to rescan this device so we echo “1” > rescan.  This tells the kernel to take another look at the device itself and register any changes that have been made since the system first became aware of it at boot time.
[root@iscsi 2:0:3:0]# echo "1" > rescan
[root@iscsi 2:0:3:0]# fdisk -l /dev/sdd

WARNING: GPT (GUID Partition Table) detected on '/dev/sdd'! The util fdisk doesn't support GPT. Use GNU Parted.

Disk /dev/sdd: 32.2 GB, 32212254720 bytes
256 heads, 63 sectors/track, 3900 cylinders
Units = cylinders of 16128 * 512 = 8257536 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sdd1 1 2081 16777215+ ee GPT


Notice the new size of /dev/sdd is now 32GB where it was 16GB before.  Congratulations!  Ok don’t get all excited just yet, your journey has just begun.  Now you have to update the partition table to reflect the new number of sectors so the OS can use the new space.  Then you have to resize whatever filesystem resides on that disk.  If you’re using LVM or software RAID, you’ll need to make some changes at that level first before messing with the filesystem.  The good news is most if not all of this stuff can be done online without having to reboot.


I hope this helps and let me know if you have any questions or input as to how to do this better!


Oracle Announcements Coming

With Oracle OpenWorld coming up in a week or so, we’re used to Oracle using this event to make major announcements about new products, features or generally good stuff.  They’ve already quietly announced their next SPARC chip This year- keep your ears peeled for some very impactful but not all that shocking news about one of their current products. There will be a new product in particular that I’m particularly interested in.  It’s called the SuperCluster M8.  From what I’ve heard, it should be able to run circles around the current exadata.  Additionally, there have been announcements about the support life for Solaris as well.


I can’t say much more until it’s public knowledge, but I would enjoy some spirited discussion about it after it comes out.



Stay tuned!!

Putting the Oracle SPARC M7 Chip through its paces

From time to time I get an opportunity to dive under the hood of some pretty cool technologies in my line of work.  Being an Oracle Platinum Partner, Collier IT specializes in Oracle based hardware and software solutions.  On the hardware side we work with Exadata, Oracle Database Appliance and the Oracle ZFS Appliance just to name a few.  We have a pretty nice lab that includes our own Exatada and ODA, and just recently a T7-2.


download (1)Featuring the new SPARC M7 chip released in October of 2015 with Software in Silicon technology, the M7-x and T7-x server line represents a huge leap forward in Oracle Database performance.  The difference between the M7 and T7 servers is basically size and power.  The chip itself is called M7, not to be confused with the server model M7-x.  The T7-x servers also use the same M7 processor.  Hopefully that clears up any confusion on this going forward.  Here’s a link to a datasheet that outlines the server line in more detail.


In addition to faster on-chip encryption and real time data integrity checking, SQL query acceleration provides an extremely compelling use case for consolidation while maintaining a high level of performance and security with virtually no overhead.  The SPARC line of processors has come a very long way indeed since it’s infancy.  Released in late 1987, it was designed from the start to provide a highly scalable architecture around which to build a compute package that ranged from embedded processors all the way up to large server based CPU’s while utilizing the same core instruction set.  The name SPARC itself stands for Scalable Processor ARChitecture.  Based on the RISC (Reduced Instruction Set Computer) architecture, operations are designed to be as simple as possible.  This helps achieve nearly one instruction per CPU cycle which allows for greater speed and simplicity of hardware.  Furthermore this helps promote consolidation of other functions such as memory management or Floating Point operations on the same chip.


Some of what the M7 chip is doing has actually been done in principle for decades.  Applications such as Hardware Video Acceleration or Cryptographic Acceleration leverage instruction sets hard coded into the processor itself yielding incredible performance.  Think of it as a CPU that has only one job in life- to do one thing and do it very fast.  Modern CPUs such as the Intel x86 cpu have many many jobs to perform and they have to juggle all of them at once.  They are very powerful however because of the sheer number of jobs they are asked to perform, they don’t really excel at any one thing.  Call them a jack of all trades and master of none.  The concept of what a dedicated hardware accelerator is doing for Video playback for example, is what Oracle is doing with Database Instructions such as SQL in the M7 chip.  The M7 processor is still a general purpose CPU, however with the ability to perform in hardware database related instructions at machine level speeds with little to no overhead.  Because of this, the SPARC M7 is able to outperform all other general purpose processors that have to timeshare those types of instructions along with all the other workloads they’re being asked to perform.


sprinting-runnerA great analogy would be comparing an athlete who competes in a decathlon to a sprint runner.  The decathlete is very good at running fast, however he needs to be proficient in 9 other areas of competition.  Because of this, the decathlete cannot possibly be as good at running fast as the sprinter because the sprinter is focusing on doing just one thing and being the best at it.  In the same vein, the M7 chip also performs SQL instructions like a sprinter.  The same applies to encryption and real time data compression.


Having explained this concept, we can now get into practical application.  The most common use case will be for accelerating Oracle Database workloads.  I’ll spend some time digging into that in my next article.  Bear in mind that there are also other applications such as crypto acceleration and hardware data compression that are accelerated as well.


Over the past few weeks, we’ve been doing some benchmark comparisons between 3 very different Oracle Database hardware configurations.  The Exadata (x5), the Oracle Database Appliance (x5) and an Oracle T7-2 are the three platforms that were chosen.  There is a white paper that Collier IT is in the process of developing which I will be a part of.  Because the data is not yet fully analyzed, I can’t go into specifics on the results.  What I can say is that the T7-2 performed amazingly well from a price/performance perspective compared to the other two platforms.


Stay tuned for more details on a new test with the S7 and a Nimble CS-500 array as well as a more in depth look at how the onboard acceleration works including some practical examples.








ODA Patching – get ahead of yourself?

I was at a customer site deploying an X5-2 ODA.  They are standardizing on the patch level.  Even though is currently the latest, they don’t want to be on the bleeding edge.  Recall that the patch doesn’t include infrastructure patches (mostly firmware) so you have to install first, run the –infra patch to get the firmware and then update to


We unpacked the patch on both systems and then had an epiphany.  Why don’t we just unpack the patch as well and save some time later?  What could possibly go wrong?  Needless to say, when we went to install or even verify the patch it complained as follows:

ERROR: Patch version must be


Ok, so there has to be a way to clean that patch off the system so I can use right?  I stumbled across the oakcli manage cleanrepo command and thought for sure that would fix things up nicely.  Ran it and I got this output:


[root@CITX-5ODA-ODABASE-NODE0 tmp]# oakcli manage cleanrepo --ver
Deleting the following files...
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/OAK/
Deleting the files under /DOM0OAK/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/Seagate/ST95000N/SF04/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/Seagate/ST95001N/SA03/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/WDC/WD500BLHXSUN/5G08/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HGST/H101860SFSUN600G/A770/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/Seagate/ST360057SSUN600G/0B25/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HITACHI/H106060SDSUN600G/A4C0/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HITACHI/H109060SESUN600G/A720/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HITACHI/HUS1560SCSUN600G/A820/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HGST/HSCAC2DA6SUN200G/A29A/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HGST/HSCAC2DA4SUN400G/A29A/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/STEC/ZeusIOPs-es-G3/E12B/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/STEC/Z16IZF2EUSUN73G/9440/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Expander/ORACLE/DE2-24P/0018/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Expander/ORACLE/DE2-24C/0018/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Expander/ORACLE/DE3-24C/0291/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/LSI-es-Logic/0x0072/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/LSI-es-Logic/0x0072/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Ilom/SUN/X4370-es-M2/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HITACHI/H109090SESUN900G/A720/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/STEC/Z16IZF4EUSUN200G/944A/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HGST/H7240AS60SUN4.0T/A2D2/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HGST/H7240B520SUN4.0T/M554/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Disk/HGST/H7280A520SUN8.0T/P554/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Expander/SUN/T4-es-Storage/0342/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/LSI-es-Logic/0x0072/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/LSI-es-Logic/0x005d/4.230.40-3739/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/LSI-es-Logic/0x0097/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/Mellanox/0x1003/2.11.1280/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Ilom/SUN/X4170-es-M3/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Ilom/SUN/X4-2/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Ilom/SUN/X5-2/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/HMP/
Deleting the files under /DOM0HMP/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/IPMI/
Deleting the files under /DOM0IPMI/
Deleting the files under /JDK/1.7.0_91/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/ASR/5.3.1/Base
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/OPATCH/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/OPATCH/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/OPATCH/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/GI/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/DB/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/DB/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/DB/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/DB/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/OEL/6.7/Patches/6.7.1
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/OVM/3.2.9/Patches/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/OVS/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/LSI-es-Logic/0x0072/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/thirdpartypkgs/Firmware/Controller/LSI-es-Logic/0x0072/
Deleting the files under $OAK_REPOS_HOME/pkgrepos/orapkgs/GI/


So I assumed that this fixed the problem.  Nope…


[root@CITX-5ODA-ODABASE-NODE0 tmp]# oakcli update -patch --verify

ERROR: Patch version must be



Ok so more searching the CLI manual and the oakcli help pages came up with bupkiss.  So I decided to do an strace of the oakcli command I had just ran.  As ususal- there was a LOT of garbage I didn’t care about or didn’t know what it was doing.  I did find however that it was reading the contents of a file that looked interesting to me:


[pid 5509] stat("/opt/oracle/oak/pkgrepos/System/VERSION", {st_mode=S_IFREG|0777, st_size=19, ...}) = 0
[pid 5509] open("/opt/oracle/oak/pkgrepos/System/VERSION", O_RDONLY) = 3
[pid 5509] read(3, "version=\n", 8191) = 19
[pid 5509] read(3, "", 8191) = 0
[pid 5509] close(3) = 0
[pid 5509] fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
[pid 5509] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f159799d000
[pid 5509] write(1, "\n", 1
) = 1
[pid 5509] write(1, "ERROR: Patch version must be 12."..., 40ERROR: Patch version must be
) = 40
[pid 5509] exit_group(0) = ?


There were a dozen or so lines after that, but I had what I needed.  Apparently /opt/oracle/oak/pkgrepos/System/VERSION contains the current version of the latest patch that has been unpacked.  The system software version is kept somewhere else because after I unpacked the patch, I ran an oakcli show version and it reported  But the VERSION file referenced earlier said  I assume when I unpacked the patch, it updates this file.  So what I wound up doing is changing the VERSION file back to as well as deleting the folder /opt/oracle/oak/pkgrepos/System/  Once I did this, everything worked as I expected.  I was able to verify and install the –infra portion of and continue on my merry way.


This highlights the fact that there isn’t a known way (to me at least) to delete an unpacked patch via oakcli or any python scripts I’ve been able to find yet.  Also- as an aside I tried just deleting the VERSION file assuming it would be rebuilt by oakcli and it didn’t.  I got this:


[root@CITX-5ODA-ODABASE-NODE0 System]# oakcli update -patch --verify
ERROR : Couldn't find the VERSION file to extract the current allowed version


So I just recreated the file and all was good.  I was hoping that the oak software didn’t maintain some sort of binary formatted database that kept track of all this information- I think I got lucky in this case.  Hope this helps someone out in a pinch!

X5-2 ODA upgrade from to observations

Word on keyboard

More fun with patching!  So this time I’m doing a fresh virtualized install and I decided to take my own sage advice of installing first to get the firmware patches.  I ran into a bunch of other issues which will be the topic of a different post but I digress.  I got fully installed, ODA_BASE deployed, everything was happy.


Remember that starting with version, you have to patch each node separately with the –local option for the infra patches.  So I started the patch on node 0 and it got almost all the way to the end at step 12 where oakd is being patched.  I ran into the “known issue” in 888888.1 item 9:

9.  During the infra patching, after step 12 completed, IPMI, HMP done, if it appeared to be hang during Patching OAK with the following two lines
                               INIT: Sending processes the TERM signal
                               INIT: no more processes left in this runlevel
JDK is not patched, the infra patching is not complete to the end.  
Workaround:  To reboot the appeared hang node manually, then run 
# oakcli update -patch –clean

# oakcli update -patch –infra –local
To let it complete the infra patch cleanly.  

I waited about 30 minutes at this step before I started to wonder, and sure enough after checking some log files in /opt/oracle/oak/onecmd/tmp/ it thought oakd was fully patched.  What I found is that oakd gets whacked because the patch doesn’t fully complete.  After doing the reboot that’s recommended in the workaround above, sure enough oakd is not running.  What’s more- now when I boot ODA_BASE the console doesn’t get to the login prompt and you can’t do anything even though you can ssh in just fine.  So I ran the –clean option then kicked off the patch again.  This time it complained that oakd wasn’t running on the remote node.  It was in fact running on node1 but node0 oakd was not.  I suspect that when the ODA communicates to oakd between nodes it’s using the local oakd to do so.


So I manually restarted oakd by running /etc/init.d/init.oak restart and then oakd was running.  I rebooted ODA_BASE on node0 just to be sure everything was clean then kicked off the infra patch again.  This time it went all the way through and finished.  The problem now is that the ODA_BASE console is non responsive no matter what I do so I’ll be opening a case with Oracle support to get a WTF.  I’ll update this post with their answer/solution.  If I were a betting man I’d say they’ll tell me to update to to fix it.  We’ll see…


As an aside- one of the things that does is do an in-place upgrade of Oracle Linux 5.11 to version 6.7 for ODA_BASE.  I’ve never done a successful update that way and in fact, Red Hat doesn’t support it.  I guess I can see why they would want to do an update rather than a fresh install but it still feels very risky to me.

Are IT pillars narrowing?

Back in the 1990’s when I got into the IT field, things were obviously different than now.  The ways in which things are different are somewhat alarming to me.  I’m talking about IT specializations such as network, server, desktop and virtualization engineers for example.  

I’m starting to see focus areas narrow to specific areas of expertise as things get more and more complex. Not too long ago if you were a server admin, chances are you worked on anything from racking and stacking the server to setting up ports on the switch to allocating storage and installing the OS.

Not so much these days. We have facilities engineers who rack the server and run the cables. We have a network team split into multiple sub teams based on function who get the network portion done. Then we have a SAN team who provisions the storage. Finally your server admin does the install and patches.

To some degree this is a good thing.  There are increasingly more and more complex technologies that require a higher level of training and knowledge to effectively configure and support.  It makes me somewhat nervous if I’m going to be required to limit my expertise to just a small focus area.  

What happens when a paradigm shift occurs in the tech world?  Take Novell Netware for example.  It used to be almost a given that most companies used it for file and print.  Some of you reading this article today may not even know what Netware is.  If I were a Netware admin I’d be kinda screwed right about now wouldn’t I?

Maybe I’m just being paranoid but maybe not?  I personally have branched out into a number of fields including OS support, virtualization, storage and networking.  I enjoy all these technologies and it makes me somewhat more valuable because I can now integrate these technologies better than someone who focuses on only 1 pillar.  Even though I do it at the expense of not being as “deep” in any particular field, I’m okay with this because it let’s me explore more and be better prepared for that big paradigm shift that’s out there somewhere.

What are your thoughts on this?  I’d love to hear your point of view!

Solaris Commands – off the beaten path

Oracle blogger Giri Mandalika posted some useful and somewhat obscure commands that can be very useful in troubleshooting Solaris.  These commands mostly work with Solaris 10 however some of them are specific to Solaris 11:


Interrupt Statistics : intrstat utility
intrstat utility can be used to monitor interrupt activity generated by various hardware devices along with the CPU that serviced the interrupt and the CPU time spent in servicing those interrupts on a system. On a busy system, intrstat reported stats may help figure out which devices are busy at work, and keeping the system busy with interrupts.
.. [idle system] showing the interrupt activity on first two vCPUs ..

# intrstat -c 0-1 5

device | cpu0 %tim cpu1 %tim
cnex#0 | 0 0.0 0 0.0
ehci#0 | 0 0.0 0 0.0
hermon#0 | 0 0.0 0 0.0
hermon#1 | 0 0.0 0 0.0
hermon#2 | 0 0.0 0 0.0
hermon#3 | 0 0.0 0 0.0
igb#0 | 0 0.0 0 0.0
ixgbe#0 | 0 0.0 0 0.0
mpt_sas#0 | 18 0.0 0 0.0
vldc#0 | 0 0.0 0 0.0

device | cpu0 %tim cpu1 %tim
cnex#0 | 0 0.0 0 0.0
ehci#0 | 0 0.0 0 0.0
hermon#0 | 0 0.0 0 0.0
hermon#1 | 0 0.0 0 0.0
hermon#2 | 0 0.0 0 0.0
hermon#3 | 0 0.0 0 0.0
igb#0 | 0 0.0 0 0.0
ixgbe#0 | 0 0.0 0 0.0
mpt_sas#0 | 53 0.2 0 0.0
vldc#0 | 0 0.0 0 0.0

Check the outputs of the following as well.
# echo ::interrupts | mdb -k
# echo ::interrupts -d | mdb -k

Physical Location of Disk : croinfo & diskinfo commands
Both croinfo and diskinfo commands provide information about the chassis, receptacle, and occupant relative to all disks or a specific disk. Note that croinfo and diskinfo utilities share the same executable binary and function in a identical manner. The main difference being the defaults used by each of the utilities.

# croinfo
D:devchassis-path t:occupant-type c:occupant-compdev
------------------------------ --------------- ---------------------
/dev/chassis//SYS/MB/HDD0/disk disk c0t5000CCA0125411FCd0
/dev/chassis//SYS/MB/HDD1/disk disk c0t5000CCA0125341F0d0
/dev/chassis//SYS/MB/HDD2 - -
/dev/chassis//SYS/MB/HDD3 - -
/dev/chassis//SYS/MB/HDD4/disk disk c0t5000CCA012541218d0
/dev/chassis//SYS/MB/HDD5/disk disk c0t5000CCA01248F0B8d0
/dev/chassis//SYS/MB/HDD6/disk disk c0t500151795956778Ed0
/dev/chassis//SYS/MB/HDD7/disk disk c0t5001517959567690d0

# diskinfo -oDcpd
D:devchassis-path c:occupant-compdev p:occupant-paths d:occupant-devices
------------------------------ --------------------- ----------------------------------------------------------------------------- -----------------------------------------
/dev/chassis//SYS/MB/HDD0/disk c0t5000CCA0125411FCd0 /devices/pci@400/pci@1/pci@0/pci@0/LSI,sas@0/iport@1/disk@w5000cca0125411fd,0 /devices/scsi_vhci/disk@g5000cca0125411fc
/dev/chassis//SYS/MB/HDD1/disk c0t5000CCA0125341F0d0 /devices/pci@400/pci@1/pci@0/pci@0/LSI,sas@0/iport@2/disk@w5000cca0125341f1,0 /devices/scsi_vhci/disk@g5000cca0125341f0
/dev/chassis//SYS/MB/HDD2 - - -
/dev/chassis//SYS/MB/HDD3 - - -
/dev/chassis//SYS/MB/HDD4/disk c0t5000CCA012541218d0 /devices/pci@700/pci@1/pci@0/pci@0/LSI,sas@0/iport@1/disk@w5000cca012541219,0 /devices/scsi_vhci/disk@g5000cca012541218
/dev/chassis//SYS/MB/HDD5/disk c0t5000CCA01248F0B8d0 /devices/pci@700/pci@1/pci@0/pci@0/LSI,sas@0/iport@2/disk@w5000cca01248f0b9,0 /devices/scsi_vhci/disk@g5000cca01248f0b8
/dev/chassis//SYS/MB/HDD6/disk c0t500151795956778Ed0 /devices/pci@700/pci@1/pci@0/pci@0/LSI,sas@0/iport@4/disk@w500151795956778e,0 /devices/scsi_vhci/disk@g500151795956778e
/dev/chassis//SYS/MB/HDD7/disk c0t5001517959567690d0 /devices/pci@700/pci@1/pci@0/pci@0/LSI,sas@0/iport@8/disk@w5001517959567690,0 /devices/scsi_vhci/disk@g5001517959567690

Monitoring Network Traffic Statistics : dlstat command
dlstat command reports network traffic statistics for all datalinks or a specific datalink on a system.

# dlstat -i 5 net0
net0 163.12M 39.93G 206.14M 43.63G
net0 312 196.59K 146 370.80K
net0 198 172.18K 121 121.98K
net0 168 91.23K 93 195.57K

For the complete list of options along with examples, please consult the Solaris Documentation.

Fault Management : fmstat utility
Solaris Fault Manager gathers and diagnoses problems detected by the system software, and initiates self-healing activities such as disabling faulty components. fmstat utility can be used to check the statistics associated with the Fault Manager.
fmadm config lists out all active fault management modules that are currently participating in fault management. -m option can be used to report the diagnostic statistics related to a specific fault management module. fmstat without any option report stats from all fault management modules.

# fmstat 5
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 1.0 8922.5 96 0 0 0 12b 0
disk-diagnosis 1342 0 1.1 8526.0 96 0 0 0 0 0
disk-transport 0 0 1.0 8600.3 96 1 0 0 56b 0
zfs-diagnosis 139 75 1.0 8864.5 96 0 4 12 672b 608b
zfs-retire 608 0 0.0 15.2 0 0 0 0 4b 0
# fmstat -m cpumem-retire 5
auto_flts 0 auto-close faults received
bad_flts 0 invalid fault events received
cacheline_fails 0 cacheline faults unresolveable
cacheline_flts 0 cacheline faults resolved
cacheline_nonent 0 non-existent retires
cacheline_repairs 0 cacheline faults repaired
cacheline_supp 0 cacheline offlines suppressed

InfiniBand devices : List & Show Information about each device
ibv_devices lists out all available IB devices whereas ibv_devinfo shows information about all devices or a specific IB device.

# ibv_devices
device node GUID
------ ----------------
mlx4_0 0021280001cee63a
mlx4_1 0021280001cee492
mlx4_2 0021280001cee4aa
mlx4_3 0021280001cee4ea

# ibv_devinfo -d mlx4_0
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.7.8130
node_guid: 0021:2800:01ce:e63a
sys_image_guid: 0021:2800:01ce:e63d
vendor_id: 0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id: SUN0160000002
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 56
port_lid: 95
port_lmc: 0x00
link_layer: IB

port: 2
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 56
port_lid: 96
port_lmc: 0x00
link_layer: IB

Other commands and utilities such as ibstatus, fwflash or cfgadm can also be used to retrieve similar information.

PCIe Hot-Plugging : hotplug command
When the hotplug service is enabled on a Solaris system, hotplug command to bring hot pluggable devices online or offline without physically adding or removing the device from the system.
The following command lists out the all physical [hotplug] connectors along with the current status.

# hotplug list -c
Connection State Description

For detailed instructions to hotplug a device, check the Solaris documentation out.

Join a Solaris 10 or 11 server to a Microsoft AD domain


I found a great infodoc that explains how to join an Oracle Solaris 10 or 11 server to Microsoft Active Directory for authentication.  The Solaris 10 procedure is not supported by Oracle so you may want to stick with 3rd party tools like Centrify or Likewise to at least provide a support mechanism from them in the event of any problems.

The Solaris 11 solution is however supported via kerberos and I know a few customers of my own that would be interested in implementing this!

The infodoc number is 1485462.1 and it contains references to other documents within MOS so you’ll need an account to access them.

RedHat Atomic Host – Linux on a diet never looked this good!

RedHat has released a new variant of their popular operating system called Atomic Host. It is essentially a stripped down distro that contains only the needed components to run Docker containers. This provides for a smaller footprint and limits your attack vector as well as being quicker to boot. Gonna check it out maybe this weekend!

Slashdot article:

RedHat Site info