The thickening debate over thin provisioning

e3_iSCSI-LUN-Thin-provisioningI’ve been working with different storage vendors for a number of years now.  Most if not all of them now support thin provisioning as a way to fully maximize your storage utilization.  Before thin provisioning came along, you had to pre-allocate X amount of storage to a given application and then it was usable only by that application.  This proved to be potentially very wasteful as most admins don’t allocate only the amount of storage they need, they project growth over usually a 2-3 year period at least so the “left over” space that wasn’t being used at the time was locked away and basically wasted if not used.

 

Enter thin provisioning.  Now we have a way of overcommitting our storage to get better utilization of the storage we have.  For example, let’s say you allocate a thin provisioned LUN to a server or VM of 100GB.  Initially that 100GB LUN occupies zero space on the SAN.  At some point the LUN is used to store data on the server, let’s say the server has written 20GB of data to the volume.  On the SAN side, the only blocks of data that are used at this point are the 20GB that has been written.  The remaining 80GB is still unused on the SAN and can be allocated to other volumes.  Now here’s where it gets tricky and you have to be careful.  Let’s say we have a SAN with 1TB capacity (I know- small by today’s standards but work with me on this).  Let’s show an example of storage allocation to illustrate my point:

  • 5x 100GB thin provisioned LUNS -> windows server
  • 1x 300GB thin provisioned LUN -> Linux server
  • 2x 400GB thin provisioned LUN’s -> VMware environment.

Wait a minute- that’s 1.6TB and you told me I only had 1TB!  This is exactly what I’m referring to when I say you have to be careful.  In the example above, we’ve actually given out more storage than what we have.  This magic is possible due to the fact that the SAN tells the “clients” that they have whatever size LUN we’ve given them and they all believe they actually have that much space.  In actuality, we have a pool of storage from which we can allocate blocks to each of the LUN’s that I have presented.  Don’t worry- this is actually a fairly common scenario in the real world.  The differentiator here is that it is done with full realization of how the data is used in conjunction with closely monitoring available space on the SAN.

 

Here’s a quick explanation of what happens if you don’t do it right.  As blocks of data are written to the LUNs, storage blocks are allocated on the SAN and marked as used and the pool of available storage shrinks by that much.  At some point, if you continue to use the storage, you will find yourself in the unenviable position of running out of space on the SAN.  What’s worse, all of the consumers of the storage still think they’ve got more space to use so they ALL run out of space at the same time.  We all know what happens when windows can’t write to the C: drive or Linux can’t write to the root filesystem or VMware can’t write to it’s datastore.  You’re blessed with a blue or purple screen of death or kernel panics on all of your clients.  It’s not the server’s fault- you told them they had the storage space.  It’s not the SAN’s fault, you told it to allocate that much storage to each server.  Guess who’s fault it is?  You guessed it- YOURS!  This is a very real possibility and I’ve worked with multiple customers who have run into this unenviable problem.  In some cases, depending on the scope of how many clients were impacted and how hard recovery was or if it was even possible to recover from, unlucky admins have even experienced an unplanned RGE.  RGE stands for Resume Generating Event.  Ya don’t wanna be that guy…

fired-employee

 

One nasty side effect of thin provisioning however we haven’t even discussed yet.  The thickening… oooh that sounds scary!  Albeit not as scary as an RGE, still it’s something to be aware of for sure.  Basically it has to do with what happens when you thin provision a LUN and the consumer has been using it for awhile.  When you first assign thin provisioned storage to a consumer, it occupies no space.  In my example, I have a HP P4000 SAN connected to a VMware ESXi 5.1 server with a Windows 2008 R2 VM running on it.  I’ve assigned a 1TB Thin Provisioned LUN from the P4000 directly to the VM via an RDM (Raw Device Mapping) in VMware.  Here’s a picture of what your storage looks like at this point:

start

At this point, everything’s fine.  From all perspectives, I have 1TB of space available on my shiny new LUN and no space has been used yet.  So time goes by and I start using my storage in Windows.  I’ve copied about 850GB of data to the LUN consisting of some ISO images, a few database backups and some program dumps.  Now here’s what things look like storage wise:

data written

Ok, nothing unexpected here.  I’ve copied about 850gb of data to the LUN and as expected, I’m using up about 850GB of storage on my SAN.  VMware also tracks how much space I’m using and it also reports the same amount.  The windows VM also thinks that it’s using up 850gb.  So far so good!  Now, more time has gone by and I’ve deleted some unneeded files- I got rid of some of the program dumps that are no longer valid and a few database backups as well.  Now I’m down to about 400GB used.  Here’s where things tend to sneak up on you and make life interesting.  WIndows thinks I’m only using 400GB of space- rightfully so as I’ve deleted some files.  However from the SAN’s perspective, I’m still using 850GB.  How is this possible?  I deleted the files in windows and that space is no longer needed.  Well, the SAN doesn’t know that.  To get a better understanding of why there is now a discrepancy we first need to discuss just what actually happens when you delete files in windows.

 

This is a very simplified description of what happens, I’m not going to go into too much detail here except to cover the basic principle of what happens when you delete files.  It is out of the scope of this document to talk about things like media descriptors and byte offsets or all of the other minute details of NTFS and how it works.  When you delete a file, you’re actually just de-referencing the data that is stored on the disk.  There is a MFT (Master File Table) that keeps track of where all the files reside on disk as well as a mapping or offset to the first block of data in that file.  There is also a cluster bitmap which is a table that is responsible for letting windows know where it can write new blocks of data and where it can’t.  When a file is deleted in windows, it is marked as deleted in the MFT.  The cluster bitmap is also updated to reflect those previously allocated blocks as now eligible to be written to- this is why deleting files is so fast.  It would be terribly inefficient and wasteful for windows to actually go out and write all zeros to each and every block of a file that you deleted wouldn’t it?  It would take minutes to delete even a 50GB file.  So even though the file was marked as deleted in the MFT, the actual data that made up the contents of the file that was deleted is still sitting on disk, it’s just currently inaccessible through windows explorer.  Technically speaking, if you wanted to recover that file and nothing had been written to the disk since you deleted it, you could theoretically go out and recreate the entry in the MFT if you knew what the initial offset of the first block of the file’s data was.  Back to our scenario- here’s a depiction of what things look like at this point:

data deleted

Now that we understand in a basic fashion what happens when a file is deleted in windows, we can move on to why the heck the SAN still thinks that data is there and being used.  Understand that the storage layer doesn’t have any visibility into windows, VMware or any other layers past the point where it presented the LUN.  It doesn’t know that my windows VM just deleted 450GB of data from the LUN.  It just knows that over time, it’s been asked to write blocks of data totaling 850GB to this point.  So given what we now know happens when you delete a file in windows, it would make sense that the full 850GB of data is still what the storage layer thinks it needs to hold on to.

NOTE: VMware’s VAAI UNMAP storage API as well as VMware tools installed inside the guest VM can provide VMware with some insight into what’s actually going on inside windows.  In the case of VAAI UNMAP, it can potentially reverse some of the negative effects of thin provisioning.  For the purpose of our discussion however, let’s suspend disbelief and assume that we don’t have VAAI or the VMware tools installed and configured.

To take this scenario one step further, let’s look at one more tweak to our scenario.  Say we’ve deleted the data and now are sitting at 400GB used space from windows’ perspective.  What do you suppose would happen if we wrote another 500GB of new data in windows?  Recall what happened when we deleted the other data in the first place- the blocks of data are simply de-referenced.  And the SAN doesn’t know anything about data being deleted.  Since the SAN thinks that the old data still exists, including the data we deleted in windows, new writes would have to go somewhere else wouldn’t they?  Wait a minute- we had 850GB of data to begin with, then we deleted 450GB but we added 500GB of new data.  By my math, that’s 1350GB which is more than the 1TB that we have!  I’m sure you’re asking yourself at this point, wouldn’t I run out of space?  The answer is no and I’ll tell you why- because windows told the SAN where to write the new 500GB of data.  How this works is that some of the blocks will be written where the deleted data was previously stored and some will be written to empty blocks.  If you’re thoroughly confused at this point, allow me to show you visually how this is possible.  First though, you need to understand that the blocks that windows writes are logically mapped to blocks on the SAN.  A block that is written to storage location 0x8000 according to the windows cluster bitmap may actually be written to storage location 0x15D1DC7 according to the SAN.  They both maintain a table of blocks and their corresponding location, but they are two different tables and those tables are mapped dynamically as data is written.  This diagram shows how NTFS maps data blocks as they are written, deleted and re-written to:

ntfs block mapping

The graphic above illustrates two important things.  The first is that even though we thought we had used up 850GB of 1TB on the SAN, we know in reality we still had more space in windows because we deleted files.  With the diagram above I’ve shown how windows tells the storage layer “I know I told you to write these blocks before, forget that and write these new blocks there instead”.  We’re still limited to 1TB of total space but because of how the blocks are logically mapped from windows to the SAN, we still have room to store our data.  The second thing that I’ve illustrated above is the thickening concept that this article is all about.  If you think of a high-water mark on the SAN as data gets written, deleted and re-written, then you start to understand why thin provisioned volumes eventually get “thickened” over time.  Because of this, you should have a set of guidelines in place to help you determine when it’s appropriate to thin provision and when to use thick volumes in the first place.  One primary rule of thumb that I use is this: if the LUN will be written to frequently, it’s probably not a good candidate for thin provisioning.  You won’t really gain much space savings in the end and could potentially find yourself in that ugly out of space problem I mentioned early on.

Hopefully this has shed some light on what happens over time with thin provisioned volumes and what can happen if you’re not paying attention.

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s