Windows 2012 Server Deduplication and ConfigMgr 2012

 

Windows Server 2012 brought along a really nice bunch of features, and a lot of work went into Disk Management, we saw Storage Pools, and a feature that I hadn’t check out, Deduplication. Since Windows 2012 we’ve been able to make use of this feature in the data center, Johan Arwidmark reminded me of the ConfigMgr team blog article on ConfigMgr and Deduplication for Distribution Point Content Libraries, so I thought I’d spend some time checking out how incredible the savings are, and how to get it up and running quickly in case I visit a customer who has this requirement.

There’s a primer you should look through on the Windows Server Storage Team blog that gives you a good introduction to the Deduplication feature.

Now wasn’t that worth reading. You now know a lot about Deduplication. A quick summary would be:

  • Deduplication is to be used on data volumes only, it doesn’t work on Boot or System drives
  • Doesn’t like files smaller than 32K in size and any files that have Extended Attributes
  • Doesn’t like Cluster Shared Volumes or Encrypted File Systems
  • You shouldn’t use deduplication on Application folders this excludes ConfigMgr, SQL, any application
  • Be prepared for corruption to occur at some point in time, maybe it will maybe it won’t, be prepared for it, a loss of a shared chunk may result in many files being lost, chunk caching\duplication (hotspot chunks) looks to be able to catch this when it happens, but isn’t guaranteed. It’s worth quoting from the above blog as this is important to note:

Reliability and Risk Mitigations

Even with RAID and redundancy implemented in your system, data corruption risks exist due to various disk anomalies, controller errors, firmware bugs or even environmental factors, like radiation or disk vibrations. Deduplication raises the impact of a single chunk corruption since a popular chunk can be referenced by a large number of files. Imagine a chunk that is referenced by 1000 files is lost due to a sector error; you would instantly suffer a 1000 file loss.

Backup Support: We have support for fully-optimized backup using the in-box Windows Server Backup tool and we have several major vendors working on adding support for optimized backup and un-optimized backup. We have a selective file restore API to enable backup applications to pull files out of an optimized backup.

Reporting and Detection: Any time the deduplication filter notices a corruption it logs it in the event log, so it can be scrubbed. Checksum validation is done on all data and metadata when it is read and written. Deduplication will recognize when data that is being accessed has been corrupted, reducing silent corruptions.

Redundancy: Extra copies of critical metadata are created automatically. Very popular data chunks receive entire duplicate copies whenever it is referenced 100 times. We call this area “the hotspot”, which is a collection of the most popular chunks.

Repair: A weekly scrubbing job inspects the event log for logged corruptions and fixes the data chunks from alternate copies if they exist. There is also an optional deep scrub job available that will walk through the entire data set, looking for corruptions and it tries to fix them. When using a Storage Spaces disk pool that is mirrored, deduplication will reach over to the other side of the mirror and grab the good version. Otherwise, the data will have to be recovered from a backup. Deduplication will continually scan incoming chunks it encounters looking for the ones that can be used to fix a corruption.

 

ConfigMgr administrators are already familiar with the concepts behind Deduplication from the ConfigMgr Single Instance Storage feature, the technical details are obviously different but you’re looking at the same thing, essentially pattern matching\scanning for commonality. Deduplication works with what it calls Chunks, which by default are around 32K in size, files with common chunks are candidates for storage reduction, ConfigMgr Single Instance Storage is making savings based on files being identical to each other, therefore, Deduplication with chunks is just a more granular equivalent.

 

To get this going in a lab VM, I had previously moved the Content Libraries off C: and onto a new E: Volume using the ConfigMgr 2012 R2 Toolkit Content Library Transfer tool, I’ve met Deduplication requirements by using a data volume, the folders I want to Deduplicate are there, this is what it looks like:

 

image

I’m going to include two of those folders for deduplication and exclude the rest.

Include:

  • SCCMContentLib
  • SMSPKGE$

Exclude:

  • Everything else!

I’ve had to exclude the Package Source location because as the article I linked to above says ConfigMgr does not support Content Source locations that have files that reside on reparse points. Deduplication is based on using Reparse Points so things might not work out well. If you’ve not heard of Reparse Points before, there’s more reading here: http://support.microsoft.com/kb/205524 as a primer.

Throughout this guide we’ll be using PowerShell, all that we do here can be performed using the GUI, and on occasion I’ll show some GUI shots to give you a feel of where things are.

 

Installing Deduplication

The feature’s been built-in since Windows 2012 you just need to enable it:

 

Import-Module ServerManager

Add-WindowsFeature –name FS-Data-Deduplication

 

image

 

Enabling Deduplication

My data volume is waiting for deduplication to be switched on, this doesn’t activate it, so let’s do that first:

 

Enable-DedupVolume –Volume E:

 

image

 

Configuring Deduplication

We can tell the deduplication feature to exclude folders and file types and configure the feature further. I’m going to tell it to Deduplicate any file older than 1 day (If you set it 0 then it’ll begin processing files immediately), to exclude a bunch of folders and to exclude files that are known to already be well-compressed such as PNG, CAB, ZIP, LZA, the rest can be left to default:

Set-DedupVolume -Volume E: -MinimumFileAgeDays 1 -ExcludeFolder E:\PkgSource,E:\SMSPKG,E:\SMSPKGSIG,E:\SMSSIG$ -NoCompressionFileType PNG,CAB,ZIP,LZA

 

image

 

We can see that the feature is installed, enabled and configured by visiting the Server Manager Dashboard, selecting Disk and right clicking the Volume you’ve enabled Deduplication on:

 

image

 

The schedule hasn’t been tweaked, but you can go in here and make further changes:

 

image

 

Start Deduplication

It’s ready but the Deduplication filter is not engaged, to do this you need to set off a job that will begin optimization of the volume:

 

Start-DedupJob -Type Optimization -Volume E:

 

image

 

Monitoring progress

Deduplication should now be in progress, and the job to perform deduplication can be checked on:

 

Get-Dedupjob

 

image

 

And you can review the statistics as well:

Get-DedupStatus

 

image

The column InPolicyFiles means how many files are being included in the scan and being kept an eye on for common chunks, and Optimized Files is how many files have been Deduplicated.

 

While the Deduplication filter is busy yielding massive storage savings, you can stop the job, stop deduplication of new files or remove deduplication entirely.

 

Stop any existing Deduplication jobs

This will cancel any existing deduplication jobs, such as ones to begin deduplicating a volume, or to de-deduplicate (remove Deduplication) a Volume:

 

Stop-Dedupjob –Volume E:

 

Stop further Deduplication on the Volume

This stops new files from being Deduplication, it does not remove Deduplication from the Volume:

 

Disable-DedupVolume –Volume E:

Remove Deduplication entirely

You’ll probably have a problem with a severe lack of storage space using this command after using the Deduplicated volume for a while, as the amount of files being returned to their original state will probably exceed actual free storage space, think of too much pop corn in the microwave:

 

Start-DedupJob -Type UnOptimization -Volume E:

 

Update Deduplication statistics

Statistics gathering is scheduled, you can trigger an update on statistics collection outside of the schedule:

 

Update-DedupStatus –Volume E:

 

image

 

Completion and reviewing the results

 

Eventually the Deduplication job will have completed, and using a few commands we can check out the savings and various other statistics:

 

Get-DepupStatus

 

image

 

Get-DedupStatus | F1

 

image

 

Get-DedupVolume –Volume E:

 

image

 

Get-DedupVolume – Volume E: | fl

 

image

 

Get-DedupMetadata –Volume E:

 

image

 

You can also get some of this information from the GUI:

 

image

 

And something worth nothing, is that from File Explorer, Deduplicated files will look a little unusual, the Size on Disk will accurately reflect how many chunks it is taking up not the total size of the whole file:

 

image

 

As a feature it’s a no-brainer to switch this thing on if you’re running Windows 2012 and can put your Distribution Point Content Libraries onto volumes that Deduplication supports. You’re bound to have a big OSD store, lots of Applications and Packages, the results you get with Deduplication and the ConfigMgr Distribution Point Content Libraries is nothing short of jaw-dropping. There’s also a tie-in with Branch-cache to further optimise link bandwidth usage, quoting the ConfigMgr blog posting:

Another benefit for Windows is that the sub-file chunking and indexing engine is shared with the BranchCache feature. When a Windows Server at the home office is running deduplication the data chunks are already indexed and are ready to be quickly sent over the WAN if needed. This saves a ton of WAN traffic to a branch office.

 

Some handy links:

PowerShell: Enable-DedupVolume

PowerShell: Get-DedupVolume

PowerShell: Set-DedupVolume

PowerShell: Update-DedupStatus

PowerShell: Get-DedupStatus

PowerShell: Start-DedupJob

PowerShell: Get-DedupMetadata

TechNet: Install and Configure Data Deduplication

TechNet Blog: Windows Server Storage Team Blog

TechNet Blog: ConfigMgr Blog