ConfigMgr High Availability and the Content Library

Sub-titled “ConfigMgr High Availability feature and making the Content Library highly available”.

In this article I’m going to focus specifically on considerations on placement of the Content Library for High Availability purposes.

The Content Library, an under-the-radar-for-most ‘layer’ residing on the Primary in ConfigMgr, now has to be moved away from a Primary before High Availability can be enabled.

It cannot be moved back.

Most designs out there have a Distribution Point on the Primary, which is being fed by the Content Library, which is used to feed remote Distribution Points.

The Content Library itself is ‘fed’ by the Content Source locations you specify when introducing new content.

If a server that hosts the content source, be it not the Primary, was to fail, it would be a simple exercise for an administrator to create a new share to act as a repository, while the server hosting the original content source share is recovered. If you wanted, you wouldn’t even need to recover the server hosting the content source, if it isn’t the primary obviously, as you could recover the data itself and put it onto the new share, then change the content source locations for existing content using the tooling that is out there, or automation via PoSh, an energy-rich exercise but doable.

This is not the case with the Content Library.

You cannot create a new location for the content library and carry on, as you would need to ‘move’ the original content library to the new location first, catch 22, the server hosting the content library share is most likely down, thus all energy has to be focused on recovering said server, so as to gain access to the share again, which, once accessible removes the need to proceed with a ‘move’.

Without the Content Library, and read\write access too it, no new Content can be created. Many wheels can still turn including the site server role now, but we’re not highly available until the Content Library is brought back online.

It can quite literally paralyze operations.

That Content Library share is now kind of very special.

So as of Build 1806 of ConfigMgr Current Branch, the options for managing the Content Library is quite narrow, move it onto an SMB share, which puts the onus onto introducing complex infrastructure to support that share.

Let’s explore what this means.

In the diagram below, we can see a High Availability design mock-up showing the minimal amount of ‘moving part’s necessary to implement High Availability in Build 1806, while keeping client-communications and console access pinned to two Site systems, with the Primary left to work on being a primary and processing load as fast as possible, which is key, we’re operating an essentially queue based product here, and we always try to tease out congestion \ chokepoints \ bottlenecks were we can.

* SIR’s = Single Instance Roles, Service Connection Point being a primary example

As I stated above, the roles can be dispersed differently, depending on personal preference really, the SMS Provider can be installed onto the primaries as it is technically possible in Build 1806, along with every other role baring the Distribution Point and the Content Library, or the SMS Providers could I believe be installed onto the SQL servers, so that their communications with SQL are not network-based, SQL however has to remain remote from the primaries for this cut of the feature.

So how are we handling the Content Library in the above design?

We’re using an SMB share on a site system to host the content library, which now introduces a weakness, a vulnerability.

It is now a single point of failure due to the dependence on the site system hosting the SMB share. Classic text book problem.

On the site system itself, the data can be made highly available using physical disks in a RAID 10 configuration presented to the site system, or SAN presented, a disk lost here or there, with a spare kicking in, the data is safe and will remain accessible.

If the site system is installed onto a virtual machine, and a lot of sites are running on virtual machines, either on-premise or in the Azure or other cloud services, they will benefit from the hosts underlying protections and performance (RAID or LUN disks from a SAN, SSD’s).

But if we lose the host, we’ve lost access to the share, which means we’re paralyzed until we do something about it even if we could switch the disk to another site system, no dice.

Time for a cup of tea and some Disaster Recovery.

How to Fight Data Center Fires with Fog Suppression

So you’ve got this super-fancy Highly Available Hierarchy, you’ve touted it as a the bee’s knee’s to management, but if you lose one site system, the one that hosts the SMB share for the content library, there is a lot of noise generated and energy required to bring things back onto the rail with some DR. It’s a look, just not a good look.

Another consideration is that the site system lost may contain the SIR’s, the single instance roles such as the Service Connection Point, Endpoint Protection Point, and Asset Intelligence Synchronization Point, their loss is easily recovered from as are the others, simply remove them from the missing site system and place onto the alternate site system, in your own time perform DR, or rebuild the failed site system. Your LLD should plan for both site systems to access the internet for these services.

Having the the content library as a SPOF is a better look than pre-High Availability, since the Site server role is dealing with the old blockers and still ticking along nicely, critically servicing client registration requests and thus not impeding OSD, processing work queues, while working magic as wheels within wheels continue to turn as the site systems provide the redundancy needed to do so with clustered SQL and role redundancy.

Just no new content can be authored into the hierarchy.

There is another way. Ways I and Conan are just getting into.

Make the share itself highly available by using a Cluster Share.

The hostname becomes an alias, and we can survival cluster node failure as a result.

Perfect.

Now in our design, where would the best place to house this cluster share be?

As ever showing his kingly wisdom and overall knowledge of SCCM design, Conan nails it. Yes on the SQL servers, since they already have a Windows Cluster setup. By Crom, his wisdom is strong today.

I’m not going to pour over the storage plumbing and wiring options available in 2018, or the myriad of technologies and acronyms available to pull this off, the options are diverse and will be pretty unique to each environment hosting the services, but, I will briefly show you how I put together a “get to know you” mock-up in my lab using the Hyper-V Shared Virtual Disk feature to share a single disk between VM’s on the same Hyper-V host, and Windows Clustering to leverage the Clustered File Server role inside the SQL VM’s, so as to produce a cluster share for the Content Library to reside on.

First off I’m using a single server running the Hyper-V role to host the 6 VM’s that form the High Availability lab.

I do not have a SAN, or any fancy storage technology, so I have to pull off some unsupported tricks to make things work, by installing the failover cluster role onto the Hyper-V server and using the FLTMC.exe command to finger point at a given disk.

So yeah, Hyper-V has an enabling feature called Shared Virtual Disk, which uses a disk file type with the extension VHDS that can be shared amongst multiple Virtual Machines on the same host, or with further plumbing\wiring between hosts, which you will need to do if you’re thinking of doing this in production.

There are some restrictions, which mean I have to store the VHDS files on a different volume than the traditional VHD files.

I store all my VM’s on a pretty fast RAID 10 array, but I also have a couple of 2TB disks plugged in and not doing much, along with an idle 64GB SSD.

I gave the SSD a drive letter, and nominated it as the target to host the VHDS files. This caused major problems which I’ll mention shortly.

At this point I had to perform said unsupported trick, so as to beguile and force Hyper-V to allow the SSD disk to be used for the VHDS files, the shot shows I used the F: volume but that was on the second pass around due to some fudgery on my part, initially I used the E: drive letter which shows up in the other shots below:

Command line: fltmc.exe attach svhdxflt F:\

If you tried doing this using the management console in Hyper-V, it’d deny you as the SATA3 HDD isn’t supported:

And unless you have some complex infrastructure behind you for the lab, as in storage appliances, this manual work-around needs to be applied, keep in mind it is for the lab with minimal kit, and the manual application of this filter to the new drive isn’t permanent, needing to be re-applied after every reboot.

A lot further into my lab work the Hyper-V host began to BSOD.

That was worrying lol.

It began after moving the content library, which got to about 40% of the way in when the crashes began (was log watching …).

The penny dropped and I figured it out after about 3 BSOD’s, clearly the SVHDXFLT filter applied to the SSD was causing the crash:

But why.

Well, the Content Library was larger than the SSD size, with the content library topping out at 100GB, 99% of which is patches!

The SVHDXFLT driver was literally causing the Hyper-V host to fall over.

Anyways, to overcome this I chose a 2TB HDD, moved the existing VHDS files from the SSD onto it, then issued the SVHDXFLT.exe command to attach to the new volume, and finally altered the two SQL servers VM configuration settings to alter the shared disk path so as to point at the new location.

Obviously the BSOD stopped.

Keep in mind that if the VM’s are split across Hyper-V hosts, or as individual physical hosts, you’ve got a lot more storage wiring to carry out to present a shared disk, which is what you will want to do in production, have I mentioned this enough times already hehe. I chose a quick and easy way to do it.

The next thing to do is, once the chosen volume is made ready for hosting shared virtual disks, to create a new shared virtual disk on it:

Deposit the VHD Set shared virtual disk onto the volume set aside for the VHDS files. We can then present this shared virtual disk to both SQL servers as shown here:

From one of the SQL servers use Disk Management to initialise the disk, format and assign a drive letter that is unique in all of the clusters nodes. I did this from the other SQL server as well, and noted that my writes to the same disk from either SQL server was not reflected on the other. I could write to the volume presented to either SQL, but neither could see the writes of the other. I put it down to not knowing well enough what I’m supposed to be doing here, and thought that it wouldn’t be possible to proceed further.

Choose one of the SQL servers and fire up Failover Cluster Manager. Create a new disk, and select the shared virtual disk that is presented in the wizard.

If all goes well it’ll be accepted and show in the view as a cluster disk:

Now head to Roles and add the File server role, add a File server for general use.

Provide the listener name (I used FSListener) for this cluster role, along with an IP:

Then create a share on the File Server.

Choose the SMB Share – Quick option, and the FSListener that we created:

Next we give the share a name:

And that’s it. We now have a highly available clustered share to move our Content Library onto.

I cannot stress enough at this point that for production, the storage would be wired properly, not like this, this is to get you up and running in a basic lab that does not have network storage being presented.

If you tap in \\FSListener into Start\Run you should be able to explore the newly created share. Try and create a file on there.

For production you’d need to review the permission of the share to secure it, as by default the thing is wide open to authenticated users.

Now all we have to do to wrap up here is head to the SCCM console and initiate a move of the Content Library from our Site system to the clustered share:

After a wait while keeping an eye on the DistMgr log, the content library will have been moved, and the SPOF mentioned at the front of this post will no longer exist, and with the design of the outlined at the front of this post altered to show the repositioning of the content library to a cluster share, you now have the highest level of availability possible in ConfigMgr today. That’s pretty rad in of itself.

Although this is really cool to do, and I had fun discovering how to carry this out in my lab, I felt like I was entry-level with the technologies involved, knowing the new stuff in light amounts, and just scraping through.

This post was written as I flowed through on my first pass, rather than as with other posts where I’m more prepared. So, if there’s parts missing, just keep plugging away and you’ll get there, I’ve proven it all fits together over here in my lab.

I feel all of this is too much heavy-lifting and sheer overkill to keep the content library around.

Sure most customers will have the tooling to make this happen, but its made something that is very approachable by the masses, High Availability, something a sliver of the masses may pass by, simply because of the additional requirements involved to make the content library highly available; Or they may decide to not use a clustered share and put up with outages while the site system hosting the content is recovered, which isn’t really high availability.

I’m sure this could be simplified, it should be built into the product and not exposed to complex infrastructure to manage the problem.

I would like and have spoken about a passive\backup Content Library, so that we can place the thing in multiple locations, in the design at the front of this post that could be the site systems used for human\client-facing roles, failure of a single site system will result in falling back to the passive\backup library on the other, no fancy infrastructure wiring needed, and it would give us time to perform DR on the failed site system in the background as a low-priority task.

But we build with what we’ve got, and try to pour honey in the product groups ear.

For a first go, and with the requirements to implement all of this easily available, this no doubt will be the way most people configure for the content library.

Use the comments below, let me know how you’ve designed for high availability of the content library, especially the hardware and configurations you’ve brought to bear Smile

Some reading to give you a light headache:

https://blogs.technet.microsoft.com/filecab/2016/03/25/smb-transparent-failover-making-file-shares-continuously-available-2/

https://docs.microsoft.com/en-us/windows-server/failover-clustering/failover-cluster-csvs

https://blogs.technet.microsoft.com/askpfeplat/2012/10/10/windows-server-2012-storage-spaces-is-it-for-you-could-be/

https://blogs.technet.microsoft.com/josebda/2013/07/31/windows-server-2012-r2-storage-step-by-step-with-storage-spaces-smb-scale-out-and-shared-vhdx-virtual/