High Availability–Failing over to a new Passive

I’ve been very much interested in the High Availability feature since it inception.

I’ve followed its evolution in technical preview, kicking its tyres, discussing it with the product group and MVP peers, and seeing the feature come out of the development dock, technical preview, to sail around the sea in the form of current branch, is exciting.

Exciting because the High Availability feature is a high-value design element for ConfigMgr Hierarchies, something architects have been waiting a very long time for.

We’re use to high availability for most roles now, with a few single instance roles causing ripples, but the crown jewel is the site server role.

With all the redundancy in place, business can continue, but nothing new can be authored without that site server role, and critically, client registrations would cease to function causing OSD failures.

High Availability in its first forms, will make the Site server role highly available, with a small lead-time between transitions, mostly due to it being a manual exercise in Build 1806, and because there is an up to 30 minute delay before the transition is completed. This will ease out to become instant when automatic failover is introduced. And as this feature iterates, we’ll see a reduction in the server count.

There’s more that High Availability can be used for, in the future we’re going to be able to build multiple primaries to leverage elastic computing, reducing as demand tails off, and right now we can use High Availability to completely remove the need to do an in-place OS upgrade on a primary ever again.

I put together a quick poll on Twitter to gauge interest in the feature, 427 votes is surely not representative, but it gives me an idea of interest none the less.

Over half of the participants do not have High Availability on the radar. The remaining half breaks down almost equally between not interested and implementing soon, with a small percentage implementing right now.

I’ll do another poll later in the year, I’m sure by that time we’ll see far more adoption as the penny drops.

In this post, I’m going to focus on showing you how to recover from an unrecoverable primary site failure, using an already built High Availability ConfigMgr hierarchy.

At some point I’ll rattle out a HA build-out guide, it is pretty straight forward, but I see that others have done this already, and alongside my now-ancient guide on SQL AlwaysOn, I don’t see what more I could contribute beyond the good work they have already laid down, I may have a go at it none the less.

Right then. Let’s define the Lab I’m using to make tracking what is about to happen easier.

The server and role manifest for the lab hierarchy looks like this as of Build 1806:

  • Node 1 – L3CMN1 – ConfigMgr (Active)
  • No roles other than Site server role
  • Node 2 – L3CMN3 – ConfigMgr (Passive)
  • No roles other than Site server role
  • Node 3 – L3CMSS1 – Site system
  • SMS Provider
  • MP, DP, SUP, FSP, SCP, needed SMB Shares for Content Library, SQL AlwaysOn database transfer and the Windows Cluster File Share Witness
  • Node 4 – L3CMSS2 – Site system
  • MP, DP, SUP, FSP
  • Node 5 – L3CMSQL1 – SQL Server (Clustered)
  • Reporting Services
  • Reporting Point
  • WSUS Database (in an AG!)
  • SCCM Database (AG)
  • Reporting Services Database #1
  • Node 6 – L3CMSQL2 – SQL Server (Clustered)
  • Reporting Services
  • Reporting Point
  • WSUS Database (in an AG!)
  • SCCM Database (AG)
  • Reporting Services Database #2

This is a lab build-out and not for production, not all site roles are installed or referenced. Their inclusion and placement is a matter of simple consideration and action, for example the Reporting Point and Reporting Services elements can be located on the site systems instead of on the SQL servers, Asset Intelligence Synchronisation point can be put on the Site systems.

In the above mock-up I’ve placed the SMS Providers onto the Site systems instead of the Primaries, placing the SMS Provider onto the primaries is a supported option in Build 1806.

When Build 1810 arrives, we’ll be able to condense this solution down to 4 nodes by placing the SQL AlwaysOn and Windows Cluster onto the two primary site servers and jettisoning remote SQL.

At some point, we may very well arrive at a 2 node solution, time will tell, there are a few interesting technical barriers that need to be overcome before such a design can be realised, making it unrealistic in the relatively short term. I figure that 4 nodes to make a rock-solid highly available hierarchy is a cost worth paying today.

So let’s have a play with my HA lab, literally switching off a passive site server to check out how unplanned events are handled.

We could have switched off the active primary, would have incurred a time penalty while we wait for the transition to take place, the passive to switch to becoming an active, I might do that in the end of this guide let’s see.

Handling an unrecoverable passive Site server

My two Primaries are called:

  1. L3CMN1 (Active)
  2. L3CMN3 (Passive)

Node 2 ‘took one for the team’ during testing.

I’ve turned off L3CMN3.Lab1.com, so as to simulate the complete and unrecoverable failure of one of the two nodes of a High Availability cluster.

As you can see in the shot below, L3CMN3.Lab1.com is listed as a Site system, which has the Site Server role installed:


* That lower-case server entry is annoying me too, sorry

If you try to remove the site system by right-clicking on it in the top panel, and selecting delete, it’ll complain that there is a role there:

So clearly the first step is to remove that role from the site system, the role is shown in the Site System Roles panel, right-click the role there and select Delete:

Give it a few moments, then select the site system in the top panel and select Delete:

L3CMN3.Lab1.com is gone:

We’re currently running with one primary, so we’re going to need to build ourselves a new passive primary at some point. We’d obviously schedule this for out of hours, although there is barely if any operational down-time while building a new passive primary.

Building a new passive Primary

Ready up a new OS, give it a name, fixed IP, install IIS + ADK in accordance with a Primary site’s needs, permission so that the active primary and everyone else can say hello, then create a new Site system in the ConfigMgr console, here I am creating L3CMN4.Lab1.com:

And I’m going to add the Site server in passive mode role:

Let it copy the files to the server, and install to C:\Program Files\Microsoft Configuration Manager

The new passive primary L3CMN4 now shows in the Servers and Site System Roles list:

Note that by default an SMS Provider is not installed with the passive primary. I recommend not placing them there if you have site systems to hand for client-facing roles, as it will complicate and extend recovery times. Maybe in the future when we’re down to 2 or 3 nodes, it may be more appropriate to house everything on the primaries, everything, maybe.

Keep an eye on the FailOverMgr log, and visit the new Primary and check out its setup log. I recommend using LogLauncher and punching in the servers that make up your HA setup, use it to visit\open logs, it will make life easier.

While we’re waiting for the build to complete, we can button mash the refresh button in the console, if it fails it is most likely due to permissions or prerequisites, the FailOverMgr log and site setup logs will help you shore the issues up and a Retry can be performed (right select the node in Sites > (Site server) > Nodes tab).

L3CMN4 is in place as a passive Primary, and we’re back to high availability of the Site server role:

With minimal roles deployed to the Primary, the option to give up on classical DR and build out a new primary site server on failure becomes a reality, and a good course of action to take.

If you have Roles installed onto the Primaries, things get a little bit more complicated, especially for the SMS Provider, and when Build 1810 arrives and SQL is co-located with the Primaries, building a new passive node will require a lot more to achieve.

All very doable though.

I figure I could build a passive and get it back into a Windows Cluster, then install SQL using a pre-configured unattended file, get the availability group feature running and join it back in to the existing AG being held alive by the active primary node in time for tea, well, over a few hours if things are well-connected and its all running good equipment (not laggy servers). But there will be down-time, due to a site reset being necessary to complete the SMS Provider removal process which has to take place before a new primary can be built.

In a moment I’m going to nuke the Active Primary by turning it off.

As you can guess from the serialisation, L3CMN1 is the very first primary built, so offing it is like cutting the cord in a big way, we’re not swinging from an artificial primary back and forth between a real primary, each primary in a High Availability configuration is a real primary, who cares not if its originator still exists, just whether it can reach the database and site systems.

Let’s step back a moment and ponder things.

How many of you are fixated on keeping the primary alive at all costs, treating it like an irreplaceable treasure chest?

Isn’t it a strange feeling to begin treating a primary like a throw-away lighter, something that can easily be replaced?

That’s new, exciting as I said.

High Availability is a game-changer, it is literally forcing us to look at a primary, which we’ve treated with affection and protected at all costs in the past, as something easily replaced and ‘disposable’. Almost.

3rd Party product integration with primaries may require some tending, which would complicate a High Availability design or recovery procedure. I’m exploring that.

Ok let’s have some more fun.

Making life complicated, a dead primary with SMS Provider and Roles

I’ve now installed the SMS Provider and a Management Point onto L3CMN1, before I sacrifice it to the Hyper-V gods for this guide.

SMS Provider coverage now looks like this:

  1. L3CMN1 – SMS Provider
  2. L3CMSS1 – SMS Provider
  3. L3CMSS2 – SMS Provider

Having the SMS Provider on the Active Primary will complicate things a little when it comes time to perform magic, this is so I can show you one way to back this stuff out and remove an unrecoverable passive site server with roles installed.

Remember there is no console installed onto a newly built passive, so when I turn the lights off on L3CMN1, I’m going to transition to the other primary and install the console there. This will get me back to using the console to talk to the SMS Providers on the two site systems.

When you install the console manually on the newly minted passive primary, use ConsoleSetup.exe, as this will download any prerequisites needed for the console to operate fully such as the VC redistributables. This caught me out as I installed using the MSI.

So first let’s turn off the VM for L3CMN1, the very first Primary in this Hierarchy.

You did us proud …

Done.

We now need to induce a fail-over as this is not automatic yet, and then wait for the site server status check, which occurs every 1800 seconds or 30 minutes, go ahead and promote the passive to active:

The console will show that the promotion is taking place:

Eyeball the FailOverMgr log, check the timestamp for when the last site server status check was done, add 30 to it, that’s your marker for the transition taking place.

Look for the Detected this server … entry to confirm that the site server is now the active primary:

will check site server status in 1800 seconds…
Waiting for notification changes for maximum 1800 seconds…
Wait for update notification timed out.
returned public key [0602000000A400005253413100080000010001009B347C14EFE055C110E963DE…]
Crypto exchange public key (0602000000A400005253…) is found in ConfigMgr database. And it is consistent. Done.
Trying to update identification site servers registry value to [0;L3CMN1.Lab1.com;1;L3CMN4.Lab1.com;]
Detected this site server (L3CMN4.Lab1.com) is now active. Previous active site server is L3CMN1.Lab1.com
Successfully reported ConfigMgr update status (SiteServer=L3CMN4.Lab1.com, SubStageID=0xf0001, IsComplete=1, Progress=1, Applicable=1)
Successfully reported ConfigMgr update status (SiteServer=L3CMN4.Lab1.com, SubStageID=0xf0001, IsComplete=2, Progress=100, Applicable=1)
Successfully reported ConfigMgr update status (SiteServer=L3CMN4.Lab1.com, SubStageID=0xf0002, IsComplete=1, Progress=1, Applicable=1)
Stopping service SMS_SITE_COMPONENT_MANAGER on server L3CMN1.Lab1.com
Failed to stop service SMS_SITE_COMPONENT_MANAGER on server L3CMN1.Lab1.com, Win32 Error = 1722, dwRet=1
Successfully reported ConfigMgr update status (SiteServer=L3CMN4.Lab1.com, SubStageID=0xf0002, IsComplete=3, Progress=100, Applicable=1)
Some ConfigMgr services are not stopped successfully on site server L3CMN1.Lab1.com
Successfully reported ConfigMgr update status (SiteServer=L3CMN4.Lab1.com, SubStageID=0xf0003, IsComplete=1, Progress=1, Applicable=1)

I’d like to be able to trigger this event immediately. I haven’t tried recycling the failover thread, I figure that might induce the check, the next time I build a passive primary I’ll try it.

When you see the FailOverMgr thread being restarted, its the cue to head back to the console to confirm the passive server is now the active server:

There is still work being done, if you visit the SiteComp log you’ll see that things are being reinstalled, wait for this to settle down.

Wait over, things settled down, good, here you can see, since we created the Management Point role as well as the SMS Provider, the Component Server role is listed alongside the Site system role on the now unrecoverable passive site server:

We will have to remove the Site server role and SMS Provider before we can build a new passive primary.

Remove the Site server role on L3CMN1 by right clicking it and selecting Delete:

The role leaves the console:

Have a look at SiteComp, there is an issue that rarely manifests if you remove the site server role then immediately remove other roles, if you’ve encountered it you’ll see references to not being able to find some files, this is a known bug which is being fixed at the next hotfix release. It is recommended until the hotfix arrives that you perform a site reset at this point, before you remove any further roles.

Go ahead and perform a Site reset before you proceed just to play it safe.

Right click the Management Point and remove it, leaving it looking like this:

Next up we head to setup.exe in the installation folder for ConfigMgr and remove the SMS Provider from L3CMN1, which will block us when we attempt to create a new passive primary:

Keep an eye on the ConfigMgrSetup log on the active primary to witness the removal of the provider, and as usual wait for the SiteComp log to settle down after.

You can see that the defunct provider has been removed:

I introduced a new server called L3CMN5, and deployed the passive Primary role:

Now we have High Availability healthy again, let’s turn to the now redundant L3CMN1.

If you fire up the Console you’ll see that L3CMN1 is listed, and still shows both the Component server and Site system roles:

Not showing up? Bonus, but sometimes the component role is a bit sticky, and can be a tad resilient, as in it does n0t handle predictably when it comes time for removal.

If it is still showing, we can attempt to remove the Component server role by initiating a Site reset, using Setup.exe from the installation folder, remember that this is disruptive, and will cause an outage of the site server role:

image

Keep an eye on the SiteComp log, wait for things to fully settle down.

Open the Console, visit the Sites and Site servers node and:

It’ll either still be there, or the Site system will be the only entry shown.

I’ve followed this procedure multiple times now, on the first occasion I left it a day and a half and performed a site reset, the component server role disappeared. On this occasion I chased things through pretty quickly, with the site reset performed within the same hour that the site server was decommissioned. 

Okay I’ve given it a full day since I wrote the above paragraph, and the component server role is still there.

Persistence is key Smile

When the Component server role eventually does drop out, and only Site system shows, it is a mere matter of selecting the site system and deleting it to wrap up. You’ll see complaints about the defunct site server are no longer shown in the SiteComp log.

I hope it was fun reading along, let me know your thoughts on High Availability, either via Twitter (@RobMVP) or in my blogs comment section (configmgr2012.com).

Robert