ConfigMgr–Software Update Point Affinity

 

There are quite a few cool features in the Build 1702 release of System Center Configuration Manager Current Branch, the one I’m going to go over here is the feature that gives Software Update Points Boundary Group-awareness, or boundary affinity.

 

This complements the already boundary group-aware Distribution Point, Management Point and State Migration Point roles, and with this introduction, we finally have usage control over all the content roles.

 

In a flat and well-connected environment with multiple Software Update Points, or in an Azure based environment, which SUP a device anywhere in the Hierarchy uses doesn’t really matter, the only real consideration is that if a SUP goes down, a surge in network traffic will take place, caused when devices attempt to synchronise with a new SUP, and if multiplied across devices that switch roughly at the same time, and are in a remote office behind a slow link, that surge can be huge and disruptive.

 

When a device synchronises with a SUP, the payload can vary from 20MB to 200MB, with the variance being due to a dynamic scan being performed between the ConfigMgr Agent and the Software Update Point (with Windows Update Agent (WUA) and Windows Server Update Services (WSUS) underpinning both), where the OS and other information is sent to WSUS  which sits underneath the SUP, and a customised payload is returned to the device for it to use for patch scanning. This strips out the OS patches that are not needed and thus makes the payload variable. The cost of switching between SUP’s depends on the amount of applicable-to-the-device patches in the database, and if the SUP that is selected for use shares SQL with the previously used SUP (Delta sync), or if a different SQL is used by each of them (Full sync).

 

A good example of the hazards of a population switching to another SUP would be if you had a Secondary Site server, and the SUP went offline. This would cause the devices using the SUP on the secondary to switch to one of the Parent Primary Sites SUP’s. This effect I’ve heard referred too as “Jumping the gate”, as in unexpectedly crossing the WAN link and causing network congestion, which is often the reason why you would deploy a Secondary in the first place, so as to benefit from compression taking place between the sites and offloading of roles from the Primary, which reduces resource (Disk, Memory, CPU) and network utilisation. There isn’t any resiliency with this design approach because your Secondary can only have one SUP, and if that fails, you literally depend on your network fabric stopping the connections on the device-side of the WAN link, or your WAN link will take a hit from the resync traffic.

 

There are other examples of when total SUP control would be a boon, such as a SUP located in a DMZ. If the SUP goes offline devices will try to reach across to other remote SUP’s, but in most cases port 8530\8531 will be blocked by the router handling traffic for the DMZ, so as to stop any outflow across the network, which actually introduces one of a few problems you’ll encounter while trying to manage SUP selection, and which we’ve used edits to the WSUS Scan Retry Error Code list that the ConfigMgr Site Control File contains to overcome, which is tweaked and delivered to all devices using a manual one-time technique performed at the Primary Site server. If that doesn’t work your final throw of the dice would be to configure how the router responds to requests to access an endpoint (a remote SUP), the response of the router influences how the Windows Update Agent responds, and if WUA thinks the error response is a recoverable error it will wait for service to return, and literally stick to the wrong SUP. Configuring the router to deny the socket requests rather than dropping them often helps devices to throw an error code that can be added to the Scan Retry Error Code list, so that WUA gives up trying to connect to a SUP it’ll never be able to speak too.

 

Those days are behind us with Build 1702 available, now we can forget about micro-managing the SUP selection problems, and let the Hierarchy take care of it using Boundary Groups. You can literally drop a down-level SUP into a part of the network and be highly selective about who uses it, and whether it can be fallen back too, we couldn’t before with any ease. We can now rethink how we’ve designed for SUP usage with existing ConfigMgr designs.

 

Right off it is worth noting that the Software Update Point affinity feature is always-on (see below for explanation), unlike Management Point affinity which can be toggled as seen below in Hierarchy Settings:

 

image

 

It is pretty easy to test SUP affinity in a simple lab setup. You’ll need to service ConfigMgr and install Build 1702, resulting in the following versions (or higher):

 

  • ConfigMgr Site version for 1702 5.0.8498.1000 (Console Version 5.00.8498.1500)
  • Client version 5.00.8498.1007 are needed before the functionality can be used.

 

Your lab will have to have at least the following to test the SUP Affinity feature:

 

  • Build 1702 Primary Site server (MP\Active SUP)
  • Build 1702 Site system (MP\Down-level SUP)
  • ConfigMgr supported Windows OS with Build 1702 Client to test the Software Update  Point selection
  • Three IP Range boundaries, used to split up a single IP Subnet
  • Two Boundary Groups
  • A Grin for when you see how much control you have over a SUP now!

 

Here are the three Boundaries:

 

image

 

These IP Range Boundaries let us single out an IP Address (192.168.1.126) that will be used for our tests to switch between using the Site server, and the Site system, and back to the Site server. You could extend this range out to include more than one IP address but obviously edit the other two to accommodate the change.

 

We then create two Boundary Groups, SiteServer (Active SUP) and SiteSystem (Down-level SUP):

 

image

 

You can see that the Default-Site-Boundary-Group boundary group is also listed in the shot above, it is worth noting that when upgrading to Build 1702 you’ll find all the Site’s Software Update Points have been added there during the upgrade, this is to emulate existing functionality of pooled SUP’s with non-deterministic selection and is why you don’t have a check box for this feature in Hierarchy Settings as you do for the Management Point role. If you design to include usage of this boundary group, consider what functionality you want from the SUP’s in terms of fallback selection. For the lab I removed all the references so that there is no fallback available.

 

This is worth noting from the documentation, describing exactly how the SUP will  handle a fail-over:

 

When a client that already has a software update point fails to reach it, the client can then fallback to find another. When using fallback, the client receives a list of all software update points from its current boundary group. If it fails to find an available server for 120 minutes, it will then fallback to its neighbor boundary groups and the default site boundary group. Fallback to both boundary groups happens at the same time because the software update points fallback time to neighbor groups is set to 120 minutes and cannot be changed. 120 minutes is also the default period used for fallback to the default site boundary group. When a client falls back to both a neighbor and default site boundary group, the client attempts to contact software update points from the neighbor boundary group before trying to use one from the default site boundary group.

 

So a device will hang around trying to find a SUP for 120 minutes before trying a neighbour and the fallback boundary groups, with a preference for the neighbours SUPs. A non-editable timeout of 120 minutes, for design purposes this influences how you configure for fallback and needs to be factored into a design.

 

The documentation also refers to existing devices, and states the following:

 

The continued use of an existing software update point even when that server is not in the client’s current boundary group is intentional. This is because a change of software update point can result in a large use of network bandwidth as the client synchronizes data with the new software update point. The delay in transition can help to avoid saturating your network should all your clients switch to a new software update point at the same time.

 

The feature owners at Microsoft decided to be cautious, and try not to cause any heavy network utilisation just from upgrading the Hierarchy to 1702. If you leave all SUP’s in the fallback boundary group post-upgrade, no devices will switch to another SUP and perform a delta or full resync causing network congestion. This gives you control over possible network surges straight off the bat, nice, giving you ample time to plan out your boundary groups for SUP usage and perform the transitioning.

 

Once you have upgraded to Build 1702, and implemented your planned changes to the boundary groups and their SUP references, to effect a switch to the SUP’s referenced in Boundary Groups the device is a member of, you can do one of the following:

 

  • New installation or reinstallation of the existing ConfigMgr Client, noting that OSD builds will honour boundary group based SUP selection as New installations
  • Send down a request to existing ConfigMgr Clients using the Notification Channels Switch to next Software Update Point action which will be honoured when the next Scan Cycle is initiated by deployment, schedule or manually (Automation)

 

Okay so the whole “Controlling SUP’s is very important” message should be truly imparted by now, let’s get on with walking through how easy SUP affinity is to control in Build 1702.

 

I only have one Subnet here in my lab, and to isolate a single device on that subnet for testing I used IP Range boundaries, splitting a single subnet on the last octet into three distinct boundaries:

 

  • 1 to 125
  • 126 <—Test device IP Address (192.168.1.126)
  • 127 to 254

 

In the shots below you’ll see that we added the following IP Ranges to the SiteServer (Active SUP) Boundary Group:

 

  • 192.168.1.1-192.1681.125
  • 192.168.1.127-192.168.1.254

 

And then we add the following IP Ranges to the SiteSystem (Down-level SUP) Boundary Group:

 

  • 192.168.1.126-192.168.1.126

 

This is how it looks once setup:

 

SiteServer (Active SUP)

 

image

 

Any IP Address within the defined IP Ranges above, will use the Active SUP:

 

image

 

SiteSystem (Down-level SUP)

 

image

 

The single IP Address defined above, will use the Down-level SUP:

 

image

 

I built a Windows 10 Build 1607 Virtual Machine and installed the ConfigMgr agent there, giving the machine the 192.168.1.126 IP address so that it will become the SUP test device and use the down-level SUP:

 

image

 

This device is now a member of the SiteSystem (Down-level SUP) Boundary Group. This Boundary Group as we know contains a reference to the down-level SUP, so eventually the device will receive a list containing just one entry, the down-level SUP.

 

We can determine which SUP is going to be used for a scan by looking at the ScanAgent log on the device, in the shot below we have a device which is using the Active SUP, whereas we have configured so that it uses the down-level SUP:

 

image

 

We know from the documentation that the device will not switch all the while its current SUP is healthy, even if the SUP in the Boundary Group is different than the one it is currently using, so we have to do something to induce this behaviour. Luckily, the Product Group, who mentioned this to me, have built in a way to carry out this task from the Console, using the Notification Channels Switch to next Software Update point action.

 

We issue the Switch to next Software Update Point Client Notification Action, and we can see its delivery if we open up the CcmNotificationAgent log on the device:

 

image

 

We see the GUID and tasktype entry show up, but the device will not perform a switch until it performs its next Scan Cycle.

 

This Client Notification also reflected in the ScanAgent log on the device:

 

image

 

Once we kick off a Scan cycle manually, we see the actual switching take place in the ScanAgent log, but note that you don’t have to initiate the scan cycle to complete the procedure, since the next time a scan is induced via a deployment or on schedule, the switch will take place then:

 

image

 

And we see the same in the WUAHandler log:

 

image

 

Perfect. The test device has moved to the correct SUP, the down-level one, as desired.

 

Using this technique the entire device estate could be moved to nominated SUP’s with minimal ease, while remaining totally in control.

 

To test the feature out further:

  • Swap the references around in the two boundary groups
  • Initiate the Client Notification Action followed by a Scan cycle
  • Observe the device switching back to the Active SUP
  • Repeat these steps to switch back to the down-level SUP

 

The final thing worth mentioning is that if you have a Site system that hosts a combination of a Distribution Point, a Management Point or a Software Update Point, and you wish to separate the roles out between different Boundary Groups, you’re going to need to do this at the Site system level. This is because Boundary Group references do not refer to the role, but to the site system hosting the role. That’d make for a good User Voice suggestion, to atomise the bounding beyond Site systems and into Roles.

 

A long overdue feature for ConfigMgr, and now that it is here the old story about not being able to control usage for the core roles can be put to bed. With MP, DP, SUP and SMP affinity, it’ll make for more optimised hierarchy designs. A real bonus for those of us that get a kick out of designing complex hierarchies.

 

Thanks and big props to the Product Group!

 

Smile