Skype for Business – Standard Edition Pool Failover Disaster [UPDATED]
Last week I blogged about my experiences with trying to simulate a DR event by performing a planned failover of a Standard Edition pool paired deployment. For those of you who read that post will know it is not a pleasant experience. For those who may not have read that post here is the link http://wp.me/p5atJy-cz
Over the weekend, I spent a considerable amount of time trying to find a safe workaround for the problem without being too destructive. I decided to start from scratch and install a brand new domain with Skype for Business RTM, to eliminate any customer variables and potentially CUs that could attribute to this problem. Unfortunately, I was able to replicate this issue all too easily, even with zero configuration applied (literally out of the box vanilla config) and had to revert the deployment back using the steps in my previous post.
However, I was able to work out a procedure that will produce less blood, sweat and tears for Skype admins.
The problem as I understand it is that the copy of the CMS that is synchronised to the backup server RTCLOCAL CMS database (local copy) is not updated to change the master to the backup server by the invoke-CsManagementServerFailover commandlet. Instead, the local CMS copy is still expecting the primary server to be the active master of the CMS database and this is where the verification fails and replication stops.
This is a definite bug / issue with Skype for Business and not a PICNIC (Problem In Chair Not In Computer) issue.
The process I have come to understand works is as follows:
- First, make sure you have an up to date backup for the CMS and Location Databases Export-CsConfiguration –Filename c:\cms.zip and Export-CsLisConfiguration –Filename c:\lis.zip
- Make sure that you have no replication issues by running Invoke-CsManagementStoreReplicationStatus
- Next check and confirm replication was successful by running Get-CsManagementStoreReplicationStatus and check that both Front End servers return TRUE
Also check event viewer on both servers for Informational event ID 3013 to ensure replication has completed
- Next run an manual synchronisation of the CMS database by using Invoke-CsBackupServiceSync –PoolFqdn fe1.domain.local –Confirm:$false
On the master server you should see 4 events that confirm the synchronisation was successful, these are Event ID 4066, 4090, 2038, and 3013 again
On the backup server you should see an informational event ID 3013 to confirm the data was received.
- If you have an edge pool, at this moment change the next hop to the backup server by running Set-CsEdgeServer –Identity edge.domain.local –Registrar fe2.domain.local
- From the backup server, run the following command Invoke-CsManagementServerFailover –BackupServerFqdn fe2.domain.local –BackupSqlInstanceName RTC –Force
We expect this to fail as “normal” on the validation process and you can press CTRL + C to cancel this or it will retry for 15 minutes
- Make sure the AD SCP point has updated by running the Get-CsManagementConnection command
If not run the Set-CsConfigurationStoreLocation –SqlServerFqdn fe2.domain.local –SqlInstanceName rtc
- At this point the CMS won’t be in a happy state, running Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus will return empty
and so will Get-CsManagementStoreReplicationStatus
- You will also notice the File Transfer Agent in a stopped state on the primary server – Start this service.
- Next on the backup server we need to install the local configuration store again, pulling information from the CMS database using the deployment wizard
- Once complete, the command Enable-CsTopology
- Now run Step 2 from the deployment wizard or bootstrapper.exe from command line to reinstall any missing components
- After this has run restart the following services on the backup server; Skype for Business File Transfer Agent, Skype for Business Master Replica Agent, Skype for Business Replica Replicator Agent
- Run the following command to check the CMS has now become properly active on the backup server running Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus
Then run the Get-CsManagementStoreReplicationStatus command to make sure the backup server replication and all other servers are replicating
- Check the SCP connection point is now pointing at the backup server by running Get-CsManagementConnection
- Finally, check that topology is accurately showing the backup server as the active CMS server in the pool by running Get-CsService –CentralManagement
- Now we can run the Invoke-CsPoolFailover –PoolFqdn fe01.domain.local command to fail the users and services over
- Once complete, change your internal DNS records to point to the backup server and change the reverse proxy rules as required.
This now completes the failover process.
To fail the CMS, users and service back to the primary server, perform the following steps.
- If the primary server is still powered on after failover stop the remaining Skype for Business services that are running Stop-CsWindowsService -Force
- Now run Start-CsPool –PoolFqdn fe1.domain.local –Confirm:$false
- Once started run the Invoke-CsManagementStoreReplication command
- Then run the Invoke-CsBackupServiceSync –PoolFqdn fe2.domain.local command
- Check Event viewer on both servers for Event IDs 4066, 4090, 2038, and 3013 (Same as step 4 in failover process)
- Next change the edge servers next hop pool back to the primary server using Set-CsEdgeServer – identity edge.domain.local –Registrar fe1.domain.local
- Next run the Invoke-CsManagementServerFailover –BackupSqlServerFqdn fe1.domain.com –BackupSqlInstanceName rtc –Force command
Again, we expect this to fail verification
- Check the AD SCP has updated by running the following command Get-CsManagementConnection (should show fe1)
- After this has run restart the following services on the both servers; Skype for Business File Transfer Agent, Skype for Business Master Replica Agent, Skype for Business Replica Replicator Agent
- Next run Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus command. This should return the primary server back as the active master and active file transfer agent
- Then run the Get-CsManagementStoreReplicationStatus command to check replication between servers. All should return true
- Now fail the users and services back to the primary server using Invoke-CsPoolFailBack –PoolFqdn fe1.domain.local command
- Once completed, readjust your DNS records and reverse proxy rules to point back to the primary server.
This completes the failback process.
There may be certain scenarios where this may not work as expected. In this case the active master and active file transfer agent will refuse to be set and we will now have to revert back to the CMS backup we took before performing failover.
If this is the case here is the problem. In the CMS database both servers are listed as masters and file transfer agents, but only one server is registered (and that’s a backup server).
How it should look is this:
I attempted to edit this table and found that the dbo.tasks table relies on the componentId field from the dbo.components table to correctly assign the master and file transfer agent task to the right CMS server. So I edited both tables to set the backup server as the registered master and fta host using the following SQL commands
UPDATE dbo.Component SET Registered = 1, LastUnregister=NULL WHERE ComponentId = 1
UPDATE dbo.Component SET Registered = 1, LastUnregister=NULL WHERE ComponentId = 2
I then updated the tasks database table to reflect the registered master and fta
UPDATE dbo.Task SET ComponentID = 1 WHERE TaskName = ‘Master Replication Task’
UPDATE dbo.Task SET ComponentID = 2 WHERE TaskName = ‘File Transfer Task’
And then removed the old primary server from the dbo.Component table
After restarting the services on both servers I found that this process DID NOT WORK!! Posting this step to show that editing the CMS is
- Not Supported
- Only gives you more headaches!
If you are at this stage, the only procedure left is to restore the CMS from the backup you took at the beginning of the process. Here is how to do it:
- On both servers, stop the backup, file transfer agent and master replica agent services
- On the primary server run Install-CsDatabase –CentralManagementDatabase –SqlServerName fe1.domain.local –SqlInstanceName RTC –Clean
- On the primary server run Import-CsConfiguration –Filename c:\cms.zip
- On the primary server run Import-CsLisConfiguration –Filename c:\lis.zip
- On the backup server run Install-CsDatabase –CentralManagementDatabase –SqlServerName fe2.domain.local –SqlInstanceName RTC –Clean
- On each server run the install local configuration store step from the deployment wizard
- Start the services on both servers from step 1
- Check CMS has been restored to the primary server by running Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus
To summarise this post, failover can happen but not following the TechNet documented method. It may fail, if it does, then you need to revert your CMS back to the state it was in before initiating a CMS failover. So please, please take a backup of the CMS, it takes literally 2 seconds and will save you days of pain.
Mark is an Independent Microsoft Teams Consultant with over 15 years experience in Microsoft Technology. Mark is the founder of Commsverse, a dedicated Microsoft Teams conference and former MVP. You can follow him on twitter @UnifiedVale