Tuesday, November 16, 2010

Increase Exchange 2010 DAG Failover Threshold

One of our clients has a 4 member node Exchange 2010 DAG spread across 4 different countries worldwide.

The client had reported to me that one of the sites that had a slight bandwidth issue was consistently failing it's Active Mailbox Store from the local site over to it's Dublin HQ site. When we manually moved the database back over to the original local site, it would randomly fail back over to the main Dublin HQ site presumably due to the intermittent latency on the Internet connection at that local site.

The customer requested that I find a way to increase the failover threshold or tolerance for the DAG so that it doesn't fail over as frequently without losing the functionality of High Availability.

After searching for quite a while on how to do this using Exchange Power Shell I found some information relating not to Exchange Server but to the Windows Server 2008 Cluster Service (which is essentially what the DAG uses when it is created for the first time) for it's clustering technology.

Using a standard Command Prompt (cmd), I started playing with the 'cluster' command and looking into what switches it used and what they could be applied to.

Here's what I came up with:

Type 'cluster /list' to display the name of the cluster that is present on the Server

When you run a 'cluster /prop' from the cmd line, it returns a number of values relating to the cluster, two of which are the following:

CrossSubnetDelay = 1000 (this is the default 1000 milliseconds which equals 1 second per heartbeat check)

CrossSubnetThreshold = 5 (this is the default number of heartbeats that can be missed before failover)

I changed the CrossSubnetDelay value to make the heartbeat check in every 2 seconds instead of the default 1 second by using the command below:

cluster /cluster:<ClusterName> /prop CrossSubnetDelay=2000

With this new setting along with the default value of 5 seconds for the CrossSubnetThreshold setting, this now allows the Cluster service to wait for 10 seconds before initiating a failover to a different DAG member.

This value can be increased to a maximum of 4000 milliseconds once the cluster is across subnets (it is a maximum of 2000 milliseconds if you are on the same subnet)

The CrossSubnetThreshold value can be modified with a value anywhere from 3 to 10.

This workaround / solution may need some tweaking with values until you reach the desired tolerance on your DAG.

It is also worth making sure you make a note of all changes that you make before and after the above commands and as always - make sure you have a full backup of your Exchange environment before you do anything like this!!!!

9 comments:

  1. Kevin, we have a similar issue:
    I have a client with the following:

    two sites connected with a 3MB pipe.
    Each site has
    - Domain Controllers
    - 1 Dedicated CAS/HUB server
    - 1 Dedicated Mailbox DAG server.

    At the main site the CAS/HUB server is also the File Witness Server.
    All servers are running Server 2008 R2 and Exchange 2010 SP1 without any hot fix rollups. (I have looked at the Rollups and they do not fix this specific issue.)

    The 3 MB pipe has errors or is over utilized and so the DAG member at the DR site (DAG-DR) frequently loses connection to the file share witness and then drops out of the DAG. Then about 2 to 5 minutes later it rejoins the DAG. The client is working on the WAN link to increase the size or fix the errors. This is not the main problem.

    The main issue is sometimes when the DAG member at the DR site drops out of the cluster the DAG member at the main site (DAG-MAIN) has errors in the logs that it cannot talk to the file share witness and then it drops out of the cluster as well. If I look in the cluster log I see it trying to get access to the File Share witness. It tries upto 30 times then quits trying and then dismounts the databases.

    What I think is happening (but I could be wrong) is the DAG-DR has put a lock on the FileWitness, but when the network failure occurs it does not release the lock and then the DAG-Main cannot get access to the FileWitness so it dismounts the databases to prevent split brain.

    This should not occur. If DAG-DR stops or the network drops then DAG-MAIN should continue functioning because the file witness is on the same LAN as the DAG-MAIN. Any ideas as to what is causing this?

    ReplyDelete
  2. Hi Lori, thanks for the comments.


    A few things to check are as follows:

    Ensure that the 'Exchange Trusted Subsystem' security group is added to the 'Local Administrators' group of the File Share Witness (FSW)server

    If User Account Control (UAC)is enabled on the FSW server, then turn this off for a period of time and see if the issue is resolved

    Ensure you have a working DHCP server in each site and that it has enough free IP addresses in its scope at all times as the DAG cluster uses DHCP to assign IP configurations out of the box

    Ensure that all Exchange Servers have the exact same patch and SP level and always update to the latest hotfix rollups and service packs


    If all else fails and the DAG is still not functioning the way it should, I would recommend removing the DAG and recreating it again (hopefully if the DB's aren't too big, then this shouldn't be a major undertaking as there is only 2 DAG members)

    ReplyDelete
  3. Actually, the engineer on site modified the threshold settings you mentioned in the original post and that fixed the issue. That said, we'll double check on the additional criteria you mention. Many thanks!

    ReplyDelete
  4. We have three servers in our cluster. After changing the settings do I need to restart the cluster service and if so on all three, or just on the server I run the commands on?

    Thanks!

    ReplyDelete
    Replies
    1. Hi there,

      From what I recall on this issue (its been a good while since I wrote this post!), I didn't need to restart the cluster service for these modifcations to take place.

      Kevin

      Delete
  5. Guys, thanks for this fix, absolutely amazing to stumble upon after months of DBs failing over randomly despite all attempts. I do have another question along the lines, I have 2 sites

    site A - mbx01, cas01, cas02 (cas02 isn't doing much but server to load balance hub transport and as primary witness server, mbx01 has 6 active DBs and 1 passive which is replicating from site B's active DB)

    site b - mbx02, cas03 (cas03 is alternate witness server, mbx02 has 1 active db and 6 passive)

    idea is simple if site a is down we fail over to site b and vice versa. This work perfectly if MBX01 and MBX02 is down for whatever reason.

    if CAS02 and CAS03 are down the failover doesn't occur and if the entire site A or B is down then no failover takes place either direction

    I have spent a considerable amount of hours/days with MS exchange support to come up with a step by step manual failover process which also doesn't seem to work as i tried following on more than one occasion and yet to be successful

    Does anyone have a good procedure to fail over when either a witness server is down and/or the entire site is down?

    Thanks in advance

    ReplyDelete
  6. Anonymous .... we have the same setup as you and I have created a site fail over Doc I can share with you. Let me know your email and I can send it.

    ReplyDelete
    Replies
    1. can you share it with me please

      my email rana.rtb@gmail.com

      Delete
  7. Hello,
    Can you please share the doc which you have created a site fail over ( exchange server )@ rani.evening@gmail.com

    ReplyDelete