AlwaysOn Availability Groups have made a big splash in the SQL world and are quickly becoming the HA and DR technology of choice for many SQL Server environments. Crucial to their success is the ability to move data between the replicas in the Availability Group quickly and efficiently. In the previous post we discussed design concepts for offloaded backups in AlwaysOn Availability Groups, specifically we focused on how data is moved between AGs and the potential impact on backups and recovery. It is important to measure and trend replication health and this article introduces techniques and queries that you can use in your environment to measure and trend replication health and some of the nuances of the data reported in DMVs.
Measuring Availability Group Replication Latency
Central to measuring replication health is the sys.dm_hadr_database_replica_states DMV. On the primary replica this DMV returns rows representing the current state for each database and it’s replicas participating in AvailabilityGroups.
The key fields we’re going to focus on for our monitoring are:
- log_send_queue_size – the amount of log records not sent to a secondary
- redo_queue_size – the amount of log records not yet redone on the secondary
- last_commit_time – the time of the last committed log record on a replica
- last_redo_time – the time of the last log record was redone on a replica
Measuring AG Latency When All Replicas Are Online
We want to measure replication latency as this directly impacts the recovery point of databases in the event of a failure. Under normal operations, when all replicas are online, monitoring the from the perspective of the primary can provide insight to our replication workload for trending and analysis. Doing so will help identify if there is a need for changes in things like WAN capacity, dedicated network connections for replication or reducing log generation on the primary.
Here is an example of an AG with each node online and a very small log_send_queue_size and redo_queue_size. All of the examples in this article are executed on the primary (SQL14-A). So in this result set log_send_queue_size and redo_queue_size values are NULL for the primary replica, since the primary isn’t replicating to itself, and reported values for the secondary replicas.
SELECT r.replica_server_name , DB_NAME(rs.database_id) AS [DatabaseName] , rs.log_send_queue_size , rs.log_send_rate , rs.redo_queue_size , rs.redo_rate FROM sys.dm_hadr_database_replica_states rs JOIN sys.availability_replicas r ON r.group_id = rs.group_id AND r.replica_id = rs.replica_id WHERE DB_NAME(rs.database_id) = 'TestAG1' ORDER BY r.replica_server_name;
Measuring AG Latency When A Replica Is Offline
The information in sys.dm_hadr_database_replica_states shows the current state known to the server executing the DMV query, so it is possible for information to be stale as other replicas may not be reporting due to an inability to communicate with the primary. For example, When a replica is disconnected from the primary, the log_send_queue_size field for a database on that replica changes to NULL and the remaining values are no longer updated.
In this example we use the same query as the above, we simulated a network outage which caused SQL14-C to be disconnected from the primary and it’s log_send_queue_size is now reporting NULL and log_send_rate, redo_queue_size and redo_rate are the last reported values from the secondary.
NULL Isn’t A Good Value To Trend or Report
In the event of a system failure we need information to report to the business about the state of the system. With a secondary is offline and the log_send_queue_size reporting NULL, it’s hard to gauge the amount of data loss. We need to rely on another attribute when querying the DMV for health of replication. We can use last_commit_time which indicates the time of the last committed record in the transaction log and retains the last communicated value from the secondaries rather that switching to NULL. There are several other *_time and *_LSN counters in the DMV which are updated with the time of the most recent interaction rather than reflecting what is the current state of replication health, more on this in a second.
In this example, we use a slightly different query to calculate the difference between the last_commit_time values on secondary replicas and the primary replica, this gives us the amount of time in seconds the secondaries are behind the primary. Keep in mind, this is the commit time in the transaction log, so if there aren’t any transactions for a period of time before your replicas are disconnected that will be included in this measurement. The result set shows a secondary that has been disconnected for several minutes.
SELECT r.replica_server_name , DB_NAME(rs.database_id) AS [DatabaseName] , ISNULL(DATEDIFF(SECOND, rs.last_commit_time, prs.last_commit_time), 0) AS [SecsBehindPrimary] , prs.last_commit_time AS [Primary_last_commit_time] , rs.last_commit_time AS [Secondary_last_commit_time] , rs.last_redone_time AS [Secondary_last_redone_time] FROM sys.dm_hadr_database_replica_states rs JOIN sys.availability_replicas r ON r.group_id = rs.group_id AND r.replica_id = rs.replica_id JOIN sys.dm_hadr_database_replica_states prs ON r.group_id = prs.group_id AND prs.group_database_id = rs.group_database_id AND rs.is_local = 0 AND prs.is_primary_replica = 1 WHERE DB_NAME(rs.database_id) = 'TestAG1' ORDER BY r.replica_server_name;
Measuring AG latency When A Replica Is Reconnected
When that replica is reconnected the current log_send_queue_size value is reported and the *_time and *_LSN counters in the DMV are updated with the time of the most recent interaction not necessarily the time of the change recorded in the log record being sent or redone. For example, last_redone_time is the time of the last redone log record, not the time of the data change in the log record being redone.
In this example, we use the first query from above, SQL14-C is reconnected to the primary and log_send_queue_size is reporting 232,822KB of log in the log_send_queue_size and 15,052KB in the redo_queue_size. Recall that this is the amount of log that must be sent to and redone on the secondaries.
We constructed a test that included two Availability Group replicas on different subnets. We generated an insert heavy workload which caused a measurable replication workload and added 50ms latency on the network router using tc. During the test, we disconnected the secondary replica from the network, which causes the log_send_queue_size to build on the primary. We then reconnected the replica the log_send_queue_size is reported properly and the redo_queue builds on the secondary.
The chart below shows the data movement between the queues. The chart samples start just before the secondary is reconnected, upon connection the log_send_queue_size reports it’s current value and begins working it’s way down. You can also see the redo_queue_size fluctuate based on the progress of the redo thread on the secondary replica. While this is a contrived example, with a trivial workload, in practice this can be a protracted event if there is an excessive amount of change during the time the replicas are disconnected a potentially RPO impacting event.
What to do next
- As you design and operate your system, measure replication latency log_send_queue_size and redo_queue_size when the replicas are online and healthy for trending and analysis
- In the event of an outage be sure to know how far your replicas are behind, this is a critical piece of information for the business
- As discussed in the first article, develop operations that better support your design, perhaps patching and heavy database maintenance do not happen at the same time
- Knowing this information and help you better meet the recovery point objective for your organization