Monthly Archives: June 2015

Monitoring Availability Groups with Redgate SQL Monitor

In previous posts here and here we discussed AlwaysOn Availability Group replication latency and monitoring concepts, specifically the importance of monitoring the send_queue and redo_queue. In this post I’m going to show you a technique for monitoring Availability Group replication latency with Redgate SQL Monitor and its Custom Metric functionality.

Here’s the rub, monitoring AGs is a little interesting for the following reasons

  1. We’re interested in trending and monitoring and that isn’t built into SQL Server or SSMS’s AlwaysOn Dashboard.  Both provide only point in time values. 
  2. We’ll need to monitor the health of the Availability Group as a whole. So we want to track performance data on all replicas in the AG. But interestingly the redo queue and send queue values in the DMVs on the primary are always NULL. So we need to get those values from the secondary replicas.
  3. Further, to work this into SQL Monitor’s Custom Metric framework we’ll need to limit our query’s result set to a single row and value.

Redo Queue

The redo queue is the amount of log records that haven’t been sent to a secondary replica in an AG. We want to track this as it is a measure of the amount of data on a secondary that is not yet redone into the database and can impact operations offloaded to secondaries

The bold text headings of the next section match the configuration areas on the Custom Metric screen in SQL Monitor.

Enter the T-SQL query that will collect data:

SELECT  rs.redo_queue_size
FROM    sys.dm_hadr_database_replica_states rs
        JOIN sys.availability_replicas r ON r.group_id = rs.group_id
                                            AND r.replica_id = rs.replica_id
WHERE   rs.database_id = DB_ID()
        AND r.replica_server_name = @@SERVERNAME
        AND rs.redo_queue_size IS NOT NULL;

This query will execute on each replica in the AG and will return data only for that replica. So if a replica does go offline we will loose the ability to query its information…but hey it’s offline. So it will be pretty hard to query regardless. But in reality this isn’t really good since the send queue values on the primary are always NULL. 

Select instance to collect from

Select all the instance that are in the AG you want to monitor. The T-SQL query will handle multiple Availability Groups as the query restricts its data to database ID on a particular server and a database can only be a member of one AG.

Choose databases to collect from

Please select all the databases that you are interested in monitoring.

Set collection frequency

Collect data every minute. I wish we could set this to a lower value.

Use collected or calculated values

Leave this unchecked. The DMVs queries return point in time values that we’ll want to graph the data as absolute values.

Send Queue

The send queue is used to store log records that will be sent from the primary to the secondary replica. We want to track this as it is a measure of the amount of data that is not on a secondary replica and can impact recovery point if there is a failure of a primary.

All of the configuration settings for this Custom Metric are the same as above except the T-SQL Query. Here’s that code:

Enter the T-SQL query that will collect data:

SELECT  rs.log_send_queue_size
FROM    sys.dm_hadr_database_replica_states rs
        JOIN sys.availability_replicas r ON r.group_id = rs.group_id
                                            AND r.replica_id = rs.replica_id
WHERE   rs.database_id = DB_ID()
        AND r.replica_server_name = @@SERVERNAME
        AND rs.log_send_queue_size IS NOT NULL

Send Queue Custom Metric Analysis

So let’s see this in action, like the last post we have a system with three replicas. Two are on the same subnet and a one on a simulated remote subnet. We added a 1000ms delay on the simulated WAN link to the remote location to simulate poor network conditions. 
Send Queue Availability Group Custom Metric
With the WAN latency added and a heavy write workload running, the primary replica is trying to send data to a secondary replica faster than the network can support and has to store the data in the send_queue and SQL Monitor gives us the ability to quickly see this in it’s Analysis tab.
Now let’s simulate a network outage by knocking out the WAN link that connects our primary site and secondary site. Sadly, at this point we loose visibility into the send_queue values on the unreachable secondary, but we’ll have the most recent value before it went offline.
With the WAN link down, the primary replica cannot send data to the secondary at the remote subnet and has to start queuing the data in the send_queue until the secondary comes back online.

Redo Queue Custom Metric Analysis

So with the WAN link down the send_queue builds up and once service is restored, all the log from the primary is sent over to the secondary. But at a rate that is faster than can be processed by the redo thread, so it goes into the redo_queue…let’s see that in a graph
Redo Queue Availability Group Custom Metric
Synthetic tests are great, but here is a screenshot from an actual workload from a production system. In this sample you can see a surge in the send_queue followed by a surge in the redo queue. The workload in this sample is database maintenance doing index rebuilds and more interesting is that this these replicas are on the same subnet!

Alwayson Availbility Group Replication Latency


I hope these Custom Metrics are useful to you and provide some insight into the replication health of your Availability Groups. Having a firm understanding of your AG replication and latency will allow you to answer some hard questions when it comes to where your data is at what point in time. 
Twitter @nocentino
Did you like this post? If so, click here for to sign up for our newsletter to get more SQL goodness delivered right to your inbox! 

When was your last database backup?

Its pretty often that you have to sit down at a SQL Server and need sort out what the backup situation is. One of the first things that I check is, when did the last backup for each database complete? But answering that question is getting more complicated. If you’re using Availability Groups, you could be offloading your backups to a secondary and that can skew your backup data.  In Availability Groups, database backup history is only stored on the instance that the backup executed on.

So I pulled together a quick script that gave me the following information:

  • Database name
  • Recovery model
  • If the database is in an Availability Group
  • Availability Group Preferred Backup Replica status
  • Date and time of the last full backup
  • Date and time of the last log backup
  • Date and time of the last differential backup
  • Days since the last full backup
  • Days since the last differential backup
  • Minutes since the last transaction log backup

I chose to aggregate the data on the backup type…full, differential and log and when did the last backup of each type complete.

With this data, I can do a quick scan across a row I can determine the recovery model, if the database is in an AG or not, if this is the preferred backup replica and if the appropriate backup type is happening at a regular interval. Also, since I’m efficient (*cough* lazy) I added a quick calculation to show how long in it’s been since a full or differential has completed in days and for log backups how long in minutes.

      , db.recovery_model_desc
      , CASE WHEN agdb.database_id IS NOT NULL THEN 'AG Database'
             ELSE 'Not AG Database'
        END [IsAgDb]
      , CASE WHEN sys.fn_hadr_backup_is_preferred_replica( = 1
             THEN 'YES'
             ELSE 'NO'
        END [IsPreferredReplica]
      , t.LastFullBackup
      , t.LastDiffBackup
      , t.LastLogBackup
      , t.DaysSinceLastFullBackup
      , t.DaysSinceLastDiffBackup
      , t.MinutesSinceLastLogBackup
FROM    sys.databases db
        LEFT OUTER JOIN ( SELECT    p.database_name
                                  , MAX(p.[D]) [LastFullBackup]
                                  , MAX(p.[I]) [LastDiffBackup]
                                  , MAX(p.[L]) [LastLogBackup]
                                  , DATEDIFF(DAY, MAX(p.[D]), GETDATE()) [DaysSinceLastFullBackup]
                                  , DATEDIFF(DAY, MAX(p.[I]), GETDATE()) [DaysSinceLastDiffBackup]
                                  , DATEDIFF(MINUTE, MAX(p.[L]), GETDATE()) [MinutesSinceLastLogBackup]
                          FROM      msdb.dbo.backupset bs PIVOT ( MAX(bs.backup_finish_date) FOR [type] IN ( [D], [L], [I] ) )  AS p
                          GROUP BY  p.database_name
                        ) t ON = t.database_name
        LEFT OUTER JOIN sys.dm_hadr_database_replica_states agdb ON agdb.database_id = db.database_id
                                                              AND agdb.is_local = 1;

Here’s some sample output from the script, I chose to leave the NULLs in there to really stand out. In this output you can see that the transaction logs on the AG databases haven’t been backed up in a long time on this instance. We should keep looking to find out why :)

last database backup sql server

Hopefully this script will help you quickly determine what your backups situation is!

Twitter @nocentino