Availability Group DMVs Reporting Incorrect Values

In my opinion one of the key features of SQL Server 2016 is the rebuilt and optimized log redo mechanism for AlwaysOn Availability Groups. Check out the many new AG features here. Check out my posts here and here to learn about how Availability Groups move data.

Early last week I was conducting a load test using SQL Server 2016 and wanted to compare the performance of the log redo thread with that of SQL Server 2014. To establish baseline the performance of 2014, I constructed a load test using a heavy insert workload on the primary. To measure that workload I used the following script to pull database replication performance data from sys.dm_hadr_database_replica_states 

Oh NO! The DMV data is incorrect!!! 

What I found was alarming, the DMV is reporting incorrect values for log_send_rate and redo_rate! This test was on conducted on SQL Server 2014 SP1 build 12.00.4100. This was truly unfortunate because I found this issue on a Monday and I was giving a talk on Performance Monitoring Availability Groups THAT SATURDAY at SQL Saturday #484 – Chicago. So off I went to find a new data source for the presentation and completely rewrite all the demos. The first thing I did was checked Connect, yep there it was on there. There’s also a mention of a fix in CU5 here. But I was on a build newer than that already, SP1. So I installed 2014 SP1 CU5…and yup…still bad data. 

Getting the right data from the right place

So, how did I solve this little issue? Perfmon! There are two performance counters in perfmon that hold the same data as the DMVs. So I’ll query those for the load test, SQLServer:Availability Replica and SQLServer:Database Replica. So I went about modifying the script to pull the data from perfmon so I could have accurate data for the talk that Saturday!!!

The technique is pretty straight forward. You can query perfmon using t-sql from the sys.dm_os_performance_counters DMV. But to do so you need to know that there are different types of counters and those types report their data differently. The specific counters I want to query are called “delta counters” as described by Jason Strate here. The counters we’re querying, specifically ‘Redone Bytes/sec’ and ‘Log Bytes/sec’ report on a one second average. So I take a sample, wait one second, take another sample and difference the two values and that is the actual value of what happened in that interval. 

So, we now have our own calculated log_send_rate and redo_rate. So now we need to work them back into the monitoring code that I already have coded. We also need to take into account where this data is being sampled from, so there’s a CASE statement in there too. So below is the code to pull the correct data from perfmon for the redo and send queue from perfmon and also additional Availability Group performance data such as send queue size and redo queue size. 

Calculating Availability Group Latency

With the correct values being reported, we can also calculate send and redo latency. In other words, how long in time a replica is behind the primary. We do this by dividing the queue size by the rate for both the send queue and the redo queue. This is a pretty cool value basically telling you how long it will take your replica to catch up to the primary. 

A quick demo 

With a small workload running generating about 7MB of transaction log as reported by the counter SQLServer:Databases – Log Bytes Flushed/sec here’s what we see…returning from the query defined above on a system running SQL 2014 SP1 CU5.

Availability Group Latency - DMV.png

From the result set you can see that the DMV is reporting about 140MB/sec for log_send_rate to both replicas, but the perfmon counter is reporting around 7MB/sec for that same value. I confirmed the usage by also looking actual network utilization. 

The important message here is to validate your data, in conducting those load tests 2014, I knew something wasn’t right with the data when the send rate was 112Gb/sec! The code above shows both the DMV value and the value derived from perfmon.

Good luck, and happy monitoring!

Want to chat about SQL Server or Availability Groups, feel free to drop me a line at aen@centinosystems.com

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.