Monthly Archives: March 2016

Availability Group DMVs Reporting Incorrect Values

In my opinion one of the key features of SQL Server 2016 is the rebuilt and optimized log redo mechanism for AlwaysOn Availability Groups. Check out the many new AG features here. Check out my posts here and here to learn about how Availability Groups move data.

Early last week I was conducting a load test using SQL Server 2016 and wanted to compare the performance of the log redo thread with that of SQL Server 2014. To establish baseline the performance of 2014, I constructed a load test using a heavy insert workload on the primary. To measure that workload I used the following script to pull database replication performance data from sys.dm_hadr_database_replica_states 

Oh NO! The DMV data is incorrect!!! 

What I found was alarming, the DMV is reporting incorrect values for log_send_rate and redo_rate! This test was on conducted on SQL Server 2014 SP1 build 12.00.4100. This was truly unfortunate because I found this issue on a Monday and I was giving a talk on Performance Monitoring Availability Groups THAT SATURDAY at SQL Saturday #484 – Chicago. So off I went to find a new data source for the presentation and completely rewrite all the demos. The first thing I did was checked Connect, yep there it was on there. There’s also a mention of a fix in CU5 here. But I was on a build newer than that already, SP1. So I installed 2014 SP1 CU5…and yup…still bad data. 

Getting the right data from the right place

So, how did I solve this little issue? Perfmon! There are two performance counters in perfmon that hold the same data as the DMVs. So I’ll query those for the load test, SQLServer:Availability Replica and SQLServer:Database Replica. So I went about modifying the script to pull the data from perfmon so I could have accurate data for the talk that Saturday!!!

The technique is pretty straight forward. You can query perfmon using t-sql from the sys.dm_os_performance_counters DMV. But to do so you need to know that there are different types of counters and those types report their data differently. The specific counters I want to query are called “delta counters” as described by Jason Strate here. The counters we’re querying, specifically ‘Redone Bytes/sec’ and ‘Log Bytes/sec’ report on a one second average. So I take a sample, wait one second, take another sample and difference the two values and that is the actual value of what happened in that interval. 

So, we now have our own calculated log_send_rate and redo_rate. So now we need to work them back into the monitoring code that I already have coded. We also need to take into account where this data is being sampled from, so there’s a CASE statement in there too. So below is the code to pull the correct data from perfmon for the redo and send queue from perfmon and also additional Availability Group performance data such as send queue size and redo queue size. 

Calculating Availability Group Latency

With the correct values being reported, we can also calculate send and redo latency. In other words, how long in time a replica is behind the primary. We do this by dividing the queue size by the rate for both the send queue and the redo queue. This is a pretty cool value basically telling you how long it will take your replica to catch up to the primary. 

A quick demo 

With a small workload running generating about 7MB of transaction log as reported by the counter SQLServer:Databases – Log Bytes Flushed/sec here’s what we see…returning from the query defined above on a system running SQL 2014 SP1 CU5.

Availability Group Latency - DMV.png

From the result set you can see that the DMV is reporting about 140MB/sec for log_send_rate to both replicas, but the perfmon counter is reporting around 7MB/sec for that same value. I confirmed the usage by also looking actual network utilization. 

The important message here is to validate your data, in conducting those load tests 2014, I knew something wasn’t right with the data when the send rate was 112Gb/sec! The code above shows both the DMV value and the value derived from perfmon.

Good luck, and happy monitoring!

Want to chat about SQL Server or Availability Groups, feel free to drop me a line at aen@centinosystems.com

My thoughts on SQL Server on Linux

Well yesterday was a big day in the SQL Community, Microsoft announced that they will be developing a version of SQL Server for Linux. Check out the announcement here.

SQL Server <3's Linux

Image Source – Microsoft – http://bit.ly/1U8Afd3

This leaves us with a lot of questions, in talking with one of my customers this morning he asked some pretty cool questions. Here’s how the conversation went…

Do you think it’s going to be a replacement for MySQL/MariaDB?

My initial thoughts are no, and here’s what I’m thinking. The internals of SQL Server are managed by a component called SQLOS. This is what sits between the SQL Server Database Engine and your hardware. So when SQL needs something from the operation system such as a chunk of memory, a file, or access to the network, it asks SQLOS which asks the Operating System for that resource then it returns access to that resource to SQLOS. So while I have zero inside knowledge of what’s happening inside Microsoft, it’s my best guess that this is where Microsoft is going to make the changes to SQL Server. This way the core database engine and it’s other components will not have to change. Software engineering isolation at it’s finest, right?

So my response to the question was no, I don’t think it’s going to be an immediate replacement of MySQL/MariaDB…I think the application and data tier will stay the same (for now) and that the underlying infrastructure is what’s going to change. 

How is this going to impact DBAs? 

My background is in classic computer science, I’ve studied operating systems in detail academically…what’s cool about that is this has allowed me to have an in depth understanding of how UNIX/LINUX systems work. That’s what we’re taught in those nerdy CS classes. SQL on Linux is going to offer a huge advantage to the DBA…we now will have access to the OS internals. We can read the code, we will know exactly what’s happening under the hood! I think the big change to manage here is our troubleshooting methodologies for isolating performance issues. New tools, new techniques. But in the end it’s going to be fun, we get to learn more cool stuff! But hey guess what…Linux…Windows…it’s just an operating system. 

How is this going to impact HADR components such as Availability Groups?

Again, I have no special knowledge of what’s going on inside Microsoft…but think it through. Most of the implementation of AGs is inside the engine. Similar to mirroring, transaction log blocks move from one replica to another. Now quorum might need a little attention, as that’s provided by Windows. But hey, Linux has the ability to establish quorum for a cluster too.

These are just my thoughts, a quick post for you guys. I’ve had tons of Linux exposure over the years in production systems and in the end…it’s all the same stuff under the hood :)

I’d love to hear your thoughts on this, please do comment below.

Please feel free to contact me with any questions regarding Linux or other SQL Server related issues at: aen@centinosystems.com