Tag Archives: Monitoring

dbfs – command line access to SQL Server DMVs

With SQL Server on Linux, Microsoft has recognized that they’re opening up their products to a new set of users. People that aren’t used to Windows and it’s tools. In the Linux world we have a set of tools that work with our system performance data and present that to us as text. Specifically, the placeholder for nearly all of the Linux kernel’s performance and configuration data is the /proc virtual file system, procfs. Inside here you can find everything you need that represents the running state of your system. Processes, memory utilization, and disk performance data all of this is presented as files inside of directories inside /proc.

Now, let’s take this idea and extend it to SQL Server. In SQL Server we have DMVs, dynamic management views. These represent to current running state of our SQL Server. SQL Server exposes the data in DMVs as table data that we can query using T-SQL. 

So, Microsoft saw the need to bring these two things together, we can expose the internals of SQL Server and its DMVs to the command line via a virtual file system. And that’s exactly what dbfs does, it exposes all of SQL Server’s DMVs as text files in a directory. When you access one of the text files…you’ll execute query against the SQL Server and the query output comes back to you via standard output to you Linux console. From there you can use any of your Linux command line fu…and do what you want with the data returned. 

Setting up dbfs

So first, let’s go ahead and set this up. I already have the Microsoft SQL Server repo configured so I can install via yum. If you have SQL on Linux installed, you likely already have this repo too. If not, go ahead and follow the repo setup instructions here. To install dbfs we use yum on RHEL based distributions.

First off, think about what’s going on under the hood here…we’re going to allow the system to execute queries against DMVs…so let’s try to keep this as secure as possible, I’m going to create a user that is allowed to only query DMVs with the VIEW SERVER STATE permission. So let’s do that…
 
Let’s log into our SQL Server via SQLCMD
And execute this code to create a user named dbfs_user 

Once created, let’s assign this user permissions to query DMVs
The next step is we need to create a directory where dbfs will place all the files representing the DMVs we wish to query
Now, let’s go ahead and configure dbfs. I’m going to place it’s configuration file in /etc/ since that’s the standard location for configuration files on Linux systems.
And inside that file, let’s use the following configuration. Pretty straight forward. Define a configuration name, here you see server1, the hostname which is the locally installed SQL instance. We’ll use the username and password of the user we just created and also defined is a version. While this isn’t very well documented, the code here shows that if you’re on version 16 (SQL Server 2016) or newer it will create files dbfs files with a .json file extension which exposes your DMV data as…you guessed it JSON. Also if you want to add a second server to dbfs, just repeat the configuration inside the same text file.

Running dbfs

Now with all the preliminaries out of the way, let’s launch dbfs. Basic syntax here, the actual program name with the parameter -c pointing to the configuration file we just created and the -m parameter pointing to the directory we want to “mount” our DMVs into.
Now, what’s interesting about dbfs is if you log out dbfs stays running. Honestly, I don’t like that, if this is the case it should be running as a service managed by systemd or whatever init daemon you’re using on your Linux distribution. I mentioned that on their GitHub repo. If this is going to be a user process, then I should have the choice the background the task myself.

Using dbfs

Looking at the source for dbfs it gets a list of all DMVs from sys.system_views from the SQL Server you configured it to connect to, then creates a file for each and every one of those DMVs. So we have full coverage of all the DMVs available to us and since you can use any bash command line fu to access the data now…the options are really limitless. Microsoft has a few good demos on the GitHub repo here. Let’s walk through a few examples now.
 
Accessing a DMV

This is pretty straight forward, you read from the file just like you would read from any other file on a Linux system. So let’s do that…we add the column -t option to make sure all the columns are aligned in the output.

And our output looks like this…

Notice in the output above how the connect_time column is split incorrectly? We need to tell column to use the tab as a delimiter. By default it uses whitespaces. So let’s do that…
And now our output looks much better

Selecting off a subset of columns

Well you probably noticed that the output is a bit unruly since it’s outputting all of the DMV’s columns. So let’s tame that a bit and pull out particular columns. To do that we’ll use a tool called awk which will print out columns based on the numeric index, so $1 is the first column and so on. 
And our output looks like this
Something isn’t right…as DBAs we think of things in rows and columns. So we’re going to count across the top and think the 7th column is going to yield the 7th column and it’s data for each row, right? Well, it will but data processed by awk is whitespace delimited by default and is processed row by row. So the 7th column in the second line isn’t the same as the output in the first line. This can be really frustrating if your row data has spaces in it…like you know…dates.
 
So let’s fix that…the output from the DMVs via dbfs is tab delimited. We can define our delimiter for awk with -F which will allow for whitespaces in our data. Breaking the data only on the tabs. Let’s hope there isn’t any tabs in our data!
And the output from that looks like this, much better but we don’t have the nice columns.
We’re so close, we can’t throw column on the end to make this nice and columnar because awk with this configuration it will remove the tab delimiters on it’s output stream. column by default will do the same thing too, but we can let column do the work for us and have it print tab delimiters in it’s output stream. 
And voila, we end up with some nice neatly formatted output

Searching in Text

We can search for text in the output using grep, here’s a quick example looking for the dedicated admin connection in dm_os_schedulers
And here’s the output. 

SQL folks…keep in mind, grep will only output lines matched, so we loose the column headers here since they’re part of the standard output stream when accessing the file/DMV data.

Moving forward with dbfs

We need the ability to execute more complex queries from the command line. Vin Yu mentions this here. As DBAs we already have our scripts that we use day to day to help us access, and more importantly make sense of, the data in the DMVs. So dbfs should allow us to execute those scripts somehow. I’m thinking we can have it read a folder on the local Linux system at runtime, create files for those scripts and throw them in the mounted directory and allow them to be accesses like any of the other DMVs. The other option is we place those scripts as views on the server and access them via dbfs. Pros and cons either way. Since it’s open source…I’m thinking about implementing this myself :)

Next is, somehow we need the ability to maintain column context throughout the output stream, for DBAs it’s going to be tough sell having to deal with that. I know JSON is available, but we’re talking about DBAs and sysadmins here as a target audience. 

In closing is a great step forward…giving access into the DMVs from the command line opens up SQL Server to a set of people who are used to accessing performance data this way. Bravo! 

Monitoring SLAs with SQL Monitor Reporting

Proactive Reporting for SQL Server

If you’re a return reader of this blog you know I write often about monitoring and performance of Availability Groups. I’m a very big proponent of using monitoring techniques to ensure you’re meeting your service level agreements in terms of recovery time objective and recovery point objective. In my in person training sessions on “Performance Monitoring AlwaysOn Availability Groups”, I emphasize the need for knowing what your system’s baseline for healthy replication and knowing when your system deviates from that baseline. From a monitoring perspective, there are really two key concepts here I want to dig into…reactive monitoring and proactive monitoring.

Reactive Monitoring

Reactive monitoring is configuring a metric, setting thresholds for alerting and reacting when you get the alert. This type of monitoring is critical to the operations of your system. The alerts we configure should model the healthy state of our system…when our system deviates outside of that state, we certainly want to know about that so that we can act…well really react accordingly.

Proactive Monitoring

Proactive monitoring with an alert based monitoring tool is a little harder. What DBAs and architects do is periodically sit down and go through their existing monitoring systems and review the data over some time interval. And if we’re honest with ourselves we try to do this at regular intervals but don’t get to it very often because we’re busy. And when we do finally get in there to look it’s usually to do a post mortem on some sort of production issue…then very often we find that a critical performance attribute had been slowly creeping up over time until it reached a tipping point and caused a production issue. We do our analysis, make our system corrections and move on. Still not exactly proactive. Mostly because there is still a person in the process.

Reporting on System State

With our Reactive Monitoring model, we already define what a health system state is. And let’s take people out of the equation. In Redgate’s latest release of SQL Monitor they added a reporting module. In here you can define reports that will represent the state of your system and you can get a snapshot view of what’s critical to you about your SQL Server environment. So if you are running Availability Groups, like I mentioned above, you can configure your report to have those critical performance metrics already set up so you can quickly get to them via the SQL Monitor interface. But better yet, you can schedule the report to be delivered right to your inbox. Yes, another email. But the report is a simple PDF…you give it a glance, process the data and move on. This saves you from having to go into SQL Monitor’s web interface to view the state of your system. Find something odd, jump into the Web UI and start digging for the root cause.

Reporting gives us two key advantages

  1. Point in time snapshots of our system’s health – we’ll get that report dropped into our mailbox and on that report is the most critical performance metrics that define the health of our system.
  2. Ability to see slowly changing trends – reporting helps us focus on trends. It allows us to zoom out a bit and view our system over time. Less reacting, more “proacting”

OK and one more…for me, as a consultant, I can define reports for clients and have them emailed to me periodically to review. Let’s take a second to build a simple report together

Creating Your Own Reports

Now normally, I’d show you a bunch of screenshots on how to create a report in SQL Monitor…but rather than do that…go head and click on the menu below and start trying out the reporting interface yourself using Redgate’s publicly available SQL Monitor Demo Site! 

Screen Shot 2017 02 06 at 11 47 32 AM

A couple reports I think you should look at right away are

  1. Example Report – this is the landing page on the demo site and as you can see we get a snapshot of our servers’ performance.
  2. SCC – Custom Metrics – in SQL Monitor you can add your own custom metrics, these are things that you think are business critical…the report in the demo site shows you SQLServerCentral custom metrics of Emails Sent per Hour and Forum Posts per Hour.
  3. Storage – here you’ll find things like fastest filling disk, disk space used, and database sizes. 
  4. Create Your Own – Download the trial version and create your own report. What are the things that are important to you that you want to have right at your fingertips when you log in or have a report land in your inbox with that data?
In closing, know that your monitoring strategy should take into account both types of monitoring, reactive and proactive so you can ensure you’re meeting your service levels. Start pushing things into the proactive category as much as possible. Leverage SQL Monitor’s reporting to better understand the state of your system. 

Understanding Network Latency and Impact on Availability Group Replication

When designing Availability Group systems one of the first pieces of information I ask clients for is how much transaction log their databases generate. *Roughly*, this is going to account for how much data needs to move between their Availability Group Replicas. With that number we can start working towards the infrastructure requirements for their Availability Group system. I do this because I want to ensure the network has a sufficient amount of bandwidth to move the transaction log generated between all the replicas . Basically are the pipes big enough to handle the generated workload. But bandwidth is only part of the story, we also need to ensure latency is low. Why, well we’re going to explore that together in this post!

Network Latency Defined

First, let’s define network latency. It’s how long it takes for a piece of information to move from source to destination. In computer networks, latency is often measured in milliseconds, sometimes microseconds on really fast networks. The most common way we measure network latency is with ping. The measurement ping provides measures from the time the ICMP request is send until the time it was replied to. This is how long it takes to move a piece of information from source to destination. But, the size of the data sent by default is only 64 bytes…that’s not very much data. So really ping isn’t a good way to measure latency for data intensive applications. As you add more data to the transmission, your latency will increase due to fact that the payload being transmitted is larger and those bytes have to be placed on the wire and read from the wire on the other side. This all contributes to your network latency. So we really want to measure what our network latency is with our data transmission size. 

server0:~ demo$ ping 192.168.1.1

PING 192.168.1.1 (192.168.1.1): 56 data bytes

64 bytes from 192.168.1.1: icmp_seq=0 ttl=64 time=0.072 ms


Network Latency and Availability Groups

Now let’s talk about Availability Group replication and network latency. Availability Groups replicate data over your network using Database Mirroring Endpoints which are TCP sockets used to move data between the primary and it’s replicas. When designing Availability Groups, we often think about things in terms of bandwidth…how much data do I need to move between my replicas. But there’s another design factor your need to consider, network latency. Why? Hint, it’s not going to have anything to do with synchronous availability mode and HADR_SYNC_COMMIT waits. Let’s talk about some fundamentals of TCP for a second.

How TCP Moves Data

The unit of transfer in TCP is called a TCP Segment. TCP requires positive acknowledgement. This means each segment sent, must be acknowledged by the receiver by sending an acknowledgement back to the sender confirming receipt of the segment. From that acknowledgement, TCP tracks how long it takes for a receiver to acknowledge each segment. How long? That’s where network latency comes in…acknowledgements will take at least as long as your network’s latency.

Now if TCP waited for each and every TCP segment to be acknowledged before sending the next segment, our system would never fully realize the network link’s capacity. We’d only consume the bandwidth of one segment at a time, which isn’t very much. To overcome this, TCP has a concept called “Congestion Window” what this means is that TCP can have several unacknowledged segments in flight at a point in time and thus consume more of our network connection’s capacity. The number of unacknowledged segments in-flight depends on our network’s conditions, specifically latency and reliability. If our network latency is high…TCP will reduce the number of segments in the Congestion Window. Which means our effective bandwidth utilization will be low.  If the network latency is low, TCP will increase the number of unacknowledged segments in our Congestion Window and our effective bandwidth utilization will be high. If the link isn’t very reliable TCP will decrease the Congestion Window’s size in response to unacknowledged segments, i.e. “dropped packets” so if your network line is dropping packets for some reason…expect a reduction in throughput. The Congestion Window is also variable, as conditions change TCP will increase and decrease based on changing network conditions.

Determining your maximum throughput is pretty easy, check out this calculator here.  Enter your physical link bandwidth, your round trip latency time, and the size of your TCP Window and you’ll get the maximum throughput of a single TCP stream, in other words…how fast a single TCP connection and transmit data.

Here’s a practical example, if you have a 1Gb link to your DR site and it has 40ms of latency a single TCP stream will only use 13Mb/sec. Wow…huh? CIOs everywhere just cringed reading that.

Now this is just the maximum for a single TCP stream, we can of course potentially have multiple streams sending, In AGs each database mirroring endpoint uses a TCP stream between the primary and each secondary replica in a standard Availability Group (not Distributed)

How Does Network Latency Impact My Availability Group’s Availability?

  1. Effective bandwidth can decrease – As network latency increases, effective bandwidth decreases…this means our transaction log throughput is reduced.
  2. Availability Group Flow Control – AGs track the number of unacknowledged messages sent from primary to secondary and if this gets too high the primary will enter flow control mode and will slow down or stop sending messages to the secondary. What can cause the number of unacknowledged AG messages to increase…network latency. A secondary can initialize flow control too, if it’s experiencing resource contention, it will message the primary and say slow down. 

In both of these cases our ability to move transaction log between replicas is reduced due to network latency and this can impact availability.  If you want to know more about how AGs move data check out this post and this one. Also it’s important to note, these issues are present regardless of AG availability mode. Synchronous-commit or asynchronous-commt this will impact your replication latency. In fact this just doesn’t apply to AGs, it’s ANY single TCP stream.

Where do we go from here?

Well, if you’re designing or operating high availability systems using Availability Groups, make sure you understand your network infrastructure and its performance characteristics. If you have AG replicas in a remote data center and it’s across a high latency link…test your workload to ensure you’re getting the throughput you need for your transaction log volume. You may have to tune your TCP settings on your operating system to better handle high latency connections or even have to make network changes to support your workload. 

If you want to dive deeper into how TCP works check out my Pluralsight course – LFCE: Advanced Linux Networking. While it’s not a SQL Server course, the TCP concepts are the same.

Want more SQL goodness like this sent straight to your inbox? Sign up for our newsletter here!

SQLMonitor Adds Graphical Query Plans!

The SQLMonitor team at Redgate has been releasing updates at a much more rapid rate…what’s this mean to you? More fixes and more features. In this latest release, they certainly added something special…Graphical Query Plans! Yes, right inside of SQLMonitor’s user interface. Why is this important?  Well for me, when I’m troubleshooting a performance issue…I usually start with identifying what system resource is being taxed and try to zoom in from there on the root cause. Now lets say the root cause is a poorly performing query, SQLMonitor let’s you find that query pretty easily, but stopped short when it came to diagnosing that actual performance issue in the query. 

In this article I’m going to show you how to drill down to find a graphical query plan for a poorly performing query!

In full disclosure, I’m a Friend of Redgate…but I can tell you this, I’ve been using SQLMonitor for a lot longer than I was a FoRG. I truly believe it’s a great product.

First up, download the new SQL Monitor latest release – here

Once installed and you’ve collected some data, navigate to the new server overview page by clicking on your server’s name. A couple things I want to point out to you on this page, first you get a clear layout of the critical system resources, CPU, memory and disk I/O, also included is Waits…all crucial information. With this you will be quickly able to determine if there is a resource problem. All the menus have context, so when you click on that spike in the CPU chart, the rest of the data on the page will change and adjust their data to that point in time.

New Server Overview

Figure 1: New Server Overview

Right below resource charts, there’s the list top 10 queries, this view has always been my go to spot to find “that query” that’s performing poorly. And since the data is all zoomed in on the point in time we clicked on in the previous chart we can sort the query list by CPU, physical or local I/O and both by average or total usage gives you the ability to quickly sort through tons of data and sift out that one query.

Top 10 Query List

Figure 2: Top 10 Query List

Now, once that we’ve zoomed in on “that query” causing you grief, you can select it on the left in the list by clicking on the row in the “query text” and that will bring up the query details on the right. With that window up, you get the full query text, any query level waits, and a plan hash to pull the query plan from the plan cache. But in the latest release of SQLMonitor there’s now a “View Query Plan” button. CLICK THAT!

Graphical Query Plan!

Figure 3: Query Details

Now you get a graphical query plan highlighting what just happened! Prior to this version SQLMonitor would get you all the way to this point and we had to grab the query hash and take that back into SSMS or other products for graphical analysis…that can be time consuming.

Query Plan!

Figure 4: Graphical Query Plan!

With this new functionality, SQLMonitor takes you all the all the way through your troubleshooting workflow from the reporting of your issue, enabling you to very quickly sift out the performance anomaly and help identify it’s cause.

Availability Group DMVs Reporting Incorrect Values

In my opinion one of the key features of SQL Server 2016 is the rebuilt and optimized log redo mechanism for AlwaysOn Availability Groups. Check out the many new AG features here. Check out my posts here and here to learn about how Availability Groups move data.

Early last week I was conducting a load test using SQL Server 2016 and wanted to compare the performance of the log redo thread with that of SQL Server 2014. To establish baseline the performance of 2014, I constructed a load test using a heavy insert workload on the primary. To measure that workload I used the following script to pull database replication performance data from sys.dm_hadr_database_replica_states 

Oh NO! The DMV data is incorrect!!! 

What I found was alarming, the DMV is reporting incorrect values for log_send_rate and redo_rate! This test was on conducted on SQL Server 2014 SP1 build 12.00.4100. This was truly unfortunate because I found this issue on a Monday and I was giving a talk on Performance Monitoring Availability Groups THAT SATURDAY at SQL Saturday #484 – Chicago. So off I went to find a new data source for the presentation and completely rewrite all the demos. The first thing I did was checked Connect, yep there it was on there. There’s also a mention of a fix in CU5 here. But I was on a build newer than that already, SP1. So I installed 2014 SP1 CU5…and yup…still bad data. 

Getting the right data from the right place

So, how did I solve this little issue? Perfmon! There are two performance counters in perfmon that hold the same data as the DMVs. So I’ll query those for the load test, SQLServer:Availability Replica and SQLServer:Database Replica. So I went about modifying the script to pull the data from perfmon so I could have accurate data for the talk that Saturday!!!

The technique is pretty straight forward. You can query perfmon using t-sql from the sys.dm_os_performance_counters DMV. But to do so you need to know that there are different types of counters and those types report their data differently. The specific counters I want to query are called “delta counters” as described by Jason Strate here. The counters we’re querying, specifically ‘Redone Bytes/sec’ and ‘Log Bytes/sec’ report on a one second average. So I take a sample, wait one second, take another sample and difference the two values and that is the actual value of what happened in that interval. 

So, we now have our own calculated log_send_rate and redo_rate. So now we need to work them back into the monitoring code that I already have coded. We also need to take into account where this data is being sampled from, so there’s a CASE statement in there too. So below is the code to pull the correct data from perfmon for the redo and send queue from perfmon and also additional Availability Group performance data such as send queue size and redo queue size. 

Calculating Availability Group Latency

With the correct values being reported, we can also calculate send and redo latency. In other words, how long in time a replica is behind the primary. We do this by dividing the queue size by the rate for both the send queue and the redo queue. This is a pretty cool value basically telling you how long it will take your replica to catch up to the primary. 

A quick demo 

With a small workload running generating about 7MB of transaction log as reported by the counter SQLServer:Databases – Log Bytes Flushed/sec here’s what we see…returning from the query defined above on a system running SQL 2014 SP1 CU5.

Availability Group Latency - DMV.png

From the result set you can see that the DMV is reporting about 140MB/sec for log_send_rate to both replicas, but the perfmon counter is reporting around 7MB/sec for that same value. I confirmed the usage by also looking actual network utilization. 

The important message here is to validate your data, in conducting those load tests 2014, I knew something wasn’t right with the data when the send rate was 112Gb/sec! The code above shows both the DMV value and the value derived from perfmon.

Good luck, and happy monitoring!

Want to chat about SQL Server or Availability Groups, feel free to drop me a line at aen@centinosystems.com

Monitoring Availability Groups with Redgate SQL Monitor

In previous posts here and here we discussed AlwaysOn Availability Group replication latency and monitoring concepts, specifically the importance of monitoring the send_queue and redo_queue. In this post I’m going to show you a technique for monitoring Availability Group replication latency with Redgate SQL Monitor and its Custom Metric functionality.

Here’s the rub, monitoring AGs is a little interesting for the following reasons

  1. We’re interested in trending and monitoring and that isn’t built into SQL Server or SSMS’s AlwaysOn Dashboard.  Both provide only point in time values. 
  2. We’ll need to monitor the health of the Availability Group as a whole. So we want to track performance data on all replicas in the AG. But interestingly the redo queue and send queue values in the DMVs on the primary are always NULL. So we need to get those values from the secondary replicas.
  3. Further, to work this into SQL Monitor’s Custom Metric framework we’ll need to limit our query’s result set to a single row and value.

Redo Queue

The redo queue is the amount of log records that haven’t been sent to a secondary replica in an AG. We want to track this as it is a measure of the amount of data on a secondary that is not yet redone into the database and can impact operations offloaded to secondaries

The bold text headings of the next section match the configuration areas on the Custom Metric screen in SQL Monitor.

Enter the T-SQL query that will collect data:

This query will execute on each replica in the AG and will return data only for that replica. So if a replica does go offline we will loose the ability to query its information…but hey it’s offline. So it will be pretty hard to query regardless. But in reality this isn’t really good since the send queue values on the primary are always NULL. 

Select instance to collect from

Select all the instance that are in the AG you want to monitor. The T-SQL query will handle multiple Availability Groups as the query restricts its data to database ID on a particular server and a database can only be a member of one AG.

Choose databases to collect from

Please select all the databases that you are interested in monitoring.

Set collection frequency

Collect data every minute. I wish we could set this to a lower value.

Use collected or calculated values

Leave this unchecked. The DMVs queries return point in time values that we’ll want to graph the data as absolute values.

Send Queue

The send queue is used to store log records that will be sent from the primary to the secondary replica. We want to track this as it is a measure of the amount of data that is not on a secondary replica and can impact recovery point if there is a failure of a primary.

All of the configuration settings for this Custom Metric are the same as above except the T-SQL Query. Here’s that code:

Enter the T-SQL query that will collect data:

Send Queue Custom Metric Analysis

So let’s see this in action, like the last post we have a system with three replicas. Two are on the same subnet and a one on a simulated remote subnet. We added a 1000ms delay on the simulated WAN link to the remote location to simulate poor network conditions. 
 
Send Queue Availability Group Custom Metric
 
With the WAN latency added and a heavy write workload running, the primary replica is trying to send data to a secondary replica faster than the network can support and has to store the data in the send_queue and SQL Monitor gives us the ability to quickly see this in it’s Analysis tab.
 
Now let’s simulate a network outage by knocking out the WAN link that connects our primary site and secondary site. Sadly, at this point we loose visibility into the send_queue values on the unreachable secondary, but we’ll have the most recent value before it went offline.
 
With the WAN link down, the primary replica cannot send data to the secondary at the remote subnet and has to start queuing the data in the send_queue until the secondary comes back online.

Redo Queue Custom Metric Analysis

So with the WAN link down the send_queue builds up and once service is restored, all the log from the primary is sent over to the secondary. But at a rate that is faster than can be processed by the redo thread, so it goes into the redo_queue…let’s see that in a graph
 
Redo Queue Availability Group Custom Metric
 
 
Synthetic tests are great, but here is a screenshot from an actual workload from a production system. In this sample you can see a surge in the send_queue followed by a surge in the redo queue. The workload in this sample is database maintenance doing index rebuilds and more interesting is that this these replicas are on the same subnet!
 

Alwayson Availbility Group Replication Latency

 

 
I hope these Custom Metrics are useful to you and provide some insight into the replication health of your Availability Groups. Having a firm understanding of your AG replication and latency will allow you to answer some hard questions when it comes to where your data is at what point in time. 
 
Twitter @nocentino
Did you like this post? If so, click here for to sign up for our newsletter to get more SQL goodness delivered right to your inbox! 
 

When was your last database backup?

Its pretty often that you have to sit down at a SQL Server and need sort out what the backup situation is. One of the first things that I check is, when did the last backup for each database complete? But answering that question is getting more complicated. If you’re using Availability Groups, you could be offloading your backups to a secondary and that can skew your backup data.  In Availability Groups, database backup history is only stored on the instance that the backup executed on.

So I pulled together a quick script that gave me the following information:

  • Database name
  • Recovery model
  • If the database is in an Availability Group
  • Availability Group Preferred Backup Replica status
  • Date and time of the last full backup
  • Date and time of the last log backup
  • Date and time of the last differential backup
  • Days since the last full backup
  • Days since the last differential backup
  • Minutes since the last transaction log backup

I chose to aggregate the data on the backup type…full, differential and log and when did the last backup of each type complete.

With this data, I can do a quick scan across a row I can determine the recovery model, if the database is in an AG or not, if this is the preferred backup replica and if the appropriate backup type is happening at a regular interval. Also, since I’m efficient (*cough* lazy) I added a quick calculation to show how long in it’s been since a full or differential has completed in days and for log backups how long in minutes.

Here’s some sample output from the script, I chose to leave the NULLs in there to really stand out. In this output you can see that the transaction logs on the AG databases haven’t been backed up in a long time on this instance. We should keep looking to find out why :)

last database backup sql server

Hopefully this script will help you quickly determine what your backups situation is!


Twitter @nocentino

Designing for offloaded log backups in AlwaysOn Availability Groups – Part 2 – Monitoring

AlwaysOn Availability Groups have made a big splash in the SQL world and are quickly becoming the HA and DR technology of choice for many SQL Server environments. Crucial to their success is the ability to move data between the replicas in the Availability Group quickly and efficiently. In the previous post we discussed design concepts for offloaded backups in AlwaysOn Availability Groups, specifically we focused on how data is moved between AGs and the potential impact on backups and recovery. It is important to measure and trend replication health and this article introduces techniques and queries that you can use in your environment to measure and trend replication health and some of the nuances of the data reported in DMVs.

Measuring Availability Group Replication Latency

Central to measuring replication health is the sys.dm_hadr_database_replica_states DMV. On the primary replica this DMV returns rows representing the current state for each database and it’s replicas participating in AvailabilityGroups. 

The key fields we’re going to focus on for our monitoring are:

  • log_send_queue_size – the amount of log records not sent to a secondary
  • redo_queue_size – the amount of log records not yet redone on the secondary
  • last_commit_time – the time of the last committed log record on a replica
  • last_redo_time – the time of the last log record was redone on a replica

Measuring AG Latency When All Replicas Are Online

We want to measure replication latency as this directly impacts the recovery point of databases in the event of a failure. Under normal operations, when all replicas are online, monitoring the from the perspective of the primary can provide insight to our replication workload for trending and analysis. Doing so will help identify if there is a need for changes in things like WAN capacity, dedicated network connections for replication or reducing log generation on the primary.

Here is an example of an AG with each node online and a very small log_send_queue_size and redo_queue_size. All of the examples in this article are executed on the primary (SQL14-A). So in this result set log_send_queue_size and redo_queue_size values are NULL for the primary replica, since the primary isn’t replicating to itself, and reported values for the secondary replicas.

Availability Group Latency

Measuring AG Latency When A Replica Is Offline

The information in sys.dm_hadr_database_replica_states shows the current state known to the server executing the DMV query, so it is possible for information to be stale as other replicas may not be reporting due to an inability to communicate with the primary. For example, When a replica is disconnected from the primary, the log_send_queue_size field for a database on that replica changes to NULL and the remaining values are no longer updated.

In this example we use the same query as the above, we simulated a network outage which caused SQL14-C to be disconnected from the primary and it’s log_send_queue_size is now reporting NULL and log_send_rate, redo_queue_size and redo_rate are the last reported values from the secondary. 

Availability Group Latency 

NULL Isn’t A Good Value To Trend or Report

In the event of a system failure we need information to report to the business about the state of the system. With a secondary is offline and the log_send_queue_size reporting NULL, it’s hard to gauge the amount of data loss. We need to rely on another attribute when querying the DMV for health of replication. We can use last_commit_time which indicates the time of the last committed record in the transaction log and retains the last communicated value from the secondaries rather that switching to NULL. There are several other *_time and *_LSN counters in the DMV which are updated with the time of the most recent interaction rather than reflecting what is the current state of replication health, more on this in a second.

In this example, we use a slightly different query to calculate the difference between the last_commit_time values on secondary replicas and the primary replica, this gives us the amount of time in seconds the secondaries are behind the primary. Keep in mind, this is the commit time in the transaction log, so if there aren’t any transactions for a period of time before your replicas are disconnected that will be included in this measurement. The result set shows a secondary that has been disconnected for several minutes.

Screen Shot 2015 04 13 at 7 21 58 PM

Measuring AG latency When A Replica Is Reconnected

When that replica is reconnected the current log_send_queue_size value is reported and the *_time and *_LSN counters in the DMV are updated with the time of the most recent interaction not necessarily the time of the change recorded in the log record being sent or redone.  For example, last_redone_time is the time of the last redone log record, not the time of the data change in the log record being redone. 

In this example, we use the first query from above, SQL14-C is reconnected to the primary and log_send_queue_size is reporting 232,822KB of log in the log_send_queue_size and 15,052KB in the redo_queue_size. Recall that this is the amount of log that must be sent to and redone on the secondaries.

Availability Group Latency

Example

We constructed a test that included two Availability Group replicas on different subnets. We generated an insert heavy workload which caused a measurable replication workload and added 50ms latency on the network router using tc. During the test, we disconnected the secondary replica from the network, which causes the log_send_queue_size to build on the primary. We then reconnected the replica the log_send_queue_size is reported properly and the redo_queue builds on the secondary.

The chart below shows the data movement between the queues. The chart samples start just before the secondary is reconnected, upon connection the log_send_queue_size reports it’s current value and begins working it’s way down. You can also see the redo_queue_size fluctuate based on the progress of the redo thread on the secondary replica.  While this is a contrived example, with a trivial workload, in practice this can be a protracted event if there is an excessive amount of change during the time the replicas are disconnected a potentially RPO impacting event.

Availability Group Latency Chart

What to do next

  • As you design and operate your system, measure replication latency log_send_queue_size and redo_queue_size when the replicas are online and healthy for trending and analysis
  • In the event of an outage be sure to know how far your replicas are behind, this is a critical piece of information for the business
  • As discussed in the first article, develop operations that better support your design, perhaps patching and heavy database maintenance do not happen at the same time
  • Knowing this information and help you better meet the recovery point objective for your organization
 
If you need help with your Availability Groups or other parts of your SQL Server environment, please feel free to contact me 
 

Twitter @nocentino