Tag Archives: Availability Groups

Speaking at SQL Saturday Pensacola!

I’m proud to announce that I will be speaking at SQL Saturday Pensacola on June 3rd 2017! Check out the amazing schedule!

If you don’t know what SQLSaturday is, it’s a whole day of free SQL Server training available to you at no cost!

If you haven’t been to a SQLSaturday, what are you waiting for! Sign up now!

My presentation is Designing High Availability Database Systems using AlwaysOn Availability Groups” 

Abstract:

Are you looking for a high availability solution for your business critical application? You’re heard about AlwaysOn Availability Groups and they seem like a good solution, but you don’t know where to start. It all starts with a solid design. In this session we introduce the core concepts needed to design a Availability Group based system. Covering topics such as recovery objectives, replica placement, failover requirements, synchronization models, quorum, backup and recovery and monitoring. This session is modeled after real world client engagements conducted by Centino Systems that have lead to many successful Availability Groups based systems supporting tier 1 business critical applications.

Learning Objectives: 

This session highlights the importance of doing thorough design work up front. Attendees will learn core concepts needed for successful Availability Group based systems. This includes, recovery objectives, replica placement, failover requirements, synchronization models, quorum, backup and recovery and monitoring. From this session attendees will have a firm footing on where to start when they start designing their AlwaysOn Availability Group based systems.

Why Did Your Availability Group Creation Fail?

Availability Groups are a fantastic way to provide high availability and disaster recovery for your databases, but it isn’t exactly the easiest thing in the world to pull off correctly. To do it right there’s a lot of planning and effort that goes into your Availability Group topology. The funny thing about AGs is as hard as they are to plan…they’re pretty easy to implement…but sometimes things can go wrong. In this post I’m going to show you how to look into things when creating your AGs fails.

When working at a customer site today I encountered and error that I haven’t seen before when creating an Availability Group. So I’m going to walk you through what happened and how I fixed it. So if your AGs fail at creation, you can follow this process to dig into why.

First, let’s try to create our Availability Group

But, that fails and we get this error…it tells me what happened and to go look in the SQL Server error log for more details.

OK, so let’s look in the SQL Server error Log and see what we find.

Clearly something is up, the AG tried to come online but couldn’t.

The error here say check out the Windows Server Failover Clustering log…so let’s go ahead and do that. But that’s not as straightforward as you think. WSFC does write to the event log, but the errors are pretty generic for this issue. Here’s what you’ll see in the System Event Log and the Cluster Events section in the Failover Cluster Manager

Wow, that’s informative, right? Luckily we still have more information to look into.

Let’s dig deeper with using the WSFC cluster logs

The cluster logs need to be queried, they’re not readily available as text for us. We can write them out to file with this PowerShell cmdlet Get-ClusterLogs. Let’s make a directory and dump the logs into there.

Now we have some data to look through!

When we look at the contents of the cluster logs files generates by Get-ClusterLogs, we’re totally on the other side of the spectrum when it comes to information verbosity. The logs so far have been pretty terse and haven’t really told us about what’s causing the failure…well dig through this log and you’ll likely find your reason and a lot more information. Good stuff to look at to get an understanding of the internals of WSFCs. Now for the the reason my Availability Group creation failed was permissions. Check out the log entries.

Well that’s pretty clear about what’s going on…the process creating the AG couldn’t connect to SQL Server to run the very important sp_server_diagnostics stored procedure. A quick internet search to find a fix yielded this article from Mike Fal (b | t) which points to this Microsoft article detailing the issue and fix.

For those that don’t want to click the links here’s the code to adjust the permissions and allow your Availability Group to create.

So to review…here’s how we found our issue.

  1. Read the error the create script gives you
  2. Read the SQL Server error log
  3. Look at your System Event log
  4. Dump your Cluster Logs and review

Use this technique if you find yourself in a situation where your AG won’t come online or worse…fails over unexpectedly or won’t come back online.

Monitoring SLAs with SQL Monitor Reporting

Proactive Reporting for SQL Server

If you’re a return reader of this blog you know I write often about monitoring and performance of Availability Groups. I’m a very big proponent of using monitoring techniques to ensure you’re meeting your service level agreements in terms of recovery time objective and recovery point objective. In my in person training sessions on “Performance Monitoring AlwaysOn Availability Groups”, I emphasize the need for knowing what your system’s baseline for healthy replication and knowing when your system deviates from that baseline. From a monitoring perspective, there are really two key concepts here I want to dig into…reactive monitoring and proactive monitoring.

Reactive Monitoring

Reactive monitoring is configuring a metric, setting thresholds for alerting and reacting when you get the alert. This type of monitoring is critical to the operations of your system. The alerts we configure should model the healthy state of our system…when our system deviates outside of that state, we certainly want to know about that so that we can act…well really react accordingly.

Proactive Monitoring

Proactive monitoring with an alert based monitoring tool is a little harder. What DBAs and architects do is periodically sit down and go through their existing monitoring systems and review the data over some time interval. And if we’re honest with ourselves we try to do this at regular intervals but don’t get to it very often because we’re busy. And when we do finally get in there to look it’s usually to do a post mortem on some sort of production issue…then very often we find that a critical performance attribute had been slowly creeping up over time until it reached a tipping point and caused a production issue. We do our analysis, make our system corrections and move on. Still not exactly proactive. Mostly because there is still a person in the process.

Reporting on System State

With our Reactive Monitoring model, we already define what a health system state is. And let’s take people out of the equation. In Redgate’s latest release of SQL Monitor they added a reporting module. In here you can define reports that will represent the state of your system and you can get a snapshot view of what’s critical to you about your SQL Server environment. So if you are running Availability Groups, like I mentioned above, you can configure your report to have those critical performance metrics already set up so you can quickly get to them via the SQL Monitor interface. But better yet, you can schedule the report to be delivered right to your inbox. Yes, another email. But the report is a simple PDF…you give it a glance, process the data and move on. This saves you from having to go into SQL Monitor’s web interface to view the state of your system. Find something odd, jump into the Web UI and start digging for the root cause.

Reporting gives us two key advantages

  1. Point in time snapshots of our system’s health – we’ll get that report dropped into our mailbox and on that report is the most critical performance metrics that define the health of our system.
  2. Ability to see slowly changing trends – reporting helps us focus on trends. It allows us to zoom out a bit and view our system over time. Less reacting, more “proacting”

OK and one more…for me, as a consultant, I can define reports for clients and have them emailed to me periodically to review. Let’s take a second to build a simple report together

Creating Your Own Reports

Now normally, I’d show you a bunch of screenshots on how to create a report in SQL Monitor…but rather than do that…go head and click on the menu below and start trying out the reporting interface yourself using Redgate’s publicly available SQL Monitor Demo Site! 

Screen Shot 2017 02 06 at 11 47 32 AM

A couple reports I think you should look at right away are

  1. Example Report – this is the landing page on the demo site and as you can see we get a snapshot of our servers’ performance.
  2. SCC – Custom Metrics – in SQL Monitor you can add your own custom metrics, these are things that you think are business critical…the report in the demo site shows you SQLServerCentral custom metrics of Emails Sent per Hour and Forum Posts per Hour.
  3. Storage – here you’ll find things like fastest filling disk, disk space used, and database sizes. 
  4. Create Your Own – Download the trial version and create your own report. What are the things that are important to you that you want to have right at your fingertips when you log in or have a report land in your inbox with that data?
In closing, know that your monitoring strategy should take into account both types of monitoring, reactive and proactive so you can ensure you’re meeting your service levels. Start pushing things into the proactive category as much as possible. Leverage SQL Monitor’s reporting to better understand the state of your system. 

Understanding Network Latency and Impact on Availability Group Replication

When designing Availability Group systems one of the first pieces of information I ask clients for is how much transaction log their databases generate. *Roughly*, this is going to account for how much data needs to move between their Availability Group Replicas. With that number we can start working towards the infrastructure requirements for their Availability Group system. I do this because I want to ensure the network has a sufficient amount of bandwidth to move the transaction log generated between all the replicas . Basically are the pipes big enough to handle the generated workload. But bandwidth is only part of the story, we also need to ensure latency is low. Why, well we’re going to explore that together in this post!

Network Latency Defined

First, let’s define network latency. It’s how long it takes for a piece of information to move from source to destination. In computer networks, latency is often measured in milliseconds, sometimes microseconds on really fast networks. The most common way we measure network latency is with ping. The measurement ping provides measures from the time the ICMP request is send until the time it was replied to. This is how long it takes to move a piece of information from source to destination. But, the size of the data sent by default is only 64 bytes…that’s not very much data. So really ping isn’t a good way to measure latency for data intensive applications. As you add more data to the transmission, your latency will increase due to fact that the payload being transmitted is larger and those bytes have to be placed on the wire and read from the wire on the other side. This all contributes to your network latency. So we really want to measure what our network latency is with our data transmission size. 

server0:~ demo$ ping 192.168.1.1

PING 192.168.1.1 (192.168.1.1): 56 data bytes

64 bytes from 192.168.1.1: icmp_seq=0 ttl=64 time=0.072 ms


Network Latency and Availability Groups

Now let’s talk about Availability Group replication and network latency. Availability Groups replicate data over your network using Database Mirroring Endpoints which are TCP sockets used to move data between the primary and it’s replicas. When designing Availability Groups, we often think about things in terms of bandwidth…how much data do I need to move between my replicas. But there’s another design factor your need to consider, network latency. Why? Hint, it’s not going to have anything to do with synchronous availability mode and HADR_SYNC_COMMIT waits. Let’s talk about some fundamentals of TCP for a second.

How TCP Moves Data

The unit of transfer in TCP is called a TCP Segment. TCP requires positive acknowledgement. This means each segment sent, must be acknowledged by the receiver by sending an acknowledgement back to the sender confirming receipt of the segment. From that acknowledgement, TCP tracks how long it takes for a receiver to acknowledge each segment. How long? That’s where network latency comes in…acknowledgements will take at least as long as your network’s latency.

Now if TCP waited for each and every TCP segment to be acknowledged before sending the next segment, our system would never fully realize the network link’s capacity. We’d only consume the bandwidth of one segment at a time, which isn’t very much. To overcome this, TCP has a concept called “Congestion Window” what this means is that TCP can have several unacknowledged segments in flight at a point in time and thus consume more of our network connection’s capacity. The number of unacknowledged segments in-flight depends on our network’s conditions, specifically latency and reliability. If our network latency is high…TCP will reduce the number of segments in the Congestion Window. Which means our effective bandwidth utilization will be low.  If the network latency is low, TCP will increase the number of unacknowledged segments in our Congestion Window and our effective bandwidth utilization will be high. If the link isn’t very reliable TCP will decrease the Congestion Window’s size in response to unacknowledged segments, i.e. “dropped packets” so if your network line is dropping packets for some reason…expect a reduction in throughput. The Congestion Window is also variable, as conditions change TCP will increase and decrease based on changing network conditions.

Determining your maximum throughput is pretty easy, check out this calculator here.  Enter your physical link bandwidth, your round trip latency time, and the size of your TCP Window and you’ll get the maximum throughput of a single TCP stream, in other words…how fast a single TCP connection and transmit data.

Here’s a practical example, if you have a 1Gb link to your DR site and it has 40ms of latency a single TCP stream will only use 13Mb/sec. Wow…huh? CIOs everywhere just cringed reading that.

Now this is just the maximum for a single TCP stream, we can of course potentially have multiple streams sending, In AGs each database mirroring endpoint uses a TCP stream between the primary and each secondary replica in a standard Availability Group (not Distributed)

How Does Network Latency Impact My Availability Group’s Availability?

  1. Effective bandwidth can decrease – As network latency increases, effective bandwidth decreases…this means our transaction log throughput is reduced.
  2. Availability Group Flow Control – AGs track the number of unacknowledged messages sent from primary to secondary and if this gets too high the primary will enter flow control mode and will slow down or stop sending messages to the secondary. What can cause the number of unacknowledged AG messages to increase…network latency. A secondary can initialize flow control too, if it’s experiencing resource contention, it will message the primary and say slow down. 

In both of these cases our ability to move transaction log between replicas is reduced due to network latency and this can impact availability.  If you want to know more about how AGs move data check out this post and this one. Also it’s important to note, these issues are present regardless of AG availability mode. Synchronous-commit or asynchronous-commt this will impact your replication latency. In fact this just doesn’t apply to AGs, it’s ANY single TCP stream.

Where do we go from here?

Well, if you’re designing or operating high availability systems using Availability Groups, make sure you understand your network infrastructure and its performance characteristics. If you have AG replicas in a remote data center and it’s across a high latency link…test your workload to ensure you’re getting the throughput you need for your transaction log volume. You may have to tune your TCP settings on your operating system to better handle high latency connections or even have to make network changes to support your workload. 

If you want to dive deeper into how TCP works check out my Pluralsight course – LFCE: Advanced Linux Networking. While it’s not a SQL Server course, the TCP concepts are the same.

Want more SQL goodness like this sent straight to your inbox? Sign up for our newsletter here!

Using Extended Events to Visualize Availability Group Replication Internals

SQL 2014 Service Pack 2 was recently released by Microsoft and there is a ton of great new features and enhancements in this release.This isn’t just a collection of bug fixes…there’s some serious value in this Service Pack. Check out the full list here. One of the key things added in this Service Pack is an enhancement of the Extended Events for AlwaysOn Availability Group replication.

Why are the new Availability Group Extended Event interesting?

If you’ve used Availability Groups in production systems with high transaction volumes you know that replication latency can impact your availability. If you want to brush up on that check out our blog posts on AG Replication Latency, Monitoring for replication latency, and issues with the DMVs when monitoring. These new extended events add insight at nearly every point inside your Availability Group’s replication. More importantly they also include duration. So using these Extended Events we can pinpoint latency inside our Availability Group replication.

Here’s the list and description of the new Extended Events:

  • hadr_log_block_group_commit – Log block group commit processing duration
  • log_block_pushed_to_logpool – Log block start push to log pool
  • log_flush_start – Asynchronous log write start
  • hadr_log_block_compression – Log block compression processing duration
  • hadr_capture_log_block – Primary has captured a log block
  • hadr_capture_filestream_wait
  • ucs_connection_send_msg – UCS transport connection sent message
  • hadr_log_block_send_complete – After a log block message has been sent. This event is only used for fail points
  • log_flush_complete – Reports that synchronous send is complete
  • hadr_receive_harden_lsn_message – When a new hardened LSN from the secondary
  • hadr_db_commit_mgr_harden – Transaction commit harden result from Hadron Commit management
  • hadr_transport_receive_log_block_message – Fires when we’re receiving new log block message
  • hadr_log_block_decompression – Log block decompression processing duration
  • hadr_apply_log_block – Secondary is going to append a log block to the log
  • hadr_send_harden_lsn_message – Crafting a message to send containing a new hardened LSN on a secondary.  Test only
  • hadr_lsn_send_complete – After an LSN message has been sent.
The source for this list and it’s descriptions is Amit Banerjee’s HADR Virtual Chapter Talk and KB3173156

Using Extended Events to visualize Availability Group Replication Internals

Using these Extended Events, we’re able to get some insight into the replication internals of our Availability Groups. Like I described a second ago, we can use these to measure the duration of each event and sift out performance issues in our replication. But we can also use them to get a better understanding of what goes into Availability Group replication, let’s see how.
 
Here’s the steps I used to generate the chart below:
  1. Let’s add an Event Session with all of these new Events above to the primary and secondary replicas of a two replica synchronous Availability Group
  2. Then with the sessions running, insert only one row on the primary replica. This will cause the AG to replicate the data change event and trigger the new Extended Events on both the primary and secondary replicas
  3. Stop the Event Sessions
  4. Open the Event file in SSMS and View Target Data on both the primary and secondary replicas
  5. Order the Event data by timestamp
  6. Merge the event data together based on the timestamp (I did this in a spreadsheet)
With this list we have the entire replication sequence of Events for that single insert replicating data from the primary to the secondary from the perspective of both the primary and secondary. I took that information and made the following graphic for you visualizing what it takes to replicate data in a synchronous Availability Group.

AG Synchrnous Replication Internals

 Figure 1: Availability Group Replication Extended Events

From Figure 1, let’s walk through the events…

On the primary

  • The first event that fires is hadr_log_block_group_commit starting things off. It’s my understanding that this initializes the needed code and program state for AG replication
  • Then up next is log_block_pushed_to_logpool. This fires when a log block is copied to the log pool. The log pool is a special memory location used to store log blocks that may need to be read again, since it’s in memory it keeps the readers from doing a physical IO on the transaction log
  • Then log_flush_start, this fires when log blocks are flushed to disk, normal write ahead logging
  • Once the log block has been copied into the log pool and flushed to disk, it’s read from the log pool and compressed then the hadr_log_block_compression event fires
  • After it’s compressed the AG “captures” the log block for replication to the secondary with hadr_capture_log_block 
  • Then builds and sends the message to the secondary and ucs_connection_send_msg fires
  • Once the message is sent the hadr_log_block_send_complete event fires
  • We mark the synchronous send complete with log_flush_complete

On the secondary

  • On the other side of the fence, basically the inverse occurs. We mark the receipt of the message with hadr_transport_receive_log_block_message
  • The log block is decompressed and hadr_log_block_decompression fires
  • Then the log block is appended to the local log buffer and hadr_apply_log_block fires
  • Then the log block is copied to the local log pool and hadr_pushed_to_logpool fires
  • Next, the asynchronous log write happens and log_flush_start marks that start of that
  • And once the asynchronous log write finishes, log_flush_complete fires
  • Once hardened locally we need to craft a message to send to the primary marking the the LSN is hardened and hadr_send_harden_lsn_message fires
  • ucs_connection_send_msg then transmits fires when the message is sent to the primary
  • hadr_lsn_send_complete marks the fires marking the end of sending the message

Back on the primary

  • hadr_receive_harden_lsn_message fires marking that the secondary has acknowledging receipt of the send LSN
  • And to wrap it all up hadr_db_commit_mgr_harden marks the end of the log block replication for this sequence

A few important notes…

  • Many of these Extended Events fire at different points and multiple times during replication under different conditions. There’s a field called mode, which is an integer value, that marks various conditions for the events. I have yet to decode what each value of mode is for all of the events but will update this blog as I get some insight.
     
  • It’s very important to note that this example highlights an insert into an Availability Group with a primary and only one secondary configured in synchronous Availability Mode. The code path and the subsequent Extended Events that fire can change based on Availability Mode (sync/async), replication health, and many other conditions that can impact replication.

I hope that I’ve shown you some interesting concepts that you can use to help troubleshoot your Availability Group replication using Extended Events. This post, highlights how you can use Extended Events to get some insight into the the internals of Availability Group replication.

If you have any questions about this or your Availability Group configuration or performance please feel free to email me at aen@centinosystems.com

Please follow me on Twitter @nocentino to keep up with new blog posts

References

Great references on the internals of the log pool – https://sqlactions.com/2014/03/31/how-it-works-logpool/

HADR Virtual Chapter – http://hadrvc.sqlpass.org/Home.aspx?EventID=5304

Availability Group DMVs Reporting Incorrect Values

In my opinion one of the key features of SQL Server 2016 is the rebuilt and optimized log redo mechanism for AlwaysOn Availability Groups. Check out the many new AG features here. Check out my posts here and here to learn about how Availability Groups move data.

Early last week I was conducting a load test using SQL Server 2016 and wanted to compare the performance of the log redo thread with that of SQL Server 2014. To establish baseline the performance of 2014, I constructed a load test using a heavy insert workload on the primary. To measure that workload I used the following script to pull database replication performance data from sys.dm_hadr_database_replica_states 

Oh NO! The DMV data is incorrect!!! 

What I found was alarming, the DMV is reporting incorrect values for log_send_rate and redo_rate! This test was on conducted on SQL Server 2014 SP1 build 12.00.4100. This was truly unfortunate because I found this issue on a Monday and I was giving a talk on Performance Monitoring Availability Groups THAT SATURDAY at SQL Saturday #484 – Chicago. So off I went to find a new data source for the presentation and completely rewrite all the demos. The first thing I did was checked Connect, yep there it was on there. There’s also a mention of a fix in CU5 here. But I was on a build newer than that already, SP1. So I installed 2014 SP1 CU5…and yup…still bad data. 

Getting the right data from the right place

So, how did I solve this little issue? Perfmon! There are two performance counters in perfmon that hold the same data as the DMVs. So I’ll query those for the load test, SQLServer:Availability Replica and SQLServer:Database Replica. So I went about modifying the script to pull the data from perfmon so I could have accurate data for the talk that Saturday!!!

The technique is pretty straight forward. You can query perfmon using t-sql from the sys.dm_os_performance_counters DMV. But to do so you need to know that there are different types of counters and those types report their data differently. The specific counters I want to query are called “delta counters” as described by Jason Strate here. The counters we’re querying, specifically ‘Redone Bytes/sec’ and ‘Log Bytes/sec’ report on a one second average. So I take a sample, wait one second, take another sample and difference the two values and that is the actual value of what happened in that interval. 

So, we now have our own calculated log_send_rate and redo_rate. So now we need to work them back into the monitoring code that I already have coded. We also need to take into account where this data is being sampled from, so there’s a CASE statement in there too. So below is the code to pull the correct data from perfmon for the redo and send queue from perfmon and also additional Availability Group performance data such as send queue size and redo queue size. 

Calculating Availability Group Latency

With the correct values being reported, we can also calculate send and redo latency. In other words, how long in time a replica is behind the primary. We do this by dividing the queue size by the rate for both the send queue and the redo queue. This is a pretty cool value basically telling you how long it will take your replica to catch up to the primary. 

A quick demo 

With a small workload running generating about 7MB of transaction log as reported by the counter SQLServer:Databases – Log Bytes Flushed/sec here’s what we see…returning from the query defined above on a system running SQL 2014 SP1 CU5.

Availability Group Latency - DMV.png

From the result set you can see that the DMV is reporting about 140MB/sec for log_send_rate to both replicas, but the perfmon counter is reporting around 7MB/sec for that same value. I confirmed the usage by also looking actual network utilization. 

The important message here is to validate your data, in conducting those load tests 2014, I knew something wasn’t right with the data when the send rate was 112Gb/sec! The code above shows both the DMV value and the value derived from perfmon.

Good luck, and happy monitoring!

Want to chat about SQL Server or Availability Groups, feel free to drop me a line at aen@centinosystems.com

Monitoring Availability Groups with Redgate SQL Monitor

In previous posts here and here we discussed AlwaysOn Availability Group replication latency and monitoring concepts, specifically the importance of monitoring the send_queue and redo_queue. In this post I’m going to show you a technique for monitoring Availability Group replication latency with Redgate SQL Monitor and its Custom Metric functionality.

Here’s the rub, monitoring AGs is a little interesting for the following reasons

  1. We’re interested in trending and monitoring and that isn’t built into SQL Server or SSMS’s AlwaysOn Dashboard.  Both provide only point in time values. 
  2. We’ll need to monitor the health of the Availability Group as a whole. So we want to track performance data on all replicas in the AG. But interestingly the redo queue and send queue values in the DMVs on the primary are always NULL. So we need to get those values from the secondary replicas.
  3. Further, to work this into SQL Monitor’s Custom Metric framework we’ll need to limit our query’s result set to a single row and value.

Redo Queue

The redo queue is the amount of log records that haven’t been sent to a secondary replica in an AG. We want to track this as it is a measure of the amount of data on a secondary that is not yet redone into the database and can impact operations offloaded to secondaries

The bold text headings of the next section match the configuration areas on the Custom Metric screen in SQL Monitor.

Enter the T-SQL query that will collect data:

This query will execute on each replica in the AG and will return data only for that replica. So if a replica does go offline we will loose the ability to query its information…but hey it’s offline. So it will be pretty hard to query regardless. But in reality this isn’t really good since the send queue values on the primary are always NULL. 

Select instance to collect from

Select all the instance that are in the AG you want to monitor. The T-SQL query will handle multiple Availability Groups as the query restricts its data to database ID on a particular server and a database can only be a member of one AG.

Choose databases to collect from

Please select all the databases that you are interested in monitoring.

Set collection frequency

Collect data every minute. I wish we could set this to a lower value.

Use collected or calculated values

Leave this unchecked. The DMVs queries return point in time values that we’ll want to graph the data as absolute values.

Send Queue

The send queue is used to store log records that will be sent from the primary to the secondary replica. We want to track this as it is a measure of the amount of data that is not on a secondary replica and can impact recovery point if there is a failure of a primary.

All of the configuration settings for this Custom Metric are the same as above except the T-SQL Query. Here’s that code:

Enter the T-SQL query that will collect data:

Send Queue Custom Metric Analysis

So let’s see this in action, like the last post we have a system with three replicas. Two are on the same subnet and a one on a simulated remote subnet. We added a 1000ms delay on the simulated WAN link to the remote location to simulate poor network conditions. 
 
Send Queue Availability Group Custom Metric
 
With the WAN latency added and a heavy write workload running, the primary replica is trying to send data to a secondary replica faster than the network can support and has to store the data in the send_queue and SQL Monitor gives us the ability to quickly see this in it’s Analysis tab.
 
Now let’s simulate a network outage by knocking out the WAN link that connects our primary site and secondary site. Sadly, at this point we loose visibility into the send_queue values on the unreachable secondary, but we’ll have the most recent value before it went offline.
 
With the WAN link down, the primary replica cannot send data to the secondary at the remote subnet and has to start queuing the data in the send_queue until the secondary comes back online.

Redo Queue Custom Metric Analysis

So with the WAN link down the send_queue builds up and once service is restored, all the log from the primary is sent over to the secondary. But at a rate that is faster than can be processed by the redo thread, so it goes into the redo_queue…let’s see that in a graph
 
Redo Queue Availability Group Custom Metric
 
 
Synthetic tests are great, but here is a screenshot from an actual workload from a production system. In this sample you can see a surge in the send_queue followed by a surge in the redo queue. The workload in this sample is database maintenance doing index rebuilds and more interesting is that this these replicas are on the same subnet!
 

Alwayson Availbility Group Replication Latency

 

 
I hope these Custom Metrics are useful to you and provide some insight into the replication health of your Availability Groups. Having a firm understanding of your AG replication and latency will allow you to answer some hard questions when it comes to where your data is at what point in time. 
 
Twitter @nocentino
Did you like this post? If so, click here for to sign up for our newsletter to get more SQL goodness delivered right to your inbox! 
 

Designing for offloaded log backups in AlwaysOn Availability Groups – Part 2 – Monitoring

AlwaysOn Availability Groups have made a big splash in the SQL world and are quickly becoming the HA and DR technology of choice for many SQL Server environments. Crucial to their success is the ability to move data between the replicas in the Availability Group quickly and efficiently. In the previous post we discussed design concepts for offloaded backups in AlwaysOn Availability Groups, specifically we focused on how data is moved between AGs and the potential impact on backups and recovery. It is important to measure and trend replication health and this article introduces techniques and queries that you can use in your environment to measure and trend replication health and some of the nuances of the data reported in DMVs.

Measuring Availability Group Replication Latency

Central to measuring replication health is the sys.dm_hadr_database_replica_states DMV. On the primary replica this DMV returns rows representing the current state for each database and it’s replicas participating in AvailabilityGroups. 

The key fields we’re going to focus on for our monitoring are:

  • log_send_queue_size – the amount of log records not sent to a secondary
  • redo_queue_size – the amount of log records not yet redone on the secondary
  • last_commit_time – the time of the last committed log record on a replica
  • last_redo_time – the time of the last log record was redone on a replica

Measuring AG Latency When All Replicas Are Online

We want to measure replication latency as this directly impacts the recovery point of databases in the event of a failure. Under normal operations, when all replicas are online, monitoring the from the perspective of the primary can provide insight to our replication workload for trending and analysis. Doing so will help identify if there is a need for changes in things like WAN capacity, dedicated network connections for replication or reducing log generation on the primary.

Here is an example of an AG with each node online and a very small log_send_queue_size and redo_queue_size. All of the examples in this article are executed on the primary (SQL14-A). So in this result set log_send_queue_size and redo_queue_size values are NULL for the primary replica, since the primary isn’t replicating to itself, and reported values for the secondary replicas.

Availability Group Latency

Measuring AG Latency When A Replica Is Offline

The information in sys.dm_hadr_database_replica_states shows the current state known to the server executing the DMV query, so it is possible for information to be stale as other replicas may not be reporting due to an inability to communicate with the primary. For example, When a replica is disconnected from the primary, the log_send_queue_size field for a database on that replica changes to NULL and the remaining values are no longer updated.

In this example we use the same query as the above, we simulated a network outage which caused SQL14-C to be disconnected from the primary and it’s log_send_queue_size is now reporting NULL and log_send_rate, redo_queue_size and redo_rate are the last reported values from the secondary. 

Availability Group Latency 

NULL Isn’t A Good Value To Trend or Report

In the event of a system failure we need information to report to the business about the state of the system. With a secondary is offline and the log_send_queue_size reporting NULL, it’s hard to gauge the amount of data loss. We need to rely on another attribute when querying the DMV for health of replication. We can use last_commit_time which indicates the time of the last committed record in the transaction log and retains the last communicated value from the secondaries rather that switching to NULL. There are several other *_time and *_LSN counters in the DMV which are updated with the time of the most recent interaction rather than reflecting what is the current state of replication health, more on this in a second.

In this example, we use a slightly different query to calculate the difference between the last_commit_time values on secondary replicas and the primary replica, this gives us the amount of time in seconds the secondaries are behind the primary. Keep in mind, this is the commit time in the transaction log, so if there aren’t any transactions for a period of time before your replicas are disconnected that will be included in this measurement. The result set shows a secondary that has been disconnected for several minutes.

Screen Shot 2015 04 13 at 7 21 58 PM

Measuring AG latency When A Replica Is Reconnected

When that replica is reconnected the current log_send_queue_size value is reported and the *_time and *_LSN counters in the DMV are updated with the time of the most recent interaction not necessarily the time of the change recorded in the log record being sent or redone.  For example, last_redone_time is the time of the last redone log record, not the time of the data change in the log record being redone. 

In this example, we use the first query from above, SQL14-C is reconnected to the primary and log_send_queue_size is reporting 232,822KB of log in the log_send_queue_size and 15,052KB in the redo_queue_size. Recall that this is the amount of log that must be sent to and redone on the secondaries.

Availability Group Latency

Example

We constructed a test that included two Availability Group replicas on different subnets. We generated an insert heavy workload which caused a measurable replication workload and added 50ms latency on the network router using tc. During the test, we disconnected the secondary replica from the network, which causes the log_send_queue_size to build on the primary. We then reconnected the replica the log_send_queue_size is reported properly and the redo_queue builds on the secondary.

The chart below shows the data movement between the queues. The chart samples start just before the secondary is reconnected, upon connection the log_send_queue_size reports it’s current value and begins working it’s way down. You can also see the redo_queue_size fluctuate based on the progress of the redo thread on the secondary replica.  While this is a contrived example, with a trivial workload, in practice this can be a protracted event if there is an excessive amount of change during the time the replicas are disconnected a potentially RPO impacting event.

Availability Group Latency Chart

What to do next

  • As you design and operate your system, measure replication latency log_send_queue_size and redo_queue_size when the replicas are online and healthy for trending and analysis
  • In the event of an outage be sure to know how far your replicas are behind, this is a critical piece of information for the business
  • As discussed in the first article, develop operations that better support your design, perhaps patching and heavy database maintenance do not happen at the same time
  • Knowing this information and help you better meet the recovery point objective for your organization
 
If you need help with your Availability Groups or other parts of your SQL Server environment, please feel free to contact me 
 

Twitter @nocentino

Designing for offloaded log backups in AlwaysOn Availability Groups – Part 1

AlwaysOn Availability Groups made their initial appearance in SQL 2012 and have generated a lot of buzz, HA and DR in one! Even with AGs, still integral to your DR strategy are backups and with AGs you’re given the option to offload backups to a secondary replica. In this blog we’re going to talk about offloaded log backups the potential impact to your databases’ recoverability under certain conditions, we’ll begin with some preliminaries on data movement in AGs.

How data is moved in Availability Groups

Roughly speaking data is synchronized between replicas in an Availability Group by sending log blocks from the transaction log of the primary (read/write) replica over a database mirroring endpoint to the secondary replicas. However, log blocks are not immediately sent, they are passed through a set of queues.

  • Send queue – a queue used to store log records that will be sent from the primary to the secondary replica 
  • Redo queue – a queue used to store log records received on the secondary, that have to be “played back” on the secondary replica

The send queue’s impact on recovery point

In the event that log generation on the primary exceeds the rate in which log blocks are sent to secondaries or a secondary becomes unavailable, the log blocks can build up in the send queue. If there is unsent data in send queue, then this is the amount of data at risk in the event of a disastrous failure of the primary replica.

You might be thinking to yourself, I configured the AG in synchronous availability mode, I should be safe. Even in synchronous mode, the send queue can build up. A secondary replica in an Availability Group can become unavailable for any number of reasons, patching, network outage…etc and in this situation the data destined for the secondary replica(s) is stored in the send queue. If the secondary comes back online then the primary will work diligently to send the data to the secondary.

The redo queue’s impact on recovery point

In the event that log blocks received by a secondary exceed the rate in which the records can be processed by the redo thread, the redo queue will grow. This can happen when the secondary cannot simply keep up with the rate of change on the primary or during an outage of a secondary. A practical example of the latter is when a secondary comes back online after an outage and there is a large amount of data change during that outage. Think of the times when database maintenance is running on a weekend and the network team just happens to be updating switch firmware. All of that change is queued in the send queue on the primary and when the secondary is back online, quickly shipped over to the redo queue on the secondary.

Now, with that huge chunk of data change hardened on the secondary and in the redo queue, one would think we’re in the clear. Well, sort of, yes your data is in two places but now the offloaded transaction log backups on your secondary may start to fail if the secondary replica is too far behind the primary. Specifically, in the event that the database’s last backup LSN (log record beyond the end of the log backup) on the primary is greater than the local redo LSN on the secondary your backups will fail. This is a protection to prevent gaps in the continuity of the transaction log in backups and a condition that you need to be aware of when designing offloaded backups where the database is unable to take a successful log backups and impacting RPO.

Availability Group LSN log

Figure A: last LSN and redo LSN positions in the transaction log

Designing for offloaded backups in AlwaysOn AvailabilityGroups 

Availability Groups allow us to offload backups to secondary replicas. This is a great option as it reduces IO on the primary replica. In doing so, system designers need to be aware of the impact the health of the AG replication has on off loaded backups. Understanding when your data is at risk during times where the secondaries are not completely caught up with the changes from the primary and techniques to mitigate that risk and protect the RPO and RTO that the business expects

Here are a few things to keep in mind:

  • Understand and minimize the amount of log generation that occurs in your databases and design to support that load
  • Monitor send_queue and redo_queue in sys.dm_hadr_database_replica_states on replicas to measure impact on recovery point objectives 
  • Understand your system’s operations, consider downtime for patching and network maintenance
  • Understand resource contention on shared infrastructure, are you competing for things like network bandwidth, disk IO?
In the next article we’ll explore some DMVs that will allow you to get insight into AG replication health and in the final post of this series show how to use RedGate’s SQLMonitor custom metrics for trending and analysis.

If you need help with your Availability Groups or other parts of your SQL Server environment, please feel free to contact me
 

Twitter @nocentino

Availability Group Read-only Routing

This morning at a customer site I was researching an issue where Availability Group read-only routing was not working correctly. Quickly I was able to determine the issue was a misconfigured read-only routing URL list. In this blog post I’ll show you the requirements for read-only routing in Availability Groups, how I determined the URL list was the issue and what to do to fix the situation. The requirements for Read-only routing in Availability Groups are:

  1. A configured listener
  2. At least one replica is configure for read-only access
  3. Each secondary is configured with a URL
  4. Each replica has a configured routing list
  5. The replica being routed to must be synchronized or synchronizing.

Now, let’s explore the requirements and then what happens with condition 5 is not met and the steps to find and remedy the issue.

Is there a listener configured?

Screen Shot 2015 01 25 at 12 16 23 PM This shows us that there is one AG listener, requirement 1 met! Review the current access configuration and read only routing URL configuration of each replica

Screen Shot 2015 01 25 at 12 14 38 PM1

So we can see each is configured for read-only access, indicated by secondary_role_allow_connections_desc = ALL and that each replica has a read only routing url, excellent requirements 2 and 3 are met, let’s move on.

Let’s explore the routing lists for the Availability Group

Read only routing listsFrom this output we can see that when SQL14-A is a read-only replica of SQL14-B and SQL14-B is a read-only replica of SQL14-A. Requirement 4 met, right? Yes, but let’s dig deeper. 4. Let’s test out connectivity under the condition when the current read only replica does not meet the routing requirements listed above. Stop data movement to SQL14-B, which changes the database state to Not Synchronizing

Screen Shot 2015 01 25 at 11 53 19 AM Now let’s test connectivity with SQLCMD, which now has ApplicationIntent as a command line parameter. You’ll need the -d parameter which sets the database context for read only routing to work.

Yikes, that didn’t work. Shouldn’t it have just failed back to SQL14-A. It didn’t. Why not? Well our only configured read-only replica SQL14-B no longer meets the connection requirements. We need to tell the system what to do in this state. 4. Let’s change our read-only routing URLs to include both the primary and the read-only replica

What we need to to configure the read-only routing URL list to point to all possible read-only replicas, including the primary. 

Let’s check our routing list again, as you can see each server now has two entries and the priority is set for the entries relative to their position in our routing list

Screen Shot 2015 01 25 at 12 19 45 PM So let’s test with SQLCMD again

Excellent, the AG is back up and running for applications configured with ApplicationIntent=ReadOnly, but we’re not quite there yet, we still are pointing to the Primary. We need to restart data movement on SQL14-B to get our database state back to Synchronized so the workload can use the read-only secondary.

Screen Shot 2015 01 25 at 11 58 30 AM And let’s try with SQLCMD once more

Success! The key take away here is that the read-only routing URL needs to list all possible read-only replicas, including the primary not just the servers that are current secondaries.  Please feel free to contact me with any questions regarding Availability Groups or other SQL Server related issues at: aen@centinosystems.com References: Configure Read-Only Routing for an Availability Group (SQL Server)     

Shout out to Jared Karney for picking up an error in this post! Thanks!