Tag Archives: Engineering

Using Extended Events to Visualize Availability Group Replication Internals

SQL 2014 Service Pack 2 was recently released by Microsoft and there is a ton of great new features and enhancements in this release.This isn’t just a collection of bug fixes…there’s some serious value in this Service Pack. Check out the full list here. One of the key things added in this Service Pack is an enhancement of the Extended Events for AlwaysOn Availability Group replication.

Why are the new Availability Group Extended Event interesting?

If you’ve used Availability Groups in production systems with high transaction volumes you know that replication latency can impact your availability. If you want to brush up on that check out our blog posts on AG Replication Latency, Monitoring for replication latency, and issues with the DMVs when monitoring. These new extended events add insight at nearly every point inside your Availability Group’s replication. More importantly they also include duration. So using these Extended Events we can pinpoint latency inside our Availability Group replication.

Here’s the list and description of the new Extended Events:

  • hadr_log_block_group_commit – Log block group commit processing duration
  • log_block_pushed_to_logpool – Log block start push to log pool
  • log_flush_start – Asynchronous log write start
  • hadr_log_block_compression – Log block compression processing duration
  • hadr_capture_log_block – Primary has captured a log block
  • hadr_capture_filestream_wait
  • ucs_connection_send_msg – UCS transport connection sent message
  • hadr_log_block_send_complete – After a log block message has been sent. This event is only used for fail points
  • log_flush_complete – Reports that synchronous send is complete
  • hadr_receive_harden_lsn_message – When a new hardened LSN from the secondary
  • hadr_db_commit_mgr_harden – Transaction commit harden result from Hadron Commit management
  • hadr_transport_receive_log_block_message – Fires when we’re receiving new log block message
  • hadr_log_block_decompression – Log block decompression processing duration
  • hadr_apply_log_block – Secondary is going to append a log block to the log
  • hadr_send_harden_lsn_message – Crafting a message to send containing a new hardened LSN on a secondary.  Test only
  • hadr_lsn_send_complete – After an LSN message has been sent.
The source for this list and it’s descriptions is Amit Banerjee’s HADR Virtual Chapter Talk and KB3173156

Using Extended Events to visualize Availability Group Replication Internals

Using these Extended Events, we’re able to get some insight into the replication internals of our Availability Groups. Like I described a second ago, we can use these to measure the duration of each event and sift out performance issues in our replication. But we can also use them to get a better understanding of what goes into Availability Group replication, let’s see how.
Here’s the steps I used to generate the chart below:
  1. Let’s add an Event Session with all of these new Events above to the primary and secondary replicas of a two replica synchronous Availability Group
  2. Then with the sessions running, insert only one row on the primary replica. This will cause the AG to replicate the data change event and trigger the new Extended Events on both the primary and secondary replicas
  3. Stop the Event Sessions
  4. Open the Event file in SSMS and View Target Data on both the primary and secondary replicas
  5. Order the Event data by timestamp
  6. Merge the event data together based on the timestamp (I did this in a spreadsheet)
With this list we have the entire replication sequence of Events for that single insert replicating data from the primary to the secondary from the perspective of both the primary and secondary. I took that information and made the following graphic for you visualizing what it takes to replicate data in a synchronous Availability Group.

AG Synchrnous Replication Internals

 Figure 1: Availability Group Replication Extended Events

From Figure 1, let’s walk through the events…

On the primary

  • The first event that fires is hadr_log_block_group_commit starting things off. It’s my understanding that this initializes the needed code and program state for AG replication
  • Then up next is log_block_pushed_to_logpool. This fires when a log block is copied to the log pool. The log pool is a special memory location used to store log blocks that may need to be read again, since it’s in memory it keeps the readers from doing a physical IO on the transaction log
  • Then log_flush_start, this fires when log blocks are flushed to disk, normal write ahead logging
  • Once the log block has been copied into the log pool and flushed to disk, it’s read from the log pool and compressed then the hadr_log_block_compression event fires
  • After it’s compressed the AG “captures” the log block for replication to the secondary with hadr_capture_log_block 
  • Then builds and sends the message to the secondary and ucs_connection_send_msg fires
  • Once the message is sent the hadr_log_block_send_complete event fires
  • We mark the synchronous send complete with log_flush_complete

On the secondary

  • On the other side of the fence, basically the inverse occurs. We mark the receipt of the message with hadr_transport_receive_log_block_message
  • The log block is decompressed and hadr_log_block_decompression fires
  • Then the log block is appended to the local log buffer and hadr_apply_log_block fires
  • Then the log block is copied to the local log pool and hadr_pushed_to_logpool fires
  • Next, the asynchronous log write happens and log_flush_start marks that start of that
  • And once the asynchronous log write finishes, log_flush_complete fires
  • Once hardened locally we need to craft a message to send to the primary marking the the LSN is hardened and hadr_send_harden_lsn_message fires
  • ucs_connection_send_msg then transmits fires when the message is sent to the primary
  • hadr_lsn_send_complete marks the fires marking the end of sending the message

Back on the primary

  • hadr_receive_harden_lsn_message fires marking that the secondary has acknowledging receipt of the send LSN
  • And to wrap it all up hadr_db_commit_mgr_harden marks the end of the log block replication for this sequence

A few important notes…

  • Many of these Extended Events fire at different points and multiple times during replication under different conditions. There’s a field called mode, which is an integer value, that marks various conditions for the events. I have yet to decode what each value of mode is for all of the events but will update this blog as I get some insight.
  • It’s very important to note that this example highlights an insert into an Availability Group with a primary and only one secondary configured in synchronous Availability Mode. The code path and the subsequent Extended Events that fire can change based on Availability Mode (sync/async), replication health, and many other conditions that can impact replication.

I hope that I’ve shown you some interesting concepts that you can use to help troubleshoot your Availability Group replication using Extended Events. This post, highlights how you can use Extended Events to get some insight into the the internals of Availability Group replication.

If you have any questions about this or your Availability Group configuration or performance please feel free to email me at aen@centinosystems.com

Please follow me on Twitter @nocentino to keep up with new blog posts


Great references on the internals of the log pool – https://sqlactions.com/2014/03/31/how-it-works-logpool/

HADR Virtual Chapter – http://hadrvc.sqlpass.org/Home.aspx?EventID=5304

SQL Server, Persistent Memory on NVDIMMs and DAX

Paradigm Shift!

What do I mean by that? Every once in a while a technology comes along and changes the way things are done, moves the bar…well last week Microsoft released a Channel 9 video on persistent memory using NVDIMMs and DAX on Windows 2016…then combining it with SQL Server! This is one of those technologies that moves the bar! Check it out here.

Why is this important?

Relational databases like SQL Server use a transaction log to ensure the durability of the transactional operations to the database. This is so it can ensure its data is consistent in the event of a system failure. SQL Server uses a logging protocol called write ahead logging (WAL). This means that the data change operations must be written to stable, persistent storage before the process can change data in the database file…so this means our database throughput is at the mercy of the performance of the device the changes (log records) are written to.

Another facet of this is disk latency. Historically this has been terribly inefficient, especially in the case of spinning HDDs due to seek time and write latency. Disk latency has gotten much better recently with SSDs and NVMe. To combat disk latency, SQL Server doesn’t perform a disk write for each and every log record. SQL Server writes log records to what’s called a log buffer in main memory, once the log buffer is full it or a transaction commits (or aborts) it flushes that buffer to disk in sector aligned IO operations. This means your transactions have to wait for the disk IO to complete before they’re considered hardened to disk. This is a core bottleneck in RDBMS systems.

What changed?

Well, in NVDIMMs we have a DIMM, which is normally used for main memory, is now backed with a battery backed flash array. This means we can have access to stable storage at main memory speeds, think less than 10 microseconds, not milliseconds. An order of magnitude performance gain! NVDIMMs have two modes of operation, storage mode and memory mode. Storage mode will present the device as a block device to your OS. Memory mode will present this as a memory mapped device. Windows Server 2016 has the drivers to deal with both modes. In the first case you’ll get a block device and all the code that goes along with making a block IO request, the advantage of this mode is no application code changes, it’s just a block device like any other. In the second mode, memory mode, it presents byte addressable memory to the OS as storage, but you may need to change code to take full advantage of this. So, Microsoft has developed a feature called DirectAcces (DAX) in Windows Server 2016 to present memory mode storage to the OS bypassing traditional block IO code. This is where the insanity ensues! You’ll literally have stable storage available to your applications at main memory speeds, committing to stable storage is now just a memory copy!!! Watch the video for performance numbers!

How this will impact you?

Well, in the video Lindsey Allen highlights a code change in SQL Server that modifies how the log buffer writes data. Log writes are immediately considered hardened when committed to the log buffer. We no longer have to wait for the disk IO to complete, side stepping disk latency and IO altogether. This leads to a 2x improvement in the demo application in the video. These tests compare NVMe and DAX, and NVMe is PCI-express attached flash which is as good as it gets for persistent storage today (well until memory mode NVDIMMS take over). You’ll notice that CPU spikes during their load tests, this isn’t an uncommon issue when you remove your storage bottlenecks from your system.

What’s it going to cost?

Looking online, HPE has a 8GB NVDIMM available for $800 at online wholesalers, this is in DIMM form factor. Some are built directly into the motherboard, others are on cards.

Questions that I have?

  1. Foreign Memory Access – In a NUMA system is there be a performance penalty or inconsistent performance when threads need to read or write to a foreign memory area. Log writing is single threaded, so this should be something that we can manage. It does go multithreaded in 2016, so the best thing to do here is test your workload and measure.
  2. Hypervisor Support – can we leverage this awesomeness in a virtual machine?
  3. Error Detection – how to errors percolate back up into the OS/SQL so we know when something goes wrong. Likely using standard disk IO alerts will apply here.

Well hopefully I’ve convinced you how big of a deal this is and you’re as excited as I am!

Please let reach out to me if you like to know more about this – aen@centinosystems.com

CPU Scheduling Basics – Windows and SQL Server

In this post we’re going to introduce the basics of CPU scheduling.

In a computer system, only one thing can happen at a time. More specifically, only one task can be on a processor at a point in time. This can expand to several tasks if the system has multiple processors or a processor with multiple cores, which most modern systems have. For example, a four core system can potentially execute four tasks concurrently. 

So since only one task can be on a logical processor (or core) at one point in time, something has to manage which task gets access to the CPU. This is what the scheduler does. CPU scheduling is a very (VERY) deep topic and I’m going to introduce you to a couple basic concepts that can help you understand how scheduling works in Windows and SQL Server. So when you see things slow down in either of one of these technologies you know where you can start to look first.

If a task isn’t on a CPU, it’s not doing any processing. It’s simply waiting. Each CPU has its own task scheduler and the task scheduler is the thing that goes and assigns work to the logical CPU (or core).

Let’s check out a few examples of schedulers…

First up is Windows, it uses a preemptive, priority scheduler. This means processes with higher priority are allowed to access the CPU first, but also will stop a running process to schedule a higher priority process. Each process gets fixed amount of time on the processor, called a quantum, then Windows moves that process off the CPU and places into a queue waiting for access to the CPU again, so no one process dominates CPU time. 

SQL Server uses a non-preemptive, co-operative scheduler. Which also uses the quantum concept, but the task voluntarily yields and is switched off, placed into a queue waiting for access to the CPU again. Again, this ensures that no one task dominates the CPU, but here moving off the CPU is left up to the thread rather than the scheduler. In SQL Server a task is your work, that query you want to execute. A worker is the execution context of your work. A task is assigned to a worker, which is assigned to a scheduler for execution on the CPU. A scheduler is the manager of what gets on the CPU, there is one per logical processor.

Both Windows and SQL Server schedulers have a concept of process state and there are a couple of deeper nuances for each but here I’m going to generalize them into three states:

  • Running – a task executing on the CPU, an actively running process.
  • Waiting – a task is waiting on something before it can get back onto the CPU. For example, a process could be waiting on an I/O operating such as a read from a disk or network socket.
  • Runnable – a task waiting to get onto the CPU, nothing is preventing it from getting onto the CPU (no waiting)

You often hear CPU measured in percent CPU up to 100. But measuring CPU in terms of percent used is only part of the picture. I like to measure how long processes and tasks are waiting to get access to the CPU. Because what really we’re interested in, is how fast our work is progressing and it can only progress when it’s on a CPU doing work.

So here are a few basic ways to measure CPU pressure in terms of what’s waiting for access to a CPU

  • In Windows there’s a perfmon counter called processor queue length, anything in the queue is waiting to get onto the CPU. Each processor has its own queue. Check out a deeper dive here.
  • In SQL Server, we measure the number of tasks that are on the scheduler’s runnable queue, the lower the better. You can find that in the sys.dm_os_schedulers DMV in a column named runnable_task_count. Each CPU will have scheduler, and each will have a runnable task count. The query below will show you the runnable tasks, current tasks, and tasks waiting for a worker per scheduler. One row will be returned per scheduler.
select	  scheduler_id		--ID of the scheduler
	, cpu_id		--ID of the CPU
	, runnable_tasks_count	--a worker, with a task, waiting on CPU.
	, current_tasks_count	--total tasks associated with this scheduler.
	, work_queue_count	--total tasks waiting for a worker.
FROM sys.dm_os_schedulers
WHERE scheduler_id < 255
ORDER BY work_queue_count DESC

I hope this helps you get a basic understanding of CPU scheduling in Windows and SQL Server and starts you on a journey of learning more about how it works!

Feel free to contact me to discuss more!

Did you like this post? If so, click here for to sign up for our newsletter to get more SQL goodness delivered right to your inbox!

Twitter @nocentino

Load Testing Your Storage Subsystem with Diskspd – Part II

In this post we’re going discuss how to implement load testing of your storage subsystem with DiskSpd. We’re going to craft tests to measure bandwidth and latency for specific access patterns and IO sizes. In the last post “Load Testing Your Storage Subsystem with Diskspd”  we looked closely at access patterns and I/O size and discussed the impact each has on key performance attributes. 

Diskspd command options

Let’s start with some common command options, don’t get caught up on the syntax. Diskspd’s documentation is fantastic. It’s included with the program download here. Here I’m going to tell you why I set these settings this way, so you can adjust them as needed for your environments.

  • Duration ( -d<seconds> ) – this is the runtime of the test, the longer the better. The longer your test the more likely you’ll smooth out any performance anomalies, such as competing for shared resources. A longer test will likely invalidate any caches along the I/O path that may cache data and skew your results. We’re trying to measure the I/O capacity of the whole pipeline…not any caches.
  • Threads ( -T<count> or -f<count> ) – if -T is used this is the number of threads per file. If -f is used this is the number of threads for the whole test. For smaller systems I usually set number of threads to the number of cores. On larger systems I start at 16 and change it up or down based on results. For systems with very fast I/O paths you may need to add additional I/O threads to max out the throughput.
  • Outstanding I/Os ( -o<count>) – the number of I/Os ready to be dispatched per thread. Your storage subsystem may be fantastic, so when testing for throughput you may need to stack on more I/Os to increase the pressure. I usually start with this equal to the number of spindles in my LUN. Then I increase outstanding I/Os until I start to see latency increase. Once you see that, congrats you just saturated your I/O subsystem! Use this in conjunction with threads when trying to saturate an I/O path. If latency is already at unacceptable levels, reduce outstanding I/Os…but you’ll likely start to see a reduction in throughput. Try to find the sweet spot between minimum latency and maximum bandwidth. If each match the physical attributes of your disk subsystem you’re heading in the right direction. If outstanding I/Os is set to 1, the I/O is synchronous, all other values are asynchronous.…more on this later. 
  • File Size ( -c<size> ) – the file size, I like to have this be larger than the largest cache in the I/O pipeline. This includes your HBA, SAN controller…anything along the way between the running process and the disk. 
  • Block Size ( -b<size>[K|M|G|b] ) – the size of the IO, this is what we’ll change this to match varying I/O patterns in SQL Server.
  • Disable hardware write and software caching ( -h ) – we want to disable software (file system) caching and request disabling hardware caching. Disabling hardware caching is only a request of the storage hardware and that’s one of the reasons why want to ensure the file size we use is larger than the largest cache in our I/O path. This is enabled on all of the tests in this post. Further, for durability reasons most major relational database systems, SQL Server included, do not use the file system cache. They rely on their own caching mechanisms. 
  • IO Pattern – discussed in detail in our previous post here
    • Random I/O ( -r 
    • Sequential I/O ( -s ) – if using multiple threads, use -si this will coordinate the threads’ access into the file ensuring a sequential access pattern.
  • Write Percentage-w<percentage> ) – 0 is all reads, 100 is all writes.You can choose any value between, but I like to isolate read and write tests for analysis. 
  • Measure latency statistics ( -L ) – the whole reason we’re doing this is to understand our performance, go ahead and turn this on.

Impact of I/O Access Patterns

Here are some example Diskspd tests that implement sequential and random access patterns. These tests simulate index seeks/point queries and index scans/range queries.

  • Random

    diskspd.exe -d15 -o32 -t4 -b64K -h -r -L -w0 D:\TEST\iotest.dat

    This test will run for 15 seconds, with 32 outstanding IOs, using 4 threads, with 64k IOs. The hardware and software caches are disabled, access pattern is random and is read only. In our previous post we defined the characteristics of this access pattern, we should expect lower bandwidths and higher latencies in this test. This is due to the drives having to physically move to service the random I/O requests. This test is similar to an index seek/point query in SQL Server  (SSDs will still exhibit slightly higher latencies on random access as discussed in the last post here).

  • Sequential

    diskspd.exe -d15 -o32 -t4 -b64K -h -si -L -w0 D:\TEST\iotest.dat

    This test is the same as above, but uses a sequential access pattern. With sequential I/O we should see higher bandwidths with lower latencies. This is due to the data being physically contiguous on the drive. This test is similar to an index scan/range query in SQL Server.

Impact of I/O sizes

For these tests we’ll explore two I/O sizes. We’ll simulate a log buffer flush using a small 60KB synchronous, a small, single threaded, sequential write. Then we will simulate a backup operation with a much larger, multithreaded, sequential write.

  • Tranaction log simulation  

    diskspd.exe -d15 -o1 -t1 -b60K -h -s -L -w100 D:\TEST\iotest.dat

    In this test we simulate the writing of full transaction log records. The test is configured for synchronous I/Os by setting the outstanding I/O and threads to 1. Each I/O is 60KB and writes sequentially to the data file. We’re really trying to measure latency in the I/O subsystem and determine if there are any potential bottlenecks. 

  • Backup operation simulation

    diskspd.exe -d15 -o32 -t4 -b512K -h -si -L -w0 D:\TEST\iotest.dat 

    In this test we simulate the writing of a backup file. The test is configured for asynchronous, parallel I/Os by setting the outstanding I/O parameter to 32 and threads to 4. Each I/O is 512KB and reads sequentially from the file. We’re really trying to tax the I/O subsystem and reach a saturation point so we can really determine how much data our disk subsystem can move for reads. 

In this post we showed you how use Diskspd to craft tests to measure bandwidth and latency, two key attributes of your disk subsystem. In our next post in this series run some tests that simulate SQL Server I/O access patterns and review output. 

Building Debug Symbols – Troubleshooting symbol building

Recently,I have been working with the Debugger Symbols for SQL Server to generate call stacks and learn more about the internals of SQL Server. 
I approached one of our clients about doing this on a non-production system and they thought it be great. They would get better insight on their workload, I would get access to a real workload. win..win, right? Even in their stage environment they have a pretty heavy workload so it would be a good candidate for generating call stack data.
When I sat down to build the debug symbols on their non-production system I could not get it to work for all the binaries that I needed for this test. Not having all the symbols means the function call name resolution will not work for the calls defined in the missing symbol files. Nobody likes spotty data and I wanted to find out why the symbol files did not build.
The system is a SQL Server Enterprise 2012 SP2 + the online index rebuild hotfix 2969896 (build 11.0.5522). So in my test lab I was able to build symbol files for 2012 SP2, then I installed the hotfix and was able to reproduce what I saw on the staging system…missing symbol files.
So I needed an explanation for this and a plausible one is, Microsoft is not providing data on the symbol servers for hotfixed build and likely not for other non-mainstream builds.
These are the binaries that fail once the online index hotfix is applied…likely the binaries affected by the hotfix right?.
SYMCHK: sqlaccess.dll        FAILED  – SqlAccess.pdb mismatched or not found
SYMCHK: sqldk.dll            FAILED  – sqldk.pdb mismatched or not found
SYMCHK: sqllang.dll          FAILED  – sqllang.pdb mismatched or not found
SYMCHK: sqlmin.dll           FAILED  – sqlmin.pdb mismatched or not found
SYMCHK: sqlos.dll            FAILED  – SQLOS.pdb mismatched or not found
SYMCHK: sqltses.dll          FAILED  – sqltses.pdb mismatched or not found
Please feel free to contact me with any questions regarding performance or other SQL Server related issues at: aen@centinosystems.com

Tracing call stacks in SQL Server – Installing the Debugging Tools for Windows (WinDbg) on Windows 8.1 and generating debug symbols for SQL Server binaries

Where to get the debugger tools

To generate the needed symbols you will need the “Windows Software Development Kit (SDK) for Windows 10” download here: https://dev.windows.com/en-us/downloads/windows-10-sdk and click “Download the standalone SDK”

  • I like to select the Download option and pick “Debugging Tools for Windows” this will download all of the installation files

How to install the debugger tools 

The installation of this software is very straight forward, Then find in the downloaded files \Windows Kits\10\StandaloneSDK\Installers\X64 Debuggers And Tools-x64_en-us.msi and install the debugger and install the tools with the default settings. 


Generating the symbol files for SQL Server

Once installed you will need to generate the symbols, below are two commands one for the SQL Server executable and one for all of the dlls that are in the SQL working directory. Paul Randal @PaulRandal recommends a tip in his blog for using *.dll to gather symbol data on all the dlls in the directory at the expense of time and a small amount of disk space.

  • Change into the BINN directory
    • cd C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\Binn
  • “C:\Program Files\Windows Kits\10\Debuggers\x64\symchk.exe” sqlservr.exe /s SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
  • “C:\Program Files\Windows Kits\10\Debuggers\x64\symchk.exe” *.dll  /s SRV*c:\symbols*http://msdl.microsoft.com/download/symbols

Please feel free to contact me with any questions regarding performance or other SQL Server related issues at: aen@centinosystems.com

Tracing call stacks in SQL Server – Introduction

At this fall’s SQLIntersection conference in Las Vegas I attended, Paul Randal’s (t|b)“Performance Troubleshooting Using Latches and Waits” precon. Where at he asked for some assistance compiling data for a project he’s working on. The project that would require installing the “Debugging Tools for Windows” and generating debug symbols for the SQL binaries. I have always intended to work with the debug symbols to find the call stack traces and experiment with what SQL Server does during certain events, like creating a database, inserting a row and such. These are topics that interest me as a computer scientist and a SQL Server professional and also can help our clients in understand conditions when trying to get a handle on obscure performance anomalies.

In this multi-part blog series I will document the process of

  • Installing the Debugging Tools for Windows (WinDbg) on Windows 8.1 and generating debug symbols for SQL Server binaries
  • Implementing the debug symbols on a SQL Instance
  • Demonstrate how to build a call stack trace
  • Document the stack traces generated during interesting scenarios such as creating a database, adding a table, inserting a row and more

What is a symbol file?

A symbol file is used by a program, usually a debugger, at runtime to translate the address offsets of an machine binary to human readable names from the programming constructs such as function calls.

What is a call stack?

As each function is called, it’s address offset (i.e. program counter) is pushed onto the stack in a stack frame. The currently executing function is on the top of the stack. Subsequent function calls are pushed onto the top of the stack and the calling function is pushed down in the stack. This is the call stack. Each entry will have an address offset of the function call in the binary executable. In the execution of a program a function will call another function and so on. This call stack can hint at what is happening inside the execution of the program. 

What do we care about the call stacks when gathering wait and latch data for SQL Server?

This technique allows us identity code execution paths, waits, and latches from inside the SQL Server’s executing process. From this we can have greater insight as to under which conditions these events occur.

What does a call stack in SQL Server look like?


Next up in this series is “Installing the Debugging Tools for Windows (WinDbg) on Windows and generating debug symbols for SQL Server binaries

Please feel free to contact me with any questions regarding performance or other SQL Server related issues at: aen@centinosystems.com

References used for this project: