Category Archives: Performance

SQL Server on Linux – External Memory Pressure

In this blog post we’re going to explore how SQL Server on Linux responds to external memory pressure. On Windows based SQL Server systems we’ve become accustomed to the OS signaling to SQL Server that there’s a memory shortage. When signaled, SQL Server will kindly start shrinking it’s memory caches, including the buffer pool, to maintain overall system stability and usability. We’ll that story is a little different in SQL Server on Linux…let’s look and see how SQL Server on Linux responds to external memory pressure

The Scenario

Our SQL Server on Linux system is a CentOS 7.4 VM, running SQL Server 2017 (RTM). There is 12GB of physical memory and 80% of that is dedicated to SQL Server via mssql-conf about 9.2GB, the default configuration. I’ve also set Max Server memory to 8GB.

So let’s start up SQL Server on Linux and query the buffer pool cache sizes when there’s no data loaded. I’m going to look in two places. sys.dm_os_memory_clerks and also sys.dm_os_buffer_descriptors.  In Figures 1 and 2, we can get a feel for the memory distribution across the system and also inside the buffer pool.

Screen Shot 2017 10 19 at 2 01 04 PM

Figure 1 – Top memory consumers at system startup

Buffer Pool - Cold

 Figure 2 – Buffer Pool memory distribution at system startup

Now, I’m going to load a 8GB table into the buffer pool, maxing out SQL Server’s MAX memory instance setting. Now we’ll look at the allocations again. In Figures 3 and 4, we can see the buffer pool is the dominant consumer of memory and that our database is the dominant consumer in the buffer pool. So far so good, eh?

Memory Clerks - Warm

 Figure 3 – Top memory consumers after server reached maximum memory

Buffer Pool - Warm

 Figure 4 – Buffer Pool memory distribution after table loaded.

Getting Memory Information From Linux

We can use tools like ps, top and htop to look our are virtual and physical memory allocations. We can also look in the /proc virtual file system for our process and look at the status file. In here we’ll find the point in time status of a process, and most importantly the types of memory allocations for a process. We’ll get granular data on the virtual memory allocations and also the resident set size of the process. Here are the interesting values in the status file we’re going to focus on today.

  • VmSize – total current virtual address space of the process
  • VmRSS – total amount of physical memory currently allocated to the process
  • VmSwap – total amount of virtual memory currently paged out to the swap file (disk)

We can use the command pidof sqlservr to find the process ID for our SQL on Linux process, in our case it’s 14689. When running SQL Server on Linux there is always two processes for sqlservr, we’re going to look the one with the higher process ID.

Now to access the status information for our process we look in /proc/14689/status

Below is the output from our process’ status file, filtering for the values we’re interested in. You can see about 10.16GB of Virtual Memory (VmSize), 7.82GB of which is in actual RAM (VmRSS) and 622MB in the swap file (VmSwap). Nothing special to see here, just the baseline information for our process with a populated buffer pool.

VmSize: 10661380 kB

VmRSS:   8201240 kB

VmSwap:   636952 kB

Let’s Add Some Pressure

Using a small C program I wrote, I use calloc to allocate large sections of memory and then I have the program continually write data into the memory buffer to ensure those pages stay in physical memory. Using this program, let’s allocate 4GB of RAM. I’m choosing this value because it’s going to cause my system to exceed it’s physical memory but NOT exceed the amount virtual memory in my system when including the swap file’s size. Things can get really dicey if you exceed physical and virtual memory size, processes will start to get killed. Microsoft documents a case here.

Let’s look at the memory data from /proc/17141/status for my memory allocator program. In the output below you can see we have just about 4GB of virtual memory (VmSize) allocated and nearly the same amount of memory that’s resident in physical memory (VmRSS) and no data in the swap file.

VmSize:  4198476 kB

VmRSS:   4194484 kB

VmSwap:        0 kB


Now that 4GB program has caused Linux to need to make room in physical memory for this process. It does this by swapping least recently used pages from memory out to disk. So under memory pressure, let’s look at the SQL Server process’ memory allocations according to Linux. In the output below we see we still have a VmSize of around 10GB, but our VmRSS value has decreased dramatically. In fact, our VmRSS is now only 3.53GB. It decreased by 4.29GB!  VmSwap has increased to 5.04GB. Wow, that’s a huge portion of the process swapped to disk. 

VmSize: 10843320 kB

VmRSS:   3705476 kB

VmSwap:  5289752 kB

What Does SQL Server Think About All of This?

With our memory allocation program running and no workload running inside SQL Server, let’s look at sys.dm_os_memory_clerks and sys.dm_os_buffer_descriptors again. Something interesting happens, in Figures 5 and 6, when we look at sys.dm_os_memory_clerks and sys.dm_os_buffer_descriptors things changed but only slightly  But from the data above we do know that only 3.54GB of the process is actually in memory (VmRSS) at this point. Which means a large portion of SQL Server’s process is actually NOT in memory anymore…they’re on disk. The reality is, we simply do not know which portions of the process are in memory or on disk at this level.  You could use tools like pmap to determine which sections of memory are on disk, but we’ll leave that for another day. Try it out…it’s fun ;)

Swapped out? Why not?

Figure 5 – Top memory consumers after server reached maximum memory and with a significant portion of the process swapped to disk. Just about the same as it was without external memory pressure.

Buffer Pool - Warm but paged

Figure 6 – Buffer Pool memory distribution after table loaded and with a significant portion of the process swapped to disk. Just about the same as it was without external memory pressure.

Things aren’t what they seem

Looking at wait stats during query execution is going to be a little more interesting when SQL Server thinks it’s data pages are in memory, but a significant portion of them are actually on disk. In fact, when querying my test table with a simple SELECT COUNT(*) the wait types don’t indicate the fact that SQL Server is now actually disk bound due to the large portions of the buffer pool being paged out to disk. Under normal conditions, this query takes 11 seconds to run…when we’re paging to disk, it takes 2 minutes and 11 seconds to run. 

So let’s check out the wait types reported during our query output here in Figure 7. We see SOS_SCHEDULER_YIELD…which makes sense…long running query, that’s pretty slow and it’s quantum is being exhausted and the thread is being rescheduled. We do see some disk IO waits with PAGEIOLATCH_SH, but not much…what’s happening here is my data table is 8GB and my Max server memory is also 8GB so there’s a little overflow there and will read from disk for some of the data. From a performance standpoint, SQL Server has no idea that it’s data pages are likely swapped out to in this low memory situation. It thinks those pages are in the buffer pool…they certainly are…but a large chunk of the buffer pool is now actually swapped out to disk. 

Wait stats, who needs wait stats?

Figure 7 – Wait types reported during query execution with a significant portion of the process swapped to disk. Wait stats query from @brento 

What does this all mean?

Well, SQL Server is on a new operating system now, Linux. It’s upon us as DBAs to know how our system works as a whole and what to do when we’re in situations where performance falls off a cliff due to memory contention. This case is certainly a contrived case by putting so much external pressure on our process…but you, the DBA, need to know how SQL Server on Linux is going to react in these conditions and how to get your system back on track when things go south. Good monitoring and also good system resource management are needed. Certainly a properly size system in terms of physical memory and swap file size is needed. We toed the line here in our example ensuring we didn’t exceed physical memory plus swap file size. If you do exceed physical and swap size memory allocations will fail and your process will likely crash or be killed.

Speaking of monitoring, I’m speaking at PASS Summit this year on…you guessed “Monitoring Linux Performance for the SQL Server Admin” on Friday November 3th at 8am. If you want to learn more, I’ll see you there. 

Speaking at PASS Summit 2017

I’m very pleased to announce that I will be speaking at PASS Summit 2017!  This is my first time speaking at PASS Summit and I’m very excited to be doing so! What’s more, is I get to help blaze new ground on a emerging technology SQL Server on Linux! My session is Monitoring Linux Performance for the SQL Server Admin so if you’re a Windows or SQL Server administrator, this session is for you. We’ll look at some of the internals of SQL Server on Linux and dive into Linux OS internals and show you where to look inside Linux for most important performance data for your SQL Server. I hope to see you there!


Monitoring Linux Performance for the SQL Server Admin


So you’re a SQL Server administrator and you just installed SQL Server on Linux. It’s a whole new world. Don’t fear, it’s just an operating system. It has all the same components Windows has and in this session we’ll show you that. We will look at the Linux operating system architecture and show you where to look for the performance data you’re used to! Further we’ll dive into SQLPAL and how it architecture and internals enables high performance for your SQL Server. By the end of this session you’ll be ready to go back to the office and have a solid understanding of performance monitoring Linux systems and SQL on Linux. We’ll look at the core system components of CPU, Disk, Memory and Networking monitoring techniques for each and look some of the new tools available including new DMVs and DBFS.


PASS Summit 2017

dbfs – command line access to SQL Server DMVs

With SQL Server on Linux, Microsoft has recognized that they’re opening up their products to a new set of users. People that aren’t used to Windows and it’s tools. In the Linux world we have a set of tools that work with our system performance data and present that to us as text. Specifically, the placeholder for nearly all of the Linux kernel’s performance and configuration data is the /proc virtual file system, procfs. Inside here you can find everything you need that represents the running state of your system. Processes, memory utilization, and disk performance data all of this is presented as files inside of directories inside /proc.

Now, let’s take this idea and extend it to SQL Server. In SQL Server we have DMVs, dynamic management views. These represent to current running state of our SQL Server. SQL Server exposes the data in DMVs as table data that we can query using T-SQL. 

So, Microsoft saw the need to bring these two things together, we can expose the internals of SQL Server and its DMVs to the command line via a virtual file system. And that’s exactly what dbfs does, it exposes all of SQL Server’s DMVs as text files in a directory. When you access one of the text files…you’ll execute query against the SQL Server and the query output comes back to you via standard output to you Linux console. From there you can use any of your Linux command line fu…and do what you want with the data returned. 

Setting up dbfs

So first, let’s go ahead and set this up. I already have the Microsoft SQL Server repo configured so I can install via yum. If you have SQL on Linux installed, you likely already have this repo too. If not, go ahead and follow the repo setup instructions here. To install dbfs we use yum on RHEL based distributions.

First off, think about what’s going on under the hood here…we’re going to allow the system to execute queries against DMVs…so let’s try to keep this as secure as possible, I’m going to create a user that is allowed to only query DMVs with the VIEW SERVER STATE permission. So let’s do that…
Let’s log into our SQL Server via SQLCMD
And execute this code to create a user named dbfs_user 

Once created, let’s assign this user permissions to query DMVs
The next step is we need to create a directory where dbfs will place all the files representing the DMVs we wish to query
Now, let’s go ahead and configure dbfs. I’m going to place it’s configuration file in /etc/ since that’s the standard location for configuration files on Linux systems.
And inside that file, let’s use the following configuration. Pretty straight forward. Define a configuration name, here you see server1, the hostname which is the locally installed SQL instance. We’ll use the username and password of the user we just created and also defined is a version. While this isn’t very well documented, the code here shows that if you’re on version 16 (SQL Server 2016) or newer it will create files dbfs files with a .json file extension which exposes your DMV data as…you guessed it JSON. Also if you want to add a second server to dbfs, just repeat the configuration inside the same text file.

Running dbfs

Now with all the preliminaries out of the way, let’s launch dbfs. Basic syntax here, the actual program name with the parameter -c pointing to the configuration file we just created and the -m parameter pointing to the directory we want to “mount” our DMVs into.
Now, what’s interesting about dbfs is if you log out dbfs stays running. Honestly, I don’t like that, if this is the case it should be running as a service managed by systemd or whatever init daemon you’re using on your Linux distribution. I mentioned that on their GitHub repo. If this is going to be a user process, then I should have the choice the background the task myself.

Using dbfs

Looking at the source for dbfs it gets a list of all DMVs from sys.system_views from the SQL Server you configured it to connect to, then creates a file for each and every one of those DMVs. So we have full coverage of all the DMVs available to us and since you can use any bash command line fu to access the data now…the options are really limitless. Microsoft has a few good demos on the GitHub repo here. Let’s walk through a few examples now.
Accessing a DMV

This is pretty straight forward, you read from the file just like you would read from any other file on a Linux system. So let’s do that…we add the column -t option to make sure all the columns are aligned in the output.

And our output looks like this…

Notice in the output above how the connect_time column is split incorrectly? We need to tell column to use the tab as a delimiter. By default it uses whitespaces. So let’s do that…
And now our output looks much better

Selecting off a subset of columns

Well you probably noticed that the output is a bit unruly since it’s outputting all of the DMV’s columns. So let’s tame that a bit and pull out particular columns. To do that we’ll use a tool called awk which will print out columns based on the numeric index, so $1 is the first column and so on. 
And our output looks like this
Something isn’t right…as DBAs we think of things in rows and columns. So we’re going to count across the top and think the 7th column is going to yield the 7th column and it’s data for each row, right? Well, it will but data processed by awk is whitespace delimited by default and is processed row by row. So the 7th column in the second line isn’t the same as the output in the first line. This can be really frustrating if your row data has spaces in it…like you know…dates.
So let’s fix that…the output from the DMVs via dbfs is tab delimited. We can define our delimiter for awk with -F which will allow for whitespaces in our data. Breaking the data only on the tabs. Let’s hope there isn’t any tabs in our data!
And the output from that looks like this, much better but we don’t have the nice columns.
We’re so close, we can’t throw column on the end to make this nice and columnar because awk with this configuration it will remove the tab delimiters on it’s output stream. column by default will do the same thing too, but we can let column do the work for us and have it print tab delimiters in it’s output stream. 
And voila, we end up with some nice neatly formatted output

Searching in Text

We can search for text in the output using grep, here’s a quick example looking for the dedicated admin connection in dm_os_schedulers
And here’s the output. 

SQL folks…keep in mind, grep will only output lines matched, so we loose the column headers here since they’re part of the standard output stream when accessing the file/DMV data.

Moving forward with dbfs

We need the ability to execute more complex queries from the command line. Vin Yu mentions this here. As DBAs we already have our scripts that we use day to day to help us access, and more importantly make sense of, the data in the DMVs. So dbfs should allow us to execute those scripts somehow. I’m thinking we can have it read a folder on the local Linux system at runtime, create files for those scripts and throw them in the mounted directory and allow them to be accesses like any of the other DMVs. The other option is we place those scripts as views on the server and access them via dbfs. Pros and cons either way. Since it’s open source…I’m thinking about implementing this myself :)

Next is, somehow we need the ability to maintain column context throughout the output stream, for DBAs it’s going to be tough sell having to deal with that. I know JSON is available, but we’re talking about DBAs and sysadmins here as a target audience. 

In closing is a great step forward…giving access into the DMVs from the command line opens up SQL Server to a set of people who are used to accessing performance data this way. Bravo! 

Using Extended Events to Visualize Availability Group Replication Internals

SQL 2014 Service Pack 2 was recently released by Microsoft and there is a ton of great new features and enhancements in this release.This isn’t just a collection of bug fixes…there’s some serious value in this Service Pack. Check out the full list here. One of the key things added in this Service Pack is an enhancement of the Extended Events for AlwaysOn Availability Group replication.

Why are the new Availability Group Extended Event interesting?

If you’ve used Availability Groups in production systems with high transaction volumes you know that replication latency can impact your availability. If you want to brush up on that check out our blog posts on AG Replication Latency, Monitoring for replication latency, and issues with the DMVs when monitoring. These new extended events add insight at nearly every point inside your Availability Group’s replication. More importantly they also include duration. So using these Extended Events we can pinpoint latency inside our Availability Group replication.

Here’s the list and description of the new Extended Events:

  • hadr_log_block_group_commit – Log block group commit processing duration
  • log_block_pushed_to_logpool – Log block start push to log pool
  • log_flush_start – Asynchronous log write start
  • hadr_log_block_compression – Log block compression processing duration
  • hadr_capture_log_block – Primary has captured a log block
  • hadr_capture_filestream_wait
  • ucs_connection_send_msg – UCS transport connection sent message
  • hadr_log_block_send_complete – After a log block message has been sent. This event is only used for fail points
  • log_flush_complete – Reports that synchronous send is complete
  • hadr_receive_harden_lsn_message – When a new hardened LSN from the secondary
  • hadr_db_commit_mgr_harden – Transaction commit harden result from Hadron Commit management
  • hadr_transport_receive_log_block_message – Fires when we’re receiving new log block message
  • hadr_log_block_decompression – Log block decompression processing duration
  • hadr_apply_log_block – Secondary is going to append a log block to the log
  • hadr_send_harden_lsn_message – Crafting a message to send containing a new hardened LSN on a secondary.  Test only
  • hadr_lsn_send_complete – After an LSN message has been sent.
The source for this list and it’s descriptions is Amit Banerjee’s HADR Virtual Chapter Talk and KB3173156

Using Extended Events to visualize Availability Group Replication Internals

Using these Extended Events, we’re able to get some insight into the replication internals of our Availability Groups. Like I described a second ago, we can use these to measure the duration of each event and sift out performance issues in our replication. But we can also use them to get a better understanding of what goes into Availability Group replication, let’s see how.
Here’s the steps I used to generate the chart below:
  1. Let’s add an Event Session with all of these new Events above to the primary and secondary replicas of a two replica synchronous Availability Group
  2. Then with the sessions running, insert only one row on the primary replica. This will cause the AG to replicate the data change event and trigger the new Extended Events on both the primary and secondary replicas
  3. Stop the Event Sessions
  4. Open the Event file in SSMS and View Target Data on both the primary and secondary replicas
  5. Order the Event data by timestamp
  6. Merge the event data together based on the timestamp (I did this in a spreadsheet)
With this list we have the entire replication sequence of Events for that single insert replicating data from the primary to the secondary from the perspective of both the primary and secondary. I took that information and made the following graphic for you visualizing what it takes to replicate data in a synchronous Availability Group.

AG Synchrnous Replication Internals

 Figure 1: Availability Group Replication Extended Events

From Figure 1, let’s walk through the events…

On the primary

  • The first event that fires is hadr_log_block_group_commit starting things off. It’s my understanding that this initializes the needed code and program state for AG replication
  • Then up next is log_block_pushed_to_logpool. This fires when a log block is copied to the log pool. The log pool is a special memory location used to store log blocks that may need to be read again, since it’s in memory it keeps the readers from doing a physical IO on the transaction log
  • Then log_flush_start, this fires when log blocks are flushed to disk, normal write ahead logging
  • Once the log block has been copied into the log pool and flushed to disk, it’s read from the log pool and compressed then the hadr_log_block_compression event fires
  • After it’s compressed the AG “captures” the log block for replication to the secondary with hadr_capture_log_block 
  • Then builds and sends the message to the secondary and ucs_connection_send_msg fires
  • Once the message is sent the hadr_log_block_send_complete event fires
  • We mark the synchronous send complete with log_flush_complete

On the secondary

  • On the other side of the fence, basically the inverse occurs. We mark the receipt of the message with hadr_transport_receive_log_block_message
  • The log block is decompressed and hadr_log_block_decompression fires
  • Then the log block is appended to the local log buffer and hadr_apply_log_block fires
  • Then the log block is copied to the local log pool and hadr_pushed_to_logpool fires
  • Next, the asynchronous log write happens and log_flush_start marks that start of that
  • And once the asynchronous log write finishes, log_flush_complete fires
  • Once hardened locally we need to craft a message to send to the primary marking the the LSN is hardened and hadr_send_harden_lsn_message fires
  • ucs_connection_send_msg then transmits fires when the message is sent to the primary
  • hadr_lsn_send_complete marks the fires marking the end of sending the message

Back on the primary

  • hadr_receive_harden_lsn_message fires marking that the secondary has acknowledging receipt of the send LSN
  • And to wrap it all up hadr_db_commit_mgr_harden marks the end of the log block replication for this sequence

A few important notes…

  • Many of these Extended Events fire at different points and multiple times during replication under different conditions. There’s a field called mode, which is an integer value, that marks various conditions for the events. I have yet to decode what each value of mode is for all of the events but will update this blog as I get some insight.
  • It’s very important to note that this example highlights an insert into an Availability Group with a primary and only one secondary configured in synchronous Availability Mode. The code path and the subsequent Extended Events that fire can change based on Availability Mode (sync/async), replication health, and many other conditions that can impact replication.

I hope that I’ve shown you some interesting concepts that you can use to help troubleshoot your Availability Group replication using Extended Events. This post, highlights how you can use Extended Events to get some insight into the the internals of Availability Group replication.

If you have any questions about this or your Availability Group configuration or performance please feel free to email me at

Please follow me on Twitter @nocentino to keep up with new blog posts


Great references on the internals of the log pool –

HADR Virtual Chapter –

SQL Server, Persistent Memory on NVDIMMs and DAX

Paradigm Shift!

What do I mean by that? Every once in a while a technology comes along and changes the way things are done, moves the bar…well last week Microsoft released a Channel 9 video on persistent memory using NVDIMMs and DAX on Windows 2016…then combining it with SQL Server! This is one of those technologies that moves the bar! Check it out here.

Why is this important?

Relational databases like SQL Server use a transaction log to ensure the durability of the transactional operations to the database. This is so it can ensure its data is consistent in the event of a system failure. SQL Server uses a logging protocol called write ahead logging (WAL). This means that the data change operations must be written to stable, persistent storage before the process can change data in the database file…so this means our database throughput is at the mercy of the performance of the device the changes (log records) are written to.

Another facet of this is disk latency. Historically this has been terribly inefficient, especially in the case of spinning HDDs due to seek time and write latency. Disk latency has gotten much better recently with SSDs and NVMe. To combat disk latency, SQL Server doesn’t perform a disk write for each and every log record. SQL Server writes log records to what’s called a log buffer in main memory, once the log buffer is full it or a transaction commits (or aborts) it flushes that buffer to disk in sector aligned IO operations. This means your transactions have to wait for the disk IO to complete before they’re considered hardened to disk. This is a core bottleneck in RDBMS systems.

What changed?

Well, in NVDIMMs we have a DIMM, which is normally used for main memory, is now backed with a battery backed flash array. This means we can have access to stable storage at main memory speeds, think less than 10 microseconds, not milliseconds. An order of magnitude performance gain! NVDIMMs have two modes of operation, storage mode and memory mode. Storage mode will present the device as a block device to your OS. Memory mode will present this as a memory mapped device. Windows Server 2016 has the drivers to deal with both modes. In the first case you’ll get a block device and all the code that goes along with making a block IO request, the advantage of this mode is no application code changes, it’s just a block device like any other. In the second mode, memory mode, it presents byte addressable memory to the OS as storage, but you may need to change code to take full advantage of this. So, Microsoft has developed a feature called DirectAcces (DAX) in Windows Server 2016 to present memory mode storage to the OS bypassing traditional block IO code. This is where the insanity ensues! You’ll literally have stable storage available to your applications at main memory speeds, committing to stable storage is now just a memory copy!!! Watch the video for performance numbers!

How this will impact you?

Well, in the video Lindsey Allen highlights a code change in SQL Server that modifies how the log buffer writes data. Log writes are immediately considered hardened when committed to the log buffer. We no longer have to wait for the disk IO to complete, side stepping disk latency and IO altogether. This leads to a 2x improvement in the demo application in the video. These tests compare NVMe and DAX, and NVMe is PCI-express attached flash which is as good as it gets for persistent storage today (well until memory mode NVDIMMS take over). You’ll notice that CPU spikes during their load tests, this isn’t an uncommon issue when you remove your storage bottlenecks from your system.

What’s it going to cost?

Looking online, HPE has a 8GB NVDIMM available for $800 at online wholesalers, this is in DIMM form factor. Some are built directly into the motherboard, others are on cards.

Questions that I have?

  1. Foreign Memory Access – In a NUMA system is there be a performance penalty or inconsistent performance when threads need to read or write to a foreign memory area. Log writing is single threaded, so this should be something that we can manage. It does go multithreaded in 2016, so the best thing to do here is test your workload and measure.
  2. Hypervisor Support – can we leverage this awesomeness in a virtual machine?
  3. Error Detection – how to errors percolate back up into the OS/SQL so we know when something goes wrong. Likely using standard disk IO alerts will apply here.

Well hopefully I’ve convinced you how big of a deal this is and you’re as excited as I am!

Please let reach out to me if you like to know more about this –

CPU Scheduling Basics – Windows and SQL Server

In this post we’re going to introduce the basics of CPU scheduling.

In a computer system, only one thing can happen at a time. More specifically, only one task can be on a processor at a point in time. This can expand to several tasks if the system has multiple processors or a processor with multiple cores, which most modern systems have. For example, a four core system can potentially execute four tasks concurrently. 

So since only one task can be on a logical processor (or core) at one point in time, something has to manage which task gets access to the CPU. This is what the scheduler does. CPU scheduling is a very (VERY) deep topic and I’m going to introduce you to a couple basic concepts that can help you understand how scheduling works in Windows and SQL Server. So when you see things slow down in either of one of these technologies you know where you can start to look first.

If a task isn’t on a CPU, it’s not doing any processing. It’s simply waiting. Each CPU has its own task scheduler and the task scheduler is the thing that goes and assigns work to the logical CPU (or core).

Let’s check out a few examples of schedulers…

First up is Windows, it uses a preemptive, priority scheduler. This means processes with higher priority are allowed to access the CPU first, but also will stop a running process to schedule a higher priority process. Each process gets fixed amount of time on the processor, called a quantum, then Windows moves that process off the CPU and places into a queue waiting for access to the CPU again, so no one process dominates CPU time. 

SQL Server uses a non-preemptive, co-operative scheduler. Which also uses the quantum concept, but the task voluntarily yields and is switched off, placed into a queue waiting for access to the CPU again. Again, this ensures that no one task dominates the CPU, but here moving off the CPU is left up to the thread rather than the scheduler. In SQL Server a task is your work, that query you want to execute. A worker is the execution context of your work. A task is assigned to a worker, which is assigned to a scheduler for execution on the CPU. A scheduler is the manager of what gets on the CPU, there is one per logical processor.

Both Windows and SQL Server schedulers have a concept of process state and there are a couple of deeper nuances for each but here I’m going to generalize them into three states:

  • Running – a task executing on the CPU, an actively running process.
  • Waiting – a task is waiting on something before it can get back onto the CPU. For example, a process could be waiting on an I/O operating such as a read from a disk or network socket.
  • Runnable – a task waiting to get onto the CPU, nothing is preventing it from getting onto the CPU (no waiting)

You often hear CPU measured in percent CPU up to 100. But measuring CPU in terms of percent used is only part of the picture. I like to measure how long processes and tasks are waiting to get access to the CPU. Because what really we’re interested in, is how fast our work is progressing and it can only progress when it’s on a CPU doing work.

So here are a few basic ways to measure CPU pressure in terms of what’s waiting for access to a CPU

  • In Windows there’s a perfmon counter called processor queue length, anything in the queue is waiting to get onto the CPU. Each processor has its own queue. Check out a deeper dive here.
  • In SQL Server, we measure the number of tasks that are on the scheduler’s runnable queue, the lower the better. You can find that in the sys.dm_os_schedulers DMV in a column named runnable_task_count. Each CPU will have scheduler, and each will have a runnable task count. The query below will show you the runnable tasks, current tasks, and tasks waiting for a worker per scheduler. One row will be returned per scheduler.

I hope this helps you get a basic understanding of CPU scheduling in Windows and SQL Server and starts you on a journey of learning more about how it works!

Feel free to contact me to discuss more!

Did you like this post? If so, click here for to sign up for our newsletter to get more SQL goodness delivered right to your inbox!

Twitter @nocentino

Load Testing Your Storage Subsystem with Diskspd – Part III

In our final post in our “Load Testing Your Storage Subsystem with Diskspd” series, we’re going to look at output from Diskspd and run some tests and interpret results. In our first post we showed how performance can vary based on access pattern and IO size. In our second post we showed how to design a test to highlight those performance characteristics and in this post we’ll execute those tests and review the results. 

First let’s walk through the output from Diskspd, for now don’t focus on the actual results. There are four major sections:

  • Test Parameters – here is the test’s parameters. Including the exact command line parameters executed. This is great for reproducing tests.
  • CPU Usage – CPU usage for the test, recall if you are not using all your bandwidth, you may want to add threads. If your CPU burn is high, you may want to back off on the number of threads.
  • Performance – this is the meat of the test. Here we see bandwidth measured in MB/sec and latency measured in microseconds. With SSDs and today’s super fast storage I/O subsystems, you’ll likely need this level of accuracy. This is alone beats SQLIO in my opinion. I’m not much a fan of IOPs since those numbers require that you know the size of the IO for it to have any meaning. Check out Jeremiah Peschka’s article on this here. Remember, focus on minimizing latency and maximizing I/O please refer back Part I and Part II posts in this series for details.
  • Histogram – this gives a great representation of how your test did over the whole run. In this example, 99% of the time our latency was less than 0.654ms…that’s pretty super.

Impact of I/O Access Patterns

  • Random

    diskspd.exe -d15 -o32 -t2 -b64K -h -r -L -w0 C:\TEST\iotest.dat

    In this test you can see the that there is high throughput and very low latency. This disk is a PCIe attached SSD, so it performs well with a random IO access pattern.

  • Sequential

    diskspd.exe -d15 -o32 -t2 -b64K -h -s -L -w0 C:\TEST\iotest.dat

    In this test you can see that the sequential I/O pattern yields a similar performance profile to the random IO test on the SSD. Recall that an SSD does not have to move a disk head or rotate a platter. The access latency to any location on the drive has the same latency cost.  

Impact of I/O sizes

  • Tranaction log simulation  

    diskspd.exe -d15 -o1 -t1 -b60K -h -s -L -w100 C:\TEST\iotest.dat

    This test measures access latency of single thread with a very small data transfer, as you can see latency is very low at 0.289. This is expected on a low latency device such as a local attached SSD.

  • Backup operation simulation

    diskspd.exe -d15 -o32 -t4 -b512K -h -s -L -w0 C:\TEST\iotest.dat

    And finally, our test simulating reading data for a backup. The larger I/Os have a higher latency but also yield a higher transfer rate at 2,225MB/sec.

In this series of post we introduced you into some theory on how drives access data, we presented tests on how to explore the performance profile of your disk subsystem and reviewed Diskspd output for those tests. This should give you the tools and ideas you need to load test your disk subsystem and ensure your SQL Servers will perform well when you put them into production!

Load Testing Your Storage Subsystem with Diskspd – Part II

In this post we’re going discuss how to implement load testing of your storage subsystem with DiskSpd. We’re going to craft tests to measure bandwidth and latency for specific access patterns and IO sizes. In the last post “Load Testing Your Storage Subsystem with Diskspd”  we looked closely at access patterns and I/O size and discussed the impact each has on key performance attributes. 

Diskspd command options

Let’s start with some common command options, don’t get caught up on the syntax. Diskspd’s documentation is fantastic. It’s included with the program download here. Here I’m going to tell you why I set these settings this way, so you can adjust them as needed for your environments.

  • Duration ( -d<seconds> ) – this is the runtime of the test, the longer the better. The longer your test the more likely you’ll smooth out any performance anomalies, such as competing for shared resources. A longer test will likely invalidate any caches along the I/O path that may cache data and skew your results. We’re trying to measure the I/O capacity of the whole pipeline…not any caches.
  • Threads ( -T<count> or -f<count> ) – if -T is used this is the number of threads per file. If -f is used this is the number of threads for the whole test. For smaller systems I usually set number of threads to the number of cores. On larger systems I start at 16 and change it up or down based on results. For systems with very fast I/O paths you may need to add additional I/O threads to max out the throughput.
  • Outstanding I/Os ( -o<count>) – the number of I/Os ready to be dispatched per thread. Your storage subsystem may be fantastic, so when testing for throughput you may need to stack on more I/Os to increase the pressure. I usually start with this equal to the number of spindles in my LUN. Then I increase outstanding I/Os until I start to see latency increase. Once you see that, congrats you just saturated your I/O subsystem! Use this in conjunction with threads when trying to saturate an I/O path. If latency is already at unacceptable levels, reduce outstanding I/Os…but you’ll likely start to see a reduction in throughput. Try to find the sweet spot between minimum latency and maximum bandwidth. If each match the physical attributes of your disk subsystem you’re heading in the right direction. If outstanding I/Os is set to 1, the I/O is synchronous, all other values are asynchronous.…more on this later. 
  • File Size ( -c<size> ) – the file size, I like to have this be larger than the largest cache in the I/O pipeline. This includes your HBA, SAN controller…anything along the way between the running process and the disk. 
  • Block Size ( -b<size>[K|M|G|b] ) – the size of the IO, this is what we’ll change this to match varying I/O patterns in SQL Server.
  • Disable hardware write and software caching ( -h ) – we want to disable software (file system) caching and request disabling hardware caching. Disabling hardware caching is only a request of the storage hardware and that’s one of the reasons why want to ensure the file size we use is larger than the largest cache in our I/O path. This is enabled on all of the tests in this post. Further, for durability reasons most major relational database systems, SQL Server included, do not use the file system cache. They rely on their own caching mechanisms. 
  • IO Pattern – discussed in detail in our previous post here
    • Random I/O ( -r 
    • Sequential I/O ( -s ) – if using multiple threads, use -si this will coordinate the threads’ access into the file ensuring a sequential access pattern.
  • Write Percentage-w<percentage> ) – 0 is all reads, 100 is all writes.You can choose any value between, but I like to isolate read and write tests for analysis. 
  • Measure latency statistics ( -L ) – the whole reason we’re doing this is to understand our performance, go ahead and turn this on.

Impact of I/O Access Patterns

Here are some example Diskspd tests that implement sequential and random access patterns. These tests simulate index seeks/point queries and index scans/range queries.

  • Random

    diskspd.exe -d15 -o32 -t4 -b64K -h -r -L -w0 D:\TEST\iotest.dat

    This test will run for 15 seconds, with 32 outstanding IOs, using 4 threads, with 64k IOs. The hardware and software caches are disabled, access pattern is random and is read only. In our previous post we defined the characteristics of this access pattern, we should expect lower bandwidths and higher latencies in this test. This is due to the drives having to physically move to service the random I/O requests. This test is similar to an index seek/point query in SQL Server  (SSDs will still exhibit slightly higher latencies on random access as discussed in the last post here).

  • Sequential

    diskspd.exe -d15 -o32 -t4 -b64K -h -si -L -w0 D:\TEST\iotest.dat

    This test is the same as above, but uses a sequential access pattern. With sequential I/O we should see higher bandwidths with lower latencies. This is due to the data being physically contiguous on the drive. This test is similar to an index scan/range query in SQL Server.

Impact of I/O sizes

For these tests we’ll explore two I/O sizes. We’ll simulate a log buffer flush using a small 60KB synchronous, a small, single threaded, sequential write. Then we will simulate a backup operation with a much larger, multithreaded, sequential write.

  • Tranaction log simulation  

    diskspd.exe -d15 -o1 -t1 -b60K -h -s -L -w100 D:\TEST\iotest.dat

    In this test we simulate the writing of full transaction log records. The test is configured for synchronous I/Os by setting the outstanding I/O and threads to 1. Each I/O is 60KB and writes sequentially to the data file. We’re really trying to measure latency in the I/O subsystem and determine if there are any potential bottlenecks. 

  • Backup operation simulation

    diskspd.exe -d15 -o32 -t4 -b512K -h -si -L -w0 D:\TEST\iotest.dat 

    In this test we simulate the writing of a backup file. The test is configured for asynchronous, parallel I/Os by setting the outstanding I/O parameter to 32 and threads to 4. Each I/O is 512KB and reads sequentially from the file. We’re really trying to tax the I/O subsystem and reach a saturation point so we can really determine how much data our disk subsystem can move for reads. 

In this post we showed you how use Diskspd to craft tests to measure bandwidth and latency, two key attributes of your disk subsystem. In our next post in this series run some tests that simulate SQL Server I/O access patterns and review output. 

Load Testing Your Storage Subsystem with Diskspd

One of the primary activities I do before bringing SQL Server into production is load testing the storage subsystem. On a new system this is critical because I want to ensure that we’re “getting what we’ve paid for” when it comes to the disk subsystem. All too often there’s a configuration issue, component mismatch, a fundamental misunderstanding of the technology or worse an insufficient disk subsystem…these all can lead to poor disk performance. Even if it’s the simplest test, its imperative to measure performance as it’s significantly harder to make changes to a SQL Server once a database is in production. So do your testing. This is especially an important topic if your disks are not direct attached or in a shared storage environment such as a SAN or VMware data store. Storage networks, controllers, shelves…it gets complicated fast!

In this article we’re going to discuss what we’re looking for when load testing your storage subsystem and introduce DiskSpd a tool for performance testing disks using varying IO patterns and sizes.

Key measurements

The two metrics of key interest to us in a disk subsystem are bandwidth and access latency. Bandwidth, often referred to as data transfer rate, is how much data can be moved in a time interval, think gigabytes per second. Access latency or access time is how long a disk transaction takes from request to the delivery of the requested data. Latency is measured in milliseconds for HDD and microseconds for solid state drives. I don’t get hung up on IOPs as much, as they’re really a function of latency. Keep your latency low and your IOPs will likely be high.

Access Patterns

The access pattern in which data is read from a drive can have significant implications on the bandwidth and latency of the request. There are two access patterns that we’re concerned with when load testing a system, sequential and random. Each with its own performance characteristics.

Sequential access is when an application requests a block of data and the next block requested is physically adjacent on the disk. On a hard disk drive (HDD) the drive’s head does not have to move and the disk platter simply rotates under the head to read the next block off the disk. On a solid state disk (SSD) this becomes less of a concern as SSDs access latency is constant across the drive. In SQL Server, sequential IO is analogous to table/index scans, translation log writes and database backups. When load testing a system we use sequential IO patterns to saturate the disk subsystems IO path and determine if there any physical bottlenecks limiting bandwidth between the application driving the IO and the device serving up the data…the disk drive, the storage network (interconnect) or the SAN.  

Hard Disk Drive

Figure 1: Hard Disk Drive – Image from Wikipedia

Random access is when an application requests a block of data and the next block requested may not physically adjacent on the disk. On an HDD the drive’s head may have to move and the disk’s platter rotate to read the next block off the disk. This all contributes to our access latency. Drive vendors refer to this as seek time in drive specifications. A solid state disk is composed of an array of flash memory chips, each of which have fixed time access latency. So on an SSD if an application requests a block, the request for the next block can be serviced in a fixed amount of time, regardless of its location on the disk (there is a small amount of latency when switching between flash packages). This means random IO patterns can be serviced more efficiently as we do not have to wait for a physically moving component to access the next block…and that’s the game changer for SSDs when compared with HDDs.

Further the access latency of an IO on an SSD is measured in microseconds, this is an order of magnitude faster than an HDD which is in milliseconds. In SQL Server, random IO patterns can occur on index seeks, data file writes and operations that read from the transaction log. When load testing a system we use random IO patterns to find the overall access latency to the disk subsystem and determine if there are any components in the system that are not servicing the requests “fast enough” or in other words contributing to access latency.

Variable IO Sizes

An application can request data in variable IO sizes. For example, SQL Server can perform IOs in 8KB, 64KB, 128KB, 256KB and more. The size of the IO impacts both latency and bandwidth. A small IO can have a lower access latency as it is measured from the request of the IO until the delivery of all of the data requested. A small IO potentially can consume less bandwidth as well, as each IO translates into physical disk accesses, each of which will have some access latency. So of key importance when measuring smaller IO sizes is a low access latency, if the disk subsystem can service the IO quickly enough then we can see higher bandwidths on smaller operations. But we’re really shooting for getting the IO completed as quickly as possible.

A larger IO request can have a higher access latency since the measurement is from the beginning of the request until it is finished. Simply put a larger IO will take longer to transfer because the IO is moving more data in one operation. The IO is still has to pay the cost of the initial access latency, but usually the dominant factor in the duration of the IO is the transfer. So of key importance when measuring larger IO sizes is higher bandwidth, but still keep an eye on latency.


Enter Diskspd, a tool from Microsoft which allows us to performance test a disk subsystem. Diskspd allows us to define specific IO patterns, IO sizes, file sizes, read or write access, number of threads, access stride, and many more options. One key feature of Diskspd is it reports access latency time in microseconds, which is becoming more important as SSDs are becoming more common in the enterprise. SQLIO has served us well for years, but Diskspd is really the next generation of testing.

In our next post we’ll discuss how to performance test a disk subsystem using Diskspd and look at some of the key values of bandwidth, latency, and likely IOPs too.