Category Archives: Performance

Memory Settings for Running SQL Server in Kubernetes

People often ask me what’s the number one thing to look out for when running SQL Server on Kubernetes…the answer is memory settings. In this post, we’re going to dig into why you need to configure resource limits in your SQL Server’s Pod Spec when running SQL Server workloads in Kubernetes. I’m running these demos in Azure Kubernetes Service (AKS), but these concepts apply to any SQL Server environment running in Kubernetes. 

Let’s deploy SQL Server in a Pod without any resource limits.  In the yaml below, we’re using a Deployment to run one SQL Server Pod with a PersistentVolumeClaim for our instance directory and also frontending the Pod with a Service for access. 

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mssql-deployment-2017
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
        app: mssql-2017
  template:
    metadata:
      labels:
        app: mssql-2017
    spec:
      hostname: sql3
      containers:
      - name: mssql
        image: 'mcr.microsoft.com/mssql/server:2017-CU16-ubuntu'
        ports:
        - containerPort: 1433
        env:
        - name: ACCEPT_EULA
          value: "Y"
        - name: SA_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mssql
              key: SA_PASSWORD
        volumeMounts:
        - name: mssqldb
          mountPath: /var/opt/mssql
      volumes:
      - name: mssqldb
        persistentVolumeClaim:
          claimName: pvc-sql-2017
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-sql-2017
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: managed-premium
---
apiVersion: v1
kind: Service
metadata:
  name: mssql-svc-2017
spec:
  selector:
    app: mssql-2017
  ports:
    - protocol: TCP
      port: 1433
      targetPort: 1433
  type: LoadBalancer

Running a Workload Against our Pod…then BOOM!

With that Pod deployed, I loaded up a HammerDB TPC-C test with about 10GB of data and drove a workload against our SQL Server. Then while monitoring the workload…boom HammerDB throws connection errors and crashes. Let’s look at why.

First thing’s first, let’s check the Pods status with kubectl get pods. We’ll that’s interesting I have 13 Pods. 1 has a Status of Running and the remainder have are Evicted

kubectl get pods
NAME                                     READY   STATUS    RESTARTS   AGE
mssql-deployment-2017-8698fb8bf5-2pw2z   0/1     Evicted   0          8m24s
mssql-deployment-2017-8698fb8bf5-4bn6c   0/1     Evicted   0          8m23s
mssql-deployment-2017-8698fb8bf5-4pw7d   0/1     Evicted   0          8m25s
mssql-deployment-2017-8698fb8bf5-54k6k   0/1     Evicted   0          8m27s
mssql-deployment-2017-8698fb8bf5-96lzf   0/1     Evicted   0          8m26s
mssql-deployment-2017-8698fb8bf5-clrbx   0/1     Evicted   0          8m27s
mssql-deployment-2017-8698fb8bf5-cp6ml   0/1     Evicted   0          8m27s
mssql-deployment-2017-8698fb8bf5-ln8zt   0/1     Evicted   0          8m27s
mssql-deployment-2017-8698fb8bf5-nmq65   0/1     Evicted   0          8m21s
mssql-deployment-2017-8698fb8bf5-p2mvm   0/1     Evicted   0          25h
mssql-deployment-2017-8698fb8bf5-stzfw   0/1     Evicted   0          8m23s
mssql-deployment-2017-8698fb8bf5-td24w   1/1     Running   0          8m20s
mssql-deployment-2017-8698fb8bf5-wpgcx   0/1     Evicted   0          8m22s

What Just Happened?

Let’s keep digging and look at kubectl get events to see if that can help us sort out what’s happening…reading through these events a lot is going on. Let’s start at the top, we can see that our original Pod mssql-deployment-2017-8698fb8bf5-p2mvm is Killed and the line below that tells us why. The Node had a MemoryPressure condition. A few lines below that we see that our mssql container was using 4461532Ki which exceeded its request of 0 (more on why it’s 0 in a bit). So then our Deployment Controller sees that our Pod is no longer up and running so the Deployment controller does what it’s supposed to do start a new Pod in the place of the failed Pod.
 
The scheduler in Kubernetes will try to place a Pod back onto the same Node if the Node is still available, in our case aks-agentpool-43452558-0. And each time the scheduler places the Pod back onto the same Node it find that the MemoryPressure condition is still true, so after the 10th try the scheduler selects a new Node, aks-agentpool-43452558-3 to run our Pod. And in the last line of the output below we can see that once the workload is moved to aks-agentpool-43452558-3 the MemoryPressure condition goes away on aks-agentpool-43452558-0 since it’s no longer running our workload. 
 
kubectl get events --sort-by=.metadata.creationTimestamp
LAST SEEN   TYPE      REASON                      OBJECT                                        MESSAGE
17m         Normal    Scheduled                   pod/mssql-deployment-2017-8698fb8bf5-clrbx    Successfully assigned default/mssql-deployment-2017-8698fb8bf5-clrbx to aks-agentpool-43452558-0
17m         Warning   EvictionThresholdMet        node/aks-agentpool-43452558-0                 Attempting to reclaim memory
17m         Normal    SuccessfulCreate            replicaset/mssql-deployment-2017-8698fb8bf5   Created pod: mssql-deployment-2017-8698fb8bf5-clrbx
17m         Normal    SuccessfulCreate            replicaset/mssql-deployment-2017-8698fb8bf5   Created pod: mssql-deployment-2017-8698fb8bf5-ln8zt
17m         Normal    Killing                     pod/mssql-deployment-2017-8698fb8bf5-p2mvm    Stopping container mssql
17m         Warning   Evicted                     pod/mssql-deployment-2017-8698fb8bf5-54k6k    The node had condition: [MemoryPressure].
17m         Warning   Evicted                     pod/mssql-deployment-2017-8698fb8bf5-p2mvm    The node was low on resource: memory. Container mssql was using 4461532Ki, which exceeds its request of 0.
17m         Warning   Evicted                     pod/mssql-deployment-2017-8698fb8bf5-cp6ml    The node had condition: [MemoryPressure].
17m         Normal    Scheduled                   pod/mssql-deployment-2017-8698fb8bf5-cp6ml    Successfully assigned default/mssql-deployment-2017-8698fb8bf5-cp6ml to aks-agentpool-43452558-0
17m         Normal    Scheduled                   pod/mssql-deployment-2017-8698fb8bf5-54k6k    Successfully assigned default/mssql-deployment-2017-8698fb8bf5-54k6k to aks-agentpool-43452558-0
17m         Warning   Evicted                     pod/mssql-deployment-2017-8698fb8bf5-clrbx    The node had condition: [MemoryPressure].
17m         Normal    SuccessfulCreate            replicaset/mssql-deployment-2017-8698fb8bf5   Created pod: mssql-deployment-2017-8698fb8bf5-cp6ml
17m         Normal    SuccessfulCreate            replicaset/mssql-deployment-2017-8698fb8bf5   Created pod: mssql-deployment-2017-8698fb8bf5-54k6k
17m         Normal    Scheduled                   pod/mssql-deployment-2017-8698fb8bf5-ln8zt    Successfully assigned default/mssql-deployment-2017-8698fb8bf5-ln8zt to aks-agentpool-43452558-0
17m         Normal    Scheduled                   pod/mssql-deployment-2017-8698fb8bf5-96lzf    Successfully assigned default/mssql-deployment-2017-8698fb8bf5-96lzf to aks-agentpool-43452558-0
17m         Normal    SuccessfulCreate            replicaset/mssql-deployment-2017-8698fb8bf5   Created pod: mssql-deployment-2017-8698fb8bf5-96lzf
17m         Warning   Evicted                     pod/mssql-deployment-2017-8698fb8bf5-ln8zt    The node had condition: [MemoryPressure].
17m         Warning   Evicted                     pod/mssql-deployment-2017-8698fb8bf5-96lzf    The node had condition: [MemoryPressure].
17m         Warning   Evicted                     pod/mssql-deployment-2017-8698fb8bf5-4pw7d    The node had condition: [MemoryPressure].
17m         Normal    Scheduled                   pod/mssql-deployment-2017-8698fb8bf5-4pw7d    Successfully assigned default/mssql-deployment-2017-8698fb8bf5-4pw7d to aks-agentpool-43452558-0
17m         Normal    SuccessfulCreate            replicaset/mssql-deployment-2017-8698fb8bf5   Created pod: mssql-deployment-2017-8698fb8bf5-4pw7d
17m         Warning   Evicted                     pod/mssql-deployment-2017-8698fb8bf5-2pw2z    The node had condition: [MemoryPressure].
17m         Normal    Scheduled                   pod/mssql-deployment-2017-8698fb8bf5-2pw2z    Successfully assigned default/mssql-deployment-2017-8698fb8bf5-2pw2z to aks-agentpool-43452558-0
17m         Normal    SuccessfulCreate            replicaset/mssql-deployment-2017-8698fb8bf5   Created pod: mssql-deployment-2017-8698fb8bf5-2pw2z
17m         Warning   Evicted                     pod/mssql-deployment-2017-8698fb8bf5-4bn6c    The node had condition: [MemoryPressure].
17m         Normal    SuccessfulCreate            replicaset/mssql-deployment-2017-8698fb8bf5   Created pod: mssql-deployment-2017-8698fb8bf5-4bn6c
17m         Normal    SuccessfulCreate            replicaset/mssql-deployment-2017-8698fb8bf5   Created pod: mssql-deployment-2017-8698fb8bf5-stzfw
17m         Normal    Scheduled                   pod/mssql-deployment-2017-8698fb8bf5-4bn6c    Successfully assigned default/mssql-deployment-2017-8698fb8bf5-4bn6c to aks-agentpool-43452558-0
17m         Warning   Evicted                     pod/mssql-deployment-2017-8698fb8bf5-stzfw    The node had condition: [MemoryPressure].
17m         Normal    SuccessfulCreate            replicaset/mssql-deployment-2017-8698fb8bf5   (combined from similar events): Created pod: mssql-deployment-2017-8698fb8bf5-td24w
17m         Normal    Scheduled                   pod/mssql-deployment-2017-8698fb8bf5-wpgcx    Successfully assigned default/mssql-deployment-2017-8698fb8bf5-wpgcx to aks-agentpool-43452558-0
17m         Warning   Evicted                     pod/mssql-deployment-2017-8698fb8bf5-wpgcx    The node had condition: [MemoryPressure].
17m         Normal    Scheduled                   pod/mssql-deployment-2017-8698fb8bf5-stzfw    Successfully assigned default/mssql-deployment-2017-8698fb8bf5-stzfw to aks-agentpool-43452558-3
17m         Warning   Evicted                     pod/mssql-deployment-2017-8698fb8bf5-nmq65    The node had condition: [MemoryPressure].
17m         Normal    Scheduled                   pod/mssql-deployment-2017-8698fb8bf5-nmq65    Successfully assigned default/mssql-deployment-2017-8698fb8bf5-nmq65 to aks-agentpool-43452558-0
17m         Normal    NodeHasInsufficientMemory   node/aks-agentpool-43452558-0                 Node aks-agentpool-43452558-0 status is now: NodeHasInsufficientMemory
17m         Normal    Scheduled                   pod/mssql-deployment-2017-8698fb8bf5-td24w    Successfully assigned default/mssql-deployment-2017-8698fb8bf5-td24w to aks-agentpool-43452558-3
16m         Normal    SuccessfulAttachVolume      pod/mssql-deployment-2017-8698fb8bf5-td24w    AttachVolume.Attach succeeded for volume "pvc-f35b270a-e063-11e9-9b6d-ee8baa4f9319"
15m         Normal    Pulling                     pod/mssql-deployment-2017-8698fb8bf5-td24w    Pulling image "mcr.microsoft.com/mssql/server:2017-CU16-ubuntu"
15m         Normal    Pulled                      pod/mssql-deployment-2017-8698fb8bf5-td24w    Successfully pulled image "mcr.microsoft.com/mssql/server:2017-CU16-ubuntu"
15m         Normal    Started                     pod/mssql-deployment-2017-8698fb8bf5-td24w    Started container mssql
15m         Normal    Created                     pod/mssql-deployment-2017-8698fb8bf5-td24w    Created container mssql
12m         Normal    NodeHasSufficientMemory     node/aks-agentpool-43452558-0                 Node aks-agentpool-43452558-0 status is now: NodeHasSufficientMemory
 
But guess what…we’re going to have the same problem on this new Node. If we run our workload again, our memory allocation will grow and Kubernetes will kill the Pod again once the MemoryPressure condition is met. So what do we do…how can we prevent our nodes from going into a MemoryPressure condition? 

Understanding Allocatable Memory in Kubernetes 

Using kubectl describe nodein the output below there’s a section Allocatable. In there we can see that amount of allocatable resources on this Node in terms of CPU, disk, RAM and Pods. These are the amount of resources available to run user Pods on this Node. And there we see the amount of allocatable memory is 4667840Ki (~4.45GB) so we have about that much memory to run our workloads. The amount here is a function of the amount of memory in the Node and reservations made by Kubernetes for system functions, more on that here. Our AKS cluster VMs are Standard DS2 v2 which have 2vCPU and 7GB of RAM, so about 2.55GB is reserved for other uses.  The output below is from after our Pod was evicted so we can see the LastTransitionTime shows the last time a condition occurred and for MemoryPressure we can see an event at 7:53 AM. The other LastTransitionTimes are from when the Node was started. Another key point is in the Events section where we can see the conditions change state.
 
kubectl describe nodes aks-agentpool-43452558-0
Name:               aks-agentpool-43452558-0
...output omitted...
Unschedulable:      false
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Tue, 10 Sep 2019 16:20:00 -0500   Tue, 10 Sep 2019 16:20:00 -0500   RouteCreated                 RouteController created a route
  MemoryPressure       False.  Sat, 28 Sep 2019 07:58:56 -0500.  Sat, 28 Sep 2019 07:53:55 -0500.  KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Sat, 28 Sep 2019 07:58:56 -0500   Tue, 10 Sep 2019 16:18:27 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Sat, 28 Sep 2019 07:58:56 -0500   Tue, 10 Sep 2019 16:18:27 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Sat, 28 Sep 2019 07:58:56 -0500   Tue, 10 Sep 2019 16:18:27 -0500   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  Hostname:    aks-agentpool-43452558-0
  InternalIP:  10.240.0.6
Capacity:
 attachable-volumes-azure-disk:  8
 cpu:                            2
 ephemeral-storage:              101584140Ki
 hugepages-1Gi:                  0
 hugepages-2Mi:                  0
 memory:                         7113152Ki
 pods:                           110
Allocatable:
  attachable-volumes-azure-disk:  8
 cpu:                            1931m
 ephemeral-storage:              93619943269
 hugepages-1Gi:                  0
 hugepages-2Mi:                  0
 memory: 4667840Ki  pods:                           110
...output omitted...
Events:
Type     Reason                     Age                  From                               Message
  ----     ------                     ----                 ----                               -------
  Warning  EvictionThresholdMet       10m                  kubelet, aks-agentpool-43452558-0  Attempting to reclaim memory
  Normal   NodeHasInsufficientMemory  10m                  kubelet, aks-agentpool-43452558-0  Node aks-agentpool-43452558-0 status is now: NodeHasInsufficientMemory
  Normal   NodeHasSufficientMemory    5m15s (x2 over 14d)  kubelet, aks-agentpool-43452558-0  Node aks-agentpool-43452558-0 status is now: NodeHasSufficientMemory

SQL Server’s View of Memory on Kubernetes Nodes

When using a Pod with no memory limits defined in the Pod Spec (which is why we saw 0 for the limits in the Event entry) SQL Server sees 5557MB (~5.4GB) memory available and thinks it has that to use. Why is that? Well, SQL Server on Linux looks at the base OS to see how much memory is available on the system and by default uses approximately 80% of that memory due its architecture (SQLPAL).
2019-09-28 14:46:16.23 Server      Detected 5557 MB of RAM. This is an informational message; no user action is required. 
This is bad news in our situation, Kubernetes has only 4667840Ki (~4.45GB) to allocate before setting the MemoryPressure condition which will cause our Pod to be Evicted and Terminated. And as with our workload running SQL Server allocates memory, primarily to the buffer pool, and it exceeds the Allocatable amount of memory for the Node Kubernetes kills our Pod to protect the Node and the cluster as a whole. 

Configuring Pod Limits for SQL Server

So how do we fix all of this? We need to set a resource limit in our Pod Spec. Limits allow us to control the amount of a particular resource exposed to a Pod. And in our case, we want to limit the amount of memory we want SQL Server to see. In our environment we know we have  4667840Ki (~4.45GB) of Allocatable memory for user Pods on Nodes so lets set a value lower than that…and to be super safe I’m going to use 3GB. In the code below you can see in the Pod Spec for our mssql container we have a section for resources, limits and a value of memory: “3Gi”.

    spec:
      hostname: sql3
      containers:
      - name: mssql
        image: 'mcr.microsoft.com/mssql/server:2017-CU16-ubuntu'
        ports:
        - containerPort: 1433
        env:
        - name: ACCEPT_EULA
          value: "Y"
        - name: SA_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mssql
              key: SA_PASSWORD
        resources:
          limits:
            memory: "3Gi"
        volumeMounts:
        - name: mssqldb
          mountPath: /var/opt/mssql
      volumes:
      - name: mssqldb
        persistentVolumeClaim:
          claimName: pvc-sql-system-2017
With this configured we limit the amount of memory SQL Server sees to 3GB. Given that the container is running SQL Server on Linux, SQL Server will actually see about 80% of that 2458MB
2019-09-28 14:01:46.16 Server      Detected 2458 MB of RAM. This is an informational message; no user action is required.

Summary

With that, I hope you can see why I consider memory settings the number one thing to look out for when deploying SQL Server in Kubernetes.  Setting appropriate values will ensure that your SQL Server instance on Kubernetes stays up and running and happily with the other workloads you have running in your cluster.  What’s the best value to set? We need to take into account the amount of memory on the Node, the amount of memory we need to run our workload in SQL Server, and the reservations needed by both Kubernetes and SQLPAL. Additionally, we should set max server memory instance level setting inside of SQL Server to limit the amount of memory that’s allocatable. My suggestion to you is to configure both a resource limit at the Pod Spec and configure max server memory at the instance level.

If you want to read more about resource management and Pod eviction check out this resources:

 

 

 

 

Data Persistency and Advanced SQL Server Disk Topologies in Kubernetes

When working with SQL Server in containers and Kubernetes storage is a key concept. In this post, we’re going to walk through how to deploy SQL Server in Kubernetes with Persistent Volumes for the system and user databases.

One of the key principals of Kubernetes is the ephemerality of Pods. No Pod is every redeployed, a completely new Pod is created. If a Pod dies, for whatever reason, a new Pod is created in its place there is no continuity in the state of that Pod. The newly created Pod will go back to the initial state of the container image defined in the Pod’s spec. This is very valuable for stateless workloads, not so much for stateful workloads like SQL Server.

This means that for a stateful workload like SQL Server we need to store both configuration and data externally from the Pod to maintain state through the recreation of a Pod. Kubernetes give us constructs two constructs to do that, environment variables and Persistent Volumes. 

Using Environment Variables for Container Configuration

Container-based applications use environment variables for configuration at startup. The SQL Server container has a collection of environment variables that can be used to configure it at container startup. We will leverage two of those in this configuration. MSSQL_DATA_DIR and MSSQL_LOG_DIR these allow us to define a file system locations for user database and log files. When the SQL Server container is started inside the Pod, it reads the environment variables at runtime and sets its configuration based on those values. We define these variables as part of the Pod Spec. We will cover that configuration below.

Using Persistent Volumes to Maintain Database State

To persist the state of our SQL Server container, we will configure SQL Server to store its data and log files for both user and system databases on Persistent Volumes.

First, let’s review how SQL Server in a container starts up. During the initial startup, the SQL Server process checks to see if there are any system databases in the default system file location which is, /var/opt/mssql/data. If there are none the system databases are copied there, if they are there no action is taken. 

To add persistently to the system databases, and really all of the other components of SQL Server such as the Error Log and other system files, we will configure /var/opt/mssql so that it is backed by a Persistent Volume.

By placing the system databases on a Persistent Volume, when a Pod is recreated and the Persistent Volumes are attached and mounted in the same location when the SQL Server process starts up it sees the system databases and has what it needs to maintain state between creation.

If there are records for user databases in the system databases, SQL Server will start the process of bringing those databases online as well. We certainly the default location for user databases is /var/opt/mssql/data but we are going to override that with an environment variable for both the data and log directories, placing each on a dedicated Persistent Volumes.

Let’s walk through that configuration together. 

Persistent Volume Claims

In this configuration, we will use dynamic storage provisioning. In dynamic provisioning, a Persistent Volume Claim (PVC) is used to request a Persistent Volume (PV) from a Storage Class. In this case, we’ll be using AKS’s managed-premium Storage Class. 

Here we define three PVCs, one for each place we want Persistent Volume, for the system files and databases and the user database and log files.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "pvc-sql-data"
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 10Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "pvc-sql-system"
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 10Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "pvc-sql-log"
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 10Gi

Deployment

In the Pod spec for our Deployment, we want to define several elements to support this configuration. 

  • Volumes – define volumes that can be mounted by this Pod. In this case, we’re creating and naming three volumes, backed by the PVCs defined above.
  • volumeMounts – volumes mounted into the container and their mountPath, location. This maps the names from the named Volumes to a location in the filesystem in the container.
  • env – due to the ephemerality of the container in the Pod, we need to tell SQL Server at start up that the data and log files will be stored in a specified directory. We are leaving the system databases and files in the default location which is /var/opt/mssql
The net effect of this storage configuration is that we are mapping the Persistent Volumes into a particular location in the filesystem inside the container. 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mssql-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mssql
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: mssql
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: mssql
        image: 'mcr.microsoft.com/mssql/server:2017-latest'
        ports:
        - containerPort: 1433
        env:
        - name: ACCEPT_EULA
          value: 'Y'
        - name: MSSQL_DATA_DIR
          value: '/data'
        - name: MSSQL_LOG_DIR
          value: '/log'
        - name: SA_PASSWORD
          value: 'S0methingS@Str0ng!'
        volumeMounts:
        - name: mssql-system
          mountPath: /var/opt/mssql
        - name: mssql-data
          mountPath: /data
        - name: mssql-log
          mountPath: /log
      volumes:
      - name: mssql-system
        persistentVolumeClaim:
          claimName: pvc-sql-system
      - name: mssql-data
        persistentVolumeClaim:
          claimName: pvc-sql-data
      - name: mssql-log
        persistentVolumeClaim:
          claimName: pvc-sql-log

Service

We’ll front end our SQL Server with a public IP address and a load balancer. 

apiVersion: v1
kind: Service
metadata:
  name: mssql-deployment
spec:
  selector:
    app: mssql
  ports:
    - protocol: TCP
      port: 31433
      targetPort: 1433
  type: LoadBalancer

Apply the Configuration

Save the code above into a YAML file and deploy it into SQL Server.

kubectl apply -f deployment-advanced-disk.yaml

You’ll get this output

persistentvolumeclaim/pvc-sql-data created
persistentvolumeclaim/pvc-sql-system created
persistentvolumeclaim/pvc-sql-log created
deployment.apps/mssql-deployment created
service/mssql-deployment created

Confirm the configuration

We can use kubectl get pv to list the Persistent Volumes (PV) dynamically allocated by our cluster. Here there are three Persistent Volumes. The key here is the status is Bound, which means they are bound to a PVC. I also want to point out the Reclaim Policy is Delete. This means if the PVC is deleted, the PV will be deleted at a cleanup interval sometime in the future. 

kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                    STORAGECLASS      REASON   AGE
pvc-e0b418ef-6e69-11e9-a433-f659caf6a6f5   10Gi       RWO            Delete           Bound    default/pvc-sql-data     managed-premium            11m
pvc-e0cf2345-6e69-11e9-a433-f659caf6a6f5   10Gi       RWO            Delete           Bound    default/pvc-sql-system   managed-premium            11m
pvc-e0ea01a8-6e69-11e9-a433-f659caf6a6f5   10Gi       RWO            Delete           Bound    default/pvc-sql-log      managed-premium            11m

With kubectl get pvc we get a list of the PVCs in our configuration, once for each we defined above. The key here is the status is Bound, or that they are bound to a PV.

kubectl get pvc
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
pvc-sql-data     Bound    pvc-e0b418ef-6e69-11e9-a433-f659caf6a6f5   10Gi       RWO            managed-premium   12m
pvc-sql-log      Bound    pvc-e0ea01a8-6e69-11e9-a433-f659caf6a6f5   10Gi       RWO            managed-premium   12m
pvc-sql-system   Bound    pvc-e0cf2345-6e69-11e9-a433-f659caf6a6f5   10Gi       RWO            managed-premium   12m 

Now let’s use kubectl describe pods to get the deep dive info about our storage configuration and how it’s mapped into the Pod. 

There are three keep places in the output below I want to point you to

  • Containers: mssql: Environment: you’ll find the two environment variables set for the data and log directories. Configured as /data and /log
  • Mounts: we see the file system location inside the container and the name of the Volumes defined in the Pod Spec
  • Volumes: we see the name of the Volumes, their type, claim name and the read/write status.
  • Events: this is a log of the events for the creation of this Pod. Key here is that sometimes the container will come up prior to the storage being available to the Pod. That’s what the error below is, but it clears itself up and the container is able to start.
kubectl describe pods
Name:               mssql-deployment-df4cf5c4c-nf8lf
Namespace:          default
Priority:           0
PriorityClassName:
Node:               aks-nodepool1-89481420-2/10.240.0.6
Start Time:         Sat, 04 May 2019 07:41:59 -0500
Labels:             app=mssql
                    pod-template-hash=df4cf5c4c
Annotations:
Status:             Running
IP:                 10.244.1.51
Controlled By:      ReplicaSet/mssql-deployment-df4cf5c4c
Containers:
  mssql:
    Container ID:   docker://f2320ae8f94c24fbb04214b903b4a218b82e9548f8d88a95daa7e207eeaa42b4
    Image:          mcr.microsoft.com/mssql/server:2017-latest
    Image ID:       docker-pullable://mcr.microsoft.com/mssql/server@sha256:39554141d307f2d40d2abfc54e3a0eea3aa527e58f616496c6f3ed3245a2e2b1
    Port:           1433/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sat, 04 May 2019 07:44:21 -0500
    Ready:          True
    Restart Count:  0
    Environment:
      ACCEPT_EULA:                   Y
      MSSQL_DATA_DIR:                /data
      MSSQL_LOG_DIR:                 /log
      SA_PASSWORD:                   S0methingS@Str0ng!
      KUBERNETES_PORT_443_TCP_ADDR:  cscluster-kubernetes-cloud-fd0c5e-8bca8b54.hcp.centralus.azmk8s.io
      KUBERNETES_PORT:               tcp://cscluster-kubernetes-cloud-fd0c5e-8bca8b54.hcp.centralus.azmk8s.io:443
      KUBERNETES_PORT_443_TCP:       tcp://cscluster-kubernetes-cloud-fd0c5e-8bca8b54.hcp.centralus.azmk8s.io:443
      KUBERNETES_SERVICE_HOST:       cscluster-kubernetes-cloud-fd0c5e-8bca8b54.hcp.centralus.azmk8s.io
    Mounts:
      /data from mssql-data (rw)
      /log from mssql-log (rw)
      /var/opt/mssql from mssql-system (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-z9sbf (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  mssql-system:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-sql-system
    ReadOnly:   false
  mssql-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-sql-data
    ReadOnly:   false
  mssql-log:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-sql-log
    ReadOnly:   false
  default-token-z9sbf:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-z9sbf
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age   From                               Message
  ----     ------                  ----  ----                               -------
  Normal   Scheduled               13m   default-scheduler                  Successfully assigned default/mssql-deployment-df4cf5c4c-nf8lf to aks-nodepool1-89481420-2
  Normal   SuccessfulAttachVolume  13m   attachdetach-controller            AttachVolume.Attach succeeded for volume "pvc-e0ea01a8-6e69-11e9-a433-f659caf6a6f5"
  Normal   SuccessfulAttachVolume  12m   attachdetach-controller            AttachVolume.Attach succeeded for volume "pvc-e0cf2345-6e69-11e9-a433-f659caf6a6f5"
  Normal   SuccessfulAttachVolume  12m   attachdetach-controller            AttachVolume.Attach succeeded for volume "pvc-e0b418ef-6e69-11e9-a433-f659caf6a6f5"
  Warning  FailedMount             11m   kubelet, aks-nodepool1-89481420-2  Unable to mount volumes for pod "mssql-deployment-df4cf5c4c-nf8lf_default(027c46f7-6e6a-11e9-a433-f659caf6a6f5)": timeout expired waiting for volumes to attach or mount for pod "default"/"mssql-deployment-df4cf5c4c-nf8lf". list of unmounted volumes=[mssql-system mssql-data]. list of unattached volumes=[mssql-system mssql-data mssql-log default-token-z9sbf]
  Normal   Pulled                  11m   kubelet, aks-nodepool1-89481420-2  Container image "mcr.microsoft.com/mssql/server:2017-latest" already present on machine
  Normal   Created                 11m   kubelet, aks-nodepool1-89481420-2  Created container
  Normal   Started                 11m   kubelet, aks-nodepool1-89481420-2  Started container

Creating a Database and Verifying File Location

With this code, we’ll get our IP address for our SQL Server service then we’ll create a database and query master_files for a list of data files. Notice I’m defining my service port as 31443 which is what we defined when creating our service in the earlier step.

SVCIP=$(kubectl get svc mssql-deployment | grep mssql-deployment |  awk '{print $4}')
sqlcmd -S $SVCIP,31433 -U sa -Q 'CREATE DATABASE TestDB1' -P $PASSWORD
sqlcmd -S $SVCIP,31433 -U sa -Q 'SELECT name,physical_name from sys.master_files' -P $PASSWORD


And we’ll get this output, you can see all of the system databases backed by /var/opt/mssql and our user database is on /data and the log is on /log. All backed by Persistent Volumes.

master        /var/opt/mssql/data/master.mdf
mastlog       /var/opt/mssql/data/mastlog.ldf
tempdev       /var/opt/mssql/data/tempdb.mdf
templog       /var/opt/mssql/data/templog.ldf
modeldev      /var/opt/mssql/data/model.mdf
modellog      /var/opt/mssql/data/modellog.ldf
MSDBData      /var/opt/mssql/data/MSDBData.mdf
MSDBLog       /var/opt/mssql/data/MSDBLog.ldf
TestDB1       /data/TestDB1.mdf
TestDB1_log   /log/TestDB1_log.ldf

Confirming Persistency

Let’s go ahead and delete our Pod to confirm that when it’s recreated by our Deployment our data is still there. 

kubectl get pods
NAME                               READY   STATUS    RESTARTS   AGE
mssql-deployment-df4cf5c4c-nf8lf   1/1     Running   0          4d2h

kubectl delete pod mssql-deployment-df4cf5c4c-nf8lf 
pod "mssql-deployment-df4cf5c4c-nf8lf" deleted

Once the Pod is recreated, let’s query master files to see where our databases are located. And you’ll find that your the database created in the previous step persisted between Pod creations.

sqlcmd -S $SVCIP,31433 -U sa -Q 'SELECT name,physical_name from sys.master_files' -P $PASSWORD

master        /var/opt/mssql/data/master.mdf
mastlog       /var/opt/mssql/data/mastlog.ldf
tempdev       /var/opt/mssql/data/tempdb.mdf
templog       /var/opt/mssql/data/templog.ldf
modeldev      /var/opt/mssql/data/model.mdf
modellog      /var/opt/mssql/data/modellog.ldf
MSDBData      /var/opt/mssql/data/MSDBData.mdf
MSDBLog       /var/opt/mssql/data/MSDBLog.ldf
TestDB1       /data/TestDB1.mdf
TestDB1_log   /log/TestDB1_log.ldf

SQL Server on Linux – External Memory Pressure with 2019 CTP2

In this blog post we’re going to revisit how SQL Server on Linux responds to external memory pressure. This is a very long post, and it ends with me not knowing exactly what’s going on…but the journey is pretty fun…let’s go!
 
On Windows-based SQL Server systems we’ve become accustomed to the OS signaling to SQL Server that there’s a memory shortage. When signaled, SQL Server will kindly start shrinking it’s memory caches, including the buffer pool, to maintain overall system stability and usability. Well that story is a little different in SQL Server on Linux…last year I wrote a similar post for SQL Server 2017 RTM and how it reacted to external memory pressure, check that out here! That was quite a dramatic story, you can literally cause SQL Serve to swap nearly its entire process address space out to disk! Now, let’s look and see how SQL Server on Linux responds to external memory pressure in SQL Server 2019 CTP2.

The Scenario

Our SQL Server on Linux system is a CentOS 7.5 VM, running SQL Server 2019 (CTP2). There is 12GB of physical memory and 80% of that is dedicated to SQL Server via mssql-conf about 9.2GB, the default configuration. I’ve also set Max Server memory to 8GB for the SQL Server instance.

So let’s start up SQL Server on Linux 2019 CTP2 and query the SQL Server’s cache sizes when there’s no data loaded. I’m going to look in  sys.dm_os_memory_clerks for memory allocation accounting information. In Figures 1 we can get a feel for the memory distribution across the system’s caches. Special shoutout to the new “Azure Data Studio” I’m using in these demos.

Screen Shot 2018 10 13 at 6 46 19 AM

 Figure 1 – Top memory consumers at system startup

Now, I’m going to load an 8GB table into the buffer pool, maxing out SQL Server’s MAX memory instance setting. Now we’ll look at the allocations again. In Figure 2, we can see the buffer pool is the dominant consumer of memory and that our database is the dominant consumer in the buffer pool. So far so good, eh?

Screen Shot 2018 10 13 at 6 50 58 AM

Figure 2 – Top memory consumers after server reached maximum memory

Getting Memory Information From Linux

We can use tools like pstop and htop to look our are virtual and physical memory allocations. We can also look in the /proc virtual file system for our process and look at the status file. In here we’ll find the point in time status of a process, and most importantly the types of memory allocations for a process. We’ll get granular data on the virtual memory allocations and also the resident set size of the process. Here are the interesting values in the status file we’re going to focus on today.

  • VmSize – total current virtual address space of the process
  • VmRSS – total amount of physical memory currently allocated to the process
  • VmSwap – total amount of virtual memory currently paged out to the swap file (disk)

We can use the command pidof sqlservr to find the process ID for our SQL on Linux process, in our case it’s 14689. When running SQL Server on Linux there is always two processes for sqlservr, we’re going to look the one with the higher process ID as the one with the lower PID is the watchdog process.

Now to access the status information for our process we look in /proc/14689/status

Below is the output from our process’ status file, filtering for the values we’re interested in. You can see about 16.4GB of Virtual Memory (VmSize), 7.9GB of which is in actual RAM (VmRSS) and 0MB in the swap file (VmSwap). Wait a second, 16GB of for VmSize? Yea, 16GB for VmSize. That’s the reserved allocation for the whole process. More on that in a second…

VmSize:  17165472 kB

VmRSS:    8579016 kB

VmSwap:         0 kB 


Table 1 – Process memory for SQL Server 2019

Now, if we look back at the previous post here on SQL Server 2017 RTM, the memory distribution was a bit different. We see about 10.16GB of memory in VmSize

VmSize:  10661380 kB

VmRSS:    8201240 kB

VmSwap:         0 kB


Table 2 – Process memory for SQL Server 2017

Phantom Memory Allocation?

Now, I very rarely post something when I don’t have an answer for a specific behavior but this case I don’t. My thoughts on this are, that 16GB is the reserved allocation for the whole process. Linux uses a demand paging allocation system so the majority of the process access space is just a reservation and isn’t backed by actual pages. So it’s not *really* consuming physical memory. But what I do find interesting is the process seems to carry that allocation forward, as I loaded memory in, the VmSize went from 8GB (which is the VmSize when SQL Server Starts) to 16GB (when I loaded the 8GB table). I’ve reported this to the product team and I am awaiting for an answer on the ~8GB discrepancy between the two tests.

Let’s Add Some Pressure

Using a small C program I wrote, I use calloc to allocate large sections of contiguous memory and then I have the program continually write data into the memory buffer to ensure those pages stay in physical memory. Using this program, let’s allocate 7GB of RAM. I’m choosing this value because its going to cause my system to exceed it’s physical memory but NOT exceed the amount virtual memory in my system when including the swap file’s size which is 5.8GB on my system. Things can get really dicey if you exceed physical and virtual memory size, processes will start to get killed. Microsoft documents a case here. Just so you know how important adding a monitor for external memory pressure is, when I allocated 7.5GB and exceeded physical + virtual memory Linux killed both the memory allocation program and SQL Server to protect the stability of the system, oomkiller is a thing to look out for!

Now, let’s look at the memory accounting information at /proc/PID/status for my memory allocator program. In the output below you can see we have just about 7GB of virtual memory (VmSize) allocated and nearly the same amount of memory that’s resident in physical memory (VmRSS) and no data in the swap file.

VmSize:  7344248 kB

VmRSS:   7340312 kB

VmSwap:        0 kB


Table 3 – Process memory for external memory allocation program
 
Now in SQL Server 2017 with that 7GB program running would cause Linux to need to make room in physical memory for this process. Linux does this by swapping least recently used pages from memory out to disk. So under external memory pressure, let’s look at the SQL Server process’ memory allocations according to Linux. In the output below we see we still have a VmSize of around 10GB, but our VmRSS value has decreased dramatically. In fact, our VmRSS is now only 2.95GB. VmSwap has increased to 5.44GB. Wow, that’s a huge portion of the SQL Server process swapped to disk.
 

VmSize:  10700328 kB

VmRSS:    3098456 kB

RssAnon:  3096308 kB

VmData:  10442840 kB

VmSwap:   5711856 kB

Table 4 – Process memory for SQL Server 2017 on Linux under external memory pressure

In SQL Server 2019, there’s a different outcome! In the data below we see our 16GB VmSize which won’t change much because of the virtual address space for the process. With that large external process running SQL Server reduced VmRSS from 7.9GB (from Table 1) to 2.8GB only placing about 4.68GB in the swap file. That doesn’t sound much better, does it? I thought SQL Server was going to react to the external memory pressure…let’s keep digging and ask SQL Server what it thinks about this.

VmSize:  17159552 kB

VmRSS:    2938684 kB

VmData:  16806460 kB

VmSwap:   4911120 kB


Table 5 – Process memory for SQL Server 2019 on Linux under external memory pressure

What’s SQL Server Think About This?

In SQL Server on Linux 2017, we saw a large chunk of the process address space swap out and we saw those pages stay resident in SQL Server’s buffer pool or really any part of the SQL Server process address space since the Linux kernel is an equal opportunity swapper…so anything in the caches managed by SQL Server didn’t know if that memory was resident in physical memory or was actually on disk. In figure 3, we see the top memory consumers after external memory pressure, our buffer pool is still fully populated with our table data.

Screen Shot 2018 10 13 at 8 24 02 AM

Figure 3 – Top memory consumers after external memory pressure in SQL Server 2017

Ok, let’s look at what happens in SQL Server 2019 with external memory pressure. Looking at the memory distribution, we see that the memory clerk for SQL Server’s buffer pool reduced its memory footprint from 7,227MB (the value from way back in Figure 2) when the buffer pool was first loaded with our large table. Wh then applied external memory pressure with our 7GB process and now we see the buffer pool is reduced to 3,502MB a reduction of 3,725MB. Cool, looks like we dumped our buffer pool to protect the system but not the whole buffer pool, there’s still 3,502MB in memory. But the SQL Server 2019 on Linux process still swapped out about that much data (Table 5)…what’s going on? I still don’t know. Let’s keep digging.
 
Screen Shot 2018 10 13 at 8 24 39 AM

Figure 4 – Top memory consumers after external memory pressure in SQL Server 2019

Now, in SQL Server 2017 on Linux Resource Monitor didn’t exist…let’s look at what we have in SQL Server 2019…using the query from Amit Banerjee and Sudarshan Narasimhan in this post here let’s see what happened according to the newly implemented Resource Monitor for SQL Server 2019 on Linux.
 
Screen Shot 2018 10 13 at 10 02 45 AM
 
Figure 5 – Resource Monitor output before and after external memory pressure
 
From the output of this query, we can see that System Indicator is 2 during periods of external memory pressure. This is a signal from Resource Monitor to SQL Server to dump its caches to protect the health of the overall system.  We certainly observed that in the output in Figure 4. But what we also found is that Linux is placing 4.68GB of memory into the swap file and that SQL Server is holding on to 3,502MB of memory in the buffer pool while the VmRSS of the process is 2.8GB (Table 5). So the question remains, why is SQL Server swapping so much of its memory out to disk and holding on to so much memory in the buffer pool even though it’s seeing the low memory signal from Resource Monitor. This data is further backed up by the fact that we see SQL_CommitedMemory_KB staying relatively stable in the light of the external memory pressure, the expectation is that would reduce. So it’s like it’s not quite finishing the job deallocating the memory. So let’s ask Linux what it thinks about this. 

Resource Monitor on Linux – How I think it works…

If you’ve read this blog before you know I like to use strace to find out how applications interact with the OS via system calls. Well, let’s use that technique to try to figure out what happens when SQL Server starts dumping it’s cache under external memory pressure. 
 
In the last section, we learned that Resource Monitor can track high and low memory conditions in Linux. In this section, let’s see how it interacts with the Linux memory management system to deallocate memory.
 
In the output below (Table 6) you can see the aggregated system calls when I’m applying external memory pressure to SQL Server on Linux. From that output, it looks like SQL Server on Linux uses memory mapped files, which is a file access technique where files are opened mapped directly into a process’ memory address space. The process can then access the file via direct memory addressing rather than using system calls.  This type of access actually makes quite well to the internal file structures used in SQL Server data files.
 
During external memory pressure, the madvise is the most frequently called system call during this period. This system call is used by applications to advise (hint) the kernel on what to do with ranges of memory under certain conditions. In the last section, we observed that Resource Monitor can sense the external memory pressure and signals SQL Server to clear its caches and this is likely the implementation of how those caches are dumped.  
 
Under memory pressure, the madvise system call is called with the MADV_DONTNEED flag for large address ranges in SQL Server’s process address space, the memory mapped file. The MADV_DONTNEED flag tells the kernel that the application does not expect to access these pages in the near future and they can be discarded and eventually freed rather than swapped out to disk. See the kernel source code here for how it works.

madvise(0x69274a000, 8192, MADV_DONTNEED) = 0

 
My theory is that this is how Resource Monitor is signaling to the OS that these pages are no longer needed and should be deallocated due to the semantics of the madvise system call the pages are marked as not needed. So we see the reduction in the size reported by SQL Server’s memory clerks. But what I find interesting is the amount of memory that’s still hitting swap when we look at the memory accounting information for the individual processes and the IO profile of the system during external memory pressure. So it seems like the kernel or the application is holding onto pages and they’re hitting swap rather than being deallocated.
 
This is just my theory and I’m trying to bring the pieces together in this emerging technology since it’s still in CTP. So if you have some insight as to what’s happening at this layer, I’d love to hear from you :) 
 

strace -p PID -f -c

% time     seconds  usecs/call     calls    errors syscall

—— ———– ———– ——— ——— —————-

 75.98  575.483472       12188     47217     11425 futex

  8.24   62.370591      275976       226           epoll_wait

  7.73   58.578205     2253008        26        14 restart_syscall

  3.44   26.050226     2170852        12           io_getevents

  1.60   12.096475      549840        22           nanosleep

  1.55   11.726022      266501        44        44 rt_sigtimedwait

  0.71    5.387356          14    389517           clock_gettime

  0.49    3.688034          24    154405           madvise

  0.12    0.944458         803      1176           munmap

  0.07    0.495154         343      1445           mmap

  0.06    0.444399        7796        57           open

  0.01    0.062301        1093        57           read

  0.00    0.013327         579        23        23 stat

  0.00    0.009003          46       196           gettid

  0.00    0.005058          43       119           sched_setaffinity

  0.00    0.004188          68        62           clone

  0.00    0.004018          65        62           set_robust_list

  0.00    0.003902          36       109           rt_sigprocmask

  0.00    0.003194          56        57           fstat

  0.00    0.002914          47        62           sched_getaffinity

  0.00    0.002331          38        62           sigaltstack

  0.00    0.001805          29        62           arch_prctl

  0.00    0.001575          28        57           close

  0.00    0.001182         394         3           io_submit

  0.00    0.000672          42        16           sched_yield

  0.00    0.000506          22        23           rt_sigaction

  0.00    0.000011          11         1           fdatasync

—— ———– ———– ——— ——— —————-

100.00  757.380379                595118     11506 total

 
Table 6: Aggregated system calls for SQL Server on Linux under external memory pressure
 

 % time     seconds  usecs/call     calls    errors syscall

—— ———– ———– ——— ——— —————-

 71.52  569.206916       33673     16904      4444 futex

 12.46   99.164048      819538       121           epoll_wait

  6.90   54.920907     2387866        23        17 restart_syscall

  2.73   21.715336        1084     20033           io_getevents

  2.15   17.123567     1902619         9           nanosleep

  2.07   16.494124      868112        19        19 rt_sigtimedwait

  1.77   14.122895        1070     13202           io_submit

  0.22    1.780524         218      8180           munmap

  0.10    0.769185          93      8262           mmap

  0.06    0.502749         125      4026           clock_gettime

  0.00    0.006171         386        16           sched_setaffinity

  0.00    0.005702          62        92           read

  0.00    0.004128         197        21           fstat

  0.00    0.003996         118        34           sched_yield

  0.00    0.003961         165        24           gettid

  0.00    0.003112         346         9           arch_prctl

  0.00    0.002615         291         9           sched_getaffinity

  0.00    0.002528          84        30           rt_sigprocmask

  0.00    0.002045          85        24           close

  0.00    0.001929          92        21           open

  0.00    0.001924         214         9           clone

  0.00    0.001752         195         9           set_robust_list

  0.00    0.001647          97        17           madvise

  0.00    0.001198         599         2           fdatasync

  0.00    0.000989         110         9           sigaltstack

  0.00    0.000632          57        11        11 stat

  0.00    0.000446          45        10           rt_sigaction

  0.00    0.000351          18        19           write

  0.00    0.000316          11        29           epoll_ctl

  0.00    0.000310          44         7           writev

  0.00    0.000159          11        15         7 readv

  0.00    0.000051          26         2           socket

  0.00    0.000021          11         2           setsockopt

  0.00    0.000014           4         4           getcpu

  0.00    0.000009           3         3           getsockname

  0.00    0.000009           5         2           getpeername

  0.00    0.000004           4         1           accept

  0.00    0.000004           2         2           fcntl

—— ———– ———– ——— ——— —————-

100.00  795.846274                 71212      4498 total

 
Table 7: Aggregated system calls for SQL Server on Linux under external memory pressure
 
 

SQL Server on Linux – External Memory Pressure

In this blog post we’re going to explore how SQL Server on Linux responds to external memory pressure. On Windows based SQL Server systems we’ve become accustomed to the OS signaling to SQL Server that there’s a memory shortage. When signaled, SQL Server will kindly start shrinking it’s memory caches, including the buffer pool, to maintain overall system stability and usability. We’ll that story is a little different in SQL Server on Linux…let’s look and see how SQL Server on Linux responds to external memory pressure

The Scenario

Our SQL Server on Linux system is a CentOS 7.4 VM, running SQL Server 2017 (RTM). There is 12GB of physical memory and 80% of that is dedicated to SQL Server via mssql-conf about 9.2GB, the default configuration. I’ve also set Max Server memory to 8GB.

So let’s start up SQL Server on Linux and query the buffer pool cache sizes when there’s no data loaded. I’m going to look in two places. sys.dm_os_memory_clerks and also sys.dm_os_buffer_descriptors.  In Figures 1 and 2, we can get a feel for the memory distribution across the system and also inside the buffer pool.

Screen Shot 2017 10 19 at 2 01 04 PM

Figure 1 – Top memory consumers at system startup

Buffer Pool - Cold

 Figure 2 – Buffer Pool memory distribution at system startup

Now, I’m going to load a 8GB table into the buffer pool, maxing out SQL Server’s MAX memory instance setting. Now we’ll look at the allocations again. In Figures 3 and 4, we can see the buffer pool is the dominant consumer of memory and that our database is the dominant consumer in the buffer pool. So far so good, eh?

Memory Clerks - Warm

 Figure 3 – Top memory consumers after server reached maximum memory

Buffer Pool - Warm

 Figure 4 – Buffer Pool memory distribution after table loaded.

Getting Memory Information From Linux

We can use tools like ps, top and htop to look our are virtual and physical memory allocations. We can also look in the /proc virtual file system for our process and look at the status file. In here we’ll find the point in time status of a process, and most importantly the types of memory allocations for a process. We’ll get granular data on the virtual memory allocations and also the resident set size of the process. Here are the interesting values in the status file we’re going to focus on today.

  • VmSize – total current virtual address space of the process
  • VmRSS – total amount of physical memory currently allocated to the process
  • VmSwap – total amount of virtual memory currently paged out to the swap file (disk)

We can use the command pidof sqlservr to find the process ID for our SQL on Linux process, in our case it’s 14689. When running SQL Server on Linux there is always two processes for sqlservr, we’re going to look the one with the higher process ID.

Now to access the status information for our process we look in /proc/14689/status

Below is the output from our process’ status file, filtering for the values we’re interested in. You can see about 10.16GB of Virtual Memory (VmSize), 7.82GB of which is in actual RAM (VmRSS) and 622MB in the swap file (VmSwap). Nothing special to see here, just the baseline information for our process with a populated buffer pool.

VmSize: 10661380 kB

VmRSS:   8201240 kB

VmSwap:   636952 kB

Let’s Add Some Pressure

Using a small C program I wrote, I use calloc to allocate large sections of memory and then I have the program continually write data into the memory buffer to ensure those pages stay in physical memory. Using this program, let’s allocate 4GB of RAM. I’m choosing this value because it’s going to cause my system to exceed it’s physical memory but NOT exceed the amount virtual memory in my system when including the swap file’s size. Things can get really dicey if you exceed physical and virtual memory size, processes will start to get killed. Microsoft documents a case here.

Let’s look at the memory data from /proc/17141/status for my memory allocator program. In the output below you can see we have just about 4GB of virtual memory (VmSize) allocated and nearly the same amount of memory that’s resident in physical memory (VmRSS) and no data in the swap file.

VmSize:  4198476 kB

VmRSS:   4194484 kB

VmSwap:        0 kB

 

Now that 4GB program has caused Linux to need to make room in physical memory for this process. It does this by swapping least recently used pages from memory out to disk. So under memory pressure, let’s look at the SQL Server process’ memory allocations according to Linux. In the output below we see we still have a VmSize of around 10GB, but our VmRSS value has decreased dramatically. In fact, our VmRSS is now only 3.53GB. It decreased by 4.29GB!  VmSwap has increased to 5.04GB. Wow, that’s a huge portion of the process swapped to disk. 

VmSize: 10843320 kB

VmRSS:   3705476 kB

VmSwap:  5289752 kB

What Does SQL Server Think About All of This?

With our memory allocation program running and no workload running inside SQL Server, let’s look at sys.dm_os_memory_clerks and sys.dm_os_buffer_descriptors again. Something interesting happens, in Figures 5 and 6, when we look at sys.dm_os_memory_clerks and sys.dm_os_buffer_descriptors things changed but only slightly  But from the data above we do know that only 3.54GB of the process is actually in memory (VmRSS) at this point. Which means a large portion of SQL Server’s process is actually NOT in memory anymore…they’re on disk. The reality is, we simply do not know which portions of the process are in memory or on disk at this level.  You could use tools like pmap to determine which sections of memory are on disk, but we’ll leave that for another day. Try it out…it’s fun ;)

Swapped out? Why not?

Figure 5 – Top memory consumers after server reached maximum memory and with a significant portion of the process swapped to disk. Just about the same as it was without external memory pressure.

Buffer Pool - Warm but paged

Figure 6 – Buffer Pool memory distribution after table loaded and with a significant portion of the process swapped to disk. Just about the same as it was without external memory pressure.

Things aren’t what they seem

Looking at wait stats during query execution is going to be a little more interesting when SQL Server thinks it’s data pages are in memory, but a significant portion of them are actually on disk. In fact, when querying my test table with a simple SELECT COUNT(*) the wait types don’t indicate the fact that SQL Server is now actually disk bound due to the large portions of the buffer pool being paged out to disk. Under normal conditions, this query takes 11 seconds to run…when we’re paging to disk, it takes 2 minutes and 11 seconds to run. 

So let’s check out the wait types reported during our query output here in Figure 7. We see SOS_SCHEDULER_YIELD…which makes sense…long running query, that’s pretty slow and it’s quantum is being exhausted and the thread is being rescheduled. We do see some disk IO waits with PAGEIOLATCH_SH, but not much…what’s happening here is my data table is 8GB and my Max server memory is also 8GB so there’s a little overflow there and will read from disk for some of the data. From a performance standpoint, SQL Server has no idea that it’s data pages are likely swapped out to in this low memory situation. It thinks those pages are in the buffer pool…they certainly are…but a large chunk of the buffer pool is now actually swapped out to disk. 

Wait stats, who needs wait stats?

Figure 7 – Wait types reported during query execution with a significant portion of the process swapped to disk. Wait stats query from @brento 

What does this all mean?

Well, SQL Server is on a new operating system now, Linux. It’s upon us as DBAs to know how our system works as a whole and what to do when we’re in situations where performance falls off a cliff due to memory contention. This case is certainly a contrived case by putting so much external pressure on our process…but you, the DBA, need to know how SQL Server on Linux is going to react in these conditions and how to get your system back on track when things go south. Good monitoring and also good system resource management are needed. Certainly a properly size system in terms of physical memory and swap file size is needed. We toed the line here in our example ensuring we didn’t exceed physical memory plus swap file size. If you do exceed physical and swap size memory allocations will fail and your process will likely crash or be killed.

Speaking of monitoring, I’m speaking at PASS Summit this year on…you guessed “Monitoring Linux Performance for the SQL Server Admin” on Friday November 3th at 8am. If you want to learn more, I’ll see you there. 

Speaking at PASS Summit 2017

I’m very pleased to announce that I will be speaking at PASS Summit 2017!  This is my first time speaking at PASS Summit and I’m very excited to be doing so! What’s more, is I get to help blaze new ground on a emerging technology SQL Server on Linux! My session is Monitoring Linux Performance for the SQL Server Admin so if you’re a Windows or SQL Server administrator, this session is for you. We’ll look at some of the internals of SQL Server on Linux and dive into Linux OS internals and show you where to look inside Linux for most important performance data for your SQL Server. I hope to see you there!

 

Monitoring Linux Performance for the SQL Server Admin

Abstract

So you’re a SQL Server administrator and you just installed SQL Server on Linux. It’s a whole new world. Don’t fear, it’s just an operating system. It has all the same components Windows has and in this session we’ll show you that. We will look at the Linux operating system architecture and show you where to look for the performance data you’re used to! Further we’ll dive into SQLPAL and how it architecture and internals enables high performance for your SQL Server. By the end of this session you’ll be ready to go back to the office and have a solid understanding of performance monitoring Linux systems and SQL on Linux. We’ll look at the core system components of CPU, Disk, Memory and Networking monitoring techniques for each and look some of the new tools available including new DMVs and DBFS.

 

PASS Summit 2017

dbfs – command line access to SQL Server DMVs

With SQL Server on Linux, Microsoft has recognized that they’re opening up their products to a new set of users. People that aren’t used to Windows and it’s tools. In the Linux world we have a set of tools that work with our system performance data and present that to us as text. Specifically, the placeholder for nearly all of the Linux kernel’s performance and configuration data is the /proc virtual file system, procfs. Inside here you can find everything you need that represents the running state of your system. Processes, memory utilization, and disk performance data all of this is presented as files inside of directories inside /proc.

Now, let’s take this idea and extend it to SQL Server. In SQL Server we have DMVs, dynamic management views. These represent to current running state of our SQL Server. SQL Server exposes the data in DMVs as table data that we can query using T-SQL. 

So, Microsoft saw the need to bring these two things together, we can expose the internals of SQL Server and its DMVs to the command line via a virtual file system. And that’s exactly what dbfs does, it exposes all of SQL Server’s DMVs as text files in a directory. When you access one of the text files…you’ll execute query against the SQL Server and the query output comes back to you via standard output to you Linux console. From there you can use any of your Linux command line fu…and do what you want with the data returned. 

Setting up dbfs

So first, let’s go ahead and set this up. I already have the Microsoft SQL Server repo configured so I can install via yum. If you have SQL on Linux installed, you likely already have this repo too. If not, go ahead and follow the repo setup instructions here. To install dbfs we use yum on RHEL based distributions.

sudo yum install dbfs -y
First off, think about what’s going on under the hood here…we’re going to allow the system to execute queries against DMVs…so let’s try to keep this as secure as possible, I’m going to create a user that is allowed to only query DMVs with the VIEW SERVER STATE permission. So let’s do that…
 
Let’s log into our SQL Server via SQLCMD
sqlcmd -H localhost -U sa -p 
And execute this code to create a user named dbfs_user 
CREATE LOGIN [dbfs_user]
WITH PASSWORD=N'ThisSureIsntSecure',
CHECK_EXPIRATION=OFF,
CHECK_POLICY=OFF
GO
Once created, let’s assign this user permissions to query DMVs
GRANT VIEW SERVER STATE TO [dbfs_user]
The next step is we need to create a directory where dbfs will place all the files representing the DMVs we wish to query
mkdir /tmp/dbfs  
Now, let’s go ahead and configure dbfs. I’m going to place it’s configuration file in /etc/ since that’s the standard location for configuration files on Linux systems.
sudo vi /etc/dbfs.config
And inside that file, let’s use the following configuration. Pretty straight forward. Define a configuration name, here you see server1, the hostname which is the locally installed SQL instance. We’ll use the username and password of the user we just created and also defined is a version. While this isn’t very well documented, the code here shows that if you’re on version 16 (SQL Server 2016) or newer it will create files dbfs files with a .json file extension which exposes your DMV data as…you guessed it JSON. Also if you want to add a second server to dbfs, just repeat the configuration inside the same text file.
[server1]
hostname=localhost
username=dbfs_user
password=ThisSureIsntSecure
version=16

Running dbfs

Now with all the preliminaries out of the way, let’s launch dbfs. Basic syntax here, the actual program name with the parameter -c pointing to the configuration file we just created and the -m parameter pointing to the directory we want to “mount” our DMVs into.
dbfs -c /etc/dbfs.config -m /tmp/dbfs/ 
Now, what’s interesting about dbfs is if you log out dbfs stays running. Honestly, I don’t like that, if this is the case it should be running as a service managed by systemd or whatever init daemon you’re using on your Linux distribution. I mentioned that on their GitHub repo. If this is going to be a user process, then I should have the choice the background the task myself.

Using dbfs

Looking at the source for dbfs it gets a list of all DMVs from sys.system_views from the SQL Server you configured it to connect to, then creates a file for each and every one of those DMVs. So we have full coverage of all the DMVs available to us and since you can use any bash command line fu to access the data now…the options are really limitless. Microsoft has a few good demos on the GitHub repo here. Let’s walk through a few examples now.
 
Accessing a DMV

This is pretty straight forward, you read from the file just like you would read from any other file on a Linux system. So let’s do that…we add the column -t option to make sure all the columns are aligned in the output.
cat dm_exec_connections | column -t

And our output looks like this…

session_id  most_recent_session_id  connect_time  net_transport  protocol_type  protocol_version  endpoint_id  encrypt_option  auth_scheme  node_affinity  num_reads  num_writes  last_read  last_write  net_packet_size  client_net_address  client_tcp_port  local_net_address  local_tcp_port  connection_id  parent_connection_id  most_recent_sql_handle
51          51                      Jun           19             2017           09:46:33:660AM    TCP          TSQL            1946157060   4              FALSE      NTLM        0          7           7                Jun                 19               2017               09:46:34:103AM  Jun            19                    2017                    09:46:34:107AM  8000  127.0.0.1  37641  127.0.0.1  1433  EDC82F4B-D333-4DCA-88BB-4AB2CE9  02000000a0c09f36765d6d3b8a15a90772d74e103ac8b653000000000000000
Notice in the output above how the connect_time column is split incorrectly? We need to tell column to use the tab as a delimiter. By default it uses whitespaces. So let’s do that…
cat dm_exec_connections  | column -t -s $'\t'
And now our output looks much better
session_id  most_recent_session_id  connect_time                net_transport  protocol_type  protocol_version  endpoint_id  encrypt_option  auth_scheme  node_affinity  num_reads  num_writes  last_read                   last_write                  net_packet_size  client_net_address  client_tcp_port  local_net_address  local_tcp_port  connection_id                    parent_connection_id  most_recent_sql_handle
51          51                      Jun 19 2017 09:29:36:340PM  TCP            TSQL           1895825409        4            FALSE           SQL          0              7          7           Jun 19 2017 09:29:36:347PM  Jun 19 2017 09:29:36:350PM  4096             127.0.0.1           42784            127.0.0.1          1433            EDFB041F-B319-4098-B4DE-80739A7                        01000100f0e88f076013d837050000000000000000000000000000000000000

Selecting off a subset of columns

Well you probably noticed that the output is a bit unruly since it’s outputting all of the DMV’s columns. So let’s tame that a bit and pull out particular columns. To do that we’ll use a tool called awk which will print out columns based on the numeric index, so $1 is the first column and so on. 
cat dm_exec_connections | awk '{ print $1,$3,$4,$7 }'
And our output looks like this
session_id connect_time net_transport endpoint_id
51 Jun 19 TCP
Something isn’t right…as DBAs we think of things in rows and columns. So we’re going to count across the top and think the 7th column is going to yield the 7th column and it’s data for each row, right? Well, it will but data processed by awk is whitespace delimited by default and is processed row by row. So the 7th column in the second line isn’t the same as the output in the first line. This can be really frustrating if your row data has spaces in it…like you know…dates.
 
So let’s fix that…the output from the DMVs via dbfs is tab delimited. We can define our delimiter for awk with -F which will allow for whitespaces in our data. Breaking the data only on the tabs. Let’s hope there isn’t any tabs in our data!
cat dm_exec_connections | awk -F $'\t' '{ print $1,$3,$4,$7 }'
And the output from that looks like this, much better but we don’t have the nice columns.
session_id  connect_time  net_transport  endpoint_id
51          Jun           19             2017         02:29:36:340PM  TCP  4
We’re so close, we can’t throw column on the end to make this nice and columnar because awk with this configuration it will remove the tab delimiters on it’s output stream. column by default will do the same thing too, but we can let column do the work for us and have it print tab delimiters in it’s output stream. 
cat dm_exec_connections | column -t -s $'\t' -o $'\t' | awk -F $'\t' '{ print $1,$3,$4,$7 }'
And voila, we end up with some nice neatly formatted output
session_id connect_time               net_transport endpoint_id
51         Jun 19 2017 02:29:36:340PM TCP           4          

Searching in Text

We can search for text in the output using grep, here’s a quick example looking for the dedicated admin connection in dm_os_schedulers
grep DAC dm_os_schedulers
And here’s the output. 
00000005391c0040	64	1048576	0	VISIBLE ONLINE (DAC)	1	1	1	3	5	2	940180	0	00000005392aa160	000000053906e040	0000000539070040	4000	44	0	0

SQL folks…keep in mind, grep will only output lines matched, so we loose the column headers here since they’re part of the standard output stream when accessing the file/DMV data.

Moving forward with dbfs

We need the ability to execute more complex queries from the command line. Vin Yu mentions this here. As DBAs we already have our scripts that we use day to day to help us access, and more importantly make sense of, the data in the DMVs. So dbfs should allow us to execute those scripts somehow. I’m thinking we can have it read a folder on the local Linux system at runtime, create files for those scripts and throw them in the mounted directory and allow them to be accesses like any of the other DMVs. The other option is we place those scripts as views on the server and access them via dbfs. Pros and cons either way. Since it’s open source…I’m thinking about implementing this myself :)

Next is, somehow we need the ability to maintain column context throughout the output stream, for DBAs it’s going to be tough sell having to deal with that. I know JSON is available, but we’re talking about DBAs and sysadmins here as a target audience. 

In closing is a great step forward…giving access into the DMVs from the command line opens up SQL Server to a set of people who are used to accessing performance data this way. Bravo! 

Using Extended Events to Visualize Availability Group Replication Internals

SQL 2014 Service Pack 2 was recently released by Microsoft and there is a ton of great new features and enhancements in this release.This isn’t just a collection of bug fixes…there’s some serious value in this Service Pack. Check out the full list here. One of the key things added in this Service Pack is an enhancement of the Extended Events for AlwaysOn Availability Group replication.

Why are the new Availability Group Extended Event interesting?

If you’ve used Availability Groups in production systems with high transaction volumes you know that replication latency can impact your availability. If you want to brush up on that check out our blog posts on AG Replication Latency, Monitoring for replication latency, and issues with the DMVs when monitoring. These new extended events add insight at nearly every point inside your Availability Group’s replication. More importantly they also include duration. So using these Extended Events we can pinpoint latency inside our Availability Group replication.

Here’s the list and description of the new Extended Events:

  • hadr_log_block_group_commit – Log block group commit processing duration
  • log_block_pushed_to_logpool – Log block start push to log pool
  • log_flush_start – Asynchronous log write start
  • hadr_log_block_compression – Log block compression processing duration
  • hadr_capture_log_block – Primary has captured a log block
  • hadr_capture_filestream_wait
  • ucs_connection_send_msg – UCS transport connection sent message
  • hadr_log_block_send_complete – After a log block message has been sent. This event is only used for fail points
  • log_flush_complete – Reports that synchronous send is complete
  • hadr_receive_harden_lsn_message – When a new hardened LSN from the secondary
  • hadr_db_commit_mgr_harden – Transaction commit harden result from Hadron Commit management
  • hadr_transport_receive_log_block_message – Fires when we’re receiving new log block message
  • hadr_log_block_decompression – Log block decompression processing duration
  • hadr_apply_log_block – Secondary is going to append a log block to the log
  • hadr_send_harden_lsn_message – Crafting a message to send containing a new hardened LSN on a secondary.  Test only
  • hadr_lsn_send_complete – After an LSN message has been sent.
The source for this list and it’s descriptions is Amit Banerjee’s HADR Virtual Chapter Talk and KB3173156

Using Extended Events to visualize Availability Group Replication Internals

Using these Extended Events, we’re able to get some insight into the replication internals of our Availability Groups. Like I described a second ago, we can use these to measure the duration of each event and sift out performance issues in our replication. But we can also use them to get a better understanding of what goes into Availability Group replication, let’s see how.
 
Here’s the steps I used to generate the chart below:
  1. Let’s add an Event Session with all of these new Events above to the primary and secondary replicas of a two replica synchronous Availability Group
  2. Then with the sessions running, insert only one row on the primary replica. This will cause the AG to replicate the data change event and trigger the new Extended Events on both the primary and secondary replicas
  3. Stop the Event Sessions
  4. Open the Event file in SSMS and View Target Data on both the primary and secondary replicas
  5. Order the Event data by timestamp
  6. Merge the event data together based on the timestamp (I did this in a spreadsheet)
With this list we have the entire replication sequence of Events for that single insert replicating data from the primary to the secondary from the perspective of both the primary and secondary. I took that information and made the following graphic for you visualizing what it takes to replicate data in a synchronous Availability Group.

AG Synchrnous Replication Internals

 Figure 1: Availability Group Replication Extended Events

From Figure 1, let’s walk through the events…

On the primary

  • The first event that fires is hadr_log_block_group_commit starting things off. It’s my understanding that this initializes the needed code and program state for AG replication
  • Then up next is log_block_pushed_to_logpool. This fires when a log block is copied to the log pool. The log pool is a special memory location used to store log blocks that may need to be read again, since it’s in memory it keeps the readers from doing a physical IO on the transaction log
  • Then log_flush_start, this fires when log blocks are flushed to disk, normal write ahead logging
  • Once the log block has been copied into the log pool and flushed to disk, it’s read from the log pool and compressed then the hadr_log_block_compression event fires
  • After it’s compressed the AG “captures” the log block for replication to the secondary with hadr_capture_log_block 
  • Then builds and sends the message to the secondary and ucs_connection_send_msg fires
  • Once the message is sent the hadr_log_block_send_complete event fires
  • We mark the synchronous send complete with log_flush_complete

On the secondary

  • On the other side of the fence, basically the inverse occurs. We mark the receipt of the message with hadr_transport_receive_log_block_message
  • The log block is decompressed and hadr_log_block_decompression fires
  • Then the log block is appended to the local log buffer and hadr_apply_log_block fires
  • Then the log block is copied to the local log pool and hadr_pushed_to_logpool fires
  • Next, the asynchronous log write happens and log_flush_start marks that start of that
  • And once the asynchronous log write finishes, log_flush_complete fires
  • Once hardened locally we need to craft a message to send to the primary marking the the LSN is hardened and hadr_send_harden_lsn_message fires
  • ucs_connection_send_msg then transmits fires when the message is sent to the primary
  • hadr_lsn_send_complete marks the fires marking the end of sending the message

Back on the primary

  • hadr_receive_harden_lsn_message fires marking that the secondary has acknowledging receipt of the send LSN
  • And to wrap it all up hadr_db_commit_mgr_harden marks the end of the log block replication for this sequence

A few important notes…

  • Many of these Extended Events fire at different points and multiple times during replication under different conditions. There’s a field called mode, which is an integer value, that marks various conditions for the events. I have yet to decode what each value of mode is for all of the events but will update this blog as I get some insight.
     
  • It’s very important to note that this example highlights an insert into an Availability Group with a primary and only one secondary configured in synchronous Availability Mode. The code path and the subsequent Extended Events that fire can change based on Availability Mode (sync/async), replication health, and many other conditions that can impact replication.

I hope that I’ve shown you some interesting concepts that you can use to help troubleshoot your Availability Group replication using Extended Events. This post, highlights how you can use Extended Events to get some insight into the the internals of Availability Group replication.

If you have any questions about this or your Availability Group configuration or performance please feel free to email me at aen@centinosystems.com

Please follow me on Twitter @nocentino to keep up with new blog posts

References

Great references on the internals of the log pool – https://sqlactions.com/2014/03/31/how-it-works-logpool/

HADR Virtual Chapter – http://hadrvc.sqlpass.org/Home.aspx?EventID=5304

SQL Server, Persistent Memory on NVDIMMs and DAX

Paradigm Shift!

What do I mean by that? Every once in a while a technology comes along and changes the way things are done, moves the bar…well last week Microsoft released a Channel 9 video on persistent memory using NVDIMMs and DAX on Windows 2016…then combining it with SQL Server! This is one of those technologies that moves the bar! Check it out here.

Why is this important?

Relational databases like SQL Server use a transaction log to ensure the durability of the transactional operations to the database. This is so it can ensure its data is consistent in the event of a system failure. SQL Server uses a logging protocol called write ahead logging (WAL). This means that the data change operations must be written to stable, persistent storage before the process can change data in the database file…so this means our database throughput is at the mercy of the performance of the device the changes (log records) are written to.

Another facet of this is disk latency. Historically this has been terribly inefficient, especially in the case of spinning HDDs due to seek time and write latency. Disk latency has gotten much better recently with SSDs and NVMe. To combat disk latency, SQL Server doesn’t perform a disk write for each and every log record. SQL Server writes log records to what’s called a log buffer in main memory, once the log buffer is full it or a transaction commits (or aborts) it flushes that buffer to disk in sector aligned IO operations. This means your transactions have to wait for the disk IO to complete before they’re considered hardened to disk. This is a core bottleneck in RDBMS systems.

What changed?

Well, in NVDIMMs we have a DIMM, which is normally used for main memory, is now backed with a battery backed flash array. This means we can have access to stable storage at main memory speeds, think less than 10 microseconds, not milliseconds. An order of magnitude performance gain! NVDIMMs have two modes of operation, storage mode and memory mode. Storage mode will present the device as a block device to your OS. Memory mode will present this as a memory mapped device. Windows Server 2016 has the drivers to deal with both modes. In the first case you’ll get a block device and all the code that goes along with making a block IO request, the advantage of this mode is no application code changes, it’s just a block device like any other. In the second mode, memory mode, it presents byte addressable memory to the OS as storage, but you may need to change code to take full advantage of this. So, Microsoft has developed a feature called DirectAcces (DAX) in Windows Server 2016 to present memory mode storage to the OS bypassing traditional block IO code. This is where the insanity ensues! You’ll literally have stable storage available to your applications at main memory speeds, committing to stable storage is now just a memory copy!!! Watch the video for performance numbers!

How this will impact you?

Well, in the video Lindsey Allen highlights a code change in SQL Server that modifies how the log buffer writes data. Log writes are immediately considered hardened when committed to the log buffer. We no longer have to wait for the disk IO to complete, side stepping disk latency and IO altogether. This leads to a 2x improvement in the demo application in the video. These tests compare NVMe and DAX, and NVMe is PCI-express attached flash which is as good as it gets for persistent storage today (well until memory mode NVDIMMS take over). You’ll notice that CPU spikes during their load tests, this isn’t an uncommon issue when you remove your storage bottlenecks from your system.

What’s it going to cost?

Looking online, HPE has a 8GB NVDIMM available for $800 at online wholesalers, this is in DIMM form factor. Some are built directly into the motherboard, others are on cards.

Questions that I have?

  1. Foreign Memory Access – In a NUMA system is there be a performance penalty or inconsistent performance when threads need to read or write to a foreign memory area. Log writing is single threaded, so this should be something that we can manage. It does go multithreaded in 2016, so the best thing to do here is test your workload and measure.
  2. Hypervisor Support – can we leverage this awesomeness in a virtual machine?
  3. Error Detection – how to errors percolate back up into the OS/SQL so we know when something goes wrong. Likely using standard disk IO alerts will apply here.

Well hopefully I’ve convinced you how big of a deal this is and you’re as excited as I am!

Please let reach out to me if you like to know more about this – aen@centinosystems.com

CPU Scheduling Basics – Windows and SQL Server

In this post we’re going to introduce the basics of CPU scheduling.

In a computer system, only one thing can happen at a time. More specifically, only one task can be on a processor at a point in time. This can expand to several tasks if the system has multiple processors or a processor with multiple cores, which most modern systems have. For example, a four core system can potentially execute four tasks concurrently. 

So since only one task can be on a logical processor (or core) at one point in time, something has to manage which task gets access to the CPU. This is what the scheduler does. CPU scheduling is a very (VERY) deep topic and I’m going to introduce you to a couple basic concepts that can help you understand how scheduling works in Windows and SQL Server. So when you see things slow down in either of one of these technologies you know where you can start to look first.

If a task isn’t on a CPU, it’s not doing any processing. It’s simply waiting. Each CPU has its own task scheduler and the task scheduler is the thing that goes and assigns work to the logical CPU (or core).

Let’s check out a few examples of schedulers…

First up is Windows, it uses a preemptive, priority scheduler. This means processes with higher priority are allowed to access the CPU first, but also will stop a running process to schedule a higher priority process. Each process gets fixed amount of time on the processor, called a quantum, then Windows moves that process off the CPU and places into a queue waiting for access to the CPU again, so no one process dominates CPU time. 

SQL Server uses a non-preemptive, co-operative scheduler. Which also uses the quantum concept, but the task voluntarily yields and is switched off, placed into a queue waiting for access to the CPU again. Again, this ensures that no one task dominates the CPU, but here moving off the CPU is left up to the thread rather than the scheduler. In SQL Server a task is your work, that query you want to execute. A worker is the execution context of your work. A task is assigned to a worker, which is assigned to a scheduler for execution on the CPU. A scheduler is the manager of what gets on the CPU, there is one per logical processor.

Both Windows and SQL Server schedulers have a concept of process state and there are a couple of deeper nuances for each but here I’m going to generalize them into three states:

  • Running – a task executing on the CPU, an actively running process.
  • Waiting – a task is waiting on something before it can get back onto the CPU. For example, a process could be waiting on an I/O operating such as a read from a disk or network socket.
  • Runnable – a task waiting to get onto the CPU, nothing is preventing it from getting onto the CPU (no waiting)

You often hear CPU measured in percent CPU up to 100. But measuring CPU in terms of percent used is only part of the picture. I like to measure how long processes and tasks are waiting to get access to the CPU. Because what really we’re interested in, is how fast our work is progressing and it can only progress when it’s on a CPU doing work.

So here are a few basic ways to measure CPU pressure in terms of what’s waiting for access to a CPU

  • In Windows there’s a perfmon counter called processor queue length, anything in the queue is waiting to get onto the CPU. Each processor has its own queue. Check out a deeper dive here.
     
  • In SQL Server, we measure the number of tasks that are on the scheduler’s runnable queue, the lower the better. You can find that in the sys.dm_os_schedulers DMV in a column named runnable_task_count. Each CPU will have scheduler, and each will have a runnable task count. The query below will show you the runnable tasks, current tasks, and tasks waiting for a worker per scheduler. One row will be returned per scheduler.
select	  scheduler_id		--ID of the scheduler
	, cpu_id		--ID of the CPU
	, runnable_tasks_count	--a worker, with a task, waiting on CPU.
	, current_tasks_count	--total tasks associated with this scheduler.
	, work_queue_count	--total tasks waiting for a worker.
FROM sys.dm_os_schedulers
WHERE scheduler_id < 255
ORDER BY work_queue_count DESC

I hope this helps you get a basic understanding of CPU scheduling in Windows and SQL Server and starts you on a journey of learning more about how it works!

Feel free to contact me to discuss more!

Did you like this post? If so, click here for to sign up for our newsletter to get more SQL goodness delivered right to your inbox!

Twitter @nocentino

Load Testing Your Storage Subsystem with Diskspd – Part III

In our final post in our “Load Testing Your Storage Subsystem with Diskspd” series, we’re going to look at output from Diskspd and run some tests and interpret results. In our first post we showed how performance can vary based on access pattern and IO size. In our second post we showed how to design a test to highlight those performance characteristics and in this post we’ll execute those tests and review the results. 

First let’s walk through the output from Diskspd, for now don’t focus on the actual results. There are four major sections:

  • Test Parameters – here is the test’s parameters. Including the exact command line parameters executed. This is great for reproducing tests.
    Command Line: diskspd.exe -d15 -o1 -F1 -b60K -h -s -L -w100 C:\TEST\iotest.dat
    
    Input parameters:
    
    	timespan:   1
    	-------------
    	duration: 15s
    	warm up time: 5s
    	cool down time: 0s
    	measuring latency
    	random seed: 0
    	path: 'C:\TEST\iotest.dat'
    		think time: 0ms
    		burst size: 0
    		software and hardware write cache disabled
    		performing write test
    		block size: 61440
    		number of outstanding I/O operations: 1
    		thread stride size: 0
    		IO priority: normal
    
  • CPU Usage – CPU usage for the test, recall if you are not using all your bandwidth, you may want to add threads. If your CPU burn is high, you may want to back off on the number of threads.
    Results for timespan 1:
    *******************************************************************************
    
    actual test time:	15.00s
    thread count:		1
    proc count:		2
    
    CPU |  Usage |  User  |  Kernel |  Idle
    -------------------------------------------
       0|  30.10%|   1.04%|   29.06%|  69.89%
       1|   0.10%|   0.10%|    0.00%|  99.78%
    -------------------------------------------
    avg.|  15.10%|   0.57%|   14.53%|  84.84%
    
  • Performance – this is the meat of the test. Here we see bandwidth measured in MB/sec and latency measured in microseconds. With SSDs and today’s super fast storage I/O subsystems, you’ll likely need this level of accuracy. This is alone beats SQLIO in my opinion. I’m not much a fan of IOPs since those numbers require that you know the size of the IO for it to have any meaning. Check out Jeremiah Peschka’s article on this here. Remember, focus on minimizing latency and maximizing I/O please refer back Part I and Part II posts in this series for details.
    Total IO
    thread |       bytes     |     I/Os     |     MB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
    -----------------------------------------------------------------------------------------------------
         0 |      3162378240 |        51471 |     201.04 |    3431.10 |    0.289 |     2.816 | C:\TEST\iotest.dat (20GB)
    -----------------------------------------------------------------------------------------------------
    total:        3162378240 |        51471 |     201.04 |    3431.10 |    0.289 |     2.816
    
    Read IO
    thread |       bytes     |     I/Os     |     MB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
    -----------------------------------------------------------------------------------------------------
         0 |               0 |            0 |       0.00 |       0.00 |    0.000 |       N/A | C:\TEST\iotest.dat (20GB)
    -----------------------------------------------------------------------------------------------------
    total:                 0 |            0 |       0.00 |       0.00 |    0.000 |       N/A
    
    Write IO
    thread |       bytes     |     I/Os     |     MB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
    -----------------------------------------------------------------------------------------------------
         0 |      3162378240 |        51471 |     201.04 |    3431.10 |    0.289 |     2.816 | C:\TEST\iotest.dat (20GB)
    -----------------------------------------------------------------------------------------------------
    total:        3162378240 |        51471 |     201.04 |    3431.10 |    0.289 |     2.816
    
  • Histogram – this gives a great representation of how your test did over the whole run. In this example, 99% of the time our latency was less than 0.654ms…that’s pretty super.
      %-ile |  Read (ms) | Write (ms) | Total (ms)
    ----------------------------------------------
        min |        N/A |      0.059 |      0.059
       25th |        N/A |      0.163 |      0.163
       50th |        N/A |      0.193 |      0.193
       75th |        N/A |      0.218 |      0.218
       90th |        N/A |      0.258 |      0.258
       95th |        N/A |      0.312 |      0.312
       99th |        N/A |      0.654 |      0.654
    3-nines |        N/A |     17.926 |     17.926
    4-nines |        N/A |     18.906 |     18.906
    5-nines |        N/A |    583.568 |    583.568
    6-nines |        N/A |    583.568 |    583.568
    7-nines |        N/A |    583.568 |    583.568
    8-nines |        N/A |    583.568 |    583.568 
        max |        N/A |    583.568 |    583.568
    

Impact of I/O Access Patterns

  • Random

    diskspd.exe -d15 -o32 -t2 -b64K -h -r -L -w0 C:\TEST\iotest.dat

    Read IO
    thread |       bytes     |     I/Os     |     MB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
    -----------------------------------------------------------------------------------------------------
         0 |     16066543616 |       245156 |    1021.49 |   16343.84 |    1.896 |     0.286 | C:\TEST\iotest.dat (20GB)
         1 |     16231759872 |       247677 |    1031.99 |   16511.91 |    1.877 |     0.207 | C:\TEST\iotest.dat (20GB)
    -----------------------------------------------------------------------------------------------------
    total:       32298303488 |       492833 |    2053.48 |   32855.75 |    1.886 |     0.250

    In this test you can see the that there is high throughput and very low latency. This disk is a PCIe attached SSD, so it performs well with a random IO access pattern.

  • Sequential

    diskspd.exe -d15 -o32 -t2 -b64K -h -s -L -w0 C:\TEST\iotest.dat

    Read IO
    thread |       bytes     |     I/Os     |     MB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
    -----------------------------------------------------------------------------------------------------
         0 |     16094724096 |       245586 |    1022.21 |   16355.35 |    1.895 |     0.260 | C:\TEST\iotest.dat (20GB)
         1 |     16263544832 |       248162 |    1032.93 |   16526.91 |    1.875 |     0.185 | C:\TEST\iotest.dat (20GB)
    -----------------------------------------------------------------------------------------------------
    total:       32358268928 |       493748 |    2055.14 |   32882.26 |    1.885 |     0.225

    In this test you can see that the sequential I/O pattern yields a similar performance profile to the random IO test on the SSD. Recall that an SSD does not have to move a disk head or rotate a platter. The access latency to any location on the drive has the same latency cost.  

Impact of I/O sizes

  • Tranaction log simulation  

    diskspd.exe -d15 -o1 -t1 -b60K -h -s -L -w100 C:\TEST\iotest.dat

    Write IO
    thread |       bytes     |     I/Os     |     MB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
    -----------------------------------------------------------------------------------------------------
         0 |      3162378240 |        51471 |     201.04 |    3431.10 |    0.289 |     2.816 | C:\TEST\iotest.dat (20GB)
    -----------------------------------------------------------------------------------------------------
    total:        3162378240 |        51471 |     201.04 |    3431.10 |    0.289 |     2.816

    This test measures access latency of single thread with a very small data transfer, as you can see latency is very low at 0.289. This is expected on a low latency device such as a local attached SSD.
     

  • Backup operation simulation

    diskspd.exe -d15 -o32 -t4 -b512K -h -s -L -w0 C:\TEST\iotest.dat

    Read IO
    thread |       bytes     |     I/Os     |     MB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
    -----------------------------------------------------------------------------------------------------
         0 |      8552185856 |        16312 |     543.17 |    1086.33 |   29.434 |    26.063 | C:\TEST\iotest.dat (20GB)
         1 |      8846311424 |        16873 |     561.85 |    1123.69 |   28.501 |    25.373 | C:\TEST\iotest.dat (20GB)
         2 |      8771338240 |        16730 |     557.09 |    1114.17 |   28.777 |    25.582 | C:\TEST\iotest.dat (20GB)
         3 |      8876720128 |        16931 |     563.78 |    1127.56 |   28.440 |    25.353 | C:\TEST\iotest.dat (20GB)
    -----------------------------------------------------------------------------------------------------
    total:       35046555648 |        66846 |    2225.88 |    4451.76 |   28.783 |    25.593
    

    And finally, our test simulating reading data for a backup. The larger I/Os have a higher latency but also yield a higher transfer rate at 2,225MB/sec.

In this series of post we introduced you into some theory on how drives access data, we presented tests on how to explore the performance profile of your disk subsystem and reviewed Diskspd output for those tests. This should give you the tools and ideas you need to load test your disk subsystem and ensure your SQL Servers will perform well when you put them into production!