Why Did Your Availability Group Creation Fail?

Availability Groups are a fantastic way to provide high availability and disaster recovery for your databases, but it isn’t exactly the easiest thing in the world to pull off correctly. To do it right there’s a lot of planning and effort that goes into your Availability Group topology. The funny thing about AGs is as hard as they are to plan…they’re pretty easy to implement…but sometimes things can go wrong. In this post I’m going to show you how to look into things when creating your AGs fails.

When working at a customer site today I encountered and error that I haven’t seen before when creating an Availability Group. So I’m going to walk you through what happened and how I fixed it. So if your AGs fail at creation, you can follow this process to dig into why.

First, let’s try to create our Availability Group

USE [master]
GO
CREATE AVAILABILITY GROUP [SQL-A]
WITH (AUTOMATED_BACKUP_PREFERENCE = SECONDARY,
DB_FAILOVER = OFF,
DTC_SUPPORT = NONE)
FOR
REPLICA ON N’SQL-A’ WITH (ENDPOINT_URL = N’TCP://SQL-A.lab.local:5022',
FAILOVER_MODE = AUTOMATIC, AVAILABILITY_MODE = SYNCHRONOUS_COMMIT, SESSION_TIMEOUT = 10,
BACKUP_PRIORITY = 50, PRIMARY_ROLE(ALLOW_CONNECTIONS = ALL), SECONDARY_ROLE(ALLOW_CONNECTIONS = NO));
GO

But, that fails and we get this error…it tells me what happened and to go look in the SQL Server error log for more details.

Msg 41131, Level 16, State 0, Line 3
Failed to bring availability group ‘AG1' online.  The operation timed out. Verify that the local Windows Server Failover Clustering (WSFC) node is online. Then verify that the availability group resource exists in the WSFC cluster. If the problem persists, you might need to drop the availability group and create it again.
Msg 41152, Level 16, State 2, Line 3 Failed to create availability group ‘AG1'. The operation encountered SQL Server error 41131 and has been rolled back. Check the SQL Server error log for more details. When the cause of the error has been resolved, retry CREATE AVAILABILITY GROUP command.

OK, so let’s look in the SQL Server error Log and see what we find.

The state of the local availability replica in availability group ’AG1 has changed from ‘NOT_AVAILABLE' to 'RESOLVING_NORMAL'.  The state changed because the local availability replica is joining the availability group.  For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.
The state of the local availability replica in availability group ’AG1’ has changed from ‘RESOLVING_NORMAL' to 'NOT_AVAILABLE'.  The state changed because either the associated availability group has been deleted, or the local availability replica has been removed from another SQL Server instance.  For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.
Error: 19435, Severity: 16, State: 1.
Always On: WSFC AG integrity check failed for AG ’SQL-A’ with error 41044, severity 16, state 1.

Clearly something is up, the AG tried to come online but couldn’t.

The error here say check out the Windows Server Failover Clustering log…so let’s go ahead and do that. But that’s not as straightforward as you think. WSFC does write to the event log, but the errors are pretty generic for this issue. Here’s what you’ll see in the System Event Log and the Cluster Events section in the Failover Cluster Manager

Cluster resource ’AG1’ of type ‘SQL Server Availability Group’ in clustered role 'AG1' failed.

Wow, that’s informative, right? Luckily we still have more information to look into.

Let’s dig deeper with using the WSFC cluster logs

The cluster logs need to be queried, they’re not readily available as text for us. We can write them out to file with this PowerShell cmdlet Get-ClusterLogs. Let’s make a directory and dump the logs into there.

mkdir C:\temp
Get-ClusterLogs -Destination C:\temp 

Now we have some data to look through!

When we look at the contents of the cluster logs files generates by Get-ClusterLogs, we’re totally on the other side of the spectrum when it comes to information verbosity. The logs so far have been pretty terse and haven’t really told us about what’s causing the failure…well dig through this log and you’ll likely find your reason and a lot more information. Good stuff to look at to get an understanding of the internals of WSFCs. Now for the the reason my Availability Group creation failed was permissions. Check out the log entries.

INFO  [RES] SQL Server Availability Group: [hadrag] Connect to SQL Server ...
INFO  [RES] SQL Server Availability Group: [hadrag] The connection was established successfully
INFO  [RES] SQL Server Availability Group: [hadrag] Run 'EXEC sp_server_diagnostics 10' returns following information
ERR   [RES] SQL Server Availability Group: [hadrag] ODBC Error: [42000] [Microsoft][SQL Server Native Client 11.0][SQL Server]The user does not have permission to perform this action. (297)
ERR   [RES] SQL Server Availability Group: [hadrag] Failed to run diagnostics command. See previous log for error message
INFO  [RES] SQL Server Availability Group: [hadrag] Disconnect from SQL Server

Well that’s pretty clear about what’s going on…the process creating the AG couldn’t connect to SQL Server to run the very important sp_server_diagnostics stored procedure. A quick internet search to find a fix yielded this article from Mike Fal (b | t) which points to this Microsoft article detailing the issue and fix.

For those that don’t want to click the links here’s the code to adjust the permissions and allow your Availability Group to create.

GRANT ALTER ANY AVAILABILITY GROUP TO [NT AUTHORITY\SYSTEM];
GRANT CONNECT SQL TO [NT AUTHORITY\SYSTEM];
GRANT VIEW SERVER STATE TO [NT AUTHORITY\SYSTEM];

So to review…here’s how we found our issue. 

  1. Read the error the create script gives you
  2. Read the SQL Server error log
  3. Look at your System Event log
  4. Dump your Cluster Logs and review

Use this technique if you find yourself in a situation where your AG won’t come online or worse…fails over unexpectedly or won’t come back online. 

New Pluralsight Course – LFCE: Network and Host Security

My new course “LFCE: Network and Host Security” in now available on Pluralsight here! If you want to learn about the course, check out the trailer here or if you want to dive right in check it out here!

This course targets IT professionals that design and maintain RHEL/CentOS based enterprises. It aligns with the Linux Foundation Certified System Administrator (LFCS) and Linux Foundation Certified Engineer (LFCE) and also Redhat’s RHCSA and RHCE certifications. The course can be used by both the IT pro learning new skills and the senior system administrator preparing for the certification exam

Let’s take your LINUX sysadmin skills to the next level and get you started on your LFCS/LFCE learning path.

If you’re in the SQL Server community and want to learn how Linux secure your Linux systems…this course is for you too! You have heard that Microsoft has SQL Server for Linux now, right, if not…read this!

The modules of the course are:

  • Linux Security Concept and Architectures – Introduction you into the fundamental concepts needed for securing your environment
  • Securing Hosts and Services – iptables and TCP Wrappers – Host based firewall concepts and techniques with iptables and TCP Wrappers
  • Securing Hosts and Services – firewalld – Learn leverage firewalld to develop more complex firewalls systems…simply. Including concepts such zones, service, ports and NAT
  • Remote Access – OpenSSH – We’ll look at encryption, authentication and how to configure SSH for public authentication
  • Remote Access – Tools and Techniques  – SSH is more than just remote access, we’ll look at secure copy, tunneling and how to use windowing systems such as X11 and VNC…securely.

Pluralsight Redhat Linux

Check out the course at Pluralsight!

Speaking at PowerShell Summit!

Speaking at PowerShell + DevOps Global Summit 2017!

I’m proud to announce that I will be speaking at PowerShell + DevOps Global Summit 2017 on the conference runs from April 9th 2017 through April 12th 2017. This is an incredible event packed with fantastic content and speakers. Check out the amazing schedule!

This year I have two sessions!

On Tuesday, April 10th at 10:00AM – My session is with none other the Jason Helmick. Our session is “Cross platform Management – Windows/Linux

Here’s the abstract

Let Jason Helmick and Anthony Nocentino take you through a fun filled, demo heavy adventure of how Windows and Linux admins can work together managing a heterogeneous environment. You will learn all you need to know from both sides of the aisle to get started!

On Wednesday, April 11th at 10:00AM – I’m presenting solo on “Linux Fundamentals for the PowerShell Expert

Here’s the abtract

PowerShell is now available on Linux and your management wants you to leverage this shift in technology to more effectively manage your systems, but you’re a Windows guy! Don’t fear, iIt’s just an operating system! It has all the same components Windows has and in this session we’ll show you that.

We will look at the Linux operating system architecture and show you how to interact with and manage Linux system! By the end of this session you’ll be ready to go back to the office and get started working with Linux In this session we’ll cover the following – Process control – Service control – Package installation – Configuration management – System resource management (CPU, disk and memory) – Using PowerShell to interact with Linux systems

PowerShell Summit

 

Using dbatools for automated restore and CHECKDB

OK, so if you haven’t heard of the dbatools.io project run by Chrissy LeMaire and company…you’ve likely been living under a rock. I strongly encourage you to check it out ASAP. What they’re doing will make your life as a DBA easier…immediately. Here’s an example…

One of the things I like to do as a DBA is backup my databases, restore them to another server and run CHECKDB on them. There are some cmdlets in the dbatools project, in particular the Snowball release, that really make this easy. In this post I’m going to outline a quick solution I had to throw together this week to help me achieve this goal. We’ve all likely written code to do this using any number of technologies and techniques…wait until you see how easy it is using the dbatools project.

Requirements

  1. Automation – Complete autopilot, no human interaction.
  2. Report job status – Accurate reporting in the event the job failed, the CHECKDB failed or the restore failed.

Solution

  1. Use dbaltools cmdlets for restore and CHECKDB operations
  2. Use SQL Agent Job automation, logging and alerting

So let’s walk through this implementation together.

Up first, here’s the PowerShell script used to restore and CHECKDB the database. Save this code into a file named restore_databses.ps1

Let’s what through what’s going on here. First the line with $ErrorActionPreference = “Stop” that’s crucial because it will tell our script to stop when it encounters and error. Yes, that’s what I want. The job stops and the error from the cmdlets will reach the SQL Agent job we have driving the process. Using this, the job will fail, and I’ll have a nice log telling me exactly what happened.

Next we have some variables set, including the backup path and the location of the data and log files on the destination system.

Now, here’s the Restore-DbaDatabase cmdlet from the dbatools project, this cmdlet will traverse the backup path defined in -Path parameter, find all the backups and build the restore sequence for you. Yes…really! If you don’t define a parameter defining a point in time it will build a restore sequence using the most recent backups available in the share. The next few parameters define the destination data and log directories and tell the restore to overwrite the database if the database exists on the destination server. That next parameter tells the job to ignore using log backups. This is sufficient in my implementation because I’m running full backups daily, I don’t need the point in time recovery. You might, so give it a try. CHECKDB can take a long time…the final parameter, tells Invoke-SqlCmd2 not to timeout while running its query.

Now, I need to run some T-SQL to clean up the databases, for example, I change the recovery model, then shrink the log. This is so I don’t have a bunch of production sized log files laying around on the destination system I do this after each restore, this way I can save a little space. And finally, I run CHECKDB against the database.

If you want to do this for more than one database, you could easily parameterize this code and drive the process with a loop. You’re creative…give it a try.

Now, I take all this and wrap it up in a SQL Agent job.

SQL Agent Job Step

 Figure 1: SQL Agent Job Step Definition

Using a SQL Agent job, we get automation, reporting and alerting. I’ll know average run times, if the job fails and have a log of why and it sends me an email with the job’s results.

The SQL Agent job type is set to Operating system (CmdExec), rather than PowerShell. We run the job this way because we want to use the latest version of PowerShell installed on our system. In this case its version 5.1. The SQL Agent PowerShell job step on SQL 2012 I believe uses version 4 and when I used it, it wasn’t able to load the dbatools modules.

We need to ensure we install the dbatools as administrator. This way the module is available to everyone on the system, including the SQL Agent user, not just the user installing the module. Simply run a PowerShell session as administrator and use Install-Module dbatools. If you need more assistance check out this for help.

From a testing standpoint I confirmed the following things…

  1. When a restore fails, it’s logged to the SQL Agent job’s log, I get an alert.
  2. When one of the Invoke-SqlCmd2 calls fails, it’s logged to the SQL Agent job’s log and I get an alert.
  3. When CHECKDB finds a corruption in a database, it’s logged to the SQL Agent job’s log, the SQL Server Error Log and I get an alert. For testing this I used Paul Randal’s corrupt databases which he has available here.

So in this post, we discussed a solution to common DBA problem, backup, restore and CHECKDB a set of databases. Using dbatools, you can do this with a very simple solution like I described here. I like simple. Simple is easier to maintain. Certainly there are some features I want to add to this. Specifically, I’d like to write some more verbose information into the SQL Agent job’s log or use the job step’s ability to log to a file. Using those logs I can easily review the exact runtimes of each restore and CHECKDB.

Give dbatools a try. You won’t be disappointed…really go there now!

TugaIT – Pre-conference workshop on PowerShell on Linux

Where – Thursday, May 18, 2017

Where – TUGA IT – Lisbon, Portugal

Full Day Session – “Open Source PowerShell on Linux – Skills to Manage Your Heterogenous Data Center“ 

Registration Link – https://app.weventual.com/detalheEvento.action?iDEvento=4011

  • Early Bird Price – before 03/18/2017 – 150€
  • Normal Price – before 05/01/2017 – 200€
  • Late Registration – 05/18/2017 – 250€

PowerShell is now available on Linux and Mac and you want to use it to manage your multi-platform data center. In this workshop we will introduce Open Source PowerShell and learn why this is such a groundbreaking technology shift. Then we’ll get into the essentials of using PowerShell on Linux and Mac, we’ll start with installing Powershell and building PowerShell from source, work our way into using cmdlets and bash integration, building pipelines, remoting scenarios with heterogenous operating systems and discuss Desired State Configuration. 

You will learn how to

  • Set up your environment for multi-platform management
  • Bash and PowerShell scripting fundamentals
  • Building command pipelines in Bash and PowerShell
  • Toolmaking in Powershell
  • Configure remoting in multi-platform environments
  • Configuration management basics with Desired State Configuration

Topics

  • Setting up your OpenSource PowerShell environment
  • Working with PowerShell cmdlets and bash integration
  • Comparing the PowerShell pipeline and a UNIX style text-based pipeline
  • PowerShell concepts for building more general toolmaking
  • Remoting in multi-platform environments
  • Leveraging OpenSource PowerShell in your data centers with Desired State Configuration
  • What’s next and limitations

Prerequisites 

This is a fundamentals level workshop. This workshop’s intent is to introduce you to the technologies and get you started. Attendees should have basic understanding of the Windows and Linux operating systems.

Registration Link – https://app.weventual.com/detalheEvento.action?iDEvento=4011


Speaking at SQLSaturday Chicago – 600!

Speaking at SQLSaturday Chicago!

I’m proud to announce that I will be speaking at SQL Saturday Chicago on March 11th 2017! And wow, 600 SQLSaturdays! This one won’t let you down. Check out the amazing schedule!

If you don’t know what SQLSaturday is, it’s a whole day of free SQL Server training available to you at no cost!

If you haven’t been to a SQLSaturday, what are you waiting for! Sign up now!

My presentation is Networking Internals for the SQL Server Professional” 

NewImage

Here’s the abstract for the talk

Once data leaves your SQL Server do you know what happens or is the world of networking a black box to you? Would you like to know how data is packaged up and transmitted to other systems and what to do when things go wrong?  Are  you tired of being frustrated with the network team? In this session we introduce how data moves between systems on networks and TCP/IP internals. We’ll discuss real world scenarios showing you how your network’s performance impacts the performance of your SQL Server and even your recovery objectives.

Friend of Redgate – 2017

I’m excited to announce that I have been named a Friend of Redgate for 2017. The program targets influential people in their respective technical communities such as SQL, .NET and ALM and enables us to participate in the conversation around product and community development.

As a multi-year awardee in the program I get to see first hand the continuing dedication Redgate has to the SQL community and to making great software. I met a ton of really cool, very dedicated people along the way. Thanks for the recognition and I look forward to another great year!

Redgate makes outstanding products! While I focus mainly on the DBA side of things such as SQL Monitor, SQL Backup and SQL Prompt there are many more. I’ve used these tools for years and let’s just say they’re awesome.

Redgate isn’t just software, they’re committed to community and education. Here are some of the things they do to support technical communities:

  • Online resources – SimpleTalkSQL Server Central, and books and Free eBooks. These resources aren’t marketing fluff, it’s killer content written by real experts
  • Events – hosting events, exhibiting at events and supporting user groups across the world. One word can describe this, engaged
Thank you to Redgate for this opportunity! I look forward to participating in this program, sharing my thoughts and learning as much as I can from all involved.
FoRG 2017
If you need you’d like to talk about Redgate’s products and where they fit into your SQL Server system please feel free to contact me.
 
Follow me on Twitter: @nocentino

Monitoring SLAs with SQL Monitor Reporting

Proactive Reporting for SQL Server

If you’re a return reader of this blog you know I write often about monitoring and performance of Availability Groups. I’m a very big proponent of using monitoring techniques to ensure you’re meeting your service level agreements in terms of recovery time objective and recovery point objective. In my in person training sessions on “Performance Monitoring AlwaysOn Availability Groups”, I emphasize the need for knowing what your system’s baseline for healthy replication and knowing when your system deviates from that baseline. From a monitoring perspective, there are really two key concepts here I want to dig into…reactive monitoring and proactive monitoring.

Reactive Monitoring

Reactive monitoring is configuring a metric, setting thresholds for alerting and reacting when you get the alert. This type of monitoring is critical to the operations of your system. The alerts we configure should model the healthy state of our system…when our system deviates outside of that state, we certainly want to know about that so that we can act…well really react accordingly.

Proactive Monitoring

Proactive monitoring with an alert based monitoring tool is a little harder. What DBAs and architects do is periodically sit down and go through their existing monitoring systems and review the data over some time interval. And if we’re honest with ourselves we try to do this at regular intervals but don’t get to it very often because we’re busy. And when we do finally get in there to look it’s usually to do a post mortem on some sort of production issue…then very often we find that a critical performance attribute had been slowly creeping up over time until it reached a tipping point and caused a production issue. We do our analysis, make our system corrections and move on. Still not exactly proactive. Mostly because there is still a person in the process.

Reporting on System State

With our Reactive Monitoring model, we already define what a health system state is. And let’s take people out of the equation. In Redgate’s latest release of SQL Monitor they added a reporting module. In here you can define reports that will represent the state of your system and you can get a snapshot view of what’s critical to you about your SQL Server environment. So if you are running Availability Groups, like I mentioned above, you can configure your report to have those critical performance metrics already set up so you can quickly get to them via the SQL Monitor interface. But better yet, you can schedule the report to be delivered right to your inbox. Yes, another email. But the report is a simple PDF…you give it a glance, process the data and move on. This saves you from having to go into SQL Monitor’s web interface to view the state of your system. Find something odd, jump into the Web UI and start digging for the root cause.

Reporting gives us two key advantages

  1. Point in time snapshots of our system’s health – we’ll get that report dropped into our mailbox and on that report is the most critical performance metrics that define the health of our system.
  2. Ability to see slowly changing trends – reporting helps us focus on trends. It allows us to zoom out a bit and view our system over time. Less reacting, more “proacting”

OK and one more…for me, as a consultant, I can define reports for clients and have them emailed to me periodically to review. Let’s take a second to build a simple report together

Creating Your Own Reports

Now normally, I’d show you a bunch of screenshots on how to create a report in SQL Monitor…but rather than do that…go head and click on the menu below and start trying out the reporting interface yourself using Redgate’s publicly available SQL Monitor Demo Site! 

Screen Shot 2017 02 06 at 11 47 32 AM

A couple reports I think you should look at right away are

  1. Example Report – this is the landing page on the demo site and as you can see we get a snapshot of our servers’ performance.
  2. SCC – Custom Metrics – in SQL Monitor you can add your own custom metrics, these are things that you think are business critical…the report in the demo site shows you SQLServerCentral custom metrics of Emails Sent per Hour and Forum Posts per Hour.
  3. Storage – here you’ll find things like fastest filling disk, disk space used, and database sizes. 
  4. Create Your Own – Download the trial version and create your own report. What are the things that are important to you that you want to have right at your fingertips when you log in or have a report land in your inbox with that data?
In closing, know that your monitoring strategy should take into account both types of monitoring, reactive and proactive so you can ensure you’re meeting your service levels. Start pushing things into the proactive category as much as possible. Leverage SQL Monitor’s reporting to better understand the state of your system. 

Speaking at PowerShell Virtual Group of PASS

This month I’ll be speaking to the PowerShell Virtual Chapter of PASS. The session is on Linux OS Fundamentals for the SQL Admin. At the core of the session we will introduce you to OS concepts like managing files and file systems, installation packages, using PowerShell on Linux, managing system services, commands and processes and system resource management. This session is intended for those who have never seen or have very little exposure to Linux but are seasoned Windows or SQL administrators. Things like processes, memory utilization and writing scripts should be familiar to you but are not required.

Sign up now! https://attendee.gotowebinar.com/register/4762712017177605123

Wednesday, February 1, 12:00PM-1:00PM Eastern (GMT-5)

NewImage

Abstract

PowerShell and SQL Server are now available on Linux and management wants you to leverage this shift in technology to more effectively manage your systems, but you’re a Windows admin!  Don’t fear! It’s just an operating system! It has all the same components Windows has and in this session we’ll show you that. We will look at the Linux operating system architecture and show you how to interact with and manage Linux system. By the end of this session you’ll be ready to go back to the office and get started working with Linux with a fundamental understanding of how it works.

Interested in growing your knowledge about database systems, sign up for our newsletter today!

Weekly Newsletter

This week we started our Centino Systems weekly newsletter. Check out the first edition here!

The newsletter is going to include the latest in SQL Server and other things in technology that I think are important or interesting…and maybe you will too!

So if you’d like to subscribe to the newsletter go ahead and sign up here!