Availability Groups are a fantastic way to provide high availability and disaster recovery for your databases, but it isn’t exactly the easiest thing in the world to pull off correctly. To do it right there’s a lot of planning and effort that goes into your Availability Group topology. The funny thing about AGs is as hard as they are to plan…they’re pretty easy to implement…but sometimes things can go wrong. In this post I’m going to show you how to look into things when creating your AGs fails.
When working at a customer site today I encountered and error that I haven’t seen before when creating an Availability Group. So I’m going to walk you through what happened and how I fixed it. So if your AGs fail at creation, you can follow this process to dig into why.
First, let’s try to create our Availability Group
USE [master] GO CREATE AVAILABILITY GROUP [SQL-A] WITH (AUTOMATED_BACKUP_PREFERENCE = SECONDARY, DB_FAILOVER = OFF, DTC_SUPPORT = NONE) FOR REPLICA ON N’SQL-A’ WITH (ENDPOINT_URL = N’TCP://SQL-A.lab.local:5022', FAILOVER_MODE = AUTOMATIC, AVAILABILITY_MODE = SYNCHRONOUS_COMMIT, SESSION_TIMEOUT = 10, BACKUP_PRIORITY = 50, PRIMARY_ROLE(ALLOW_CONNECTIONS = ALL), SECONDARY_ROLE(ALLOW_CONNECTIONS = NO)); GO
But, that fails and we get this error…it tells me what happened and to go look in the SQL Server error log for more details.
Msg 41131, Level 16, State 0, Line 3 Failed to bring availability group ‘AG1' online. The operation timed out. Verify that the local Windows Server Failover Clustering (WSFC) node is online. Then verify that the availability group resource exists in the WSFC cluster. If the problem persists, you might need to drop the availability group and create it again.
Msg 41152, Level 16, State 2, Line 3 Failed to create availability group ‘AG1'. The operation encountered SQL Server error 41131 and has been rolled back. Check the SQL Server error log for more details. When the cause of the error has been resolved, retry CREATE AVAILABILITY GROUP command.
OK, so let’s look in the SQL Server error Log and see what we find.
The state of the local availability replica in availability group ’AG1 has changed from ‘NOT_AVAILABLE' to 'RESOLVING_NORMAL'. The state changed because the local availability replica is joining the availability group. For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log. The state of the local availability replica in availability group ’AG1’ has changed from ‘RESOLVING_NORMAL' to 'NOT_AVAILABLE'. The state changed because either the associated availability group has been deleted, or the local availability replica has been removed from another SQL Server instance. For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log. Error: 19435, Severity: 16, State: 1. Always On: WSFC AG integrity check failed for AG ’SQL-A’ with error 41044, severity 16, state 1.
Clearly something is up, the AG tried to come online but couldn’t.
The error here say check out the Windows Server Failover Clustering log…so let’s go ahead and do that. But that’s not as straightforward as you think. WSFC does write to the event log, but the errors are pretty generic for this issue. Here’s what you’ll see in the System Event Log and the Cluster Events section in the Failover Cluster Manager
Cluster resource ’AG1’ of type ‘SQL Server Availability Group’ in clustered role 'AG1' failed.
Wow, that’s informative, right? Luckily we still have more information to look into.
Let’s dig deeper with using the WSFC cluster logs
The cluster logs need to be queried, they’re not readily available as text for us. We can write them out to file with this PowerShell cmdlet Get-ClusterLogs. Let’s make a directory and dump the logs into there.
mkdir C:\temp Get-ClusterLogs -Destination C:\temp
Now we have some data to look through!
When we look at the contents of the cluster logs files generates by Get-ClusterLogs, we’re totally on the other side of the spectrum when it comes to information verbosity. The logs so far have been pretty terse and haven’t really told us about what’s causing the failure…well dig through this log and you’ll likely find your reason and a lot more information. Good stuff to look at to get an understanding of the internals of WSFCs. Now for the the reason my Availability Group creation failed was permissions. Check out the log entries.
INFO [RES] SQL Server Availability Group: [hadrag] Connect to SQL Server ... INFO [RES] SQL Server Availability Group: [hadrag] The connection was established successfully INFO [RES] SQL Server Availability Group: [hadrag] Run 'EXEC sp_server_diagnostics 10' returns following information ERR [RES] SQL Server Availability Group: [hadrag] ODBC Error:  [Microsoft][SQL Server Native Client 11.0][SQL Server]The user does not have permission to perform this action. (297) ERR [RES] SQL Server Availability Group: [hadrag] Failed to run diagnostics command. See previous log for error message INFO [RES] SQL Server Availability Group: [hadrag] Disconnect from SQL Server
Well that’s pretty clear about what’s going on…the process creating the AG couldn’t connect to SQL Server to run the very important sp_server_diagnostics stored procedure. A quick internet search to find a fix yielded this article from Mike Fal (b | t) which points to this Microsoft article detailing the issue and fix.
For those that don’t want to click the links here’s the code to adjust the permissions and allow your Availability Group to create.
GRANT ALTER ANY AVAILABILITY GROUP TO [NT AUTHORITY\SYSTEM]; GRANT CONNECT SQL TO [NT AUTHORITY\SYSTEM]; GRANT VIEW SERVER STATE TO [NT AUTHORITY\SYSTEM];
So to review…here’s how we found our issue.
- Read the error the create script gives you
- Read the SQL Server error log
- Look at your System Event log
- Dump your Cluster Logs and review
Use this technique if you find yourself in a situation where your AG won’t come online or worse…fails over unexpectedly or won’t come back online.