Monthly Archives: February 2015

Book Review – SQL Server Internals: In-Memory OLTP

In-Memory OLTP – a potential game changing technology

Every once in a while a technology comes out that has the potential to change things dramatically. In-Memory OLTP (Hekaton) is one of them. The design team set out with a goal of reaching an order of magnitude improvement over existing technologies and techniques. To do so they had to rethink key facets of the relational database system, latching, locking, logging and statement compilation. When a technology as potentially disruptive as this comes along it gets everyone’s attention. When an opportunity to review a book based on this technology came along it certainly is worth the effort. I spent a ton of time with this book (maybe a little too much), reading and re-reading chapters but it was worth every minute.

SQL Server Internals: In-Memory OLTP by Kalen Delaney ( twitter | blog ) is one of the first books published dedicated to the topic and definitely will be (is) the go-to book on the topic. The author is a well known SQL Server internals expert that has an extensive history of top-tier publications and presentations and this book is no exception.

SQL Server Internals: In-Memory OLTP

In-Memory OLTP introduces a fundamental shift in how the database engine stores and accesses data. This book walks you through each of these topics with almost perfect precision. I would suggest reading this book from beginning to end. This way, when needed, you can refer back to a chapter and have an understanding of its context in the larger system.

  • The first three chapters are the preliminaries of In-Memory OLTP. Chapter 2 introduces the basics and some of the limitations of the technology and in chapter 3 the reasons behind those limitations become clear with the discussion of row structure. These chapters bring into light that we’re dealing with something new and unique. A traditional in-memory data structure that respects the ACID properties of a transactional relational database system
  • I felt than the second half of Chapter 4 on range indexes struggled to eloquently describe the technical nuances of the bw-tree data structure. I read this chapter several times and still struggled. Also, it is not entirely clear what the benefit of the mapping table is, I had to refer to the research paper for clarity here
  • Chapter 5 describes transactions and the opportunistic locking design that In-Memory OLTP uses. In contrast to pessimistic locking which requires latches and locks, opportunistic locking uses a validation technique at the time a transaction attempts to commit it’s data. Even thought it’s covered in detail in other texts, I would have liked deeper coverage on read phenomena here.
  • The best chapter in the book is Chapter 6 logging, checkpoint and recovery. Logging and recovery are completely re-engineered in In-Memory OLTP. This chapter walks you through the entire process culminating with an exceptional discussion on the recovery process. This chapter is so smooth it feels like a conversation.
  • Chapter 7 describes native compilation of tables and stored procedures, it’s interesting that after the bottlenecks of disk, latching and locking are removed we’re left with compilation as a potential bottleneck. This chapter presents the techniques used by In-Memory OLTP’s native compilation feature and it’s limitations.
  • The book concludes with a chapter on configuration best practices and where In-Memory OLTP fit in the database world.  This part of the book is key for those that have to evaluate a technology for it’s architectural fit into their data-tier.

There are few shortcomings with this book, the bw-tree could have been more clearly explained. Parts of the book may be hard to approach for some, but it is covering a bleeding-edge topic that is still being actively researched. The book toes the line between IT topic and graduate level computer science. I’m surprised I didn’t see some complexity analysis in some of the chapters :)

My detailed chapter notes are available here 

In summary:

This is the book that you will reach for when you have a question about how it works. I really look forward to when In-Memory OLTP matures enough that some it current limitations are relaxed.

Thank you to Kalen and the folks at RedGate for this release!

If you need help evaluating In-Memory OLTP and where it fits in your data-tier or system architecture, please feel free to contact me at

Follow me on Twitter @nocentino

Book Review – SQL Server Internals: In-Memory OLTP – Detailed Notes

Here are my unedited chapter notes:

Chapter 1
Chapter 2
  • Collation – current version requires BIN2 on character index columns. Best to do so at the column level. Supports only sorting, comparison and grouping. Will remove the need for case sensitive code on tables and columns but not data.
  • Collate database_default to handle tempdb collation of temp objects. Research more.
  • Interpreted SQL via interop useful for ad hoc or migration of code.  Doesn’t perform as well as compiled.
  • Key point restrictions really show that the Hk tables need to be isolated bc of the interface between the engines. P 46
Chapter 3
  • The restrictions of chapter 2 become clearer in this chapter with the description of the row structure. Traditional CS data structure
  • All indexes can be thought of as covering, each row has the actual data.
  • Also clear why tables structures can’t be modified. P51
  • Good example highlighting the time stamp usage in MVCC
Chapter 4
Hash Indexes
  • Indexes must be defined at table creation – part of the header
  • Indexes are in memory structures only and operations on them are not logged
  • Hash index – hashing function and collisions explained. Linked lists in buckets
  • Index pointer – array in the row header. Null when row points to no other rows. Points to the “next” row in the index at that hashed index entry. Every row had at least one pointer to it, either directly from the hash index bucket or from another row
  • After reading a second time this was much clearer
  • Hash indexes useful for equality search with high cardinality
  • For seeking on a hash index you must provide all key columns. For example last and first with a search on last will not use the index
  • Buckets – number of elements in the has array. Should be equal or greater than the cardinality of the index key columns. So that a bucket points to a unique values chain. Too few you get collisions on different values. Too many is wasteful. dm_db_xtp_ hash_index_stats
  • Large average chain length could mean
    • 1. Empty Bucket count low. Avg and max chain lengths are similar. Many different index keys to the same bucket.
    • 2. Empty bucket count high. Max length is high in relation to average chain length. Many row with duplicate keys or skew in key values. Not evenly distributed
  • Short chains along with high empty bucket cube could indicate bucket count is set too high
Range Indexes
  • Range indexes – If cardinality is unknown or will be range searching hash indexes aren’t helpful.
  • Range index connects ties at the leaf level using a bw-tree. Latch free variant of a b-tree. Index pages are variable sized. Index page contains an ordered set of key values. For each value there’s a pointer to either another internal index page or to a leaf page pointing to a data row. At the data row there could be a list of rows linked together on the key value
  • A key distinction for bw-trees is the page pointer is a logical PID Rather than a physical memory address.  PID is the position in the mapping table which contains the physical location. Pages are never updated they are replaced and the mapping table is updated so the PID uses the new physical address
  • At the leaf level still uses the PID which points to the data row
  • Delta record – a page containing an insert or update. The mapping table is updated with the physical address of the delta record. Searching/range scanning may have to traverse this chain that is until consolidation occurs. Combining pages into a new base page. Benefit is not having to update tree during each update
  • It’s not clearly explained what the benefit of the mapping table is. 
Chapter 5
  • Pessimistic acquiring a shared lock or range lock. Opportunistic checking that rows haven’t changed before commit of transaction
  • In MVCC SQL assumes that concurrent transactions won’t interfere and performs validation checks once a transaction commits to ensure it obeys the required isolation level
  • Isolation levels used in in memory OLTP.
    • Snapshot
    • Repeatable read
    • Serializable
    • There should be a description of the read phenomenon
  • Cross container transactions are really two sub transactions. one disk based and one in memory transaction
  • If executing an explicit transaction it will run in read committed. If accessing a in memory OLTP then this will error as this is an unsupported isolation level.  Do specify the isolation level as a table hint.  Read committed is supported during auto-commit or implicit transactions
  • Technically for a single statement transaction there is no difference between read committed and snapshot on a memory optimized table
  • Snapshot cannot be used in cross container transactions because each is considered/implemented as a sub transaction
  • The default isolation level is snapshot. There’s a new DB option to set it default for memory optimized tables. Alternatively set the isolation level with a table hint
  • Monitoring transactions sys.dm_db_xtp_transactions xtp_transaction_id is the sequential global transaction ID used to begint and endt
  • Read write conflict a transaction tries to update a row that has been updated by an active transaction. But since there are no locks anomalies can still occur.
Validation phase
  1. Check for violations of isolation levels see read committed example on pg 107
  2. Wait for commit dependencies to goto 0
  3. Log the changes
  • This is the only waiting that can occur in memory optimized tables.
  • Waits could happen for commit dependencies, transaction log writes. Both which can be very short.
  1. Validate no changes violate isolation levels. Basically is the transaction globally transactionally consistent.
  2. Commit dependencies. A CD is when a transaction reads data from another transaction that has finished phase 1 of validation with valid end time stamps (transaction Id or actual time stamps?).  In practice very rare. There is an EE and wait type. Waiting_for_dependencytx_event.
  3. Logging. If any of the modified tables were created with SCHEMA_AND_DATA then SQL must log. Sql reads the write set to determine what to log. transactions track all there changes in the write set. A list of delete/insert ops with pointers to the version associated with each op
    1. Once a log record is hardened to storage the state of the transaction is changed to committed. Then reduce the dependency counters by one.
  4. Post processing. Updates time stamps of each row inserted or deleted by this transaction. Switches time stamps from transaction IDs to real time stamps.
  • Garage collection – Find rows who’s time stamps are earlier than the oldest active transaction on the system. It is marked stale
  • The actual collection process is completed in cooperation with user threads. During a regular user access. If a stale thread is encountered it will be makes expired or removed.
Chapter 6
The transaction log consists of
  • Log streams – changes made by committed transactions. Insertion and deletions of row versions.
  • Checkpoint streams
    • Data streams – contains all versions inserted during a timestamp interval
    • Delta streams – associated w data streams, contain a list of integers which versions its corresponding data steam have been deleted
  • No longer uses LSNs for sequencing uses end Ts for serialization order
  • Still only supports one log per database but due to more efficient logging the potential for log bottleneck is reduced.
  • Reduced logging – no need to log index operations. All indexes are rebuilt during recovery
  • Each atomic change isn’t logged to a single log record rather InMemory OLTP will combine into a single record
  • InMemory OLTP does not need to log uncommitted transactions – opportunistic remember?
  • For InMemory OLTP no logging is used for info. Just redo
  • 100 inserts into a disk based table generates 200 log records. Into a InMemory OLTP table only 3.
  • Continuous check pointing – checkpoints are continuous and incremental as activity accumulates. Rather than periodic like in disk based tables
  • Streaming IO – ie. sequential
  • Checkpoints are manual or every 512mb
  • Checkpoint files are stored in file stream files
  • Two types of checkpoint files data and delta. Called a checkpoint file pair
    • Data file – new versions of rows in a time interval. Append only files. Once closed only read
    • Delta file – info about about which rows have been deleted from the corresponding data file. Once the data file is closed so is the delta file. At recover used to filter out rows from the data file
  • CFP pairs can be recovered from concurrently
  • On insert a continuous checkpoint will append the row to a checkpoint file
  • A checkpoint event will close a set of files state goes from under construction to active. No more writes to these files but can be referred to by other insert update operations from other CFPs
  • Every checkpoint generates a new set of files. As such older files may no longer be needed and are merged
  • If no open transactions refer to the CFPs they transition to non-active. After a log backup they are no longer needed and can be removed by the garbage collection process
  • See page 135 for states. db_xtp_checkpoint_files
  • Pre-created – pre built as an optimization
  • The number of pre-created CFPs is equal to the number of schedulers
  • Under construction – open CFP being with continuous checkpoints being written to. A checkpoint event will close and change to active
  • Active – CFPs containing inserted/deleted rows for the last checkpoint event. Continuous checkpoint is no longer writing new data but deleted are added to delta files. During recovery active CFPs contain all inserted/deleted rows needed to restore the data before applying the tail of the log backup
  • Storage array – 8192 entry array holding CFP entries. Max 256GB of CFP data
  • CFPs in the storage array along with the tail of the log backup represent all the needed data to recover a InMemory OLTP table in a db
  • Before there are any active CFPs we can use the transaction log for recovery
  • Calling checkpoint is really closing a continuous checkpoint
  • It changes the under construction CFPs to active. Adds them to the storage array. All log entries on InMemory OLTP tables  not covered by a previous checkpoint are converted into active CFPs.
  • Once finished:
    1. all buffered InMemory OLTP writes are flushed to data and delta files
    2. checkpoint inventory (file descriptors from the previous checkpoint interval and any files added by the current checkpoint) are hardened to the log
    3. the location of the inventory is hardened to the log
  • A checkpoint event includes a timestamp which indicates that the effects of all transactions are before the checkpoint timestamp are in checkpoint files rather than the log thus the log is not needed to recover them. Still need a log backup to truncate log.
  • Each checkpoint generates new files. More files, the longer crash recover will take
  • We merge files with adjacent timestamp ranges when their active content drops below a percentage. See page 145 for table. Also when adjacent files are < 50% full. Can happen during manual checkpoints
  • Due to this process the largest storage space CFPs can use is two times larger than the corresponding memory optimized size
  • You can merge manually if needed. Very rare
  • Do the demos from this section
  • Merged source files are removed from the storage array and are eligible for garbage collection after checkpoint and log backups.
Garbage Collection of checkpoint files
  1. Checkpoint – guarantees that data in checkpoint files is no longer needed.
  2. Log backup – guarantee that the log truncation point is safely beyond where the CFPs will be needed.
  3. Unused files garbage collected – uses the same mechanism as regular file stream data.
  • We can force GC and file stream GC via a SP Page 152-3
  • If you do a manual GC if files you’ll need to make sure you account for the additional log backups.
  • Starts after the most recent checkpoint inventory has been discovered during a scan of the tail of the log. Sql server tells InMemory OLTP this location and can begin a parallel recovery. The global transaction timestamp is initialized with the highest value found in the transactions recovered
  • Each delta file is used as a filter to eliminate rows from the corresponding data file.  Each CFPs can be processed in parallel. One thread per core
  • As data loads rows are linked into their indexes. Hash indexes – rows are linked into their buckets. Range indexes – added to an existing chain or created if the key doesn’t duplicate one already encountered during recovery
  • CFP load is complete the tail of their is replayed front the time of the last checkpoint. Finished!
Chapter 7
  • A table or stored proc can be natively compiled
  • The performance benefits of native compilation are best in complex stored procedures. Use for most performance critical parts
  • TSQL for tables and stored procedures generate C code. Recompiled at instance startup
  • The execution plan is part of the DLL. There is no re compilation. You must drop and recreate if you want a new plan. Compiled ok first execution and after instance restart and AG failover
  • Intended for short basic operations many complex constructs are not allowed
  • No parameter sniffing at compilation. Compiled with parameters UNKNOWN
  • When using hash indexes the query must specify all key columns in the where clause or else scan. Also based on equality. Must specify whole value not a partial string
  • Range indexes are only forward. There are no previous pointers. You’d need an index ascending and descending
  • No Halloween protection in plan. On disk based tabled this is via spool operators. In InMemory OLTP it’s done via statement id in the row header
  • No parallel plans
  • No auto update stats. Because there’s no row mod counter
  • When creating stored procedures be sure data is loaded at compilation time
  • Need native compilation to achieve the performance gains of InMemory OLTP. Needed to reduce CPU instructions so couldn’t do interpreted stored procedures
  • Query stats are not captured by default. Need to enable with a stored procedure. Page 178. Can be used to collect on one object or all
Chapter 8
  • Memory allocation is critical. Running out of memory can cause operations to fail. If under memory pressure InMemory OLTP will more aggressively clean up old row versions
  • 256GB max for all InMemory OLTP tables in a DB. This is because of the number of checkpoint files
  • Rule of thumb is plan for 2x size of data for memory allocation. More of write heavy. Less if read heavy.  See pg 187 for more details
  • Use resource governor to control memory allocations. Many Dbs can be in the same pool. Db must be taken offline then on-lined to use pool
  • In sys.indexes range indexes are type 2 same as non clustered. Hash indexes are 7
Best practices 
  • Use collate bin2 at the column level rather than db level
  • Be accurate in you bucket count. Should be close to the number of unique values
  • For low cardinality use range instead of hash
  • Stats are not automatically updated on InMemory OLTP tables
  • Memory optimized table variables behave the same as table variables. But don’t use tempdb. Could help relieve contention
Add summary of 199
Good summary of where InMemory OLTP fits in.

Standard SQL Server Build

Often I’m asked what is the best practice for a single SQL Server installation. Well, that is a tricky questions and the answer is it always depends. Let’s discuss what a “standard” SQL Server build looks like if you had to start somewhere. Here let’s focus on Standard edition on a physical server. Enterprise edition and virtualization are topics that can stand on their own.
Processors – The higher the clock frequency the better
SQL Server is licensed by the core. Generally speaking, higher clock rates will yield higher data throughput. Meaning your query will get done faster. It’s a balance between clock speed cost and number of cores (SQL license costs). To see the impact of clock speed on your performance try adjusting your Windows power plan and time a workload…you will see the impact. Check out Glenn Berry’s post on this here and most recently here. Again since we’re likely licensing SQL by the core, we really want to feed the processor as quickly as possible, so next we discuss the need for a lot of memory and fast disks to do so.
Memory – At the moment memory is cheap, buy as much as you can afford. But not too much if you’re on Standard edition.

The maximum memory allowed on SQL Server Standard 2014 is 128GB per instance, 64GB if you’re on SQL Server Standard 2012. The more memory your system has the larger your memory allocation can be to things like the buffer pool. With a larger buffer pool, it’s more likely your query will be cached in memory and not have to retrieve your data from disk. Memory operates in nanoseconds, SSDs are microseconds and spinning disks are milliseconds. I know where I would want my data waiting for me. 
Storage – Isolate your major functions
This is where things depend highly on workload, specifically capacity and performance. This article is about the general case, so here’s we’re I’d start. Also, these basics apply to direct attached storage rather than SAN based storage. 
The disk controller should have a battery backed write cache. With a battery backed write cache your controller can tell the OS that it’s IO completed while caching the IO in its local memory. If you’re into it, add a hot spare. If you do, try to make all your drives the same size/type and it can be a global hot spare for any disk in the system this is simply a safety blanket in the event of a disk failure.
If you isolate your major functions as described below you’re ahead of the crowd already, the main reasons for this are differing IO patterns and the potential for recovery in the event of a disk failure:
Operating System – 2 disks RAID 1 – 100GB – you should just have your OS and SQL installation here. SSD if you can afford it, otherwise 15K RPM.

Databases – 4+ disks RAID 5 is OK. RAID10 is better – the number and capacity of these disks depend on your data size requirements, if you have the budget for SSD go for it, get MLC.

Transaction Logs – 2+ disks RAID 1, if you need more capacity 4+ disks RAID10 – if you have the budget for SSD go for it and get SLC

TempDB – 2+ disks RAID 1, if you need more capacity 4+ disks RAID10 – if you have the budget for SSD go for it and get SLC 
This 100% needs to be addressed in the first conversation about building a SQL Server. Where are you going to put them? How long are you going to keep them? How large are they going to be? Replication offsite? We could (and should) spend a whole day discussing backups. 
Here we describe a very generalized SQL Server hardware configuration, if you start here you’ll be in a good place. There are a ton of decision points that can drive a SQL Server hardware configuration in any direction. In future articles we’ll discuss how to configure Windows, Install SQL Server and instance level configuration basics for SQL Server.
If you need some assistance with your SQL Server installation, design and backups planning please feel free to contact me.
Follow me on Twitter: @nocentino