Whitepaper: Managing Database Schemas in a Continuous Delivery World

A whitepaper I wrote for my employer, Readify, just got published. Feel free to check it out. I’ve included the abstract below.

One of the trickier technical problems to address when moving to a continuous delivery development model is managing database schema changes. It’s much harder to to roll back or roll forward database changes compared to software changes since by definition it has state. Typically, organisations address this problem by having database administrators (DBAs) manually apply changes so they can manually correct any problems, but this has the downside of providing a bottleneck to deploying changes to production and also introduces human error as a factor.

A large part of continuous delivery involves the setup of a largely automated deployment pipeline that increases confidence and reduces risk by ensuring that software changes are deployed consistently to each environment (e.g. dev, test, prod).

To fit in with that model it’s important to automate database changes so that they are applied automatically and consistently to each environment thus increasing the likelihood of problems being found early and reducing the risk associated with database changes.

This report outlines an approach to managing database schema changes that is compatible with a continuous delivery development model, the reasons why the approach is important and some of the considerations that need to be made when taking this approach.

The approaches discussed in this document aren’t specific to continuous delivery and in fact should be considered regardless of your development model.

Developing ASP.NET web applications with IIS

When you File -> New Project an ASP.NET application in Visual Studio and then F5 by default it will spin up IIS Express and navigate to the site for you.

IIS Express is pretty cool – it runs under your user account so no need to mess around with elevated privileges, it has most of the power of IIS (think web.config) and it “just works” out of the box without extra configuration needed across all dev machines ūüôā

For projects that I work on every day though, I really dislike using IIS Express as a development server for the following reasons:

  • It will randomly crash
  • When it crashes I have to F5 or Ctrl+F5 in Visual Studio to restart it – I can’t just go to the last url it was deployed to (e.g. http://localhost:port/)
  • If your code has an uncaught exception then a¬†crash dialog pops up in your taskbar in a way that isn’t obvious and requires you to click a button until the code continues running (this can be very confusing)
  • Setting up a custom domain is tricky to do, is a tedious manual process and can’t¬†run on port 80 side-by-side with proper IIS
    • Using a custom domain is often¬†essential too – think sharing cookies between domains or¬†performing something like integrating with third parties where you need to provide a URL other than localhost

Part of the reason IIS Express exists is because setting up IIS with a site is not a trivial process. However, when you do eventually get it set up I usually find it works great from then on:

  • It’s stable
  • The URL is always available – you don’t have to use Visual Studio at all
  • Uncaught exceptions¬†behave as expected
  • Custom domains are easy in both the IIS¬†Manager GUI and via a variety of commandline options

In order to reduce the pain involved with setting up IIS I do two things:

  • I modify my¬†Visual Studio taskbar icon to always run as admin (necessary to open a project bound to IIS)
  • I add a Developer Setup script to the project that¬†developers must run once when they first clone the repository that sets up everything up for them in a matter of seconds (hopefully giving the same Open Solution -> F5 and start developinging experience)
    • I’ve added an example of such a script to a Gist¬†– the script also includes setting up the hosts file and a SQL Express database
    • I can’t claim full credit for the script – it’s been a collaborative effort over a number of projects by all of the Readifarians I’ve worked with ūüėÄ

The Idempotency issue when retrying commands with Azure SQL Database (SQL Azure)

There is a lot of information available about dealing with transient errors that occur when using Azure SQL Database. If you are using it then it’s really important to take the transient errors into account since it’s one of the main differences that Azure SQL has when compared to SQL Server.

If you are using .NET then you are in luck because Microsoft have provided an open source library to detect and retry for these transient errors (The Transient Fault Handling Application Block). I have blogged previously about how the guidance that comes with the application block (along with most of the posts, tutorials and forum posts about it) indicate that you need to completely re-architect your system to wrap every single database statement with a retry.

I wasn’t satisfied with that advice and hence I created NHibernate.SqlAzure and more recently ReliableDbProvider (works with ADO.NET, EntityFramework, LinqToSql, etc.). These frameworks allow you to drop in a couple of lines of configuration at one place in your application and unobtrusively get transient fault handling in your application.

Easy right? A silver bullet even? Unfortunately, no.

The Idempotency issue

Today I was made aware of a post made by a Senior Program Manager on the SQL Server team that was posted a few months ago about the Idempotency issue with Azure SQL Database. Unfortunately, I haven’t been able to find any more information about it – if you know anything please leave a comment.

The crux of the problem is that it is possible for a transient error to be experienced by the application when in fact the command that was sent to the server was successfully processed. Obviously, that won’t have any ill-effect for a SELECT statement, and if the SELECT is retried then there is no problem. When you have write operations (e.g. INSERTs, UPDATEs and DELETEs) then you can start running into trouble unless those commands are repeatable (i.e. idempotent).

This is particularly troubling (although in retrospect not terribly surprising) and the frustrations of one of the commenters from the post sums up the situation fairly well (and in particular calls out how impractical the suggested workaround given in the post is):

How exactly would this work with higher abstraction ORMs such as Entity Framework? The updates to a whole entity graph are saved as a whole, along with complex relationships between entities. Can entity updates be mapped to stored procedures such as this in EF? I completely appreciate this post from an academic perspective, but it seems like an insane amount of work (and extremely error-prone) to map every single update/delete operation to a stored procedure such as this.

Approaches

After giving it some consideration and conferring with some of my colleagues, I can see a number of ways to deal with this (you could do something like what was suggested in the post linked to above, but frankly I don’t think it’s practical so I’m not including it). If you have any other ideas then please leave a comment below.

  1. Do nothing: transient faults (if you aren’t loading the database heavily) are pretty rare and within that the likelihood of coming across the idempotency issue is very low
    • In this case you would be making a decision that the potential for “corrupt” data is of a lower concern than application complexity / overhead / effort to re-architect
    • If you do go down this approach I’d consider if there is some way you can monitor / check the data to try and detect if any corruption has occurred
    • Unique keys are your friend (e.g. if you had a Member table with an identity primary key and some business logic that said emails must be unique per member then you can use a unique key on Member.Email to protect duplicate entries)
  2. Architect your system so that all work to the database is abstracted behind some sort of unit of work pattern and that the central code that executes your unit of work contains your retry logic
    • For instance if using NHibernate you could throw away the session on a transient error, get another one and retry the unit of work
    • While this ensures the integrity of your transactions it does have the potential side-effect of making everything a lot slower since any transient errors will cause the whole unit of work to retry (which could potentially be slow)
  3. Ensure all of your commands are idempotent
    • While on the surface this doesn’t sound much better than having to wrap all commands with transient retry logic it can be quite straightforward depending on the application because most update and delete commands are probably idempotent already
    • Instead of using database-generated identities for new records use application generated identities (for instance generate a GUID and assign it to the id before inserting an entity) and then your insert statements will also be idempotent (assuming the database has a primary key on the id column)
    • NHibernate has automatic GUID generation capabilities for you and you can use the Comb GUID algorithm to avoid index fragmentation issues within the database storage
    • Alternatively, you can use strategies to generate unique integers like HiLo in NHibernate or SnowMaker
    • If you are doing delete or update statements then you simply need to ensure that they can be executed multiple times with the same result – e.g. if you are updating a column based on it’s current value (e.g. UPDATE [table] SET [column] = [column] + 1 WHERE Id = [Id]) then that could be a problem if it executed twice
  4. Retry for connections only, but not commands
  5. Retry for select statements only, but not others (e.g. INSERT, UPDATE, DELETE)
  6. Don’t use Azure SQL, but instead use SQL Server on an Azure Virtual Machine

Recommendations

With all that in mind, here are my recommendations:

  • Don’t shy away from protecting against transient errors – it’s still important and the transient errors are far more likely to happen than this problem
  • Use application-generated ids for table identifiers
  • Consider what approach you will take to deal with the idempotency issue (as per above list)

Unobtrusive transient fault handling in Azure SQL Database with Entity Framework, Linq To Sql, NHibernate and ADO.NET

It’s fairly well known know that when connecting to an Azure SQL Database you need to protect your application from transient errors that occur when using it.

Luckily, Microsoft provide a library that takes care of most of the effort of detecting these errors and retrying your operations according to a retry policy.

The problem with this library is that most of the documentation out there tells you to essentially wrap every database call in your system in a retry call using the library. Let’s be honest – this is a crappy solution to the problem, particularly for legacy code bases where this might result in a major set of changes to the codebase.

Luckily, there is a way around it:

  • If you are using Entity Framework 6 then check out the new Connection Resiliency functionality, although note that the default implementation doesn’t retry for timeout errors (see the documentation of the below mentioned libraries to see why you might care about that)
  • If you are using NHibernate then you can use the library I’ve previously announced/released; NHibernate.SqlAzure
  • If you are using EntityFramework then the best unobtrusive solutions on the net previously have retries for connection establishment (that have other potential connection management side-effects) and save operations, but not command retries – I’d like to announce the release of a library I have created called ReliableDbProvider that solves these problems in an unobtrusive way (generally you will only need to change your config and possibly one or two lines of code)
  • If you are using ADO.NET (provided you use the Db Provider functionality for managing connections and commands) or Linq To Sql then the ReliableDbProvider library can be used as well

Simulating transient errors when using Azure SQL Database

I recently gave a presentation on the ins and outs of using Azure SQL for the Perth Ms Cloud User Group on Azure SQL Database. If you are curious feel free to check out the presentation file at: https://github.com/robdmoore/SQLAzurePresentation

One of the things I did for the presentation was come up with a demo that illustrated the need for, and impact of, having transient error protection when talking to an Azure SQL Database. I have taken this code and cleaned it up a bit and released it on Github at: https://github.com/robdmoore/SQLAzureTransientDemo

This code is quite unique since I have managed to find a way to reliably elicit the transient exceptions (something I haven’t seen anywhere else on the net). If you need to test some code that needs these exceptions to happen to run and you want to be really sure the code works then feel free to check out the code.

Announcing NHibernate.SqlAzure version 1.0!

I’m proud to be able to announce the release of version 1.0 of NHibernate.SqlAzure!

This library takes care of retrying when the transient errors that SQL Azure throws at you occur while using the NHibernate ORM. It’s been in Beta for the last few months and has been successfully used on a number of production websites.

Changes from 0.9 to 1.0

  • Bug fix when using Schema validation (thanks to @hmvs)
  • There is now a transient error detection strategy and associated NHibernate driver (SqlAzureClientDriverWithTimeoutRetries; say that 10 times fast!!) that retries for timeout exceptions (see the Github page for details and also thanks to @hmvs for a contribution towards this)
  • Some instances where exceptions were wrapped in NHibernate exceptions (batching, transactions) are now picked up as transient exceptions when before they were ignored
  • You can now easily log connection and command exceptions separately (see the CommandRetry and ConnectionRetry virtual methods on the driver class you use)
  • The documentation is a bit more comprehensive now
  • I finished writing all the automated tests I wanted to
  • Been road-tested on a number of sites over the last few months in production

This project is a collaborative effort along with my partner in code crime –¬†Matt Davies – all code was either pair programmed together or reviewed by the other party.

Go and grab it from NuGet today and let me know how you go! Installation / usage instructions are on the Github page.

NHibernate.SqlAzure: Transient fault-handling NHibernate driver for use with Windows Azure SQL Database

I would like to announce the release of a software library I have written in conjunction with Matt Davies as open source via NuGet / Github: NHibernate.SqlAzure.

When using Windows Azure SQL Database (as it’s now known: what a mouthful! :P) one of the things that you need to account for is transient errors that come up as part of the fact you are using a high availability database.

This essentially involves looking at the error code returned from the database and if it’s one of a set of numbers then retrying the communication (be it a SQL command or a connection establishment) again with some sort of retry pattern. Obviously, you don’t want to just retry every query because the error that comes back might be a legitimate error.

Microsoft have helpfully released a library as part of Enterprise Library called the Transient Fault Handling Application Block that handles these transient errors as well as various retry patterns via the ReliableSqlConnection class.

While there are lots of articles out there about integrating this library with Entity Framework and Linq to SQL there isn’t really any guidance for NHibernate; only a few stack overflow posts and similar.

I’ve been working with Matt over the last few months to develop an NHibernate driver that will automatically make use of the transient connection and command retry logic within the Transient Fault Handling Application Block in an unobtrusive manner. This means that you can add one line of code to your existing applications (be it Fluent NHibernate or XML configuration) to make use of this feature. The library works fine for local connections (since none of the error codes it is looking for will usually get thrown by a non SQL Azure database (and frankly if they did then there is no harm in retrying!).

The other cool thing about this integration is that you can make use of it for other retry situations by specifying a different transient error detection strategy and extending the ReliableSql2008ClientDriver class with a custom implementation. See the Github site for more information about that.

This library was actually fairly tricky to write because of a whole heap of explicit type-casts inside ADO.NET and NHibernate so I’m pretty proud of being able to release this!

One last note – I’ve been able to prove with automated tests that the library does actually handle transient errors by continuously pausing and restarting the SQL Express service while the test is running and it works beautifully!! ūüôā

Enjoy!