The Idempotency issue when retrying commands with Azure SQL Database (SQL Azure)

There is a lot of information available about dealing with transient errors that occur when using Azure SQL Database. If you are using it then it’s really important to take the transient errors into account since it’s one of the main differences that Azure SQL has when compared to SQL Server.

If you are using .NET then you are in luck because Microsoft have provided an open source library to detect and retry for these transient errors (The Transient Fault Handling Application Block). I have blogged previously about how the guidance that comes with the application block (along with most of the posts, tutorials and forum posts about it) indicate that you need to completely re-architect your system to wrap every single database statement with a retry.

I wasn’t satisfied with that advice and hence I created NHibernate.SqlAzure and more recently ReliableDbProvider (works with ADO.NET, EntityFramework, LinqToSql, etc.). These frameworks allow you to drop in a couple of lines of configuration at one place in your application and unobtrusively get transient fault handling in your application.

Easy right? A silver bullet even? Unfortunately, no.

The Idempotency issue

Today I was made aware of a post made by a Senior Program Manager on the SQL Server team that was posted a few months ago about the Idempotency issue with Azure SQL Database. Unfortunately, I haven’t been able to find any more information about it – if you know anything please leave a comment.

The crux of the problem is that it is possible for a transient error to be experienced by the application when in fact the command that was sent to the server was successfully processed. Obviously, that won’t have any ill-effect for a SELECT statement, and if the SELECT is retried then there is no problem. When you have write operations (e.g. INSERTs, UPDATEs and DELETEs) then you can start running into trouble unless those commands are repeatable (i.e. idempotent).

This is particularly troubling (although in retrospect not terribly surprising) and the frustrations of one of the commenters from the post sums up the situation fairly well (and in particular calls out how impractical the suggested workaround given in the post is):

How exactly would this work with higher abstraction ORMs such as Entity Framework? The updates to a whole entity graph are saved as a whole, along with complex relationships between entities. Can entity updates be mapped to stored procedures such as this in EF? I completely appreciate this post from an academic perspective, but it seems like an insane amount of work (and extremely error-prone) to map every single update/delete operation to a stored procedure such as this.

Approaches

After giving it some consideration and conferring with some of my colleagues, I can see a number of ways to deal with this (you could do something like what was suggested in the post linked to above, but frankly I don’t think it’s practical so I’m not including it). If you have any other ideas then please leave a comment below.

  1. Do nothing: transient faults (if you aren’t loading the database heavily) are pretty rare and within that the likelihood of coming across the idempotency issue is very low
    • In this case you would be making a decision that the potential for “corrupt” data is of a lower concern than application complexity / overhead / effort to re-architect
    • If you do go down this approach I’d consider if there is some way you can monitor / check the data to try and detect if any corruption has occurred
    • Unique keys are your friend (e.g. if you had a Member table with an identity primary key and some business logic that said emails must be unique per member then you can use a unique key on Member.Email to protect duplicate entries)
  2. Architect your system so that all work to the database is abstracted behind some sort of unit of work pattern and that the central code that executes your unit of work contains your retry logic
    • For instance if using NHibernate you could throw away the session on a transient error, get another one and retry the unit of work
    • While this ensures the integrity of your transactions it does have the potential side-effect of making everything a lot slower since any transient errors will cause the whole unit of work to retry (which could potentially be slow)
  3. Ensure all of your commands are idempotent
    • While on the surface this doesn’t sound much better than having to wrap all commands with transient retry logic it can be quite straightforward depending on the application because most update and delete commands are probably idempotent already
    • Instead of using database-generated identities for new records use application generated identities (for instance generate a GUID and assign it to the id before inserting an entity) and then your insert statements will also be idempotent (assuming the database has a primary key on the id column)
    • NHibernate has automatic GUID generation capabilities for you and you can use the Comb GUID algorithm to avoid index fragmentation issues within the database storage
    • Alternatively, you can use strategies to generate unique integers like HiLo in NHibernate or SnowMaker
    • If you are doing delete or update statements then you simply need to ensure that they can be executed multiple times with the same result – e.g. if you are updating a column based on it’s current value (e.g. UPDATE [table] SET [column] = [column] + 1 WHERE Id = [Id]) then that could be a problem if it executed twice
  4. Retry for connections only, but not commands
  5. Retry for select statements only, but not others (e.g. INSERT, UPDATE, DELETE)
  6. Don’t use Azure SQL, but instead use SQL Server on an Azure Virtual Machine

Recommendations

With all that in mind, here are my recommendations:

  • Don’t shy away from protecting against transient errors – it’s still important and the transient errors are far more likely to happen than this problem
  • Use application-generated ids for table identifiers
  • Consider what approach you will take to deal with the idempotency issue (as per above list)

Unobtrusive transient fault handling in Azure SQL Database with Entity Framework, Linq To Sql, NHibernate and ADO.NET

It’s fairly well known know that when connecting to an Azure SQL Database you need to protect your application from transient errors that occur when using it.

Luckily, Microsoft provide a library that takes care of most of the effort of detecting these errors and retrying your operations according to a retry policy.

The problem with this library is that most of the documentation out there tells you to essentially wrap every database call in your system in a retry call using the library. Let’s be honest – this is a crappy solution to the problem, particularly for legacy code bases where this might result in a major set of changes to the codebase.

Luckily, there is a way around it:

  • If you are using Entity Framework 6 then check out the new Connection Resiliency functionality, although note that the default implementation doesn’t retry for timeout errors (see the documentation of the below mentioned libraries to see why you might care about that)
  • If you are using NHibernate then you can use the library I’ve previously announced/released; NHibernate.SqlAzure
  • If you are using EntityFramework then the best unobtrusive solutions on the net previously have retries for connection establishment (that have other potential connection management side-effects) and save operations, but not command retries – I’d like to announce the release of a library I have created called ReliableDbProvider that solves these problems in an unobtrusive way (generally you will only need to change your config and possibly one or two lines of code)
  • If you are using ADO.NET (provided you use the Db Provider functionality for managing connections and commands) or Linq To Sql then the ReliableDbProvider library can be used as well

Simulating transient errors when using Azure SQL Database

I recently gave a presentation on the ins and outs of using Azure SQL for the Perth Ms Cloud User Group on Azure SQL Database. If you are curious feel free to check out the presentation file at: https://github.com/robdmoore/SQLAzurePresentation

One of the things I did for the presentation was come up with a demo that illustrated the need for, and impact of, having transient error protection when talking to an Azure SQL Database. I have taken this code and cleaned it up a bit and released it on Github at: https://github.com/robdmoore/SQLAzureTransientDemo

This code is quite unique since I have managed to find a way to reliably elicit the transient exceptions (something I haven’t seen anywhere else on the net). If you need to test some code that needs these exceptions to happen to run and you want to be really sure the code works then feel free to check out the code.

Announcing NHibernate.SqlAzure version 1.0!

I’m proud to be able to announce the release of version 1.0 of NHibernate.SqlAzure!

This library takes care of retrying when the transient errors that SQL Azure throws at you occur while using the NHibernate ORM. It’s been in Beta for the last few months and has been successfully used on a number of production websites.

Changes from 0.9 to 1.0

  • Bug fix when using Schema validation (thanks to @hmvs)
  • There is now a transient error detection strategy and associated NHibernate driver (SqlAzureClientDriverWithTimeoutRetries; say that 10 times fast!!) that retries for timeout exceptions (see the Github page for details and also thanks to @hmvs for a contribution towards this)
  • Some instances where exceptions were wrapped in NHibernate exceptions (batching, transactions) are now picked up as transient exceptions when before they were ignored
  • You can now easily log connection and command exceptions separately (see the CommandRetry and ConnectionRetry virtual methods on the driver class you use)
  • The documentation is a bit more comprehensive now
  • I finished writing all the automated tests I wanted to
  • Been road-tested on a number of sites over the last few months in production

This project is a collaborative effort along with my partner in code crime – Matt Davies – all code was either pair programmed together or reviewed by the other party.

Go and grab it from NuGet today and let me know how you go! Installation / usage instructions are on the Github page.

NHibernate.SqlAzure: Transient fault-handling NHibernate driver for use with Windows Azure SQL Database

I would like to announce the release of a software library I have written in conjunction with Matt Davies as open source via NuGetGithub: NHibernate.SqlAzure.

When using Windows Azure SQL Database (as it’s now known: what a mouthful! :P) one of the things that you need to account for is transient errors that come up as part of the fact you are using a high availability database.

This essentially involves looking at the error code returned from the database and if it’s one of a set of numbers then retrying the communication (be it a SQL command or a connection establishment) again with some sort of retry pattern. Obviously, you don’t want to just retry every query because the error that comes back might be a legitimate error.

Microsoft have helpfully released a library as part of Enterprise Library called the Transient Fault Handling Application Block that handles these transient errors as well as various retry patterns via the ReliableSqlConnection class.

While there are lots of articles out there about integrating this library with Entity Framework and Linq to SQL there isn’t really any guidance for NHibernate; only a few stack overflow posts and similar.

I’ve been working with Matt over the last few months to develop an NHibernate driver that will automatically make use of the transient connection and command retry logic within the Transient Fault Handling Application Block in an unobtrusive manner. This means that you can add one line of code to your existing applications (be it Fluent NHibernate or XML configuration) to make use of this feature. The library works fine for local connections (since none of the error codes it is looking for will usually get thrown by a non SQL Azure database (and frankly if they did then there is no harm in retrying!).

The other cool thing about this integration is that you can make use of it for other retry situations by specifying a different transient error detection strategy and extending the ReliableSql2008ClientDriver class with a custom implementation. See the Github site for more information about that.

This library was actually fairly tricky to write because of a whole heap of explicit type-casts inside ADO.NET and NHibernate so I’m pretty proud of being able to release this!

One last note – I’ve been able to prove with automated tests that the library does actually handle transient errors by continuously pausing and restarting the SQL Express service while the test is running and it works beautifully!! 🙂

Enjoy!

Getting up and running with Database testing quickly in .NET

When I first started learning about how to do database code in .NET and in particular how to test your database code I came across a really useful post on the Code Project.

It basically goes through a tutorial on how to get a base class set up that sets up a brand new SQL Server CE database every test run and populates it from an NHibernate configuration.

This was great, but there were a number of issues that caused it to not work for me (including the fact I decided to use Fluent NHibernate rather than writing a heap of obscure XML). As well as this, there were a couple of fundamental flaws in the way the test worked and it didn’t work with the latest versions of SQL Server CE and NHibernate at the time.

Consequently, I fixed up the code so it worked and I wanted to outline those changes and the final result here in case anyone finds it useful.

Continue reading “Getting up and running with Database testing quickly in .NET”