Be Sure with Azure .NET – Azure Table Storage (Part 2)

Storing Data in Microsoft Azure's Table Storage Service

In our Be Sure with Azure .NET – Azure Table Storage (Part 1) we cover details about Azure Table Storage such as working with NoSQL databases, how they compare to relational databases, how to design and create your Azure Table as well as all the available operations for persisting data.  In Part 2 on Azure Tables were going to cover the remaining aspects of the Table Storage service such as:

    1. Efficient and Inefficient Queries
    2. Rats! We Hit a Snag (Working with Retry Policies)
    3. Concurrency
    4. Security
    5. Secondary Indexes (Bonus)

Queries: Efficient and Inefficient

Now that we have designed and constructed our Azure Tables and effectively persisting data to them, the next important point would be how to get data out of them.  Because of the nature and limitations of Azure Tables, querying the data can easily be inefficient.  However, we didn’t wait until we needed to get the data out before we started thinking about querying the data, but started with that in mind when designing out tables.  But, despite the steps we have taken to be able to efficiently retrieve the data, there are no safeguards to ensure that we don’t write inefficient queries.  Therefore, we will not only be covering how to write queries and retrieve the data, but also look at what components are required to write efficient queries as well as how to avoid writing inefficient queries.

 

Efficient Queries

As a refresher, there is no secondary index on our tables, there is only the clustered index made up by the Partition and Primary keys.  Because of this, our most efficient queries will always involve both the Partition and Row keys.  Being able to retrieve a distinct entity by a using a specific Partition and Row key will always have the best performance.

Single Entity by Partition and Row Key

TableOperation query = TableOperation.Retrieve<Footwear>(partitionkey, rowKey);
TableResult result = table.Execute(query);
Footwear footwear = (Footwear) result.Result;

 

Collection by Specific Partition Key

The next most efficient queries will be queries involving collections based on a specific Partition Key

IQueryable<Footwear> query = table.CreateQuery<Footwear>().Where(f => f.PartitionKey == partitionKey);
List<Footwear> shoes = query.ToList();

 

You’ll notice here we aren’t using the Retrieve method but the CreateQuery which will allow you to construct and execute queries with LINQ.  There are limitations to CreateQuery as it has been specifically optimized for working with Azure Table Storage, but you can review all the supported LINQ operations available.

Slightly less efficient queries would involve ranges of Row Key’s or ranges of Partition and/or Row keys.

IQueryable<Footwear> query =
table.CreateQuery<Footwear>()
.Where(f => f.PartitionKey.Equals(startPartitionKey) || f.PartitionKey.Equals(endPartitionKey));
List<Footwear> shoes = query.ToList();

 

Although we are specifying a Partition Key as part of our query, because this is crossing Partition boundaries, this has the potential of being a fairly inefficient queries depending on the number of entities per partition as well as the number of partitions involved.

 

Inefficient Queries

(Short Commercial Break)

Before jumping into what determines an inefficient query, this would be a good time to break down how a query is handled at a more granular level.  We have already talked about how the Partition key plays the primary role in allowing your data to be scalable and a direct relation to efficiency when querying.  Let’s take a closer look at the underlying reasons that make the Partition Key play such a primary role.

This is only a high level view of some of the Azure Storage architecture, but a detailed paper is available for ground level details. Azure Storage is made up of various numbers of storage nodes to help disperse the load from requests.  A Partition of data might easily span multiple storage nodes.  A Partition Server is used for handling all requests to a specific partition of data and can be responsible for handling request for more than one partition.

Therefore, under heavy load, Azure can split off partitions of data to their own individual Partition Server.  Because of this, when your queries span multiple data partitions, a query could easily cross multiple boundaries.  These boundaries could include the data partition itself as well as Partition Servers.  Each has a performance debt that can be incurred. Shortly, we’ll talk more about how to handle partition sizes as well as some of the requirements for queries that cross server partitions, such as use of continuation tokens.

Querying All the Things

All our queries up to now have all either required the Partition Key, Row Key or both.  But the Partition and Row Key are usually not the only two properties on a table entity. As soon as you started writing queries that utilize the entity properties you will have moved into the realm of inefficient queries.

IQueryable<Footwear> query = table
.CreateQuery<Footwear>()
.Where(f => f.Gender == "Male" && (f.Size > 4 && f.Size < 7));
IEnumerable<Footwear> shoes = query.ToList();

 

As seen above, when no Partition or Row key is specified, you can guarantee a full table scan will be performed.  In addition, because no Partition key was specified, the query will be sent to every Partition Server.  One upgrade would be to include a Partition Key which will reduce the query to being sent to a specific Partition Server and only incur a partition scan.

Including a Partition Key with every query can drastically help with performance.  This is even more noticeable for data under heavy load.

 

 

Continuation Tokens

Continuation Tokens are more or less book-markers for queries.  They allow for queries to pick up from where they left off.  There are a number of scenarios where Azure Table Storage requires you to utilize continuation tokens in order to acquire all the available results to you query, such as:

  • A Query has exceed the max return results of 1000 entities
  • Execution of the query has exceeded 5 seconds
  • The query has crossed a partition boundary

 

Because of these limitations, you can see how partition size could easily play a role in performance.  Too large of partitions has shown to directly impact the ability to scale, while too small could easily impact performance.  A query that has a result of 5 entities, each with their own Partition, could force you to handle 4 continuation tokens to fully acquire all the results.  So, how do we handle a Continuation Token?

Well the good news is that if you don’t want to, the Storage Client SDK handles continuation tokens for you.  However, there are cases where you need to take a more granular approach to your queries as in the case where you want to ensure you are not pulling back overwhelming amounts of entities or possibly you want to implement pagination.  Below is an example for utilizing Continuation Tokens:

 

TableQuery<Footwear> query = 
   table.CreateQuery<Footwear>().Where(f => f.PartitionKey == partitionKey)   .AsTableQuery();

TableContinuationToken token = null;
List<Footwear> shoes = new List<Footwear>();

do
{
  TableQuerySegment<Footwear> queryResult = query.ExecuteSegmented(token);
  token = queryResult.ContinuationToken;
  shoes.AddRange(queryResult.Results);
} while (token != null);

 

The continuation should not be based on whether or not any results were returned, but based on whether the continuation token was Null.  This is because a query sent to a partition server might not have any results, but because it crossed a Partition Server, it automatically returned a Continuation Token with no query resultsIn addition, it is always recommended that you write queries for handling tokens to ensure you’re not incidentally pulling overwhelming data results.

 

Rats! We Hit a Snag (Working with Retry Policies)

Whether it is because the internet hamsters sleep or some other snafu, you can be certain there will be situations where our attempted operations fail.  Lucky for us, we can define a Retry Policy that provides a means for a failed operation to repeat if necessary.  This is a topic we didn’t cover in the first “Be Sure with Azure” on Blob Storage but applies to all the storage services.  There are a few out-of-the-box policies available as well as the ability to create custom retry policies.  The 3 available policies are:

  • Linear
  • Exponential
  • None

 

We can set the DefaultRequestOptions on our CloudTableClient that allows us to specify settings such as our retry policy.  In addition to setting the Retry Policy in the example below, we are also setting the MaximumExecutionTime simply for the sake of point it out. Retry polices allow you to specify the Delta Backoff time between retries as well as the number of maximum attempts:

TableClient.DefaultRequestOptions = new TableRequestOptions
{
  RetryPolicy = new ExponentialRetry(TimeSpan.FromSeconds(10), 5),
  MaximumExecutionTime = TimeSpan.FromSeconds(10)
};

 

Exponential Retry Policy will force the period of time between retries to grow exponentially in such a way that the above example will fire the first retry at 5 seconds, 10 seconds between the next retry, then 20 seconds, and so on, until the max number of attempts has been reached. This is defined as the first constructor’s parameter deltaBackoff.  In addition, we can specify the maximum number of retries.

Linear retries works in such a way that the interval time between retries stays consistent based on the specified deltaBackoff time.  The None is a policy but simply returns False for ShouldRetry() method when checked by the CloudTableClient.

 

So How Do We Know When a Retry Has Occurred?

A common question asked is how to know if a operation failed and a retry was attempted?  There are a few options available.  One is to use Microsoft’s Enterprise Transient Fault Application Block easily accessible through Nuget and attach to its event handling and built in ability to know about transient faults when they occur.

RetryPolicy<StorageTransientErrorDetectionStrategy> retryPolicy = new RetryPolicy<StorageTransientErrorDetectionStrategy>(new Incremental(5, TimeSpan.FromSeconds(1), TimeSpan.FromSeconds(3)));

retryPolicy.Retrying += (obj, eventArgs) => System.Console.WriteLine("Retrying");

TableClient.DefaultRequestOptions = new TableRequestOptions
{
  RetryPolicy = new NoRetry(),
};

TableOperation query = TableOperation.Retrieve<Footwear>(partitionKey, rowKey);
TableResult result = retrypolicy.ExecuteAction(() => table.Execute(query));
Footwear shoe = (Footwear) result.Result;

 

The problem with this approach is the readability of the code itself being that all execution is handled by the RetryPolicy.  The good news is that we can create a custom policy by implementing the IRetryPolicy Interface that will allow you to handle how you want a retry to occur or extra logic such as logging.

The current retry policies available do not retry on HTTP status codes 4xx, 306, 501, 505.  Therefore, if you do write your own custom retry policy you will want to make sure to handle these as well.

 

 

Concurrency

In the last article on Blob Storage, I outlined the different concurrency’s control mechanisms, Optimistic and Pessimistic Concurrency.  If you haven’t had a chance or not familiar with these different controls, flip back and read up before moving on.  A nice commodity with Azure Table’s is that they use optimistic concurrency as the default mechanism.  In addition, concurrency is at the entity level, not the table level.  This is one difference between Blob Storage concurrency which is provided at both the blob as well as the container level.

The ETag is the property that Azure Tables use for conducting optimistic concurrency.  When an entity is retrieved the ETag property is provided.  When you persist an update to the entity, the ETag is a mandatory property that will need to be set.  The service will make the comparison to verify it matches the ETag of the current entity in the table in order for the update to be successful.  If not, a HTTP Status Code of 412 will be returned.

However, if you are in need of forcing an update, you can assign the wildcard “*” to the ETag property to force an update.  In regards to the Table Storage service, not all requests require the ETag.  The Storage SDK reflects what the Azure Storage API requires and when attempting to conduct a Replace or Merge Table Operation, the client will warn you if you don’t provide the ETag or if it’s not a wildcard “*” before submission.    But when attempting an InsertOrMerge or InsertOrReplace, the ETag is not required because the service doesn’t require it.

Azure has provided a helpful write up on Concurrency for their Storage Services, including some suggestions if requiring a pessimistic concurrency mechanism for your table.

If you are implementing any type of public API, approach using the InsertOrReplace and InsertOrMerge operations with caution as you will default to a last-write-wins scenario which might be undesirable.

 

 

Security

Shared Access Signatures

Microsoft Azure's Security OptionsWhen it comes to security in regards to the Azure Storage services there is one security control that is utilized by all the Storage services.  That security control is called Shared Access Signatures (SAS).  Shared Access Signatures provide a way for you to specify what permissions to what specific storage resources a consumer can have.

As a refresher for those that didn’t read all the intricate details found in the previous article on Blob Storage, a Shared Access Signature is an HMAC SHA-256 hash composed from a number of query string parameters that specify details such as the specific resources, expiration time of the granted access and the permissions granted, just to name a few.

Shared Access Signatures reside in either one of two flavors, ad hoc or revocable. At the heart of the Shared Access Signatures is a Shared Access Policy.  This policy is what defines parameters such as permissions and expiration date of the Shared Access Signatures.   It is with Shared Access Policies that makes the difference between an ad hoc and revocable Shared Access Signatures.

Ad hoc Shared Access Signature

An ad hoc Shared Access Signature is generated by using a main Storage Access key in the hash algorithm.  As a consequence the only way to revoke an ad hoc Shared Access Signature is by revoking the Storage Access key that was used to generate the hash.  This ultimately would mean anything else that uses the storage account access key would also have to be updated.

To see how this can be utilized, in the example below is demonstrating the scenario of an external client that can request a Shared Access Signature from your API.  This SAS could then be used to acquire access to table storage entities it has been given access:

//generate Shared Access Policy
SharedAccessTablePolicy policy = new SharedAccessTablePolicy
{
  Permissions = SharedAccessTablePermissions.Add | SharedAccessTablePermissions.Update |
  SharedAccessTablePermissions.Query,
  SharedAccessExpiryTime = DateTime.UtcNow.AddHours(1),
};

//generate Shared Access Signature using the policy
string sas = table.GetSharedAccessSignature(
  policy,
  null,
  "Athletics",
  null,
  "Running",
  null
);

 

The client then could use this Shared Access Signature to ultimately create a CloudTableClient which would be operating under the restrictions placed by the Shared Access Signature.

StorageCredentials creds = new StorageCredentials(sas);

//endpoint created in this fashion to make clear what it represents
string endpoint = string.Format("http://{0}.table.core.windows.net", accountName);

CloudTableClient tableClient = new CloudTableClient(new Uri(endpoint), creds);
CloudTable table = tableClient.GetTableReference("footwear");

 

From here, we can create and execute TableOperations that are in line with the permissions.  When attempting to operate outside the bounds of the permissions granted by the Shared Access Signature, the service will operate as if the entity can’t be found.  This is evident when attempting to query for entities and the service returning null with an HTTP status code of 404 (in the TableResult).  In conjunction, when attempting to perform any update or insert without permission, will cause a StorageException to be thrown with a HTTP status code 404.

Managed Shared Access Signatures

As mentioned, at the heart of the Shared Access Signature is the policy. But in the case of a managed Shared Access Signature the policy is generated and stored on the Storage account.  Therefore when we generate a managed Shared Access Signature we specify the identifier of the Stored Access Policy that already exists. This allows us to have a means for revoking a Shared Access Signature by simply revoking the Stored Access Policy that was used to generate the Shared Access Signature.

The first operation will be to create and store a Stored Access Policy in our Storage Account.  We can do this by first creating the policy and calling SetPermissions on the CloudTableClient.  The link to our Shared Access Signature will be the identifier that we assign the policy.  In this example we have named it “tablepolicy1”:

TablePermissions permissions = new TablePermissions();
permissions.SharedAccessPolicies.Add(
  "tablepolicy1", new SharedAccessTablePolicy
     {
       SharedAccessExpiryTime = DateTime.UtcNow.AddDays(2),
       Permissions = SharedAccessTablePermissions.Add | 
                     SharedAccessTablePermissions.Query
});

table.SetPermissions(permissions);

 

From here, it’s business as usually with the slight difference of specifying the Stored Access Policy identifier when creating the Shared Access Signature:

//generate Shared Access Signature using the policy identifier
string sas = table.GetSharedAccessSignature(
    null,
    "tablepolicy1",
    "Athletics",
    null,
    "Running",
    null
);

 

Finally, as we saw in the original ad hoc example our client who has received the Shared Access Signature could create his CloudTableClient indirectly by using the SAS and conduct any table operations that are within the specified Stored Access Policy.

StorageCredentials creds = new StorageCredentials(sas);

//endpoint created in this fashion to make clear what it represents
string endpoint = string.Format("http://{0}.table.core.windows.net", accountName);

CloudTableClient tableClient = new CloudTableClient(new Uri(endpoint), creds);
CloudTable table = tableClient.GetTableReference("footwear");

 

Despite having a limit of only 5 Stored Access Policies, utilizing them will benefit you when the need arises to revoke a Shared Access Signature.  It centralizes your distributed permissions and makes management a bit easier.

 

 

Secondary Indexes (Bonus)

I know what you’re thinking, “Azure’s Table Storage doesn’t support Secondary Indexes!” and that is correct.  But I thought for a bonus section I would mention that users of NoSQL database that don’t support Secondary Indexes haven’t completely given up on the idea of finding a way to optimize queries not based on primary keys.

We have already looked at one way in Part 1 under “De-normalization” when we discussed duplicating entities with different Row Keys.  This has the benefit of providing a type of secondary index, but it also comes with a heavy overhead of maintenance and depending on the size of your data, a significant impact in overall database size.  However, if you’re interested in researching other approaches to generating secondary indexes, you can read up here on how.

Conclusion

We have managed to cover a lot in this second part on Azure’s Table Storage including writing efficient queries and how to avoid inefficient queries, utilizing retry policies, concurrency and security.  You can clearly see there is a lot to learn.  But, one of the nice aspects of the .NET Storage Client SDK is that it makes getting started fairly easy.  If you dropped in on Part 2, take a look at Part 1 to get the first half of the story and how to get started with Microsoft’s Azure Table Storage service.

 

Some Helpful References

  1. Supported Query Operators
  2. Retry Policies
  3. Enterprise Library 6 Fault Handling
  4. Concurrency in Azure Table Storage
  5. Secondary Indexes
About the author

Max McCarty

Max McCarty is a software developer with a passion for breathing life into big ideas. He is the founder and owner of LockMeDown.com and host of the popular Lock Me Down podcast.