Be Sure With Azure .NET – Azure Table Storage (Part 1)

Microsoft Azure Table Storage

This is the second article in an ongoing series on Microsoft Azure features and services.  If you’re just joining us, you might want to take a look at the previous Microsoft Azure Blob Storage article that covers how to get started with an Azure Storage account, development tools, using the Azure Storage Client SDK and pricing and performance expectations which we’ll not be repeating.

For this next article in the series, were going to be looking at Microsoft’ Azure’s Table Storage service. I have decided to break it into two parts as we’ll be covering a lot.  In Part 1, we’ll covering working with NoSQL databases, the differences between relational databases, designing and creating your Azure Table as well as all the available operations for persisting data.  In Part 2, we’ll cover writing queries to our tables, retrying failed operations, and concurrency and of course security.

Let’s get down to business and see what we can learn about working with Azure Tables:

  1. What do you Mean This Isn’t a Relation Database?
  2. Firing Up Our Tables
  3. Most Important Question; How To Design Your Table?
  4. Working With Azure Tables (Table Operations)
  5. Batch Operations with Entity Group Transactions
  6. Queries: Efficient and Inefficient
  7. Rats! We Hit a Snag (Working with Retry Policies)
  8. Concurrency
  9. Security
  10. Secondary Indexes

 

The Right Tool for the Right Job

As it is with most tools, whether it’s the screwdriver in your toolbox, lawnmower in your garage, or the software development IDE you’re using, each has a specific use.  This is also true with Azure Table Storage as well.  There are a specific set of cases that makes Azure Table Storage a good fit (read: it’s not a perfect fit for every job).

When we have the potential of dealing with large amounts of data (read: terabytes) that don’t have complex relationships and we want extremely fast retrieval and persistence along with the ability to easily scale out, Azure Tables shines.  This is because of a combination of the design features of Azure Tables, that if designed correctly, will provide out-box, highly scalable tables that are partitioning when under heavy load as well as extremely efficient data retrieval and persistence throughput.

Before getting into the specific details of Azure Tables, let’s take a quick stroll through comparing Azure Tables and our popular relation database.

 

What Do you Mean This Isn’t a Relation Database?

There is a high probability that you have a familiarity with Relational Databases (RDBMS) and think along those terms when it comes to storing and structuring data in a database.  However, Azure Table Storage is not a relational database, but a NoSQL database, so there is a completely different approach that needs to be taken when we set out to host our data in Azure’s Table Storage.

When we think about relational databases, we think about single table schemas, table relationships with foreign keys and constraints, stored procedures and columns and rows. All of which we aren’t burdened with when dealing with Azure Tables. This is one of the positive characteristics of NoSQL databases that allow us to focus solely on the data.  Of course, those same characteristics are what provide a relational database its strengths.

Since Azure Tables are not a perfect fit for every time we need to persist data, let’s look at some of the major differences between Azure Tables and relational databases to help draw that defining line.

 

Azure Tables vs. Relational Databases

This is not an exhaustive list of differences, but the important differences that you need to be aware to make the right decisions when working with Azure Tables and ultimately will play a significant role when we get to designing Azure Tables:

  • A Different Vocabulary:  In the traditional relational database vocabulary we make references to a table’s columns and rows.  However, in the world of NoSQL, we store entities (rows) with properties (columns) in our tables.

 

  • No schema defined relationships between tables: This lack of relationships means, you are not going to be drawing up relationships between other tables and storing their primary identifiers.  This is because or a byproduct of the fact that you can’t do joins on other NoSQL Tables.  There is a way to store entities with a different schema in the same table in order to keep related entities close at hand.

 

  • All schemas are created equal:  Unlike a relational database table that has an associated defined schema that all rows must adhere, in Azure Tables, a single table can store entities that have a different schema.  To take it a step further, we can view Azure Tables as a container of entities, while entities are simply a bag of name/value pairs of data.  Each entity within a table can contain different name/value pairs of data.  The only common denominator shared by a table’s entities are the three required properties (Partition key, Row key and Timestamp), two of which we’ll be looking at in-depth.

 

  • Indices are limited: In a relational database we have the luxury of designing indices to help with query efficiency such as a secondary index.  Out of the box, we don’t have this luxury and querying Azure Tables by entity properties, other than the required properties mentioned earlier can be very inefficient.  We’ll talk more about this when looking at important factors when designing our tables.

 

If you don’t have a relational database background, maybe much of the above won’t make a lot of sense, or if you come from a NoSQL background, I am preaching to the choir.  Without further bombardment of technical details, let’s dive in and get started with Azure Tables.  When we get there, we’ll cover the most important aspect of Azure Tables; design.

 

Firing Up Our Tables

As a reminder you can see the previous post Azure Blob Storage for getting a storage account setup.  Once setup with a Azure Storage account, we can demonstrate how easy it is to dump data to an Azure Table in 2 simple steps:

Step 1 (create the table)

CloudTable table = _client.GetTableReference(tableName);
table.CreateIfNotExists();

 

Step2 (Create and persist a dynamic Table Entity)

var dynamicEntity = new DynamicTableEntity
{
  PartitionKey = "Games",
  RowKey = "Outside",
  Properties = new Dictionary<string, EntityProperty> { { "Name", new EntityProperty("Corn Hole")} }
};

var tableOperation = TableOperation.Insert(dynamicEntity);
var result = table.Execute(tableOperation);

 

Now this was just for demonstration purposes only and we’re getting ahead of ourselves.  But, you can see how easy it is to save data to an Azure Table by just utilizing the tools already available and didn’t have to predefined custom entities to host the data I wanted to store. Starting with the next section we’ll go over creating tables and persisting data.  But, this brings us to the most important point when dealing with Azure Tables, design.

 

Most Important Question; How to Design Your Table?

If you have had anything to do with developing an application with a database backend, you probably have seen how the requirements generally start off with having some data that needs to be persisted. Therefore, the table is designed to house that data.  Its only afterwards that thought is put into how that information needs to be retrieved and utilized.  Sound familiar?

With Azure Tables, the most important question that you have to answer before you’re ready to persist data to Table Storage is, what are you going to do with the data?

We don’t have the luxury of answering this question after we’re already persisting data to Table Storage.  This is because in order to acquire the highly efficient retrieval speeds and out-of-box scale out capabilities afforded by Azure Table Storage, we have to first make important decisions on how we are going to structure the data in an Azure Table.  Remember, these entities we are storing are un-schematized, but that doesn’t mean they aren’t structured.   So, lets breakdown what important decisions we have to make for structuring our data.

Partition Key and Row Key

You have heard me mention earlier that with the NoSQL database we are storing entities that are simply bags of name/value pairs of data.  But, in addition, there is a common denominator between all entities in all tables which are the required properties referred to as the Partition Key, Row Key and Timestamp properties.  For now, the most important of these are the Partition and Row key.

Partition Key

So, how does Azure Tables automatically afford out-of-the-box scale out capabilities when data is under heavy load?  This is where the Partition Key becomes probably the most important of the two properties we are currently evaluating.  The partition key is a string property that is required for all entities.  Unlike a primary key on a relational database table, it isn’t unique.  But, it is the key (no pun indented) to allowing Azure to determine where divisions can be made in a single Azure Table and partition entities that are part of that single partition to its own partition server.

Even though every partition will be served by a Partition Server (that can be responsible for multiple partitions), it is when partitions under heavy load can be designated its own Partition Server.  It is this distribution of load across partitions that allow your Azure Table Storage to be highly scalable.

So let’s stop for a second and think about this; if you design your table with a single partition key. Such as in the case of a table that stores store product information, but you decide to make your partition key “products” and all entities fall under this single partition.  How can Azure partition your data so that it can automatically scale out your table for efficient performance? It can’t.  All 50,000,000 shoe products you store fall under the same partition.

Unfortunately, partition servers also create a boundary that will directly affect performance.  Therefore, in contrast to the all-in-one partition approach, creating unique partitions for every entity is a pattern that that will cost you the ability to perform batch operations (discussed later), incur performance penalties for insert throughput as well as when queries cross partition boundaries.

Finally, sorting is not something that is controlled after the data has been persisted.  Data in your table will be sorted in ascending order first by the partition key, then sorted ascending by the row key. Therefore, if sort is of importance, you will need to determine the ways partition key’s and row keys are defined.  A good example is how 11 would come before 2 unless padded with 0’s.

Row Key

The row key is also a string property that must be unique within the partition.  As I mentioned earlier, the idea of secondary indexes has been a lacking feature in Azure Tables.  However, the combination of the Row and Partition key creates an entities primary key and forms a single clustered index within a table.

Row keys also provide a second applied ascending sort order after the applied ascending sort order of the partition key.  Therefore, depending on your circumstances, further thought might be required on how you want data to be sorted when retrieved.

Therefore, the decision you need to make is how will the data be queried, what are the common queries that you expect to be made? Based on that answer, you need to determine how the data can be grouped into partitions.  The following is a list of guidelines that is not exhaustive, but can help making table design decisions easier.

  1. Determine what the common queries will be against the data.
  2. Based on those common queries, determine how the data can be grouped (partitions).
  3. Avoid over-sized partitions that would hinder the ability for scalability.
  4. Avoid extremely small (single entity) partitions that would negate ability for batch partitions and hinder insert throughput
  5. Consider your sort order requirements when determine exact partition and row naming

 

We will continue to look at all these points as we cover table operations to persist and query the data.  There is more to consider that just these starting points, such as how inefficient queries based on table properties are that force a full table scan.  Since this article is not strictly about designing tables, it would recommend reading this helpful MSDN article for more on how different partition sizes affect queries as well as more information on table design.

The most important question you need to ask before using or designing your Azure Tables, how will the data be queried.

 

Working With Azure Tables (Table Operations)

So now that you have decided how you want to structure your tables and you’re ready to start persisting data, let’s get down to business.  We are utilizing the .NET Azure Storage Client SDK which we fully covered in the previous Azure Blob Storage article.  There, you can learn your different options for acquiring the Storage Client SDK.   Despite having covered the details about utilizing your Storage Account’s Access keys, I am going to cover that small part again here.

Therefore, assuming you up and running with the .NET Storage Client SDK, let’s start with the minimal requirements to get started with working with an Azure Table.

Minimal Requirements

The table itself is associated with a specific Azure Storage Account.  Therefore, if we want to perform crud and query operations on specific table in our storage account, roughly, we will be required to instantiate objects that represent our storage account, a specific table client object within our storage account and finally, a reference to the table.

Keeping that in mind, the minimal requirements to work with a table would go something like the following:

  1. First, you create a storage account object that represents your Azure Storage Account
  2. Secondly, through the storage account object, you can create a Table Client object
  3. Third, through the Table Client object, obtain an object that references a table within your storage account.
  4. Finally, through the specific table reference, you can execute table operations

 

CloudStorageAccount account = new CloudStorageAccount(new StorageCredentials("your-storage-account-name", "your-storage-account-access-key"), true);
CloudTableClient tableClient = account.CreateCloudTableClient();
CloudTable table = tableClient.GetTableReference(tableName);

You can see for the CloudStorageAccount object we are creating a StorageCredentials object and passing two strings, the storage account name, and a storage account access key (more on that in a moment).  In essence, we create each required parent object, until we eventually get a reference to a Table.  Despite having covered setting up a CloudStorageAccount object in the previous linked post, below is a refresher on utilizing your storage’s access keys.

Storage Key’s

As mentioned earlier, creating the storage credentials object, you need to provide it your storage account name and either the primary or secondary base64 key.  This is all information we covered in the article on blob storage, but you can obtain this information from your Azure Portal by selecting “Manage Access Keys” under “Storage”, where it will list your storage accounts you have created.

Azure Storage Access Keys

Yes, as you might have guessed, the access keys are the keys to the kingdom, so you would not want to pass this information out.  However, for the sake of simplicity, I am demonstrating the bases for what we need in order to access a specific table storage account.  We will take a closer look at security later on in this article.

A better approach to acquiring a CloudStorageAccount would be to look at the “Windows Azure Configuration Manager”  NuGet package that can help with abstracting out some of the overhead of creating an account.

 

Creating Tables

So as we saw earlier, in order to create a table we have to have a reference to a table.  Azure has some table naming rules that we will need to follow.  Once we have a reference to a table there are a few different ways to create a table

CloudTable table = tableClient.GetTableReference(“footwear”);
table.Create();

 

Or if your aren’t aware of whether or not it has been created yet

CloudTable table = tableClient.GetTableReference(“footwear”);
table.CreateIfNotExists();

 

You might find code that repeatedly makes CreateIfNotExists method calls anytime they want to interact with Tables.  Recall, when we talked about pricing in the Azure Blob Storage, one of the aspects of billing is that it is based on transactions.  The above create example conducts multiple transactions to validate the table exists.  Therefore, you could easily incur unneeded transactional overhead especially if you are doing this on every attempt to work with a table. For the creation of tables, I would advice you create the tables ahead of time when you can.

 

Insert

In this article, we are going to look at two of the main ways to persist data to your table.  One is through explicitly defining a class in your application that implements the TableEntity class.  While a more implicit, second way to persist data to your table is through the use of the DynamicTableEntity.

We can start off by defining classes to represent our table entities, which will derive from TableEntity, creating a new instance of an entity we want to persist, followed by the table operation we want to carry out (insert) and finally executing it against the table reference we have:

public class Footwear : TableEntity
{
public double Size { get; set; }
public string Brand { get; set; }
public string Name { get; set; }
public string Gender { get; set; }
}

Footwear atheleticShoe = new Footwear()
{
  PartitionKey = "Athletics",
  RowKey = "038389_7_women",
  Brand = "AeroSpeed",
  Size = 7.5,
  Gender = "women",
  Model = 38389
};

TableOperation tableOperation = TableOperation.Insert(atheleticShoe);
TableResult result = table.Execute(tableOperation);

 

As you might have noticed, were using a combination of the model#, size and gender for the Row key.  Depending on what common queries you determine will be used for data retrieval from your table, this key can have significant importance on available unique keys within a partition and sort order.

TableResult provides an encapsulation of two important pieces of information.  One, if you are performing a query, the TableResult.Result will be the returned entity(s).  It also encapsulates the HTTP response.  In the previous article on Blob Storage we learned that the Azure Storage services is a REST API. This HTTP response can provide more insight to the results of a table operation.  Going along in this article you’ll see that I capture the returned result just for completion, but it isn’t always necessary.

 

De-normalization (A Short Commercial Break)

Since we are on the topic of inserting data in our tables, this is a good as any time to talk about a common problem with NoSQL databases; de-normalization.  We have already pointed out the significance to understanding how your table data will be queried and making the correct table design decisions ahead of time.  If your decide that you will have more than one dominant query against the data, a common solution is to insert the data multiple times with different Row key’s within a partition.  This will allow for multiple efficient queries to be performed against the same data.

A simple example using our Footwear table, where we might want to query by size or by model#, we can prefix the Row key and insert the same data twice.  The prefix will allow our application to distinguish between the different values:


Partition Key Row Key
Athletics size:07_38389_women
Athletics model:038389_7_women

 

The above example is just a possible solution, not a de-facto method for creating Row key’s, only a means for you to think about different scenarios of persisting entities.  But as mentioned earlier, this is a case where we have de-normalized our table by duplicating the data.  This isn’t the only case where de-normalization of your data might be required or be a possible solution to a problem.

Another common problem is the need for our application to work with different entities together, where the data is different, but closely related.  If we can’t define relationships between entities, we can use the un-schematized characteristic of NoSQL entities to save associated entities in the same table.  A common example would be an application that works with a Contact entity which requires working with Address entities.   Saving them all in the same table under the same partition would afford the ability to perform a batch transaction when saving a Contact and its Addresses (we’ll talk about batch transactions).

Updates

Updating existing properties on an existing entity in our table isn’t the only kind of update that can occur.  Because entities are no more than a bag of key/value pairs and there is no master schema that an entity has to follow, there is no reason why an update to an entity entity where the PartitionKey and RowKey match, might contain a complete different  set of properties and values.  Because of this, there are a few ways we might want to handle the update of an entity.  This is where Merge or Replace come in play.

Merge

If we make changes to existing properties on an existing table entity, Merge will accommodate updating the entity where the PartitionKey and RowKeyexists.  But in the case where we have an entity with different properties and want to update an existing table entity without losing those existing properties, we can merge the two entities.  This will result in a table entity with the combined properties as well as any changes to shared properties.

Before any changes we might have the following table entity:

Storage Table before merge

 

 

TableOperation query = TableOperation.Retrieve<Footwear>("Athletics", "38389_7_women");
TableResult results = table.Execute(query);
Footwear footwear = (Footwear)results.Result;

DynamicTableEntity newFootwear = new DynamicTableEntity()
{
  PartitionKey = footwear.PartitionKey,
  RowKey = footwear.RowKey,
  ETag = footwear.ETag,
  Properties = new Dictionary<string, EntityProperty>
  {
    {"PrimaryColor", new EntityProperty("Red")},
    {"SecondaryColor", new EntityProperty("White")}
  }
};

TableResult result = table.Execute(TableOperation.Merge(newFootwear));

 

In this example, we are loading an existing entity by the PartitionKey and RowKey.  I use a DynamicTableEntity to demonstrate where we want to Merge new changes to an existing entity by performing a Merge TableOperation.  We can see the final result is the combination of the existing properties as well as the new PrimaryColor and SecondaryColor properties.

Storage Table after merge

 

This is just a note that the previous DynamicTableEntity wasn’t required to make changes to an existing entity.  But you’re applications entity POCO’s might change, while you want to retain existing property information, or possibly your application has a split persistent model that that needs to merge the data in an entity.

Replace

Perhaps, we don’t want to retain previous information and we completely want to alter an existing entity.  Replace lets us completely Replace the existing entity where the PartitionKey and RowKey exists.  Taking the previous example we conduct a replace TableOperation

TableOperation replaceOperation = TableOperation.Replace(newFootwear);
TableResult result = table.Execute(replaceOperation);

 

 

And we can see how it has completely altered the structure and data for the existing table entity.

Table Storage after replace table operation

 

Because I have seen some confusion out there of the differences, just remember, Merge retains any current properties and data on an existing entity (data retention).  While Replace does just what the name implies, replaces the entity completely with the new entity (data loss)

You can’t modify the existing PartitionKey and RowKey.  In the case where this is needed, a simple insert of a new entity with the desired PartitionKey and RowKey and a deletion of the old entity will be required.

 

Delete

Using the same Footwear entity example from above, we can simply delete an entity doing the following:

TableOperation deleteOperation = TableOperation.Delete(footwear);
table.Execute(deleteOperation);

 

Be aware, that though it might appear you can simply pass in a newly constructed entity with the correct PartitionKey and RowKey, It will require that you provide a value for the ETag property, which we haven’t talked about yet and is covered under the Concurrency section.  You can force the update by using a wildcard “*” or first load the entity from the Table and pass it as the entity in the delete operation.

Hey Wait! There’s More

The Azure Table REST API provides a couple binary operations if we aren’t sure if an entity exists.  Through the CloudTableClient we can either InsertOrMerge

TableOperation insertOrMergeOperation = TableOperation.InsertOrMerge(newFootwear);
TableResult result = table.Execute(insertOrMergeOperation);

 

Or we have the option to InsertOrReplace

TableOperation insertOrReplaceOperation = TableOperation.InsertOrReplace(newFootwear);
TableResult result = table.Execute(insertOrReplaceOperation);

 

Insecure Direct Object Reference: Is your ASP.NET App Data Secure?

 

In all the above operations, we were only ever working with one table operations at a time.  Azure Tables provide a way to perform transactions through the means of Entity Group Transactions.

 

Batch Operations with Entity Group Transactions

A tip that I pointed out earlier as well as the last post on Blob Storage, is that there is an associated cost with each transaction.  You might have observed that all previous operations have been single operations involving one table entity. In the case were our application has a relationship between entities that are being stored in the same table under the same partition key we might want to perform a batch operation that allows a number of atomic operations to occur successfully or none occur at all.

Entity Group Transactions (EGT) is Azures answer to such a scenario in which we want to perform a series of atomic operations at one time and ensure they all are successful or they all fail.  There are a number of restrictions involving EGT’s. The primary restriction is that all entities involved in a group transaction require having the same Partition Key, which we mentioned earlier when we were discussing how to design our tables.  But let’s quickly look at what those restrictions are when using EGTs:

  1. All entities involved must share the same Partition Key.
  2. No more than 100 entities can be involved in a single group transaction.
  3. An entity can only appear once within a group transaction and there can only be one operation against the entity.
  4. A group transaction payload can total no more than 4mb.

 

The TableBatchOperation actually is a collection of single atomic TableOperations which we went over earlier.

TableBatchOperation batchOperations = new TableBatchOperation
{
  TableOperation.Insert(new Footwear
  {
    PartitionKey = "Athletics",
    RowKey = "model:038389_7_women",
    Brand = "AeroSpeed",
    Size = 7,
    Gender = "women",
    Model = 38389
}),

TableOperation.Insert(new Footwear
{
  PartitionKey = "Athletics",
  RowKey = "size:07_38389_women",
  Brand = "AeroSpeed",
  Size = 7,
  Gender = "women",
  Model = 38389
})
};

table.ExecuteBatch(batchOperations);

 

This was a simple operation.  But a more common example might be that we have a number of entities that we want to persist and we need to ensure that all batched TableBatchOperation(s) share the same PartitionKey and the count does not exceed 100 per operations. Suppose we had more than 100 Footwear we had to process:

IEnumerable<Footwear> footwears = GetFootwear(); //Get some unknown number of footwear objects

TableBatchOperation batchOperations = new TableBatchOperation();

foreach (var footwearGroup in footwears.GroupBy(f => f.PartitionKey))
{
  foreach (var footwear in footwearGroup)
  {
    if (batchOperations.Count < 100)
    {
      batchOperations.Add(TableOperation.InsertOrReplace(footwear));
    }
    else
    {
      table.ExecuteBatch(batchOperations);
      batchOperations = new TableBatchOperation {TableOperation.Insert(footwear)};
    }
  }
  table.ExecuteBatch(batchOperations);
  batchOperations = new TableBatchOperation();
}

if (batchOperations.Count > 0)
{
  table.ExecuteBatch(batchOperations);
}

 

Here, we group all the entities by partition key, then process all groups in batches of 100.  We then ensure if there are any batched operations left over at the end that those are processed.

TableBatchOperation(s) provides very efficient persistence of your entities and it shows.  Azure’s targeted batch processing speed is 1000 batch transactions per second. So use it when you can.

 

 

Conclusion

We have managed to cover everything from understanding Azure’s NoSQL Table Storage, differences from relational databases, designing tables, performing persistence operations and finally, how to take advantage of batch operations.  This was a lot to cover, but it’s far from over.  Part 2  covers the second half of the listed agenda at the beginning of this article.  This includes writing queries against the data, retrying failed table operations, concurrency and security.

 

 

References

  1. Understanding the Table Data Model
  2. Getting the most out of Azure Tables
  3. Designing a Scalable Table Structure
  4. Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency
About the author

Max McCarty

Max McCarty is a software developer with a passion for breathing life into big ideas. He is the founder and owner of LockMeDown.com and host of the popular Lock Me Down podcast.