Home 3. Patterns 1
Post
Cancel

3. Patterns 1

Patterns

Patterns are very exciting because they are the most powerful tool for designing schemas for MongoDB and NoSQL. Patterns are not full solution to problems. Those are called full solutions. Patterns are a smaller section of those solutions. They are reusable units of knowledge.

We will see how to optimize when faced with large documents with a subset pattern, use the computed pattern to avoid repeated computations, structure similar fields with the attribute pattern, handle changes to your deployment without downtime with the schema versioning pattern, and much more.

Those patterns will also serve as a common language for your teams working on schema designs. Finally, having well-defined patterns and understanding when and how to use them will remove a little bit of the are in data modeling for MongoDB and make the process more predictable.

Handling Duplication Staleness and Referential Integrity

Patterns are a way to get the best out of your data model. Often, the main goal is to optimize your schema to respond to some performance operation or optimize it for a given use case or access pattern.

However, many patterns lead to some situation that would require some additional actions. For example: -

  1. Duplication: Duplicating data across documents
  2. Data Staleness: Accepting staleness in some pieces of data
  3. Data Integrity Issues: Writing extra applicaiton side logic to ensure referential integrity

Choosing a pattern to be applied to your schema requires taking into account these three concerns. If these concerns are more important than the potential simplicity of performance gains provided by the pattern, you should not use the pattern.

Duplication

Why do we have duplication? It is usually the result of embedding information in a given document for faster access. The concern is that it makes handling changes to duplicated information a challenge for correctness and consistency, where multiple documents across different collections may need to be updated. There is this general misconception that duplication should not exist. In some cases, duplication is better than no duplication. However, not all pieces of information are affected in the same way by duplication.

Let’s start with a situation where duplicating information is better than not doing it.

Let’s link orders of products to the address of the customer that placed the order by using a reference to a customer document. Updating the address for this customer updates information for the already fulfilled shipments, order that have been already delivered to the customer. This is not the desired behavior. The shipments were made to the customer’s address at that point in time, either when the order was made or before the customer changed their address. So the address reference in a given order is unlikely to be changed.

Embedding a copy of the address within the shipment document will ensure we keep the correct value. When the customer moves, we add another shipping address on file. Using this new address for new orders, does not affect the already shipped orders.

The next duplication situation to consider is when the copy data does not ever change.

Let’s say we want to model movies and actors. Movies have many actors and actors play in many movies. So this is a typical many-to-many relationship. Avoiding duplication in a many-to-many relationship requires us to keep two collections and create references between the documents in the two collections. If we list the actors in a given movie document, we are creating duplication. However, once the movie is released, the list of actors does not change. So duplication on this unchanging information is also perfectly acceptable.

This leaves us with the last duplication situation, the duplication of a piece of information that needs to or may change with time.

For this example, let’s use the revenues for a given movie, which is stored within the movie, and the revenues earned per screening. Oh, yeah, with said duplication add to be a single value in two locations. In this case, we have duplication between the sum store in the movie document and the revenue store in the screening documents used to compute the total sum. This type of situation, where we must keep multiple values in sync over time, makes us ask the question is the benefit of having this sum precomputed surpassing the cost and trouble of keeping it in sync? If yes, then use this computed pattern. If not, don’t use it.

Here, if we want the sum to be synchronized, it may be the responsibility of the application to keep it in sync. Meaning, whenever the application writes a new document to the collection or updates the value of an existing document, it must update the sum. Alternatively, we could add another application or job to do it.

Staleness

Staleness is about facing a piece of data to a user that may have been out of date. We now live in a world that has more staleness than a few years ago. Due to globalization and the world being flatter, systems are now accessed by millions of concurrent users, impacting the ability to display up-to-the-second data to all these users more challenging.

For example, the availability of a product that is shown to a user may still have to be confirmed at checkout time. The same goes for prices of plane tickets or hotel rooms that change right before you book them.

Why do you get staleness? New events come along at such a fast rate that updating data constantly can cause performance issues. The main concern when solving this issue is data quality and reliability. We want to be able to trust the data that is stored in the database.

The right question is, for how long can the user tolerate not seeing the most up-to-date value for a specific field. For example, the user’s threshold for seeing if something is still available to buy is lower than knowing how many people view or purchase a given item. When performing analytic the queries it is often understood that the data may be stale and that the data being analyzed is based on some past snapshot. Analytic queries are often run on the secondary node, which often may have stale data. It may be a fraction of a second or a few seconds out of date. However, it is enough to break any guarantee that we’re looking at the latest data recorded by the system.

The solution to resolve staleness in the world of big data is to batch updates.

As long as the updates are run within the acceptable thresholds, staleness is not a significant problem. So, yes, every piece of data has a threshold for acceptable staleness that goes from 0 to whatever makes sense for given piece of information. A common a way to refresh stale data is to use a Change Stream to see what has changed in some documents and derive a list of dependent piece of data to be updated.

Referential Integrity

Our third concern, when using patterns, is referential integrity. Referential integrity has some similarities to staleness. It may be OK for the system to have some extra or missing links, as long as they get corrected within the given period of time.

Why do we get refferential integrity issues? Frequently, it may be the result of deleting a piece of information [INAUDIBLE] document– for example, without deleting the references to it. In the big data world, we can also associate referential integrity issues to adding distributed system, where a related piece of information live on different machines.

At this time, the MongoDB server does not support foreign keys and associated cascading deletes and updates responsible for keeping referential integrity. It is the responsibility of the application to do so. Here again, the main concern is data quality and reliability.

Attribute Pattern

Polymorphic, one of the most frequent schema design patterns used in MongoDB.Polymorphic is when you put different products, like these three examples, in one collection without going through relational acrobatics.

Our products should have an identification like manufacturer, brand, sub-brand, enterprise that are common across the majority of products. Products’ additional fields that are common across many products, like color and size– either these values may have different units and means different things for the different products. For example the size of a beverage made in the US maybe measured as ounces, while the same drink in Europe will be measured in milliliters. As for MongoDB charger, well, the size is measured according to its three dimensions. For the size of a Cherry Coke six-pack, we would say 12 ounces for a single can, six times 12 ounces, or 72 ounces to count the full six-pack. Ultimately we could list the physical dimension and report the amount of the liquid in that field. Note that physical dimensions for a beverage make sense if your main concern is the storage or transportation of the product, not the drinking of it. Then there is the third list of fields, the set of fields that are not going to exist in all the products. You may not even know where they are in advance.

They may exist in the new description that your supplier is providing you. For a sugary drink, you may want to know the type of sweetener, while for a battery, you are more interested in its specifications, like the amount of electricity provides. For the characteristics that are almost always present, we keep them as fields those qualify as the common schema part.

Schema and indexing may appear in the third list of fields. To search effectively on one of those fields, you need an index. For example, searching on the capacity for my battery would require an index. Searching on the voltage output of my battery would also require an index. If you have tons of fields, you may have a lot of indexes.

Remember some of the characteristics may be very specific to a few products and the list of fields may be unpredictable. Each addition or discovery of a new characteristic may require you to add an index, modify your schema validators, and modify your user interface to show the new information. For this case you want to use the attribute pattern.

We might want to use the attribute pattern.

  1. identify the fields to tranpose
  2. for each field and its value a named tuple is created: {k: "name_of_the_field", v: "value_of_the_field"} and place them all inside a new array property (props, for instance). There could be a third field in the tuple with the unit, for example.
  3. Create an index on {props.k: 1, props.v: 1}

Another example: a movie has differnet dates of release in different countries:

1
2
3
4
5
6
7
    {
        title: "Dunkirk",
        release_USA: "2017/07/23",
        release_Mexico: "2017/08/01",
        release_France: "2017/08/01",
        ...
    }

What if we wanted to find all the movies released between two dates across all countries? Moving to

1
2
3
4
5
6
7
8
9
    {
        title: "Dunkirk",
        releases: [
            {k: "USA", v: "2017/07/23" },
            {k: "Mexico", v: "2017/08/01"},
            {k: "France":, v: "2017/08/01" }
        ]
        ...
    }

makes the query so much simple

Problem

  • Lot’s of similar fields with similar types and need to search across those fields at once.
  • Another case is when only a subset of documetns have many similar fields.

Solution Transform fields into a new array property of key, value pairs with the key being the name of the field and the value, its value. Then create an index containing both key and value

Use Cases

  • Searchable characteristics of a product
  • Set of fields all having the same value type (list of dates)

Benefits/Tradeoffs

  • Easier to index
  • Allow variety of field names
  • Allows qualifying the relationship between key and value with a third tuple field.

Extended Reference Pattern

If you find yourself “joining” data between collections, event if the query is not so horrid, with a high volume of data, performance is a liability.

Before lookup, all joining had to be done in the application with multiple queries involved.

graphLookup allows recursive queries on the same collection, like on graph databases.

Another way would be embedding in the one-side of the 1-* relationship. But… what if the joins come from the other side? Imagine a 1-* between a customer and its orders. But we usually query orders, not customers.

Embedding the most used information (duplication) in the many-side while maintaining the reference field in the many, allows us to not having to join most of the time, but joining if we must at the expense of duplication.

Duplication management:

  • minimize it: duplicate fields that change rarely and only the fields needed to avoid the join
  • After change in “master” information: identify what needs to be chanaged and do it straight away if we must or wait for a batch update to do it at a later stage

In the example, duplicating is the right thing to do.

Problem

  • too many repetitive joins

Solution

  • identify the field on the lookup side
  • bring in those fields into the main object

Use Cases -Catalog, mobile applications, real-time analytics.

That is: optimize read operations by avoiding round-trips or touching too many pieces of data.

Benefits/Trade-offs

  • ✔ Faster reads
  • ✔ Reduced number of joins and lookups
  • ✘ Duplication if the extended reference contains data that changes a lot

Subset Pattern

MongoDB tries to put in memory the working set (docs and portion of indexes that are accessed).

If the working set does not fit in memory, MongoDB will start trying to occupy RAM and there will be a constant thrashing of data trying to make it to RAM just to be evicted shortly after for other data to get there.

Solutions:

  • add more RAM (or more nodes to the cluster)
  • scale with sharding (or are more shards)
  • reduce the size of the working set

Will focus on third. Get rid of part of huge documents that is not used so often.

For example, for a movie, people might want to access only the main actors, top reviews or quotes. The rest can go into a separate collection.

Problem Working set does not fit in memory

Or only some of the data of the working set is frequently used.

Lots of pages evicted from memory

Solution Divide the document in two: fields that are often required and fields that are rarely required.

The resulting relationship will be a 1-1 or 1-*. Part of the information is duplicated in the most used side.

Use Cases

  • List of review of a product.
  • List of comments on an article
  • List of actors in a movie

Benefits/Trade-offs

  • ✔ Smaller working set as most used docs are smaller
  • ✔ Short disk access from most used collection
  • ✘ More round trips to the server when accessing less frequent information
  • ✘ Mor disk usage as information is duplicated
This post is licensed under CC BY 4.0 by the author.