Home 1. Introduction to Data Modeling
Post
Cancel

1. Introduction to Data Modeling

One of the most common misconceptions about data modeling in MongoDB is that modeling is schemaless, meaning that it doesn’t really matter which field documents have or how different the documents can be from one another or even how many collections we might have per database.

What MongoDB unquestionably has is a very flexible data model. There are no default rules for what a document should be looking like apart from having correctly been defined in BSON and containing a primary key. But most importantly all data as some sort of structure, and therefore a schema. MongoDB just happens to make it easier for us to deal with that later rather than sooner.

Before you jump into an ERD and UML tooling, in order to determine the full scope of our data structures, it tends to be preferable to start building our application and finding out from that particular experience what the data structure should look like.

However, if we do know our

  • our usage pattern
  • how our data is accessed
  • which queries are critical to our application
  • ratios between reads and wrties we will be able to extract a good model even before writing the full application to make it scale with MongoDB.

Being flexible means that your application changes.

When we start start having a pretty good idea of how documents should be looking like and should be shaped out and which data types those fields we have, we’ll be able to enforce those rules in MongoDB using document validation.

Another misconception is that all information, regardless of how data should be manipulated, can be stored in one single document. There are some usecases where this approach is actually correct. But in reality this is not the way application generally uses data.

Keeping the amount of information stored per individual documents to the data that your applicaiton uses and having different models to deal with historical data or other types of data that are not always accessed is something that we’ll be looking into Data Modeling.

And there is also this perception that there is no way to perform a join between documents in MongoDB. While MongoDB does not call $lookup a join, for many good reasons, you can still perform all sorts of join in MongoDB.

MongoDB Document Model

Data in MongoDB stored in a hierarchical structure where the database are at the top level where each MongoDB deployment can have many databases. Then there are one or more collections in the database. And finally, there are documents which are kept at the collection level.

In MongoDB, data is stored as BSON documents that are composed of field value pairs, where BSON is a binary representation of JSON documents. Since BSON is not human-readable, we will stick with JSON in our examples throughout this course.

Keeping with the standard JSON notation, we have an open curly bracket indicating the beginning of a document, followed by several field value pairs separated by commas. Example: -

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
    {
        'firstName': 'Babin', 
        'lastName': 'Joshi', 
        'age': 24,
        'phone': [
            123456789,987654321
        ],
        'address': {
            'street': 'One',
            'building': 1,
            'city': 'Kathmandu', 
            'province': 3,
            'country': 'USA'
        },
        'education': [
            {
                'College': 'Prasadi', 
                'Degree': +2
            },
            {
                'College': 'Kathmandu University', 
                'Degree': 'Bsc. Computer Science'
            }
        ]
    }

Each value can be of any BSON data type, which in this case are a string, an integer, another string, and an array.

With MongoDB, a document is a collection of attributes for an object. If you’re coming from SQL, a document is like a row in the table that has been joined with other relevant rows from other tables. You have a field and its assigned value, just like each column in a row has a value.

Instead of having to query multiple tables of related data and assembling it all together, we can keep your related data in a single document and pull it all down using a single query.

We can consider a MongoDB document as being similar to a dictionary or a map or an associative array– an object that has a number of key value pairs.

Since MongoDB documents support a flexible structure, we can use a document to represent an entire object rather than having to break up the data across multiple records as we would have to do with the relational database.

The exact structure of a document– all the fields, values, and embedded documents– represent the schema of a document.

Documents in the same collection don’t need to have the exact same list of fields.

Furthermore, the data type in any given field can vary across documents.

WE do not have to make changes at the cluster level to support this.

Another way to view this is that we can have multiple versions of your schema as our application develops over time and all the schema versions can coexist in the same collection.

Intro to Methodology

We will go over a methodology to help you through the old process of data modeling for MongoDB. The methodology we use is composed of three phases.

  • The first phase is to describe the workload. In other terms, it means gathering everything there is to know about how you would be using your data.

  • The second phase is to identify the relationships between the different entities you need to handle and choose how to model those relationships.

  • And the third phase is to apply design patterns or transformation to the current model to address performance requirements.

Let’s describe each phase a little bit more.

Our goal is to create a data model, what is often referred to as our MongoDB schema.

For example, you may have a requirements document listing the scenarios the system needs to support. Alternatively, or in complement, you may have an expert on the domain, who can advise on what needs to be done. You may be migrating from a relational database, or you are evolving an existing MongoDB database. In both cases, logs, stats, et cetera, give you additional information about the current state of the system. If they exist, you want to use them.

Finally, someone needs to assemble this information together in the schema. This is done by the data modeling expert.

So the first phase is to look at the documents that you have in your input and create some artifacts out of them. You should be able to size the amount of data your system will have initially, in few months, and in few years. The action of recording those numbers will permit you to observe any major deviations in your system once it’s in operation. Those differences will be a good indicator that you may have to iterate again over your schema. The same applies to the operations, the reads and the writes. You should be able to tell how many are run per unit of time and if each query has additional requirements in terms of execution time, the latency from the application, tolerance to staleness, et cetera.

For each of these operation requirements, record your thoughts and assumptions.They will also be a good indicator to see whether you need to reevaluate the model again later.

In our second phase, we start with a piece of information that were identified. Each piece has a relationship with another one. The ones that have a one-to-one relationship tend to be grouped together in the same table or collection. In modeling for a relational database, you would probably have come up with those three entities– actors, movies, and reviews. And place the piece of information inside the appropriate entity. For example, a movie title has a one-to-many relationship to the reviews for the movie, while the money earned by the movie has a one-to-one relationship with the movie title. So the movie title and its revenues are in the same entity or collection, while the reviews are in a separate collection. With MongoDB, you follow the same process of identifying the relationships between the pieces of information. However, you need to decide if you embed information or keep it separate. At the end of this process, you will have a list of entities with their fields, some of them grouped together inside the common collection.

Our last phase is to apply schema design patterns. This is where you will get to make your model more performant or more clear by applying some transformations.

If any of the input information on the left changes, you need to assess the impact on the decision you’ve made in their corresponding phase. For example, if you discover another reported query, get more data about the size of your problem, or run benchmarks on your current solution, all that known information, with feedback as the input to the model. Any successful application will undergo modifications at some point in its lifetime, so be ready to get new inputs at some point. If you track why you made some decision and what were the assumptions in the past, it will be much easier to apply the needed changes.

Introduction Modeling for Simplicity vs Performance

Modeling for simplicity means we will avoid any complexity that could slow down the development of the system by our engineers. Frequently, for those kind of projects, there are limited expectations and small requirements in term of CPU, disk, I/O, memory. Things are generally small and simple. You should start by identifying the most important operations for the system. And you will need to establish the relationships between the entities and fields. To keep the model simple, you are likely to group a lot of those pieces inside a few collection using sub-documents or arrays to represent the one-to-one, one-to-many, too many-to-many many relationships. By keeping the modeling steps to the minimum, we can remain very agile, with the ability to quickly iterate on our application, reflecting these changes back into the model if needed.

If you model for simplicity, as a result, you will likely see fewer collection in your design where each document contains more information and maps very well to the object you have in your application code– the objects being represented in your favorite language as hashes, maps, dictionary, or nested objects. Finally, as a result of having larger documents with embedded documents in them, the application is likely to do less disk accesses to retrieve the information. These three collection embedded into one, a single read will be sufficient to retrieve the information instead of four.

At other end of our axis, we have the performance criteria.

In this scenario, resources are likely to be used to the maximum. Project that makes use of sharding to scatter horizontally are likely to fall into this category, because you often shard your database because there is not enough resources available with a simple replica set.

The system may require very fast read or writes operation, or it may have to support a ton of operations. Although situations are demanding a model for performance. When you model for performance or have more complexity to handle, you want to go over all the steps of the methodology.

Again, you start by identifying the important operations, but also quantify those in terms of metrics like operation per second, required latency, and pinning some quality attributes on those queries such as– can the application work with data that’s a little stale, are those operations parts of the large analytic query?

If you model for performance you will often see more collection in your design. You will also need to apply a series of schema design patterns to ensure the best usage of resources like CPU, disk, bandwidth. And in between, well, you have project that have more of a balance or trade between those two criteria.

This post is licensed under CC BY 4.0 by the author.