Home 5. Indexing and Aggregation Pipeline
Post
Cancel

5. Indexing and Aggregation Pipeline

Aggregation Framework

The aggregation framework in its simplest form is just another way to query data in MongoDB. Everything we know how to do using the MongoDB query language (MQL) can also be done using the aggregation framework.

Example: - Let’s find all documents that have wi-fi as one of the amenities only including the price and address in the resulting cursor. With MQL, we will use the following command:-

1
2
3
4
5
6
7
8
9
10
    db.listingsAndReviews.find(
        {
            'amenities': 'Wifi'
        },
        {
            'price': 1,
            'address': 1,
            '_id': 0
        }
    ).pretty()

With the aggregation framework, we use the following command:-

1
2
3
4
5
6
7
8
9
10
11
12
    db.listingsAndReviews.aggregate(
        [
            { "$match": 
                { "amenities": "Wifi" } 
            },
            { "$project": { "price": 1,
                            "address": 1,
                            "_id": 0 }
            }
        ]
    ).pretty()

To use the aggregation framework, we use the aggregate instead of find. The reason for that is because sometimes we might want to aggregate, as in group or modify our data in some way, instead of always just filtering for the right documents. This means that you can perform operations other than finding and projecting data. But you can also claculate using aggregation. The aggregatoin framework works as a pipeline, where the order of actions in the pipeline matters. And each action is executed in the order in which we list it. Meaning that we give our data to the pipeline on our end, then we describe how this pipeline is going to treat our data using aggregation stages. And then the transformed data emerges at the end of the pipeline.

The “$group” stage

The $group stage is one of the many stages that differentiates the aggregation framework from MQL. With MQL, we can filter and update data. With the aggregation framework, we can compute and reshape data. The $group is an operator that takes the incoming stream of data and siphons it into multiple distinct reservoiors.

Syntax:

1
2
3
4
5
6
7
8
9
10
    {
        '$group': {
            '_id': <expression>, //Group By Expression
            <field1>: {
                <acumulator1>: <expression1>
            },
            ...,
            }
        }
    }

Example-1:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
    db.listingsAndReviews.aggregate(
        [
            {
                '$project': {
                    'address': 1,
                    '_id': 0
                }
            },
            {
                '$group': {
                    '_id': '$address.country'
                }
            }
        ]
    )

The above query projects only the address field value for each document, then group all documents into one document per address.country value.

Example-2:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
    db.listingsAndReviews.aggregate(
        [
            {
                '$project': {
                    'address': 1,
                    '_id': 0
                }
            },
            {
                '$group': {
                    '_id': '$address.country',
                }
            }
        ]
    )

The above query projects only the address field value for each document, then group all documents into one document per address.country value, and count one for each document in each group.

sort() and limit()

Sometimes, when we’re creating a collection, we are not interested in all the results, but are looking for the top 3 or top 10 results.

Suppose we want to find the least popoulated zip code in the zips collection. Then we willl use the following query.

1
2
3
4
5
    db.zips.find().sort(
        {
            'pop': 1
        }
    ).limit(1).pretty()

The above query gets all the documents, sorts them by their population in ascending or increasing order, and only returns the first document in the cursor, a.k.a the one with the smallest population value.

1
2
3
4
5
    db.zips.find().sort(
        {
            'pop': -1
        }
    ).limit(10).pretty()

The above query gets all the documents, sorts them by their population in descending or decreasing order, and only returns the first 10 document in the cursor, a.k.a the one with the largest population value.

The sort() and limit() are cursore methods. A cursor method is not applied to the data that is stored in the database. It is instead applied to the result set that lives in the cursor. After the curosr is populated with the filter data that’s the result of the Find command, we can then apply the sort() method which will sort the data based on the criteria that we provided.

We can sort the data by one or more fields in increasing or decreasing direction. For example: -

1
2
3
4
5
6
    db.zips.find().sort(
        {
            'pop': 1,
            'city': -1
        }
    )

The above query gets all the documenets, sorts them in the increasing order by population and decreasing order by the city name.

Indexes

Indexes are one of the most impactful way to improving query performance. An index in a databse is, by its function, similar to an index in a book, where you have an alphabetical list of names and subjects with references to the places where they occur. Index in database is a special data structure that stores a small portion of the collection’s data set in an easy to traverse form. In simple terms, an index is a data structure that optimizes queries.

Given:

1
2
3
    db.trips.find({'birth year': 1989})

    db.trips.find({'start station id': 476}).sort('birth year': 1)

The first query filters the data by the value of the birth year and the second sorts by the value of that field. Both could benefit from that index.

  • Creating an Index:
    • Single field index
      1
      2
      3
      4
      5
      
            db.trips.createIndex(
                {
                    'birth year': 1
                }
            )
      

      The above query creates an index on the birth year field in increasing order.

    • Compund index: an index on multiple fields.
      1
      2
      3
      4
      5
      6
      
            db.trips.createIndex(
                {
                    'start station id': 1,
                    'birth year': 1
                }
            )
      

Data Modeling

MongoDB doesn’t enforce how data is organized by default. So how can we decide what structure to use to store our data? Where should we create subdocuments? And where should we use arrays of values? At which point should data get its own collection?

Making these decision about the shape and structure of our data is called Data Modeling. More specifically, Data Modeling is a way to orgranize fields in a document to support our applicaiton performacen and querying capabilities.

The most important rule of thumb in data modeling with MongoDB is that data is stored in the way that it is used. This notion determines the decision that we make about the shape of our document and the number of our collections.

Note: Data that is used/acccessed together should be stored together. And as our application evolves, our data model should also evolve.

Upsert

Everything in MOSQL that can be used to locate a docuent in a collection can also be used to modify the document.

1
2
3
4
5
6
7
8
    db.collection.updateOne(
        {
            <query to locate>
        },
        {
            update
        }
    )

The first part of the update operation is the query to locate the document in question. One of the awesome features of MQL is the upsert option within the update documents.

Upsert is a hybrid of update and insert, and it should only be used when it is needed.

1
2
3
4
5
6
7
8
9
10
11
    db.collection.updateOne(
        {
            <query>
        },
        {
            <update>
        },
        {
            'upsert': true
        }
    )

By default upsert is set to false, but if it is set to ture, we can expect it to do either an update or an insert. THe update will happen if there are documents that match the filter criteria of the update operation. The insert will happen if there are no documents that match the filter criteria.

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
    db.iot.updateOne(
        {
            'sensor': r.sensor,
            'date': r.date,
            'valcount': {
                '$lt': 48
            }
        },
        {
            '$push': {
                'readings': {
                    'v': r.value,
                    't': r.time
                }
            },
            '$inc': {
                'valuecount': 1,
                'total': r.value
            }
        },
        {
            'upsert': true
        }
    )

This post is licensed under CC BY 4.0 by the author.