
Creating index and viewing plans of queries
In this recipe, we will look at querying the data, analyzing its performance by explaining the query plan, and then optimizing it by creating indexes.
Getting ready
For the creation of indexes, we need to have a server up and running. A simple single node is what we need. Refer to the Installing single node MongoDB recipe from Chapter 1, Installing and Starting the Server for instructions on how to start the server. The data that we will operate on needs to be imported in the database. The steps to import the data are given in the previous recipe, Creating test data. Once this prerequisite is complete, we are good to go.
How to do it…
We are trying to write a query that would find us all the zip codes in a given state.
- Execute the following query to view the plan of this query:
> db.postalCodes.find({state:'Maharashtra'}).explain('executionStats')
Take a note of the following fields:
stage
,nReturned
,totalDocsExamined
,docsExamined
, andexecutionTimeMillis
in the result of the explain plan operation. - Let's again execute the same query, but this time, we limit the results to 100 results only:
> db.postalCodes.find({state:'Maharashtra'}).limit(100).explain()
- Take a note of the following fields:
nReturned
,totalDocsExamined
,docsExamined
, andexecutionTimeMillis
in the result. - We now create an index on the
state
andpincode
fields as follows:> db.postalCodes.createIndex({state:1, pincode:1})
- Execute the following query:
> db.postalCodes.find({state:'Maharashtra'}).explain()
Take a note of the following fields:
stage
,nReturned
,totalDocsExamined
,docsExamined
, andexecutionTimeMillis
in the result. - As we want the pincodes only, we modify the query as follows and view its plan:
> db.postalCodes.find({state:'Maharashtra'}, {pincode:1, _id:0}).explain()
Take a note of the following fields:
stage
,nReturned
,totalDocsExamined
,docsExamined
, andexecutionTimeMillis
in the result.
How it works…
There is a lot to explain here. We will first discuss what we just did and how to analyze the stats. Next, we will discuss some points to be kept in mind for the index creation and some caveats.
Analyzing the plan
Okay, let's look at the first step and analyze the output that we executed:
db.postalCodes.find({state:'Maharashtra'}).explain()
The output on my machine is as follows: ( I am skipping the nonrelevant fields for now.)
{ "stage" : "COLLSCAN", ... "nReturned" : 6446, "totalDocsExamined " : 39732, … "docsExamined" : 39732, … "executionTimeMillis" : 12, … }
The value of the stage
field in the result is COLLSCAN
, which means that a full collection scan (all the documents scanned one after another) has happened in order to search the matching documents in the entire collection. The nReturned
value is 6446
, which is the number of results that matched the query. The totalDocsExamined
and docsExamined
field have values of 39,732
, which is the number of documents in the collection scanned to retrieve the results. This is the also the total number of documents present in the collection and all were scanned for the result. Finally, executionTimeMillis
is the number of milliseconds taken to retrieve the result.
Improving the query execution time
So far, the query doesn't look too good in terms of performance and there is great scope for improvement. To demonstrate how the limit applied to the query affects the query plan, we can find the query plan again without the index but with the limit clause as follows:
> db.postalCodes.find({state:'Maharashtra'}).limit(100).explain() { "stage" : "COLLSCAN",… "nReturned" : 100, "totalDocsExamined" : 19951, … "docsExamined" : 19951, … "executionTimeMillis" : 8, … }
The query plan this time around is interesting. Though we still haven't created an index, we do see an improvement in the time that the query took for execution and the number of objects scanned to retrieve the results. This is due to the fact that mongo ignores the scanning of the remaining documents once the number of documents specified in the limit
function has been reached. We can thus conclude that it is recommended that you use the limit
function to limit your number of results where the maximum number of documents accessed is known up front. This may give better query performance. The word may
is important as, in the absence of an index, the collection might still be completely scanned if the number of matches is not met.
Improvement using indexes
Moving on, we then create a compound index on the state and pincode field. The order of the index is ascending in this case (as the value is one) and is not significant unless we plan to execute a multi-key sorting. This is a deciding factor as to whether the result can be sorted using an index only or mongo needs to sort it in memory later on before returning the results. As far as the plan of the query is concerned, we can see that there is a significant improvement:
{ "executionStages" : { "stage" : "FETCH", … "inputStage" : { "stage" : "IXSCAN", … "nReturned" : 6446, "totalDocsExamined" : 6446, "docsExamined" : 6446, … "executionTimeMillis" : 4, … }
The inputStage
field now has the IXSCAN
value, which shows that the index is indeed used now. The number of results stays, as expected, the same at 6446
. The number of objects scanned in the index and the documents scanned in the collection has now reduced to the same number of documents as in the result. This is because we have now used an index that gives us the starting document to scan, and only then, the required number of documents are scanned. This is similar to using the book's index to find a word or scanning the entire book to search for the word. As expected, the time in executionTimeMillis
has reduced as well.
Improvement using covered indexes
This leaves us with one field, executionStages
, which is FETCH
, and we will see what this means. To know what this value is, we need to look briefly at how indexes operate.
Indexes store a subset of fields of the original document in the collection. The fields present in the index are the same as those that the index is created on. The fields, however, are kept sorted in the index in an order specified during the index creation. Apart from the fields, there is an additional value stored in the index that acts as a pointer to the original document in the collection. Thus, whenever the user executes a query, the index is consulted to get a set of matches if the query contains fields that an index is present on. The pointer, stored with the index entries matching the query, is then used to make another IO operation to fetch the complete document from the collection, which is then returned to the user.
The value of executionStages
, which is FETCH
, indicates that the data requested by the user in the query is not entirely present in the index, but an additional IO operation is needed to retrieve the entire document from the collection following the pointer from the index. If the value is present in the index itself, an additional operation to retrieve the document from the collection would not be necessary and the data from the index would be returned. This is called a covered index, and the value of executionStages
, in this case, would be IXSCAN
.
In our case, we just needed the pincodes. So, why not use projection in our queries to retrieve just what we need? This would also make the index covered as the index entry just has the state's name and pincode, and the required data can be served completely without retrieving the original document from the collection. The plan of the query in this case is interesting too.
Execute the following command:
db.postalCodes.find({state:'Maharashtra'}, {pincode:1, _id:0}).explain()
This gives us the following plan:
{ "executionStages" : { "stage" : "PROJECTION", … "inputStage" : { "stage" : "IXSCAN", … "nReturned" : 6446, "totalDocsExamined" : 0, "totalKeysExamined": 6446 "executionTimeMillis" : 4, … }
The value of the totalDocsExamined
and executionStage: PROJECTION
fields is something to observe. As expected, the data that we requested in the projection can be served from the index alone. In this case, we scanned 6446 entries in the index and thus, the totalKeysExamined
value is 6446
.
As the entire result was fetched from the index, our query did not fetch any documents from the collection. Hence, the value of totalDocsExamined
is 0
.
As this collection is small, we do not see a significant difference in the execution time of the query. This will be more evident on larger collections. Making use of indexes is great and gives us a good performance. Making use of covered index gives us an even better performance.
Note
The explain results feature of MongoDB has had a major overhaul in version 3.0. I would suggest spending a few minutes going through its documentation at http://docs.mongodb.org/manual/reference/explain-results/.
Another thing to remember is that if your document has a lot of fields, try and use projection to retrieve only the number of fields we need. The _id
field is retrieved every time by default. Unless we plan to use it, set _id:0
to not retrieve it if it is not a part of the index. Executing a covered query is the most efficient way to query a collection.
Some caveats of index creations
We will now see some pitfalls in index creation and some facts when an array field is used in the index.
Some of the operators that do not use the index efficiently are the $where
, $nin
, and $exists
operators. Whenever these operators are used in the query, one should bear in mind that a possible performance bottleneck might occur when the data size increases.
Similarly, the $in
operator must be preferred over the $or
operator as both can be used to achieve more or less the same result. As an exercise, try to find the pincodes in the state of Maharashtra and Gujarat in the postalCodes
collection. Write two queries: one using $or
and one using the $in
operator. Explain the plan for both these queries.
What happens when an array field is used in the index?
Mongo creates an index entry for each element present in the array field of a document. So, if there are 10 elements in an array in a document, there will be 10 index entries, one for each element in the array. However, there is a constraint while creating indexes containing array fields. When creating indexes using multiple fields, no more than one field can be of a type array, and this is done to prevent a possible explosion in the number of indexes on adding even a single element to the array used in the index. If we think of it carefully, an index entry is created for each element in the array. If multiple fields of the type array were allowed to be a part of an index, then we would have a large number of entries in the index, which would be a product of the length of these array fields. For example, a document added with two array fields, each of length 10, would add 100 entries to the index if it is allowed to create one index using these two array fields.
This should be good enough, for now, to scratch the surfaces of a plain, vanilla index. We will see more options and types in the following few recipes.