February 01, 2017Elankumaran Srinivasan

MongoDB Query Examples

In this blog I will focus on explain how to perform common operations on a given collection in MongoDB.

Let consider a scenario where we have a messaging APP. An user can communicate with one or more users through the application and the system will track all the communication in MongoDB.

Let consider a scenario where we have a messaging APP. An user can communicate with one or more users through the application and the system will track all the communication in MongoDB.

We have a collection named “messages” in MongoDB.

Below is a sample documents that are stored in the collection:

{
	"_id": "582a15149bbc2d2f8898628b",
	"senderUserId": 1234,
	"receiverUserId": 7890,
	"message": Hi, How are you?,
	"sentDate": ISODate("2016-11-14T19:48:36.379Z"),
	"read": false
},
{
	"_id": "722a15149bbc4d2f8898628b",
	"senderUserId": 7890,
	"receiverUserId": 1234,
	"message": "I am doing well. How about you?",
	"sentDate": ISODate("2016-11-15T2:10:11.123Z")
	"read": true
}, {
	"_id": "150000149bb42d2f8898628b",
	"senderUserId": 5678,
	"receiverUserId": 2222,
	"message": "Hola !!!",
	"sentDate": ISODate("2016-11-17T01:04:6.079Z"),
	"read" : false
	
}

Query #1 - Create document (User sends a message to another)

db.getCollection('messages').insertOne({
	"_id": "582a15149bbc2d2f8898628b",
	"senderUserId": "1234",
	"receiverUserId": "7890",
	"message": "Hi, How are you?",
	"sentDate": ISODate("2016-11-14T19:48:36.379Z"),
	"read": false
})

Query #2 - Message history between two users with the most recent message at the bottom

db.getCollection('messages').find({
	$or: [{
		senderUserId: 1234,
		receiverUserId: 7890
	}, {
		senderUserId: 7890,
		receiverUserId: 1234
	}]
}).sort({
	sentDate: 1
})

Query #3 - Retrieve all messages received by an user from other users and which are still unread by the receiving user

db.getCollection('messages').find({
	receiverUserId: 1234,
	read: false
}).sort({
	sentDate: 1
})

Query #4 - Retrieve all messages sent by an user to other users

db.getCollection('messages').find({
	senderUserId: 1234,
}).sort({
	sentDate: 1
})

Query #5 - Retrieve count of all messages sent by the user till date

db.getCollection('messages').find({
	senderUserId: 1234,
}).count()

Query #6 - Retrieve grouped count of unread messages per contact

db.getCollection('messages').aggregate([{
		$match: {
			receiverUserId: 1234
		}
	},
	{
		$group: {
			_id: "$senderUserId",
			count: {
				$sum: 1
			}
		}
	}

])

The only problem with the above query is that the order of the result is not guaranteed. We want the results to contain the results where the contact with the most recent communication comes at the top. In order to achieve this order, we need to sort the result in descending order of ‘sendDate’ field. The contact is nothing but the sender.

The above query can be rewritten as following :

db.getCollection('messages').aggregate([{
		$match: {
			receiverUserId: 1234
		}
	},
	{
		$sort: {
			sentDate: -1
		}
	}, {
		$group: {
			_id: "$senderUserId",
			count: {
				$sum: 1
			},
			sentDate: {
				$first: "$sentDate"
			}
		}
	}
	,
	{
		$sort: {
			sentDate: -1
		}
	}, {
		$project: {
			_id: 1,
			count: 1
		}
	}
])

Query #7: Get a unique list of Ids of all people that the user sent messages to

db.getCollection('messages').aggregate([{
		$match: {
			senderUserId: 1234
		}
	}, {
		$group: {
			"_id": "$receiverUserId"
		}
	}, {
		$project: {
			"receiverUserId": "$_id",
			"_id": 0
		}
	}

]);

Query #7: Count of messages sent by the user to each of the contacts

db.getCollection('messages').aggregate([{
		$match: {
			senderUserId: 1234
		}
	}, {
		$group: {
			"_id": "$receiverUserId",
			count: {
				$sum: 1
			}
		}
	},
	{
		$project: {
			"receiverUserId": "$_id",
			"_id": 0,
			count: 1
		}
	}
]);

Query #8: Get paginated list of messages sent between an user and another user [ Pagination Approach #1 : Using pageSize and pageNum ]

 


db.getCollection('messages').find({
	senderUserId: 1234,
	receiverUserId: 7890
}).sort({sentDate:-1}).skip(0).limit(10)       // Fetch the first page of the paginated messages. Each page will have 10 messages


db.getCollection('messages').find({
	senderUserId: 1234,
	receiverUserId: 7890
}).sort({sentDate:-1}).skip(10).limit(10)       // Fetch the second page of the paginated messages.



General format :

db.getCollection('messages').find({
	senderUserId: 1234,
	receiverUserId: 7890
}).sort({sentDate:-1}).skip(pageNum-1 * pageSize ).limit(pageSize)       // General format for fetching messages

Query #9: Get list of messages sent between an user and another match [Pagination Approach #2 : Range based pagination]

 

// Fetch 10 messages that were exchanged between the users before the provided time frame.

db.getCollection('messages').find({
	senderUserId: 1234,
	receiverUserId: 7890,
	sentDate: {
		$lt: ISODate("2016-11-14T19:48:36.379Z")
	}
}).sort({
	sentDate: -1
}).limit(10)



// Fetch 10 messages that were exchanged between the users after the provided time frame.

db.getCollection('messages').find({
	senderUserId: 1234,
	receiverUserId: 7890,
	sentDate: {
		$gt: ISODate("2016-11-14T19:48:36.379Z")
	}
}).sort({
	sentDate: 1
}).limit(10)

Query #10: Update messages received by an an user from another user as read.

 
db.getCollection('messages').update({
	receiverUserId: 1234,
	senderUserId: 7890,
	read: false
}, {
	$set: {
		read: true
	}
}, {
	multi: true
})

January 15, 2017Elankumaran Srinivasan

MongoDB Using $FIRST with Aggregation

Lets assume that MongoDB is being used as a data store for an e-commerce application(similar to Amazon.com) on which users can shop for and buy the items they desire.

All the users purchases are stored in a collection named ‘purchase_history’. Each document in the collection will be structured as below

{
	"_id": ObjectId("581006910d96be013c3ee72d"),
	"itemId": "1112-3333-4442-1111",
	"userId": "353236",
	"category": "sports",
	"purchaseDate": ISODate("2016-10-26T01:27:45.195Z"),
	"purchasePrice": 32.3,
	"currency": "DOLLARS"
}

The site also has a section on the page which displays the recent 5 categories associated with the products the users shopped.

For example if the user purchased 10 items of different categories in the following order :

sports kitchen sports furniture sports sports clothing jewellery sports shoes

The “Recent 5 categories” section with show the following data

sports kitchen furniture clothing jewellery

Logically thinking, we can achieve the above result if we do the following steps :

Sort the data in the ‘purchase_history’ by ‘purchaseDate’ descending order so that the most recent purchase is at the top.
Filter distinct values for the ‘categories’ in the sorted result of step #1.
Extract the top 5 values.

The above can be achieved through Mongo query using Aggregation framework combined with Group and $FIRST.

 
 db.purchase_history.aggregate([{
   $match:{userId:"353236"}},
   {$sort : { "purchaseDate" : -1}} , {$group:{_id:"$category" , 'purchaseDate': {$first: '$purchaseDate'}}},{$sort : { purchaseDate : -1} }]); 
                     
 
 
 Overview :
 
 1. Find all records associated with the userId.
 2. Sort by purchaseDate descending order.
 3. Group by category and project 'purchaseDate' as the first record in each group. The first record in this case will be the record with the latest "purchaseDate' for a each category.
 4. Resort again by the projected purchaseDate becase the 'group' operation does not maintain any previous ordering.

December 25, 2016Elankumaran Srinivasan

Common-Docker-Commands

Some useful commonly used docker commands when working on OSX with VirtualBox.

1) This ensures the boot2docker VM is up and running.

docker-machine start default

2) This is the command that needs to be executed once for every new terminal window opened, so as to establish connection with the boot2docker vm.

 eval “$(docker-machine env defaut)”

3) This command returns the IP address of the ‘default’ VM running boot2docker

docker-machine ip default

4) Returns a list of VM with boot2docker install and running on VirtualBox.

docker-machine ls

Sample Output:

NAME        ACTIVE   DRIVER       STATE     URL                         SWARM   ERRORS
default     *        virtualbox   Running   tcp://192.168.99.100:2376

5) Regenerates the TLS certificates needed to communicated with the ‘boot2docker’ VM.

docker-machine regenerate-certs default     /// To regenerate the TLS certs for docker

6) Lists all available docker images

 docker images

7) Looks the docker process running with a particular container id.

docker ps -a | grep <container -id >

8) Removes a running docker container

docker rm <container id>

9) Removes a particular docker image

docker rmi  <imageid>

10) Kills a particular docker container

docker kill <container-id>

11) Connects to the docker container and gives access to the bash shell to execute commands on the container

docker exec -i -t <container id> /bin/bash

12) Runs a docker image and does not automatically terminate the container. Allows an easy way to keep the container running and later connect to the containers shell using “docker exec -it “ command.

docker run -d <image-name> tail -f /dev/null

13) This is not a docker command. This utility is part of the virtual box installation. Allows one to power-off a VM running on VirtualBox

VBoxManage controlvm <vm-name> poweroff

14) Again, not a docker command. A virtual box command to share a directory on the HOST OS (OSX: /Users/esrinivasan/..) with the guest operating system running on VirtualBox. The name of the shared folder is set as “/eks” here.

VBoxManage sharedfolder add default --name /eks --hostpath /Users/esrinivasan/develop/learning/Docker/dec-2016/docker-node/docker-volume

December 21, 2016Elankumaran Srinivasan

MongoDB-Prod-Cluster-Alerting

Monitoring is a very important aspect of any service. This applies not just to web applications but also to databases and other supporting tools.

On MonogDB Atlas user will have the ability to setup alerts based on various metrics(Host level, DB level and networking level).

In this blog I will try to highlight some of the alerts which will be very useful for monitoring a production MongoDB cluster on Atlas.

Setting Up The Alerts:

Log into MongoDB Atlas.
Select your cluster.
Select the ‘Alerts’ icon on the side bar.
Clicking on ‘Add’ button will bring up a pop-up window which allows you to select various conditions under which to trigger the alerts.

Here is a sample of some of the alerts setup on MongoDB Atlas. The alerts are set up to go to

Group Owner : The set of users who belong to the group associated with the MongoDb cluster.
Email : Send an email to “mongo.prod.alert@corp.com”
Slack : Atlas has integration with Slack and can sent the alerts to a slack channel.

There are several other integration that can be setup. The other integrations are shown in the above screenshot.

Escalation level can be configured based on how long the error condition or state lasts. The below screenshot has the alert “Tickets Available: Reads below 20” set up as multi level alerting with escalations to the “prod-alerts-channel” (SLACK) if the error condition lasts more than 10 minutes. The first level of alerting always goes to the “Group Owners” and the email group (mongo.prod.alert@corp.com)

Alerts (These are setup at host target level)

Condition	Condition Lasts (Time Period)	Explanation
Effective lock % above 80	5 Minutes	In WiredTiger the locks are at a document level. Write heavy databases may see lock % upto 60% where as read heavy databases may not see more than 10%. Too much document locking will lead to reduced database query performance for write operations since new writes need to wait till the lock on the document is released by the previous write operation.
Disk I/O % utilization on Data Partition above 80	5 Minutes	80 % is a good threshold for this. MongoDB caches records in memory for quick retrieval. High I/O utilization may be due to frequent data flush from memory to disk or could be due to large number of cache misses which caused mongo to retrieve the missing documents from the disk. The latter may be due to smaller memory size of the host. Nevertheless, too much disk I/O will induce latency in queries.
Tickets Available: Read below 20	5 Minutes	MongoDB uses a ticketing system to control the number of concurrent operations that can be performed. Each ticket corresponds to a an operation. This is synonymous with threads in Java programming. By default there are 120 tickets available to the primary host. If the tickets available becomes "0" then all subsequent read operation will queue up until more tickets are available.
Tickets Available: Writes below 20	5 Minutes	By default there are 120 tickets available to the primary host. If the tickets available becomes "0" then all subsequent write operation will queue up until more tickets are available.
Queues: Readers above 100	5 Minutes	Reads start queuing up when number of read tickets are zero (0)
Queues: Writers above 100	5 Minutes	Writes start queuing up when the number of write tickets are zero (0).
Replication Lag above 1 minute	5 Minutes	Mongo commits data to the primary first and the same is replicated asynchronously to the secondaries. If clients are reading data from the secondaries then a large lag time will cause inconsistent/ old data to be returned.
System:CPU(User)% above 80	5 Minutes	Badly written queries are one of the major reasons for CPU % shooting up. This could be due to a large number of records being scanned to return result or lack of usage of indexes.
Memory Resident above {}	5 Minutes	Resident memory is the total amount of memory being used by MongoDB and its processes. Alert should be triggered when the usage goes above 80% of the total memory available on the host.
Connections Above {}	5 Minutes	https://docs.atlas.mongodb.com/connect-to-cluster/ Depending on the type of cluster you setup, that maximum connection should not go above 80% of total connections possible on the instance type. Too many connections could be due to increased traffic or an issue in connection-pooling mechanism of the client code where old connections are not being released or exiting connections are not being reused.
Average Execution Time: Commands above {} ms	Immediate	Trigger alerts if queries take more than "X" miliseconds to execute. This may vary based on the use-cases.
Disk space % used on Data Partition above 80	5 Minutes	No disk space means, no space for MongoDB to save data. Make sure that there is at-least 20% of free space left on the MongoDB hosts. This will give enough time to scale out of add more disk space.
Background Flush Average above 1000 ms	5 Minutes	MongoDB first makes data changes in memory and then flushes the changed to disk every 60s by default. If journaling is enabled then memory changes are journaled every 100ms and then flushed to disk. The time taken to flush the data to disk is dependent on the DISK i/o available and also on the amount of data being flushed. A large number of write operations will cause large amount of data to be flushed to the disk. Usage of magnetic storage device could be another reason. Solid state storage devices have better I/O capabilities.
Network Bytes In or Network Bytes Out	5 Minutes	Optional. This alert cannot be setup with at the very beginning itself. One needs to observe the Production traffic pattern on the cluster of couple of weeks and use those metrics to come up with a threshold number for the network bytes in / out. This metrics can also be used in conjunction with ops counter. Any aberration in the network pattern could indicate increased traffic or possible attack on the system.

December 19, 2016Elankumaran Srinivasan

MongoDB Atlas

What is Mongo Atlas ?

Mongo Atlas is the cloud hosted version of MongoDB. This is the software-as-a-service flavor of the NoSQL database.

Why MongoDB Atlas ?

Since this is the software as a service and is hosted on the cloud it has several advantages over the on-premise hosted version:

Easy and quick ( in around 10-15 mins) to setup a production standard single replica set or sharded cluster of MongoDB.
Highly reliable cluster. The cluster is deployed on AWS (Amazon Web Service). Each member of the replica set is deployed on a different availability zone and thus providing a higher degree of fault tolerance for a particular replica set.
No need to have a NoSQl database management expertise.
Provides users the ability to configure backups of the database and also restore the data from backup when necessary.
Automated handling of failure scenarios. If the primary node goes down then Atlas automatically handles election of a new primary and recovery of the broken node.
Tracks various system level and database level metrics. These metrics are represented as graphs for users to drill down for a particular time frame and granularity.
Setup alerts based on different metrics and custom user set thresholds.
Pay as you go pricing. The pricing is dependent on your usage and the configuration you choose. This gives flexibility to organizations based on their use cases and financial capabilities.

Setting up a cluster

Once you have registered and created an account. You will be presented with a form where you can choose your desired configuration for the database.

The pricing displayed in the screenshots below are for the base minimum configuration. The price varies based on the choices you make for the below parameters.

To quickly give an overview of the choices you have

The DB engine (3.2v WiredTiger or 3.4v WiredTiger). The older engine MMAP(memory mapped) is not supported in MongoDB Atlas.
Right now the hosting is available only in the Oregon region of AWS.
The size of the instances on which to host the database. The instances need to be homogeneous. Different instances have different memory and CPU However, the storage can be configured as desired.
Replication factor. Number of replica in a replica set.
Whether to setup a single replica set stand-alone cluster or a sharded cluster.
If the backup needs to be enabled or not.

My Tech Blog

All Posts

About the Author

Home

Welcome to my Tech Blog

Query #1 - Create document (User sends a message to another)

Query #2 - Message history between two users with the most recent message at the bottom

Query #3 - Retrieve all messages received by an user from other users and which are still unread by the receiving user

Query #4 - Retrieve all messages sent by an user to other users

Query #5 - Retrieve count of all messages sent by the user till date

Query #6 - Retrieve grouped count of unread messages per contact

Query #7: Get a unique list of Ids of all people that the user sent messages to

Query #7: Count of messages sent by the user to each of the contacts

Query #8: Get paginated list of messages sent between an user and another user [ Pagination Approach #1 : Using pageSize and pageNum ]

Query #9: Get list of messages sent between an user and another match [Pagination Approach #2 : Range based pagination]

Query #10: Update messages received by an an user from another user as read.

Setting Up The Alerts:

Alerts (These are setup at host target level)

Condition

Condition Lasts (Time Period)

Explanation

Effective lock % above 80

5 Minutes

In WiredTiger the locks are at a document level. Write heavy databases may see lock % upto 60% where as read heavy databases may not see more than 10%.

Too much document locking will lead to reduced database query performance for write operations since new writes need to wait till the lock on the document is released by the previous write operation.

Disk I/O % utilization on Data Partition above 80

5 Minutes

Nevertheless, too much disk I/O will induce latency in queries.

Tickets Available: Read below 20

5 Minutes

MongoDB uses a ticketing system to control the number of concurrent operations that can be performed. Each ticket corresponds to a an operation.

This is synonymous with threads in Java programming.

By default there are 120 tickets available to the primary host. If the tickets available becomes "0" then all subsequent read operation will queue up until more tickets are available.

Tickets Available: Writes below 20

5 Minutes

By default there are 120 tickets available to the primary host. If the tickets available becomes "0" then all subsequent write operation will queue up until more tickets are available.

Queues: Readers above 100

5 Minutes

Reads start queuing up when number of read tickets are zero (0)

Queues: Writers above 100

5 Minutes

Writes start queuing up when the number of write tickets are zero (0).

Replication Lag above 1 minute

5 Minutes

Mongo commits data to the primary first and the same is replicated asynchronously to the secondaries. If clients are reading data from the secondaries then a large lag time will cause inconsistent/ old data to be returned.

System:CPU(User)% above 80

5 Minutes

Badly written queries are one of the major reasons for CPU % shooting up. This could be due to a large number of records being scanned to return result or lack of usage of indexes.

Memory Resident above {}

5 Minutes

Resident memory is the total amount of memory being used by MongoDB and its processes.

Alert should be triggered when the usage goes above 80% of the total memory available on the host.

Connections Above {}

5 Minutes

https://docs.atlas.mongodb.com/connect-to-cluster/

Depending on the type of cluster you setup, that maximum connection should not go above 80% of total connections possible on the instance type.

Too many connections could be due to increased traffic or an issue in connection-pooling mechanism of the client code where old connections are not being released or exiting connections are not being reused.

Average Execution Time: Commands above {} ms

Immediate

Trigger alerts if queries take more than "X" miliseconds to execute. This may vary based on the use-cases.

Disk space % used on Data Partition above 80

5 Minutes

No disk space means, no space for MongoDB to save data. Make sure that there is at-least 20% of free space left on the MongoDB hosts.

This will give enough time to scale out of add more disk space.

Background Flush Average above 1000 ms

5 Minutes

MongoDB first makes data changes in memory and then flushes the changed to disk every 60s by default.

If journaling is enabled then memory changes are journaled every 100ms and then flushed to disk.

The time taken to flush the data to disk is dependent on the DISK i/o available and also on the amount of data being flushed.

A large number of write operations will cause large amount of data to be flushed to the disk.

Usage of magnetic storage device could be another reason. Solid state storage devices have better I/O capabilities.

Network Bytes In or Network Bytes Out

5 Minutes

Optional.

This alert cannot be setup with at the very beginning itself. One needs to observe the Production traffic pattern on the cluster of couple of weeks and use those metrics to come up with a threshold number for the network bytes in / out.

This metrics can also be used in conjunction with ops counter.

Any aberration in the network pattern could indicate increased traffic or possible attack on the system.

What is Mongo Atlas ?

Why MongoDB Atlas ?

Setting up a cluster