May 20, 2015

Cassandra On AWS - Part 1 - Setup

This is the first of the multi part series on Cassandra. In this first part I will explain how Cassandra can be setup on AWS using the EC2 instances.

Deployment Model

Cassandra will be set up as a multi node cluster on AWS. AWS has several regions, with each region constituting multiple availability zones (AZ). The AZ’s in a region are connected to each other and provide low latency inter-communication. This setup helps setting up a fault tolerance system by distributing the service or data across different zones in a region.
Most of the AWS regions have 3 AZ’s. As a best practice, it’s good to have at least one EC2 instance in each AZ of AWS region. In case you are working with a region which has only 2 regions then one of the region will have one more EC2 instance than others.
Cassandra Architecure On AWS
Fig 1: Cassandra deployment in a region spread over multiple availability zones (AZ)


Setup


AWS

  • Choose a region which has three availability zones. For the purpose of this post, I have chosen us-west-2.
  • Spin up one EC2 instance in each AZ.
  • Databases should always be on private subnets of VPC and should not be open to internet.
  • Update the firewall setting on the security group associated with the EC2 instance to allow incoming traffic on the following ports
    • Port: 9042 This is the port for CQL clients
    • Port: 7000 This port is used for inter-node communication on the cluster

Installable

  • Download Cassandra 2.2.4V from the below URL.
    http://downloads.datastax.com/community/dsc-cassandra-2.2.4-bin.tar.gz
  • Unzip it to a location on the EC2 machine.
  • Do this on all three nodes.

Cassandra Configuration

Update the configuration files present under the /conf folder of Cassandra installation.
  
File : cassandra-rackdc.properties

dc_suffix = 2a_cassandra   # This property uniquely identify a node in a datacenter(DC).DC names are automatically assiged by Cassandra using EC2Snitch/EC2MultiRegionSnitch.
prefer_local = true
  
  
File : cassandra.yaml

partitioner: org.apache.cassandra.dht.Murmur3Partitioner  # This is the default and we will keep it as it is. Used to hash and distribute the keys across different nodes.

endpoint_snitch: Ec2Snitch  # Use EC2MultiRegionSnitch if you are setting up multiple clusters spanning different regions. Otherwise use Ec2Snitch

listen_address: 10.101.212.201  # This will be the private IP address of the EC2 instance. Will vary from instance to instance.

broadcast_address: 10.101.212.201  # private IP address of the EC2 instance.

rpc_address: 10.101.212.201  # private IP address of the EC2 instance

seeds: "10.101.212.206,10.101.214.60"   ## IP address of the nodes acting as seeds

key_cache_size_in_mb: 100000

data_file_directories:   # location where database files needs to be stored. 
   -/local/mnt/cassandra/data
   
commitlog_directory: /local/mnt/cassandra/commitlog  #location where the commit logs needs to be stored.