Cassandra Database



1. What is Cassandra?  

Short Answer

Cassandra is a one of the NoSQL distributed database system. It is an open source data storage system effectively designed to store and manages large volume of data without any failure.

Apache Cassandra is an open source data storage system developed at Facebook for inbox search and designed for storing and managing large amounts of data across commodity servers. It can server as both
  • Real time data store system for online applications
  • Also as a read intensive database for business intelligence system

Long Answer

Cassandra first started as an incubation project at Apache in January of 2009. Shortly thereafter, the committers, led by Apache Cassandra Project Chair Jonathan Ellis, released version 0.3 of Cassandra, and have steadily made minor releases since that time. Though as of this writing it has not yet reached a 1.0 release, Cassandra is being used in production by some of the biggest properties on the Web, including Facebook, Twitter, Cisco, Rackspace, Digg, Cloudkick, Reddit, and more.


Cassandra has become so popular because of its outstanding technical features. It is durable, seamlessly scalable, and tuneably consistent. It performs blazingly fast writes, can store hundreds of terabytes of data, and is decentralized and symmetrical so there's no single point of failure. It is highly available and offers a schema-free data model.

Listed below are some of the notable points of Apache Cassandra:
  • It is scalable, fault-tolerant, and consistent.
  • It is a column-oriented database.
  • Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable.
  • Created at Facebook, it differs sharply from relational database management systems.
  • Cassandra implements a Dynamo-style replication model with no single point of failure, but adds a more powerful “column family” data model.


2. What is NoSQL?  

Short Answer

NoSQL encompasses a wide variety of different database technologies that were developed in response to the demands presented in building modern applications:
  • Developers are working with applications that create massive volumes of new, rapidly changing data types — structured, semi-structured, unstructured and polymorphic data.
  • Long gone is the twelve-to-eighteen month waterfall development cycle. Now small teams work in agile sprints, iterating quickly and pushing code every week or two, some even multiple times every day.
  • Applications that once served a finite audience are now delivered as services that must be always-on, accessible from many different devices and scaled globally to millions of users.
  • Organizations are now turning to scale-out architectures using open source software, commodity servers and cloud computing instead of large monolithic servers and storage infrastructure.
  • Relational databases were not designed to cope with the scale and agility challenges that face modern applications, nor were they built to take advantage of the commodity storage and processing power available today.

Long Answer

A NoSQL (originally referring to "non SQL", "non relational" or "not only SQL") database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. Such databases have existed since the late 1960s, but did not obtain the "NoSQL" moniker until a surge of popularity in the early twenty-first century, triggered by the needs of Web 2.0 companies such as Facebook, Google, and Amazon.com.NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages.

Motivations for this approach include: simplicity of design, simpler "horizontal" scaling to clusters of machines (which is a problem for relational databases), and finer control over availability. The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL. The particular suitability of a given NoSQL database depends on the problem it must solve. Sometimes the data structures used by NoSQL databases are also viewed as "more flexible" than relational database tables.

Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability, partition tolerance, and speed. Barriers to the greater adoption of NoSQL stores include the use of low-level query languages (instead of SQL, for instance the lack of ability to perform ad-hoc JOINs across tables), lack of standardized interfaces, and huge previous investments in existing relational databases. Most NoSQL stores lack true ACID transactions, although a few databases, such as MarkLogic, Aerospike, FairCom c-treeACE, Google Spanner (though technically a NewSQL database), Symas LMDB, and OrientDB have made them central to their designs.

Instead, most NoSQL databases offer a concept of "eventual consistency" in which database changes are propagated to all nodes "eventually" (typically within milliseconds) so queries for data might not return updated data immediately or might result in reading data that is not accurate, a problem known as stale reads. Additionally, some NoSQL systems may exhibit lost writes and other forms of data loss. Fortunately, some NoSQL systems provide concepts such as write-ahead logging to avoid data loss. For distributed transaction processing across multiple databases, data consistency is an even bigger challenge that is difficult for both NoSQL and relational databases.



3. Where Did Cassandra Come From?  

Short Answer

As per Greek mythology, Cassandra was the daughter of King Priam of Troy. She could accurately predict the future but nobody believed her. It's not entirely clear whether the name was given to the data store because of this, but one reasoning is that NOSQL database solutions are inevitable for today's and future data needs, but there is a big resistance (or disbelief) from the traditional RDBMS world.

Long Answer

The Cassandra data store is an open source Apache project available at http://cassandra.apache.org. Cassandra originated at Facebook in 2007 to solve that company's inbox search problem, in which they had to deal with large volumes of data in a way that was difficult to scale with traditional methods. Specifically, the team had requirements to handle huge volumes of data in the form of message copies, reverse indices of messages, and many random reads and many simultaneous random writes.

The team was led by Jeff Hammerbacher, with Avinash Lakshman, Karthik Ranganathan, and Facebook engineer on the Search Team Prashant Malik as key engineers. The code was released as an open source Google Code project in July 2008. During its tenure as a Google Code project in 2008, the code was updateable only by Facebook engineers, and little community was built around it as a result. So in March 2009 it was moved to an Apache Incubator project, and on February 17, 2010 it was voted into a top-level project.

Cassandra today presents a kind of paradox: it feels new and radical, and yet it's solidly rooted in many standard, traditional computer science concepts and maxims that successful predecessors have already institutionalized. Cassandra is a realist's kind of database; it doesn't depart from the relational model to be a fun art project or experiment for smart developers. It was created specifically to solve a real-world problem that existing tools weren't able to solve. It acknowledges the limitations of prior methods and faces our new world of big data head-on.



4. How Did Cassandra Get Its Name?  

Short Answer

In Greek mythology, Cassandra was the daughter of King Priam and Queen Hecuba of Troy. Cassandra was so beautiful that the god Apollo gave her the ability to see the future. But when she refused his amorous advances, he cursed her such that she would still be able to accurately predict everything that would happen—but no one would believe her. Cassandra foresaw the destruction of her city of Troy, but was powerless to stop it. The Cassandra distributed database is named for her. I speculate that it is also named as kind of a joke on the Oracle at Delphi, another seer for whom a database is named.

Long Answer

Use Cases for Cassandra Large Deployments

You probably don't drive a semi truck to pick up your dry cleaning; semis aren't well suited for that sort of task. Lots of careful engineering has gone into Cassandra's high availability, tuneable consistency, peer-to-peer protocol, and seamless scaling, which are its main selling points. None of these qualities is even meaningful in a single-node deployment, let alone allowed to realize its full potential.

There are, however, a wide variety of situations where a single-node relational database is all we may need. So do some measuring. Consider your expected traffic, throughput needs, and SLAs. There are no hard and fast rules here, but if you expect that you can reliably serve traffic with an acceptable level of performance with just a few relational databases, it might be a better choice to do so, simply because RDBMS are easier to run on a single machine and are more familiar.

If you think you'll need at least several nodes to support your efforts, however, Cassandra might be a good fit. If your application is expected to require dozens of nodes, Cassandra might be a great fit.Lots of Writes, Statistics, and Analysis.

Consider your application from the perspective of the ratio of reads to writes. Cassandra is optimized for excellent throughput on writes.

Many of the early production deployments of Cassandra involve storing user activity updates, social network usage, recommendations/reviews, and application statistics. These are strong use cases for Cassandra because they involve lots of writing with less predictable read operations, and because updates can occur unevenly with sudden spikes. In fact, the ability to handle application workloads that require high performance at significant write volumes with many concurrent client threads is one of the primary features of Cassandra.

According to the project wiki, Cassandra has been used to create a variety of applications, including a windowed time-series store, an inverted index for document searching, and a distributed job priority queue.Geographical Distribution

Cassandra has out-of-the-box support for geographical distribution of data. You can easily configure Cassandra to replicate data across multiple data centers. If you have a globally deployed application that could see a performance benefit from putting the data near the user, Cassandra could be a great fit.Evolving Applications

If your application is evolving rapidly and you're in “startup mode,” Cassandra might be a good fit given its schema-free data model. This makes it easy to keep your database in step with application changes as you rapidly deploy.




5. Explain the concept of Tunable Consistency in Cassandra?  

Short Answer

Consistency refers to how up-to-date and synchronized a row of Cassandra data is on all of its replicas. Cassandra extends the concept of eventual consistency by offering tunable consistency for any given read or write operation, the client application decides how consistent the requested data should be.

Long Answer

Tunable Consistency is a phenomenal characteristic that makes Cassandra a favored database choice of Developers, Analysts and Big data Architects. Consistency refers to the up-to-date and synchronized data rows on all their replicas. Cassandra’s Tunable Consistency allows users to select the consistency level best suited for their use cases. It supports two consistencies -Eventual and Consistency and Strong Consistency.

The former guarantees consistency when no new updates are made on a given data item, all accesses return the last updated value eventually. Systems with eventual consistency are known to have achieved replica convergence.

For Strong consistency, Cassandra supports the following condition:
R + W > N, where
N – Number of replicas
W – Number of nodes that need to agree for a successful write
R – Number of nodes that need to agree for a successful read.


6. List the benefits of using Cassandra?  

Unlike traditional or any other database, Apache Cassandradelivers near real-time performance simplifying the work of Developers, Administrators, Data Analysts and Software Engineers.
  • Instead of master-slave architecture, Cassandra is established on peer-to-peer architecture ensuring no failure.
  • It also assures phenomenal flexibility as it allows insertion of multiple nodes to any Cassandra cluster in any datacenter. Further, any client can forward its request to any server.
  • Cassandra facilitates extensible scalability and can be easily scaled up and scaled down as per the requirements. With a high throughput for read and write operations, this NoSQL application need not be restarted while scaling.
  • Cassandra is also revered for its strong data replication capability as it allows data storage at multiple locations enabling users to retrieve data from another location if one node fails. Users have the option to set up the number of replicas they want to create.
  • Shows brilliant performance when used for massive datasets and thus, the most preferable NoSQL DB by most organizations.
  • Operates on column-oriented structure and thus, quickens and simplifies the process of slicing. Even data access and retrieval becomes more efficient with column-based data model.
  • Further, Apache Cassandra supports schema-free/schema-optional data model, which un-necessitate the purpose of showing all the columns required by your application.


7. What is the use of Cassandra and why to use Cassandra?  

Cassandra was designed to handle big data workloads across multiple nodes without any single point of failure. The various factors responsible for using Cassandra are
  • It is fault tolerant and consistent
  • Gigabytes to petabytes scalabilities
  • It is a column-oriented database
  • No single point of failure
  • No need for separate caching layer
  • Flexible schema design
  • It has flexible data storage, easy data distribution, and fast writes
  • It supports ACID (Atomicity, Consistency, Isolation, and Durability)properties
  • Multi-data center and cloud capable
  • Data compression


8. In which language Cassandra is written?  

Cassandra is written in Java. It is originally designed by Facebook consisting of flexible schemas. It is highly scalable for big data.



9. What was the design goal of Cassandra?  

The design goal of Cassandra is to handle big data workloads across multiple nodes without any single point of failure.



10. What do you understand by Commit log in Cassandra?  

Commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.



11. Define Mem-table in Cassandra?  

Short Answer

It is a memory-resident data structure. After commit log, the data will be written to the mem-table. Mem-table is in-memory/write-back cache space consisting of content in key and column format. The data in mem- table is sorted by key, and each column family consists of a distinct mem-table that retrieves column data via key. It stores the writes until it is full, and then flushed out.

Long Answer

  1. A memtable is basically a write-back cache of data rows that can be looked up by key i.e. unlike a write-through cache, writes are batched up in the memtable until it is full, when a memtable is full, and it is written to disk as SSTable. Memtable is an in-memory cache with content stored as key/column. Memtable data are sorted by key. Each ColumnFamily has a separate Memtable and retrieve column data from the key. Cassandra writes are first written to the CommitLog. After writing to CommitLog, Cassandra writes the data to memtable. Learn more in: NoSQL Databases.
  2. A memtable is an interactive touch table that supports co-located group meeting by capturing both digital and physical interaction in its memory. A memtable is basically a write-back cache of data rows that can be looked up by key i.e. unlike a write-through cache, writes are batched up in the memtable until it is full, when a memtable is full, and it is written to disk as SSTable. Memtable is an in-memory cache with content stored as key/column. Memtable data are sorted by key; each ColumnFamily has a separate Memtable and retrieve column data from the key. Cassandra writes are first written to the CommitLog. After writing to CommitLog, Cassandra writes the data to memtable.


12. What is SSTable?  

Short Answer

SSTable or ‘Sorted String Table,’ refers to an important data file in Cassandra. It accepts regular written memtables which are stored on disk and exist for each Cassandra table. Being immutable, SStables do not allow any further addition and removal of data items once written. For each SSTable, Cassandra creates three separate files like partition index, partition summary and a bloom filter.

Long Answer

Sorted String Table is a file of key/value string pairs, sorted by keys. An SSTable can be completely mapped into memory, which allows us to perform lookups and scans without touching disks.An SSTable provides a persistent, ordered an immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key and to iterate over all key/value pairs in a specified key range. SSTable is a simple abstraction to efficiently store large numbers of key-value pairs while optioning for high throughput, sequential read/write workloads. Internally, each SSTable contains a sequence of blocks typically each block is 64KB in size, but this is configurable. A block index stored at the end of the SSTable is used to locate blocks; the index is loaded into memory when the SSTable is opened. The features of SSTable are: SSTables are immutable, simplifies caching, sharing across GFS etc., serializable index, serializable data, bloom filter, no need for concurrency control, SSTables of a tablet recorded in METADATA table, Garbage collection of SSTables done by master, on tablet split, split tables can start off quickly on shared SSTables, splitting them lazily. SSTable works in Cassandra (data format, indexing, serialization, searching). Learn more in: NoSQL Databases.

An SSTable provides a persistent, ordered an immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks typically each block is 64KB in size, but this is configurable. A block index stored at the end of the SSTable is used to locate blocks; the index is loaded into memory when the SSTable is opened.

The features of SSTable are:
1) SSTables are immutable;
2) Simplifies caching, sharing across GFS, etc;
3) No need for concurrency control;
4) SSTables of a tablet recorded in METADATA table;
5) Garbage collection of SSTables done by master; and
6) On tablet split, split tables can start off quickly on shared SSTables, splitting them lazily.


13. How does Cassandra write?  

Cassandra performs the write function by applying two commits-first it writes to a commit log on disk and then commits to an in-memory structured known as memtable. Once the two commits are successful, the write is achieved. Writes are written in the table structure as SSTable (sorted string table). Cassandra offers speedier write performance.



14. Explain what is composite type in Cassandra?  

In Cassandra, composite type allows to define key or a column name with a concatenation of data of different type. You can use two types of Composite Type
  • Row Key
  • Column Name


15. How Cassandra stores data?  

  • All data stored as bytes
  • When you specify validator, Cassandra ensures those bytes are encoded as per requirement
  • Then a comparator orders the column based on the ordering specific to the encoding
  • While composite are just byte arrays with a specific encoding, for each component it stores a two byte length followed by the byte encoded component followed by a termination bit.


16. Mention what are the main components of Cassandra Data Model?  

The main components of Cassandra Data Model are
  • Cluster
  • Keyspace
  • Column
  • Column & Family


17. Explain what is a column family in Cassandra?  

Column family in Cassandra is referred for a collection of Rows.



18. Explain what is a cluster in Cassandra?  

A cluster is a container for keyspaces. Cassandra database is segmented over several machines that operate together. The cluster is the outermost container which arranges the nodes in a ring format and assigns data to them. These nodes have a replica which takes charge in case of data handling failure.



19. Explain what is a keyspace in Cassandra?  

In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consist of one keyspace per node.



20. List out the other components of Cassandra?  

The other components of Cassandra are
  • Node
  • Data Center
  • Cluster
  • Commit log
  • Mem-table
  • SSTable
  • Bloom Filter


Java Interview Question

.Net Interview Question

PHP Interview Question

AngularJS Interview Questions