Cassandra is a one of the NoSQL distributed database system. It is an open source data storage system effectively designed to store and manages large volume of data without any failure.Apache Cassandra is an open source data storage system developed at Facebook for inbox search and designed for storing and managing large amounts of data across commodity servers. It can server as both
Cassandra first started as an incubation project at Apache in January of 2009. Shortly thereafter, the committers, led by Apache Cassandra Project Chair Jonathan Ellis, released version 0.3 of Cassandra, and have steadily made minor releases since that time. Though as of this writing it has not yet reached a 1.0 release, Cassandra is being used in production by some of the biggest properties on the Web, including Facebook, Twitter, Cisco, Rackspace, Digg, Cloudkick, Reddit, and more.
Cassandra has become so popular because of its outstanding technical features. It is durable, seamlessly scalable, and tuneably consistent. It performs blazingly fast writes, can store hundreds of terabytes of data, and is decentralized and symmetrical so there's no single point of failure. It is highly available and offers a schema-free data model.Listed below are some of the notable points of Apache Cassandra:
A NoSQL (originally referring to "non SQL", "non relational" or "not only SQL") database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. Such databases have existed since the late 1960s, but did not obtain the "NoSQL" moniker until a surge of popularity in the early twenty-first century, triggered by the needs of Web 2.0 companies such as Facebook, Google, and Amazon.com.NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages.
Motivations for this approach include: simplicity of design, simpler "horizontal" scaling to clusters of machines (which is a problem for relational databases), and finer control over availability. The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL. The particular suitability of a given NoSQL database depends on the problem it must solve. Sometimes the data structures used by NoSQL databases are also viewed as "more flexible" than relational database tables.
Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability, partition tolerance, and speed. Barriers to the greater adoption of NoSQL stores include the use of low-level query languages (instead of SQL, for instance the lack of ability to perform ad-hoc JOINs across tables), lack of standardized interfaces, and huge previous investments in existing relational databases. Most NoSQL stores lack true ACID transactions, although a few databases, such as MarkLogic, Aerospike, FairCom c-treeACE, Google Spanner (though technically a NewSQL database), Symas LMDB, and OrientDB have made them central to their designs.
Instead, most NoSQL databases offer a concept of "eventual consistency" in which database changes are propagated to all nodes "eventually" (typically within milliseconds) so queries for data might not return updated data immediately or might result in reading data that is not accurate, a problem known as stale reads. Additionally, some NoSQL systems may exhibit lost writes and other forms of data loss. Fortunately, some NoSQL systems provide concepts such as write-ahead logging to avoid data loss. For distributed transaction processing across multiple databases, data consistency is an even bigger challenge that is difficult for both NoSQL and relational databases.
As per Greek mythology, Cassandra was the daughter of King Priam of Troy. She could accurately predict the future but nobody believed her. It's not entirely clear whether the name was given to the data store because of this, but one reasoning is that NOSQL database solutions are inevitable for today's and future data needs, but there is a big resistance (or disbelief) from the traditional RDBMS world.
The Cassandra data store is an open source Apache project available at http://cassandra.apache.org. Cassandra originated at Facebook in 2007 to solve that company's inbox search problem, in which they had to deal with large volumes of data in a way that was difficult to scale with traditional methods. Specifically, the team had requirements to handle huge volumes of data in the form of message copies, reverse indices of messages, and many random reads and many simultaneous random writes.
The team was led by Jeff Hammerbacher, with Avinash Lakshman, Karthik Ranganathan, and Facebook engineer on the Search Team Prashant Malik as key engineers. The code was released as an open source Google Code project in July 2008. During its tenure as a Google Code project in 2008, the code was updateable only by Facebook engineers, and little community was built around it as a result. So in March 2009 it was moved to an Apache Incubator project, and on February 17, 2010 it was voted into a top-level project.
Cassandra today presents a kind of paradox: it feels new and radical, and yet it's solidly rooted in many standard, traditional computer science concepts and maxims that successful predecessors have already institutionalized. Cassandra is a realist's kind of database; it doesn't depart from the relational model to be a fun art project or experiment for smart developers. It was created specifically to solve a real-world problem that existing tools weren't able to solve. It acknowledges the limitations of prior methods and faces our new world of big data head-on.
In Greek mythology, Cassandra was the daughter of King Priam and Queen Hecuba of Troy. Cassandra was so beautiful that the god Apollo gave her the ability to see the future. But when she refused his amorous advances, he cursed her such that she would still be able to accurately predict everything that would happen—but no one would believe her. Cassandra foresaw the destruction of her city of Troy, but was powerless to stop it. The Cassandra distributed database is named for her. I speculate that it is also named as kind of a joke on the Oracle at Delphi, another seer for whom a database is named.
You probably don't drive a semi truck to pick up your dry cleaning; semis aren't well suited for that sort of task. Lots of careful engineering has gone into Cassandra's high availability, tuneable consistency, peer-to-peer protocol, and seamless scaling, which are its main selling points. None of these qualities is even meaningful in a single-node deployment, let alone allowed to realize its full potential.
There are, however, a wide variety of situations where a single-node relational database is all we may need. So do some measuring. Consider your expected traffic, throughput needs, and SLAs. There are no hard and fast rules here, but if you expect that you can reliably serve traffic with an acceptable level of performance with just a few relational databases, it might be a better choice to do so, simply because RDBMS are easier to run on a single machine and are more familiar.
If you think you'll need at least several nodes to support your efforts, however, Cassandra might be a good fit. If your application is expected to require dozens of nodes, Cassandra might be a great fit.Lots of Writes, Statistics, and Analysis.
Consider your application from the perspective of the ratio of reads to writes. Cassandra is optimized for excellent throughput on writes.
Many of the early production deployments of Cassandra involve storing user activity updates, social network usage, recommendations/reviews, and application statistics. These are strong use cases for Cassandra because they involve lots of writing with less predictable read operations, and because updates can occur unevenly with sudden spikes. In fact, the ability to handle application workloads that require high performance at significant write volumes with many concurrent client threads is one of the primary features of Cassandra.
According to the project wiki, Cassandra has been used to create a variety of applications, including a windowed time-series store, an inverted index for document searching, and a distributed job priority queue.Geographical Distribution
Cassandra has out-of-the-box support for geographical distribution of data. You can easily configure Cassandra to replicate data across multiple data centers. If you have a globally deployed application that could see a performance benefit from putting the data near the user, Cassandra could be a great fit.Evolving Applications
If your application is evolving rapidly and you're in “startup mode,” Cassandra might be a good fit given its schema-free data model. This makes it easy to keep your database in step with application changes as you rapidly deploy.
Tunable Consistency is a phenomenal characteristic that makes Cassandra a favored database choice of Developers, Analysts and Big data Architects. Consistency refers to the up-to-date and synchronized data rows on all their replicas. Cassandra’s Tunable Consistency allows users to select the consistency level best suited for their use cases. It supports two consistencies -Eventual and Consistency and Strong Consistency.
The former guarantees consistency when no new updates are made on a given data item, all accesses return the last updated value eventually. Systems with eventual consistency are known to have achieved replica convergence.For Strong consistency, Cassandra supports the following condition:
Cassandra is written in Java. It is originally designed by Facebook consisting of flexible schemas. It is highly scalable for big data.
The design goal of Cassandra is to handle big data workloads across multiple nodes without any single point of failure.
Commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.
It is a memory-resident data structure. After commit log, the data will be written to the mem-table. Mem-table is in-memory/write-back cache space consisting of content in key and column format. The data in mem- table is sorted by key, and each column family consists of a distinct mem-table that retrieves column data via key. It stores the writes until it is full, and then flushed out.
SSTable or ‘Sorted String Table,’ refers to an important data file in Cassandra. It accepts regular written memtables which are stored on disk and exist for each Cassandra table. Being immutable, SStables do not allow any further addition and removal of data items once written. For each SSTable, Cassandra creates three separate files like partition index, partition summary and a bloom filter.
Sorted String Table is a file of key/value string pairs, sorted by keys. An SSTable can be completely mapped into memory, which allows us to perform lookups and scans without touching disks.An SSTable provides a persistent, ordered an immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key and to iterate over all key/value pairs in a specified key range. SSTable is a simple abstraction to efficiently store large numbers of key-value pairs while optioning for high throughput, sequential read/write workloads. Internally, each SSTable contains a sequence of blocks typically each block is 64KB in size, but this is configurable. A block index stored at the end of the SSTable is used to locate blocks; the index is loaded into memory when the SSTable is opened. The features of SSTable are: SSTables are immutable, simplifies caching, sharing across GFS etc., serializable index, serializable data, bloom filter, no need for concurrency control, SSTables of a tablet recorded in METADATA table, Garbage collection of SSTables done by master, on tablet split, split tables can start off quickly on shared SSTables, splitting them lazily. SSTable works in Cassandra (data format, indexing, serialization, searching). Learn more in: NoSQL Databases.
An SSTable provides a persistent, ordered an immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks typically each block is 64KB in size, but this is configurable. A block index stored at the end of the SSTable is used to locate blocks; the index is loaded into memory when the SSTable is opened.The features of SSTable are:
Cassandra performs the write function by applying two commits-first it writes to a commit log on disk and then commits to an in-memory structured known as memtable. Once the two commits are successful, the write is achieved. Writes are written in the table structure as SSTable (sorted string table). Cassandra offers speedier write performance.
Column family in Cassandra is referred for a collection of Rows.
A cluster is a container for keyspaces. Cassandra database is segmented over several machines that operate together. The cluster is the outermost container which arranges the nodes in a ring format and assigns data to them. These nodes have a replica which takes charge in case of data handling failure.
In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consist of one keyspace per node.