Jump to content

How to Decide Between RDBMS and Cassandra

+ 2
  chco's Photo
Posted Jan 15 2011 02:26 AM

There are several differences between Cassandra’s model and query methods compared to what’s available in RDBMS (relational database management system), and these are important to keep in mind. This excerpt from Cassandra: The Definitive Guide will help you decide which would be a better option for your situation.
No Query Language

SQL is the standard query language used in relational databases. Cassandra has no query language. It does have an API that you access through its RPC serialization mechanism, Thrift.

No Referential Integrity

Cassandra has no concept of referential integrity, and therefore has no concept of joins. In a relational database, you could specify foreign keys in a table to reference the primary key of a record in another table. But Cassandra does not enforce this. It is still a common design requirement to store IDs related to other entities in your tables, but operations such as cascading deletes are not available.

Secondary Indexes

Here’s why secondary indexes are a feature: say that you want to find the unique ID for a hotel property. In a relational database, you might use a query like this:

SELECT hotelID FROM Hotel WHERE name = 'Clarion Midtown';


This is the query you’d have to use if you knew the name of the hotel you were looking for but not the unique ID. When handed a query like this, a relational database will perform a full table scan, inspecting each row’s name column to find the value you’re looking for. But this can become very slow once your table grows very large. So the relational answer to this is to create an index on the name column, which acts as a copy of the data that the relational database can look up very quickly. Because the hotelID is already a unique primary key constraint, it is automatically indexed, and that is the primary index; for us to create another index on the name column would constitute a secondary index, and Cassandra does not currently support this.

To achieve the same thing in Cassandra, you create a second column family that holds the lookup data. You create one column family to store the hotel names, and map them to their IDs. The second column family acts as an explicit secondary index.

Note: Support for secondary indexes is currently being added to Cassandra 0.7. This allows you to create indexes on column values. So, if you want to see all the users who live in a given city, for example, secondary index support will save you from doing it from scratch.

Sorting Is a Design Decision

In RDBMS, you can easily change the order in which records are returned to you by using ORDER BY in your query. The default sort order is not configurable; by default, records are returned in the order in which they are written. If you want to change the order, you just modify your query, and you can sort by any list of columns. In Cassandra, however, sorting is treated differently; it is a design decision. Column family definitions include a CompareWith element, which dictates the order in which your rows will be sorted on reads, but this is not configurable per query.

Where RDBMS constrains you to sorting based on the data type stored in the column, Cassandra only stores byte arrays, so that approach doesn’t make sense. What you can do, however, is sort as if the column were one of several different types (ASCII, Long integer, TimestampUUID, lexicographically, etc.). You can also use your own pluggable comparator for sorting if you wish.

Otherwise, there is no support for ORDER BY and GROUP BY statements in Cassandra as there is in SQL. There is a query type called a SliceRange, it is similar to ORDER BY in that it allows a reversal.

Denormalization

In relational database design, we are often taught the importance of normalization. This is not an advantage when working with Cassandra because it performs best when the data model is denormalized. It is often the case that companies end up denormalizing data in a relational database. There are two common reasons for this. One is performance. Companies simply can’t get the performance they need when they have to do so many joins on years’ worth of data, so they denormalize along the lines of known queries. This ends up working, but goes against the grain of how relational databases are intended to be designed, and ultimately makes one question whether using a relational database is the best approach in these circumstances.

A second reason that relational databases get denormalized on purpose is a business document structure that requires retention. That is, you have an enclosing table that refers to a lot of external tables whose data could change over time, but you need to preserve the enclosing document as a snapshot in history. The common example here is with invoices. You already have Customer and Product tables, and you’d think that you could just make an invoice that refers to those tables. But this should never be done in practice. Customer or price information could change, and then you would lose the integrity of the Invoice document as it was on the invoice date, which could violate audits, reports, or laws, and cause other problems.

In the relational world, denormalization violates Codd's normal forms, and we try to avoid it. But in Cassandra, denormalization is, well, perfectly normal. It's not required if your data model is simple. But don't be afraid of it.

The important point is that instead of modeling the data first and then writing queries, with Cassandra you model the queries and let the data be organized around them. Think of the most common query paths your application will use, and then create the column families that you need to support them.

Detractors have suggested that this is a problem. But it is perfectly reasonable to expect that you should think hard about the queries in your application, just as you would, presumably, think hard about your relational domain. You may get it wrong, and then you’ll have problems in either world. Or your query needs might change over time, and then you’ll have to work to update your data set. But this is no different from defining the wrong tables, or needing additional tables, in RDBMS.

Note: For an interesting article on how Cloudkick is using Cassandra to store metrics and monitoring data, see https://www.cloudkic..._with_cassandra.

Cassandra: The Definitive Guide

Learn more about this topic from Cassandra: The Definitive Guide.

With this hands-on guide, you'll learn how Apache Cassandra handles hundreds of terabytes of data while remaining highly available across multiple data centers -- capabilities that have attracted Facebook, Twitter, and other data-intensive companies. Cassandra: The Definitive Guide provides the technical details and practical examples you need to assess this database management system and put it to work in a production environment.

See what you'll learn


Tags:
0 Subscribe


0 Replies