Journey To Database World: Part 8 (Column Family Database - Cassandra As Example)

Other Parts of This Series:

Part 7: Journey To Database World: Part 7 (Document Database - MongoDB As Example)
Part 9: Journey To Database World: Part 9 (Graph Database - Neo4j As Example)

Column Family Database - Cassandra (Photo Credit: LinkedIn Image)

Story:

Imagine a flexible library where each bookshelf (column family) represents a category of data, such as “Science” or “History.” Within each bookshelf, books (rows) contain unique chapters (columns), and not all books need to have the same chapters. Each book has a unique ID (row key), making it easy to locate on the shelf. So if you only want to read about “Physics,” you can go directly to the “Science” bookshelf and the specific book without flipping through everything else.

What is a Column Family Database?

A Column Family Database is a type of NoSQL database that stores data in a schema-less format, organizing it into column families. Instead of predefined tables, data is grouped into column families, where each row can have a different number and type of columns. This model is particularly suited for applications requiring high scalability, low latency, and efficient storage of large amounts of structured data. Cassandra, a widely-used open-source column-family database, is a great example.

Key Concepts in Column Family Databases

Column: A basic unit of data, consisting of: Key: Identifies the column, Value: The data stored in the column, Timestamp: Helps resolve conflicts during replication.
Row: A collection of columns, identified by a row key. Rows in a column family can have a variable number of columns.
Column Family: Similar to a table in relational databases, it is a container for rows. Each row contains a unique key and a set of columns.
Partition: Data is partitioned across nodes using a partition key to distribute and retrieve data efficiently.

Use Cases of Column Family Databases

Column family databases are suitable for:

Time-Series Data: Storing sensor readings, logs, and stock prices.
Real-Time Analytics: Dashboards for monitoring user activity.
Recommendation Systems: E-commerce product recommendations.
Content Management Systems: Storing user-generated data, such as reviews or posts.
IoT Applications: Managing large-scale device data efficiently.

Benefits of Column Family Databases

Scalability: Designed for horizontal scaling across distributed systems.
Flexibility: Schema-less structure allows dynamic addition of columns.
High Performance: Handles high write throughput and read efficiency.
Data Locality: Clustering columns ensures efficient queries by storing related data together.
Fault Tolerance: Built-in replication ensures high availability and resilience.

Drawbacks of Column Family Databases

Complex Querying: Lack of support for joins and complex queries like in SQL.
Eventual Consistency: May lead to stale reads if strict consistency is required.
Design Challenges: Requires careful data modeling based on queries, not normalization.
Learning Curve: Requires understanding distributed system concepts and database-specific configurations.

When to Use a Column Family Database

Massive Scale: When the application deals with petabytes of data.
Distributed Architecture: When high availability and partition tolerance are critical.
Write-Heavy Workloads: Applications with frequent writes, such as logs or IoT data.
Low-Latency Reads: Systems needing quick data retrieval.

When Not to Use a Column Family Database

Complex Relationships: Applications needing joins, foreign keys, or transactional consistency.
Small-Scale Applications: Relational databases may be simpler and more cost-effective.
Dynamic Query Requirements: If query patterns change frequently, schema adjustments can be cumbersome.
Frequent update operation: Though write is fast but update is not efficient enough.

Cassandra as a Column Family Database

Apache Cassandra is one of the most popular column family databases. Here’s a detailed look:

Features:

Data Partitioning: Uses consistent hashing to distribute data across nodes.
Replication: Ensures high availability with a configurable replication factor.
Tunable Consistency: Allows control over consistency levels (e.g., ONE, QUORUM).
Fault Tolerance: Handles node failures seamlessly.
CQL (Cassandra Query Language): Simplifies interaction with the database.

Architecture:

Masterless: Every node has the same role, enabling high fault tolerance.
Gossip Protocol: Nodes communicate to share cluster state.

Components of Cassandra

The main components of Cassandra are:

Node: A Cassandra node is a place where data is stored.
Data center: Data center is a collection of related nodes.
Cluster: A cluster is a component which contains one or more data centers.
Commit log: In Cassandra, the commit log is a crash-recovery mechanism. Every write operation is written to the commit log.
Mem-table: A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.
SSTable: It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.
Bloom filter: These are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.

Key Concepts in Cassandra Data Model

Cluster:

A collection of nodes (servers) that work together as a distributed system.
Data is distributed across nodes using consistent hashing.

Keyspace:

The top-level namespace in Cassandra, similar to a database in relational systems.
Defines replication strategies and factors for the data.

Column Family / Table:

Equivalent to a table in relational databases but with a schema-flexible design.
Rows in a table may have different numbers of columns.

Partition Key:

The primary component of the primary key that determines how data is distributed across nodes.
Ensures efficient data retrieval by grouping related rows together.

Clustering Key:

Defines the order of data within a partition.
Enables sorting and efficient queries for a specific partition.

Columns:

The fundamental data unit, consisting of a name, value, and timestamp.
Schema-flexible: Rows in the same table can have different sets of columns.

Cassandra Query Language (CQL) Data Types

Primitive Data Types

ascii: US-ASCII strings.
bigint: 64-bit signed integer.
blob: Arbitrary bytes in hex format.
boolean: true or false.
counter: 64-bit counter (for increments/decrements).
decimal: Variable-precision decimal.
double: 64-bit floating-point number.
float: 32-bit floating-point number.
inet: IP addresses (IPv4/IPv6).
int: 32-bit signed integer.
text / varchar: UTF-8 strings.
timestamp: Date and time (with milliseconds).
time: Time without date.
date: Date without time.
uuid: Universally unique identifier.
timeuuid: Time-based UUID.
varint: Arbitrary-precision integer.
tinyint: 8-bit signed integer.
smallint: 16-bit signed integer.
duration: Time duration.

Collection Data Types

list: Ordered collection (e.g., [1, 2, 3]).
set: Unordered unique values (e.g., {1, 2, 3}).
map: Key-value pairs (e.g., {‘key’: ‘value’}).

User-Defined Types (UDTs)

Custom types combining multiple fields (e.g., CREATE TYPE address (street text, city text, zipcode int)).

Tuple

Fixed-length collection with different types (e.g., (int, text, timestamp)).

Frozen

Marks collections or UDTs as immutable (treated as a single unit).

CQL Queries

Keyspace Operations

Create Keyspace:

`1`	`CREATE KEYSPACE my_keyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};`

Use Keyspace:

`1`	`USE my_keyspace;`

Drop Keyspace:

`1`	`DROP KEYSPACE my_keyspace;`

Table Operations

Create Table:

`1`	`CREATE TABLE users ( user_id UUID PRIMARY KEY, name TEXT, age INT );`

Alter Table:

`1`	`ALTER TABLE users ADD email TEXT;`

Drop Table:

`1`	`DROP TABLE users;`

Truncate Table:

`1`	`TRUNCATE TABLE users;`

Insert Data

Add a row:

`1`	`INSERT INTO users (user_id, name, age) VALUES (uuid(), 'Alice', 30);`

Query Data

Fetch all rows:

`1`	`SELECT * FROM users;`

Fetch specific rows:

`1`	`SELECT name, age FROM users WHERE user_id = <some_uuid>;`

Update Data

Update specific columns:

`1`	`UPDATE users SET age = 31 WHERE user_id = <some_uuid>;`

Delete Data

Delete specific columns:

`1`	`DELETE age FROM users WHERE user_id = <some_uuid>;`

Delete entire row:

`1`	`DELETE FROM users WHERE user_id = <some_uuid>;`

Batch Operations

Perform multiple queries in one batch:

1
2
3

BEGIN BATCH 
INSERT INTO users (user_id, name, age) VALUES (uuid(), 'Bob', 25); UPDATE users SET age = 26 WHERE name = 'Alice'; 
APPLY BATCH;

TTL (Time-to-Live)

Set expiration time for data:

`1`	`INSERT INTO users (user_id, name, age) VALUES (uuid(), 'Eve', 28) USING TTL 3600;`

Lightweight Transactions (LWT)

Conditional insert:

`1`	`INSERT INTO users (user_id, name, age) VALUES (uuid(), 'John', 35) IF NOT EXISTS;`

Conditional update:

`1`	`UPDATE users SET age = 36 WHERE user_id = <some_uuid> IF age = 35;`

Aggregate Functions

Count rows:

`1`	`SELECT COUNT(*) FROM users;`

Index Operations

Create Index:

`1`	`CREATE INDEX ON users (email);`

Drop Index:

`1`	`DROP INDEX users_email_idx;`

Summary:

Column family databases, such as Cassandra, are excellent for applications requiring high scalability, flexible schema, and low-latency reads. They are particularly suited for distributed architectures and write-heavy workloads. However, they lack support for complex queries and strict consistency, making them less suitable for applications with intricate data relationships or transactional requirements. It is ideal for use cases like IoT, real-time analytics, and time-series data storage.

Other Parts of This Series:#

Story:#

What is a Column Family Database?#

Use Cases of Column Family Databases#

Benefits of Column Family Databases#

Drawbacks of Column Family Databases#

When to Use a Column Family Database#

When Not to Use a Column Family Database#

Cassandra as a Column Family Database#

Components of Cassandra#

Key Concepts in Cassandra Data Model#

Cassandra Query Language (CQL) Data Types#

CQL Queries#

Keyspace Operations#

Table Operations#

Insert Data#

Query Data#

Update Data#

Delete Data#

Batch Operations#

TTL (Time-to-Live)#

Lightweight Transactions (LWT)#

Aggregate Functions#

Index Operations#

Summary:#