Technology choices are often influenced by hype or company politics. People tend to over-engineer using tools allowing them to handle “prospective” traffic, which is rarely achieved. Of course, sometimes they just want to have fun by using something new.
Today, let’s have a solid discussion about databases! We’re going to start with theories and qualities that should be considered when choosing a database resolution. Then we’ll talk about data models and types — SQL, NoSQL, and NewSQL. We’ll dive into a few most wellliked, open-source databases and a few most common use cases describing ideas how they can be handled. The goal is to learn how to make better choices.
CAP is something most of you have already heard about, but let’s do a small recap.
In 2000, Eric Brewer presented a theory regarding distributed systems (if you have 2 database nodes then it probably already applies). It claims that the system can fulfill only 2 of 3 specific traits — Consistency, Availability, and Partition Tolerance.
What do they mean?
- Consistency — all clients querying the system see exactly the same data, it does not matter to which node they are related
- Availability — all requests made to the system get successful responses (i.e. it does not fail because a part of the system is not accessible)
- Partition Tolerance — if there is a community issue, e.g. a split brain breaking a cluster to 2 unbiased parts, then the system nonetheless works
When you google CAP, the thing common among all articles are graphics of a triangle, where each trait is in 1 angle and edges are described by database types or specific products that match them. However, is it really so easy to classify modern systems? Not really. In 2012, Brewer has revealed an update — CAP Twelve Years Later: How the “Rules” Have Changed, saying that the world is not so black and white. Today databases often have tunable settings, they may concern specific cluster, desk, or sometimes even query. You have to remember about them, just know what tradeoffs each option brings. For more information, I recommend reading Brewer’s article but also the article Please stop calling databases CP or AP from Martin Kleppman.
In 2010, yet another related theorem was formulated. PACELC is a CAP extension saying that in the case of community Partition (P), the system has to choose to be Available (A) or Consistent (C). Else (E), when operating normally, it has to choose between Latency (L) and Consistency (C).
You might also be interested in:
Apart from CAP, we have more qualities & features we need to think about when making a choice.
Response time is necessary for user experience. Some solutions may sluggish down when they develop in size. In other cases, a huge load may influence how quickly the database replies. The more complicated the queries are, the bigger time grows. Inserting rows, requiring deduplication based on some unique value will be always slower than a pure upsert because the db needs to do more work to handle the request. All types of transactions will increase the latency as well. If you want to fetch a lot of data, then disk velocity, community throughput, and other factors will also influence the results.
Everybody would like to have 100% available systems, but that is not possible. Remember about the things we talked about when discussing the CAP/PACELC theorem. In reality, to achieve higher availability, we need to add more nodes and more unbiased data facilities in geographic distribution. Often it is analyzed how many nodes may fail at the same time so that the system nonetheless works and replies to the queries. In some cases, failure won’t be even noticed, in others, there may be a small downtime when the database is switching to the backup node.
Remember that in order to handle community partitions, we may need to sacrifice a bit of availability (writes) or data consistency.
We can distinguish 2 basic forms of scaling:
- vertical — achieved by adding more sources — CPU, RAM
- horizontal — achieved by adding more instances to the system
Some databases just can’t be clustered and scaled horizontally. For high availability, they usually use the lively-passive approach, where 1 database is lively, but the changes are synchronously replicated to a passive node. Scaling is influenced by the data models. Relational data models are difficult to distribute, so the most common case is to have them with a single writable node or to shard the data. On the other hand, key value shops can be very easily scaled horizontally.
In order to be faster and achieve better latency, sacrifices are made. There are databases that reply to the client operations when the data is written to a disk. However, in many cases, dbs return to the client just once the change is put in the memory or the disk cache. In case of failure, some data may be lost. Of course, there are preventive mechanisms — clusters when scaled horizontally nonetheless may have the data on other nodes.
Databases are offered with various licenses. Usually, there are no limitations about their utilization for commercial purposes. In some cases, there are a few variants offered — basic free & open-source and more advanced, paid with additional features. Nowadays, hosted and as-a-service versions are available as well. The main limitation put on the licenses concerns offering self-managed & hosted versions — so that the main company does not get any competition.
Most people first meet SQL databases and later go into the NoSQL world. It can be surprising that ACID transactions are actually a not-so-common feature. Eric Brewer (yes, the guy from the CAP theorem) defined another acronym called BASE. It can be defined as:
- Basically available — this refers to the availability from CAP.
- Soft state — state of the database can change, even when there are no operations.
- Eventual consistency — signifies that the system will get constant at some point in time. This can be a result of e.g. some data synchronization mechanisms happening in the background. Depending on a case, milliseconds may be enough.
Getting ACID and full consistency is difficult and costly. What is more, for many applications, it is actually not desired. Think about the systems you have worked on, in how many cases, after writing data, 1 ms later somebody could be reading it and the response had to be already fully constant with the write? Actually, not too often. Example: simple CRUD, user accounts, people may say, this should be strictly constant! However, users don’t edit their accounts too often, and even if they do, this happens principally from 1 session and reading may not happen in milliseconds after the write.
There are many other factors that may influence your decisions. Financial companies often need audit logs — they are not supported by all solutions. Sometimes you’d like to get a stream of data changes (see CDC, Debezium, etc.). For others, it is necessary how easy it is to deploy to Kubernetes. Companies may need constructed-in backup capabilities, data-at-rest encryption, and many other features.
You might also be interested in:
In this section, we will discuss the most wellliked types of databases.
I think this is the most wellliked and known data model. It consists of tables related using relations. Tables have many columns with specific types, rows are defined by primary keys — they are a worthy fit for storing structured data. Relations are defined using overseas keys. Data can be queried using any column, but additional indexes may improve the performance of those queries. Queries are defined, of course, using SQL (Structured Query Language), which exists since 1974! SQL is very powerful, helps transactions and is constantly being extended.
However, relational databases also have some drawbacks. Due to transactions and durability, latency won’t be ultra low. They don’t scale horizontally very well. Relations make that difficult. In order to avoid it, some databases support sharding or partitioning to divide data into smaller parts, but that’s only a workaround. In recent years, a new category of databases emerged, called NewSQL. It brings horizontal scaling to the relational world, usually by using additional consensus algorithms like Raft or Paxos. Unfortunately, you have to remember they are not a miraculous resolution. A single NewSQL node can be even a few times slower than a single SQL node, but you can just set up a hundred of them.
Examples of SQL databases:
Examples of NewSQL databases:
A very simple model, based on keys & values (who would guess that!), identical to a dictionary data structure. Values are put under specific keys and can be retrieved only using those keys. Values can be just simple types like string or number, or more complex like a list or map. Advantages? Simple, quickly, easy to scale. Disadvantages? Not too powerful read queries (principally get value by key).
Examples of Key-value databases:
Many graph-related buildings can be modeled in different databases, e.g. in relational 1, however, they are not the easiest and most efficient for that.
Graph databases focus on nodes and vertices. If you’d like to model a social community showing who has what friends and then find those with common friends, then graph database is for you.
Examples of Graph databases:
Document databases are actually a special subtype of key value shops. However, their values contain documents e.g. in json, yaml or xml. In many cases, those documents do not have to share any schema (so they may be used also for storing unstructured data) but can refer to each other ~ have relations! What about queries? They are more powerful than in raw key-value shops, often it is possible to create queries based on fields of inserted documents.
Examples of Document-oriented databases:
You might also be interested in:
This type is sometimes called tabular, column-oriented or column family databases. We have tables here, but those tables are not related using any relations. Additionally, rows are divided into partitions (based on partition key), which signifies their physical location in the cluster. There are additional keys that define rows order inside of the partition. Since this model is usually related with solutions easy to scale horizontally, queries are very restricted — you may need to define the partition key every time (so that it knows on which node the data is located). Due to distributed nature, indexes are also not too efficient. This makes modeling data quite challenging. Actually, there are whole modeling methodologies for these databases and they start with defining all queries you’d like to make against your data. In many cases, it appears that in order to fulfill them, you’d need to highly denormalize your data.
Examples of Wide Column databases:
There are NoSQL databases that can actually be categorized into multiple models. ArangoDB is theoretically a graph database, but its nodes may contain documents. Redis extensions allow for graph operations. Datastax Enterprise provides Spark, Solr, and graph features to Cassandra. As you can see, databases do not need to stick to a single model. What should be necessary for you here? What queries you’d like to do in the prospective and what is actually the format of your data. There is no single format that will fit all problems.
Nowadays, there are so many technologies in the market that it is difficult to cover all of them. In this section, we’re going to focus on a few wellliked ones that you may encounter during your work.
MySQL, PostgreSQL, and MariaDB are the most wellliked open-source RDBMS (Relational Database Management Systems) for OLTP (Online Transaction Processing). They have been on the market for many years and include many features you would not expect to find. All of them can support geospatial data. PostgreSQL has even a constructed-in pub-sub like mechanism. In terms of clustering & high availability, the basic model leverages the lively & passive approach, with failover when the lively node fails. It is also possible to make a passive node “Hot Standby”, making it available for read queries. This way, you can scale reads, but not writes. For writes, you need to partition your data.
As these relational databases are the most wellliked choices, all cloud providers offer their hosted versions including the high-availability features.
Redis is a key value store. Although such a data model seems quite simple, Redis offers tons of additional features including geospatial indexes. It retains data in memory, periodically writing them to a disk, which makes it ultra quickly (see benchmarks), but provides a bit of a risk related to durability. It is often used as a cache, however, it helps transactions, Lua scripting, can be clustered, and includes a Pub/Sub mechanism leveraged e.g. by Signal IM. All major cloud providers offer Redis in hosted variants, see e.g. Amazon ElastiCache or GCP Memorystore.
You might also be interested in:
Apache Cassandra is often used together with Big Data. A few years ago, Apple announced their cluster had over 115 000 nodes and stored over 10 PB of data. Cassandra scales horizontally in almost linear fashion with no single point of failure. As usual, there should be some drawbacks. You won’t find full transactions here. As mentioned before, modeling is difficult and really before starting to use Cassandra, you should define all queries you’ll need in your application, which is of course in many cases just not possible. DataStax offers commercially supported Cassandra distributions and hosted variants. An open-source version is available also from other cloud providers:
If you’d like to learn more about Cassandra, take a gaze at our another blogpost “7 mistakes when using Apache Cassandra”:
MongoDB is probably the best known document database. It offers a few clustering models, including sharding features and lively-passive approach. Mongo helps transactions, indexes, but also aggregation operations with map-reduce like features. It is supported also in cloud as MongoDB Atlas.
CockroachDB is a NewSQL (relational, but scales horizontally) created by ex-Google representatives, inspired by Google Spanner. It implements the PostgreSQL write protocol, so it is suitable with PostgreSQL drivers. Core open-source version is free, however, Cockroach Labs offers other variants including hosted dedicated clusters but also a serverless product.
Many companies say that they store data in Elastic. It offers a document data model with really worthy search capabilities. However, you may meet opinions that it shouldn’t be used as your main database, see Why 1 shouldn’t use ElasticSearch as a data storage.
Modeling a CRUD (Create Read Update Delete) User service is a quite common task. What to use? Honestly, if you are not dealing with dozens of millions of users and a 5-min inability to edit user accounts is not a dispute, a relational database management system like PostgreSQL is enough.
For this dispute, you may consider multiple options. If you have huge quantities of data to store, Apache Cassandra may be a fine choice. However, in order to process that data, you’ll need to leverage e.g. Apache Spark. Among other options are typical Time series databases like:
- InfluxDB — NoSQL database designed for time-series data.
- TimescaleDB — time-series based on PostgreSQL relational database.
- Prometheus — often leveraged in monitoring systems.
Their main advantage is that they have constructed-in support for more time & window related operations.
Yet another approach would be to store them on Apache Kafka topics with configured infinite retention. Depending on the use case, they can be processed in real-time in a streaming manner or as a batch using e.g. Apache Spark.
The most powerful full text search engines are Elastic and Solr. However, relying on what you actually need to do, they may not be required. Many other databases include constructed-in support for full text search like operations, so as a first step, check what is possible in your current main database. As an example, see full text search in PostgreSQL.
In many cases, having native, in-memory cache may be enough. You can use libraries like Caffeine or Guava Cache. If you have a bigger amount of data to cache or you want to share it among multiple application instances, then you can choose Redis or memcached.
There is no excellent resolution that matches all problems and requirements. Choosing a database is not an easy task and usually there is more than 1 option that could work for you. In practice, most initiatives do not operate on a huge scale and don’t need ultra-scalable NoSQL databases. Short downtimes (like e.g. 5 min to switch between lively & passive instances) can be totally hidden from the user by having workarounds e.g. in the form of an additional caching layer. However, a few minutes of inability to write is often just acceptable. Of course, there are exceptions for which solutions like Apache Cassandra are a worthy choice. Just remember that they shouldn’t be the first choice for every single project since they have an additional overhead related to difficult data modeling and implementation. It may be safer to start a project with relational and once you know all the queries and how the load changes, then move to another resolution.
Unfortunately, you won’t omit politics in your everyday work. I have personally met companies that allowed in the past to use only 1 selected database type, which was not the best fit for most of the services. There are some who bought Oracle licenses or racks many years ago and don’t allow for anything else. The only thing you can do is to negotiate, analyze what you really need to achieve, and refrain a bit from using everything that is currently on hype.
I hope that information gathered in this article will help you choose the right database for your system. Keep in mind that relying on the project and how much data do you need to store, the best choice may be actually to leverage multiple databases!
If you’re looking for content around Big Data, subscribe to Data Times — the monthly publication curated by SoftwareMill’s Engineers.