DATA

SQL Tutorial Series Vol. 4 — A Primer on Schemas

Stars and Snowflakes make for beautiful viewing

Bruce A. Lee

--

Welcome to part four of my SQL Tutorial Series, where I break down the fundamentals for those just learning the language.

Part 1, Part 2, Part 3

Before we get more into the nuances of writing queries to pull data, I felt the need to talk more on some theory behind how the tables we interact with in a query are laid out, why that matters and how it may affect workflow.

To do that, we’ll need to talk about two of the most widely used types of schema found in relational databases.

For starters, a schema is a representation of a plan or theory in the form of an outline or model.

HR Database Schema

This is from the prior example I used in the last volume of the series. When talking about the layout of the database we’re querying, this is a great example of a star schema in use.

Databricks (the SaaS Data Warehousing company) defines a star schema as:

A multi-dimensional data model used to organize data in a database so that it is easy to understand and analyze. Star schemas can be applied to data warehouses, databases, data marts, and other tools. The star schema design is optimized for querying large data sets.

On the flipside of that, we have snowflake schemas which Databricks’ glossary defines as:

A multi-dimensional data model that is an extension of a star schema, where dimension tables are broken down into subdimensions. Snowflake schemas are commonly used for business intelligence and reporting in OLAP data warehouses, data marts, and relational databases.

More on Snowflake Schemas:

The schema for the Northwind DB

As I’ve touched on above, just as the last database had a star schema as its layout — what you see above is an excellent example of a snowflake schema.

Queries you make using this kind of database will need to be a bit more structured (and more complex) and intentionally built around how you join everything together — because of the added granularity of how the database is laid out due to the use of added dimensions via lookup (dimensional) tables.

For example, if you were to look at the employees table, you’ll see that it’s being further divided into employee territories, the territories themselves and their regions. Likewise, in the customer table, that table is further spilt into demographics.

This is because the snowflake schema removes redundancy in the data itself, normalizing the data for use in other applications.

In comparison, star schemas feature denormalized data — denormalization itself refers to the repeating of the same values within a table. (1)

Due to that normalization and lack of data redundancy, tables in snowflake schema take up much less disk space, which adds up in a production environment as that means less money is spent to upkeep storage of data that can span into terabytes or petabytes.

On the other hand, tables in star schema are much easier to query because there aren’t as many complex joins that will need to be considered when referencing data in adjacent tables. Though, it’s harder to ensure integrity and maintain everything, because new inserts, updates, or deletes can compromise the integrity of data. (1)

However, your queries can reference that data much more easily due to that same redundancy brought on by denormalization. That also can save costs in processing power in a production-centric work environment, as big data requires immense calculations which require extreme throughput that’s taxing on computing resources.

So the use-case of either really depends on the data maturity of the organization you end up working for.

Realistically, that is why you’ll find that snowflake schema are most widely adopted in organizations that have achieved maturity in its data practices.

These orgs will employ a OLAP (Online Analytical Processing) service for using in their data warehousing, in conjunction with either a star or snowflake schema in their tables.

Conversely, while a OLTP (Online Transactional Processing) service may resemble a snowflake schema in nature — it uses neither of the two types of schema and is optimized for transactions across systems, however, they can both be used together in enterprise data systems to great effect.

As was previously mentioned. In SQL, query complexity can sometimes make or break a project in terms of timelines and resources available to complete a project.

That is why it’s extremely important to learn early on why optimizing queries are of the upmost importance if you’re working with Data Warehouses, Data Lakes, Data Marts, etc..

Usually, queries can break or be woefully inefficient for a number of reasons, for example:

  1. You‘re referencing the wrong tables or fields (columns) in your joins
  2. You’re using the wrong kinds of joins in the query itself
  3. Subqueries add to the depth of complexity with joins which can add to both 1 & 2
  4. Something went wrong in your WHERE logic (which can also come into play with subqueries)

Learning where to look when encountering broken query logic is important, as is optimizing as you go and the troubleshooting process as a whole.

More importantly than that, reporting to stakeholders if these issues result in the mistiming of a project is critical — because having the soft skill to navigate these hard situations is paramount if you want to succeed in data work.

When it comes to how these schemas come into play considering that — it will help to keep in mind that in star schemas, the data is replicated elsewhere in different tables usually. So, it should be easier to query what you need in from adjacent tables.

Whereas with snowflake, it helps to be as concise with your query as possible.

So in this volume of the tutorial we learned:

  • What schemas are
  • The kinds of schemas found in the wild (I may touch on galaxy schemas later)
  • How star and snowflake schemas differ
  • Normalization vs. Denormalization
  • When you’ll likely encounter one type of schema over the other
  • A little on OLAP and OLTP services..
  • ..and why optimizing queries are important

As always, thank you for consideration of this here tutorial series!

In the next volume, I’ll go over Common Table Expressions, subqueries and temp tables.

All related concepts, but important to consider when writing complex queries that span multiple tables that require deeper logic and nuance to pull specific data.

Keep on dragon on!

Bruce

--

--

Bruce A. Lee

Sometimes I'll write about data, sometimes I'll just rant but either way, it's sure to be entertaining.