The Star Schema: A Hands-On Approach to Modeling

Jonathan Duran
10 min readJun 5, 2021

Introduction

Data modeling is an art. It's the ultimate intersection between analytics and engineering, the point at which you are simultaneously thinking about performance, extensibility, and scalability, along with the coverage of possible business queries.

I won’t get into the weeds of choosing the ‘right’ data model for your firm, however, I will say there are many data models to choose from, each with its own pros and cons, but ultimately this choice depends on 3 factors:

  1. The source data (do you have disparate flat files that share a key? or does your data come from normalized production database tables?)
  2. The skillset of your Analytics/Data Engineers — Star schemas involves much more detailed work than say wide denormalized tables.
  3. The business users. Nothing matters if they do not adopt the data model. This can happen if your data model causes long-running queries, inconsistent data, or is just too complicated/overwhelming for day-to-day use.

Setting The Stage

Let’s start with a few points.

  1. We are on the data team for a fictional stock market trading firm that needs some insights into its orders data. After analyzing our source data and business needs, the team has decided that modeling the orders data as a star schema would be the optimal choice.
  2. We receive data exports from our backend team. The means of this transfer isn’t important as we will start our project at the point which it is loaded into our warehouse. The raw data will be normalized, comprised of three tables — Users, Orders, and Symbols.
  3. Google’s BigQuery is our data warehouse (you can use any columnar data warehouse or RDBMS to follow along).

Source Data

I’ve kept things simple here but for anyone who isn’t familiar with the world of trading let’s go over each table in detail.

Users — This is simply a table that stores one record per user. Users are unique and contain some extra information such as city, state, etc.

Orders —When users want to place trades, they first have to place an order. This adds complexity to our model because the status of the order can change over time. For example, you may place an order to buy 100 shares of Tesla today and cancel it tomorrow.

Symbols — This table stores all the tradable symbols available at our fictitious trading firm. In the real world, this table’s entries would consist of companies like Apple, Tesla, etc. I have also included some extra information just for the sake of having more data.

* All the data is randomly generated and is not based on anything real. The Users data is created using the Python Faker library, so while it may seem like this is data on real people, it’s not. I will include the scripts used to create these tables at the end of the article.

To help visualize the relationships between the tables, let's check out an ER diagram:

ER Diagram for Backend Tables

Remember, this is not our analytical model. Data modeling optimized for operational use is very different from the one used for analytics. Transforming the former into the latter is one of the reasons why data and analytics engineers exist.

Loading The Raw Data Into BQ

You are free to use any database system but do keep in mind that cloud data warehouses such as BigQuery, Redshift, and Snowflake aren’t free. If you are looking to follow along for free, please set up an open-source RDBMS like PostgreSQL or MySQL. The dialect is slightly different across engines, but any syntax issues are sure to be fixed with the help of a quick google search.

The first step is making sure you grab the three CSV files that represent our tables from my repository (you can clone the repo to your local machine and grab the files from there). Create the corresponding tables in your database. In my case, I created a raw_hands_on_star_schema dataset (BQ’s version of a schema) that contains my four tables.

Raw Tables in BigQuery

BigQuery makes it very easy to load in data via its UI, however, if you are using something like PostgreSQL, you are probably going to need to execute a few CREATE TABLE and COPY commands. Using the Users table as an example, the DDL should look like the following:

CREATE TABLE users (
user_id INTEGER,
first_name VARCHAR(155),
last_name VARCHAR(155),
email VARCHAR(155),
city VARCHAR(155),
state VARCHAR(155),
date_joined DATE
);

while the COPY command looks like this:

COPY users
FROM 'path/to/local/file/users.csv'
DELIMITER ','
CSV HEADER;

Changing ‘path/to/local/file/users.csv’ to the directory where you downloaded the actual files. Both of these commands work in PostgreSQL but may need a bit of tweaking if you are using another relational database.

The DDL for the rest of the tables is included in the ddl.sql file in my Github repository.

The Dimensional Design Process

There are four decisions we need to make at this stage.

  1. Selecting the business process.
  2. Declaring the grain.
  3. Identifying the dimensions.
  4. Identifying the facts.

This is literally textbook Kimball (literally, go to pg.38 in the Data Warehouse Toolkit).

The business process in our project is the orders placed by our users. Taking a quick look at the raw orders table, we see two numerical columns — price and quantity — from which we can generate key metrics for the business. These will be the facts/measures in our final facts table.

Now that we have our business process clearly defined, we can tackle the grain. Kimball encourages targeting the atomic grain — ‘the lowest level at which data is captured by a given business process’ — which in our case would be defining our records at the order level. This means we don’t try to limit our business users by preemptively guessing their questions, instead, we give them the most amount of data and allow them to explore on their own.

Next to grain, the most important contributor to BI users adopting our model is the dimension tables. Requests usually come in at least two parts, first is the measure (how much, how many, etc.), second is the filter/descriptor (who, what, when, etc.). The more dimensions we make available, the more our users will be able to uncover by slicing up the data.

Finally, we move to our fact table. There are a few different types of fact tables, however, choosing one isn’t really up to us. Taking a close look at our business process and its grain, and it is clear that our table is transactional. Each order is an event that only happens once at a specific point in time.

📝 Other types of fact tables include Periodic Snapshot — each row summarizes a measurement over some period of time, Accumulating Periodic Snapshot — each row represents an instance of some business process (such as the lifecycle of an order, or claim), and Factless Tables — a table without measures.

Forming the Star Schema

The easiest way to begin the design of the star schema is to denormalize the raw tables into one record. This will give us a wide view of our orders transaction and help us build out our dimensions.

The following query will do just that.

WITH users AS(
SELECT *
FROM `gcpuser-project.raw_hands_on_star_schema.users`
),
symbols AS(
SELECT *
FROM `gcpuser-project.raw_hands_on_star_schema.symbols`
),
orders AS (
SELECT *
FROM `gcpuser-project.raw_hands_on_star_schema.orders`
),
joined AS (
SELECT
u.user_id,
u.last_name,
u.email,
u.city,
u.state,
u.date_joined,
o.order_id,
o.price,
o.quantity,
o.order_date,
o.buy_or_sell,
o.order_status,
s.symbol_id,
s.symbol,
s.date_added,
s.listed_at
FROM orders AS o
LEFT JOIN symbols AS s
ON o.symbol_id = s.symbol_id
LEFT JOIN users AS u
ON o.user_id = u.user_id
)
SELECT *
FROM joined
LIMIT 1

Great! we now have a complete look at what an order transaction looks like.

Denormalized Query Result

The record is made up of our two business facts, price and quantity, along with foreign keys and various descriptors that will be used to build our dimension tables.

Lucky for us, it doesn’t look like any changes are needed to convert the raw users table into a users dimension table. The same can be said about the symbols table, and so the only real data modeling left to do is creating the date and orders dimension tables.

Dates

For the dates dimension table, there are plenty of ways to go about generating a series of past and future dates. This really depends on the dialect your database is using but should look very similar to what I have below (Once again, I am using BigQuery).

-- dim_datesCREATE TABLE `gcpuser-project.marts_hands_on_star_schema.dim_dates` AS(
SELECT
date AS date_id,
date,
EXTRACT(MONTH FROM date) AS month,
EXTRACT(YEAR FROM date) AS year
FROM UNNEST(
GENERATE_DATE_ARRAY('2014-01-01', CURRENT_DATE('America/New_York'), INTERVAL 1 DAY)
) AS date

The query above generates a table of distinct dates, one day apart, starting from the arbitrary date ‘2014–01–01’ to the current date. Since dates are distinct, I’ve chosen to also use them for the date_id field.

Additionally, this table needs to be recreated daily to include the most recent current date. BigQuery allows us to schedule queries while other engines have similar functionality to help automate jobs. Alternatively, we could have generated dates extending far into the future and not have to worry about making daily updates to the table.

Junk Orders

We will build our fact table on top of the raw orders table but not before removing a few descriptive columns. The buy_or_sell and order_status columns can be used to create what is called a junk dimension table, where the rows are simply all possible combinations of the attributes of the columns. A hash of the different combinations will also be computed and used as a surrogate key.

-- jnk_ordersCREATE TABLE `gcpuser-project.marts_hands_on_star_schema.jnk_orders` AS (WITH values AS (
SELECT DISTINCT buy_or_sell, order_status
FROM `gcpuser-project.raw_hands_on_star_schema.orders`
),
apply_surrogate_key AS (
SELECT
SHA256(CONCAT(buy_or_sell, order_status)) AS jnk_orders_id,
buy_or_sell,
order_status
FROM values
)
SELECT *
FROM apply_surrogate_key
)

Let’s discuss this a bit further. Assume that our orders dimension table contained the values of buy_or_sell and order_status for each order_id in the fact table.

The first issue we encounter is redundancy. Consider the cardinality of the possible combination between these two columns (there are only 4 distinct combinations), this would introduce quite a bit of repitative data in our dimension table.

The bigger issue is performance. Our orders dimension table would grow at the same rate as our orders fact table. We need to try our best to avoid such a design, and instead choose dimension tables that grow slowly relative to the fact table.

This is where having a junk dimension table helps. We populate the table with all the possible values of each column, create a new surrogate key based on the combination of a row’s values, and add this new key to the fact table for reference. The result is a table that does not grow unless the underlying business process changes (i.e we add a new order_status such as ‘modified’).

Users & Symbols

The users and symbols dimension tables can be created as a one-to-one copy of the raw tables. There are no needed changes in the schemas of either table.

Orders Fact Table

All that remains is building a fact table from the raw orders table. The only major change made during this stage is re-computing the jnk_orders_id hash that will allow us to connect to the jnk_orders dimension table. We do this as part of the DDL of the fact table.

-- fct_ordersCREATE TABLE `gcpuser-project.marts_hands_on_star_schema.fct_orders` AS (SELECT
order_id,
order_date as order_date_id,
user_id,
SHA256(CONCAT(buy_or_sell, order_status)) as jnk_order_id,
symbol_id,
price,
quantity
FROM `gcpuser-project.raw_hands_on_star_schema.orders`
)

Now that everything is modeled out, let's take a look at an ERD of our star schema.

ERD Star Schema

As well as a look at the table structure inside of BigQuery.

Table Structure in BQ

A few things to note are the relationships between dimension tables for date fields. This is an acceptable design in the Kimball pattern and is referred to as an outrigger dimension.

Analysis

We now have a model that can be fed into most BI tools and made available to end-users.

Let's use our model to answer a few hypothetical questions.

  1. On average, which cities are trading the highest-priced stocks?
SELECT
city,
ROUND(AVG(price),2) AS avg_price
FROM `gcpuser-project.marts_hands_on_star_schema.fct_orders` AS f
LEFT JOIN `gcpuser-project.marts_hands_on_star_schema.dim_users` as u
ON f.user_id = u.user_id
GROUP BY city
ORDER BY avg_price DESC
LIMIT 5
Results

2. What is the daily proportion of orders that are executed (as opposed to canceled)?

SELECT
date,
SUM(CASE WHEN order_status = 'executed' THEN 1 ELSE 0 END) /
COUNT(*) AS proportion_executed
FROM `gcpuser-project.marts_hands_on_star_schema.fct_orders` as f
LEFT JOIN `gcpuser-project.marts_hands_on_star_schema.dim_dates` as d
ON f.order_date_id = d.date_id
LEFT JOIN `gcpuser-project.marts_hands_on_star_schema.jnk_orders` as o
ON f.jnk_order_id = o.jnk_order_id
GROUP BY date
ORDER BY date DESC
Results

Wrap Up

I hope this article provided a good opportunity to practice data modeling. Making data accessible to others is, I think, the biggest value-add in the entire data spectrum.

As always, please check out the github repo to follow along with article.

👋 Cheers.

--

--