Star and Snowflake Schema

Star Schema
Dimensional modeling: star schema Fact tables
- Holds records of a metric
- Changes regularly
- Connects to dimensions via foreign keys
- Example:
  - Supply books to stores in USA and Canada
  - Keep track of book sales
Dimension Tables
- Holds descriptions of attributes
- Does not change as often
Star Schema Example
- One dimension
Snowflake Schema (an extension)
- More than one dimension because dimension tables are normalized
What is Normalization?
- Database design technique
- Divides tables into smalller tables and connects them via relationships
- Goal reduce redundancy and increase data integrity
Book dimension of the star schema
Most likely to have repeating values:
- Author
- Publisher
- Genre
Book dimension of the snowflake schema

Normalized and Denormalized Databases

Back to our book store example

Denormalized Query

Goal: get quantity of all Octavia E.Butler books sold in Vancouver in Q4 of 2018

  
      SELECT SUM(quantity) FROM fact_booksales
          -- Join to get city
          INNER JOIN dim_store_star on fact_booksales.store_id = dim_store_star.store_id
          -- Join to get author
          INNER JOIN dim_book_star on fact_booksales.book_id = dim_book_star.book_id
          -- Join to get year and quarter
          INNER JOIN dim_time_star on fact_booksales.time_id = dim_time_star.time_id
      WHERE
     dim_store_star.city = 'Vancouver' AND dim_book_star.author = 'Octavia E. Butler' AND dim_time_star.year = 2018 AND dim_time_star.quarter = 4;     

Normalized Query

  
      SELECT
          SUM(fact_booksales.quantity)
      FROM fact_booksales
      -- Join to get city
      INNER JOIN dim_store_sf ON fact_booksales.store_id = dim_store_sf.store_id
      INNER JOIN dim_city ON dim_store_sf.city_id = dim_city_sf.city_id
      -- Join to get author
      INNER JOIN dim_book_sf ON fact_booksales.book_id = dim_book_sf.book_id
      INNER JOIN dim_author_sf ON dim_book_sf.author_id = dim_author_sf.author_id
      -- Join to get year and quarter
      INNER JOIN dim_time_sf ON fact_booksales.time_id = dim_time_sf.time_id
      INNER JOIN dim_month_sf ON dim_time_sf.month_id = dim_month_sf.month_id
      INNER JOIN dim_quarter_sf ON dim_month_sf.quarter_id = dim_quarter_sf.quarter_id
      INNER JOIN dim_year_sf ON dim_quarter_sf.year_id = dim_year_sf.year_id
  WHERE
      dim_city_sf.city = 'Vancouver'
      AND
      dim_author_sf.author = 'Octavia E.Butler'
      AND
      dim_year_sf.year = 2018 AND dim_quarter_sf.quarter = 4;

Why Normalization?
- Normalization saves space as it eliminates data redundancy
- Normalization ensures better data integrity
  1. Enforces data consistency: Must respect naming conventions because of referential integrity, e.g:- ‘California’, not ‘CA’ or ‘california’
  2. Safer updating, removing, and inserting: Less data redundancy = less records to alter
  3. Easier to redesign by extending: Smaller tables are easier to extend than larger tables
Disadvantage of Normalization
- Complex queries require more CPU

Normal Forms

Normalization
Idenfity repeating groups of data and create new tables for them
A more formal definition:
The goals of normalization are to:
- Be able to characterize the level of redundancy in a relational schema
- Provide mechanisms for transforming schemas in order to remove redundancy
Ordered from least to most normalized:
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- Elementary Key Normal Form (EKNF)
- Boyce-Codd Normal Form (BCNF)
- Fourth Normal Form (4NF)
- Essential Tuple Normal Form (ETNF)
- Fifth Normal Form (5NF)
- Domain-Key Normal Form (DKNF)
- Sixth Normal Form (6NF)
1NF Rules
- Each record must be unique - no duplicate rows
- Each cell must hold one value
2NF
- Must satisfy 1NF AND
  - If primary key is one column
    - then automatically satisfies 2NF
  - If there is a composite primary key
    - then each non-key column must be dependent on all the keys
3NF
- Satisifes 2NF
- No transitive dependencies: non-key columns can’t depend on other non-key columns
Data Anomalies
What is risked if we don’t normalize enough?
1. Update anomaly
2. Insertion anomaly
3. Deletion anomaly
The more normalized the database, the less prone it will be to data anomalies
Update Anomaly
Data inconsistency caused by data redundancy when updating
To update student 520s email:
- Need to update more than one record, otherwise, there will be inconsistency
- User updating needs to know about redundancy
Insertion Anomaly
Unable toadd a record due to missing atributes
Unable to insert a student who has signed up but not enrolled in any courses
Deletion Anomaly
Deletion of record(s) causes unintentional loss of data
If we delete Student 230, what happens to the data on Cleaning Data in R?

Database Schemas

Star and Snowflake Schema

Star Schema

Star Schema Example

Snowflake Schema (an extension)

What is Normalization?

Book dimension of the star schema

Book dimension of the snowflake schema

Normalized and Denormalized Databases

Back to our book store example

Denormalized Query

Normalized Query

Why Normalization?

Disadvantage of Normalization

Normal Forms

Normalization

1NF Rules

2NF

3NF

Data Anomalies

Update Anomaly

Insertion Anomaly

Deletion Anomaly

Further Reading

OLTP and OLAP

Database Views

Database Roles and Access Control