How should we organize and manage data?
- Schemas: How should my data be logically organized?
- Normalization: Should my data have minimal dependency and redundancy?
- Views: What joins will be done most ofte?
- Access Control: Should all users of the data have the same level of access
- DBMS: How do I pick between all the SQL and noSQL options?
Approaches to processing data
OLTP OLAP Online Transaction Processing Online Analytical Processing - Find the price of a book - Calculate books with best profit margin - Update latest customer transaction - Find most loyal customers - Keep track of employee hours - Decide employee of the month OLVAP vs OLTP
OLTP OLAP Purpose support daily transactions report and analyze data Data up-to-date, operational consolidated, historical Size snapshot, gigabyters archive, terrabytes Queries simple transactions & frequent updates complex, aggregate queries & limited updates Users thousands hundreds
Storing Data
Structuring Data
- Structured data
- Follows a schema
- Defined data types & relationships
- E.g:- SQL, tables in a relational database
- Unstuctured Data
- Schemaless
- Makes up most of data in the world
- E.g: - Photos, chat logs, MP3
- Semi-structued data
- Does not follow larger schema
- Self-describing structure
- E.g: - NoSQL, XML, JSON
1 2 3 4 5 6 7 8
# Example of a JSON file 'user': { 'profile_use_background_image': true, 'statuses_count': 31, 'profile_background_color': 'CODEED', 'followers_count': 3066, ... }
- Structured data
Storing data beyond traditional databases
- Traditional databases
- For storing real-time relational structured data? OLTP
- Data warehouses
- For analyzing achieved structured data? OLAP
- Data lakes
- For storing data of all structures = flexibility and scalability
- For analyzing big data
- Traditional databases
Data Warehouses
- Optimized for analytics - OLAP
- Organized for reading/aggregating data
- Usually read-only
- Contains data from multiple sources
- Massively Parallel Processing (MPP)
- Typically uses a denormalized schema and dimensional modeling
Data Marts
- Subset of data warehouses
- Dedicated to a specific topic
- Optimized for analytics - OLAP
Data Lakes
- Store all types of data at a lower cost
- E.g:- raw, operational databases, IoT device logs, real-time, relational and non-relational
- Retains all data and can take up petabytes
- Schema-on-read as opposed to schema-on-write
- Need to catalog data otherwise becomes a data swamp
- Run big data analytics using services such as Apache Spark and Hadoop
- Useful for deep learning and data discovery because activites requires so much data
- Store all types of data at a lower cost
ETL
ELT
Database Design
What is database design?
- Determines how data is logically stored
- How is data going to be read and updated?
- Uses *database models**: high-level specifications for database structure
- Most popular: relational model
- Some other options: NoSQL models, object-oriented model, network model
- Use schemas: blueprint of the database
- Defines tables, fields, relationships, indexes, and views
- When inserting data in relational databases, schemas must be respected
- Determines how data is logically stored
Data Modeling
Process of creating a data model for the data to be stored.
- Conceptual data model: describes entities, relationships, and attributes
- Tools: data structure diagrams, e.g:- entity-relational diagrams and UML diagrams
- Logical data modle: defines tables, columns, relationships
- Tools: database models and schemas, e.g:- relational model and star schema
- Physical data model: describes physical storage
- Tools: partitions, CPUs, indexes, backup systems and tablespaces
- Conceptual data model: describes entities, relationships, and attributes
Beyond the relational model-Dimensional Modeling
Adaptation of the relational model for data warehouse design
- Optimized for OLAP queries: aggregate data, not updating (OLTP)
- Built using the star schema
- Easy to interpret and extend schema
Element of dimensional modeling
Organize by:
- What is being analyzed?
- How often do entities change?
Fact tables
- Decided by business use-case
- Holds records of a metric
- Changes regularly
- Connects to dimensions via foreign keys
Dimension tables
- Holds descriptions of attributes
- DOes not change as often