Exploring Data With Aggregation
- Reviewing summarized values for each column is a common first step in analyzing data
- If the data exists in a database, fastest way to aggregate is to use SQL
- Common Summary Statistics
MIN()
for the minimum value of a columnMAX()
for the maximum value of a columnAVG()
for the mean or average value of a column
Example:
1 2 3 4 5
SELECT AVG(InternetUse) AS MeanInternetUse, MIN(InternetUse) AS MINInternet, MAX(InternetUse) AS MAXInternet FROM EconomicIndicators
- Filtering Summary Data with Where
This SQL query filters the aggregated values using a WHERE caluse
1 2 3 4 5 6
SELECT AVG(InternetUse) AS MeanInternetUse, MIN(InternetUse) AS MINInternet, MAX(InternetUse) AS MAXInternet FROM EconomicIndicators WHERE Country = 'Solomon Islands'
- Subtotaling Aggregations into Groups with GROUP BY
1 2 3 4 5 6 7
SELECT Country, AVG(InternetUse) AS MeanInternetUse, MIN(InternetUse) AS MINInternet, MAX(InternetUse) AS MAXInternet FROM EconomicIndicators GROUP BY Country
- HAVING is the WHERE for Aggregations
Cannot use
WHERE
withGROUP BY
as it will give error This throws as error1 2 3
... GROUP BY WHERE MAX(InternetUse) > 100
Instead, use
HAVING
This is how you filter with a GROUP BY1 2
GROUP BY HAVING MAX(InternetUse) > 100
Example:
1 2 3 4 5 6 7 8
SELECT Country, AVG(InternetUse) AS MeanInternetUse, MIN(GDP) AS SmallestGDP, MAX(InternetUse) AS MAXInternetUse FROM EconomicIndicators GROUP BY Country HAVING MAX(InternetUse) > 100
Finding and Resolving Missing Data
Detecting Missing Values
- When you have no data, the empty database field contains the word
NULL
- Because
NULL
is not a number, it is not possible to use=
,<
,>
to find or compare missing values - To determine if a column contains a
NULL
value, useIS NULL
andIS NOT NULL
- When you have no data, the empty database field contains the word
Returning No NULL Values in T-SQL
1 2 3 4 5 6
SELECT Country, InternetUse, Year FROM EconomicIndicators WHERE InternetUse IS NOT NULL
The above SQL query returns records where the value is InternetUse field is not
NULL
Detecting NULLs in T-SQL
1 2 3 4 5 6
SELECT Country, InternetUse, Year FROM EconomicIndicators WHERE InternetUse IS NULL
The above SQL query returns records where the value is InternetUse field is
NULL
Blank Is Not Null
- A blank is not the same as a NULL value
- May show up in columns containing text
- An empty string
''
can be used to find blank values - The best way is to look for a column where the Length or LEN > 0
1 2 3 4 5 6
SELECT Country, GDP, Year FROM EconomicIndicators WHERE LEN(GDP) > 0
Substituting missing data with a specific value using ISNULL
1 2 3 4 5
SELECT GDP, Country, ISNULL(Country, 'Unknown') AS NewCountry FROM EconomicIndicators
The above SQL query returns a new column called NewCountry which has all the same values as the Country field but the only difference being every NULL value is replaced by
Unknown
keyword.Substituting missing data with a column using ISNULL
Substituting values from one column or another with ISNULL
1 2 3 4 5
SELECT TradeGDPPercent, ImportGoodPercent, ISNULL(TradeGDPPercent,ImportGoodPercent) AS NewPercent FROM EconomicIndicators
Substituting NULL values using COALESCE
COALESCE
returns the first non-missing value
Syntax: COALESCE(value_1, value_2, value_3,….,value_n)- If
value_1
isNULL
andvalue_2
is notNULL
, returnvalue_2
- If
value_1
andvalue_2
areNULL
andvalue_3
is notNULL
, returnvalue_3
Example:
1 2 3 4 5
SELECT TradeGDPPercent, ImportGoodPercent COALESCE(TradeGDPPercent, ImportGoodPercent, 'N/A') AS NewPercent FROM EconomicIndicators
- If
Changing column values with CASE
Syntax:
CASE
WHEN Boolean_expression THEN result_expression [ …n ]
[ ELSE else_result_expression ]
ENDChanging column values with CASE IN T-SQL
1 2 3 4 5 6
SELECT Continent CASE WHEN Continent = 'Europe' or Continent = 'Asia' THEN 'Eurasia' ELSE 'Other' END AS NewContinent FROM EconomicIndicators
Using CASE statements to create value groups
We are binning the data here into discrete groups
1 2 3 4 5 6 7 8 9 10 11
SELECT Country, LifeExp, CASE WHEN LifeExp < 30 Then 1 WHEN LifeExp > 29 AND LifeExp < 40 THEN 2 WHEN LifeExp > 39 AND LifeExp < 50 THEN 3 WHEN LifeExp > 49 AND LifeExp < 60 THEN 4 ELSE 5 END AS LifeExpGroup FROM EconomicIndicators WHERE Year = 2007