Data Imputation
- Divide by zero error when calculating Avg Fare/TripDistance
- EDA uncovers hundreds of TaxiRide trip records with Trip Distance = 0
- Data Imputation methods to resolve
- Mean
- Hot Deck
- Omission
Mean Imputation
- Replace missing value with mean
- Doesn’t change the mean value
- Increases correlations with other columns
1 2 3 4 5 6 7 8 9 10 11 12 13
CREATE PROCEDURE dbo.ImputeMean AS BEGIN DECLARE @AvgTripDuration AS float SELECT @AvgTripDuration = AVG(Duration) FROM CapitalBikeShare WHERE Duration > 0 UPDATE CapitalBikeShare SET Duration = @AvgTripDuration WHERE Duration = 0 END;
Hot Deck Imputation
- Missing value set to randomly selected value
TABLESAMPLE
clause ofFROM
clause
1 2 3 4 5 6 7 8 9 10 11 12 13
CREATE FUNCTION dbo.GetDurHotDeck() RETURNS decimal(18,4) AS BEGIN RETURN (SELECT TOP 1 Duration FROM CapitalBikeShare TABLESAMPLE(1000 rows) WHERE Duration > 0) END SELECT StartDate, "TripDuration" = CASE WHEN Duration > 0 THEN Duration ELSE dbo.GetDurHotDeck() END FRom CapitalBikeShare;
Case Study of UDFs
Conversion UDFs
1 2 3 4 5 6 7 8 9 10 11 12 13 14
CREATE FUNCTION dbo.ConvertMileToMeter(@miles numeric) RETURNS numeric AS BEGIN RETURN (SELECT @miles * 1609.34) END CREATE FUNCTION dbo.ConvertCurrency(@Currency numeric, @ExchangeRate numeric) RETURNS numeric AS BEGIN RETURN (SELECT @ExchangeRate * @Currency) END
Formatting Tools
Before Formatting
1 2 3 4 5 6
SELECT DATENAME(weekday, StartDate) AS 'DayOfWeek', SUM(Duration) AS TotalDuration FROM CapitalBikeShare GROUP BY DATENAME(weekday, StartDate) ORDER BY DATENAME(weekday, StartDate)
Sort by logical weekday
1 2 3 4 5 6 7 8 9 10 11 12 13
SELECT DATENAME(weekday, StartDate) as 'DayOfWeek', SUM(Duration) as TotalDuration FROM CapitalBikeShare GROUP BY DATENAME(WEEKDAY, StartDate) ORDER BY CASE WHEN Datename(WEEKDAY, StartDate) = 'Sunday' THEN 1 WHEN Datename(WEEKDAY, StartDate) = 'Monday' THEN 2 WHEN Datename(WEEKDAY, StartDate) = 'Tuesday' THEN 3 WHEN Datename(WEEKDAY, StartDate) = 'Wednesday' THEN 4 WHEN Datename(WEEKDAY, StartDate) = 'Thursday' THEN 5 WHEN Datename(WEEKDAY, StartDate) = 'Friday' THEN 6 WHEN Datename(WEEKDAY, StartDate) = 'Saturday' THEN 7 END ASC;
1 2 3 4 5 6 7 8 9 10 11 12 13
SELECT TOP 5 FORMAT(CAST(StartDate as Date), 'd', 'de-de') AS 'German Date', FORMAT(CAST(StartDate as Date), 'd', 'en-us') AS 'US Eng Date', FORMAT(Sum(Duration), 'n', 'de-de') AS 'German Duration', FORMAT(SUM(Duration), 'n', 'en-us') AS 'US Eng Duration', FORMAT(SUM(Duration),'#,0.00') AS 'Custom Numeric' FROM CapitalBikeShare GROUP BY CAST(StartDate as Date)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
SELECT DATENAME(weekday, StartDate) AS 'DayOfWeek', FORMAT(SUM(Duration),'#,0.00') AS 'TotalDuration' FROM CapitalBikeShare GROUP BY DATENAME(WEEKDAY, StartDate) ORDER BY CASE WHEN Datename(WEEKDAY, StartDate) = 'Sunday' THEN 1 WHEN Datename(WEEKDAY, StartDate) = 'Monday' THEN 2 WHEN Datename(WEEKDAY, StartDate) = 'Tuesday' THEN 3 WHEN Datename(WEEKDAY, StartDate) = 'Wednesday' THEN 4 WHEN Datename(WEEKDAY, StartDate) = 'Thursday' THEN 5 WHEN Datename(WEEKDAY, StartDate) = 'Friday' THEN 6 WHEN Datename(WEEKDAY, StartDate) = 'Saturday' THEN 7 END ASC