Home Case Study EDA and Imputation
Post
Cancel

Case Study EDA and Imputation

  • Data Imputation

    • Divide by zero error when calculating Avg Fare/TripDistance
    • EDA uncovers hundreds of TaxiRide trip records with Trip Distance = 0
    • Data Imputation methods to resolve
      • Mean
      • Hot Deck
      • Omission
  • Mean Imputation

    • Replace missing value with mean
    • Doesn’t change the mean value
    • Increases correlations with other columns
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
          CREATE PROCEDURE dbo.ImputeMean
          AS
          BEGIN
          DECLARE @AvgTripDuration AS float
    
          SELECT @AvgTripDuration = AVG(Duration)
          FROM CapitalBikeShare
          WHERE Duration > 0
    
          UPDATE CapitalBikeShare
          SET Duration = @AvgTripDuration
          WHERE Duration = 0
          END;
    
  • Hot Deck Imputation

    • Missing value set to randomly selected value
    • TABLESAMPLE clause of FROM clause
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
          CREATE FUNCTION dbo.GetDurHotDeck()
          RETURNS decimal(18,4)
          AS BEGIN
          RETURN (SELECT TOP 1 Duration
          FROM CapitalBikeShare
          TABLESAMPLE(1000 rows)
          WHERE Duration > 0)
          END
          SELECT
              StartDate,
              "TripDuration" = CASE WHEN Duration > 0 THEN Duration 
                                      ELSE dbo.GetDurHotDeck() END
          FRom CapitalBikeShare;
    

Case Study of UDFs

  • Conversion UDFs

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    
          CREATE FUNCTION dbo.ConvertMileToMeter(@miles numeric)
          RETURNS numeric
          AS
          BEGIN
          RETURN (SELECT @miles * 1609.34)
          END
    
    
          CREATE FUNCTION dbo.ConvertCurrency(@Currency numeric, @ExchangeRate numeric)
          RETURNS numeric
          AS
          BEGIN
          RETURN (SELECT @ExchangeRate * @Currency)
          END
    

Formatting Tools

  • Before Formatting

    1
    2
    3
    4
    5
    6
    
          SELECT
              DATENAME(weekday, StartDate) AS 'DayOfWeek',
              SUM(Duration) AS TotalDuration
          FROM CapitalBikeShare
          GROUP BY DATENAME(weekday, StartDate)
          ORDER BY DATENAME(weekday, StartDate)
    
  • Sort by logical weekday

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
          SELECT DATENAME(weekday, StartDate) as 'DayOfWeek',
          SUM(Duration) as TotalDuration
          FROM CapitalBikeShare
          GROUP BY DATENAME(WEEKDAY, StartDate)
          ORDER BY
              CASE WHEN Datename(WEEKDAY, StartDate) = 'Sunday' THEN 1
                  WHEN Datename(WEEKDAY, StartDate) = 'Monday' THEN 2
                  WHEN Datename(WEEKDAY, StartDate) = 'Tuesday' THEN 3
                  WHEN Datename(WEEKDAY, StartDate) = 'Wednesday' THEN 4
                  WHEN Datename(WEEKDAY, StartDate) = 'Thursday' THEN 5
                  WHEN Datename(WEEKDAY, StartDate) = 'Friday' THEN 6
                  WHEN Datename(WEEKDAY, StartDate) = 'Saturday' THEN 7
          END ASC;
    
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
          SELECT TOP 5
          FORMAT(CAST(StartDate as Date), 'd', 'de-de')
          AS 'German Date',
          FORMAT(CAST(StartDate as Date), 'd', 'en-us')
          AS 'US Eng Date',
          FORMAT(Sum(Duration), 'n', 'de-de')
          AS 'German Duration',
          FORMAT(SUM(Duration), 'n', 'en-us')
          AS 'US Eng Duration',
          FORMAT(SUM(Duration),'#,0.00')
          AS 'Custom Numeric'
          FROM CapitalBikeShare
          GROUP BY CAST(StartDate as Date)
    

    image

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    
          SELECT DATENAME(weekday, StartDate)
          AS 'DayOfWeek',
          FORMAT(SUM(Duration),'#,0.00')
          AS 'TotalDuration'
          FROM CapitalBikeShare
          GROUP BY DATENAME(WEEKDAY, StartDate)
          ORDER BY
              CASE
                  WHEN Datename(WEEKDAY, StartDate) = 'Sunday' THEN 1
                  WHEN Datename(WEEKDAY, StartDate) = 'Monday' THEN 2
                  WHEN Datename(WEEKDAY, StartDate) = 'Tuesday' THEN 3
                  WHEN Datename(WEEKDAY, StartDate) = 'Wednesday' THEN 4
                  WHEN Datename(WEEKDAY, StartDate) = 'Thursday' THEN 5
                  WHEN Datename(WEEKDAY, StartDate) = 'Friday' THEN 6
                  WHEN Datename(WEEKDAY, StartDate) = 'Saturday' THEN 7
              END ASC
    

    image

This post is licensed under CC BY 4.0 by the author.