Home Getting started with csvkit
Post
Cancel

Getting started with csvkit

  • What is csvkit?

    csvkit:

    • is a suite of command-line tools
    • is developed in Python by Wireservice
    • offers data processing and cleaning capabilities on CSV files
    • has data capabilities that rival Python, R, and SQL
    • for documentation visit link: Documentation
  • csvkit installation

    Install csvkit using Python package manager pip:

    1
    
          pip install csvkit
    

    Upgrade csvkit to the latest version:

    1
    
          pip install --upgrade csvkit
    
  • in2csv: converting files to csv

    Syntax:

    1
    
          in2csv SpotifyData.xlsx > SpotifyData.csv
    

    Prints the first sheet in Excel to console and does not save

    1
    
          in2csv SpotifyData.xlsx
    

    > redirects the output and saves it as a new file SpotifyData.csv

    1
    
      > SpotifyData.csv
    

    Use --names or -n option to print all sheet names in SpotifyData.xlsx

    1
    
          in2csv -n SpotifyData.xlsx
    

    Use --sheet option followed by the sheet Worksheet1_Popularity to be converted.

    1
    
          in2csv SpotifyData.xlsx --sheet "Worksheet1_Popularity" > Spotify_Popularity.csv
    
  • csvlook: data preview on the command line

    csvlook: renders a cSV to the command line in a Markdown-compatible, fixed-width format

    Documentation:

    1
    
          csvlook -h
    

    Syntax:

    1
    
          csvlook Spotify_Popularity.csv
    
  • csvstat: descriptive stats on CSV data files

    csvstat: prints descriptive summary statistison all columns in CSV (e.g. mean, median, unique valuescoutns) Documentation:

    1
    
          csvstat - h
    

    Syntax:

    1
    
          csvstat Spotify_Popularity.csv
    

Filtering data using csvkit

  • What does it mean to filter data?

    We can create a subset of the original data file by:

    1. Filtering the data by column
    2. Filtering the data by row csvcut: filters data using column name or position csvgrep: filters data by row value through exact match, pattern matching, or even regex
  • csvcut: filtering data by column

    csvcut: fitlers and truncates CSV files by column name or column position Documentation:

    1
    
          csvcut -h
    

    Use --names or -n option to print all column names in Spotify_MusicAttributes.csv.

    1
    
          csvcut -n Spotify_MusicAttributes.csv
    

    To return the first column in the data, by position:

    1
    
          csvcut -c 1 Spotify_MusicAttributes.csv
    

    image

    To return only the first column in the data, by name:

    1
    
          csvcut -c "track_id" Spotify_MusicAttributes.csv
    

    To return the second and third column in the data, by position:

    1
    
          csvcut -c 2,3 Spotify_MusicAttributes.csv
    

    To return the second and third column in the data, by name:

    1
    
          csvcut -c "danceability", "duration_ms" Spotify_MusicAttributes.csv
    
  • csvgrep: filtering data by row value

    csvgrep:

    • fitlers by row using exact match or regex fuzzy matching
    • must be paired with one of these options: -m: followed by the exact row value to filter -r: followed with a regex pattern -f: followed by the path to a file

    Documentation:

    1
    
          csvgrep -h
    

    Example: Find in Spotify_Popularity.csv where track_id = 5RCPsfzmEpTXMCTNk7wEfQ

    1
    
          csvgrep -c 'track_id` -m 5RCPsfzmEpTXMCTNk7wEfQ Spotify_Popularity.csv
    
    1
    
          csvgrep -c 1 -m 5RCPsfzmEpTXMCTNk7wEfQ Spotify_Popularity.csv
    

Stacking data and chaining commands with csvkit

  • csvstack: stacking multiple CSV fies

    csvstak: stacks up the rows from two or more CSV files

    Documentation:

    1
    
          csvstack -h
    

    Stack two similar files Spotify_Rank6.csv and Spotify_Rank7.csv into one file.

    Preview the data to check schema:

    1
    
          csvlook Spotify_Rank6.csv
    

    Syntax:

    1
    2
    3
    
          csvstack Spotify_Rank6.csv Spotify_Rank7.csv > Spotify_AllRanks.csv
    
          csvlook Spotify_AllRanks.csv
    
    1
    2
    3
    4
    
          csvstack -g "Rank6", "Rank7" -n "source"\
          Spotify_Rank6.csv Spotify_Rank7.csv > Spotify_AllRanks.csv
    
          csvlook Spotify_AllRanks.csv
    
  • chaining command-line commands

    ; links commands together and runs sequentially

    1
    
          csvlook Spotify_All.csv; csvstat SpotifyData_All.csv
    

    && links commands togethers,but only runs the 2nd command if the 1st succeeds

    1
    
          csvlook SpotifyData_All.csv && csvstat SpotifyData_All_data.csv
    

    > re-directs the output from the 1st command to the location indicated as the 2nd

    1
    
          in2csv SpotifyData.xlsx > SpotifyData.csv
    

    | uses the output of the 1st command as input to the 2nd Example: Output of csvcut is not well informed:

    1
    
          csvcut -c "track_id", "danceability" Spotify_MusicAttributes.csv
    

    Re-format csvcut’s output by piping the output as input to csvlool:

    1
    
          csvcut -c 'track_id', 'danceablity' Spotify_Popularity.csv | csvlook
    
This post is licensed under CC BY 4.0 by the author.