What is csvkit?
csvkit
:- is a suite of command-line tools
- is developed in Python by Wireservice
- offers data processing and cleaning capabilities on CSV files
- has data capabilities that rival Python, R, and SQL
- for documentation visit link: Documentation
csvkit installation
Install
csvkit
using Python package managerpip
:1
pip install csvkit
Upgrade
csvkit
to the latest version:1
pip install --upgrade csvkit
in2csv: converting files to csv
Syntax:
1
in2csv SpotifyData.xlsx > SpotifyData.csv
Prints the first sheet in Excel to console and does not save
1
in2csv SpotifyData.xlsx
>
redirects the output and saves it as a new fileSpotifyData.csv
1
> SpotifyData.csv
Use
--names
or-n
option to print all sheet names inSpotifyData.xlsx
1
in2csv -n SpotifyData.xlsx
Use
--sheet
option followed by the sheetWorksheet1_Popularity
to be converted.1
in2csv SpotifyData.xlsx --sheet "Worksheet1_Popularity" > Spotify_Popularity.csv
csvlook: data preview on the command line
csvlook
: renders a cSV to the command line in a Markdown-compatible, fixed-width formatDocumentation:
1
csvlook -h
Syntax:
1
csvlook Spotify_Popularity.csv
csvstat: descriptive stats on CSV data files
csvstat
: prints descriptive summary statistison all columns in CSV (e.g. mean, median, unique valuescoutns) Documentation:1
csvstat - h
Syntax:
1
csvstat Spotify_Popularity.csv
Filtering data using csvkit
What does it mean to filter data?
We can create a subset of the original data file by:
- Filtering the data by column
- Filtering the data by row
csvcut
: filters data using column name or positioncsvgrep
: filters data by row value through exact match, pattern matching, or even regex
csvcut: filtering data by column
csvcut
: fitlers and truncates CSV files by column name or column position Documentation:1
csvcut -h
Use
--names
or-n
option to print all column names inSpotify_MusicAttributes.csv
.1
csvcut -n Spotify_MusicAttributes.csv
To return the first column in the data, by position:
1
csvcut -c 1 Spotify_MusicAttributes.csv
To return only the first column in the data, by name:
1
csvcut -c "track_id" Spotify_MusicAttributes.csv
To return the second and third column in the data, by position:
1
csvcut -c 2,3 Spotify_MusicAttributes.csv
To return the second and third column in the data, by name:
1
csvcut -c "danceability", "duration_ms" Spotify_MusicAttributes.csv
csvgrep: filtering data by row value
csvgrep
:- fitlers by row using exact match or regex fuzzy matching
- must be paired with one of these options:
-m
: followed by the exact row value to filter-r
: followed with a regex pattern-f
: followed by the path to a file
Documentation:
1
csvgrep -h
Example: Find in
Spotify_Popularity.csv
wheretrack_id
=5RCPsfzmEpTXMCTNk7wEfQ
1
csvgrep -c 'track_id` -m 5RCPsfzmEpTXMCTNk7wEfQ Spotify_Popularity.csv
1
csvgrep -c 1 -m 5RCPsfzmEpTXMCTNk7wEfQ Spotify_Popularity.csv
Stacking data and chaining commands with csvkit
csvstack: stacking multiple CSV fies
csvstak
: stacks up the rows from two or more CSV filesDocumentation:
1
csvstack -h
Stack two similar files
Spotify_Rank6.csv
andSpotify_Rank7.csv
into one file.Preview the data to check schema:
1
csvlook Spotify_Rank6.csv
Syntax:
1 2 3
csvstack Spotify_Rank6.csv Spotify_Rank7.csv > Spotify_AllRanks.csv csvlook Spotify_AllRanks.csv
1 2 3 4
csvstack -g "Rank6", "Rank7" -n "source"\ Spotify_Rank6.csv Spotify_Rank7.csv > Spotify_AllRanks.csv csvlook Spotify_AllRanks.csv
chaining command-line commands
;
links commands together and runs sequentially1
csvlook Spotify_All.csv; csvstat SpotifyData_All.csv
&&
links commands togethers,but only runs the 2nd command if the 1st succeeds1
csvlook SpotifyData_All.csv && csvstat SpotifyData_All_data.csv
>
re-directs the output from the 1st command to the location indicated as the 2nd1
in2csv SpotifyData.xlsx > SpotifyData.csv
|
uses the output of the 1st command as input to the 2nd Example: Output ofcsvcut
is not well informed:1
csvcut -c "track_id", "danceability" Spotify_MusicAttributes.csv
Re-format
csvcut
’s output by piping the output as input to csvlool:1
csvcut -c 'track_id', 'danceablity' Spotify_Popularity.csv | csvlook