Skip to content

runsascoded/dvc-utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dvc-utils

Diff DVC files, optionally piping through other commands first.

dvc-utils on PyPI

Installation

pip install dvc-utils

Usage

dvc-utils --help
# Usage: dvc-utils [OPTIONS] COMMAND [ARGS]...
#
# Options:
#   --help  Show this message and exit.
#
# Commands:
#   diff  Diff a DVC-tracked file at two commits (or one commit vs. current
#         worktree), optionally passing both through another command first

The single subcommand, dvc-utils diff, is also exposed directly as dvc-dff:

dvc-diff

dvc-diff --help
# Usage: dvc-diff [OPTIONS] [exec_cmd...] <path>
#
#   Diff a file at two commits (or one commit vs. current worktree), optionally
#   passing both through `cmd` first
#
#   Examples:
#
#   dvc-utils diff -r HEAD^..HEAD wc -l foo.dvc  # Compare the number of lines
#   (`wc -l`) in `foo` (the file referenced by `foo.dvc`) at the previous vs.
#   current commit (`HEAD^..HEAD`).
#
#   dvc-utils diff md5sum foo  # Diff the `md5sum` of `foo` (".dvc" extension is
#   optional) at HEAD (last committed value) vs. the current worktree content.
#
# Options:
#   -c, --color                  Colorize the output
#   -r, --refspec TEXT           <commit 1>..<commit 2> (compare two commits) or
#                                <commit> (compare <commit> to the worktree)
#   -s, --shell-executable TEXT  Shell to use for executing commands; defaults
#                                to $SHELL (/bin/bash)
#   -S, --no-shell               Don't pass `shell=True` to Python
#                                `subprocess`es
#   -U, --unified INTEGER        Number of lines of context to show (passes
#                                through to `diff`)
#   -v, --verbose                Log intermediate commands to stderr
#   -w, --ignore-whitespace      Ignore whitespace differences (pass `-w` to
#                                `diff`)
#   -x, --exec-cmd TEXT          Command(s) to execute before diffing; alternate
#                                syntax to passing commands as positional
#                                arguments
#   --help                       Show this message and exit.

Examples

Parquet

See sample commands and output below for inspecting changes to a DVC-tracked Parquet file in a given commit.

Setup:

git clone https://github.com/hudcostreets/nj-crashes && cd nj-crashes # Clone + enter example repo
commit=c8ae28e  # Example commit that changed some DVC-tracked Parquet files
path=njdot/data/2001/NewJersey2001Accidents.pqt.dvc  # One of the changed files

Schema diff

Use parquet2json to observe schema changes to a Parquet file:

parquet_schema() {
    parquet2json "$1" schema
}
export -f parquet_schema
dvc-diff -r $commit^..$commit parquet_schema $path
Output
2d1
<   OPTIONAL BYTE_ARRAY Year (STRING);
8,10d6
<   OPTIONAL BYTE_ARRAY Crash Date (STRING);
<   OPTIONAL BYTE_ARRAY Crash Day Of Week (STRING);
<   OPTIONAL BYTE_ARRAY Crash Time (STRING);
14,17c10,13
<   OPTIONAL BYTE_ARRAY Total Killed (STRING);
<   OPTIONAL BYTE_ARRAY Total Injured (STRING);
<   OPTIONAL BYTE_ARRAY Pedestrians Killed (STRING);
<   OPTIONAL BYTE_ARRAY Pedestrians Injured (STRING);
---
>   OPTIONAL INT64 Total Killed;
>   OPTIONAL INT64 Total Injured;
>   OPTIONAL INT64 Pedestrians Killed;
>   OPTIONAL INT64 Pedestrians Injured;
20,21c16,17
<   OPTIONAL BYTE_ARRAY Alcohol Involved (STRING);
<   OPTIONAL BYTE_ARRAY HazMat Involved (STRING);
---
>   OPTIONAL BOOLEAN Alcohol Involved;
>   OPTIONAL BOOLEAN HazMat Involved;
23c19
<   OPTIONAL BYTE_ARRAY Total Vehicles Involved (STRING);
---
>   OPTIONAL INT64 Total Vehicles Involved;
29c25
<   OPTIONAL BYTE_ARRAY Mile Post (STRING);
---
>   OPTIONAL DOUBLE Mile Post;
47,48c43,44
<   OPTIONAL BYTE_ARRAY Latitude (STRING);
<   OPTIONAL BYTE_ARRAY Longitude (STRING);
---
>   OPTIONAL DOUBLE Latitude;
>   OPTIONAL DOUBLE Longitude;
51a48
>   OPTIONAL INT64 Date (TIMESTAMP(MICROS,false));

Here we can see that various date/time columns were consolidated, and several stringly-typed columns were converted to ints, floats, and booleans.

Row diff

Diff the first row of the Parquet file above (pretty-printed as JSON using jq), before and after the given commit:

pretty_print_first_row() {
    # Print first row of Parquet file as JSON, pretty-print with jq
    parquet2json "$1" cat -l 1 | jq .
}
export -f pretty_print_first_row
dvc-diff -r $commit^..$commit pretty_print_first_row $path
Output
2d1
<   "Year": "2001",
8,10d6
<   "Crash Date": "12/21/2001",
<   "Crash Day Of Week": "F",
<   "Crash Time": "1834",
14,17c10,13
<   "Total Killed": "0",
<   "Total Injured": "0",
<   "Pedestrians Killed": "0",
<   "Pedestrians Injured": "0",
---
>   "Total Killed": 0,
>   "Total Injured": 0,
>   "Pedestrians Killed": 0,
>   "Pedestrians Injured": 0,
20,21c16,17
<   "Alcohol Involved": "N",
<   "HazMat Involved": "N",
---
>   "Alcohol Involved": false,
>   "HazMat Involved": false,
23c19
<   "Total Vehicles Involved": "2",
---
>   "Total Vehicles Involved": 2,
29c25
<   "Mile Post": "",
---
>   "Mile Post": null,
47,48c43,44
<   "Latitude": "",
<   "Longitude": "",
---
>   "Latitude": null,
>   "Longitude": null,
51c47,48
<   "Reporting Badge No.": "830"
---
>   "Reporting Badge No.": "830",
>   "Date": "2001-12-21 18:34:00 +00:00"

This reflects the schema changes above.

Row count diff

parquet_row_count() {
    parquet2json "$1" rowcount
}
export -f parquet_row_count
dvc-diff -r $commit^..$commit parquet_row_count $path

This time we get no output; the given $commit didn't change the row count in the DVC-tracked Parquet file $path.

GZipped CSVs

Here's a "one-liner" I used in ctbk.dev, to normalize and compare headers of .csv.gz.dvc files between two commits:

# Save some `sed` substitution commands to file `seds`:
cat <<EOF >seds
s/station_//
s/latitude/lat/
s/longitude/lng/
s/starttime/started_at/
s/stoptime/ended_at/
s/usertype/member_casual/
EOF
# Commit range to diff; branch `c0` is an initial commit of some `.csv.gz` files, branch `c1` is a later commit after some updates
r=c0..c1
# List files changed in commit range `$r`, in the `s3/ctbk/csvs/` dir, piping through several post-processing commands:
gdno $r s3/ctbk/csvs/ | \
pel "ddcr $r guc h1 spc kq kcr snc 'sdf seds' sort"
Explanation of aliases
  • gdno (git diff --name-only): list files changed in the given commit range and directory
  • pel: parallel alias that prepends an echo {} to the command
  • ddcr (dvc-diff -cr): colorized diff output, revision range $r
  • guc (gunzip -c): uncompress the .csv.gz files
  • h1 (head -n1): only examine each file's header line
  • spc (tr , $'\n'): split the header line by commas (so each column name will be on one line, for easier diffing below)
  • kq (tr -d '"'): kill quote characters (in this case, header-column name quoting changed, but I don't care about that)
  • kcr (tr -d '\r'): kill carriage returns (line endings also changed)
  • snc (sed -f 'snake_case.sed'): snake-case column names
  • sdf (sed -f): execute the sed substitution commands defined in the seds file above
  • sort: sort the column names alphabetically (to identify missing or added columns, ignore rearrangements)

Note:

  • Most of these are exported Bash functions, allowing them to be used inside the parallel command.
  • I was able to build this pipeline iteratively, adding steps to normalize out the bits I didn't care about (and accumulating the seds commands).

Example output:

…
s3/ctbk/csvs/201910-citibike-tripdata.csv.gz.dvc:
s3/ctbk/csvs/201911-citibike-tripdata.csv.gz.dvc:
s3/ctbk/csvs/201912-citibike-tripdata.csv.gz.dvc:
s3/ctbk/csvs/202001-citibike-tripdata.csv.gz.dvc:
1,2d0
< bikeid
< birth_year
8d5
< gender
9a7,8
> ride_id
> rideable_type
15d13
< tripduration
s3/ctbk/csvs/202002-citibike-tripdata.csv.gz.dvc:
1,2d0
< bikeid
< birth_year
8d5
< gender
9a7,8
> ride_id
> rideable_type
15d13
< tripduration
s3/ctbk/csvs/202003-citibike-tripdata.csv.gz.dvc:
1,2d0
< bikeid
< birth_year
8d5
< gender
9a7,8
> ride_id
> rideable_type
15d13
< tripduration

This helped me see that the data update in question (c0..c1) dropped some fields (bikeid, birth_year, gender, tripduration) and added others (ride_id, rideable_type), for 202001 and later.