Skip to content

Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows}.

Notifications You must be signed in to change notification settings

runsascoded/parquet-diff-test

Repository files navigation

parquet-diff-test

Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows} (see arrow#39399).

CLI

For each {engine, compression codec}:

parquet-diff-test writes a simple Parquet file:

df = pd.DataFrame([{ 'a': 111 }])
empty_df = df.iloc[:0]  # subset the dataset to have 0 rows
out_dir = f'out/{engine}/{compression}'
parquet_path = f'{out_dir}/empty.parquet'
empty_df.to_parquet(parquet_path, engine=engine, compression=compression)

In the same directory, it also writes:

  • metadata.json, which includes:
    • the pyarrow.ParquetFile.metadata dictionary
    • file size
    • file sha256 hash
  • xxd.txt: ASCII representation of every byte in empty.parquet

Results

The test.yml workflow runs parquet-diff-test on Ubuntu, macOS, and Windows, and pushes the results of each to a branch.

Here are the macos and windows branches' compared to ubuntu:

Summary

  • ✅ In all cases, Parquet files generated by fastparquet are identical .across OSes
  • 🤔 In many cases, those generated by pyarrow are different from each other.

pyarrow

Ubuntu Windows macOS
brotli
gzip ⚠️ ⚠️
lz4
snappy
zstd

fastparquet

Ubuntu Windows macOS
brotli
gzip
lz4
snappy
zstd

Full diffs

For example, here's the diff for {pyarrow, snappy}:

git diff ubuntu..macos -- out/pyarrow/snappy/xxd.txt
 00000280: 7741 4141 4145 4141 6741 4367 4141 414e  wAAAAEAAgACgAAAN
 00000290: 7742 4141 4145 4141 4141 4151 4141 4141  wBAAAEAAAAAQAAAA
 000002a0: 7741 4141 4149 4141 7741 4241 4149 4141  wAAAAIAAwABAAIAA
-000002b0: 6741 4141 4149 4141 4141 4541 4141 4141  gAAAAIAAAAEAAAAA
-000002c0: 5941 4141 4277 5957 356b 5958 4d41 414b  YAAABwYW5kYXMAAK
-000002d0: 5942 4141 4237 496d 6c75 5a47 5634 5832  YBAAB7ImluZGV4X2
-000002e0: 4e76 6248 5674 626e 4d69 4f69 4262 6579  NvbHVtbnMiOiBbey
-000002f0: 4a72 6157 356b 496a 6f67 496e 4a68 626d  JraW5kIjogInJhbm
-00000300: 646c 4969 7767 496d 3568 6257 5569 4f69  dlIiwgIm5hbWUiOi
-00000310: 4275 6457 7873 4c43 4169 6333 5268 636e  BudWxsLCAic3Rhcn
-00000320: 5169 4f69 4177 4c43 4169 6333 5276 6343  QiOiAwLCAic3RvcC
-00000330: 4936 4944 4173 4943 4a7a 6447 5677 496a  I6IDAsICJzdGVwIj
-00000340: 6f67 4d58 3164 4c43 4169 5932 3973 6457  ogMX1dLCAiY29sdW
-00000350: 3175 5832 6c75 5a47 5634 5a58 4d69 4f69  1uX2luZGV4ZXMiOi
-00000360: 4262 6579 4a75 5957 316c 496a 6f67 626e  BbeyJuYW1lIjogbn
-00000370: 5673 6243 7767 496d 5a70 5a57 786b 5832  VsbCwgImZpZWxkX2
-00000380: 3568 6257 5569 4f69 4275 6457 7873 4c43  5hbWUiOiBudWxsLC
-00000390: 4169 6347 4675 5a47 467a 5833 5235 6347  AicGFuZGFzX3R5cG
-000003a0: 5569 4f69 4169 6457 3570 5932 396b 5a53  UiOiAidW5pY29kZS
-000003b0: 4973 4943 4a75 6457 3177 6556 3930 6558  IsICJudW1weV90eX
-000003c0: 426c 496a 6f67 496d 3969 616d 566a 6443  BlIjogIm9iamVjdC
-000003d0: 4973 4943 4a74 5a58 5268 5a47 4630 5953  IsICJtZXRhZGF0YS
-000003e0: 4936 4948 7369 5a57 356a 6232 5270 626d  I6IHsiZW5jb2Rpbm
-000003f0: 6369 4f69 4169 5656 5247 4c54 6769 6658  ciOiAiVVRGLTgifX
-00000400: 3164 4c43 4169 5932 3973 6457 3175 6379  1dLCAiY29sdW1ucy
-00000410: 4936 4946 7437 496d 3568 6257 5569 4f69  I6IFt7Im5hbWUiOi
-00000420: 4169 5953 4973 4943 4a6d 6157 5673 5a46  AiYSIsICJmaWVsZF
-00000430: 3975 5957 316c 496a 6f67 496d 4569 4c43  9uYW1lIjogImEiLC
-00000440: 4169 6347 4675 5a47 467a 5833 5235 6347  AicGFuZGFzX3R5cG
-00000450: 5569 4f69 4169 6157 3530 4e6a 5169 4c43  UiOiAiaW50NjQiLC
-00000460: 4169 626e 5674 6348 6c66 6448 6c77 5a53  AibnVtcHlfdHlwZS
-00000470: 4936 4943 4a70 626e 5132 4e43 4973 4943  I6ICJpbnQ2NCIsIC
-00000480: 4a74 5a58 5268 5a47 4630 5953 4936 4947  JtZXRhZGF0YSI6IG
-00000490: 3531 6247 7839 5853 7767 496d 4e79 5a57  51bGx9XSwgImNyZW
-000004a0: 4630 6233 4969 4f69 4237 496d 7870 596e  F0b3IiOiB7ImxpYn
-000004b0: 4a68 636e 6b69 4f69 4169 6348 6c68 636e  JhcnkiOiAicHlhcn
-000004c0: 4a76 6479 4973 4943 4a32 5a58 4a7a 6157  JvdyIsICJ2ZXJzaW
-000004d0: 3975 496a 6f67 496a 4530 4c6a 4175 4d69  9uIjogIjE0LjAuMi
-000004e0: 4a39 4c43 4169 6347 4675 5a47 467a 5833  J9LCAicGFuZGFzX3
-000004f0: 5a6c 636e 4e70 6232 3469 4f69 4169 4d69  ZlcnNpb24iOiAiMi
-00000500: 3478 4c6a 5169 6651 4141 4151 4141 4142  4xLjQifQAAAQAAAB
+000002b0: 6741 4141 4330 4151 4141 4241 4141 414b  gAAAC0AQAABAAAAK
+000002c0: 5942 4141 4237 496d 6c75 5a47 5634 5832  YBAAB7ImluZGV4X2
+000002d0: 4e76 6248 5674 626e 4d69 4f69 4262 6579  NvbHVtbnMiOiBbey
+000002e0: 4a72 6157 356b 496a 6f67 496e 4a68 626d  JraW5kIjogInJhbm
+000002f0: 646c 4969 7767 496d 3568 6257 5569 4f69  dlIiwgIm5hbWUiOi
+00000300: 4275 6457 7873 4c43 4169 6333 5268 636e  BudWxsLCAic3Rhcn
+00000310: 5169 4f69 4177 4c43 4169 6333 5276 6343  QiOiAwLCAic3RvcC
+00000320: 4936 4944 4173 4943 4a7a 6447 5677 496a  I6IDAsICJzdGVwIj
+00000330: 6f67 4d58 3164 4c43 4169 5932 3973 6457  ogMX1dLCAiY29sdW
+00000340: 3175 5832 6c75 5a47 5634 5a58 4d69 4f69  1uX2luZGV4ZXMiOi
+00000350: 4262 6579 4a75 5957 316c 496a 6f67 626e  BbeyJuYW1lIjogbn
+00000360: 5673 6243 7767 496d 5a70 5a57 786b 5832  VsbCwgImZpZWxkX2
+00000370: 3568 6257 5569 4f69 4275 6457 7873 4c43  5hbWUiOiBudWxsLC
+00000380: 4169 6347 4675 5a47 467a 5833 5235 6347  AicGFuZGFzX3R5cG
+00000390: 5569 4f69 4169 6457 3570 5932 396b 5a53  UiOiAidW5pY29kZS
+000003a0: 4973 4943 4a75 6457 3177 6556 3930 6558  IsICJudW1weV90eX
+000003b0: 426c 496a 6f67 496d 3969 616d 566a 6443  BlIjogIm9iamVjdC
+000003c0: 4973 4943 4a74 5a58 5268 5a47 4630 5953  IsICJtZXRhZGF0YS
+000003d0: 4936 4948 7369 5a57 356a 6232 5270 626d  I6IHsiZW5jb2Rpbm
+000003e0: 6369 4f69 4169 5656 5247 4c54 6769 6658  ciOiAiVVRGLTgifX
+000003f0: 3164 4c43 4169 5932 3973 6457 3175 6379  1dLCAiY29sdW1ucy
+00000400: 4936 4946 7437 496d 3568 6257 5569 4f69  I6IFt7Im5hbWUiOi
+00000410: 4169 5953 4973 4943 4a6d 6157 5673 5a46  AiYSIsICJmaWVsZF
+00000420: 3975 5957 316c 496a 6f67 496d 4569 4c43  9uYW1lIjogImEiLC
+00000430: 4169 6347 4675 5a47 467a 5833 5235 6347  AicGFuZGFzX3R5cG
+00000440: 5569 4f69 4169 6157 3530 4e6a 5169 4c43  UiOiAiaW50NjQiLC
+00000450: 4169 626e 5674 6348 6c66 6448 6c77 5a53  AibnVtcHlfdHlwZS
+00000460: 4936 4943 4a70 626e 5132 4e43 4973 4943  I6ICJpbnQ2NCIsIC
+00000470: 4a74 5a58 5268 5a47 4630 5953 4936 4947  JtZXRhZGF0YSI6IG
+00000480: 3531 6247 7839 5853 7767 496d 4e79 5a57  51bGx9XSwgImNyZW
+00000490: 4630 6233 4969 4f69 4237 496d 7870 596e  F0b3IiOiB7ImxpYn
+000004a0: 4a68 636e 6b69 4f69 4169 6348 6c68 636e  JhcnkiOiAicHlhcn
+000004b0: 4a76 6479 4973 4943 4a32 5a58 4a7a 6157  JvdyIsICJ2ZXJzaW
+000004c0: 3975 496a 6f67 496a 4530 4c6a 4175 4d69  9uIjogIjE0LjAuMi
+000004d0: 4a39 4c43 4169 6347 4675 5a47 467a 5833  J9LCAicGFuZGFzX3
+000004e0: 5a6c 636e 4e70 6232 3469 4f69 4169 4d69  ZlcnNpb24iOiAiMi
+000004f0: 3478 4c6a 5169 6651 4141 4267 4141 4148  4xLjQifQAABgAAAH
+00000500: 4268 626d 5268 6377 4141 4151 4141 4142  BhbmRhcwAAAQAAAB
 00000510: 5141 4141 4151 4142 5141 4341 4147 4141  QAAAAQABQACAAGAA
 00000520: 6341 4441 4141 4142 4141 4541 4141 4141  cADAAAABAAEAAAAA
 00000530: 4141 4151 4951 4141 4141 4841 4141 4141  AAAQIQAAAAHAAAAA

The pyarrow metadata is the same for both; I can't tell what explains the difference.

  • All fastparquet parquets are identical.
  • pyarrow parquets are mostly identical, except for one header byte in the gzip codec.
git diff ubuntu..windows -- out/pyarrow/gzip/xxd.txt
 00000000: 5041 5231 1504 1500 1528 4c15 0015 0012  PAR1.....(L.....
-00000010: 0000 1f8b 0800 0000 0000 0003 0300 0000  ................
+00000010: 0000 1f8b 0800 0000 0000 000a 0300 0000  ................
 00000020: 0000 0000 0000 264c 1c15 0419 2500 0619  ......&L....%...
 00000030: 1801 6115 0416 0016 1c16 4426 0026 0829  ..a.......D&.&.)
 00000040: 1c15 0415 0015 0200 0000 1504 192c 3500  .............,5.

Discussion

The discrepancy between macOS and Ubuntu has made some tests inconvenient; it would be nice to understand why it occurs.

Docker

Interestingly, I see the same macOS diffs when running run.sh in an ubuntu Docker image on a macOS host machine

About

Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows}.

Topics

Resources

Stars

Watchers

Forks