`pycsvy` won't work with Unicode data on Windows #86

alexdewar · 2024-10-01T10:06:30Z

Currently, pycsvy's various reader functions don't explicitly specify a desired text-encoding to the underlying calls to open(), which means that the platform's default will be used.

On Linux and macOS, the default encoding is UTF-8 but for unfortunate legacy reasons the default encoding on Windows is cp1252. Even worse, this encoding doesn't support Unicode, so if you try writing data containing Unicode to file, you'll get an error 😞. This actually crops up more often than you'd think, in data sets containing names with accents, for example (looking at you @dalonsoa...) and I've been bitten by it a few times recently.

It looks like UTF-8 will be made the default, but not until Python 3.15: https://peps.python.org/pep-0686/. In the meantime I think it probably makes sense to just make UTF-8 for pycsvy, regardless of the platform.

Another consideration is that users can specify the encoding in the YAML header. We have a couple of options here:

If the user specifies an encoding other than UTF-8, we raise an error/warning and use UTF-8 anyway
We use the user's specified encoding to parse the file (not sure whether we should also read the header with this encoding -- which will require reading it twice -- or just the CSV portion of the file)

The text was updated successfully, but these errors were encountered:

dc2917 · 2024-10-23T12:46:26Z

I can take a look at this.

I'm not sure if it's a great idea to just enforce utf-8? Perhaps we can pass that as a default (for all OS), and then if a custom encoding is set in the yaml header, we attempt to use that.

dalonsoa · 2024-10-23T12:52:17Z

Go ahead with it. To be honest, I am not really sure of the implications.

I agree it will be useful to allow the user to specify the encoding. If we do so, it should be a top level parameter to the read/write functions, not included within the header, because if we try to read the header and the file has the wrong encoding, we won't be able to read it! So we need to know it upfront.

alexdewar added the bug Something isn't working label Oct 1, 2024

dalonsoa added the Hacktoberfest label Oct 1, 2024

dalonsoa added this to the v1.0.0 milestone Oct 1, 2024

dc2917 self-assigned this Oct 23, 2024

dc2917 linked a pull request Oct 23, 2024 that will close this issue

Set UTF-8 as default encoding when reading and writing #124

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pycsvy` won't work with Unicode data on Windows #86

`pycsvy` won't work with Unicode data on Windows #86

alexdewar commented Oct 1, 2024

dc2917 commented Oct 23, 2024

dalonsoa commented Oct 23, 2024

pycsvy won't work with Unicode data on Windows #86

pycsvy won't work with Unicode data on Windows #86

Comments

alexdewar commented Oct 1, 2024

dc2917 commented Oct 23, 2024

dalonsoa commented Oct 23, 2024

`pycsvy` won't work with Unicode data on Windows #86

`pycsvy` won't work with Unicode data on Windows #86