Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pycsvy won't work with Unicode data on Windows #86

Open
alexdewar opened this issue Oct 1, 2024 · 2 comments · May be fixed by #124
Open

pycsvy won't work with Unicode data on Windows #86

alexdewar opened this issue Oct 1, 2024 · 2 comments · May be fixed by #124
Assignees
Labels
bug Something isn't working Hacktoberfest
Milestone

Comments

@alexdewar
Copy link
Contributor

Currently, pycsvy's various reader functions don't explicitly specify a desired text-encoding to the underlying calls to open(), which means that the platform's default will be used.

On Linux and macOS, the default encoding is UTF-8 but for unfortunate legacy reasons the default encoding on Windows is cp1252. Even worse, this encoding doesn't support Unicode, so if you try writing data containing Unicode to file, you'll get an error 😞. This actually crops up more often than you'd think, in data sets containing names with accents, for example (looking at you @dalonsoa...) and I've been bitten by it a few times recently.

It looks like UTF-8 will be made the default, but not until Python 3.15: https://peps.python.org/pep-0686/. In the meantime I think it probably makes sense to just make UTF-8 for pycsvy, regardless of the platform.

Another consideration is that users can specify the encoding in the YAML header. We have a couple of options here:

  1. If the user specifies an encoding other than UTF-8, we raise an error/warning and use UTF-8 anyway
  2. We use the user's specified encoding to parse the file (not sure whether we should also read the header with this encoding -- which will require reading it twice -- or just the CSV portion of the file)
@alexdewar alexdewar added the bug Something isn't working label Oct 1, 2024
@dalonsoa dalonsoa added this to the v1.0.0 milestone Oct 1, 2024
@dc2917 dc2917 self-assigned this Oct 23, 2024
@dc2917
Copy link
Contributor

dc2917 commented Oct 23, 2024

I can take a look at this.

I'm not sure if it's a great idea to just enforce utf-8? Perhaps we can pass that as a default (for all OS), and then if a custom encoding is set in the yaml header, we attempt to use that.

@dalonsoa
Copy link
Collaborator

Go ahead with it. To be honest, I am not really sure of the implications.

I agree it will be useful to allow the user to specify the encoding. If we do so, it should be a top level parameter to the read/write functions, not included within the header, because if we try to read the header and the file has the wrong encoding, we won't be able to read it! So we need to know it upfront.

@dc2917 dc2917 linked a pull request Oct 23, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Hacktoberfest
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants