Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes handling of double quotes for unescaped fields with tests #182

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

robwithhair
Copy link

related to issue #98.

This is to fix a possible Oracle / MS Excel CSV export weirdness where fields can be created which are 'unescaped' but still contain double quote characters. While it violates RFC4180, there is no reason why we cannot parse these files correctly. I would argue that it is more correct to parse the field without error than to terminate a field early, hence splitting the field and providing confusing error message.

Example of such a field unescaped BSV "Frohsinn" Mehr-Ork-Gest e.
Example of such a field escaped "BSV ""Frohsinn"" Mehr-Ork-Gest e".

Both fields can appear in CSV files. Because the first character is not a double quote, the string is considered unquoted but it can still be parsed. Previous functionality was to error with a confusing error message due to fields being split in two at the double quote character.

I have run tests and all seem to be passing. I have also added a new test for this functionality.

Although all tests are passing there is a possibility that existing code depending on the error could be affected though I believe this is unlikely. It would be wise to test with some varied known working CSV files.

…te text fields while containing quote characters. See new test case for expected functionality. Origional functionality was to fail on parse as field was terminated on doublequote.
@johannesgerer
Copy link

Yes, this would be helpful! Currently, I have to maintain a cassava fork because of this

@andreasabel
Copy link
Member

Frankly, I am a bit uneasy with the change request, because the package prominently states that it implements RFC 4180:

cassava/cassava.cabal

Lines 7 to 9 in 5a410c9

@cassava@ is a library for parsing and encoding [RFC 4180](https://tools.ietf.org/html/rfc4180)
compliant [comma-separated values (CSV)](https://en.wikipedia.org/wiki/Comma-separated_values) data,
which is a textual line-oriented format commonly used for exchanging tabular data.

If we should deviate from RFC 4180, then only under a flag. (Alternatively, the faithful implementation could be under flag rfc4180 or strict-rfc-4180, and the more liberal/practical implementation the default.)
@robwithhair @johannesgerer What do you think?

@j-rockel
Copy link

What is the status here? I'd also really appreciate having this as an option!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants