Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed data types in dataframe column may cause problems downstream #327

Open
jacobthill opened this issue Nov 3, 2022 · 0 comments
Open

Comments

@jacobthill
Copy link
Contributor

I’m having an issue with the QNL post harvest task that I noticed when trying to fix the traject config. I raised the question on stack overflow https://stackoverflow.com/questions/74293940/aggregating-multiple-data-types-in-pandas-groupby but no answer so far. The basic problem is a column might have multiple data types which cause problems when I try to merge rows together. e.g. the subject_name_namePart column has strings and lists. I can’t figure out how to fix this in pandas. Is there a way to do it before the data gets into a dataframe? For example, if a value is a string, can I force it into a list with one string value "string" > ["string"]?

I can't think of a reason why we would ever want to preserve mixed values in a Pandas column. I think we would always be okay if we automatically converted all strings to lists with one string value. We could then use parse_csv in traject on all fields.

Currently this is blocking me from transforming QNL data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant