Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More than one curator? #197

Open
urinieto opened this issue Nov 20, 2018 · 13 comments
Open

More than one curator? #197

urinieto opened this issue Nov 20, 2018 · 13 comments
Labels
enhancement schema Issues pertaining to schema definitions

Comments

@urinieto
Copy link
Contributor

From this issue I realized that the current jams schema doesn't allow for a given annotation to be curated by more than one person. Is that true? If so, we should consider enhancing the schema to allow a list of curators.

@ejhumphrey
Copy link
Collaborator

could you provide an instance where one annotation is provided by more than one curator, instead of multiple annotations provided by one curator each?

@urinieto
Copy link
Contributor Author

urinieto commented Nov 20, 2018

It's not that a single annotation is "provided by more than one curator" (the annotation is provided by an annotator, right?), but that a single annotation can be curated by one or more people. For example, the HEMAN dataset employs data that were curated (but not annotated) by me and that later @irisyupingren further reviewed, cleaned, published, and formatted (i.e., curated?).

Maybe I'm employing the word "curator" in the wrong way? To be honest, it's a bit confusing from the jams specification docs:

curator : a structured object containing contact information (name and email) for the curator of this data;
annotator : a sandbox object to describe the individual annotator — which can be a person or a program — that generated this annotation;

urinieto added a commit to marl/jams-data that referenced this issue Nov 20, 2018
…Removed Oriol until JAMS allows for more than one curator, following this: marl/jams#197
@justinsalamon
Copy link
Contributor

justinsalamon commented Nov 20, 2018

@urinieto your take is correct (imo), annotator is whoever generated the specific annotation, while curator is the person (or people) who did the work of collecting all the annotations into a dataset. Under this reasoning, it makes perfect sense to have more than one curator.

@urinieto
Copy link
Contributor Author

Glad we agree on this, Justin. I just reviewed the original paper, and it also feels a bit confusing to me (who wrote this paper?):

curator (F) is itself an object with two subfields, name and email, for the contact person responsible for the annotation; and annotator (G) is another unconstrained object, which is intended to capture information about the source of the annotation

@ejhumphrey
Copy link
Collaborator

ah yes, you're right + I'm wrong – curator is the person(s) responsible for collecting the annotation, the annotator is the observer.

I guess one thing that we punted on pretty hard was having an "agent" datatype in the schema; we weren't really sure what kinds of curators would crop up (people? teams? universities?), and so it got left as a single string. In hindsight, this is kind of a great problem to have, since it's lightyears ahead of unstructured text files..

maybe I could rephrase my question better: would an array of open-ended strings be enough? or is there enough data to infer what a more structured Curator object might look like?

also I definitely wrote that section of the paper, so that makes me 0/2 on this thread.

@urinieto
Copy link
Contributor Author

haha even if you wrote that section, we should've pointed out how ambiguous it was (i.e., don't be too hard on yourself, we The Authors are all to blame here 💃 ).

Ok, back to your question, I would go with either a single open-ended string (e.g., "Eric Humphrey eric@humphreystitan.com & Justin Salamon jsalamon@titanic.com"), or an array of Curator objects. Since the open-ended string might become a bit too complicated to parse, the latter option seems better to me.

Also, ideally, I would allow either one Curator or a list of Curators, but not sure how ugly this would look in terms of schema design/validation.

@bmcfee
Copy link
Contributor

bmcfee commented Nov 20, 2018

I see the Curator field as more of a point of contact, rather than the person who constructed the data per se. The way I'd want to use it is to have a way to chase down whoever's responsible for bugs or revising annotations going forward, not so much as a historical attribution field. (The latter should go in the annotator field.)

In that light, I'm not sure having multiple points of contact makes much sense, but I agree that the current Curator field is lacking in many ways.

Maybe it's worth considering the proposal data management / revisioning that @mcartwright wrote up for our OSS-MIR paper in IEEE-SPL? Thinking about how the Curator field could be expanded / replaced by something more useful for specific purposes, eg, where to send bug reports. In that case, maybe a URL is more appropriate than an email address? Or something else entirely?

@ejhumphrey
Copy link
Collaborator

generally if a field is to be repeated, it should always be an array, and maybe specify a minimum number of elements. afaik mixing data types (allowing Curator and Array) is poor form.

to @bmcfee's point, I really like the idea of thinking about why it exists. If curator equals "who do I bother", then perhaps either URLs or email addresses are equally fine?

@justinsalamon
Copy link
Contributor

My 2c as the person who put the "curator" field in there in the first place :)

The intention was precisely for attribution. While a dataset can have many annotators (especially in a crowdsourcing scenario), it usually has a small set of curators who are in charge of putting the whole thing together, quality control, etc. Basically like an art exhibition that may consist of artworks by multiple artist (annotators in this analogy), it is usually curated by just one or two people, the curators.

Personally I think it's important to have such a field, because annotator(s) != curator(s) != point of contact.

The assumption was that the first curator is also the POC, and that people would infer that on their own. If you think it's worth adding an explicit "contact" field (e.g. with an email address) I'm totally fine with that, but not at the expense of the "curator" field, IMO.

@justinsalamon
Copy link
Contributor

p.s. forgot to add, in light of the above, I'd support @urinieto's proposal of making the curator field a list of Curator.

@bmcfee bmcfee added the schema Issues pertaining to schema definitions label Aug 12, 2019
@bmcfee
Copy link
Contributor

bmcfee commented Aug 12, 2019

Coming back to this one, it seems to me that curator is a collection-level attribute, not an annotation-level attribute. As we start planning for #178 and more collection-oriented things, does it even make sense to keep a curator field in the annotation metadata objects?

I'm thinking it might be better to lift that up a level; annotations can belong to collections, and collections can have curators, as well as other properties: home page, DOI, etc. For my typical use cases, a DOI pointing to a zenodo page for the dataset would be perfect. From there, I can get all the attribution and contact info I need, and the maintainers can worry about keeping things up to date there. For example, if a curator changes email address, there's currently no mechanism to propagate that information back to a bunch of jams files out on the internet. Relying on zenodo (or figshare, or whatever it happens to be) for this seems like a much better approach.

@urinieto
Copy link
Contributor Author

I agree with @bmcfee: curator should be a collection-level attribute, and there may be more than one curator associated with a collection.

My only concern is that changing this would potentially make pretty much all JAMS files to date incompatible with the new schema. Unless we do something smart about it, with deprecation warnings and so on.

@bmcfee
Copy link
Contributor

bmcfee commented Aug 12, 2019

My only concern is that changing this would potentially make pretty much all JAMS files to date incompatible with the new schema.

Yup, that'll happen. The ideal fix here will be to 1) standardize the schema into a self-contained definition (ie without namespace runtime patching) as noted in #178, 2) put the schema under proper version control, and 3) put converters in place for migrating between versions.

If we set this up properly, then migration should be pretty easy, since we're going from a "exactly one of" to a "zero or more of" type of field, though obviously the python object model will have to change to stay usable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement schema Issues pertaining to schema definitions
Projects
None yet
Development

No branches or pull requests

4 participants