Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map schemas between Metax Dataset schema and SD Submit schemas Datacite, Study and Dataset #327

Open
2 of 3 tasks
Tracked by #292
genie9 opened this issue Jan 10, 2022 · 13 comments · Fixed by #387
Open
2 of 3 tasks
Tracked by #292

Comments

@genie9
Copy link
Contributor

genie9 commented Jan 10, 2022

Description

Metadata-submitter's Datacite, Study and Dataset schemas need to be mapped to Metax Dataset schema: https://raw.githubusercontent.com/CSCfi/metax-api/master/src/metax_api/api/rest/v2/schemas/att_dataset_schema.json.

Tasks

DoD

Testing

@genie9 genie9 self-assigned this Jan 10, 2022
@genie9 genie9 changed the title Map schemas between Metax Dataset schema and Submitter schemas Study and Dataset Map schemas between Metax Dataset schema and Submitter schemas Datacite, Study and Dataset Jan 14, 2022
@juhtornr juhtornr added this to the 0.13.0 milestone Mar 14, 2022
@genie9 genie9 changed the title Map schemas between Metax Dataset schema and Submitter schemas Datacite, Study and Dataset Map schemas between Metax Dataset schema and SD Submit schemas Datacite, Study and Dataset Mar 21, 2022
@genie9
Copy link
Contributor Author

genie9 commented Mar 21, 2022

Had a conversation on possible field mappings with @heikkil, @fmorelloCSC, and @blankdots.

Already mapped fields:

SD Submit datacite or object Metax research_dataset
DOI preferred_identifier
title (object) title
description (dataset) description
abstract (study) description
"CSC Sensitive Data Services for Research" publisher
creators creator
"restricted" access_rights

Will be mapped with this ticket:

SD Submit datacite or object Metax research_dataset
dates Updated modified
dates Issued issued
dates Collected temporal
keywords keyword
alternateIdentifiers other_identifier
contributors (other than Rights Holder, Data Curator, Distributor) contributor
contributors Rights Holder rights_holder
contributors Data Curator curator
type (study/dataset) theme
language language
geoLocations spatial
sizes (MUST standardize to bytes) total_remote_resources_byte_size

Need some clarification:

  • Need to consult with Yrjö Leino if these fields could be mapped or should we add an extra field to datacite schema
    SD Submit datacite subject -> Metax research_dataset field_of_science

  • Will this field be describing SD Submit Study where described SD Submit Dataset belongs to?

"is_output_of": [
  {
    "title": "Producer project",
    "description": "A project that has caused the dataset to be created",
    "$ref": "#/definitions/Project"
  }
]
  • And vice verse this field will be describing SD Submit Dataset which belongs to SD Submit Study being described?
"relation": [
  {
    "title": "Relation",
    "description": "A related dataset or other entity",
  }
]
  • Could this be link to REMS?
remote_resources: [
  {
    "title": "Remote resources",
    "description": "A concrete storage or expression format for the data in the dataset, for example a file, a database or a query interface to the data.",
    "$ref": "#/definitions/WebResource"
  }
]

Fields that could be implemented with further SD Submit versions:

SD Submit
metadata_version_identifier
version_info
version_notes
bibliographic_citation
provenance

@genie9
Copy link
Contributor Author

genie9 commented Mar 22, 2022

@fmorelloCSC, @heikkil, @blankdots
Datacite Dates related pondering from Metax perspective:
Dates are submitted as an array with each date having date type eg. Issued, Updated, Collected. Issued and Updated are mapped as a date string in Metax but SD Submit treats them as arrays thus the same date_type can be present several times. Is it something that can happen deliberately or would it be a mistake?

@genie9
Copy link
Contributor Author

genie9 commented Mar 22, 2022

@blankdots,, @fmorelloCSC, @heikkil
Schema reference to datacite subjects https://support.datacite.org/docs/datacite-metadata-schema-v44-recommended-and-optional-properties#6-subject does not force the use of any specific schema for the field of science. Could it be possible to just take into use the schema used by Metax https://metax.fairdata.fi/es/reference_data/field_of_science/_search?pretty=true&size=100?

@blankdots
Copy link
Contributor

blankdots commented Mar 22, 2022

@genie9

Could it be possible to just take into use the schema used by Metax

Why not both ?

Issued, Updated, Collected. Issued and Updated are mapped as a date string in Metax but SD Submit treats them as arrays thus the same date_type can be present several times. Is it something that can happen deliberately or would it be a mistake?

That is derived from Datacite Schema and it seems that is allowed, so it is deliberate

@blankdots
Copy link
Contributor

about

Could this be link to REMS?

we will need to do integration to REMS/SD-Apply in this #291 so that we can generate the workflow needed for that link. Relevant info on that is available at: https://github.com/CSCfi/rems/blob/master/docs/linking.md#linking-into-a-new-application

The end link will look like: https://rems-demo.rahtiapp.fi/apply-for?resource=<datacite_doi> where <datacite_doi> is the URL of the datacite DOI for the dataset

@genie9
Copy link
Contributor Author

genie9 commented Mar 22, 2022

Could it be possible to just take into use the schema used by Metax

Why not both ?

Mainly just not to create overhead with too many formfields on same subject, especially where metax related is the one WE need mostly. But is there some history why current FOS classification were chosen in the beginning for datacite subjects?

@blankdots
Copy link
Contributor

But is there some history why current FOS classification were chosen in the beginning for datacite subjects?

it is default Datacite. ok, then feel free do a PR and propose the necessary changes

@genie9
Copy link
Contributor Author

genie9 commented Mar 22, 2022

But is there some history why current FOS classification were chosen in the beginning for datacite subjects?

it is default Datacite. ok, then feel free do a PR and propose the necessary changes

OK... I will do that.

@genie9
Copy link
Contributor Author

genie9 commented Mar 23, 2022

Issued, Updated, Collected. Issued and Updated are mapped as a date string in Metax but SD Submit treats them as arrays thus the same date_type can be present several times. Is it something that can happen deliberately or would it be a mistake?

That is derived from Datacite Schema and it seems that is allowed, so it is deliberate

@blankdots,, @fmorelloCSC, @heikkil

Then another question arises:
Metax takes in for issued and updated only one date. Should we then use for:

  • issued chronological first appearance
  • modified chronological last appearance

@blankdots
Copy link
Contributor

Then another question arises: Metax takes in for issued and updated only one date. Should we then use for:

* `issued` chronological first appearance
* `modified` chronological last appearance

imo that seems reasonable.

@genie9
Copy link
Contributor Author

genie9 commented Mar 26, 2022

@heikkil, @fmorelloCSC, and @blankdots
New updates on fields and mapping possibilities

SD Submit datacite or object Possible Metax research_dataset field Consideration
type (study/dataset) theme Cannot be mapped as is as Metax has predefined collection from http://www.yso.fi/onto/koko/. I think we should drop this mapping.
language language Cannot be mapped as is as Metax has predefined collection from http://lexvo.org/id/. Lexovo describes over 7k languages and so it's a huge collection. We have just over 200 enums now in Datacite schema. Should we just drop language mapping?
dates Updated / Issued / Collected modified / issued / temporal have to be validated to format YYYY-MM-DD
contributors Rights Holder rights_holder This could be an organization in the future but now is mapped as Person
sizes total_remote_resources_byte_size sizes schema is an array on Submitter side and integer on Metax side. The submitter will be able to provide file sizes after integration with SD Connect or other file upload service

@genie9 genie9 linked a pull request Apr 4, 2022 that will close this issue
3 tasks
@genie9
Copy link
Contributor Author

genie9 commented Apr 7, 2022

The easiest mappings have been merged to the metax-integration branch. There is still work with fields:

  • is_output_of
  • relation
  • languages
  • field of science / subjects

which looks like are doable but need some more attention.

These will be added with separate PRs in near future.

@juhtornr juhtornr modified the milestones: 0.13.0, Sprint 29/04 Apr 12, 2022
@ainoc ainoc modified the milestones: Sprint 29/04, Sprint 13/05 May 2, 2022
@genie9 genie9 modified the milestones: Sprint 13/05, 0.14.0 May 3, 2022
@genie9
Copy link
Contributor Author

genie9 commented Jun 14, 2022

The field in Metax remote resource cannot be used as a link to SD Apply as the dataset access rights need to be open in that case. https://wiki.eduuni.fi/display/cscfairdata/REMS+in+Sensitive+Data+Service

@teemukataja @juhtornr
Disclaimer: the datasets which were added by hand to Etsin (e.g. https://etsin.fairdata.fi/dataset/335a6e92-5366-473a-b239-f9e52f204f9d) have the link to SD Apply, but it is a bug https://jira.eduuni.fi/browse/CSCFAIRMETA-1453

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants