Skip to content

Virginia Tech Data Repository: The purpose of the Virginia Tech Data Repository is to highlight, preserve, and provide access to research products (e.g. datasets) of the Virginia Tech community, and in doing so help to disseminate the intellectual output of the university in its land-grant mission.

License

Notifications You must be signed in to change notification settings

VTUL/VTDR_RepositoryServices

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview of Curation workflow for Data Management and Curation Services

The codes here show the workflow set up by Virginia Tech Data Repository to download and deposit bags to aptrust via DART. Virginia Tech Data Services uses a simple client for the figshare API in python from Cognoma, parts of LD-Cool-P from University of Arizona workflow, and Scripting with DART. VTDR runs on the figshare for institutions platform. The workflow creates folders for VTDR articles in-review(ingest content: before curator-client interactions) and after review(published content: after curator-client interactions). The content is then bagged in tarred format. Part of the bagging for published content involves creation of ArchivalReadme.rtf file, README.rtf file, and addition of emails, ProvenanceLog.rtf by the curator. The bagged content is then transferred to APTrust via their DART and/or Virginia Tech Libraries storage. APTrust registry checks are made to avoid overwriting existing bags.

Detailed documentation on how to set up a Windows/Mac environment to use these codes is available at: ScriptsSetupAndExecution_CurationWorkflow_Windows.docx ScriptsSetupAndExecution_CurationWorkflow_Mac.docx

Getting Started

Detailed instructions for environment setup are at: Setting up environment for Mac Setting up environment for Windows

Overview of environment setup:

conda create -n curation python=3.9
conda activate curation

Clone VTDR Repository from VTUL github

  • Create a token in VTDR
  • Open generate_config_example.py from 'Figshare-APTrust' folder, and save it as 'generate_config.py' in the 'curation' folder. Fill in the credentials. Details on filling these credentials are available here for mac, and here for windows
  • Download APTrust Partner Tools. Details on accessing apt-cmd.exe are available here for mac, and here for windows
  • To access google sheets, curator needs google sheets api key. Download and save client_secret.json in the curation folder.
  • Download DART tool to deposit content to APTrust. Fill in the credentials in the DART tool following these instructions

Description of the workflow:

Note : For the curation workflow, only the scripts in 'Fighsare-APTrust' folder are used.

Workflow diagram with detailed description of each block is available at: Curator workflow detailed documentation with workflow diagram

Overview of the workflow:

alt text

Above is the curation workflow diagram by Jonathan Petters. Following is the overview of each step of the diagram:

Step 1 Request to publish dataset received from Client in the email.

Step 2 ‘Ingest record created’ and Step 3 ‘Ingest dataset bagged and deposited’:

  • VTDR figshare 'item in review' metadata i.e. requestor, author, version, date, title, article id and email is entered in VTDR curation spreadsheet.
  • article id is entered in generate_config.py, run this script to create configurations.ini file.
  • run IngFolder_Download_TransferBagAPTrust.py to create an ingest folder. The ingest folder contains 'in review' article metadata and files downloaded from figshare. These are then deposited to aptrust in tar format using DART app.

Step 4 Provenance log and client record created: These are done manually by the curator. Provenance log is added to the ingest folder created in step 3 above. Provenance log contains interactions between the client and the curator, along with the date and curator name. This file is in rtf format. Client record is created in VTUL LibCRM. Details on recording are found here

Step 5 Dataset metadata on repository platform evaluated for quality/completeness Article is reviewed using the edit item interface in the review page of VTDR.

Step 6 Does metadata meet Publishing Requirements? Dataset of sufficient quality to publish? In this step, the curator decides either to continue with publication of the dataset or to decline the dataset depending on the VTDR publishing and depositing guidelines

Step 7 Communicate with client to get minimum metadata and suggest other dataset sharing improvements Curator exchanges emails with the client for improvements to their research dataset.

Step 8 Record communications in client record: Client record created in step 4 is updated based on the interactions by the curator.

Step 9 Modify metadata and files in repository platform with agreement of client: Clients record is updated on VTDR using edit item interface, based on the interactions/suggestions by the curator

Step 10 Record modifications in Provenance log: Provenance log created in step 4 is updated by the curator

Step 11 Add metadata to dataset on repository platform and Step(11a): Record metadata changes in provenance log:

  • Run README.rtf to create a README file. Upload this file to the client's record.

Step 12 Publish dataset on repository platform and step (12a) Record dataset publication in provenance log: Item is published on VTDR

Step 13 Inform client dataset is published, send citation and DOI Client is updated via email. Email interactions are saved as pdf files

Step 14 Publication record created and Step 15 Publication dataset aggregated bagged, and deposited: Published article metadata is recorded in VTDR spreadsheet.

  • Run PubFolder_Download.py to download published article metadata and files.
  • Add ProvenanceLog.rtf, email interactions to the folder created above.
  • Run PubBagDART_TransferBagAPTrust.py to upload this to APTrust/VTUL S3/local Sandisk following the options in this script.

Step 16 Complete client record for data publication

Workflow is complete.

Running Batch Codes:

From the Explorer on the left side in VSCode, open generate_config_batch_example.py and save it as generate_config_batch.py. Fill in only the values of the credentials from generate_config.py. Please note that copying lines from generate_config.py to generate_config_batch.py changes formatting, and causes errors. In order to avoid this, copy paste only the values. For Eg: for figshare token, copy the token value only, and paste it in generate_config_batch.py. The only new addition will be the path to the curation services actions folder where emails are to be saved:

VTCurSerFoldPath="/Users/padma/opt/anaconda3/envs/curation/test"

Make a new folder in 'curation' folder called 'test'(or whatever you want to name it, make sure to change it above as well if other than 'test') and move the contents of curation services actions(emails etc.) to this folder.

Open AutomateProvenanceLog_Batch.py and fill in the curator/description. Save the file.

Open downloadFigshareContent_batch.py

Provide the article ids on the first line in this code:

FigshareArticleID=["212121","5453543","32232"]

Add more ids or replace with the article ids for the ingest/publication content. Run downloadFigshareContent_batch.py Pick 1 for Ingest, 2 for Pub Pick 1 for demo, 4 for repo

Run downloadFigshareContent_batch.py (For the error bs4 not found, please do "pip3 install BeautifulSoup4") Pick 1 for Ingest 2 for Pub Pick 1 for demo 4 for repo

Make sure "curation" environment is activated. Ctrl+Shift+P, Select Python Interpreter, pick 'curation'.

Note:

  1. README file is created in the path provided in the configurations, with a date stamp addition in order to avoid overwriting
  2. README file is uploaded to the client's account after the ingest record is created and transferred to aptrust

git rebase for accomodating changes made locally, and main being ahead of local:

  • Commit local changes:
cd VTDR_RepositoryServices
git status
git add --all
git commit -m 'localchanges'
  • Checkout remote repositor:

See if remote is named 'origin' or get the name for the remote branch:

git remote -v 
git checkout origin
  • Rebase local changes with the ones in the remote:
git rebase main

-Checkout local and merge with remote:

git checkout main
git merge origin

About

Virginia Tech Data Repository: The purpose of the Virginia Tech Data Repository is to highlight, preserve, and provide access to research products (e.g. datasets) of the Virginia Tech community, and in doing so help to disseminate the intellectual output of the university in its land-grant mission.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages