-
Notifications
You must be signed in to change notification settings - Fork 36
GSoC 2020 Projects
This page contains project ideas for students applying to the Google Summer of Code 2020. We recommend that prospective students join our Slack workspace to discuss project proposals. Be sure to read our Code of Conduct - respect is important and you will be working with a team from many backgrounds.
signac is a data management framework named after the painter Paul Signac, whose colorful pointillist style resembles a collection of data "points". The signac framework is designed to help researchers design, manage, and execute computational studies. The core data management package signac helps users track data and metadata for file-based workflows (e.g. large molecular simulations) with features for searchability, collaboration, reproducibility, and archival. The companion package signac-flow automates workflow submission on high performance computing clusters operated by universities, companies, and federal research labs. The architecture of signac is specifically aimed at research, where questions change rapidly, data models are always in flux, and computing infrastructure varies widely from project to project. Portability and fast MVPs are signac's strong suit -- compute some jobs, analyze the outputs, write a paper, and archive the data. The signac framework is available for Python 3.5+, can be installed with pip or conda, and is licensed BSD-3.
To learn more about signac, check out the signac website and framework documentation. The signac framework is written in Python 3, so contributors should have some familiarity with the Python language. Contributors should be familiar with Git and GitHub, and read our guidelines for contributors. You can also follow @signacdata on Twitter.
We recommend that new contributors get started with a "good first issue" to acquaint themselves with the project and our development process. Note that the signac framework has a few separate repositories where issues are filed:
- signac, core data management package
- signac-flow, workflow automation
- signac-dashboard, rapid data visualization in a browser
- signac-docs, the central documentation repository
- signac-examples, a set of example projects
The core of signac uses JSON files to track metadata like job state points and job documents. This data is available for the user to read and modify through a synchronization mechanism that allows user modifications to be written back to disk. In this project, you will work to improve signac's internal synced data structures to enhance the design, API, and performance.
Many signac workspaces are designed as multi-dimensional parameter sweeps, and grouping jobs according to common state point parameters is necessary for tasks like making plots that average over replicates or have multi-dimensional axes. Currently all operations operate on one job. In this project, you will enable users to create and execute operations that accept multiple jobs as input, via job queries or manually-constructed lists. This will greatly improve the power of workflows in the signac framework. Previous work resulted in a draft of this feature, which will need to be updated substantially to account for other changes in the signac-flow execution model.
Workflows in signac-flow are defined using "conditions" (such as whether an output file exists) that determine what operations to run. Recent work on signac-flow has enabled automatic detection of the dependency graph, to know when an operation depends on another operation running first. In this project, you will make it possible to automatically perform an operation's dependencies before executing the desired operation. This helps with applications such as active learning over a data space, as well as enhancing the user experience for complicated workflows.
- Learn to automate and scale computational workflows from laptops to the world's largest supercomputers
- Improve your skills in designing user-centered APIs, working on collaborative teams, and using scientific Python
- Work on a project that will be used by scientific researchers at institutions around the globe
- We're friendly!
Our development is distributed across 5+ time zones, and we have an active Slack workspace, biweekly video calls, and biweekly development "sprints" to coordinate our efforts.