-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
165 lines (114 loc) · 6.48 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
library(dplyr)
library(bitfield)
library(CoordinateCleaner)
library(stringr)
library(magrittr)
library(knitr)
```
# bitfield <a href='https://github.com/ehrmanns/bitfield/'><img src='man/figures/logo.svg' align="right" height="200" /></a>
<!-- badges: start -->
<!-- [![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/)](https://cran.r-project.org/package=) -->
<!-- [![DOI](https://zenodo.org/badge/DOI/)](https://doi.org/) -->
[![R-CMD-check](https://github.com/ehrmanns/bitfield/workflows/R-CMD-check/badge.svg)](https://github.com/ehrmanns/bitfield/actions)
[![codecov](https://codecov.io/gh/ehrmanns/bitfield/branch/master/graph/badge.svg?token=hjppymcGr3)](https://codecov.io/gh/ehrmanns/bitfield)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
<!-- [![](http://cranlogs.r-pkg.org/badges/grand-total/)](https://cran.r-project.org/package=) -->
<!-- badges: end -->
## Overview
This package is designed to build sequences of bits (i.e., [bitfields](https://en.wikipedia.org/wiki/Bit_field)) to capture the computational footprint of any (scientific) model workflow or output. The bit sequence is then encoded as an integer value that stores a range of information into a single column of a table or a raster layer. This can be useful when documenting
- the metadata of any dataset by collecting information throughout the dataset creation process,
- a provenance graph that documents how a gridded modelled data product was built,
- intermediate data that accrue along a workflow, or
- a set of output metrics or parameters.
Think of a bit as a switch representing off and on states. A combination of a pair of bits can store four states, and n bits can accommodate 2^n states. These states could be the outcomes of (simple) tests that document binary responses, cases or numeric values. The data produced in that way could be described as meta-analytic or meta-algorithmic data, because they can be re-used to extend an analysis pipeline or algorithm by downstream applications.
## Installation
Install the official version from CRAN:
```{r, eval=FALSE}
# install.packages("bitfield")
```
Install the latest development version from github:
```{r, eval=FALSE}
devtools::install_github("EhrmannS/bitfield")
```
## Examples
```{r example}
library(bitfield)
library(dplyr, warn.conflicts = FALSE); library(CoordinateCleaner); library(stringr)
```
Let's first load an example dataset
```{r}
tbl_bityield$x # invalid (259) and improbable (0) coordinate value
tbl_bityield$y # Inf and NaN value
tbl_bityield$commodity # NA value or mislabelled term ("honey")
tbl_bityield$yield # correct range?!
tbl_bityield$year # flags (*r)
# and there is a set of valid commodity terms
validComm <- c("soybean", "maize")
```
The first step is in creating what is called `registry` in `bitfield`. This registry captures all the information required to build the bitfield
```{r}
yieldReg <- bf_registry(name = "yield_QA",
description = "this bitfield documents quality assessment in a table of yield data.")
```
Then, individual bit flags need to be grown by specifying the respective mapping function. These functions create flags for the most common applications, such as `bf_na()` (to test for missing values), `bf_case()` (to test what case/class the observations are part of),`bf_length()` (to count the number of digits of a variable), or `bf_numeric()` to encode a numeric (floating point) variable as bit sequence.
```{r}
# tests for longitude availability
yieldReg <-
bf_na(x = tbl_bityield, # specify where to determine flags
test = "x", # ... and which variable to test
pos = 1, # specify at which position to store the flag
registry = yieldReg) # provide the registry to update
# test which case an observation is part of
yieldReg <-
bf_case(x = tbl_bityield, exclusive = FALSE,
yield >= 11, yield < 11 & yield > 9, yield < 9 & commodity == "maize",
registry = yieldReg)
# test the length (number of digits) of values
yieldReg <-
bf_length(x = tbl_bityield, test = "y",
registry = yieldReg)
# store a simplified (e.g. rounded) numeric value
# yieldReg <-
# bf_numeric(x = tbl_bityield, source = "yield", precision = 3,
# registry = yieldReg)
```
Various derived functions build on these and thus require bits according to the same rules. The resulting data structure is a record of all the things that are grown on the bitfield.
```{r}
yieldReg
```
This is, however, not yet the bitfield. The registry is merely the instruction manual, so to speak, to create the bitfield and encode it as integer, with the function `bf_encode()`.
```{r}
(intBit <- bf_encode(registry = yieldReg))
```
The bitfield can be decoded based on the registry with the function `bf_decode()` at a later point in time, where the metadata contained in the bitfield can be studied or extended in a downstream application.
```{r}
bitfield <- bf_decode(x = intBit, registry = yieldReg, sep = "-")
# -> prints legend by default, which is also available in bf_env$legend
tbl_bityield |>
bind_cols(bitfield) |>
kable()
```
The column `bf_binary`, in combination with the legend, can be read one step at a time. For example, considering the first bit, we see that no observation has an `NA` value and considering the second bit, we see that observations 4 and 6 have a `yield` smaller than 9 and a `commodity` value "maize".
## Bitfields for other data-types
Not only tabular data are supported, but also gridded data such as rasters (wip).
```{r}
library(terra, warn.conflicts = FALSE)
rst_bityield <- rast(system.file("ex/rst_bityield.tif", package="bitfield"))
levels(rst_bityield$commodity) <- tibble(id = 1:3, commodity = c("soybean", "maize", "honey"))
plot(rst_bityield)
```
# To Do
- [ ] write unit tests
- [ ] include MD5 sum for a bitfield and update it each time the bitfield is grown further
- [ ] document the provenance stuff in here