Skip to content

Latest commit

 

History

History
233 lines (200 loc) · 15.4 KB

readme.md

File metadata and controls

233 lines (200 loc) · 15.4 KB

R Package Structure

Happy Boxing Day! While you're dealing with your physical packages, we're looking into R packages!

The dataset this week comes from "Historical Trends in R Package Structure and Interdependency on CRAN" by Mark Padgham and Noam Ross. In that paper, they use the {pkgstats} R package to analyze the structure of R packages over time, using an archive of all packages on CRAN as of 2022-11-22. We've provided csv versions of two of the datasets from that paper.

The paper focuses on package characteristics over time. It might be interesting to look at the distribution of similar features (such as lines of code) across packages.

If you're unfamiliar with some of the terminology in this dataset, you might find the R Packages book by Hadley Wickham and Jennifer Bryan helpful.

The Data

# Option 1: tidytuesdayR package 
## install.packages("tidytuesdayR")

tuesdata <- tidytuesdayR::tt_load('2023-12-26')
## OR
tuesdata <- tidytuesdayR::tt_load(2023, week = 52)

cran_20221122 <- tuesdata$cran_20221122
external_calls <- tuesdata$external_calls
internal_calls <- tuesdata$internal_calls

# Option 2: Read directly from GitHub

cran_20221122 <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-12-26/cran_20221122.csv')
external_calls <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-12-26/external_calls.csv')
internal_calls <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-12-26/internal_calls.csv')

If you would like to dive deeper, you can download the larger dataset with this code:

cran_all_20221122 <- readr::read_rds("https://zenodo.org/records/7414296/files/pkgstats-CRAN-all.Rds?download=1")

How to Participate

  • Explore the data, watching out for interesting relationships. We would like to emphasize that you should not draw conclusions about causation in the data. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our suggestion is to use the data provided to practice your data tidying and plotting techniques, and to consider for yourself what nuances might underlie these relationships.
  • Create a visualization, a model, a shiny app, or some other piece of data-science-related output, using R or another programming language.
  • Share your output and the code used to generate it on social media with the #TidyTuesday hashtag.

Data Dictionary

cran_20221122.csv

variable class description
package character The name of the package
version character The package version
date double The release date of that version of the package
license character License information
files_R double Number of files in the /R directory, where numbers are recursively counted in all sub-directories
files_src double Number of files in the /src directory, where numbers are recursively counted in all sub-directories
files_inst double Number of files in the /inst/include directory, where numbers are recursively counted in all sub-directories
files_vignettes double Number of files in the /vignettes directory, where numbers are recursively counted in all sub-directories
files_tests double Number of files in the /tests directory, where numbers are recursively counted in all sub-directories
loc_R double Total lines of code across all files in the /R directory
loc_src double Total lines of code across all files in the /src directory
loc_inst double Total lines of code across all files in the /inst/include directory
loc_vignettes double Total lines of code across all files in the /vignettes directory
loc_tests double Total lines of code across all files in the /tests directory
blank_lines_R double Total numbers of blank lines across all files in the /R directory
blank_lines_src double Total numbers of blank lines across all files in the /src directory
blank_lines_inst double Total numbers of blank lines across all files in the /inst directory
blank_lines_vignettes double Total numbers of blank lines across all files in the /vignettes directory
blank_lines_tests double Total numbers of blank lines across all files in the /tests directory
comment_lines_R double Total numbers of comment lines across all files in the /R directory
comment_lines_src double Total numbers of comment lines across all files in the /src directory
comment_lines_inst double Total numbers of comment lines across all files in the /inst directory
comment_lines_vignettes double Total numbers of comment lines across all files in the /vignettes directory
comment_lines_tests double Total numbers of comment lines across all files in the /tests directory
rel_space double Measure of relative white space across all files in the /R, /src, and /inst directories
rel_space_R double Measure of relative white space across all files in the /R directory
rel_space_src double Measure of relative white space across all files in the /src directory
rel_space_inst double Measure of relative white space across all files in the /inst directory
rel_space_vignettes double Measure of relative white space across all files in the /vignettes directory
rel_space_tests double Measure of relative white space across all files in the /tests directory
indentation double The number of spaces used to indent code, with values of -1 indicating indentation with tab characters
nexpr double The median number of nested expression per line of code, counting only those lines which have any expressions
num_vignettes double Number of vignettes
num_demos double Number of demos
num_data_files double Number of data files
data_size_total double Total size of all package data
data_size_median double Median size of package data files
translations character List of translations where package includes translations files, given as a comma-separated list of (spoken) language codes
urls character Package URL(s)
bugs character URL for BugReports
desc_n_aut double Number of contributors with role of author
desc_n_ctb double Number of contributors with role of contributor
desc_n_fnd double Number of contributors with role of funder
desc_n_rev double Number of contributors with role of reviewer
desc_n_ths double Number of contributors with role of thesis advisor
desc_n_trl double Number of contributors with role of translator (relating to translation between computer and not spoken languages)
depends character Comma-separated character entries for all depends packages
imports character Comma-separated character entries for all imports packages
suggests character Comma-separated character entries for all suggests packages
enhances character Comma-separated character entries for all enhances packages
linking_to character Comma-separated character entries for all linking_to packages
n_fns_r double Numbers of functions in R
n_fns_r_exported double Numbers of exported R functions
n_fns_r_not_exported double Numbers of non-exported R functions
n_fns_src double Number of functions (or objects) in other computer languages, including functions in both src and inst/include directories
n_fns_per_file_r double Number of functions (or objects) per individual file in /R
n_fns_per_file_src double Number of functions (or objects) per individual file in source directories other than /R
npars_exported_mn double Mean number of parameters per exported R function
npars_exported_md double Median number of parameters per exported R function
loc_per_fn_r_mn double Mean lines of code per function in /R
loc_per_fn_r_md double Median lines of code per function in /R
loc_per_fn_r_exp_mn double Mean lines of code per exported function in /R
loc_per_fn_r_exp_md double Median lines of code per exported function in /R
loc_per_fn_r_not_exp_mn double Mean lines of code per non-exported function in /R
loc_per_fn_r_not_exp_md double Median lines of code per non-exported function in /R
loc_per_fn_src_mn double Mean lines of code per in other source directories
loc_per_fn_src_md double Median lines of code per in other source directories
languages character languages
doclines_per_fn_exp_mn double Mean lines of documentation per exported function in /R
doclines_per_fn_exp_md double Median lines of documentation per exported function in /R
doclines_per_fn_not_exp_mn double Mean lines of documentation per non-exported function in /R
doclines_per_fn_not_exp_md double Median lines of documentation per non-exported function in /R
doclines_per_fn_src_mn double Mean lines of code per in other source directories
doclines_per_fn_src_md double Median lines of code per in other source directories
docchars_per_par_exp_mn double Mean number of documentation characters per parameter of exported R functions
docchars_per_par_exp_md double Median number of documentation characters per parameter of exported R functions
n_edges double Number of edges connecting functions (and other objects) across all languages in package code
n_edges_r double Number of edges connecting R functions (and other objects)
n_edges_src double Number of edges connecting functions (and other objects) in other languages
n_clusters double Number of distinct clusters in package network
centrality_dir_mn double Mean centrality of all network edges, calculated from directed representation of network
centrality_dir_md double Median centrality of all network edges, calculated from directed representation of network
centrality_dir_mn_no0 double Mean centrality of all network edges, calculated from directed representation of network, excluding edges with centrality of zero
centrality_dir_md_no0 double Median centrality of all network edges, calculated from directed representation of network, excluding edges with centrality of zero
centrality_undir_mn double Mean centrality of all network edges, calculated from undirected representation of network
centrality_undir_md double Median centrality of all network edges, calculated from undirected representation of network
centrality_undir_mn_no0 double Mean centrality of all network edges, calculated from undirected representation of network, excluding edges with centrality of zero
centrality_undir_md_no0 double Median centrality of all network edges, calculated from undirected representation of network, excluding edges with centrality of zero
num_terminal_edges_dir double Numbers of terminal edges, calculated from directed representation of network
num_terminal_edges_undir double num_terminal_edges_undir, calculated from undirected representation of network
node_degree_mn double Mean node degree
node_degree_md double Median node degree
node_degree_max double Maximum node degree
cpl_instability_pkg double Coupling instability, a measure of the extent to which packages depend on external functionality without other packages in turn depending on them

external_calls.csv

variable class description
package_from character The package that makes the call
package_to character The package that the source package calls
n_total double The total number of calls from package_from to package_to
n_unique double The number of unique calls from package_from to package_to

internal_calls.csv

variable class description
package character The package being evaluated
n_total double The total number of calls from functions in one file to functions in another file within the package
n_unique double The number of unique calls from functions in one file to functions in another file within the package

Cleaning Script

The authors provided mostly [clean data](https://zenodo.org/records/7414296. We chose one of their datasets, lightly cleaned the data, and saved it as a CSV. We also split the external_calls data into two files, one for calls to functions in other files in the same package (internal_calls.csv) and one for calls to functions in other packages (external_calls.csv).

library(tidyverse)
library(here)
library(fs)

working_dir <- here::here("data", "2023", "2023-12-26")

cran_20221122_url <- "https://zenodo.org/records/7414296/files/pkgstats-CRAN-current.Rds?download=1"
cran_20221122 <- readr::read_rds(cran_20221122_url) |>
  dplyr::ungroup() |>
  dplyr::mutate(
    dplyr::across(
      c(translations, depends:linking_to, languages, external_calls),
      \(x) {
        dplyr::na_if(x, "NA") |> 
          dplyr::na_if("")
      }
    )
  )
dplyr::glimpse(cran_20221122)


calls_20221122 <- cran_20221122 |>
  dplyr::select(package_from = package, external_calls) |>
  tidyr::separate_longer_delim(
    external_calls,
    ","
  ) |>
  # In at least one case, an extra "L:" prefix was picked up from a 1:10-style
  # range.
  dplyr::mutate(
    external_calls = stringr::str_remove(external_calls, "^L:")
  ) |> 
  tidyr::separate_wider_delim(
    external_calls,
    ":",
    names = c("package_to", "n_total", "n_unique")
  )

cran_20221122$external_calls <- NULL

external_calls <- calls_20221122 |> 
  dplyr::filter(package_from != package_to)
internal_calls <- calls_20221122 |> 
  dplyr::filter(package_from == package_to) |> 
  dplyr::select(package = package_from, n_total, n_unique)

readr::write_csv(
  cran_20221122,
  fs::path(working_dir, "cran_20221122.csv")
)
readr::write_csv(
  external_calls,
  fs::path(working_dir, "external_calls.csv")
)
readr::write_csv(
  internal_calls,
  fs::path(working_dir, "internal_calls.csv")
)