Skip to content

HedvigS/rgrambank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What this is

This repository contains a set of R functions that are useful for analysis of Grambank data, and other CLDF-datasets. Most of the functions are adapted from the code behind the Grambank release paper of 2023. The code of the paper was also published as grambank-analysed on Zenodo and GitHub. Part of that code has been been re-written here to produce more general functions that can easily be applied to future Grambank releases and other CLDF-datasets.

Installing package

The R-functions of this repository can be accessed as an R-package. The packages is not available via CRAN, instead you can install it directly from GitHub. The packages remotes and devtools contain functions for installing packages from GitHub.

library(remotes)
remotes::install_github("HedvigS/rgrambank")
library(rgrambank)

Versioning

The content here will be continuously updated and periodically released with version tags. Git allows for accessing the state of the repos at a particular time via commit labels or tags. This can be used when cloning or accessing content via URLs and when installing the package withing R. We strongly encourage you to keep track of versioning, this makes it easier to identify issues later.

library(remotes)
remotes::install_github("HedvigS/rgrambank", ref = "v1.0")
library(rgrambank)

Structure of content

Within this repository, functions are found in the directory R and examples in example_scripts. The directory example_scripts contain R-scripts which illustrate sepcific functions. For example, the script example_scripts/binarise.R showcases the functions rgrambank::make_binary_ParameterTableand rgrambank::make_binary_ValueTable. This README contains a list of all the functions, linked to example scripts and with details on who wrote the function and who reviewed it. In order to run the example scripts you need to set your working directory to example_scripts, as this is how the file-paths to R and fixed are set up. The example scripts also rely on the package rcldf by Simon Greenhill for fetching Grambank and Glottobank-datasets.

Detailed descriptions of the functions parameters and behaviour can be found in their respective scripts in the dir R or accessed via the help-pages once the package is installed.

Who did what

The entire set of code of the Grambank release-paper and grambank-analysed was primarily written by Simon Greenhill, Sam Passmore, Hedvig Skirgård, Damián Blasi, Russell Dinnage, Hannah Haynie, Angela Chira and Luke Maurits. The functions here, in rgrambank, are primarily written by Simon Greenhill and Hedvig Skirgård. Specific author(s) is/are specified for each function.

Review

The functions in this repos go through internal peer-review within the Department of Cultural and Linguistic Evolution at the Max Planck Insitute for Evolutionary Anthropology. The table below tracks which functions have been reviewed and by whom.

Functions

reviewed Function Short description example scripts Function author(s) Review Pull Request Reviewer
make_binary_ParameterTable.R Takes the Grambank ParameterTable and adds binarised features for the multistate-features. example_scripts/binarise.R Hedvig Skirgård PR 7 Olena Shcherbakova
make_binary_ValueTable.R Takes the GrambankValueTable and transforms mulistate feature values into binarised counter parts appropraitely. example_scripts/binarise.R Hedvig Skirgård PR 7 Olena Shcherbakova
make_theo_scores.R Calculates metrics per language based on theoretical linguistics: fusion, informativity, gender/noun class, flexivity, locus of marking and word order. For more details, see supplementary material of the Grambank release paper (2023) compare_new_old_theo_scores, example_make_theo_scores Hedvig Skirgård, Hannah Haynie and Olena Shcherbakova PR 7 Olena Shcherbakova
varcov.spatial.3D.R Adjusted function based on geoR::varcov.spatial. If given Longitude and Latitude, it makes haversine distances that take into account curvature of the earth and handles the antimeridian correctly (unlike geoR::varcov.spatial example_scripts/example_varcov.spatial.3D.R Original function: Paulo J. Ribeiro Jr. and Peter J. Diggle. Update: Hedvig Skirgård and Sam Passmore. PR 12 Angela Chira
reduce_ValueTable_to_unique_glottocodes.R Removes duplicate glottocodes in ValueTable for the same Parameter. Option for merging dialects into one entry. Read specification of method for merging closely. example_scripts/example_reduce_ValueTable_to_unique_glottocodes, example_make_theo_scores Hedvig Skirgård PR 13 Stephen Mann
drop_duplicate_glottocode_tips.R Drops tips which are mapped to the same glottocode of a tree at random. Option for merging dialects to one tip (i.e. dropping all dialects but one). example_scripts/example_drop_duplicate_glottocode_tips Hedvig Skirgård PR 13 Stephen Mann
crop_missing_data.R Takes a CLDF ValueTable and removes parameters and languages with lots of missing data. The cut-offs are defined by the missing data in the full dataset and can be set to any value between 0 and 1. The pruning is not stepwise, i.e. it is not the case that parameters are pruned first and then languages based on the missingness after the first pruning. This can be a practical step before imputation as it reduces missing data to be imputed. For more advanced approaches, please see annagraff/densify. example_script_worldmap_rgb.R Hedvig Skirgård PR 17 Enock Appiah Tieku
match_to_rgb Takes a data-frame or matrix and maps three numeric columns to colors using RGB (RedGreenBlue). example_script_worldmap_rgb.R Hedvig Skirgård and Damián Blasi PR 17 Enock Appiah Tieku
basemap_pacific_center Function that generates a base-layer in ggplot with a Pacific-centered worldmap with a van der Grinten project. The function outputs both a basemap ggplot layer and a data-frame which is the combination of the two input data-frames (LongLatTable and DataTable) with adjusted longitude to match the basemap layer. example_script_worldmap_rgb.R Hedvig Skirgård PR 17 Enock Appiah Tieku
combine_ValueTable_LanguageTable.R Combines CLDF ValueTable and LanguageTable in a practical manner useful for many Grambank analysis purposes example_enrich_language_table.R Hedvig Skirgård PR 17 Enock Appiah Tieku
add_family_name_column.R Adds the column "Family_name" to a LanguageTable, using Family_ID and Name. example_enrich_language_table.R Hedvig Skirgård PR 17 Enock Appiah Tieku
add_isolate_info.R Marks dialects of isolates as isolates as well in the column "Is_Isolate" and fills in Family_ID. example_enrich_language_table.R Hedvig Skirgård PR 17 Enock Appiah Tieku

Differences between grambank/grambank-analysed and HedvigS/rgrambank

There are a few minor differences between the code in grambank/grambank-analysed (Grambank release paper of 2023) and the functions here in HedvigS/rgrambank. They are all listed here:

Theoretical scores

This difference concerns the treatment of missing data for the calculation of the theoretical scores. In grambank/grambank-analysed we did a subsetting of the entire dataset where we pruned away features and languages with large amounts of missing data considering all features and languages at once. We then used this subset in several parts of the analysis, including PCA and calculation of theoretical scores. The function here in make_theo_scores in R_scripts/make_theo_scores.R instead prunes for missing data with respect to the specific features involved in each of the theoretical scores. Furthermore, the function allows the users to set a different cut-off (default = 0.75). The difference is very small in practice. Below are two scatterplots of two central theoretical scores, Fusion and Informativity. In each plot, the x-axis represents the newer way of computing the score (as in HedvigS/rgrambank) and the y-axis the older (grambank/grambank-analysed).

In addition, please note that there is a small difference in how the fusion score is calculated for Skirgård et al (2023). Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss compared to Shcherbakova et al (2023). Societies of strangers do not speak less complex languages. The features of Grambank are tagged as follows for Fuson: 1 (bound marking), 0.5 (bound marking/clitics/other morphology) and 0 (free-marking). Neither of the two papers use the 0 marked features (which could be done by reversing them, letting more free-standing marking lead to lower fusion score). Skirgård et al (2023) uses both 1 and 0.5 marking to construct the fusion score whearas Shcherbakova (2023) uses only 1 marked features. There are only 6 features marked as 0.5 and the difference in the overall scores is small.

The function make_theo_scores from HedvigS/rgrambank lets you choose how to compute the fusion score with options for only counting 1, counting both 1 and 0.5 and counting 1 and 0.5 and letting 0 contribute negatively as well. Below is a comparison of the options coutning only 1 and counting 1 and 0.5. The Pearson-correlation is 1.

For more details on the theoretical scores, see the supplementary material to the Grambank release paper of 2023.

Spatial Variance-Co-Variance calculations

In order to model spatial auto-correlation in regression models it is necessary to compute a variance-co-variance matrix (vcv) of the data points. This can be done using a Matérn decay function with the function varcov.spatial in the R-package geoR. However, the package geoR can be difficult to install and in addition, the function uses stats::dist to compute distances which is inappropriate for geographic points (see reasoning here). The code at grambank/grambank-analysed uses the geoR::varcov.spatial as is, copying over the source code over into a separate script in order to avoid the installation problems. The issue with the underlying distances was discovered later. The impact on the analysis was negligent, but all the same we have created an updated version of geoR::varcov.spatial in this repository which uses fields::rdist.earth for computing the distances given longitude and latitude. Unlike stats:.dist, fields::rdist.earth takes into account the antimeridian correctly and the curvature of the earth. The new function is called varcov.spatial.3D and is almost identical to the original function created for geoR by Paulo J. Ribeiro Jr. and Peter J. Diggle which was used in the Grambank release paper, with the difference of stats::dist -> fields::rdist.earth.

Tip

Good to know: you can give geoR:varcov.spatial or varcov.spatial.3D distances directly, instead of coordinates. You can for example calculate distances over cost-surfaces and create a spatial vcv of that. The difference discussed above only concerns when you give the function spatial coordinates and it computes haversine distances for you.

Cropping missing data

The function rgrambank::crop_missing_data differs from the script impute_missing_values.R (link) in that you specify how much non-missing data you want there to be for what remains (e.g. 0.75 = features/languages with 75% data or more remains) whereas the release paper script is based on specifing an upper limit for missing data (0.25 = only languages with 25% missing data or less remain).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages