Skip to content

Commit

Permalink
version 0.3.5
Browse files Browse the repository at this point in the history
  • Loading branch information
michalovadek authored Mar 15, 2021
1 parent 319de1d commit 9ee0eeb
Show file tree
Hide file tree
Showing 26 changed files with 492 additions and 314 deletions.
3 changes: 1 addition & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@ Suggests:
tidytext,
wordcloud,
purrr,
ggplot2,
glue
ggplot2
URL: https://michalovadek.github.io/eurlex/
VignetteBuilder: knitr
2 changes: 1 addition & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

- it is now possible to select all resource types available with `elx_make_query(resource_type = "any")`. Since there are nearly 1 million CELEX codes, use with discretion and expect long execution times
- results can be restricted to a particular directory code with `elx_make_query(directory = "18")` (directory code "18" denotes Common Foreign and Security Policy)
- results can be restricted to a particular sector with `elx_make_query(sector = 2)` (sector code 3 denotes EU international agreements)
- results can be restricted to a particular sector with `elx_make_query(sector = 2)` (sector code 2 denotes EU international agreements)

## Minor changes

Expand Down
5 changes: 3 additions & 2 deletions R/elx_make_query.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
#' Create SPARQL quries
#' Create SPARQL queries
#'
#' Generates pre-defined or manual SPARQL queries to retrieve document ids from Cellar.
#' List of available resource types: http://publications.europa.eu/resource/authority/resource-type .
#' Note that not all resource types are compatible with the pre-defined query.
#' Note that not all resource types are compatible with default parameter values.
#'
#' @importFrom magrittr %>%
#'
Expand Down Expand Up @@ -46,6 +46,7 @@ elx_make_query <- function(resource_type = c("directive","regulation","decision"
include_directory = FALSE, include_sector = FALSE,
order = FALSE, limit = NULL){

if (missing(resource_type)) stop("'resource_type' must be defined")
if (!resource_type %in% c("any","directive","regulation","decision","recommendation","intagr","caselaw","manual","proposal","national_impl")) stop("'resource_type' must be defined")

if (resource_type == "manual" & nchar(manual_type) < 2){
Expand Down
17 changes: 15 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,11 @@ For the moment, it is recommended to retrieve metadata one variable at a time. F
2. `dates <- elx_make_query("directive", include_date_transpos = TRUE) %>% elx_run_query()`
3. `ids %>% dplyr::left_join(lbs) %>% dplyr::left_join(dates)`

rather than `elx_make_query("directive", include_lbs = TRUE, include_date_transpos = TRUE)`. This approach should make it easier to understand the returned data frame(s), especially when some variables contain missing or duplicated data.
rather than `elx_make_query("directive", include_lbs = TRUE, include_date_transpos = TRUE)`. This approach should make it easier to understand the returned data frame(s), especially when some variables contain missing or duplicated data. Always keep an eye on whether the `work` and `celex` columns identify rows uniquely or not.

One of the main contributions of the SPARQL requests is that we obtain a comprehensive list of identifiers that we can subsequently use to obtain more data relating to the document in question. While the results of the SPARQL queries are useful also for webscraping (with the `rvest` package), the function `elx_fetch_data()` enables us to fire GET requests to retrieve data on documents with known identifiers (including Cellar URI). The function currently enables downloading the title and the full text of a document in all available languages.

See the [vignette](https://michalovadek.github.io/eurlex/articles/eurlexpkg.html) for a walkthrough on how to use the package. Check function documentation for most up-to-date overview of features.
See the [vignette](https://michalovadek.github.io/eurlex/articles/eurlexpkg.html) for a walkthrough on how to use the package. Check function documentation for most up-to-date overview of features. Example use cases are shown in this [paper](https://www.tandfonline.com/doi/full/10.1080/2474736X.2020.1870150).

## Cite
Michal Ovádek (2021) Facilitating access to data on European Union laws, Political Research Exchange, 3:1, DOI: [10.1080/2474736X.2020.1870150](https://www.tandfonline.com/doi/full/10.1080/2474736X.2020.1870150)
Expand All @@ -40,6 +40,19 @@ This package nor its author are in any way affiliated with the EU Publications O

Please consider contributing to the maintanance and development of the package by reporting bugs or suggesting new features.

## Latest changes

### eurlex 0.3.5

- it is now possible to select all resource types available with `elx_make_query(resource_type = "any")`. Since there are nearly 1 million CELEX codes, use with discretion and expect long execution times
- results can be restricted to a particular directory code with `elx_make_query(directory = "18")` (directory code "18" denotes Common Foreign and Security Policy)
- results can be restricted to a particular sector with `elx_make_query(sector = 2)` (sector code 2 denotes EU international agreements)

- new feature: request date of court case submission `elx_make_query(include_date_lodged = TRUE)`
- new feature: request type of court procedure and outcome `elx_make_query(include_court_procedure = TRUE)`
- new feature: request directory code of legal act `elx_make_query(include_directory = TRUE)`
- `elx_curia_list()` has a new default parameter `parse = TRUE` which creates separate columns for `ecli`, `see_case`, `appeal` applying regular expressions on `case_info`

## Useful resources
Guide to CELEX numbers: https://eur-lex.europa.eu/content/tools/TableOfSectors/types_of_documents_in_eurlex.html

Expand Down
28 changes: 17 additions & 11 deletions doc/eurlexpkg.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,27 +17,36 @@ results <- dirs %>% select(-force,-date)

## -----------------------------------------------------------------------------
query_dir %>%
glue::as_glue() # for nicer printing
cat() # for nicer printing

elx_make_query(resource_type = "caselaw") %>%
glue::as_glue()
cat()

elx_make_query(resource_type = "manual", manual_type = "SWD") %>%
glue::as_glue()
cat()


## -----------------------------------------------------------------------------
elx_make_query(resource_type = "directive", include_date = TRUE, include_force = TRUE) %>%
glue::as_glue()
cat()

# minimal query: elx_make_query(resource_type = "directive")

elx_make_query(resource_type = "recommendation", include_date = TRUE, include_lbs = TRUE) %>%
glue::as_glue()
cat()

# minimal query: elx_make_query(resource_type = "recommendation")


## -----------------------------------------------------------------------------
# request documents from directory 18 ("Common Foreign and Security Policy")
# and sector 3 ("Legal acts")

elx_make_query(resource_type = "any",
directory = "18",
sector = 3) %>%
cat()

## ----runquery, eval=FALSE-----------------------------------------------------
# results <- elx_run_query(query = query_dir)
#
Expand Down Expand Up @@ -65,18 +74,14 @@ rec_eurovoc %>%


## ----eurovoctable-------------------------------------------------------------

eurovoc_lookup <- elx_label_eurovoc(uri_eurovoc = rec_eurovoc$eurovoc)

print(eurovoc_lookup)


## ----appendlabs---------------------------------------------------------------

rec_eurovoc %>%
left_join(eurovoc_lookup)


## -----------------------------------------------------------------------------
eurovoc_lookup <- elx_label_eurovoc(uri_eurovoc = rec_eurovoc$eurovoc,
alt_labels = TRUE,
Expand All @@ -86,7 +91,6 @@ rec_eurovoc %>%
left_join(eurovoc_lookup) %>%
select(celex, eurovoc, labels)


## ----getdatapur, message = FALSE, warning=FALSE, error=FALSE------------------
# the function is not vectorized by default
elx_fetch_data(results$work[1],"title")
Expand Down Expand Up @@ -117,7 +121,9 @@ dirs %>%

## -----------------------------------------------------------------------------
dirs %>%
ggplot(aes(x = as.Date(date), y = celex)) +
filter(!is.na(force)) %>%
mutate(date = as.Date(date)) %>%
ggplot(aes(x = date, y = celex)) +
geom_point(aes(color = force), alpha = 0.1) +
theme(axis.text.y = element_blank(),
axis.line.y = element_blank(),
Expand Down
41 changes: 26 additions & 15 deletions doc/eurlexpkg.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: "eurlex: Retrieve data on European Union law in R"
output: rmarkdown::html_vignette
description: >
Retrieve efficiently tidy data on European Union law in R with
Retrieve data on European Union law in R with
pre-defined SPARQL and REST queries.
vignette: >
%\VignetteIndexEntry{eurlex: Retrieve data on European Union law in R}
Expand All @@ -29,6 +29,8 @@ The `eurlex` R package attempts to significantly reduce the overhead associated

The `eurlex` package currently envisions the typical use-case to consist of getting bulk information about EU legislation into R as fast as possible. The package contains three core functions to achieve that objective: `elx_make_query()` to create pre-defined or customized SPARQL queries; `elx_run_query()` to execute the pre-made or any other manually input query; and `elx_fetch_data()` to fire GET requests for certain metadata to the REST API.

The package also contains largely self-explanatory functions for retrieving data on EU court cases (`elx_curia_list()`) and Council votes (`elx_council_votes()`) from outside Eur-Lex.

## `elx_make_query()`: Generate SPARQL queries

The function `elx_make_query` takes as its first argument the type of resource to be retrieved from the semantic database that powers Eur-Lex (and other publications) called Cellar.
Expand All @@ -55,35 +57,47 @@ The choice of resource type is then reflected in the SPARQL query generated by t

```{r}
query_dir %>%
glue::as_glue() # for nicer printing
cat() # for nicer printing
elx_make_query(resource_type = "caselaw") %>%
glue::as_glue()
cat()
elx_make_query(resource_type = "manual", manual_type = "SWD") %>%
glue::as_glue()
cat()
```

There are various ways of querying the same information in the Cellar database due to the existence of several overlapping classes and identifiers describing the same resources. The queries generated by the function should offer a reliable way of obtaining exhaustive results, as they have been validated by the helpdesk of the Publication Office. At the same time, it is always possible there will be issues either on the query or the database side; please report any you encounter through Github.

The other arguments in `elx_make_query()` relate to additional metadata to be returned. The results include by default the [CELEX number](https://eur-lex.europa.eu/content/tools/TableOfSectors/types_of_documents_in_eurlex.html) and exclude corrigenda (corrections of errors in legislation). Other data needs to be opted into. Make sure to select ones that are logically compatible (e.g. case law does not have a legal basis). More options should be added in the future.

Note that availability of data for each variable has an impact on the results. The data frame returned by the query will be shrunken to the size of the variable with most missing data. It is recommended to always compare results from a desired query to a minimal query requesting only celex ids.
Note that availability of data for each variable might have an impact on the results. The data frame returned by the query might be shrunken to the size of the variable with most missing data. It is recommended to always compare results from a desired query to a minimal query requesting only celex ids.

```{r}
elx_make_query(resource_type = "directive", include_date = TRUE, include_force = TRUE) %>%
glue::as_glue()
cat()
# minimal query: elx_make_query(resource_type = "directive")
elx_make_query(resource_type = "recommendation", include_date = TRUE, include_lbs = TRUE) %>%
glue::as_glue()
cat()
# minimal query: elx_make_query(resource_type = "recommendation")
```

You can also decide to not specify any resource types, in which case all types of documents will be returned. As there are over a million documents with a CELEX identifier, this is likely not efficient for a majority of users. But since version 0.3.5 it is possible to request documents belonging to a particular ["sector"](https://eur-lex.europa.eu/content/tools/TableOfSectors/types_of_documents_in_eurlex.html) or [directory code](https://eur-lex.europa.eu/browse/directories/legislation.html).

```{r}
# request documents from directory 18 ("Common Foreign and Security Policy")
# and sector 3 ("Legal acts")
elx_make_query(resource_type = "any",
directory = "18",
sector = 3) %>%
cat()
```

Now that we have a query, we are ready to run it.

## `elx_run_query()`: Execute SPARQL queries
Expand Down Expand Up @@ -135,20 +149,16 @@ rec_eurovoc %>%
By default, the endpoint returns the EuroVoc concept codes rather than the labels (keywords). The function `elx_label_eurovoc()` needs to be called to obtain a look-up table with the labels.

```{r eurovoctable}
eurovoc_lookup <- elx_label_eurovoc(uri_eurovoc = rec_eurovoc$eurovoc)
print(eurovoc_lookup)
```

The results include labels only for unique identifiers, but with `dplyr::left_join()` it is straightforward to append the labels to the entire dataset.

```{r appendlabs}
rec_eurovoc %>%
left_join(eurovoc_lookup)
```

As elsewhere in the API, we can tap into the multilingual nature of EU documents also when it comes to the EuroVoc keywords. Moreover, most concepts in the thesaurus are associated with alternative labels; these can be returned as well (separated by a comma).
Expand All @@ -161,7 +171,6 @@ eurovoc_lookup <- elx_label_eurovoc(uri_eurovoc = rec_eurovoc$eurovoc,
rec_eurovoc %>%
left_join(eurovoc_lookup) %>%
select(celex, eurovoc, labels)
```

## `elx_fetch_data()`: Fire GET requests
Expand All @@ -186,7 +195,7 @@ print(dir_titles)
```

Note that text requests are by far the most time-intensive; requesting the full text for thousands of documents is liable to extend the run-time into hours. Currently, no method for downloading text in non-html/plain formats is implemented, which means pdf-only texts will be missing from the results.^[It is worth pointing out that the html and pdf contents of older case law differs. Whereas typically the html file is only going to contain a summary and grounds of a judgment, the pdf should also contain background to the dispute.]
Note that text requests are by far the most time-intensive; requesting the full text for thousands of documents is liable to extend the run-time into hours. Texts are retrieved from html by priority, but methods for pdfs and .docs are also implemented.^[It is worth pointing out that the html and pdf contents of older case law differs. Whereas typically the html file is only going to contain a summary and grounds of a judgment, the pdf should also contain background to the dispute.] The function even handles multi-document resources (by pasting them together).

# Application

Expand All @@ -213,7 +222,9 @@ Directives become naturally outdated with time. It might be all the more interes

```{r}
dirs %>%
ggplot(aes(x = as.Date(date), y = celex)) +
filter(!is.na(force)) %>%
mutate(date = as.Date(date)) %>%
ggplot(aes(x = date, y = celex)) +
geom_point(aes(color = force), alpha = 0.1) +
theme(axis.text.y = element_blank(),
axis.line.y = element_blank(),
Expand Down Expand Up @@ -251,6 +262,6 @@ dirs_1970_title %>%

I use term-frequency inverse-document frequency (tf-idf) to weight the importance of the words in the wordcloud. If we used pure frequencies, the wordcloud would largely consist of words conveying little meaning ("the", "and", ...).

This is an extremely basic application of the `eurlex` package. Much more sophisticated methods can be used to analyse both the content and metadata of European Union legislation. If the package is useful for your research, please consider citing it.
This is an extremely basic application of the `eurlex` package. Much more sophisticated methods can be used to analyse both the content and metadata of European Union legislation. If the package is useful for your research, please consider citing the [accompanying paper](https://www.tandfonline.com/doi/full/10.1080/2474736X.2020.1870150).^[Michal Ovádek (2021) Facilitating access to data on European Union laws, Political Research Exchange, 3:1, DOI: [10.1080/2474736X.2020.1870150](https://www.tandfonline.com/doi/full/10.1080/2474736X.2020.1870150)]


Loading

0 comments on commit 9ee0eeb

Please sign in to comment.