Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

The CEPS EurLex dataset

The CEPS EurLex dataset: “142.036 EU laws from 1952-2019 with full text and 22 variables: The dataset contains 142.036 EU laws – almost the entire corpus of the EU’s digitally available legal acts passed between 1952 – 2019. It encompasses the three types of legally binding acts passed by the EU institutions: 102.304 regulations, 4.070 directives, 35.798 decisions in English language. The dataset was scraped from the official EU legal database (Eur-lex.eu) and transformed in machine-readable CSV format with the programming languages R and Python.
The dataset was collected by the Centre for European Policy Studies (CEPS) for the TRIGGER project (https://trigger-project.eu/). We hope that it will facilitate future quantitative and computational research on the EU.

Brief description:

  • The dataset is organised in tabular format, with each law representing one row and the columns representing 23 variables.
  • The full text of 134.633 laws is included (column “act_raw_text”). For newer laws, the text was scraped from Eur-lex.eu via the HTML pages, while for older laws, the text was extracted from (scanned) PDF documents (if available in English)
  • 22 additional variables are included, such as ‘Act_name’, ‘Act_type’, ‘Subject_matter’, ‘Authors’, ‘Date_document’, ‘ELI_link’, ‘CELEX’ (a unique identifier for every law). Please see the “CEPS_EurLex_codebook.pdf” file for an explanation of all variables.
  • Given its size, the dataset was uploaded in different batches to facilitate usage. Some Excel files are provided for non-technical users. We recommend, however, the use of the CSV files, since Excel does not save large amounts of data properly. EurLex_all.csv is the master file containing all data.”

Sorry, comments are closed for this post.