Materials and Methods
Methods
The study draws on quantitative methods of corpus linguistics, language acquisition methods; and is also informed by terminology research in the fields of translation studies, law and language, as well as drafting and terminological practices.
Materials: Terms
The study uses two large datasets:
- Defined terms extracted from EU, UK and Irish legal acts (dataset 1)
- Terms in the English section of the EU’s IATE term base (dataset 2).
The former reflects the perspective of drafters whereas the latter the perspective of terminologists. The datasets were uploaded to Sketch Engine for analyses.
- Dataset 1: Defined terms in EU, UK and Irish legal acts
Since automatic term recognition turned out to be ineffective, defined terms — a special category of terms — were extracted from 10-year corpora of legal acts covering the period of 2010-2019. The focus corpus is the EU corpus of English-language directives and regulations. The reference corpora are the Irish corpus of public legal acts (IPA) and the UK corpus of Public General Acts (UKPGA), representing two major English-language jurisdictions in Europe: Ireland and the United Kingdom. The years 2010-2019 were chosen to ensure that the reference countries were part of the EU during the sampling frame.
Details of the corpora.
Corpus | # of files | Words | Terms, types | Terms, tokens |
REG: REGULATIONS | 1040 | 9168665 | 7043 | 5218 |
DIR: DIRECTIVES | 156 | 2328195 | 2228 | 1575 |
UKPGA: UK Public General Acts | 306 | 3537394 | 7319 | 4205 |
IPA: Irish Public Acts | 244 | 3615306 | 7018 | 3879 |
The EU files were downloaded automatically from the EUR-Lex directory of legal acts. The search criteria were limited to basic acts in force as at 2020 and excluded delegated and implementing acts. UK public general acts were downloaded from the UK statute database in the “Original as enacted” version. Irish public acts (Acts of the Oireachtas) were downloaded from the Irish Statute Book (eISB) website. All downloads excluded amending acts. Corpus files were annotated in NotePad++ to separate normative parts (enacting terms) from other sections, i.e. citations, recitals and annexes in the EU corpus and schedules in Irish and UK acts, to ensure better comparability between the corpora. The files were uploaded to Sketch Engine, part-of-speech tagged and lemmatized. Defined terms were extracted from the corpora, using wild cards and filtering with the lemma mean up to the 5th position on the right. Thus, the dataset covers the whole population of defined terms which meet these criteria. Terms were exported to an Excel sheet, verified manually and cleaned.
- IATE datasets
IATE (Interactive Terminology for Europe, https://iate.europa.eu/home), which is a multilingual interinstitutional terminology database of the European institutions, is one of the world’s largest term bases run in 24 EU official languages. The terms were exported from the English section of publicly available IATE, copied to Excel and semi-automatically cleaned. This resulted in the IATE FULL dataset, composed of 668,980 terms and 541,163 entries.
IATE datasets
IATE FULL | IATE EU | IATE EU LAW | IATE LAW | IATE PRIMARY | |
English terms | 668980 | 40764 | 3326 | 45616 | 111611 |
English entries | 541163 | 31115 | 2333 | 36659 | 71345 |
In search for prototypes, four smaller, more controlled datasets were further extracted, ranging from 0.5% to 11% of the full dataset: IATE EU, IATE EU LAW, IATE LAW based on thematic criteria and IATE PRIMARY based on a qualitative criterion. In the case of thematic exports, EuroVoc thematic were used: IATE EU → 10 European Union; IATE EU LAW → 1011 European Union Law, and IATE LAW → 12 Law. These subsets reflect EU terminologists’ perception on which terms qualify as EU terms, EU legal terms and legal terms, respectively. IATE PRIMARY comprises entries marked as ‘primary’, that is those which meet minimum quality standards of information content (about 12% of IATE entries). Primary entries therefore offer a higher-quality snapshot of IATE: entries considered important enough to be more elaborated.