Materials and Methods

Methods

The study draws on quantitative methods of corpus linguistics, language acquisition methods; and is also informed by terminology research in the fields of translation studies, law and language, as well as drafting and terminological practices.

Materials: Terms

The study uses two large datasets:

Defined terms extracted from EU, UK and Irish legal acts (dataset 1)
Terms in the English section of the EU’s IATE term base (dataset 2).

The former reflects the perspective of drafters whereas the latter the perspective of terminologists. The datasets were uploaded to Sketch Engine for analyses.

Dataset 1: Defined terms in EU, UK and Irish legal acts

Since automatic term recognition turned out to be ineffective, defined terms — a special category of terms — were extracted from 10-year corpora of legal acts covering the period of 2010-2019. The focus corpus is the EU corpus of English-language directives and regulations. The reference corpora are the Irish corpus of public legal acts (IPA) and the UK corpus of Public General Acts (UKPGA), representing two major English-language jurisdictions in Europe: Ireland and the United Kingdom. The years 2010-2019 were chosen to ensure that the reference countries were part of the EU during the sampling frame.

Details of the corpora.

Corpus	# of files	Words	Terms, types	Terms, tokens
REG: REGULATIONS	1040	9168665	7043	5218
DIR: DIRECTIVES	156	2328195	2228	1575
UKPGA: UK Public General Acts	306	3537394	7319	4205
IPA: Irish Public Acts	244	3615306	7018	3879

The EU files were downloaded automatically from the EUR-Lex directory of legal acts. The search criteria were limited to basic acts in force as at 2020 and excluded delegated and implementing acts. UK public general acts were downloaded from the UK statute database in the “Original as enacted” version. Irish public acts (Acts of the Oireachtas) were downloaded from the Irish Statute Book (eISB) website. All downloads excluded amending acts. Corpus files were annotated in NotePad++ to separate normative parts (enacting terms) from other sections, i.e. citations, recitals and annexes in the EU corpus and schedules in Irish and UK acts, to ensure better comparability between the corpora. The files were uploaded to Sketch Engine, part-of-speech tagged and lemmatized. Defined terms were extracted from the corpora, using wild cards and filtering with the lemma mean up to the 5^th position on the right. Thus, the dataset covers the whole population of defined terms which meet these criteria. Terms were exported to an Excel sheet, verified manually and cleaned.

IATE datasets

IATE (Interactive Terminology for Europe, https://iate.europa.eu/home), which is a multilingual interinstitutional terminology database of the European institutions, is one of the world’s largest term bases run in 24 EU official languages. The terms were exported from the English section of publicly available IATE, copied to Excel and semi-automatically cleaned. This resulted in the IATE FULL dataset, composed of 668,980 terms and 541,163 entries.

IATE datasets

	IATE FULL	IATE EU	IATE EU LAW	IATE LAW	IATE PRIMARY
English terms	668980	40764	3326	45616	111611
English entries	541163	31115	2333	36659	71345

In search for prototypes, four smaller, more controlled datasets were further extracted, ranging from 0.5% to 11% of the full dataset: IATE EU, IATE EU LAW, IATE LAW based on thematic criteria and IATE PRIMARY based on a qualitative criterion. In the case of thematic exports, EuroVoc thematic were used: IATE EU → 10 European Union; IATE EU LAW → 1011 European Union Law, and IATE LAW → 12 Law. These subsets reflect EU terminologists’ perception on which terms qualify as EU terms, EU legal terms and legal terms, respectively. IATE PRIMARY comprises entries marked as ‘primary’, that is those which meet minimum quality standards of information content (about 12% of IATE entries). Primary entries therefore offer a higher-quality snapshot of IATE: entries considered important enough to be more elaborated.