Which NLP corpus?

What is an NLP text corpus?

An NLP corpus (plural: “corpora”) is a collection of text organised into datasets. Corpora can be comprised of newspapers, novels, social media, transcripts of interviews or television shows, or other authentic text. An NLP corpus is useful for corpus linguistics, computational linguistics, and natural language processing research. A corpus could be raw text as originally spoken or written, or it could include annotations (annotated for parts of speech, such as nouns, verbs, and adjectives, or annotated by topic).

List of multilingual text corpora for natural language processing

If you want to train your own large language model, try developing a stylometry model, or simply hone your NLP skills, you will need an NLP corpus to work with.

Some of the best known NLP corpora include:

The Open American National Corpus (OANC)
The British National Corpus (BNC)
Project Gutenberg
Text corpora on authorship attribution (University of Neuchatel, the CLC group)
The CLEF PAN corpora and tasks
The Federalist Papers - a series of 85 essays written by Alexander Hamilton, John Jay, and James Madison. Some are of disputed authorship.

One easy way to load an NLP corpus in your code is with the HuggingFace Datasets package.

Below you can search a comprehensive table of NLP corpora in different languages. If you see one that is missing, please contact Thomas Wood at Fast Data Science, the maintainer of this list.

Corpus	Links	Language	Number of tokens	Licence
ACL Anthology Reference Corpus (ARC)	download	English	6219633	Creative Commons Attribution
The ACTRES Parallel Corpus (P-ACTRES 2.0)	download	English-Spanish	4000000	?
Afrikaans corpus from Wikipedia	about	Afrikaans	22000000	?
Amarna letters	about	Akkadian	358 tablets	CC BY
American National Corpus	download about	English	22000000	Restricted
Amharic web corpus (amWaC)	download	Amharic	26000000	?
Annotated Corpora of Classical Tibetan (ACTib) version 2.0	download	Tibetan	170000000	?
ArabCC	download	English	203654	?
Arabic Learner Corpus (ALC)	download	Arabic	282732	CC-BY-SA 4.0 license
Arabic Web Corpus (arWaC)	download	Arabic	174000000	?
Araneum Russicum	download	Russian	?	?
Asosoft text corpus – Central Kurdish	download	Kurdish	188000000	non-commercial
Bank of English	about	English	?	?
Belarusian N-korpus	download	Belarusian	165000000	?
Bergen Corpus of London Teenage Language (COLT)	about	English	2000000	?
Bijankhan Corpus	download	Persian	2600000	General Public License
BookCorpus	about	English	985000000	?
Brexit corpus	download	English	100000000	?
British Academic Spoken English (BASE)	download	British English	1644942	?
British Academic Spoken English Corpus (BASE)	about	English	1477281	free
British Academic Written English (BAWE)	download	British English	6900000	?
British Academic Written English Corpus (BAWE)	about	English	6968089	free
British Law Report Corpus	download	British English	8850000	?
British National Corpus	download about	English	100000000	Custom open licence
The British National Corpus (BNC)	download	British English	100000000	?
Brown Corpus	download about	English	1014312	non-commercial
Buckeye Corpus	download	English	300000	Open access
Bulgarian National Corpus	download	Bulgarian	1200000000	CC BY 4.0
Cambridge English Corpus	download	English	1800000	?
Cambridge Learner Corpus	about	English	40000000	?
CC-100 (Commoncrawl): Monolingual Datasets from Web Crawl Data	download	Multiple languages: Afrikaans, Amharic, Arabic, Assamese, Azerbaijani, Belarusian, Bulgarian, Bengali, Bengali, Breton, Bosnian, Catalan, Czech, Welsh, Danish, German, Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Frisian, Irish, Gaelic, Galician, Gujarati, Hausa, Hebrew, Hindi, Hindi, Croatian, Hungarian, Armenian, Indonesian, Icelandic, Italian, Japanese, Javanese, Georgian, Kazakh, Khmer, Kannada, Korean, Kurdish, Kyrgyz, Latin, Lao, Lithuanian, Latvian, Malagasy, Macedonian, Malayalam, Mongolian, Marathi, Malay, Burmese, Burmese, Nepali, Dutch, Norwegian, Oromo, Oriya, Punjabi, Polish, Pashto, Portuguese, Romanian, Russian, Sanskrit, Sindhi, Sinhala, Slovak, Slovenian, Somali, Albanian, Serbian, Sundanese, Swedish, Swahili, Tamil, Tamil, Telugu, Telugu, Thai, Filipino, Turkish, Uyghur, Ukrainian, Urdu, Urdu, Uzbek, Vietnamese, Xhosa, Yiddish, Chinese (Simplified), Chinese (Traditional)	294582300000	See terms of use
CELEN: Learner Corpus of Spanish in Japan	about	Spanish	658467	free
CETENFolha	download	Portuguese	25000000	?
CHILDES	download	Multiple languages	?	Creative Commons Attribution 4.0 International License.
Chinese/English Political Interpreting Corpus	download	Chinese/English	6500000	Creative Commons license
COBUILD	about	English	?	?
COMPARA – Portuguese/English parallel corpora	download	Portuguese-English	191053	?
CorALit	download	Lithuanian	?	?
CorCenCC National Corpus of Contemporary Welsh	about	Welsh	11000000	CC-BY-SA v4 license
CORE Corpus	download	English	50000000	CC BY-SA 4.0
CoRoLa - The Reference Corpus of the Contemporary Romanian Language (Corpus reprezentativ al limbii române contemporane)	download	Romanian	?	?
Coronavirus Corpus	download	English	1500000000	Licence
Corpus Inscriptionum Insularum Celticarum	about	Old Irish	9000 inscriptions	?
Corpus Inscriptionum Semiticarum	about	Semitic languages	?	?
Corpus Nko ߒߞߏ ߝߊ߬ߘߌ߬ߞߋ߬ߟߋ߲߬ߡߊ	about	N'Ko	4102593	free
Corpus of Academic Journal Articles (CAJA)	download	English	79000000	restricted
Corpus of Academic Written and Spoken English (CAWSE)	download	Chinese	1000000	?
Corpus of American Soap Operas	download	English	100000000	Licence
Corpus of Contemporary American English	about	English	1000000000	?
The Corpus of Electronic Texts	download	Irish	19000000	No explicit license
Corpus of English Dialogues	download	English	1200000	?
A Corpus of English Dialogues 1560-1760	download	English	1200000	?
Corpus of Estonian Web sentences 2020 Corpus	download	Estonian	331491665	?
Corpus of Historical American English (COHA)	download	English	475000000	Licence
Corpus of Political Speeches	download	?	6269359	CC BY-NC-ND 4.0
Corpus of US Supreme Court Opinions	download	English	130000000	?
Corpus Resource Database (CoRD)	download	English	?	?
Coruña Corpus	download	English	400000	CC BY-NC-ND 4.0
COVID-19 Open Research Dataset (CORD-19)	about	English	1443530655	free
Croatian Language Corpus	about	Croatian	90000000 token	?
Croatian National Corpus	about	Croatian	216800000	?
Czech National Corpus	download	Czech	2200000000	Licence
DBLP Discovery Dataset (D3)	download	English	?	CC BY-NC 4.0
The Digital Parisian Stage Corpus	download	French	173000	Apache License 2.0
Early English Books Online	download	English	800000000	Unlimited simultaneous users
Eastern Armenian National Corpus	download	Armenian	?	?
EcoLexicon English Corpus (EEC)	download about	English	23169446	free
Electronic Text Corpus of Sumerian Literature	download	Sumerian	?	?
Elsevier	download	English	43125207462	CC-BY 4.0
English as a Lingua Franca in Academic Settings (ELFA)	download	English	1000000	?
Enron Corpus	download	English	13810266	public domain
Environment corpus	download	English	61000000	?
EUR-Lex corpus	download	Multiple languages	840000000	CC-BY-NC-SA licence
Europarl Corpus	about	Multiple languages	60000000	Licence
Europarl spoken parallel – English	about	English	15099625	free
Europarl spoken parallel – French	about	French	16815290	free
Europarl spoken parallel – Polish	about	Polish	13034164	free
Europarl spoken parallel – Spanish	about	Spanish	15513307	free
European Parliament Proceedings Parallel Corpus 1996–2011	download	Multiple languages	596694486	CC0: Public Domain
Film corpus	download	English	21000000	?
General Internet Corpus of Russian	download about	Russian	1500000000	?
General regionally annotated corpus of Ukrainian	download	Ukrainian	373000000	?
German Political Speeches Corpus	download	German	13000000	CC-BY-SA
German Reference Corpus	download	German	55000000000	?
Global Web-Based English (GloWbE)	download	English	1900000000	Licence
Glosbe	download	Multiple languages	?	?
Google Books Ngram Corpus	download	Multiple languages	?	CC BY 3.0
GRALIS	download	Slavic	?	?
Guangwai - Lancaster Chinese Learner Corpus	about	Chinese Simplified	1289060	free
Guangwai-Lancaster Chinese Learner Corpus	download	Chinese	1200000	?
GUM corpus	download	English	204000	?
Hamshahri Corpus (Persian)	about	Persian	160000 articles	?
Hansard Corpus	download	British English	1600000000	?
InterCorp	download	Multiple languages	?	?
International Corpus of English	download	English	1000000	non transferable, lent, or re-sold
International Corpus of Learner English (ICLE)	download	English	5500000	?
Irish Syllabic Poetry, circa 1200-1650 (BARDIC@TCD)	about	Irish	478445	free
iWeb: The Intelligent Web-based Corpus	download	English	14000000000	?
Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles	download	Japanese-English	?	Creative Commons Attribution-Share-Alike License 3.0
The JRC-Acquis Multilingual Parallel Corpus	download	Multiple languages	44119860	CC - BY
Kanaanäische und Aramäische Inschriften	about	Multiple languages	?	?
Kotonoha Japanese language corpus	download	Japanese	?	?
Kurdish Language Corpus	download	Kurdish	14898062	?
Lancaster-Oslo-Bergen Corpus	about	English	161192	?
Language Grid	download	Multiple languages	?	?
LatinISE corpus	download	Latin	13000000	CC BY-NC-SA 4.0
linguatools	download	German English	?	?
LIVAC Synchronous Corpus	download	Chinese	2400000	Creative Commons Attribution-Share Alike 4.0
LOB Corpus	download	British English	1000000	Licence
Louvain International Database of Spoken English Interlanguage (LINDSEI)	download	English	100000	?
MacMorpho	download	Portuguese (Brazil)	6000000	?
Magpie corpus	download	English	4500000	Creative Commons Attribution 4.0 International license
MagyarOK teaching materials for Hungarian, levels A1 to B2	about	Hungarian	144832	free
MagyarOK teaching materials for Hungarian, levels A1 to B2 (old version)	about	Hungarian	144832	free
Maldivian Wikipedia Corpus (dvwiki)	about	Maldivian	500000	?
Maltese Web Corpus (mtWaC)	download	Maltese	110000000	?
MasakhaNER 1.0: Africa-centric Transfer Learning for Named Entity Recognition	paper download	Multiple languages: Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof, Yorùbá	3353113	CC-BY-4.0-NC
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition	download	Multiple languages: Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof, Yorùbá, Bambara, Ewe, Fon, Twi, Ghomala, Chichewa, Setwana, chiShona, isiXhosa, isiZulu	3627462	CC-BY-4.0-NC
Medical Web Corpus	download	English	33000000	?
METCLIL: Metaphor in EMI seminars	about	English	110493	free
The Movie Corpus	download	English	200000000	Licence
Multicultural London English corpus	download	English	2.4 million	?
myCAT – Olanto	download	Multiple languages	10000000	?
National Corpus of Polish	about	Polish	1000000000	CC-BY 4.0
Neo-Assyrian Text Corpus Project	download	Assyrian Neo-Aramaic	?	?
Nepali National corpus (NNC)	download	Nepali	13000000	?
Nepali Text Corpus	download	Nepali	90000000	CC BY 4.0
The New Corpus for Ireland	download	Irish	30000000	?
New model Corpus	download	English	100000000	?
News on the Web (NOW)	download	English	17500000000	Custom Dataset Terms
NTU — Multilingual Corpus	download	Multiple languages	595000	Creative Commons – Attribute 3.0
Nunavut Hansard	download	Inuktitut- English	25000000	?
N’ko corpus	download	Manding	?	?
Old Bailey Corpus	download	English	120000000	CC-BY-SA 4.0 license
The Open American National Corpus	download	American English	15000000	?
Open American National Corpus	download about	English	11000000	Custom open license
Open Richly Annotated Cuneiform Corpus	download	Multiple languages	741100	CCA
OPUS	download	Multiple languages	30000000	CC0-1.0
Oxford Children’s Corpus (OCC)	download	English	30000000	?
Oxford English Corpus	download	English	2.1 billion	?
ParaSol	download	Slavic	?	?
Persian in MULTEXT-EAST corpus	download	Persian	13006 entries	Non-commercial
Polish Parliamentary Corpus (PPC)	download	Polish	553858723	CC-BY
Project Gutenberg	download	English	3000000000	Licence
PropBank	download	Multiple languages	?	?
PTC: Persian Today Corpus	download	Persian	1000000	?
Quranic Arabic Corpus	about	Quranic Arabic	77430	NU public license
RE3D (Relationship and Entity Extraction Evaluation Dataset)	download	English	?	?
Reference Corpus of Contemporary Portuguese	download	Portuguese	312000000	?
RusAge: Corpus for Age-Based Text Classification	download	Russian	5592 texts	CC BY 4.0
Russian Corpus of Biographical Texts	download	Russian	?	CC BY 4.0
Russian National Corpus	download	Russian	2000000000	non-commercial
RuTweetCorp	download	Russian	?	?
Santa Barbara Corpus of Spoken American English	download	English	249000	Creative Commons Attribution-No Derivative Works 3.0 United States License
Scottish Corpus of Texts and Speech	download	Scots	4700000	CC
SeedLing	download	Multiple languages	150000000	?
The Self-dialogue Corpus	download	English	3653313	?
SinMin	download	Sinhalese	70000000	?
Sketch Engine	about	Multiple languages	60000000000	?
Slovenian National Corpus	download	Slovenian	1134693933	CC BY-NC 4.0
Slovenian Reference Corpus(FidaPLUS)	download	Slovenian	600000000	?
Spanish text corpus by Molino de Ideas	download	Spanish	660000000	?
Spoken English Corpus	about	English	52637	?
Strathy Corpus of Canadian English	download	English	50000000	?
TalkBank	download	Multiple languages	?	CC BY-NC-SA 3.0
Tatar Mixed Corpus	download	Tatar	100000000	?
Tatoeba	about download	Multiple languages	?	CC BY 2.0 FR / CC0 1.0
TAUS	download	Multiple languages	?	?
Tekstaro de Esperanto	download	Esperanto	5177208 words	Creative Commons Attribution 4.0 License.
The TenTen Corpus Family	download	Multiple languages	10000000000	?
TEP: Tehran English-Persian Parallel Corpus	download	Persian	8900000	?
TERMSEARCH	download	English-Russian	?	?
Thesaurus Linguae Graecae	download	Greek	110000000	Licence
Tigrynia web corpus (tiWac)	download	Tigrinya	2000000	?
TIME Magazine Corpus	download	English	100000000	?
Timestamped JSI web corpora	download	Multiple languages	?	?
TIMIT	download	English	?	DC User Agreement for Non-Members
TMC: Tehran Monolingual Corpus	download	Persian	250000000	?
TradooIT	download	English/French/Spanish	?	?
Transhistorical Corpus of Written English (TCWE)	about	English	501633	free
Trinity Lancaster Corpus	download	English	4200000	Creative Commons Attribution 4.0 International
TS Corpus	download	Turkish	1300000000	?
Turkish National Corpus	download	Turkish	50000000	?
The TV Corpus	download	English	325000000	?
Ubuntu Dialogue Corpus v2.0	download	English	100000000	?
Ukrainian Language Corpus on the Mova.info Linguistic Portal	download	Ukrainian	?	?
University of Pittsburgh English Language Institute Corpus (PELIC)	download	English	4200000	Creative Commons Attribution No Derivatives 4.0 International
VerbNet	about	English	?	Licence
Vienna-Oxford International Corpus of English (VOICE)	download	English	1023043	Creative Commons Attribution-NonCommercial-ShareAlike 3.0
WaCky	download	Multiple languages	1000000000	?
Wellington Corpus of Spoken New Zealand English	download	English	1000000	Creative Commons Attribution License
Wikipedia Comparable Corpora	about	Multiple languages	?	Creative Commons Attribution Share-alike
Wikipedia Corpus	download	English	1900000000	Creative Commons Attribution-ShareAlike 3.0 License
Yiddish Wikipedia Corpus (yiwiki)	about	Yiddish	2000000	?

What is an NLP text corpus?

List of multilingual text corpora for natural language processing

Train your own AI: Fine tune a large language model for sentence similarity

Hire an NLP developer

What is NLP?