An NLP corpus (plural: “corpora”) is a collection of text organised into datasets. Corpora can be comprised of newspapers, novels, social media, transcripts of interviews or television shows, or other authentic text. An NLP corpus is useful for corpus linguistics, computational linguistics, and natural language processing research. A corpus could be raw text as originally spoken or written, or it could include annotations (annotated for parts of speech, such as nouns, verbs, and adjectives, or annotated by topic).
If you want to train your own large language model, try developing a stylometry model, or simply hone your NLP skills, you will need an NLP corpus to work with.
Some of the best known NLP corpora include:
One easy way to load an NLP corpus in your code is with the HuggingFace Datasets package.
Below you can search a comprehensive table of NLP corpora in different languages. If you see one that is missing, please contact Thomas Wood at Fast Data Science, the maintainer of this list.
Corpus | Links | Language | Number of tokens | Licence |
---|---|---|---|---|
ACL Anthology Reference Corpus (ARC) | download | English | 6219633 | Creative Commons Attribution |
The ACTRES Parallel Corpus (P-ACTRES 2.0) | download | English-Spanish | 4000000 | ? |
Afrikaans corpus from Wikipedia | about | Afrikaans | 22000000 | ? |
Amarna letters | about | Akkadian | 358 tablets | CC BY |
American National Corpus | download about | English | 22000000 | Restricted |
Amharic web corpus (amWaC) | download | Amharic | 26000000 | ? |
Annotated Corpora of Classical Tibetan (ACTib) version 2.0 | download | Tibetan | 170000000 | ? |
ArabCC | download | English | 203654 | ? |
Arabic Learner Corpus (ALC) | download | Arabic | 282732 | CC-BY-SA 4.0 license |
Arabic Web Corpus (arWaC) | download | Arabic | 174000000 | ? |
Araneum Russicum | download | Russian | ? | ? |
Asosoft text corpus – Central Kurdish | download | Kurdish | 188000000 | non-commercial |
Bank of English | about | English | ? | ? |
Belarusian N-korpus | download | Belarusian | 165000000 | ? |
Bergen Corpus of London Teenage Language (COLT) | about | English | 2000000 | ? |
Bijankhan Corpus | download | Persian | 2600000 | General Public License |
BookCorpus | about | English | 985000000 | ? |
Brexit corpus | download | English | 100000000 | ? |
British Academic Spoken English (BASE) | download | British English | 1644942 | ? |
British Academic Spoken English Corpus (BASE) | about | English | 1477281 | free |
British Academic Written English (BAWE) | download | British English | 6900000 | ? |
British Academic Written English Corpus (BAWE) | about | English | 6968089 | free |
British Law Report Corpus | download | British English | 8850000 | ? |
British National Corpus | download about | English | 100000000 | Custom open licence |
The British National Corpus (BNC) | download | British English | 100000000 | ? |
Brown Corpus | download about | English | 1014312 | non-commercial |
Buckeye Corpus | download | English | 300000 | Open access |
Bulgarian National Corpus | download | Bulgarian | 1200000000 | CC BY 4.0 |
Cambridge English Corpus | download | English | 1800000 | ? |
Cambridge Learner Corpus | about | English | 40000000 | ? |
CC-100 (Commoncrawl): Monolingual Datasets from Web Crawl Data | download | Multiple languages: Afrikaans, Amharic, Arabic, Assamese, Azerbaijani, Belarusian, Bulgarian, Bengali, Bengali, Breton, Bosnian, Catalan, Czech, Welsh, Danish, German, Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Frisian, Irish, Gaelic, Galician, Gujarati, Hausa, Hebrew, Hindi, Hindi, Croatian, Hungarian, Armenian, Indonesian, Icelandic, Italian, Japanese, Javanese, Georgian, Kazakh, Khmer, Kannada, Korean, Kurdish, Kyrgyz, Latin, Lao, Lithuanian, Latvian, Malagasy, Macedonian, Malayalam, Mongolian, Marathi, Malay, Burmese, Burmese, Nepali, Dutch, Norwegian, Oromo, Oriya, Punjabi, Polish, Pashto, Portuguese, Romanian, Russian, Sanskrit, Sindhi, Sinhala, Slovak, Slovenian, Somali, Albanian, Serbian, Sundanese, Swedish, Swahili, Tamil, Tamil, Telugu, Telugu, Thai, Filipino, Turkish, Uyghur, Ukrainian, Urdu, Urdu, Uzbek, Vietnamese, Xhosa, Yiddish, Chinese (Simplified), Chinese (Traditional) | 294582300000 | See terms of use |
CELEN: Learner Corpus of Spanish in Japan | about | Spanish | 658467 | free |
CETENFolha | download | Portuguese | 25000000 | ? |
CHILDES | download | Multiple languages | ? | Creative Commons Attribution 4.0 International License. |
Chinese/English Political Interpreting Corpus | download | Chinese/English | 6500000 | Creative Commons license |
COBUILD | about | English | ? | ? |
COMPARA – Portuguese/English parallel corpora | download | Portuguese-English | 191053 | ? |
CorALit | download | Lithuanian | ? | ? |
CorCenCC National Corpus of Contemporary Welsh | about | Welsh | 11000000 | CC-BY-SA v4 license |
CORE Corpus | download | English | 50000000 | CC BY-SA 4.0 |
CoRoLa - The Reference Corpus of the Contemporary Romanian Language (Corpus reprezentativ al limbii române contemporane) | download | Romanian | ? | ? |
Coronavirus Corpus | download | English | 1500000000 | Licence |
Corpus Inscriptionum Insularum Celticarum | about | Old Irish | 9000 inscriptions | ? |
Corpus Inscriptionum Semiticarum | about | Semitic languages | ? | ? |
Corpus Nko ߒߞߏ ߝߊ߬ߘߌ߬ߞߋ߬ߟߋ߲߬ߡߊ | about | N'Ko | 4102593 | free |
Corpus of Academic Journal Articles (CAJA) | download | English | 79000000 | restricted |
Corpus of Academic Written and Spoken English (CAWSE) | download | Chinese | 1000000 | ? |
Corpus of American Soap Operas | download | English | 100000000 | Licence |
Corpus of Contemporary American English | about | English | 1000000000 | ? |
The Corpus of Electronic Texts | download | Irish | 19000000 | No explicit license |
Corpus of English Dialogues | download | English | 1200000 | ? |
A Corpus of English Dialogues 1560-1760 | download | English | 1200000 | ? |
Corpus of Estonian Web sentences 2020 Corpus | download | Estonian | 331491665 | ? |
Corpus of Historical American English (COHA) | download | English | 475000000 | Licence |
Corpus of Political Speeches | download | ? | 6269359 | CC BY-NC-ND 4.0 |
Corpus of US Supreme Court Opinions | download | English | 130000000 | ? |
Corpus Resource Database (CoRD) | download | English | ? | ? |
Coruña Corpus | download | English | 400000 | CC BY-NC-ND 4.0 |
COVID-19 Open Research Dataset (CORD-19) | about | English | 1443530655 | free |
Croatian Language Corpus | about | Croatian | 90000000 token | ? |
Croatian National Corpus | about | Croatian | 216800000 | ? |
Czech National Corpus | download | Czech | 2200000000 | Licence |
DBLP Discovery Dataset (D3) | download | English | ? | CC BY-NC 4.0 |
The Digital Parisian Stage Corpus | download | French | 173000 | Apache License 2.0 |
Early English Books Online | download | English | 800000000 | Unlimited simultaneous users |
Eastern Armenian National Corpus | download | Armenian | ? | ? |
EcoLexicon English Corpus (EEC) | download about | English | 23169446 | free |
Electronic Text Corpus of Sumerian Literature | download | Sumerian | ? | ? |
Elsevier | download | English | 43125207462 | CC-BY 4.0 |
English as a Lingua Franca in Academic Settings (ELFA) | download | English | 1000000 | ? |
Enron Corpus | download | English | 13810266 | public domain |
Environment corpus | download | English | 61000000 | ? |
EUR-Lex corpus | download | Multiple languages | 840000000 | CC-BY-NC-SA licence |
Europarl Corpus | about | Multiple languages | 60000000 | Licence |
Europarl spoken parallel – English | about | English | 15099625 | free |
Europarl spoken parallel – French | about | French | 16815290 | free |
Europarl spoken parallel – Polish | about | Polish | 13034164 | free |
Europarl spoken parallel – Spanish | about | Spanish | 15513307 | free |
European Parliament Proceedings Parallel Corpus 1996–2011 | download | Multiple languages | 596694486 | CC0: Public Domain |
Film corpus | download | English | 21000000 | ? |
General Internet Corpus of Russian | download about | Russian | 1500000000 | ? |
General regionally annotated corpus of Ukrainian | download | Ukrainian | 373000000 | ? |
German Political Speeches Corpus | download | German | 13000000 | CC-BY-SA |
German Reference Corpus | download | German | 55000000000 | ? |
Global Web-Based English (GloWbE) | download | English | 1900000000 | Licence |
Glosbe | download | Multiple languages | ? | ? |
Google Books Ngram Corpus | download | Multiple languages | ? | CC BY 3.0 |
GRALIS | download | Slavic | ? | ? |
Guangwai - Lancaster Chinese Learner Corpus | about | Chinese Simplified | 1289060 | free |
Guangwai-Lancaster Chinese Learner Corpus | download | Chinese | 1200000 | ? |
GUM corpus | download | English | 204000 | ? |
Hamshahri Corpus (Persian) | about | Persian | 160000 articles | ? |
Hansard Corpus | download | British English | 1600000000 | ? |
InterCorp | download | Multiple languages | ? | ? |
International Corpus of English | download | English | 1000000 | non transferable, lent, or re-sold |
International Corpus of Learner English (ICLE) | download | English | 5500000 | ? |
Irish Syllabic Poetry, circa 1200-1650 (BARDIC@TCD) | about | Irish | 478445 | free |
iWeb: The Intelligent Web-based Corpus | download | English | 14000000000 | ? |
Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles | download | Japanese-English | ? | Creative Commons Attribution-Share-Alike License 3.0 |
The JRC-Acquis Multilingual Parallel Corpus | download | Multiple languages | 44119860 | CC - BY |
Kanaanäische und Aramäische Inschriften | about | Multiple languages | ? | ? |
Kotonoha Japanese language corpus | download | Japanese | ? | ? |
Kurdish Language Corpus | download | Kurdish | 14898062 | ? |
Lancaster-Oslo-Bergen Corpus | about | English | 161192 | ? |
Language Grid | download | Multiple languages | ? | ? |
LatinISE corpus | download | Latin | 13000000 | CC BY-NC-SA 4.0 |
linguatools | download | German English | ? | ? |
LIVAC Synchronous Corpus | download | Chinese | 2400000 | Creative Commons Attribution-Share Alike 4.0 |
LOB Corpus | download | British English | 1000000 | Licence |
Louvain International Database of Spoken English Interlanguage (LINDSEI) | download | English | 100000 | ? |
MacMorpho | download | Portuguese (Brazil) | 6000000 | ? |
Magpie corpus | download | English | 4500000 | Creative Commons Attribution 4.0 International license |
MagyarOK teaching materials for Hungarian, levels A1 to B2 | about | Hungarian | 144832 | free |
MagyarOK teaching materials for Hungarian, levels A1 to B2 (old version) | about | Hungarian | 144832 | free |
Maldivian Wikipedia Corpus (dvwiki) | about | Maldivian | 500000 | ? |
Maltese Web Corpus (mtWaC) | download | Maltese | 110000000 | ? |
MasakhaNER 1.0: Africa-centric Transfer Learning for Named Entity Recognition | paper download | Multiple languages: Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof, Yorùbá | 3353113 | CC-BY-4.0-NC |
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition | download | Multiple languages: Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof, Yorùbá, Bambara, Ewe, Fon, Twi, Ghomala, Chichewa, Setwana, chiShona, isiXhosa, isiZulu | 3627462 | CC-BY-4.0-NC |
Medical Web Corpus | download | English | 33000000 | ? |
METCLIL: Metaphor in EMI seminars | about | English | 110493 | free |
The Movie Corpus | download | English | 200000000 | Licence |
Multicultural London English corpus | download | English | 2.4 million | ? |
myCAT – Olanto | download | Multiple languages | 10000000 | ? |
National Corpus of Polish | about | Polish | 1000000000 | CC-BY 4.0 |
Neo-Assyrian Text Corpus Project | download | Assyrian Neo-Aramaic | ? | ? |
Nepali National corpus (NNC) | download | Nepali | 13000000 | ? |
Nepali Text Corpus | download | Nepali | 90000000 | CC BY 4.0 |
The New Corpus for Ireland | download | Irish | 30000000 | ? |
New model Corpus | download | English | 100000000 | ? |
News on the Web (NOW) | download | English | 17500000000 | Custom Dataset Terms |
NTU — Multilingual Corpus | download | Multiple languages | 595000 | Creative Commons – Attribute 3.0 |
Nunavut Hansard | download | Inuktitut- English | 25000000 | ? |
N’ko corpus | download | Manding | ? | ? |
Old Bailey Corpus | download | English | 120000000 | CC-BY-SA 4.0 license |
The Open American National Corpus | download | American English | 15000000 | ? |
Open American National Corpus | download about | English | 11000000 | Custom open license |
Open Richly Annotated Cuneiform Corpus | download | Multiple languages | 741100 | CCA |
OPUS | download | Multiple languages | 30000000 | CC0-1.0 |
Oxford Children’s Corpus (OCC) | download | English | 30000000 | ? |
Oxford English Corpus | download | English | 2.1 billion | ? |
ParaSol | download | Slavic | ? | ? |
Persian in MULTEXT-EAST corpus | download | Persian | 13006 entries | Non-commercial |
Polish Parliamentary Corpus (PPC) | download | Polish | 553858723 | CC-BY |
Project Gutenberg | download | English | 3000000000 | Licence |
PropBank | download | Multiple languages | ? | ? |
PTC: Persian Today Corpus | download | Persian | 1000000 | ? |
Quranic Arabic Corpus | about | Quranic Arabic | 77430 | NU public license |
RE3D (Relationship and Entity Extraction Evaluation Dataset) | download | English | ? | ? |
Reference Corpus of Contemporary Portuguese | download | Portuguese | 312000000 | ? |
RusAge: Corpus for Age-Based Text Classification | download | Russian | 5592 texts | CC BY 4.0 |
Russian Corpus of Biographical Texts | download | Russian | ? | CC BY 4.0 |
Russian National Corpus | download | Russian | 2000000000 | non-commercial |
RuTweetCorp | download | Russian | ? | ? |
Santa Barbara Corpus of Spoken American English | download | English | 249000 | Creative Commons Attribution-No Derivative Works 3.0 United States License |
Scottish Corpus of Texts and Speech | download | Scots | 4700000 | CC |
SeedLing | download | Multiple languages | 150000000 | ? |
The Self-dialogue Corpus | download | English | 3653313 | ? |
SinMin | download | Sinhalese | 70000000 | ? |
Sketch Engine | about | Multiple languages | 60000000000 | ? |
Slovenian National Corpus | download | Slovenian | 1134693933 | CC BY-NC 4.0 |
Slovenian Reference Corpus(FidaPLUS) | download | Slovenian | 600000000 | ? |
Spanish text corpus by Molino de Ideas | download | Spanish | 660000000 | ? |
Spoken English Corpus | about | English | 52637 | ? |
Strathy Corpus of Canadian English | download | English | 50000000 | ? |
TalkBank | download | Multiple languages | ? | CC BY-NC-SA 3.0 |
Tatar Mixed Corpus | download | Tatar | 100000000 | ? |
Tatoeba | about download | Multiple languages | ? | CC BY 2.0 FR / CC0 1.0 |
TAUS | download | Multiple languages | ? | ? |
Tekstaro de Esperanto | download | Esperanto | 5177208 words | Creative Commons Attribution 4.0 License. |
The TenTen Corpus Family | download | Multiple languages | 10000000000 | ? |
TEP: Tehran English-Persian Parallel Corpus | download | Persian | 8900000 | ? |
TERMSEARCH | download | English-Russian | ? | ? |
Thesaurus Linguae Graecae | download | Greek | 110000000 | Licence |
Tigrynia web corpus (tiWac) | download | Tigrinya | 2000000 | ? |
TIME Magazine Corpus | download | English | 100000000 | ? |
Timestamped JSI web corpora | download | Multiple languages | ? | ? |
TIMIT | download | English | ? | DC User Agreement for Non-Members |
TMC: Tehran Monolingual Corpus | download | Persian | 250000000 | ? |
TradooIT | download | English/French/Spanish | ? | ? |
Transhistorical Corpus of Written English (TCWE) | about | English | 501633 | free |
Trinity Lancaster Corpus | download | English | 4200000 | Creative Commons Attribution 4.0 International |
TS Corpus | download | Turkish | 1300000000 | ? |
Turkish National Corpus | download | Turkish | 50000000 | ? |
The TV Corpus | download | English | 325000000 | ? |
Ubuntu Dialogue Corpus v2.0 | download | English | 100000000 | ? |
Ukrainian Language Corpus on the Mova.info Linguistic Portal | download | Ukrainian | ? | ? |
University of Pittsburgh English Language Institute Corpus (PELIC) | download | English | 4200000 | Creative Commons Attribution No Derivatives 4.0 International |
VerbNet | about | English | ? | Licence |
Vienna-Oxford International Corpus of English (VOICE) | download | English | 1023043 | Creative Commons Attribution-NonCommercial-ShareAlike 3.0 |
WaCky | download | Multiple languages | 1000000000 | ? |
Wellington Corpus of Spoken New Zealand English | download | English | 1000000 | Creative Commons Attribution License |
Wikipedia Comparable Corpora | about | Multiple languages | ? | Creative Commons Attribution Share-alike |
Wikipedia Corpus | download | English | 1900000000 | Creative Commons Attribution-ShareAlike 3.0 License |
Yiddish Wikipedia Corpus (yiwiki) | about | Yiddish | 2000000 | ? |
“Fine-tuning” means adapting an existing machine learning model for specific tasks or use cases. In this post I’m going to walk you through how you can fine tune a large language model for sentence similarity using some hand annotated test data. This example is in the psychology domain. You need training data consisting of pairs of sentences, and a “ground truth” of how similar you want those sentences to be when you train your custom sentence similarity model.
Hire an NLP developer and untangle the power of natural language in your projects The world is buzzing with the possibilities of natural language processing (NLP). From chatbots that understand your needs to algorithms that analyse mountains of text data, NLP is revolutionising industries across the board. But harnessing this power requires the right expertise. That’s where finding the perfect NLP developer comes in. Post a job in NLP on naturallanguageprocessing.
Natural language processing What is natural language processing? Natural language processing, or NLP, is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language. NLP is a branch of AI but is really a mixture of disciplines such as linguistics, computer science, and engineering. There are a number of approaches to NLP, ranging from rule-based modelling of human language to statistical methods. Common uses of NLP include speech recognition systems, the voice assistants available on smartphones, and chatbots.