Which NLP corpus?

·
Which NLP corpus?

What is an NLP text corpus?

An NLP corpus (plural: “corpora”) is a collection of text organised into datasets. Corpora can be comprised of newspapers, novels, social media, transcripts of interviews or television shows, or other authentic text. An NLP corpus is useful for corpus linguistics, computational linguistics, and natural language processing research. A corpus could be raw text as originally spoken or written, or it could include annotations (annotated for parts of speech, such as nouns, verbs, and adjectives, or annotated by topic).

List of multilingual text corpora for natural language processing

If you want to train your own large language model, try developing a stylometry model, or simply hone your NLP skills, you will need an NLP corpus to work with.

Some of the best known NLP corpora include:

One easy way to load an NLP corpus in your code is with the HuggingFace Datasets package.

Below you can search a comprehensive table of NLP corpora in different languages. If you see one that is missing, please contact Thomas Wood at Fast Data Science, the maintainer of this list.

CorpusLinksLanguageNumber of tokensLicence
ACL Anthology Reference Corpus (ARC)downloadEnglish6219633Creative Commons Attribution
The ACTRES Parallel Corpus (P-ACTRES 2.0)downloadEnglish-Spanish4000000?
Afrikaans corpus from WikipediaaboutAfrikaans22000000?
Amarna lettersaboutAkkadian358 tabletsCC BY
American National Corpusdownload aboutEnglish22000000Restricted
Amharic web corpus (amWaC)downloadAmharic26000000?
Annotated Corpora of Classical Tibetan (ACTib) version 2.0downloadTibetan170000000?
ArabCCdownloadEnglish203654?
Arabic Learner Corpus (ALC)downloadArabic282732CC-BY-SA 4.0 license
Arabic Web Corpus (arWaC)downloadArabic174000000?
Araneum RussicumdownloadRussian??
Asosoft text corpus – Central KurdishdownloadKurdish188000000non-commercial
Bank of EnglishaboutEnglish??
Belarusian N-korpusdownloadBelarusian165000000?
Bergen Corpus of London Teenage Language (COLT)aboutEnglish2000000?
Bijankhan CorpusdownloadPersian2600000General Public License
BookCorpusaboutEnglish985000000?
Brexit corpusdownloadEnglish100000000?
British Academic Spoken English (BASE)downloadBritish English1644942?
British Academic Spoken English Corpus (BASE)aboutEnglish1477281free
British Academic Written English (BAWE)downloadBritish English6900000?
British Academic Written English Corpus (BAWE)aboutEnglish6968089free
British Law Report CorpusdownloadBritish English8850000?
British National Corpusdownload aboutEnglish100000000Custom open licence
The British National Corpus (BNC)downloadBritish English100000000?
Brown Corpusdownload aboutEnglish1014312non-commercial
Buckeye CorpusdownloadEnglish300000Open access
Bulgarian National CorpusdownloadBulgarian1200000000CC BY 4.0
Cambridge English CorpusdownloadEnglish1800000?
Cambridge Learner CorpusaboutEnglish40000000?
CC-100 (Commoncrawl): Monolingual Datasets from Web Crawl DatadownloadMultiple languages: Afrikaans, Amharic, Arabic, Assamese, Azerbaijani, Belarusian, Bulgarian, Bengali, Bengali, Breton, Bosnian, Catalan, Czech, Welsh, Danish, German, Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Frisian, Irish, Gaelic, Galician, Gujarati, Hausa, Hebrew, Hindi, Hindi, Croatian, Hungarian, Armenian, Indonesian, Icelandic, Italian, Japanese, Javanese, Georgian, Kazakh, Khmer, Kannada, Korean, Kurdish, Kyrgyz, Latin, Lao, Lithuanian, Latvian, Malagasy, Macedonian, Malayalam, Mongolian, Marathi, Malay, Burmese, Burmese, Nepali, Dutch, Norwegian, Oromo, Oriya, Punjabi, Polish, Pashto, Portuguese, Romanian, Russian, Sanskrit, Sindhi, Sinhala, Slovak, Slovenian, Somali, Albanian, Serbian, Sundanese, Swedish, Swahili, Tamil, Tamil, Telugu, Telugu, Thai, Filipino, Turkish, Uyghur, Ukrainian, Urdu, Urdu, Uzbek, Vietnamese, Xhosa, Yiddish, Chinese (Simplified), Chinese (Traditional)294582300000See terms of use
CELEN: Learner Corpus of Spanish in JapanaboutSpanish658467free
CETENFolhadownloadPortuguese25000000?
CHILDESdownloadMultiple languages?Creative Commons Attribution 4.0 International License.
Chinese/English Political Interpreting CorpusdownloadChinese/English6500000Creative Commons license
COBUILDaboutEnglish??
COMPARA – Portuguese/English parallel corporadownloadPortuguese-English191053?
CorALitdownloadLithuanian??
CorCenCC National Corpus of Contemporary WelshaboutWelsh11000000CC-BY-SA v4 license
CORE CorpusdownloadEnglish50000000CC BY-SA 4.0
CoRoLa - The Reference Corpus of the Contemporary Romanian Language (Corpus reprezentativ al limbii române contemporane)downloadRomanian??
Coronavirus CorpusdownloadEnglish1500000000Licence
Corpus Inscriptionum Insularum CelticarumaboutOld Irish9000 inscriptions?
Corpus Inscriptionum SemiticarumaboutSemitic languages??
Corpus Nko ߒߞߏ ߝߊ߬ߘߌ߬ߞߋ߬ߟߋ߲߬ߡߊaboutN'Ko4102593free
Corpus of Academic Journal Articles (CAJA)downloadEnglish79000000restricted
Corpus of Academic Written and Spoken English (CAWSE)downloadChinese1000000?
Corpus of American Soap OperasdownloadEnglish100000000Licence
Corpus of Contemporary American EnglishaboutEnglish1000000000?
The Corpus of Electronic TextsdownloadIrish19000000No explicit license
Corpus of English DialoguesdownloadEnglish1200000?
A Corpus of English Dialogues 1560-1760downloadEnglish1200000?
Corpus of Estonian Web sentences 2020 CorpusdownloadEstonian331491665?
Corpus of Historical American English (COHA)downloadEnglish475000000Licence
Corpus of Political Speechesdownload?6269359CC BY-NC-ND 4.0
Corpus of US Supreme Court OpinionsdownloadEnglish130000000?
Corpus Resource Database (CoRD)downloadEnglish??
Coruña CorpusdownloadEnglish400000CC BY-NC-ND 4.0
COVID-19 Open Research Dataset (CORD-19)aboutEnglish1443530655free
Croatian Language CorpusaboutCroatian90000000 token?
Croatian National CorpusaboutCroatian216800000?
Czech National CorpusdownloadCzech2200000000Licence
DBLP Discovery Dataset (D3)downloadEnglish?CC BY-NC 4.0
The Digital Parisian Stage CorpusdownloadFrench173000Apache License 2.0
Early English Books OnlinedownloadEnglish800000000Unlimited simultaneous users
Eastern Armenian National CorpusdownloadArmenian??
EcoLexicon English Corpus (EEC)download aboutEnglish23169446free
Electronic Text Corpus of Sumerian LiteraturedownloadSumerian??
ElsevierdownloadEnglish43125207462CC-BY 4.0
English as a Lingua Franca in Academic Settings (ELFA)downloadEnglish1000000?
Enron CorpusdownloadEnglish13810266public domain
Environment corpusdownloadEnglish61000000?
EUR-Lex corpusdownloadMultiple languages840000000CC-BY-NC-SA licence
Europarl CorpusaboutMultiple languages60000000Licence
Europarl spoken parallel – EnglishaboutEnglish15099625free
Europarl spoken parallel – FrenchaboutFrench16815290free
Europarl spoken parallel – PolishaboutPolish13034164free
Europarl spoken parallel – SpanishaboutSpanish15513307free
European Parliament Proceedings Parallel Corpus 1996–2011downloadMultiple languages596694486CC0: Public Domain
Film corpusdownloadEnglish21000000?
General Internet Corpus of Russiandownload aboutRussian1500000000?
General regionally annotated corpus of UkrainiandownloadUkrainian373000000?
German Political Speeches CorpusdownloadGerman13000000CC-BY-SA
German Reference CorpusdownloadGerman55000000000?
Global Web-Based English (GloWbE)downloadEnglish1900000000Licence
GlosbedownloadMultiple languages??
Google Books Ngram CorpusdownloadMultiple languages?CC BY 3.0
GRALISdownloadSlavic??
Guangwai - Lancaster Chinese Learner CorpusaboutChinese Simplified1289060free
Guangwai-Lancaster Chinese Learner CorpusdownloadChinese1200000?
GUM corpusdownloadEnglish204000?
Hamshahri Corpus (Persian)aboutPersian160000 articles?
Hansard CorpusdownloadBritish English1600000000?
InterCorpdownloadMultiple languages??
International Corpus of EnglishdownloadEnglish1000000non transferable, lent, or re-sold
International Corpus of Learner English (ICLE)downloadEnglish5500000?
Irish Syllabic Poetry, circa 1200-1650 (BARDIC@TCD)aboutIrish478445free
iWeb: The Intelligent Web-based CorpusdownloadEnglish14000000000?
Japanese-English Bilingual Corpus of Wikipedia's Kyoto ArticlesdownloadJapanese-English?Creative Commons Attribution-Share-Alike License 3.0
The JRC-Acquis Multilingual Parallel CorpusdownloadMultiple languages44119860CC - BY
Kanaanäische und Aramäische InschriftenaboutMultiple languages??
Kotonoha Japanese language corpusdownloadJapanese??
Kurdish Language CorpusdownloadKurdish14898062?
Lancaster-Oslo-Bergen CorpusaboutEnglish161192?
Language GriddownloadMultiple languages??
LatinISE corpusdownloadLatin13000000CC BY-NC-SA 4.0
linguatoolsdownloadGerman English??
LIVAC Synchronous CorpusdownloadChinese2400000Creative Commons Attribution-Share Alike 4.0
LOB CorpusdownloadBritish English1000000Licence
Louvain International Database of Spoken English Interlanguage (LINDSEI)downloadEnglish100000?
MacMorphodownloadPortuguese (Brazil)6000000?
Magpie corpusdownloadEnglish4500000Creative Commons Attribution 4.0 International license
MagyarOK teaching materials for Hungarian, levels A1 to B2aboutHungarian144832free
MagyarOK teaching materials for Hungarian, levels A1 to B2 (old version)aboutHungarian144832free
Maldivian Wikipedia Corpus (dvwiki)aboutMaldivian500000?
Maltese Web Corpus (mtWaC)downloadMaltese110000000?
MasakhaNER 1.0: Africa-centric Transfer Learning for Named Entity Recognitionpaper downloadMultiple languages: Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof, Yorùbá3353113CC-BY-4.0-NC
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity RecognitiondownloadMultiple languages: Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof, Yorùbá, Bambara, Ewe, Fon, Twi, Ghomala, Chichewa, Setwana, chiShona, isiXhosa, isiZulu3627462CC-BY-4.0-NC
Medical Web CorpusdownloadEnglish33000000?
METCLIL: Metaphor in EMI seminarsaboutEnglish110493free
The Movie CorpusdownloadEnglish200000000Licence
Multicultural London English corpusdownloadEnglish2.4 million?
myCAT – OlantodownloadMultiple languages10000000?
National Corpus of PolishaboutPolish1000000000CC-BY 4.0
Neo-Assyrian Text Corpus ProjectdownloadAssyrian Neo-Aramaic??
Nepali National corpus (NNC)downloadNepali13000000?
Nepali Text CorpusdownloadNepali90000000CC BY 4.0
The New Corpus for IrelanddownloadIrish30000000?
New model CorpusdownloadEnglish100000000?
News on the Web (NOW)downloadEnglish17500000000Custom Dataset Terms
NTU — Multilingual CorpusdownloadMultiple languages595000Creative Commons – Attribute 3.0
Nunavut HansarddownloadInuktitut- English25000000?
N’ko corpusdownloadManding??
Old Bailey CorpusdownloadEnglish120000000CC-BY-SA 4.0 license
The Open American National CorpusdownloadAmerican English15000000?
Open American National Corpusdownload aboutEnglish11000000Custom open license
Open Richly Annotated Cuneiform CorpusdownloadMultiple languages741100CCA
OPUSdownloadMultiple languages30000000CC0-1.0
Oxford Children’s Corpus (OCC)downloadEnglish30000000?
Oxford English CorpusdownloadEnglish2.1 billion?
ParaSoldownloadSlavic??
Persian in MULTEXT-EAST corpusdownloadPersian13006 entriesNon-commercial
Polish Parliamentary Corpus (PPC)downloadPolish553858723CC-BY
Project GutenbergdownloadEnglish3000000000Licence
PropBankdownloadMultiple languages??
PTC: Persian Today CorpusdownloadPersian1000000?
Quranic Arabic CorpusaboutQuranic Arabic77430NU public license
RE3D (Relationship and Entity Extraction Evaluation Dataset)downloadEnglish??
Reference Corpus of Contemporary PortuguesedownloadPortuguese312000000?
RusAge: Corpus for Age-Based Text ClassificationdownloadRussian5592 textsCC BY 4.0
Russian Corpus of Biographical TextsdownloadRussian?CC BY 4.0
Russian National CorpusdownloadRussian2000000000non-commercial
RuTweetCorpdownloadRussian??
Santa Barbara Corpus of Spoken American EnglishdownloadEnglish249000Creative Commons Attribution-No Derivative Works 3.0 United States License
Scottish Corpus of Texts and SpeechdownloadScots4700000CC
SeedLingdownloadMultiple languages150000000?
The Self-dialogue CorpusdownloadEnglish3653313?
SinMindownloadSinhalese70000000?
Sketch EngineaboutMultiple languages60000000000?
Slovenian National CorpusdownloadSlovenian1134693933CC BY-NC 4.0
Slovenian Reference Corpus(FidaPLUS)downloadSlovenian600000000?
Spanish text corpus by Molino de IdeasdownloadSpanish660000000?
Spoken English CorpusaboutEnglish52637?
Strathy Corpus of Canadian EnglishdownloadEnglish50000000?
TalkBankdownloadMultiple languages?CC BY-NC-SA 3.0
Tatar Mixed CorpusdownloadTatar100000000?
Tatoebaabout downloadMultiple languages?CC BY 2.0 FR / CC0 1.0
TAUSdownloadMultiple languages??
Tekstaro de EsperantodownloadEsperanto5177208 wordsCreative Commons Attribution 4.0 License.
The TenTen Corpus FamilydownloadMultiple languages10000000000?
TEP: Tehran English-Persian Parallel CorpusdownloadPersian8900000?
TERMSEARCHdownloadEnglish-Russian??
Thesaurus Linguae GraecaedownloadGreek110000000Licence
Tigrynia web corpus (tiWac)downloadTigrinya2000000?
TIME Magazine CorpusdownloadEnglish100000000?
Timestamped JSI web corporadownloadMultiple languages??
TIMITdownloadEnglish?DC User Agreement for Non-Members
TMC: Tehran Monolingual CorpusdownloadPersian250000000?
TradooITdownloadEnglish/French/Spanish??
Transhistorical Corpus of Written English (TCWE)aboutEnglish501633free
Trinity Lancaster CorpusdownloadEnglish4200000Creative Commons Attribution 4.0 International
TS CorpusdownloadTurkish1300000000?
Turkish National CorpusdownloadTurkish50000000?
The TV CorpusdownloadEnglish325000000?
Ubuntu Dialogue Corpus v2.0downloadEnglish100000000?
Ukrainian Language Corpus on the Mova.info Linguistic PortaldownloadUkrainian??
University of Pittsburgh English Language Institute Corpus (PELIC)downloadEnglish4200000Creative Commons Attribution No Derivatives 4.0 International
VerbNetaboutEnglish?Licence
Vienna-Oxford International Corpus of English (VOICE)downloadEnglish1023043Creative Commons Attribution-NonCommercial-ShareAlike 3.0
WaCkydownloadMultiple languages1000000000?
Wellington Corpus of Spoken New Zealand EnglishdownloadEnglish1000000Creative Commons Attribution License
Wikipedia Comparable CorporaaboutMultiple languages?Creative Commons Attribution Share-alike
Wikipedia CorpusdownloadEnglish1900000000Creative Commons Attribution-ShareAlike 3.0 License
Yiddish Wikipedia Corpus (yiwiki)aboutYiddish2000000?

 Train your own AI: Fine tune a large language model for sentence similarity

Train your own AI: Fine tune a large language model for sentence similarity

“Fine-tuning” means adapting an existing machine learning model for specific tasks or use cases. In this post I’m going to walk you through how you can fine tune a large language model for sentence similarity using some hand annotated test data. This example is in the psychology domain. You need training data consisting of pairs of sentences, and a “ground truth” of how similar you want those sentences to be when you train your custom sentence similarity model.

Hire an NLP developer
Ai and nlpBusiness applications

Hire an NLP developer

Hire an NLP developer and untangle the power of natural language in your projects The world is buzzing with the possibilities of natural language processing (NLP). From chatbots that understand your needs to algorithms that analyse mountains of text data, NLP is revolutionising industries across the board. But harnessing this power requires the right expertise. That’s where finding the perfect NLP developer comes in. Post a job in NLP on naturallanguageprocessing.

What is NLP?

What is NLP?

Natural language processing What is natural language processing? Natural language processing, or NLP, is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language. NLP is a branch of AI but is really a mixture of disciplines such as linguistics, computer science, and engineering. There are a number of approaches to NLP, ranging from rule-based modelling of human language to statistical methods. Common uses of NLP include speech recognition systems, the voice assistants available on smartphones, and chatbots.