This is a list of links to lexical databases and corpora, organized by language or language group. The resources on this page were initially compiled from announcements on the LINGUIST list and web-search results. This is not intended to be an exhaustive list, but rather a place to organize and store potentially useful links as I encounter them. Suggestions for additional links to include on this page are welcome.
Contents of this page:
Lexical database resources (lemmas, wordforms, frequency information)
Collections
- The CELEX Lexical Databases (Dutch, English, German), Max Planck Institute for Psycholinguistics
- More documentation, University of Pennsylvania
- List of WordNets in the world, Global WordNet Association
- Multi WordNet, Fondazione Bruno Kessler
- Italian WordNet aligned with Princeton English WordNet
- Also has access to WordNets for Spanish, Portuguese, Hebrew, Romanian, and Latin
French
- Lexique, University of Paris V
German
- GermaNet, University of Tübingen
- Noun Associations for German, Saarland University
English
- The CMU Pronouncing Dictionary of American English, Carnegie Mellon University
- WordNet (English), Princeton University
- MRC Psycholinguistic Database, University of Western Australia
- Lists of high-frequency English lemmas and wordforms (data from CCAE)
- The Verb Semantics Ontology Project, CSLI, Stanford University
- Twitter Current English Lexicon, Illocution Incorporated
- Top 10,000 words and bigrams from their English Twitter corpus, with frequency information
Italian
Spanish
- Spanish FrameNet (SFN), Autonomous University of Barcelona and International Computer Science Institute
Corpora
Collections |
Chinese (Mandarin) |
English |
Icelandic |
Indo-European |
Italian |
Japanese |
Persian/Farsi |
Polish |
Portuguese |
Spanish |
Sumerian |
Swedish |
Turkish
Collections
- The University of Oxford Text Archive (browse a list of available texts and corpora)
- Querying Internet corpora, Leeds University
- SMULTRON - Stockholm MULtilingual TReebank, University of Zurich
- Includes treebanks in English, German, Swedish, French, and Spanish
- CHILDES -- Child Language Data Exchange System, Carnegie Mellon University
- Child data files from various languages are represented
Chinese (Mandarin)
- Leiden Weibo Corpus, Leiden University
English
- American English corpora, Brigham Young University
- Corpus of Contemporary American English (COCA)
- TIME Magazine Corpus of American English
- Corpus of Historical American English (COHA)
- Google Books American English Corpus
- Michigan Corpus of Academic Spoken English, University of Michigan
- British National Corpus (BNC)
- Official BNC site
- BNC via Brigham Young University
- Phrases In English -- separate utility, but uses BNC data
- Penn and Penn-Helsinki corpora of historical and modern English
- Penn-Helsinki Parsed Corpus of Middle English, 2nd edition (PPCME2)
- Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME)
- Penn-Helsinki Parsed Corpus of Modern British English (PPCMBE)
- Parsed Corpus of Early English Correspondence (PCEEC)
- Scottish English corpora, University of Glasgow
- The Salamanca Corpus - Digital Archive of English Dialect Texts, University of Salamanca
- Contains PDF and DOC versions of texts which represent British dialects, 1500-1950
- Germanic possessive -s : an empirical, historical and theoretical study, University of Manchester
- Database of categorized examples of possessive constructions in English (BNC) and Swedish (GSLC)
Icelandic
Indo-European
- Pragmatic Resources in Old IE Languages (PROIEL), University of Oslo
- Parallel New Testament texts/translations in Ancient Greek, Latin, Gothic, Armenian, and Old Church Slavonic
Italian
- La Repubblica corpus (Italian newspaper texts), University of Bologna
Japanese
- BCCWJ: Balanced Corpus of Contemporary Written Japanese (KOTONOHA), NINJAL
Persian/Farsi
- Bijankhan corpus, University of Tehran -- for NLP
- Persian Treebank, Free University of Berlin (data from Bijankhan corpus)
- Hamshahri corpus, University of Tehran -- for information retrieval
Polish
Portuguese
- Corpus do Português, Brigham Young University
Spanish
- Corpus del Español, Brigham Young University
- Spanish Learner Language Oral Corpora (L2 Spanish; L1=English), University of Southampton
Sumerian
- Electronic Text Corpus of Sumerian Literature (ETCSL), University of Oxford
Swedish
- Göteborg Spoken Language Corpus (GSLC), Göteborg University
- Germanic possessive -s : an empirical, historical and theoretical study, University of Manchester
- Database of categorized examples of possessive constructions in English (BNC) and Swedish (GSLC)
Turkish
- TS corpus, Taner Sezer
Lists of lexical-database and corpus resources
- Corpora and Corpus-based Computational Linguistics, Manuel Barbera
- Corpora4Learning.net, Sabine Braun
- Corpus Linguistics and Written Language Resources - Bibliography, Joaquim Llisterri
- Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources, Christopher Manning
- Text Analysis and Corpus Linguistics resources, SIL International
- List of Corpora, W3-Corpora Project, University of Essex
Last update and link check: March 2012