The Linguistics Lab maintains a collection of natural language corpora. Some highlights are listed below. Many of these have been released through the Linguistic Data Consortium (LDC). UGA's membership in the LDC is funded annually by contributions from cooperating academic units. Faculty and students in these cooperating units enjoy unrestricted access to the entire collection.
For information about becoming a cooperating academic unit, please contact John Hale via <email@example.com>
This was the first million-word electronic corpus of English, created in 1961 at Brown University. It spans about fifteen different categories of text.
Manually-corrected phrase structure trees for English, including 1.2 million words of newspaper text from the Wall Street Journal.
531 million tokens of American English sampled from 1990--2017 across categories such as Academic, Fiction, Magazine, Newspaper, and Spoken. This version is annotated with lemmas and parts of speech. Provided courtesy of the UGA Library.
400+ million words from the period 1810--2008 facilitate diachronic investigation. Provided courtesy of the UGA Library.
78 million words from Philosophical Transactions of the Royal Society of London 1665--1920 with lemmas, parts of speech and spelling normalization. For more information see Fischer et al LREC 2020.
1 million words of written Welsh tagged with parts of speech, lemmas and a classification of morphophonemic mutation types. Courtesy of Jonathan Jones.
100 million words of text from a variety of British sources that are annotated with parts of speech and lemmas as well as sociolinguistic variables such as speaker age, social class and geographical region. 91% was published between 1985 and 1993.
Audio and all available transcriptions of the 7.5 million words of the spoken portion of the British National Corpus, including TextGrids aligned at the word and phone level, with associated speaker metadata. For more details see Coleman et al. 2012.
11.4 million tokens, orthographically transcribed from smartphone recordings made between 2012 and 2016. Substantial speaker metadata is included. Annotations include parts of speech, lemmas and a system of semantic tags. For more details see Brezina et al 2018.
English varieties from the UK, Canada, East Africa, Hong Kong, India, Ireland, Jamaica, the Philippines, Singapore and the USA. Each component corpus contains about one million words. The ICE-GB word annotations (but not syntactic trees) are searchable using IMS Open Corpus Workbench.
Approximately 800 thousand words of newswire text from Agence France-Presse annotated with parts of speech, morphology and phrase structure.
About 100 thousand words from both Spanish newswire and discussion forums, with extensive morphological and syntactic annotations.
About 500 thousand words of Spanish newswire, with extensive morphological and syntactic annotations.
2 billion words of Spanish from 21 different countries, with lemmas and parts of speech. Courtesy of Mark Davies.
180 million words from the Portuguese newspaper "Publico'' 1991--1998 with morphological and syntactic annotations.
About 2 billion words from Brazillian web pages, collected in WaCky style (see below). Courtesy of Aline Villavicencio.
1 billion words from web pages in Brazil, Portugal, Angola & Mozambique with POS tags from Eckhard Bick's "Palabras" tagger. Courtesy of Mark Davies.
This corpus is drawn from the newspaper Le Monde 1989-1994 annotated with syntactic constituents, syntactic categories, lemmas and compounds and totals about 650 thousand words.
Dependency, constituency and morphology annotations for Arabic, Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish.
355 thousand tokens of the German newspaper Frankfurter Rundschau annotated with syntactic structures.
About 40 million words of European Parliamentary proceedings aligned across translations into English, German, Spanish, French, Italian and Dutch.
This corpus consists of 5-10 minute snippets from 120 phone calls, each 30 minutes each in length. They are made from native English speakers from various places in North America; mostly made to family members or close friends. Roughly 90 of the phone calls are placed to persons living outside of North America but all are in English. Holdings include both audio and transcripts.
This corpus is comprised of 40 speakers from Columbus Ohio, recorded from 1992-2000. Each interview lasted about 60 minutes and the corpus totals more than 300,000 words of speech. Created primarily for studying phonological variation in American English, it was gathered via a "modified sociolinguistic interview format" to target a representative sample of forms and frequency of phonological variants. The corpus' .wav files are transcribed and force-aligned at the segment level.
Orthography, phonology, morphology and attestation frequency information for words in English, German and Dutch.
About 1.3 billion words from articles that appeared in the New York Times 1982--2007 with automatically-assigned lemmas and part-of-speech tags.
Between 1.2 and 1.9 billion tokens each of French, German and Italian as crawled from the world wide web. Also includes about 800 million tokens of English Wikipedia as it was in 2009. These corpora are annotated with lemmas and parts of speech. For more details see Baroni et al 2009.
Parallel translations of the Bible into 100 languages. For more information see Christodouloupoulos and Steedman 2015.
A corpus of tweets collected by Jordan Graham as part of her MA thesis. These tweets all include the word "police" and either the hashtag #BlackLivesMatter or the hashtag #BlueLivesMatter and are dated either May25--26th 2020 or June 3-4 2020. Each subset comprises 2000 tweets, except for the May #BlueLivesMatter set where the scraping operation only yielded 81.