This is a sampling of the corpora that are housed in our repositories. 


Arabic Treebank Part 1, Part 2, & Part 3

With over 1800 newswire stories from various outlets. This fantastic collection has morphology, gloss and syntactic treebank annotation in accordance with the Penn Arabic Treebank (PATB). With hundreds of thousands if tokens this is an excellent resources for studying Arabic but, also natural language processing. Many publications have already been published based on this data set. 

Sample publications: one, two


BYU corpora: COCA, COHA, GloWbE Corpus

These renouwn corpora have been the source for a great many research projects and publications. COCA sports and incredible, balanced corpus which is updated with 20 million words per year. COHA (Corpus of Historical American English) is a great resource for the rise and fall of the use of words since 1810 in English with more than 400 million words. 

These corpora can be found online for limited use here: