CRL has a number of natural language corpora that are available to members of the Center. Some of these corpora are accessible on-line; other corpora exist on CD-ROM, so special arrangements must be made to use them. Because most of the corpora are proprietary and have usage restrictions, access is only available to researchers affiliated with CRL.

NEW: A Linux-port of the Penn Treebank search utility tgrep is now available from CRL.

CHILDES Child Language Description Exchange. Child language productions from a variety of researchers.
WordNet 1.6

Lexical database (See
Linux: /home/corpora/wordnet-1.6
Windows: \\\corpora\wordnet-1.6
Macintosh: smb://

Penn Treebank Penn's Linguistic Data Consortium (LDC) collection, including Brown (Kucera-Francis); Wall Street Journal, and other sources; some text is parsed and can be searched wih the tgrep program. (See
Linux: /home/corpora/treebank
Windows: \\\corpora\treebank
Macintosh: smb://
North American News Text Corpus Large (~350 million word) corpus of newswire text. (See
Wall Street Journal 1987/parsed ~25 million word parsed text from WSJ (text from LDC; parsed version courtesy Eugene Charniak)
Linux: /home/corpora/wsj87
Windows: \\\corpora\wsj87
Macintosh: smb://
Spanish Language News Corpus Large (~172 million word) corpus of newsire text. (See
European Languages News Corpus ~100 million words of French, 90 million words of German, and 15 million words of Portuguese; newswire text. (See
Hansard Parallel Text in English and French Parallel English/French texts drawn from Canadian Parliament discussions. (See
CELEX Lexical databases (word lemmas, phonology, morphology, frequency) for Dutch, German, and English. (See
Linux: /home/corpora/celex
Windows: \\\corpora\celex
Macintosh: smb://
British National Corpus (100 million word searchable corpus; Windows software for more extensive searching is also available)
Linux: /home/corpora/BNC
Windows: \\\corpora\BNC
Macintosh: smb://

NOTE: Access to the above file locations is restricted. Please contact CRL to request access.

In addition, there are a large number of electronic databases and other resources that are useful for the psycholinguist, linguist, or computational linguist. A few of these include the following:

MRC Psycholinguistic Database Interface (Kucera-Francis; number of letters/phonemes/syllables; ratings of word familiarity, concreteness, imagability, meaningfulness; age of acquisition; etc.)
Edinburgh Associative Thesaurus (on-line word association norms)
Oxford Text Archive (a collection of several thousand electronic texts and linguistic corpora)
Association for Computational Linguistics
Institute for Scientific Information The Web of Science Citation Databases
UCSD Libraries and Library Resources

