Center for Research in LanguageUCSD
CRL
Web

Resources

Corpora

CRL has a number of natural language corpora that are available to members of the Center. Some of these corpora are accessible on-line; other corpora exist on CD-ROM, so special arrangements must be made to use them. Because most of the corpora are proprietary and have usage restrictions, access is only available to researchers affiliated with CRL.

NEW: A Linux-port of the Penn Treebank search utility tgrep is now available from CRL.

CHILDES Child Language Description Exchange. Child language productions from a variety of researchers.
http://childes.psy.cmu.edu/
WordNet 1.6 Lexical database (See http://www.cogsci.princeton.edu/~wn/)
server: crl.ucsd.edu folder: /home/corpora/celex
Penn Treebank Penn's Linguistic Data Consortium (LDC) collection, including Brown (Kucera-Francis); Wall Street Journal, and other sources; some text is parsed and can be searched wih the tgrep program. (See http://www.ldc.upenn.edu/)
server: crl.ucsd.edu folder:/home/corpora/treebank
North American News Text Corpus Large (~350 million word) corpus of newswire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T21)
CD-ROM
Wall Street Journal 1987/parsed ~25 million word parsed text from WSJ (text from LDC; parsed version courtesy Eugene Charniak)
server: crl.ucsd.edu folder:/home/corpora/wsj87
Spanish Language News Corpus Large (~172 million word) corpus of newsire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T9)
CD-ROM
European Languages News Corpus ~100 million words of French, 90 million words of German, and 15 million words of Portuguese; newswire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T11)
CD-ROM
Hansard Parallel Text in English and French Parallel English/French texts drawn from Canadian Parliament discussions. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T20)
CD-ROM
CELEX Lexical databases (word lemmas, phonology, morphology, frequency) for Dutch, German, and English. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96L14)
server: crl.ucsd.edu folder:/home/corpora/celex
British National Corpus (100 million word searchable corpus; Windows software for more extensive searching is also available)
http://sara.natcorp.ox.ac.uk/lookup.html
server: crl.ucsd.edu folder:/home/corpora/BNC (permission for access required)

NOTE: The folder locations mentioned above are also available via AppleShare or in the CRL Windows workgroup or by SFTP by using your authorized account. If you are unable to access one of these folders, contact CRL.

In addition, there are a large number of electronic databases and other resources that are useful for the psycholinguist, linguist, or computational linguist. A few of these include the following:

MRC Psycholinguistic Database Interface (Kucera-Francis; number of letters/phonemes/syllables; ratings of word familiarity, concreteness, imagability, meaningfulness; age of acquisition; etc.)
http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm
Edinburgh Associative Thesaurus (on-line word association norms)
http://monkey.cis.rl.ac.uk/Eat/htdoc/eat.html
Oxford Text Archive (a collection of several thousand electronic texts and linguistic corpora)
http://ota.ahds.ac.uk/
Association for Computational Linguistics http://www.aclweb.org/
Institute for Scientific Information The Web of Science Citation Databases
http://isi1.isiknowledge.com/portal.cgi
PubMed http://www.ncbi.nlm.nih.gov/PubMed/
UCSD Libraries and Library Resources http://libraries.ucsd.edu