UC San Diego Search Menu

Corpora

CRL has a number of natural language corpora that are available to members of the Center. Some of these corpora are accessible on-line; other corpora exist on CD-ROM, so special arrangements must be made to use them. Because most of the corpora are proprietary and have usage restrictions, access is only available to researchers affiliated with CRL.

NEW: A Linux-port of the Penn Treebank search utility tgrep is now available from CRL.

CHILDES Child Language Description Exchange. Child language productions from a variety of researchers.
http://childes.psy.cmu.edu/
WordNet 1.6

Lexical database (See http://www.cogsci.princeton.edu/~wn/)
Linux: /home/corpora/wordnet-1.6
Windows: \\slice.ucsd.edu\corpora\wordnet-1.6
Macintosh: smb://slice.ucsd.edu/corpora/wordnet-1.6

Penn Treebank Penn's Linguistic Data Consortium (LDC) collection, including Brown (Kucera-Francis); Wall Street Journal, and other sources; some text is parsed and can be searched wih the tgrep program. (See http://www.ldc.upenn.edu/)
Linux: /home/corpora/treebank
Windows: \\slice.ucsd.edu\corpora\treebank
Macintosh: smb://slice.ucsd.edu/corpora/treebank
North American News Text Corpus Large (~350 million word) corpus of newswire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T21)
CD-ROM
Wall Street Journal 1987/parsed ~25 million word parsed text from WSJ (text from LDC; parsed version courtesy Eugene Charniak)
Linux: /home/corpora/wsj87
Windows: \\slice.ucsd.edu\corpora\wsj87
Macintosh: smb://slice.ucsd.edu/corpora/wsj87
Spanish Language News Corpus Large (~172 million word) corpus of newsire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T9)
CD-ROM
European Languages News Corpus ~100 million words of French, 90 million words of German, and 15 million words of Portuguese; newswire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T11)
CD-ROM
Hansard Parallel Text in English and French Parallel English/French texts drawn from Canadian Parliament discussions. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T20)
CD-ROM
CELEX Lexical databases (word lemmas, phonology, morphology, frequency) for Dutch, German, and English. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96L14)
Linux: /home/corpora/celex
Windows: \\slice.ucsd.edu\corpora\celex
Macintosh: smb://slice.ucsd.edu/corpora/celex
British National Corpus (100 million word searchable corpus; Windows software for more extensive searching is also available)
http://sara.natcorp.ox.ac.uk/lookup.html
Linux: /home/corpora/BNC
Windows: \\slice.ucsd.edu\corpora\BNC
Macintosh: smb://slice.ucsd.edu/corpora/BNC

NOTE: Access to the above file locations is restricted. Please contact CRL to request access.

In addition, there are a large number of electronic databases and other resources that are useful for the psycholinguist, linguist, or computational linguist. A few of these include the following:

MRC Psycholinguistic Database Interface (Kucera-Francis; number of letters/phonemes/syllables; ratings of word familiarity, concreteness, imagability, meaningfulness; age of acquisition; etc.)
http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm
Edinburgh Associative Thesaurus (on-line word association norms)
http://monkey.cis.rl.ac.uk/Eat/htdoc/eat.html
Oxford Text Archive (a collection of several thousand electronic texts and linguistic corpora)
http://ota.ahds.ac.uk/
Association for Computational Linguistics http://www.aclweb.org/
Institute for Scientific Information The Web of Science Citation Databases
http://isi1.isiknowledge.com/portal.cgi
PubMed http://www.ncbi.nlm.nih.gov/PubMed/
UCSD Libraries and Library Resources http://libraries.ucsd.edu

CRL Talks

April 29, 2014
Short-term memory for ASL fingerspelling and print
Zed Sevcikova (School of speech, language, and hearing science, SDSU)