UC San Diego Search Menu

Corpora

CRL has a number of natural language corpora that are available to members of the Center. Some of these corpora are accessible on-line; other corpora exist on CD-ROM, so special arrangements must be made to use them. Because most of the corpora are proprietary and have usage restrictions, access is only available to researchers affiliated with CRL.

NEW: A Linux-port of the Penn Treebank search utility tgrep is now available from CRL.

CHILDES Child Language Description Exchange. Child language productions from a variety of researchers.
http://childes.psy.cmu.edu/
WordNet 1.6

Lexical database (See http://www.cogsci.princeton.edu/~wn/)
Linux: /home/corpora/wordnet-1.6
Windows: \\slice.ucsd.edu\corpora\wordnet-1.6
Macintosh: smb://slice.ucsd.edu/corpora/wordnet-1.6

Penn Treebank Penn's Linguistic Data Consortium (LDC) collection, including Brown (Kucera-Francis); Wall Street Journal, and other sources; some text is parsed and can be searched wih the tgrep program. (See http://www.ldc.upenn.edu/)
Linux: /home/corpora/treebank
Windows: \\slice.ucsd.edu\corpora\treebank
Macintosh: smb://slice.ucsd.edu/corpora/treebank
North American News Text Corpus Large (~350 million word) corpus of newswire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T21)
CD-ROM
Wall Street Journal 1987/parsed ~25 million word parsed text from WSJ (text from LDC; parsed version courtesy Eugene Charniak)
Linux: /home/corpora/wsj87
Windows: \\slice.ucsd.edu\corpora\wsj87
Macintosh: smb://slice.ucsd.edu/corpora/wsj87
Spanish Language News Corpus Large (~172 million word) corpus of newsire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T9)
CD-ROM
European Languages News Corpus ~100 million words of French, 90 million words of German, and 15 million words of Portuguese; newswire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T11)
CD-ROM
Hansard Parallel Text in English and French Parallel English/French texts drawn from Canadian Parliament discussions. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T20)
CD-ROM
CELEX Lexical databases (word lemmas, phonology, morphology, frequency) for Dutch, German, and English. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96L14)
Linux: /home/corpora/celex
Windows: \\slice.ucsd.edu\corpora\celex
Macintosh: smb://slice.ucsd.edu/corpora/celex
British National Corpus (100 million word searchable corpus; Windows software for more extensive searching is also available)
http://sara.natcorp.ox.ac.uk/lookup.html
Linux: /home/corpora/BNC
Windows: \\slice.ucsd.edu\corpora\BNC
Macintosh: smb://slice.ucsd.edu/corpora/BNC

NOTE: Access to the above file locations is restricted. Please contact CRL to request access.

In addition, there are a large number of electronic databases and other resources that are useful for the psycholinguist, linguist, or computational linguist. A few of these include the following:

MRC Psycholinguistic Database Interface (Kucera-Francis; number of letters/phonemes/syllables; ratings of word familiarity, concreteness, imagability, meaningfulness; age of acquisition; etc.)
http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm
Edinburgh Associative Thesaurus (on-line word association norms)
http://monkey.cis.rl.ac.uk/Eat/htdoc/eat.html
Oxford Text Archive (a collection of several thousand electronic texts and linguistic corpora)
http://ota.ahds.ac.uk/
Association for Computational Linguistics http://www.aclweb.org/
Institute for Scientific Information The Web of Science Citation Databases
http://isi1.isiknowledge.com/portal.cgi
PubMed http://www.ncbi.nlm.nih.gov/PubMed/
UCSD Libraries and Library Resources http://libraries.ucsd.edu

Announcements

CRL is excited to present the latest CRL Newsletter, featuring technical report:
Flexible use of perceptuomotor knowledge in lexical and semantic decision tasks
Ben D. Amsel

Congratulations to Dr. Marta Kutas on being awarded the 2015 Distinguished Career Contributions Award. Dr. Kutas will give her award lecture on Saturday, March 28, 2015 in San Francisco.