DataSetsAvailableOnIRKMLabMachines

 

Our group has purchased the license of several data sets. Our industry partners also share some data with us. Please be careful with the license of each individual data set. Please limit the read access of the data to only yourself by default, unless you have discussed with Yi and are clear about the license of the data.

We also crawled some data sets from the internet. If you need to purchase other data sets, please contact Yi.

Some of the data are available at /home/shared/data on irkm.cse.ucsc.edu (linux)

Some of the data are on \\castlerock (windows)

For standard information retrieval

The data sets usually includes a corpus, a set of queries and a set of relevance judgements (ask Yi for data)

  • Reuters Corpus (1996-1997)
  • NIST TREC CD 1-4
  • AQUAINT Corpus
  • TREC Blog data
  • TREC Genomic Track data (2004, 2005)
  • TREC Spam track data
  • TREC Medical Records Track
  • TREC Session Track
  • TREC diversity/novelty related tracks: ask Yi
  • TREC patent search track
  • TREC Twitter search track
  • TREC Web track

Other Data sets

  • Linguistic Data Consortiumn (UCSC Library Copies)
  • Delicious data sets
  • Epinion data sets
  • An online shopping website's customer and product data
  • American Online Query Log
  • Microsoft Live Search Query Log (TBA)
  • Microsoft AdCenter 100 mllion query log (Search Query, Ad impression, Ad clicks, TBA)
  • Novelty detection data sets
  • A unified IMDB, MovieLens, and NetFlix dataset is available on irkm under /home/jonathan/movies.
  • Metafilter stats are on irkm under /home/jonathan/metafilter.com

Ongoing IR Evaluations

Need to get your data annotated?

If you need many annotators for small tasks, try Amazon Mechanical Turk (suggested rate: $2-$4/hr)

If you need a few annotators for a long period of time, try oDesk https://www.odesk.com/w/odesk_story or hire workstudy undergraduate students ($5/hr for workstudy students)

If you need many annotator for many tasks, try to build a game (suggested rate: -$1-$0/hr)