
自然语言常用数据集
转自:
Treebanks and annotated corpus useful for training POS tagger, parser etc
- Penn Treebank .html
- WSJ Corpus
- NEGRA German corpus /
- Tiger corpus /
- alpino Treebank /
- Bultreebank /
- Turin University Treebank /
- prague dependency Treebank .0/
Semantic relation annotated corpus
- propbank
- Nombank .html
- framenet /
- salsa .php?id=index
Text classification corpus
- Reuters dataset /
- news group datasets /
Parallel corpus used in machine translation
Text summarization
- DUC-2001, 2002, 2003, 2004, 2005, 2006, 2007 .html
- TAC-2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015 /
- Gigawords
- LCSTS .html
Machine Reading
- CNN
- Microsoft .09268
- Microsoft Marco /
- SQuAD
Others
- TREC
- SemEval .php?id=tasks
- Microsoft COCO: /