# Sources The evaluation datasets for German come from two sources: 1. Deep Semantic Analogies Dataset 2. Bachelors' Thesis by Andreas Müller: https://devmount.github.io/GermanWordEmbeddings/ 3. Word Similarity: https://www.informatik.tu-darmstadt.de/ukp/research_6/data/semantic_relatedness/german_relatedness_datasets/index.en.jsp 4. Multilingual SimLex999 and WordSim353: http://leviants.com/ira.leviant/MultilingualVSMdata.html Deep Semantic Analogies Dataset -------------------------------------------- Paper: https://www.aclweb.org/anthology/W15-0105 This collection contains six newly created semantic datasets. It contains 5 files: * de_re-rated_Schm280.txt * de_sem-para_SemRel.txt * en_sem-para_BLESS.txt * en_sem-para_SemRel.txt * de_toefl_subset.txt * de_trans_Google_analogies.txt For a detailed description of the data, please refer to the paper (see reference below). For questions, please contact Maximilian Koeper (koepermn@ims.uni-stuttgart.de), Christian Scheible (scheibcn@ims.uni-stuttgart.de), or Sabine Schulte im Walde (schulte@ims.uni-stuttgart.de) File descriptions: ------------------ * de_re-rated_Schm280.txt contains the re-rated version of the Schm280 set (Schmidt et al. 2001). Schm280 consists of 280 translated word pairs from WordSim350. We re-rated these pairs, asking 10 Judges under the same conditions as in WordSim353. We call the resulting dataset WordSim280. Each line contains a word pair and the mean similarity score in [0,10] * en_sem-para_SemRel.txt and de_sem-para_SemRel.txt contain analogy questions based on the word pairs from (Scheible and Schulte im Walde, 2014). Each question is of the form A:B::C:D. The questions cover the relations adj-antonym, noun-hyperonym, noun-synonym, noun-antonym, and verb-antonym. For more details, please refer to the paper. This file consists of several sections (delimited by header lines), each for a different relation. Within a section, each line lists the four related words A, B, C, and D of an analogy "A is to B as C is to D". * en_sem-para_BLESS.txt was constructed the same way as the SemRel datasets, but based on hyperonymy and meronymy relations from the BLESS dataset (Baroni & Lenci. 2011). The format is the same as for the SemRel files. * de_toefl_subset.txt is a subset of the German word choice questions from the University of Darmstadt (Mohammad et al., 2007). We removed all questions that contain phrases in order to obtain a challenge of a difficulty comparable to the English TOEFL data. Each line contains a question of the form "stem correct_answer distractor1 distractor2 distractor3". * de_trans_Google_analogies.txt is the German translation of the Google (Mikolov et al., 2013a) analogy set. We omit the adjective-adverb relation as this distinction does not exist in German. The format is again the same as for the SemRel files. Reference: ---------- @inproceedings{KoeperScheibleSchulte2015IWCS, title = {Multilingual Reliability and ``Semantic'' Structure of Continuous Word Spaces}, author = {Maximilian K\"oper, Christian Scheible, Sabine {Schulte im Walde}}, booktitle = {Proceedings of the 11th International Conference on Computational Semantics (IWCS 2015) -- Short Papers}, address = {London, UK}, year = {2015} }