Textmining - Datensätze

List of ACM Abstracts

Short description	List of 10934 URLs of ACM Papers used for keyword extraction
Description	The collection onsists of 10934 URLs of ACM abstracts. For each of the articles keywords are available. There are 27336 distinct keywords, 21634 of which are used in the collection only once, 2 keywords (evaluation and security) are used more than 100 times. The great majority of all keywords, 21642, consists of more than one word. Our multiword detection algorithm identified 4817 multiwords. Each article has at least 1 and at most 10 keywords, with an average of 4.5 keywords. We consider this set of keyword annotations as a golden standard and evaluate extracted keywords by computing precision and recall for this set. Obviously, many terms which are not selected by the authors or editors of an article might be good keywords nevertheless. About 52% of the articles has a keyword that is selected as a candidate term after preprocessing. Thus the theoretically optimal precision if only one keyword is selected is 0.52.
Goal	Keyword extraction from abstracts.
Results	Prec@5: 0.095 - Recall@5: 0.11 - F1@5: 0.10
Source	Wartena, Christian, Rogier Brussee and Wout Slakhorst, "Keyword Extraction Using Word Co-occurrence," in 2010 21st International Conference on Database and Expert Systems Applications, Bilbao, 2010 pp. 54-58. doi: 10.1109/DEXA.2010.32
Download	ACM-URLs.txt

Synonyms (Frequent and infrequent words from Wordnet)

Short description	List of 849 pairs of synonyms and 8967 pairs of non-synonymous english words.
Description	The dataset is consists of pairs of english words collected from Wordnet. We started with a list of pairs (w1, w2) such that (1) w1 and w2 are single words, (2) w1 occurs at least two times in the British National Corpus and (3) w1 and w2 share at least one sense. This resulted in a list of 24 576 word pairs. From this list we selected all pairs for which the Jaccard coefficient of the sets of senses of the words is at least 0.7. After filtering out all pairs containing a word that was not found in the ukWaC corpus a list of 849 pairs remained. These word pairs are considered as synonyms in the following. Next from the list of 24 576 word pairs the second components were reordered randomly. The resulting list of new word pairs was filtered such that the two words of each pair both occur in the ukWaC corpus and have no common sense. This resulted in a list of 8967 word pairs.
Goal	Distinguish synonyms form non-synonyms.
Results	AUC = 0,86
Source	Wartena, Christian. “Distributional Similarity of Words with Different Frequencies.” In Distributional Similarity of Words with Different Frequencies, pp 8–11, 2013.
Further papers	Wartena, Christian. “On the Effect of Word Frequency on Distributional Similarity.” In Proceedings of the 12th Edition of the Konvens Conference, Hildesheim, Germany, October 8-10, 2014, edited by Josef Ruppenhofer and Gertrud Faaß, 1–10. Universitätsbibliothek Hildesheim, 2014.
Download	Synonymlist.zip

Synonyms form SC53

Short description	List of 1752 pairs of synonyms and 1752 pairs of non-synonymous english words.
Description	The SC53 dataset is derived from the Bullinaria and Levy (2007) dataset for a semantic classification task,containing 530 words which are taken from 53 semantic categories (10 words for each category). From the SC53 dataset, 3504 semantically-related and unrelated pairs of words have been constructed. 1752 pairs of words belong to the same category, while 1752 randomly chosen pairs of words belong to two different categories. The pair of words which belong to the same category are considered as positive examples or semantically-related pairs of words since they share the same semantic category. The pairs build from different semantic categories, are considered as negative examples.
Goal	Distinguish synonyms form non-synonyms.
Results	Accuracy = 0,97
Source	Aga, Rosa Tsegaye, Christian Wartena, Lucas Drumond and Lars Schmidt-Thieme. "Learning thesaurus relations from distributional features". In Proceedings of the 10th Edition of the LREC Conference, Portorož, Slovenia, 23 - 28 May 2016
Download	SC53.rar

Eurovoc Level 2 Synonyms

Short description	List of 10 000 pairs of synonyms and 10 000 pairs of non-synonymous english words.
Description	Eurovoc is a multilingual thesaurus developed by the European Commissions Publications Office as a controlled vocabulary for the intellectual indexation of documents Office for Official Publications of the European Communities (1995). The Eurovoc thesaurus is divided into 127 micro-thesauri. A collection of pairs have been built by selecting 10 000 pairs of words from the same data set where both words are taken from the same micro-thesaurus and 10 000 pairs taken from two different micro-thesauri.
Goal	Distinguish synonyms form non-synonyms.
Results	Accuracy = 0,92
Source	Aga, Rosa Tsegaye, Christian Wartena, Lucas Drumond and Lars Schmidt-Thieme. "Learning thesaurus relations from distributional features". In Proceedings of the 10th Edition of the LREC Conference, Portorož, Slovenia, 23 - 28 May 2016
Download	Eurovoc2.rar

Eurovoc Level 1 Synonyms

Short description	List of 10 000 pairs of synonyms and 10 000 pairs of non-synonymous english words.
Description	From each of the above explained Eurovoc micro-thesauri, 528 in total, top-level concepts have been taken as semantic categories. For each category, all narrower concepts have been collected and their preferred and alternative labels have been considered as terms for that category. Then all terms that belong to more than one category or that consist of more than two words have been removed. Finally, all categories which contain less than 10 terms, have been removed. Now, 190 categories with a total of 2386 terms are left. The largest category consists of 44 terms. From this dataset two set of pairs have been constructed. The first set has 10 000 pairs of words belonging to the same category, and the second set has 10 000 randomly chosen pairs from two different categories. We refer this set of pairs as Eurovoc1, since the terms are equivalent by going up one level in the Eurovoc concept hierarchy.
Goal	Distinguish synonyms form non-synonyms.
Results	Accuracy = 0,88
Source	Aga, Rosa Tsegaye, Christian Wartena, Lucas Drumond and Lars Schmidt-Thieme. "Learning thesaurus relations from distributional features". In Proceedings of the 10th Edition of the LREC Conference, Portorož, Slovenia, 23 - 28 May 2016
Download	Eurovoc1.rar

Eurovoc Level 0 Synonyms

Short description	List of 2175 pairs of synonyms and 2335 pairs of non-synonymous english words.
Description	Pairs of words from the original Eurovoc thesaurus (before applying the modification for Eurovoc1 and Eurovoc2) have been sampled by taking preferred and alternative labels for the same concept as synonymous terms, and pairs that are used as labels for different concepts as non-synonym pairs. For the negative examples, an equal distribution of easy and difficult pairs have been considered. Thus, 20% pairs which are words from concepts with a distance of 1 step have taken by using any specified thesaurus relation. Further 20% has been taken from concepts with a distance of 2 steps, and so on. For the last 20%, pairs of concepts with a distance of at least 5 steps have been used. In all cases it has been ensured that no shorter path exists. For this set, 2175 synonymous and 2335 non-synonymous words have been taken.
Goal	Distinguish synonyms form non-synonyms.
Results	Accuracy = 0,84
Source	Aga, Rosa Tsegaye, Christian Wartena, Lucas Drumond and Lars Schmidt-Thieme. "Learning thesaurus relations from distributional features". In Proceedings of the 10th Edition of the LREC Conference, Portorož, Slovenia, 23 - 28 May 2016
Download	Eurovoc0.rar

STW Level 1 Synonyms

Short description	List of 10 000 pairs of synonyms and 10 000 pairs of non-synonymous english words.
Description	STW (Standard-Thesaurus Wirtschaft) is a German economic thesaurus. It is divided into 7 subthesauri Economics, Business Administration, Economic Sectors, Products, Neighboring Sciences, Geographical Terms, General Words). Each part consists of a hierarchy of notations and descriptors. Descriptors have broader, narrower, and more related terms. We have been taking all terms (i.e. labels from descriptors) from the 6 sub-thesauri (leaving out the sub-thesaurus with general terms) that belong to only one notation and consist of at most 2 words. Subsequently, all words belonging to a notation with less than 5 terms in the sample have been removed. This gives us 419 concepts (one concept for a notation) with a total of 11 599 terms. There are 5 concepts with over 100 terms. The largest concept has 233 terms. From this set, a set of a random pair of words have been selected, 10 000 from the same concept and 10 000 from different concepts. We restricted the selection of words to words occurring at least once in deWaC. This set is referred to as STW1.
Goal	Distinguish synonyms form non-synonyms.
Results	Accuracy = 0,80
Source	Aga, Rosa Tsegaye, Christian Wartena, Lucas Drumond and Lars Schmidt-Thieme. "Learning thesaurus relations from distributional features". In Proceedings of the 10th Edition of the LREC Conference, Portorož, Slovenia, 23 - 28 May 2016
Download	STW1.rar

STW Level 0 Synonyms

Short description	List of 10 000 pairs of synonyms and 10 000 pairs of non-synonymous english words.
Description	By following the explanation of the above STW1 dataset construction, STW0 dataset has constucted. Pairs of words where the terms are labeled for the same or different descriptors have been selected. This dataset has 10 000 positive and 10 000 negative pairs. The distribution of negative pairs of this dataset has been collected in the same way that the Eurovoc0 dataset has been controlled.
Goal	Distinguish synonyms form non-synonyms.
Results	Accuracy = 0,71
Source	Aga, Rosa Tsegaye, Christian Wartena, Lucas Drumond and Lars Schmidt-Thieme. "Learning thesaurus relations from distributional features". In Proceedings of the 10th Edition of the LREC Conference, Portorož, Slovenia, 23 - 28 May 2016
Download	STW0.rar

STW German Synonyms

Short description	List of 66 pairs of synonyms and 66 pairs of non-synonymous german words extracted fron the German economic thesaurus STW.
Description	List of 66 pairs of synonyms and 66 pairs of non-synonymous german words extracted fron the German economic thesaurus STW. The positive pairs are a preferred and an alternative label from the same concept. The negative pairs are chosen randomly and guaranteed to be not the labels of the same concept.
Goal	Distinguish synonyms form non-synonyms.
Results	AUC = 0,94
Source	Tutorial on Distributional Semantics
Download	STW-Syn.txt

Semantic Classification - Eurovoc Data

Short description	List of 2386 english words under 190 semantic categories.
Description	Eurovoc is a multilingual thesaurus developed by the European Commissions Publications Office as a controlled vocabulary for the intellectual indexation of documents Office for Official Publications of the European Communities (1995). The Eurovoc thesaurus is divided into 127 micro-thesauri. From each of these micro-thesauri, 528 in total, top-level concepts have been taken as semantic categories. For each category, all narrower concepts have been collected and their preferred and alternative labels have been considered as terms for that category. Then all terms that belong to more than one category or that consist of more than two words have been removed. Finally, all categories which contain less than 10 terms, have been removed. Now, 190 categories with a total of 2386 terms are left.
Goal	Classifying semantically related words.
Results	Accuracy = 0,71
Source	Aga, Rosa Tsegaye, Christian Wartena, Lucas Drumond and Lars Schmidt-Thieme. "Integrating Distributional and Lexical Information for Semantic Classification of Words using MRMF". In Proceedings of the 26th Edition of the COLING Conference, Osaka, Japan, 13 - 16 December 2016
Download	SC_Eurovoc.rar

Abbreviation Disambiguation - Coling 2018 Data

Short description	Set of 19 954 examples for 4 365 ambiguous acronyms from image captions in scientific papers along with their contextually correct definition from different domains.
Description	The NOA-Project has a large collection of papers, at the creation time 712,438 papers from open access journals. Tihs collection contains journals from almost all scientific disciplines. We found 2,838,713 occurrences of words written in all capitals of at least 3 letters that thus are likely to be acronyms. For 25,336 acronyms with a total of 628,470 occurrences we found a definition in the paper using a simple regular expression. For 379,509 additional acronyms we could find an unambiguous definition in another paper from the same journal. In total about 36% of all potential acronyms could be expanded. For 67 % of the remaining acronyms we have multiple possible expansions from other papers. From our database we collected a set acronyms that have at least two and at most five definitions that are not too similar. As a criterion for similarity we used trigram overlap and we required that the Jaccard-Index of the sets of trigrams was smaller than 0.4. This gave us 4,365 acronyms
Goal	Disambiguation of Abbreviation expansions.
Results	Accuracy = 0,84
Source	Charbonnier, Jean, Christian Wartena. "Using Word Embeddings for Unsupervised Acronym Disambiguation". In Proceedings of the 27th International Conference on Computational Linguistics., Santa Fe, NM - USA, 20 - 26 August 2018
Download	abbr.zip

Predicting the Concreteness for German Words - SwissText Konvens 2020 Merged Data

Short description	A merged dataset from the different sources mentioned in the paper with normalized values for all words.
Description	To create a bigger dataset for the usage with machine learning alogithms, we merged the biggest known datasets for german words with concreteness values toghether. We used words present in all datasets as calibration words and normalized the values to the same scale accordingly. Please see the paper for the exact references to the used datasets.
Goal	Predicting the Concreteness for german words.
Results	Pearson Correlation (5-fold cross validation) = 0.861
Source	Charbonnier, Jean, Christian Wartena. "Predicting the Concreteness for German Words". In To be published
Download	MergedConcreteness.zip