What is Multi-SimLex ?
A large multilingual resource for lexical semantics
Multi-SimLex is a large-scale multilingual resource for lexical semantics. The current version of Multi-SimLex provides human judgments on the semantic similarity of word pairs for as many as 12 monolingual and 66 cross-lingual datasets. The languages covered are typologically diverse and represent both major languages (e.g., Mandarin Chinese, Spanish, Russian) and less-resourced ones (e.g., Welsh, Kiswahili). Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels.
A new benchmark for evaluating representation learning models
Multi-SimLex data sets can be used as a new benchmark to evaluate multilingual and cross-lingual representation learning architectures in the semantic similarity task. Thanks to its extensive coverage Multi-SimLex provides novel opportunities for experimental evaluation and analysis. We have already evaluated, on its monolingual and cross-lingual benchmarks, a range of state-of-the-art monolingual and cross-lingual representation models. Please see the Multi-SimLex paper for benchmarks and results.
A collaborative ongoing community project
Together with the first batch of Multi-Simlex data we are launching a community effort to extend this resource to many more (of the world’s over 7000) languages. We release the guidelines we developed and used for creating Multi-Simlex and we hope they will encourage others to translate and annotate Multi-Simlex -style datasets for additional languages. Please get in touch with us if you have any questions or when you have a dataset to submit to our Multi-Simlex repository! And THANK YOU in advance - the resulting extensive resource will be hugely valuable for advancing multilingual NLP.
Download Multi-SimLex
Download the entire current Multi-SimLex dataset in two files: 1) a translation file containing the translations for all 1,888 original English word pairs in currently available languages, and 2) a scores file containing the corresponding scores. Each pair has a unique ID, which can be used to index into both files.
Download individual language files in CSV format. Each column contains the scores from one annotator for all 1,888 pairs.
- ARA (Arabic)
- CMN (Mandarin Chinese)
- CYM (Welsh)
- ENG (English)
- EST (Estonian)
- FIN (Finnish)
- FRA (French)
- HEB (Hebrew)
- POL (Polish)
- RUS (Russian)
- SPA (Spanish)
- SWA (Kiswahili)
- YUE (Yue Chinese)
You can also download the Cross-lingual datasets for all 66 language pairs described in our paper.
Contribute! Help us extend Multi-Simlex
We invite the wider NLP community to join the effort to expand Multi-SimLex beyond the current sample of 12 languages. We particularly welcome creation of Multi-SimLex datasets for under-resourced and typologically diverse languages. To contribute, please follow the below protocol and guidelines for creation of your language dataset (further details are in the Multi-SimLex paper):
Translation
Translators for each target language should be instructed to find direct or approximate translations for the 1,888 word pairs that satisfy the following rules:
-
All pairs in the translated set must be unique (i.e., no duplicate pairs)
-
Translating two words from the same English pair into the same word in the target language is not allowed. For example, it is not allowed to translate car and automobile to the same Spanish word coche.
-
The translated pairs must preserve the semantic relations between the two words when possible. This means that, when multiple translations are possible, the translation that best conveys the semantic relation between the two words found in the original English pair is selected.
-
If it is not possible to use a single-word translation in the target language, then a multi-word expression (MWE) can be used to convey the nearest possible semantics given the above points.
Annotation
All submissions must consist of at least ten valid annotations for each pair. Annotators should be native speakers (or a similar fluency level) in the target language. All annotators must be required to abide by the following instructions:
-
Each annotator must assign an integer score between 0 and 6 (inclusive) indicating how semantically similar the two words in a given pair are. A score of 6 indicates very high similarity (i.e., perfect synonymy), while zero indicates no similarity.
-
Each annotator must score the entire set of 1,888 pairs in the dataset. The pairs must not be shared between different annotators.
-
Annotators are able to break the workload over a period of approximately 2-3 weeks, and are able to use external sources (e.g. dictionaries, thesauri, WordNet) if required.
-
Annotators are kept anonymous, and are not able to communicate with each other during the annotation process.
To ensure the quality of the collected ratings, We ask all submissions to use our adjudication protocol consisting of the following three rounds:
-
Round 1: All annotators are asked to follow the instructions outlined above, and to rate all 1,888 pairs with integer scores between 0 and 6.
-
Round 2: We compare the scores of all annotators and identify the pairs for each annotator that have shown the most disagreement. We ask the annotators to reconsider the assigned scores for those pairs only. The annotators may chose to either change or keep the scores. As in the case with Round 1, the annotators have no access to the scores of the other annotators, and the process is anonymous. This process gives a chance for annotators to correct human errors or reconsider their judgments, and has been shown to be very effective in reaching consensus. To identify the pairs with the most disagreement; for each annotator, we marked the i-th pair if the rated score: , where is the mean of the other annotators’ scores.
-
Round 3: We compute the average agreement for each annotator (with the other annotators), by measuring the average Spearman’s correlation against all other annotators. We discard the scores of annotators that have shown the least average agreement with all other annotators, while we maintain at least ten annotators per language by the end of this round. The actual process is done in multiple iterations: (S1) we measure the average agreement for each annotator with every other annotator (this corresponds to the APIAA measure, see later); (S2) if we still have more than 10 valid annotators and the lowest average score is higher than in the previous iteration, we remove the lowest one, and rerun S1.
Measuring agreement
We measure the agreement between annotators using two metrics, average pairwise inter-annotator agreement (APIAA), and average mean inter-annotator agreement (AMIAA):
where is the Spearman’s correlation between annotators i and j‘s scores for all pairs in the dataset, and N is the number of annotators. See the Multi-SimLex paper for the agreement scores of our current languages.
Submit your dataset
Finally, please don’t forget to submit your dataset to us so that we can make it part of the growing Multi-SimLex database and it can benefit the whole community. To do this, please get in touch with us by email and we will give instructions for upload: multisimlex [at] gmail.com
Multi-SimLex paper
For details of Multi-SimLex, our annotation guidelines, the benchmark results and more, please see and cite the following paper:
Ivan Vulić, Simon Baker, Edoardo Maria Ponti, Ulla Petti, Ira Leviant, Kelly Wing, Olga Majewska, Eden Bar, Matt Malone, Thierry Poibeau, Roi Reichart and Anna Korhonen. 2020. Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity
Multi-Simlex team
Multi-Simlex is developed in collaboration between the Cambridge University - Language Technology Lab and the Faculty of Industrial Engineering and Management, Technion.
We gratefully acknowledge the support of the European Research Council (ERC) Consolidator Grant LEXICAL: Lexical Acquisition Across Languages (no 648909).
Contact us
For any queries related to Multi-SimLex, please contact us: multisimlex [at] gmail.com.