Terms of use:


Available Resources:


Equity Evaluation Corpus


Scripts for Best-Worst Scaling

Best-Worst Scaling (BWS), also sometimes referred to as Maximum Difference Scaling (MaxDiff), is an annotation scheme that exploits the comparative approach to annotation (Louviere and Woodworth, 1990; Cohen, 2003; Louviere et al., 2015). Annotators are given four items (4-tuple) and asked which item is the Best (highest in terms of the property of interest) and which is the Worst (least in terms of the property of interest). These annotations can then be easily converted into real-valued scores of association between the items and the property, which eventually allows for creating a ranked list of items as per their association with the property of interest.

We showed that BWS ranks terms more reliably than rating scales (RS) do; that is, when comparing the term rankings obtained from two groups of annotators for the same set of terms, the correlation between the two sets of ranks produced by BWS is significantly higher than the correlation for the two sets obtained with RS. The difference in reliability is more marked when about 5N (or less) total annotations are obtained, which is the case in many NLP annotation projects (Strapparava and Mihalcea, 2007; Socher et al., 2013; Mohammad and Turney, 2013).

We provide scripts to assist with Best-Worst Scaling annotations. The package includes:
- a script to produce 4-tuples with desired term distributions,
- a script to produce real-valued scores from Best-Worst annotations,
- a script to calculate split-half reliability of the annotations.

We have used BWS to annotate single words, short phrases, and whole tweets for emotion and sentiment intensity.
Kiritchenko, S., and Mohammad, S. (2016) Capturing Reliable Fine-Grained Sentiment Associations by Crowdsourcing and Best-Worst Scaling. Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), San Diego, California, 2016 [pdf]
Kiritchenko, S. and Mohammad, S. (2017) Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation. Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 2017 [pdf]


WikiArt Emotions Dataset


Manually Created Sentiment Lexicons Using Best-Worst Scaling


Automatically Generated Sentiment Lexicons


Data Manually Annotated for Sentiment and Emotion