The ideal scoring set up: Caption weights and calculating results
Written By Akim Mansaray and Sherman Charles
Graphics by Akim Mansaray and Sherman Charles
When deciding on a scoring set up for your contest, there are several things to consider, such as the scoresheet, judges, and awards, just to name a few. Regardless, the primary focus must be fostering educational goals and developmental progress. Artistic performance assessment is understandably difficult because of the varied views and the subjective factors that impact performance curricula. Thus, the goal of this blog post is to lay out what needs to be considered to have a quality, transparent contest with integrity where education is the prize.
The following four posts discuss 1) Choosing scoresheet categories, 2) Caption weights and calculating results, 3) Choosing judges, and 4) Awards and time penalties. This post is the second of the four. If you have any questions about this post’s content, or if you would like to share your view on this topic, feel free to add your comments at the bottom of this page. Don’t forget to read the other three posts!
Caption weights and calculating results
Let’s say you have a list of criteria that captures all of the important details to evaluate, but you find that the proportion of points between captions (caption here refers to groups of categories that evaluate broad elements of a performance, such as Vocals, Visuals, Band, etc.) is less than desirable. In order to get the proportion that you want, there are two options.
The first option is to modify the categories themselves. One could collapse related categories into a single one (e.g., the categories Appearance and Poise collapse to become one, Appearance & Poise), remove extraneous categories, or change the maximum point value of some categories. While these are the simplest ways of modifying your ideal scoresheet to fit your ideal caption proportions and weighting, the result is less detail, less precise information, and the meaning of individual scores becomes more opaque. For example, how is a director or student supposed to know if a 7 out of 10 in Appearance & Poise means that they need to improve upon the costuming and makeup, or if they need to improve their presence on stage? What if one of the categories that you removed happens to be a category that some performances miss the mark on and would benefit from feedback? What is the difference between a 3 out 5 for one category, a 3 out of 7 for another, and a 6 out of 10 for a third category? Therefore, while these might be the easiest ways to achieve your desired proportions, there are drawbacks that need to be considered. We are by no means discouraging you from doing this, but rather we are raising these points to your attention. A successful scoresheet that employs this approach is the Tyson Showchoir sheet. It was developed and tested by Dr. LaDona Tyson and it is currently used at all contests in Mississippi.
A second option is to mathematically weight each caption. This approach requires multiplying or dividing scores for at least one if not all captions. Let’s take this scenario: we have the perfect showchoir scoresheet, but the captions Vocals and Visuals have the same number of categories, making the weighting between the captions 50% Vocals and 50% Visuals. For some in the showchoir world, this is not ideal; many believe that Vocals should be worth more than Visuals, say 60% Vocals and 40% Visuals. With this weighting, the Vocals caption is worth more, 1.5 times or 50% more to be precise, than the Visuals caption. To achieve this, one can multiply the Vocals caption scores by 1.5 (or a number of other ways that will not be discussed here). Notice that this is not multiplying the categories within Vocals by 1.5 but rather the total points for the entire Vocals caption. This means that all categories are still evaluated on a 10 point scale, facilitating easy interpretation of the scores that directors and students get back from the judges. What changes are the total points contributed to the final overall total score. One downside of this approach is that the total weighted scores are different from the total raw scores, which can result in some confusion among the competitors. The primary benefit of this approach is that one can create an ideal scoresheet without needing to worry about the number of points in each caption. The Carmen Showchoir and Carmen Advanced Showchoir scoresheets use this approach. Below are screenshots of the Carmen Showchoir scoresheet from within the Carmen Scoring System. The captions Music (Vocals outside of CA) and Show (Visuals outside of CA) have the same number of categories, thus they have the same number of points. This sheet can either be used as is, or weighting can be applied to achieve a 60/40 proportion between the captions.
You may have noticed that the approaches outlined above are two strategies that educators use to make sure that all of the assignments for a course are balanced in a desirable way to determine final grades. In fact, this is exactly what one should be doing when designing a scoring system – you are creating a gradebook! A teacher allocates a certain number of points or weight for the homework, in-class assignments, exams, final projects, etc., so that they have the desired impact on the final grade.
However, one of the biggest differences between a standard classroom gradebook and a scoring system such as this is that there are many graders, all of whom have an impact on the final outcome. Thus, we need an accurate, representative, and honest way of calculating results from all of the judges’ scores combined. Since we have already written on this topic in a few of our posts, this will be brief. If you want to read up, see the posts Raw Scores versus Rank and Consensus Scoring Methods: Lessons from the French Revolution.
Simply using the total Raw Scores for each group from all of the judges seems like an obvious and simple method for calculating the results, but there is a major flaw in using this method. The problem centers around how judges decide on what score to give. There is a general lack of understanding when it comes to assessment of qualitative elements, especially when it comes to education. Scoring something like a performance is subject to a million factors—like personal taste, background, training, physical perspective—and it is impossible to control all of them even though we try. Because of this, judges score each performance relative to their experience, background, and previously observed performances, and as a result, one judge’s 7 out of 10 might be equivalent to another judge’s 8 out of 10. This means that at times there might be wild variation between the judges’ scores for any set of performances, and this fact might have unforeseen consequences on the final results, whether by coincidence or intentionally by a savvy judge. While efforts to standardize judging practices can be helpful, it is much safer to assume that this is in fact happening at your contest and to take measures to ensure this does not negatively impact the final results. The way to do this is to use a scoring method that removes the influence of point swings and differences between scoring habits, namely the Condorcet Method.
The Condorcet Method does not care about the total points one group has versus another. Briefly, this method uses pair-offs to determine which group wins over all other groups (see Consensus Scoring Methods: Lessons from the French Revolution for a more detailed explanation). It controls for all point swings and differences between judges, AND it eliminates the power of any single judge or minority of judges to decide the outcome. Regardless of the choices that you make about your scoresheet or your judging panel, we highly encourage you to avoid using Raw Scores and use the Condorcet Method in all cases where you have a panel of judges scoring a performance.
Lastly, in the context of the guiding principle that education should be primary at contests, averaging scores between judges to determine final scores should be avoided. There are two reasons: 1) a single judge can determine the outcome of a contest; and 2) as a result directors and students can get the wrong impression of how well they did in relation to other groups as characterized by all of the other judges’ scores. No matter if all of the other judges agree on which ensemble should win, the odd-one-out can pull the scores down enough to take the win away from them. Averaging judges’ scores is a disservice to the educational goals of any contest.
If you find the above information daunting, no need to worry! The Carmen Scoring System automatically takes care of all of the math for you. When you set up your contest within our system, you get to choose your scoresheet as well as your scoring method. The judges enter the scores and Carmen does the rest! You don’t have to worry about inventing your own formula or algorithm or doing any math by hand. Even if you are a spreadsheet master, the risk of human error in entering formulas incorrectly somewhere in the long chain of calculations is still present. Our methods have consistently proven themselves time and time again at contests of all shapes and sizes around the US and UK. Count on the Carmen Scoring System to deliver fast, trustworthy, and transparent results each and every time you use it. To find out more about the services we offer, go to our information page or contact us with any questions you may have. Let’s take your contest to the next level!
Recommended Readings
Much of what is written above is inspired by the following publications. We encourage you all to take the time to read through these thoughtful papers to gain better insight into music assessment, its purpose, its value, and its intended outcomes. You can find many of them by simply searching for them in your favorite search engine or by clicking the links below.
Napoles, J. (2009). The effects of score use on musicians’ ratings of choral performances. Journal of Research in Music Education, 57(3), 267–279. https://doi.org/10.1177/0022429409343423
Tyson, L. (2020). Analysis of a Research-Based Show Choir Competition Adjudication Rubric: Reliability and User Perceptions. Doctoral Dissertation: University of Southern Mississippi. https://aquila.usm.edu/dissertations/1784/
Wesolowski, B. C., Wind, S. A., & Engelhard, G. (2016). Examining rater precision in music performance assessment. Music Perception, 33(5), 662–678. https://doi.org/10.1525/mp.2016.33.5.662