What can scoring data tell us about showchoir?

postblog04.png

Written by Sherman Charles

Edited by Trent DeBonis

Graphic by Cleidson da Silva

As curious members of the showchoir community, we often find ourselves with many questions about the judging and scoring processes and what makes them tick.

Is it harder to receive good scores in vocals or in choreography?

Are there such things as objectively ‘good’ and ‘bad’ showchoir groups?

Do certain judges tend to score groups more critically than others?

Over the last 4 years, Carmen has logged 209,708 data points from 700 choirs, 262 judges, and 115 competitions. That’s every single score that every judge gave to every group they saw. It is a treasure trove of information that helps us answer the questions above and many more. We started this discussion on Eric Van Cleave’s Facebook Live event. Eric is an arranger, judge, and clinician in the showchoir world, and he had many of the same questions that we did. You can watch the show on YouTube here. On the show, alongside our colleagues Shawn Miller (director at Loveland, OH) and Tim Smith (director at Olentangy, OH), we discussed what we can learn from the information that Carmen has saved. The interest in this subject matter was so overwhelmingly positive that we wanted to keep the discussion going on the blog. Below, you will find the data we discussed on the show as well as some additional points that we didn’t cover. So, without further ado, let’s dive right into it! Please feel free to add comments below and add to the discussion; we want to know what else the showchoir community is curious about.

What scoring categories have lower/higher average scores?

This question was posed by Mr. Van Cleave when he first approached us about doing the show. We though it might be interesting to know which scoresheet categories tend to get lower or higher scores than others. Below, you will find a graph showing every score recorded for every category in the Carmen Scoring System.

Average scores for each category in the Carmen Scoring System arranged from lowest (left) to highest (right). The category names are highlighted according to their caption – Vocals/Music (purple) and Visuals/Show (orange). Each red dot is an individ…

Average scores for each category in the Carmen Scoring System arranged from lowest (left) to highest (right). The category names are highlighted according to their caption – Vocals/Music (purple) and Visuals/Show (orange). Each red dot is an individual data point.

This graph is very overwhelming to look at on first glance, (I haven’t seen a box-and-whisker plot since the 6th grade!), but allow us to make sense of it. There are two key points that we can take away from the information.

  1. All of the categories have a very similar median/range.

    • They all tend to fall between a 7 and 9, with a median around 7.5 or 8. This means that most showchoirs score between a 7/10 and 9/10 on any given category.

  2. Vocal/music categories (purple) tend to be receive lower scores than visual/choreography categories (orange)

    • Note: non-highlighted categories are either band categories or we could not tell whether they should be vocals or choreography.

As you can see, we can learn very interesting things from this information. We think that these patterns could be the result of a few factors…

  1. Scoresheets are often vocal/music heavy (thus providing more opportunities to score lower).

  2. There tend to be more vocal/music judges on any given panel.

  3. Most choirs do not sing as well as they dance.

All that said, there are, of course, a couple of inconsistencies with this dataset that are worth pointing out. First, not every competition uses the same scoresheet. Because of this, the data is a bit messy in places. (Example: the lowest average scoring category is “Staging & Transitions”, but if you look carefully, you can find many other categories with similar names that have higher average scores.) This is most likely skewing the results to a certain degree. Second, not all categories are out of 10 possible points. Most of them are, but not all. In order to compare each category, we normalized all scores to look as if they were all out of 10, but this probably warped at least some of the data.

How often do choirs change place from Prelims to Finals?

This question was also posed by Mr. Van Cleave. In our database we have a total of 77 competitions that had a preliminary round and a finals round. Of those events, 53 of them (~70%) exhibited one or more place changes (at least two choirs swapping rank). This was unexpected by both myself and Mr. Van Cleave. We both thought this would be less common.

The total number of instances at least two choirs changed place from prelims to finals.

The total number of instances at least two choirs changed place from prelims to finals.

Although this graph is very intriguing, it does not provide much detail. We wanted to learn more! To do this, we looked into each individual placement and how those tend to change. We found that 1st and 6th place tend to stay the same from prelims to finals, but 3rd and 4th place change about 50% of the time.

The total number of instances choirs in 1st-6th place changed from prelims to finals.

The total number of instances choirs in 1st-6th place changed from prelims to finals.

There are a couple conclusions that we can draw from this information.

  1. If you want to win Grand Champion, you should go into finals in first place. (That is to say, do your best in Prelims and you have a better chance at winning Finals)

  2. If you are in 3rd or 4th place going into finals, do your absolute best and you have a 50/50 shot at improving your placement.

  3. These trends could be an indication that 6 spots are at least one (and maybe two) too many. Or, since 1st place never really changes, maybe finals don’t really do much for discovering a “true winner” of a competition.

Maybe finals isn’t really worth it… More on this in a later blog post!

How much of difference is there between the top finalists and the bottom finalists?

After seeing the above data, Mr. Van Cleave asked us if we could find out whether there are major score gaps between the finalists or between making and not making finals. That is to say, does the Grand Champion tend to win ‘by a landslide,’ or is it always a close race? Once again, the results were surprising to us. In order to compare competitions that use different scoresheets, we normalized the data to reflect the percentage of total points that a choir earned.

Remarkably, there are no major gaps between any of the places regardless of the round. This is still true even if the data are divided into region and state.

Boxplots of percentage of total possible points for each place during preliminary and finals rounds.

Boxplots of percentage of total possible points for each place during preliminary and finals rounds.

Let’s try and make sense of what we’re seeing. We think there are two possible explanations for these results.

  1. The top groups at each of these competitions were scored relatively close to each other, but there is likely a clear division between each of the places.

  2. Judges might be inflating or deflating their scores in order to evenly space groups in the order that they see fit.

The second point makes the most sense since showchoir judges are always scoring each group relative to the next rather than against some idealistic standard (see Raw Scores vs. Ranks for a detailed discussion of this topic).

How many good, average, and bad groups are there?

Since I was in high school, I have noticed a steady increase in not only the number of showchoirs and competitions around the world, but the number of really good choirs and tight competitions has also increased. I remember that at most of the competitions we attended, there may be one or two exceptionally good choirs that were ‘expected’ to win. Everyone else was really competing for 2nd or 3rd place. But now (2020), many, if not most competitions are a battle for the top 4, 5, or even 6 places. That is to say, there are many more really good choirs now than there were just 15 years ago.

But how many good choirs are there? How many really do go above and beyond the average? And do bad showchoirs really exist? If so, do they get equally bad scores?

Conveniently, we already tend to separate choirs into divisions, and those divisions tend to reflect the level of each choir, though sometimes indirectly through school or choir size, or something of the like. Take a look at the following graph.

Scores by Division. Divisions are ordered from lowest (left) to highest (right) scores. Boxplots show where each divisions scores (orange dots) tend to be.

Scores by Division. Divisions are ordered from lowest (left) to highest (right) scores. Boxplots show where each divisions scores (orange dots) tend to be.

It might be difficult to see, but the graph above shows all of the divisions that are in the Carmen Scoring Database (with some modifications) organized by the median score. The low scoring divisions are on the left and the high scoring divisions are on the right. Not surprisingly, we see higher performing divisions on the right and lower performing divisions on the left. (So, yes, good groups get good scores, and bad groups get poor ones). We would like to point out a few interesting facts in the data though.

First, while the Single Gender and Women’s divisions tend to score lower in the Midwest, South, and East Coast, they actually do better than the Mixed divisions in California, (only by a hair.) This is true for every level (Novice, Intermediate, and Advanced). Even the Advanced Men division scores higher than the Advanced Mixed Tier 2 division in CA. It’s not entirely clear why, but it could be related to how single gender choirs are cast compared to mixed choirs in each region.

Second, some middle school divisions score just as high as their high school counter parts, particularly in CA and the South.

Third, as expected, the smaller divisions tend to under perform compared to their large division counter parts.

With this in mind, let’s collapse all of these divisions into smaller Types.

Scores by Type

Scores by Type

We have reduced all of the divisions into just a few “archetypes”, if you will. This gives a broader picture of the overall patterns in our database. The graph shows that overall, high school groups score higher than middle school groups, single gender groups score lower than mixed groups, there probably isn’t enough data to make any conclusions on elementary or collegiate groups, and finals groups score much higher than everyone else. More importantly, though, we see that all divisions except finals tend to have very similar ranges and score medians. Finalists outperform everyone, but not all competitions have preliminary and final rounds. So let’s take the finals category out of the equation and see what happens.

Scores by Choir. Each mark on the x-axis represents one of the 700 choirs in the Carmen Scoring System. They are organized from lowest average score to highest average score, represented by the purple dots. Individual scores are represented by the o…

Scores by Choir. Each mark on the x-axis represents one of the 700 choirs in the Carmen Scoring System. They are organized from lowest average score to highest average score, represented by the purple dots. Individual scores are represented by the orange dots.

If we put all 700 choirs side by side in order of average score, we can see that there are a few really low scoring groups, a lot of middle scoring groups, and few high scoring groups. However, they do not form a straight line-pattern. Instead, we see what looks like a sideways S. We will henceforth refer to this as an ‘s-curve.’ This curve tells us that while most groups are somewhere in the middle, the really good groups score much higher, and the not so good groups score much lower than groups that score in what we consider to be the ‘middle.’

One interpretation to this information is that to be among the top groups, you must outperform all other groups in every single category. It is not an option to just sing well, just dance well, or have amazing sets, costumes, props, and special effects. You must have EVERYTHING – the whole package. That’s how you get higher average scores. (We have a feeling that many people will object to this idea, but we are simply interpreting the data, which suggest that the best groups have the best of everything.)

Is there such a thing as high scoring or low scoring judges? And how does this affect the results?

In previous blog posts, we emphasize that showchoir scoring is 100% subjective and 100% relative. In a nutshell, this means that each judge uses their own personal experiences and background to score each participating choir against all other competitors. At the same time, all judges use their own average score to reflect how well a choir did at each category on a scoresheet. Let’s see what these average scores can tell us…

Box plot of all scores in the Carmen Scoring System.

Box plot of all scores in the Carmen Scoring System.

The above graph displays a boxplot of every single score that has ever been awarded to a showchoir in the Carmen Scoring System. The median score is at about 8 points, which was a point higher than we expected. That means most of the scores in our system are an 8 out of 10. The majority of scores fall between a 7 and a 9 and 99.3% of scores are at a 5 or above. Very rarely (0.7% of the time) do we see scores below a 5.

Now, let’s look at how each individual judge compares to the plot.

Scores by Judge. Each mark on the x-axis represents one of the 262 judges in the Carmen Scoring System. They are organized from lowest average score to highest average score, represented by the purple dots. Individual scores are represented by the g…

Scores by Judge. Each mark on the x-axis represents one of the 262 judges in the Carmen Scoring System. They are organized from lowest average score to highest average score, represented by the purple dots. Individual scores are represented by the green dots.

Do you recognize the apparent pattern in this graph? You should! This graph looks remarkably similar to the graph displaying scores by choir. It has an s-curve shape to it. We can make similar conclusions here: there are many judges who have a similar average score around a 7.5, we have a few judges who have very high average scores, and a few judges that have very low average scores. In other words, yes, high and low scoring judges exist!

All that said, we think there is a more important conclusion to draw here. Because there are so many middle-ground scoring judges and so few high and low scoring judges, the fear of a single judge ‘throwing’ a competition (dramatically changing its outcome), consciously or unconsciously, is real. All it takes is one of those judges in a single panel of average scoring judges to skew the results. However, this only happens if you are using Raw Scores to determine the results (see the previous post Raw Scores vs. Ranks). If you convert scores to Ranks, then this disparity is eliminated.

In conclusion, there is absolutely no viable reason anyone should be using Raw Scores to calculate their results. If you want to maintain a certain proportion of vocal to visual scores, you can weight your judging panel instead of your scoresheet. That is to say, have more vocal judges than visual judges. We have found that the best ratio is 3 vocal judges and 2 visual judges, which results in a 60/40 ratio when converted to rank. What to do with those ranks is an entirely different topic which we discuss in a previous blog post, Consensus Scoring Methods: Lessons from the French Revolution. If you are currently planning your judging panel for your next competition, we highly encourage you to take this into consideration. The data doesn’t lie.

What do you think of the data?

Let us know your thoughts and concerns below. If you have any questions that weren't addressed here, we want to hear from you! We would love to direct this conversation to our audience’s interest.