Thoughts on scoring words

Status: Draft | January 2025

What makes a word a good word for a crossword? What makes it interesting? Some words might entertain a solver, such as TANTALIZE. Others might disgust or frustrate them: consider MOIST or SSW (the direction). Some words are overused and can cause eyes to roll — EGOT and OLEO — while others can excite and provoke wonder.

Good words are the backbone of any word puzzle. When combined into a grid, it almost becomes a form of poetry: a combination of words that engage and delight the solver. It can be predetermined in places as the letters and words require, but it can have moments of whimsy and surprise. Or perhaps clever moments that provoke thought.

This document is an attempt to enumerate measurable dimensions for words that could be interesting for word puzzles, and propose a few ways of using that to score word lists. There are no absolutes when it comes to language: everyone’s lived experience is different and the language they know and speak and are familiar with ranges from person to person. A word getting a score here may underrepresent its value. Nevertheless, this attempts to provide some structure.

Crossword setters generally try to write for a common audience and make their puzzles accessible. However, there are no absolutes when it comes to culture. Words that are familiar to some are obscure to others. There’s a reason crosswords thrive in local newspapers. A common location provides at least some common grounding for a setter to target.

For a wonderful musing and history of this from a gendered perspective, read Anna Shechtman’s The Riddles of the Sphinx.

Puzzle Kinds

It’s worth noting that how each puzzle kind uses words affects how they approach words. Standard crosswords use a lot of filler words, and may have less flexibility as to what to choose. On the other hand, cryptics have a lot more ability to choose their words carefully and relatively few words in their grids.

Overall approach

We are proposing scores to words to get better results when creating grids. These scores would be surfaced both in the Word List and Autofill functionalities. We start with the following assumptions:

Variety is key First and foremost, having a good set of different types of words keeps the solver entertained and engaged.
Don’t clump traits It’s bad form to have too much similarity in a section of a puzzle.
Where possible use familiar words… It’s fine to send your solver to the dictionary for some words, but if they need a dictionary to make any progress you might be making it too hard.
…but not too many Expecting to stretch your users’ vocabulary is a plus. In addition, occasionally you need to reach for an obscure word to make an otherwise strong section fill.
Human editing is best Perhaps in the future it’s possible to have AI create high quality grids, but the best ones will still have a high degree of human intervention. This is meant as an assistive tool and shouldn’t be used to override editorial control.

Traits

We propose a few measurable traits for a word that can have a numerical rating. These dimensions can be used to drive variety in a grid, and give the autosolver something to work with beyond word shape. These ratings are considered independently of the grid being filled and can be precomputed before hand. The traits proposed are:

Lexical interest
Frequency
Familiarity
Definition count
Sentiment

Each word can have a score for each trait. That would give the setter the ability to assess the overall grid and make decisions. It could also be used by the autosolver to pick better words.

Details

We propose a way of measuring each of the traits below. For each trait, we discuss how to measure it and touch a little on how to calculate this. It will take quite some experimentation to build a practical score,

Lexical Interest: Bigrams and Trigrams

Unusual looking words that catch the eye are a often a plus in crosswords, and a good way to differentiate. One way to make a word unusual is to have an unexpected run of characters.

For example, I would argue that KNAVE is more interesting than THINK. They both have an N and a K in them, but the KN bigram is rarer than NK. Likewise, there are some trigrams that are fairly rare — for example OXC in OXCART.

This is an easier score to calculate as we don’t need additional datasets. Go through the word and see if any pair or triple of letters is unexpected. If any of them are pass a threshold, we add it to the score.

Frequency

How often a word is used may be an interesting characteristic. A word thats used a lot. Fortunately, we can use the google ngram dataset to calculate the frequency of each word. There are also great lists of words used in existing puzzles that we could use to constrain it to crosswords.

Links:

https://books.google.com/ngrams/
https://cryptics.georgeho.org/

Familiarity

Familiarity is akin to frequency but is a different rating. There are plenty of words that are seldom used that the average reader would know/recognize. In addition, words can be familiar to solvers and not be in common parlance. Familiarity is harder to determine, though there are efforts out there to build a table. We’ll have to research this more.

Links:

https://arxiv.org/pdf/1806.03431

Number of definitions and parts of speech

This trait is particularly crossword-centric. Some words (think SET and RUN) have a lot of different meanings, and are useful for cryptics. We could compose a score valuing words that have a higher number of definitions or multiple parts of speech. We have the data to determine this already.

Sentiment and beauty / Swearing and profanity

It’s possible to determine the sentiment around a word. People have done surveys to determine if it’s positive or negative. Along the same lines, there are profane words that people probably don’t want to see while solving a crossword over breakfast (A Will Shortz rule).

NOTE: The Peter Broda list has its own scoring system. Empirically, it strongly values profanity and crassness in its list, and we’ve already gotten bug reports about some of its words. We may want to separate profanity as a separate trait from sentiment.

Links:

https://github.com/stdlib-js/datasets-liu-positive-opinion-words-en
https://en.wikipedia.org/wiki/Phonaesthetics
https://www.sciencedirect.com/science/article/abs/pii/0749596X86900215
https://github.com/surge-ai/profanity

Other possibilitiies

It’s worth talking about a few things that are too situational to be a good dimension, or are hard to measure/calculate.

Word Shape

The shape of a word. That is to say, the graphemes that are combined to create it. This is highly situational and can’t be precomputed. For example, consider a Standard Crosswords with the word WHEY in it. That word will work really well in the last row, as every letter in it is a valid and relatively high-frequency last letter for the down clues. If it shifts up a row, you start running into problems. Words ending in Y? and H? are more rare, and it’s not nearly as good a word in that position. The same word would have very different scores based on where it is.

The autofill algorithm tries to acount for that by checking the crossing words for their frequency. As a result, we most likely can skip this factor when precomputing scores.

Clue words

Crosswords are primarily about the clues, of course. Some words — especially for cryptics — just lead to good clues. Words with other words inside them, or words clever common homonyms or anagrams. The fact that SILENT and LISTEN are anagrams is a good (though overused) example of this. The afforementioned WHEY is a homonym of WAY and WEIGH, which makes it also valuable in cryptics.

It’s also worth considering word fragments. For example, words with other words embedded with them make for good cryptic answers too (and great for rebus puzzles, crossword themes).

I don’t have a good concept of how to measure this trait, yet.