Thoughts on scoring words
Status: Draft | January 2025
What makes a word a good word for a crossword? What makes it
interesting? Some words might entertain a solver, such as
TANTALIZE
. Others might disgust or frustrate them: consider MOIST
or SSW
(the direction). Some words are overused and can cause eyes
to roll — EGOT
and OLEO
— while others can excite and provoke
wonder.
Good words are the backbone of any word puzzle. When combined into a grid, it almost becomes a form of poetry: a combination of words that engage and delight the solver. It can be predetermined in places as the letters and words require, but it can have moments of whimsy and surprise. Or perhaps clever moments that provoke thought.
This document is an attempt to enumerate measurable dimensions for words that could be interesting for word puzzles, and propose a few ways of using that to score word lists. There are no absolutes when it comes to language: everyone’s lived experience is different and the language they know and speak and are familiar with ranges from person to person. A word getting a score here may underrepresent its value. Nevertheless, this attempts to provide some structure.
Crossword setters generally try to write for a common audience and make their puzzles accessible. However, there are no absolutes when it comes to culture. Words that are familiar to some are obscure to others. There’s a reason crosswords thrive in local newspapers. A common location provides at least some common grounding for a setter to target.
For a wonderful musing and history of this from a gendered perspective, read Anna Shechtman’s The Riddles of the Sphinx.
Puzzle Kinds
It’s worth noting that how each puzzle kind uses words affects how they approach words. Standard crosswords use a lot of filler words, and may have less flexibility as to what to choose. On the other hand, cryptics have a lot more ability to choose their words carefully and relatively few words in their grids.
Overall approach
We are proposing scores to words to get better results when creating grids. These scores would be surfaced both in the Word List and Autofill functionalities. We start with the following assumptions:
Variety is key First and foremost, having a good set of different types of words keeps the solver entertained and engaged.
Don’t clump traits It’s bad form to have too much similarity in a section of a puzzle.
Where possible use familiar words… It’s fine to send your solver to the dictionary for some words, but if they need a dictionary to make any progress you might be making it too hard.
…but not too many Expecting to stretch your users’ vocabulary is a plus. In addition, occasionally you need to reach for an obscure word to make an otherwise strong section fill.
Human editing is best Perhaps in the future it’s possible to have AI create high quality grids, but the best ones will still have a high degree of human intervention. This is meant as an assistive tool and shouldn’t be used to override editorial control.
Traits
We propose a few measurable traits for a word that can have a numerical rating. These dimensions can be used to drive variety in a grid, and give the autosolver something to work with beyond word shape. These ratings are considered independently of the grid being filled and can be precomputed before hand. The traits proposed are:
Lexical interest
Frequency
Familiarity
Definition count
Sentiment
Each word can have a score for each trait. That would give the setter the ability to assess the overall grid and make decisions. It could also be used by the autosolver to pick better words.
Details
We propose a way of measuring each of the traits below. For each trait, we discuss how to measure it and touch a little on how to calculate this. It will take quite some experimentation to build a practical score,
Lexical Interest: Bigrams and Trigrams
Unusual looking words that catch the eye are a often a plus in crosswords, and a good way to differentiate. One way to make a word unusual is to have an unexpected run of characters.
For example, I would argue that KNAVE
is more interesting than
THINK
. They both have an N
and a K
in them, but the KN
bigram
is rarer than NK
. Likewise, there are some trigrams that are fairly
rare — for example OXC
in OXCART
.
This is an easier score to calculate as we don’t need additional datasets. Go through the word and see if any pair or triple of letters is unexpected. If any of them are pass a threshold, we add it to the score.
Frequency
How often a word is used may be an interesting characteristic. A word thats used a lot. Fortunately, we can use the google ngram dataset to calculate the frequency of each word. There are also great lists of words used in existing puzzles that we could use to constrain it to crosswords.
Links:
https://books.google.com/ngrams/
https://cryptics.georgeho.org/
Familiarity
Familiarity is akin to frequency but is a different rating. There are plenty of words that are seldom used that the average reader would know/recognize. In addition, words can be familiar to solvers and not be in common parlance. Familiarity is harder to determine, though there are efforts out there to build a table. We’ll have to research this more.
Links:
https://arxiv.org/pdf/1806.03431
Number of definitions and parts of speech
This trait is particularly crossword-centric. Some words (think SET
and RUN
) have a lot of different meanings, and are useful for
cryptics. We could compose a score valuing words that have a higher
number of definitions or multiple parts of speech. We have the data to
determine this already.
Sentiment and beauty / Swearing and profanity
It’s possible to determine the sentiment around a word. People have done surveys to determine if it’s positive or negative. Along the same lines, there are profane words that people probably don’t want to see while solving a crossword over breakfast (A Will Shortz rule).
NOTE: The Peter Broda list has its own scoring system. Empirically, it strongly values profanity and crassness in its list, and we’ve already gotten bug reports about some of its words. We may want to separate profanity as a separate trait from sentiment.
Links:
https://github.com/stdlib-js/datasets-liu-positive-opinion-words-en
https://en.wikipedia.org/wiki/Phonaesthetics
https://www.sciencedirect.com/science/article/abs/pii/0749596X86900215
https://github.com/surge-ai/profanity
Other possibilitiies
It’s worth talking about a few things that are too situational to be a good dimension, or are hard to measure/calculate.
Word Shape
The shape of a word. That is to say, the graphemes that are combined
to create it. This is highly situational and can’t be precomputed. For
example, consider a Standard Crosswords with the word WHEY
in
it. That word will work really well in the last row, as every letter
in it is a valid and relatively high-frequency last letter for the
down clues. If it shifts up a row, you start running into
problems. Words ending in Y?
and H?
are more rare, and it’s not
nearly as good a word in that position. The same word would have very
different scores based on where it is.
The autofill algorithm tries to acount for that by checking the crossing words for their frequency. As a result, we most likely can skip this factor when precomputing scores.
Clue words
Crosswords are primarily about the clues, of course. Some words —
especially for cryptics — just lead to good clues. Words with other
words inside them, or words clever common homonyms or anagrams. The
fact that SILENT
and LISTEN
are anagrams is a good (though
overused) example of this. The afforementioned WHEY
is a homonym of
WAY
and WEIGH
, which makes it also valuable in cryptics.
It’s also worth considering word fragments. For example, words with other words embedded with them make for good cryptic answers too (and great for rebus puzzles, crossword themes).
I don’t have a good concept of how to measure this trait, yet.