This document describes one possible algorithm to generate a filled crossword grid from a starting template
- Ensure you are using NodeJS 22 or above
- From the
Work
directory, runcat complex.txt | node generate.js
- Tested with GHC 9.4.8+, Cabal 3.10+
- From the
Algorithm
directory, runcabal run -- betterfill -d ../dict_fr.txt ../Data/complex.txt
A starting template must be provided, containing the following
- It is a CARRIAGE RETURN delimited text file, where each line has the same length
- Each line contains only the following characters :
.
: the DOT character : it represents an empty cell in the crossword grid that the program intents to fill#
: the HASH character : it represents a black cell in the crossword gridA...Z
: Any of the 26 uppercase letters of the usual ASCII alphabet
A quick input template example :
#C.....
CHIEN#.
.A.....
.T.....
.S#....
A more fancy example - inspired from A dog's day from Gnome Crosswords :
## ##### ##
#HA#.....#..#
#.I...........#
#.C#.......#..#
#.##.#...#.##.#
## #.#...#.# ##
#.......#
#..###..#
#..#..#
#####
This program supports any Latin-derived dictionary, that is any dictionary text file meeting the following conditions :
- It is a text file containing one word per line
- Each line contains only
A...Z
characters
A quick example for French :
AAAI
ABAISSA
ABAISSABLE
...
ZYTHONS
ZYTHUM
ZYTHUMS
One may generate a dictionary.txt file using aspell
for any language, by replacing fr
below with the language codename, e.g. en
for english :
#!/bin/bash
aspell dump master fr \
| iconv -f utf8 -t ascii//TRANSLIT//IGNORE \
| tr '[:lower:]' '[:upper:]' \
| grep -v '-' \
| grep -v "'" \
| tr -d "^'\`\"" \
| grep -v '^.$' \
| tee >( sort | uniq > dictionary.txt )
NOTE: This program intentionaly does not ship with any official dictionary, say the Collins or the Larousse, because it allows the user to remove unwanted words - Allowing to create (Ex1) a dictionary tailored for kids - the author usecase, (Ex2) A conjugation-free dictionary or (Ex3) An animals only dictionary (Ex4) A non-nsfw dictionary
Prepare the prerequisites :
- A
dictionary.txt
file in the Work folder - An input
template.txt
file
Then run cat template.txt | node generate.js
from Work
or cabal run -- betterfill -d ../dict_fr.txt template.txt
from Algorithm
. Hopefully the program will terminate with a filled grid on standard output.
NOTE: The average user can stop reading here. The next chapters describe the algorithm used by this program to try and fill the grid in a reasonable time - give it 200 seconds maximum.
Statement of the problem : finding word candidates with fixed letters. For instance in french consider the following (part of a) grid :
. | R | O | . | S | . | . |
The following regex /^.RO.S..$/
applied to the dictionary will perfectly compute the candidate set, including but not limited to CROISER, CROISES, CROISEZ. The problem is :
- Applying a general regex is slow, especially compared to walking character by character and compare by hand (in a sufficiently low level language or environment e.g. C or asm)
- Scanning the entire dictionary is notoriously slow, there are 1755 french words of length 7 with a
R
as a second letter but there are 229353 words in the same dictionary. Let's precompute them.
To speed up candidate words search the following structure helps :
- The structure is a Map
- The key is the word length N
- The value is another Map
- Which key is the one-letter regex, i.e. a fixed letter L at a fixed position P
- And which value is the Set of words of length N which contain the letter L at position P
Map
├──2->Map
│ ├──A.->Set(AC,AH,AI,AL,AN,AS,AU,AV)
│ ├──B.->Set(BD,BU)
│ ├──C.->Set(...)
│ ├──...
│ ├──.Z->Set(DZ,HZ)
├──3->Map
│ ├──A..->Set(ACE,...,AXE)
│ ├──B..->Set(BAC,...,BYE)
│ ├──...
├──...
├──7->Map
│ ├──...
│ ├──.R.....->Set(...,CROISER,CROISES,CROISEZ,...)
│ ├──...
│ ├──..O....->Set(...,CROISER,CROISES,CROISEZ,...)
│ ├──...
│ ├──....S..->Set(...,CROISER,CROISES,CROISEZ,...)
│ ├──...
├──...
├──25->Map
│ ├──A........................->Set(ANTICONSTITUTIONNELLEMENT)
│ ├──...
Using this structure, finding a 7 letters word matching .RO.S..
is
- Finding the 3 sets matching
.R.....
,..O....
and....S..
- Performing the intersection of those sets
This structure allows performant search of words matching patterns - and is a classic memoization
In crossword grids, the slots where word are actually placed are defined by
- Any contiguous serie of non-black cells, of length at least two
- In this program, it is any serie of non HASH or SPACE of two or more
- Thus looking for both DOT or a LETTER
As this program will have to iterate through those slots a lot, let's precompute them - instead of performing this static work every time.
Considering the grid :
ALPIN.
R....#
C#.#..
H.....
E...#.
The 13 slots are :
01 From 0,0 to 5,0 - contains ALPIN.
02 From 0,1 to 4,1 - contains R....
03 From 4,2 to 5,2 - contains ..
04 From 0,3 to 5,3 - contains H.....
05 From 0,4 to 3,4 - contains E...
06 From 0,0 to 0,4 - contains ARCHE
07 From 1,0 to 1,1 - contains L.
08 From 1,3 to 1,4 - contains ..
09 From 2,0 to 2,4 - contains P....
10 From 3,0 to 3,1 - contains I.
11 From 3,3 to 3,4 - contains ..
12 From 4,0 to 4,3 - contains N...
13 From 5,2 to 5,4 - contains ...
If the word RIMEZ
is tried at slot 02, it creates two impossible situations :
- slot 09 now contains
PM...
- slot 12 now contains
NZ..
Each time a candidate is tried at a slot, it can be eliminated if it creates a lack of candidates for any of the impacted slot.
Let's state the algorithm like that
- Start from the (parsed) template grid
- Find the slots - two or more consecutive non-black cells
- Choose one of those slots wisely
- Maximize the number of known letters - fills slots near to previous populated slots
- Minimize the number of unknown letters
- Minimize the number of candidates for the slot
- It's a heuristic at this point - the goal being to avoid costly branches
- For the chosen slot, sort candidates
- Compute the frequency of each letter in crossing slots
- Add all this frequency to get a score
- Basically this maximizes the chance of having candidates in the crossing slots
- It's a heuristic at this point - the goal being to avoid costly branches
- Iterate all candidates
- Place the candidate in the slot, modifying the grid
- Check every crossing slot containing a modified cell of the grid
- If some word still can be placed at this crossing slot, skip
- Else it's a dead-end, eliminate the candidate
- If the candidate has not been eliminated, perform a recursion, at step 2
- After recursion, restore the grid
- When no slot has at least one empty cell, it's a solution. Claim victory
NOTE: To iterate the candidate, use the pattern tree described in the previous chapters.
Usage: autofill [-d dictionary] [-b] [-v] [-q] [-n] [-h|--help] [gridfile]
-d dictionary
use another dictionary file than./dictionary.txt
-b
bypass the AC-3 algorithm. AC-3 speeds a lot by pre-elminiating candidates if the grid is at least 1/3 filled. Disable on a nearly empty grid.-v
be verbose. Expect debugging information-q
be silent. No animation, only output a solution-n
don't sort candidates. Can speed up a little on grids with a high density of black cells-h
output usage helpgridfile
a filename or-
forstdin
. Defaults tostdin
Usage: betterfill [-d dictionary] [-q] [-h|--help] [gridfile]
-d dictionary
use another dictionary file than./dictionary.txt
-q
be silent. No animation, only output a solution-h
output usage helpgridfile
a filename or-
forstdin
. Defaults tostdin