Bootstrapping project-specific spell-checkers

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

Claus Huitfeldt, University of Bergen

11 July 2019

Slides:
http://mlcd.blackmesatech.com/Talks


Overview

  • spell checking in DH
    • challenges
    • meeting those challenges
  • pilot study
    • constructing project-specific dictionary
    • dictionary size and error detection
    • dictionary size and signal : noise ratio
    • dictionary size and corpus size
  • toward better spell checking

Why is spell-checking hard in DH?

Why is spell checking hard in DH projects?

  • Transcriptions seek to reproduce original, not correct it; idiosyncratic spelling is a challenge.
  • Off-the-shelf dictionaries cover
    • current varieties
    • of widely spoken (= commercially important) languages.
    They don't cover older forms or minority languages.
  • Some languages lack orthographic norms; inconstant spelling.
  • XML documents are often polyglot (header, text, annotation, ... and markup)

Meeting the challenges: idiosyncratic spelling

How do deal with idiosyncratic (but consistent) spelling?

  • Goal: avoid erroneous flags on correctly transcribed words.
  • Goal: catch unconscious spelling corrections by transcribers.
  • N.B. Software knows and cares about the dictionary, not orthographic norms.
  • If the author consistently writes neice not niece, or giebt not gibt,
  • then put neice and giebt in the dictionary, and take niece and gibt out.
  • That is: make your own dictionary.

Meeting the challenges: variant language forms

How do deal with older languages? under-resourced languages? non-standard varieties?

  • We need a dictionary.
  • There are no off-the-shelf dictionaries.
  • So: make your own dictionary.

Meeting the challenges: inconstant spelling

What to do when a language has no orthographic norms?

  • Spelling may vary scribe to scribe,
  • or line by line.
  • But cannot be completely unpredictable. (We can read the text!)
  • One word, multiple spellings (colour, color): substitution of wrong form is indetectable.
  • But some errors are still detectable: colr, colorr, colur, colro, ...

Meeting the challenges: polyglot texts

What to do when documents have multiple languages?

Filter.

  • Use xml:lang* (or other markup) to choose which words to check with which dictionary.

    • * Oxygen does this out of the box.
  • Extract one alpha text† for each distinct language / dictionary (language variety).

    • † Alpha text: list of word forms which should be meaningfully checkable.

Pilot study: questions

Pilot project used data from two projects (details) to ask:

  • How can we construct project-specific dictionaries?
  • What should they contain / exclude?
  • How much work is involved?
  • How big does the dictionary need to be?
  • How much transcribed text does that need?

Constructing a project-specific dictionary

How to construct a dictionary?

Simplest method:

  1. Start with empty dictionary.
  2. Check a text.
  3. For each word flagged, choose:
    • If correct, add to dictionary.
    • Otherwise correct it.

Simplest, not necessarily fastest, easiest. (Alternative approach.)

Can we control the hailstorm of false reports?

What to include in the dictionary? What to exclude?

Ideally:

  • include all forms that occur correctly
  • exclude all forms that occur incorrectly (= transcription errors)

Ideal is unattainable:

  • A form may be sometimes correct, sometimes incorrect (real-word errors).
  • No dictionary is complete; there are always new correct forms.

How big does it have to be?

How big must the dictionary be to catch real errors?
Answer: even an empty dictionary catches real errors.

Data plot showing size of dictionary versus number of correct reports

Figure 1.

The catch

The problem: empty dictionaries also produce noise (false reports).
Fortunately, the rate falls fast.

Data plot showing size of dictionary versus number of false reports.

Figure 2.

What to include in the dictionary? What to exclude? (2)

Tradeoff:

  • more signal (minimize missed flags)
  • less noise (minimize false reports)

Smaller dictionaries: more flags.

Bigger dictionaries: fewer flags.

How many extra false reports is another correct report worth? 100? 10? 1?

For concreteness, we assume 10.
= We want signal/noise ratio of 1 : 10 (= 0.1).

What to include in the dictionary? What to exclude? (3)

How to decide: play a simple game between In (include it!) and Out (exclude it!):

  • If catching one more error is worth looking at 10 incorrect flags, then for any given word form:
  • Every correct occurrence* is worth 1 point for In.
  • Every incorrect occurrence* is worth 10 points for Out.
  • If the score is equal, In wins.

* Count correct and incorrect occurrences in a representative sample. (Or just guess.)

(Boring arithmetic expression also available.)

What to include in the dictionary? What to exclude? (4)

Examples from Chalmers, n = 10:

  • negociating 11 correct, 0 incorrect. Score 11:0, In.
  • negotiating 4 correct, 2 incorrect. Score 4:20, Out.
  • biit 0 correct, 3 incorrect. Score 0:30, Out.
  • infixing 0 correct, 1 incorrect. Score 0:10, Out.
  • pic 0 correct, 3 incorrect. Score 0:30, Out.
  • fora 1 correct, 19 incorrect. Score 1:190, Out.

How big must the dictionary be?

With size, signal : noise ratio improves.
For Wittgenstein and Chalmers, ca 15,000 forms achieves S/N = 0.10.

data plot showing size of dictionary versus signal/noise ratio

Figure 3.

How much data do we need?

How much text produces a dictionary of what size?
Some variation, but a clear pattern.
For a 15,000 word dictionary, we want 200,000 tokens of text.

Plot showing size of corpus vs size of dictionary

Figure 4.

Summary

  • Making project-specific dictionaries is doable.
  • A very small dictionary (1300 forms) can cover 90% of all tokens.
  • A larger dictionary (15,000 forms) can have signal : noise ratio of 1 : 10.
  • For a 15,000-form dictionary, we need ca. 200,000 words of text.
  • Real-word errors cannot be detected this way.

Can we do better?

What is spell-checking?

In general, spell checking has several parts.

  • statistical model of language
    (assigns probability to words, sentences)
    E.g. text is sequence of equiprobable known forms.
  • acceptability threshold
    (low-probability tokens are likely errors)
    E.g. p = 0
  • word similarity measure
    (embodies some theory of error)
    E.g. Levenshtein / Damerau edit distance (or phonetic distance)
  • word similarity threshold
    (dictionary words closest to error are possible corrections)
    E.g. edit distance = 1 (or 2).

Each of these can be varied, producing different kinds of spell checkers.

Future work

  • XML support
  • Alternative models
    • character n-grams
    • word n-grams (n > 1)
    • POS n-grams
    • ...

Transcription-project partners sought!

Page maintained by MLCD Project
Style based on 'SyndicateMe' by rhildred