Reading and Writing Electronic Text: Midterm Project

My midterm project devises a new form of electronic poetry called dead language poetry in which one finds an ancient, untranslated or untranslatable text and attempts to generate rhythmically plausible translations in another language: here English.

I’ve created a Python program that generates texts conforming to this form by examining digital transcriptions of the Voynich Manuscripts and reconstructing a translation of the manuscripts using lines remixed from a different source text.

The program has several components:

cleanup_text.py: preprocesses source texts to split them at punctuation boundaries to produce one phrase per line.
cmudict_to_json.py: preprocesses the cmudict pronunciation dictionary into the more easily readable JSON format.
process_voynich.py: extracts a consistent transcription of the Voynich Manuscripts from the transcription data file.
translate.py: Chooses phrases from the source text that could be plausible rhythmic translations of lines from the Voynich Manuscript by examining consonant and vowel rhythms from each phrase.

Specifically, translate.py examines each line of the Voynich Manuscript input, generates a plausible set of phonemes for each word in a line and converts the phonemes into a sequence of consonant and vowel markers:

fachys ykal ar ataiin shol shory cth res y kor sholdy

F-AE0-CH-EY0-EH1-S EY0-K-AE1-L AA1-R AE0-T-AE0-AY0-EH1-N SH-OW0-EH1-L SH-AO1-R-IY0 S-T-IY1-EY1-CH R-EY1-Z W-AY1 K-AO1-R SH-OW0-L-D-W-AY1

CVCVVC VCVC VC VCVVVC CVVC CVCV CVVC CVC CV CVC CVCV

Then, the program selects the most similar line from the source text to serve as the translation by comparing the consonant and vowel representation for the line in the source text to the representation for the ancient line. The representations are compared using edit distance:

CVC CV CVC CVVVC CVC CVVVC

CVC CV CVC CVCV CVC CVCVC

I’ve generated three candidate poems that I found satisfactory:

f1r_lightinaugust: a combination of folio f1r and William Faulkner’s Light in August.
f39v_rosettastone: a combination of folio f39v and an English translation of the Rosetta Stone.
f81r_odyssey: a combination of folio f81r and Homer’s Odyssey.

How well does the output of my computer program conform to my invented poetic form?

The output conforms relatively well, however there are almost no perfect matches for rhythmic translations. Most lines are different by 5-10 phonemes but some flexibility enhances the illusion of the translation being legitimate.

Could a human do it better?

Yes, a human would have greater control over the theme and content of the translation, and would probably be able to match the rhythm 100% phoneme for phoneme.

Also, sometimes, especially with shorter source texts, the translations can be repeated when a particular line matches many ancient lines in terms of rhythm. A human can completely avoid this obvious error.

How does my choice of source text (my “raw material”) affect the character and quality of the poems that your program generates?

I chose the Rosetta Stone text because the resulting translations were very convincing since both the Voynich Manuscript and the Rosetta Stone are ancient relative to modern day.

I chose the Odyssey text because it also matched the time period of an ancient text, however the story-like quality of the Odyssey doesn’t match my expectations of what a translation of the Voynich Manuscript would be, so it provides an interesting juxtaposition.

I chose Faulkner’s Light in August precisely for its stark contrast, being a more modern text. The juxtaposition of the two texts combined with the oddly consistent mixture of Faulkner’s lines lead to a surprising result.

See the README on github.