NLTK is a library for Python that allows extensive language treatment. It comes with a lot of ‘data’, amongst them, a series of ‘books’. When you load the books, you get a nice list of hierarchically sorted books, starting with Melvilles’s Moby Dick, Jane Austen’s Sens and Sensibility, and the Book of Genesis. Follow next: Inaugural Address Corpus, Chat corpus, Monty Python and the Holy Grail, Wall Street Journal, Personals Corpus and Chesterton’s The Man who was Thursday.
These are the figures:
* “man” is mentioned 114 times
* “woman” is mentioned 20 times
* “he” is mentioned 648 times
* “she” is mentioned 161 times
FYI, “sex”, “sensual”, “sensuality”, “copulation” are mentioned 0 times; “flesh” is mentioned 26 times of which 6 in combination with circumcision.
The longest words count 15 characters each, and are: ‘Zaphnathpaaneah’ and ‘interpretations’.
26% of the text consists of 3-letter-words, 11599 occurrences in total. Herewith you find the collection of these words, in alphabetical order, organised for a mosaic of words on the wall in a metro station or some intelligence quiz on tv:
From the 2615 words in the lexicon of Genesis, the following words are all part of a title and count more than 10 letters each: Abelmizraim, Allonbachuth, Beerlahairoi, Canaanitish, Chedorlaomer, Girgashites, Hazarmaveth, Hazezontamar, Ishmeelites, Jegarsahadutha, Jehovahjireh, Kirjatharba, Melchizedek, Mesopotamia, Peradventure, Philistines, Zaphnathpaaneah.
Many writing algorithms are useful to find suitable names. ‘Hazezontamar’ is so exotic. Gamers thought of it before.