Counting words

December 8th, 2015 § 0 comments

While exploring NLTK or Natural Language Toolkit, I came across an interesting way of ‘reading’ The Book of Genesis.
This post is a small report. You find a a Dutch variant here.

NLTK is a library for Python that allows extensive language treatment. It comes with a lot of ‘data’, amongst them, a series of ‘books’. When you load the books, you get a nice list of hierarchically sorted books, starting with Melvilles’s Moby Dick, Jane Austen’s Sens and Sensibility, and the Book of Genesis. Follow next: Inaugural Address Corpus, Chat corpus, Monty Python and the Holy Grail, Wall Street Journal, Personals Corpus and Chesterton’s The Man who was Thursday.

A ‘distant’ algorithmic reading of The Book of Genesis learn me details like this dispersion plot:
genesis_man_woman

These are the figures:
* “man” is mentioned 114 times
* “woman” is mentioned 20 times
* “he” is mentioned 648 times
* “she” is mentioned 161 times
FYI, “sex”, “sensual”, “sensuality”, “copulation” are mentioned 0 times; “flesh” is mentioned 26 times of which 6 in combination with circumcision.

The longest words count 15 characters each, and are: ‘Zaphnathpaaneah’ and ‘interpretations’.
26% of the text consists of 3-letter-words, 11599 occurrences in total. Herewith you find the collection of these words, in alphabetical order, organised for a mosaic of words on the wall in a metro station or some intelligence quiz on tv:
“Abr,All,And,Ard,Are,Art,Ask,Bow,But,Buz,Can,Dan,Day,Din,Egy,Ehi,Eno,Eri,Eve,ForGad,Get,God,Hai,Ham,His,How,Hul,Huz,Isa,Jac,Job,Kor,Lay,Let,Lie,Lot,Lud,Luz,Mam,Man,Nay,Nod,Not,Now,Our,Out,Pau,Put,Reu,Say,See,Set,She,Sod,The,Thy,Two,Who,Why,Yea,Yet,Zar,add,aga,age,air,all,alo,and,any,are,ark,art,ash,ask,ass,bad,bak,bed,bou,bow,bre,but,buy,can,chi,clo,cru,cry,cup,cut,day,dea,dew,did,die,dim,doe,dry,dwe,ear,eat,end,ewe,fai,far,fat,fed,few,fie,fig,fir,fle,flo,fly,for,fou,fro,gat,get,goa,got,gre,gro,had,han,her,hid,hil,him,his,hor,hou,how,ill,inn,jud,kid,lad,lan,law,lay,led,let,lie,man,may,men,met,mou,nig,nor,not,now,oak,off,oil,old,one,oth,our,out,own,pea,pit,pla,pow,put,ram,ran,red,rib,rid,riv,rul,run,sac,sad,sat,saw,say,sea,see,set,she,sin,sir,sit,six,sle,sod,son,sou,sow,spe,spi,ste,sto,sun,tak,tar,ten,the,thi,thy,tim,too,top,tru,two,voi,vow,war,was,wat,way,who,why,wit,wiv,wor,wot,yea,yet,you”

From the 2615 words in the lexicon of Genesis, the following words are all part of a title and count more than 10 letters each: Abelmizraim, Allonbachuth, Beerlahairoi, Canaanitish, Chedorlaomer, Girgashites, Hazarmaveth, Hazezontamar, Ishmeelites, Jegarsahadutha, Jehovahjireh, Kirjatharba, Melchizedek, Mesopotamia, Peradventure, Philistines, Zaphnathpaaneah.

Many writing algorithms are useful to find suitable names. ‘Hazezontamar’ is so exotic. Gamers thought of it before.

Leave a Reply