Using Douglas Harper's Online Etymology Dictionary, Kinde matched words with their linguistic origins, then wrote a script to color-code each word in a given passage with that origin and link it to its dictionary definition. The results look like this:
Things get a little more interesting when he increases the length and complexity of the passage, using an excerpt from Charles Dickens's Great Expectations:
Now we have a nice mix of word origins, although Old English is still clearly the main source. Interestingly, when he analyzes a passage from Mark Twain's The Adventures of Tom Sawyer, the percentage of words derived from Old English drops to 72.9% of the total, in part due to the addition of Greek, Old Norse, Scandinavian, and Native American to the list of word origins. A legal text, describing nations' borders at sea, provides even more insight: the percentage of words derived from Old English drops to 64.4%, while Latin and Old French combine to provide nearly 20% of the total. The further away we get from everyday spoken English, the lower the percentage of Old English-sourced words appears to be, and the more a text correlates with a high level of education on the part of the author, the more frequently we find words of Latin and Greek origin. Case in point, the following excerpt from a medical text:
At the beginning of his post, he mentions having wanted to write an app that would analyze any given text in this manner, and I can imagine spending (wasting?) hours and hours comparing different branches of science, different journals, or even publications from different research groups. It would be fascinating to see if papers written in English by a French research group, for example, tend to have a higher frequency of French-derived words than those from a German group. I know I use word cognates frequently in speaking another language, even if the meaning in the non-English language may not be exactly what I intend, only because they come to mind much more readily. Who knows what kinds of hidden linguistic patterns we might find in academic publishing?
Unfortunately, he's abandoned this idea as infeasible (for now, at least) given the amount of "manual intervention" that was required for his current analysis. A basic, much less visually-pleasing version exists over at the Etymology Discovery Message Board (though it pains me to direct traffic to a site with an its/it's mistake in the first line). Until Mike Kinde, or another like-minded programmer, can figure out how to automate his process for the masses, we'll have to be content with these analyses for a reminder of just how complicated (but interesting!) English can be.