Difference between revisions of "How "great" are the Great Books"

From Code4Lib
Jump to: navigation, search
(New page: == How "great" are the Great Books? == * Eric Lease Morgan, University of Notre Dame (emorgan at nd.edu) In the 1960s a set of books called the Great Books of the Western World was publ...)
 
(How "great" are the Great Books?)
Line 17: Line 17:
  
 
The first part of this talk describes the different steps involved in the text pre-processing to calculate an accurate TFIDF value for each item of the corpus. The results and statistical analysis are discussed in the second part. Finally I will outline the remaining work such as refining the analysis and extending the current quantitative process to a web implementation.
 
The first part of this talk describes the different steps involved in the text pre-processing to calculate an accurate TFIDF value for each item of the corpus. The results and statistical analysis are discussed in the second part. Finally I will outline the remaining work such as refining the analysis and extending the current quantitative process to a web implementation.
 +
 +
[[Category:Code4Lib2011TalksProposals]]

Revision as of 19:30, 15 November 2010

How "great" are the Great Books?

  • Eric Lease Morgan, University of Notre Dame (emorgan at nd.edu)

In the 1960s a set of books called the Great Books of the Western World was published. It was supposed to represent the best of Western literature and enable the reader to further their liberal arts education. Sixty volumes in all, it included works by Plato, Aristotle, Shakespeare, Milton, Galileo, Kepler, Melville, Darwin, etc. These great books were selected based on the way they discussed a set of 102 "great ideas" such as art, astronomy, beauty, evil, evolution, mind, nature, poetry, revolution, science, will, wisdom, etc. How "great" are these books, and how "great" are the ideas expressed in them?

Given full text versions of these books it is almost trivial to use the "great ideas" as input and apply relevancy ranking algorithms against the texts thus creating a sort of score -- a "Great Ideas Coefficient". Term Frequency/Inverse Document Frequency (TFIDF) is a well-established algorithm for computing just this sort of thing:

relevancy = ( c / t ) * log( d / f ) where:

  • c = number of times a given word appears in a document
  • t = total number of words in a document
  • d = total number of documents in a corpus
  • f = total number of documents containing a given word

Thus, to calculate our Great Ideas Coefficient I sum the relevancy score for each "great idea" for each "great book". Plato's Republic might have a cumulative score of 525 while Aristotle's On The History Of Animals might have a cumulative score of 251. Books with a larger Coefficient could be considered greater. Given such a score a person could measure a book's "greatness". We could then compare the score to the scores of other books. Which book is the "greatest"? We could compare the score to other measurable things such as book's length or date to see if there were correlations. Are "great books" longer or shorter than others? Do longer books contain more "great ideas"? Are there other books that were not included in the set that maybe should have been included?

The first part of this talk describes the different steps involved in the text pre-processing to calculate an accurate TFIDF value for each item of the corpus. The results and statistical analysis are discussed in the second part. Finally I will outline the remaining work such as refining the analysis and extending the current quantitative process to a web implementation.