Changes
added Programming Latent Symantic Analysis
* David Lacy, Library Software Development Specialist, Villanova University (david dot lacy at villanova dot edu)
We have recently rearchitected our homegrown digital library utilizing an all-XML framework. The system is comprised of a data repository residing in a native XML database (eXist-DB), a metadata editor constructed using a Java-based XForms processor (Orbeon Forms), and a series of services for image manipulation, OCR processing and OAI-PMH serving. In this talk, I will detail our workflow process from scanning to online publishing, demonstrate the software's flexible configuration and features, and how these steps allow rapid digital preservation and online access. Oh, and it's open source, so I'll show you where to get it as well.
==Programming Latent Semantic Analysis for Large Digital Corpora==
* Wally Hooper, Indiana University (whooper@indiana.edu)
* Kirk Hess, Indiana University (kirhess@indiana.edu)
The Chymistry of Isaac Newton Project [http://www.chymistry.org http://www.chymistry.org] is publishing one hundred eighteen alchemical manuscripts written by Isaac Newton, thirty-two of which are now publically available using TEI and Unicode encodings, and served using the eXtensible Text Framework (XTF) engine [http://www.cdlib.org/services/publishing/tools/xtf/]. The National Science Foundation has funded a three-year project (2009–12, #0620868) to develop computational tools for the analysis of the alchemical language in Newton alchemical corpus. This project is applying computational tools from the fields of computational linguistics, information retrieval, and network sciences to mine and analyze Newton’s manuscripts.
One technique, Latent Semantic Analysis (LSA), has been used by the project to create a set of tools to discover the semantic structure and organization of the corpus of text, and has discovered shared passages, phrases, and technical vocabulary across the corpus. We thought many projects with tei data might want to do LSA, but may not know how. We’ll discuss creating tools for LSA to analyze tei encoded text using xsl, perl, php, a mathematical/statistical software package (e.g. Matlab), and having a supercomputer handy is helpful but not required! We'll walk through our method for chunking text, building a term document matrix, executing singular value decomposition and outputting that data as correlated document pairs and in Graphml format so it can be analyzed in a network analysis and vizualization tool (e.g. Network Workbench)[http://nwb.slis.indiana.edu].