Changes

2012 talks proposals

1,053 bytes added, 08:44, 21 November 2011
no edit summary
The existing body of Open Access scholarly research is a well classified and described dataset. However, in Institutional Repositories it can be the case that there are insufficient resources to invest for cataloging and maintaining rich metadata descriptions of contributed content. This is especially the case when collections are populated and maintained by non-librarians. A great deal of classifiable detail preexists within files that are submitted to scholarly repositories. Utilizing existing Open Source technologies capable of extracting this information, a process can be provided to submitters and repository maintainers to suggest appropriate subject classifications and types for descriptive metadata during submission and update of repository items. This talk will provide an overview of an approach for utilizing machine learning as a tool for the auto population of subject classifications and content types.
 
== Mining Wikipedia for Book Articles ==
* Paul Deschner, Harvard Library Innovation Lab, deschner@law.harvard.edu
 
Suppose you were developing a browsing tool for library materials and wanted to include Wikipedia articles and categories whenever available -- how would you do it? There is no API or other data service which one can use to get a comprehensive listing of every page in Wikipedia devoted to the discussion of a book.
 
This talk will focus on the tools, workflows and data sources we have used to approach this problem. Tools and workflows include the use of Infobox ISBN's and other standard identifiers, analysis of Wikipedia categories and category hierarchies, exploitation of article abstracts and titles, and Mechanical Turk resources. Data sources include Dbpedia triple stores and Wikimedia XML/SQL dumps. So far, we have harvested around 60,000 book articles. This is an exploration in dealing with open, relatively unstructured Web content, and in aggregating answers to the same question using quite diverse techniques.
[[Category: Code4Lib2012]]
5
edits