Changes

2012 talks proposals

1,843 bytes added, 00:42, 19 November 2011

no edit summary

What's the right metadata standard to use for a digital repository? There isn't just one standard that fits documents, videos, newspapers, audio files, local data, etc. And there is no standard to rule them all. So what do you do? At UC San Diego Libraries, we went down a conceptual level and attempted to hold every piece of metadata and give each holding place some context, hopefully in a common namespace. RDF has proven to be the ideal solution, and allows us to work with MODS, PREMIS, MIX, and just about anything else we've tried. It also opens up the potential for data re-use and authority control as other metadata owners start thinking about and expressing their data in the same way. I'll talk about our workflow which takes metadata from a stew of various sources (CSV dumps, spreadsheet data of varying richness, MARC data, and MODS data), normalizes them into METS by our Metadata Specialists who create an assembly plan, and then ingests them into our digital asset management system. The result is a [http://dl.dropbox.com/u/6923768/Work/DAMS%20object%20rdf%20graph.png beautiful graph] of RDF triples with metadata poised to be expressed as [https://libraries.ucsd.edu/digital/ HTML], RSS, METS, XML, and opens linked data possibilities that we are just starting to explore.

== HathiTrust Large Scale Search: Scalability meets Usability ==

* Tom Burton-West, DLPS, University of Michigan Library, tburtonw AT umich edu

[http://www.hathitrust.org/ HathiTrust Large-Scale search] provides full-text search services over nearly 10 million full-text books using Solr for the back-end. Our index is around 5-6 TB in size and each shard contains over 3 billion unique terms due to content in over 400 languages and dirty OCR.

Searching the full-text of 10 million books often results in very large result sets. By conference time a number of [http://www.hathitrust.org/full-text-search-features-and-analysis features] designed to help users narrow down large result sets and to do exploratory searching will either be in production or in preparation for release. There are often trade-offs between implementing desirable user features and keeping response time reasonable in addition to the traditional search trade-offs of precision versus recall.

We will discuss various [http://www.hathitrust.org/blogs/large-scale-search scalability] and usability issues including:

* Trade-offs between desirable user features and keeping response time reasonable and scalable

* Our solution to providing the ability to search within the 10 million books and also search within each book

* Migrating the [http://babel.hathitrust.org/cgi/mb personal collection builder application] from a separate Solr instance to an app which uses the same back-end as full-text search.

* Design of a scalable multilingual spelling suggester

* Providing advanced search features combining MARC metadata with OCR

** The dismax mm and tie parameters

** Weighting issues and tuning relevance ranking

* Displaying only the most "relevant" facets

* Tuning relevance ranking

* Dirty OCR issues

* CJK tokenizing and other multilingual issues.

[[Category: Code4Lib2012]]

Tburtonw

4

edits

Changes

2012 talks proposals

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools