2014 Prepared Talk Proposals
2014 Prepared Talk Proposals
Proposals for Prepared Talks:
Prepared talks are 20 minutes (including setup and questions), and should focus on one or more of the following areas:
- Projects you've worked on which incorporate innovative implementation of existing technologies and/or development of new software
- Tools and technologies – How to get the most out of existing tools, standards and protocols (and ideas on how to make them better)
- Technical issues - Big issues in library technology that should be addressed or better understood
- Relevant non-technical issues – Concerns of interest to the Code4Lib community which are not strictly technical in nature, e.g. collaboration, diversity, organizational challenges, etc.
To Propose a Talk
- Log in to the wiki in order to submit a proposal. If you are not already registered, follow the instructions to do so.
- Provide a title and brief (500 words or fewer) description of your proposed talk.
- If you so choose, you may also indicate when, if ever, you have presented at a prior Code4Lib conference. This information is completely optional, but it may assist us in opening the conference to new presenters.
As in past years, the Code4Lib community will vote on proposals that they would like to see included in the program. This year, however, only the top 10 proposals will be guaranteed a slot at the conference. Additional presentations will be selected by the Program Committee in an effort to ensure diversity in program content. Community votes will, of course, still weigh heavily in these decisions.
Presenters whose proposals are selected for inclusion in the program will be guaranteed an opportunity to register for the conference. The standard conference registration fee will still apply.
Proposals can be submitted through Friday, November 8, 2013, at 5pm PST. Voting will commence on November 18, 2013 and continue through December 6, 2013. The final line-up of presentations will be announced in early January, 2014.
Creating a new Greek-Dutch dictionary
- Caspar Treijtel, University of Amsterdam, email@example.com
At present, no complete dictionary of (ancient) Greek-Dutch is available online. A new dictionary is currently under construction at Leiden University, with software being developed at the University of Amsterdam. The team in Leiden has already begun preparation of the data, with at this moment about 6,000 approved lemmas. The ultimate goal is to produce both a print version and online open access version from the same source documents. The software needed for this has been made in a project that was funded by CLARIN-NL.
For the production of lemmas we have implemented an advanced workflow. The (generally non-technical) users create lemmas using MS Word, which is both familiar and easy to use. We have developed a custom software module that carefully migrates the Word documents into deeply structured XML by analyzing the structure and semantics of the lemmas, and falling back on heuristics in ambiguous cases. While having initially envisioned the oXygen XML Author component as the main tool for creating new lemmas, we obtained excellent results with the migrator module, and decided therefore to continue using MS Word as the primary composition tool. The main advantage of this is that the editors are much more familiar with Word than with any other WYSIWYG editor. Lemmas that have been migrated to XML are stored in an XML database and can be further edited using oXygen XML Author.
Greek morphology is complicated. In order to use a dictionary effectively, a rather high level of initial language competence is necessary for the user to be able to relate the word form s/he finds in a text to the correct basic lemma form, where the definition of the word can be found. Using a Greek morphological database we have been able to facilitate the search for lemmas. A ‘lemmatizer’ module gives the possible parsings of the word forms and the lemmas they can be derived from. This enables the user to type in the word as found in the text and be redirected to the correct lemma.
The online dictionary is still being worked on, have a look at http://www.woordenboekgrieks.nl/ for the beta version. A newer test version with additional features can be found here: http://angel.ic.uva.nl:8600/.
- construction of the dictionary: Prof. Ineke Sluiter, Classics department of Leiden University; Prof. Albert Rijksbaron, University of Amsterdam
- publisher of the dictionary: Amsterdam University Press
- design/typesetting dictionary: TaT Zetwerk (http://www.tatzetwerk.nl/)
- software development: Digital Production Center, University Library, University of Amsterdam
- project funding: CLARIN-NL (http://www.clarin.nl/)
- morphological database for use by the lemmatizer: courtesy of Prof. Helma Dik, University of Chicago (based on data of the Perseus Project)
Using Drupal to drive alternative presentation systems
- Cary Gordon, The Cherry Hill Company, firstname.lastname@example.org
Recently, we have been building systems that use angular.js, Rails, or other systems for presentation, while leveraging Drupal's sophisticated content management capabilities on the back end.
So far, these have been one-way systems, but as we move to Drupal 8 we are beginning to explore ways to further decouple the presentation and CMS functions.
A Book, a Web Browser and a Tablet: How Bibliotheca Alexandrina's Book Viewer Framework Makes It Possible
- Mohammed Abu ouda, Bibliotheca Alexandrina (The new Library of Alexandria)
A lot of institutions around the world are engaged in multiple digitization projects aiming at preserving the human knowledge present in books and availing them through multiple channels to people around the whole globe. These efforts will sure help close the digital gap particularly with the arrival of affordable e-readers, mobile phones and network coverage. However, the digital reading experience has not yet arrived to its maximum potential. Many readers miss features they like in their good old books and wish to find them in their digital counterpart. In an attempt to create a unique digital reading experience, Bibliotheca Alexandria (BA) created a flexible book viewing framework that is currently used to access its current collection of more than 300,000 digital books in five different languages which includes the largest collection of digitized Arabic books.
Using open source tools, BA used the framework to develop a modular book viewer that can be deployed in different environments and is currently at the heart of various BA projects. The Book viewer provides several features creating a more natural reading experience. As with physical books, the reader can now personalize the books he reads by adding annotations like highlights, underlines and sticky notes to capture his thoughts and ideas in addition to being able to share the book with friends on social networks. The reader can perform a search across the content of the book receiving highlighted search results within the pages of the book. More features can be further added to the book viewer through its plugin architecture.
Structured data NOW: seeding schema.org in library systems
- Dan Scott, Laurentian University
- Previous code4lib presentations: CouchDB is sacrilege... mmm, delicious sacrilege at Code4Lib 2008
The semantic web, linked data, and structured data are all fantastic ideas with a barrier imposed by implementation constraints. If their system does not allow customizations, or the institution lacks skilled human resources, it does not matter how enthused a given library might be about publishing structured data... it will not happen. However, if the software in use simply publishes structured data by default, then the web will be populated for free. Really! No extra resources necessary.
This presentation highlights Dan's work with systems such as Evergreen, Koha, and VuFind to enable the publication of schema.org structured data out-of-the-box. Along the way, we reflect the current state of the W3C Schema.org Bibliographic Extension community group efforts to shape the evolution of the schema.org vocabulary. Finally, hold on tight as we contemplate next steps and the possibilities of a world where structured data is the norm on the web.
- Bret Davidson, North Carolina State University Libraries, email@example.com
- Previous Code4Lib Presentations: Visualizing library data with D3.js at Code4Lib 2013
WebSockets for Real-Time and Interactive Interfaces
- Jason Ronallo, NCSU Libraries, firstname.lastname@example.org
Previous Code4Lib presentations:
Watching the Google Analytics Real-Time dashboard for the first time was mesmerizing. As soon as someone visited a site, I could see what page they were on. For a digital collections site with a lot of images, it was fun to see what visitors were looking at. But getting from Google Analytics to the image or other content of what was currently being viewed was cumbersome. The real-time experience was something I wanted share with others. I'll show you how I used a WebSocket service to create a real-time interface to digital collections.
In the Hunt Library at NCSU we have some large video walls. I wanted to make HTML-based exhibits that featured viewer interactions. I'll show you how I converted Listen to Wikipedia  into an bring-your-own-device interactive exhibit. With WebSockets any HTML page can be remote controlled by any internet connected device.
I will attempt to include real-time audience participation.
Rapid Development of Automated Tasks with the File Analyzer
- Terry Brady, Georgetown University Libraries, email@example.com
The Georgetown University Libraries have customized the File Analyzer and Metadata Harvester application (https://github.com/Georgetown-University-Libraries/File-Analyzer) to solve a number of library automation challenges:
- validating digitized and reformatted files
- validating vendor statistics for counter compliance
- preparing collections of digital files for archiving and ingest
- manipulating ILS import and export files
The File Analyzer application was used by the US National Archives to validate 3.5 million digitized images from the 1940 Census. After implementing a customized ingest workflow within the File Analyzer, the Georgetown University Libraries was able to process an ingest backlog of over a thousand files of digital resources into DigitalGeorgetown, the Libraries’ Digital Collections and Institutional Repository platform. Georgetown is currently developing customized workflows that integrate Apache Tika, BagIt, and Marc conversion utilities.
The File Analyzer is a desktop application with a powerful framework for implementing customized file validation and transformation rules. As new rules are deployed, they are presented to users within a user interface that is easy (and powerful) to use.
Learn about the functionality that is available for download, how you can use this tool to automate workflows from digital collections to ILS ingests to electronic resources statistics and also discuss the opportunities to collaborate on enhancements to this application!
GeoHydra: How to Build a Geospatial Digital Library with Fedora
- Darren Hardy, Stanford University, firstname.lastname@example.org
Geographically-rich data are exploding and putting fear in those trying to tackle integrating them into existing digital library infrastructures. Building a spatial data infrastructure that integrates with your digital library infrastructure need not be a daunting task. We have successfully deployed a geospatial digital library infrastructure using Fedora and open-source geospatial software . We'll discuss the primary design decisions and technologies that led to a production deployment within a few months. Briefly, our architecture revolves around discovery, delivery, and metadata pipelines using open-source OpenGeoPortal , Solr , GeoServer , PostGIS , and GeoNetwork  technologies, plus the proprietary ESRI ArcMap  -- the GIS industry's workhorse. Finally, we'll discuss the key skillsets needed to build and maintain a spatial data infrastructure.
Under the Hood of Hadoop Processing at OCLC Research
- Previous Code4Lib presentations: 2006: "The Case for Code4Lib 501c(3)"
Apache Hadoop is widely used by Yahoo!, Google, and many others to process massive amounts of data quickly. OCLC Research uses a 40-node compute cluster with Hadoop and HBase to process the 300 million MARC records of WorldCat in various ways. This presentation will explain how Hadoop MapReduce works and illustrate it with specific examples and code. The role of the jobtracker in both monitoring and reporting on processes will be explained. String searching WorldCat will also be demonstrated live.
Quick and Easy Data Visualization with Google Visualization API and Google Chart Libraries
Bohyun Kim, Florida International University, email@example.com
- 'No' previous Code4Lib presentations
Do most of the data that your library collects stay in spreadsheets or are published as a static table with a series of boring numbers? Do your library stakeholders spend more time collecting the data than using it as a decision-making tool because the data is presented in a way that makes it hard for them to quickly grasp its significance?
This talk will provide an overview of Google Visualization API  and Google Chart Libraries  to get you started on the way to quickly query and visualize your library data from remote data sources (e.g. a Google Spreadsheet or your own database) with (or without) cool-looking user-controls, animation effects, and even a dashboard.
Leap Motion + Rare Books: A hands-free way to view and interact with rare books in 3D
Juan Denzer, Binghamton University, firstname.lastname@example.org
- 'No' previous Code4Lib presentations
As rare books become more delicate over time, making them available to the public becomes harder. We at Binghamton University Library have developed an application that makes it easier to view rare books without ever having to touch them. We have combined the Leap Motion hands-free device and 3D rendered models to create a new virtual experience for the viewer.
The application allows the user to rotate and zoom in on a 3D representation of a rare book. The user is also able to ‘open’ the virtual book and flip through it using a natural user interface. Such as swiping the hand left or right to turn the page.
The application is built on the .Net framework and is written in C#. 3D models are created using simple 3D software such as sketchup or Blender. Scans of the book cover and spine are created using simple flatbed scanners. The inside pages are scanned using overhead scanners.
This talk with discuss the technologies used in developing the application and virtually any library could implement the application with virtually no coding at all. This presentation will have a demonstration of the software and also a chance for audience members to experience the Rare Book Leap Motion App themselves.
Course Reserves Unleashed!
- Bobbi Fox, Library Technology Services, Harvard University, email@example.com
- Gloria Korsman, Andover-Harvard Theological Library
- No previous Code4Lib presentations
Hey kids! Remember when SOAP was used for something other than washing? Our sophisticated (and highly functional) Course Reserves Request system does!
However, while the system is great for submitting and processing course reserve requests, the student-facing presentation through Havard’s home-grown -- and soon to be replaced -- LMS leaves a lot to be desired.
Follow along as we leverage Solr 4 as a No-SQL database, along with more progressive RESTful API techniques, to release Reserves data into the wild without interfering with reserves request processing -- and, in the process, open up the opportunity for other schools to feed their data in as well.
We Are All Disabled! Universal Web Design Making Web Services Accessible for Everyone
Cynthia Ng, Accessibility Librarian, CILS at Langara College
- No previous Code4Lib presentations (not counting lightning talks)
We’re building and improving tools and services all the time, but do you only develop for the “average” user or add things for “disabled” users? We all use “assistive” technology accessing information in a multitude of ways with different platforms, devices, etc. Let’s focus on providing web services that are accessible to everyone without it being onerous or ugly. The aim is to get you thinking about what you can do to make web-based services and content more accessible for all from the beginning or with small amounts of effort whether you're a developer or not.
The goal of the presentation is to provide both developers and content creators with information on simple, practical ways to make web content and web services more accessible. However, rather than thinking about putting in extra effort or making adjustment for those with disabilities, I want to help people think about how to make their websites more accessible for all users through universal web design.
Personalize your Google Analytics Data with Custom Events and Variables
Josh Wilson, Systems Integration Librarian, State Library of North Carolina - firstname.lastname@example.org
At the State Library of North Carolina, we had more specific questions about the use of our digital collections than standard GA could provide. A few implementations of custom events and custom variables later, we have our answers.
- Capturing the content of specific metadata fields in CONTENTdm as Custom Events
- Recording Drupal taxonomy terms as Custom Variables
In both instances, this data deepened our understanding of how our sites and collections were being used, and in turn, we were able to report usage more accurately to content contributors and other stakeholders.
Behold Fedora 4: The Incredible Shrinking Repository!
Esmé Cowles, UC San Diego Library. Previous talk: All Teh Metadatas Re-Revisited (2013)
- One repository contains untold numbers of digital objects and powers many Hydra and Islandora apps
- It speaks RDF, but contains no triplestore! (triplestores sold separately, SPARQL Update may be involved, some restrictions apply)
- Flexible enough to tie itself in knots implementing storage and access control policies
- Witness feats of strength and scalability, with dramatically increased performance and clustering
- Plumb the depths of bottomless hierarchies, and marvel at the metadata woven into the very fabric of the repository
- Ponder the paradox of ingesting large files by not ingesting them
- Be amazed as Fedora 4 swallows other systems whole (including Fedora 3 repositories)
- Watch novice developers setup Fedora 4 from scratch, with just a handful of incantations to Git and Maven
The Fedora Commons Repository is the foundation of many digital collections, e-research, digital library, archives, digital preservation, institutional repository and open access publishing systems. This talk will focus on how Fedora 4 improves core repository functionality, adds new features, maintains backwards compatibility, and addresses the shortcomings of Fedora 3.
Organic Free-Range API Development - Making Web Services That You Will Actually Want to Consume
Steve Meyer and Karen Coombs, OCLC
Building web services can have great benefits by providing reusability of data and functionality. Underpinning your applications with a web service will allow you to write code once and support multiple environments: your library's web app, mobile applications, the embedded widget in your campus portal. However, building a web service is its own kind of artful programming. Doing it well requires attention to many of the same techniques and requirements as building web applications, though with different outcomes.
So what are the usability principles for web services? How do you build a web service that you (and others) will actually want to use? In this talk, we’ll share some of the lessons learned - the good, the bad, and the ugly - through OCLC's work on the WorldCat Metadata API. This web service is a sophisticated API that provides external clients with read and write access to WorldCat data. It provides a model to help aspiring API creators navigate the potential complications of crafting a web service. We'll cover:
- Loose coupling of data assets and resource-oriented data modeling at the core
- Coding to standards vs. exposure of an internal data model
- Authentication and security for web services: API Keys, Digital Signing, OAuth Flows
- Building web services that behave as a suite so it looks like the left hand knows what the right hand is doing
So at the end of the day, your team will know your API is a very good egg after all.
If accepted, the presenters intend to produce and share a Quick Guide for building a web service that will reflect content presented in the talk.
Lucene's Latest (for Libraries)
Lucene powers the search capabilities of practically all library discovery platforms, by way of Solr, etc. The Lucene project evolves rapidly, and it's a full-time job to keep up with the ever improving features and scalability. This talk will distill and showcase the most relevant(!) advancements to date.
The Why and How of Very Large Displays in Libraries.
- Cory Lown, NCSU Libraries, email@example.com
Previous Code4Lib Presentations:
- How People Search the Library from a Single Search Box 2012
- Enhancing Discoverability with Virtual Shelf Browse 2010
Built into the walls of NC State's new Hunt Library are several Christie MicroTile Display Wall Systems. What does a library do with a display that's seven feet tall and over twenty feet wide? I'll talk about why libraries might want large displays like this, what we're doing with them right now, and what we might do with them in the future. I'll talk about how these displays factor into planning for new and existing web projects. And I'll get into the fun details of how you build web applications that scale from the very small browser window on a phone all the way up to a browser window with about 14 million pixels (about 10 million more than a dual 24" monitor desktop setup).
Your Library, Anywhere: A Modern, Responsive Library Catalogue at University of Toronto Libraries
- Bilal Khalid, Gordon Belray, Lisa Gayhart (firstname.lastname@example.org)
- No previous Code4Lib presentations
With the recent surge in the mobile device market and an ever expanding patron base with increasingly divergent levels of technical ability, the University of Toronto Libraries embarked on the development of a new catalogue discovery layer to fit the needs of its diverse users.
The result: a mobile-friendly, flexible and intuitive web application that brings the full power of a faceted library catalogue to users without compromising quality or performance, employing Responsive Web Design principles. This talk will discuss: application development; service improvements; interface design; and user outreach, testing, and project communications. Feedback and questions from the audience are very welcome. If time runs short, we will be available for questions and conversation after the presentation.
Note: A version of this content has been provisionally accepted as an article for Code4Lib Journal, January 2014 publication.)
All Tiled Up
- Mike Graves, MIT Libraries (email@example.com)
You've got maps. You even scanned and georeferenced them. Now what? Running a full GIS stack can be expensive, and overkill in some cases. The good news is that you have a lot more options now than you did just a few years ago. I'd like to present some lighter weight solutions to making georeferenced images available on the Web.
This talk will provide an introduction to MBTiles. I'll go over what they are, how you create them, how you use them and why you would use them.
The Great War: Image Interoperability to Facebook
- Rob Sanderson, Los Alamos National Laboratory (firstname.lastname@example.org)
- (Code4Lib 2006: | Library Text Mining)
- Rob Warren, Carleton University
- No previous presentations
Using a pipeline constructed from Linked Open Data and other interoperability specifications, it is possible to merge and re-use image and textual data from distributed library collections to build new, useful tools and applications. Starting with the OAI-PMH interface to ContentDM, we will take you on a tour through the International Image Interoperability Framework and Shared Canvas, to a cross-institutional viewer, and image analysis for the purposes of building a historical Facebook from finding and tagging people in photographs. The World War One collections are drawn from multiple institutions and merged by the machine learning code.
The presentation will focus on the (open source) toolchain and the benefits of the use of standards throughout: OAI-PMH to get the metadata, IIIF for interaction with the images, the Shared Canvas ontology for describing collections of digitized objects, Open Annotation for tagging things in the images and specialized ontologies that are specific to the contents. The tools include standard RDF / OWL technologies, JSON-LD, imagemagick and OpenCV for image analysis.
- Julia Bauder, Grinnell College Libraries (bauderj-at-grinnell-dot-edu)
- No previous presentations at national Code4Lib conferences
As the corpus of articles, books, and other resources searched by discovery systems continues to get bigger, searchers are more and more frequently confronted with unmanageably large numbers of results. How can we help users make sense of 10,000 hits and find the ones they actually want? Facets help, but making sense of a gigantic sidebar of facets is not an easy task for users, either. During this talk, I will explain how we will soon be using Solr 4’s pivot queries and hierarchical visualizations (e.g., treemaps) from D3.js to let patrons view and manipulate search results. We will be doing this with our VuFind 2.0 catalog, but this technique will work with any system running Solr 4. I will also talk about early student reaction to our tests of these visualization features.
PeerLibrary – open source cloud based collaborative library
PeerLibrary is a new open source project and a cloud service providing collaborative reading, sharing and storing. Users can upload publications they want to read (currently in PDF format), read them in the browser in real-time with others, highlight, annotate and organize their own or collaborative library. PeerLibrary provides a search engine to search over all uploaded open access publications. Additionally, it aims to collaboratively aggregate the open layer of knowledge on top of this publications through public annotations and references user will add to publications. In this way publications would not just be available to read, but accessible to the general public as well. Currently, it is aiming at scientific community and scientific publications.
See screencast here.
It is still in development and beta launch is planned at the end of November.
Who was where when, or finding biographical articles on Wikipedia by place and time
- Emily Morton-Owens, The Seattle Public Library (presenting on work from NYU)
- No previous c4l presentations
It's easy to answer the question "What important people were in Paris in 1939?" But what about Virginia in the 1750s or Scandinavia in the 14th century? I created a tool that allows you to search for biographies in a generally applicable way, using a map interface. I would like to present updates to my thesis project, which combines a crawler written in Java that extracts information from Wikipedia articles, with a MongoDB data store and a frontend in Python.
The input to the project is freetext of entire articles in Wikipedia; this is important to allow us to pick up Benjamin Franklin not just in the single most obvious place of Philadelphia but also in London, Paris, Boston, etc. I can talk about my experiments disambiguating place names (approaches pioneered on newspaper articles were actually unhelpful on this type of text) and setting up a processing queue that does not become mired in the biographies of every human who ever played soccer. I also want to mitigate some of the implementation choices I made due to my academic deadline and improve the accuracy/usability.
What I hope to show is that I was able to develop a novel and useful reference tool automatically, using fairly simple heuristics that are a far cry from hand-cataloging familiar to many librarians.
You can try out the original version (this server is inconveniently set to be updated/rebooted on 11/8--may be temporarily unavailable)
Good!, DRY, and Dynamic: Content Strategy for Libraries (Especially the Big Ones)
- Michael Schofield, Nova Southeastern University Libraries, email@example.com
- No previous code4lib presentations.
The responsibilities of the #libweb are exploding [it’s a good thing] and it is no longer uncommon for libraries to manage or even home-grow multiple applications and sites. Often it is at this point where the web people begin to suffer the absence of a content strategy when, say, business hours need to be updated sitewide a half-dozen times.
We were already feeling this crunch when we decided to further complicate the Nova Southeastern University Libraries by splitting the main library website into two. The Alvin Sherman Library, Research, and Information Technology Center is a unique joint-use facility that serves not only the academic community but the public of Broward County - and marketing a hyperblend of content through one portal just wasn't cutting it. With a web team of two, we knew that managing all this rehashed, disparate content was totally unsustainable.
I want to share in this talk how I went about making our library content DRY (“don’t repeat yourself”): input content in one place--blurbs, policies, featured events, featured databases, book reviews, business hours, and so on.--and syndicate it everywhere - even, sometimes, dynamically target that content for specific audiences or context. It is a presentation that is a little about workflow, a little more about browser and context detection, a tangent about content-modeling the CMS, and a lot about APIs, syndication, and performance.
No code, no root, no problem? Adventures in SaaS and library discovery
- Erin White, VCU
- No previous C4L presentations
In 2012 VCU was an eager early adopter of Ex Libris' cloud service Alma as an ILS, ERM, link resolver, and single-stop, de-silo'd public-facing discovery tool. This has been a disruptive change that has shifted our systems staff's day-to-day work, relationships with others in the library, and relationships with vendors.
I'll share some of our experiences and takeaways from implementing and maintaining a cloud service:
- Seeking disruption and finding it
- Changing expectations of service and the reality of unplanned downtime
- Communication and problem resolution with non-IT library staff
- Working with a vendor that uses agile development methodology
- Benefits and pitfalls of creating customizations and code workarounds
- Changes in library IT/coders' roles with SaaS
...as well as thoughts on the philosophy of library discovery vs real-life experiences in moving to a single-search model.
Building for others (and ourselves): the Avalon Media System
- Michael B Klein, Senior Software Developer, Northwestern University
- Public Datasets in the Cloud (code4lib 2010)
- The Avalon Media System: A Next Generation Hydra Head For Audio and Video Delivery (code4lib 2013)
- Julie Rudder, Digital Initiatives Project Manager, Northwestern University
- no previous code4lib presentations
Avalon Media System is a collaborative effort between development teams at Northwestern and Indiana Universities. Our goal is to produce an open source media management platform that works well for us, but is also widely adopted and contributed to by other institutions. We believe that building a strong user and contributor community is vital to the success and longevity of the project, and have developed the system with this goal in mind. We will share lessons learned, pains and successes we’ve had releasing two versions of the application since last year.
Our presentation will cover our experiences:
- providing flexible, admin-friendly distribution and installation options
- building with abstraction, customization and local integrations in mind
- prioritizing features (user stories)
- attracting code contributions from other institutions
- gathering community feedback
- creating a product rather than a bag of parts
How to check your data to provide a great data product? Data quality as a key product feature at Europeana
- Péter Király portal backend developer, Europeana
- No previous C4L presentations
Europeana.eu - Europe's digital library, archive and museum - aggregates more than 30 million metadata records from more than 2200 institutions. The records come from libraries, archives, museums and every other kind of cultural institution, from very different systems and metadata schemas, and are typically transformed several times until they are ingested into the Europeana data repository. Europeana builds a consolidated database from these records, creating reliable and consistent services for end-users (a search portal, search widget, mobile apps, thematic sites etc.) and an API, which supports our strategic goeal of data for reuse in education, creative industries, and the cultural sector. A reliable "data product" is thus at the core of our own software products, as well as those of our API partners.
Much effort is needed to smooth out local differences in the metadata curation practice of our data providers. We need a solid framework to measure the consistency of our data and provide feedback to decision-makers inside and outside the organisation. We can also use this metrics framework to ask content providers to improve their own metadata. Of course, a data-quality-driven approach requires that we also improve the data transformation steps of the Europeana ingestion process itself. Data quality issues heavily define what new features we are able to create in our user interfaces and API, and might actually affect the design and implementation of our underlying data structure, the Europeana Data Model.
In the presentation I briefly describe the Europeana metadata ingestion process, show the data quality metrics, the measuring techniques (using the Europeana API, Solr and MongoDB queries), some typical problems (both trivial and difficult ones), and finally the feedback mechanism we propose to deploy.
Keywords: Europeana, data quality, EDM, API, Apache Solr, MongoDB, #opendata, #openglam
Teach your Fedora to Fly: scaling out your digital repository
- Aaron Coburn, Software Developer, Amherst College
- No previous C4L presentations
Fedora is a great repository system for managing large collections of digital objects, but what happens when a popular food magazine begins directing a large number of readers to a manuscript showing Emily Dickinson’s own recipe for doughnuts? While Fedora excels in its support of XML-based metadata, it doesn’t always perform well under a high volume of traffic. Nor is it especially tolerant of network or hardware failures.
This presentation will show how we are making heavy use of a fedora repository while at the same time insulating it almost entirely from any web traffic. Starting with a distributed web front-end built with Node.js, and caching most of the user-accessible content from Fedora in an elastic, fault-tolerant Riak (NoSQL) cluster, we have eliminated nearly all single points of failure in the system. It also means that our production system is spread across twelve separate servers, where asynchrony and Map-Reduce are king. And aside from being blazing fast, it is also entirely Hydra-compliant.
Furthermore, we will attempt to answer the question: if fedora crashes and the visitors to your site don’t notice, did it really fail?
Using Open Source Software and Freeware to Preserve and Deliver Digital Videos
- Wei Fang, Head of Digital Services, Rutgers University Law Library
- Jiebei Luo, Digital Projects Initiative Intern, Rutgers University
- No previous C4L presentations
The Rutgers University Law Library is the official digital repository of the New Jersey Supreme Court oral arguments since 2002. This large video collection contains approximately 3,000 videos with a total of 400 GB or 6,000 viewing hours. With the expansion of this collection, the existing database and the static website could not efficiently support the library’s daily operations and meet its patrons’ search needs. By utilizing open source software and freeware such as Ubuntu, FFmpeg, Solr and Drupal, the library is able to develop a complete solution to re-encoding videos, embedding subtitles, incorporating Solr search engine and content management system to support full-text subtitle search, automatically updating video metadata records in the library catalog system and eventually providing a plug-in free HTML 5-based Web interface for patrons to view the videos online. The aspects below will be presented in detail at the conference:
- Video codecs comparison
- Server-end batch video encoding/re-encoding
- HTML 5 video tag and embedding subtitles
- Incorporating search engine Solr and content management tool Drupal with the database to retrieve videos by full-text search especially in subtitle files
- Incorporating video metadata with the library catalog system