2015 Prepared Talk Proposals

Code4lib 2015 is a loosely-structured conference that provides people working at the intersection of libraries/archives/museums/cultural heritage and technology with a chance to share ideas, be inspired, and forge collaborations. For more information about the Code4lib community, please visit http://code4lib.org/about/. The conference will be held at the Portland Hilton & Executive Tower in Portland, Oregon, from February 9-12, 2015.

Proposals for Prepared Talks:

We encourage everyone to propose a talk.

Prepared talks are 20 minutes (including setup and questions), and should focus on one or more of the following areas:

Projects you've worked on which incorporate innovative implementation of existing technologies and/or development of new software
Tools and technologies – How to get the most out of existing tools, standards and protocols (and ideas on how to make them better)
Technical issues - Big issues in library technology that should be addressed or better understood
Relevant non-technical issues – Concerns of interest to the Code4Lib community which are not strictly technical in nature, e.g. collaboration, diversity, organizational challenges, etc.

Proposals can be submitted through Friday, November 7, 2014 at 5pm PST (GMT−8). Voting will start on November 11, 2014 and continue through November 25, 2014. The URL to submit votes will be announced on the Code4Lib website and mailing list and will require an active code4lib.org account to participate. The final list of presentations will be announced in early- to mid-December.

Proposals for Prepared Talks:

Log in to the Code4lib wiki and edit this wiki page using the prescribed format. If you are not already registered, follow the instructions to do so. Provide a title and brief (500 words or fewer) description of your proposed talk. If you so choose, you may also indicate when, if ever, you have presented at a prior Code4Lib conference. This information is completely optional, but it may assist voters in opening the conference to new presenters.

Please follow the formatting guidelines:


== Talk Title: ==
 
* Speaker's name,  email address, and (optional) affiliation
* Second speaker's name, email address, and affiliation, if second speaker

Abstract of no more than 500 words.

Talk Proposals

No cataloging software? Need more than Dublin Core? No problem!: Experiences with CollectiveAccess

Sean Q. Hendricks, sqhendr@clemson.edu, Clemson University
Rachel Wittmann, rwittma@clemson.edu, Clemson University

Clemson University Libraries has implemented the open-source software CollectiveAccess for customized digital collection needs. CollectiveAccess is an open-source project with the goal of providing a flexible way to manage and publish museum and archival collections. There are several applications associated with the projects; most used are: Providence (for cataloging and entering metadata) and Pawtucket (for displaying objects in a collection for the public). It has many profiles readily available for installing with existing library standards, such as Dublin Core, and there is a robust syntax for creating your own profiles to fit custom tailored metadata schemas. Plus, the user interface allows you to modify the metadata profile quickly and easily.

In this talk, we will discuss:

Our experiences with installing Providence and creating an installation profile that satisfies the needs of many of the Clemson Libraries digital archiving processes.
The stumbling blocks experienced in that process and how they were resolved.
The available plugins sourcing widely used authorities, such as Library of Congress thesauri and GeoNames.org, and how they have been used by our projects.
A brief overview of the export and import functions and also current workflow practices within Providence.
Future plans & the role of CollectiveAccess at Clemson University Libraries

Getting ContentDM and Wordpress to Play Together

Sean Q. Hendricks, sqhendr@clemson.edu, Clemson University

Clemson University Libraries has a very strong program for digitizing and archiving photographs, and the Digital Imaging team processes many hundreds of photographs every month. These images are managed using different methods, including ContentDM, a digital collection manager.

ContentDM provides various methods for searching and displaying photographs, along with their metadata. However, recent initiatives have resulted in the need to leverage those collections into exhibits displayed on other library-related websites, such as our Special Collections unit. The Clemson Libraries has invested heavily in Wordpress as our content management system of choice, and it seemed most efficient not to have to export and import images into our Wordpress sites in order to provide exhibited images.

Fortunately, ContentDM has provided an API to many of their functions, allowing the extraction of metadata and even rescaled images through URLs. This project has been developing a plugin for Wordpress that integrates with ContentDM through shortcodes that Wordpress editors can easily include in their content. These shortcodes allow editors to choose how many images, which images from which collections, thumbnail sizes, etc. to display in different gallery styles. Plans are for it to allow integration with different plugins such as Fancybox and Masonry.

In this presentation, I will demonstrate the current state of the plugin and discuss future plans.

Refinery — An open source locally deployable web platform for the analysis of large document collections

Daeil Kim, The New York Times, daeil.kim@nytimes.com

Refinery is an open source web platform for the analysis of large unstructured document collections. It extracts meaningful semantic themes within documents also known as "topics" which can be thought of as word clouds composed of terms that highly co-occur with one another. Once this semantic index is formed, one can extract relevant documents related to these topics and further refine their contents through a summarization process that allows users to search for phrases that are relevant to them within the corpus. The goal of Refinery is to make this whole process easier and to provide some of the latest scalable versions of these learning algorithms in an intuitive web-based interface. Refinery is also meant to be run locally, thus bypassing the need for securing document collections over the internet. The talk will go through some of the technologies involved and a demo of the app.

For more info check out http://www.docrefinery.org.

Drupal 8 — Evolution & Revolution

Cary Gordon, The Cherry Hill Company, cgordon@chillco.com

Drupal 8 is in beta and nearing release. Among its many features, it notably has become more developer friendly through its adoption of the Symfony PHP framework along with Symfony's outstanding set of libraries (like Guzzle) and tools (like Composer). And, in implementing the Twig theming system, it is can begin to escape PHPtemplate. These moves also make it easier to create headless systems that uses Angular.js and other systems for presentation, or even forgo presentation entirely.

From the site-builder's perspective, Drupal 8 provides a much smother experience and makes it easier to build and implement site recipes.

Using GameSalad to Build a Gamified Information Literacy Mobile App for Higher Education

Stanislav 'Stan' Bogdanov, stan@stanrb.com, Adelphi University and Boglio LLC

GameSalad is a popular tool for developing mobile and desktop games with little actual programming. In this presentation, Stan Bogdanov breaks down the development process he followed while building mobiLit, a mobile app with the goal of being the first open-source gamified information literacy app to be used as part of a college-level information literacy curriculum. He will go through the basics of using GameSalad to create an app that can be easily customized by non-programmers and the instructional principles used to teach the material in a mobile medium. Stan will also go through two qualitative design studies he did on the app and discuss their results and the lessons learned from building mobiLit. The session will conclude with an overview of the next steps for the mobiLit project.

The Impossible Search: Pulling data from multiple unknown sources

Riley Childs, no official affiliation (currently a Senior in High School at Charlotte United Christian Academy), rchilds (AT) cucawarriors.com

It's easy to search data you know the structure of, but what if you need to pull in data from sources that don't have a standard structure. The ability to search community events along with your standard catalog search results is an example, but often the only way to pull these events is through XML, JSON, (Insert structured format here), or even just raw html. But how do you get that structure? That simple question is what makes this impossible. The process to define and process this structure takes a lot of manual labor, especially if the data you are pulling is just HTML, and then every time you add data to the index you have to run all the data through a script to pull in data in a format Solr or an other index can use. This talk will focus on Solr, but the principles explained will apply to many other indexes.

What! You're Not Using Docker?

Cary Gordon, The Cherry Hill Company, cgordon@chillco.com

Boring part: Docker[1] is a container system that provides benefits similar to virtualization with only a fraction of the overhead. Scintillating part: Docker can host between four to six times the number of service instances than systems such as Xen or VMWare on a given piece of hardware. But thats not all! Docker also makes it simple(r) to create transportable instances, so you can spin up development servers on your laptop.

[1]https://www.docker.com/

Video Accessibility, WebVTT, and Timed Text Track Tricks

Jason Ronallo, jronallo@gmail.com, NCSU Libraries

Video on the Web presents new challenges and opportunities. How do you make your video more accessible to those with various disabilities and needs? I'll show you how. This presentation will focus on how to write and deliver captions, subtitles, audio descriptions, and timed metadata tracks for Web video using the WebVTT W3C standard. Encoding timed text tracks in this way opens up opportunities for new functionality on your websites beyond accessibility. The presentation will show some examples of the potential for using timed text tracks in creative ways. I'll cover all the HTML and JavaScript you will need to know as well as some of the CSS and other bits you could probably do without but are too fun to pass up.

Categorizing Records with Random Forests

Geoffrey Boushey, geoffrey.boushey@ucsf.edu, UCSF Library

Academic libraries are increasingly responsible for providing ingest, search, discovery, and analysis for data sets. Emerging techniques from data science and machine learning can provide librarians and developers with an opportunity to generate new insights and services from these document collections. This presentation will provide a brief overview of common machine learning classification techniques, then dive into a more detailed example using a random forest to assign keywords to research data sets. The talk will emphasize the insight that can be gained from machine learning rather than the inner workings of the algorithms. The overall goal of this presentation is to provide librarians and developers with the context to recognize an opportunity to apply machine learning categorization techniques at their home campuses and organizations.

Data Science in Libraries

Devon Smith, smithde@oclc.org, OCLC

Data Science is increasing in buzz and hype. I'll go over what it is, what it isn't, and how it fits in libraries.

PDF metadata extraction for academic literature

Kevin Savage, kevin.savage at mendeley.com, Mendeley
Joyce Stack, joyce.stack at mendeley.com, Mendeley

Mendeley recently added a, "document from file," endpoint to its API which attempts to extract metadata such as title and authors directly from PDF files. This talk will describe at a high level the machine learning methods we used including how we measured and tuned our model. We will then delve more deeply into our stack, the tools we used, some of the things that didn't work and why PDFs are the worst thing ever to compute over.

Giving Users What They Want: Record Grouping in VuFind

Mark Noble, mark@marmot.org, Marmot Library Network

In 2013, Marmot did extensive usability studies with patrons to determine what was difficult in the catalog. Many patrons had problems sifting through all of the various formats and editions of a title. In 2014 we developed a method for grouping records so only a single work is shown in search results and all formats and editions are listed under that work. We will discuss our definition of a 'work' based on FRBR principles; combining meta data from MARC records with metadata from other sources like OverDrive; the technical details of Record Grouping; the design decisions made during implementation; and the reaction from users and staff.

Topic Space: a mobile augmented reality recommendation app

Jim Hahn, jimhahn@illinois.edu, University of Illinois at Urbana-Champaign

The Topic Space module (http://minrvaproject.org/modules_topicspace.php ) was developed with an IMLS Sparks! Grant to investigate augmented reality technologies for in-library recommendations. The funding allowed for sustained university community collaboration by the University Library, the Graduate School of Library and Information Science, as well as graduate student programmers sourced from the Department of Computer Science. Collaborators designed app functionality and identified relevant open source libraries that could power optical character recognition (OCR) functionality from within the mobile phone.

Topic space allows a user to take a picture of an item's call number in the book stacks. The module will show the user other books that are relevant but that are not shelved nearby. It can also show users books that are normally shelved here but that are currently checked out. Recommendations are based on Library of Congress subject headings and ILS circulation data which indicate recommendation candidates based on total check-outs.

Research questions included development of back end (server-side) pattern matching algorithms for recommendations, and a rapid formative evaluation of interface design that would provide optimal user experience for navigation of the book stacks as a context to recommendations.

Along with the Topic Space native app, grant collaborators prototyped web based recommendations which could serve as a new way of providing readers advisory and “more like this” recommendations from discovery interfaces accessed through desktop browsers. Outcomes of the grant include the availability of the Topic Spaces module within Minrva app on the Android Play store and an experimental Backbone.js based Topic Space web app.

Leveling Up Your Git Workflow

Megan Kudzia, moneill@albion.edu, Albion College Library
Kate Sears, eks11@albion.edu, Albion College Library

Have you started experimenting with Git on your own, but now you need to include others in your projects? Learn from our mistakes! Transitioning from a one-person git workflow and repo structure, to a structure that includes multiple people (including student workers), is not for the faint of heart. We'll talk about why we decided to work this way, our path to developing a git culture amongst ourselves, conceptual and technical difficulties we've faced, what we learned, and where we are now. Also with pretty pictures (aka workflow drawings).

Drone Loaning Program: Because Laptops are so last century

* Uche Enwesi, uenwesi@umd.edu, University of Maryland Libraries
* Francis Kayiwa, fkayiwa@umd.edu, University of Maryland Libraries

At Univ. Maryland we are in the very early stages of looking into allowing our student body get their hands on a drone. Yes that's right we will let students take out a drone for n amount of hours to work on projects of their choosing. The talk will talk about the logistics of getting a program of this sort from concept to "Is the drone available?". If people sign waivers we will also promise not to crash the drone into code4lib attendees.

Got Git? Getting More Out of Your GitHub Repositories

* Terry Brady, twb27@georgetown.edu, Georgetown University Library

This presentation will discuss how librarians, developers, and system administrators at Georgetown University are maximizing their use of the public and private GitHub repositories.

In additional to all of the great benefits of using Git for code management, the GitHub interface provides a powerful set of tools to showcase a project and to keep your users informed of developments to your project. These tools can assist with marketing and outreach - turning your code repository into a focus of conversation!

Style-able Project Pages
Project Wikis
Project Release Notes/Portfolios
Web Resources That Can Be Directly Requested
Gists for code sharing
Private Repositories and Organizational Groups
Pull Request Conversation Tracking
Customized Issue management

Quick Wins for Every Department in the Library - File Analyzer!

* Terry Brady, twb27@georgetown.edu, Georgetown University Library

The Georgetown University Library has customized workflows for nearly every department in our library with a single code base.

Analyzing Marc Records for the Cataloging department
Transferring ILS invoices for the University Account System for the Acquisitions department
Delivering patron fines to the Bursar’s office for the Access Service department
Summarizing student worker timesheet data for the Finance department
Validating COUNTER compliant reports for the Electronic Resources department
Generating ingest packages for the Digital Services department
Validating checksums for the Preservation department

Learn how you can customize the File Analyzer to become a hero in your library!

The Geospatial World is Moving from Maps on the Web to Maps of the web. Libraries can too

Mita Williams, mita@uwindsor.ca, User Experience Librarian, University of Windsor

The transition from paper maps to digital ones changed much more than the maps themselves; it changed the very foundation of how we work and how we find each other. Now maps are transforming again. The Geospatial World is moving from GIS systems that are institutionally-focused, expensive, feature-burdened, and binds data into a complicated and demanding user-hostile interface. From this transition from digital to web-based digital geospatial tools has come growth and development in new forms of map-based investigative journalism, activism, scholarship, and business ventures. This talk will highlight the conditions and strategies that made these changes possible as a means to draw a path by which librarians through our own work may follow, dragons notwithstanding.

Building Your Own Federated Search

Rich Trott, Richard.Trott@ucsf.edu, UC San Francisco

Advances in modern browsers have created some interesting possibilities for federated search. This presentation will cover common techniques and pitfalls in building a federated search. We will discuss what principles guided our decisions when implementing our own federated search. We will show tools we've built and our findings from building and using experimental prototypes.

Your higher education institution likely offers dozens of online resources for educators, students, researchers, and the public. And each of these online resources likely has its own search tool. But users can't be expected to search in dozens of different interfaces to find what they're looking for. A typical solution for this issue is federated search.

Indexing Linked Data with LDPath

Chris Beer, cabeer@stanford.edu, Stanford University Libraries

LDPath [1] is a simple query language for indexing linked open data, with support for caching, content negotiation, and integration with non-RDF endpoints. This talk will demonstrate the features and potential of the language and framework to index a resource with links into id.loc.gov, viaf.org, geonames.org, etc to build an application-ready document.

[1] http://marmotta.apache.org/ldpath/language.html

Show Me the Money: Integrating an LMS with Payment Providers

Josh Weisman, Josh.Weisman@exlibrisgroup.com, Development Director-Resources Management, Ex Libris Group

In order to provide an easy and convenient way for patrons to pay fines, we are exploring ways to integrate the library management system with online payment providers such as PayPal. With many LMS systems being designed and developed for the cloud, we should be able to provide the frictionless user experience our patrons have come to expect from online transactions. In this session we'll discuss strategies for integration and review a sample application which uses REST APIs from a library management system to integrate with PayPal.

Shibboleth Federated Authentication for Library Applications:

Scott Fisher, scott.fisher@ucop.edu, California Digital Library
Ken Weiss, ken.weiss@ucop.edu, California Digital Library

Shibboleth is the most widely-used method to provide single-sign-on authentication to academic applications where users come from many different institutions. Shibboleth, the InCommon education and research trust framework, and the SAML protocol comprise a very powerful - but very complicated - solution to this very complicated problem. Scott and Ken have implemented Shibboleth for multiple library applications. They will share their understanding of the good, the bad, and the underlying spaghetti that makes it all work. Ken will discuss some of the technical aspects of the solution, touching on optimal and non-optimal use cases, administrative challenges, and authorization concerns. Scott will describe the implementation pattern for multi-institution single-sign-on that the California Digital Library has evolved, using the recently released Dash application (http://dash.cdlib.org) as an example.

Scientific Data: A Needs Assessment Journey

Vicky Steeves, vsteeves@amnh.org, American Museum of Natural History

While surveying digital research and collections data in the research science divisions at the American Museum of Natural History in NYC (as a part of my National Digital Stewardship Residency project), I have come across the big data hogs (genome sequencing and CT scanning) and the little pieces of data (images, publications), all equally important to not only scientific discovery, but as nodes in the history of science.

In this session, I will discuss the development of my needs assessment surveys for scientific datasets and the interview process with Museum curators and researchers as background, seguing into an explanation of the results. I will then combine my findings into preliminary selection criteria to choose tools for digital preservation and management unique to scientific datasets. This will brooke a discussion on emerging standards, tools, and technologies in big data, specific to research science.

I will conclude with preliminary findings on emerging technology that can be used to answer concerns surrounding the management and digital preservation of these data. I am hoping the Q&A session can be used to both answer questions about my project, and function as a way for you (the larger tech-savy library community) to discuss the tools I’ve touched on in this talk.

Feminist Human Computer Interaction (HCI) in Library Software

Bess Sadler, bess@stanford.edu, Stanford University Libraries

Libraries are not neutral repositories of knowledge. Library classification systems and search technologies tend to reflect the inequalities, biases, ethnocentrism, and power imbalances of the societies in which they are built [1]. How might we better resist these tendencies in the library software we create? This talk will examine some qualities of feminist HCI (pluralism, self-disclosure, participation, ecology, advocacy, and embodiment) [2] through the lens of library software.

[1] Olson, Hope A. (2002). The Power to Name: Locating the Limits of Subject Representation in Libraries. Dordrecht, The Netherlands: Kluwer Academic Publishers.

[2] Bardzell, Shaowen. Feminist HCI: Taking Stock and Outlining an Agenda for Design. CHI 2010: HCI For All. http://dmrussell.net/CHI2010/docs/p1301.pdf

Heiðrún: DPLA's Metadata Harvesting, Mapping and Enhancement System

Audrey Altman, audrey at dp.la, Digital Public Library of America
Gretchen Gueguen, gretchen at dp.la, Digital Public Library of America
Mark Breedlove, mb at dp.la, Digital Public Library of America

The Digital Public Library of America aggregates metadata for over 8 million objects from more than 24 direct partners, or Hubs, using its Metadata Application Profile (MAP), an RDF metadata application profile based on the Europeana Data Model. After working with the initial system for harvesting, mapping and enhancing our Hub’s metadata for a year, we realized that it was inadequate for working with data at this scale. There were architectural issues; it was opaque to non-developer and partner staff; there were inadequate tools for quality assurance and analysis; and the system was unaware that it was working with RDF data. As the network of Hubs expanded and we ingested more metadata, it became harder and harder to know when or why a harvest, a mapping task, or an enrichment went wrong because the tools for quality assurance were largely inadequate.

The DPLA Content and Technology teams decided to develop a new system from the ground up to address those problems. Development of Heidrun, the internal version of the new system, started in October 2014. Heidrun’s goals are to make it easier for us to harvest and map metadata from various sources and in variety of schemas to the DPLA MAP, to better enrich that metadata using external data sources, and to actively involve our partners in the ingestion process through access to better QA tools. Heidrun and its componentry are built on Ruby on Rails, Blacklight, and ActiveTriples. Our presentation will give some background on our design principles and processes used during development, the architecture of the system, and its functionality. We plan to release a version of Heidrun and its components as a generalized metadata aggregation system for use by DPLA Hubs and others working to aggregate cultural heritage metadata.

OS or GTFO: Program or Perish

Tessa Fallon, tessa.fallon@gmail.com

Description TBD

Creating Dynamic— and Cheap!— Digital Displays with HTML 5 Authoring Software

Chris Woodall, cmwoodall@salisbury.edu, Salisbury University Libraries

Would your library like to have large digital signage that displays dynamic information such as library hours, weather, room availability, and more? Have you looked into purchasing large digital signage, only to be turned off by the high price tag and lack of customization available with commercial solutions? Our library has developed a cheap and effective alternative to these systems using HTML 5 authoring software, a large TV, and freely-available APIs from Google, Springshare, and others. At this session, you’ll learn about the system that we have in place for displaying dynamic and easily-updatable information on our library’s large digital display, and how you can easily create something similar for your library.

REPOX: Metadata Blender

John Mignault, jmignault@metro.org, Empire State Digital Network

With the growth in the number of hubs providing metadata to the Digital Public Library of America, many of them are using REPOX, a tool originally created for the Europeana project. We'll take a look at REPOX and its capabilities and how it can be useful for ingesting and transforming metadata.

Beyond Open Source

Jason Casden, jmcasden@ncsu.edu, NCSU Libraries
Bret Davidson, bddavids@ncsu.edu, NCSU Libraries

The Code4Lib community has produced an increasingly impressive collection of open source software over the last decade, but much of this creative work remains out of reach for large portions of the library community. Do the relatively privileged institutions represented by a majority of Code4Lib participants have a professional responsibility to support the adoption of their innovations?

Drawing from old and new software packaging and distribution approaches (from freeware to Docker), we will propose extending the open source software values of collaboration and transparency to include the wide and affordable distribution of software. We believe this will not only simplify the process of sharing our applications within the Code4Lib community, but also make it possible for less well resourced institutions to actually use our software. We will identify areas of need, present our experiences with the users of our own open source projects, discuss our attempts to go beyond open source, and make an argument for the internal value of supporting and encouraging a vibrant library ecosystem.

Making It Work: Problem Solving Using Open Source at a Small Academic Library

Adam Strohm, astrohm@iit.edu, Illinois Institute of Technology
Max King, mking9@iit.edu, Illinois Institute of Technology

The Illinois Institute of Technology campus was added to the National Register of Historic Places in 2005, and contains a building, Mies van der Rohe's S.R. Crown Hall, that was named a National Historic Landmark in 2001. Creating a digital resource that can adequately showcase the campus and its architecture is challenge enough in and of itself, but doing so as a two-person team of relative newcomers, at a university library without dedicated programmers on staff, ups the ante considerably. The challenges of technical know-how, staff time, and funding are nothing new to anyone working on digital projects at a university library, and are amplified when doing so at a smaller institution. This talk covers the conception, development, and design of the campus map site that was built, concentrating on the problem-solving strategies developed to cope with limited technical and financial resources. We'll talk about our approach to development with Open Source software, including Omeka, along with the Neatline and Simile Timeline plugins. We'll also discuss the juggling act of designing for mobile mapping functionality without sacrificing desktop design, weighing the costs of increased functionality versus our ability to time-effectively include that functionality, and the challenge of building a site that could be developed iteratively, with an eye towards future enhancement and sustainability. Finally, we’ll provide recommendations for other librarians at smaller institutions for their own efforts at digital development.

Recording Digitization History: Metadata Options for the Process History of Audiovisual Materials

Peggy Griesinger, peggy_griesinger@moma.org, Museum of Modern Art

The Museum of Modern Art has amassed a large collection of audiovisual materials over its many decades of existence. In order to preserve these materials, much of the audiovisual collection has been digitized. This is a complex process involving numerous steps and devices, and the methods used for digitization can have an effect on the quality of the file that is preserved. Therefore, knowing exactly how something was digitized is critical for future stewards of these objects to be able to properly care for and preserve them. However, detailed technical information about the processes involved in the digitization of audiovisual materials is not defined explicitly in most metadata schemas used for audiovisual materials. In order to record process history using existing metadata standards, some level of creativity is required to allow existing standards to express this information.

This talk will detail different metadata standards, including PBCore, PREMIS, and reVTMD, that can be implemented as methods of recording this information. Specifically, the talk will examine efforts to integrate this metadata into the Museum of Modern Art’s new digital repository, the DRMC. This talk will provide background on the DRMC as well as MoMA’s specific institutional needs for process history metadata, then discuss different metadata implementations we have considered to document process history.

Pig Kisses Elephant: Building Research Data Services for Web Archives

Jefferson Bailey, jefferson@archive.org, Internet Archive
Vinay Goel, vinay@archive.org, Internet Archive

More and more libraries and archives are creating web archiving programs. For both new and established programs, these archives can consist of hundreds of thousands, if not millions, of born-digital resources within a single collection; as such, they are ideally suited for large-scale computational study and analysis. Yet current access methods for web archives consist largely of browsing the archived web in the same manner as browsing the live web and the size of these collections and complexity of the WARC format can make aggregate analysis difficult. This talk will describe a project to create new ways for users and researchers to access and study web archives by offering extracted and post-processed datasets derived from web collections. Working with the 325+ institutions and their 2600+ collections within the Archive-It service, the Internet Archive is building methods to deliver a variety of datasets culled from collections of web content, including extracted metadata packaged in JSON, longitudinal link graph data, named entities, and other types of data. The talk will cover the technical details of building dataset production pipelines with Apache Pig, Hadoop, and tools like Stanford NER, the programmatic aspects of building data services for archives and researchers, and ongoing work to create new ways to access and study web archives.

Awesome Pi, LOL!

Matt Connolly, mconnolly@cornell.edu, Cornell University Library
Jennifer Colt, jrc88@cornell.edu, Cornell University Library

Inspired by Harvard Library Lab’s “Awesome Box” project, Cornell’s Library Outside the Library (LOL) group is piloting a more automated approach to letting our users tell us which materials they find particularly stunning. Armed with a Raspberry Pi, a barcode scanner, and some bits of kit that flash and glow, we have ventured into the foreign world of hardware development. This talk will discuss what it’s like for software developers and designers to get their hands dirty, how patrons are reacting to the Awesomizer, and LOL’s not-afraid-to-fail philosophy of experimentation.

You Gotta Keep 'em Separated: The Case for "Bento Box" Discovery Interfaces

Jason Thomale, jason.thomale@unt.edu, University of North Texas Libraries

I know, I know--proposing a talk about Resource Discovery is like, so 2010.

The thing is, practically all of us--in academic libraries at least--have a similar set up for discovery, with just a few variations, and so talking about it still seems useful. Stop me if this sounds familiar. You've got a single search box on the library homepage as a starting point for discovery. And it's probably a tabbed affair, with an option for searching the catalog for books, an option for searching a discovery service for articles, an option for searching databases, and maybe a few others. Maybe you have an option to search everything at once--probably the default, if you have it. And, if you're a crazy hepcat, maybe you only have your one search that searches everything, with no tabs.

Now, the question is, for your "everything" search, are you doing a combined list of results, or are you doing it bento-box style, with a short results list from each category displayed in its own compartment?

At UNT, we've been holding off on implementing an "everything" search, for various reasons. One reason is that the evidence for either style hasn't been very clear. There's this persistent paradox that we just can't reconcile: users tell us, through word and action, that they prefer searching Google, yet, libraries aren't Google, and there are valid design reasons why we shouldn't try to oversimplify our discovery interfaces to be like Google. And there's user data that supports both sides.

Holding off on making this decision has granted us 2 years of data on how people use our tabbed search interface that does not include an "everything" search. Recently I conducted a thorough analysis of this data--specifically the usage and query data for our catalog and discovery system (Summon). And I think it helps make the case for a bento box style discovery interface. To be clear, it isn't exactly the smoking gun that I was hoping for, but the picture it paints I think is telling. At the very least, it points away from a combined-results approach.

I'm proposing a talk discussing the data we've collected, the trends we've seen, and what I think it all means--plus other reasons that we're jumping on the "bento box" discovery bandwagon and why I think "bento box" is at this point the path that least sells our souls.

Don’t know about you, but I’m feeling like SHA-2!: Checksumming with Taylor Swift

Ashley Blewer!, ashley.blewer@gmail.com

Checksum technology is used all over the place, from git commits to authenticating Linux packages. It is most commonly used in the digital preservation field to monitor materials in storage for changes that will occur over time or used in the transmission of files during duplication. But do you even checksum, bro? I want this talk to move checksums from a position of mysterious macho jargon to something everyone can understand and want to use. I think a lot of people have heard of checksum but don’t know where to begin when it comes to actually using it at their institution. And cryptography is hella intimidating! This talk will cover what checksums are, how they can be integrated into a library or archival workflow, protecting collections requiring additional levels of security, algorithms used to verify file fixity and how they are different, and other aspects of cryptographic technology. Oh, and please note that all points in this talk will be emphasized or lightly performed through Taylor Swift lyrics. Seriously, this talk will consist of at least 50% Taylor Swift. Can you, like, even?

Level Up Your Coding with Code Club (yes, you can talk about it)

Coral Sheldon-Hess, coral@sheldon-hess.org

Reading code is a necessary part of becoming a better developer. It gives you more experience and more insight into How Things Are (or Aren't) Done; it builds your intuition about how to solve problems with code; and it increases your confidence that you, too, can tackle whatever technological problems you're facing.

But you don't have to read code alone! (Which is good. It's really not fun to read code alone.)

In late 2014, a group of librarians formed two Code Clubs, inspired by this talk by Saron (of Bloggytoons fame). I'd like to tell you about how we've structured our Code Clubs, what has gone well, what we've learned, and what you need to do to form your own Code Club. I'll share a list of the codebases we've looked at, too, to help you get your own Code Club off the ground!

The Growth of a Programmer

Joshua Gomez, Getty Research Institute, jgomez@getty.edu

Just like other creative endeavors, software developers can experience periods of great productivity or find themselves in a rut. After contemplating the alternating periods in my own career I've noticed several factors that have effected my own professional growth and happiness, including: mentorship, structure, community, teamwork, environment, formal education, etc. Not all of the factors need to be present at all times; but some mixture of them is critical for continued growth. In this talk, I will articulate these factors, discuss how they can effect a developer's career, and how they can be sought out when missing. This talk is aimed at both new developers looking to strike their own path as well as the veterans that lead or mentor them.

Developing a Fedora 4.0 Content Model for Disk Images

Matthew Farrell, matthew.j.farrell@duke.edu, Duke University Libraries
Alexandra Chassanoff, achass@email.unc.edu, BitCurator Access Project Manager

As the acquisition of born-digital materials grows, institutions are seeking methods to facilitate easy ingest into their repositories and provide access to disk images and files derived or extracted from disk images. In this session, we describe our development of a Fedora 4.0 Content model for disk images, including acceptable image file formats and the rationale behind those choices. We will also discuss efforts to integrate the disk image content model into the BitCurator Access environment. Unlike generalized, format-agnostic content models which might treat the disk image as a generic bitstream, a content model designed for disk images enables expression of relationships among associated content in the collection such as files extracted from images and other born-digital and digitized material associated with the same creator. It also enables capture of file-system attributes such as file paths, timestamps, whether files are allocated/deleted, etc. Further, a disk image content model suggests further steps repositories can take in order to transform and re-use associated metadata generated during the creation and forensic analysis of the disk image.

Data acquisition and publishing tools in R

Scott Chamberlain, scott@ropensci.org, rOpenSci/UC Berkeley

R is an open source programming environment that is widely used among researchers in many fields. R is powerful because it's free, increasingly robust, and facilitates reproducible research, an increasingly sought after goal in academia. Although tools for data manipulation/visualization/analysis are well developed in R, data acquisition and publishing tools are not. rOpenSci is a collaborative effort to create the tools necessary to complete the reproducible research workflow. This presentation discusses the need for these tools, including examples, including interacting with the repositories Mendeley, Dryad, DataONE, and Figshare. In addition, we are building tools for searching scholarly metadata and acuiring full text of open access articles in a standarized way across metadata providers (e.g., Crossref, DataCite, DPLA) and publishers (e.g., PLOS, PeerJ, BMC, Pubmed). Last, we are building out tools for data reading and writing in Ecologial Metadata Language (EML).