Changes

2014 Prepared Talk Proposals

2,213 bytes added, 20:18, 8 November 2013

no edit summary

'''To Propose a Talk'''

* Log in to the wiki in order to submit a proposal. If you are not already registered, follow the instructions to do so.

* Provide a title and brief (500 words or fewer) ~~description~~ description of your proposed talk.

* If you so choose, you may also indicate when, if ever, you have presented at a prior Code4Lib conference. This information is completely optional, but it may assist us in opening the conference to new presenters.

Many of us are using the excellent Lucene library (or SOLR appliance) to provide search functionality. These systems contain number of features to adjust relevancy ranking of hits, but we may not know how to use them. In this presentation, I'll present the available options - eg. what is the default ranking 'Vector space model, what are the alternatives (eg. BM25) and what are the other options we have to tweak and adjust the ranking of the hits (eg. boost factors, functions). But even if we know how to deploy these adjustments and tweaks, we are still left in dark. We do not know whether the change we've just rolled out had a significant (statistically significant) effect or maybe it was just a waste of time and resources? A/B testing is one option, but there may be a much better one - so called "Multi-Armed Bandits Approach". And in this talk I'd like to show how we are experimenting with this strategy to adjust [http://labs.adsabs.harvard.edu/adsabs/ ADS search engine].

== Building Worker Queues with AWS and Resque ==

* Eric Rochester [http://scholarslab.org Scholars' Lab], erochest@virginia.edu

* Scott Turnbull [http://aptrust.org/ Academic Preservation Trust], scott.turnbull@aptrust.org

A common task in larger systems is to be able to process large input files automatically. Often users can drop those files into a shared directory on AWS or on NFS or another shared drive. Those files need to be processed and potentially integrated into a system. This task has come up recently in the University of Virginia libraries in allowing users to add GIS data to the system and in setting up a system for the Academic Preservation Trust (http://aptrust.org/) that ingests files and resources into the preservation system.

This system is built by loosely coupling a number of different technologies. This allows us to easily interoperate and communicate between different system and programming environments. Because the interfaces are well defined, it’s also fairly simple to switch out technologies as the requirements of the system change.

The process is fairly simple:

First, a Ruby daemon monitors an AWS S3 bucket that others can upload new files into. This daemon creates a Resque status task, adds a marker for the task in a database, and continues monitoring.

Second, Resque mediates incoming job requests and routes them to the appropriate workers which may be in Java, Go, or Ruby. The diversity of technologies that Resque can manage allows great latitude to leverage the appropriate tool for a specific job. While processing, it updates the status for that job and coordinates processing with other jobs.

Finally, a page that is integrated into a larger Rails app provides a novice-user-friendly view of the status of the workers and allows basic tasks such as restarting the job.

This architecture allows us to swap in the technology that best fits each part of the process, and it makes it easier to maintain the system. We use this to integrate and coordinate between tasks handled in Java, Ruby, and Go, and it provides an effective way to interoperate with these programming languages and the respective strengths that they bring to this system.

[[:Category:Code4Lib2014]]

EricRochester

edit