2014 Breakout II (Wednesday)
- 1 UX
- 2 Securing EZproxy
- 3 Tech services
- 4 AngularJS
- 5 BIBFRAME 2 & Linked Data
- 6 Unusual searches & long searches
- 7 ResCarta
- 8 OCLC institution RDF project
- 9 Digital Preservation
Notes by @erinrwhite again. Y'all cannot escape me
NCSU's UX department is cross-functional and has members from across departments. Looking at creating cross-channel experiences from digital to real life. Working on consistency across experiences. Expanded on UMich's UX department to create a UX research team.
Research: the NCSU does a research project every month. NCSU is also training new library fellows to infuse User Experience work into their projects. Growing the culture of UX within the organization.
How do you work in harmony with a dev team when sometimes the UX team can be the roadblock to development? Need to get a workflow that works so that everyone can move quickly.
UXing web pages vs. entire web applications: they're totally different experiences so need different approaches to user experience evaluation.
Guerrilla research: go out into the public spaces of your library to test prototypes or design ideas. Make it quick. User research doesn't have to be a huge deal.
If you can't give money as remuneration, give 'em candy bars. But make the candy bars full-size, not the minis.
Librarians are users too...right?
How do we push back against librarians' assertions that pages/interfaces should look a certain way?
Research with users can *sometimes* help.
Need to communicate your evidence to your library. UT hired someone last year just to do IT communication (!).
Numbers don't always work. Need a visual tool if possible (i.e. a heatmap). If you can compile a video or audio of user interviews or usability testing, that can be very powerful.
Recommendation: 37Signals' book Getting Real on helping choose things that are/aren't important and moving on.
Publish your damn work!
As a community, we need to get better about sharing our work with each other so we don't have to keep reinventing the wheel.
(Mag II) At the request of several individuals I'm keeping actual names of individuals involved in the discussion private. If you contribute or add to these notes, feel free to identify yourself, but please don't identify others
There was a mixture of general discussion on ezproxy, as several individuals haven't had experience with any security issues but came to the breakout hoping to learn more about ezproxy. The notes below are a bit disjointed for that reason.
Started some general discussion based off of recent issues an institution had with a large-scale of compromised accounts being used by two different organizations.
Organization 1: a website called scihub.org that acts as a web search/proxy itself and rotates through compromised accounts to fetch articles.
Organization 2: a group based in China that seem to be employing actual humans to go and downloaded lists of contents. (Downloading is during Chinese business hours, there's no traffic on Chinese holidays, etc)
Some approaches taken by this University have been:
- rsyslog the ezproxy logs w/ campus IT. The campus IT runs the Shibboleth instance, which feed into a splunk system with several other inputs including VPN and machine usage. Allows for detection of compromised accounts logging in from different areas of the world near the same time. Also allows more folks to aid in detecting compromises.
- x-forwarded-for turned on with cooperating vendors. This allowed them to spot ip addresses used multiple times w/ a proxy.
- some ip address of compromised accounts.
VPN blocking was brought up, but not many people practicing. There's been issues with organizations using compromised accounts of other schools to make tracking the origin ip harder. There's a github project for this (I believe the person was talking about https://github.com/bertrama/ezproxy-config/blob/master/reject-ip-vpn).
Some folks are looking for excessive downloads via scripts. Also if the same name from different ip addresses in roughly the same time is flagged for suspicious behavior/possibly compromised accounts.
Also common to have campus IT looking for multiple logins from same spot.
Some more general discussion on maintaining ezproxy came up:
Several schools have web interfaces for specifying resources which then create the config files automatically. A backup is made of the last config file. Allows librarians to edit without having IT be a bottleneck.
Most common problem is actually issues w/ space due to logs. Political/cost issue more than technical. Logrotate w/ bz2ip can help, but will hinder analysis. Retention policy is important.
One school keeps logs for 90 days to aid with detection from vendor complaints, then anonymization/sanitizing them to allow analysis without individual details.
Several schools do not include identifying information ever in logs, for various political and privacy concerns. (Linking identity w/ urls searched). Makes more difficult than necessary.
some questions on how difficult it is to maintain EzProxy. General consensus is pretty easy. Will lock up and require restarts, but is pretty rarely. At least on eschool uses version control w/ the config files to make sure easy to roll back from mistakes. One place also using puppet to push out some of these files.
Alumni configuration a common issue, most folks seem to break apart into own group and hten either have local logins or a shibboleth attribute.
Several schools using shibboleth.
Some other points that came up:
Also important: making sure that the entire stack of the authentication/authorization is properly protected, harder to even trust inside of the network.
Some discussion also came up on password policies (or what to try to get campus IT to enforce):
- Make sure strong passwords enforced
- Make sure that checking for similar passwords as previous passwords, to avoid easily guessable password once an account is already compromised)
We shared projects, challenges, and areas of interest
- Linked data for acquisitions info and the Global Open Knowledgebase
- Changing roles for catalogers -- description of unique resources, data extraction and manipulation, linked data
- ILS migrations
- Managing multiple systems and silos (ERM, ILS, ERP, archives)
- Managing DDA (demand driven acquisitions)
- Skills for catalogers -- computational thinking
- Trends toward fewer professional librarians in tech services
- Accepting ambiguity
- CORAL open source ERM
A few discussion topics emerged
- We use different ones -- mostly we get whatever our IT department already has
- Helpful for representing electronic resources -- there's no physical presence to remind you to do the work
- Helpful for metrics
- One barrier to use can be training others to use the systems rather than contacting an individual directly
- A lot of people have to be involved -- collections, tech services, IT
- Duplicate records between existing collection and DDA records -- we don't always realize where duplication exists
- People want to be able to activate DDA records in their e-resources knowledgebase -- ideally we'd have our book jobber help with updating the kb
- Important to a have a good vendor rep
- We weren't able to understand the entire process at the start -- every step was like a new discovery
- A challenge is getting quality records and identifying records that need additional work
MARC record services -- how do you evaluate quality?
- For ebooks, they can be really bad
- Many people are using MARC edit to do batch processing of records, find things that need to be fixed
- Suggested use of regular expressions to pull things out of leader field
- Another common practice is to use various methods to convert MARC records to Excel and look at errors there
- Some of us are using OpenRefine to find problems
- Some of us are becoming more error tolerant, but the cool stuff that people do is dependent on good data
- IRC - #angularjs freenode
Modules, Tools, Features
Misc Resources mentioned and more-or-less related to Angular:
- http://firebase.com/docs/angular/ (cloud back end)
- Other "No Backend" solutions: http://nobackend.org/solutions.html
- http://emberjs.com/ - okay, not angular
BIBFRAME 2 & Linked Data
Notes based off Tweets made by group during the session:
Starting comment: This is the year of testers, early implementers, of BIBFRAME.
- How to get from MARC to BIBFRAME? Request to explain tools, scripts.
- Issues with linking different ontologies to building linked data networks, SKOS brought up, being discussed
- How do people feel about the concept of event? being discussed
- When do we have to switch? When will the vendors build applications in BIBFRAME so then libraries can follow?
- How far does Bibframe extend, and when do you say this is no longer Bibframe's job?
- How do people feel about the concept of event?
- Explain Place, dates, agents as three attributes?
- Discussing expressions versus works (making expression into relationships)
- Model: Works, Instances, with relationships between Works that has expression
- Brief mention of Named entity extraction work for finding these attributes
- What happens when you link to an ontology, then it changes? URIs play the important role here.
- VIVO Project shout out! http://t.co/OK9DLJXVWw
- Example of a collection put through BIBFRAME from A&M http://t.co/K4vnwdczAO
- Group member transcribed records from MARC to BIBFRAME had their *SERIES* records come out correctly
- Variances of cataloging practices will also be a huge issue as well for transcribing records
- Member of group: experiences w/transcribing MARC to BIBFRAME records: two tools, they didn't give the same output
- Battles lines have been drawn: discussing Dublin Core and it's simplicity (good? bad?)
- Music cataloging in FRBR and BIBFRAME being discussed now - diving the deep end
- Locally, we choose, self-select ontologies we need. But if we want more exposure for data, need to explain, share.
- Going from catalogers to metadata librarians in an institutional level, trying to start retrain people now.
- Creating an entire ontology for all of human history would be overwhelming :
- Response to this concern about ontologies: 'But an ontology is domain knowledge, it takes multiple domains/ontologies to cover all of human history'
Unusual searches & long searches
This group met to talk about unusual searches, especially extremely long searches, copied and pasted citations, and other issues related to serving niche searches.
Some of the possible solutions include:
- Looking for DOI, ISBN or other identifiers in the query, extract these, and make the request to a service using these IDs.
- Remove extraneous characters from the beginning of a string that may indicate copied and pasted text.
- Truncate a long query at a certain character length (80 to 100?) assuming that the most useful text appears at the start of the query.
- Use a regex to identify a citation by detecting some combination of words commonly used in citations (Vol., Iss., pp.), four digit years, and other combinations of numbers.
- It would be useful to test this regex against a search corpus to check for false matches.
- Once a citation is identified either certain characters could be removed from the query or a citation parser such as the Brown's FreeCite .
Other things noted:
- If you truncate a query don't truncate in the middle of a word or else recall may be worse.
- Log queries that provide zero hit as way to find types of queries that may need some post processing.
- Is there way to provide smarter, live results for libraries for thing such as library hours, similar to the way Google provides live flight tracking information directly in the results list.
We gathered in the ballroom and had an active conversation about the philosophy of keeping archives in a reduced set of file formats with standardized metadata. We reviewed directory structures and METS collection level details. For a future reduction of coding and costs we advise the reduction of file formats (normalization) on ingestion into a structured archive.
Justin from Artefactual shared their philosophy and thoughts on use of METS collection level file contents.
Historically systems like NDNP are gate keeper validation systems and we should be building digital archive creation systems. Build to a standard under code control rather than code to check hand made datasets.
OCLC institution RDF project
Cost issues, billing departments, charging grant projects one-time vs. multiple
Internal vs. external hosting
Trusted Digital Repository, TRAC, ISO standard
Geographic distribution, what does that actually mean
? who is using checksums and how often they are verifying
UNC - make sure checksums checked every quarter, throttle/stagger checking
? Has anyone had checksum checks fail?
only time is user error, checking wrong one, files are changed after initial checksum
video - frame-level checksum, part of ffmpeg, make frame level information and checksum that
? how much code/time is done to check on problems with checksums?
manual vs. auto repair, prefer manual intervention
how often to check tapes, without further damaging tape
for testing, there's a tool that will flip bits
disaster recovery testing
hesitance to test/break files on production
ZFS, self-healing filesystem, replication (worried about replicating checksum errors)
? about viruses, malicious scripts
UNC runs ClamAV on everything, does make sure everyone is authorized user
AV Artifact Atlas - visual glossary of damage types to a/v files
tape backup of everything can take too long to run (days)
rely on multiple copies of objects on disk
format migrations - no one has really done it yet
archivematica wiki is great resource
normalization on ingest
emulation as a service - possible collaboration in community
Major issues for Digital Preservation
- storage (terabytes coming in each year, no cost-effective solutions for growing needs)
- staffing (for smaller institutions)
- funding model/sustainability (some charge for services, some funding by Campus IT)
- research data, grants, data management planning tool
- how long can we offer to store files
- trying to convince Provost that library storage is like library shelf space and needs to be funded
- split funding, from graduate schools or president's office
- some work on service level agreements, tiers of service
- file retrievals may not be tracked anywhere, if so can't tell what hasn't been retrieved
NDSA Levels of Preservation - http://www.digitalpreservation.gov/ndsa/activities/levels.html