Difference between revisions of "Code4Lib Montreal meeting notes 20181023"
(First stab at notes) |
m (Correct a name) |
||
(One intermediate revision by the same user not shown) | |||
Line 12: | Line 12: | ||
* Tomasz Langenbauer - Digital Projects, Concordia | * Tomasz Langenbauer - Digital Projects, Concordia | ||
* Rebecca Nicholson - Web group, McGill | * Rebecca Nicholson - Web group, McGill | ||
− | * Eka | + | * Eka Grguric - Web Librarian, McGill |
* Dan Scott - Systems librarian, Laurentian / McGill student | * Dan Scott - Systems librarian, Laurentian / McGill student | ||
Line 42: | Line 42: | ||
=== Tim Walsh, Bulk Reviewer === | === Tim Walsh, Bulk Reviewer === | ||
− | Identifies, reviews, and removes sensitive files in disk images and directories, regardless of file format | + | |
− | Sensitive info - SSN, credit card numbers, phone numbers, email addresses, internet history, EXIF metadata, GPS data, custom search terms, Windows registry (program install history) | + | This is a project Tim started working on while a Harvard Fellow over the summer; the idea is to use forensics tools for the power of archives. Requires identifying individual files accurately rather than the broader-based "yeah it looks like there are credit card numbers on this hard drive" approach that forensics are interested in. |
− | Built using Django, Vue.js, bulk_extractor, DFXML, and Docker | + | |
− | bulk_extractor generates text files or a SQLite database that normally gets processed into a histogram; this processes the data to instead support a Web browser front end and identify the individual files that may be problematic | + | * Identifies, reviews, and removes sensitive files in disk images and directories, regardless of file format |
− | Problems | + | * Sensitive info - SSN, credit card numbers, phone numbers, email addresses, internet history, EXIF metadata, GPS data, custom search terms, Windows registry (program install history) |
− | Many false positives (e.g. all 9 digit numbers are identified as SSNs); Tim isn't sure any of these tools have a high level of confidence | + | * Built using Django, Vue.js, bulk_extractor, DFXML, and Docker |
− | Tooling is all American-based, so adding something like a SIN requires C++ (Tomasz is willing to help!) | + | * bulk_extractor generates text files or a SQLite database that normally gets processed into a histogram; this processes the data to instead support a Web browser front end and identify the individual files that may be problematic |
+ | |||
+ | ==== Problems ==== | ||
+ | |||
+ | * Many false positives (e.g. all 9 digit numbers are identified as SSNs); Tim isn't sure any of these tools have a high level of confidence | ||
+ | * Tooling is all American-based, so adding something like a SIN requires C++ (Tomasz is willing to help!) | ||
== Next meeting == | == Next meeting == |
Latest revision as of 13:02, 25 October 2018
Contents
Code4Lib Montreal 2018-10-23
Attendees
- Chris Trudeau - recent McGill SIS graduate
- Martin ?? - Health Sciences liaison, McGill
- Stephana Bretweiser - CCA
- Tim Walsh - Digital Preservation librarian, Concordia
- John ?? - Digital Archivist, Concordia
- Clara Turp - Metadata Analyst Librarian, McGill
- Jessica Reeve - Senior Electronic Resources
- Tomasz Langenbauer - Digital Projects, Concordia
- Rebecca Nicholson - Web group, McGill
- Eka Grguric - Web Librarian, McGill
- Dan Scott - Systems librarian, Laurentian / McGill student
Mandat du groupe et description / Group's description and mandate
Brief discussion about what the mandate of the group should be:
- Learning about technology and coding through doing; workshops
- Building a community - across universities, colleges, public institutions in Montreal
- Informal
We like https://code4lib.org/about
- Action Clara will customize the Code4Lib statement, ensuring it reflects a Montreal & bilingual context
Presentations
Sarah Severson: sick, will present conference report from DLF next time
Chris Trudeau: citations to reserves
Idea: instead of faculty emailing the library with their individual requests for items that need to be placed on reserve, why not extract the citations from the course outline / syllabus (in PDF or Word format) and automatically generate reserve requests?
Feedback
- McGill used to have faculty upload syllabi, but eventually stopped because of resistance ("private information")
- McGill accepts reserve requests in any format: email, in person, paper
- Tomasz built something like this for Concordia in 2009 and is willing to share it; but faculty wanted the ability to submit the entire syllabus; or paste in a full citation; or fill out the parts field-by-field
Tim Walsh, Bulk Reviewer
This is a project Tim started working on while a Harvard Fellow over the summer; the idea is to use forensics tools for the power of archives. Requires identifying individual files accurately rather than the broader-based "yeah it looks like there are credit card numbers on this hard drive" approach that forensics are interested in.
- Identifies, reviews, and removes sensitive files in disk images and directories, regardless of file format
- Sensitive info - SSN, credit card numbers, phone numbers, email addresses, internet history, EXIF metadata, GPS data, custom search terms, Windows registry (program install history)
- Built using Django, Vue.js, bulk_extractor, DFXML, and Docker
- bulk_extractor generates text files or a SQLite database that normally gets processed into a histogram; this processes the data to instead support a Web browser front end and identify the individual files that may be problematic
Problems
- Many false positives (e.g. all 9 digit numbers are identified as SSNs); Tim isn't sure any of these tools have a high level of confidence
- Tooling is all American-based, so adding something like a SIN requires C++ (Tomasz is willing to help!)
Next meeting
- November - Sarah and John to present
- Mid-December - social