Changes

Jump to: navigation, search

C4lMW14 - Code4Lib Journal as epub

2,755 bytes added, 00:40, 24 July 2014
no edit summary
Created git repo
https://github.com/jtgorman/c4l-jouranljournal-as-epub images are in issue, not w/ article
Runs Wordpress, maybe use Anthologize
 
 
EPub2 Tutorial: http://www.ibm.com/developerworks/xml/tutorials/x-epubtut/index.html
 
Writing ePub3: http://idpf.org/sites/default/files/digital-book-conference/presentations/db2012/DB2012_Liz_Castro.pdf
 
Dan Scott's suggestion: make it sustainable on the top of
http://wiki.code4lib.org/index.php/Code4Lib_Journal_WordPress_Customizations
$ zip -Xr9Dq my-book.epub *
 
 
Pandoc (uses the Haskell Platform) http://johnmacfarlane.net/pandoc/installing.html
 
Wordpress w/ Pandoc? https://blogs.aalto.fi/blog/epublishing-with-pandoc/
 
 
NATURAL LANGUAGE:
 
For an issue, create .ncx / .end files from the issue index, <spine /> and <manifest /> in .opf
 
Save HTML output for each article, index, list in <manifest />, .ncx / .end
 
Sort into folder for relationships
 
Zip, rename .epub, save to download
 
 
http://codex.wordpress.org/XML-RPC_Supportb
 
 
https://wordpress.org/plugins/demomentsomtres-wp-export/
 
Creating ePub with image files
 
On an article - save the article page as a local file (journal.htm, in this example).
It saved the content file as well as the image files.
Then, run this command -
pandoc -f html -t epub --toc -o journal.epub journal.htm
This generated an journal.epub file with images.
 
Idea came from: https://blogs.aalto.fi/blog/epublishing-with-pandoc/
 
 
 
Jon's quick & crazy hack...
get_links.xsl
 
 
&lt;?xml version="1.0"?>
&lt;xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
&lt;xsl:output method="text" />
 
&lt;xsl:template match="fullTextUrl">
&lt;xsl:value-of select="." />&lt;xsl:text>
&lt;/xsl:text>
&lt;/xsl:template>
 
&lt;xsl:template match="text()" />
 
&lt;/xsl:stylesheet>
 
 
 
wget http://journal.code4lib.org/issues/issue1/feed/doaj
mv doaj toc.xml
xsltproc get_links.xslt toc.xml | xargs -n 1 -i{} wget -r -l 1 --no-parent -k {}
xsltproc get_links.xslt toc.xml | xargs -n 1 -i{} wget -r -l 1 -A jpg,jpeg,png,gif -k {}
 
Summary
======
 
Unfortunately we didn't get a Wordpress VM setup in time that would emulate the settings of the journal.code4lib site.
 
We looked at a couple of plugins, but all looked like they would still require several manual steps (goal would be to have it so every new issue just gets released as epub).
 
Downloading the page via save-as and using Calibre did a decent job, but is awkward.
 
XML-RPC seems to require a username + password, but might be feasible.
 
Problem with most scraping programs (wget mainly was used, although some sites seem to advocate for HTTrack) is
* the list of links on the left hand to other issues
* the images are stored not related to the paths the posts are on
 
So if you scrape the page and restrict to just that level and loewr, you don't get images, but otherwise you get more. And it's still largely clumsy and not automated.
 
- Summary added by Jon Gorman after the fact....
98
edits

Navigation menu