Difference between revisions of "C4lMW14 - Code4Lib Journal as epub"

From Code4Lib
Jump to: navigation, search
(won't work)
 
Line 61: Line 61:
  
 
Idea came from: https://blogs.aalto.fi/blog/epublishing-with-pandoc/
 
Idea came from: https://blogs.aalto.fi/blog/epublishing-with-pandoc/
 
 
  
  
Line 68: Line 66:
 
Jon's quick & crazy hack...
 
Jon's quick & crazy hack...
 
get_links.xsl
 
get_links.xsl
<nowiki>
+
 
<?xml version="1.0"?>
+
 
<xsl:stylesheet version="1.0"
+
&lt;?xml version="1.0"?>
 +
&lt;xsl:stylesheet version="1.0"
 
                 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 
                 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />
+
&lt;xsl:output method="text" />
  
<xsl:template match="fullTextUrl">
+
&lt;xsl:template match="fullTextUrl">
   <xsl:value-of select="." /><xsl:text>
+
   &lt;xsl:value-of select="." />&lt;xsl:text>
</xsl:text>
+
&lt;/xsl:text>
</xsl:template>
+
&lt;/xsl:template>
 +
 
 +
&lt;xsl:template match="text()" />
 +
 
 +
&lt;/xsl:stylesheet>
  
<xsl:template match="text()" />
 
  
</xsl:stylesheet>
 
</nowiki>
 
  
<nowiki>
 
 
wget http://journal.code4lib.org/issues/issue1/feed/doaj
 
wget http://journal.code4lib.org/issues/issue1/feed/doaj
 
mv doaj toc.xml
 
mv doaj toc.xml
 
xsltproc get_links.xslt toc.xml | xargs -n 1 -i{} wget -r -l 1 --no-parent  -k {}
 
xsltproc get_links.xslt toc.xml | xargs -n 1 -i{} wget -r -l 1 --no-parent  -k {}
 
xsltproc get_links.xslt toc.xml | xargs -n 1 -i{} wget -r -l 1 -A jpg,jpeg,png,gif -k {}
 
xsltproc get_links.xslt toc.xml | xargs -n 1 -i{} wget -r -l 1 -A jpg,jpeg,png,gif -k {}
</nowiki>
+
 
 +
Summary
 +
======
 +
 
 +
Unfortunately we didn't get a Wordpress VM setup in time that would emulate the settings of the journal.code4lib site.
 +
 
 +
We looked at a couple of plugins, but all looked like they would still require several manual steps (goal would be to have it so every new issue just gets released as epub).
 +
 
 +
Downloading the page via save-as and using Calibre did a decent job, but is awkward.
 +
 
 +
XML-RPC seems to require a username + password, but might be feasible.
 +
 
 +
Problem with most scraping programs (wget mainly was used, although some sites seem to advocate for HTTrack) is
 +
* the list of links on the left hand to other issues
 +
* the images are stored not related to the paths the posts are on
 +
 
 +
So if you scrape the page and restrict to just that level and loewr, you don't get images, but otherwise  you get more. And it's still largely clumsy and not automated.
 +
 
 +
- Summary added by Jon Gorman after the fact....

Latest revision as of 00:40, 24 July 2014

Useful information:


Created git repo https://github.com/jtgorman/c4l-journal-as-epub

images are in issue, not w/ article

Runs Wordpress, maybe use Anthologize



http://wiki.code4lib.org/index.php/Code4Lib_Journal_Entries_in_Directory_of_Open_Access_Journals

EPub3: http://www.ibm.com/developerworks/library/x-richlayoutepub/

EPub2 Tutorial: http://www.ibm.com/developerworks/xml/tutorials/x-epubtut/index.html

Writing ePub3: http://idpf.org/sites/default/files/digital-book-conference/presentations/db2012/DB2012_Liz_Castro.pdf

Dan Scott's suggestion: make it sustainable on the top of http://wiki.code4lib.org/index.php/Code4Lib_Journal_WordPress_Customizations


zip protocol:

$ zip -0Xq my-book.epub mimetype

$ zip -Xr9Dq my-book.epub *


Pandoc (uses the Haskell Platform) http://johnmacfarlane.net/pandoc/installing.html

Wordpress w/ Pandoc? https://blogs.aalto.fi/blog/epublishing-with-pandoc/


NATURAL LANGUAGE:

For an issue, create .ncx / .end files from the issue index, <spine /> and <manifest /> in .opf

Save HTML output for each article, index, list in <manifest />, .ncx / .end

Sort into folder for relationships

Zip, rename .epub, save to download


http://codex.wordpress.org/XML-RPC_Supportb


https://wordpress.org/plugins/demomentsomtres-wp-export/

Creating ePub with image files

On an article - save the article page as a local file (journal.htm, in this example). It saved the content file as well as the image files. Then, run this command - pandoc -f html -t epub --toc -o journal.epub journal.htm This generated an journal.epub file with images.

Idea came from: https://blogs.aalto.fi/blog/epublishing-with-pandoc/


Jon's quick & crazy hack... get_links.xsl


<?xml version="1.0"?> <xsl:stylesheet version="1.0"

               xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text" />

<xsl:template match="fullTextUrl">

  <xsl:value-of select="." /><xsl:text>

</xsl:text> </xsl:template>

<xsl:template match="text()" />

</xsl:stylesheet>


wget http://journal.code4lib.org/issues/issue1/feed/doaj mv doaj toc.xml xsltproc get_links.xslt toc.xml | xargs -n 1 -i{} wget -r -l 1 --no-parent -k {} xsltproc get_links.xslt toc.xml | xargs -n 1 -i{} wget -r -l 1 -A jpg,jpeg,png,gif -k {}

Summary

==

Unfortunately we didn't get a Wordpress VM setup in time that would emulate the settings of the journal.code4lib site.

We looked at a couple of plugins, but all looked like they would still require several manual steps (goal would be to have it so every new issue just gets released as epub).

Downloading the page via save-as and using Calibre did a decent job, but is awkward.

XML-RPC seems to require a username + password, but might be feasible.

Problem with most scraping programs (wget mainly was used, although some sites seem to advocate for HTTrack) is

  • the list of links on the left hand to other issues
  • the images are stored not related to the paths the posts are on

So if you scrape the page and restrict to just that level and loewr, you don't get images, but otherwise you get more. And it's still largely clumsy and not automated.

- Summary added by Jon Gorman after the fact....