Dynamic Documents with R and knitr 2nd ed

纯看书根本不会, 要网上搜更多的内容. 很糟糕的书


Title: Dynamic Documents with R and knitr 2nd ed
Authors: Yihui Xie
Edition: 2
Finished Date: 2017-04-29
Rating: 1
Language: English
Genres: Programming, R, Software, RStudio
Level: Entry
Publishers: Chapman and Hall/CRC
Publication Date: 2015-06-24
ISBN: 978-1498716963
Format: Pdf
Pages: 294
Download: Pdf

resources

symbol of the book

  • bold text: function names
  • italic: function names
  • typewriter font: inline code
  • serif fonts: filenames

source code produces the output => maintain the source code only

literate programming paradigm

  1. write program to do computing
  2. write narratives to explain what is being done by the program code

technically, literate programming involves 3 steps can be implemented in software packages. The authors control the style of the output

  1. parse the source document and separate code from narratives
  2. execute source code and return results
  3. mix results from the source code with the original narratives

Ch 2. reproducible research

Reproducible research (RR) is one possible by-product of dynamic documents, but dynamic documents do not absolutely guarantee RR.

example: Monte Carlo simulation

  • with a certain random seed and got a good estimate of a parameter, but the result was actually due to a “lucky” random seed.
  • Although we can strictly reproduce the estimate, it is not actually reproducible in the general sense. Similar problems exist in optimization algorithms, e.g., different starting values can lead to different roots of the same equation.

2.1 literature

the term reproducible research

  • first proposed: Jon Claerbout @ Stanford University (Fomel and Claerbout, 2009)
  • the final product of research is not only the paper itself, but also the full computational environment used to produce the results in the paper such as the code and data necessary for reproduction of the results and building upon the research.
  • RR is often related to literate programming

    Knuth 1984

  • recommendation

    • source files

      • put them under the same directory
      • use relative paths whenever possible
    • Do not change the working directory after the computing has started

      set working directory in the very beginning of an R session

      the working directory is set to the directory of the source document before knitr is called to compile documents.

1
2
3
f <- function(...) {
# stores current dir to a variable owd
owd <- setwd("a/different/dir/")
# restore working dir when the function exits 

on.exit(setwd(owd), add = TRUE) # now you can work under a/different/dir ... }
  • Compile the documents in a clean R session

    in the end

    1. start a new R session
    2. compile a report in the batch mode
    3. all the results are freshly generated from the code.
  • Avoid the commands that require human interaction

    write the filename explicitly

    read.table(’a-specific-file.txt’)

  • Avoid environment variables for data analysis

    because they require additional instructions for users to set up, and humans can simply forget to do this. If there are any options to set up, do it inside the source document.

  • attach

    • sessionInfo() or devtools:sessionInfo()
    • instructions on how to compille the document

      it is better to provide the instructions in the form of a computer script; e.g., a shell script, a Makefile, or a batch file.

###2.3 barriers of reproducible reports

  • data can be huge
  • confidentiality of data
  • outdated software version
  • compile differently in different operating system
  • competition

    one may choose not to release the code or data with the report due to the fact that potential competitors can easily get everything for free, whereas the original authors have invested a large amount of money and effort

Ch.3 a first look

1
packageVersion("knitr")
# if not the latest version, run update.packages()

requirement of installation

  • LaTex

    • windows: MiKTEX
    • Mac: MacTEX
    • Linux: TEXLive
  • HTML: nothing

  • Markdown: nothing

knit(): compile source documents

1
library(knitr) knit("your-file.Rnw")

stitch(): from source R file

  • knitr provides a template of the source document with some default settings
  • Currently it has built-in templates for LATEX, HTML, and Markdown.

  • stitich(): tex

  • stitch_rhtml(): html
  • stitch_rmd: Markdown
1
2
library(knitr)
stitch("your-script.R")

literate programming document

  • weave: compile it to a re- port (run the code) knitr()
  • tangling: extract the program code in it purl()
1
2
3
library(knitr)
purl("your-file.Rnw")
purl("your-file.Rmd")

the result: R script: consists of all code chunks in the source document

Rnw: change options from Sweave to knitr

Ch.4 editors

###4.1 RStudio

###4.2 LYX

###4.3 Emacs/ESS

###4.4 other editors

Ch.5 document formats

3 key components of the design of knitr package

  1. a source parser
  2. a code evaluator
  3. an output renderer

parser

  1. parse the source document
  2. identify computer code and inline code

evaluator

  1. execute the code
  2. return results

renderer

  1. format the results in an appropriate format
  2. combine with the original documentation

knitr components’relationship with document format

  • independent of the document format: evaluator

  • have relation to document format:

    • parser: input syntax
    • renderer: output syntax

###5.1 input syntax

regular expression

  • identify

    • cod blocks i.e., chunks
    • other elements

      • inline code
  • codes are in all_pattern object

    • store in pattern.R

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      all_patterns = list(
      `rnw` = list(
      chunk.begin = '^\\s*<<(.*)>>=.*$', chunk.end = '^\\s*@\\s*(%+.*|)$',
      inline.code = '\\\\Sexpr\\{([^}]+)\\}', inline.comment = '^\\s*%.*',
      ref.chunk = '^\\s*<<(.+)>>\\s*$', header.begin = '(^|\n)[^%]*\\s*\\\\documentclass[^}]+\\}',
      document.begin = '\\s*\\\\begin\\{document\\}'),

      `brew` = list(inline.code = '<%[=]{0,1}\\s+([^%]+)\\s+[-]*%>'),

      `tex` = list(
      chunk.begin = '^\\s*%+\\s*begin.rcode\\s*(.*)', chunk.end = '^\\s*%+\\s*end.rcode',
      chunk.code = '^%+', ref.chunk = '^%+\\s*<<(.+)>>\\s*$',
      inline.comment = '^\\s*%.*', inline.code = '\\\\rinline\\{([^}]+)\\}',
      header.begin = '(^|\n)[^%]*\\s*\\\\documentclass[^}]+\\}',
      document.begin = '\\s*\\\\begin\\{document\\}'),

      `html` = list(
      chunk.begin = '^\\s*<!--\\s*begin.rcode\\s*(.*)',
      chunk.end = '^\\s*end.rcode\\s*-->', ref.chunk = '^\\s*<<(.+)>>\\s*$',
      inline.code = '<!--\\s*rinline(.+?)-->', header.begin = '\\s*<head>'),

      `md` = list(
      chunk.begin = '^[\t >]*```+\\s*\\{[.]?([a-zA-Z]+.*)\\}\\s*$',
      chunk.end = '^[\t >]*```+\\s*$',
      ref.chunk = '^\\s*<<(.+)>>\\s*$', inline.code = '`r +([^`]+)\\s*`'),

      `rst` = list(
      chunk.begin = '^\\s*[.][.]\\s+\\{r(.*)\\}\\s*$',
      chunk.end = '^\\s*[.][.]\\s+[.][.]\\s*$', chunk.code = '^[.][.]',
      ref.chunk = '^\\.*\\s*<<(.+)>>\\s*$', inline.code = ':r:`([^`]+)`'),

      `asciidoc` = list(
      chunk.begin = '^//\\s*begin[.]rcode(.*)$', chunk.end = '^//\\s*end[.]rcode\\s*$',
      chunk.code = '^//+', ref.chunk = '^\\s*<<(.+)>>\\s*$',
      inline.code = '`r +([^`]+)\\s*`|[+]r +([^+]+)\\s*[+]',
      inline.comment = '^//.*'),

      `textile` = list(
      chunk.begin = '^###[.]\\s+begin[.]rcode(.*)$',
      chunk.end = '^###[.]\\s+end[.]rcode\\s*$',
      ref.chunk = '^\\s*<<(.+)>>\\s*$',
      inline.code = '@r +([^@]+)\\s*@',
      inline.comment = '^###[.].*')
      )
复习完regular expression后重新看 page 28

Two more technical notes about the regular expression above:
1. \s denotes a white space in regular expressions, but in R we have to write double backslashes because \ in an R string re- ally means one backslash (the first backslash acts as escaping the second character, which is also a backslash); the backslash as the escape character can be rather confusing to beginners, and the rule of thumb is, when you want a real backslash, you may need two backslashes;
2. the braces () in the regular expression group a series of char- acters so that we can extract them with back references, e.g., we extract the second group of characters from abbbc:

###5.1.1 chunk options

章节标题是document formats….为什么就不写syntax,这样更容易理解

The syntax for chunk options is almost exactly the same as the syntax for function arguments in R

option = value

as long as the option values are valid R code, they are valid to knitr => write arbitrary valid R code for chunk options, which makes a source document programmable

Example

1
2
3
4
5
6
7
```{r}
bar <- 2
```

```{r, eval=if(bar<5) TRUE else FALSE}
print(3)
```

short form: eval=bar<5

###5.1.2 chunk label

chunk label: label = “character”

  • data type: character
* quote 
* unquote: knitr quote internally
  • omit label= when arguments by position, not by name
  • unique

    • if not unique: knitr stop and give an error

      because: potential danger that the files generated from one chunk may override the other chunk

  • purpose: generate external files

    • images
    • cache files
  • label is empty: automatically generate a label of the form unnamed-chunk-i, i = 1,2,3…

Example

1
2
3
4
5
6
```{r foo}
```{r foo-bar}
```{r "foo"}
```{r 'foo-bar'}
```{r label="foo"}
```{r echo=FALSE, label="foo-bar"}

###5.1.3 global options

opts_chunk in defaults.R

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
opts_chunk = new_defaults(list(

eval = TRUE, echo = TRUE, results = 'markup', tidy = FALSE, tidy.opts = NULL,
collapse = FALSE, prompt = FALSE, comment = '##', highlight = TRUE,
strip.white = TRUE, size = 'normalsize', background = '#F7F7F7',

cache = FALSE, cache.path = 'cache/', cache.vars = NULL, cache.lazy = TRUE,
dependson = NULL, autodep = FALSE, cache.rebuild = FALSE,

fig.keep = 'high', fig.show = 'asis', fig.align = 'default', fig.path = 'figure/',
dev = NULL, dev.args = NULL, dpi = 72, fig.ext = NULL, fig.width = 7, fig.height = 7,
fig.env = 'figure', fig.cap = NULL, fig.scap = NULL, fig.lp = 'fig:', fig.subcap = NULL,
fig.pos = '', out.width = NULL, out.height = NULL, out.extra = NULL, fig.retina = 1,
external = TRUE, sanitize = FALSE, interval = 1, aniopts = 'controls,loop',

warning = TRUE, error = TRUE, message = TRUE,

render = NULL,

ref.label = NULL, child = NULL, engine = 'R', split = FALSE, include = TRUE, purl = TRUE

))
  • global options are shared across all the following chunks after the location in which the options are set
  • local options in chunk override global options

###5.1.4 chunk syntax

1
2
3
4
5
6
7
8
Documentation here

<<>>=
code
<<>>=
code
@
More documentation

5.2 document formats

code chunks can be indented by any number of spaces in all document formats

5.2.1 Markdown

problems of Markdown: each derivative has its own definition of certain elements, such as, tables

CommonMark http://commonmark.org/: try to give standard of Markdown syntax

Pandoc’s Markdown is compatible with the CommonMark standars

5.2.2 LaTex

5.2.3 HTML

use comment syntax in order to write R code in HTML document

  1. create a file R HTML
  2. write code

begin code: <!--begin.rcode
end code: end.rcode-->

chunk options: <!--begin.rcode fig.width=7, fig.height=6

inline codes: <!--rinline pi -->

5.2.4 reStructuredText

page 36

reStructuredText (reST) document: http://docutils.sourceforge.net/rst.html

like Markdown

more powerful, more complicated

5.2.5 AsciiDoc

page 37

https://en.wikipedia.org/wiki/AsciiDoc

5.2.6 Textile

page 37

https://en.wikipedia.org/wiki/Textile_(markup_language)

5.2.7 customization

page 37

use one’s own syntax to parse a source document

knit_patterns: manage regular expressions

1
knit_patterns$get(
c("chunk.begin", "chunk.end", "inline.code") )

override the default syntax: knit_patterns$set()

Example

1
knit_patterns$set(
chunk.begin = "^<<r(.*)", chunk.end = "^r>>$",
    inline.code = "\\{\\{([^}]+)\\}\\}"
)

when parse a source document

  1. match pattern list to the filename extension
  2. if not found, search the code chunks

    look for whether the syntax matches with existing pattern list

5.3 output renderes

  • function eval() in base package: execute inline R code

    1. parse
    2. evaluate

      eval(parse(text = "1+1"))

  • evaluate code chunks: evaluate package

    loop

    1. evaluate package, function evaluate()

      1. takes a chunk of R source code
      2. evaluate
      3. return a list. 6 possible classes

        • character: normal text output
        • source: source code
        • warning
        • message
        • error
        • recordeplot: plots
    2. knitr package: object knit_hooks: a list of output hook functions to construct the finial output based on output format

the form of a hook function

1
2
3
hook_fun <- function(x, options){
# returns a character string with markup
}
  • x: raw output from R
  • options: a list of current chunk options