├── .Rbuildignore
├── .gitignore
├── .travis.yml
├── CONTRIBUTING.md
├── ChangeLog
├── DESCRIPTION
├── LICENSE
├── NAMESPACE
├── R
    ├── file_sample_exact.r
    ├── file_sample_prob.r
    ├── filesampler-package.r
    ├── reactor.r
    ├── sample_csv.r
    ├── sample_lines.r
    ├── utils.r
    └── wc.r
├── README.md
├── cleanup
├── configure
├── configure.ac
├── inst
    ├── CITATION
    ├── benchmarks
    │   ├── .gitignore
    │   ├── README
    │   ├── makebig
    │   ├── sampler.r
    │   └── wc.r
    └── rawdata
    │   └── small.csv
├── man
    ├── file_sample_exact.Rd
    ├── file_sample_prop.Rd
    ├── filesampler.Rd
    ├── print-wc.Rd
    ├── sample_csv.Rd
    ├── sample_lines.Rd
    └── wc.Rd
├── src
    ├── Makevars.in
    ├── Rfilesampler.h
    ├── filesampler
    │   ├── Makefile
    │   ├── check_avx.h
    │   ├── error.h
    │   ├── file_sampler.c
    │   ├── filesampler.h
    │   ├── rebalance.c
    │   ├── safeomp.h
    │   ├── utils.h
    │   ├── utils_sample.h
    │   └── wc.c
    ├── filesampler_native.c
    ├── samplers.c
    └── wc.c
├── tests
    ├── exact.R
    ├── prop.R
    └── wc.R
├── tools
    └── ax_check_compile_flag.m4
└── vignettes
    ├── IEEEtran.bst
    ├── build_pdf.sh
    ├── filesampler.Rnw
    └── include
        ├── 00-acknowledgement.tex
        ├── filesampler.bib
        ├── settings.tex
        └── uch_small.png


/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^\.travis\.yml$
2 | CONTRIBUTING.md
3 | 
4 | inst/benchmarks/big.csv
5 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | src/Makevars
 2 | aclocal.m4
 3 | 
 4 | *.o
 5 | *.lo
 6 | *.so
 7 | *.log
 8 | *.status
 9 | *~
10 | *.swp
11 | 
12 | *.pdf
13 | 
14 | *.aux
15 | *.bbl
16 | *.blg
17 | *.out
18 | *.toc
19 | 
20 | inst/doc
21 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: R
2 | cache: packages
3 | warnings_are_errors: true
4 | 
5 | env: 
6 |   global:
7 |     - CRAN: https://cran.rstudio.com
8 |     - _R_CHECK_FORCE_SUGGESTS_=FALSE
9 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # How to Contribute
 2 | 
 3 | If you're reading this and thinking of contributing, first, thanks so much! All contributions big and small are welcome. This document is a set of guidelines (not firm rules) to help the process.
 4 | 
 5 | Thanks!
 6 | 
 7 | 
 8 | ## Reporting a Bug or Requesting a Feature
 9 | Please open an issue at the project repository's issue tracker on GitHub.  If there is already an open issue describing your problem/request, feel free to join the discussion inside the existing issue, but please do not open a new one.
10 | 
11 | 
12 | ## Submitting Patches/Corrections
13 | You can submit a pull request (PR) to the project repository on GitHub. Please try to keep PR's as "small" as possible (don't try to combine several large changes into one PR).
14 | 
15 | For small changes, the log can be simple, e.g. "fixed a typo". For larger changes that touch many files, make sure to give a reasonably detailed description of the change(s).
16 | 
17 | If the project uses a continuous integration (CI) service such as Travis CI, then the PR must pass CI tests before it will be merged.
18 | 
19 | 
20 | ## Coding Conventions
21 | If you submit a patch that modifies/adds code, please try to keep the style at least reasonably similar to the one used in the codebase.  For both C and R code, please use the following conventions:
22 | 
23 |   * Indent using two spaces.
24 |   * Use the [Allman](https://en.wikipedia.org/wiki/Indent_style#Allman_style) control statement style.  Do not use braces (but do indent) for one liners following a control statement.
25 |   * Always put spaces after commas and around operators. So for example, use `foo(a, b, c)` not `foo(a,b,c)`; and use `x += 1` not `x+=1`.
26 |   * If you create new functionality (new function, new arguments to an old function), add or expand a test in the `tests/` sub-tree.
27 |   * The `master` branch of this repository should always pass `R CMD check --as-cran` with no NOTE's and no WARNING's.
28 |   * Update the `ChangeLog` file in the root of this directory explainng what you did, and end the line with your initials or full name.  For example:
29 |   ```
30 |   Fixed a memory leak in src/foo.c. (ABC)
31 |   ```
32 | 
33 | 
34 | ## Copyright
35 | This document is released as [CC0, public domain](https://creativecommons.org/choose/zero/). Feel free to re-use as much of this as you like with or without attribution if you find the template useful.
36 | 


--------------------------------------------------------------------------------
/ChangeLog:
--------------------------------------------------------------------------------
 1 | Release 0.4-0:
 2 |   * Integrate exact sampler into sample_csv().
 3 |   * Refactored internals for better code-reuse.
 4 |   * Drop assertthat dependency.
 5 |   * Re-wrote vignette.
 6 |   * Switch to LaTeX vignette.
 7 |   * Register native routines.
 8 |   * Use AVX for line counter when available (Daniel Lemire).
 9 |   * Add wc_w().
10 | 
11 | Release 0.3-1:
12 |   * Internal reorganization.
13 |   * Add option to turn off counts in wc().
14 |   * Optimize wc().
15 |   * Better error checking/handling across the board.
16 |   * Better internal library/R separation for easier code re-use.
17 | 
18 | Release 0.3-0:
19 |   * Rework API.
20 |   * Added vignette.
21 |   * Added nmax reader option.
22 |   * Fix broken help/NAMESPACE issues.
23 |   * Change versioning format.
24 |   * Change to assertthat.
25 | 
26 | Release 0.2-0:
27 |   * Removed gnu extensions (strnlen).
28 |   * Fixed numerous packaging and documentation issues.
29 | 
30 | Release 0.1-0:
31 |   * Added basic sampler.
32 |   * Added reservoir sampler.
33 |   * Added misc utilities (wc, etc).
34 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: filesampler
 2 | Type: Package
 3 | Title: File Sampler
 4 | Version: 0.4-0
 5 | Description: A collection of utilities for reading subsamples of flat text files
 6 |     by line in a reasonably efficient manner.  We do so by sampling 
 7 |     as the input file is scanned and randomly choosing whether or not to
 8 |     dump the current line to an external temporary file.  This 
 9 |     temporary file is then read back into R.  For (aggressive) 
10 |     'downsampling', this is a very effective strategy; for resampling,
11 |     you are much better off reading the full dataset into memory.
12 | License: BSD 2-clause License + file LICENSE
13 | Depends:
14 |     R (>= 3.5.0)
15 | Imports:
16 |     utils
17 | Suggests:
18 |     memuse (>= 3.0.0)
19 | NeedsCompilation: yes
20 | ByteCompile: yes
21 | Authors@R: c(
22 |     person("Drew", "Schmidt", role = c("aut", "cre"), email="wrathematics@gmail.com"),
23 |     person("Daniel", "Lemire", role="ctb", comment="vectorized line counter")
24 |     )
25 | Maintainer: Drew Schmidt <wrathematics@gmail.com>
26 | URL: https://github.com/wrathematics/filesampler
27 | BugReports: https://github.com/wrathematics/filesampler/issues
28 | RoxygenNote: 7.0.2
29 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | YEAR: 2015-2018
2 | COPYRIGHT HOLDER: Drew Schmidt
3 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
 1 | # Generated by roxygen2: do not edit by hand
 2 | 
 3 | S3method(print,wc)
 4 | export(file_sample_exact)
 5 | export(file_sample_prop)
 6 | export(sample_csv)
 7 | export(sample_lines)
 8 | export(wc)
 9 | export(wc_l)
10 | export(wc_w)
11 | importFrom(utils,read.csv)
12 | useDynLib(filesampler,R_fs_sample_exact)
13 | useDynLib(filesampler,R_fs_sample_prop)
14 | useDynLib(filesampler,R_fs_wc)
15 | 


--------------------------------------------------------------------------------
/R/file_sample_exact.r:
--------------------------------------------------------------------------------
 1 | #' Exact File Sampler
 2 | #' 
 3 | #' Randomly sample lines from an input text file.
 4 | #' 
 5 | #' @details
 6 | #' The sampling is done in two passes of the input file.  First, the number of
 7 | #' lines of the input file are determined by scanning through the file as
 8 | #' quickly as possible (i.e., it should be completely I/O bound).  Next, an
 9 | #' index of lines to keep is produced by reservoir sampling.  Then finally,  the
10 | #' input file is scanned again line by line with the chosen lines dumped into a
11 | #' temporary file.
12 | #' 
13 | #' If the output file (the one pointed to by the return of this function) is
14 | #' "large" and to be read into memory (which isn't really appropriate for text
15 | #' files in the first place!), then this strategy is probably not appropriate.
16 | #' 
17 | #' @param nlines
18 | #' The (exact) number of lines to sample from the input file.
19 | #' @param infile
20 | #' Location of the file (as a string) to be subsampled.
21 | #' @param outfile
22 | #' Output file location (as a string).
23 | #' @param header
24 | #' Is a header (line of column names) on the first line of the csv file?
25 | #' @param nskip
26 | #' Number of lines to skip. If \code{header=TRUE}, then this only applies to
27 | #' lines after the header.
28 | #' @param verbose
29 | #' Should linecounts of the input file and the number of lines sampled be
30 | #' printed?
31 | #' 
32 | #' @return
33 | #' \code{NULL}
34 | #' 
35 | #' @useDynLib filesampler R_fs_sample_exact
36 | #' @export
37 | file_sample_exact = function(nlines, infile, outfile=tempfile(), header=TRUE, nskip=0, verbose=FALSE)
38 | {
39 |   check.is.posint(nlines)
40 |   check.is.string(infile)
41 |   infile = abspath(infile)
42 |   check.is.string(outfile)
43 |   check.is.flag(header)
44 |   check.is.natnum(nskip)
45 |   check.is.flag(verbose)
46 |   
47 |   .Call(R_fs_sample_exact, as.integer(verbose), as.integer(header), as.integer(nskip), as.integer(nlines)-1L, infile, outfile)
48 |   
49 |   invisible()
50 | }
51 | 


--------------------------------------------------------------------------------
/R/file_sample_prob.r:
--------------------------------------------------------------------------------
 1 | #' Proportional File Sampler
 2 | #' 
 3 | #' Randomly sample lines from an input text file.
 4 | #' 
 5 | #' @details
 6 | #' The sampling is done in one pass of the input file, dumping lines to a
 7 | #' temporary file as the input is read.
 8 | #' 
 9 | #' If the output file (the one pointed to by the return of this function) is
10 | #' "large" and to be read into memory (which isn't really appropriate for text
11 | #' files in the first place!), then this strategy is probably not appropriate.
12 | #' 
13 | #' @param p
14 | #' Proportion to retain; should be a numeric value between 0 and 1.
15 | #' @param infile
16 | #' Location of the file (as a string) to be subsampled.
17 | #' @param outfile
18 | #' Output file location (as a string).
19 | #' @param header
20 | #' Is a header (line of column names) on the first line of the csv file?
21 | #' @param nskip
22 | #' Number of lines to skip. If \code{header=TRUE}, then this only applies to
23 | #' lines after the header.
24 | #' @param nmax
25 | #' Max number of lines to read.  If \code{nmax==0}, then there is no read cap.
26 | #' @param verbose
27 | #' Should linecounts of the input file and the number of lines sampled be
28 | #' printed?
29 | #' 
30 | #' @return
31 | #' \code{NULL}
32 | #' 
33 | #' @useDynLib filesampler R_fs_sample_prop
34 | #' @export
35 | file_sample_prop = function(p, infile, outfile=tempfile(), header=TRUE, nskip=0, nmax=0, verbose=FALSE)
36 | {
37 |   check.is.scalar(p)
38 |   check.is.string(infile)
39 |   infile = abspath(infile)
40 |   check.is.string(outfile)
41 |   check.is.flag(header)
42 |   check.is.natnum(nskip)
43 |   check.is.natnum(nmax)
44 |   check.is.flag(verbose)
45 |   
46 |   if (p == 0)
47 |     stop("no lines available for input")
48 |   if (p < 0 || p > 1)
49 |     stop("Argument 'p' must be between 0 and 1")
50 |   
51 |   .Call(R_fs_sample_prop, verbose, header, as.integer(nskip), as.integer(nmax), as.double(p), infile, outfile)
52 |   
53 |   invisible()
54 | }
55 | 


--------------------------------------------------------------------------------
/R/filesampler-package.r:
--------------------------------------------------------------------------------
 1 | #' File Sampler
 2 | #' 
 3 | #' A simple package for reading subsamples of flat text files by line in a
 4 | #' reasonably efficient manner.  We do so by sampling as the input file is
 5 | #' scanned and randomly choosing whether or not to dump the current line to an
 6 | #' external temporary file.  This temporary file is then read back into R.  For
 7 | #' (aggressive) downsampling, this is a very effective strategy; for resampling,
 8 | #' you are much better off reading the full dataset into memory.
 9 | #' 
10 | #' The basic function performs the sampling/reading in a single pass. There is
11 | #' an alternative version which allows for an exact amout to be subsampled via
12 | #' reservoir sampling, but this version requires 2 passes through the data.  The
13 | #' heavy lifting is done entirely in C.
14 | #'
15 | #' This package, including the underlying C library, is licensed under the
16 | #' permissive 2-clause BSD license.
17 | #' 
18 | #' @author Drew Schmidt \email{wrathematics@@gmail.com}
19 | #' @references Project URL: \url{https://github.com/wrathematics/filesampler}
20 | #'
21 | #' @importFrom utils read.csv
22 | #' 
23 | #' @name filesampler
24 | #' @docType package
25 | #' @title File Sampler
26 | #' @keywords package
27 | NULL
28 | 


--------------------------------------------------------------------------------
/R/reactor.r:
--------------------------------------------------------------------------------
  1 | is.badval <- function(x)
  2 | {
  3 |    is.na(x) || is.nan(x) || is.infinite(x)
  4 | }
  5 | 
  6 | is.inty <- function(x)
  7 | {
  8 |   abs(x - round(x)) < 1e-10
  9 | }
 10 | 
 11 | is.zero <- function(x)
 12 | {
 13 |   abs(x) < 1e-10
 14 | }
 15 | 
 16 | is.negative <- function(x)
 17 | {
 18 |   x < 0
 19 | }
 20 | 
 21 | is.annoying <- function(x)
 22 | {
 23 |   length(x) != 1 || is.badval(x)
 24 | }
 25 | 
 26 | 
 27 | 
 28 | check.is.flag <- function(x)
 29 | {
 30 |   if (!is.logical(x) || is.annoying(x))
 31 |   {
 32 |     nm <- deparse(substitute(x))
 33 |     stop(paste0("argument '", nm, "' must be TRUE or FALSE"), call.=FALSE)
 34 |   }
 35 |   
 36 |   invisible(TRUE)
 37 | }
 38 | 
 39 | 
 40 | 
 41 | check.is.scalar <- function(x)
 42 | {
 43 |   if (!is.numeric(x) || is.annoying(x))
 44 |   {
 45 |     nm <- deparse(substitute(x))
 46 |     stop(paste0("argument '", nm, "' must be a single number (not NA, Inf, NaN)"), call.=FALSE)
 47 |   }
 48 |   
 49 |   invisible(TRUE)
 50 | }
 51 | 
 52 | 
 53 | 
 54 | check.is.string <- function(x)
 55 | {
 56 |   if (!is.character(x) || is.annoying(x))
 57 |   {
 58 |     nm <- deparse(substitute(x))
 59 |     stop(paste0("argument '", nm, "' must be a single string"), call.=FALSE)
 60 |   }
 61 |   
 62 |   invisible(TRUE)
 63 | }
 64 | 
 65 | 
 66 | 
 67 | check.is.int <- function(x)
 68 | {
 69 |   if (!is.numeric(x) || is.annoying(x) || !is.inty(x))
 70 |   {
 71 |     nm <- deparse(substitute(x))
 72 |     stop(paste0("argument '", nm, "' must be an integer"), call.=FALSE)
 73 |   }
 74 |   
 75 |   invisible(TRUE)
 76 | }
 77 | 
 78 | 
 79 | 
 80 | check.is.natnum <- function(x)
 81 | {
 82 |   if (!is.numeric(x) || is.annoying(x) || !is.inty(x) || is.negative(x))
 83 |   {
 84 |     nm <- deparse(substitute(x))
 85 |     stop(paste0("argument '", nm, "' must be a natural number (0 or positive integer)"), call.=FALSE)
 86 |   }
 87 |   
 88 |   invisible(TRUE)
 89 | }
 90 | 
 91 | 
 92 | 
 93 | check.is.posint <- function(x)
 94 | {
 95 |   if (!is.numeric(x) || is.annoying(x) || !is.inty(x) || is.negative(x) || is.zero(x))
 96 |   {
 97 |     nm <- deparse(substitute(x))
 98 |     stop(paste0("argument '", nm, "' must be a positive integer"), call.=FALSE)
 99 |   }
100 |   
101 |   invisible(TRUE)
102 | }
103 | 
104 | 
105 | 
106 | check.is.function <- function(x)
107 | {
108 |   if (!is.function(x))
109 |   {
110 |     nm <- deparse(substitute(x))
111 |     stop(paste0("argument '", nm, "' must be a function"), call.=FALSE)
112 |   }
113 |   
114 |   invisible(TRUE)
115 | }
116 | 


--------------------------------------------------------------------------------
/R/sample_csv.r:
--------------------------------------------------------------------------------
 1 | #' Read Sample of CSV
 2 | #' 
 3 | #' The function will read (as csv) approximately p*nlines lines. So 
 4 | #' if \code{p=.1}, then we will get roughly (probably not exactly) 10% of the
 5 | #' data.  This is the analogue of the base R function \code{read.csv()}.
 6 | #' 
 7 | #' @details
 8 | #' This function scans over the test of the input file and at each step,
 9 | #' randomly chooses whether or not to include the current line into a
10 | #' downsampled file. Each selected line is placed in a temporary file, before
11 | #' being read into R via \code{read.csv()}.  Additional arguments to this
12 | #' function (those other than \code{file}, \code{p}, and \code{verbose}) are
13 | #' passed to \code{read.csv()}, and so if their behavior is unclear, you should
14 | #' examine the \code{read.csv()} help file.
15 | #' 
16 | #' If \code{verbose=TRUE}, then something like:
17 | #' 
18 | #' \code{Read 12207 lines (0.001\%) of 12174948 line file.}
19 | #' 
20 | #' will be printed to the terminal. This counts the header (if there is one) as
21 | #' one of the lines read and as one of the lines possible.
22 | #' 
23 | #' @param file
24 | #' Location of the file (as a string) to be subsampled.
25 | #' @param param
26 | #' The downsampling parameter. For the "proportional" method, this is the
27 | #' proportion to retain and should be a numeric value between 0 and 1. For the
28 | #' exact method, this is the total number of lines to read in.
29 | #' @param method
30 | #' A string indicating the type of read method to use. Options are
31 | #' "proportional" and "exact".
32 | #' @param reader
33 | #' A function specifying the reader to use. The default is 
34 | #' \code{utils::read.csv}. Other options include \code{data.table::fread()} and
35 | #' \code{readr::read_csv()}.  Note the first argument of the reader should be
36 | #' the file to read in and the second should be the the
37 | #' \code{header}/\code{col_names} argument.  This would require writing a small
38 | #' wrapper for \code{fread()}.
39 | #' @param header
40 | #' Is a header (line of column names) on the first line of the csv file?
41 | #' @param nskip
42 | #' Number of lines to skip. If \code{header=TRUE}, then this only applies to
43 | #' lines after the header.
44 | #' @param nmax
45 | #' Max number of lines to read. If nmax==0, then there is no read cap. Ignored
46 | #' if \code{method="exact"}.
47 | #' @param verbose
48 | #' Should linecounts of the input file and the number of lines sampled be
49 | #' printed?
50 | #' @param ...
51 | #' Additional arguments passed to the csv reader.
52 | #' 
53 | #' @return
54 | #' A dataframe, as with \code{read.csv()}.
55 | #' 
56 | #' @examples
57 | #' library(filesampler)
58 | #' file = system.file("rawdata/small.csv", package="filesampler")
59 | #' 
60 | #' # Read in a 5% random subsample of the rows.
61 | #' data = sample_csv(file, param=.05)
62 | #' 
63 | #' # Read in 10 randomly sampled rows.
64 | #' data = sample_csv(file, param=10, method="exact")
65 | #'
66 | #' @export
67 | sample_csv = function(file, param, method="proportional", reader=utils::read.csv, header=TRUE, nskip=0, nmax=0, verbose=FALSE, ...)
68 | {
69 |   check.is.function(reader)
70 |   method = match.arg(tolower(method), c("proportional", "exact"))
71 |   
72 |   outfile = tempfile()
73 |   
74 |   if (method == "proportional")
75 |   {
76 |     p = param
77 |     file_sample_prop(p=p, infile=file, outfile=outfile, header=header, nskip=nskip, nmax=nmax, verbose=verbose)
78 |   }
79 |   else if (method == "exact")
80 |   {
81 |     nlines = param
82 |     file_sample_exact(nlines=nlines, infile=file, outfile=outfile, header=header, nskip=nskip, verbose=verbose)
83 |   }
84 |   
85 |   
86 |   reader_nm = deparse(substitute(reader))
87 |   if (grepl(reader_nm, pattern="read_csv"))
88 |     data = reader(outfile, col_names=header, ...)
89 |   else
90 |     data = reader(outfile, header=header, ...)
91 |   
92 |   unlink(outfile)
93 |   return(data)
94 | }
95 | 


--------------------------------------------------------------------------------
/R/sample_lines.r:
--------------------------------------------------------------------------------
 1 | #' Read Sample Lines of Text File
 2 | #' 
 3 | #' The function will read approximately p*nlines lines of a flat text
 4 | #' file. So if \code{p=.1}, then we will get roughly (probably not exactly)
 5 | #' 10% of the data.  This is the analogue of the base R function
 6 | #' \code{readLines()}.
 7 | #' 
 8 | #' @details
 9 | #' This function scans over the test of the input file and at each step, randomly
10 | #' chooses whether or not to include the current line into a downsampled file.
11 | #' Each selected line is placed in a temporary file, before being read into R
12 | #' via \code{readLines()}.  Additional arguments to this function (those other
13 | #' than \code{file}, \code{p}, and \code{verbose}) are passed to \code{readLines()},
14 | #' and so if their behavior is unclear, you should examine the \code{readLines()}
15 | #' help file.
16 | #' 
17 | #' If \code{verbose=TRUE}, then something like:
18 | #' 
19 | #' \code{Read 12207 lines (0.001\%) of 12174948 line file.}
20 | #' 
21 | #' will be printed to the terminal.  This counts the header (if there is one)
22 | #' as one of the lines read and as one of the lines possible.
23 | #' 
24 | #' @param file
25 | #' Location of the file (as a string) to be subsampled.
26 | #' @param n
27 | #' As in \code{readLines()}.
28 | #' @param p
29 | #' Proportion to retain; should be a numeric value between 0 and 1.
30 | #' @param nskip
31 | #' Number of lines to skip.
32 | #' @param nmax
33 | #' Max number of lines to read.  If nmax==0, then there is no read cap.
34 | #' @param verbose
35 | #' Logical; indicates whether or not linecounts of the input file and the number
36 | #' of lines sampled should be printed.
37 | #' @param ...
38 | #' Additional arguments passed to \code{readLines()}.
39 | #' 
40 | #' @return
41 | #' A character vector, as with \code{readLines()}.
42 | #' 
43 | #' @examples
44 | #' library(filesampler)
45 | #' file = system.file("rawdata/small.csv", package="filesampler")
46 | #' sample_lines(file, p=.05)
47 | #'
48 | #' @export
49 | sample_lines = function(file, n=-1L, p=.1, nskip=0, nmax=0, verbose=FALSE, ...)
50 | {
51 |   check.is.int(n)
52 |   
53 |   if (p == 0)
54 |     return(character(0))
55 |   
56 |   if (n > 0 && n < nskip)
57 |     return(character(0))
58 |   
59 |   outfile = tempfile()
60 |   file_sample_prop(verbose=verbose, header=FALSE, nskip=nskip, nmax=nmax, p=p, infile=file, outfile=outfile)
61 |   
62 |   data = readLines(outfile, n=n, ...)
63 |   unlink(outfile)
64 |   data
65 | }
66 | 


--------------------------------------------------------------------------------
/R/utils.r:
--------------------------------------------------------------------------------
 1 | title_case = function(x) gsub(x, pattern="(^|[[:space:]])([[:alpha:]])", replacement="\\1\\U\\2", perl=TRUE)
 2 | 
 3 | abspath = function(file)
 4 | {
 5 |   p = path.expand(file)
 6 |   if (!file.exists(p))
 7 |     stop("file does not exist")
 8 |   
 9 |   p
10 | }
11 | 


--------------------------------------------------------------------------------
/R/wc.r:
--------------------------------------------------------------------------------
 1 | #' Count Letters, Words, and Lines of a File
 2 | #' 
 3 | #' See title.
 4 | #' 
 5 | #' @details
 6 | #' \code{wc_l()} is a shorthand for counting only lines, similar to \code{wc -l}
 7 | #' in the terminal. Likewise \code{wc_w()} is analogous to \code{wc -w} for
 8 | #' words.
 9 | #' 
10 | #' @param file
11 | #' Location of the file (as a string) from which the counts will be generated.
12 | #' @param chars,words,lines
13 | #' Should char/word/line counts be shown? At least one of the three must be
14 | #' \code{TRUE}.
15 | #' 
16 | #' @return
17 | #' A list containing the requested counts.
18 | #' 
19 | #' @examples
20 | #' library(filesampler)
21 | #' file = system.file("rawdata/small.csv", package="filesampler")
22 | #' data = wc(file=file)
23 | #'
24 | #' @name wc
25 | #' @rdname wc
26 | NULL
27 | 
28 | 
29 | 
30 | #' @useDynLib filesampler R_fs_wc
31 | #' @rdname wc
32 | #' @export
33 | wc = function(file, chars=TRUE, words=TRUE, lines=TRUE)
34 | {
35 |   check.is.string(file)
36 |   check.is.flag(chars)
37 |   check.is.flag(words)
38 |   check.is.flag(lines)
39 |   
40 |   if (!chars && !words && !lines)
41 |     stop("at least one of the arguments 'chars', 'words', or 'lines' must be TRUE")
42 |   
43 |   file = abspath(file)
44 |   ret = .Call(R_fs_wc, file, chars, words, lines)
45 |   
46 |   counts = list(chars=ret[1L], words=ret[2L], lines=ret[3L])
47 |   class(counts) = "wc"
48 |   attr(counts, "file") = file
49 |   
50 |   counts
51 | }
52 | 
53 | 
54 | 
55 | #' @rdname wc
56 | #' @export
57 | wc_w = function(file)
58 | {
59 |   wc(file=file, chars=FALSE, words=TRUE, lines=FALSE)
60 | }
61 | 
62 | 
63 | 
64 | #' @rdname wc
65 | #' @export
66 | wc_l = function(file)
67 | {
68 |   wc(file=file, chars=FALSE, words=FALSE, lines=TRUE)
69 | }
70 | 
71 | 
72 | 
73 | #' @title Print \code{wc} objects
74 | #' @description Printing for \code{wc()}
75 | #' @param x \code{wc} object
76 | #' @param ... unused
77 | #' @name print-wc
78 | #' @rdname print-wc
79 | #' @method print wc
80 | #' @export
81 | print.wc = function(x, ...)
82 | {
83 |   cat("file:  ", attr(x, "file"), "\n")
84 |   
85 |   x = x[which(x != -1)]
86 |   
87 |   maxlen = max(sapply(names(x), nchar))
88 |   names = gsub(names(x), pattern="_", replacement=" ")
89 |   names = title_case(x=names)
90 |   spacenames = simplify2array(lapply(names, function(str) paste0(str, ":", paste0(rep(" ", maxlen-nchar(str)), collapse=""))))
91 |   
92 |   cat(paste(spacenames, x, sep=" ", collapse="\n"), "\n")
93 |   invisible()
94 | }
95 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # filesampler
 2 | 
 3 | * **Version:** 0.4-0
 4 | * **Status:** [![Build Status](https://travis-ci.org/wrathematics/filesampler.png)](https://travis-ci.org/wrathematics/filesampler)
 5 | * **License:** [BSD 2-Clause](http://opensource.org/licenses/BSD-2-Clause)
 6 | * **Project home**: https://github.com/wrathematics/lineSampler
 7 | * **Bug reports**: https://github.com/wrathematics/lineSampler/issues
 8 | * **Author:** Drew Schmidt
 9 | 
10 | 
11 | 
12 | This is a simple R package to quickly read a random sample of lines of a flat text file (such as a csv) into R.  This allows you to get a subsample into R without having to read the (possibly large) file into memory first.
13 | 
14 | If you would like to contribute to this project, please see the `CONTRIBUTING.md` file.  Idea inspired by Eduardo Arino de la Rubia's [fast_sample](https://github.com/earino/fast_sample).
15 | 
16 | 
17 | 
18 | ## Installation
19 | 
20 | You can install the stable version from [the HPCRAN](https://hpcran.org) using the usual `install.packages()`:
21 | 
22 | ```r
23 | install.packages("filesampler", repos="https://hpcran.org")
24 | ```
25 | 
26 | The development version is maintained on GitHub:
27 | 
28 | 
29 | ```r
30 | remotes::install_github("wrathematics/filesampler")
31 | ```
32 | 
33 | 
34 | 
35 | ## Package Use
36 | 
37 | ```r
38 | library(filesampler)
39 | ret <- sample_csv(file, p=.001)
40 | ```
41 | 
42 | There is also a `sample_lines()` function for reading in (line) subsamples of unstructured text akin to `readLines()`.  See package vignette for more details.
43 | 
44 | For more information, including benchmarks and implementation details, please see the package vignette:
45 | 
46 | ```r
47 | vignette("filesampler", package="filesampler")
48 | ```
49 | 
50 | 
51 | 
52 | ## Code Re-Use
53 | 
54 | The C code in the `src/filesampler` tree of this package can easily be re-purposed for use outside of R with some minor modifications.  The components that need to be edited can be found in the file:
55 | 
56 |   * `src/filesampler/utils.h`
57 | 
58 | Detailed explanations are contained there, but you will need:
59 | 
60 | * a uniform random number generator
61 | * a print function (probably `printf()`)
62 | * an interrupt checker
63 | 


--------------------------------------------------------------------------------
/cleanup:
--------------------------------------------------------------------------------
 1 | #! /bin/sh
 2 | 
 3 | rm -rf ./src/Makevars
 4 | 
 5 | rm -rf ./src/*.o
 6 | rm -rf ./src/filesampler/*.o
 7 | 
 8 | rm -rf ./src/*.so*
 9 | rm -rf ./src/*.dll
10 | 


--------------------------------------------------------------------------------
/configure.ac:
--------------------------------------------------------------------------------
 1 | AC_PREREQ([2.69])
 2 | AC_INIT(DESCRIPTION)
 3 | AC_CONFIG_MACRO_DIRS([tools/])
 4 | 
 5 | # Get C compiler from R
 6 | : ${R_HOME=`R RHOME`}
 7 | if test -z "${R_HOME}"; then
 8 |   echo "could not determine R_HOME"
 9 |   exit 1
10 | fi
11 | CC=`"${R_HOME}/bin/R" CMD config CC`
12 | CFLAGS=`"${R_HOME}/bin/R" CMD config CFLAGS`
13 | CPPFLAGS=`"${R_HOME}/bin/R" CMD config CPPFLAGS`
14 | 
15 | 
16 | AC_PROG_CC_C99
17 | AC_OPENMP
18 | if test -n "${OPENMP_CFLAGS}"; then
19 |   have_omp="yes"
20 |   OMP_FLAGS="\$(SHLIB_OPENMP_CFLAGS)"
21 | else
22 |   have_omp="no"
23 |   OMP_FLAGS=""
24 | fi
25 | 
26 | 
27 | 
28 | AX_CHECK_COMPILE_FLAG("-mavx2", [have_avx2="yes"], [have_avx2="no"])
29 | if test "X${have_avx2}" = Xyes; then
30 |   AVX2_FLAGS="-mavx2"
31 | else
32 |   AVX2_FLAGS=""
33 | fi
34 | 
35 | 
36 | echo " "
37 | echo "**************** Results of filesampler package configure ****************"
38 | echo " "
39 | echo "* OpenMP Report"
40 | echo "*   >> Compiler support: ${have_omp}"
41 | echo "*   >> CFLAGS = ${OMP_FLAGS}"
42 | echo "* avx2 Report"
43 | echo "*   >> Compiler support: ${have_avx2}"
44 | echo "*   >> CPPFLAGS = ${AVX2_FLAGS}"
45 | echo "**************************************************************************"
46 | echo " "
47 | 
48 | AC_SUBST(OMP_FLAGS)
49 | AC_SUBST(AVX2_FLAGS)
50 | AC_OUTPUT(src/Makevars)
51 | 


--------------------------------------------------------------------------------
/inst/CITATION:
--------------------------------------------------------------------------------
 1 | year <- sub("-.*", "", meta$Date)
 2 | note <- sprintf("{R} package version %s", meta$Version)
 3 | 
 4 | citEntry(
 5 |   entry = "Misc",
 6 |   title = "{filesampler}: File Line Sampler",
 7 |   author = personList(as.person("Drew Schmidt")),
 8 |   year = year,
 9 |   note = note,
10 |   url = "https://cran.r-project.org/package=filesampler",
11 |   textVersion = NULL,
12 |   key = "filesamplerPackage"
13 | )
14 | 
15 | citEntry(
16 |   entry = "Manual",
17 |   title = "Introduction to the {filesampler} Package",
18 |   author = personList(as.person("Drew Schmidt")),
19 |   year = "2016",
20 |   note = "{R} Vignette",
21 |   url = "https://cran.r-project.org/package=filesampler",
22 |   textVersion = NULL,
23 |   key = "filesamplerVignette"
24 | )
25 | 


--------------------------------------------------------------------------------
/inst/benchmarks/.gitignore:
--------------------------------------------------------------------------------
1 | big.csv
2 | 


--------------------------------------------------------------------------------
/inst/benchmarks/README:
--------------------------------------------------------------------------------
1 | These benchmarks require the use of a reasonably large csv file. You
2 | can generate one by running the script `makebig` found in this directory.
3 | 


--------------------------------------------------------------------------------
/inst/benchmarks/makebig:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | startfile=../rawdata/small.csv
 4 | 
 5 | rm -rf med.csv
 6 | rm -rf big.csv
 7 | 
 8 | len=`wc -l < $startfile`
 9 | 
10 | head -n 1 $startfile >> med.csv
11 | 
12 | for ((i=0; i<160; i++)); do
13 |   tail -n $(( $len - 1 )) $startfile >> med.csv
14 | done
15 | 
16 | 
17 | 
18 | len=`wc -l < med.csv`
19 | 
20 | head -n 1 med.csv >> big.csv
21 | 
22 | for ((i=0; i<1000; i++)); do
23 |   tail -n $(( $len - 1 )) med.csv >> big.csv
24 | done
25 | 
26 | 
27 | rm -rf med.csv
28 | 


--------------------------------------------------------------------------------
/inst/benchmarks/sampler.r:
--------------------------------------------------------------------------------
 1 | library(filesampler)
 2 | 
 3 | 
 4 | system.time({
 5 |     x <- sample_csv("big.csv", method="exact", param=15000, verbose=TRUE)
 6 | })
 7 | 
 8 | 
 9 | system.time({
10 |     x <- sample_csv("big.csv", param=.001, verbose=TRUE)
11 | })
12 | 


--------------------------------------------------------------------------------
/inst/benchmarks/wc.r:
--------------------------------------------------------------------------------
1 | library(filesampler)
2 | 
3 | system.time(x <- wc_l("big.csv"))
4 | 
5 | x
6 | 


--------------------------------------------------------------------------------
/inst/rawdata/small.csv:
--------------------------------------------------------------------------------
  1 | "A","B","C","D","E","F"
  2 | 31,"v","A",0.0344356081914157,0.197601571670833,79.807882879395
  3 | 58,"e","J",0.868477246491238,-0.0333204715610167,80.296481680125
  4 | 5,"d","B",0.469008938642219,-0.507745240265884,17.0324605982751
  5 | 17,"z","C",0.263196594081819,-0.894154055132274,49.2029465176165
  6 | 30,"c","T",0.423068210249767,-0.714639366414706,33.1179875787348
  7 | 75,"y","P",0.157697404269129,-1.05675687342485,15.0995138473809
  8 | 30,"m","U",0.663265228504315,-0.102609033060013,46.2487461487763
  9 | 26,"l","S",0.697267947718501,-0.351352275280531,28.2710071466863
 10 | 46,"p","W",0.0227476919535547,-1.26253520594915,71.3425914221443
 11 | 100,"w","I",0.0328309496399015,-2.2742194420414,76.8699974031188
 12 | 7,"i","D",0.953019385691732,-0.526115032090435,91.7900401726365
 13 | 22,"e","C",0.214634924661368,0.0298457049024103,52.8432966885157
 14 | 49,"b","N",0.73877848428674,0.735700914198922,47.3630763380788
 15 | 22,"o","Z",0.930282746674493,1.69084953123455,45.2439673081972
 16 | 82,"i","C",0.95590949524194,0.13338503874271,39.8316274024546
 17 | 42,"w","F",0.559732376364991,-0.222795841608947,52.3474555672146
 18 | 76,"f","L",0.444616575026885,-0.308119184474722,76.059751638677
 19 | 91,"b","J",0.939074153313413,-2.14593634852752,58.5156215843745
 20 | 96,"m","H",0.776781423948705,1.4502067197792,79.699759890791
 21 | 94,"i","C",0.289326652418822,0.1935193313665,55.3891222900711
 22 | 84,"n","F",0.848534285090864,1.03334443771633,95.9706183988601
 23 | 21,"f","Q",0.0960773911792785,-0.230727972829568,15.320910315495
 24 | 100,"r","U",0.515066175023094,1.58910198179705,21.091499782633
 25 | 92,"b","B",0.647247697459534,1.20942736228284,34.8187624011189
 26 | 8,"n","A",0.00861937203444541,0.292516057068796,30.1586005091667
 27 | 83,"h","B",0.220667276298627,0.00146272657865459,62.4190913466737
 28 | 33,"q","L",0.0601193737238646,2.01678033592448,88.6710710334592
 29 | 7,"p","K",0.433341701282188,0.795315551223573,33.1174146966077
 30 | 90,"x","D",0.169272042112425,-0.610111017730115,66.5069763478823
 31 | 6,"p","B",0.145372492726892,1.52611855611189,20.8286189078353
 32 | 100,"y","G",0.304517379263416,-1.05102990058431,45.0845562084578
 33 | 90,"s","I",0.602839902974665,1.84877027567577,59.3807370681316
 34 | 82,"d","W",0.257859059143811,1.30801004183715,37.6233650557697
 35 | 59,"w","U",0.146642411127687,-1.69092871059025,14.5103294635192
 36 | 52,"b","K",0.571891283616424,-0.689954480114329,53.2374527514912
 37 | 3,"v","I",0.19160017836839,-0.930770873136812,89.8825674923137
 38 | 52,"o","M",0.823421220993623,0.484253791040398,32.5794443488121
 39 | 20,"f","R",0.511186928488314,-0.150959552705217,38.3852071734145
 40 | 99,"a","C",0.257107607088983,0.543330031852767,16.28650350729
 41 | 63,"k","G",0.641139208804816,0.996895748616495,62.9691184335388
 42 | 87,"y","B",0.201599461957812,0.599122824399696,34.5542569248937
 43 | 84,"o","K",0.25619876710698,-0.134000852338425,41.6240883222781
 44 | 49,"j","S",0.473985320422798,-0.551402560151181,80.1333197625354
 45 | 99,"w","A",0.948514126939699,1.57938969908825,87.4724759580567
 46 | 48,"j","X",0.158811556873843,0.627735521292107,79.1880375985056
 47 | 2,"g","O",0.293799062492326,-0.822778808870904,41.1412728042342
 48 | 69,"c","L",0.754705770406872,0.390535823507935,94.5743023580872
 49 | 55,"z","F",0.478943325113505,-0.364710776856741,61.6175709129311
 50 | 96,"y","D",0.759227265138179,0.205341535526123,63.838811442256
 51 | 12,"g","Z",0.101827198406681,-0.00428392052516273,10.5943197244778
 52 | 87,"f","A",0.654482287121937,-0.322827993815207,88.8147027604282
 53 | 87,"f","R",0.0575539630372077,-0.281809239856949,54.9242287687957
 54 | 48,"j","X",0.504813719773665,0.683105702318004,71.3473828113638
 55 | 40,"x","Z",0.874165998306125,1.36229750817053,58.0776743311435
 56 | 49,"o","O",0.146845723502338,-0.00241839228845686,62.0573334139772
 57 | 67,"c","S",0.706441656686366,1.19764725630863,30.4078086558729
 58 | 13,"p","Z",0.34508125577122,-0.455623686756772,40.096492273733
 59 | 40,"g","X",0.536776724969968,-0.0951942274812661,10.4143589432351
 60 | 63,"r","Z",0.903075464535505,0.502183970814897,58.7581716710702
 61 | 13,"k","Q",0.831190430326387,1.86223311514226,87.2440657229163
 62 | 45,"q","S",0.851588690653443,-0.993020661598526,33.6388454330154
 63 | 49,"m","P",0.647244467865676,-0.59059429990917,91.2623915518634
 64 | 94,"i","Y",0.0192667245864868,-0.243914617726999,10.9829535800964
 65 | 77,"r","G",0.248602577485144,0.356037725132404,42.5667992956005
 66 | 94,"e","O",0.953355941688642,-0.502098317746785,52.9305877559818
 67 | 27,"q","J",0.23831829521805,-0.640516447399472,94.6382070984691
 68 | 72,"n","F",0.407367554027587,-0.0713510222814287,90.4908181494102
 69 | 28,"n","W",0.573690160643309,1.48579320914896,72.228856375441
 70 | 1,"c","K",0.762256102170795,0.615465325415559,88.3274596068077
 71 | 83,"q","H",0.0123868996743113,0.680072131251977,84.4729636912234
 72 | 73,"r","R",0.0478137016762048,1.01801018855128,10.6682984228246
 73 | 11,"s","L",0.803982262965292,-0.19969314014496,19.4303638185374
 74 | 54,"b","G",0.155096606584266,-0.933558833688986,17.3642539186403
 75 | 78,"z","Z",0.792126497719437,-0.534543097232384,34.1601846390404
 76 | 88,"q","V",0.366345942718908,-0.114730600414802,92.6017211284488
 77 | 7,"t","H",0.428722943877801,1.57858338038043,60.1827564346604
 78 | 8,"w","M",0.0831752363592386,1.28651271683734,29.9501179973595
 79 | 4,"j","X",0.141073858132586,1.34795892329137,36.4506742311642
 80 | 87,"g","S",0.260312086436898,-0.0554640013763592,92.6275063771755
 81 | 14,"w","H",0.190684856381267,0.221300020385139,27.2539445431903
 82 | 93,"a","E",0.168578451033682,-0.696105815815523,47.6984555623494
 83 | 72,"z","T",0.909249585820362,-0.651621286646339,70.1360652432777
 84 | 88,"t","Q",0.55341205233708,-0.803369540743316,68.9707933459431
 85 | 66,"m","P",0.697604088578373,-1.6475732415293,52.6957668946125
 86 | 73,"v","X",0.586017430527136,0.321069618408937,29.9387905886397
 87 | 74,"x","P",0.104086846811697,0.457174348627436,94.8964916937985
 88 | 63,"l","V",0.564066298073158,1.63904016175932,77.7651392738335
 89 | 67,"c","A",0.317764449398965,1.65236337759661,66.9126302446239
 90 | 38,"v","V",0.762690836330876,-0.907516528439995,18.7640961213037
 91 | 5,"j","D",0.258540550014004,-1.4848916678904,44.5786042907275
 92 | 27,"u","G",0.616374830948189,-1.04993173547399,16.9220584654249
 93 | 64,"c","O",0.180769482627511,0.109428632280286,73.9784279139712
 94 | 23,"k","J",0.779640080407262,0.797267568837994,59.7707860195078
 95 | 45,"u","Z",0.914203838445246,-0.358003181560163,12.4600219866261
 96 | 95,"v","E",0.533451511291787,0.386028503446159,70.4833873175085
 97 | 5,"z","N",0.4724566496443,0.895458647868482,21.5520231472328
 98 | 8,"j","T",0.316375133348629,0.129911444978728,61.0537318745628
 99 | 67,"w","P",0.231593899428844,0.712171107833738,47.2880723653361
100 | 91,"w","M",0.69922315934673,0.763018783592163,11.7304084543139
101 | 32,"r","E",0.143097473075613,-1.67635300018962,99.3239515228197
102 | 


--------------------------------------------------------------------------------
/man/file_sample_exact.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/file_sample_exact.r
 3 | \name{file_sample_exact}
 4 | \alias{file_sample_exact}
 5 | \title{Exact File Sampler}
 6 | \usage{
 7 | file_sample_exact(
 8 |   nlines,
 9 |   infile,
10 |   outfile = tempfile(),
11 |   header = TRUE,
12 |   nskip = 0,
13 |   verbose = FALSE
14 | )
15 | }
16 | \arguments{
17 | \item{nlines}{The (exact) number of lines to sample from the input file.}
18 | 
19 | \item{infile}{Location of the file (as a string) to be subsampled.}
20 | 
21 | \item{outfile}{Output file location (as a string).}
22 | 
23 | \item{header}{Is a header (line of column names) on the first line of the csv file?}
24 | 
25 | \item{nskip}{Number of lines to skip. If \code{header=TRUE}, then this only applies to
26 | lines after the header.}
27 | 
28 | \item{verbose}{Should linecounts of the input file and the number of lines sampled be
29 | printed?}
30 | }
31 | \value{
32 | \code{NULL}
33 | }
34 | \description{
35 | Randomly sample lines from an input text file.
36 | }
37 | \details{
38 | The sampling is done in two passes of the input file.  First, the number of
39 | lines of the input file are determined by scanning through the file as
40 | quickly as possible (i.e., it should be completely I/O bound).  Next, an
41 | index of lines to keep is produced by reservoir sampling.  Then finally,  the
42 | input file is scanned again line by line with the chosen lines dumped into a
43 | temporary file.
44 | 
45 | If the output file (the one pointed to by the return of this function) is
46 | "large" and to be read into memory (which isn't really appropriate for text
47 | files in the first place!), then this strategy is probably not appropriate.
48 | }
49 | 


--------------------------------------------------------------------------------
/man/file_sample_prop.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/file_sample_prob.r
 3 | \name{file_sample_prop}
 4 | \alias{file_sample_prop}
 5 | \title{Proportional File Sampler}
 6 | \usage{
 7 | file_sample_prop(
 8 |   p,
 9 |   infile,
10 |   outfile = tempfile(),
11 |   header = TRUE,
12 |   nskip = 0,
13 |   nmax = 0,
14 |   verbose = FALSE
15 | )
16 | }
17 | \arguments{
18 | \item{p}{Proportion to retain; should be a numeric value between 0 and 1.}
19 | 
20 | \item{infile}{Location of the file (as a string) to be subsampled.}
21 | 
22 | \item{outfile}{Output file location (as a string).}
23 | 
24 | \item{header}{Is a header (line of column names) on the first line of the csv file?}
25 | 
26 | \item{nskip}{Number of lines to skip. If \code{header=TRUE}, then this only applies to
27 | lines after the header.}
28 | 
29 | \item{nmax}{Max number of lines to read.  If \code{nmax==0}, then there is no read cap.}
30 | 
31 | \item{verbose}{Should linecounts of the input file and the number of lines sampled be
32 | printed?}
33 | }
34 | \value{
35 | \code{NULL}
36 | }
37 | \description{
38 | Randomly sample lines from an input text file.
39 | }
40 | \details{
41 | The sampling is done in one pass of the input file, dumping lines to a
42 | temporary file as the input is read.
43 | 
44 | If the output file (the one pointed to by the return of this function) is
45 | "large" and to be read into memory (which isn't really appropriate for text
46 | files in the first place!), then this strategy is probably not appropriate.
47 | }
48 | 


--------------------------------------------------------------------------------
/man/filesampler.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/filesampler-package.r
 3 | \docType{package}
 4 | \name{filesampler}
 5 | \alias{filesampler}
 6 | \title{File Sampler}
 7 | \description{
 8 | File Sampler
 9 | }
10 | \details{
11 | A simple package for reading subsamples of flat text files by line in a
12 | reasonably efficient manner.  We do so by sampling as the input file is
13 | scanned and randomly choosing whether or not to dump the current line to an
14 | external temporary file.  This temporary file is then read back into R.  For
15 | (aggressive) downsampling, this is a very effective strategy; for resampling,
16 | you are much better off reading the full dataset into memory.
17 | 
18 | The basic function performs the sampling/reading in a single pass. There is
19 | an alternative version which allows for an exact amout to be subsampled via
20 | reservoir sampling, but this version requires 2 passes through the data.  The
21 | heavy lifting is done entirely in C.
22 | 
23 | This package, including the underlying C library, is licensed under the
24 | permissive 2-clause BSD license.
25 | }
26 | \references{
27 | Project URL: \url{https://github.com/wrathematics/filesampler}
28 | }
29 | \author{
30 | Drew Schmidt \email{wrathematics@gmail.com}
31 | }
32 | \keyword{package}
33 | 


--------------------------------------------------------------------------------
/man/print-wc.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/wc.r
 3 | \name{print-wc}
 4 | \alias{print-wc}
 5 | \alias{print.wc}
 6 | \title{Print \code{wc} objects}
 7 | \usage{
 8 | \method{print}{wc}(x, ...)
 9 | }
10 | \arguments{
11 | \item{x}{\code{wc} object}
12 | 
13 | \item{...}{unused}
14 | }
15 | \description{
16 | Printing for \code{wc()}
17 | }
18 | 


--------------------------------------------------------------------------------
/man/sample_csv.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/sample_csv.r
 3 | \name{sample_csv}
 4 | \alias{sample_csv}
 5 | \title{Read Sample of CSV}
 6 | \usage{
 7 | sample_csv(
 8 |   file,
 9 |   param,
10 |   method = "proportional",
11 |   reader = utils::read.csv,
12 |   header = TRUE,
13 |   nskip = 0,
14 |   nmax = 0,
15 |   verbose = FALSE,
16 |   ...
17 | )
18 | }
19 | \arguments{
20 | \item{file}{Location of the file (as a string) to be subsampled.}
21 | 
22 | \item{param}{The downsampling parameter. For the "proportional" method, this is the
23 | proportion to retain and should be a numeric value between 0 and 1. For the
24 | exact method, this is the total number of lines to read in.}
25 | 
26 | \item{method}{A string indicating the type of read method to use. Options are
27 | "proportional" and "exact".}
28 | 
29 | \item{reader}{A function specifying the reader to use. The default is 
30 | \code{utils::read.csv}. Other options include \code{data.table::fread()} and
31 | \code{readr::read_csv()}.  Note the first argument of the reader should be
32 | the file to read in and the second should be the the
33 | \code{header}/\code{col_names} argument.  This would require writing a small
34 | wrapper for \code{fread()}.}
35 | 
36 | \item{header}{Is a header (line of column names) on the first line of the csv file?}
37 | 
38 | \item{nskip}{Number of lines to skip. If \code{header=TRUE}, then this only applies to
39 | lines after the header.}
40 | 
41 | \item{nmax}{Max number of lines to read. If nmax==0, then there is no read cap. Ignored
42 | if \code{method="exact"}.}
43 | 
44 | \item{verbose}{Should linecounts of the input file and the number of lines sampled be
45 | printed?}
46 | 
47 | \item{...}{Additional arguments passed to the csv reader.}
48 | }
49 | \value{
50 | A dataframe, as with \code{read.csv()}.
51 | }
52 | \description{
53 | The function will read (as csv) approximately p*nlines lines. So 
54 | if \code{p=.1}, then we will get roughly (probably not exactly) 10% of the
55 | data.  This is the analogue of the base R function \code{read.csv()}.
56 | }
57 | \details{
58 | This function scans over the test of the input file and at each step,
59 | randomly chooses whether or not to include the current line into a
60 | downsampled file. Each selected line is placed in a temporary file, before
61 | being read into R via \code{read.csv()}.  Additional arguments to this
62 | function (those other than \code{file}, \code{p}, and \code{verbose}) are
63 | passed to \code{read.csv()}, and so if their behavior is unclear, you should
64 | examine the \code{read.csv()} help file.
65 | 
66 | If \code{verbose=TRUE}, then something like:
67 | 
68 | \code{Read 12207 lines (0.001\%) of 12174948 line file.}
69 | 
70 | will be printed to the terminal. This counts the header (if there is one) as
71 | one of the lines read and as one of the lines possible.
72 | }
73 | \examples{
74 | library(filesampler)
75 | file = system.file("rawdata/small.csv", package="filesampler")
76 | 
77 | # Read in a 5\% random subsample of the rows.
78 | data = sample_csv(file, param=.05)
79 | 
80 | # Read in 10 randomly sampled rows.
81 | data = sample_csv(file, param=10, method="exact")
82 | 
83 | }
84 | 


--------------------------------------------------------------------------------
/man/sample_lines.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/sample_lines.r
 3 | \name{sample_lines}
 4 | \alias{sample_lines}
 5 | \title{Read Sample Lines of Text File}
 6 | \usage{
 7 | sample_lines(file, n = -1L, p = 0.1, nskip = 0, nmax = 0, verbose = FALSE, ...)
 8 | }
 9 | \arguments{
10 | \item{file}{Location of the file (as a string) to be subsampled.}
11 | 
12 | \item{n}{As in \code{readLines()}.}
13 | 
14 | \item{p}{Proportion to retain; should be a numeric value between 0 and 1.}
15 | 
16 | \item{nskip}{Number of lines to skip.}
17 | 
18 | \item{nmax}{Max number of lines to read.  If nmax==0, then there is no read cap.}
19 | 
20 | \item{verbose}{Logical; indicates whether or not linecounts of the input file and the number
21 | of lines sampled should be printed.}
22 | 
23 | \item{...}{Additional arguments passed to \code{readLines()}.}
24 | }
25 | \value{
26 | A character vector, as with \code{readLines()}.
27 | }
28 | \description{
29 | The function will read approximately p*nlines lines of a flat text
30 | file. So if \code{p=.1}, then we will get roughly (probably not exactly)
31 | 10% of the data.  This is the analogue of the base R function
32 | \code{readLines()}.
33 | }
34 | \details{
35 | This function scans over the test of the input file and at each step, randomly
36 | chooses whether or not to include the current line into a downsampled file.
37 | Each selected line is placed in a temporary file, before being read into R
38 | via \code{readLines()}.  Additional arguments to this function (those other
39 | than \code{file}, \code{p}, and \code{verbose}) are passed to \code{readLines()},
40 | and so if their behavior is unclear, you should examine the \code{readLines()}
41 | help file.
42 | 
43 | If \code{verbose=TRUE}, then something like:
44 | 
45 | \code{Read 12207 lines (0.001\%) of 12174948 line file.}
46 | 
47 | will be printed to the terminal.  This counts the header (if there is one)
48 | as one of the lines read and as one of the lines possible.
49 | }
50 | \examples{
51 | library(filesampler)
52 | file = system.file("rawdata/small.csv", package="filesampler")
53 | sample_lines(file, p=.05)
54 | 
55 | }
56 | 


--------------------------------------------------------------------------------
/man/wc.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/wc.r
 3 | \name{wc}
 4 | \alias{wc}
 5 | \alias{wc_w}
 6 | \alias{wc_l}
 7 | \title{Count Letters, Words, and Lines of a File}
 8 | \usage{
 9 | wc(file, chars = TRUE, words = TRUE, lines = TRUE)
10 | 
11 | wc_w(file)
12 | 
13 | wc_l(file)
14 | }
15 | \arguments{
16 | \item{file}{Location of the file (as a string) from which the counts will be generated.}
17 | 
18 | \item{chars, words, lines}{Should char/word/line counts be shown? At least one of the three must be
19 | \code{TRUE}.}
20 | }
21 | \value{
22 | A list containing the requested counts.
23 | }
24 | \description{
25 | See title.
26 | }
27 | \details{
28 | \code{wc_l()} is a shorthand for counting only lines, similar to \code{wc -l}
29 | in the terminal. Likewise \code{wc_w()} is analogous to \code{wc -w} for
30 | words.
31 | }
32 | \examples{
33 | library(filesampler)
34 | file = system.file("rawdata/small.csv", package="filesampler")
35 | data = wc(file=file)
36 | 
37 | }
38 | 


--------------------------------------------------------------------------------
/src/Makevars.in:
--------------------------------------------------------------------------------
 1 | PKG_CPPFLAGS = @AVX2_FLAGS@
 2 | PKG_CFLAGS = @OMP_FLAGS@
 3 | PKG_LIBS = @OMP_FLAGS@
 4 | 
 5 | FS_OBJECTS = filesampler/file_sampler.o filesampler/wc.o
 6 | R_OBJECTS = filesampler_native.o samplers.o wc.o
 7 | OBJECTS = $(FS_OBJECTS) $(R_OBJECTS)
 8 | 
 9 | all: $(SHLIB)
10 | 
11 | $(SHLIB): $(OBJECTS)
12 | 


--------------------------------------------------------------------------------
/src/Rfilesampler.h:
--------------------------------------------------------------------------------
 1 | #ifndef R_FILESAMPLER_H_
 2 | #define R_FILESAMPLER_H_
 3 | 
 4 | 
 5 | #include <R.h>
 6 | #include <Rinternals.h>
 7 | 
 8 | #define CHARPT(x,i) ((char*)CHAR(STRING_ELT(x,i)))
 9 | #define INT(x) INTEGER(x)[0]
10 | #define DBL(x) REAL(x)[0]
11 | 
12 | 
13 | #endif
14 | 


--------------------------------------------------------------------------------
/src/filesampler/Makefile:
--------------------------------------------------------------------------------
 1 | CC = gcc
 2 | CFLAGS = -fopenmp -O3 -std=c99 -Wall -Wno-unused-function
 3 | 
 4 | OBJECTS = file_sampler.o wc.o
 5 | 
 6 | all: shlib
 7 | 
 8 | shlib: clean
 9 | 	$(CC) $(CFLAGS) -shared -o libfilesampler.so $(OBJECTS)
10 | 
11 | clean:
12 | 	rm -f ./libfilesampler.so
13 | 


--------------------------------------------------------------------------------
/src/filesampler/check_avx.h:
--------------------------------------------------------------------------------
 1 | #ifndef FILESAMPLER_CHECK_AVX_H
 2 | #define FILESAMPLER_CHECK_AVX_H
 3 | 
 4 | 
 5 | #if defined(__INTEL_COMPILER) && (__INTEL_COMPILER >= 1300)
 6 |   #include <immintrin.h>
 7 |   #define CHECK_FOR_AVX2 _may_i_use_cpu_feature(_FEATURE_AVX2)
 8 | #elif (defined(__GNUC__) || defined(__clang__))
 9 |   #define CHECK_FOR_AVX2 __builtin_cpu_supports("avx2")
10 | #else
11 |   #define CHECK_FOR_AVX2 0;
12 | #endif
13 | 
14 | 
15 | static inline int has_avx2()
16 | {
17 |   static int check = -1;
18 |   if (check < 0 )
19 |     check = CHECK_FOR_AVX2;
20 |   
21 |   return (check > 0);
22 | }
23 | 
24 | 
25 | #endif
26 | 


--------------------------------------------------------------------------------
/src/filesampler/error.h:
--------------------------------------------------------------------------------
 1 | /*  Copyright (c) 2015-2016, Schmidt
 2 |     All rights reserved.
 3 |     
 4 |     Redistribution and use in source and binary forms, with or without
 5 |     modification, are permitted provided that the following conditions are met:
 6 |     
 7 |     1. Redistributions of source code must retain the above copyright notice,
 8 |     this list of conditions and the following disclaimer.
 9 |     
10 |     2. Redistributions in binary form must reproduce the above copyright
11 |     notice, this list of conditions and the following disclaimer in the
12 |     documentation and/or other materials provided with the distribution.
13 |     
14 |     THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
15 |     "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
16 |     TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
17 |     PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
18 |     CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
19 |     EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
20 |     PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
21 |     PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
22 |     LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
23 |     NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
24 |     SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
25 | */
26 | 
27 | 
28 | #ifndef FILESAMPLER_ERROR_H_
29 | #define FILESAMPLER_ERROR_H_
30 | 
31 | 
32 | #include "utils.h"
33 | 
34 | // Parameter checks
35 | #define INVALID_PROB    -1
36 | #define INVALID_NSKIP   -2
37 | 
38 | #define INVALID_PROB_MSG  "Invalid `p` specified. Must be between 0 and 1."
39 | #define INVALID_NSKIP_MSG "Invalid `nskip` specified; "
40 | 
41 | 
42 | // Things that went wrong
43 | #define READ_FAIL       -3
44 | #define WRITE_FAIL      -4
45 | #define MALLOC_FAIL     -5
46 | #define USER_INTERRUPT  -6
47 | 
48 | #define READ_FAIL_MSG       "Could not read infile; perhaps it doesn't exist?"
49 | #define WRITE_FAIL_MSG      "Could not generate tempfile for writing for some reason?"
50 | #define MALLOC_FAIL_MSG     "Out of memory."
51 | #define USER_INTERRUPT_MSG  "Process killed by user interrupt."
52 | 
53 | 
54 | static inline void fs_checkret(const int ret)
55 | {
56 |   switch (ret)
57 |   {
58 |     case 0:
59 |       break;
60 |     case INVALID_PROB:
61 |       fs_error_fun(ret, INVALID_PROB_MSG);
62 |       break;
63 |     case INVALID_NSKIP:
64 |       fs_error_fun(ret, INVALID_NSKIP_MSG);
65 |       break;
66 |     case READ_FAIL:
67 |       fs_error_fun(ret, READ_FAIL_MSG);
68 |       break;
69 |     case WRITE_FAIL:
70 |       fs_error_fun(ret, WRITE_FAIL_MSG);
71 |       break;
72 |     case MALLOC_FAIL:
73 |       fs_error_fun(ret, MALLOC_FAIL_MSG);
74 |       break;
75 |     case USER_INTERRUPT:
76 |       fs_error_fun(ret, USER_INTERRUPT_MSG);
77 |       break;
78 |     default:
79 |       fs_error_fun(ret, "Unknown error code; please report this to the developers.");
80 |   }
81 |   
82 |   return;
83 | }
84 | 
85 | 
86 | #endif
87 | 


--------------------------------------------------------------------------------
/src/filesampler/file_sampler.c:
--------------------------------------------------------------------------------
  1 | /*  Copyright (c) 2015-2016, Drew Schmidt
  2 |     All rights reserved.
  3 | 
  4 |     Redistribution and use in source and binary forms, with or without
  5 |     modification, are permitted provided that the following conditions are met:
  6 | 
  7 |     1. Redistributions of source code must retain the above copyright notice,
  8 |     this list of conditions and the following disclaimer.
  9 | 
 10 |     2. Redistributions in binary form must reproduce the above copyright
 11 |     notice, this list of conditions and the following disclaimer in the
 12 |     documentation and/or other materials provided with the distribution.
 13 | 
 14 |     THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 15 |     "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
 16 |     TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 17 |     PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
 18 |     CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 19 |     EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 20 |     PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 21 |     PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
 22 |     LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
 23 |     NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 24 |     SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 25 | */
 26 | 
 27 | 
 28 | #include <string.h>
 29 | #include <stdio.h>
 30 | #include <stdlib.h>
 31 | 
 32 | #include "filesampler.h"
 33 | #include "safeomp.h"
 34 | #include "utils.h"
 35 | 
 36 | 
 37 | #define HAS_NEWLINE ((readlen > 0) && (buf[readlen-1] == '\n'))
 38 | 
 39 | static inline void read_header(char *buf, FILE *fp_read, FILE *fp_write, uint64_t *nlines_in, uint64_t *nlines_out)
 40 | {
 41 |   size_t readlen;
 42 |   
 43 |   while (fgets(buf, BUFLEN, fp_read) != NULL)
 44 |   {
 45 |     fprintf(fp_write, "%s", buf);
 46 |     
 47 |     readlen = strnlen(buf, BUFLEN);
 48 |     
 49 |     if (HAS_NEWLINE)
 50 |     {
 51 |       (*nlines_in)++;
 52 |       (*nlines_out)++;
 53 |       break;
 54 |     }
 55 |   }
 56 | }
 57 | 
 58 | 
 59 | 
 60 | /**
 61 |  * @file
 62 |  * @brief 
 63 |  * File Sampler
 64 |  *
 65 |  * @details
 66 |  * This function takes an input file and randomly subsamples it at
 67 |  * the given proportion p line-by-line, with the randomly chosen lines
 68 |  * being placed in the given output file.  
 69 |  * 
 70 |  * If the file has many lines, then the size of the output file
 71 |  * should be roughly p*nlines(input).  If you want to specify
 72 |  * an exact number of lines, see file_sampler_exact().
 73 |  *
 74 |  * @param verbose
 75 |  * Input.  Indicates whether character/word/line counts of the input
 76 |  * file (discovered while filling the output) should be printed.
 77 |  * @param header
 78 |  * Input.  Indicates whether or not there is a header line (as in a
 79 |  * csv).
 80 |  * @param nskip
 81 |  * Input.  Number of lines to skip.  If header=true and nskip>0, then
 82 |  * the number of lines skipped applies to post-header lines only.
 83 |  * @param nmax
 84 |  * Input.  Max number of lines to read.  If nmax==0 then there is no max.
 85 |  * @param p
 86 |  * Input.  Proportion of lines from input file to (randomly) retain.
 87 |  * The proportion retained is not guaranteed to be exactly p, but 
 88 |  * will be very close for large files.
 89 |  * @param input
 90 |  * Input.  Absolute path to input file.
 91 |  * @param output
 92 |  * Input.  Absolute path to output file.
 93 |  *
 94 |  * @note
 95 |  * Due to R's RNG, this call (as written) is very un-threadsafe.
 96 |  * 
 97 |  * @return
 98 |  * The return value indicates the status of the function.
 99 |  */
100 | int fs_sample_prop(const bool verbose, const bool header, uint32_t nskip, uint32_t nmax, const double p, const char *input, const char *output)
101 | {
102 |   int ret = 0;
103 |   FILE *fp_read, *fp_write;
104 |   char *buf;
105 |   size_t readlen;
106 |   // Have to track cases where buffer is too small for fgets()
107 |   bool should_write = false;
108 |   bool singleread = true;
109 |   bool checkmax = nmax ? true : false;
110 |   uint64_t nlines_in = 0, nlines_out = 0;
111 |   
112 |   if (p < 0. || p > 1.)
113 |     return INVALID_PROB;
114 |   else if (p == 0.)
115 |   {
116 |     if (verbose)
117 |       PRINTFUN("Read 0 lines of unknown length file (p == 0).\n");
118 |     
119 |     return 0;
120 |   }
121 |   
122 |   
123 |   fp_read = fopen(input, "r");
124 |   if (!fp_read)
125 |     return READ_FAIL;
126 |   
127 |   fp_write = fopen(output, "w");
128 |   if (!fp_write)
129 |   {
130 |     fclose(fp_read);
131 |     return WRITE_FAIL;
132 |   }
133 |   
134 |   buf = malloc(BUFLEN * sizeof(char));
135 |   if (buf == NULL)
136 |   {
137 |     fclose(fp_read);
138 |     fclose(fp_write);
139 |     return MALLOC_FAIL;
140 |   }
141 |   
142 |   
143 |   if (header)
144 |     read_header(buf, fp_read, fp_write, &nlines_in, &nlines_out);
145 |   
146 |   
147 |   STARTRNG;
148 |   
149 |   while (fgets(buf, BUFLEN, fp_read) != NULL)
150 |   {
151 |     if ((nlines_out % INTERRUPT_CHECK_NUM == 0) && check_interrupt())
152 |     {
153 |       ret = USER_INTERRUPT;
154 |       goto cleanup;
155 |     }
156 |     
157 |     if (singleread)
158 |     {
159 |       if (RUNIF < p)
160 |         should_write = true;
161 |       else
162 |         should_write = false;
163 |     }
164 |     
165 |     if (nskip)
166 |       nskip--;
167 |     else if (should_write)
168 |     {
169 |       nlines_out++;
170 |       fprintf(fp_write, "%s", buf);
171 |       
172 |       if (checkmax)
173 |       {
174 |         nmax--;
175 |         if (!nmax) break;
176 |       }
177 |     }
178 |     
179 |     readlen = strnlen(buf, BUFLEN);
180 |     
181 |     // check if multiple reads for this one line are needed (i.e., line longer than buffer)
182 |     if (HAS_NEWLINE)
183 |     {
184 |       nlines_in++;
185 |       should_write = false;
186 |       singleread = true;
187 |     }
188 |     else
189 |       singleread = false;
190 |   }
191 |   
192 |   if (verbose)
193 |   {
194 |     if (checkmax && !nmax)
195 |       PRINTFUN("Read nmax=%llu lines of unknown length file.\n", nmax);
196 |     else
197 |       PRINTFUN("Read %llu lines (%.5f%%) of %llu line file.\n", nlines_out, (double) nlines_out/nlines_in, nlines_in);
198 |   }
199 |   
200 |   
201 |   cleanup:
202 |     ENDRNG;
203 |     fclose(fp_read);
204 |     fclose(fp_write);
205 |     free(buf);
206 |   
207 |   return ret;
208 | }
209 | 
210 | 
211 | 
212 | // ------------------------------------------------------
213 | // exact reader
214 | // ------------------------------------------------------
215 | 
216 | static inline int unif_rand_int(const int low, const int high)
217 | {
218 |   return (int) low + (high + 1 - low)*RUNIF ;
219 | }
220 | 
221 | 
222 | 
223 | static int res_sampler(const uint32_t nskip, const uint64_t nlines_in, const uint32_t nlines_out, uint64_t **samp)
224 | {
225 |   *samp = malloc(nlines_out * sizeof(**samp));
226 |   if (samp == NULL)
227 |     return MALLOC_FAIL;
228 |   
229 |   SAFE_FOR_SIMD
230 |   for (uint32_t i=0; i<nlines_out; i++)
231 |     (*samp)[i] = nskip + i+1;
232 |   
233 |   STARTRNG;
234 |   
235 |   SAFE_FOR_SIMD
236 |   for (uint64_t i=nlines_out; i<nlines_in; i++)
237 |   {
238 |     uint32_t j = unif_rand_int(0, i-1);
239 |     if (j < nlines_out)
240 |       (*samp)[j] = nskip + i+1;
241 |   }
242 |   
243 |   ENDRNG;
244 |   
245 |   return 0;
246 | }
247 | 
248 | 
249 | 
250 | static int comp(const void *a, const void *b)
251 | {
252 |   return *(uint64_t*)a - *(uint64_t*)b;
253 | }
254 | 
255 | 
256 | 
257 | /**
258 |  * @file
259 |  * @brief 
260 |  * File Sampler (Exact)
261 |  *
262 |  * @details
263 |  * This function takes an input file and randomly subsamples
264 |  * line-by-line producing exactly n lines in the subsample, with 
265 |  * the randomly chosen lines being placed in the given output file.  
266 |  * 
267 |  * If the file has many lines, it's probably just as good (and 
268 |  * certainly much faster) to instead use file_sampler(), which
269 |  * randomly subsamples at a proportion.
270 |  *
271 |  * @param header
272 |  * Input.  Indicates whether or not there is a header line (as in a
273 |  * csv).
274 |  * @param nskip
275 |  * Input.  Number of lines to skip.  If header=true and nskip>0, then
276 |  * the number of lines skipped applies to post-header lines only.
277 |  * @param nlines_in
278 |  * Input.  The number of lines of the input file.  See wc().
279 |  * @param nlines_out
280 |  * Input.  The precise number of lines of input to (randomly) to 
281 |  * retain.
282 |  * @param input
283 |  * Input.  Absolute path to input file.
284 |  * @param output
285 |  * Input.  Absolute path to output file.
286 |  *
287 |  * @note
288 |  * Due to R's RNG, this call (as written) is very un-threadsafe.
289 |  * 
290 |  * @return
291 |  * The return value indicates the status of the function.
292 |  */
293 | int fs_sample_exact(const bool verbose, const bool header, const uint32_t nskip, uint64_t nlines_out, const char *input, const char *output)
294 | {
295 |   int ret;
296 |   FILE *fp_read, *fp_write;
297 |   char *buf;
298 |   size_t readlen;
299 |   // Have to track cases where buffer is too small for fgets()
300 |   bool should_write = false;
301 |   bool singleread = true;
302 |   uint64_t *samp;
303 |   uint64_t nlines_in;
304 |   uint64_t current_line = 0;
305 |   uint64_t lines_read = 0;
306 |   
307 |   
308 |   ret = fs_wc(input, false, NULL, false, NULL, true, &nlines_in);
309 |   if (ret)
310 |     return ret;
311 |   
312 |   
313 |   if (nskip > nlines_in)
314 |     return INVALID_NSKIP;
315 |   
316 |   fp_read = fopen(input, "r");
317 |   if (!fp_read) 
318 |     return READ_FAIL;
319 |   
320 |   fp_write = fopen(output, "w");
321 |   if (!fp_write)
322 |   {
323 |     fclose(fp_read);
324 |     return WRITE_FAIL;
325 |   }
326 |   
327 |   buf = malloc(BUFLEN * sizeof(char));
328 |   if (buf == NULL)
329 |   {
330 |     fclose(fp_read);
331 |     fclose(fp_write);
332 |     return MALLOC_FAIL;
333 |   }
334 |   
335 |   
336 |   if (header)
337 |     read_header(buf, fp_read, fp_write, &nlines_in, &nlines_out);
338 |   
339 |   ret = res_sampler(nskip, nlines_in, nlines_out, &samp);
340 |   if (ret) 
341 |     goto cleanup;
342 |   
343 |   qsort(samp, nlines_out, sizeof(uint64_t), comp);
344 |   
345 |   STARTRNG;
346 |   
347 |   
348 |   while (fgets(buf, BUFLEN, fp_read) != NULL)
349 |   {
350 |     if (lines_read == nlines_out)
351 |       break;
352 |     
353 |     if ((lines_read % INTERRUPT_CHECK_NUM == 0) && check_interrupt())
354 |     {
355 |       ret = USER_INTERRUPT;
356 |       goto fullcleanup;
357 |     }
358 |     
359 |     if (singleread)
360 |     {
361 |       if (samp[lines_read] == current_line)
362 |         should_write = true;
363 |       else
364 |         should_write = false;
365 |     }
366 |     
367 |     if (should_write)
368 |     {
369 |       lines_read++;
370 |       fprintf(fp_write, "%s", buf);
371 |     }
372 |     
373 |     readlen = strnlen(buf, BUFLEN);
374 |     
375 |     if (HAS_NEWLINE)
376 |     {
377 |       current_line++;
378 |       should_write = false;
379 |       singleread = true;
380 |     }
381 |     else
382 |       singleread = false;
383 |   }
384 |   
385 |   
386 |   if (verbose)
387 |   {
388 |     nlines_in -= header;
389 |     PRINTFUN("Read %llu lines (%.5f%%) of %llu line file.\n", nlines_out, (double) nlines_out/nlines_in, nlines_in);
390 |   }
391 |   
392 |   
393 |   fullcleanup:
394 |     ENDRNG;
395 |     free(samp);
396 |   
397 |   cleanup:
398 |     fclose(fp_read);
399 |     fclose(fp_write);
400 |     free(buf);
401 |   
402 |   return ret;
403 | }
404 | 


--------------------------------------------------------------------------------
/src/filesampler/filesampler.h:
--------------------------------------------------------------------------------
 1 | /*  Copyright (c) 2015-2016, Schmidt
 2 |     All rights reserved.
 3 |     
 4 |     Redistribution and use in source and binary forms, with or without
 5 |     modification, are permitted provided that the following conditions are met:
 6 |     
 7 |     1. Redistributions of source code must retain the above copyright notice,
 8 |     this list of conditions and the following disclaimer.
 9 |     
10 |     2. Redistributions in binary form must reproduce the above copyright
11 |     notice, this list of conditions and the following disclaimer in the
12 |     documentation and/or other materials provided with the distribution.
13 |     
14 |     THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
15 |     "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
16 |     TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
17 |     PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
18 |     CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
19 |     EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
20 |     PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
21 |     PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
22 |     LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
23 |     NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
24 |     SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
25 | */
26 | 
27 | 
28 | #ifndef FILESAMPLER_H_
29 | #define FILESAMPLER_H_
30 | 
31 | 
32 | #include <stdbool.h>
33 | #include <stdint.h>
34 | 
35 | #include "error.h"
36 | 
37 | #define BUFLEN 8192
38 | #define INTERRUPT_CHECK_NUM 1024
39 | 
40 | // file_sampler.c
41 | int fs_sample_prop(const bool verbose, const bool header, uint32_t nskip, uint32_t nmax, const double p, const char *input, const char *output);
42 | int fs_sample_exact(const bool verbose, const bool header, const uint32_t nskip, uint64_t nlines_out, const char *input, const char *output);
43 | 
44 | // wc.c
45 | int fs_wc(const char *file, const bool chars, uint64_t *nchars, const bool words, uint64_t *nwords, const bool lines, uint64_t *nlines);
46 | 
47 | 
48 | #endif
49 | 


--------------------------------------------------------------------------------
/src/filesampler/rebalance.c:
--------------------------------------------------------------------------------
 1 | /*  Copyright (c) 2015-2018, Drew Schmidt
 2 |     All rights reserved.
 3 | 
 4 |     Redistribution and use in source and binary forms, with or without
 5 |     modification, are permitted provided that the following conditions are met:
 6 | 
 7 |     1. Redistributions of source code must retain the above copyright notice,
 8 |     this list of conditions and the following disclaimer.
 9 | 
10 |     2. Redistributions in binary form must reproduce the above copyright
11 |     notice, this list of conditions and the following disclaimer in the
12 |     documentation and/or other materials provided with the distribution.
13 | 
14 |     THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
15 |     "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
16 |     TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
17 |     PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
18 |     CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
19 |     EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
20 |     PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
21 |     PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
22 |     LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
23 |     NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
24 |     SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
25 | */
26 | 
27 | 
28 | #include <string.h>
29 | #include <stdio.h>
30 | #include <stdlib.h>
31 | 
32 | #include "filesampler.h"
33 | #include "utils.h"
34 | 
35 | 
36 | #define HEADER_ALL 0
37 | #define HEADER_FIRST 1
38 | #define HEADER_NONE 2
39 | 
40 | 
41 | int fs_rebalance(const uint32_t num_outfiles, const uint32_t num_infiles, const char **infiles, const int inheader, const int outheader, const char *outdir)
42 | {
43 |   int ret = 0;
44 |   FILE *fp_read, *fp_write;
45 |   uint64_t total_lines;
46 |   uint64_t outfile_lines;
47 |   uint64_t rem;
48 |   
49 |   if (num_infiles == num_outfiles)
50 |     return ret;
51 |   
52 |   
53 |   buf = malloc(BUFLEN * sizeof(char));
54 |   if (buf == NULL)
55 |     return MALLOC_FAIL;
56 |   
57 |   for (uint32_t i=0; i<num_infiles; i++)
58 |   {
59 |     uint64_t nlines;
60 |     int wcret = fs_wc(infiles[i], false, NULL, false, NULL, true, &nlines)
61 |     total_lines += nlines;
62 |   }
63 |   
64 |   
65 |   outfile_lines = total_lines / num_outfiles;
66 |   rem = total_lines - outfile_lines*num_outfiles;
67 |   
68 |   
69 |   
70 |   
71 |   for (uint32_t i=0; i<num_infiles; i++)
72 |   {
73 |     fp_read = fopen(infiles[i], "r");
74 |     if (!fp_read)
75 |     {
76 |       ret = READ_FAIL;
77 |       goto cleanup;
78 |     }
79 |     
80 |     fp_write = fopen(output, "w");
81 |     if (!fp_write)
82 |     {
83 |       fclose(fp_read);
84 |       ret = WRITE_FAIL;
85 |       goto cleanup;
86 |     }
87 |   }
88 |   
89 |   
90 | cleanup:
91 |   free(buf);
92 |   return ret;
93 | }
94 | 


--------------------------------------------------------------------------------
/src/filesampler/safeomp.h:
--------------------------------------------------------------------------------
 1 | /*  Copyright (c) 2015-2016, Schmidt
 2 |     All rights reserved.
 3 |     
 4 |     Redistribution and use in source and binary forms, with or without
 5 |     modification, are permitted provided that the following conditions are met:
 6 |     
 7 |     1. Redistributions of source code must retain the above copyright notice,
 8 |     this list of conditions and the following disclaimer.
 9 |     
10 |     2. Redistributions in binary form must reproduce the above copyright
11 |     notice, this list of conditions and the following disclaimer in the
12 |     documentation and/or other materials provided with the distribution.
13 |     
14 |     THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
15 |     "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
16 |     TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
17 |     PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
18 |     CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
19 |     EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
20 |     PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
21 |     PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
22 |     LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
23 |     NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
24 |     SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
25 | */
26 | 
27 | #ifndef __COOP_OMPUTILS_H__
28 | #define __COOP_OMPUTILS_H__
29 | 
30 | 
31 | #define OMP_MIN_SIZE 1000
32 | 
33 | 
34 | #ifdef _OPENMP
35 | #include <omp.h>
36 | #if _OPENMP >= 201307
37 | #define OMP_VER_4
38 | #elif _OPENMP >= 200805
39 | #define OMP_VER_3
40 | #endif
41 | #endif
42 | 
43 | 
44 | // Insert SIMD pragma if supported
45 | #ifdef OMP_VER_4
46 | #define SAFE_SIMD _Pragma("omp simd")
47 | #define SAFE_FOR_SIMD _Pragma("omp for simd")
48 | #define SAFE_PARALLEL_FOR_SIMD _Pragma("omp parallel for simd")
49 | #else
50 | #define SAFE_SIMD 
51 | #define SAFE_FOR_SIMD
52 | #define SAFE_PARALLEL_FOR_SIMD
53 | #endif
54 | 
55 | 
56 | #endif
57 | 


--------------------------------------------------------------------------------
/src/filesampler/utils.h:
--------------------------------------------------------------------------------
 1 | // This file is free and unencumbered software released into the public domain.
 2 | // You may modify it for any purpose with or without attribution.
 3 | // See the Unlicense specification for full details http://unlicense.org/
 4 | 
 5 | // To make the filesampler library useable outside of R, you need to modify
 6 | // every definition in this file appropriately, and delete the R header files.
 7 | // Comments in this file are to help someone porting this to another (non-R)
 8 | // system.
 9 | 
10 | // See utils_sample.h for an example.
11 | 
12 | 
13 | #ifndef FILESAMPLER_UTILS_H_
14 | #define FILESAMPLER_UTILS_H_
15 | 
16 | #define UNUSED(x) (void)(x)
17 | 
18 | // ----------------------------------------------------------------------------
19 | // RNG
20 | // ----------------------------------------------------------------------------
21 | 
22 | #include <R.h>      // delete me
23 | #include <Rmath.h>  // delete me
24 | 
25 | // rng initializer and ender; probably sufficient to set the second def to a blank
26 | #define STARTRNG GetRNGstate()
27 | #define ENDRNG PutRNGstate()
28 | 
29 | // generate a single random uniform number
30 | #define RUNIF unif_rand()
31 | 
32 | 
33 | 
34 | // ----------------------------------------------------------------------------
35 | // Printing
36 | // ----------------------------------------------------------------------------
37 | 
38 | // replace with regular printf() (probably)
39 | #define PRINTFUN Rprintf
40 | 
41 | 
42 | 
43 | // ----------------------------------------------------------------------------
44 | // Interrupt checker
45 | // ----------------------------------------------------------------------------
46 | 
47 | // Write an appropriate check_interrupt() function (or just set to always return
48 | // false if you're lazy).
49 | 
50 | #include <Rinternals.h>   // delete me
51 | #include <R_ext/Utils.h>  // delete me
52 | 
53 | #include <stdbool.h>
54 | 
55 | #define UNUSED(x) (void)(x)
56 | 
57 | static void check_interrupt_fun(void *ignored)
58 | {
59 |   UNUSED(ignored);
60 |   R_CheckUserInterrupt();
61 | }
62 | 
63 | // returns `true` if user interrupts with C-c, and `false` if not
64 | static inline bool check_interrupt()
65 | {
66 |   return (R_ToplevelExec(check_interrupt_fun, NULL) == FALSE);
67 | }
68 | 
69 | 
70 | 
71 | // ----------------------------------------------------------------------------
72 | // Error handler
73 | // ----------------------------------------------------------------------------
74 | 
75 | // Print to stderr and exit
76 | static inline void fs_error_fun(const int err_num, const char *err_msg)
77 | {
78 |   UNUSED(err_num);
79 |   error(err_msg);
80 | }
81 | 
82 | 
83 | #endif
84 | 


--------------------------------------------------------------------------------
/src/filesampler/utils_sample.h:
--------------------------------------------------------------------------------
 1 | // This file is free and unencumbered software released into the public domain.
 2 | // You may modify it for any purpose with or without attribution.
 3 | // See the Unlicense specification for full details http://unlicense.org/
 4 | 
 5 | // A sample, modified version of utils.h
 6 | 
 7 | #ifndef FILESAMPLER_UTILS_H_
 8 | #define FILESAMPLER_UTILS_H_
 9 | 
10 | // ----------------------------------------------------------------------------
11 | // RNG
12 | // ----------------------------------------------------------------------------
13 | 
14 | #include <rand.h>
15 | 
16 | #define STARTRNG
17 | #define ENDRNG
18 | 
19 | #define RUNIF ((double) rand() / (RAND_MAX+1))
20 | 
21 | 
22 | 
23 | // ----------------------------------------------------------------------------
24 | // Printing
25 | // ----------------------------------------------------------------------------
26 | 
27 | #define PRINTFUN printf
28 | 
29 | 
30 | 
31 | // ----------------------------------------------------------------------------
32 | // Interrupt checker
33 | // ----------------------------------------------------------------------------
34 | 
35 | static inline bool check_interrupt()
36 | {
37 |   return false;
38 | }
39 | 
40 | 
41 | 
42 | // ----------------------------------------------------------------------------
43 | // Error handler
44 | // ----------------------------------------------------------------------------
45 | 
46 | #include <stdio.h>
47 | 
48 | static inline void fs_error_fun(const int err_num, const char *err_msg)
49 | {
50 |   fputs(stderr, err_msg);
51 |   exit(err_num)
52 | }
53 | 
54 | 
55 | #endif
56 | 


--------------------------------------------------------------------------------
/src/filesampler/wc.c:
--------------------------------------------------------------------------------
  1 | /*  Copyright (c) 2015-2017, Drew Schmidt and Daniel Lemire
  2 |     All rights reserved.
  3 | 
  4 |     Redistribution and use in source and binary forms, with or without
  5 |     modification, are permitted provided that the following conditions are met:
  6 | 
  7 |     1. Redistributions of source code must retain the above copyright notice,
  8 |     this list of conditions and the following disclaimer.
  9 | 
 10 |     2. Redistributions in binary form must reproduce the above copyright
 11 |     notice, this list of conditions and the following disclaimer in the
 12 |     documentation and/or other materials provided with the distribution.
 13 | 
 14 |     THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 15 |     "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
 16 |     TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 17 |     PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
 18 |     CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 19 |     EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 20 |     PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 21 |     PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
 22 |     LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
 23 |     NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 24 |     SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 25 | */
 26 | 
 27 | 
 28 | #include <ctype.h> // isspace()
 29 | 
 30 | #include "check_avx.h"
 31 | #include "filesampler.h"
 32 | #include "safeomp.h"
 33 | #include "utils.h"
 34 | 
 35 | 
 36 | #ifdef __AVX2__
 37 |   // we have AVX2 support
 38 |   #ifndef _MSC_VER
 39 |     /* Non-Microsoft C/C++-compatible compiler */
 40 |     #include <x86intrin.h> // on some recent GCC, this will declare posix_memalign
 41 |   #else
 42 |     /* Microsoft C/C++-compatible compiler */
 43 |     #include <intrin.h>
 44 |   #endif
 45 | 
 46 |   // http://lemire.me/blog/2017/02/14/how-fast-can-you-count-lines/
 47 |   static inline size_t linefeedcount_avx2(char *const restrict buffer, const size_t size)
 48 |   {
 49 |     size_t answer = 0;
 50 |     __m256i cnt = _mm256_setzero_si256();
 51 |     __m256i newline = _mm256_set1_epi8('\n');
 52 |     size_t i = 0;
 53 |     uint8_t tmpbuffer[sizeof(__m256i)];
 54 |     
 55 |     while (i + 32 <= size)
 56 |     {
 57 |       size_t remaining = size - i;
 58 |       size_t howmanytimes =  remaining / 32;
 59 |       
 60 |       if (howmanytimes > 256)
 61 |         howmanytimes = 256;
 62 |       
 63 |       const __m256i *buf = (const __m256i*) (buffer + i);
 64 |       size_t j = 0;
 65 |       
 66 |       for (; j + 3 <  howmanytimes; j+= 4)
 67 |       {
 68 |         __m256i newdata1 = _mm256_lddqu_si256(buf + j);
 69 |         __m256i newdata2 = _mm256_lddqu_si256(buf + j + 1);
 70 |         __m256i newdata3 = _mm256_lddqu_si256(buf + j + 2);
 71 |         __m256i newdata4 = _mm256_lddqu_si256(buf + j + 3);
 72 |         __m256i cmp1 = _mm256_cmpeq_epi8(newline, newdata1);
 73 |         __m256i cmp2 = _mm256_cmpeq_epi8(newline, newdata2);
 74 |         __m256i cmp3 = _mm256_cmpeq_epi8(newline, newdata3);
 75 |         __m256i cmp4 = _mm256_cmpeq_epi8(newline, newdata4);
 76 |         __m256i cnt1 = _mm256_add_epi8(cmp1, cmp2);
 77 |         __m256i cnt2 = _mm256_add_epi8(cmp3, cmp4);
 78 |         cnt = _mm256_add_epi8(cnt, cnt1);
 79 |         cnt = _mm256_add_epi8(cnt, cnt2);
 80 |       }
 81 |       
 82 |       for (; j <  howmanytimes; j++)
 83 |       {
 84 |         __m256i newdata = _mm256_lddqu_si256(buf + j);
 85 |         __m256i cmp = _mm256_cmpeq_epi8(newline, newdata);
 86 |         cnt = _mm256_add_epi8(cnt, cmp);
 87 |       }
 88 |       
 89 |       i += howmanytimes * 32;
 90 |       cnt = _mm256_subs_epi8(_mm256_setzero_si256(), cnt);
 91 |       _mm256_storeu_si256((__m256i *) tmpbuffer, cnt);
 92 |       
 93 |       for (unsigned int k = 0; k < sizeof(__m256i); ++k)
 94 |         answer += tmpbuffer[k];
 95 |       
 96 |       cnt = _mm256_setzero_si256();
 97 |     }
 98 |     
 99 |     for (; i < size; i++)
100 |     {
101 |       if (buffer[i] == '\n')
102 |         answer++;
103 |     }
104 |     
105 |     return answer;
106 |   }
107 | #endif
108 | 
109 | 
110 | 
111 | static inline size_t linefeedcount_fallback(char *const restrict buffer, const size_t size)
112 | {
113 |   uint64_t nl = 0;
114 |   char *ptr = buffer;
115 |   char *last = buffer + size;
116 |   
117 |   while ((ptr = memchr(ptr, '\n', last - ptr)))
118 |   {
119 |     ptr++;
120 |     nl++;
121 |   }
122 |   
123 |   return nl;
124 | }
125 | 
126 | 
127 | 
128 | static inline size_t linefeedcount(char *const restrict buffer, const size_t size)
129 | {
130 | #ifdef __AVX2__
131 |   if (has_avx2())
132 |     return linefeedcount_avx2(buffer, size);
133 |   else
134 | #endif
135 |     return linefeedcount_fallback(buffer, size);
136 | }
137 | 
138 | 
139 | 
140 | // -----------------------------------------------------------------------------
141 | // wrappers
142 | // -----------------------------------------------------------------------------
143 | 
144 | static inline int wc_charsonly(FILE *restrict fp, char *restrict buf, uint64_t *restrict nchars)
145 | {
146 |   size_t readlen = BUFLEN;
147 |   uint64_t nc = 0;
148 |   
149 |   while (readlen == BUFLEN)
150 |   {
151 |     if (check_interrupt())
152 |       return USER_INTERRUPT;
153 |     
154 |     readlen = fread(buf, sizeof(char), BUFLEN, fp);
155 |     nc += readlen;
156 |   }
157 |   
158 |   *nchars = nc;
159 |   
160 |   return 0;
161 | }
162 | 
163 | 
164 | 
165 | static inline int wc_linesonly(FILE *restrict fp, char *restrict buf, uint64_t *restrict nlines)
166 | {
167 |   size_t readlen = BUFLEN;
168 |   uint64_t nl = 0;
169 |   
170 |   while (readlen == BUFLEN)
171 |   {
172 |     if (check_interrupt())
173 |       return USER_INTERRUPT;
174 |     
175 |     readlen = fread(buf, sizeof(*buf), BUFLEN, fp);
176 |     nl += linefeedcount(buf, readlen);
177 |   }
178 |   
179 |   *nlines = nl;
180 |   
181 |   return 0;
182 | }
183 | 
184 | 
185 | 
186 | static inline int wc_nowords(FILE *restrict fp, char *restrict buf, uint64_t *restrict nchars, uint64_t *restrict nlines)
187 | {
188 |   size_t readlen = BUFLEN;
189 |   uint64_t nc = 0;
190 |   uint64_t nl = 0;
191 |   
192 |   while (readlen == BUFLEN)
193 |   {
194 |     if (check_interrupt())
195 |       return USER_INTERRUPT;
196 |     
197 |     readlen = fread(buf, sizeof(*buf), BUFLEN, fp);
198 |     nl += linefeedcount(buf, readlen);
199 |     nc += readlen;
200 |   }
201 |   
202 |   *nchars = nc;
203 |   *nlines = nl;
204 |   
205 |   return 0;
206 | }
207 | 
208 | 
209 | 
210 | static inline int wc_nolines(FILE *restrict fp, char *restrict buf, uint64_t *restrict nchars, uint64_t *restrict nwords)
211 | {
212 |   size_t readlen = BUFLEN;
213 |   uint64_t nc = 0;
214 |   uint64_t nw = 0;
215 |   
216 |   while (readlen == BUFLEN)
217 |   {
218 |     if (check_interrupt())
219 |       return USER_INTERRUPT;
220 |     
221 |     readlen = fread(buf, sizeof(char), BUFLEN, fp);
222 |     
223 |     SAFE_FOR_SIMD
224 |     for (size_t i=0; i<readlen; i++)
225 |     {
226 |       if (isspace(buf[i]))
227 |         nw++;
228 |       
229 |       nc++;
230 |     }
231 |   }
232 |   
233 |   *nchars = nc;
234 |   *nwords = nw;
235 | 
236 |   return 0;
237 | }
238 | 
239 | 
240 | 
241 | static inline bool isnewline(const char c)
242 | {
243 |   return (c=='\n') ? true : false;
244 | }
245 | 
246 | static inline int wc_full(FILE *restrict fp, char *restrict buf, uint64_t *restrict nchars, uint64_t *restrict nwords, uint64_t *restrict nlines)
247 | {
248 |   uint64_t nc = 0;
249 |   uint64_t nw = 0;
250 |   uint64_t nl = 0;
251 |   size_t readlen = BUFLEN;
252 | 
253 |   while (readlen == BUFLEN)
254 |   {
255 |     if (check_interrupt())
256 |       return USER_INTERRUPT;
257 |     
258 |     readlen = fread(buf, sizeof(char), BUFLEN, fp);
259 |     nc += readlen;
260 |     
261 |     SAFE_FOR_SIMD
262 |     for (size_t i=0; i<readlen; i++)
263 |     {
264 |       if (isnewline(buf[i]))
265 |       {
266 |         nw++;
267 |         nl++;
268 |       }
269 |       else
270 |       {
271 |         if (isspace(buf[i]))
272 |           nw++;
273 |       }
274 |     }
275 |   }
276 |   
277 |   *nchars = nc;
278 |   *nwords = nw;
279 |   *nlines = nl;
280 |   
281 |   return 0;
282 | }
283 | 
284 | 
285 | 
286 | /**
287 |  * @file
288 |  * @brief
289 |  * Wordcounts
290 |  *
291 |  * @details
292 |  * This function emulates the standard unix wc command, to generate
293 |  * letter, word, and line counts from a text file.
294 |  *
295 |  * @param file
296 |  * Input.  Absolute path to output file.
297 |  * @param chars
298 |  *
299 |  * @param nchars
300 |  * Output, passed by reference.  On successful return, the value
301 |  * is set to the number of "letters" (character) in the file.
302 |  * @param nwords
303 |  * Output, passed by reference.  On successful return, the value
304 |  * is set to the number of "words" (space separated tokens) in the
305 |  * file.
306 |  * @param nlines
307 |  * Output, passed by reference.  On successful return, the value
308 |  * is set to the number of lines in the file.
309 |  *
310 |  * @return
311 |  * The return value indicates the status of the function.
312 |  */
313 | int fs_wc(const char *file, const bool chars, uint64_t *nchars, 
314 |   const bool words, uint64_t *nwords, const bool lines, uint64_t *nlines)
315 | {
316 |   int ret = 0;
317 |   FILE *fp;
318 |   char *buf;
319 |   
320 |   fp = fopen(file, "r");
321 |   if (!fp)
322 |     return READ_FAIL;
323 |   
324 |   buf = malloc(BUFLEN * sizeof(*buf));
325 |   if (buf == NULL)
326 |   {
327 |     fclose(fp);
328 |     return MALLOC_FAIL;
329 |   }
330 |   
331 |   if (!chars && !words && lines)
332 |     ret = wc_linesonly(fp, buf, nlines);
333 |   else if (chars && !words && lines)
334 |     ret = wc_nowords(fp, buf, nchars, nlines);
335 |   else if (chars && words && !lines)
336 |     ret = wc_nolines(fp, buf, nchars, nwords);
337 |   else if (chars && !words && !lines)
338 |     ret = wc_charsonly(fp, buf, nchars);
339 |   else
340 |     ret = wc_full(fp, buf, nchars, nwords, nlines);
341 |   
342 |   fclose(fp);
343 |   free(buf);
344 |   
345 |   return ret;
346 | }
347 | 


--------------------------------------------------------------------------------
/src/filesampler_native.c:
--------------------------------------------------------------------------------
 1 | /* Automatically generated. Do not edit by hand. */
 2 | 
 3 | #include <R.h>
 4 | #include <Rinternals.h>
 5 | #include <R_ext/Rdynload.h>
 6 | #include <stdlib.h>
 7 | 
 8 | extern SEXP R_fs_sample_exact(SEXP verbose, SEXP header, SEXP nskip_, SEXP nlines_out_, SEXP input, SEXP output);
 9 | extern SEXP R_fs_sample_prop(SEXP verbose, SEXP header, SEXP nskip_, SEXP nmax_, SEXP p, SEXP input, SEXP output);
10 | extern SEXP R_fs_wc(SEXP input, SEXP chars_, SEXP words_, SEXP lines_);
11 | 
12 | static const R_CallMethodDef CallEntries[] = {
13 |   {"R_fs_sample_exact", (DL_FUNC) &R_fs_sample_exact, 6},
14 |   {"R_fs_sample_prop", (DL_FUNC) &R_fs_sample_prop, 7},
15 |   {"R_fs_wc", (DL_FUNC) &R_fs_wc, 4},
16 |   {NULL, NULL, 0}
17 | };
18 | void R_init_filesampler(DllInfo *dll)
19 | {
20 |   R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
21 |   R_useDynamicSymbols(dll, FALSE);
22 | }
23 | 


--------------------------------------------------------------------------------
/src/samplers.c:
--------------------------------------------------------------------------------
 1 | /*  Copyright (c) 2015-2016, Drew Schmidt
 2 |     All rights reserved.
 3 | 
 4 |     Redistribution and use in source and binary forms, with or without
 5 |     modification, are permitted provided that the following conditions are met:
 6 | 
 7 |     1. Redistributions of source code must retain the above copyright notice,
 8 |     this list of conditions and the following disclaimer.
 9 | 
10 |     2. Redistributions in binary form must reproduce the above copyright
11 |     notice, this list of conditions and the following disclaimer in the
12 |     documentation and/or other materials provided with the distribution.
13 | 
14 |     THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
15 |     "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
16 |     TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
17 |     PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
18 |     CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
19 |     EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
20 |     PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
21 |     PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
22 |     LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
23 |     NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
24 |     SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
25 | */
26 | 
27 | 
28 | #include "Rfilesampler.h"
29 | #include "filesampler/filesampler.h"
30 | 
31 | 
32 | SEXP R_fs_sample_prop(SEXP verbose, SEXP header, SEXP nskip_, SEXP nmax_, SEXP p, SEXP input, SEXP output)
33 | {
34 |   int ret;
35 |   
36 |   const uint32_t nskip = (uint32_t) INT(nskip_);
37 |   const uint32_t nmax = (uint32_t) INT(nmax_);
38 |   
39 |   ret = fs_sample_prop(INT(verbose), INT(header), nskip, nmax, DBL(p), CHARPT(input, 0), CHARPT(output, 0));
40 |   fs_checkret(ret);
41 |   
42 |   return R_NilValue;
43 | }
44 | 
45 | 
46 | 
47 | SEXP R_fs_sample_exact(SEXP verbose, SEXP header, SEXP nskip_, SEXP nlines_out_, SEXP input, SEXP output)
48 | {
49 |   int ret;
50 |   
51 |   const uint32_t nskip = (uint32_t) INT(nskip_);
52 |   const uint32_t nlines_out = (uint32_t) INT(nlines_out_);
53 |   
54 |   ret = fs_sample_exact(INT(verbose), INT(header), nskip, nlines_out, CHARPT(input, 0), CHARPT(output, 0));
55 |   fs_checkret(ret);
56 |   
57 |   return R_NilValue;
58 | }
59 | 


--------------------------------------------------------------------------------
/src/wc.c:
--------------------------------------------------------------------------------
 1 | /*  Copyright (c) 2015-2016, Drew Schmidt
 2 |     All rights reserved.
 3 | 
 4 |     Redistribution and use in source and binary forms, with or without
 5 |     modification, are permitted provided that the following conditions are met:
 6 | 
 7 |     1. Redistributions of source code must retain the above copyright notice,
 8 |     this list of conditions and the following disclaimer.
 9 | 
10 |     2. Redistributions in binary form must reproduce the above copyright
11 |     notice, this list of conditions and the following disclaimer in the
12 |     documentation and/or other materials provided with the distribution.
13 | 
14 |     THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
15 |     "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
16 |     TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
17 |     PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
18 |     CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
19 |     EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
20 |     PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
21 |     PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
22 |     LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
23 |     NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
24 |     SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
25 | */
26 | 
27 | 
28 | #include "Rfilesampler.h"
29 | #include "filesampler/filesampler.h"
30 | 
31 | #define COUNTS(n) REAL(counts)[n]
32 | #define NCHARS  0
33 | #define NWORDS  1
34 | #define NLINES  2
35 | 
36 | #define BADVAL -1.0
37 | 
38 | 
39 | SEXP R_fs_wc(SEXP input, SEXP chars_, SEXP words_, SEXP lines_)
40 | {
41 |   SEXP counts;
42 |   int ret;
43 |   uint64_t nchars, nwords, nlines;
44 |   
45 |   const bool chars = INT(chars_);
46 |   const bool words = INT(words_);
47 |   const bool lines = INT(lines_);
48 |   
49 |   PROTECT(counts = allocVector(REALSXP, 3));
50 |   
51 |   ret = fs_wc(CHARPT(input, 0), chars, &nchars, words, &nwords, lines, &nlines);
52 |   fs_checkret(ret);
53 |   
54 |   COUNTS(NCHARS) = chars ? (double) nchars : BADVAL;
55 |   COUNTS(NWORDS) = words ? (double) nwords : BADVAL;
56 |   COUNTS(NLINES) = lines ? (double) nlines : BADVAL;
57 |   
58 |   UNPROTECT(1);
59 |   return counts;
60 | }
61 | 


--------------------------------------------------------------------------------
/tests/exact.R:
--------------------------------------------------------------------------------
 1 | library(filesampler)
 2 | 
 3 | file <- system.file("rawdata/small.csv", package="filesampler")
 4 | 
 5 | ### Argument checks
 6 | 
 7 | # nlines
 8 | badp <- "<simpleError: argument 'nlines' must be a positive integer>"
 9 | badval <- tryCatch(sampled <- sample_csv(file, param=-1, method="exact"), error=capture.output)
10 | stopifnot(all.equal(badp, badval))
11 | 
12 | 
13 | 
14 | ### general functionality
15 | set.seed(1234)
16 | sampled <- sample_csv(file, param=5, method="exact")
17 | 
18 | sampled_actual <-
19 | structure(list(A = c(13L, 28L, 83L, 88L, 87L), B = structure(c(3L, 
20 | 2L, 4L, 4L, 1L), .Label = c("g", "n", "p", "q"), class = "factor"), 
21 |     C = structure(c(5L, 4L, 1L, 3L, 2L), .Label = c("H", "S", 
22 |     "V", "W", "Z"), class = "factor"), D = c(0.34508125577122, 
23 |     0.573690160643309, 0.0123868996743113, 0.366345942718908, 
24 |     0.260312086436898), E = c(-0.455623686756772, 1.48579320914896, 
25 |     0.680072131251977, -0.114730600414802, -0.0554640013763592
26 |     ), F = c(40.096492273733, 72.228856375441, 84.4729636912234, 
27 |     92.6017211284488, 92.6275063771755)), .Names = c("A", "B", 
28 | "C", "D", "E", "F"), class = "data.frame", row.names = c(NA, 
29 | -5L))
30 | 
31 | 
32 | stopifnot(all.equal(sampled, sampled_actual))
33 | 


--------------------------------------------------------------------------------
/tests/prop.R:
--------------------------------------------------------------------------------
 1 | library(filesampler)
 2 | file = system.file("rawdata/small.csv", package="filesampler")
 3 | 
 4 | 
 5 | ### Argument checks
 6 | 
 7 | # p
 8 | expect = "Argument 'p' must be between 0 and 1>"
 9 | have = tryCatch(sampled <- sample_csv(file, param=-1), error=capture.output)
10 | stopifnot(grepl(expect, have))
11 | have = tryCatch(sampled <- sample_csv(file, param=1.1), error=capture.output)
12 | stopifnot(grepl(expect, have))
13 | 
14 | # nmax
15 | set.seed(1234)
16 | sampled = sample_csv(file, param=.05, nmax=1)
17 | 
18 | sampled_actual =
19 | structure(list(A = 30L, B = structure(1L, .Label = "m", class = "factor"), 
20 |     C = structure(1L, .Label = "U", class = "factor"), D = 0.663265228504315, 
21 |     E = -0.102609033060013, F = 46.2487461487763), .Names = c("A", 
22 | "B", "C", "D", "E", "F"), class = "data.frame", row.names = c(NA, 
23 | -1L))
24 | 
25 | stopifnot(all.equal(sampled, sampled_actual))
26 | 
27 | 
28 | 
29 | ### general functionality
30 | 
31 | # read
32 | set.seed(1234)
33 | sampled = sample_csv(file, param=.05)
34 | 
35 | sampled_actual =
36 | structure(list(A = c(30L, 92L, 6L, 49L, 77L, 54L, 67L), B = structure(c(2L, 
37 | 1L, 3L, 2L, 4L, 1L, 5L), .Label = c("b", "m", "p", "r", "w"), class = "factor"), 
38 |     C = structure(c(4L, 1L, 1L, 3L, 2L, 2L, 3L), .Label = c("B", 
39 |     "G", "P", "U"), class = "factor"), D = c(0.663265228504315, 
40 |     0.647247697459534, 0.145372492726892, 0.647244467865676, 
41 |     0.248602577485144, 0.155096606584266, 0.231593899428844), 
42 |     E = c(-0.102609033060013, 1.20942736228284, 1.52611855611189, 
43 |     -0.59059429990917, 0.356037725132404, -0.933558833688986, 
44 |     0.712171107833738), F = c(46.2487461487763, 34.8187624011189, 
45 |     20.8286189078353, 91.2623915518634, 42.5667992956005, 17.3642539186403, 
46 |     47.2880723653361)), .Names = c("A", "B", "C", "D", "E", "F"
47 | ), class = "data.frame", row.names = c(NA, -7L))
48 | 
49 | stopifnot(all.equal(sampled, sampled_actual))
50 | 
51 | # verbose
52 | verb = capture.output(invisible(sample_csv(file, param=.05, verbose=TRUE)))
53 | verb_actual = "Read 4 lines (0.03960%) of 101 line file."
54 | stopifnot(all.equal(verb, verb_actual))
55 | 


--------------------------------------------------------------------------------
/tests/wc.R:
--------------------------------------------------------------------------------
 1 | library(filesampler)
 2 | 
 3 | file <- system.file("rawdata/small.csv", package="filesampler")
 4 | 
 5 | nchars = 6426
 6 | nwords = 101
 7 | nlines = 101
 8 | 
 9 | truth = c(nchars, nwords, nlines)
10 | test = as.integer(wc(file))
11 | stopifnot(all.equal(truth, test))
12 | 
13 | truth = c(-1, -1, nlines)
14 | test = as.integer(wc(file, FALSE, FALSE))
15 | stopifnot(all.equal(truth, test))
16 | 
17 | truth = c(nchars, -1, nlines)
18 | test = as.integer(wc(file, TRUE, FALSE, TRUE))
19 | stopifnot(all.equal(truth, test))
20 | 
21 | truth = c(nchars, nwords, -1)
22 | test = as.integer(wc(file, lines=FALSE))
23 | stopifnot(all.equal(truth, test))
24 | 
25 | truth = c(nchars, -1, -1)
26 | test = as.integer(wc(file, words=FALSE, lines=FALSE))
27 | stopifnot(all.equal(truth, test))
28 | 


--------------------------------------------------------------------------------
/tools/ax_check_compile_flag.m4:
--------------------------------------------------------------------------------
 1 | # ===========================================================================
 2 | #  https://www.gnu.org/software/autoconf-archive/ax_check_compile_flag.html
 3 | # ===========================================================================
 4 | #
 5 | # SYNOPSIS
 6 | #
 7 | #   AX_CHECK_COMPILE_FLAG(FLAG, [ACTION-SUCCESS], [ACTION-FAILURE], [EXTRA-FLAGS], [INPUT])
 8 | #
 9 | # DESCRIPTION
10 | #
11 | #   Check whether the given FLAG works with the current language's compiler
12 | #   or gives an error.  (Warnings, however, are ignored)
13 | #
14 | #   ACTION-SUCCESS/ACTION-FAILURE are shell commands to execute on
15 | #   success/failure.
16 | #
17 | #   If EXTRA-FLAGS is defined, it is added to the current language's default
18 | #   flags (e.g. CFLAGS) when the check is done.  The check is thus made with
19 | #   the flags: "CFLAGS EXTRA-FLAGS FLAG".  This can for example be used to
20 | #   force the compiler to issue an error when a bad flag is given.
21 | #
22 | #   INPUT gives an alternative input source to AC_COMPILE_IFELSE.
23 | #
24 | #   NOTE: Implementation based on AX_CFLAGS_GCC_OPTION. Please keep this
25 | #   macro in sync with AX_CHECK_{PREPROC,LINK}_FLAG.
26 | #
27 | # LICENSE
28 | #
29 | #   Copyright (c) 2008 Guido U. Draheim <guidod@gmx.de>
30 | #   Copyright (c) 2011 Maarten Bosmans <mkbosmans@gmail.com>
31 | #
32 | #   This program is free software: you can redistribute it and/or modify it
33 | #   under the terms of the GNU General Public License as published by the
34 | #   Free Software Foundation, either version 3 of the License, or (at your
35 | #   option) any later version.
36 | #
37 | #   This program is distributed in the hope that it will be useful, but
38 | #   WITHOUT ANY WARRANTY; without even the implied warranty of
39 | #   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
40 | #   Public License for more details.
41 | #
42 | #   You should have received a copy of the GNU General Public License along
43 | #   with this program. If not, see <https://www.gnu.org/licenses/>.
44 | #
45 | #   As a special exception, the respective Autoconf Macro's copyright owner
46 | #   gives unlimited permission to copy, distribute and modify the configure
47 | #   scripts that are the output of Autoconf when processing the Macro. You
48 | #   need not follow the terms of the GNU General Public License when using
49 | #   or distributing such scripts, even though portions of the text of the
50 | #   Macro appear in them. The GNU General Public License (GPL) does govern
51 | #   all other use of the material that constitutes the Autoconf Macro.
52 | #
53 | #   This special exception to the GPL applies to versions of the Autoconf
54 | #   Macro released by the Autoconf Archive. When you make and distribute a
55 | #   modified version of the Autoconf Macro, you may extend this special
56 | #   exception to the GPL to apply to your modified version as well.
57 | 
58 | #serial 5
59 | 
60 | AC_DEFUN([AX_CHECK_COMPILE_FLAG],
61 | [AC_PREREQ(2.64)dnl for _AC_LANG_PREFIX and AS_VAR_IF
62 | AS_VAR_PUSHDEF([CACHEVAR],[ax_cv_check_[]_AC_LANG_ABBREV[]flags_$4_$1])dnl
63 | AC_CACHE_CHECK([whether _AC_LANG compiler accepts $1], CACHEVAR, [
64 |   ax_check_save_flags=$[]_AC_LANG_PREFIX[]FLAGS
65 |   _AC_LANG_PREFIX[]FLAGS="$[]_AC_LANG_PREFIX[]FLAGS $4 $1"
66 |   AC_COMPILE_IFELSE([m4_default([$5],[AC_LANG_PROGRAM()])],
67 |     [AS_VAR_SET(CACHEVAR,[yes])],
68 |     [AS_VAR_SET(CACHEVAR,[no])])
69 |   _AC_LANG_PREFIX[]FLAGS=$ax_check_save_flags])
70 | AS_VAR_IF(CACHEVAR,yes,
71 |   [m4_default([$2], :)],
72 |   [m4_default([$3], :)])
73 | AS_VAR_POPDEF([CACHEVAR])dnl
74 | ])dnl AX_CHECK_COMPILE_FLAGS
75 | 


--------------------------------------------------------------------------------
/vignettes/IEEEtran.bst:
--------------------------------------------------------------------------------
   1 | %%
   2 | %% IEEEtran.bst
   3 | %% BibTeX Bibliography Style file for IEEE Journals and Conferences (unsorted)
   4 | %% Version 1.14 (2015/08/26)
   5 | %% 
   6 | %% Copyright (c) 2003-2015 Michael Shell
   7 | %% 
   8 | %% Original starting code base and algorithms obtained from the output of
   9 | %% Patrick W. Daly's makebst package as well as from prior versions of
  10 | %% IEEE BibTeX styles:
  11 | %% 
  12 | %% 1. Howard Trickey and Oren Patashnik's ieeetr.bst  (1985/1988)
  13 | %% 2. Silvano Balemi and Richard H. Roy's IEEEbib.bst (1993)
  14 | %% 
  15 | %% Support sites:
  16 | %% http://www.michaelshell.org/tex/ieeetran/
  17 | %% http://www.ctan.org/pkg/ieeetran
  18 | %% and/or
  19 | %% http://www.ieee.org/
  20 | %% 
  21 | %% For use with BibTeX version 0.99a or later
  22 | %%
  23 | %% This is a numerical citation style.
  24 | %% 
  25 | %%*************************************************************************
  26 | %% Legal Notice:
  27 | %% This code is offered as-is without any warranty either expressed or
  28 | %% implied; without even the implied warranty of MERCHANTABILITY or
  29 | %% FITNESS FOR A PARTICULAR PURPOSE! 
  30 | %% User assumes all risk.
  31 | %% In no event shall the IEEE or any contributor to this code be liable for
  32 | %% any damages or losses, including, but not limited to, incidental,
  33 | %% consequential, or any other damages, resulting from the use or misuse
  34 | %% of any information contained here.
  35 | %%
  36 | %% All comments are the opinions of their respective authors and are not
  37 | %% necessarily endorsed by the IEEE.
  38 | %%
  39 | %% This work is distributed under the LaTeX Project Public License (LPPL)
  40 | %% ( http://www.latex-project.org/ ) version 1.3, and may be freely used,
  41 | %% distributed and modified. A copy of the LPPL, version 1.3, is included
  42 | %% in the base LaTeX documentation of all distributions of LaTeX released
  43 | %% 2003/12/01 or later.
  44 | %% Retain all contribution notices and credits.
  45 | %% ** Modified files should be clearly indicated as such, including  **
  46 | %% ** renaming them and changing author support contact information. **
  47 | %%*************************************************************************
  48 | 
  49 | 
  50 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  51 | %% DEFAULTS FOR THE CONTROLS OF THE BST STYLE %%
  52 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  53 | 
  54 | % These are the defaults for the user adjustable controls. The values used
  55 | % here can be overridden by the user via IEEEtranBSTCTL entry type.
  56 | 
  57 | % NOTE: The recommended LaTeX command to invoke a control entry type is:
  58 | % 
  59 | %\makeatletter
  60 | %\def\bstctlcite{\@ifnextchar[{\@bstctlcite}{\@bstctlcite[@auxout]}}
  61 | %\def\@bstctlcite[#1]#2{\@bsphack
  62 | %  \@for\@citeb:=#2\do{%
  63 | %    \edef\@citeb{\expandafter\@firstofone\@citeb}%
  64 | %    \if@filesw\immediate\write\csname #1\endcsname{\string\citation{\@citeb}}\fi}%
  65 | %  \@esphack}
  66 | %\makeatother
  67 | %
  68 | % It is called at the start of the document, before the first \cite, like:
  69 | % \bstctlcite{IEEEexample:BSTcontrol}
  70 | %
  71 | % IEEEtran.cls V1.6 and later does provide this command.
  72 | 
  73 | 
  74 | 
  75 | % #0 turns off the display of the number for articles.
  76 | % #1 enables
  77 | FUNCTION {default.is.use.number.for.article} { #1 }
  78 | 
  79 | 
  80 | % #0 turns off the display of the paper and type fields in @inproceedings.
  81 | % #1 enables
  82 | FUNCTION {default.is.use.paper} { #1 }
  83 | 
  84 | 
  85 | % #0 turns off the display of urls
  86 | % #1 enables
  87 | FUNCTION {default.is.use.url} { #1 }
  88 | 
  89 | 
  90 | % #0 turns off the forced use of "et al."
  91 | % #1 enables
  92 | FUNCTION {default.is.forced.et.al} { #0 }
  93 | 
  94 | 
  95 | % The maximum number of names that can be present beyond which an "et al."
  96 | % usage is forced. Be sure that num.names.shown.with.forced.et.al (below)
  97 | % is not greater than this value!
  98 | % Note: There are many instances of references in IEEE journals which have
  99 | % a very large number of authors as well as instances in which "et al." is
 100 | % used profusely.
 101 | FUNCTION {default.max.num.names.before.forced.et.al} { #10 }
 102 | 
 103 | 
 104 | % The number of names that will be shown with a forced "et al.".
 105 | % Must be less than or equal to max.num.names.before.forced.et.al
 106 | FUNCTION {default.num.names.shown.with.forced.et.al} { #1 }
 107 | 
 108 | 
 109 | % #0 turns off the alternate interword spacing for entries with URLs.
 110 | % #1 enables
 111 | FUNCTION {default.is.use.alt.interword.spacing} { #1 }
 112 | 
 113 | 
 114 | % If alternate interword spacing for entries with URLs is enabled, this is
 115 | % the interword spacing stretch factor that will be used. For example, the
 116 | % default "4" here means that the interword spacing in entries with URLs can
 117 | % stretch to four times normal. Does not have to be an integer. Note that
 118 | % the value specified here can be overridden by the user in their LaTeX
 119 | % code via a command such as: 
 120 | % "\providecommand\BIBentryALTinterwordstretchfactor{1.5}" in addition to
 121 | % that via the IEEEtranBSTCTL entry type.
 122 | FUNCTION {default.ALTinterwordstretchfactor} { "4" }
 123 | 
 124 | 
 125 | % #0 turns off the "dashification" of repeated (i.e., identical to those
 126 | % of the previous entry) names. The IEEE normally does this.
 127 | % #1 enables
 128 | FUNCTION {default.is.dash.repeated.names} { #1 }
 129 | 
 130 | 
 131 | % The default name format control string.
 132 | FUNCTION {default.name.format.string}{ "{f.~}{vv~}{ll}{, jj}" }
 133 | 
 134 | 
 135 | % The default LaTeX font command for the names.
 136 | FUNCTION {default.name.latex.cmd}{ "" }
 137 | 
 138 | 
 139 | % The default URL prefix.
 140 | FUNCTION {default.name.url.prefix}{ "[Online]. Available:" }
 141 | 
 142 | 
 143 | % Other controls that cannot be accessed via IEEEtranBSTCTL entry type.
 144 | 
 145 | % #0 turns off the terminal startup banner/completed message so as to
 146 | % operate more quietly.
 147 | % #1 enables
 148 | FUNCTION {is.print.banners.to.terminal} { #1 }
 149 | 
 150 | 
 151 | 
 152 | 
 153 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 154 | %% FILE VERSION AND BANNER %%
 155 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 156 | 
 157 | FUNCTION{bst.file.version} { "1.14" }
 158 | FUNCTION{bst.file.date} { "2015/08/26" }
 159 | FUNCTION{bst.file.website} { "http://www.michaelshell.org/tex/ieeetran/bibtex/" }
 160 | 
 161 | FUNCTION {banner.message}
 162 | { is.print.banners.to.terminal
 163 |      { "-- IEEEtran.bst version" " " * bst.file.version *
 164 |        " (" * bst.file.date * ") " * "by Michael Shell." *
 165 |        top$
 166 |        "-- " bst.file.website *
 167 |        top$
 168 |        "-- See the " quote$ * "IEEEtran_bst_HOWTO.pdf" * quote$ * " manual for usage information." *
 169 |        top$
 170 |      }
 171 |      { skip$ }
 172 |    if$
 173 | }
 174 | 
 175 | FUNCTION {completed.message}
 176 | { is.print.banners.to.terminal
 177 |      { ""
 178 |        top$
 179 |        "Done."
 180 |        top$
 181 |      }
 182 |      { skip$ }
 183 |    if$
 184 | }
 185 | 
 186 | 
 187 | 
 188 | 
 189 | %%%%%%%%%%%%%%%%%%%%%%
 190 | %% STRING CONSTANTS %%
 191 | %%%%%%%%%%%%%%%%%%%%%%
 192 | 
 193 | FUNCTION {bbl.and}{ "and" }
 194 | FUNCTION {bbl.etal}{ "et~al." }
 195 | FUNCTION {bbl.editors}{ "eds." }
 196 | FUNCTION {bbl.editor}{ "ed." }
 197 | FUNCTION {bbl.edition}{ "ed." }
 198 | FUNCTION {bbl.volume}{ "vol." }
 199 | FUNCTION {bbl.of}{ "of" }
 200 | FUNCTION {bbl.number}{ "no." }
 201 | FUNCTION {bbl.in}{ "in" }
 202 | FUNCTION {bbl.pages}{ "pp." }
 203 | FUNCTION {bbl.page}{ "p." }
 204 | FUNCTION {bbl.chapter}{ "ch." }
 205 | FUNCTION {bbl.paper}{ "paper" }
 206 | FUNCTION {bbl.part}{ "pt." }
 207 | FUNCTION {bbl.patent}{ "Patent" }
 208 | FUNCTION {bbl.patentUS}{ "U.S." }
 209 | FUNCTION {bbl.revision}{ "Rev." }
 210 | FUNCTION {bbl.series}{ "ser." }
 211 | FUNCTION {bbl.standard}{ "Std." }
 212 | FUNCTION {bbl.techrep}{ "Tech. Rep." }
 213 | FUNCTION {bbl.mthesis}{ "Master's thesis" }
 214 | FUNCTION {bbl.phdthesis}{ "Ph.D. dissertation" }
 215 | FUNCTION {bbl.st}{ "st" }
 216 | FUNCTION {bbl.nd}{ "nd" }
 217 | FUNCTION {bbl.rd}{ "rd" }
 218 | FUNCTION {bbl.th}{ "th" }
 219 | 
 220 | 
 221 | % This is the LaTeX spacer that is used when a larger than normal space
 222 | % is called for (such as just before the address:publisher).
 223 | FUNCTION {large.space} { "\hskip 1em plus 0.5em minus 0.4em\relax " }
 224 | 
 225 | % The LaTeX code for dashes that are used to represent repeated names.
 226 | % Note: Some older IEEE journals used something like
 227 | % "\rule{0.275in}{0.5pt}\," which is fairly thick and runs right along
 228 | % the baseline. However, the IEEE now uses a thinner, above baseline,
 229 | % six dash long sequence.
 230 | FUNCTION {repeated.name.dashes} { "------" }
 231 | 
 232 | 
 233 | 
 234 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 235 | %% PREDEFINED STRING MACROS %%
 236 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 237 | 
 238 | MACRO {jan} {"Jan."}
 239 | MACRO {feb} {"Feb."}
 240 | MACRO {mar} {"Mar."}
 241 | MACRO {apr} {"Apr."}
 242 | MACRO {may} {"May"}
 243 | MACRO {jun} {"Jun."}
 244 | MACRO {jul} {"Jul."}
 245 | MACRO {aug} {"Aug."}
 246 | MACRO {sep} {"Sep."}
 247 | MACRO {oct} {"Oct."}
 248 | MACRO {nov} {"Nov."}
 249 | MACRO {dec} {"Dec."}
 250 | 
 251 | 
 252 | 
 253 | %%%%%%%%%%%%%%%%%%
 254 | %% ENTRY FIELDS %%
 255 | %%%%%%%%%%%%%%%%%%
 256 | 
 257 | ENTRY
 258 |   { address
 259 |     assignee
 260 |     author
 261 |     booktitle
 262 |     chapter
 263 |     day
 264 |     dayfiled
 265 |     edition
 266 |     editor
 267 |     howpublished
 268 |     institution
 269 |     intype
 270 |     journal
 271 |     key
 272 |     language
 273 |     month
 274 |     monthfiled
 275 |     nationality
 276 |     note
 277 |     number
 278 |     organization
 279 |     pages
 280 |     paper
 281 |     publisher
 282 |     school
 283 |     series
 284 |     revision
 285 |     title
 286 |     type
 287 |     url
 288 |     volume
 289 |     year
 290 |     yearfiled
 291 |     CTLuse_article_number
 292 |     CTLuse_paper
 293 |     CTLuse_url
 294 |     CTLuse_forced_etal
 295 |     CTLmax_names_forced_etal
 296 |     CTLnames_show_etal
 297 |     CTLuse_alt_spacing
 298 |     CTLalt_stretch_factor
 299 |     CTLdash_repeated_names
 300 |     CTLname_format_string
 301 |     CTLname_latex_cmd
 302 |     CTLname_url_prefix
 303 |   }
 304 |   {}
 305 |   { label }
 306 | 
 307 | 
 308 | 
 309 | 
 310 | %%%%%%%%%%%%%%%%%%%%%%%
 311 | %% INTEGER VARIABLES %%
 312 | %%%%%%%%%%%%%%%%%%%%%%%
 313 | 
 314 | INTEGERS { prev.status.punct this.status.punct punct.std
 315 |            punct.no punct.comma punct.period 
 316 |            prev.status.space this.status.space space.std
 317 |            space.no space.normal space.large
 318 |            prev.status.quote this.status.quote quote.std
 319 |            quote.no quote.close
 320 |            prev.status.nline this.status.nline nline.std
 321 |            nline.no nline.newblock 
 322 |            status.cap cap.std
 323 |            cap.no cap.yes}
 324 | 
 325 | INTEGERS { longest.label.width multiresult nameptr namesleft number.label numnames }
 326 | 
 327 | INTEGERS { is.use.number.for.article
 328 |            is.use.paper
 329 |            is.use.url
 330 |            is.forced.et.al
 331 |            max.num.names.before.forced.et.al
 332 |            num.names.shown.with.forced.et.al
 333 |            is.use.alt.interword.spacing
 334 |            is.dash.repeated.names}
 335 | 
 336 | 
 337 | %%%%%%%%%%%%%%%%%%%%%%
 338 | %% STRING VARIABLES %%
 339 | %%%%%%%%%%%%%%%%%%%%%%
 340 | 
 341 | STRINGS { bibinfo
 342 |           longest.label
 343 |           oldname
 344 |           s
 345 |           t
 346 |           ALTinterwordstretchfactor
 347 |           name.format.string
 348 |           name.latex.cmd
 349 |           name.url.prefix}
 350 | 
 351 | 
 352 | 
 353 | 
 354 | %%%%%%%%%%%%%%%%%%%%%%%%%
 355 | %% LOW LEVEL FUNCTIONS %%
 356 | %%%%%%%%%%%%%%%%%%%%%%%%%
 357 | 
 358 | FUNCTION {initialize.controls}
 359 | { default.is.use.number.for.article 'is.use.number.for.article :=
 360 |   default.is.use.paper 'is.use.paper :=
 361 |   default.is.use.url 'is.use.url :=
 362 |   default.is.forced.et.al 'is.forced.et.al :=
 363 |   default.max.num.names.before.forced.et.al 'max.num.names.before.forced.et.al :=
 364 |   default.num.names.shown.with.forced.et.al 'num.names.shown.with.forced.et.al :=
 365 |   default.is.use.alt.interword.spacing 'is.use.alt.interword.spacing :=
 366 |   default.is.dash.repeated.names 'is.dash.repeated.names :=
 367 |   default.ALTinterwordstretchfactor 'ALTinterwordstretchfactor :=
 368 |   default.name.format.string 'name.format.string :=
 369 |   default.name.latex.cmd 'name.latex.cmd :=
 370 |   default.name.url.prefix 'name.url.prefix :=
 371 | }
 372 | 
 373 | 
 374 | % This IEEEtran.bst features a very powerful and flexible mechanism for
 375 | % controlling the capitalization, punctuation, spacing, quotation, and
 376 | % newlines of the formatted entry fields. (Note: IEEEtran.bst does not need
 377 | % or use the newline/newblock feature, but it has been implemented for
 378 | % possible future use.) The output states of IEEEtran.bst consist of
 379 | % multiple independent attributes and, as such, can be thought of as being
 380 | % vectors, rather than the simple scalar values ("before.all", 
 381 | % "mid.sentence", etc.) used in most other .bst files.
 382 | % 
 383 | % The more flexible and complex design used here was motivated in part by
 384 | % the IEEE's rather unusual bibliography style. For example, the IEEE ends the
 385 | % previous field item with a period and large space prior to the publisher
 386 | % address; the @electronic entry types use periods as inter-item punctuation
 387 | % rather than the commas used by the other entry types; and URLs are never
 388 | % followed by periods even though they are the last item in the entry.
 389 | % Although it is possible to accommodate these features with the conventional
 390 | % output state system, the seemingly endless exceptions make for convoluted,
 391 | % unreliable and difficult to maintain code.
 392 | %
 393 | % IEEEtran.bst's output state system can be easily understood via a simple
 394 | % illustration of two most recently formatted entry fields (on the stack):
 395 | %
 396 | %               CURRENT_ITEM
 397 | %               "PREVIOUS_ITEM
 398 | %
 399 | % which, in this example, is to eventually appear in the bibliography as:
 400 | % 
 401 | %               "PREVIOUS_ITEM," CURRENT_ITEM
 402 | %
 403 | % It is the job of the output routine to take the previous item off of the
 404 | % stack (while leaving the current item at the top of the stack), apply its
 405 | % trailing punctuation (including closing quote marks) and spacing, and then
 406 | % to write the result to BibTeX's output buffer:
 407 | % 
 408 | %               "PREVIOUS_ITEM," 
 409 | % 
 410 | % Punctuation (and spacing) between items is often determined by both of the
 411 | % items rather than just the first one. The presence of quotation marks
 412 | % further complicates the situation because, in standard English, trailing
 413 | % punctuation marks are supposed to be contained within the quotes.
 414 | % 
 415 | % IEEEtran.bst maintains two output state (aka "status") vectors which
 416 | % correspond to the previous and current (aka "this") items. Each vector
 417 | % consists of several independent attributes which track punctuation,
 418 | % spacing, quotation, and newlines. Capitalization status is handled by a
 419 | % separate scalar because the format routines, not the output routine,
 420 | % handle capitalization and, therefore, there is no need to maintain the
 421 | % capitalization attribute for both the "previous" and "this" items.
 422 | % 
 423 | % When a format routine adds a new item, it copies the current output status
 424 | % vector to the previous output status vector and (usually) resets the
 425 | % current (this) output status vector to a "standard status" vector. Using a
 426 | % "standard status" vector in this way allows us to redefine what we mean by
 427 | % "standard status" at the start of each entry handler and reuse the same
 428 | % format routines under the various inter-item separation schemes. For
 429 | % example, the standard status vector for the @book entry type may use
 430 | % commas for item separators, while the @electronic type may use periods,
 431 | % yet both entry handlers exploit many of the exact same format routines.
 432 | % 
 433 | % Because format routines have write access to the output status vector of
 434 | % the previous item, they can override the punctuation choices of the
 435 | % previous format routine! Therefore, it becomes trivial to implement rules
 436 | % such as "Always use a period and a large space before the publisher." By
 437 | % pushing the generation of the closing quote mark to the output routine, we
 438 | % avoid all the problems caused by having to close a quote before having all
 439 | % the information required to determine what the punctuation should be.
 440 | %
 441 | % The IEEEtran.bst output state system can easily be expanded if needed.
 442 | % For instance, it is easy to add a "space.tie" attribute value if the
 443 | % bibliography rules mandate that two items have to be joined with an
 444 | % unbreakable space. 
 445 | 
 446 | FUNCTION {initialize.status.constants}
 447 | { #0 'punct.no :=
 448 |   #1 'punct.comma :=
 449 |   #2 'punct.period :=
 450 |   #0 'space.no := 
 451 |   #1 'space.normal :=
 452 |   #2 'space.large :=
 453 |   #0 'quote.no :=
 454 |   #1 'quote.close :=
 455 |   #0 'cap.no :=
 456 |   #1 'cap.yes :=
 457 |   #0 'nline.no :=
 458 |   #1 'nline.newblock :=
 459 | }
 460 | 
 461 | FUNCTION {std.status.using.comma}
 462 | { punct.comma 'punct.std :=
 463 |   space.normal 'space.std :=
 464 |   quote.no 'quote.std :=
 465 |   nline.no 'nline.std :=
 466 |   cap.no 'cap.std :=
 467 | }
 468 | 
 469 | FUNCTION {std.status.using.period}
 470 | { punct.period 'punct.std :=
 471 |   space.normal 'space.std :=
 472 |   quote.no 'quote.std :=
 473 |   nline.no 'nline.std :=
 474 |   cap.yes 'cap.std :=
 475 | }
 476 | 
 477 | FUNCTION {initialize.prev.this.status}
 478 | { punct.no 'prev.status.punct :=
 479 |   space.no 'prev.status.space :=
 480 |   quote.no 'prev.status.quote :=
 481 |   nline.no 'prev.status.nline :=
 482 |   punct.no 'this.status.punct :=
 483 |   space.no 'this.status.space :=
 484 |   quote.no 'this.status.quote :=
 485 |   nline.no 'this.status.nline :=
 486 |   cap.yes 'status.cap :=
 487 | }
 488 | 
 489 | FUNCTION {this.status.std}
 490 | { punct.std 'this.status.punct :=
 491 |   space.std 'this.status.space :=
 492 |   quote.std 'this.status.quote :=
 493 |   nline.std 'this.status.nline :=
 494 | }
 495 | 
 496 | FUNCTION {cap.status.std}{ cap.std 'status.cap := }
 497 | 
 498 | FUNCTION {this.to.prev.status}
 499 | { this.status.punct 'prev.status.punct :=
 500 |   this.status.space 'prev.status.space :=
 501 |   this.status.quote 'prev.status.quote :=
 502 |   this.status.nline 'prev.status.nline :=
 503 | }
 504 | 
 505 | 
 506 | FUNCTION {not}
 507 | {   { #0 }
 508 |     { #1 }
 509 |   if$
 510 | }
 511 | 
 512 | FUNCTION {and}
 513 | {   { skip$ }
 514 |     { pop$ #0 }
 515 |   if$
 516 | }
 517 | 
 518 | FUNCTION {or}
 519 | {   { pop$ #1 }
 520 |     { skip$ }
 521 |   if$
 522 | }
 523 | 
 524 | 
 525 | % convert the strings "yes" or "no" to #1 or #0 respectively
 526 | FUNCTION {yes.no.to.int}
 527 | { "l" change.case$ duplicate$
 528 |     "yes" =
 529 |     { pop$  #1 }
 530 |     { duplicate$ "no" =
 531 |         { pop$ #0 }
 532 |         { "unknown boolean " quote$ * swap$ * quote$ *
 533 |           " in " * cite$ * warning$
 534 |           #0
 535 |         }
 536 |       if$
 537 |     }
 538 |   if$
 539 | }
 540 | 
 541 | 
 542 | % pushes true if the single char string on the stack is in the
 543 | % range of "0" to "9"
 544 | FUNCTION {is.num}
 545 | { chr.to.int$
 546 |   duplicate$ "0" chr.to.int$ < not
 547 |   swap$ "9" chr.to.int$ > not and
 548 | }
 549 | 
 550 | % multiplies the integer on the stack by a factor of 10
 551 | FUNCTION {bump.int.mag}
 552 | { #0 'multiresult :=
 553 |     { duplicate$ #0 > }
 554 |     { #1 -
 555 |       multiresult #10 +
 556 |       'multiresult :=
 557 |     }
 558 |   while$
 559 | pop$
 560 | multiresult
 561 | }
 562 | 
 563 | % converts a single character string on the stack to an integer
 564 | FUNCTION {char.to.integer}
 565 | { duplicate$ 
 566 |   is.num
 567 |     { chr.to.int$ "0" chr.to.int$ - }
 568 |     {"noninteger character " quote$ * swap$ * quote$ *
 569 |           " in integer field of " * cite$ * warning$
 570 |     #0
 571 |     }
 572 |   if$
 573 | }
 574 | 
 575 | % converts a string on the stack to an integer
 576 | FUNCTION {string.to.integer}
 577 | { duplicate$ text.length$ 'namesleft :=
 578 |   #1 'nameptr :=
 579 |   #0 'numnames :=
 580 |     { nameptr namesleft > not }
 581 |     { duplicate$ nameptr #1 substring$
 582 |       char.to.integer numnames bump.int.mag +
 583 |       'numnames :=
 584 |       nameptr #1 +
 585 |       'nameptr :=
 586 |     }
 587 |   while$
 588 | pop$
 589 | numnames
 590 | }
 591 | 
 592 | 
 593 | 
 594 | 
 595 | % The output routines write out the *next* to the top (previous) item on the
 596 | % stack, adding punctuation and such as needed. Since IEEEtran.bst maintains
 597 | % the output status for the top two items on the stack, these output
 598 | % routines have to consider the previous output status (which corresponds to
 599 | % the item that is being output). Full independent control of punctuation,
 600 | % closing quote marks, spacing, and newblock is provided.
 601 | % 
 602 | % "output.nonnull" does not check for the presence of a previous empty
 603 | % item.
 604 | % 
 605 | % "output" does check for the presence of a previous empty item and will
 606 | % remove an empty item rather than outputing it.
 607 | % 
 608 | % "output.warn" is like "output", but will issue a warning if it detects
 609 | % an empty item.
 610 | 
 611 | FUNCTION {output.nonnull}
 612 | { swap$
 613 |   prev.status.punct punct.comma =
 614 |      { "," * }
 615 |      { skip$ }
 616 |    if$
 617 |   prev.status.punct punct.period =
 618 |      { add.period$ }
 619 |      { skip$ }
 620 |    if$ 
 621 |   prev.status.quote quote.close =
 622 |      { "''" * }
 623 |      { skip$ }
 624 |    if$
 625 |   prev.status.space space.normal =
 626 |      { " " * }
 627 |      { skip$ }
 628 |    if$
 629 |   prev.status.space space.large =
 630 |      { large.space * }
 631 |      { skip$ }
 632 |    if$
 633 |   write$
 634 |   prev.status.nline nline.newblock =
 635 |      { newline$ "\newblock " write$ }
 636 |      { skip$ }
 637 |    if$
 638 | }
 639 | 
 640 | FUNCTION {output}
 641 | { duplicate$ empty$
 642 |     'pop$
 643 |     'output.nonnull
 644 |   if$
 645 | }
 646 | 
 647 | FUNCTION {output.warn}
 648 | { 't :=
 649 |   duplicate$ empty$
 650 |     { pop$ "empty " t * " in " * cite$ * warning$ }
 651 |     'output.nonnull
 652 |   if$
 653 | }
 654 | 
 655 | % "fin.entry" is the output routine that handles the last item of the entry
 656 | % (which will be on the top of the stack when "fin.entry" is called).
 657 | 
 658 | FUNCTION {fin.entry}
 659 | { this.status.punct punct.no =
 660 |      { skip$ }
 661 |      { add.period$ }
 662 |    if$
 663 |    this.status.quote quote.close =
 664 |      { "''" * }
 665 |      { skip$ }
 666 |    if$
 667 | write$
 668 | newline$
 669 | }
 670 | 
 671 | 
 672 | FUNCTION {is.last.char.not.punct}
 673 | { duplicate$
 674 |    "}" * add.period$
 675 |    #-1 #1 substring$ "." =
 676 | }
 677 | 
 678 | FUNCTION {is.multiple.pages}
 679 | { 't :=
 680 |   #0 'multiresult :=
 681 |     { multiresult not
 682 |       t empty$ not
 683 |       and
 684 |     }
 685 |     { t #1 #1 substring$
 686 |       duplicate$ "-" =
 687 |       swap$ duplicate$ "," =
 688 |       swap$ "+" =
 689 |       or or
 690 |         { #1 'multiresult := }
 691 |         { t #2 global.max$ substring$ 't := }
 692 |       if$
 693 |     }
 694 |   while$
 695 |   multiresult
 696 | }
 697 | 
 698 | FUNCTION {capitalize}{ "u" change.case$ "t" change.case$ }
 699 | 
 700 | FUNCTION {emphasize}
 701 | { duplicate$ empty$
 702 |     { pop$ "" }
 703 |     { "\emph{" swap$ * "}" * }
 704 |   if$
 705 | }
 706 | 
 707 | FUNCTION {do.name.latex.cmd}
 708 | { name.latex.cmd
 709 |   empty$
 710 |     { skip$ }
 711 |     { name.latex.cmd "{" * swap$ * "}" * }
 712 |   if$
 713 | }
 714 | 
 715 | % IEEEtran.bst uses its own \BIBforeignlanguage command which directly
 716 | % invokes the TeX hyphenation patterns without the need of the Babel
 717 | % package. Babel does a lot more than switch hyphenation patterns and
 718 | % its loading can cause unintended effects in many class files (such as
 719 | % IEEEtran.cls).
 720 | FUNCTION {select.language}
 721 | { duplicate$ empty$ 'pop$
 722 |     { language empty$ 'skip$
 723 |         { "\BIBforeignlanguage{" language * "}{" * swap$ * "}" * }
 724 |       if$
 725 |     }
 726 |   if$
 727 | }
 728 | 
 729 | FUNCTION {tie.or.space.prefix}
 730 | { duplicate$ text.length$ #3 <
 731 |     { "~" }
 732 |     { " " }
 733 |   if$
 734 |   swap$
 735 | }
 736 | 
 737 | FUNCTION {get.bbl.editor}
 738 | { editor num.names$ #1 > 'bbl.editors 'bbl.editor if$ }
 739 | 
 740 | FUNCTION {space.word}{ " " swap$ * " " * }
 741 | 
 742 | 
 743 | % Field Conditioners, Converters, Checkers and External Interfaces
 744 | 
 745 | FUNCTION {empty.field.to.null.string}
 746 | { duplicate$ empty$
 747 |     { pop$ "" }
 748 |     { skip$ }
 749 |   if$
 750 | }
 751 | 
 752 | FUNCTION {either.or.check}
 753 | { empty$
 754 |     { pop$ }
 755 |     { "can't use both " swap$ * " fields in " * cite$ * warning$ }
 756 |   if$
 757 | }
 758 | 
 759 | FUNCTION {empty.entry.warn}
 760 | { author empty$ title empty$ howpublished empty$
 761 |   month empty$ year empty$ note empty$ url empty$
 762 |   and and and and and and
 763 |     { "all relevant fields are empty in " cite$ * warning$ }
 764 |     'skip$
 765 |   if$
 766 | }
 767 | 
 768 | 
 769 | % The bibinfo system provides a way for the electronic parsing/acquisition
 770 | % of a bibliography's contents as is done by ReVTeX. For example, a field
 771 | % could be entered into the bibliography as:
 772 | % \bibinfo{volume}{2}
 773 | % Only the "2" would show up in the document, but the LaTeX \bibinfo command
 774 | % could do additional things with the information. IEEEtran.bst does provide
 775 | % a \bibinfo command via "\providecommand{\bibinfo}[2]{#2}". However, it is
 776 | % currently not used as the bogus bibinfo functions defined here output the
 777 | % entry values directly without the \bibinfo wrapper. The bibinfo functions
 778 | % themselves (and the calls to them) are retained for possible future use.
 779 | % 
 780 | % bibinfo.check avoids acting on missing fields while bibinfo.warn will
 781 | % issue a warning message if a missing field is detected. Prior to calling
 782 | % the bibinfo functions, the user should push the field value and then its
 783 | % name string, in that order.
 784 | 
 785 | FUNCTION {bibinfo.check}
 786 | { swap$ duplicate$ missing$
 787 |     { pop$ pop$ "" }
 788 |     { duplicate$ empty$
 789 |         { swap$ pop$ }
 790 |         { swap$ pop$ }
 791 |       if$
 792 |     }
 793 |   if$
 794 | }
 795 | 
 796 | FUNCTION {bibinfo.warn}
 797 | { swap$ duplicate$ missing$
 798 |     { swap$ "missing " swap$ * " in " * cite$ * warning$ pop$ "" }
 799 |     { duplicate$ empty$
 800 |         { swap$ "empty " swap$ * " in " * cite$ * warning$ }
 801 |         { swap$ pop$ }
 802 |       if$
 803 |     }
 804 |   if$
 805 | }
 806 | 
 807 | 
 808 | % The IEEE separates large numbers with more than 4 digits into groups of
 809 | % three. The IEEE uses a small space to separate these number groups. 
 810 | % Typical applications include patent and page numbers.
 811 | 
 812 | % number of consecutive digits required to trigger the group separation.
 813 | FUNCTION {large.number.trigger}{ #5 }
 814 | 
 815 | % For numbers longer than the trigger, this is the blocksize of the groups.
 816 | % The blocksize must be less than the trigger threshold, and 2 * blocksize
 817 | % must be greater than the trigger threshold (can't do more than one
 818 | % separation on the initial trigger).
 819 | FUNCTION {large.number.blocksize}{ #3 }
 820 | 
 821 | % What is actually inserted between the number groups.
 822 | FUNCTION {large.number.separator}{ "\," }
 823 | 
 824 | % So as to save on integer variables by reusing existing ones, numnames
 825 | % holds the current number of consecutive digits read and nameptr holds
 826 | % the number that will trigger an inserted space.
 827 | FUNCTION {large.number.separate}
 828 | { 't :=
 829 |   ""
 830 |   #0 'numnames :=
 831 |   large.number.trigger 'nameptr :=
 832 |   { t empty$ not }
 833 |   { t #-1 #1 substring$ is.num
 834 |       { numnames #1 + 'numnames := }
 835 |       { #0 'numnames := 
 836 |         large.number.trigger 'nameptr :=
 837 |       }
 838 |     if$
 839 |     t #-1 #1 substring$ swap$ *
 840 |     t #-2 global.max$ substring$ 't :=
 841 |     numnames nameptr =
 842 |       { duplicate$ #1 nameptr large.number.blocksize - substring$ swap$
 843 |         nameptr large.number.blocksize - #1 + global.max$ substring$
 844 |         large.number.separator swap$ * *
 845 |         nameptr large.number.blocksize - 'numnames :=
 846 |         large.number.blocksize #1 + 'nameptr :=
 847 |       }
 848 |       { skip$ }
 849 |     if$
 850 |   }
 851 |   while$
 852 | }
 853 | 
 854 | % Converts all single dashes "-" to double dashes "--".
 855 | FUNCTION {n.dashify}
 856 | { large.number.separate
 857 |   't :=
 858 |   ""
 859 |     { t empty$ not }
 860 |     { t #1 #1 substring$ "-" =
 861 |         { t #1 #2 substring$ "--" = not
 862 |             { "--" *
 863 |               t #2 global.max$ substring$ 't :=
 864 |             }
 865 |             {   { t #1 #1 substring$ "-" = }
 866 |                 { "-" *
 867 |                   t #2 global.max$ substring$ 't :=
 868 |                 }
 869 |               while$
 870 |             }
 871 |           if$
 872 |         }
 873 |         { t #1 #1 substring$ *
 874 |           t #2 global.max$ substring$ 't :=
 875 |         }
 876 |       if$
 877 |     }
 878 |   while$
 879 | }
 880 | 
 881 | 
 882 | % This function detects entries with names that are identical to that of
 883 | % the previous entry and replaces the repeated names with dashes (if the
 884 | % "is.dash.repeated.names" user control is nonzero).
 885 | FUNCTION {name.or.dash}
 886 | { 's :=
 887 |    oldname empty$
 888 |      { s 'oldname := s }
 889 |      { s oldname =
 890 |          { is.dash.repeated.names
 891 |               { repeated.name.dashes }
 892 |               { s 'oldname := s }
 893 |             if$
 894 |          }
 895 |          { s 'oldname := s }
 896 |        if$
 897 |      }
 898 |    if$
 899 | }
 900 | 
 901 | % Converts the number string on the top of the stack to
 902 | % "numerical ordinal form" (e.g., "7" to "7th"). There is
 903 | % no artificial limit to the upper bound of the numbers as the
 904 | % two least significant digits determine the ordinal form.
 905 | FUNCTION {num.to.ordinal}
 906 | { duplicate$ #-2 #1 substring$ "1" =
 907 |       { bbl.th * }
 908 |       { duplicate$ #-1 #1 substring$ "1" =
 909 |           { bbl.st * }
 910 |           { duplicate$ #-1 #1 substring$ "2" =
 911 |               { bbl.nd * }
 912 |               { duplicate$ #-1 #1 substring$ "3" =
 913 |                   { bbl.rd * }
 914 |                   { bbl.th * }
 915 |                 if$
 916 |               }
 917 |             if$
 918 |           }
 919 |         if$
 920 |       }
 921 |     if$
 922 | }
 923 | 
 924 | % If the string on the top of the stack begins with a number,
 925 | % (e.g., 11th) then replace the string with the leading number
 926 | % it contains. Otherwise retain the string as-is. s holds the
 927 | % extracted number, t holds the part of the string that remains
 928 | % to be scanned.
 929 | FUNCTION {extract.num}
 930 | { duplicate$ 't :=
 931 |   "" 's :=
 932 |   { t empty$ not }
 933 |   { t #1 #1 substring$
 934 |     t #2 global.max$ substring$ 't :=
 935 |     duplicate$ is.num
 936 |       { s swap$ * 's := }
 937 |       { pop$ "" 't := }
 938 |     if$
 939 |   }
 940 |   while$
 941 |   s empty$
 942 |     'skip$
 943 |     { pop$ s }
 944 |   if$
 945 | }
 946 | 
 947 | % Converts the word number string on the top of the stack to
 948 | % Arabic string form. Will be successful up to "tenth".
 949 | FUNCTION {word.to.num}
 950 | { duplicate$ "l" change.case$ 's :=
 951 |   s "first" =
 952 |     { pop$ "1" }
 953 |     { skip$ }
 954 |   if$
 955 |   s "second" =
 956 |     { pop$ "2" }
 957 |     { skip$ }
 958 |   if$
 959 |   s "third" =
 960 |     { pop$ "3" }
 961 |     { skip$ }
 962 |   if$
 963 |   s "fourth" =
 964 |     { pop$ "4" }
 965 |     { skip$ }
 966 |   if$
 967 |   s "fifth" =
 968 |     { pop$ "5" }
 969 |     { skip$ }
 970 |   if$
 971 |   s "sixth" =
 972 |     { pop$ "6" }
 973 |     { skip$ }
 974 |   if$
 975 |   s "seventh" =
 976 |     { pop$ "7" }
 977 |     { skip$ }
 978 |   if$
 979 |   s "eighth" =
 980 |     { pop$ "8" }
 981 |     { skip$ }
 982 |   if$
 983 |   s "ninth" =
 984 |     { pop$ "9" }
 985 |     { skip$ }
 986 |   if$
 987 |   s "tenth" =
 988 |     { pop$ "10" }
 989 |     { skip$ }
 990 |   if$
 991 | }
 992 | 
 993 | 
 994 | % Converts the string on the top of the stack to numerical
 995 | % ordinal (e.g., "11th") form.
 996 | FUNCTION {convert.edition}
 997 | { duplicate$ empty$ 'skip$
 998 |     { duplicate$ #1 #1 substring$ is.num
 999 |         { extract.num
1000 |           num.to.ordinal
1001 |         }
1002 |         { word.to.num
1003 |           duplicate$ #1 #1 substring$ is.num
1004 |             { num.to.ordinal }
1005 |             { "edition ordinal word " quote$ * edition * quote$ *
1006 |               " may be too high (or improper) for conversion" * " in " * cite$ * warning$
1007 |             }
1008 |           if$
1009 |         }
1010 |       if$
1011 |     }
1012 |   if$
1013 | }
1014 | 
1015 | 
1016 | 
1017 | 
1018 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1019 | %% LATEX BIBLIOGRAPHY CODE %%
1020 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1021 | 
1022 | FUNCTION {start.entry}
1023 | { newline$
1024 |   "\bibitem{" write$
1025 |   cite$ write$
1026 |   "}" write$
1027 |   newline$
1028 |   ""
1029 |   initialize.prev.this.status
1030 | }
1031 | 
1032 | % Here we write out all the LaTeX code that we will need. The most involved
1033 | % code sequences are those that control the alternate interword spacing and
1034 | % foreign language hyphenation patterns. The heavy use of \providecommand
1035 | % gives users a way to override the defaults. Special thanks to Javier Bezos,
1036 | % Johannes Braams, Robin Fairbairns, Heiko Oberdiek, Donald Arseneau and all
1037 | % the other gurus on comp.text.tex for their help and advice on the topic of
1038 | % \selectlanguage, Babel and BibTeX.
1039 | FUNCTION {begin.bib}
1040 | { "% Generated by IEEEtran.bst, version: " bst.file.version * " (" * bst.file.date * ")" *
1041 |   write$ newline$
1042 |   preamble$ empty$ 'skip$
1043 |     { preamble$ write$ newline$ }
1044 |   if$
1045 |   "\begin{thebibliography}{"  longest.label  * "}" *
1046 |   write$ newline$
1047 |   "\providecommand{\url}[1]{#1}"
1048 |   write$ newline$
1049 |   "\csname url@samestyle\endcsname"
1050 |   write$ newline$
1051 |   "\providecommand{\newblock}{\relax}"
1052 |   write$ newline$
1053 |   "\providecommand{\bibinfo}[2]{#2}"
1054 |   write$ newline$
1055 |   "\providecommand{\BIBentrySTDinterwordspacing}{\spaceskip=0pt\relax}"
1056 |   write$ newline$
1057 |   "\providecommand{\BIBentryALTinterwordstretchfactor}{"
1058 |   ALTinterwordstretchfactor * "}" *
1059 |   write$ newline$
1060 |   "\providecommand{\BIBentryALTinterwordspacing}{\spaceskip=\fontdimen2\font plus "
1061 |   write$ newline$
1062 |   "\BIBentryALTinterwordstretchfactor\fontdimen3\font minus \fontdimen4\font\relax}"
1063 |   write$ newline$
1064 |   "\providecommand{\BIBforeignlanguage}[2]{{%"
1065 |   write$ newline$
1066 |   "\expandafter\ifx\csname l@#1\endcsname\relax"
1067 |   write$ newline$
1068 |   "\typeout{** WARNING: IEEEtran.bst: No hyphenation pattern has been}%"
1069 |   write$ newline$
1070 |   "\typeout{** loaded for the language `#1'. Using the pattern for}%"
1071 |   write$ newline$
1072 |   "\typeout{** the default language instead.}%"
1073 |   write$ newline$
1074 |   "\else"
1075 |   write$ newline$
1076 |   "\language=\csname l@#1\endcsname"
1077 |   write$ newline$
1078 |   "\fi"
1079 |   write$ newline$
1080 |   "#2}}"
1081 |   write$ newline$
1082 |   "\providecommand{\BIBdecl}{\relax}"
1083 |   write$ newline$
1084 |   "\BIBdecl"
1085 |   write$ newline$
1086 | }
1087 | 
1088 | FUNCTION {end.bib}
1089 | { newline$ "\end{thebibliography}" write$ newline$ }
1090 | 
1091 | FUNCTION {if.url.alt.interword.spacing}
1092 | { is.use.alt.interword.spacing
1093 |     { is.use.url
1094 |         { url empty$ 'skip$ {"\BIBentryALTinterwordspacing" write$ newline$} if$ }
1095 |         { skip$ }
1096 |       if$
1097 |     }
1098 |     { skip$ }
1099 |   if$
1100 | }
1101 | 
1102 | FUNCTION {if.url.std.interword.spacing}
1103 | { is.use.alt.interword.spacing
1104 |     { is.use.url
1105 |         { url empty$ 'skip$ {"\BIBentrySTDinterwordspacing" write$ newline$} if$ }
1106 |         { skip$ }
1107 |       if$
1108 |     }
1109 |     { skip$ }
1110 |   if$
1111 | }
1112 | 
1113 | 
1114 | 
1115 | 
1116 | %%%%%%%%%%%%%%%%%%%%%%%%
1117 | %% LONGEST LABEL PASS %%
1118 | %%%%%%%%%%%%%%%%%%%%%%%%
1119 | 
1120 | FUNCTION {initialize.longest.label}
1121 | { "" 'longest.label :=
1122 |   #1 'number.label :=
1123 |   #0 'longest.label.width :=
1124 | }
1125 | 
1126 | FUNCTION {longest.label.pass}
1127 | { type$ "ieeetranbstctl" =
1128 |     { skip$ }
1129 |     { number.label int.to.str$ 'label :=
1130 |       number.label #1 + 'number.label :=
1131 |       label width$ longest.label.width >
1132 |         { label 'longest.label :=
1133 |           label width$ 'longest.label.width :=
1134 |         }
1135 |         { skip$ }
1136 |       if$
1137 |     }
1138 |   if$
1139 | }
1140 | 
1141 | 
1142 | 
1143 | 
1144 | %%%%%%%%%%%%%%%%%%%%%
1145 | %% FORMAT HANDLERS %%
1146 | %%%%%%%%%%%%%%%%%%%%%
1147 | 
1148 | %% Lower Level Formats (used by higher level formats)
1149 | 
1150 | FUNCTION {format.address.org.or.pub.date}
1151 | { 't :=
1152 |   ""
1153 |   year empty$
1154 |     { "empty year in " cite$ * warning$ }
1155 |     { skip$ }
1156 |   if$
1157 |   address empty$ t empty$ and
1158 |   year empty$ and month empty$ and
1159 |     { skip$ }
1160 |     { this.to.prev.status
1161 |       this.status.std
1162 |       cap.status.std
1163 |       address "address" bibinfo.check *
1164 |       t empty$
1165 |         { skip$ }
1166 |         { punct.period 'prev.status.punct :=
1167 |           space.large 'prev.status.space :=
1168 |           address empty$
1169 |             { skip$ }
1170 |             { ": " * }
1171 |           if$
1172 |           t *
1173 |         }
1174 |       if$
1175 |       year empty$ month empty$ and
1176 |         { skip$ }
1177 |         { t empty$ address empty$ and
1178 |             { skip$ }
1179 |             { ", " * }
1180 |           if$
1181 |           month empty$
1182 |             { year empty$
1183 |                 { skip$ }
1184 |                 { year "year" bibinfo.check * }
1185 |               if$
1186 |             }
1187 |             { month "month" bibinfo.check *
1188 |               year empty$
1189 |                  { skip$ }
1190 |                  { " " * year "year" bibinfo.check * }
1191 |               if$
1192 |             }
1193 |           if$
1194 |         }
1195 |       if$
1196 |     }
1197 |   if$
1198 | }
1199 | 
1200 | 
1201 | FUNCTION {format.names}
1202 | { 'bibinfo :=
1203 |   duplicate$ empty$ 'skip$ {
1204 |   this.to.prev.status
1205 |   this.status.std
1206 |   's :=
1207 |   "" 't :=
1208 |   #1 'nameptr :=
1209 |   s num.names$ 'numnames :=
1210 |   numnames 'namesleft :=
1211 |     { namesleft #0 > }
1212 |     { s nameptr
1213 |       name.format.string
1214 |       format.name$
1215 |       bibinfo bibinfo.check
1216 |       't :=
1217 |       nameptr #1 >
1218 |         { nameptr num.names.shown.with.forced.et.al #1 + =
1219 |           numnames max.num.names.before.forced.et.al >
1220 |           is.forced.et.al and and
1221 |             { "others" 't :=
1222 |               #1 'namesleft :=
1223 |             }
1224 |             { skip$ }
1225 |           if$
1226 |           namesleft #1 >
1227 |             { ", " * t do.name.latex.cmd * }
1228 |             { s nameptr "{ll}" format.name$ duplicate$ "others" =
1229 |                 { 't := }
1230 |                 { pop$ }
1231 |               if$
1232 |               t "others" =
1233 |                 { " " * bbl.etal emphasize * }
1234 |                 { numnames #2 >
1235 |                     { "," * }
1236 |                     { skip$ }
1237 |                   if$
1238 |                   bbl.and
1239 |                   space.word * t do.name.latex.cmd *
1240 |                 }
1241 |               if$
1242 |             }
1243 |           if$
1244 |         }
1245 |         { t do.name.latex.cmd }
1246 |       if$
1247 |       nameptr #1 + 'nameptr :=
1248 |       namesleft #1 - 'namesleft :=
1249 |     }
1250 |   while$
1251 |   cap.status.std
1252 |   } if$
1253 | }
1254 | 
1255 | 
1256 | 
1257 | 
1258 | %% Higher Level Formats
1259 | 
1260 | %% addresses/locations
1261 | 
1262 | FUNCTION {format.address}
1263 | { address duplicate$ empty$ 'skip$
1264 |     { this.to.prev.status
1265 |       this.status.std
1266 |       cap.status.std
1267 |     }
1268 |   if$
1269 | }
1270 | 
1271 | 
1272 | 
1273 | %% author/editor names
1274 | 
1275 | FUNCTION {format.authors}{ author "author" format.names }
1276 | 
1277 | FUNCTION {format.editors}
1278 | { editor "editor" format.names duplicate$ empty$ 'skip$
1279 |     { ", " *
1280 |       get.bbl.editor
1281 |       capitalize
1282 |       *
1283 |     }
1284 |   if$
1285 | }
1286 | 
1287 | 
1288 | 
1289 | %% date
1290 | 
1291 | FUNCTION {format.date}
1292 | {
1293 |   month "month" bibinfo.check duplicate$ empty$
1294 |   year  "year" bibinfo.check duplicate$ empty$
1295 |     { swap$ 'skip$
1296 |         { this.to.prev.status
1297 |           this.status.std
1298 |           cap.status.std
1299 |          "there's a month but no year in " cite$ * warning$ }
1300 |       if$
1301 |       *
1302 |     }
1303 |     { this.to.prev.status
1304 |       this.status.std
1305 |       cap.status.std
1306 |       swap$ 'skip$
1307 |         {
1308 |           swap$
1309 |           " " * swap$
1310 |         }
1311 |       if$
1312 |       *
1313 |     }
1314 |   if$
1315 | }
1316 | 
1317 | FUNCTION {format.date.electronic}
1318 | { month "month" bibinfo.check duplicate$ empty$
1319 |   year  "year" bibinfo.check duplicate$ empty$
1320 |     { swap$ 
1321 |         { pop$ }
1322 |         { "there's a month but no year in " cite$ * warning$
1323 |         pop$ ")" * "(" swap$ *
1324 |         this.to.prev.status
1325 |         punct.no 'this.status.punct :=
1326 |         space.normal 'this.status.space :=
1327 |         quote.no 'this.status.quote :=
1328 |         cap.yes  'status.cap :=
1329 |         }
1330 |       if$
1331 |     }
1332 |     { swap$ 
1333 |         { swap$ pop$ ")" * "(" swap$ * }
1334 |         { "(" swap$ * ", " * swap$ * ")" * }
1335 |       if$
1336 |     this.to.prev.status
1337 |     punct.no 'this.status.punct :=
1338 |     space.normal 'this.status.space :=
1339 |     quote.no 'this.status.quote :=
1340 |     cap.yes  'status.cap :=
1341 |     }
1342 |   if$
1343 | }
1344 | 
1345 | 
1346 | 
1347 | %% edition/title
1348 | 
1349 | % Note: The IEEE considers the edition to be closely associated with
1350 | % the title of a book. So, in IEEEtran.bst the edition is normally handled 
1351 | % within the formatting of the title. The format.edition function is 
1352 | % retained here for possible future use.
1353 | FUNCTION {format.edition}
1354 | { edition duplicate$ empty$ 'skip$
1355 |     { this.to.prev.status
1356 |       this.status.std
1357 |       convert.edition
1358 |       status.cap
1359 |         { "t" }
1360 |         { "l" }
1361 |       if$ change.case$
1362 |       "edition" bibinfo.check
1363 |       "~" * bbl.edition *
1364 |       cap.status.std
1365 |     }
1366 |   if$
1367 | }
1368 | 
1369 | % This is used to format the booktitle of a conference proceedings.
1370 | % Here we use the "intype" field to provide the user a way to 
1371 | % override the word "in" (e.g., with things like "presented at")
1372 | % Use of intype stops the emphasis of the booktitle to indicate that
1373 | % we no longer mean the written conference proceedings, but the
1374 | % conference itself.
1375 | FUNCTION {format.in.booktitle}
1376 | { booktitle "booktitle" bibinfo.check duplicate$ empty$ 'skip$
1377 |     { this.to.prev.status
1378 |       this.status.std
1379 |       select.language
1380 |       intype missing$
1381 |         { emphasize
1382 |           bbl.in " " *
1383 |         }
1384 |         { intype " " * }
1385 |       if$
1386 |       swap$ *
1387 |       cap.status.std
1388 |     }
1389 |   if$
1390 | }
1391 | 
1392 | % This is used to format the booktitle of collection.
1393 | % Here the "intype" field is not supported, but "edition" is.
1394 | FUNCTION {format.in.booktitle.edition}
1395 | { booktitle "booktitle" bibinfo.check duplicate$ empty$ 'skip$
1396 |     { this.to.prev.status
1397 |       this.status.std
1398 |       select.language
1399 |       emphasize
1400 |       edition empty$ 'skip$
1401 |         { ", " *
1402 |           edition
1403 |           convert.edition
1404 |           "l" change.case$
1405 |           * "~" * bbl.edition *
1406 |         }
1407 |       if$
1408 |       bbl.in " " * swap$ *
1409 |       cap.status.std
1410 |     }
1411 |   if$
1412 | }
1413 | 
1414 | FUNCTION {format.article.title}
1415 | { title duplicate$ empty$ 'skip$
1416 |     { this.to.prev.status
1417 |       this.status.std
1418 |       "t" change.case$
1419 |     }
1420 |   if$
1421 |   "title" bibinfo.check
1422 |   duplicate$ empty$ 'skip$
1423 |     { quote.close 'this.status.quote :=
1424 |       is.last.char.not.punct
1425 |         { punct.std 'this.status.punct := }
1426 |         { punct.no 'this.status.punct := }
1427 |       if$
1428 |       select.language
1429 |       "``" swap$ *
1430 |       cap.status.std
1431 |     }
1432 |   if$
1433 | }
1434 | 
1435 | FUNCTION {format.article.title.electronic}
1436 | { title duplicate$ empty$ 'skip$
1437 |     { this.to.prev.status
1438 |       this.status.std
1439 |       cap.status.std
1440 |       "t" change.case$ 
1441 |     }
1442 |   if$
1443 |   "title" bibinfo.check
1444 |   duplicate$ empty$ 
1445 |     { skip$ } 
1446 |     { select.language }
1447 |   if$
1448 | }
1449 | 
1450 | FUNCTION {format.book.title.edition}
1451 | { title "title" bibinfo.check
1452 |   duplicate$ empty$
1453 |     { "empty title in " cite$ * warning$ }
1454 |     { this.to.prev.status
1455 |       this.status.std
1456 |       select.language
1457 |       emphasize
1458 |       edition empty$ 'skip$
1459 |         { ", " *
1460 |           edition
1461 |           convert.edition
1462 |           status.cap
1463 |             { "t" }
1464 |             { "l" }
1465 |           if$
1466 |           change.case$
1467 |           * "~" * bbl.edition *
1468 |         }
1469 |       if$
1470 |       cap.status.std
1471 |     }
1472 |   if$
1473 | }
1474 | 
1475 | FUNCTION {format.book.title}
1476 | { title "title" bibinfo.check
1477 |   duplicate$ empty$ 'skip$
1478 |     { this.to.prev.status
1479 |       this.status.std
1480 |       cap.status.std
1481 |       select.language
1482 |       emphasize
1483 |     }
1484 |   if$
1485 | }
1486 | 
1487 | 
1488 | 
1489 | %% journal
1490 | 
1491 | FUNCTION {format.journal}
1492 | { journal duplicate$ empty$ 'skip$
1493 |     { this.to.prev.status
1494 |       this.status.std
1495 |       cap.status.std
1496 |       select.language
1497 |       emphasize
1498 |     }
1499 |   if$
1500 | }
1501 | 
1502 | 
1503 | 
1504 | %% how published
1505 | 
1506 | FUNCTION {format.howpublished}
1507 | { howpublished duplicate$ empty$ 'skip$
1508 |     { this.to.prev.status
1509 |       this.status.std
1510 |       cap.status.std
1511 |     }
1512 |   if$
1513 | }
1514 | 
1515 | 
1516 | 
1517 | %% institutions/organization/publishers/school
1518 | 
1519 | FUNCTION {format.institution}
1520 | { institution duplicate$ empty$ 'skip$
1521 |     { this.to.prev.status
1522 |       this.status.std
1523 |       cap.status.std
1524 |     }
1525 |   if$
1526 | }
1527 | 
1528 | FUNCTION {format.organization}
1529 | { organization duplicate$ empty$ 'skip$
1530 |     { this.to.prev.status
1531 |       this.status.std
1532 |       cap.status.std
1533 |     }
1534 |   if$
1535 | }
1536 | 
1537 | FUNCTION {format.address.publisher.date}
1538 | { publisher "publisher" bibinfo.warn format.address.org.or.pub.date }
1539 | 
1540 | FUNCTION {format.address.publisher.date.nowarn}
1541 | { publisher "publisher" bibinfo.check format.address.org.or.pub.date }
1542 | 
1543 | FUNCTION {format.address.organization.date}
1544 | { organization "organization" bibinfo.check format.address.org.or.pub.date }
1545 | 
1546 | FUNCTION {format.school}
1547 | { school duplicate$ empty$ 'skip$
1548 |     { this.to.prev.status
1549 |       this.status.std
1550 |       cap.status.std
1551 |     }
1552 |   if$
1553 | }
1554 | 
1555 | 
1556 | 
1557 | %% volume/number/series/chapter/pages
1558 | 
1559 | FUNCTION {format.volume}
1560 | { volume empty.field.to.null.string
1561 |   duplicate$ empty$ 'skip$
1562 |     { this.to.prev.status
1563 |       this.status.std
1564 |       bbl.volume 
1565 |       status.cap
1566 |         { capitalize }
1567 |         { skip$ }
1568 |       if$
1569 |       swap$ tie.or.space.prefix
1570 |       "volume" bibinfo.check
1571 |       * *
1572 |       cap.status.std
1573 |     }
1574 |   if$
1575 | }
1576 | 
1577 | FUNCTION {format.number}
1578 | { number empty.field.to.null.string
1579 |   duplicate$ empty$ 'skip$
1580 |     { this.to.prev.status
1581 |       this.status.std
1582 |       status.cap
1583 |          { bbl.number capitalize }
1584 |          { bbl.number }
1585 |        if$
1586 |       swap$ tie.or.space.prefix
1587 |       "number" bibinfo.check
1588 |       * *
1589 |       cap.status.std
1590 |     }
1591 |   if$
1592 | }
1593 | 
1594 | FUNCTION {format.number.if.use.for.article}
1595 | { is.use.number.for.article 
1596 |      { format.number }
1597 |      { "" }
1598 |    if$
1599 | }
1600 | 
1601 | % The IEEE does not seem to tie the series so closely with the volume
1602 | % and number as is done in other bibliography styles. Instead the
1603 | % series is treated somewhat like an extension of the title.
1604 | FUNCTION {format.series}
1605 | { series empty$ 
1606 |    { "" }
1607 |    { this.to.prev.status
1608 |      this.status.std
1609 |      bbl.series " " *
1610 |      series "series" bibinfo.check *
1611 |      cap.status.std
1612 |    }
1613 |  if$
1614 | }
1615 | 
1616 | 
1617 | FUNCTION {format.chapter}
1618 | { chapter empty$
1619 |     { "" }
1620 |     { this.to.prev.status
1621 |       this.status.std
1622 |       type empty$
1623 |         { bbl.chapter }
1624 |         { type "l" change.case$
1625 |           "type" bibinfo.check
1626 |         }
1627 |       if$
1628 |       chapter tie.or.space.prefix
1629 |       "chapter" bibinfo.check
1630 |       * *
1631 |       cap.status.std
1632 |     }
1633 |   if$
1634 | }
1635 | 
1636 | 
1637 | % The intended use of format.paper is for paper numbers of inproceedings.
1638 | % The paper type can be overridden via the type field.
1639 | % We allow the type to be displayed even if the paper number is absent
1640 | % for things like "postdeadline paper"
1641 | FUNCTION {format.paper}
1642 | { is.use.paper
1643 |      { paper empty$
1644 |         { type empty$
1645 |             { "" }
1646 |             { this.to.prev.status
1647 |               this.status.std
1648 |               type "type" bibinfo.check
1649 |               cap.status.std
1650 |             }
1651 |           if$
1652 |         }
1653 |         { this.to.prev.status
1654 |           this.status.std
1655 |           type empty$
1656 |             { bbl.paper }
1657 |             { type "type" bibinfo.check }
1658 |           if$
1659 |           " " * paper
1660 |           "paper" bibinfo.check
1661 |           *
1662 |           cap.status.std
1663 |         }
1664 |       if$
1665 |      }
1666 |      { "" } 
1667 |    if$
1668 | }
1669 | 
1670 | 
1671 | FUNCTION {format.pages}
1672 | { pages duplicate$ empty$ 'skip$
1673 |     { this.to.prev.status
1674 |       this.status.std
1675 |       duplicate$ is.multiple.pages
1676 |         {
1677 |           bbl.pages swap$
1678 |           n.dashify
1679 |         }
1680 |         {
1681 |           bbl.page swap$
1682 |         }
1683 |       if$
1684 |       tie.or.space.prefix
1685 |       "pages" bibinfo.check
1686 |       * *
1687 |       cap.status.std
1688 |     }
1689 |   if$
1690 | }
1691 | 
1692 | 
1693 | 
1694 | %% technical report number
1695 | 
1696 | FUNCTION {format.tech.report.number}
1697 | { number "number" bibinfo.check
1698 |   this.to.prev.status
1699 |   this.status.std
1700 |   cap.status.std
1701 |   type duplicate$ empty$
1702 |     { pop$ 
1703 |       bbl.techrep
1704 |     }
1705 |     { skip$ }
1706 |   if$
1707 |   "type" bibinfo.check 
1708 |   swap$ duplicate$ empty$
1709 |     { pop$ }
1710 |     { tie.or.space.prefix * * }
1711 |   if$
1712 | }
1713 | 
1714 | 
1715 | 
1716 | %% note
1717 | 
1718 | FUNCTION {format.note}
1719 | { note empty$
1720 |     { "" }
1721 |     { this.to.prev.status
1722 |       this.status.std
1723 |       punct.period 'this.status.punct :=
1724 |       note #1 #1 substring$
1725 |       duplicate$ "{" =
1726 |         { skip$ }
1727 |         { status.cap
1728 |           { "u" }
1729 |           { "l" }
1730 |         if$
1731 |         change.case$
1732 |         }
1733 |       if$
1734 |       note #2 global.max$ substring$ * "note" bibinfo.check
1735 |       cap.yes  'status.cap :=
1736 |     }
1737 |   if$
1738 | }
1739 | 
1740 | 
1741 | 
1742 | %% patent
1743 | 
1744 | FUNCTION {format.patent.date}
1745 | { this.to.prev.status
1746 |   this.status.std
1747 |   year empty$
1748 |     { monthfiled duplicate$ empty$
1749 |         { "monthfiled" bibinfo.check pop$ "" }
1750 |         { "monthfiled" bibinfo.check }
1751 |       if$
1752 |       dayfiled duplicate$ empty$
1753 |         { "dayfiled" bibinfo.check pop$ "" * }
1754 |         { "dayfiled" bibinfo.check 
1755 |           monthfiled empty$ 
1756 |              { "dayfiled without a monthfiled in " cite$ * warning$
1757 |                * 
1758 |              }
1759 |              { " " swap$ * * }
1760 |            if$
1761 |         }
1762 |       if$
1763 |       yearfiled empty$
1764 |         { "no year or yearfiled in " cite$ * warning$ }
1765 |         { yearfiled "yearfiled" bibinfo.check 
1766 |           swap$
1767 |           duplicate$ empty$
1768 |              { pop$ }
1769 |              { ", " * swap$ * }
1770 |            if$
1771 |         }
1772 |       if$
1773 |     }
1774 |     { month duplicate$ empty$
1775 |         { "month" bibinfo.check pop$ "" }
1776 |         { "month" bibinfo.check }
1777 |       if$
1778 |       day duplicate$ empty$
1779 |         { "day" bibinfo.check pop$ "" * }
1780 |         { "day" bibinfo.check 
1781 |           month empty$ 
1782 |              { "day without a month in " cite$ * warning$
1783 |                * 
1784 |              }
1785 |              { " " swap$ * * }
1786 |            if$
1787 |         }
1788 |       if$
1789 |       year "year" bibinfo.check 
1790 |       swap$
1791 |       duplicate$ empty$
1792 |         { pop$ }
1793 |         { ", " * swap$ * }
1794 |       if$
1795 |     }
1796 |   if$
1797 |   cap.status.std
1798 | }
1799 | 
1800 | FUNCTION {format.patent.nationality.type.number}
1801 | { this.to.prev.status
1802 |   this.status.std
1803 |   nationality duplicate$ empty$
1804 |     { "nationality" bibinfo.warn pop$ "" }
1805 |     { "nationality" bibinfo.check
1806 |       duplicate$ "l" change.case$ "united states" =
1807 |         { pop$ bbl.patentUS }
1808 |         { skip$ }
1809 |       if$
1810 |       " " *
1811 |     }
1812 |   if$
1813 |   type empty$
1814 |     { bbl.patent "type" bibinfo.check }
1815 |     { type "type" bibinfo.check }
1816 |   if$  
1817 |   *
1818 |   number duplicate$ empty$
1819 |     { "number" bibinfo.warn pop$ }
1820 |     { "number" bibinfo.check
1821 |       large.number.separate
1822 |       swap$ " " * swap$ *
1823 |     }
1824 |   if$ 
1825 |   cap.status.std
1826 | }
1827 | 
1828 | 
1829 | 
1830 | %% standard
1831 | 
1832 | FUNCTION {format.organization.institution.standard.type.number}
1833 | { this.to.prev.status
1834 |   this.status.std
1835 |   organization duplicate$ empty$
1836 |     { pop$ 
1837 |       institution duplicate$ empty$
1838 |         { "institution" bibinfo.warn }
1839 |         { "institution" bibinfo.warn " " * }
1840 |       if$
1841 |     }
1842 |     { "organization" bibinfo.warn " " * }
1843 |   if$
1844 |   type empty$
1845 |     { bbl.standard "type" bibinfo.check }
1846 |     { type "type" bibinfo.check }
1847 |   if$  
1848 |   *
1849 |   number duplicate$ empty$
1850 |     { "number" bibinfo.check pop$ }
1851 |     { "number" bibinfo.check
1852 |       large.number.separate
1853 |       swap$ " " * swap$ *
1854 |     }
1855 |   if$ 
1856 |   cap.status.std
1857 | }
1858 | 
1859 | FUNCTION {format.revision}
1860 | { revision empty$
1861 |     { "" }
1862 |     { this.to.prev.status
1863 |       this.status.std
1864 |       bbl.revision
1865 |       revision tie.or.space.prefix
1866 |       "revision" bibinfo.check
1867 |       * *
1868 |       cap.status.std
1869 |     }
1870 |   if$
1871 | }
1872 | 
1873 | 
1874 | %% thesis
1875 | 
1876 | FUNCTION {format.master.thesis.type}
1877 | { this.to.prev.status
1878 |   this.status.std
1879 |   type empty$
1880 |     {
1881 |       bbl.mthesis
1882 |     }
1883 |     { 
1884 |       type "type" bibinfo.check
1885 |     }
1886 |   if$
1887 | cap.status.std
1888 | }
1889 | 
1890 | FUNCTION {format.phd.thesis.type}
1891 | { this.to.prev.status
1892 |   this.status.std
1893 |   type empty$
1894 |     {
1895 |       bbl.phdthesis
1896 |     }
1897 |     { 
1898 |       type "type" bibinfo.check
1899 |     }
1900 |   if$
1901 | cap.status.std
1902 | }
1903 | 
1904 | 
1905 | 
1906 | %% URL
1907 | 
1908 | FUNCTION {format.url}
1909 | { is.use.url
1910 |     { url empty$
1911 |       { "" }
1912 |       { this.to.prev.status
1913 |         this.status.std
1914 |         cap.yes 'status.cap :=
1915 |         name.url.prefix " " *
1916 |         "\url{" * url * "}" *
1917 |         punct.no 'this.status.punct :=
1918 |         punct.period 'prev.status.punct :=
1919 |         space.normal 'this.status.space :=
1920 |         space.normal 'prev.status.space :=
1921 |         quote.no 'this.status.quote :=
1922 |       }
1923 |     if$
1924 |     }
1925 |     { "" }
1926 |   if$
1927 | }
1928 | 
1929 | 
1930 | 
1931 | 
1932 | %%%%%%%%%%%%%%%%%%%%
1933 | %% ENTRY HANDLERS %%
1934 | %%%%%%%%%%%%%%%%%%%%
1935 | 
1936 | 
1937 | % Note: In many journals, the IEEE (or the authors) tend not to show the number
1938 | % for articles, so the display of the number is controlled here by the
1939 | % switch "is.use.number.for.article"
1940 | FUNCTION {article}
1941 | { std.status.using.comma
1942 |   start.entry
1943 |   if.url.alt.interword.spacing
1944 |   format.authors "author" output.warn
1945 |   name.or.dash
1946 |   format.article.title "title" output.warn
1947 |   format.journal "journal" bibinfo.check "journal" output.warn
1948 |   format.volume output
1949 |   format.number.if.use.for.article output
1950 |   format.pages output
1951 |   format.date "year" output.warn
1952 |   format.note output
1953 |   format.url output
1954 |   fin.entry
1955 |   if.url.std.interword.spacing
1956 | }
1957 | 
1958 | FUNCTION {book}
1959 | { std.status.using.comma
1960 |   start.entry
1961 |   if.url.alt.interword.spacing
1962 |   author empty$
1963 |     { format.editors "author and editor" output.warn }
1964 |     { format.authors output.nonnull }
1965 |   if$
1966 |   name.or.dash
1967 |   format.book.title.edition output
1968 |   format.series output
1969 |   author empty$
1970 |     { skip$ }
1971 |     { format.editors output }
1972 |   if$
1973 |   format.address.publisher.date output
1974 |   format.volume output
1975 |   format.number output
1976 |   format.note output
1977 |   format.url output
1978 |   fin.entry
1979 |   if.url.std.interword.spacing
1980 | }
1981 | 
1982 | FUNCTION {booklet}
1983 | { std.status.using.comma
1984 |   start.entry
1985 |   if.url.alt.interword.spacing
1986 |   format.authors output
1987 |   name.or.dash
1988 |   format.article.title "title" output.warn
1989 |   format.howpublished "howpublished" bibinfo.check output
1990 |   format.organization "organization" bibinfo.check output
1991 |   format.address "address" bibinfo.check output
1992 |   format.date output
1993 |   format.note output
1994 |   format.url output
1995 |   fin.entry
1996 |   if.url.std.interword.spacing
1997 | }
1998 | 
1999 | FUNCTION {electronic}
2000 | { std.status.using.period
2001 |   start.entry
2002 |   if.url.alt.interword.spacing
2003 |   format.authors output
2004 |   name.or.dash
2005 |   format.date.electronic output
2006 |   format.article.title.electronic output
2007 |   format.howpublished "howpublished" bibinfo.check output
2008 |   format.organization "organization" bibinfo.check output
2009 |   format.address "address" bibinfo.check output
2010 |   format.note output
2011 |   format.url output
2012 |   fin.entry
2013 |   empty.entry.warn
2014 |   if.url.std.interword.spacing
2015 | }
2016 | 
2017 | FUNCTION {inbook}
2018 | { std.status.using.comma
2019 |   start.entry
2020 |   if.url.alt.interword.spacing
2021 |   author empty$
2022 |     { format.editors "author and editor" output.warn }
2023 |     { format.authors output.nonnull }
2024 |   if$
2025 |   name.or.dash
2026 |   format.book.title.edition output
2027 |   format.series output
2028 |   format.address.publisher.date output
2029 |   format.volume output
2030 |   format.number output
2031 |   format.chapter output
2032 |   format.pages output
2033 |   format.note output
2034 |   format.url output
2035 |   fin.entry
2036 |   if.url.std.interword.spacing
2037 | }
2038 | 
2039 | FUNCTION {incollection}
2040 | { std.status.using.comma
2041 |   start.entry
2042 |   if.url.alt.interword.spacing
2043 |   format.authors "author" output.warn
2044 |   name.or.dash
2045 |   format.article.title "title" output.warn
2046 |   format.in.booktitle.edition "booktitle" output.warn
2047 |   format.series output
2048 |   format.editors output
2049 |   format.address.publisher.date.nowarn output
2050 |   format.volume output
2051 |   format.number output
2052 |   format.chapter output
2053 |   format.pages output
2054 |   format.note output
2055 |   format.url output
2056 |   fin.entry
2057 |   if.url.std.interword.spacing
2058 | }
2059 | 
2060 | FUNCTION {inproceedings}
2061 | { std.status.using.comma
2062 |   start.entry
2063 |   if.url.alt.interword.spacing
2064 |   format.authors "author" output.warn
2065 |   name.or.dash
2066 |   format.article.title "title" output.warn
2067 |   format.in.booktitle "booktitle" output.warn
2068 |   format.series output
2069 |   format.editors output
2070 |   format.volume output
2071 |   format.number output
2072 |   publisher empty$
2073 |     { format.address.organization.date output }
2074 |     { format.organization "organization" bibinfo.check output
2075 |       format.address.publisher.date output
2076 |     }
2077 |   if$
2078 |   format.paper output
2079 |   format.pages output
2080 |   format.note output
2081 |   format.url output
2082 |   fin.entry
2083 |   if.url.std.interword.spacing
2084 | }
2085 | 
2086 | FUNCTION {manual}
2087 | { std.status.using.comma
2088 |   start.entry
2089 |   if.url.alt.interword.spacing
2090 |   format.authors output
2091 |   name.or.dash
2092 |   format.book.title.edition "title" output.warn
2093 |   format.howpublished "howpublished" bibinfo.check output 
2094 |   format.organization "organization" bibinfo.check output
2095 |   format.address "address" bibinfo.check output
2096 |   format.date output
2097 |   format.note output
2098 |   format.url output
2099 |   fin.entry
2100 |   if.url.std.interword.spacing
2101 | }
2102 | 
2103 | FUNCTION {mastersthesis}
2104 | { std.status.using.comma
2105 |   start.entry
2106 |   if.url.alt.interword.spacing
2107 |   format.authors "author" output.warn
2108 |   name.or.dash
2109 |   format.article.title "title" output.warn
2110 |   format.master.thesis.type output.nonnull
2111 |   format.school "school" bibinfo.warn output
2112 |   format.address "address" bibinfo.check output
2113 |   format.date "year" output.warn
2114 |   format.note output
2115 |   format.url output
2116 |   fin.entry
2117 |   if.url.std.interword.spacing
2118 | }
2119 | 
2120 | FUNCTION {misc}
2121 | { std.status.using.comma
2122 |   start.entry
2123 |   if.url.alt.interword.spacing
2124 |   format.authors output
2125 |   name.or.dash
2126 |   format.article.title output
2127 |   format.howpublished "howpublished" bibinfo.check output 
2128 |   format.organization "organization" bibinfo.check output
2129 |   format.address "address" bibinfo.check output
2130 |   format.pages output
2131 |   format.date output
2132 |   format.note output
2133 |   format.url output
2134 |   fin.entry
2135 |   empty.entry.warn
2136 |   if.url.std.interword.spacing
2137 | }
2138 | 
2139 | FUNCTION {patent}
2140 | { std.status.using.comma
2141 |   start.entry
2142 |   if.url.alt.interword.spacing
2143 |   format.authors output
2144 |   name.or.dash
2145 |   format.article.title output
2146 |   format.patent.nationality.type.number output
2147 |   format.patent.date output
2148 |   format.note output
2149 |   format.url output
2150 |   fin.entry
2151 |   empty.entry.warn
2152 |   if.url.std.interword.spacing
2153 | }
2154 | 
2155 | FUNCTION {periodical}
2156 | { std.status.using.comma
2157 |   start.entry
2158 |   if.url.alt.interword.spacing
2159 |   format.editors output
2160 |   name.or.dash
2161 |   format.book.title "title" output.warn
2162 |   format.series output
2163 |   format.volume output
2164 |   format.number output
2165 |   format.organization "organization" bibinfo.check output
2166 |   format.date "year" output.warn
2167 |   format.note output
2168 |   format.url output
2169 |   fin.entry
2170 |   if.url.std.interword.spacing
2171 | }
2172 | 
2173 | FUNCTION {phdthesis}
2174 | { std.status.using.comma
2175 |   start.entry
2176 |   if.url.alt.interword.spacing
2177 |   format.authors "author" output.warn
2178 |   name.or.dash
2179 |   format.article.title "title" output.warn
2180 |   format.phd.thesis.type output.nonnull
2181 |   format.school "school" bibinfo.warn output
2182 |   format.address "address" bibinfo.check output
2183 |   format.date "year" output.warn
2184 |   format.note output
2185 |   format.url output
2186 |   fin.entry
2187 |   if.url.std.interword.spacing
2188 | }
2189 | 
2190 | FUNCTION {proceedings}
2191 | { std.status.using.comma
2192 |   start.entry
2193 |   if.url.alt.interword.spacing
2194 |   format.editors output
2195 |   name.or.dash
2196 |   format.book.title "title" output.warn
2197 |   format.series output
2198 |   format.volume output
2199 |   format.number output
2200 |   publisher empty$
2201 |     { format.address.organization.date output }
2202 |     { format.organization "organization" bibinfo.check output
2203 |       format.address.publisher.date output
2204 |     }
2205 |   if$
2206 |   format.note output
2207 |   format.url output
2208 |   fin.entry
2209 |   if.url.std.interword.spacing
2210 | }
2211 | 
2212 | FUNCTION {standard}
2213 | { std.status.using.comma
2214 |   start.entry
2215 |   if.url.alt.interword.spacing
2216 |   format.authors output
2217 |   name.or.dash
2218 |   format.book.title "title" output.warn
2219 |   format.howpublished "howpublished" bibinfo.check output 
2220 |   format.organization.institution.standard.type.number output
2221 |   format.revision output
2222 |   format.date output
2223 |   format.note output
2224 |   format.url output
2225 |   fin.entry
2226 |   if.url.std.interword.spacing
2227 | }
2228 | 
2229 | FUNCTION {techreport}
2230 | { std.status.using.comma
2231 |   start.entry
2232 |   if.url.alt.interword.spacing
2233 |   format.authors "author" output.warn
2234 |   name.or.dash
2235 |   format.article.title "title" output.warn
2236 |   format.howpublished "howpublished" bibinfo.check output 
2237 |   format.institution "institution" bibinfo.warn output
2238 |   format.address "address" bibinfo.check output
2239 |   format.tech.report.number output.nonnull
2240 |   format.date "year" output.warn
2241 |   format.note output
2242 |   format.url output
2243 |   fin.entry
2244 |   if.url.std.interword.spacing
2245 | }
2246 | 
2247 | FUNCTION {unpublished}
2248 | { std.status.using.comma
2249 |   start.entry
2250 |   if.url.alt.interword.spacing
2251 |   format.authors "author" output.warn
2252 |   name.or.dash
2253 |   format.article.title "title" output.warn
2254 |   format.date output
2255 |   format.note "note" output.warn
2256 |   format.url output
2257 |   fin.entry
2258 |   if.url.std.interword.spacing
2259 | }
2260 | 
2261 | 
2262 | % The special entry type which provides the user interface to the
2263 | % BST controls
2264 | FUNCTION {IEEEtranBSTCTL}
2265 | { is.print.banners.to.terminal
2266 |     { "** IEEEtran BST control entry " quote$ * cite$ * quote$ * " detected." *
2267 |       top$
2268 |     }
2269 |     { skip$ }
2270 |   if$
2271 |   CTLuse_article_number
2272 |   empty$
2273 |     { skip$ }
2274 |     { CTLuse_article_number
2275 |       yes.no.to.int
2276 |       'is.use.number.for.article :=
2277 |     }
2278 |   if$
2279 |   CTLuse_paper
2280 |   empty$
2281 |     { skip$ }
2282 |     { CTLuse_paper
2283 |       yes.no.to.int
2284 |       'is.use.paper :=
2285 |     }
2286 |   if$
2287 |   CTLuse_url
2288 |   empty$
2289 |     { skip$ }
2290 |     { CTLuse_url
2291 |       yes.no.to.int
2292 |       'is.use.url :=
2293 |     }
2294 |   if$
2295 |   CTLuse_forced_etal
2296 |   empty$
2297 |     { skip$ }
2298 |     { CTLuse_forced_etal
2299 |       yes.no.to.int
2300 |       'is.forced.et.al :=
2301 |     }
2302 |   if$
2303 |   CTLmax_names_forced_etal
2304 |   empty$
2305 |     { skip$ }
2306 |     { CTLmax_names_forced_etal
2307 |       string.to.integer
2308 |       'max.num.names.before.forced.et.al :=
2309 |     }
2310 |   if$
2311 |   CTLnames_show_etal
2312 |   empty$
2313 |     { skip$ }
2314 |     { CTLnames_show_etal
2315 |       string.to.integer
2316 |       'num.names.shown.with.forced.et.al :=
2317 |     }
2318 |   if$
2319 |   CTLuse_alt_spacing
2320 |   empty$
2321 |     { skip$ }
2322 |     { CTLuse_alt_spacing
2323 |       yes.no.to.int
2324 |       'is.use.alt.interword.spacing :=
2325 |     }
2326 |   if$
2327 |   CTLalt_stretch_factor
2328 |   empty$
2329 |     { skip$ }
2330 |     { CTLalt_stretch_factor
2331 |       'ALTinterwordstretchfactor :=
2332 |       "\renewcommand{\BIBentryALTinterwordstretchfactor}{"
2333 |       ALTinterwordstretchfactor * "}" *
2334 |       write$ newline$
2335 |     }
2336 |   if$
2337 |   CTLdash_repeated_names
2338 |   empty$
2339 |     { skip$ }
2340 |     { CTLdash_repeated_names
2341 |       yes.no.to.int
2342 |       'is.dash.repeated.names :=
2343 |     }
2344 |   if$
2345 |   CTLname_format_string
2346 |   empty$
2347 |     { skip$ }
2348 |     { CTLname_format_string
2349 |       'name.format.string :=
2350 |     }
2351 |   if$
2352 |   CTLname_latex_cmd
2353 |   empty$
2354 |     { skip$ }
2355 |     { CTLname_latex_cmd
2356 |       'name.latex.cmd :=
2357 |     }
2358 |   if$
2359 |   CTLname_url_prefix
2360 |   missing$
2361 |     { skip$ }
2362 |     { CTLname_url_prefix
2363 |       'name.url.prefix :=
2364 |     }
2365 |   if$
2366 | 
2367 | 
2368 |   num.names.shown.with.forced.et.al max.num.names.before.forced.et.al >
2369 |     { "CTLnames_show_etal cannot be greater than CTLmax_names_forced_etal in " cite$ * warning$ 
2370 |       max.num.names.before.forced.et.al 'num.names.shown.with.forced.et.al :=
2371 |     }
2372 |     { skip$ }
2373 |   if$
2374 | }
2375 | 
2376 | 
2377 | %%%%%%%%%%%%%%%%%%%
2378 | %% ENTRY ALIASES %%
2379 | %%%%%%%%%%%%%%%%%%%
2380 | FUNCTION {conference}{inproceedings}
2381 | FUNCTION {online}{electronic}
2382 | FUNCTION {internet}{electronic}
2383 | FUNCTION {webpage}{electronic}
2384 | FUNCTION {www}{electronic}
2385 | FUNCTION {default.type}{misc}
2386 | 
2387 | 
2388 | 
2389 | %%%%%%%%%%%%%%%%%%
2390 | %% MAIN PROGRAM %%
2391 | %%%%%%%%%%%%%%%%%%
2392 | 
2393 | READ
2394 | 
2395 | EXECUTE {initialize.controls}
2396 | EXECUTE {initialize.status.constants}
2397 | EXECUTE {banner.message}
2398 | 
2399 | EXECUTE {initialize.longest.label}
2400 | ITERATE {longest.label.pass}
2401 | 
2402 | EXECUTE {begin.bib}
2403 | ITERATE {call.type$}
2404 | EXECUTE {end.bib}
2405 | 
2406 | EXECUTE{completed.message}
2407 | 
2408 | 
2409 | %% That's all folks, mds.
2410 | 


--------------------------------------------------------------------------------
/vignettes/build_pdf.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | fixVersion(){
 4 |   PKGVER=`grep "Version:" ../DESCRIPTION | sed -e "s/Version: //"`
 5 |   sed -i -e "s/myversion{.*}/myversion{${PKGVER}}/" $1
 6 | }
 7 | 
 8 | cleanVignette(){
 9 |   rm -f *.aux *.bbl *.blg *.log *.out *.toc *.dvi
10 | }
11 | 
12 | buildVignette(){
13 |   fixVersion $1
14 |   pdflatex $1
15 |   bibname=`echo "$1" | sed -e 's/\..*//'`
16 |   bibtex $bibname
17 |   pdflatex $1
18 |   pdflatex $1
19 |   Rscript -e "tools::compactPDF('$1', gs_quality='ebook')"
20 | }
21 | 
22 | 
23 | cleanVignette
24 | buildVignette filesampler.Rnw
25 | cleanVignette
26 | 
27 | 
28 | mv -f *.pdf ../inst/doc/
29 | cp -f *.Rnw ../inst/doc/
30 | 


--------------------------------------------------------------------------------
/vignettes/filesampler.Rnw:
--------------------------------------------------------------------------------
  1 | %\VignetteIndexEntry{Introduction to the filesampler Package}
  2 | \documentclass[]{article}
  3 | 
  4 | 
  5 | \input{./include/settings}
  6 | 
  7 | 
  8 | \mytitle{Introduction to the filesampler Package}
  9 | \mysubtitle{}
 10 | \myversion{0.4-0}
 11 | \myauthor{
 12 | \centering
 13 | Drew Schmidt \\ 
 14 | \texttt{wrathematics@gmail.com} 
 15 | }
 16 | 
 17 | 
 18 | \begin{document}
 19 | \makefirstfew
 20 | 
 21 | 
 22 | 
 23 | \section{Introduction}\label{introduction}
 24 | 
 25 | \textbf{filesampler}~\cite{filesampler} is a simple R package for
 26 | reading random subsamples of flat text files by line in a reasonably
 27 | efficient manner. This is useful if you have a very large file (too big
 28 | to comfortably handle in memory) and you want to prototype something on
 29 | an unbiased subset.
 30 | 
 31 | Basically, each reader works by sampling as the input file is scanned
 32 | and randomly choosing whether or not to dump the current line to an
 33 | external temporary file. This temporary file is then read back into R.
 34 | For (aggressive) downsampling, this is a very effective strategy; for
 35 | resampling, you are much better off reading the full dataset into
 36 | memory.
 37 | 
 38 | This package, including the underlying C library, is licensed under the
 39 | permissive 2-clause BSD license. The original idea was inspired by
 40 | Eduardo Arino de la Rubia's \textbf{fast\_sample}
 41 | @fast\_sample.
 42 | 
 43 | \subsection{Installation}\label{installation}
 44 | 
 45 | You can install the stable version from CRAN using the usual
 46 | \texttt{install.packages()}:
 47 | 
 48 | \begin{lstlisting}[language=rr]
 49 | install.packages("filesampler")
 50 | \end{lstlisting}
 51 | 
 52 | The development version is maintained on GitHub:
 53 | 
 54 | \begin{lstlisting}[language=rr]
 55 | remotes::install_github("wrathematics/filesampler")
 56 | \end{lstlisting}
 57 | 
 58 | \section{Using the Package}\label{using-the-package}
 59 | 
 60 | At its most basic, you can use just the \texttt{sample\_csv()} for csv
 61 | files. Say you want to read in 0.1\% of the data. Then you would call:
 62 | 
 63 | \begin{lstlisting}[language=rr]
 64 | sample_csv(file, 0.001)
 65 | \end{lstlisting}
 66 | 
 67 | and \texttt{sample\_lines()} for reading unstructured lines (as
 68 | \texttt{readLines()}):
 69 | 
 70 | \begin{lstlisting}[language=rr]
 71 | sample_lines(file, 0.001)
 72 | \end{lstlisting}
 73 | 
 74 | By default, each of these file samplers will sample proportionally. This
 75 | is done in a single pass of the input file. However, you can also use an
 76 | ``exact'' sampler, which uses a reservoir sampler to draw an exact
 77 | number of lines at random. Say you wanted to read in 1000 lines of a
 78 | csv. Then you would call:
 79 | 
 80 | \begin{lstlisting}[language=rr]
 81 | sample_csv(file, 1000, method="exact")
 82 | \end{lstlisting}
 83 | 
 84 | This makes two passes through the file. Implementation details are
 85 | discussed in a later section of this vignette.
 86 | 
 87 | \subsection{Other Readers}\label{other-readers}
 88 | 
 89 | To keep external dependencies to a minimum, only the base R
 90 | \texttt{read.csv()} and \texttt{readLines()} are directly supported. If
 91 | you wish to use other readers like \texttt{fread()} from
 92 | \textbf{data.table}~\cite{datatable} or \texttt{read\_csv()} from
 93 | \textbf{readr}~\cite{readr}, you can pass one of these as the
 94 | argument \texttt{reader} to the \code{sample_csv()} function. Examples
 95 | of this can be found in the Benchmarks section below.
 96 | 
 97 | Note that if the downsampling was sufficiently aggressive, you will
 98 | probably not notice any performance improvement using these better
 99 | readers over \texttt{read.csv()}.
100 | 
101 | 
102 | 
103 | \section{Benchmarks}\label{benchmarks}
104 | 
105 | \subsection{Benchmark Setup}\label{benchmark-setup}
106 | 
107 | All benchmarks were performed using:
108 | 
109 | \begin{itemize}
110 | \item
111 |   A Core i5-2500K CPU @ 3.30GHz with a platter HDD
112 | \item
113 |   R 3.3.1
114 | \item
115 |   Package versions:
116 | 
117 |   \begin{itemize}
118 |   \item
119 |     0.4-0 of \textbf{filesampler}
120 |   \item
121 |     1.0.0 of \textbf{readr}
122 |   \item
123 |     1.9.6 of \textbf{data.table}
124 |   \end{itemize}
125 | \end{itemize}
126 | 
127 | Each timing shown is the result of running each test twice and taking
128 | the best (lowest) time. This is to hopefully ensure that there's no file
129 | caching or bad RNG behavior making one reader look unreasonably slow
130 | compared to the others. For each of the tests, there was no major
131 | difference in run times between the two runs.
132 | 
133 | The file \texttt{big.csv} referred to below was generated with the
134 | included \texttt{makebig} script found in \texttt{inst/benchmarks/} tree
135 | of the source for this package..
136 | 
137 | \subsection{The Benchmarks}\label{the-benchmarks}
138 | 
139 | The package should perform very well, provided the total number of lines
140 | sampled is fairly small. For example, consider the file:
141 | 
142 | \begin{lstlisting}[language=rr]
143 | file <- "/tmp/big.csv"
144 | 
145 | library(memuse)
146 | Sys.filesize(file)
147 | ## 976.868 MiB
148 | 
149 | library(filesampler)
150 | wc_l(file)
151 | ## file:   /tmp/big.csv 
152 | ## Lines: 16000001 
153 | \end{lstlisting}
154 | 
155 | We can read in approximately 0.1\% of the input file quite quickly:
156 | 
157 | \begin{lstlisting}[language=rr]
158 | system.time(small <- sample_csv(file, .001))
159 | ##  user  system elapsed 
160 | ## 1.044   0.128   1.172 
161 | \end{lstlisting}
162 | 
163 | Compare this to the time spent reading the entire file:
164 | 
165 | \begin{lstlisting}[language=rr]
166 | system.time(full <- read.csv(file))
167 | ##    user  system elapsed 
168 | ##  72.988   0.500  73.491
169 | 
170 | system.time(full <- read_csv(file, progress=FALSE))
171 | ##   user  system elapsed 
172 | ## 27.316   0.276  27.595 
173 | 
174 | system.time(full <- fread(file))
175 | ##    user  system elapsed 
176 | ##  12.328   0.100  12.430 
177 | \end{lstlisting}
178 | 
179 | Note the difference in memory usage:
180 | 
181 | \begin{lstlisting}[language=rr]
182 | dim(small)
183 | ## 16036     6
184 | memuse(small)
185 | ## 568.477 KiB
186 | 
187 | dim(full)
188 | ## [1] 16000000        6
189 | memuse(full)
190 | ## 549.321 MiB
191 | \end{lstlisting}
192 | 
193 | Also notice that the in-memory file size of the full file times the
194 | proportion of lines read in is roughly the size of the downsampled file:
195 | 
196 | \begin{lstlisting}[language=rr]
197 | memuse(full) * .001
198 | ## 562.505 KiB
199 | \end{lstlisting}
200 | 
201 | Obviously the \textbf{filesampler} strategy is valuable only for
202 | aggressive downsampling, and not resampling.
203 | 
204 | \subsection{Combining filesampler with Other
205 | Readers}\label{combining-filesampler-with-other-readers}
206 | 
207 | The bulk of the time spent in \texttt{sample\_csv()} is not in the csv
208 | reading/parsing itself, but rather in the downsampling. Consider:
209 | 
210 | \begin{lstlisting}[language=rr]
211 | # the default reader
212 | system.time(small <- sample_csv(file, .001, reader=read.csv))
213 | ##   user  system elapsed 
214 | ##  1.004   0.172   1.180 
215 | 
216 | # readr
217 | system.time(small <- sample_csv(file, .001, reader=read_csv))
218 | ##   user  system elapsed 
219 | ##  1.024   0.108   1.130 
220 | 
221 | # data.table
222 | system.time(small <- sample_csv(file, .001, reader=fread))
223 | ##   user  system elapsed 
224 | ##  0.928   0.184   1.112 
225 | \end{lstlisting}
226 | 
227 | The read times do indeed go down, but not incredibly; the difference is
228 | probably unnoticeable to a human. For larger reads, the times are more
229 | significant. We can quickly see this by reading in 1\% of the data:
230 | 
231 | \begin{lstlisting}[language=rr]
232 | system.time(small <- sample_csv(file, .01, reader=read.csv))
233 | ##   user  system elapsed 
234 | ##  1.712   0.136   1.855 
235 | 
236 | system.time(small <- sample_csv(file, .01, reader=read_csv))
237 | ##   user  system elapsed 
238 | ##  1.236   0.160   1.397 
239 | 
240 | system.time(small <- sample_csv(file, .01, reader=fread))
241 | ##   user  system elapsed 
242 | ##  1.100   0.148   1.246 
243 | \end{lstlisting}
244 | 
245 | \subsection{Exact Sampling}\label{exact-sampling}
246 | 
247 | Recall that we can sample an exact number of lines using a reservoir
248 | sampler, rather than drawing lines proportionally. We can enable this
249 | behavior by setting \texttt{method="exact"} in \texttt{sample\_csv()}.
250 | 
251 | Before we were sampling about 16,000 and 160,000 lines (for
252 | \texttt{p=0.001} and \texttt{p=0.01} respectively), so we'll use that
253 | value for the exact sampler:
254 | 
255 | \begin{lstlisting}[language=rr]
256 | system.time(small <- sample_csv(file, 16000, method="exact"))
257 | ##   user  system elapsed 
258 | ##  1.144   0.292   1.437 
259 | 
260 | system.time(small <- sample_csv(file, 160000, method="exact"))
261 | ##   user  system elapsed 
262 | ##  1.892   0.232   2.124 
263 | \end{lstlisting}
264 | 
265 | In each case, the times are a bit slower, reflecting some extra work
266 | needed in concocting the reservoir (for example, getting line counts for
267 | the file).
268 | 
269 | \subsection{Conclusions and Summary}\label{conclusions-and-summary}
270 | 
271 | Based on the benchmarks, we find:
272 | 
273 | \begin{itemize}
274 | \item
275 |   If you want to very quickly read in a small amount of data,
276 |   \textbf{filesampler} is very effective. The canonical use cases for
277 |   this are if you:
278 | 
279 |   \begin{itemize}
280 |   \item
281 |     want to take a quick, unbiased peek at the data
282 |   \item
283 |     don't have enough memory to read in/operate on the full data set
284 |   \end{itemize}
285 | \item
286 |   If your data isn't very big, you can probably comfortably read it in
287 |   very quickly with \textbf{data.table}'s \texttt{fread()}.
288 | \end{itemize}
289 | 
290 | Finally, if your csv is very large, say greater than 10GiB, I would
291 | strongly encourage you to choose another (preferably binary) format. I
292 | have seen terabyte sized csv's, but it's not a good idea!
293 | 
294 | 
295 | 
296 | \section{Implementation Details}\label{implementation-details}
297 | 
298 | Here we describe the implementation details for the file sampling
299 | functions. The other package export, \texttt{wc()}, has been
300 | aggressively optimized; but its implementation is not very interesting,
301 | so we do not belabor it here. We will also only focus on the csv
302 | samplers, as in fact the raw text and csv schemes are almost identical.
303 | 
304 | The proportional sampling is handled by the function
305 | \texttt{file\_sample\_prop()}, which samples lines at a given proportion.
306 | As the input file is scanned line-by-line, lines are randomly selected
307 | to be placed into a temporary file at the given proportion. This
308 | requires only one pass through the file. On the other hand, the exact
309 | sampling is handled by \texttt{file\_sample\_exact()}. Here we use a
310 | reservoir sampler to determine ahead of time which lines will be read,
311 | and then pass through the input file, dumping the pre-selected lines to
312 | the temporary file. This requires two passes through the file, since we
313 | need to know the total number of lines. In each case, after the
314 | downsampling takes place we read the temporary file into R using one of
315 | its csv readers.
316 | 
317 | 
318 | 
319 | \addcontentsline{toc}{section}{References}
320 | \bibliography{./include/filesampler}
321 | \bibliographystyle{plain}
322 | 
323 | \end{document}
324 | 


--------------------------------------------------------------------------------
/vignettes/include/00-acknowledgement.tex:
--------------------------------------------------------------------------------
 1 | \section*{Disclaimer}
 2 | Any opinions, findings, and conclusions or recommendations expressed in this 
 3 | material are those only of the authors.  The findings and conclusions in this 
 4 | article should not be construed to represent any determination or policy of
 5 | University, Agency, Administration and National Laboratory.
 6 | 
 7 | 
 8 | This manual may be incorrect or out-of-date.  The author(s) assume
 9 | no responsibility for errors or omissions, or for damages resulting
10 | from the use of the information contained herein.
11 | 
12 | This publication was typeset using \LaTeX.
13 | 
14 | \vfill
15 | 
16 | \null
17 | \vfill
18 | \copyright\ 2015--2016 Drew Schmidt.
19 | 
20 | Permission is granted to make and distribute verbatim copies of
21 | this vignette and its source provided the copyright notice and
22 | this permission notice are preserved on all copies.
23 | 


--------------------------------------------------------------------------------
/vignettes/include/filesampler.bib:
--------------------------------------------------------------------------------
 1 | @Manual{filesampler,
 2 |   title = {{filesampler: File Line Sampler}},
 3 |   author = {Drew Schmidt},
 4 |   note = {R package version 0.3-0},
 5 |   url = {https://github.com/wrathematics/remoter}
 6 | }
 7 | 
 8 | @Manual{fast_sample,
 9 |   title = {{fast\_sample}},
10 |   author = {Eduardo Arino de la Rubia},
11 |   url = {https://github.com/earino/fast_sample}
12 | }
13 | 
14 | @Manual{readr,
15 |   title = {readr: Read Tabular Data},
16 |   author = {Hadley Wickham and Romain Francois},
17 |   year = {2015},
18 |   note = {R package version 0.2.2},
19 |   url = {https://CRAN.R-project.org/package=readr},
20 | }
21 | 
22 | @Manual{datatable,
23 |   title = {data.table: Extension of Data.frame},
24 |   author = {M Dowle and A Srinivasan and T Short and S Lianoglou with contributions from R Saporta and E Antonyan},
25 |   year = {2015},
26 |   note = {R package version 1.9.6},
27 |   url = {https://CRAN.R-project.org/package=data.table},
28 | }
29 | 


--------------------------------------------------------------------------------
/vignettes/include/settings.tex:
--------------------------------------------------------------------------------
  1 | %-------------------------------------------------------------------------------
  2 | % basic settings
  3 | %-------------------------------------------------------------------------------
  4 | 
  5 | \usepackage{parskip}
  6 | \setlength{\parskip}{.3cm}
  7 | 
  8 | \makeatletter
  9 | \newcommand\code{\bgroup\@makeother\_\@makeother\~\@makeother\$\@codex}
 10 | \def\@codex#1{{\normalfont\ttfamily\hyphenchar\font=-1 #1}\egroup}
 11 | \makeatother
 12 | %%\let\code=\texttt
 13 | \let\proglang=\textsf
 14 | 
 15 | 
 16 | \usepackage[margin=1.1in]{geometry}
 17 | \usepackage{graphicx}
 18 | \usepackage{listings}
 19 | \usepackage{hyperref}
 20 | \usepackage{xcolor}
 21 | \usepackage{xspace}
 22 | 
 23 | \definecolor{gray}{rgb}{.6,.6,.6}
 24 | \definecolor{dkgray}{rgb}{.3,.3,.3}
 25 | \definecolor{grayish}{rgb}{.9, .9, .9}
 26 | \definecolor{dkgreen}{rgb}{0,0.6,0}
 27 | \definecolor{mauve}{rgb}{0.58,0,0.82}
 28 | 
 29 | \hypersetup{
 30 |     pdfnewwindow=true,
 31 |     colorlinks=true,
 32 |     linkcolor=blue,
 33 |     linkbordercolor=blue,
 34 |     citecolor=blue,
 35 |     filecolor=magenta,
 36 |     urlcolor=blue
 37 | }
 38 | 
 39 | \lstdefinelanguage{rr}{
 40 |   language=R,
 41 |   basicstyle=\ttfamily\color{black},
 42 |   backgroundcolor=\color{grayish},
 43 |   frame=single,
 44 |   breaklines=true,
 45 |   keywordstyle=\color{blue},
 46 |   commentstyle=\color{dkgreen},
 47 |   stringstyle=\color{mauve},
 48 |   numbers=left,%none,
 49 |   numberstyle=\tiny\color{dkgray},
 50 |   stepnumber=1,
 51 |   numbersep=8pt,
 52 |   showspaces=false,
 53 |   showstringspaces=false,
 54 |   showtabs=false,
 55 |   rulecolor=\color{gray},
 56 |   tabsize=4,
 57 |   captionpos=t,
 58 | }
 59 | 
 60 | 
 61 | 
 62 | %-------------------------------------------------------------------------------
 63 | % title settings
 64 | %-------------------------------------------------------------------------------
 65 | 
 66 | \newcommand*{\plogo}{\includegraphics[scale=.5]{./include/uch_small}} 
 67 | 
 68 | \makeatletter
 69 | \newcommand\mytitle[1]{\renewcommand\@mytitle{#1}}
 70 | \newcommand\@mytitle{}
 71 | \makeatother
 72 | 
 73 | \makeatletter
 74 | \newcommand\mysubtitle[1]{\renewcommand\@mysubtitle{#1}}
 75 | \newcommand\@mysubtitle{}
 76 | \makeatother
 77 | 
 78 | \makeatletter
 79 | \newcommand\myversion[1]{\renewcommand\@myversion{#1}}
 80 | \newcommand\@myversion{}
 81 | \makeatother
 82 | 
 83 | 
 84 | \newcommand{\myauthor}[1]{\author{#1}}
 85 | 
 86 | 
 87 | %-------------------------------------------------------------------------------
 88 | %	TITLE PAGE
 89 | %-------------------------------------------------------------------------------
 90 | 
 91 | \makeatletter
 92 | \newcommand*{\titleGP}{\begingroup 
 93 | \centering
 94 | \vspace*{\baselineskip} 
 95 | 
 96 | \rule{\textwidth}{1.6pt}\vspace*{-\baselineskip}\vspace*{2pt} 
 97 | \rule{\textwidth}{0.4pt}\\[\baselineskip]
 98 | 
 99 | {\Huge \scshape\@mytitle \\[0.2\baselineskip] }
100 | 
101 | \rule{\textwidth}{0.4pt}\vspace*{-\baselineskip}\vspace{3.2pt} 
102 | \rule{\textwidth}{1.6pt}\\[\baselineskip] 
103 | 
104 | \scshape % Small caps
105 | \@mysubtitle \ \\[\baselineskip] % Tagline(s) or further 
106 | \today\par % Location and year
107 | 
108 | \vspace*{2\baselineskip} % Whitespace between location/year and editors
109 | 
110 | {\large \@author \par}
111 | 
112 | \vfill % Whitespace between editor names and publisher logo
113 | 
114 | \plogo \\[0.3\baselineskip] % Publisher logo
115 | {Version \@myversion} \\[0.3\baselineskip] % Year published
116 | % {\large THE PUBLISHER}\par % Publisher
117 | 
118 | \endgroup}
119 | \makeatother
120 | 
121 | 
122 | 
123 | %-------------------------------------------------------------------------------
124 | % first few
125 | %-------------------------------------------------------------------------------
126 | 
127 | \usepackage{lastpage}
128 | \usepackage{fancyhdr}
129 | 
130 | \pagestyle{fancy}
131 | 
132 | \newcommand{\prebodyheadfoot}{
133 |   \fancyhf{} % clear all header and footer fields
134 |   \fancyfoot{}
135 |   \renewcommand{\headrulewidth}{0pt}
136 |   \renewcommand{\footrulewidth}{0pt}
137 |   
138 |   % redefinition of the plain style:
139 |   \fancypagestyle{plain}{%
140 |   \fancyhf{} % clear all header and footer fields
141 |   \renewcommand{\headrulewidth}{0pt}
142 |   \renewcommand{\footrulewidth}{0pt}}
143 | }
144 | 
145 | \newcommand{\bodyheadfoot}{
146 |   \fancyhf{} % clear all header and footer fields
147 | % \fancyhead[L]{\slshape \rightmark}
148 |   \fancyhead[L]{\slshape \leftmark}
149 |   \fancyhead[R]{ \thepage\ of\ \pageref{LastPage}}
150 |   \renewcommand{\headrulewidth}{1pt}
151 |   \renewcommand{\footrulewidth}{0pt}
152 |   
153 |   % redefinition of the plain style:
154 |   \fancypagestyle{plain}{%
155 |   \fancyhf{} % clear all header and footer fields
156 |   \renewcommand{\headrulewidth}{0pt}
157 |   \renewcommand{\footrulewidth}{0pt}}
158 | }
159 | 
160 | 
161 | 
162 | \newcommand{\makefirstfew}{%
163 | \prebodyheadfoot
164 | \restoregeometry
165 | 
166 |   
167 | \titleGP
168 | \newpage
169 | 
170 | \input{./include/00-acknowledgement}
171 | \newpage
172 | \pagenumbering{roman}
173 | \tableofcontents
174 | \newpage
175 | 
176 | 
177 | \pagenumbering{arabic}
178 | \setcounter{page}{1}
179 | 
180 | 
181 | \bodyheadfoot
182 | \pagenumbering{arabic}
183 | \setcounter{page}{1}
184 | \pagestyle{fancy}
185 | }


--------------------------------------------------------------------------------
/vignettes/include/uch_small.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wrathematics/filesampler/b3683ea15888b0da6e1f983233c395d23cf9e2b6/vignettes/include/uch_small.png


--------------------------------------------------------------------------------