├── .Rbuildignore ├── .gitignore ├── .travis.yml ├── CONTRIBUTING.md ├── ChangeLog ├── DESCRIPTION ├── LICENSE ├── NAMESPACE ├── R ├── file_sample_exact.r ├── file_sample_prob.r ├── filesampler-package.r ├── reactor.r ├── sample_csv.r ├── sample_lines.r ├── utils.r └── wc.r ├── README.md ├── cleanup ├── configure ├── configure.ac ├── inst ├── CITATION ├── benchmarks │ ├── .gitignore │ ├── README │ ├── makebig │ ├── sampler.r │ └── wc.r └── rawdata │ └── small.csv ├── man ├── file_sample_exact.Rd ├── file_sample_prop.Rd ├── filesampler.Rd ├── print-wc.Rd ├── sample_csv.Rd ├── sample_lines.Rd └── wc.Rd ├── src ├── Makevars.in ├── Rfilesampler.h ├── filesampler │ ├── Makefile │ ├── check_avx.h │ ├── error.h │ ├── file_sampler.c │ ├── filesampler.h │ ├── rebalance.c │ ├── safeomp.h │ ├── utils.h │ ├── utils_sample.h │ └── wc.c ├── filesampler_native.c ├── samplers.c └── wc.c ├── tests ├── exact.R ├── prop.R └── wc.R ├── tools └── ax_check_compile_flag.m4 └── vignettes ├── IEEEtran.bst ├── build_pdf.sh ├── filesampler.Rnw └── include ├── 00-acknowledgement.tex ├── filesampler.bib ├── settings.tex └── uch_small.png /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^\.travis\.yml$ 2 | CONTRIBUTING.md 3 | 4 | inst/benchmarks/big.csv 5 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | src/Makevars 2 | aclocal.m4 3 | 4 | *.o 5 | *.lo 6 | *.so 7 | *.log 8 | *.status 9 | *~ 10 | *.swp 11 | 12 | *.pdf 13 | 14 | *.aux 15 | *.bbl 16 | *.blg 17 | *.out 18 | *.toc 19 | 20 | inst/doc 21 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: R 2 | cache: packages 3 | warnings_are_errors: true 4 | 5 | env: 6 | global: 7 | - CRAN: https://cran.rstudio.com 8 | - _R_CHECK_FORCE_SUGGESTS_=FALSE 9 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # How to Contribute 2 | 3 | If you're reading this and thinking of contributing, first, thanks so much! All contributions big and small are welcome. This document is a set of guidelines (not firm rules) to help the process. 4 | 5 | Thanks! 6 | 7 | 8 | ## Reporting a Bug or Requesting a Feature 9 | Please open an issue at the project repository's issue tracker on GitHub. If there is already an open issue describing your problem/request, feel free to join the discussion inside the existing issue, but please do not open a new one. 10 | 11 | 12 | ## Submitting Patches/Corrections 13 | You can submit a pull request (PR) to the project repository on GitHub. Please try to keep PR's as "small" as possible (don't try to combine several large changes into one PR). 14 | 15 | For small changes, the log can be simple, e.g. "fixed a typo". For larger changes that touch many files, make sure to give a reasonably detailed description of the change(s). 16 | 17 | If the project uses a continuous integration (CI) service such as Travis CI, then the PR must pass CI tests before it will be merged. 18 | 19 | 20 | ## Coding Conventions 21 | If you submit a patch that modifies/adds code, please try to keep the style at least reasonably similar to the one used in the codebase. For both C and R code, please use the following conventions: 22 | 23 | * Indent using two spaces. 24 | * Use the [Allman](https://en.wikipedia.org/wiki/Indent_style#Allman_style) control statement style. Do not use braces (but do indent) for one liners following a control statement. 25 | * Always put spaces after commas and around operators. So for example, use `foo(a, b, c)` not `foo(a,b,c)`; and use `x += 1` not `x+=1`. 26 | * If you create new functionality (new function, new arguments to an old function), add or expand a test in the `tests/` sub-tree. 27 | * The `master` branch of this repository should always pass `R CMD check --as-cran` with no NOTE's and no WARNING's. 28 | * Update the `ChangeLog` file in the root of this directory explainng what you did, and end the line with your initials or full name. For example: 29 | ``` 30 | Fixed a memory leak in src/foo.c. (ABC) 31 | ``` 32 | 33 | 34 | ## Copyright 35 | This document is released as [CC0, public domain](https://creativecommons.org/choose/zero/). Feel free to re-use as much of this as you like with or without attribution if you find the template useful. 36 | -------------------------------------------------------------------------------- /ChangeLog: -------------------------------------------------------------------------------- 1 | Release 0.4-0: 2 | * Integrate exact sampler into sample_csv(). 3 | * Refactored internals for better code-reuse. 4 | * Drop assertthat dependency. 5 | * Re-wrote vignette. 6 | * Switch to LaTeX vignette. 7 | * Register native routines. 8 | * Use AVX for line counter when available (Daniel Lemire). 9 | * Add wc_w(). 10 | 11 | Release 0.3-1: 12 | * Internal reorganization. 13 | * Add option to turn off counts in wc(). 14 | * Optimize wc(). 15 | * Better error checking/handling across the board. 16 | * Better internal library/R separation for easier code re-use. 17 | 18 | Release 0.3-0: 19 | * Rework API. 20 | * Added vignette. 21 | * Added nmax reader option. 22 | * Fix broken help/NAMESPACE issues. 23 | * Change versioning format. 24 | * Change to assertthat. 25 | 26 | Release 0.2-0: 27 | * Removed gnu extensions (strnlen). 28 | * Fixed numerous packaging and documentation issues. 29 | 30 | Release 0.1-0: 31 | * Added basic sampler. 32 | * Added reservoir sampler. 33 | * Added misc utilities (wc, etc). 34 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: filesampler 2 | Type: Package 3 | Title: File Sampler 4 | Version: 0.4-0 5 | Description: A collection of utilities for reading subsamples of flat text files 6 | by line in a reasonably efficient manner. We do so by sampling 7 | as the input file is scanned and randomly choosing whether or not to 8 | dump the current line to an external temporary file. This 9 | temporary file is then read back into R. For (aggressive) 10 | 'downsampling', this is a very effective strategy; for resampling, 11 | you are much better off reading the full dataset into memory. 12 | License: BSD 2-clause License + file LICENSE 13 | Depends: 14 | R (>= 3.5.0) 15 | Imports: 16 | utils 17 | Suggests: 18 | memuse (>= 3.0.0) 19 | NeedsCompilation: yes 20 | ByteCompile: yes 21 | Authors@R: c( 22 | person("Drew", "Schmidt", role = c("aut", "cre"), email="wrathematics@gmail.com"), 23 | person("Daniel", "Lemire", role="ctb", comment="vectorized line counter") 24 | ) 25 | Maintainer: Drew Schmidt 26 | URL: https://github.com/wrathematics/filesampler 27 | BugReports: https://github.com/wrathematics/filesampler/issues 28 | RoxygenNote: 7.0.2 29 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | YEAR: 2015-2018 2 | COPYRIGHT HOLDER: Drew Schmidt 3 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | S3method(print,wc) 4 | export(file_sample_exact) 5 | export(file_sample_prop) 6 | export(sample_csv) 7 | export(sample_lines) 8 | export(wc) 9 | export(wc_l) 10 | export(wc_w) 11 | importFrom(utils,read.csv) 12 | useDynLib(filesampler,R_fs_sample_exact) 13 | useDynLib(filesampler,R_fs_sample_prop) 14 | useDynLib(filesampler,R_fs_wc) 15 | -------------------------------------------------------------------------------- /R/file_sample_exact.r: -------------------------------------------------------------------------------- 1 | #' Exact File Sampler 2 | #' 3 | #' Randomly sample lines from an input text file. 4 | #' 5 | #' @details 6 | #' The sampling is done in two passes of the input file. First, the number of 7 | #' lines of the input file are determined by scanning through the file as 8 | #' quickly as possible (i.e., it should be completely I/O bound). Next, an 9 | #' index of lines to keep is produced by reservoir sampling. Then finally, the 10 | #' input file is scanned again line by line with the chosen lines dumped into a 11 | #' temporary file. 12 | #' 13 | #' If the output file (the one pointed to by the return of this function) is 14 | #' "large" and to be read into memory (which isn't really appropriate for text 15 | #' files in the first place!), then this strategy is probably not appropriate. 16 | #' 17 | #' @param nlines 18 | #' The (exact) number of lines to sample from the input file. 19 | #' @param infile 20 | #' Location of the file (as a string) to be subsampled. 21 | #' @param outfile 22 | #' Output file location (as a string). 23 | #' @param header 24 | #' Is a header (line of column names) on the first line of the csv file? 25 | #' @param nskip 26 | #' Number of lines to skip. If \code{header=TRUE}, then this only applies to 27 | #' lines after the header. 28 | #' @param verbose 29 | #' Should linecounts of the input file and the number of lines sampled be 30 | #' printed? 31 | #' 32 | #' @return 33 | #' \code{NULL} 34 | #' 35 | #' @useDynLib filesampler R_fs_sample_exact 36 | #' @export 37 | file_sample_exact = function(nlines, infile, outfile=tempfile(), header=TRUE, nskip=0, verbose=FALSE) 38 | { 39 | check.is.posint(nlines) 40 | check.is.string(infile) 41 | infile = abspath(infile) 42 | check.is.string(outfile) 43 | check.is.flag(header) 44 | check.is.natnum(nskip) 45 | check.is.flag(verbose) 46 | 47 | .Call(R_fs_sample_exact, as.integer(verbose), as.integer(header), as.integer(nskip), as.integer(nlines)-1L, infile, outfile) 48 | 49 | invisible() 50 | } 51 | -------------------------------------------------------------------------------- /R/file_sample_prob.r: -------------------------------------------------------------------------------- 1 | #' Proportional File Sampler 2 | #' 3 | #' Randomly sample lines from an input text file. 4 | #' 5 | #' @details 6 | #' The sampling is done in one pass of the input file, dumping lines to a 7 | #' temporary file as the input is read. 8 | #' 9 | #' If the output file (the one pointed to by the return of this function) is 10 | #' "large" and to be read into memory (which isn't really appropriate for text 11 | #' files in the first place!), then this strategy is probably not appropriate. 12 | #' 13 | #' @param p 14 | #' Proportion to retain; should be a numeric value between 0 and 1. 15 | #' @param infile 16 | #' Location of the file (as a string) to be subsampled. 17 | #' @param outfile 18 | #' Output file location (as a string). 19 | #' @param header 20 | #' Is a header (line of column names) on the first line of the csv file? 21 | #' @param nskip 22 | #' Number of lines to skip. If \code{header=TRUE}, then this only applies to 23 | #' lines after the header. 24 | #' @param nmax 25 | #' Max number of lines to read. If \code{nmax==0}, then there is no read cap. 26 | #' @param verbose 27 | #' Should linecounts of the input file and the number of lines sampled be 28 | #' printed? 29 | #' 30 | #' @return 31 | #' \code{NULL} 32 | #' 33 | #' @useDynLib filesampler R_fs_sample_prop 34 | #' @export 35 | file_sample_prop = function(p, infile, outfile=tempfile(), header=TRUE, nskip=0, nmax=0, verbose=FALSE) 36 | { 37 | check.is.scalar(p) 38 | check.is.string(infile) 39 | infile = abspath(infile) 40 | check.is.string(outfile) 41 | check.is.flag(header) 42 | check.is.natnum(nskip) 43 | check.is.natnum(nmax) 44 | check.is.flag(verbose) 45 | 46 | if (p == 0) 47 | stop("no lines available for input") 48 | if (p < 0 || p > 1) 49 | stop("Argument 'p' must be between 0 and 1") 50 | 51 | .Call(R_fs_sample_prop, verbose, header, as.integer(nskip), as.integer(nmax), as.double(p), infile, outfile) 52 | 53 | invisible() 54 | } 55 | -------------------------------------------------------------------------------- /R/filesampler-package.r: -------------------------------------------------------------------------------- 1 | #' File Sampler 2 | #' 3 | #' A simple package for reading subsamples of flat text files by line in a 4 | #' reasonably efficient manner. We do so by sampling as the input file is 5 | #' scanned and randomly choosing whether or not to dump the current line to an 6 | #' external temporary file. This temporary file is then read back into R. For 7 | #' (aggressive) downsampling, this is a very effective strategy; for resampling, 8 | #' you are much better off reading the full dataset into memory. 9 | #' 10 | #' The basic function performs the sampling/reading in a single pass. There is 11 | #' an alternative version which allows for an exact amout to be subsampled via 12 | #' reservoir sampling, but this version requires 2 passes through the data. The 13 | #' heavy lifting is done entirely in C. 14 | #' 15 | #' This package, including the underlying C library, is licensed under the 16 | #' permissive 2-clause BSD license. 17 | #' 18 | #' @author Drew Schmidt \email{wrathematics@@gmail.com} 19 | #' @references Project URL: \url{https://github.com/wrathematics/filesampler} 20 | #' 21 | #' @importFrom utils read.csv 22 | #' 23 | #' @name filesampler 24 | #' @docType package 25 | #' @title File Sampler 26 | #' @keywords package 27 | NULL 28 | -------------------------------------------------------------------------------- /R/reactor.r: -------------------------------------------------------------------------------- 1 | is.badval <- function(x) 2 | { 3 | is.na(x) || is.nan(x) || is.infinite(x) 4 | } 5 | 6 | is.inty <- function(x) 7 | { 8 | abs(x - round(x)) < 1e-10 9 | } 10 | 11 | is.zero <- function(x) 12 | { 13 | abs(x) < 1e-10 14 | } 15 | 16 | is.negative <- function(x) 17 | { 18 | x < 0 19 | } 20 | 21 | is.annoying <- function(x) 22 | { 23 | length(x) != 1 || is.badval(x) 24 | } 25 | 26 | 27 | 28 | check.is.flag <- function(x) 29 | { 30 | if (!is.logical(x) || is.annoying(x)) 31 | { 32 | nm <- deparse(substitute(x)) 33 | stop(paste0("argument '", nm, "' must be TRUE or FALSE"), call.=FALSE) 34 | } 35 | 36 | invisible(TRUE) 37 | } 38 | 39 | 40 | 41 | check.is.scalar <- function(x) 42 | { 43 | if (!is.numeric(x) || is.annoying(x)) 44 | { 45 | nm <- deparse(substitute(x)) 46 | stop(paste0("argument '", nm, "' must be a single number (not NA, Inf, NaN)"), call.=FALSE) 47 | } 48 | 49 | invisible(TRUE) 50 | } 51 | 52 | 53 | 54 | check.is.string <- function(x) 55 | { 56 | if (!is.character(x) || is.annoying(x)) 57 | { 58 | nm <- deparse(substitute(x)) 59 | stop(paste0("argument '", nm, "' must be a single string"), call.=FALSE) 60 | } 61 | 62 | invisible(TRUE) 63 | } 64 | 65 | 66 | 67 | check.is.int <- function(x) 68 | { 69 | if (!is.numeric(x) || is.annoying(x) || !is.inty(x)) 70 | { 71 | nm <- deparse(substitute(x)) 72 | stop(paste0("argument '", nm, "' must be an integer"), call.=FALSE) 73 | } 74 | 75 | invisible(TRUE) 76 | } 77 | 78 | 79 | 80 | check.is.natnum <- function(x) 81 | { 82 | if (!is.numeric(x) || is.annoying(x) || !is.inty(x) || is.negative(x)) 83 | { 84 | nm <- deparse(substitute(x)) 85 | stop(paste0("argument '", nm, "' must be a natural number (0 or positive integer)"), call.=FALSE) 86 | } 87 | 88 | invisible(TRUE) 89 | } 90 | 91 | 92 | 93 | check.is.posint <- function(x) 94 | { 95 | if (!is.numeric(x) || is.annoying(x) || !is.inty(x) || is.negative(x) || is.zero(x)) 96 | { 97 | nm <- deparse(substitute(x)) 98 | stop(paste0("argument '", nm, "' must be a positive integer"), call.=FALSE) 99 | } 100 | 101 | invisible(TRUE) 102 | } 103 | 104 | 105 | 106 | check.is.function <- function(x) 107 | { 108 | if (!is.function(x)) 109 | { 110 | nm <- deparse(substitute(x)) 111 | stop(paste0("argument '", nm, "' must be a function"), call.=FALSE) 112 | } 113 | 114 | invisible(TRUE) 115 | } 116 | -------------------------------------------------------------------------------- /R/sample_csv.r: -------------------------------------------------------------------------------- 1 | #' Read Sample of CSV 2 | #' 3 | #' The function will read (as csv) approximately p*nlines lines. So 4 | #' if \code{p=.1}, then we will get roughly (probably not exactly) 10% of the 5 | #' data. This is the analogue of the base R function \code{read.csv()}. 6 | #' 7 | #' @details 8 | #' This function scans over the test of the input file and at each step, 9 | #' randomly chooses whether or not to include the current line into a 10 | #' downsampled file. Each selected line is placed in a temporary file, before 11 | #' being read into R via \code{read.csv()}. Additional arguments to this 12 | #' function (those other than \code{file}, \code{p}, and \code{verbose}) are 13 | #' passed to \code{read.csv()}, and so if their behavior is unclear, you should 14 | #' examine the \code{read.csv()} help file. 15 | #' 16 | #' If \code{verbose=TRUE}, then something like: 17 | #' 18 | #' \code{Read 12207 lines (0.001\%) of 12174948 line file.} 19 | #' 20 | #' will be printed to the terminal. This counts the header (if there is one) as 21 | #' one of the lines read and as one of the lines possible. 22 | #' 23 | #' @param file 24 | #' Location of the file (as a string) to be subsampled. 25 | #' @param param 26 | #' The downsampling parameter. For the "proportional" method, this is the 27 | #' proportion to retain and should be a numeric value between 0 and 1. For the 28 | #' exact method, this is the total number of lines to read in. 29 | #' @param method 30 | #' A string indicating the type of read method to use. Options are 31 | #' "proportional" and "exact". 32 | #' @param reader 33 | #' A function specifying the reader to use. The default is 34 | #' \code{utils::read.csv}. Other options include \code{data.table::fread()} and 35 | #' \code{readr::read_csv()}. Note the first argument of the reader should be 36 | #' the file to read in and the second should be the the 37 | #' \code{header}/\code{col_names} argument. This would require writing a small 38 | #' wrapper for \code{fread()}. 39 | #' @param header 40 | #' Is a header (line of column names) on the first line of the csv file? 41 | #' @param nskip 42 | #' Number of lines to skip. If \code{header=TRUE}, then this only applies to 43 | #' lines after the header. 44 | #' @param nmax 45 | #' Max number of lines to read. If nmax==0, then there is no read cap. Ignored 46 | #' if \code{method="exact"}. 47 | #' @param verbose 48 | #' Should linecounts of the input file and the number of lines sampled be 49 | #' printed? 50 | #' @param ... 51 | #' Additional arguments passed to the csv reader. 52 | #' 53 | #' @return 54 | #' A dataframe, as with \code{read.csv()}. 55 | #' 56 | #' @examples 57 | #' library(filesampler) 58 | #' file = system.file("rawdata/small.csv", package="filesampler") 59 | #' 60 | #' # Read in a 5% random subsample of the rows. 61 | #' data = sample_csv(file, param=.05) 62 | #' 63 | #' # Read in 10 randomly sampled rows. 64 | #' data = sample_csv(file, param=10, method="exact") 65 | #' 66 | #' @export 67 | sample_csv = function(file, param, method="proportional", reader=utils::read.csv, header=TRUE, nskip=0, nmax=0, verbose=FALSE, ...) 68 | { 69 | check.is.function(reader) 70 | method = match.arg(tolower(method), c("proportional", "exact")) 71 | 72 | outfile = tempfile() 73 | 74 | if (method == "proportional") 75 | { 76 | p = param 77 | file_sample_prop(p=p, infile=file, outfile=outfile, header=header, nskip=nskip, nmax=nmax, verbose=verbose) 78 | } 79 | else if (method == "exact") 80 | { 81 | nlines = param 82 | file_sample_exact(nlines=nlines, infile=file, outfile=outfile, header=header, nskip=nskip, verbose=verbose) 83 | } 84 | 85 | 86 | reader_nm = deparse(substitute(reader)) 87 | if (grepl(reader_nm, pattern="read_csv")) 88 | data = reader(outfile, col_names=header, ...) 89 | else 90 | data = reader(outfile, header=header, ...) 91 | 92 | unlink(outfile) 93 | return(data) 94 | } 95 | -------------------------------------------------------------------------------- /R/sample_lines.r: -------------------------------------------------------------------------------- 1 | #' Read Sample Lines of Text File 2 | #' 3 | #' The function will read approximately p*nlines lines of a flat text 4 | #' file. So if \code{p=.1}, then we will get roughly (probably not exactly) 5 | #' 10% of the data. This is the analogue of the base R function 6 | #' \code{readLines()}. 7 | #' 8 | #' @details 9 | #' This function scans over the test of the input file and at each step, randomly 10 | #' chooses whether or not to include the current line into a downsampled file. 11 | #' Each selected line is placed in a temporary file, before being read into R 12 | #' via \code{readLines()}. Additional arguments to this function (those other 13 | #' than \code{file}, \code{p}, and \code{verbose}) are passed to \code{readLines()}, 14 | #' and so if their behavior is unclear, you should examine the \code{readLines()} 15 | #' help file. 16 | #' 17 | #' If \code{verbose=TRUE}, then something like: 18 | #' 19 | #' \code{Read 12207 lines (0.001\%) of 12174948 line file.} 20 | #' 21 | #' will be printed to the terminal. This counts the header (if there is one) 22 | #' as one of the lines read and as one of the lines possible. 23 | #' 24 | #' @param file 25 | #' Location of the file (as a string) to be subsampled. 26 | #' @param n 27 | #' As in \code{readLines()}. 28 | #' @param p 29 | #' Proportion to retain; should be a numeric value between 0 and 1. 30 | #' @param nskip 31 | #' Number of lines to skip. 32 | #' @param nmax 33 | #' Max number of lines to read. If nmax==0, then there is no read cap. 34 | #' @param verbose 35 | #' Logical; indicates whether or not linecounts of the input file and the number 36 | #' of lines sampled should be printed. 37 | #' @param ... 38 | #' Additional arguments passed to \code{readLines()}. 39 | #' 40 | #' @return 41 | #' A character vector, as with \code{readLines()}. 42 | #' 43 | #' @examples 44 | #' library(filesampler) 45 | #' file = system.file("rawdata/small.csv", package="filesampler") 46 | #' sample_lines(file, p=.05) 47 | #' 48 | #' @export 49 | sample_lines = function(file, n=-1L, p=.1, nskip=0, nmax=0, verbose=FALSE, ...) 50 | { 51 | check.is.int(n) 52 | 53 | if (p == 0) 54 | return(character(0)) 55 | 56 | if (n > 0 && n < nskip) 57 | return(character(0)) 58 | 59 | outfile = tempfile() 60 | file_sample_prop(verbose=verbose, header=FALSE, nskip=nskip, nmax=nmax, p=p, infile=file, outfile=outfile) 61 | 62 | data = readLines(outfile, n=n, ...) 63 | unlink(outfile) 64 | data 65 | } 66 | -------------------------------------------------------------------------------- /R/utils.r: -------------------------------------------------------------------------------- 1 | title_case = function(x) gsub(x, pattern="(^|[[:space:]])([[:alpha:]])", replacement="\\1\\U\\2", perl=TRUE) 2 | 3 | abspath = function(file) 4 | { 5 | p = path.expand(file) 6 | if (!file.exists(p)) 7 | stop("file does not exist") 8 | 9 | p 10 | } 11 | -------------------------------------------------------------------------------- /R/wc.r: -------------------------------------------------------------------------------- 1 | #' Count Letters, Words, and Lines of a File 2 | #' 3 | #' See title. 4 | #' 5 | #' @details 6 | #' \code{wc_l()} is a shorthand for counting only lines, similar to \code{wc -l} 7 | #' in the terminal. Likewise \code{wc_w()} is analogous to \code{wc -w} for 8 | #' words. 9 | #' 10 | #' @param file 11 | #' Location of the file (as a string) from which the counts will be generated. 12 | #' @param chars,words,lines 13 | #' Should char/word/line counts be shown? At least one of the three must be 14 | #' \code{TRUE}. 15 | #' 16 | #' @return 17 | #' A list containing the requested counts. 18 | #' 19 | #' @examples 20 | #' library(filesampler) 21 | #' file = system.file("rawdata/small.csv", package="filesampler") 22 | #' data = wc(file=file) 23 | #' 24 | #' @name wc 25 | #' @rdname wc 26 | NULL 27 | 28 | 29 | 30 | #' @useDynLib filesampler R_fs_wc 31 | #' @rdname wc 32 | #' @export 33 | wc = function(file, chars=TRUE, words=TRUE, lines=TRUE) 34 | { 35 | check.is.string(file) 36 | check.is.flag(chars) 37 | check.is.flag(words) 38 | check.is.flag(lines) 39 | 40 | if (!chars && !words && !lines) 41 | stop("at least one of the arguments 'chars', 'words', or 'lines' must be TRUE") 42 | 43 | file = abspath(file) 44 | ret = .Call(R_fs_wc, file, chars, words, lines) 45 | 46 | counts = list(chars=ret[1L], words=ret[2L], lines=ret[3L]) 47 | class(counts) = "wc" 48 | attr(counts, "file") = file 49 | 50 | counts 51 | } 52 | 53 | 54 | 55 | #' @rdname wc 56 | #' @export 57 | wc_w = function(file) 58 | { 59 | wc(file=file, chars=FALSE, words=TRUE, lines=FALSE) 60 | } 61 | 62 | 63 | 64 | #' @rdname wc 65 | #' @export 66 | wc_l = function(file) 67 | { 68 | wc(file=file, chars=FALSE, words=FALSE, lines=TRUE) 69 | } 70 | 71 | 72 | 73 | #' @title Print \code{wc} objects 74 | #' @description Printing for \code{wc()} 75 | #' @param x \code{wc} object 76 | #' @param ... unused 77 | #' @name print-wc 78 | #' @rdname print-wc 79 | #' @method print wc 80 | #' @export 81 | print.wc = function(x, ...) 82 | { 83 | cat("file: ", attr(x, "file"), "\n") 84 | 85 | x = x[which(x != -1)] 86 | 87 | maxlen = max(sapply(names(x), nchar)) 88 | names = gsub(names(x), pattern="_", replacement=" ") 89 | names = title_case(x=names) 90 | spacenames = simplify2array(lapply(names, function(str) paste0(str, ":", paste0(rep(" ", maxlen-nchar(str)), collapse="")))) 91 | 92 | cat(paste(spacenames, x, sep=" ", collapse="\n"), "\n") 93 | invisible() 94 | } 95 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # filesampler 2 | 3 | * **Version:** 0.4-0 4 | * **Status:** [![Build Status](https://travis-ci.org/wrathematics/filesampler.png)](https://travis-ci.org/wrathematics/filesampler) 5 | * **License:** [BSD 2-Clause](http://opensource.org/licenses/BSD-2-Clause) 6 | * **Project home**: https://github.com/wrathematics/lineSampler 7 | * **Bug reports**: https://github.com/wrathematics/lineSampler/issues 8 | * **Author:** Drew Schmidt 9 | 10 | 11 | 12 | This is a simple R package to quickly read a random sample of lines of a flat text file (such as a csv) into R. This allows you to get a subsample into R without having to read the (possibly large) file into memory first. 13 | 14 | If you would like to contribute to this project, please see the `CONTRIBUTING.md` file. Idea inspired by Eduardo Arino de la Rubia's [fast_sample](https://github.com/earino/fast_sample). 15 | 16 | 17 | 18 | ## Installation 19 | 20 | You can install the stable version from [the HPCRAN](https://hpcran.org) using the usual `install.packages()`: 21 | 22 | ```r 23 | install.packages("filesampler", repos="https://hpcran.org") 24 | ``` 25 | 26 | The development version is maintained on GitHub: 27 | 28 | 29 | ```r 30 | remotes::install_github("wrathematics/filesampler") 31 | ``` 32 | 33 | 34 | 35 | ## Package Use 36 | 37 | ```r 38 | library(filesampler) 39 | ret <- sample_csv(file, p=.001) 40 | ``` 41 | 42 | There is also a `sample_lines()` function for reading in (line) subsamples of unstructured text akin to `readLines()`. See package vignette for more details. 43 | 44 | For more information, including benchmarks and implementation details, please see the package vignette: 45 | 46 | ```r 47 | vignette("filesampler", package="filesampler") 48 | ``` 49 | 50 | 51 | 52 | ## Code Re-Use 53 | 54 | The C code in the `src/filesampler` tree of this package can easily be re-purposed for use outside of R with some minor modifications. The components that need to be edited can be found in the file: 55 | 56 | * `src/filesampler/utils.h` 57 | 58 | Detailed explanations are contained there, but you will need: 59 | 60 | * a uniform random number generator 61 | * a print function (probably `printf()`) 62 | * an interrupt checker 63 | -------------------------------------------------------------------------------- /cleanup: -------------------------------------------------------------------------------- 1 | #! /bin/sh 2 | 3 | rm -rf ./src/Makevars 4 | 5 | rm -rf ./src/*.o 6 | rm -rf ./src/filesampler/*.o 7 | 8 | rm -rf ./src/*.so* 9 | rm -rf ./src/*.dll 10 | -------------------------------------------------------------------------------- /configure.ac: -------------------------------------------------------------------------------- 1 | AC_PREREQ([2.69]) 2 | AC_INIT(DESCRIPTION) 3 | AC_CONFIG_MACRO_DIRS([tools/]) 4 | 5 | # Get C compiler from R 6 | : ${R_HOME=`R RHOME`} 7 | if test -z "${R_HOME}"; then 8 | echo "could not determine R_HOME" 9 | exit 1 10 | fi 11 | CC=`"${R_HOME}/bin/R" CMD config CC` 12 | CFLAGS=`"${R_HOME}/bin/R" CMD config CFLAGS` 13 | CPPFLAGS=`"${R_HOME}/bin/R" CMD config CPPFLAGS` 14 | 15 | 16 | AC_PROG_CC_C99 17 | AC_OPENMP 18 | if test -n "${OPENMP_CFLAGS}"; then 19 | have_omp="yes" 20 | OMP_FLAGS="\$(SHLIB_OPENMP_CFLAGS)" 21 | else 22 | have_omp="no" 23 | OMP_FLAGS="" 24 | fi 25 | 26 | 27 | 28 | AX_CHECK_COMPILE_FLAG("-mavx2", [have_avx2="yes"], [have_avx2="no"]) 29 | if test "X${have_avx2}" = Xyes; then 30 | AVX2_FLAGS="-mavx2" 31 | else 32 | AVX2_FLAGS="" 33 | fi 34 | 35 | 36 | echo " " 37 | echo "**************** Results of filesampler package configure ****************" 38 | echo " " 39 | echo "* OpenMP Report" 40 | echo "* >> Compiler support: ${have_omp}" 41 | echo "* >> CFLAGS = ${OMP_FLAGS}" 42 | echo "* avx2 Report" 43 | echo "* >> Compiler support: ${have_avx2}" 44 | echo "* >> CPPFLAGS = ${AVX2_FLAGS}" 45 | echo "**************************************************************************" 46 | echo " " 47 | 48 | AC_SUBST(OMP_FLAGS) 49 | AC_SUBST(AVX2_FLAGS) 50 | AC_OUTPUT(src/Makevars) 51 | -------------------------------------------------------------------------------- /inst/CITATION: -------------------------------------------------------------------------------- 1 | year <- sub("-.*", "", meta$Date) 2 | note <- sprintf("{R} package version %s", meta$Version) 3 | 4 | citEntry( 5 | entry = "Misc", 6 | title = "{filesampler}: File Line Sampler", 7 | author = personList(as.person("Drew Schmidt")), 8 | year = year, 9 | note = note, 10 | url = "https://cran.r-project.org/package=filesampler", 11 | textVersion = NULL, 12 | key = "filesamplerPackage" 13 | ) 14 | 15 | citEntry( 16 | entry = "Manual", 17 | title = "Introduction to the {filesampler} Package", 18 | author = personList(as.person("Drew Schmidt")), 19 | year = "2016", 20 | note = "{R} Vignette", 21 | url = "https://cran.r-project.org/package=filesampler", 22 | textVersion = NULL, 23 | key = "filesamplerVignette" 24 | ) 25 | -------------------------------------------------------------------------------- /inst/benchmarks/.gitignore: -------------------------------------------------------------------------------- 1 | big.csv 2 | -------------------------------------------------------------------------------- /inst/benchmarks/README: -------------------------------------------------------------------------------- 1 | These benchmarks require the use of a reasonably large csv file. You 2 | can generate one by running the script `makebig` found in this directory. 3 | -------------------------------------------------------------------------------- /inst/benchmarks/makebig: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | startfile=../rawdata/small.csv 4 | 5 | rm -rf med.csv 6 | rm -rf big.csv 7 | 8 | len=`wc -l < $startfile` 9 | 10 | head -n 1 $startfile >> med.csv 11 | 12 | for ((i=0; i<160; i++)); do 13 | tail -n $(( $len - 1 )) $startfile >> med.csv 14 | done 15 | 16 | 17 | 18 | len=`wc -l < med.csv` 19 | 20 | head -n 1 med.csv >> big.csv 21 | 22 | for ((i=0; i<1000; i++)); do 23 | tail -n $(( $len - 1 )) med.csv >> big.csv 24 | done 25 | 26 | 27 | rm -rf med.csv 28 | -------------------------------------------------------------------------------- /inst/benchmarks/sampler.r: -------------------------------------------------------------------------------- 1 | library(filesampler) 2 | 3 | 4 | system.time({ 5 | x <- sample_csv("big.csv", method="exact", param=15000, verbose=TRUE) 6 | }) 7 | 8 | 9 | system.time({ 10 | x <- sample_csv("big.csv", param=.001, verbose=TRUE) 11 | }) 12 | -------------------------------------------------------------------------------- /inst/benchmarks/wc.r: -------------------------------------------------------------------------------- 1 | library(filesampler) 2 | 3 | system.time(x <- wc_l("big.csv")) 4 | 5 | x 6 | -------------------------------------------------------------------------------- /inst/rawdata/small.csv: -------------------------------------------------------------------------------- 1 | "A","B","C","D","E","F" 2 | 31,"v","A",0.0344356081914157,0.197601571670833,79.807882879395 3 | 58,"e","J",0.868477246491238,-0.0333204715610167,80.296481680125 4 | 5,"d","B",0.469008938642219,-0.507745240265884,17.0324605982751 5 | 17,"z","C",0.263196594081819,-0.894154055132274,49.2029465176165 6 | 30,"c","T",0.423068210249767,-0.714639366414706,33.1179875787348 7 | 75,"y","P",0.157697404269129,-1.05675687342485,15.0995138473809 8 | 30,"m","U",0.663265228504315,-0.102609033060013,46.2487461487763 9 | 26,"l","S",0.697267947718501,-0.351352275280531,28.2710071466863 10 | 46,"p","W",0.0227476919535547,-1.26253520594915,71.3425914221443 11 | 100,"w","I",0.0328309496399015,-2.2742194420414,76.8699974031188 12 | 7,"i","D",0.953019385691732,-0.526115032090435,91.7900401726365 13 | 22,"e","C",0.214634924661368,0.0298457049024103,52.8432966885157 14 | 49,"b","N",0.73877848428674,0.735700914198922,47.3630763380788 15 | 22,"o","Z",0.930282746674493,1.69084953123455,45.2439673081972 16 | 82,"i","C",0.95590949524194,0.13338503874271,39.8316274024546 17 | 42,"w","F",0.559732376364991,-0.222795841608947,52.3474555672146 18 | 76,"f","L",0.444616575026885,-0.308119184474722,76.059751638677 19 | 91,"b","J",0.939074153313413,-2.14593634852752,58.5156215843745 20 | 96,"m","H",0.776781423948705,1.4502067197792,79.699759890791 21 | 94,"i","C",0.289326652418822,0.1935193313665,55.3891222900711 22 | 84,"n","F",0.848534285090864,1.03334443771633,95.9706183988601 23 | 21,"f","Q",0.0960773911792785,-0.230727972829568,15.320910315495 24 | 100,"r","U",0.515066175023094,1.58910198179705,21.091499782633 25 | 92,"b","B",0.647247697459534,1.20942736228284,34.8187624011189 26 | 8,"n","A",0.00861937203444541,0.292516057068796,30.1586005091667 27 | 83,"h","B",0.220667276298627,0.00146272657865459,62.4190913466737 28 | 33,"q","L",0.0601193737238646,2.01678033592448,88.6710710334592 29 | 7,"p","K",0.433341701282188,0.795315551223573,33.1174146966077 30 | 90,"x","D",0.169272042112425,-0.610111017730115,66.5069763478823 31 | 6,"p","B",0.145372492726892,1.52611855611189,20.8286189078353 32 | 100,"y","G",0.304517379263416,-1.05102990058431,45.0845562084578 33 | 90,"s","I",0.602839902974665,1.84877027567577,59.3807370681316 34 | 82,"d","W",0.257859059143811,1.30801004183715,37.6233650557697 35 | 59,"w","U",0.146642411127687,-1.69092871059025,14.5103294635192 36 | 52,"b","K",0.571891283616424,-0.689954480114329,53.2374527514912 37 | 3,"v","I",0.19160017836839,-0.930770873136812,89.8825674923137 38 | 52,"o","M",0.823421220993623,0.484253791040398,32.5794443488121 39 | 20,"f","R",0.511186928488314,-0.150959552705217,38.3852071734145 40 | 99,"a","C",0.257107607088983,0.543330031852767,16.28650350729 41 | 63,"k","G",0.641139208804816,0.996895748616495,62.9691184335388 42 | 87,"y","B",0.201599461957812,0.599122824399696,34.5542569248937 43 | 84,"o","K",0.25619876710698,-0.134000852338425,41.6240883222781 44 | 49,"j","S",0.473985320422798,-0.551402560151181,80.1333197625354 45 | 99,"w","A",0.948514126939699,1.57938969908825,87.4724759580567 46 | 48,"j","X",0.158811556873843,0.627735521292107,79.1880375985056 47 | 2,"g","O",0.293799062492326,-0.822778808870904,41.1412728042342 48 | 69,"c","L",0.754705770406872,0.390535823507935,94.5743023580872 49 | 55,"z","F",0.478943325113505,-0.364710776856741,61.6175709129311 50 | 96,"y","D",0.759227265138179,0.205341535526123,63.838811442256 51 | 12,"g","Z",0.101827198406681,-0.00428392052516273,10.5943197244778 52 | 87,"f","A",0.654482287121937,-0.322827993815207,88.8147027604282 53 | 87,"f","R",0.0575539630372077,-0.281809239856949,54.9242287687957 54 | 48,"j","X",0.504813719773665,0.683105702318004,71.3473828113638 55 | 40,"x","Z",0.874165998306125,1.36229750817053,58.0776743311435 56 | 49,"o","O",0.146845723502338,-0.00241839228845686,62.0573334139772 57 | 67,"c","S",0.706441656686366,1.19764725630863,30.4078086558729 58 | 13,"p","Z",0.34508125577122,-0.455623686756772,40.096492273733 59 | 40,"g","X",0.536776724969968,-0.0951942274812661,10.4143589432351 60 | 63,"r","Z",0.903075464535505,0.502183970814897,58.7581716710702 61 | 13,"k","Q",0.831190430326387,1.86223311514226,87.2440657229163 62 | 45,"q","S",0.851588690653443,-0.993020661598526,33.6388454330154 63 | 49,"m","P",0.647244467865676,-0.59059429990917,91.2623915518634 64 | 94,"i","Y",0.0192667245864868,-0.243914617726999,10.9829535800964 65 | 77,"r","G",0.248602577485144,0.356037725132404,42.5667992956005 66 | 94,"e","O",0.953355941688642,-0.502098317746785,52.9305877559818 67 | 27,"q","J",0.23831829521805,-0.640516447399472,94.6382070984691 68 | 72,"n","F",0.407367554027587,-0.0713510222814287,90.4908181494102 69 | 28,"n","W",0.573690160643309,1.48579320914896,72.228856375441 70 | 1,"c","K",0.762256102170795,0.615465325415559,88.3274596068077 71 | 83,"q","H",0.0123868996743113,0.680072131251977,84.4729636912234 72 | 73,"r","R",0.0478137016762048,1.01801018855128,10.6682984228246 73 | 11,"s","L",0.803982262965292,-0.19969314014496,19.4303638185374 74 | 54,"b","G",0.155096606584266,-0.933558833688986,17.3642539186403 75 | 78,"z","Z",0.792126497719437,-0.534543097232384,34.1601846390404 76 | 88,"q","V",0.366345942718908,-0.114730600414802,92.6017211284488 77 | 7,"t","H",0.428722943877801,1.57858338038043,60.1827564346604 78 | 8,"w","M",0.0831752363592386,1.28651271683734,29.9501179973595 79 | 4,"j","X",0.141073858132586,1.34795892329137,36.4506742311642 80 | 87,"g","S",0.260312086436898,-0.0554640013763592,92.6275063771755 81 | 14,"w","H",0.190684856381267,0.221300020385139,27.2539445431903 82 | 93,"a","E",0.168578451033682,-0.696105815815523,47.6984555623494 83 | 72,"z","T",0.909249585820362,-0.651621286646339,70.1360652432777 84 | 88,"t","Q",0.55341205233708,-0.803369540743316,68.9707933459431 85 | 66,"m","P",0.697604088578373,-1.6475732415293,52.6957668946125 86 | 73,"v","X",0.586017430527136,0.321069618408937,29.9387905886397 87 | 74,"x","P",0.104086846811697,0.457174348627436,94.8964916937985 88 | 63,"l","V",0.564066298073158,1.63904016175932,77.7651392738335 89 | 67,"c","A",0.317764449398965,1.65236337759661,66.9126302446239 90 | 38,"v","V",0.762690836330876,-0.907516528439995,18.7640961213037 91 | 5,"j","D",0.258540550014004,-1.4848916678904,44.5786042907275 92 | 27,"u","G",0.616374830948189,-1.04993173547399,16.9220584654249 93 | 64,"c","O",0.180769482627511,0.109428632280286,73.9784279139712 94 | 23,"k","J",0.779640080407262,0.797267568837994,59.7707860195078 95 | 45,"u","Z",0.914203838445246,-0.358003181560163,12.4600219866261 96 | 95,"v","E",0.533451511291787,0.386028503446159,70.4833873175085 97 | 5,"z","N",0.4724566496443,0.895458647868482,21.5520231472328 98 | 8,"j","T",0.316375133348629,0.129911444978728,61.0537318745628 99 | 67,"w","P",0.231593899428844,0.712171107833738,47.2880723653361 100 | 91,"w","M",0.69922315934673,0.763018783592163,11.7304084543139 101 | 32,"r","E",0.143097473075613,-1.67635300018962,99.3239515228197 102 | -------------------------------------------------------------------------------- /man/file_sample_exact.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/file_sample_exact.r 3 | \name{file_sample_exact} 4 | \alias{file_sample_exact} 5 | \title{Exact File Sampler} 6 | \usage{ 7 | file_sample_exact( 8 | nlines, 9 | infile, 10 | outfile = tempfile(), 11 | header = TRUE, 12 | nskip = 0, 13 | verbose = FALSE 14 | ) 15 | } 16 | \arguments{ 17 | \item{nlines}{The (exact) number of lines to sample from the input file.} 18 | 19 | \item{infile}{Location of the file (as a string) to be subsampled.} 20 | 21 | \item{outfile}{Output file location (as a string).} 22 | 23 | \item{header}{Is a header (line of column names) on the first line of the csv file?} 24 | 25 | \item{nskip}{Number of lines to skip. If \code{header=TRUE}, then this only applies to 26 | lines after the header.} 27 | 28 | \item{verbose}{Should linecounts of the input file and the number of lines sampled be 29 | printed?} 30 | } 31 | \value{ 32 | \code{NULL} 33 | } 34 | \description{ 35 | Randomly sample lines from an input text file. 36 | } 37 | \details{ 38 | The sampling is done in two passes of the input file. First, the number of 39 | lines of the input file are determined by scanning through the file as 40 | quickly as possible (i.e., it should be completely I/O bound). Next, an 41 | index of lines to keep is produced by reservoir sampling. Then finally, the 42 | input file is scanned again line by line with the chosen lines dumped into a 43 | temporary file. 44 | 45 | If the output file (the one pointed to by the return of this function) is 46 | "large" and to be read into memory (which isn't really appropriate for text 47 | files in the first place!), then this strategy is probably not appropriate. 48 | } 49 | -------------------------------------------------------------------------------- /man/file_sample_prop.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/file_sample_prob.r 3 | \name{file_sample_prop} 4 | \alias{file_sample_prop} 5 | \title{Proportional File Sampler} 6 | \usage{ 7 | file_sample_prop( 8 | p, 9 | infile, 10 | outfile = tempfile(), 11 | header = TRUE, 12 | nskip = 0, 13 | nmax = 0, 14 | verbose = FALSE 15 | ) 16 | } 17 | \arguments{ 18 | \item{p}{Proportion to retain; should be a numeric value between 0 and 1.} 19 | 20 | \item{infile}{Location of the file (as a string) to be subsampled.} 21 | 22 | \item{outfile}{Output file location (as a string).} 23 | 24 | \item{header}{Is a header (line of column names) on the first line of the csv file?} 25 | 26 | \item{nskip}{Number of lines to skip. If \code{header=TRUE}, then this only applies to 27 | lines after the header.} 28 | 29 | \item{nmax}{Max number of lines to read. If \code{nmax==0}, then there is no read cap.} 30 | 31 | \item{verbose}{Should linecounts of the input file and the number of lines sampled be 32 | printed?} 33 | } 34 | \value{ 35 | \code{NULL} 36 | } 37 | \description{ 38 | Randomly sample lines from an input text file. 39 | } 40 | \details{ 41 | The sampling is done in one pass of the input file, dumping lines to a 42 | temporary file as the input is read. 43 | 44 | If the output file (the one pointed to by the return of this function) is 45 | "large" and to be read into memory (which isn't really appropriate for text 46 | files in the first place!), then this strategy is probably not appropriate. 47 | } 48 | -------------------------------------------------------------------------------- /man/filesampler.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/filesampler-package.r 3 | \docType{package} 4 | \name{filesampler} 5 | \alias{filesampler} 6 | \title{File Sampler} 7 | \description{ 8 | File Sampler 9 | } 10 | \details{ 11 | A simple package for reading subsamples of flat text files by line in a 12 | reasonably efficient manner. We do so by sampling as the input file is 13 | scanned and randomly choosing whether or not to dump the current line to an 14 | external temporary file. This temporary file is then read back into R. For 15 | (aggressive) downsampling, this is a very effective strategy; for resampling, 16 | you are much better off reading the full dataset into memory. 17 | 18 | The basic function performs the sampling/reading in a single pass. There is 19 | an alternative version which allows for an exact amout to be subsampled via 20 | reservoir sampling, but this version requires 2 passes through the data. The 21 | heavy lifting is done entirely in C. 22 | 23 | This package, including the underlying C library, is licensed under the 24 | permissive 2-clause BSD license. 25 | } 26 | \references{ 27 | Project URL: \url{https://github.com/wrathematics/filesampler} 28 | } 29 | \author{ 30 | Drew Schmidt \email{wrathematics@gmail.com} 31 | } 32 | \keyword{package} 33 | -------------------------------------------------------------------------------- /man/print-wc.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/wc.r 3 | \name{print-wc} 4 | \alias{print-wc} 5 | \alias{print.wc} 6 | \title{Print \code{wc} objects} 7 | \usage{ 8 | \method{print}{wc}(x, ...) 9 | } 10 | \arguments{ 11 | \item{x}{\code{wc} object} 12 | 13 | \item{...}{unused} 14 | } 15 | \description{ 16 | Printing for \code{wc()} 17 | } 18 | -------------------------------------------------------------------------------- /man/sample_csv.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/sample_csv.r 3 | \name{sample_csv} 4 | \alias{sample_csv} 5 | \title{Read Sample of CSV} 6 | \usage{ 7 | sample_csv( 8 | file, 9 | param, 10 | method = "proportional", 11 | reader = utils::read.csv, 12 | header = TRUE, 13 | nskip = 0, 14 | nmax = 0, 15 | verbose = FALSE, 16 | ... 17 | ) 18 | } 19 | \arguments{ 20 | \item{file}{Location of the file (as a string) to be subsampled.} 21 | 22 | \item{param}{The downsampling parameter. For the "proportional" method, this is the 23 | proportion to retain and should be a numeric value between 0 and 1. For the 24 | exact method, this is the total number of lines to read in.} 25 | 26 | \item{method}{A string indicating the type of read method to use. Options are 27 | "proportional" and "exact".} 28 | 29 | \item{reader}{A function specifying the reader to use. The default is 30 | \code{utils::read.csv}. Other options include \code{data.table::fread()} and 31 | \code{readr::read_csv()}. Note the first argument of the reader should be 32 | the file to read in and the second should be the the 33 | \code{header}/\code{col_names} argument. This would require writing a small 34 | wrapper for \code{fread()}.} 35 | 36 | \item{header}{Is a header (line of column names) on the first line of the csv file?} 37 | 38 | \item{nskip}{Number of lines to skip. If \code{header=TRUE}, then this only applies to 39 | lines after the header.} 40 | 41 | \item{nmax}{Max number of lines to read. If nmax==0, then there is no read cap. Ignored 42 | if \code{method="exact"}.} 43 | 44 | \item{verbose}{Should linecounts of the input file and the number of lines sampled be 45 | printed?} 46 | 47 | \item{...}{Additional arguments passed to the csv reader.} 48 | } 49 | \value{ 50 | A dataframe, as with \code{read.csv()}. 51 | } 52 | \description{ 53 | The function will read (as csv) approximately p*nlines lines. So 54 | if \code{p=.1}, then we will get roughly (probably not exactly) 10% of the 55 | data. This is the analogue of the base R function \code{read.csv()}. 56 | } 57 | \details{ 58 | This function scans over the test of the input file and at each step, 59 | randomly chooses whether or not to include the current line into a 60 | downsampled file. Each selected line is placed in a temporary file, before 61 | being read into R via \code{read.csv()}. Additional arguments to this 62 | function (those other than \code{file}, \code{p}, and \code{verbose}) are 63 | passed to \code{read.csv()}, and so if their behavior is unclear, you should 64 | examine the \code{read.csv()} help file. 65 | 66 | If \code{verbose=TRUE}, then something like: 67 | 68 | \code{Read 12207 lines (0.001\%) of 12174948 line file.} 69 | 70 | will be printed to the terminal. This counts the header (if there is one) as 71 | one of the lines read and as one of the lines possible. 72 | } 73 | \examples{ 74 | library(filesampler) 75 | file = system.file("rawdata/small.csv", package="filesampler") 76 | 77 | # Read in a 5\% random subsample of the rows. 78 | data = sample_csv(file, param=.05) 79 | 80 | # Read in 10 randomly sampled rows. 81 | data = sample_csv(file, param=10, method="exact") 82 | 83 | } 84 | -------------------------------------------------------------------------------- /man/sample_lines.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/sample_lines.r 3 | \name{sample_lines} 4 | \alias{sample_lines} 5 | \title{Read Sample Lines of Text File} 6 | \usage{ 7 | sample_lines(file, n = -1L, p = 0.1, nskip = 0, nmax = 0, verbose = FALSE, ...) 8 | } 9 | \arguments{ 10 | \item{file}{Location of the file (as a string) to be subsampled.} 11 | 12 | \item{n}{As in \code{readLines()}.} 13 | 14 | \item{p}{Proportion to retain; should be a numeric value between 0 and 1.} 15 | 16 | \item{nskip}{Number of lines to skip.} 17 | 18 | \item{nmax}{Max number of lines to read. If nmax==0, then there is no read cap.} 19 | 20 | \item{verbose}{Logical; indicates whether or not linecounts of the input file and the number 21 | of lines sampled should be printed.} 22 | 23 | \item{...}{Additional arguments passed to \code{readLines()}.} 24 | } 25 | \value{ 26 | A character vector, as with \code{readLines()}. 27 | } 28 | \description{ 29 | The function will read approximately p*nlines lines of a flat text 30 | file. So if \code{p=.1}, then we will get roughly (probably not exactly) 31 | 10% of the data. This is the analogue of the base R function 32 | \code{readLines()}. 33 | } 34 | \details{ 35 | This function scans over the test of the input file and at each step, randomly 36 | chooses whether or not to include the current line into a downsampled file. 37 | Each selected line is placed in a temporary file, before being read into R 38 | via \code{readLines()}. Additional arguments to this function (those other 39 | than \code{file}, \code{p}, and \code{verbose}) are passed to \code{readLines()}, 40 | and so if their behavior is unclear, you should examine the \code{readLines()} 41 | help file. 42 | 43 | If \code{verbose=TRUE}, then something like: 44 | 45 | \code{Read 12207 lines (0.001\%) of 12174948 line file.} 46 | 47 | will be printed to the terminal. This counts the header (if there is one) 48 | as one of the lines read and as one of the lines possible. 49 | } 50 | \examples{ 51 | library(filesampler) 52 | file = system.file("rawdata/small.csv", package="filesampler") 53 | sample_lines(file, p=.05) 54 | 55 | } 56 | -------------------------------------------------------------------------------- /man/wc.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/wc.r 3 | \name{wc} 4 | \alias{wc} 5 | \alias{wc_w} 6 | \alias{wc_l} 7 | \title{Count Letters, Words, and Lines of a File} 8 | \usage{ 9 | wc(file, chars = TRUE, words = TRUE, lines = TRUE) 10 | 11 | wc_w(file) 12 | 13 | wc_l(file) 14 | } 15 | \arguments{ 16 | \item{file}{Location of the file (as a string) from which the counts will be generated.} 17 | 18 | \item{chars, words, lines}{Should char/word/line counts be shown? At least one of the three must be 19 | \code{TRUE}.} 20 | } 21 | \value{ 22 | A list containing the requested counts. 23 | } 24 | \description{ 25 | See title. 26 | } 27 | \details{ 28 | \code{wc_l()} is a shorthand for counting only lines, similar to \code{wc -l} 29 | in the terminal. Likewise \code{wc_w()} is analogous to \code{wc -w} for 30 | words. 31 | } 32 | \examples{ 33 | library(filesampler) 34 | file = system.file("rawdata/small.csv", package="filesampler") 35 | data = wc(file=file) 36 | 37 | } 38 | -------------------------------------------------------------------------------- /src/Makevars.in: -------------------------------------------------------------------------------- 1 | PKG_CPPFLAGS = @AVX2_FLAGS@ 2 | PKG_CFLAGS = @OMP_FLAGS@ 3 | PKG_LIBS = @OMP_FLAGS@ 4 | 5 | FS_OBJECTS = filesampler/file_sampler.o filesampler/wc.o 6 | R_OBJECTS = filesampler_native.o samplers.o wc.o 7 | OBJECTS = $(FS_OBJECTS) $(R_OBJECTS) 8 | 9 | all: $(SHLIB) 10 | 11 | $(SHLIB): $(OBJECTS) 12 | -------------------------------------------------------------------------------- /src/Rfilesampler.h: -------------------------------------------------------------------------------- 1 | #ifndef R_FILESAMPLER_H_ 2 | #define R_FILESAMPLER_H_ 3 | 4 | 5 | #include 6 | #include 7 | 8 | #define CHARPT(x,i) ((char*)CHAR(STRING_ELT(x,i))) 9 | #define INT(x) INTEGER(x)[0] 10 | #define DBL(x) REAL(x)[0] 11 | 12 | 13 | #endif 14 | -------------------------------------------------------------------------------- /src/filesampler/Makefile: -------------------------------------------------------------------------------- 1 | CC = gcc 2 | CFLAGS = -fopenmp -O3 -std=c99 -Wall -Wno-unused-function 3 | 4 | OBJECTS = file_sampler.o wc.o 5 | 6 | all: shlib 7 | 8 | shlib: clean 9 | $(CC) $(CFLAGS) -shared -o libfilesampler.so $(OBJECTS) 10 | 11 | clean: 12 | rm -f ./libfilesampler.so 13 | -------------------------------------------------------------------------------- /src/filesampler/check_avx.h: -------------------------------------------------------------------------------- 1 | #ifndef FILESAMPLER_CHECK_AVX_H 2 | #define FILESAMPLER_CHECK_AVX_H 3 | 4 | 5 | #if defined(__INTEL_COMPILER) && (__INTEL_COMPILER >= 1300) 6 | #include 7 | #define CHECK_FOR_AVX2 _may_i_use_cpu_feature(_FEATURE_AVX2) 8 | #elif (defined(__GNUC__) || defined(__clang__)) 9 | #define CHECK_FOR_AVX2 __builtin_cpu_supports("avx2") 10 | #else 11 | #define CHECK_FOR_AVX2 0; 12 | #endif 13 | 14 | 15 | static inline int has_avx2() 16 | { 17 | static int check = -1; 18 | if (check < 0 ) 19 | check = CHECK_FOR_AVX2; 20 | 21 | return (check > 0); 22 | } 23 | 24 | 25 | #endif 26 | -------------------------------------------------------------------------------- /src/filesampler/error.h: -------------------------------------------------------------------------------- 1 | /* Copyright (c) 2015-2016, Schmidt 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | 1. Redistributions of source code must retain the above copyright notice, 8 | this list of conditions and the following disclaimer. 9 | 10 | 2. Redistributions in binary form must reproduce the above copyright 11 | notice, this list of conditions and the following disclaimer in the 12 | documentation and/or other materials provided with the distribution. 13 | 14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 15 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 16 | TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 17 | PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR 18 | CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 19 | EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 20 | PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 21 | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 22 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 23 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 25 | */ 26 | 27 | 28 | #ifndef FILESAMPLER_ERROR_H_ 29 | #define FILESAMPLER_ERROR_H_ 30 | 31 | 32 | #include "utils.h" 33 | 34 | // Parameter checks 35 | #define INVALID_PROB -1 36 | #define INVALID_NSKIP -2 37 | 38 | #define INVALID_PROB_MSG "Invalid `p` specified. Must be between 0 and 1." 39 | #define INVALID_NSKIP_MSG "Invalid `nskip` specified; " 40 | 41 | 42 | // Things that went wrong 43 | #define READ_FAIL -3 44 | #define WRITE_FAIL -4 45 | #define MALLOC_FAIL -5 46 | #define USER_INTERRUPT -6 47 | 48 | #define READ_FAIL_MSG "Could not read infile; perhaps it doesn't exist?" 49 | #define WRITE_FAIL_MSG "Could not generate tempfile for writing for some reason?" 50 | #define MALLOC_FAIL_MSG "Out of memory." 51 | #define USER_INTERRUPT_MSG "Process killed by user interrupt." 52 | 53 | 54 | static inline void fs_checkret(const int ret) 55 | { 56 | switch (ret) 57 | { 58 | case 0: 59 | break; 60 | case INVALID_PROB: 61 | fs_error_fun(ret, INVALID_PROB_MSG); 62 | break; 63 | case INVALID_NSKIP: 64 | fs_error_fun(ret, INVALID_NSKIP_MSG); 65 | break; 66 | case READ_FAIL: 67 | fs_error_fun(ret, READ_FAIL_MSG); 68 | break; 69 | case WRITE_FAIL: 70 | fs_error_fun(ret, WRITE_FAIL_MSG); 71 | break; 72 | case MALLOC_FAIL: 73 | fs_error_fun(ret, MALLOC_FAIL_MSG); 74 | break; 75 | case USER_INTERRUPT: 76 | fs_error_fun(ret, USER_INTERRUPT_MSG); 77 | break; 78 | default: 79 | fs_error_fun(ret, "Unknown error code; please report this to the developers."); 80 | } 81 | 82 | return; 83 | } 84 | 85 | 86 | #endif 87 | -------------------------------------------------------------------------------- /src/filesampler/file_sampler.c: -------------------------------------------------------------------------------- 1 | /* Copyright (c) 2015-2016, Drew Schmidt 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | 1. Redistributions of source code must retain the above copyright notice, 8 | this list of conditions and the following disclaimer. 9 | 10 | 2. Redistributions in binary form must reproduce the above copyright 11 | notice, this list of conditions and the following disclaimer in the 12 | documentation and/or other materials provided with the distribution. 13 | 14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 15 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 16 | TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 17 | PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR 18 | CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 19 | EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 20 | PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 21 | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 22 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 23 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 25 | */ 26 | 27 | 28 | #include 29 | #include 30 | #include 31 | 32 | #include "filesampler.h" 33 | #include "safeomp.h" 34 | #include "utils.h" 35 | 36 | 37 | #define HAS_NEWLINE ((readlen > 0) && (buf[readlen-1] == '\n')) 38 | 39 | static inline void read_header(char *buf, FILE *fp_read, FILE *fp_write, uint64_t *nlines_in, uint64_t *nlines_out) 40 | { 41 | size_t readlen; 42 | 43 | while (fgets(buf, BUFLEN, fp_read) != NULL) 44 | { 45 | fprintf(fp_write, "%s", buf); 46 | 47 | readlen = strnlen(buf, BUFLEN); 48 | 49 | if (HAS_NEWLINE) 50 | { 51 | (*nlines_in)++; 52 | (*nlines_out)++; 53 | break; 54 | } 55 | } 56 | } 57 | 58 | 59 | 60 | /** 61 | * @file 62 | * @brief 63 | * File Sampler 64 | * 65 | * @details 66 | * This function takes an input file and randomly subsamples it at 67 | * the given proportion p line-by-line, with the randomly chosen lines 68 | * being placed in the given output file. 69 | * 70 | * If the file has many lines, then the size of the output file 71 | * should be roughly p*nlines(input). If you want to specify 72 | * an exact number of lines, see file_sampler_exact(). 73 | * 74 | * @param verbose 75 | * Input. Indicates whether character/word/line counts of the input 76 | * file (discovered while filling the output) should be printed. 77 | * @param header 78 | * Input. Indicates whether or not there is a header line (as in a 79 | * csv). 80 | * @param nskip 81 | * Input. Number of lines to skip. If header=true and nskip>0, then 82 | * the number of lines skipped applies to post-header lines only. 83 | * @param nmax 84 | * Input. Max number of lines to read. If nmax==0 then there is no max. 85 | * @param p 86 | * Input. Proportion of lines from input file to (randomly) retain. 87 | * The proportion retained is not guaranteed to be exactly p, but 88 | * will be very close for large files. 89 | * @param input 90 | * Input. Absolute path to input file. 91 | * @param output 92 | * Input. Absolute path to output file. 93 | * 94 | * @note 95 | * Due to R's RNG, this call (as written) is very un-threadsafe. 96 | * 97 | * @return 98 | * The return value indicates the status of the function. 99 | */ 100 | int fs_sample_prop(const bool verbose, const bool header, uint32_t nskip, uint32_t nmax, const double p, const char *input, const char *output) 101 | { 102 | int ret = 0; 103 | FILE *fp_read, *fp_write; 104 | char *buf; 105 | size_t readlen; 106 | // Have to track cases where buffer is too small for fgets() 107 | bool should_write = false; 108 | bool singleread = true; 109 | bool checkmax = nmax ? true : false; 110 | uint64_t nlines_in = 0, nlines_out = 0; 111 | 112 | if (p < 0. || p > 1.) 113 | return INVALID_PROB; 114 | else if (p == 0.) 115 | { 116 | if (verbose) 117 | PRINTFUN("Read 0 lines of unknown length file (p == 0).\n"); 118 | 119 | return 0; 120 | } 121 | 122 | 123 | fp_read = fopen(input, "r"); 124 | if (!fp_read) 125 | return READ_FAIL; 126 | 127 | fp_write = fopen(output, "w"); 128 | if (!fp_write) 129 | { 130 | fclose(fp_read); 131 | return WRITE_FAIL; 132 | } 133 | 134 | buf = malloc(BUFLEN * sizeof(char)); 135 | if (buf == NULL) 136 | { 137 | fclose(fp_read); 138 | fclose(fp_write); 139 | return MALLOC_FAIL; 140 | } 141 | 142 | 143 | if (header) 144 | read_header(buf, fp_read, fp_write, &nlines_in, &nlines_out); 145 | 146 | 147 | STARTRNG; 148 | 149 | while (fgets(buf, BUFLEN, fp_read) != NULL) 150 | { 151 | if ((nlines_out % INTERRUPT_CHECK_NUM == 0) && check_interrupt()) 152 | { 153 | ret = USER_INTERRUPT; 154 | goto cleanup; 155 | } 156 | 157 | if (singleread) 158 | { 159 | if (RUNIF < p) 160 | should_write = true; 161 | else 162 | should_write = false; 163 | } 164 | 165 | if (nskip) 166 | nskip--; 167 | else if (should_write) 168 | { 169 | nlines_out++; 170 | fprintf(fp_write, "%s", buf); 171 | 172 | if (checkmax) 173 | { 174 | nmax--; 175 | if (!nmax) break; 176 | } 177 | } 178 | 179 | readlen = strnlen(buf, BUFLEN); 180 | 181 | // check if multiple reads for this one line are needed (i.e., line longer than buffer) 182 | if (HAS_NEWLINE) 183 | { 184 | nlines_in++; 185 | should_write = false; 186 | singleread = true; 187 | } 188 | else 189 | singleread = false; 190 | } 191 | 192 | if (verbose) 193 | { 194 | if (checkmax && !nmax) 195 | PRINTFUN("Read nmax=%llu lines of unknown length file.\n", nmax); 196 | else 197 | PRINTFUN("Read %llu lines (%.5f%%) of %llu line file.\n", nlines_out, (double) nlines_out/nlines_in, nlines_in); 198 | } 199 | 200 | 201 | cleanup: 202 | ENDRNG; 203 | fclose(fp_read); 204 | fclose(fp_write); 205 | free(buf); 206 | 207 | return ret; 208 | } 209 | 210 | 211 | 212 | // ------------------------------------------------------ 213 | // exact reader 214 | // ------------------------------------------------------ 215 | 216 | static inline int unif_rand_int(const int low, const int high) 217 | { 218 | return (int) low + (high + 1 - low)*RUNIF ; 219 | } 220 | 221 | 222 | 223 | static int res_sampler(const uint32_t nskip, const uint64_t nlines_in, const uint32_t nlines_out, uint64_t **samp) 224 | { 225 | *samp = malloc(nlines_out * sizeof(**samp)); 226 | if (samp == NULL) 227 | return MALLOC_FAIL; 228 | 229 | SAFE_FOR_SIMD 230 | for (uint32_t i=0; i0, then 276 | * the number of lines skipped applies to post-header lines only. 277 | * @param nlines_in 278 | * Input. The number of lines of the input file. See wc(). 279 | * @param nlines_out 280 | * Input. The precise number of lines of input to (randomly) to 281 | * retain. 282 | * @param input 283 | * Input. Absolute path to input file. 284 | * @param output 285 | * Input. Absolute path to output file. 286 | * 287 | * @note 288 | * Due to R's RNG, this call (as written) is very un-threadsafe. 289 | * 290 | * @return 291 | * The return value indicates the status of the function. 292 | */ 293 | int fs_sample_exact(const bool verbose, const bool header, const uint32_t nskip, uint64_t nlines_out, const char *input, const char *output) 294 | { 295 | int ret; 296 | FILE *fp_read, *fp_write; 297 | char *buf; 298 | size_t readlen; 299 | // Have to track cases where buffer is too small for fgets() 300 | bool should_write = false; 301 | bool singleread = true; 302 | uint64_t *samp; 303 | uint64_t nlines_in; 304 | uint64_t current_line = 0; 305 | uint64_t lines_read = 0; 306 | 307 | 308 | ret = fs_wc(input, false, NULL, false, NULL, true, &nlines_in); 309 | if (ret) 310 | return ret; 311 | 312 | 313 | if (nskip > nlines_in) 314 | return INVALID_NSKIP; 315 | 316 | fp_read = fopen(input, "r"); 317 | if (!fp_read) 318 | return READ_FAIL; 319 | 320 | fp_write = fopen(output, "w"); 321 | if (!fp_write) 322 | { 323 | fclose(fp_read); 324 | return WRITE_FAIL; 325 | } 326 | 327 | buf = malloc(BUFLEN * sizeof(char)); 328 | if (buf == NULL) 329 | { 330 | fclose(fp_read); 331 | fclose(fp_write); 332 | return MALLOC_FAIL; 333 | } 334 | 335 | 336 | if (header) 337 | read_header(buf, fp_read, fp_write, &nlines_in, &nlines_out); 338 | 339 | ret = res_sampler(nskip, nlines_in, nlines_out, &samp); 340 | if (ret) 341 | goto cleanup; 342 | 343 | qsort(samp, nlines_out, sizeof(uint64_t), comp); 344 | 345 | STARTRNG; 346 | 347 | 348 | while (fgets(buf, BUFLEN, fp_read) != NULL) 349 | { 350 | if (lines_read == nlines_out) 351 | break; 352 | 353 | if ((lines_read % INTERRUPT_CHECK_NUM == 0) && check_interrupt()) 354 | { 355 | ret = USER_INTERRUPT; 356 | goto fullcleanup; 357 | } 358 | 359 | if (singleread) 360 | { 361 | if (samp[lines_read] == current_line) 362 | should_write = true; 363 | else 364 | should_write = false; 365 | } 366 | 367 | if (should_write) 368 | { 369 | lines_read++; 370 | fprintf(fp_write, "%s", buf); 371 | } 372 | 373 | readlen = strnlen(buf, BUFLEN); 374 | 375 | if (HAS_NEWLINE) 376 | { 377 | current_line++; 378 | should_write = false; 379 | singleread = true; 380 | } 381 | else 382 | singleread = false; 383 | } 384 | 385 | 386 | if (verbose) 387 | { 388 | nlines_in -= header; 389 | PRINTFUN("Read %llu lines (%.5f%%) of %llu line file.\n", nlines_out, (double) nlines_out/nlines_in, nlines_in); 390 | } 391 | 392 | 393 | fullcleanup: 394 | ENDRNG; 395 | free(samp); 396 | 397 | cleanup: 398 | fclose(fp_read); 399 | fclose(fp_write); 400 | free(buf); 401 | 402 | return ret; 403 | } 404 | -------------------------------------------------------------------------------- /src/filesampler/filesampler.h: -------------------------------------------------------------------------------- 1 | /* Copyright (c) 2015-2016, Schmidt 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | 1. Redistributions of source code must retain the above copyright notice, 8 | this list of conditions and the following disclaimer. 9 | 10 | 2. Redistributions in binary form must reproduce the above copyright 11 | notice, this list of conditions and the following disclaimer in the 12 | documentation and/or other materials provided with the distribution. 13 | 14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 15 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 16 | TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 17 | PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR 18 | CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 19 | EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 20 | PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 21 | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 22 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 23 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 25 | */ 26 | 27 | 28 | #ifndef FILESAMPLER_H_ 29 | #define FILESAMPLER_H_ 30 | 31 | 32 | #include 33 | #include 34 | 35 | #include "error.h" 36 | 37 | #define BUFLEN 8192 38 | #define INTERRUPT_CHECK_NUM 1024 39 | 40 | // file_sampler.c 41 | int fs_sample_prop(const bool verbose, const bool header, uint32_t nskip, uint32_t nmax, const double p, const char *input, const char *output); 42 | int fs_sample_exact(const bool verbose, const bool header, const uint32_t nskip, uint64_t nlines_out, const char *input, const char *output); 43 | 44 | // wc.c 45 | int fs_wc(const char *file, const bool chars, uint64_t *nchars, const bool words, uint64_t *nwords, const bool lines, uint64_t *nlines); 46 | 47 | 48 | #endif 49 | -------------------------------------------------------------------------------- /src/filesampler/rebalance.c: -------------------------------------------------------------------------------- 1 | /* Copyright (c) 2015-2018, Drew Schmidt 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | 1. Redistributions of source code must retain the above copyright notice, 8 | this list of conditions and the following disclaimer. 9 | 10 | 2. Redistributions in binary form must reproduce the above copyright 11 | notice, this list of conditions and the following disclaimer in the 12 | documentation and/or other materials provided with the distribution. 13 | 14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 15 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 16 | TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 17 | PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR 18 | CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 19 | EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 20 | PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 21 | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 22 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 23 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 25 | */ 26 | 27 | 28 | #include 29 | #include 30 | #include 31 | 32 | #include "filesampler.h" 33 | #include "utils.h" 34 | 35 | 36 | #define HEADER_ALL 0 37 | #define HEADER_FIRST 1 38 | #define HEADER_NONE 2 39 | 40 | 41 | int fs_rebalance(const uint32_t num_outfiles, const uint32_t num_infiles, const char **infiles, const int inheader, const int outheader, const char *outdir) 42 | { 43 | int ret = 0; 44 | FILE *fp_read, *fp_write; 45 | uint64_t total_lines; 46 | uint64_t outfile_lines; 47 | uint64_t rem; 48 | 49 | if (num_infiles == num_outfiles) 50 | return ret; 51 | 52 | 53 | buf = malloc(BUFLEN * sizeof(char)); 54 | if (buf == NULL) 55 | return MALLOC_FAIL; 56 | 57 | for (uint32_t i=0; i 36 | #if _OPENMP >= 201307 37 | #define OMP_VER_4 38 | #elif _OPENMP >= 200805 39 | #define OMP_VER_3 40 | #endif 41 | #endif 42 | 43 | 44 | // Insert SIMD pragma if supported 45 | #ifdef OMP_VER_4 46 | #define SAFE_SIMD _Pragma("omp simd") 47 | #define SAFE_FOR_SIMD _Pragma("omp for simd") 48 | #define SAFE_PARALLEL_FOR_SIMD _Pragma("omp parallel for simd") 49 | #else 50 | #define SAFE_SIMD 51 | #define SAFE_FOR_SIMD 52 | #define SAFE_PARALLEL_FOR_SIMD 53 | #endif 54 | 55 | 56 | #endif 57 | -------------------------------------------------------------------------------- /src/filesampler/utils.h: -------------------------------------------------------------------------------- 1 | // This file is free and unencumbered software released into the public domain. 2 | // You may modify it for any purpose with or without attribution. 3 | // See the Unlicense specification for full details http://unlicense.org/ 4 | 5 | // To make the filesampler library useable outside of R, you need to modify 6 | // every definition in this file appropriately, and delete the R header files. 7 | // Comments in this file are to help someone porting this to another (non-R) 8 | // system. 9 | 10 | // See utils_sample.h for an example. 11 | 12 | 13 | #ifndef FILESAMPLER_UTILS_H_ 14 | #define FILESAMPLER_UTILS_H_ 15 | 16 | #define UNUSED(x) (void)(x) 17 | 18 | // ---------------------------------------------------------------------------- 19 | // RNG 20 | // ---------------------------------------------------------------------------- 21 | 22 | #include // delete me 23 | #include // delete me 24 | 25 | // rng initializer and ender; probably sufficient to set the second def to a blank 26 | #define STARTRNG GetRNGstate() 27 | #define ENDRNG PutRNGstate() 28 | 29 | // generate a single random uniform number 30 | #define RUNIF unif_rand() 31 | 32 | 33 | 34 | // ---------------------------------------------------------------------------- 35 | // Printing 36 | // ---------------------------------------------------------------------------- 37 | 38 | // replace with regular printf() (probably) 39 | #define PRINTFUN Rprintf 40 | 41 | 42 | 43 | // ---------------------------------------------------------------------------- 44 | // Interrupt checker 45 | // ---------------------------------------------------------------------------- 46 | 47 | // Write an appropriate check_interrupt() function (or just set to always return 48 | // false if you're lazy). 49 | 50 | #include // delete me 51 | #include // delete me 52 | 53 | #include 54 | 55 | #define UNUSED(x) (void)(x) 56 | 57 | static void check_interrupt_fun(void *ignored) 58 | { 59 | UNUSED(ignored); 60 | R_CheckUserInterrupt(); 61 | } 62 | 63 | // returns `true` if user interrupts with C-c, and `false` if not 64 | static inline bool check_interrupt() 65 | { 66 | return (R_ToplevelExec(check_interrupt_fun, NULL) == FALSE); 67 | } 68 | 69 | 70 | 71 | // ---------------------------------------------------------------------------- 72 | // Error handler 73 | // ---------------------------------------------------------------------------- 74 | 75 | // Print to stderr and exit 76 | static inline void fs_error_fun(const int err_num, const char *err_msg) 77 | { 78 | UNUSED(err_num); 79 | error(err_msg); 80 | } 81 | 82 | 83 | #endif 84 | -------------------------------------------------------------------------------- /src/filesampler/utils_sample.h: -------------------------------------------------------------------------------- 1 | // This file is free and unencumbered software released into the public domain. 2 | // You may modify it for any purpose with or without attribution. 3 | // See the Unlicense specification for full details http://unlicense.org/ 4 | 5 | // A sample, modified version of utils.h 6 | 7 | #ifndef FILESAMPLER_UTILS_H_ 8 | #define FILESAMPLER_UTILS_H_ 9 | 10 | // ---------------------------------------------------------------------------- 11 | // RNG 12 | // ---------------------------------------------------------------------------- 13 | 14 | #include 15 | 16 | #define STARTRNG 17 | #define ENDRNG 18 | 19 | #define RUNIF ((double) rand() / (RAND_MAX+1)) 20 | 21 | 22 | 23 | // ---------------------------------------------------------------------------- 24 | // Printing 25 | // ---------------------------------------------------------------------------- 26 | 27 | #define PRINTFUN printf 28 | 29 | 30 | 31 | // ---------------------------------------------------------------------------- 32 | // Interrupt checker 33 | // ---------------------------------------------------------------------------- 34 | 35 | static inline bool check_interrupt() 36 | { 37 | return false; 38 | } 39 | 40 | 41 | 42 | // ---------------------------------------------------------------------------- 43 | // Error handler 44 | // ---------------------------------------------------------------------------- 45 | 46 | #include 47 | 48 | static inline void fs_error_fun(const int err_num, const char *err_msg) 49 | { 50 | fputs(stderr, err_msg); 51 | exit(err_num) 52 | } 53 | 54 | 55 | #endif 56 | -------------------------------------------------------------------------------- /src/filesampler/wc.c: -------------------------------------------------------------------------------- 1 | /* Copyright (c) 2015-2017, Drew Schmidt and Daniel Lemire 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | 1. Redistributions of source code must retain the above copyright notice, 8 | this list of conditions and the following disclaimer. 9 | 10 | 2. Redistributions in binary form must reproduce the above copyright 11 | notice, this list of conditions and the following disclaimer in the 12 | documentation and/or other materials provided with the distribution. 13 | 14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 15 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 16 | TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 17 | PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR 18 | CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 19 | EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 20 | PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 21 | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 22 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 23 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 25 | */ 26 | 27 | 28 | #include // isspace() 29 | 30 | #include "check_avx.h" 31 | #include "filesampler.h" 32 | #include "safeomp.h" 33 | #include "utils.h" 34 | 35 | 36 | #ifdef __AVX2__ 37 | // we have AVX2 support 38 | #ifndef _MSC_VER 39 | /* Non-Microsoft C/C++-compatible compiler */ 40 | #include // on some recent GCC, this will declare posix_memalign 41 | #else 42 | /* Microsoft C/C++-compatible compiler */ 43 | #include 44 | #endif 45 | 46 | // http://lemire.me/blog/2017/02/14/how-fast-can-you-count-lines/ 47 | static inline size_t linefeedcount_avx2(char *const restrict buffer, const size_t size) 48 | { 49 | size_t answer = 0; 50 | __m256i cnt = _mm256_setzero_si256(); 51 | __m256i newline = _mm256_set1_epi8('\n'); 52 | size_t i = 0; 53 | uint8_t tmpbuffer[sizeof(__m256i)]; 54 | 55 | while (i + 32 <= size) 56 | { 57 | size_t remaining = size - i; 58 | size_t howmanytimes = remaining / 32; 59 | 60 | if (howmanytimes > 256) 61 | howmanytimes = 256; 62 | 63 | const __m256i *buf = (const __m256i*) (buffer + i); 64 | size_t j = 0; 65 | 66 | for (; j + 3 < howmanytimes; j+= 4) 67 | { 68 | __m256i newdata1 = _mm256_lddqu_si256(buf + j); 69 | __m256i newdata2 = _mm256_lddqu_si256(buf + j + 1); 70 | __m256i newdata3 = _mm256_lddqu_si256(buf + j + 2); 71 | __m256i newdata4 = _mm256_lddqu_si256(buf + j + 3); 72 | __m256i cmp1 = _mm256_cmpeq_epi8(newline, newdata1); 73 | __m256i cmp2 = _mm256_cmpeq_epi8(newline, newdata2); 74 | __m256i cmp3 = _mm256_cmpeq_epi8(newline, newdata3); 75 | __m256i cmp4 = _mm256_cmpeq_epi8(newline, newdata4); 76 | __m256i cnt1 = _mm256_add_epi8(cmp1, cmp2); 77 | __m256i cnt2 = _mm256_add_epi8(cmp3, cmp4); 78 | cnt = _mm256_add_epi8(cnt, cnt1); 79 | cnt = _mm256_add_epi8(cnt, cnt2); 80 | } 81 | 82 | for (; j < howmanytimes; j++) 83 | { 84 | __m256i newdata = _mm256_lddqu_si256(buf + j); 85 | __m256i cmp = _mm256_cmpeq_epi8(newline, newdata); 86 | cnt = _mm256_add_epi8(cnt, cmp); 87 | } 88 | 89 | i += howmanytimes * 32; 90 | cnt = _mm256_subs_epi8(_mm256_setzero_si256(), cnt); 91 | _mm256_storeu_si256((__m256i *) tmpbuffer, cnt); 92 | 93 | for (unsigned int k = 0; k < sizeof(__m256i); ++k) 94 | answer += tmpbuffer[k]; 95 | 96 | cnt = _mm256_setzero_si256(); 97 | } 98 | 99 | for (; i < size; i++) 100 | { 101 | if (buffer[i] == '\n') 102 | answer++; 103 | } 104 | 105 | return answer; 106 | } 107 | #endif 108 | 109 | 110 | 111 | static inline size_t linefeedcount_fallback(char *const restrict buffer, const size_t size) 112 | { 113 | uint64_t nl = 0; 114 | char *ptr = buffer; 115 | char *last = buffer + size; 116 | 117 | while ((ptr = memchr(ptr, '\n', last - ptr))) 118 | { 119 | ptr++; 120 | nl++; 121 | } 122 | 123 | return nl; 124 | } 125 | 126 | 127 | 128 | static inline size_t linefeedcount(char *const restrict buffer, const size_t size) 129 | { 130 | #ifdef __AVX2__ 131 | if (has_avx2()) 132 | return linefeedcount_avx2(buffer, size); 133 | else 134 | #endif 135 | return linefeedcount_fallback(buffer, size); 136 | } 137 | 138 | 139 | 140 | // ----------------------------------------------------------------------------- 141 | // wrappers 142 | // ----------------------------------------------------------------------------- 143 | 144 | static inline int wc_charsonly(FILE *restrict fp, char *restrict buf, uint64_t *restrict nchars) 145 | { 146 | size_t readlen = BUFLEN; 147 | uint64_t nc = 0; 148 | 149 | while (readlen == BUFLEN) 150 | { 151 | if (check_interrupt()) 152 | return USER_INTERRUPT; 153 | 154 | readlen = fread(buf, sizeof(char), BUFLEN, fp); 155 | nc += readlen; 156 | } 157 | 158 | *nchars = nc; 159 | 160 | return 0; 161 | } 162 | 163 | 164 | 165 | static inline int wc_linesonly(FILE *restrict fp, char *restrict buf, uint64_t *restrict nlines) 166 | { 167 | size_t readlen = BUFLEN; 168 | uint64_t nl = 0; 169 | 170 | while (readlen == BUFLEN) 171 | { 172 | if (check_interrupt()) 173 | return USER_INTERRUPT; 174 | 175 | readlen = fread(buf, sizeof(*buf), BUFLEN, fp); 176 | nl += linefeedcount(buf, readlen); 177 | } 178 | 179 | *nlines = nl; 180 | 181 | return 0; 182 | } 183 | 184 | 185 | 186 | static inline int wc_nowords(FILE *restrict fp, char *restrict buf, uint64_t *restrict nchars, uint64_t *restrict nlines) 187 | { 188 | size_t readlen = BUFLEN; 189 | uint64_t nc = 0; 190 | uint64_t nl = 0; 191 | 192 | while (readlen == BUFLEN) 193 | { 194 | if (check_interrupt()) 195 | return USER_INTERRUPT; 196 | 197 | readlen = fread(buf, sizeof(*buf), BUFLEN, fp); 198 | nl += linefeedcount(buf, readlen); 199 | nc += readlen; 200 | } 201 | 202 | *nchars = nc; 203 | *nlines = nl; 204 | 205 | return 0; 206 | } 207 | 208 | 209 | 210 | static inline int wc_nolines(FILE *restrict fp, char *restrict buf, uint64_t *restrict nchars, uint64_t *restrict nwords) 211 | { 212 | size_t readlen = BUFLEN; 213 | uint64_t nc = 0; 214 | uint64_t nw = 0; 215 | 216 | while (readlen == BUFLEN) 217 | { 218 | if (check_interrupt()) 219 | return USER_INTERRUPT; 220 | 221 | readlen = fread(buf, sizeof(char), BUFLEN, fp); 222 | 223 | SAFE_FOR_SIMD 224 | for (size_t i=0; i 4 | #include 5 | #include 6 | #include 7 | 8 | extern SEXP R_fs_sample_exact(SEXP verbose, SEXP header, SEXP nskip_, SEXP nlines_out_, SEXP input, SEXP output); 9 | extern SEXP R_fs_sample_prop(SEXP verbose, SEXP header, SEXP nskip_, SEXP nmax_, SEXP p, SEXP input, SEXP output); 10 | extern SEXP R_fs_wc(SEXP input, SEXP chars_, SEXP words_, SEXP lines_); 11 | 12 | static const R_CallMethodDef CallEntries[] = { 13 | {"R_fs_sample_exact", (DL_FUNC) &R_fs_sample_exact, 6}, 14 | {"R_fs_sample_prop", (DL_FUNC) &R_fs_sample_prop, 7}, 15 | {"R_fs_wc", (DL_FUNC) &R_fs_wc, 4}, 16 | {NULL, NULL, 0} 17 | }; 18 | void R_init_filesampler(DllInfo *dll) 19 | { 20 | R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); 21 | R_useDynamicSymbols(dll, FALSE); 22 | } 23 | -------------------------------------------------------------------------------- /src/samplers.c: -------------------------------------------------------------------------------- 1 | /* Copyright (c) 2015-2016, Drew Schmidt 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | 1. Redistributions of source code must retain the above copyright notice, 8 | this list of conditions and the following disclaimer. 9 | 10 | 2. Redistributions in binary form must reproduce the above copyright 11 | notice, this list of conditions and the following disclaimer in the 12 | documentation and/or other materials provided with the distribution. 13 | 14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 15 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 16 | TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 17 | PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR 18 | CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 19 | EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 20 | PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 21 | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 22 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 23 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 25 | */ 26 | 27 | 28 | #include "Rfilesampler.h" 29 | #include "filesampler/filesampler.h" 30 | 31 | 32 | SEXP R_fs_sample_prop(SEXP verbose, SEXP header, SEXP nskip_, SEXP nmax_, SEXP p, SEXP input, SEXP output) 33 | { 34 | int ret; 35 | 36 | const uint32_t nskip = (uint32_t) INT(nskip_); 37 | const uint32_t nmax = (uint32_t) INT(nmax_); 38 | 39 | ret = fs_sample_prop(INT(verbose), INT(header), nskip, nmax, DBL(p), CHARPT(input, 0), CHARPT(output, 0)); 40 | fs_checkret(ret); 41 | 42 | return R_NilValue; 43 | } 44 | 45 | 46 | 47 | SEXP R_fs_sample_exact(SEXP verbose, SEXP header, SEXP nskip_, SEXP nlines_out_, SEXP input, SEXP output) 48 | { 49 | int ret; 50 | 51 | const uint32_t nskip = (uint32_t) INT(nskip_); 52 | const uint32_t nlines_out = (uint32_t) INT(nlines_out_); 53 | 54 | ret = fs_sample_exact(INT(verbose), INT(header), nskip, nlines_out, CHARPT(input, 0), CHARPT(output, 0)); 55 | fs_checkret(ret); 56 | 57 | return R_NilValue; 58 | } 59 | -------------------------------------------------------------------------------- /src/wc.c: -------------------------------------------------------------------------------- 1 | /* Copyright (c) 2015-2016, Drew Schmidt 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | 1. Redistributions of source code must retain the above copyright notice, 8 | this list of conditions and the following disclaimer. 9 | 10 | 2. Redistributions in binary form must reproduce the above copyright 11 | notice, this list of conditions and the following disclaimer in the 12 | documentation and/or other materials provided with the distribution. 13 | 14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 15 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 16 | TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 17 | PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR 18 | CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 19 | EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 20 | PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 21 | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 22 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 23 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 25 | */ 26 | 27 | 28 | #include "Rfilesampler.h" 29 | #include "filesampler/filesampler.h" 30 | 31 | #define COUNTS(n) REAL(counts)[n] 32 | #define NCHARS 0 33 | #define NWORDS 1 34 | #define NLINES 2 35 | 36 | #define BADVAL -1.0 37 | 38 | 39 | SEXP R_fs_wc(SEXP input, SEXP chars_, SEXP words_, SEXP lines_) 40 | { 41 | SEXP counts; 42 | int ret; 43 | uint64_t nchars, nwords, nlines; 44 | 45 | const bool chars = INT(chars_); 46 | const bool words = INT(words_); 47 | const bool lines = INT(lines_); 48 | 49 | PROTECT(counts = allocVector(REALSXP, 3)); 50 | 51 | ret = fs_wc(CHARPT(input, 0), chars, &nchars, words, &nwords, lines, &nlines); 52 | fs_checkret(ret); 53 | 54 | COUNTS(NCHARS) = chars ? (double) nchars : BADVAL; 55 | COUNTS(NWORDS) = words ? (double) nwords : BADVAL; 56 | COUNTS(NLINES) = lines ? (double) nlines : BADVAL; 57 | 58 | UNPROTECT(1); 59 | return counts; 60 | } 61 | -------------------------------------------------------------------------------- /tests/exact.R: -------------------------------------------------------------------------------- 1 | library(filesampler) 2 | 3 | file <- system.file("rawdata/small.csv", package="filesampler") 4 | 5 | ### Argument checks 6 | 7 | # nlines 8 | badp <- "" 9 | badval <- tryCatch(sampled <- sample_csv(file, param=-1, method="exact"), error=capture.output) 10 | stopifnot(all.equal(badp, badval)) 11 | 12 | 13 | 14 | ### general functionality 15 | set.seed(1234) 16 | sampled <- sample_csv(file, param=5, method="exact") 17 | 18 | sampled_actual <- 19 | structure(list(A = c(13L, 28L, 83L, 88L, 87L), B = structure(c(3L, 20 | 2L, 4L, 4L, 1L), .Label = c("g", "n", "p", "q"), class = "factor"), 21 | C = structure(c(5L, 4L, 1L, 3L, 2L), .Label = c("H", "S", 22 | "V", "W", "Z"), class = "factor"), D = c(0.34508125577122, 23 | 0.573690160643309, 0.0123868996743113, 0.366345942718908, 24 | 0.260312086436898), E = c(-0.455623686756772, 1.48579320914896, 25 | 0.680072131251977, -0.114730600414802, -0.0554640013763592 26 | ), F = c(40.096492273733, 72.228856375441, 84.4729636912234, 27 | 92.6017211284488, 92.6275063771755)), .Names = c("A", "B", 28 | "C", "D", "E", "F"), class = "data.frame", row.names = c(NA, 29 | -5L)) 30 | 31 | 32 | stopifnot(all.equal(sampled, sampled_actual)) 33 | -------------------------------------------------------------------------------- /tests/prop.R: -------------------------------------------------------------------------------- 1 | library(filesampler) 2 | file = system.file("rawdata/small.csv", package="filesampler") 3 | 4 | 5 | ### Argument checks 6 | 7 | # p 8 | expect = "Argument 'p' must be between 0 and 1>" 9 | have = tryCatch(sampled <- sample_csv(file, param=-1), error=capture.output) 10 | stopifnot(grepl(expect, have)) 11 | have = tryCatch(sampled <- sample_csv(file, param=1.1), error=capture.output) 12 | stopifnot(grepl(expect, have)) 13 | 14 | # nmax 15 | set.seed(1234) 16 | sampled = sample_csv(file, param=.05, nmax=1) 17 | 18 | sampled_actual = 19 | structure(list(A = 30L, B = structure(1L, .Label = "m", class = "factor"), 20 | C = structure(1L, .Label = "U", class = "factor"), D = 0.663265228504315, 21 | E = -0.102609033060013, F = 46.2487461487763), .Names = c("A", 22 | "B", "C", "D", "E", "F"), class = "data.frame", row.names = c(NA, 23 | -1L)) 24 | 25 | stopifnot(all.equal(sampled, sampled_actual)) 26 | 27 | 28 | 29 | ### general functionality 30 | 31 | # read 32 | set.seed(1234) 33 | sampled = sample_csv(file, param=.05) 34 | 35 | sampled_actual = 36 | structure(list(A = c(30L, 92L, 6L, 49L, 77L, 54L, 67L), B = structure(c(2L, 37 | 1L, 3L, 2L, 4L, 1L, 5L), .Label = c("b", "m", "p", "r", "w"), class = "factor"), 38 | C = structure(c(4L, 1L, 1L, 3L, 2L, 2L, 3L), .Label = c("B", 39 | "G", "P", "U"), class = "factor"), D = c(0.663265228504315, 40 | 0.647247697459534, 0.145372492726892, 0.647244467865676, 41 | 0.248602577485144, 0.155096606584266, 0.231593899428844), 42 | E = c(-0.102609033060013, 1.20942736228284, 1.52611855611189, 43 | -0.59059429990917, 0.356037725132404, -0.933558833688986, 44 | 0.712171107833738), F = c(46.2487461487763, 34.8187624011189, 45 | 20.8286189078353, 91.2623915518634, 42.5667992956005, 17.3642539186403, 46 | 47.2880723653361)), .Names = c("A", "B", "C", "D", "E", "F" 47 | ), class = "data.frame", row.names = c(NA, -7L)) 48 | 49 | stopifnot(all.equal(sampled, sampled_actual)) 50 | 51 | # verbose 52 | verb = capture.output(invisible(sample_csv(file, param=.05, verbose=TRUE))) 53 | verb_actual = "Read 4 lines (0.03960%) of 101 line file." 54 | stopifnot(all.equal(verb, verb_actual)) 55 | -------------------------------------------------------------------------------- /tests/wc.R: -------------------------------------------------------------------------------- 1 | library(filesampler) 2 | 3 | file <- system.file("rawdata/small.csv", package="filesampler") 4 | 5 | nchars = 6426 6 | nwords = 101 7 | nlines = 101 8 | 9 | truth = c(nchars, nwords, nlines) 10 | test = as.integer(wc(file)) 11 | stopifnot(all.equal(truth, test)) 12 | 13 | truth = c(-1, -1, nlines) 14 | test = as.integer(wc(file, FALSE, FALSE)) 15 | stopifnot(all.equal(truth, test)) 16 | 17 | truth = c(nchars, -1, nlines) 18 | test = as.integer(wc(file, TRUE, FALSE, TRUE)) 19 | stopifnot(all.equal(truth, test)) 20 | 21 | truth = c(nchars, nwords, -1) 22 | test = as.integer(wc(file, lines=FALSE)) 23 | stopifnot(all.equal(truth, test)) 24 | 25 | truth = c(nchars, -1, -1) 26 | test = as.integer(wc(file, words=FALSE, lines=FALSE)) 27 | stopifnot(all.equal(truth, test)) 28 | -------------------------------------------------------------------------------- /tools/ax_check_compile_flag.m4: -------------------------------------------------------------------------------- 1 | # =========================================================================== 2 | # https://www.gnu.org/software/autoconf-archive/ax_check_compile_flag.html 3 | # =========================================================================== 4 | # 5 | # SYNOPSIS 6 | # 7 | # AX_CHECK_COMPILE_FLAG(FLAG, [ACTION-SUCCESS], [ACTION-FAILURE], [EXTRA-FLAGS], [INPUT]) 8 | # 9 | # DESCRIPTION 10 | # 11 | # Check whether the given FLAG works with the current language's compiler 12 | # or gives an error. (Warnings, however, are ignored) 13 | # 14 | # ACTION-SUCCESS/ACTION-FAILURE are shell commands to execute on 15 | # success/failure. 16 | # 17 | # If EXTRA-FLAGS is defined, it is added to the current language's default 18 | # flags (e.g. CFLAGS) when the check is done. The check is thus made with 19 | # the flags: "CFLAGS EXTRA-FLAGS FLAG". This can for example be used to 20 | # force the compiler to issue an error when a bad flag is given. 21 | # 22 | # INPUT gives an alternative input source to AC_COMPILE_IFELSE. 23 | # 24 | # NOTE: Implementation based on AX_CFLAGS_GCC_OPTION. Please keep this 25 | # macro in sync with AX_CHECK_{PREPROC,LINK}_FLAG. 26 | # 27 | # LICENSE 28 | # 29 | # Copyright (c) 2008 Guido U. Draheim 30 | # Copyright (c) 2011 Maarten Bosmans 31 | # 32 | # This program is free software: you can redistribute it and/or modify it 33 | # under the terms of the GNU General Public License as published by the 34 | # Free Software Foundation, either version 3 of the License, or (at your 35 | # option) any later version. 36 | # 37 | # This program is distributed in the hope that it will be useful, but 38 | # WITHOUT ANY WARRANTY; without even the implied warranty of 39 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General 40 | # Public License for more details. 41 | # 42 | # You should have received a copy of the GNU General Public License along 43 | # with this program. If not, see . 44 | # 45 | # As a special exception, the respective Autoconf Macro's copyright owner 46 | # gives unlimited permission to copy, distribute and modify the configure 47 | # scripts that are the output of Autoconf when processing the Macro. You 48 | # need not follow the terms of the GNU General Public License when using 49 | # or distributing such scripts, even though portions of the text of the 50 | # Macro appear in them. The GNU General Public License (GPL) does govern 51 | # all other use of the material that constitutes the Autoconf Macro. 52 | # 53 | # This special exception to the GPL applies to versions of the Autoconf 54 | # Macro released by the Autoconf Archive. When you make and distribute a 55 | # modified version of the Autoconf Macro, you may extend this special 56 | # exception to the GPL to apply to your modified version as well. 57 | 58 | #serial 5 59 | 60 | AC_DEFUN([AX_CHECK_COMPILE_FLAG], 61 | [AC_PREREQ(2.64)dnl for _AC_LANG_PREFIX and AS_VAR_IF 62 | AS_VAR_PUSHDEF([CACHEVAR],[ax_cv_check_[]_AC_LANG_ABBREV[]flags_$4_$1])dnl 63 | AC_CACHE_CHECK([whether _AC_LANG compiler accepts $1], CACHEVAR, [ 64 | ax_check_save_flags=$[]_AC_LANG_PREFIX[]FLAGS 65 | _AC_LANG_PREFIX[]FLAGS="$[]_AC_LANG_PREFIX[]FLAGS $4 $1" 66 | AC_COMPILE_IFELSE([m4_default([$5],[AC_LANG_PROGRAM()])], 67 | [AS_VAR_SET(CACHEVAR,[yes])], 68 | [AS_VAR_SET(CACHEVAR,[no])]) 69 | _AC_LANG_PREFIX[]FLAGS=$ax_check_save_flags]) 70 | AS_VAR_IF(CACHEVAR,yes, 71 | [m4_default([$2], :)], 72 | [m4_default([$3], :)]) 73 | AS_VAR_POPDEF([CACHEVAR])dnl 74 | ])dnl AX_CHECK_COMPILE_FLAGS 75 | -------------------------------------------------------------------------------- /vignettes/IEEEtran.bst: -------------------------------------------------------------------------------- 1 | %% 2 | %% IEEEtran.bst 3 | %% BibTeX Bibliography Style file for IEEE Journals and Conferences (unsorted) 4 | %% Version 1.14 (2015/08/26) 5 | %% 6 | %% Copyright (c) 2003-2015 Michael Shell 7 | %% 8 | %% Original starting code base and algorithms obtained from the output of 9 | %% Patrick W. Daly's makebst package as well as from prior versions of 10 | %% IEEE BibTeX styles: 11 | %% 12 | %% 1. Howard Trickey and Oren Patashnik's ieeetr.bst (1985/1988) 13 | %% 2. Silvano Balemi and Richard H. Roy's IEEEbib.bst (1993) 14 | %% 15 | %% Support sites: 16 | %% http://www.michaelshell.org/tex/ieeetran/ 17 | %% http://www.ctan.org/pkg/ieeetran 18 | %% and/or 19 | %% http://www.ieee.org/ 20 | %% 21 | %% For use with BibTeX version 0.99a or later 22 | %% 23 | %% This is a numerical citation style. 24 | %% 25 | %%************************************************************************* 26 | %% Legal Notice: 27 | %% This code is offered as-is without any warranty either expressed or 28 | %% implied; without even the implied warranty of MERCHANTABILITY or 29 | %% FITNESS FOR A PARTICULAR PURPOSE! 30 | %% User assumes all risk. 31 | %% In no event shall the IEEE or any contributor to this code be liable for 32 | %% any damages or losses, including, but not limited to, incidental, 33 | %% consequential, or any other damages, resulting from the use or misuse 34 | %% of any information contained here. 35 | %% 36 | %% All comments are the opinions of their respective authors and are not 37 | %% necessarily endorsed by the IEEE. 38 | %% 39 | %% This work is distributed under the LaTeX Project Public License (LPPL) 40 | %% ( http://www.latex-project.org/ ) version 1.3, and may be freely used, 41 | %% distributed and modified. A copy of the LPPL, version 1.3, is included 42 | %% in the base LaTeX documentation of all distributions of LaTeX released 43 | %% 2003/12/01 or later. 44 | %% Retain all contribution notices and credits. 45 | %% ** Modified files should be clearly indicated as such, including ** 46 | %% ** renaming them and changing author support contact information. ** 47 | %%************************************************************************* 48 | 49 | 50 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 51 | %% DEFAULTS FOR THE CONTROLS OF THE BST STYLE %% 52 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 53 | 54 | % These are the defaults for the user adjustable controls. The values used 55 | % here can be overridden by the user via IEEEtranBSTCTL entry type. 56 | 57 | % NOTE: The recommended LaTeX command to invoke a control entry type is: 58 | % 59 | %\makeatletter 60 | %\def\bstctlcite{\@ifnextchar[{\@bstctlcite}{\@bstctlcite[@auxout]}} 61 | %\def\@bstctlcite[#1]#2{\@bsphack 62 | % \@for\@citeb:=#2\do{% 63 | % \edef\@citeb{\expandafter\@firstofone\@citeb}% 64 | % \if@filesw\immediate\write\csname #1\endcsname{\string\citation{\@citeb}}\fi}% 65 | % \@esphack} 66 | %\makeatother 67 | % 68 | % It is called at the start of the document, before the first \cite, like: 69 | % \bstctlcite{IEEEexample:BSTcontrol} 70 | % 71 | % IEEEtran.cls V1.6 and later does provide this command. 72 | 73 | 74 | 75 | % #0 turns off the display of the number for articles. 76 | % #1 enables 77 | FUNCTION {default.is.use.number.for.article} { #1 } 78 | 79 | 80 | % #0 turns off the display of the paper and type fields in @inproceedings. 81 | % #1 enables 82 | FUNCTION {default.is.use.paper} { #1 } 83 | 84 | 85 | % #0 turns off the display of urls 86 | % #1 enables 87 | FUNCTION {default.is.use.url} { #1 } 88 | 89 | 90 | % #0 turns off the forced use of "et al." 91 | % #1 enables 92 | FUNCTION {default.is.forced.et.al} { #0 } 93 | 94 | 95 | % The maximum number of names that can be present beyond which an "et al." 96 | % usage is forced. Be sure that num.names.shown.with.forced.et.al (below) 97 | % is not greater than this value! 98 | % Note: There are many instances of references in IEEE journals which have 99 | % a very large number of authors as well as instances in which "et al." is 100 | % used profusely. 101 | FUNCTION {default.max.num.names.before.forced.et.al} { #10 } 102 | 103 | 104 | % The number of names that will be shown with a forced "et al.". 105 | % Must be less than or equal to max.num.names.before.forced.et.al 106 | FUNCTION {default.num.names.shown.with.forced.et.al} { #1 } 107 | 108 | 109 | % #0 turns off the alternate interword spacing for entries with URLs. 110 | % #1 enables 111 | FUNCTION {default.is.use.alt.interword.spacing} { #1 } 112 | 113 | 114 | % If alternate interword spacing for entries with URLs is enabled, this is 115 | % the interword spacing stretch factor that will be used. For example, the 116 | % default "4" here means that the interword spacing in entries with URLs can 117 | % stretch to four times normal. Does not have to be an integer. Note that 118 | % the value specified here can be overridden by the user in their LaTeX 119 | % code via a command such as: 120 | % "\providecommand\BIBentryALTinterwordstretchfactor{1.5}" in addition to 121 | % that via the IEEEtranBSTCTL entry type. 122 | FUNCTION {default.ALTinterwordstretchfactor} { "4" } 123 | 124 | 125 | % #0 turns off the "dashification" of repeated (i.e., identical to those 126 | % of the previous entry) names. The IEEE normally does this. 127 | % #1 enables 128 | FUNCTION {default.is.dash.repeated.names} { #1 } 129 | 130 | 131 | % The default name format control string. 132 | FUNCTION {default.name.format.string}{ "{f.~}{vv~}{ll}{, jj}" } 133 | 134 | 135 | % The default LaTeX font command for the names. 136 | FUNCTION {default.name.latex.cmd}{ "" } 137 | 138 | 139 | % The default URL prefix. 140 | FUNCTION {default.name.url.prefix}{ "[Online]. Available:" } 141 | 142 | 143 | % Other controls that cannot be accessed via IEEEtranBSTCTL entry type. 144 | 145 | % #0 turns off the terminal startup banner/completed message so as to 146 | % operate more quietly. 147 | % #1 enables 148 | FUNCTION {is.print.banners.to.terminal} { #1 } 149 | 150 | 151 | 152 | 153 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%% 154 | %% FILE VERSION AND BANNER %% 155 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%% 156 | 157 | FUNCTION{bst.file.version} { "1.14" } 158 | FUNCTION{bst.file.date} { "2015/08/26" } 159 | FUNCTION{bst.file.website} { "http://www.michaelshell.org/tex/ieeetran/bibtex/" } 160 | 161 | FUNCTION {banner.message} 162 | { is.print.banners.to.terminal 163 | { "-- IEEEtran.bst version" " " * bst.file.version * 164 | " (" * bst.file.date * ") " * "by Michael Shell." * 165 | top$ 166 | "-- " bst.file.website * 167 | top$ 168 | "-- See the " quote$ * "IEEEtran_bst_HOWTO.pdf" * quote$ * " manual for usage information." * 169 | top$ 170 | } 171 | { skip$ } 172 | if$ 173 | } 174 | 175 | FUNCTION {completed.message} 176 | { is.print.banners.to.terminal 177 | { "" 178 | top$ 179 | "Done." 180 | top$ 181 | } 182 | { skip$ } 183 | if$ 184 | } 185 | 186 | 187 | 188 | 189 | %%%%%%%%%%%%%%%%%%%%%% 190 | %% STRING CONSTANTS %% 191 | %%%%%%%%%%%%%%%%%%%%%% 192 | 193 | FUNCTION {bbl.and}{ "and" } 194 | FUNCTION {bbl.etal}{ "et~al." } 195 | FUNCTION {bbl.editors}{ "eds." } 196 | FUNCTION {bbl.editor}{ "ed." } 197 | FUNCTION {bbl.edition}{ "ed." } 198 | FUNCTION {bbl.volume}{ "vol." } 199 | FUNCTION {bbl.of}{ "of" } 200 | FUNCTION {bbl.number}{ "no." } 201 | FUNCTION {bbl.in}{ "in" } 202 | FUNCTION {bbl.pages}{ "pp." } 203 | FUNCTION {bbl.page}{ "p." } 204 | FUNCTION {bbl.chapter}{ "ch." } 205 | FUNCTION {bbl.paper}{ "paper" } 206 | FUNCTION {bbl.part}{ "pt." } 207 | FUNCTION {bbl.patent}{ "Patent" } 208 | FUNCTION {bbl.patentUS}{ "U.S." } 209 | FUNCTION {bbl.revision}{ "Rev." } 210 | FUNCTION {bbl.series}{ "ser." } 211 | FUNCTION {bbl.standard}{ "Std." } 212 | FUNCTION {bbl.techrep}{ "Tech. Rep." } 213 | FUNCTION {bbl.mthesis}{ "Master's thesis" } 214 | FUNCTION {bbl.phdthesis}{ "Ph.D. dissertation" } 215 | FUNCTION {bbl.st}{ "st" } 216 | FUNCTION {bbl.nd}{ "nd" } 217 | FUNCTION {bbl.rd}{ "rd" } 218 | FUNCTION {bbl.th}{ "th" } 219 | 220 | 221 | % This is the LaTeX spacer that is used when a larger than normal space 222 | % is called for (such as just before the address:publisher). 223 | FUNCTION {large.space} { "\hskip 1em plus 0.5em minus 0.4em\relax " } 224 | 225 | % The LaTeX code for dashes that are used to represent repeated names. 226 | % Note: Some older IEEE journals used something like 227 | % "\rule{0.275in}{0.5pt}\," which is fairly thick and runs right along 228 | % the baseline. However, the IEEE now uses a thinner, above baseline, 229 | % six dash long sequence. 230 | FUNCTION {repeated.name.dashes} { "------" } 231 | 232 | 233 | 234 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 235 | %% PREDEFINED STRING MACROS %% 236 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 237 | 238 | MACRO {jan} {"Jan."} 239 | MACRO {feb} {"Feb."} 240 | MACRO {mar} {"Mar."} 241 | MACRO {apr} {"Apr."} 242 | MACRO {may} {"May"} 243 | MACRO {jun} {"Jun."} 244 | MACRO {jul} {"Jul."} 245 | MACRO {aug} {"Aug."} 246 | MACRO {sep} {"Sep."} 247 | MACRO {oct} {"Oct."} 248 | MACRO {nov} {"Nov."} 249 | MACRO {dec} {"Dec."} 250 | 251 | 252 | 253 | %%%%%%%%%%%%%%%%%% 254 | %% ENTRY FIELDS %% 255 | %%%%%%%%%%%%%%%%%% 256 | 257 | ENTRY 258 | { address 259 | assignee 260 | author 261 | booktitle 262 | chapter 263 | day 264 | dayfiled 265 | edition 266 | editor 267 | howpublished 268 | institution 269 | intype 270 | journal 271 | key 272 | language 273 | month 274 | monthfiled 275 | nationality 276 | note 277 | number 278 | organization 279 | pages 280 | paper 281 | publisher 282 | school 283 | series 284 | revision 285 | title 286 | type 287 | url 288 | volume 289 | year 290 | yearfiled 291 | CTLuse_article_number 292 | CTLuse_paper 293 | CTLuse_url 294 | CTLuse_forced_etal 295 | CTLmax_names_forced_etal 296 | CTLnames_show_etal 297 | CTLuse_alt_spacing 298 | CTLalt_stretch_factor 299 | CTLdash_repeated_names 300 | CTLname_format_string 301 | CTLname_latex_cmd 302 | CTLname_url_prefix 303 | } 304 | {} 305 | { label } 306 | 307 | 308 | 309 | 310 | %%%%%%%%%%%%%%%%%%%%%%% 311 | %% INTEGER VARIABLES %% 312 | %%%%%%%%%%%%%%%%%%%%%%% 313 | 314 | INTEGERS { prev.status.punct this.status.punct punct.std 315 | punct.no punct.comma punct.period 316 | prev.status.space this.status.space space.std 317 | space.no space.normal space.large 318 | prev.status.quote this.status.quote quote.std 319 | quote.no quote.close 320 | prev.status.nline this.status.nline nline.std 321 | nline.no nline.newblock 322 | status.cap cap.std 323 | cap.no cap.yes} 324 | 325 | INTEGERS { longest.label.width multiresult nameptr namesleft number.label numnames } 326 | 327 | INTEGERS { is.use.number.for.article 328 | is.use.paper 329 | is.use.url 330 | is.forced.et.al 331 | max.num.names.before.forced.et.al 332 | num.names.shown.with.forced.et.al 333 | is.use.alt.interword.spacing 334 | is.dash.repeated.names} 335 | 336 | 337 | %%%%%%%%%%%%%%%%%%%%%% 338 | %% STRING VARIABLES %% 339 | %%%%%%%%%%%%%%%%%%%%%% 340 | 341 | STRINGS { bibinfo 342 | longest.label 343 | oldname 344 | s 345 | t 346 | ALTinterwordstretchfactor 347 | name.format.string 348 | name.latex.cmd 349 | name.url.prefix} 350 | 351 | 352 | 353 | 354 | %%%%%%%%%%%%%%%%%%%%%%%%% 355 | %% LOW LEVEL FUNCTIONS %% 356 | %%%%%%%%%%%%%%%%%%%%%%%%% 357 | 358 | FUNCTION {initialize.controls} 359 | { default.is.use.number.for.article 'is.use.number.for.article := 360 | default.is.use.paper 'is.use.paper := 361 | default.is.use.url 'is.use.url := 362 | default.is.forced.et.al 'is.forced.et.al := 363 | default.max.num.names.before.forced.et.al 'max.num.names.before.forced.et.al := 364 | default.num.names.shown.with.forced.et.al 'num.names.shown.with.forced.et.al := 365 | default.is.use.alt.interword.spacing 'is.use.alt.interword.spacing := 366 | default.is.dash.repeated.names 'is.dash.repeated.names := 367 | default.ALTinterwordstretchfactor 'ALTinterwordstretchfactor := 368 | default.name.format.string 'name.format.string := 369 | default.name.latex.cmd 'name.latex.cmd := 370 | default.name.url.prefix 'name.url.prefix := 371 | } 372 | 373 | 374 | % This IEEEtran.bst features a very powerful and flexible mechanism for 375 | % controlling the capitalization, punctuation, spacing, quotation, and 376 | % newlines of the formatted entry fields. (Note: IEEEtran.bst does not need 377 | % or use the newline/newblock feature, but it has been implemented for 378 | % possible future use.) The output states of IEEEtran.bst consist of 379 | % multiple independent attributes and, as such, can be thought of as being 380 | % vectors, rather than the simple scalar values ("before.all", 381 | % "mid.sentence", etc.) used in most other .bst files. 382 | % 383 | % The more flexible and complex design used here was motivated in part by 384 | % the IEEE's rather unusual bibliography style. For example, the IEEE ends the 385 | % previous field item with a period and large space prior to the publisher 386 | % address; the @electronic entry types use periods as inter-item punctuation 387 | % rather than the commas used by the other entry types; and URLs are never 388 | % followed by periods even though they are the last item in the entry. 389 | % Although it is possible to accommodate these features with the conventional 390 | % output state system, the seemingly endless exceptions make for convoluted, 391 | % unreliable and difficult to maintain code. 392 | % 393 | % IEEEtran.bst's output state system can be easily understood via a simple 394 | % illustration of two most recently formatted entry fields (on the stack): 395 | % 396 | % CURRENT_ITEM 397 | % "PREVIOUS_ITEM 398 | % 399 | % which, in this example, is to eventually appear in the bibliography as: 400 | % 401 | % "PREVIOUS_ITEM," CURRENT_ITEM 402 | % 403 | % It is the job of the output routine to take the previous item off of the 404 | % stack (while leaving the current item at the top of the stack), apply its 405 | % trailing punctuation (including closing quote marks) and spacing, and then 406 | % to write the result to BibTeX's output buffer: 407 | % 408 | % "PREVIOUS_ITEM," 409 | % 410 | % Punctuation (and spacing) between items is often determined by both of the 411 | % items rather than just the first one. The presence of quotation marks 412 | % further complicates the situation because, in standard English, trailing 413 | % punctuation marks are supposed to be contained within the quotes. 414 | % 415 | % IEEEtran.bst maintains two output state (aka "status") vectors which 416 | % correspond to the previous and current (aka "this") items. Each vector 417 | % consists of several independent attributes which track punctuation, 418 | % spacing, quotation, and newlines. Capitalization status is handled by a 419 | % separate scalar because the format routines, not the output routine, 420 | % handle capitalization and, therefore, there is no need to maintain the 421 | % capitalization attribute for both the "previous" and "this" items. 422 | % 423 | % When a format routine adds a new item, it copies the current output status 424 | % vector to the previous output status vector and (usually) resets the 425 | % current (this) output status vector to a "standard status" vector. Using a 426 | % "standard status" vector in this way allows us to redefine what we mean by 427 | % "standard status" at the start of each entry handler and reuse the same 428 | % format routines under the various inter-item separation schemes. For 429 | % example, the standard status vector for the @book entry type may use 430 | % commas for item separators, while the @electronic type may use periods, 431 | % yet both entry handlers exploit many of the exact same format routines. 432 | % 433 | % Because format routines have write access to the output status vector of 434 | % the previous item, they can override the punctuation choices of the 435 | % previous format routine! Therefore, it becomes trivial to implement rules 436 | % such as "Always use a period and a large space before the publisher." By 437 | % pushing the generation of the closing quote mark to the output routine, we 438 | % avoid all the problems caused by having to close a quote before having all 439 | % the information required to determine what the punctuation should be. 440 | % 441 | % The IEEEtran.bst output state system can easily be expanded if needed. 442 | % For instance, it is easy to add a "space.tie" attribute value if the 443 | % bibliography rules mandate that two items have to be joined with an 444 | % unbreakable space. 445 | 446 | FUNCTION {initialize.status.constants} 447 | { #0 'punct.no := 448 | #1 'punct.comma := 449 | #2 'punct.period := 450 | #0 'space.no := 451 | #1 'space.normal := 452 | #2 'space.large := 453 | #0 'quote.no := 454 | #1 'quote.close := 455 | #0 'cap.no := 456 | #1 'cap.yes := 457 | #0 'nline.no := 458 | #1 'nline.newblock := 459 | } 460 | 461 | FUNCTION {std.status.using.comma} 462 | { punct.comma 'punct.std := 463 | space.normal 'space.std := 464 | quote.no 'quote.std := 465 | nline.no 'nline.std := 466 | cap.no 'cap.std := 467 | } 468 | 469 | FUNCTION {std.status.using.period} 470 | { punct.period 'punct.std := 471 | space.normal 'space.std := 472 | quote.no 'quote.std := 473 | nline.no 'nline.std := 474 | cap.yes 'cap.std := 475 | } 476 | 477 | FUNCTION {initialize.prev.this.status} 478 | { punct.no 'prev.status.punct := 479 | space.no 'prev.status.space := 480 | quote.no 'prev.status.quote := 481 | nline.no 'prev.status.nline := 482 | punct.no 'this.status.punct := 483 | space.no 'this.status.space := 484 | quote.no 'this.status.quote := 485 | nline.no 'this.status.nline := 486 | cap.yes 'status.cap := 487 | } 488 | 489 | FUNCTION {this.status.std} 490 | { punct.std 'this.status.punct := 491 | space.std 'this.status.space := 492 | quote.std 'this.status.quote := 493 | nline.std 'this.status.nline := 494 | } 495 | 496 | FUNCTION {cap.status.std}{ cap.std 'status.cap := } 497 | 498 | FUNCTION {this.to.prev.status} 499 | { this.status.punct 'prev.status.punct := 500 | this.status.space 'prev.status.space := 501 | this.status.quote 'prev.status.quote := 502 | this.status.nline 'prev.status.nline := 503 | } 504 | 505 | 506 | FUNCTION {not} 507 | { { #0 } 508 | { #1 } 509 | if$ 510 | } 511 | 512 | FUNCTION {and} 513 | { { skip$ } 514 | { pop$ #0 } 515 | if$ 516 | } 517 | 518 | FUNCTION {or} 519 | { { pop$ #1 } 520 | { skip$ } 521 | if$ 522 | } 523 | 524 | 525 | % convert the strings "yes" or "no" to #1 or #0 respectively 526 | FUNCTION {yes.no.to.int} 527 | { "l" change.case$ duplicate$ 528 | "yes" = 529 | { pop$ #1 } 530 | { duplicate$ "no" = 531 | { pop$ #0 } 532 | { "unknown boolean " quote$ * swap$ * quote$ * 533 | " in " * cite$ * warning$ 534 | #0 535 | } 536 | if$ 537 | } 538 | if$ 539 | } 540 | 541 | 542 | % pushes true if the single char string on the stack is in the 543 | % range of "0" to "9" 544 | FUNCTION {is.num} 545 | { chr.to.int$ 546 | duplicate$ "0" chr.to.int$ < not 547 | swap$ "9" chr.to.int$ > not and 548 | } 549 | 550 | % multiplies the integer on the stack by a factor of 10 551 | FUNCTION {bump.int.mag} 552 | { #0 'multiresult := 553 | { duplicate$ #0 > } 554 | { #1 - 555 | multiresult #10 + 556 | 'multiresult := 557 | } 558 | while$ 559 | pop$ 560 | multiresult 561 | } 562 | 563 | % converts a single character string on the stack to an integer 564 | FUNCTION {char.to.integer} 565 | { duplicate$ 566 | is.num 567 | { chr.to.int$ "0" chr.to.int$ - } 568 | {"noninteger character " quote$ * swap$ * quote$ * 569 | " in integer field of " * cite$ * warning$ 570 | #0 571 | } 572 | if$ 573 | } 574 | 575 | % converts a string on the stack to an integer 576 | FUNCTION {string.to.integer} 577 | { duplicate$ text.length$ 'namesleft := 578 | #1 'nameptr := 579 | #0 'numnames := 580 | { nameptr namesleft > not } 581 | { duplicate$ nameptr #1 substring$ 582 | char.to.integer numnames bump.int.mag + 583 | 'numnames := 584 | nameptr #1 + 585 | 'nameptr := 586 | } 587 | while$ 588 | pop$ 589 | numnames 590 | } 591 | 592 | 593 | 594 | 595 | % The output routines write out the *next* to the top (previous) item on the 596 | % stack, adding punctuation and such as needed. Since IEEEtran.bst maintains 597 | % the output status for the top two items on the stack, these output 598 | % routines have to consider the previous output status (which corresponds to 599 | % the item that is being output). Full independent control of punctuation, 600 | % closing quote marks, spacing, and newblock is provided. 601 | % 602 | % "output.nonnull" does not check for the presence of a previous empty 603 | % item. 604 | % 605 | % "output" does check for the presence of a previous empty item and will 606 | % remove an empty item rather than outputing it. 607 | % 608 | % "output.warn" is like "output", but will issue a warning if it detects 609 | % an empty item. 610 | 611 | FUNCTION {output.nonnull} 612 | { swap$ 613 | prev.status.punct punct.comma = 614 | { "," * } 615 | { skip$ } 616 | if$ 617 | prev.status.punct punct.period = 618 | { add.period$ } 619 | { skip$ } 620 | if$ 621 | prev.status.quote quote.close = 622 | { "''" * } 623 | { skip$ } 624 | if$ 625 | prev.status.space space.normal = 626 | { " " * } 627 | { skip$ } 628 | if$ 629 | prev.status.space space.large = 630 | { large.space * } 631 | { skip$ } 632 | if$ 633 | write$ 634 | prev.status.nline nline.newblock = 635 | { newline$ "\newblock " write$ } 636 | { skip$ } 637 | if$ 638 | } 639 | 640 | FUNCTION {output} 641 | { duplicate$ empty$ 642 | 'pop$ 643 | 'output.nonnull 644 | if$ 645 | } 646 | 647 | FUNCTION {output.warn} 648 | { 't := 649 | duplicate$ empty$ 650 | { pop$ "empty " t * " in " * cite$ * warning$ } 651 | 'output.nonnull 652 | if$ 653 | } 654 | 655 | % "fin.entry" is the output routine that handles the last item of the entry 656 | % (which will be on the top of the stack when "fin.entry" is called). 657 | 658 | FUNCTION {fin.entry} 659 | { this.status.punct punct.no = 660 | { skip$ } 661 | { add.period$ } 662 | if$ 663 | this.status.quote quote.close = 664 | { "''" * } 665 | { skip$ } 666 | if$ 667 | write$ 668 | newline$ 669 | } 670 | 671 | 672 | FUNCTION {is.last.char.not.punct} 673 | { duplicate$ 674 | "}" * add.period$ 675 | #-1 #1 substring$ "." = 676 | } 677 | 678 | FUNCTION {is.multiple.pages} 679 | { 't := 680 | #0 'multiresult := 681 | { multiresult not 682 | t empty$ not 683 | and 684 | } 685 | { t #1 #1 substring$ 686 | duplicate$ "-" = 687 | swap$ duplicate$ "," = 688 | swap$ "+" = 689 | or or 690 | { #1 'multiresult := } 691 | { t #2 global.max$ substring$ 't := } 692 | if$ 693 | } 694 | while$ 695 | multiresult 696 | } 697 | 698 | FUNCTION {capitalize}{ "u" change.case$ "t" change.case$ } 699 | 700 | FUNCTION {emphasize} 701 | { duplicate$ empty$ 702 | { pop$ "" } 703 | { "\emph{" swap$ * "}" * } 704 | if$ 705 | } 706 | 707 | FUNCTION {do.name.latex.cmd} 708 | { name.latex.cmd 709 | empty$ 710 | { skip$ } 711 | { name.latex.cmd "{" * swap$ * "}" * } 712 | if$ 713 | } 714 | 715 | % IEEEtran.bst uses its own \BIBforeignlanguage command which directly 716 | % invokes the TeX hyphenation patterns without the need of the Babel 717 | % package. Babel does a lot more than switch hyphenation patterns and 718 | % its loading can cause unintended effects in many class files (such as 719 | % IEEEtran.cls). 720 | FUNCTION {select.language} 721 | { duplicate$ empty$ 'pop$ 722 | { language empty$ 'skip$ 723 | { "\BIBforeignlanguage{" language * "}{" * swap$ * "}" * } 724 | if$ 725 | } 726 | if$ 727 | } 728 | 729 | FUNCTION {tie.or.space.prefix} 730 | { duplicate$ text.length$ #3 < 731 | { "~" } 732 | { " " } 733 | if$ 734 | swap$ 735 | } 736 | 737 | FUNCTION {get.bbl.editor} 738 | { editor num.names$ #1 > 'bbl.editors 'bbl.editor if$ } 739 | 740 | FUNCTION {space.word}{ " " swap$ * " " * } 741 | 742 | 743 | % Field Conditioners, Converters, Checkers and External Interfaces 744 | 745 | FUNCTION {empty.field.to.null.string} 746 | { duplicate$ empty$ 747 | { pop$ "" } 748 | { skip$ } 749 | if$ 750 | } 751 | 752 | FUNCTION {either.or.check} 753 | { empty$ 754 | { pop$ } 755 | { "can't use both " swap$ * " fields in " * cite$ * warning$ } 756 | if$ 757 | } 758 | 759 | FUNCTION {empty.entry.warn} 760 | { author empty$ title empty$ howpublished empty$ 761 | month empty$ year empty$ note empty$ url empty$ 762 | and and and and and and 763 | { "all relevant fields are empty in " cite$ * warning$ } 764 | 'skip$ 765 | if$ 766 | } 767 | 768 | 769 | % The bibinfo system provides a way for the electronic parsing/acquisition 770 | % of a bibliography's contents as is done by ReVTeX. For example, a field 771 | % could be entered into the bibliography as: 772 | % \bibinfo{volume}{2} 773 | % Only the "2" would show up in the document, but the LaTeX \bibinfo command 774 | % could do additional things with the information. IEEEtran.bst does provide 775 | % a \bibinfo command via "\providecommand{\bibinfo}[2]{#2}". However, it is 776 | % currently not used as the bogus bibinfo functions defined here output the 777 | % entry values directly without the \bibinfo wrapper. The bibinfo functions 778 | % themselves (and the calls to them) are retained for possible future use. 779 | % 780 | % bibinfo.check avoids acting on missing fields while bibinfo.warn will 781 | % issue a warning message if a missing field is detected. Prior to calling 782 | % the bibinfo functions, the user should push the field value and then its 783 | % name string, in that order. 784 | 785 | FUNCTION {bibinfo.check} 786 | { swap$ duplicate$ missing$ 787 | { pop$ pop$ "" } 788 | { duplicate$ empty$ 789 | { swap$ pop$ } 790 | { swap$ pop$ } 791 | if$ 792 | } 793 | if$ 794 | } 795 | 796 | FUNCTION {bibinfo.warn} 797 | { swap$ duplicate$ missing$ 798 | { swap$ "missing " swap$ * " in " * cite$ * warning$ pop$ "" } 799 | { duplicate$ empty$ 800 | { swap$ "empty " swap$ * " in " * cite$ * warning$ } 801 | { swap$ pop$ } 802 | if$ 803 | } 804 | if$ 805 | } 806 | 807 | 808 | % The IEEE separates large numbers with more than 4 digits into groups of 809 | % three. The IEEE uses a small space to separate these number groups. 810 | % Typical applications include patent and page numbers. 811 | 812 | % number of consecutive digits required to trigger the group separation. 813 | FUNCTION {large.number.trigger}{ #5 } 814 | 815 | % For numbers longer than the trigger, this is the blocksize of the groups. 816 | % The blocksize must be less than the trigger threshold, and 2 * blocksize 817 | % must be greater than the trigger threshold (can't do more than one 818 | % separation on the initial trigger). 819 | FUNCTION {large.number.blocksize}{ #3 } 820 | 821 | % What is actually inserted between the number groups. 822 | FUNCTION {large.number.separator}{ "\," } 823 | 824 | % So as to save on integer variables by reusing existing ones, numnames 825 | % holds the current number of consecutive digits read and nameptr holds 826 | % the number that will trigger an inserted space. 827 | FUNCTION {large.number.separate} 828 | { 't := 829 | "" 830 | #0 'numnames := 831 | large.number.trigger 'nameptr := 832 | { t empty$ not } 833 | { t #-1 #1 substring$ is.num 834 | { numnames #1 + 'numnames := } 835 | { #0 'numnames := 836 | large.number.trigger 'nameptr := 837 | } 838 | if$ 839 | t #-1 #1 substring$ swap$ * 840 | t #-2 global.max$ substring$ 't := 841 | numnames nameptr = 842 | { duplicate$ #1 nameptr large.number.blocksize - substring$ swap$ 843 | nameptr large.number.blocksize - #1 + global.max$ substring$ 844 | large.number.separator swap$ * * 845 | nameptr large.number.blocksize - 'numnames := 846 | large.number.blocksize #1 + 'nameptr := 847 | } 848 | { skip$ } 849 | if$ 850 | } 851 | while$ 852 | } 853 | 854 | % Converts all single dashes "-" to double dashes "--". 855 | FUNCTION {n.dashify} 856 | { large.number.separate 857 | 't := 858 | "" 859 | { t empty$ not } 860 | { t #1 #1 substring$ "-" = 861 | { t #1 #2 substring$ "--" = not 862 | { "--" * 863 | t #2 global.max$ substring$ 't := 864 | } 865 | { { t #1 #1 substring$ "-" = } 866 | { "-" * 867 | t #2 global.max$ substring$ 't := 868 | } 869 | while$ 870 | } 871 | if$ 872 | } 873 | { t #1 #1 substring$ * 874 | t #2 global.max$ substring$ 't := 875 | } 876 | if$ 877 | } 878 | while$ 879 | } 880 | 881 | 882 | % This function detects entries with names that are identical to that of 883 | % the previous entry and replaces the repeated names with dashes (if the 884 | % "is.dash.repeated.names" user control is nonzero). 885 | FUNCTION {name.or.dash} 886 | { 's := 887 | oldname empty$ 888 | { s 'oldname := s } 889 | { s oldname = 890 | { is.dash.repeated.names 891 | { repeated.name.dashes } 892 | { s 'oldname := s } 893 | if$ 894 | } 895 | { s 'oldname := s } 896 | if$ 897 | } 898 | if$ 899 | } 900 | 901 | % Converts the number string on the top of the stack to 902 | % "numerical ordinal form" (e.g., "7" to "7th"). There is 903 | % no artificial limit to the upper bound of the numbers as the 904 | % two least significant digits determine the ordinal form. 905 | FUNCTION {num.to.ordinal} 906 | { duplicate$ #-2 #1 substring$ "1" = 907 | { bbl.th * } 908 | { duplicate$ #-1 #1 substring$ "1" = 909 | { bbl.st * } 910 | { duplicate$ #-1 #1 substring$ "2" = 911 | { bbl.nd * } 912 | { duplicate$ #-1 #1 substring$ "3" = 913 | { bbl.rd * } 914 | { bbl.th * } 915 | if$ 916 | } 917 | if$ 918 | } 919 | if$ 920 | } 921 | if$ 922 | } 923 | 924 | % If the string on the top of the stack begins with a number, 925 | % (e.g., 11th) then replace the string with the leading number 926 | % it contains. Otherwise retain the string as-is. s holds the 927 | % extracted number, t holds the part of the string that remains 928 | % to be scanned. 929 | FUNCTION {extract.num} 930 | { duplicate$ 't := 931 | "" 's := 932 | { t empty$ not } 933 | { t #1 #1 substring$ 934 | t #2 global.max$ substring$ 't := 935 | duplicate$ is.num 936 | { s swap$ * 's := } 937 | { pop$ "" 't := } 938 | if$ 939 | } 940 | while$ 941 | s empty$ 942 | 'skip$ 943 | { pop$ s } 944 | if$ 945 | } 946 | 947 | % Converts the word number string on the top of the stack to 948 | % Arabic string form. Will be successful up to "tenth". 949 | FUNCTION {word.to.num} 950 | { duplicate$ "l" change.case$ 's := 951 | s "first" = 952 | { pop$ "1" } 953 | { skip$ } 954 | if$ 955 | s "second" = 956 | { pop$ "2" } 957 | { skip$ } 958 | if$ 959 | s "third" = 960 | { pop$ "3" } 961 | { skip$ } 962 | if$ 963 | s "fourth" = 964 | { pop$ "4" } 965 | { skip$ } 966 | if$ 967 | s "fifth" = 968 | { pop$ "5" } 969 | { skip$ } 970 | if$ 971 | s "sixth" = 972 | { pop$ "6" } 973 | { skip$ } 974 | if$ 975 | s "seventh" = 976 | { pop$ "7" } 977 | { skip$ } 978 | if$ 979 | s "eighth" = 980 | { pop$ "8" } 981 | { skip$ } 982 | if$ 983 | s "ninth" = 984 | { pop$ "9" } 985 | { skip$ } 986 | if$ 987 | s "tenth" = 988 | { pop$ "10" } 989 | { skip$ } 990 | if$ 991 | } 992 | 993 | 994 | % Converts the string on the top of the stack to numerical 995 | % ordinal (e.g., "11th") form. 996 | FUNCTION {convert.edition} 997 | { duplicate$ empty$ 'skip$ 998 | { duplicate$ #1 #1 substring$ is.num 999 | { extract.num 1000 | num.to.ordinal 1001 | } 1002 | { word.to.num 1003 | duplicate$ #1 #1 substring$ is.num 1004 | { num.to.ordinal } 1005 | { "edition ordinal word " quote$ * edition * quote$ * 1006 | " may be too high (or improper) for conversion" * " in " * cite$ * warning$ 1007 | } 1008 | if$ 1009 | } 1010 | if$ 1011 | } 1012 | if$ 1013 | } 1014 | 1015 | 1016 | 1017 | 1018 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1019 | %% LATEX BIBLIOGRAPHY CODE %% 1020 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1021 | 1022 | FUNCTION {start.entry} 1023 | { newline$ 1024 | "\bibitem{" write$ 1025 | cite$ write$ 1026 | "}" write$ 1027 | newline$ 1028 | "" 1029 | initialize.prev.this.status 1030 | } 1031 | 1032 | % Here we write out all the LaTeX code that we will need. The most involved 1033 | % code sequences are those that control the alternate interword spacing and 1034 | % foreign language hyphenation patterns. The heavy use of \providecommand 1035 | % gives users a way to override the defaults. Special thanks to Javier Bezos, 1036 | % Johannes Braams, Robin Fairbairns, Heiko Oberdiek, Donald Arseneau and all 1037 | % the other gurus on comp.text.tex for their help and advice on the topic of 1038 | % \selectlanguage, Babel and BibTeX. 1039 | FUNCTION {begin.bib} 1040 | { "% Generated by IEEEtran.bst, version: " bst.file.version * " (" * bst.file.date * ")" * 1041 | write$ newline$ 1042 | preamble$ empty$ 'skip$ 1043 | { preamble$ write$ newline$ } 1044 | if$ 1045 | "\begin{thebibliography}{" longest.label * "}" * 1046 | write$ newline$ 1047 | "\providecommand{\url}[1]{#1}" 1048 | write$ newline$ 1049 | "\csname url@samestyle\endcsname" 1050 | write$ newline$ 1051 | "\providecommand{\newblock}{\relax}" 1052 | write$ newline$ 1053 | "\providecommand{\bibinfo}[2]{#2}" 1054 | write$ newline$ 1055 | "\providecommand{\BIBentrySTDinterwordspacing}{\spaceskip=0pt\relax}" 1056 | write$ newline$ 1057 | "\providecommand{\BIBentryALTinterwordstretchfactor}{" 1058 | ALTinterwordstretchfactor * "}" * 1059 | write$ newline$ 1060 | "\providecommand{\BIBentryALTinterwordspacing}{\spaceskip=\fontdimen2\font plus " 1061 | write$ newline$ 1062 | "\BIBentryALTinterwordstretchfactor\fontdimen3\font minus \fontdimen4\font\relax}" 1063 | write$ newline$ 1064 | "\providecommand{\BIBforeignlanguage}[2]{{%" 1065 | write$ newline$ 1066 | "\expandafter\ifx\csname l@#1\endcsname\relax" 1067 | write$ newline$ 1068 | "\typeout{** WARNING: IEEEtran.bst: No hyphenation pattern has been}%" 1069 | write$ newline$ 1070 | "\typeout{** loaded for the language `#1'. Using the pattern for}%" 1071 | write$ newline$ 1072 | "\typeout{** the default language instead.}%" 1073 | write$ newline$ 1074 | "\else" 1075 | write$ newline$ 1076 | "\language=\csname l@#1\endcsname" 1077 | write$ newline$ 1078 | "\fi" 1079 | write$ newline$ 1080 | "#2}}" 1081 | write$ newline$ 1082 | "\providecommand{\BIBdecl}{\relax}" 1083 | write$ newline$ 1084 | "\BIBdecl" 1085 | write$ newline$ 1086 | } 1087 | 1088 | FUNCTION {end.bib} 1089 | { newline$ "\end{thebibliography}" write$ newline$ } 1090 | 1091 | FUNCTION {if.url.alt.interword.spacing} 1092 | { is.use.alt.interword.spacing 1093 | { is.use.url 1094 | { url empty$ 'skip$ {"\BIBentryALTinterwordspacing" write$ newline$} if$ } 1095 | { skip$ } 1096 | if$ 1097 | } 1098 | { skip$ } 1099 | if$ 1100 | } 1101 | 1102 | FUNCTION {if.url.std.interword.spacing} 1103 | { is.use.alt.interword.spacing 1104 | { is.use.url 1105 | { url empty$ 'skip$ {"\BIBentrySTDinterwordspacing" write$ newline$} if$ } 1106 | { skip$ } 1107 | if$ 1108 | } 1109 | { skip$ } 1110 | if$ 1111 | } 1112 | 1113 | 1114 | 1115 | 1116 | %%%%%%%%%%%%%%%%%%%%%%%% 1117 | %% LONGEST LABEL PASS %% 1118 | %%%%%%%%%%%%%%%%%%%%%%%% 1119 | 1120 | FUNCTION {initialize.longest.label} 1121 | { "" 'longest.label := 1122 | #1 'number.label := 1123 | #0 'longest.label.width := 1124 | } 1125 | 1126 | FUNCTION {longest.label.pass} 1127 | { type$ "ieeetranbstctl" = 1128 | { skip$ } 1129 | { number.label int.to.str$ 'label := 1130 | number.label #1 + 'number.label := 1131 | label width$ longest.label.width > 1132 | { label 'longest.label := 1133 | label width$ 'longest.label.width := 1134 | } 1135 | { skip$ } 1136 | if$ 1137 | } 1138 | if$ 1139 | } 1140 | 1141 | 1142 | 1143 | 1144 | %%%%%%%%%%%%%%%%%%%%% 1145 | %% FORMAT HANDLERS %% 1146 | %%%%%%%%%%%%%%%%%%%%% 1147 | 1148 | %% Lower Level Formats (used by higher level formats) 1149 | 1150 | FUNCTION {format.address.org.or.pub.date} 1151 | { 't := 1152 | "" 1153 | year empty$ 1154 | { "empty year in " cite$ * warning$ } 1155 | { skip$ } 1156 | if$ 1157 | address empty$ t empty$ and 1158 | year empty$ and month empty$ and 1159 | { skip$ } 1160 | { this.to.prev.status 1161 | this.status.std 1162 | cap.status.std 1163 | address "address" bibinfo.check * 1164 | t empty$ 1165 | { skip$ } 1166 | { punct.period 'prev.status.punct := 1167 | space.large 'prev.status.space := 1168 | address empty$ 1169 | { skip$ } 1170 | { ": " * } 1171 | if$ 1172 | t * 1173 | } 1174 | if$ 1175 | year empty$ month empty$ and 1176 | { skip$ } 1177 | { t empty$ address empty$ and 1178 | { skip$ } 1179 | { ", " * } 1180 | if$ 1181 | month empty$ 1182 | { year empty$ 1183 | { skip$ } 1184 | { year "year" bibinfo.check * } 1185 | if$ 1186 | } 1187 | { month "month" bibinfo.check * 1188 | year empty$ 1189 | { skip$ } 1190 | { " " * year "year" bibinfo.check * } 1191 | if$ 1192 | } 1193 | if$ 1194 | } 1195 | if$ 1196 | } 1197 | if$ 1198 | } 1199 | 1200 | 1201 | FUNCTION {format.names} 1202 | { 'bibinfo := 1203 | duplicate$ empty$ 'skip$ { 1204 | this.to.prev.status 1205 | this.status.std 1206 | 's := 1207 | "" 't := 1208 | #1 'nameptr := 1209 | s num.names$ 'numnames := 1210 | numnames 'namesleft := 1211 | { namesleft #0 > } 1212 | { s nameptr 1213 | name.format.string 1214 | format.name$ 1215 | bibinfo bibinfo.check 1216 | 't := 1217 | nameptr #1 > 1218 | { nameptr num.names.shown.with.forced.et.al #1 + = 1219 | numnames max.num.names.before.forced.et.al > 1220 | is.forced.et.al and and 1221 | { "others" 't := 1222 | #1 'namesleft := 1223 | } 1224 | { skip$ } 1225 | if$ 1226 | namesleft #1 > 1227 | { ", " * t do.name.latex.cmd * } 1228 | { s nameptr "{ll}" format.name$ duplicate$ "others" = 1229 | { 't := } 1230 | { pop$ } 1231 | if$ 1232 | t "others" = 1233 | { " " * bbl.etal emphasize * } 1234 | { numnames #2 > 1235 | { "," * } 1236 | { skip$ } 1237 | if$ 1238 | bbl.and 1239 | space.word * t do.name.latex.cmd * 1240 | } 1241 | if$ 1242 | } 1243 | if$ 1244 | } 1245 | { t do.name.latex.cmd } 1246 | if$ 1247 | nameptr #1 + 'nameptr := 1248 | namesleft #1 - 'namesleft := 1249 | } 1250 | while$ 1251 | cap.status.std 1252 | } if$ 1253 | } 1254 | 1255 | 1256 | 1257 | 1258 | %% Higher Level Formats 1259 | 1260 | %% addresses/locations 1261 | 1262 | FUNCTION {format.address} 1263 | { address duplicate$ empty$ 'skip$ 1264 | { this.to.prev.status 1265 | this.status.std 1266 | cap.status.std 1267 | } 1268 | if$ 1269 | } 1270 | 1271 | 1272 | 1273 | %% author/editor names 1274 | 1275 | FUNCTION {format.authors}{ author "author" format.names } 1276 | 1277 | FUNCTION {format.editors} 1278 | { editor "editor" format.names duplicate$ empty$ 'skip$ 1279 | { ", " * 1280 | get.bbl.editor 1281 | capitalize 1282 | * 1283 | } 1284 | if$ 1285 | } 1286 | 1287 | 1288 | 1289 | %% date 1290 | 1291 | FUNCTION {format.date} 1292 | { 1293 | month "month" bibinfo.check duplicate$ empty$ 1294 | year "year" bibinfo.check duplicate$ empty$ 1295 | { swap$ 'skip$ 1296 | { this.to.prev.status 1297 | this.status.std 1298 | cap.status.std 1299 | "there's a month but no year in " cite$ * warning$ } 1300 | if$ 1301 | * 1302 | } 1303 | { this.to.prev.status 1304 | this.status.std 1305 | cap.status.std 1306 | swap$ 'skip$ 1307 | { 1308 | swap$ 1309 | " " * swap$ 1310 | } 1311 | if$ 1312 | * 1313 | } 1314 | if$ 1315 | } 1316 | 1317 | FUNCTION {format.date.electronic} 1318 | { month "month" bibinfo.check duplicate$ empty$ 1319 | year "year" bibinfo.check duplicate$ empty$ 1320 | { swap$ 1321 | { pop$ } 1322 | { "there's a month but no year in " cite$ * warning$ 1323 | pop$ ")" * "(" swap$ * 1324 | this.to.prev.status 1325 | punct.no 'this.status.punct := 1326 | space.normal 'this.status.space := 1327 | quote.no 'this.status.quote := 1328 | cap.yes 'status.cap := 1329 | } 1330 | if$ 1331 | } 1332 | { swap$ 1333 | { swap$ pop$ ")" * "(" swap$ * } 1334 | { "(" swap$ * ", " * swap$ * ")" * } 1335 | if$ 1336 | this.to.prev.status 1337 | punct.no 'this.status.punct := 1338 | space.normal 'this.status.space := 1339 | quote.no 'this.status.quote := 1340 | cap.yes 'status.cap := 1341 | } 1342 | if$ 1343 | } 1344 | 1345 | 1346 | 1347 | %% edition/title 1348 | 1349 | % Note: The IEEE considers the edition to be closely associated with 1350 | % the title of a book. So, in IEEEtran.bst the edition is normally handled 1351 | % within the formatting of the title. The format.edition function is 1352 | % retained here for possible future use. 1353 | FUNCTION {format.edition} 1354 | { edition duplicate$ empty$ 'skip$ 1355 | { this.to.prev.status 1356 | this.status.std 1357 | convert.edition 1358 | status.cap 1359 | { "t" } 1360 | { "l" } 1361 | if$ change.case$ 1362 | "edition" bibinfo.check 1363 | "~" * bbl.edition * 1364 | cap.status.std 1365 | } 1366 | if$ 1367 | } 1368 | 1369 | % This is used to format the booktitle of a conference proceedings. 1370 | % Here we use the "intype" field to provide the user a way to 1371 | % override the word "in" (e.g., with things like "presented at") 1372 | % Use of intype stops the emphasis of the booktitle to indicate that 1373 | % we no longer mean the written conference proceedings, but the 1374 | % conference itself. 1375 | FUNCTION {format.in.booktitle} 1376 | { booktitle "booktitle" bibinfo.check duplicate$ empty$ 'skip$ 1377 | { this.to.prev.status 1378 | this.status.std 1379 | select.language 1380 | intype missing$ 1381 | { emphasize 1382 | bbl.in " " * 1383 | } 1384 | { intype " " * } 1385 | if$ 1386 | swap$ * 1387 | cap.status.std 1388 | } 1389 | if$ 1390 | } 1391 | 1392 | % This is used to format the booktitle of collection. 1393 | % Here the "intype" field is not supported, but "edition" is. 1394 | FUNCTION {format.in.booktitle.edition} 1395 | { booktitle "booktitle" bibinfo.check duplicate$ empty$ 'skip$ 1396 | { this.to.prev.status 1397 | this.status.std 1398 | select.language 1399 | emphasize 1400 | edition empty$ 'skip$ 1401 | { ", " * 1402 | edition 1403 | convert.edition 1404 | "l" change.case$ 1405 | * "~" * bbl.edition * 1406 | } 1407 | if$ 1408 | bbl.in " " * swap$ * 1409 | cap.status.std 1410 | } 1411 | if$ 1412 | } 1413 | 1414 | FUNCTION {format.article.title} 1415 | { title duplicate$ empty$ 'skip$ 1416 | { this.to.prev.status 1417 | this.status.std 1418 | "t" change.case$ 1419 | } 1420 | if$ 1421 | "title" bibinfo.check 1422 | duplicate$ empty$ 'skip$ 1423 | { quote.close 'this.status.quote := 1424 | is.last.char.not.punct 1425 | { punct.std 'this.status.punct := } 1426 | { punct.no 'this.status.punct := } 1427 | if$ 1428 | select.language 1429 | "``" swap$ * 1430 | cap.status.std 1431 | } 1432 | if$ 1433 | } 1434 | 1435 | FUNCTION {format.article.title.electronic} 1436 | { title duplicate$ empty$ 'skip$ 1437 | { this.to.prev.status 1438 | this.status.std 1439 | cap.status.std 1440 | "t" change.case$ 1441 | } 1442 | if$ 1443 | "title" bibinfo.check 1444 | duplicate$ empty$ 1445 | { skip$ } 1446 | { select.language } 1447 | if$ 1448 | } 1449 | 1450 | FUNCTION {format.book.title.edition} 1451 | { title "title" bibinfo.check 1452 | duplicate$ empty$ 1453 | { "empty title in " cite$ * warning$ } 1454 | { this.to.prev.status 1455 | this.status.std 1456 | select.language 1457 | emphasize 1458 | edition empty$ 'skip$ 1459 | { ", " * 1460 | edition 1461 | convert.edition 1462 | status.cap 1463 | { "t" } 1464 | { "l" } 1465 | if$ 1466 | change.case$ 1467 | * "~" * bbl.edition * 1468 | } 1469 | if$ 1470 | cap.status.std 1471 | } 1472 | if$ 1473 | } 1474 | 1475 | FUNCTION {format.book.title} 1476 | { title "title" bibinfo.check 1477 | duplicate$ empty$ 'skip$ 1478 | { this.to.prev.status 1479 | this.status.std 1480 | cap.status.std 1481 | select.language 1482 | emphasize 1483 | } 1484 | if$ 1485 | } 1486 | 1487 | 1488 | 1489 | %% journal 1490 | 1491 | FUNCTION {format.journal} 1492 | { journal duplicate$ empty$ 'skip$ 1493 | { this.to.prev.status 1494 | this.status.std 1495 | cap.status.std 1496 | select.language 1497 | emphasize 1498 | } 1499 | if$ 1500 | } 1501 | 1502 | 1503 | 1504 | %% how published 1505 | 1506 | FUNCTION {format.howpublished} 1507 | { howpublished duplicate$ empty$ 'skip$ 1508 | { this.to.prev.status 1509 | this.status.std 1510 | cap.status.std 1511 | } 1512 | if$ 1513 | } 1514 | 1515 | 1516 | 1517 | %% institutions/organization/publishers/school 1518 | 1519 | FUNCTION {format.institution} 1520 | { institution duplicate$ empty$ 'skip$ 1521 | { this.to.prev.status 1522 | this.status.std 1523 | cap.status.std 1524 | } 1525 | if$ 1526 | } 1527 | 1528 | FUNCTION {format.organization} 1529 | { organization duplicate$ empty$ 'skip$ 1530 | { this.to.prev.status 1531 | this.status.std 1532 | cap.status.std 1533 | } 1534 | if$ 1535 | } 1536 | 1537 | FUNCTION {format.address.publisher.date} 1538 | { publisher "publisher" bibinfo.warn format.address.org.or.pub.date } 1539 | 1540 | FUNCTION {format.address.publisher.date.nowarn} 1541 | { publisher "publisher" bibinfo.check format.address.org.or.pub.date } 1542 | 1543 | FUNCTION {format.address.organization.date} 1544 | { organization "organization" bibinfo.check format.address.org.or.pub.date } 1545 | 1546 | FUNCTION {format.school} 1547 | { school duplicate$ empty$ 'skip$ 1548 | { this.to.prev.status 1549 | this.status.std 1550 | cap.status.std 1551 | } 1552 | if$ 1553 | } 1554 | 1555 | 1556 | 1557 | %% volume/number/series/chapter/pages 1558 | 1559 | FUNCTION {format.volume} 1560 | { volume empty.field.to.null.string 1561 | duplicate$ empty$ 'skip$ 1562 | { this.to.prev.status 1563 | this.status.std 1564 | bbl.volume 1565 | status.cap 1566 | { capitalize } 1567 | { skip$ } 1568 | if$ 1569 | swap$ tie.or.space.prefix 1570 | "volume" bibinfo.check 1571 | * * 1572 | cap.status.std 1573 | } 1574 | if$ 1575 | } 1576 | 1577 | FUNCTION {format.number} 1578 | { number empty.field.to.null.string 1579 | duplicate$ empty$ 'skip$ 1580 | { this.to.prev.status 1581 | this.status.std 1582 | status.cap 1583 | { bbl.number capitalize } 1584 | { bbl.number } 1585 | if$ 1586 | swap$ tie.or.space.prefix 1587 | "number" bibinfo.check 1588 | * * 1589 | cap.status.std 1590 | } 1591 | if$ 1592 | } 1593 | 1594 | FUNCTION {format.number.if.use.for.article} 1595 | { is.use.number.for.article 1596 | { format.number } 1597 | { "" } 1598 | if$ 1599 | } 1600 | 1601 | % The IEEE does not seem to tie the series so closely with the volume 1602 | % and number as is done in other bibliography styles. Instead the 1603 | % series is treated somewhat like an extension of the title. 1604 | FUNCTION {format.series} 1605 | { series empty$ 1606 | { "" } 1607 | { this.to.prev.status 1608 | this.status.std 1609 | bbl.series " " * 1610 | series "series" bibinfo.check * 1611 | cap.status.std 1612 | } 1613 | if$ 1614 | } 1615 | 1616 | 1617 | FUNCTION {format.chapter} 1618 | { chapter empty$ 1619 | { "" } 1620 | { this.to.prev.status 1621 | this.status.std 1622 | type empty$ 1623 | { bbl.chapter } 1624 | { type "l" change.case$ 1625 | "type" bibinfo.check 1626 | } 1627 | if$ 1628 | chapter tie.or.space.prefix 1629 | "chapter" bibinfo.check 1630 | * * 1631 | cap.status.std 1632 | } 1633 | if$ 1634 | } 1635 | 1636 | 1637 | % The intended use of format.paper is for paper numbers of inproceedings. 1638 | % The paper type can be overridden via the type field. 1639 | % We allow the type to be displayed even if the paper number is absent 1640 | % for things like "postdeadline paper" 1641 | FUNCTION {format.paper} 1642 | { is.use.paper 1643 | { paper empty$ 1644 | { type empty$ 1645 | { "" } 1646 | { this.to.prev.status 1647 | this.status.std 1648 | type "type" bibinfo.check 1649 | cap.status.std 1650 | } 1651 | if$ 1652 | } 1653 | { this.to.prev.status 1654 | this.status.std 1655 | type empty$ 1656 | { bbl.paper } 1657 | { type "type" bibinfo.check } 1658 | if$ 1659 | " " * paper 1660 | "paper" bibinfo.check 1661 | * 1662 | cap.status.std 1663 | } 1664 | if$ 1665 | } 1666 | { "" } 1667 | if$ 1668 | } 1669 | 1670 | 1671 | FUNCTION {format.pages} 1672 | { pages duplicate$ empty$ 'skip$ 1673 | { this.to.prev.status 1674 | this.status.std 1675 | duplicate$ is.multiple.pages 1676 | { 1677 | bbl.pages swap$ 1678 | n.dashify 1679 | } 1680 | { 1681 | bbl.page swap$ 1682 | } 1683 | if$ 1684 | tie.or.space.prefix 1685 | "pages" bibinfo.check 1686 | * * 1687 | cap.status.std 1688 | } 1689 | if$ 1690 | } 1691 | 1692 | 1693 | 1694 | %% technical report number 1695 | 1696 | FUNCTION {format.tech.report.number} 1697 | { number "number" bibinfo.check 1698 | this.to.prev.status 1699 | this.status.std 1700 | cap.status.std 1701 | type duplicate$ empty$ 1702 | { pop$ 1703 | bbl.techrep 1704 | } 1705 | { skip$ } 1706 | if$ 1707 | "type" bibinfo.check 1708 | swap$ duplicate$ empty$ 1709 | { pop$ } 1710 | { tie.or.space.prefix * * } 1711 | if$ 1712 | } 1713 | 1714 | 1715 | 1716 | %% note 1717 | 1718 | FUNCTION {format.note} 1719 | { note empty$ 1720 | { "" } 1721 | { this.to.prev.status 1722 | this.status.std 1723 | punct.period 'this.status.punct := 1724 | note #1 #1 substring$ 1725 | duplicate$ "{" = 1726 | { skip$ } 1727 | { status.cap 1728 | { "u" } 1729 | { "l" } 1730 | if$ 1731 | change.case$ 1732 | } 1733 | if$ 1734 | note #2 global.max$ substring$ * "note" bibinfo.check 1735 | cap.yes 'status.cap := 1736 | } 1737 | if$ 1738 | } 1739 | 1740 | 1741 | 1742 | %% patent 1743 | 1744 | FUNCTION {format.patent.date} 1745 | { this.to.prev.status 1746 | this.status.std 1747 | year empty$ 1748 | { monthfiled duplicate$ empty$ 1749 | { "monthfiled" bibinfo.check pop$ "" } 1750 | { "monthfiled" bibinfo.check } 1751 | if$ 1752 | dayfiled duplicate$ empty$ 1753 | { "dayfiled" bibinfo.check pop$ "" * } 1754 | { "dayfiled" bibinfo.check 1755 | monthfiled empty$ 1756 | { "dayfiled without a monthfiled in " cite$ * warning$ 1757 | * 1758 | } 1759 | { " " swap$ * * } 1760 | if$ 1761 | } 1762 | if$ 1763 | yearfiled empty$ 1764 | { "no year or yearfiled in " cite$ * warning$ } 1765 | { yearfiled "yearfiled" bibinfo.check 1766 | swap$ 1767 | duplicate$ empty$ 1768 | { pop$ } 1769 | { ", " * swap$ * } 1770 | if$ 1771 | } 1772 | if$ 1773 | } 1774 | { month duplicate$ empty$ 1775 | { "month" bibinfo.check pop$ "" } 1776 | { "month" bibinfo.check } 1777 | if$ 1778 | day duplicate$ empty$ 1779 | { "day" bibinfo.check pop$ "" * } 1780 | { "day" bibinfo.check 1781 | month empty$ 1782 | { "day without a month in " cite$ * warning$ 1783 | * 1784 | } 1785 | { " " swap$ * * } 1786 | if$ 1787 | } 1788 | if$ 1789 | year "year" bibinfo.check 1790 | swap$ 1791 | duplicate$ empty$ 1792 | { pop$ } 1793 | { ", " * swap$ * } 1794 | if$ 1795 | } 1796 | if$ 1797 | cap.status.std 1798 | } 1799 | 1800 | FUNCTION {format.patent.nationality.type.number} 1801 | { this.to.prev.status 1802 | this.status.std 1803 | nationality duplicate$ empty$ 1804 | { "nationality" bibinfo.warn pop$ "" } 1805 | { "nationality" bibinfo.check 1806 | duplicate$ "l" change.case$ "united states" = 1807 | { pop$ bbl.patentUS } 1808 | { skip$ } 1809 | if$ 1810 | " " * 1811 | } 1812 | if$ 1813 | type empty$ 1814 | { bbl.patent "type" bibinfo.check } 1815 | { type "type" bibinfo.check } 1816 | if$ 1817 | * 1818 | number duplicate$ empty$ 1819 | { "number" bibinfo.warn pop$ } 1820 | { "number" bibinfo.check 1821 | large.number.separate 1822 | swap$ " " * swap$ * 1823 | } 1824 | if$ 1825 | cap.status.std 1826 | } 1827 | 1828 | 1829 | 1830 | %% standard 1831 | 1832 | FUNCTION {format.organization.institution.standard.type.number} 1833 | { this.to.prev.status 1834 | this.status.std 1835 | organization duplicate$ empty$ 1836 | { pop$ 1837 | institution duplicate$ empty$ 1838 | { "institution" bibinfo.warn } 1839 | { "institution" bibinfo.warn " " * } 1840 | if$ 1841 | } 1842 | { "organization" bibinfo.warn " " * } 1843 | if$ 1844 | type empty$ 1845 | { bbl.standard "type" bibinfo.check } 1846 | { type "type" bibinfo.check } 1847 | if$ 1848 | * 1849 | number duplicate$ empty$ 1850 | { "number" bibinfo.check pop$ } 1851 | { "number" bibinfo.check 1852 | large.number.separate 1853 | swap$ " " * swap$ * 1854 | } 1855 | if$ 1856 | cap.status.std 1857 | } 1858 | 1859 | FUNCTION {format.revision} 1860 | { revision empty$ 1861 | { "" } 1862 | { this.to.prev.status 1863 | this.status.std 1864 | bbl.revision 1865 | revision tie.or.space.prefix 1866 | "revision" bibinfo.check 1867 | * * 1868 | cap.status.std 1869 | } 1870 | if$ 1871 | } 1872 | 1873 | 1874 | %% thesis 1875 | 1876 | FUNCTION {format.master.thesis.type} 1877 | { this.to.prev.status 1878 | this.status.std 1879 | type empty$ 1880 | { 1881 | bbl.mthesis 1882 | } 1883 | { 1884 | type "type" bibinfo.check 1885 | } 1886 | if$ 1887 | cap.status.std 1888 | } 1889 | 1890 | FUNCTION {format.phd.thesis.type} 1891 | { this.to.prev.status 1892 | this.status.std 1893 | type empty$ 1894 | { 1895 | bbl.phdthesis 1896 | } 1897 | { 1898 | type "type" bibinfo.check 1899 | } 1900 | if$ 1901 | cap.status.std 1902 | } 1903 | 1904 | 1905 | 1906 | %% URL 1907 | 1908 | FUNCTION {format.url} 1909 | { is.use.url 1910 | { url empty$ 1911 | { "" } 1912 | { this.to.prev.status 1913 | this.status.std 1914 | cap.yes 'status.cap := 1915 | name.url.prefix " " * 1916 | "\url{" * url * "}" * 1917 | punct.no 'this.status.punct := 1918 | punct.period 'prev.status.punct := 1919 | space.normal 'this.status.space := 1920 | space.normal 'prev.status.space := 1921 | quote.no 'this.status.quote := 1922 | } 1923 | if$ 1924 | } 1925 | { "" } 1926 | if$ 1927 | } 1928 | 1929 | 1930 | 1931 | 1932 | %%%%%%%%%%%%%%%%%%%% 1933 | %% ENTRY HANDLERS %% 1934 | %%%%%%%%%%%%%%%%%%%% 1935 | 1936 | 1937 | % Note: In many journals, the IEEE (or the authors) tend not to show the number 1938 | % for articles, so the display of the number is controlled here by the 1939 | % switch "is.use.number.for.article" 1940 | FUNCTION {article} 1941 | { std.status.using.comma 1942 | start.entry 1943 | if.url.alt.interword.spacing 1944 | format.authors "author" output.warn 1945 | name.or.dash 1946 | format.article.title "title" output.warn 1947 | format.journal "journal" bibinfo.check "journal" output.warn 1948 | format.volume output 1949 | format.number.if.use.for.article output 1950 | format.pages output 1951 | format.date "year" output.warn 1952 | format.note output 1953 | format.url output 1954 | fin.entry 1955 | if.url.std.interword.spacing 1956 | } 1957 | 1958 | FUNCTION {book} 1959 | { std.status.using.comma 1960 | start.entry 1961 | if.url.alt.interword.spacing 1962 | author empty$ 1963 | { format.editors "author and editor" output.warn } 1964 | { format.authors output.nonnull } 1965 | if$ 1966 | name.or.dash 1967 | format.book.title.edition output 1968 | format.series output 1969 | author empty$ 1970 | { skip$ } 1971 | { format.editors output } 1972 | if$ 1973 | format.address.publisher.date output 1974 | format.volume output 1975 | format.number output 1976 | format.note output 1977 | format.url output 1978 | fin.entry 1979 | if.url.std.interword.spacing 1980 | } 1981 | 1982 | FUNCTION {booklet} 1983 | { std.status.using.comma 1984 | start.entry 1985 | if.url.alt.interword.spacing 1986 | format.authors output 1987 | name.or.dash 1988 | format.article.title "title" output.warn 1989 | format.howpublished "howpublished" bibinfo.check output 1990 | format.organization "organization" bibinfo.check output 1991 | format.address "address" bibinfo.check output 1992 | format.date output 1993 | format.note output 1994 | format.url output 1995 | fin.entry 1996 | if.url.std.interword.spacing 1997 | } 1998 | 1999 | FUNCTION {electronic} 2000 | { std.status.using.period 2001 | start.entry 2002 | if.url.alt.interword.spacing 2003 | format.authors output 2004 | name.or.dash 2005 | format.date.electronic output 2006 | format.article.title.electronic output 2007 | format.howpublished "howpublished" bibinfo.check output 2008 | format.organization "organization" bibinfo.check output 2009 | format.address "address" bibinfo.check output 2010 | format.note output 2011 | format.url output 2012 | fin.entry 2013 | empty.entry.warn 2014 | if.url.std.interword.spacing 2015 | } 2016 | 2017 | FUNCTION {inbook} 2018 | { std.status.using.comma 2019 | start.entry 2020 | if.url.alt.interword.spacing 2021 | author empty$ 2022 | { format.editors "author and editor" output.warn } 2023 | { format.authors output.nonnull } 2024 | if$ 2025 | name.or.dash 2026 | format.book.title.edition output 2027 | format.series output 2028 | format.address.publisher.date output 2029 | format.volume output 2030 | format.number output 2031 | format.chapter output 2032 | format.pages output 2033 | format.note output 2034 | format.url output 2035 | fin.entry 2036 | if.url.std.interword.spacing 2037 | } 2038 | 2039 | FUNCTION {incollection} 2040 | { std.status.using.comma 2041 | start.entry 2042 | if.url.alt.interword.spacing 2043 | format.authors "author" output.warn 2044 | name.or.dash 2045 | format.article.title "title" output.warn 2046 | format.in.booktitle.edition "booktitle" output.warn 2047 | format.series output 2048 | format.editors output 2049 | format.address.publisher.date.nowarn output 2050 | format.volume output 2051 | format.number output 2052 | format.chapter output 2053 | format.pages output 2054 | format.note output 2055 | format.url output 2056 | fin.entry 2057 | if.url.std.interword.spacing 2058 | } 2059 | 2060 | FUNCTION {inproceedings} 2061 | { std.status.using.comma 2062 | start.entry 2063 | if.url.alt.interword.spacing 2064 | format.authors "author" output.warn 2065 | name.or.dash 2066 | format.article.title "title" output.warn 2067 | format.in.booktitle "booktitle" output.warn 2068 | format.series output 2069 | format.editors output 2070 | format.volume output 2071 | format.number output 2072 | publisher empty$ 2073 | { format.address.organization.date output } 2074 | { format.organization "organization" bibinfo.check output 2075 | format.address.publisher.date output 2076 | } 2077 | if$ 2078 | format.paper output 2079 | format.pages output 2080 | format.note output 2081 | format.url output 2082 | fin.entry 2083 | if.url.std.interword.spacing 2084 | } 2085 | 2086 | FUNCTION {manual} 2087 | { std.status.using.comma 2088 | start.entry 2089 | if.url.alt.interword.spacing 2090 | format.authors output 2091 | name.or.dash 2092 | format.book.title.edition "title" output.warn 2093 | format.howpublished "howpublished" bibinfo.check output 2094 | format.organization "organization" bibinfo.check output 2095 | format.address "address" bibinfo.check output 2096 | format.date output 2097 | format.note output 2098 | format.url output 2099 | fin.entry 2100 | if.url.std.interword.spacing 2101 | } 2102 | 2103 | FUNCTION {mastersthesis} 2104 | { std.status.using.comma 2105 | start.entry 2106 | if.url.alt.interword.spacing 2107 | format.authors "author" output.warn 2108 | name.or.dash 2109 | format.article.title "title" output.warn 2110 | format.master.thesis.type output.nonnull 2111 | format.school "school" bibinfo.warn output 2112 | format.address "address" bibinfo.check output 2113 | format.date "year" output.warn 2114 | format.note output 2115 | format.url output 2116 | fin.entry 2117 | if.url.std.interword.spacing 2118 | } 2119 | 2120 | FUNCTION {misc} 2121 | { std.status.using.comma 2122 | start.entry 2123 | if.url.alt.interword.spacing 2124 | format.authors output 2125 | name.or.dash 2126 | format.article.title output 2127 | format.howpublished "howpublished" bibinfo.check output 2128 | format.organization "organization" bibinfo.check output 2129 | format.address "address" bibinfo.check output 2130 | format.pages output 2131 | format.date output 2132 | format.note output 2133 | format.url output 2134 | fin.entry 2135 | empty.entry.warn 2136 | if.url.std.interword.spacing 2137 | } 2138 | 2139 | FUNCTION {patent} 2140 | { std.status.using.comma 2141 | start.entry 2142 | if.url.alt.interword.spacing 2143 | format.authors output 2144 | name.or.dash 2145 | format.article.title output 2146 | format.patent.nationality.type.number output 2147 | format.patent.date output 2148 | format.note output 2149 | format.url output 2150 | fin.entry 2151 | empty.entry.warn 2152 | if.url.std.interword.spacing 2153 | } 2154 | 2155 | FUNCTION {periodical} 2156 | { std.status.using.comma 2157 | start.entry 2158 | if.url.alt.interword.spacing 2159 | format.editors output 2160 | name.or.dash 2161 | format.book.title "title" output.warn 2162 | format.series output 2163 | format.volume output 2164 | format.number output 2165 | format.organization "organization" bibinfo.check output 2166 | format.date "year" output.warn 2167 | format.note output 2168 | format.url output 2169 | fin.entry 2170 | if.url.std.interword.spacing 2171 | } 2172 | 2173 | FUNCTION {phdthesis} 2174 | { std.status.using.comma 2175 | start.entry 2176 | if.url.alt.interword.spacing 2177 | format.authors "author" output.warn 2178 | name.or.dash 2179 | format.article.title "title" output.warn 2180 | format.phd.thesis.type output.nonnull 2181 | format.school "school" bibinfo.warn output 2182 | format.address "address" bibinfo.check output 2183 | format.date "year" output.warn 2184 | format.note output 2185 | format.url output 2186 | fin.entry 2187 | if.url.std.interword.spacing 2188 | } 2189 | 2190 | FUNCTION {proceedings} 2191 | { std.status.using.comma 2192 | start.entry 2193 | if.url.alt.interword.spacing 2194 | format.editors output 2195 | name.or.dash 2196 | format.book.title "title" output.warn 2197 | format.series output 2198 | format.volume output 2199 | format.number output 2200 | publisher empty$ 2201 | { format.address.organization.date output } 2202 | { format.organization "organization" bibinfo.check output 2203 | format.address.publisher.date output 2204 | } 2205 | if$ 2206 | format.note output 2207 | format.url output 2208 | fin.entry 2209 | if.url.std.interword.spacing 2210 | } 2211 | 2212 | FUNCTION {standard} 2213 | { std.status.using.comma 2214 | start.entry 2215 | if.url.alt.interword.spacing 2216 | format.authors output 2217 | name.or.dash 2218 | format.book.title "title" output.warn 2219 | format.howpublished "howpublished" bibinfo.check output 2220 | format.organization.institution.standard.type.number output 2221 | format.revision output 2222 | format.date output 2223 | format.note output 2224 | format.url output 2225 | fin.entry 2226 | if.url.std.interword.spacing 2227 | } 2228 | 2229 | FUNCTION {techreport} 2230 | { std.status.using.comma 2231 | start.entry 2232 | if.url.alt.interword.spacing 2233 | format.authors "author" output.warn 2234 | name.or.dash 2235 | format.article.title "title" output.warn 2236 | format.howpublished "howpublished" bibinfo.check output 2237 | format.institution "institution" bibinfo.warn output 2238 | format.address "address" bibinfo.check output 2239 | format.tech.report.number output.nonnull 2240 | format.date "year" output.warn 2241 | format.note output 2242 | format.url output 2243 | fin.entry 2244 | if.url.std.interword.spacing 2245 | } 2246 | 2247 | FUNCTION {unpublished} 2248 | { std.status.using.comma 2249 | start.entry 2250 | if.url.alt.interword.spacing 2251 | format.authors "author" output.warn 2252 | name.or.dash 2253 | format.article.title "title" output.warn 2254 | format.date output 2255 | format.note "note" output.warn 2256 | format.url output 2257 | fin.entry 2258 | if.url.std.interword.spacing 2259 | } 2260 | 2261 | 2262 | % The special entry type which provides the user interface to the 2263 | % BST controls 2264 | FUNCTION {IEEEtranBSTCTL} 2265 | { is.print.banners.to.terminal 2266 | { "** IEEEtran BST control entry " quote$ * cite$ * quote$ * " detected." * 2267 | top$ 2268 | } 2269 | { skip$ } 2270 | if$ 2271 | CTLuse_article_number 2272 | empty$ 2273 | { skip$ } 2274 | { CTLuse_article_number 2275 | yes.no.to.int 2276 | 'is.use.number.for.article := 2277 | } 2278 | if$ 2279 | CTLuse_paper 2280 | empty$ 2281 | { skip$ } 2282 | { CTLuse_paper 2283 | yes.no.to.int 2284 | 'is.use.paper := 2285 | } 2286 | if$ 2287 | CTLuse_url 2288 | empty$ 2289 | { skip$ } 2290 | { CTLuse_url 2291 | yes.no.to.int 2292 | 'is.use.url := 2293 | } 2294 | if$ 2295 | CTLuse_forced_etal 2296 | empty$ 2297 | { skip$ } 2298 | { CTLuse_forced_etal 2299 | yes.no.to.int 2300 | 'is.forced.et.al := 2301 | } 2302 | if$ 2303 | CTLmax_names_forced_etal 2304 | empty$ 2305 | { skip$ } 2306 | { CTLmax_names_forced_etal 2307 | string.to.integer 2308 | 'max.num.names.before.forced.et.al := 2309 | } 2310 | if$ 2311 | CTLnames_show_etal 2312 | empty$ 2313 | { skip$ } 2314 | { CTLnames_show_etal 2315 | string.to.integer 2316 | 'num.names.shown.with.forced.et.al := 2317 | } 2318 | if$ 2319 | CTLuse_alt_spacing 2320 | empty$ 2321 | { skip$ } 2322 | { CTLuse_alt_spacing 2323 | yes.no.to.int 2324 | 'is.use.alt.interword.spacing := 2325 | } 2326 | if$ 2327 | CTLalt_stretch_factor 2328 | empty$ 2329 | { skip$ } 2330 | { CTLalt_stretch_factor 2331 | 'ALTinterwordstretchfactor := 2332 | "\renewcommand{\BIBentryALTinterwordstretchfactor}{" 2333 | ALTinterwordstretchfactor * "}" * 2334 | write$ newline$ 2335 | } 2336 | if$ 2337 | CTLdash_repeated_names 2338 | empty$ 2339 | { skip$ } 2340 | { CTLdash_repeated_names 2341 | yes.no.to.int 2342 | 'is.dash.repeated.names := 2343 | } 2344 | if$ 2345 | CTLname_format_string 2346 | empty$ 2347 | { skip$ } 2348 | { CTLname_format_string 2349 | 'name.format.string := 2350 | } 2351 | if$ 2352 | CTLname_latex_cmd 2353 | empty$ 2354 | { skip$ } 2355 | { CTLname_latex_cmd 2356 | 'name.latex.cmd := 2357 | } 2358 | if$ 2359 | CTLname_url_prefix 2360 | missing$ 2361 | { skip$ } 2362 | { CTLname_url_prefix 2363 | 'name.url.prefix := 2364 | } 2365 | if$ 2366 | 2367 | 2368 | num.names.shown.with.forced.et.al max.num.names.before.forced.et.al > 2369 | { "CTLnames_show_etal cannot be greater than CTLmax_names_forced_etal in " cite$ * warning$ 2370 | max.num.names.before.forced.et.al 'num.names.shown.with.forced.et.al := 2371 | } 2372 | { skip$ } 2373 | if$ 2374 | } 2375 | 2376 | 2377 | %%%%%%%%%%%%%%%%%%% 2378 | %% ENTRY ALIASES %% 2379 | %%%%%%%%%%%%%%%%%%% 2380 | FUNCTION {conference}{inproceedings} 2381 | FUNCTION {online}{electronic} 2382 | FUNCTION {internet}{electronic} 2383 | FUNCTION {webpage}{electronic} 2384 | FUNCTION {www}{electronic} 2385 | FUNCTION {default.type}{misc} 2386 | 2387 | 2388 | 2389 | %%%%%%%%%%%%%%%%%% 2390 | %% MAIN PROGRAM %% 2391 | %%%%%%%%%%%%%%%%%% 2392 | 2393 | READ 2394 | 2395 | EXECUTE {initialize.controls} 2396 | EXECUTE {initialize.status.constants} 2397 | EXECUTE {banner.message} 2398 | 2399 | EXECUTE {initialize.longest.label} 2400 | ITERATE {longest.label.pass} 2401 | 2402 | EXECUTE {begin.bib} 2403 | ITERATE {call.type$} 2404 | EXECUTE {end.bib} 2405 | 2406 | EXECUTE{completed.message} 2407 | 2408 | 2409 | %% That's all folks, mds. 2410 | -------------------------------------------------------------------------------- /vignettes/build_pdf.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | fixVersion(){ 4 | PKGVER=`grep "Version:" ../DESCRIPTION | sed -e "s/Version: //"` 5 | sed -i -e "s/myversion{.*}/myversion{${PKGVER}}/" $1 6 | } 7 | 8 | cleanVignette(){ 9 | rm -f *.aux *.bbl *.blg *.log *.out *.toc *.dvi 10 | } 11 | 12 | buildVignette(){ 13 | fixVersion $1 14 | pdflatex $1 15 | bibname=`echo "$1" | sed -e 's/\..*//'` 16 | bibtex $bibname 17 | pdflatex $1 18 | pdflatex $1 19 | Rscript -e "tools::compactPDF('$1', gs_quality='ebook')" 20 | } 21 | 22 | 23 | cleanVignette 24 | buildVignette filesampler.Rnw 25 | cleanVignette 26 | 27 | 28 | mv -f *.pdf ../inst/doc/ 29 | cp -f *.Rnw ../inst/doc/ 30 | -------------------------------------------------------------------------------- /vignettes/filesampler.Rnw: -------------------------------------------------------------------------------- 1 | %\VignetteIndexEntry{Introduction to the filesampler Package} 2 | \documentclass[]{article} 3 | 4 | 5 | \input{./include/settings} 6 | 7 | 8 | \mytitle{Introduction to the filesampler Package} 9 | \mysubtitle{} 10 | \myversion{0.4-0} 11 | \myauthor{ 12 | \centering 13 | Drew Schmidt \\ 14 | \texttt{wrathematics@gmail.com} 15 | } 16 | 17 | 18 | \begin{document} 19 | \makefirstfew 20 | 21 | 22 | 23 | \section{Introduction}\label{introduction} 24 | 25 | \textbf{filesampler}~\cite{filesampler} is a simple R package for 26 | reading random subsamples of flat text files by line in a reasonably 27 | efficient manner. This is useful if you have a very large file (too big 28 | to comfortably handle in memory) and you want to prototype something on 29 | an unbiased subset. 30 | 31 | Basically, each reader works by sampling as the input file is scanned 32 | and randomly choosing whether or not to dump the current line to an 33 | external temporary file. This temporary file is then read back into R. 34 | For (aggressive) downsampling, this is a very effective strategy; for 35 | resampling, you are much better off reading the full dataset into 36 | memory. 37 | 38 | This package, including the underlying C library, is licensed under the 39 | permissive 2-clause BSD license. The original idea was inspired by 40 | Eduardo Arino de la Rubia's \textbf{fast\_sample} 41 | @fast\_sample. 42 | 43 | \subsection{Installation}\label{installation} 44 | 45 | You can install the stable version from CRAN using the usual 46 | \texttt{install.packages()}: 47 | 48 | \begin{lstlisting}[language=rr] 49 | install.packages("filesampler") 50 | \end{lstlisting} 51 | 52 | The development version is maintained on GitHub: 53 | 54 | \begin{lstlisting}[language=rr] 55 | remotes::install_github("wrathematics/filesampler") 56 | \end{lstlisting} 57 | 58 | \section{Using the Package}\label{using-the-package} 59 | 60 | At its most basic, you can use just the \texttt{sample\_csv()} for csv 61 | files. Say you want to read in 0.1\% of the data. Then you would call: 62 | 63 | \begin{lstlisting}[language=rr] 64 | sample_csv(file, 0.001) 65 | \end{lstlisting} 66 | 67 | and \texttt{sample\_lines()} for reading unstructured lines (as 68 | \texttt{readLines()}): 69 | 70 | \begin{lstlisting}[language=rr] 71 | sample_lines(file, 0.001) 72 | \end{lstlisting} 73 | 74 | By default, each of these file samplers will sample proportionally. This 75 | is done in a single pass of the input file. However, you can also use an 76 | ``exact'' sampler, which uses a reservoir sampler to draw an exact 77 | number of lines at random. Say you wanted to read in 1000 lines of a 78 | csv. Then you would call: 79 | 80 | \begin{lstlisting}[language=rr] 81 | sample_csv(file, 1000, method="exact") 82 | \end{lstlisting} 83 | 84 | This makes two passes through the file. Implementation details are 85 | discussed in a later section of this vignette. 86 | 87 | \subsection{Other Readers}\label{other-readers} 88 | 89 | To keep external dependencies to a minimum, only the base R 90 | \texttt{read.csv()} and \texttt{readLines()} are directly supported. If 91 | you wish to use other readers like \texttt{fread()} from 92 | \textbf{data.table}~\cite{datatable} or \texttt{read\_csv()} from 93 | \textbf{readr}~\cite{readr}, you can pass one of these as the 94 | argument \texttt{reader} to the \code{sample_csv()} function. Examples 95 | of this can be found in the Benchmarks section below. 96 | 97 | Note that if the downsampling was sufficiently aggressive, you will 98 | probably not notice any performance improvement using these better 99 | readers over \texttt{read.csv()}. 100 | 101 | 102 | 103 | \section{Benchmarks}\label{benchmarks} 104 | 105 | \subsection{Benchmark Setup}\label{benchmark-setup} 106 | 107 | All benchmarks were performed using: 108 | 109 | \begin{itemize} 110 | \item 111 | A Core i5-2500K CPU @ 3.30GHz with a platter HDD 112 | \item 113 | R 3.3.1 114 | \item 115 | Package versions: 116 | 117 | \begin{itemize} 118 | \item 119 | 0.4-0 of \textbf{filesampler} 120 | \item 121 | 1.0.0 of \textbf{readr} 122 | \item 123 | 1.9.6 of \textbf{data.table} 124 | \end{itemize} 125 | \end{itemize} 126 | 127 | Each timing shown is the result of running each test twice and taking 128 | the best (lowest) time. This is to hopefully ensure that there's no file 129 | caching or bad RNG behavior making one reader look unreasonably slow 130 | compared to the others. For each of the tests, there was no major 131 | difference in run times between the two runs. 132 | 133 | The file \texttt{big.csv} referred to below was generated with the 134 | included \texttt{makebig} script found in \texttt{inst/benchmarks/} tree 135 | of the source for this package.. 136 | 137 | \subsection{The Benchmarks}\label{the-benchmarks} 138 | 139 | The package should perform very well, provided the total number of lines 140 | sampled is fairly small. For example, consider the file: 141 | 142 | \begin{lstlisting}[language=rr] 143 | file <- "/tmp/big.csv" 144 | 145 | library(memuse) 146 | Sys.filesize(file) 147 | ## 976.868 MiB 148 | 149 | library(filesampler) 150 | wc_l(file) 151 | ## file: /tmp/big.csv 152 | ## Lines: 16000001 153 | \end{lstlisting} 154 | 155 | We can read in approximately 0.1\% of the input file quite quickly: 156 | 157 | \begin{lstlisting}[language=rr] 158 | system.time(small <- sample_csv(file, .001)) 159 | ## user system elapsed 160 | ## 1.044 0.128 1.172 161 | \end{lstlisting} 162 | 163 | Compare this to the time spent reading the entire file: 164 | 165 | \begin{lstlisting}[language=rr] 166 | system.time(full <- read.csv(file)) 167 | ## user system elapsed 168 | ## 72.988 0.500 73.491 169 | 170 | system.time(full <- read_csv(file, progress=FALSE)) 171 | ## user system elapsed 172 | ## 27.316 0.276 27.595 173 | 174 | system.time(full <- fread(file)) 175 | ## user system elapsed 176 | ## 12.328 0.100 12.430 177 | \end{lstlisting} 178 | 179 | Note the difference in memory usage: 180 | 181 | \begin{lstlisting}[language=rr] 182 | dim(small) 183 | ## 16036 6 184 | memuse(small) 185 | ## 568.477 KiB 186 | 187 | dim(full) 188 | ## [1] 16000000 6 189 | memuse(full) 190 | ## 549.321 MiB 191 | \end{lstlisting} 192 | 193 | Also notice that the in-memory file size of the full file times the 194 | proportion of lines read in is roughly the size of the downsampled file: 195 | 196 | \begin{lstlisting}[language=rr] 197 | memuse(full) * .001 198 | ## 562.505 KiB 199 | \end{lstlisting} 200 | 201 | Obviously the \textbf{filesampler} strategy is valuable only for 202 | aggressive downsampling, and not resampling. 203 | 204 | \subsection{Combining filesampler with Other 205 | Readers}\label{combining-filesampler-with-other-readers} 206 | 207 | The bulk of the time spent in \texttt{sample\_csv()} is not in the csv 208 | reading/parsing itself, but rather in the downsampling. Consider: 209 | 210 | \begin{lstlisting}[language=rr] 211 | # the default reader 212 | system.time(small <- sample_csv(file, .001, reader=read.csv)) 213 | ## user system elapsed 214 | ## 1.004 0.172 1.180 215 | 216 | # readr 217 | system.time(small <- sample_csv(file, .001, reader=read_csv)) 218 | ## user system elapsed 219 | ## 1.024 0.108 1.130 220 | 221 | # data.table 222 | system.time(small <- sample_csv(file, .001, reader=fread)) 223 | ## user system elapsed 224 | ## 0.928 0.184 1.112 225 | \end{lstlisting} 226 | 227 | The read times do indeed go down, but not incredibly; the difference is 228 | probably unnoticeable to a human. For larger reads, the times are more 229 | significant. We can quickly see this by reading in 1\% of the data: 230 | 231 | \begin{lstlisting}[language=rr] 232 | system.time(small <- sample_csv(file, .01, reader=read.csv)) 233 | ## user system elapsed 234 | ## 1.712 0.136 1.855 235 | 236 | system.time(small <- sample_csv(file, .01, reader=read_csv)) 237 | ## user system elapsed 238 | ## 1.236 0.160 1.397 239 | 240 | system.time(small <- sample_csv(file, .01, reader=fread)) 241 | ## user system elapsed 242 | ## 1.100 0.148 1.246 243 | \end{lstlisting} 244 | 245 | \subsection{Exact Sampling}\label{exact-sampling} 246 | 247 | Recall that we can sample an exact number of lines using a reservoir 248 | sampler, rather than drawing lines proportionally. We can enable this 249 | behavior by setting \texttt{method="exact"} in \texttt{sample\_csv()}. 250 | 251 | Before we were sampling about 16,000 and 160,000 lines (for 252 | \texttt{p=0.001} and \texttt{p=0.01} respectively), so we'll use that 253 | value for the exact sampler: 254 | 255 | \begin{lstlisting}[language=rr] 256 | system.time(small <- sample_csv(file, 16000, method="exact")) 257 | ## user system elapsed 258 | ## 1.144 0.292 1.437 259 | 260 | system.time(small <- sample_csv(file, 160000, method="exact")) 261 | ## user system elapsed 262 | ## 1.892 0.232 2.124 263 | \end{lstlisting} 264 | 265 | In each case, the times are a bit slower, reflecting some extra work 266 | needed in concocting the reservoir (for example, getting line counts for 267 | the file). 268 | 269 | \subsection{Conclusions and Summary}\label{conclusions-and-summary} 270 | 271 | Based on the benchmarks, we find: 272 | 273 | \begin{itemize} 274 | \item 275 | If you want to very quickly read in a small amount of data, 276 | \textbf{filesampler} is very effective. The canonical use cases for 277 | this are if you: 278 | 279 | \begin{itemize} 280 | \item 281 | want to take a quick, unbiased peek at the data 282 | \item 283 | don't have enough memory to read in/operate on the full data set 284 | \end{itemize} 285 | \item 286 | If your data isn't very big, you can probably comfortably read it in 287 | very quickly with \textbf{data.table}'s \texttt{fread()}. 288 | \end{itemize} 289 | 290 | Finally, if your csv is very large, say greater than 10GiB, I would 291 | strongly encourage you to choose another (preferably binary) format. I 292 | have seen terabyte sized csv's, but it's not a good idea! 293 | 294 | 295 | 296 | \section{Implementation Details}\label{implementation-details} 297 | 298 | Here we describe the implementation details for the file sampling 299 | functions. The other package export, \texttt{wc()}, has been 300 | aggressively optimized; but its implementation is not very interesting, 301 | so we do not belabor it here. We will also only focus on the csv 302 | samplers, as in fact the raw text and csv schemes are almost identical. 303 | 304 | The proportional sampling is handled by the function 305 | \texttt{file\_sample\_prop()}, which samples lines at a given proportion. 306 | As the input file is scanned line-by-line, lines are randomly selected 307 | to be placed into a temporary file at the given proportion. This 308 | requires only one pass through the file. On the other hand, the exact 309 | sampling is handled by \texttt{file\_sample\_exact()}. Here we use a 310 | reservoir sampler to determine ahead of time which lines will be read, 311 | and then pass through the input file, dumping the pre-selected lines to 312 | the temporary file. This requires two passes through the file, since we 313 | need to know the total number of lines. In each case, after the 314 | downsampling takes place we read the temporary file into R using one of 315 | its csv readers. 316 | 317 | 318 | 319 | \addcontentsline{toc}{section}{References} 320 | \bibliography{./include/filesampler} 321 | \bibliographystyle{plain} 322 | 323 | \end{document} 324 | -------------------------------------------------------------------------------- /vignettes/include/00-acknowledgement.tex: -------------------------------------------------------------------------------- 1 | \section*{Disclaimer} 2 | Any opinions, findings, and conclusions or recommendations expressed in this 3 | material are those only of the authors. The findings and conclusions in this 4 | article should not be construed to represent any determination or policy of 5 | University, Agency, Administration and National Laboratory. 6 | 7 | 8 | This manual may be incorrect or out-of-date. The author(s) assume 9 | no responsibility for errors or omissions, or for damages resulting 10 | from the use of the information contained herein. 11 | 12 | This publication was typeset using \LaTeX. 13 | 14 | \vfill 15 | 16 | \null 17 | \vfill 18 | \copyright\ 2015--2016 Drew Schmidt. 19 | 20 | Permission is granted to make and distribute verbatim copies of 21 | this vignette and its source provided the copyright notice and 22 | this permission notice are preserved on all copies. 23 | -------------------------------------------------------------------------------- /vignettes/include/filesampler.bib: -------------------------------------------------------------------------------- 1 | @Manual{filesampler, 2 | title = {{filesampler: File Line Sampler}}, 3 | author = {Drew Schmidt}, 4 | note = {R package version 0.3-0}, 5 | url = {https://github.com/wrathematics/remoter} 6 | } 7 | 8 | @Manual{fast_sample, 9 | title = {{fast\_sample}}, 10 | author = {Eduardo Arino de la Rubia}, 11 | url = {https://github.com/earino/fast_sample} 12 | } 13 | 14 | @Manual{readr, 15 | title = {readr: Read Tabular Data}, 16 | author = {Hadley Wickham and Romain Francois}, 17 | year = {2015}, 18 | note = {R package version 0.2.2}, 19 | url = {https://CRAN.R-project.org/package=readr}, 20 | } 21 | 22 | @Manual{datatable, 23 | title = {data.table: Extension of Data.frame}, 24 | author = {M Dowle and A Srinivasan and T Short and S Lianoglou with contributions from R Saporta and E Antonyan}, 25 | year = {2015}, 26 | note = {R package version 1.9.6}, 27 | url = {https://CRAN.R-project.org/package=data.table}, 28 | } 29 | -------------------------------------------------------------------------------- /vignettes/include/settings.tex: -------------------------------------------------------------------------------- 1 | %------------------------------------------------------------------------------- 2 | % basic settings 3 | %------------------------------------------------------------------------------- 4 | 5 | \usepackage{parskip} 6 | \setlength{\parskip}{.3cm} 7 | 8 | \makeatletter 9 | \newcommand\code{\bgroup\@makeother\_\@makeother\~\@makeother\$\@codex} 10 | \def\@codex#1{{\normalfont\ttfamily\hyphenchar\font=-1 #1}\egroup} 11 | \makeatother 12 | %%\let\code=\texttt 13 | \let\proglang=\textsf 14 | 15 | 16 | \usepackage[margin=1.1in]{geometry} 17 | \usepackage{graphicx} 18 | \usepackage{listings} 19 | \usepackage{hyperref} 20 | \usepackage{xcolor} 21 | \usepackage{xspace} 22 | 23 | \definecolor{gray}{rgb}{.6,.6,.6} 24 | \definecolor{dkgray}{rgb}{.3,.3,.3} 25 | \definecolor{grayish}{rgb}{.9, .9, .9} 26 | \definecolor{dkgreen}{rgb}{0,0.6,0} 27 | \definecolor{mauve}{rgb}{0.58,0,0.82} 28 | 29 | \hypersetup{ 30 | pdfnewwindow=true, 31 | colorlinks=true, 32 | linkcolor=blue, 33 | linkbordercolor=blue, 34 | citecolor=blue, 35 | filecolor=magenta, 36 | urlcolor=blue 37 | } 38 | 39 | \lstdefinelanguage{rr}{ 40 | language=R, 41 | basicstyle=\ttfamily\color{black}, 42 | backgroundcolor=\color{grayish}, 43 | frame=single, 44 | breaklines=true, 45 | keywordstyle=\color{blue}, 46 | commentstyle=\color{dkgreen}, 47 | stringstyle=\color{mauve}, 48 | numbers=left,%none, 49 | numberstyle=\tiny\color{dkgray}, 50 | stepnumber=1, 51 | numbersep=8pt, 52 | showspaces=false, 53 | showstringspaces=false, 54 | showtabs=false, 55 | rulecolor=\color{gray}, 56 | tabsize=4, 57 | captionpos=t, 58 | } 59 | 60 | 61 | 62 | %------------------------------------------------------------------------------- 63 | % title settings 64 | %------------------------------------------------------------------------------- 65 | 66 | \newcommand*{\plogo}{\includegraphics[scale=.5]{./include/uch_small}} 67 | 68 | \makeatletter 69 | \newcommand\mytitle[1]{\renewcommand\@mytitle{#1}} 70 | \newcommand\@mytitle{} 71 | \makeatother 72 | 73 | \makeatletter 74 | \newcommand\mysubtitle[1]{\renewcommand\@mysubtitle{#1}} 75 | \newcommand\@mysubtitle{} 76 | \makeatother 77 | 78 | \makeatletter 79 | \newcommand\myversion[1]{\renewcommand\@myversion{#1}} 80 | \newcommand\@myversion{} 81 | \makeatother 82 | 83 | 84 | \newcommand{\myauthor}[1]{\author{#1}} 85 | 86 | 87 | %------------------------------------------------------------------------------- 88 | % TITLE PAGE 89 | %------------------------------------------------------------------------------- 90 | 91 | \makeatletter 92 | \newcommand*{\titleGP}{\begingroup 93 | \centering 94 | \vspace*{\baselineskip} 95 | 96 | \rule{\textwidth}{1.6pt}\vspace*{-\baselineskip}\vspace*{2pt} 97 | \rule{\textwidth}{0.4pt}\\[\baselineskip] 98 | 99 | {\Huge \scshape\@mytitle \\[0.2\baselineskip] } 100 | 101 | \rule{\textwidth}{0.4pt}\vspace*{-\baselineskip}\vspace{3.2pt} 102 | \rule{\textwidth}{1.6pt}\\[\baselineskip] 103 | 104 | \scshape % Small caps 105 | \@mysubtitle \ \\[\baselineskip] % Tagline(s) or further 106 | \today\par % Location and year 107 | 108 | \vspace*{2\baselineskip} % Whitespace between location/year and editors 109 | 110 | {\large \@author \par} 111 | 112 | \vfill % Whitespace between editor names and publisher logo 113 | 114 | \plogo \\[0.3\baselineskip] % Publisher logo 115 | {Version \@myversion} \\[0.3\baselineskip] % Year published 116 | % {\large THE PUBLISHER}\par % Publisher 117 | 118 | \endgroup} 119 | \makeatother 120 | 121 | 122 | 123 | %------------------------------------------------------------------------------- 124 | % first few 125 | %------------------------------------------------------------------------------- 126 | 127 | \usepackage{lastpage} 128 | \usepackage{fancyhdr} 129 | 130 | \pagestyle{fancy} 131 | 132 | \newcommand{\prebodyheadfoot}{ 133 | \fancyhf{} % clear all header and footer fields 134 | \fancyfoot{} 135 | \renewcommand{\headrulewidth}{0pt} 136 | \renewcommand{\footrulewidth}{0pt} 137 | 138 | % redefinition of the plain style: 139 | \fancypagestyle{plain}{% 140 | \fancyhf{} % clear all header and footer fields 141 | \renewcommand{\headrulewidth}{0pt} 142 | \renewcommand{\footrulewidth}{0pt}} 143 | } 144 | 145 | \newcommand{\bodyheadfoot}{ 146 | \fancyhf{} % clear all header and footer fields 147 | % \fancyhead[L]{\slshape \rightmark} 148 | \fancyhead[L]{\slshape \leftmark} 149 | \fancyhead[R]{ \thepage\ of\ \pageref{LastPage}} 150 | \renewcommand{\headrulewidth}{1pt} 151 | \renewcommand{\footrulewidth}{0pt} 152 | 153 | % redefinition of the plain style: 154 | \fancypagestyle{plain}{% 155 | \fancyhf{} % clear all header and footer fields 156 | \renewcommand{\headrulewidth}{0pt} 157 | \renewcommand{\footrulewidth}{0pt}} 158 | } 159 | 160 | 161 | 162 | \newcommand{\makefirstfew}{% 163 | \prebodyheadfoot 164 | \restoregeometry 165 | 166 | 167 | \titleGP 168 | \newpage 169 | 170 | \input{./include/00-acknowledgement} 171 | \newpage 172 | \pagenumbering{roman} 173 | \tableofcontents 174 | \newpage 175 | 176 | 177 | \pagenumbering{arabic} 178 | \setcounter{page}{1} 179 | 180 | 181 | \bodyheadfoot 182 | \pagenumbering{arabic} 183 | \setcounter{page}{1} 184 | \pagestyle{fancy} 185 | } -------------------------------------------------------------------------------- /vignettes/include/uch_small.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wrathematics/filesampler/b3683ea15888b0da6e1f983233c395d23cf9e2b6/vignettes/include/uch_small.png --------------------------------------------------------------------------------