├── README.org ├── RWiki ├── admin_application_2012.txt ├── admin_application_2013.txt ├── biodiversity.txt ├── dimension_reduction_methods_for_multivariate_time_series.txt ├── gci2011.txt ├── gci2011 │ ├── debianrinstaller.txt │ ├── guidependency.txt │ ├── hyperspecplotspc.txt │ ├── hyperspecplotspctests.txt │ ├── hyperspecroxy2.txt │ ├── hyperspecspc.txt │ ├── hyperspecspctests.txt │ ├── hyperspecweb.txt │ └── softclassvalweb.txt ├── gsco2010 │ └── rbeancounter.txt ├── gsoc2010.txt ├── gsoc2010 │ ├── adinr.txt │ ├── blogweave.txt │ ├── cran_stats.txt │ ├── crantastic2.txt │ ├── gpclib_replacement.txt │ ├── interactive-check.txt │ ├── nosql_interface.txt │ ├── ritk.txt │ ├── rpy2ff.txt │ ├── rserve_javascript.txt │ ├── sra.txt │ ├── suggestion_form.txt │ ├── syrfr.txt │ ├── template.txt │ └── vrs.txt ├── gsoc2011.txt ├── gsoc2011 │ ├── apim.txt │ ├── convex_optimization.txt │ ├── cranvastime.txt │ ├── dclusterm.txt │ ├── dendro.txt │ ├── explorestoch.txt │ ├── gibbstrading.txt │ ├── huge.txt │ ├── image_analysis.txt │ ├── manipulate.txt │ ├── msfga.txt │ ├── optile.txt │ ├── panorama.txt │ ├── r-em-accelerator.txt │ ├── r-openmp.txt │ ├── ropenclcuda.txt │ ├── smart.txt │ ├── sra_with_roles.txt │ ├── tradeanalytics.txt │ ├── wikipediar.txt │ └── yarrcache.txt ├── gsoc2012.txt ├── gsoc2012 │ ├── hyperspec.txt │ ├── image_labels.txt │ ├── interactive_animations.txt │ ├── knitr.txt │ ├── mentor.txt │ ├── meucci.txt │ ├── optim.c.txt │ ├── performanceanalytics.txt │ ├── performancebenchmarking.txt │ ├── portfolioanalytics.txt │ ├── rmaxima.txt │ ├── ropensci.txt │ ├── rtaq.txt │ ├── sp_econometrics.txt │ ├── student.txt │ ├── vectorizedoptim.txt │ ├── wikipediar.txt │ └── xts.txt ├── gsoc2013.txt ├── gsoc2013 │ ├── factors.txt │ ├── iid_perfa.txt │ ├── import_graphics.txt │ ├── meucci.txt │ ├── mpiprofiler.txt │ ├── nso.txt │ ├── porta.txt │ ├── rapport.txt │ ├── right.txt │ ├── rtaq.txt │ ├── spectralunmixing.txt │ └── xts.txt ├── gsoc2014.txt ├── gsoc2014 │ ├── animint.txt │ ├── bdvis.txt │ ├── bio-interactive-web-plots.txt │ ├── bio-mzr.txt │ ├── citools.txt │ ├── covariance_matrix.txt │ ├── dimension_reduction_methods_for_multivariate_time_series.txt │ ├── discrimination.txt │ ├── enmgadgets.txt │ ├── factoranalytics.txt │ ├── manifold_optimization.txt │ ├── pander.txt │ ├── pbdprof.txt │ ├── portfolioanalytics.txt │ ├── right.txt │ ├── roughset.txt │ ├── spotvol.txt │ └── tsvisualization.txt ├── gsoc2014_application.txt ├── highly_parallelized_implementations_of_robust_statistical_procedures_using_cuda.txt ├── projects.txt ├── sidebar.txt └── student_app_template.txt ├── melange-banner.png └── summeR-of-code.png /README.org: -------------------------------------------------------------------------------- 1 | This repository will be used to organize the R project's participation 2 | in the Google Summer of Code 2015 (R-GSOC-2015) 3 | 4 | ** GitHub wiki pages 5 | 6 | - [[https://github.com/rstats-gsoc/gsoc2015/wiki][Overview of R-GSOC-2015]]. 7 | - [[https://github.com/rstats-gsoc/gsoc2015/wiki/table-of-proposed-coding-projects][Table of project ideas]]. 8 | 9 | ** Links 10 | 11 | - [[http://www.google-melange.com/gsoc/homepage/google/gsoc2015][Official Google Summer of Code 2015 page on melange]]. 12 | -------------------------------------------------------------------------------- /RWiki/biodiversity.txt: -------------------------------------------------------------------------------- 1 | **Title:** Biodiversity data visualization in R 2 | 3 | **Description:** 4 | 5 | R is increasingly being used in Biodiversity information analysis. There are several R packages like rgbif and rvertnet in rOpenSci suite to query, download and to some extent analyse the data within R workflow. We also have packages like dismo, randomForest and SDMTools for modelling the data. It will be useful to have a package to quickly visualize biodiversity data. These visualizations would be helpful to understand extent of geographical, taxonomic and temporal coverage, gaps and biases in data. 6 | 7 | The functions provided would be for following tasks: 8 | 9 | * Data preparation - The data needs to be converted into suitable format for visualizations and analysis i.e. date format, taxonomic classification and geographical co-ordinates should be in uniform and usable formats. 10 | 11 | * Geographic coverage - functions to visualize the data points on maps, density maps at different scales like Country level, Degree grid and so on. 12 | 13 | * Temporal coverage - functions to visualize the data on different temporal scales like monthly / annual counts, seasonality of data by months, weeks or days, Chronohorogram. [Ref:http://dx.doi.org/10.1093/bioinformatics/bts359] 14 | 15 | * Taxonomic coverage - functions to visualize the taxonomic coverage of data in Tree Map formats by Number of records per species and number of species covered. 16 | 17 | * Completeness analysis - functions to assess and visualize completeness of biodiversity inventory of the region [Ref:http://dx.doi.org/10.1111/j.0906-7590.2007.04627.x ] 18 | 19 | **Mentor(s):** Javier Otegui 20 | -------------------------------------------------------------------------------- /RWiki/gci2011.txt: -------------------------------------------------------------------------------- 1 | ====== Google Code-In 2011 ====== 2 | **!!!This page is to initiate possible projects for the Google Code-In 2011. If you want to propose a project or participate in this year's GCI with R please join our google group: 3 | 4 | [[https://groups.google.com/group/gci-r|gci-r@googlegroups.com]]!!!** 5 | 6 | Details of the competition can be found [[http://google-opensource.blogspot.com/2011/10/google-code-in-are-you-in.html|here]] 7 | 8 | ===== Status and Timeline ===== 9 | 10 | * [[http://google-opensource.blogspot.com/2011/10/google-code-in-are-you-in.html|Nov 9]]: Google announces whether R foundation is accepted as a mentoring organisation for 2011, and proposals for R projects start to be considered. 11 | 12 | * Nov 21: GCI goes live for 13-17 year olds, so all projects must be finalized before this. 13 | 14 | ===== Project Ideas ===== 15 | 16 | Project ideas can be listed below on per-topic Wiki pages. //Please// follow the formatting used for the [[developers:projects:gsoc2010:template]] project. 17 | 18 | 19 | 20 | 21 | ==== Project proposals ==== 22 | 23 | * [[developers:projects:gci2011:hyperspecroxy2|Make Documentation compatible with roxygen2]] 24 | * [[developers:projects:gci2011:hyperspecweb|Polish hyperSpec's web page]] 25 | * [[developers:projects:gci2011:softclassvalweb|Polish softclassval's web page]] 26 | * [[developers:projects:gci2011:guidependency|Dependency Graph for GUIs in R]] 27 | * Import functions for various spectroscopic data formats. 28 | * .spc - see [[developers:projects:gci2011:hyperspecspc]] and [[developers:projects:gci2011:hyperspecspctests]] 29 | * I could pose similar tasks for other file formats, but would rather do that on request as I need to contact manufacturers wrt. the rights to publish the code open source 30 | * [[developers:projects:gci2011:hyperspecplotspc|Refactor hyperSpec's spectra plotting function]] and [[developers:projects:gci2011:hyperspecplotspctests|write tests for the new functions]] 31 | * [[developers:projects:gci2011:debianRinstaller|Combined installer for debianized and regular R packages]] 32 | 33 | To add a project, use the [[developers:projects:gsoc2010:template]] //Copy this page to add a new proposal// 34 | 35 | 36 | -------------------------------------------------------------------------------- /RWiki/gci2011/debianrinstaller.txt: -------------------------------------------------------------------------------- 1 | ====== debianRinstaller ====== 2 | 3 | 4 | **Task Summary:** Write a function/package that will install a package first from the debian R repository if the package is there, else install from the regular CRAN repository 5 | 6 | **Task Description:** 7 | - Check function has required inputs 8 | - Look for package name on debian R ppa 9 | - If it exists, run the debian installer (apt, aptitude, etc.) 10 | - else run the install.packages() function to install from a CRAN repository 11 | - provide appropriate messages and returns. 12 | - Document the function, though if done well it should be self-explanatory. 13 | 14 | **Skills needed:** 15 | - sufficient knowledge of R to understand the install.packages() and related functions 16 | - sufficient knowledge of debian and related packages (used by Ubuntu and others) to know about the .deb infrastructure. 17 | - ability to research details of the tools above as needed e.g., to search for the package 18 | - ability to write short scripts in R and in some other language that can interact with the debian package infrastructure (Python, Perl, C, C++, etc.). The R package is envisaged as having a fairly simple single my.install.packages() function that will use system() to call a script in another language. (The mentor would likely use Perl, as that is most familiar to him.) 19 | 20 | **Rating:** Medium (likely easy if you know how; difficulty will be in getting the right information about debian infrastructure and its interfacing to R). 21 | 22 | **Category:** This is a code and interface task. 23 | 24 | **Mentor:** John C. Nash (nashjc _at_ uottawa.ca) 2011-11-07. 25 | -------------------------------------------------------------------------------- /RWiki/gci2011/guidependency.txt: -------------------------------------------------------------------------------- 1 | ====== Graph of Dependencies for different GUI toolkits in R ====== 2 | 3 | 4 | **Task Summary:** Develop a figure that shows the use of the different GUI toolkits available in R. 5 | 6 | **Task Description:** 7 | The graph should show the different GUI toolkits, and how many / which packages rely on them to: 8 | - enhance it 9 | - depend on the toolkit 10 | - are suggested 11 | 12 | One difficulty that needs to be solved is that the package dependency description does not specify that a package depends on one of a list of packages (see e.g. gwidgets). These are hopefully listed in Suggests. 13 | 14 | [[http://csgillespie.wordpress.com/2011/03/23/graphical-display-of-r-package-dependencies/|Colin Gillespie's blog article]] is a good starting point. 15 | 16 | **Skills needed:** R 17 | 18 | **Rating:** Medium 19 | 20 | **Category:** Research 21 | 22 | **Mentor:** [[Claudia.Beleites@ipht-jena.de|Claudia Beleites]] 2011/11/04 -------------------------------------------------------------------------------- /RWiki/gci2011/hyperspecplotspc.txt: -------------------------------------------------------------------------------- 1 | ====== Refactoring of plotspc ====== 2 | 3 | 4 | **Task Summary:** Divide the function into suitable helper functions to allow easier writing of plotting functions with other than base graphics, e.g. lattice, ggplot2, acinonyx(?) 5 | 6 | **Task Description:** 7 | helper functions for 8 | - preparation of the data 9 | - the axes 10 | - x-axis: cut axes 11 | - y-axis: "stacking" of spectra 12 | - plotting of filled areas (polygons) 13 | - plotting of lines 14 | - ... 15 | should replace the corresponding parts of ''plotspc''. 16 | 17 | **Skills needed:** R 18 | 19 | **Rating:** Hard 20 | 21 | **Category:** Code (there is a corresponding Quality Assurance Task) 22 | 23 | **Mentor:** [[Claudia.Beleites@ipht-jena.de|Claudia Beleites]] 2011/11/04 -------------------------------------------------------------------------------- /RWiki/gci2011/hyperspecplotspctests.txt: -------------------------------------------------------------------------------- 1 | ====== plotspc helper functions: tests ====== 2 | 3 | **Task Summary:** Write tests to ensure proper working of the helper functions for plotting spectra (corresponding task) 4 | 5 | **Task Description:** Tests should be written using svUnit. 6 | 7 | **Skills needed:** This is a follow up task for the refactoring of the spectra plotting function 8 | 9 | **Rating:** medium 10 | 11 | **Category:** Quality Assurance 12 | 13 | **Mentor:** [[Claudia.Beleites@ipht-jena.de|Claudia Beleites]] 2011/11/04 -------------------------------------------------------------------------------- /RWiki/gci2011/hyperspecroxy2.txt: -------------------------------------------------------------------------------- 1 | ====== Refactor Doc Comments to use roxygen2 ====== 2 | 3 | 4 | **Task Summary:** Make sure the doc comments work with roxygen2 5 | 6 | **Task Description:** Currently, hyperSpec uses a patched version of roxygen for creating the man pages. As roxygen is not under active development, but instead roxygen2 is available, the documentation should be adapted. 7 | 8 | - check all exisiting doc comments for their compatibility with roxygen2 9 | - track down incompatibilities (will be particularly related to S4 documentation; this list may serve as wishlist for roxygen2) 10 | - correct the problems 11 | 12 | **Skills needed:** R 13 | 14 | **Rating:** Medium 15 | 16 | **Category:** Documentation 17 | 18 | **Mentor:** [[Claudia.Beleites@ipht-jena.de|Claudia Beleites]] 2011/11/04 -------------------------------------------------------------------------------- /RWiki/gci2011/hyperspecspc.txt: -------------------------------------------------------------------------------- 1 | ====== .spc import for hyperSpec ====== 2 | 3 | 4 | **Task Summary:** Write an import function for spectra stored in .spc file format. 5 | 6 | **Task Description:** The spc file format is a binary format to store spectroscopic data. The specification is available ([[https://ftirsearch.com/features/converters/gspc_udf.zip|zipped Specification]]), it is a typical binary stucture containing header / subheader and data sections. HyperSpec is an R package to work with spectroscopic data which should be extended so that .spc files can be imported. 7 | 8 | A pure R version already exists, but is too slow for some applications. 9 | 10 | **Skills needed:** Reading of the file should preferrably be in C/C++, which can be directly invoked via the Rcpp package. 11 | 12 | **Rating:** hard (the file format has lots of subformats) 13 | 14 | **Category:** Code (note there is a corresponding Quality Assurance task) 15 | 16 | **Mentor:** [[Claudia.Beleites@ipht-jena.de|Claudia Beleites]] 2011/11/04 -------------------------------------------------------------------------------- /RWiki/gci2011/hyperspecspctests.txt: -------------------------------------------------------------------------------- 1 | ====== .spc import for hyperSpec: Test Suite ====== 2 | 3 | **Task Summary:** Write tests to ensure proper working of the .spc import function (corresponding task) 4 | 5 | **Task Description:** A number of .spc files as test cases is available 6 | 7 | **Skills needed:** This is a follow up task for the coding of the import function. Tests should be written using svUnit. 8 | 9 | **Rating:** medium 10 | 11 | **Category:** Quality Assurance 12 | 13 | **Mentor:** [[Claudia.Beleites@ipht-jena.de|Claudia Beleites]] 2011/11/04 -------------------------------------------------------------------------------- /RWiki/gci2011/hyperspecweb.txt: -------------------------------------------------------------------------------- 1 | ====== hyperSpec's web page ====== 2 | 3 | 4 | **Task Summary:** Give hyperSpec a nice homepage 5 | 6 | **Task Description:** hyperSpec is hosted at [[http://hyperSpec.r-forge.r-project.org]] and needs a nice web page. The content is there in plain html, and a good style sheet would considerably improve the sight and usability of the web page. 7 | 8 | **Skills needed:** css/html/(php) 9 | 10 | **Rating:** easy 11 | 12 | **Category:** outreach -------------------------------------------------------------------------------- /RWiki/gci2011/softclassvalweb.txt: -------------------------------------------------------------------------------- 1 | ====== Web page for softclassval ====== 2 | 3 | 4 | **Task Summary:** Write a web page for softclassval 5 | 6 | **Task Description:** [[http://softclassval.r-forge.r-project.org|Softclassval]] needs a nice web page, too. Write/find a nice css and update the content of the page. 7 | 8 | **Skills needed:** html/css/(php) 9 | 10 | **Rating:** easy 11 | 12 | **Category:** outreach -------------------------------------------------------------------------------- /RWiki/gsco2010/rbeancounter.txt: -------------------------------------------------------------------------------- 1 | ====== Port Beancounter to R and extend it ====== 2 | 3 | **Summary:** Port the Beancounter (perl) program to R and extend it by leveraging on the many tools available in R 4 | 5 | **Description:** [[http://dirk.eddelbuettel.com/code/beancounter.html|Beancounter]] is a reasonably simple yet mature Perl program that uses Perl modules to download financial data, stores it SQL backends (with support for PostgreSQL, MySQL, SQLite and ODBC) and analyses it with a number of simple canned reports such as profit/loss and risk reports. 6 | 7 | When Beancounter was written, R had none of the numerous Finance modules (see e.g. the [[http://cran.r-project.org/web/views/Finance.html|Finance Task View]]). It now makes more sense to have this in R, and to extend it from R with e.g. Sweave for reports etc. 8 | 9 | **Skills required:** R programming and packaging skills, some understanding of database programming from R, an interest in Finance 10 | 11 | **Test:** Write an simple R package that downloads data for three securities and reports a correlation and covariance matrix. 12 | 13 | **Mentor:** Dirk Eddelbuettel. Suggestion added 2010-03-10 -------------------------------------------------------------------------------- /RWiki/gsoc2010.txt: -------------------------------------------------------------------------------- 1 | ====== Google Summer of Code 2010 ====== 2 | 3 | 4 | ===== Overview ===== 5 | 6 | This wiki page will be the central hub of information about the [[http://www.r-project.org|R Project]] participation in the [[http://socghop.appspot.com/site/home/site|Google Summer of Code]] (GSoC) for 2010. 7 | 8 | {{:developers:projects:2010soclogo.jpg|:developers:projects:2010soclogo.jpg}} 9 | 10 | Previous projects are listed on the [[http://www.r-project.org/soc08|soc08]] and [[http://www.r-project.org/soc09|soc09]] pages. 11 | 12 | ===== Status and Timeline ===== 13 | 14 | R has been accepted as one of the participating organisation as of March 18, 2010. 15 | 16 | As per the [[http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/faqs#timeline|GSoC 2010 timeline]]: 17 | 18 | * the student application period does not start until March 29, and ends April 9; 19 | * by April 21 all mentors have to be signed up, and scoring/ranking has to be completed and double applicants will then be filtered out; 20 | * accepted proposals will be announced on April 26. 21 | * coding commences May 24. 22 | 23 | See [[http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/faqs#timeline|the GSoC timeline]] for full details. 24 | 25 | ===== Project Ideas ===== 26 | 27 | Project ideas can be listed below on per-topic Wiki pages. //Please// follow the formatting used for the first suggestion. 28 | 29 | 30 | 31 | 32 | ==== Projects with mentors ==== 33 | 34 | * [[developers:projects:gsoc2010:suggestion_form]] //Use this to add new proposals// by copying the page content 35 | * [[developers:projects:gsoc2010:gpclib_replacement]] Replacement for gpclib. (R/GEOS) 36 | * [[developers:projects:gsoc2010:nosql_interface]] NoSQL interface for R 37 | * [[developers:projects:gsoc2010:cran_stats]] Aggregate CRAN package download statistics across multiple mirrors 38 | * [[developers:projects:gsoc2010:crantastic2]] Enhancements to crantastic to encourage further community building 39 | * [[developers:projects:gsoc2010:interactive-check]] Run R CMD check interactively during package development 40 | * [[packages:cran:hyperspec:gsoc]] Extensions to the hyperSpec package for hyperspectral data 41 | * [[developers:projects:gsoc2010:adinr]] Automatic Differentiation (AD) in R with ADMB, cppAD or other tools. 42 | * [[developers:projects:gsoc2010:vrs]] Virtual R Supercomputer provides an R interface to BOINC, the distributed computing platform. 43 | * [[developers:projects:gsoc2010:sra]] Social Relations Analyses in R 44 | * [[developers:projects:gsco2010:rbeancounter]] Port Beancounter to R and extend it 45 | * [[developers:projects:gsoc2010:syrfr]] Symbolic Regression capable of discovering the functions which fit input data best using genetic algorithm search 46 | * [[developers:projects:gsoc2010:rpy2ff]] Persistent storage-based vectors and arrays used by both R and Python 47 | * [[developers:projects:gsoc2010:rserve_javascript]] Javascript client for [[http://www.rforge.net/Rserve/index.html|Rserve]] using [[http://code.google.com/p/nixysa/|nixysa]] 48 | * [[developers:projects:gsoc2010:blogweave]] implement SWeave driver for blog postings 49 | * [[developers:projects:gsoc2010:ritk]] [[http://www.itk.org|ITK]] functionality in R 50 | 51 | 52 | 53 | ==== Proposal Suggestions ==== 54 | 55 | Submission of unsolicited proposals (via, say, the mailing list mentioned in the next paragraph) is certainly welcome. However, each proposal has a big barrier to pass: it will to need explain why this particular project is of use to the community, and it will have to show how it can be achieved in three months. At a minimum, a submission should also include a review of related packages that address the same or similar problems and discuss how these packages are not sufficient to the task. 56 | 57 | 58 | ===== Discussion List ===== 59 | 60 | A [[http://groups.google.com/group/gsoc-r|Google groups mailing list]] has been created for mentors and prospective students. Please subscribe so that you can post to ''gsoc-r@googlegroups.com'' for any discussions pertaining to R and the GSoC 2010. -------------------------------------------------------------------------------- /RWiki/gsoc2010/blogweave.txt: -------------------------------------------------------------------------------- 1 | ====== SWeave driver for programmatic blogging ====== 2 | 3 | **Summary:** Implement a new SWeave driver that posts both text and media content to blogs 4 | 5 | **Description:** SWeave is a powerful mechanism for literate statistical programming, in which explanatory text is intermingled with R code and computed results. Blogging (by way of WordPress, MovableType, etc.) is a powerful mechanism for disseminating information across professional and social networks. Though SWeave drivers exist for LaTeX, ODF, and asciidoc formats, there is not currently a driver that allows direct posting of blog content with embedded R code, and calculated R output (including any image graphics). 6 | 7 | **Skills required:** some exposure to SWeave, R programming, XML-RPC, and one of the popular blogging engines (e.g. WordPress) 8 | 9 | **Mentor:** Aaron Mackey, Asst. Prof. of Public Health Sciences, Univ. of Virginia -- ajmackey@gmail.com -------------------------------------------------------------------------------- /RWiki/gsoc2010/cran_stats.txt: -------------------------------------------------------------------------------- 1 | ====== Aggregating CRAN statistics ====== 2 | 3 | **Summary:** Aggregate download statistics across CRAN mirrors to gain global view of R package ecosystem. 4 | 5 | **Description:** Currently each CRAN mirror collects statistics on package downloads, but there is no way to centrally aggregate all download statistics. This is a problem because it makes it difficult to get a global view of the R package ecosystem: how many people are downloading packages, and which package are most popular. This must be done in such a way that it is easy for CRAN mirror maintainers to install and that does not compromise user privacy (e.g. individual ip addresses should be obfuscated). 6 | 7 | An additional component would be to develop an opt-in R package that would allow you to send 8 | 9 | There are also challenges associated to weed out people who are downloading every package, and to take package dependencies into account when computing rankings. 10 | 11 | **Skills required:** Should be able to use R to perform data manipulation and aggregation. Familiarity with apache web statistics a plus. 12 | 13 | **Test:** A test an applicant has to pass in order to qualify for the topic 14 | 15 | **Mentor:** Hadley Wickham. With guidance from Stefan Theussl and optionally Ian Fellows -------------------------------------------------------------------------------- /RWiki/gsoc2010/crantastic2.txt: -------------------------------------------------------------------------------- 1 | ====== Crantastic ====== 2 | 3 | **Summary:** Continue building on crantastic to add more features needed by the community. 4 | 5 | **Description:** Last year's google summer of code project achieved its goals of implementing the basic infrastructure for users to find, rate and review R packages. Currently crantastic receives about 6,500 visitors a month, mainly (70%) as a result of searching for R packages using google. Some ideas for possible enhancements are: 6 | 7 | * intelligent package diffs so it's easy to see what has changed between versions 8 | * a comprehensive function index to make it easy for users to what package a function lives in and for developers to find possible conflicts between packages 9 | * display top related conversations from the R-help mailing list 10 | * other ideas from http://crantastic.uservoice.com/ 11 | 12 | The developer would also be expected to be a crantastic evangelist, engaging with the R community to encourage their participation in crantastic and figuring out will be most useful for the community. 13 | 14 | **Skills required:** Web design/development, Ruby on Rails. Also helpful is some experience using R. 15 | 16 | **Test:** Fork the crantastic source code from http://github.com/hadley/crantastic and improve the existing source code in some way, by either refactoring existing code to make it cleaner and more maintainable or by implementing a small new feature. 17 | 18 | **Mentor:** Hadley Wickham -------------------------------------------------------------------------------- /RWiki/gsoc2010/gpclib_replacement.txt: -------------------------------------------------------------------------------- 1 | ====== Replacement for gpclib ====== 2 | 3 | **Summary:** Implement a replacement for the gpclib package using an existing external topology library 4 | 5 | **Description:** The aim is to provide topology operations on points, lines and polygons in 2D using a free software external library. The CRAN gpclib package has an unclear and non-free license, and only does polygon clipping. Work has started on a replacement using GEOS [[http://trac.osgeo.org/geos/]] on R-Forge [[https://r-forge.r-project.org/projects/rgeos/]] (project rgeos), but is delayed. Use of Boost ggl [[http://geometrylibrary.geodan.nl/]] is another possibility, and would probably be able to base on the build techniques used by CRAN RBGL, which also uses a Boost library. 6 | 7 | **Skills required:** R programming, C programming for R and GEOS, ability to read C++ to "get behind" the GEOS C-API, R package building. For ggl, those and C++ programming. Desirable in both cases a willingness to understand the OGC SFS, [[http://www.opengeospatial.org/standards/sfa]] which underlies both libraries. 8 | 9 | **Test:** Write a short C program to demonstrate specified GEOS (or ggl) operations reading input data from file and writing output data to file (using WKT or an agreed alternative format). 10 | 11 | **Mentor:** Roger Bivand (who had hoped to have rgeos ready now, but is seriously stuck, but where the problem (operation failure with change of scale) may be very shallow). Suggested on 2010-02-24. -------------------------------------------------------------------------------- /RWiki/gsoc2010/interactive-check.txt: -------------------------------------------------------------------------------- 1 | ====== Interactive R CMD check ====== 2 | 3 | **Summary:** Take the pain out of package development by providing continuous feedback while you code. Fix problems as they arise rather than all at once when you want to distribute it. 4 | 5 | **Description:** Running R CMD check is always a frustrating process because it uncovers problems a long time after you have made them. The aim of this project is to make package development a breeze by providing continuous feedback about your code as you code. Ideally, you could run this package in the background and it will alert you as soon as there is a potential problem. 6 | 7 | It would be composed of a series of pluggable components that you could choose from: 8 | 9 | * as many steps of R CMD check as possible 10 | * run all tests with test_that / runit / svunit 11 | * roxygenize to produce man files 12 | * run all examples in documentation 13 | * check for na.rm = FALSE and drop = FALSE 14 | * someway of sending info to a remote server to monitor progress over time 15 | * draft email for r-packages 16 | * tweet your package release 17 | * pre-release: proof read generic announcement, description 18 | * post-release: tag in repos, update version in description 19 | * create changelog from scm 20 | * push to cran 21 | * push to r-forge, github, ... 22 | * push to windows builder 23 | * load data 24 | * load code 25 | * compile C code 26 | 27 | Ideally it would incorporate dependency description ideas from (e.g.) rake, so that it does the minimum amount of work. 28 | 29 | **Skills required:** Intermediate to advanced R programming skills. Experience working code development in another language, and experience working with a dependency management framework. 30 | 31 | Ideally, the applicant would also be able to understand perl code so that they could help convert existing checks written into R code. 32 | 33 | **Test:** Implement code that performs one of the tasks described above. 34 | 35 | **Mentor:** Hadley Wickham, with additional advice from Uwe Ligges. 36 | 37 | 38 | Comments from others: 39 | 40 | 41 | //[PhG comment: note that ''svUnit'' does a big part of it. It provides a mechanism to define tests and test units that can be run interactively in R. There is one existing GUI in ''SciViews'' that reports tests that fail or raise an error (similar presentation to, say, ''jUnit''). See the vignette of the 'svUnit'' package. The API is very simple and comprises less than ten functions so that it is very easy to reimplement it for other GUIs (''ESS/Emacs'', ''JGR'', ''RKward'', ''StatEt/Eclipse'', ...). I am willing to collaborate and incorporate possible changes/improvements/additions made on top of ''svUnit'' in that package.]// 42 | 43 | //[Hadley comment: testing is just a small part of this. It also involves running all the existing R CMD checks, documentation etc. I would imagine running svUnit tests as one of many modules.]]// -------------------------------------------------------------------------------- /RWiki/gsoc2010/nosql_interface.txt: -------------------------------------------------------------------------------- 1 | ====== NoSQL interface for R ====== 2 | 3 | **Summary:** Provide a new package that accesses a NoSQL database 4 | 5 | **Description:** So-called [[http://en.wikipedia.org/wiki/NoSQL|NoSQL]] databases such as 6 | * [[http://1978th.net/tokyocabinet/|TokyoCabinet]], 7 | * [[http://www.mongodb.org|MongoDB]], 8 | * [[http://couchdb.apache.org/|CouchDB]], 9 | * [[http://code.google.com/p/redis/|Redis]], 10 | * [[http://incubator.apache.org/cassandra/|Cassandra]] 11 | * ... 12 | are becoming increasingly popular. They generally provide very efficient lookup of key/value pairs. Some other projects build on top of these, see e.g. [[http://labs.gree.jp/Top/OpenSource/Flare-en.html|Flare]] which extends [[http://1978th.net/tokyocabinet/|TokyoCabinet]]. 13 | 14 | Beyond a sample interface package, it would be desirable to think about a //generic// interface similar to what the [[http://cran.r-project.org/package=DBI|DBI package]] does for SQL backends. 15 | 16 | A related package is provided by [[http://cran.r-project.org/package=RBerkeley|RBerkeley]] even though the underlying BerkeleyDB is not typically listed as a part of the '//NoSQL Club//' but can arguably be considered an ancestor. 17 | 18 | Also of note may be [[http://cran.r-project.org/package=RProtoBuf|RProtoBuf]] which provides an interface to the efficient and language-agnostic Protocol Buffers. 19 | 20 | //Update on 2010/03/18//: A new package [[http://cran.r-project.org/package=rredis|rredis]] for Redis is now on CRAN. 21 | 22 | //Update on 2010/04/29//: The package [[http://github.com/wactbprot/R4CouchDB|R4CouchDB]] for CouchDB is at Github. 23 | 24 | **Skills required:** C/C++ and R programming skills coupled with sufficient administrative skills to set some test servers up. Familiarity with a common Linux distribution may help to take advantage of pre-built backends. The [[http://cran.r-project.org/package=Rcpp|Rcpp]] package may be helpful for C++ interfaces. 25 | 26 | **Test:** Ideally a basic proof of concept of an R function calling the underlying API to store and/or retrieve data. This meant to provide a proof of competence for working with these technologies. 27 | 28 | **Mentor:** Dirk Eddelbuettel (who has a preference for C++ implementations). Added 2010-02-24. -------------------------------------------------------------------------------- /RWiki/gsoc2010/ritk.txt: -------------------------------------------------------------------------------- 1 | ====== ITK Functionality in R ====== 2 | 3 | **Summary:** Provide an R interface to medical image analysis routines (segmentation, registration, etc.) using ITK 4 | 5 | **Description:** [[http://www.itk.org|ITK]] is an open-source software toolkit for performing registration and segmentation. Segmentation is the process of identifying and classifying data found in a digitally sampled representation. Typically the sampled representation is an image acquired from such medical instrumentation as CT (computed tomography) or MRI (magnetic resonance imaging) scanners. Registration is the task of aligning or developing correspondences between data. For example, in the medical environment, a CT scan may be aligned with a MRI scan in order to combine the information contained in both. Several R packages now provide I/O routines to standard medical imaging data formats and the user community would benefit from access to sophisticated image processing routines in ITK. 6 | 7 | **Skills required:** R programming, R package building, C++ programming for R and ITK, proficiency in Cmake (for ITK). Familiarity with [[http://rcpp.r-forge.r-project.org|Rcpp]] and an interest in medical imaging are advantageous. 8 | 9 | **Test:** Wrap a single ITK data structure with Rcpp, operate on it in ITK and return the result to R. 10 | 11 | **Mentor:** Brandon Whitcher. Core members of the Rcpp development group have suggested that the mailing list will be able to provide support. Suggested on 2010-Mar-19. -------------------------------------------------------------------------------- /RWiki/gsoc2010/rpy2ff.txt: -------------------------------------------------------------------------------- 1 | ====== Storage-based vectors and arrays with R and Python ====== 2 | 3 | **Summary:** Provide an improved access to disk-based solution for R and Python code. 4 | 5 | **Description:** Persistent storage-based representation of vectors and arrays can be required for larger datasets, 6 | and in a form that this better optimized for slicing and dicing than relational databases can be. 7 | 8 | In R, the "ff" package is arguably the solution under the most active development, as the wrappers to the 9 | HDF5 or netCDF libraries expose a small subset of the respective API. 10 | 11 | The purpose of this project is to improve the access to such storage-based structure, making cross-process and cross 12 | language access to data easy and efficient. 13 | 14 | Depending on the skills and interest of the candidate, the overall outcome would be: 15 | 16 | * Access ff-stored data from Python, with eventually special treatment for access from Python's numpy 17 | 18 | * HDF5 and netCDF data passed on to R code through Python, as the Python wrappers for HDF5 and netCDf provide a more complete mapping of the HDF5 and and netCDF libraries than their R counterparts do. 19 | 20 | * memory-mapped data (R has "mmap", Python's numpy has "numpy.memmap") 21 | 22 | The exact specifics will be discussed with the applicant. 23 | 24 | **Skills required:** Knowledge of R, Python, rpy2 (Python-R bridge), and of C / C++ if optimization through lower-level 25 | access is wished, are required. 26 | 27 | **Test:** Using rpy2, be able to implement a rudimentary class that can expose ff::ff_vector instances to Python. From that implementation, list anticipated limitations (performances, design, etc...), imagine use-cases and features that would be nice to have (no matter they can realisitically be implemented or not, within the course of the SoC project or at all). 28 | 29 | **Mentor:** Laurent Gautier [[mailto:lgautier@gmail.com|email]]. March 14th, 2010. 30 | 31 | ==== Questions: ==== 32 | 33 | * //[[jsalsman@gmail.com|James Salsman]] 2010/03/23//: will you be doing ff format reading and writing both in Python before attempting memory mapping? 34 | 35 | * //[[lgautier@gmail.com|Laurent Gautier]] 2010/03/23//: In a first instance, the ff format reading and writing will be exposed to Python users/programmers, but the tasks delegated to the ff package by using rpy2. Native reading/writing might not be considered in the final proposal, one of the reasons being that that format is not a standard (only known to R/ff, and may even be ff version-specific). 36 | 37 | ---- 38 | -------------------------------------------------------------------------------- /RWiki/gsoc2010/rserve_javascript.txt: -------------------------------------------------------------------------------- 1 | ====== Javascript client for Rserve using nixysa ====== 2 | 3 | **Summary:** Javascript client for Rserve using nixysa 4 | 5 | **Description:** [[http://code.google.com/p/nixysa/|nixysa]] is a framework written in Python to automatically generate glue code for NPAPI plugins (plugins for browsers such as [[http://www.google.com/chrome|Google Chrome]] or [[http://www.mozilla.com/en-US/|Firefox]]), letting you easily expose C++ classes to Javascript from a simple IDL representation. Nixysa was originally conceived for the needs of [[http://code.google.com/intl/fr/apis/o3d/|O3D]], but is flexible enough to support a wide range of use cases. 6 | 7 | This project will make a browser plugin to connect to [[http://www.rforge.net/Rserve/index.html|Rserve]] based on the existing C++ client, allowing a web application to interact with Rserve. 8 | 9 | **Skills required:** R, C++, javascript. Familiar with Rserve. 10 | 11 | **Test:** A hello world example of a nixysa plugin interacting with Rserve, for example showing a popup dialog box (alert) that displays the version of R (version$version.string) 12 | 13 | **Mentor:** Romain Francois . Added on 17 march 2010. -------------------------------------------------------------------------------- /RWiki/gsoc2010/sra.txt: -------------------------------------------------------------------------------- 1 | **Summary:** Social Relations Analyses in R 2 | 3 | **Description:** Social Relations Analyses (SRA) are an emerging trend in psychological research, both in social psychology and in personality research.The goals of SRAs are: 4 | a) the decomposition of interpersonal phenomena (like interpersonal perceptions, or behavioral acts) into statistically independent variance components (actor effect, partner effect, and relationship effect) 5 | b) The calculation of bivariate correlations which take the non-independence of data into account 6 | c) Providing significance tests for variance components, both as a within-group test (Lashley & Bond, 1997), as well as between group test (Kenny, 1994) 7 | 8 | An increasing number of research projects employs this method, and it is becoming a standard method in social research. Interestingly, however, the main software employed for calculating these analyses are two DOS-programs from 1995, which have some disadvantages (in addition to being DOS-programs: no missing values allowed, data files have to be transformed into some odd input file format, very strict limitations of number of cases, output only as relatively unstructured text files with 5000+ lines, ...). Hence, bringing the power of SRAs to R would be a very valuable benefit for the social science community; we know of several research groups eagerly awaiting an R implementation. 9 | 10 | First steps were already taken to implement SRAs in R: A first stable version of TripleR (which calculates SRAs in round robin designs; version 0.1) is already hosted at CRAN and in productive use. The goal of this GSOC project would be to extend the existing version of TripleR by allowing multiple groups with between group t-tests, adding handling of missing values, and implementing standard plots (presumably using ggplot2). 11 | Furthermore, basic functions for calculating SRAs in block designs should be developed and tested against the existing DOS program. 12 | 13 | **Skills required:** R programming, experiences in package building and documenting, psychological background knowledge with specific expertise concerning social relations analyses 14 | > 15 | 16 | 17 | **Test:** 18 | 19 | - Outline a way on how to implement a social relations analysis of multiple Round-Robin groups into the already existing version of TripleR hosted on CRAN 20 | 21 | - Write a function for weighted t-tests that may be used for significance tests of multiple group results (see Kenny, 1994) 22 | 23 | 24 | **Mentor:** Stefan Schmukle (http://wwwpsy.uni-muenster.de/Psychologie.inst1/AESchmukle/en/publikationen_schmukle.html); [[mailto:schmukle@uni-muenster.de|email]] 25 | Suggestion date: Mar 06 2010 26 | 27 | **References:** 28 | Kenny, D. A. (1994). Interpersonal perceptions: A social relations analysis. New York: Guilford Press. 29 | Lashley, B. R., & Bond, C. F. (1997). Significance testing for round robin data. Psychological Methods, 2, 278-291. 30 | -------------------------------------------------------------------------------- /RWiki/gsoc2010/suggestion_form.txt: -------------------------------------------------------------------------------- 1 | __//**copy the source of this wiki doc, don't edit it directly**//__ 2 | 3 | ====== Header ====== 4 | 5 | 6 | **Summary:** A brief project summary in just a few words aka "headline" 7 | 8 | **Description:** A more detailed description, possibly enhanced with links and references. 9 | 10 | **Skills required:** A list of required skills (programming languages, experience, methodologies, ...) for the prospective student 11 | 12 | **Test:** A test an applicant has to pass in order to qualify for the topic 13 | 14 | **Mentor:** A mentor willing and able to guide the student. A datestamp about when the suggestion was made. -------------------------------------------------------------------------------- /RWiki/gsoc2010/template.txt: -------------------------------------------------------------------------------- 1 | ====== Header ====== 2 | 3 | 4 | **Task Summary:** A brief project summary in just a few words aka "headline" 5 | 6 | **Task Description:** A more detailed description, possibly enhanced with links and references. 7 | 8 | **Skills needed:** A list of required skills (programming languages, experience, methodologies, ...) for the prospective student 9 | 10 | **Rating:** Rate the task as Easy, Medium or Hard (See the mentoring guide for how to do this.) 11 | 12 | **Category:** What type of task is it? Code, Documentation, Outreach, Quality Assurance, Research, Training, Translation, or User Interface (See the mentoring guide for how to do this.) 13 | 14 | **Mentor:** A mentor willing and able to guide the student. A datestamp about when the suggestion was made. -------------------------------------------------------------------------------- /RWiki/gsoc2010/vrs.txt: -------------------------------------------------------------------------------- 1 | ====== Virtual R Supercomputer ====== 2 | 3 | **Summary:** Develop an R interface that runs on BOINC (the Berkeley Open Infrastructure for Network Computing). 4 | 5 | **Details:** There are a large number of packages dedicated to high-performance computing with R. To get real performance improvements out of parallel computing, each of these packages assumes either direct access to a grid or adds the additional expense of using something like Amazon EC2. 6 | 7 | BOINC has about 605,000 active computers (hosts) worldwide processing on average 4.6 petaFLOPS as of Mar 4th 2010 (http://en.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Network_Computing). These hosts volunteer their resources for scientific research. The vrs package will allow any researcher to submit a self-contained R simulation to be run across the BOINC network of computers, giving researchers direct access to vast computing power at no cost. 8 | 9 | **Skills:** Good R and C++ programming skills. Familiarity with basic open source development tools like svn and make is assumed. Experience with Debian or Ubuntu Linux. 10 | 11 | **Tests:** 12 | - Use RInside (http://dirk.eddelbuettel.com/code/rinside.html) to write a C++ application that calculates the value of Pi using a monte carlo approach with R (see rosetta code for an example implementation of the R code: http://rosettacode.org/wiki/Monte_Carlo_methods#R). The looping part of the operation should in the C++ layer. 13 | - Describe, in words, what the following BOINC functions do: get_output_file_info, create_work, boinc_init_parallel, boinc_wu_cpu_time. 14 | - Run the "Quick Start" BOINC example application (can be run as a virtual machine to limit setup time). 15 | 16 | **Mentor:** Shane Conway (smc77@columbia.edu). -------------------------------------------------------------------------------- /RWiki/gsoc2011/apim.txt: -------------------------------------------------------------------------------- 1 | **Summary:** A package using structural equation modeling to analyze bidirectional effects in dyadic data structures 2 | 3 | **Description:** The Actor-Partner Interdependence Model (APIM; Kashy & Kenny, 1999; Kenny, Kashy, & Cook, 2006) is a model of dyadic relationships that integrates a conceptual view of interdependence in two-person relationships with the appropriate statistical techniques for measuring and testing it. Structural equation modeling makes it easy to translate the two-person relationships in the form of path diagram into a model to be estimated and testing using SEM. So far, there is only one r package "dyad", which is solely based on Gottman-Murray equations. It cannot handle complex interaction effects such as partner-moderated partner effects or actor-moderated partner effects. Our goal is: a) provide an easy interface for the calculation of APIMs with three components (actor effects, partner effects and relationship effects; b) provide chi-square tests for model fit; c) provide parameter estimates for those three components and Wald tests for them. d) provide variance/covariance estimates and likelihood tests for them. e) provide functions for newer developments like APMeMs and APMoMs (mediation and moderation in dyadic designs). 4 | 5 | **Skills required:** R programming, experiences in package building and documenting, psychological background knowledge with specific expertise concerning dyadic data structural equation models 6 | 7 | **Test:** Write a function to replicate the results of the APIM analysis in "Dyadic Data Analysis" (Kenny Kashy, & Cook, 2006), Chapter 7 (Campbell et al. data; see also http://davidakenny.net/kkc/c7/c7.htm). Make sure that the function call is as general, reusable, and easy as possible. 8 | 9 | **Mentor:** Dr. Felix Schönbrodt, Ludwig-Maximilians-Universität Munich: felix.schoenbrodt@psy.lmu.de; Co-Mentor: Prof. Mitja Back, Johannes Gutenberg University Mainz -------------------------------------------------------------------------------- /RWiki/gsoc2011/convex_optimization.txt: -------------------------------------------------------------------------------- 1 | **Summary**: 2 | CVX-R: Disciplined convex programming in R 3 | 4 | **Description**: 5 | Convex optimization problems represent a class of optimization problems 6 | that can be solved efficiently and reliably. They include least-squares 7 | and linear programming problems. Surprisingly many problems can be 8 | solved via convex programming, though it is often difficult to recognize 9 | their convexity or to transform them into a convex form. 10 | 11 | There is a free, GPLed software package, CVXOPT, for convex optimization 12 | distributed as a Python module. Several modeling tools have been built 13 | on top of it such as CVXMOD for Python. CVX for Matlab has similar goals 14 | but uses different solvers. The goal of 15 | this GSoC project will be to do a similar implementation with R and thus 16 | develop an R package for [[http://www.stanford.edu/~boyd/papers/disc_cvx_prog.html|Disciplined Convex Programming (DCP)]] 17 | like CVX and CVXMOD. 18 | 19 | The possible approaches for developing a modeling language: 20 | 21 | * Port the modeling language of CVX/CVXMOD to R and use standard solvers in R, i.e. [[http://cran.r-project.org/web/packages/quadprog/index.html|quadprog]], [[http://cran.r-project.org/web/packages/linprog/index.html|linprog]], and the new [[http://cran.r-project.org/web/packages/ROI/index.html|ROI]] interface 22 | * Exploit the [[http://cran.r-project.org/web/packages/inline/index.html|inline]] package to generate high-speed solvers, as in [[http://cvxgen.com/|CVXGEN]]. 23 | * Define new S3 methods for standard operators i.e. >=.class, as in the demo [[http://quadmod.r-forge.r-project.org|quadmod package on R-Forge]]. 24 | 25 | The developers of CVXOPT have several times indicated an interest in porting their work to R, so they shall be willing 26 | to give some support to the project. 27 | UPDATE 15 Nov 2011: TDH got several emails today from the Stanford people including Stephen Boyd , Balasubramanian Narasimhan , and Michael Grant that confirmed their interest in helping develop a DCP package for R. 28 | 29 | Also, there are pages on the Web showing how 30 | to solve problems from the book "[[http://www.stanford.edu/~boyd/cvxbook/|Convex Optimization]]" by Boyd and 31 | Vandenberghe with CVXOPT. These examples could be used to write a vignette for the package. 32 | 33 | **Skills required**: 34 | Some knowledge of R and its package building features; 35 | experience with programming in Python and/or C and/or Matlab; 36 | background or strong interest in mathematical optimization. 37 | 38 | **Test**: 39 | Select two or three convex optimization problems from the literature, 40 | solve them with CVXOPT/CVXMOD in Python (and maybe with CVX in Matlab); 41 | try to solve them with one of the optimization packages in R. 42 | 43 | **Mentor**: There will be one mentor for each of the two sides of things one from R and one from the disciplined convex programming. 44 | 45 | **R side**: Hans W. Borchers with support from Stefan Theussl, the maintainer of the CRAN Task View on Optimization, who will assist in R/Python programming related questions. 46 | 47 | **CVX side**: [[http://www.stanford.edu/~echu508/|Eric Chu]] will mentor the convex optimization part. -------------------------------------------------------------------------------- /RWiki/gsoc2011/cranvastime.txt: -------------------------------------------------------------------------------- 1 | ====== cranvastime: Interactive longitudinal and temporal data plots ====== 2 | 3 | **Summary:** Interactive longitudinal data plots in cranvas, a cutting edge data graphics package 4 | 5 | **Background** A close connection between plots and modeling has been a moving and elusive target for interactive graphics research for decades. Fleeting glimpses, DataDesk and XLispStat, have come and gone. With the emergnce of R we took two steps backwards, until now. Two recent developments have the promise to bring us forward in leaps, iplots eXtreme, and cranvas. Cranvas interfaces to the Qt libraries using the R packages, qtbase and qtpaint, to provide interactive and dynamic graphics for large amounts of data. It provides programmable interactive graphics in R, enabling users to link plots with graphics analytical methods. Cranvas currently has basic plot types, scatterplots, histograms, mosaic plots, maps, parallel coordinate plots and tours. More information about cranvas can be found at https://github.com/ggobi/cranvas. 6 | 7 | **Description** This project involves developing interactive plots for longitudinal and temporal data, the type of data used to study problems involving human health, and medical treatments, wages, prices and economic indicators, temperature and climate change. Longitudinal data plots show values for individuals, possibly at irregular time intervals. The purpose is to explore individual differences and similarities. Time series plots tend to show just one measurement for each time, but there is implicitly an interest in studying seasonal trends, and temporal dependencies. The code should follow the code structure of existing plotting routines. The plots will need incorporate ways to interactively explore the individual temporal patterns and seasonal trends. 8 | 9 | **Skills required:** : A good working knowledge of R and an interest and enthusiasm for visualization, and exploring data. 10 | 11 | **Test:** Generate an R function to plot a time series. Have the plot pan in the horizontal (time) direction, and wrap on itself when it extends past the boundaries of the plot space. 12 | 13 | **Mentor:** Dianne Cook, Iowa State University (dicook@iastate.edu), and Heike Hofmann, Heike Hofmann, Iowa State University (hofmann@iastate.edu). -------------------------------------------------------------------------------- /RWiki/gsoc2011/dclusterm.txt: -------------------------------------------------------------------------------- 1 | ====== DClusterm: Model-based detection of disease clusters 2 | 3 | **Summary:** Model-based detection of disease clusters 4 | 5 | **Background** The analysis of disease data is important in order to 6 | detect disease outbreaks and links to risk factors. Some of the methods 7 | for cluster detection have been implemented in the DCluster package. 8 | However, a model-based approach would be of interest in order to explore 9 | disease incidence to potential risk factors. 10 | 11 | **Description** Model-based clustering will be implemented using Generalized 12 | Linear Models (in principle, for Poisson and Binomial families). Clustering 13 | will be modelled as dummy variables (1=area is in a cluster, 0=area is not 14 | is a cluster). Hence, many possible clusters will be proposed and the 15 | most likely cluster will be selected according to likelihood ratio test, AIC 16 | and (possibly) any other reasonable method. 17 | 18 | **Skills required:** : A good working knowledge of R and Generalized Linear Models. 19 | Some understading of spatial statistics will be a plus. 20 | 21 | **Test:** Fit a GLM using the North Carolina SIDS data. See example("readShapePoly") in maptools package. In this GLM, SID74 will be the outcome and BIR74*sum(SID74)/sum(BIR74) an offset; the Poisson family will be used. In addition, a dummy variable representing a spatial cluster 22 | will be included. This dummy variable will include 5 different contiguous regions (i.e., the 23 | value of this dummy variable will be 1 for these 5 regions and 0 otherwise). Display the 24 | residuals of this model in a map. 25 | 26 | **Mentor:** Virgilio Gómez-Rubio, University of Castilla-La Mancha (Virgilio.Gomez@uclm.es); proposed on 2011/03/21. -------------------------------------------------------------------------------- /RWiki/gsoc2011/dendro.txt: -------------------------------------------------------------------------------- 1 | ====== Interactive Dendrogram + Heatmap ====== 2 | 3 | **Summary:** An interactive dendrogram + heatmap plot for exploring results of hierarchical clustering analysis 4 | 5 | **Description:** 6 | 7 | R is extensively used to study intrinsic relations in multidimensional data. While hierarchical clustering analysis (HCA) can be employed to reveal the internal structure of such data, there is no (to my knowledge) interactive tool to browse, select and visualize HCA results (dendrograms) in R. As this is a common use case of many scientists, who often analyze data consisting of tens of thousands of samples, I feel it would be beneficial to fill in this gap. 8 | 9 | None of existing R dendrogram plotting means meet both the criteria of satisfactory interactivity and high-speed processing: 10 | * low-level dendrogram plotting functions (''plot.dendrogram'' and ''plclust'' as well as ''identify.hclust'', all in the ''stats'' package [1]) offer a very limited interactivity, 11 | * ''iPlots'' [2], a high-interaction statistical graphics library written in Java, is quite slow in processing large dendrograms, 12 | * there are several feature-rich packages offering HCA, but none of them allows users to interact with dendrograms and (sub)clusters in a convenient way. 13 | 14 | I'd like to build a dendrogram+heatmap plot, which will enable: 15 | 16 | - to draw a dendrogram+heatmap biplot quickly (it must be capable of drawing dendrograms consisting of thousands of clusters in reasonable time) 17 | - to select, color and label multiple individual (sub)clusters anywhere in the dendrogram hierarchy 18 | - automatic distance-based cluster selection (cutting dendrogram at specified height) 19 | - zooming the dendrogram 20 | - visualize selected clusters in feature space (e.g. in terms of scatterplots, parallel coordinate plots etc.) 21 | 22 | The last point would not be implemented from scratch, but existing libraries/applications can be utilized for that, either by means of integration of the dendrogram+heatmap in them (e.g. in cranvas [3]), or by means of an user-supplied callback. 23 | 24 | [1] http://stat.ethz.ch/R-manual/R-patched/library/stats/html/00Index.html 25 | 26 | [2] http://rosuda.org/iplots/ 27 | 28 | [3] https://github.com/ggobi/cranvas 29 | 30 | 31 | **Mentor:** 32 | -------------------------------------------------------------------------------- /RWiki/gsoc2011/explorestoch.txt: -------------------------------------------------------------------------------- 1 | ====== ExploreStoch ====== 2 | 3 | **Summary:** To develop functions for exploratory analysis and 4 | visualization of multivariate and high-dimensional data from 5 | stochastic dynamic processes. 6 | 7 | **Description:** The overall objective is to contribute with functions 8 | that can be helpful in exploring, plotting and analyzing data from 9 | high-dimensional dynamic systems, in particular, emphasizing the 10 | internal dynamic of the process. The results may be embedded in the 11 | processdata package (https://r-forge.r-project.org/projects/multdiff/). 12 | 13 | Some concrete data examples and ideas are: 14 | 15 | * Gene expression dynamics. 16 | 17 | Time course gene expression measurements 18 | are very high-dimensional dynamic processes -- often observed at 19 | relatively few time points. Some of the scientific objectives include 20 | the ability to identify groups of genes that are co-regulated, genes 21 | that react to external stimuli, or, more ambitiously, the regulatory 22 | network among the genes. 23 | 24 | * Activities of neuron cells. 25 | 26 | Dynamic spike times and membrane 27 | potentials for neuron cells form another class of dynamic stochastic 28 | processes. A characteristic of the data is that the spike times are 29 | discrete events. Some of the scientific objectives are to model how 30 | the membrane potential cause spikes, and how spikes from multiple 31 | neurons co-occur, and perhaps to infer the network among the observed 32 | neurons from the spike times. 33 | 34 | For both examples above it is of interest to visualize the dynamics. 35 | Static (time-series) plots can be useful, in particular if the 36 | coordinates can be grouped and color coded in a sensible way. The 37 | proposal is, however, to focus on functions that produce animations, 38 | e.g. using the animation R package. Besides simple 1d-animations, 39 | 2d-animations of (X_t, dX_t) based on an SDE model is of interest. 40 | For discrete event times a suggestion for replacing the (X_t, dX_t) 41 | animation is to consider an (R_t, lambda_t) animation where R_t is the 42 | time since last event and lambda_t is a (model based) 43 | intensity. The infrastructure for handling continuous process data and 44 | discrete event times is available in the processdata R package, and 45 | the R package ppstat (https://r-forge.r-project.org/projects/ppstat/) 46 | provides a class of multivariate intensity based models. 47 | 48 | Static plots of the above can also be implemented, but a prime 49 | motivation for working with animations is to make it easier to visualize 50 | concurrency or, at least, the order in time of the dynamic behavior of 51 | the processes. 52 | 53 | To better represent signaling cascades, a 54 | more ambitious goal could be to superimpose a 1d-animation 55 | on a graph that represent the direct interactions between the process 56 | components, and, for instance, use color coding to represent the 57 | values of the process, spike times, time since last event etc. 58 | 59 | * Functional data and classification. 60 | 61 | One can also consider 62 | (high-frequency) observations of a continuous signal, sometimes 63 | referred to as functional data, with the purpose of discriminating 64 | between two or more groups based on the signal. A concrete data 65 | example, that can be used in the project, consists of acceleration 66 | signals from walking horses, where the purpose is to discriminate lame 67 | horses from healthy horses. The signals are 3d signals. 68 | 69 | For functional data, such as the horse data mentioned above, the 70 | objective is to produce static plots or animations that emphasize the 71 | discrepancies between the groups, that is, between the lame and 72 | healthy horses in the example. 73 | 74 | A successful completion of this project consists of an R implementation 75 | to solve one or more of the visualization tasks described above. The 76 | implementation must be based on data being of one of the classes 77 | implemented in the processdata package. It is encouraged that the 78 | resulting visualizations are combined with techniques for dimension 79 | reduction, clustering and color coding to aid the interpretation of 80 | the data. 81 | 82 | **Skills required:** R programming, preferably with a good knowledge of R 83 | graphics some of the R packages for plotting and visualization, 84 | e.g. the animation, rgl and ggplot2 packages. Knowledge of stochastic processes. 85 | 86 | **Test:** Use svn to check code out the processdata package from R-Forge and read the vignette. 87 | 88 | **Mentor:** Niels Richard Hansen 89 | 90 | //[[Niels.R.Hansen@math.ku.dk|Niels Richard Hansen]] 06-04-2011// 91 | 92 | -------------------------------------------------------------------------------- /RWiki/gsoc2011/gibbstrading.txt: -------------------------------------------------------------------------------- 1 | ====== Header ====== 2 | Gibbs sampling for detecting trade direction 3 | 4 | ===== Summary: ===== 5 | Utilize a Gibbs sampling Bayesian approach to impute trade direction from financial data. 6 | 7 | ===== Description: ===== 8 | Given a series of prices, this method exploits the price movements in a stock (or bond) to infer spread and trade direction. 9 | 10 | ===== Background: ===== 11 | In financial markets, a buyer typically pays more than the previously posted price. Similarly a seller gets less than the previously posted price. This discrepancy between previously posted price and the price that a buyer (or seller) pays is called spread, 2c. If one has the quote data, one can make an attempt to infer the spread using the difference between the bid and offer quotes. 12 | 13 | Assume that the price moves up by half a spread, (2c) when there is a buy order. 14 | 15 | p_t=p_{t-1} + c + u_t 16 | 17 | In this setting the price moves down by half a spread when there is a sell order. 18 | p_t=p_{t-1} - c + u_t 19 | 20 | The half spread is can be estimated as the square root of -1 * correlation between consecutive price changes. It is known as the Roll (1984) estimator of spread. In practice, it does not work very well since prices are positively autocorrelated, Harris(1990). 21 | 22 | 23 | ===== Gibbs Sampling: ===== 24 | 25 | Hasbrouck (2009) proposes a way around it using Gibbs sampling. The starting point is the equation 26 | 27 | \delta p_t = c \delta q_t + u_t 28 | 29 | If we knew trade direction for each trade, then the equation suggests that regressing price changes on trade direction will provide us the estimate for half spread. But we do not know the trade direction. In practice, we only observe \delta p_t, i.e., the price changes. The problem can solved by treating trade direction(s) as latent variables and sample from a joint distribution of q_t and c. This method works remarkably well. Hasbrouck (2009) shows that the correlation between estimates for spread using only the price data and spread using the quote data is 0.965. An side effect of this method is that we have reasonable estimate for trade direction for the price series. We propose to implement this estimator in R. 30 | 31 | ===== References: ===== 32 | Harris (1990) Statistical Properties of the Roll Serial Covariance Bid/Ask Spread Estimator. Journal of Finance. 33 | 34 | Hasbrouck (2009) US equities: estimating effective costs from daily data.Journal of Finance 64(3), 1445-1477; 35 | 36 | Roll(1984) A simple implicit measure of the effective bid-ask spread in an efficient market. Journal of Finance. 37 | 38 | ===== Skills required: ===== 39 | 40 | Programming will be in R, with possible extension to C/C++/Fortran, as required for performance. 41 | 42 | Applicant must have knowledge of the literature in this area, as well as knowledge of Bayesian techniques, and their implementation in R. R has several Bayes packages already that contain Gibbs sampling, if possible, one of these should be used as a dependency, to keep the project scope to implementing, documenting, and testing the technique described in Hasbrouck (2009) 43 | 44 | ===== Test: ===== 45 | A test an applicant has to pass in order to qualify for the topic 46 | 47 | ===== Mentor: ===== 48 | Brian G. Peterson (2011-04-07) 49 | with backup mentoring from Dale Rosenthal and Nitish Sinha of the faculty at the University of Illinois, Chicago 50 | -------------------------------------------------------------------------------- /RWiki/gsoc2011/image_analysis.txt: -------------------------------------------------------------------------------- 1 | **Summary:** Image analysis for R 2 | 3 | **Description:** 4 | 5 | Being visual creatures, much of the information that humans process and make inference from are images. While R is incredibly strong in many aspects of statistical analysis, with multiple packages competing to address similar analysis types and issues, however there are few packages dealing with the manipulation and analysis of images. This project will bring full integration of ImageJ to R ( http://rsbweb.nih.gov/ij/ ) and would expand the RImageJ package, which is currently mostly a stub, into a fully functional R image analysis engine. The implementation will have two components: 6 | 7 | 1. Easy programatic manipulation of image objects from within R. ImageJ images and image stacks will be encapsulated in R objects (perhaps the new reference class system?). R functions will be provided to perform common manipulation and analysis tasks. 8 | 9 | 2. Interoperability with the ImageJ GUI. When in a java R console such as JGR or Eclipse statet, the ImageJ GUI can be run in parallel with R. ImageJ will be altered so that the user can export Images, Stacks and result tables from ImageJ to R. Like wise, Image objects altered within R will be reflected within the ImageJ GUI. If Deducer is loaded, Menu items and dialogs will be available to manage the relationship between R and ImageJ. Support for the ImageJ plug-in system will also be included. 10 | 11 | **Skills required:** Experience with Java is highly desirable. Experience using rJava/JRI is preferred. 12 | 13 | **Test:** 14 | 15 | 1. Using rJava, write some R code that creates and displays a JDialog with the text "Hello from R." 16 | 17 | 2. Using JRI, write java code to create a JDialog with a JTextArea and a button named "Add". When Add is pressed, any code written in the JTextArea should be executed in R. 18 | 19 | 3. Using the RImageJ package, open and flip http://www.google.fr/intl/en_en/images/logo.gif along its x-axis ( the "Flip Horizontally" command). 20 | 21 | Hint: for 1 and 2 see http://www.deducer.org/pmwiki/pmwiki.php?n=Main.Development for some ideas on how to use rJava/JRI. For 3 see http://romainfrancois.blog.free.fr/index.php?post/2009/06/22/using-ImageJ-from-R%3A-the-RImageJ-package 22 | 23 | **Mentor:** Ian Fellows thefell@gmail.com 3-1-11 -------------------------------------------------------------------------------- /RWiki/gsoc2011/manipulate.txt: -------------------------------------------------------------------------------- 1 | ====== manipulate ====== 2 | 3 | **Summary:** Interactive graphical R applets for teaching statistics and calculus 4 | 5 | **Description:** Many concepts in statistics and calculus can be illuminated powerfully by interactive graphical displays: call them applets. For example, in a statistical modeling class, an applet can be used to show the effect of adding interaction terms to a model, or how solutions to sets of linear equations are found. In calculus, an applet can show how the quality of a polynomial approximation changes as the order is increased. Indeed, just about any static textbook illustration can be made more effective by the addition of interactivity: changing parameters, model orders, initial conditions, etc. The RStudio IDE makes possible the creation of interactive graphical applets using its built-in manipulate package. 6 | 7 | This project will develop about 20 manipulate applets to illustrate various concepts encountered in statistics and calculus. The applets will be published as part of Project MOSAIC, an NSF-supported curricular development initiative that encourages greater use of sophisticated computing in introductory college-level calculus and statistics courses. 8 | 9 | **Skills required:** Proficiency in general R programming and constructing custom graphical displays. Familiarity of basic principles of statistics and calculus (through multivariate) that will be the objects of the applets. 10 | 11 | **Test:** Test: The ability to write an R function that can display a cobweb plot, that is, the iteration of the system x_{i+1} = g( x_i ) such as the logistic map in a chaotic regime. (A rough example of such a plot is at http://math.la.asu.edu/~chaos/logistic.html) 12 | 13 | **Mentor:** Daniel Kaplan (Macalester College) and J.J. Allaire (RStudio). 3/11/2011. -------------------------------------------------------------------------------- /RWiki/gsoc2011/msfga.txt: -------------------------------------------------------------------------------- 1 | ====== msfga: Model Search Space Size Fit Goodness Adjustment ====== 2 | 3 | **Summary:** R has stepwise regression without adjustments of goodness of fit statistics which do not yet count the entire number of degrees of freedom in the model used to generate the best fit. 4 | 5 | **Description:** We need to be able to adjust the goodness of fit statistic for stepwise regression before we can properly offer symbolic regression, because people are used to adjusting the r2 statistic with the number of degrees of freedom in the best fit instead of the total number of degrees of freedom in the model used to produce it, which is improper for stepwise or symbolic regression. We will measure the proper way to count degrees of freedom in stepwise and symbolic regression instead of just counting the number of independent variables in the best fit. We will use Monte Carlo methods to validate our answers. This will enable [[developers:projects:gsoc2010:syrfr]] for 2012 Google Summer of Code. 6 | 7 | **Skills required:** R, stepwise regression, computing adjusted r2 and F-statistic values, Monte Carlo validation of confidence interval predictions, calculation of confidence intervals from linear and optionally nonlinear models, and generation of algebraic formula trees as a search space. Please ask if you have questions about any of these topics. 8 | 9 | **Test:** First please write a description of how r2 is adjusted for a fit's degrees of freedom in ordinary linear regression, and how F statistics are calculated. Send that to jim -at- talknicer.com along with an R transcript showing how you calculate confidence intervals from a linear regression fit to random data and use additional random data from the same distribution function to validate the correctness of the calculated confidence intervals. I will select the best entrants and help you write your project proposal. We must get Hadley Wickham's approval of your proposal. 10 | 11 | **Mentor:** James Salsman, email jim -at- talknicer dot com, @jsalsman on Twitter, 14 February 2011. Please tweet or send mail if interested. -------------------------------------------------------------------------------- /RWiki/gsoc2011/optile.txt: -------------------------------------------------------------------------------- 1 | ====== optile: Category order optimization for graphical displays of categorical data ====== 2 | 3 | 4 | **Summary:** 5 | There are different ways of displaying categorical data, e.g. in contingency tables, mosaicplots, fluctuation diagrams, networks or (modified) parallel coordinates plots. The interpretability of these graphics decreases rapidly with the numbers of categories and more especially with the number of variables. Various criteria and algorithms exist for reordering nominal categories to improve the displays and make them more informative. The aim is to design a structured implementation of these tools, providing users with flexible reordering options and an effective interface, both for controlling the analyses and for evaluating the results. 6 | 7 | 8 | **Description:** 9 | The project goal is to implement an interface in R which provides category order optimization for different types of input (such as tables, data frames or matrices) and 2- as well as k-dimensional categorical data. Applying the optimizations for graphical displays of possibly large dimensions (as with microarray data) requires computation times to be small and the interface to be easily applicable. Different criteria as well as different optimization procedures for k-dimensional data are part of the project. 10 | 11 | **Skills required:** 12 | R, C, experience with categorical data analysis and hierarchical clustering 13 | 14 | **Test:** 15 | Implement an R function which computes informative reorderings of a given two-dimensional data table according to user-defined criteria. The function must be able to handle large tables efficiently and should be included in a minimal R package which can be incorporated into other packages. Computation will be carried out in C. 16 | 17 | 18 | **Mentor:** 19 | Antony Unwin 20 | 21 | Professor of Computer-Oriented Statistics and Data Analysis 22 | 23 | University of Augsburg 24 | 25 | 86135 Augsburg, Germany 26 | 27 | Tel: + 49 821 5982218 28 | 29 | antony.unwin@math.uni-augsburg.de -------------------------------------------------------------------------------- /RWiki/gsoc2011/panorama.txt: -------------------------------------------------------------------------------- 1 | ====== panorama ====== 2 | 3 | 4 | **Summary:** Interactive web graphics for R 5 | 6 | **Description:** Research results from analysing data can now be presented on the web and there are some very nice examples, including excellent static graphic displays. However, explaining what can be seen in graphics in detail and helping readers understand them can be challenging. Viewers should be able to explore the graphics themselves to help them in grasping the ideas being presented. Interactive presentations can be much more effective than static reports and offer ways of involving people in the analytic process. This involves querying, interactive controls like sliders, and linking of different graphics. 7 | 8 | Related: [[http://svgmapping.r-forge.r-project.org/SVG_Mapping/Gallery.html|SVGMapping]] [[http://www.stat.auckland.ac.nz/~paul/gridSVG/|gridSVG]] 9 | 10 | This project has the goal of designing and developing a software which allows the results of visualization analyses to be presented in the form of interactive applets on the web, incorporating the use of statistical methods using the modeling capabilities of R. 11 | 12 | **Skills required:** Experience with programming in Java, good knowledge of R, experience in data visualization and interactive software. 13 | 14 | **Test:** Write an applet for drawing an scatterplot that can be queried. Provide options for adding a regression line calculated in R using Rserve. 15 | 16 | **Mentor:** Antony Unwin 17 | 18 | Professor of Computer-Oriented Statistics and Data Analysis 19 | 20 | University of Augsburg 21 | 22 | 86135 Augsburg, Germany 23 | 24 | Tel: + 49 821 5982218 25 | 26 | antony.unwin@math.uni-augsburg.de -------------------------------------------------------------------------------- /RWiki/gsoc2011/r-em-accelerator.txt: -------------------------------------------------------------------------------- 1 | ===== R-EM-Accelerator ===== 2 | 3 | **Title:** Convergence Acceleration of the Expectation-Maximization (EM) Algorithms in Computational Statistics: A Suite of Cutting-Edge Acceleration Schemes 4 | 5 | **Background:** The Expectation-Maximization (EM) algorithm is an approach to solving certain classes of optimization problems. Although it is most popular in statistics due to the seminal work of Dempster, Laird and Rubin (J Royal Stat Soc Ser B, 1977), the EM idea is quite broadly applicable in a variety of problems without any probabilistic data generating mechanism. Lange and others (J Comp Graph Stat, 2000) have shown that the EM algorithm can be subsumed under the more general minorization-maximization or majorization-minimization (MM) approach. Both EM and MM are globally convergent and have linear local convergence under weak regularity conditions. Their global convergence (i.e., convergence to a stationary point from any starting value) is a wonderful property that equips them with stability rarely seen in numerical algorithms. They are easy to implement and also have minimal memory requirements. All of these properties make them very attractive in simple, small to moderate size problems. However, their rate of linear convergence is typically quite small in many practical applications. This is a major impediment to the use of EM in high-dimensional problems or in problems where the E or M step computations are demanding. Thus, there is a real need to accelerate the small, linear rate of convergence of the EM. In particular, there is a need for algorithms that can accelerate EM without destroying its hallmark global convergence. 6 | 7 | **Description of the Project:** 8 | Varadhan and Roland (Scan J Stats 2008) proposed a unique and remarkably simple acceleration scheme that is also globally-convergent. This is called SQUAREM. This has already been implemented in an R package called SQUAREM. This work has motivated the development of newer acceleration methods based on the classical quasi-Newton (Zhou and Lange; Stat Comp 2011) and conjugate gradient algorithms (He and Liu, Tech Report 2010), as well as those closer in spirit to SQUAREM itself (Berlinet and Roland; Stat Comp 2009). The acceleration methods can accelerate any smooth, linearly convergent fixed-point algorithms. Therefore, they will be broadly applicable to a large class of problems beyond EM and MM (e.g., power method for computing dominant eigenvector, equilibrium game theory in economics). The goal of this project is to develop an R package that will make all of these latest developments in fixed-point acceleration available to analysts via an easy-to-use interface that is provided by R. 9 | 10 | **Skills Required:** A good working knowledge of R and some background in optimization and statistics. 11 | 12 | **Test:** Implement in R one particular instance of an EM or MM algorithm in the statistics literature from any of the articles cited so far or from the book of McLachlan and Krishnan (The EM Algorithm and Its Extensions, Wiley 1997). You can choose any EM or MM example. 13 | 14 | **Mentor:** Ravi Varadhan, Johns Hopkins University (rvaradhan_at_jhmi.edu). (2011-3-11) 15 | -------------------------------------------------------------------------------- /RWiki/gsoc2011/r-openmp.txt: -------------------------------------------------------------------------------- 1 | ====== OpenMP parallel framework for R ====== 2 | 3 | **Summary:** Using OpenMP parallel framework, package developers and R users can easily write multithreaded functions without much knowledge of C/C++, Fortran and OpenMP. 4 | 5 | **Description:** Nowadays, most of the CPUs have a muticore/shared memory architecture. OpenMP is an API for multi platform shared memory programming in C/C++ and Fortran. R needs to have an interface to OpenMP so that it benefits from muticore/shared memory architecture. This OpenMP parallel framework will provide #pragma omp parallel, #pragma omp for, #pragma omp sections/section and task directives of OpenMP in R. One possible direction is to extend the "inline" package to support multithreaded programming in R. This framework can be implemented using C or Fortran or Both. It will provide a simple and efficient parallel programming interface for multicore architecture to the R user/community. 6 | 7 | **Skills required:** Programming experience in R, C, C++, Fortran, OpenMP and knowledge of multithreaded programming. 8 | 9 | **Test:** Some packages will be modified using this framework to provide a performance benchmark. 10 | 11 | **Reference packages:** pnmath, multicore, foreach, doMC, romp, inline, Rcpp and bigmemory. 12 | 13 | **Mentors:** 14 | 15 | George Ostrouchov,Ph.D., Statistics and Data Sciences Group, Computer Science and Mathematics Division, Oak Ridge National Laboratory, Phone: +1-865-574-3137 http://www.csm.ornl.gov/~ost , e-mail: ostrouchovg@ornl.gov 16 | 17 | Dr. Ferdinand Jamitzky, Leibniz Supercomputing Centre, of the Bavarian Academy of Sciences and Humanities, Boltzmannstraße 1, 85748 Garching bei München, Room I.2.035, Phone: ++49-89-35831-8727, e-mail: jamitzky@lrz.de 18 | 19 | Suggestion date: 11-March-2011 (12:00 UTC) 20 | 21 | 22 | **More information about project:** 23 | 24 | Nowadays, most of the compilers support OpenMP directives(GNU, Intel, Microsoft, PGI, IBM, HP, Cray etc). R-OpenMP framework/package will provide power of multithreading programing with minimal involvement to the R users/package developers.We will also provide some functionalities using GPGPU. 25 | 26 | Our framework will provide "registerDo" functions to register the programming languages and architectures(programming languages: C and Fortran, Architectures: multicore (using OpenMP), GPU (using OpenCL, CUDA, PGI directives)). It will also provide option to register the specific compiler (e.g. Intel, GNU, Microsoft, PGI, Cray). Our proposed framework would be extension of "foreach" package and "inline" package. It will somewhat similar to "doMC" package, but proposed package would have many more functionalities using OpenMP,OpenCL and/or CUDA. We also like to do performance benchmarking of the framework using combination of different architectures, compilers, and OS.In the future, we will it more transparent and will extend it to general "apply" function. 27 | 28 | **Possible functionalities:** 29 | 30 | library(doC) 31 | 32 | registerDoC ("gcc -O3") ## similar we can specify icc, pgcc etc. 33 | 34 | registerDoC ("gcc -fopenmp -O3") ## similar we can specify icc, pgcc etc. (flag would be different, e.g for intel "-openmp") 35 | 36 | registerDoC ("gcc -ta=nvidia") ## similar we can specify icc, pgcc etc. 37 | 38 | library(doFortran) 39 | 40 | registerDoFortran ("gfortran -O3") ## similar we can specify ifort, pgf90, pgf77 etc. 41 | 42 | registerDoFortran ("gfortran -fopenmp -O3") ## similar we can specify ifort, pgf90 etc.(flag would be different, e.g. for intel "-openmp") 43 | 44 | registerDoFortran ("gfortran -ta=nvidia") ## similar we can specify ifort, pgf90, pgf77 etc. 45 | 46 | **Detail example:** 47 | 48 | registerDoFortan ("ifort -openmp -g -O3") 49 | 50 | myfunc<- foreach (i=1:n, x=double(n), y=double(n), .combine="+") %dopar% {y[i]<-sin(x[i])+3*cos(2*x[i])} 51 | 52 | Then generates a fortran file containing a fortran version of the 53 | 54 | subroutine: 55 | 56 | subroutine myfunc (integer n, double x, double y) 57 | 58 | double x (n), y (n) 59 | 60 | !$OMP DO 61 | 62 | do i=1,n 63 | 64 | y (i)=sin (x(i))+3*cos(2*x(i)) 65 | 66 | enddo 67 | 68 | end subroutine 69 | 70 | Then the fortran code is compiled on the fly and imported as a shared 71 | 72 | object into R: 73 | 74 | dyn.load("myfunc.so") 75 | 76 | myfunc<- .Fortran ("myfunc",x=double(n),y=double(n),n=as.integer(n)) 77 | 78 | (Note : The OMP pragmas could also be exchanged for generating GPGPU code using the pgi compiler. (This is helpful, but might not have large user) ) 79 | -------------------------------------------------------------------------------- /RWiki/gsoc2011/ropenclcuda.txt: -------------------------------------------------------------------------------- 1 | ====== Library for OpenCL/CUDA Parallelism in R ====== 2 | 3 | 4 | **Summary:** 5 | 6 | This library will provide a way to easily create parallel functions to run on graphics processing units (GPUs) without needing to write CUDA/OpenCL code explicitly. 7 | 8 | **Description:** 9 | 10 | OpenCL and CUDA are two methods to access a GPU in order to expose general-purpose programming functionality; that is, to use a video card for functionality other than rendering textures and video (and so on). The key property of these video cards is that they have many cores (potentially hundreds) which are geared towards computing floating-point data. Being able to offload floating-point intensive work to the GPU can provide a significant performance advantage over traditional processors as a result, and since a large amount of statistical computations are floating-point, this functionality can greatly enhance the speed of computations in R. The main focus will be translating R functions, with an emphasis on vector/matrix data, to CUDA/OpenCL-compatible C/C++ code such that the many benefits of General-Purpose GPU computations can be taken advantage of by the R community. 11 | 12 | **Reference Packages:** 13 | 14 | foreach, inline, rcpp, romp, doMC, gputools, rgpu, codetools 15 | 16 | **Skills required:** 17 | 18 | Experience programming in C/C++, R, OpenCL, and CUDA; knowledge of GPGPU and parallel computing 19 | 20 | **Test:** 21 | 22 | Translate a snippet of R into CUDA-compatible C/C++. 23 | 24 | **Mentor:** 25 | 26 | Dr. Ferdinand Jamitzky, Leibniz Supercomputing Centre, of the Bavarian Academy of Sciences and Humanities, Boltzmannstraße 1, 85748 Garching bei München, Room I.2.035, Phone: ++49-89-35831-8727, e-mail: jamitzky@lrz.de -------------------------------------------------------------------------------- /RWiki/gsoc2011/smart.txt: -------------------------------------------------------------------------------- 1 | ====== SMART: Sparse Multivariate Adaptive Regression Toolkit ====== 2 | 3 | 4 | **Summary:** A R package for large-scale nonparametric predictive learning and data mining 5 | 6 | **Background:** Modern data acquisition routinely produces massive amounts of very large-scale, ultrahigh-dimensional and highly complex datasets. Examples include interactive logs, unstructured text, multimedia/images, and high throughput genomics data. Driven by the complexity of these modern datasets, highly adaptive and scalable data analysis procedures are crucially needed. However, existing large-scale learning and data mining softwares rely heavily on parametric models, which assume the data come from an underlying distribution that can be characterized by a finite number of parameters. These parametric models, though simple and tractable, are incapable of fully exploiting the information provided by the large amount of data due to their restrictive modeling assumptions. 7 | In this project, we propose to develop a scalable R package containing a family of modern nonparametric methods, which directly conduct inference in infinite-dimensional spaces and are more powerful to capture the subtleties in challenging applications. The expected deliverables include careful C/C++ implementations (with easy-to-use R interfaces) of the Sparse Additive Models (Ravikumar et. al. 2009), Multi-task Sparse Additive Models (Liu et. al. 2008), and Greedy Sparse Additive Models (Liu and Chen 2009). The resulting package is expected to be able handle very large datasets with millions of data point and thousands of features. This package has the potential to become a general purpose exploratory data analysis toolbox for a wide range of data analysis practitioners. This proposed project lies at the intersection of machine learning and computational statistics. 8 | 9 | 10 | **Description:** Though many nonparametric packages exist in R (e.g. locfit, classPP, logspline, lokern, mboost, et al.), none of them is scalable to datasets with millions of data points and thousands of features (Such data are quite common in large-scale text mining and preprocessing of raw genomic data). The main purpose of this project is to get “the fastest and most scalable” implementation of our flagship nonparametric method: Sparse Additive Models (SpAM). The SpAM can simultaneously conduct high-dimensional prediction (e.g. regression/classification), function estimation (e.g. smoothing and curve fitting), and feature selection. It can be viewed as a nonparametric version of the Lasso regression. Using the SpAM implementation as the basic infrastructure, we will also provide high-quality implementations of several other states-of-the-art nonparametric predictive models, including Multi-task Sparse Additive Models (MT-SpAM), and Greedy Sparse Additive Models (G-SpAM). We plan to use two months to finish the basic SpAM model, and one more month to finish the MT-SpAM and G-SpAM. 11 | Our implementation will feature both software engineering practice and heuristic coding tricks to make the algorithm efficient and scalable. Example tricks include Database back-end (use Berkeley DB).  Adaptive coordinate descent (selectively choose which coordinate to update instead of cyclical update), stochastic gradient descent (a hybrid framework of online estimation+ Batch estimation), Path-following (calculate the full regularization path), Generative skeleton + Discriminative correction (Use generative models to get a fast initialization, then conduct more iterations to obtain a more discriminative solution), sparse B-spline/P-spline smoothing (in contrast to traditional kernel smoothing). To make the package more user-friendly, we will emphasize on careful design patterns and software engineering thinking. For example, the software will adopt a loosely coupled architecture, which enables users to easily tailor and modify it to fit their own purposes. Also, we will develop efficient tuning parameter selection methods so that the users can use all the functions in a tuning parameter-free mode. 12 | 13 | 14 | **Skills required:** R/C/C++ are required. Familiarity with numerical linear algebra using C is a plus. 15 | 16 | **Test:** A qualified applicant should have strong background in numerical linear algebra. 17 | 18 | 19 | **Mentor:** Han Liu (Assistant Professor in Biostatistics and Computer Science, Johns Hopkins University). Email: hanliu@jhsph.edu or hanliu@cs.cmu.edu. With guidance from John Lafferty (Computer Science, Carnegie Mellon University), Kathryn Roeder (Statistics, Carnegie Mellon University), and Larry Wasserman (Statistics, Carnegie Mellon University). 20 | 21 | **References** 22 | 23 | 24 | **Nonparametric Greedy Algorithm for the Sparse Learning Problems**,  25 | Han Liu and Xi Chen 26 | In Advances in Neural Information Processing Systems (NIPS), 22, 2009.  27 | 28 | **Sparse Additive Models**, 29 | Pradeep Ravikumar, John Lafferty, Han Liu, Larry Wasserman 30 | Journal of the Royal Statistical Society: Series B (Statistical Methodology) (JRSSB), 2009. 31 | 32 | **Nonparametric Regression and Classification with Joint Sparsity Constraints**,  33 | Han Liu, John Lafferty, and Larry Wasserman 34 | In Advances in Neural Information Processing Systems (NIPS), 21, 2008. -------------------------------------------------------------------------------- /RWiki/gsoc2011/sra_with_roles.txt: -------------------------------------------------------------------------------- 1 | ====== Social Relations Analyses with Roles: SoRARO ====== 2 | 3 | 4 | **Summary:** A package for social relations analyses with roles is created (using structural equation modeling), which will complement the [[http://cran.r-project.org/web/packages/TripleR/index.html|TripleR]] package from GSoC 2010 5 | 6 | **Description:** Social relations analyses (Kenny, 1994) are a statistical tool to analyze round robin data of interpersonal perception or any other dyadic behavior. The interdependence of data calls for a sophisticated statistical approach which is outlined in Kenny et al. (2006). The package TripleR already analyzes round robin data in groups with indistinguishable members; SoRARO will extend this approach to groups with distinguishable members. Although the underlying statistics are quite different, the user interface and return values should be completely parallel to TripleR, so that users can work with both packages easily. 7 | 8 | **Skills required:** R programming, experiences in package building and documenting, psychological background knowledge with specific expertise concerning social relations analyses and structural equation models 9 | 10 | **Test:** Outline a formula interface that builds upon the interface of the TripleR package. Extend this interface to apply to SRA with roles. Likewise, outline the necessary data structure for SRAs with roles, as well based on the TripleR data structure. 11 | 12 | **Mentor:** Felix Schönbrodt [[http://www.psy.lmu.de/allg2/people/principal_investigators/schoenbrodt/index.html|website]]; [[mailto:felix.schoenbrodt@psy.lmu.de|email]]; **Backup-Mentor**: Stefan Schmukle, Universität Münster 13 | 14 | Suggestion date: Feb 15 2010 15 | 16 | 17 | 18 | 19 | Kenny, D. A. (1994). Interpersonal perceptions: A social relations analysis. New York: Guilford Press. 20 | 21 | Kenny, D. A., Kashy, D. A., & Cook, W. L. (2006). Dyadic data analysis. New York: Guilford. -------------------------------------------------------------------------------- /RWiki/gsoc2011/tradeanalytics.txt: -------------------------------------------------------------------------------- 1 | ====== TradeAnalytics toolchain enhancements ====== 2 | 3 | 4 | ===== Summary: ===== 5 | 6 | The R-Forge project TradeAnalytics has rapidly become a core component of trade-based investment analysis in R. This GSOC project would aid in enhancing the project along any of several areas of current research and development. 7 | 8 | [[https://r-forge.r-project.org/projects/blotter/|TradeAnalytics R-Forge page]] 9 | 10 | ===== Description: ===== 11 | 12 | Several areas have been identified that could benefit from dedicated development effort 13 | 14 | 15 | 16 | ==== optional 'portfolio' applyRules loop evaluation ==== 17 | 18 | Dimension reduction/loop jumping was added to the applyRules code (via an evaluation index) to improve performance on large numbers of observations. Testing shows there are probably still some bugs here. 19 | 20 | We also want to add a ‘portfolio’ view of the evaluation index. This would give us knowledge of the other things going on in a portfolio and allow rules to react to things like current positions in other assets. It would also provide the foundation for rebalancing (below) and other things like portfolio-based hedging rules. The ‘portfolio’ view might additionally/differently require a ‘time’ view based on 'endpoints' as that discussed for rebalancing rules below. 21 | 22 | 23 | 24 | ==== add endpoints support to rules evaluation ==== 25 | 26 | The evaluation index could also support periodic execution of rules. applyRules currently loops first over each asset in a portfolio; then the strategy indicators, signals, and rules are applied. An ‘endpoints’ evaluation index would allow us to apply ‘rebalance’ rules at those points in time. In some ways, this is an extension of the ‘portfolio’ evaluation index discussed above. 27 | 28 | For example, to set your maximum positions periodically based on available capital, the relative volatility of the instruments, or your overall portfolio reward/risk (via LSPM or PortfolioAnalytics), you would add a ‘rebalance’ rule to be evaluated periodically on endpoints to set the maximum positions on a per-asset basis. In another example, you might wish to periodically update account equity. 29 | 30 | 31 | 32 | ==== parameterization of strategies ==== 33 | 34 | 35 | Parameter testing is a core part of quantitative strategy development or forensic evaluation of trade execution. This would involve generating parameter sets (either via brute force combinations or a more sophisticated methodology) and creating new blotter portfolios for each set. Each set is then tested and the trades and results would be stored in separate portfolios. 36 | 37 | The tests could be evaluated in parallel (e.g. using a foreach loop). For parallel execution to work, multiple objects will need to be passed back to the master node and all the data will need to be put into the areas that other user-level functionality will expect to find it. 38 | 39 | ==== setup and teardown ==== 40 | 41 | There is currently a fair amount of boilerplate code needed to setup and teardown before/after calling applyStrategy. Adding setup and teardown slots to the strategy object would allow the user to specify the setup and teardown functions or activities in the strategy object (e.g. simple things like getSymbols or complex custom functions). A good example of potential ‘teardown’ tasks would be calling updatePortf, updateAcct, updateEndEq, and portfReturns in blotter to give you all the raw material you needed for evaluating the test. 42 | 43 | 44 | ===== Skills Required & Test: ===== 45 | The successful applicant will demonstrate proficiency with R, xts, and blotter/quantstrat. This could be via identifying a patch or extension, providing a new demo script for the one of the packages that would show understanding of the functionality provided, or by doing a detailed proposal with pseudocode for one or more potential enhancements to the toolchain. The ideal candidate would also demonstrate prior interest or experience in finance. 46 | 47 | ===== Mentor: ===== 48 | Brian Peterson is one of the primary authors of these related packages, and would mentor the GSOC participant. 2011-03-02 49 | -------------------------------------------------------------------------------- /RWiki/gsoc2011/wikipediar.txt: -------------------------------------------------------------------------------- 1 | wikipediaR package 2 | 3 | 4 | **Summary:** A package for retrieving and handling wikipedia articles 5 | 6 | **Description:** 7 | 8 | Wikipedia is a great source of information, which can be accessed free and analyzed. For other resources like Twitter and OpenStreetMap there are already packages ([[http://cran.r-project.org/web/packages/twitteR/| twitteR]], [[http://osmar.r-forge.r-project.org/| osmar]]). 9 | 10 | To my knowledge there is no package covering Wikipedia articles. I wondered why nobody wrote a package for that task yet, because there is a lot of interesting stuff a statistician can do with this data. The content can be analyzed using text mining tools, there is the information of which other articles are linked in the text, one can examine the external links ... 11 | For example you can take a group of articles and calculate their relatedness based on their links and the cosinus similarity ([[http://www.cip.ifi.lmu.de/~molnarc/| example]]) 12 | 13 | 14 | After the 3 months of work the package should fulfill the following tasks: 15 | 16 | * The user is able to choose the source of the articles. This could be the [[http://www.mediawiki.org/wiki/API|Wikipedia API]] or a local XML-Dump of all articles (those can be downloaded). In the first case the task could completly be done by R (using the [[http://www.omegahat.org/RCurl/|RCurl]] and [[http://www.omegahat.org/RSXML/|XML]] packages), but in the latter case the dump is a large file (2.5GB). I don't think one wants to parse this file with R. The idea would be to make use of other programs, e.g. read the XML-file into a database and import the articles with the [[http://cran.r-project.org/web/packages/RMySQL/index.html|RMySQL package]]. 17 | 18 | * Save articles as S4 objects (other possiblities: S3 objects, lists) 19 | * Include print, summary and plot methods for articles. So the user can have a quick explorative overview over his data. 20 | 21 | * Provide some methods which transform the articles in a suitable way, so that it will be easy for the user to analyze them with text mining tools from the [[http://tm.r-forge.r-project.org/tm|tm package]] or plot word clouds ([[http://cran.r-project.org/web/packages/wordcloud/index.html|wordcloud package]]) 22 | 23 | 24 | **Mentor:** -------------------------------------------------------------------------------- /RWiki/gsoc2011/yarrcache.txt: -------------------------------------------------------------------------------- 1 | ====== yarr Cache Extension ====== 2 | 3 | ==== Summary ==== 4 | Experiment with caching mechanisms in the ''yarr'' package framework for mixing ''R'' code with text. 5 | 6 | ==== Description ==== 7 | === What is yarr ? === 8 | ''yarr'' [[http://r-forge.r-project.org/projects/yarr]] is a new, but fully functional ''R'' package that facilitates mixing delimited ''R'' code with text, including text with other markup (//e.g.// HTML, LaTeX). The functionality of ''yarr'' has some intersection with that of the ''brew'' package. However, the ''yarr'' package is designed with flexibility and extensibility in mind. No other ''R'' package of this kind offers flexible syntax, on-the-fly extension, and facilities for handling various types of input and output simultaneously. In addition, then small code base of ''yarr'' makes it ripe for new contributors. 9 | 10 | Download the nightly build of ''yarr'' here [[http://r-forge.r-project.org/src/contrib/yarr_0.0.tar.gz|yarr_0.0.tar.gz]]. Or, install from ''R'' with 11 | install.packages("yarr", repos="http://R-Forge.R-project.org") 13 | 14 | 15 | The following ''yarr'' file illustrates a simple case where ''R'' code delimited by '<<' and '>>' is mixed with text to form an email message to the R-help mailing list. 16 | 17 | Dear R-Help, 18 | << 19 | # Function to format and cat the date 20 | d <- function() cat(format(Sys.time(), "%m-%d-%Y")) 21 | >> 22 | I want to write an email that contains R code and output, as if I 23 | had entered the code at the R prompt. For example: 24 | <<@ 25 | f <- function(x) { 26 | x+1 27 | } 28 | f(1) 29 | stem(rnorm(50)) 30 | >> 31 | Is there a way to do that without copy and pasting from R? 32 | Also, is there a way to include the date (<<= d() >>) in the 33 | text of my email? 34 | 35 | Regards, 36 | useR 37 | 38 | 39 | Evaluating this file with a call to ''yarr'', say ''yarr::yarr('email.yarr')'' yields the following output, which could be pasted into an email editor and mailed to the R-help list. 40 | 41 | 42 | Dear R-Help, 43 | 44 | I want write an email that contains R code and output, as if I 45 | had entered the code at the R prompt. For example: 46 | > f <- function(x) { 47 | + x+1 48 | + } 49 | > f(1) 50 | [1] 2 51 | > stem(rnorm(50)) 52 | 53 | The decimal point is at the | 54 | 55 | -2 | 1 56 | -1 | 97766 57 | -1 | 4320 58 | -0 | 8887655 59 | -0 | 4321100 60 | 0 | 000122344 61 | 0 | 566777 62 | 1 | 00003344 63 | 1 | 568 64 | 65 | Is there a way to do that without copy and pasting from R? 66 | Also, is there a way to include the date (03-03-2011) in the 67 | text of my email? 68 | 69 | Regards, 70 | useR 71 | 72 | 73 | Like the ''brew'' package, ''yarr'' is designed to work with the [[http://httpd.apache.org/|Apache]] module [[http://rapache.net/|rapache]] to facilitate dynamic web applications powered by ''R''. This [[http://biostatmatt.com/yarr/MortCalc.yarr|mortgage calculator]] demonstrates the combined use of ''yarr'' and ''rapache''. 74 | 75 | === yarr caching === 76 | In many contexts, the ''yarr'' function may be called repeatedly, with little or no change to the input file. This is common in web applications like an online [[http://biostatmatt.com/yarr/MortCalc.yarr|mortgage calculator]], but also for document authorship (//e.g.// LaTeX, Texinfo, markdown). In these cases, it may be wasteful to reevaluate each embedded ''R'' expression, especially when the ''yarr'' input is identical from one call to the next. 77 | 78 | Ideally, the ''yarr'' function would remember (cache) the input and output of earlier calls, and use this information to process the current input more efficiently. This Google Summer of Code 2011 application proposes to experiment with, and develop a caching mechanism for the ''yarr'' framework. 79 | 80 | ==== Skills required ==== 81 | Familiarity with ''R'', willingness to learn new things, and some creativity are required. Caching strategies may include manipulation of ''R'' environments and serialization of ''R'' data structures. There are a variety of caching strategies that might work for ''yarr'', and there are several examples of caching among the existing ''R'' packages that might facilitate ideas. 82 | 83 | ==== Test ==== 84 | Consider the following ''yarr'' file. Which of these sentences with embedded ''R'' expressions might be processed more efficiently using a caching mechanism, and why? 85 | 86 | 1. The answer is <<= cat(40 + 2) >>. 87 | 2. The current date and time is <<= cat(date()) >>. 88 | 3. This sentence defines a function<< f <- function(x) x+x >>. 89 | 4. This is a test for a file 'foo': <<= cat(file.exists('foo')) >> 90 | 91 | 92 | ==== Mentor ==== 93 | 03/03/2011\\ Matt Shotwell\\ Assistant Professor\\ Department of Biostatistics\\ Vanderbilt University\\ Matt.Shotwell _at_ Vanderbilt.edu 94 | -------------------------------------------------------------------------------- /RWiki/gsoc2012/image_labels.txt: -------------------------------------------------------------------------------- 1 | **Background:** the [[http://directlabels.r-forge.r-project.org/|directlabels]] package facilitates use of direct labels instead of legends in everyday statistical graphics made with lattice or [[http://had.co.nz/ggplot2/|ggplot2]]. Labeling is currently limited to textual annotations, but sometimes it is more informative to label using images, i.e. [[http://cbio.ensmp.fr/~thocking/image-labels.png|iris data labeled with photos of the 3 flower species]] or [[http://www.flickr.com/photos/47179524@N02/4346623091/in/photostream|scatterplot comparing several countries using their flags]] 2 | 3 | **Goal:** develop a framework for labeling plots using images instead of textual factor names. There is some preliminary code, but there are a few technical problems with plotting images in R that must be overcome: 4 | 5 | * [[https://r-forge.r-project.org/scm/viewvc.php/pkg/directlabels/etc/flags/?root=directlabels|SVG vector graphics]] using [[http://cran.r-project.org/web/packages/grImport/index.html|grImport]] 6 | * How to import SVG into R in a portable manner? Currently I use [[https://r-forge.r-project.org/scm/viewvc.php/pkg/directlabels/etc/flags/svg2ps.py?view=markup&revision=533&root=directlabels|a python script]] to convert SVG to PS, then use grImport to read PS images. 7 | * Some complicated PS images cause errors in ''grImport::PostScriptTrace()'' when attempting to convert to RGML, i.e. [[http://upload.wikimedia.org/wikipedia/commons/d/da/Flag_of_Kansas.svg]] 8 | * Some PS images are convered to RGML but then plotted incorrectly, i.e. [[http://upload.wikimedia.org/wikipedia/commons/0/01/Flag_of_California.svg]] 9 | * Possible solution: convert SVG images to hi-res raster graphics? 10 | * [[https://r-forge.r-project.org/scm/viewvc.php/pkg/directlabels/etc/iris-images/image-labels.R?view=markup&revision=541&root=directlabels|JPG raster graphics]] using [[http://bioconductor.wustl.edu/bioc/html/EBImage.html|EBImage]] 11 | * High-resolution JPG images take a long time to read into memory, and create huge files when making R plots using the ''pdf()'' device. i.e. 20MB pdf for [[http://cbio.ensmp.fr/~thocking/image-labels.png|this plot of the iris data]]. Can we compress the images plotted with ''grid.raster''? 12 | * Possible solution: just don't use the pdf device? jpg and png images created in this way are of decent size. 13 | 14 | **Tests:** 15 | * To see how to direct label using textual annotations, make some direct labeled plots using the lattice, ggplot2, and directlabels packages. 16 | * To explore the current state-of-the-art in importing images into R, make some plots that use vector images using the grImport package, and raster graphics using the EBImage package. 17 | * To demonstrate basic knowledge of SVG and PS syntax and semantics, make a PS or SVG plot using R, then MANUALLY edit the PS or SVG file to make a modification of the plot. 18 | 19 | **Mentors:** [[http://cbio.ensmp.fr/~thocking/|Toby Dylan Hocking]] , [[http://www.stat.auckland.ac.nz/~paul/|Paul Murrell]] -------------------------------------------------------------------------------- /RWiki/gsoc2012/interactive_animations.txt: -------------------------------------------------------------------------------- 1 | **Background/Related work:** standard R graphics are based on the pen and paper model, which makes animations and interactivity difficult to accomplish. Non-interactive animations can be accomplished with the ''animation'' package, and some interactions with non-animated plots can be done with the [[https://github.com/ggobi/cranvas/wiki|qtbase, qtpaint, and cranvas packages]]. Linked plots in the web are possible using [[http://www.omegahat.org/SVGAnnotation/SVGAnnotationPaper/SVGAnnotationPaper.html|SVGAnnotation]] or [[http://sjp.co.nz/projects/gridsvg/|gridSVG]], but using these to create such a visualization requires knowledge of Javascript. The [[https://r-forge.r-project.org/scm/viewvc.php/pkg/?root=svgmaps|svgmaps]] package defines interactivity (hrefs, tooltips) in R code using igeoms, and exports SVG plots using gridSVG. [[https://github.com/trifacta/vega|Vega]] can be used for describing plots in Javascript. 2 | 3 | The [[https://github.com/tdhock/animint|animint]] package [[http://sugiyama-www.cs.titech.ac.jp/~toby/animint/|supports brushing by adding 2 new aesthetics to the grammar of graphics]]: 4 | * ''showSelected=variable'' means that only the subset of the data that corresponds to the selected value of ''variable'' will be shown. 5 | * ''clickSelects=variable'' means that clicking a plot element will change the currently selected value of ''variable''. 6 | The basic idea is to start from a list of ggplots, then export all their data in CSV along with a ''plot.json'' metadata file, which is then read and plotted using [[http://d3js.org|D3]]. The result is several linked, interactive, animated plots which are viewable in a web browser: 7 | 8 | ^ Title ^ Plots ^ Clickable plots ^ Variables shown ^ Animated? ^ 9 | | [[http://sugiyama-www.cs.titech.ac.jp/~toby/animint/evolution-simulator/index.html|Evolution simulator]] | 3 | 3 | 6 | yes | 10 | | [[http://sugiyama-www.cs.titech.ac.jp/~toby/animint/breakpoints/index.html|Breakpoints]] | 2 | 1 | 5 | no | 11 | | [[http://sugiyama-www.cs.titech.ac.jp/~toby/animint/mmir/index.html|Max margin interval regression]] | 4 | 2 | 9 | no | 12 | 13 | **Proposal:** implement several improvements to the ''animint'' package, so that more of the features of ggplot2 are available in these plots. In its current form, a 14 | large part of the grammar is not available, so the GSOC student should implement one or several of the following: 15 | * Scales 16 | * Facets 17 | * Stats 18 | * Coordinate transforms 19 | 20 | **Tests:** 21 | * To demonstrate some knowledge of D3, github, and ggplots, implement support for drawing rectangles (geom_rect) or horizontal lines (geom_hline) in the animint package. 22 | * Convert an example from the animation package to an interactive animint plot, e.g. kmeans.ani, knn.ani, grad.desc. 23 | **Mentors:** [[http://sugiyama-www.cs.titech.ac.jp/~toby/|Toby Dylan Hocking]] , 24 | [[http://had.co.nz/|Hadley Wickham]] , backup [[http://yihui.name|Yihui Xie]] . 25 | -------------------------------------------------------------------------------- /RWiki/gsoc2012/knitr.txt: -------------------------------------------------------------------------------- 1 | ====== Dynamic report generation in the web with R ====== 2 | 3 | **Summary:** Extend the **[[http://yihui.name/knitr|knitr]]** package for better applications in the web so that web pages can be easily compiled with R like Sweave documents. 4 | 5 | **Description:** The **knitr** package is a general-purpose tool for dynamic report generation; the basic idea is like Sweave: we only write source code in a document mixed with normal texts, then use a tool to evaluate the code to get output, so that we do not need to copy and paste results from the computing engine to the output document. Knitr comes with a flexible design and can be extended to multiple output formats via output hooks. Currently the LaTeX and Markdown support is almost complete, and HTML is not yet fully supported. Two missing features are code highlighting themes and animation support (both have been done for LaTeX). 6 | 7 | The first part of this project is to add the missing features in HTML, which involve with CSS and the R package **[[http://cran.r-project.org/package=animation|animation]]** (function ''saveHTML()''). After this is done, the old [[http://animation.yihui.name|wiki site]] of the **animation** package can be rewritten with **knitr**. Each page is written with R source code mixed with HTML or Markdown, then compiled to the final output containing animations. This will be easier to validate if the **animation** package works correctly and the animations can be reproducible. Depending on the progress, not all pages have to be converted in this project. The expected time span on this part is one month. 8 | 9 | The second part focuses on online collaboration of reproducible research. We will explore two possible directions: 10 | 11 | - [[http://www.omegahat.org/R/src/contrib/RGoogleDocs_0.6-0.tar.gz|RGoogleDocs]]: this package enables us to manipulate Google Documents through R, and a Google document is essentially an HTML document, so hopefully we can use **knitr** to process such a document and upload the results back to a Google account; since several people can share the same document, this may facilitate online collaboration of data analysis 12 | - [[http://johnmacfarlane.net/pandoc/|Pandoc]]: a universal document converter; we can start with a simple markdown document, compile it with **knitr**, convert it to a Word document and upload to a Google account 13 | 14 | This part is also expected to be done in a month. 15 | 16 | The third part is not completely decided yet: we can either see if the work can be applied to other websites like [[http://addictedtor.free.fr/graphiques/|R Graph Gallery]], or go to another direction -- convert R package documentation to better web pages. 17 | 18 | - R Graph Gallery: the site can be built dynamically with R, which guarantees reproducibility and is hopefully easier to maintain 19 | - R doc: one problem with the current R package documentation is we do not know the results of examples unless we copy and run the code; **knitr** should be able to extract the example code and convert it to HTML with all results displayed under the code (similar ideas can be found in the [[http://rgm2.lab.nig.ac.jp/RGM2/|R Graphical Manual]], but we want better web pages like [[https://github.com/downloads/yihui/knitr/knitr-minimal.html|this one]], with code highlighted and output commented out; this is also a chance to use much better CSS than R's default one) 20 | 21 | References: 22 | 23 | * [[http://yihui.name/knitr/|knitr homepage]] 24 | * [[https://github.com/downloads/yihui/knitr/knitr-manual.pdf|knitr main manual]] 25 | * [[https://github.com/downloads/yihui/knitr/knitr-graphics.pdf|knitr graphics manual]] 26 | * [[https://github.com/downloads/yihui/knitr/knitr-themes.pdf|knitr themes manual]] 27 | 28 | **Skills required:** Good knowledge of R (especially text processing) and basic knowledge of the **knitr** package; familarity with HTML/CSS/JavaScript and Markdown; experience with Sweave, the [[http://cran.r-project.org/package=animation|animation]] package and RGoogleDocs is a plus, so is longer "[[http://github.com|GitHub]] age". 29 | 30 | **Test:** Read the [[http://yihui.name/knitr/demo/minimal/|minimal demos]] in the **knitr** website, and use the **knitr** package to write a small report in HTML with a dataset (must contain plots; a few paragraphs will be enough). 31 | 32 | **Mentor:** [[http://yihui.name|Yihui Xie]] and [[http://www.linkedin.com/pub/shangxuan-zhang/4/b1/672|Shangxuan Zhang]] -------------------------------------------------------------------------------- /RWiki/gsoc2012/mentor.txt: -------------------------------------------------------------------------------- 1 | ====== Becoming GSoC Mentor ====== 2 | 3 | ===== Mailing Lists ===== 4 | 5 | sign up for the following mailing lists: 6 | 7 | * [[https://stat.ethz.ch/mailman/options/gsoc-mentors/ | R's mentor-private mailing list]] 8 | * [[https://groups.google.com/forum/?fromgroups#!forum/gsoc-r|students + mentors mailing list]] 9 | 10 | ===== Melange ===== 11 | Google organizes the GSoC in Melange. You need to 12 | - have a google account 13 | - [[ http://www.google-melange.com/gsoc/profile/mentor/google/gsoc2012| Register as mentor at melange]] 14 | - [[http://www.google-melange.com/gsoc/request/google/gsoc2012/rproject|Request to be a mentor for R]]: telling the admins one or two sentences who you are (consider using a nickname that is reasonably close to your real name) and why you want to become mentor for R makes their life much easier. 15 | - You can see the proposals for R via the "Dashboard" 16 | - To sign up as possible mentor for a proposal, click at the picture below "Wish to Mentor" at the left side. 17 | - The image should display an orange "YES" then and the table "Proposals submitted to my orgs" will display your login if you select the additional column "Possible Mentor link_ids" (via "columns" at the foot of the table) -------------------------------------------------------------------------------- /RWiki/gsoc2012/meucci.txt: -------------------------------------------------------------------------------- 1 | ===== Summary ===== 2 | Extend the functionality of key ReturnAnalytics packages with the current research of Attilio Meucci, a thought leader in risk and portfolio management. 3 | 4 | ===== Description ===== 5 | Attilio Meucci is a pioneer in advanced risk and portfolio management. His innovations include Entropy Pooling (technique for fully flexible portfolio construction), Factors on Demand (on-the-fly factor model for optimal hedging), Effective Number of Bets (entropy-eigenvalue statistic for diversification management), Fully Flexible Probabilities (technique for on-the-fly stress-test and estimation without re-pricing), and Copula-Marginal Algorithm (algorithm to generate panic copulas). Attilio Meucci serves as the chief risk officer at Kepos Capital LP. Concurrently, he is adjunct professor at the Master's in Financial Engineering – Baruch College - CUNY. 6 | 7 | Attilio has included Matlab code, some of which requires the additional Optimization Toolkit, alongside his key papers to better demonstrate the concepts. We would like to convert that code to R to make it more widely assessible. 8 | 9 | Beyond the initial conversion, we will functionalize key aspects of that code and consider including functions in PerformanceAnalytics, PortfolioAnalytics, or developing a package around one or more of the concepts described above, starting with the Effective Number of Bets. 10 | 11 | PerformanceAnalytics is an R package that provides a collection of econometric functions for performance and risk analysis. PortfolioAnalytics is an extensible business-focused framework for portfolio optimization and analysis. 12 | 13 | === References === 14 | * ReturnAnalytics on R-Forge. https://r-forge.r-project.org/projects/returnanalytics/ 15 | 16 | * Meucci, Attilio, Managing Diversification (April 1, 2010). Risk, pp. 74-79, May 2009; Bloomberg Education & Quantitative Research and Education Paper. Available at SSRN: http://ssrn.com/abstract=1358533 17 | 18 | * Meucci, Attilio, Factors on Demand, Building a Platform for Portfolio Managers, Risk Managers and Traders 19 | http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1565134 20 | 21 | * Meucci, Attilio, Exercises in Advanced Risk and Portfolio Management - With Step-by-Step Solutions and Fully Documented Code (August 15, 2010). Available at SSRN: http://ssrn.com/abstract=1447443 or http://dx.doi.org/10.2139/ssrn.1447443 22 | 23 | * Meucci, Attilio, Risk and Asset Allocation. Springer Finance (2005) ISBN: 3540222138 24 | 25 | ===== Test ===== 26 | The successful applicant will demonstrate proficiency with both Matlab and R. Within R, the candidate should show proficiency with xts and PerformanceAnalytics or PortfolioAnalytics. This could be via identifying a patch or extension, providing a new demo script for the one of the packages that would show understanding of the functionality provided, or by doing a detailed proposal with pseudocode for one or more potential enhancements to the toolchain. The ideal candidate would also demonstrate prior interest or experience in finance. 27 | 28 | =====Mentor===== 29 | Brian Peterson is one of the primary authors of these packages, and has previously mentored GSoC projects. 30 | 31 | Peter Carl is one of the primary authors of these related packages, and would mentor the GSOC participant. 2012-02-24 -------------------------------------------------------------------------------- /RWiki/gsoc2012/performanceanalytics.txt: -------------------------------------------------------------------------------- 1 | **Summary**: Add Measures and Attribution to PerformanceAnalytics and PortfolioAnalytics based on Bacon (2008) 2 | 3 | **Description**: PerformanceAnalytics is an R package that provides a collection of econometric functions for performance and risk analysis. It applies current research on return-based analysis of strategy or fund returns for risk, autocorrelation and illiquidity, persistence, style analysis and drift, clustering, quantile regression, and other topics. 4 | 5 | PerformanceAnalytics has long enjoyed contributions from users who would like to see specific functionality included. In particular, Diethelm Wuertz of ETHZ (and the well-known R/Metrics packages) has very generously contributed a very large set of functions based on Bacon (2008). It is a great starting point, but many of these functions need to be finished and converted such that the functions and interfaces are consistent with other PerformanceAnalytics functions, and are appropriately documented. 6 | 7 | In addition, certain functions useful for optimization will need to be converted to be used in PortfolioAnalytics, an extensible business-focused framework for portfolio optimization and analysis. That package focuses on numeric approaches for solving non-quadratic problems useful, for example, in risk budgeting. 8 | 9 | Assuming the prior work goes well, there will be an opportunity to extend PortfolioAnalytics to cover portfolio attribution as described in Bacon (2008). This is frequently-requested functionality from the finance community that uses R. We would use FinancialInstrument, an R package for defining and storing meta-data for tradeable contracts (referred to as instruments, such as for stocks, futures, options, etc.), to define relationships among securities to be used by attribution functions. 10 | 11 | **References**: 12 | * ReturnAnalytics on R-Forge. https://r-forge.r-project.org/projects/returnanalytics/ 13 | * Carl Bacon "Practical Portfolio Performance Measurement and Attribution", (London, John Wiley & Sons. September 2004) ISBN 978-0-470-85679-6. 2nd Edition May 2008 ISBN 978-0470059289 14 | 15 | 16 | **Test**: 17 | The successful applicant will demonstrate proficiency with R, and specifically with xts and PerformanceAnalytics or PortfolioAnalytics. This could be via identifying a patch or extension, providing a new demo script for the one of the packages that would show understanding of the functionality provided, or by doing a detailed proposal with pseudocode for one or more potential enhancements to the toolchain. The ideal candidate would also demonstrate prior interest or experience in finance. 18 | 19 | **Mentor**: Peter Carl is one of the primary authors of these related packages, and would mentor the GSOC participant. 2012-02-18 20 | 21 | Brian Peterson is one of the primary authors of these packages, has previously mentored GSoC projects, and would be backup mentor for this project. -------------------------------------------------------------------------------- /RWiki/gsoc2012/performancebenchmarking.txt: -------------------------------------------------------------------------------- 1 | **Summary**: R code for Portfolio Performance Measurement and Benchmarking from Christoperson, Carino, and Ferson(2009) 2 | 3 | **Description**: 4 | PerformanceAnalytics is an R package that provides a collection of econometric functions for performance and risk analysis. It applies current research on return-based analysis of strategy or fund returns for risk, autocorrelation and illiquidity, persistence, style analysis and drift, clustering, quantile regression, and other topics. 5 | 6 | As a key repository of important performance and risk benchmarking and attribution functions, it is important to keep up with the latest publications in the field. Christopherson(2009) is such a publication. David Cariño is a Research Fellow at Russell Investments, one of the co-authors of this text and would co-mentor this project. 7 | 8 | The majority of the measures described in Portfolio Performance Measurement and Benchmarking already exist in Performanceanalytics, but several newer or lesser used functions do not. A non exhaustive list includes: 9 | * holding period returns 10 | * linked returns 11 | * cash flow and unit value returns 12 | * money weighted returns 13 | * Modigliani measure 14 | * Appraisal Ratio 15 | * Fixed Income Risk Measures for duration, convexity, prepayment, and issuer-specific risks 16 | * market timing metrics 17 | * portfolio attribution 18 | * linking attribution 19 | * benchmark and index construction 20 | 21 | Additionally, there are multiple chapters on factor analysis that may be mostly covered by functions already in the package FactorAnalytics: overlap and required functionality will need to be cross-referenced, with necessary additional functionality created as part of this project. 22 | 23 | 24 | **References**: 25 | * ReturnAnalytics on R-Forge. https://r-forge.r-project.org/projects/returnanalytics/ 26 | 27 | * Portfolio Performance Measurement and Benchmarking. 2009. McGraw-Hill Finance & Investing. J. Christopherson, D. Carino, W. Ferson 28 | 29 | **Test**: 30 | The successful applicant will demonstrate proficiency with R, and specifically with xts and PerformanceAnalytics or PortfolioAnalytics. This could be via identifying a patch or extension, providing a new demo script for the one of the packages that would show understanding of the functionality provided, or by doing a detailed proposal with pseudocode for one or more potential enhancements to the toolchain. The ideal candidate would also demonstrate prior interest or experience in finance. 31 | 32 | **Mentor**: 33 | 34 | David Cariño is Research Fellow at Russell Investments 35 | 36 | Doug Martin is the head of the University of Washington Computational Finance program, and would co-mentor this project. 37 | 38 | Peter Carl is one of the primary authors of these related packages, and would mentor the GSOC participant. 39 | 40 | Brian Peterson is one of the primary authors of these packages, has previously mentored GSoC projects, and would be backup mentor for this project. 2012-03-23 -------------------------------------------------------------------------------- /RWiki/gsoc2012/portfolioanalytics.txt: -------------------------------------------------------------------------------- 1 | ===== Summary ===== 2 | Add additional closed form and global optimizer backends to PortfolioAnalytics 3 | 4 | 5 | 6 | 7 | 8 | ===== Description ===== 9 | PortfolioAnalytics is a leading R package for complex portfolio optimization, allowing for the creation of complex constraints and objectives that mirror real-world portfolio optimization problems. 10 | 11 | Many of these complex problems are 'global optimization' problems, and not tractable by closed form optimizers. As such, PortfolioAnalytics includes a Burns-style random portfolio backend, and also utilizes R package DEoptim. These methods work very well for global optimization objectives. 12 | 13 | Unfortunately, most people are introduced to portfolio optimization in much simpler contexts where global optimizers are not required. This project would first add support for quadratic, linear, and conical solvers to PortfolioAnalytics, and some intelligence to detect optimization objectives that would be suitable for this kind of back-end. For example, mean-variance portfolios with simple box constraints are quadratic, mean-gaussian-ETL objectives are linear or conical, etc. 14 | 15 | Additionally, more global optimizers should be added. Candidates include packages rgenoud, soma, and pso. 16 | 17 | The new 'ROI' optimization framework for R should be examined and quite possibly integrated, as it may provide a one-stop entry to multiple additional optimization backends. Currently, ROI supports quadprog, glpk, lpop, symphony, and cplex. There seems to be some complexity in the specification of constraints, but translation from PortfolioAnalytics constraints to ROI constraints should be straightforward. 18 | 19 | === References === 20 | * ReturnAnalytics on R-Forge. https://r-forge.r-project.org/projects/returnanalytics/ 21 | 22 | * ROI on R-Forge: https://r-forge.r-project.org/projects/roi/ 23 | 24 | * Particle Swarm Optimization on CRAN: http://cran.r-project.org/web/packages/pso/index.html 25 | 26 | * SOMA on CRAN: http://cran.r-project.org/web/packages/soma/index.html 27 | 28 | * rgenoud on CRAN: http://cran.r-project.org/web/packages/rgenoud/index.html 29 | 30 | * quadprog on CRAN: http://cran.r-project.org/web/packages/quadprog/index.html 31 | 32 | ===== Test ===== 33 | The successful applicant will demonstrate proficiency with R. Within R, the candidate should show proficiency with xts and PortfolioAnalytics, and also one or more optimization tools in R. This could be via identifying a patch or extension, providing a new demo script for the one of the packages that would show understanding of the functionality provided, or by doing a detailed proposal with pseudocode for one or more potential enhancements to the toolchain. The ideal candidate would also demonstrate prior interest or experience in finance. 34 | 35 | 36 | 37 | =====Mentor===== 38 | Guy Yollin is a Professor in the Computational Finance program at the University of Washington, former head developer at Insightful, and has extensive optimization experience. 39 | 40 | Peter Carl is one of the primary authors of these related packages, and would co-mentor the GSOC project. 41 | 42 | Brian Peterson is one of the primary authors of these packages, and has previously mentored GSoC projects. 2012-03-23 -------------------------------------------------------------------------------- /RWiki/gsoc2012/rmaxima.txt: -------------------------------------------------------------------------------- 1 | **Summary**: 2 | RMaxima --- an R package for interfacing the free Computer Algebra System Maxima. 3 | 4 | **Description**: 5 | Maxima is a free, open source Computer Algebra System (CAS), descending from the former Macsyma system on Lisp machines. Maxima has a bit older syntax than the three Ma's (Mathematica, Maple, Matlab), but is much more powerful than the symbolic algebra systems available for R right now. 6 | 7 | Maxima has been integrated with Scilab and the [[http://euler.rene-grothmann.de/|Euler Math Toolbox]]. Especially the C++ files for Euler appear to be relatively clear, where communication with Maxima is realized through 'pipes'. The main C++ file can probably be directly utilized. 8 | 9 | The design of the interface is another major task, that is how to use symbolic expressions in R, reuse of numeric results, or later on the connection to the Rmpfr package and its high-precision arithmetic. 10 | 11 | **Skills required**: 12 | Programming in C++ and R, possibly using the 'Rcpp' and 'inline' packages. Some interest in high-precision numerics and symbolic algebra. 13 | 14 | **Test**: 15 | Have a look at the interfaces to Maxima in Scilab and the Euler Math Toolbox. 16 | Propose how such an interface could be designed for R. Download the Euler sources and identify the files that realize this bridge between Maxima and Euler. 17 | 18 | **Mentor**: 19 | Hans W. Borchers for the R side and design of the interface. A mentor for the C++ and Rcpp side is still to be found. 20 | 21 | (Please note: the author of the Euler Math Toolbox has explicitely agreed that his sources can be reused for such a project.) 22 | -------------------------------------------------------------------------------- /RWiki/gsoc2012/ropensci.txt: -------------------------------------------------------------------------------- 1 | ====== rOpenSci ====== 2 | 3 | 4 | **Summary:** Dynamic access and visualization of scientific data repositories 5 | 6 | **Description:** [[http://ropensci.org|rOpenSci]] is a collaborative effort to develop R-based tools for facilitating Open Science. 7 | Projects in rOpenSci fall into two categories: those for working with the scientific literature, and those for working directly with the databases. Visit the active development hub of each project on github, where you can see and download source-code, see updates, and follow or join the developer discussions of issues. Most of the packages work through an API provided by the resource (database, paper archive) to access data and bring it within reach of R’s powerful manipulation. 8 | 9 | See a [[http://ropensci.org/ropensci-packages/|complete list]] of our R packages currently in development. 10 | 11 | The student could choose to work on a package for a particular data repository of interest, or develop tools for visualization and exploration that could function across the existing packages. 12 | 13 | **Skills required:** Should be able to use R to perform data manipulation and aggregation. Familarity with R packages RCurl & XML, with linked data concepts, common web APIs, or advanced visualization tools in R would be helpful but not required; students can expect to learn these skills during the project. 14 | 15 | 16 | **Mentor:** The [[http://ropensci.org/developers/|rOpenSci dev team]], Carl Boettiger, Scott Chamberlain, and Karthik Ram, with support from [[http://ropensci.org/advisors/|rOpenSci advisors]] Hadley Wickam, Duncan Temple Lang, Bertram Ludascher, JJ Allaire and Matt Jones. -------------------------------------------------------------------------------- /RWiki/gsoc2012/rtaq.txt: -------------------------------------------------------------------------------- 1 | ====== RTAQ: add new high frequency data structure and analytical routines to RTAQ ====== 2 | 3 | ===== Summary ===== 4 | Toolkit for data management and analysis of high-frequency financial data in R. 5 | 6 | ===== Description ===== 7 | The economic value of analyzing high-frequency financial data is now obvious, both in the academic and financial world. It is not only the basis of intraday and daily risk monitoring and forecasting, input to the portfolio allocation process, and also for high-frequency trading. For the efficient use of high-frequency data in financial decision making, the RTAQ package of the TradeAnalytics project implements state of the art techniques for cleaning, matching, liquidity, risk and correlation calculation based on high-frequency data. All these methods appropriately take into account the stylized features of high-frequency data such as microstructure noise, jumps in the price level and the non-synchronicity of the observations. 8 | The goal of this project is to develop much needed extra functionality in this growing area: 9 | ===Data management:=== 10 | RTAQ is now designed to deal with NYSE TAQ data which is popular in the academic world. However, among practitioners, other data sources are popular such as Bloomberg, InteractiveBrokers and Reuters. The goal is to develop an integrated easy-to-use and fast interface to load and manage high frequency data from all major data providers into R as xts objects. 11 | Transaction-by-transaction data is noisy. State of the art functionality should be implemented to deal with this noisiness, such as the (volume-weighted) pre-averaging of price data. 12 | ===Volatility and covariance forecasting based on high-frequency risk measures=== 13 | RTAQ offers now a wide range of univariate and multivariate “realized” volatility measures. For financial decision making, it is important to have volatility forecasts based on those ex post measures. In this project we aim to implement the recently proposed ARFIMA, HAR, HAR-J, Hybrid GARCH, Realized GARCH and HEAVY models for forecasting volatility based on realized volatility measures. In addition, with the permission of the now swamped Scott Paesuer of the //realized// package, functionality from this important package for realized volatility would be integrated to the RTAQ package where it could be maintained by a more active project team. 14 | 15 | =====Skills required===== 16 | The successful applicant should have proficiency with R, xts and RTAQ. The ideal candidate would also have prior experience on the analysis of high-frequency data and market microstructure. 17 | 18 | =====Test===== 19 | The successful applicant must demonstrate prior experience with high frequency data and R. This could be via identifying a patch or extension, providing a new demo script for the RTAQ package that would show understanding of the functionality provided, or by doing a detailed proposal with pseudocode for one or more potential enhancements to the toolchain. 20 | 21 | =====Mentors===== 22 | Dr. Kris Boudt co-authored “RTAQ: Tools for the analysis of trades and quotes in R” and published many articles in the area of high-frequency financial data in leading international financial and econometrics journals. 23 | 24 | Brian Peterson is one of the primary authors of the packages in the TradeAnalytics project, and hes previously mentored GSOC participants. 2012-02-22. 25 | -------------------------------------------------------------------------------- /RWiki/gsoc2012/sp_econometrics.txt: -------------------------------------------------------------------------------- 1 | ====== Bayesian Spatial Econometrics with R====== 2 | 3 | 4 | **Summary:** Translate Matlab code in the Spatial Econometrics Toolbox 5 | for Bayesian methods into R. 6 | 7 | 8 | **Description:** 9 | The Spatial Econometrics Toolbox ([[http://www.spatial-econometrics.com/]]) provides 10 | cutting edge methods for Spatial Econometrics. Some of the functions implemented therein 11 | are already available in some R packages (e.g., spdep). Bayesian methods for Spatial 12 | Econometrics have not been implemented in R so far. This project will focus on 13 | translating the Matlab code for these methods into R. 14 | 15 | The models to translate include: 16 | 17 | - Spatial autoregressive model 18 | 19 | - Spatial autoregressive probit model 20 | 21 | - Spatial autoregressive tobit model 22 | 23 | The students will also develop a vignette based on datasets from the 24 | Spatial Econometrics Toolbox. 25 | 26 | 27 | **Skills required:** 28 | 29 | The student will need some knowledge of Matlab and R. A good understanding of 30 | computational Bayesian methods such as MCMC (and Gibbs sampling, in particular) 31 | will be necessary too. The student will need to work with some packages on spatial 32 | econometrics (such as spdep) and the coda package for processing MCMC output. 33 | 34 | 35 | **Test:** 36 | 37 | The student should demonstrate proficiency in Matlab and R. Some proof of 38 | having a good knowledge of computational Bayesian methods is required too 39 | (but a graduate course on Bayesian analysis will suffice). 40 | 41 | 42 | **Mentor:** Virgilio Gómez-Rubio will act as the main mentor. Another (backup) mentor will 43 | be required too. 44 | 45 | -------------------------------------------------------------------------------- /RWiki/gsoc2012/student.txt: -------------------------------------------------------------------------------- 1 | ====== Apply as Student for GSoC ====== 2 | 3 | 4 | ===== Mailing Lists ===== 5 | 6 | Sign up for the following mailing lists: 7 | 8 | * [[https://groups.google.com/forum/?fromgroups#!forum/gsoc-r|Students + Mentors mailing list]] 9 | * Optional: [[https://groups.google.com/forum/?fromgroups#!forum/google-summer-of-code-discuss| General GSoC discussion mailing list]] 10 | 11 | 12 | 13 | 14 | 15 | ===== Melange ===== 16 | Google organizes the GSoC in Melange. You need to 17 | - Have a google account 18 | - [[ http://www.google-melange.com/gsoc/homepage/google/gsoc2012| Register as student at melange]] 19 | - Apply to a project by clicking on the orange "Apply" button on the google-melange main page. Choose the R project from the list. 20 | - You can see your proposals via "My Dashboard" in the menu on the left side 21 | - Until submission deadline you can edit your proposal. Click on "Edit proposal" above your proposal. 22 | - After that only comments are possible. Mentors may ask questions about your proposal. Do your best to answer them completely and promptly. 23 | - If you change your mind you can withdraw your proposal with the withdraw button on the left side of the proposal page -------------------------------------------------------------------------------- /RWiki/gsoc2012/vectorizedoptim.txt: -------------------------------------------------------------------------------- 1 | ====== Handle parallel (vectorized) objective functions in a new optimization wrapper package ====== 2 | 3 | **Summary:** Modify/wrap some optimization packages to support vectorized objective functions. In the end a new 'vectorize' option will be available in the optimization function (optim, optmx, nmkb, ...), which will then call the objective function ('fn' argument) for many points together, thus allowing for parallel computing in 'fn' evaluations. 4 | 5 | **Description:** The idea is to allow vectorized call of 'fn' argument. Exmaple: for now, when nmkb (see [[http://cran.r-project.org/web/packages/dfoptim/dfoptim.pdf]] is building a polytype (see [[http://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method|Nelder Mead method]]), the points are passed to 'fn' sequentially. If the points could be submitted at the same time to 'fn' (that is the 'vectorized' way), 'fn' may be able to parallelize these evaluations. 6 | 7 | When fn is quite CPU-expensive, the parallel evaluation of fn(x[1,]), fn(x[2,]), fn(x[3,]), ... is distributed on several cores/CPUs, so the time-to-result may be the same as one unique point. 8 | 9 | Vectorized evaluation would also allow gradient-based methods e.g., Rcgmin, to be applied to functions without analytic derivative code by parallelizing the work of code like numDeriv. 10 | 11 | * **Task 1: (10 days)** Identify the possible level of parallel calls of 'fn' in different methods and optimization packages (deoptim, dfoptim, optimx, Rcgmin). At least, [[http://en.wikipedia.org/wiki/Simulated_annealing|SANN]] and Nelder-Mead are natural candidates. The difficulty to implement vectorization needs to be evaluated, considering the level of current implementation (R only, C, Fortran, ...). 12 | * **Task 2: (5-10 days)** Discussion with mentors to define methods to parallelize. Implementation. 13 | 14 | When this parallel/vectorized version is available, it should be tested on the following example (which does not make benefit of parallelization at this point): 15 | 16 | flb <- function(x) { 17 | p <- length(x); sum(c(1, rep(4, p-1)) * (x - c(1, x[-p])^2)^2) 18 | } 19 | ## 25-dimensional box constrained 20 | tic <- proc.time()[3] 21 | nmkb(par=rep(3, 25), fn=flb, lower=rep(2, 25), upper=rep(4, 25), ...) 22 | toc <- proc.time()[3] 23 | toc-tic 24 | ## Now try with vectorized flb: 25 | flb.vectorized <- Vectorize(flb) 26 | ## The following optim must return the same result than previous one. 27 | tic <- proc.time()[3] 28 | nmkb.vectorized(par=rep(3, 25), fn=flb.vectorized, lower=rep(2, 25), upper=rep(4, 25), ...) 29 | toc <- proc.time()[3] 30 | toc-tic 31 | 32 | * **Task 3: (5 days)** Implement a ''Vectorize.foreach()'' function which does the same thing that the original ''Vectorize()'', but including a foreach loop instead of the mapply() used. 33 | \\ Now, using this ''Vectorize.foreach()'' instead of Vectorize() will efficiently reduce the computing time: 34 | 35 | flb.vectorized <- Vectorize.foreach(flb) 36 | ## The following optim must return the same result than previous one. 37 | tic <- proc.time()[3] 38 | nmkb.vectorized(par=rep(3, 25), fn=flb.vectorized, lower=rep(2, 25), upper=rep(4, 25), ...) 39 | toc <- proc.time()[3] 40 | toc-tic 41 | 42 | * **Task 4: (15 days)** Benchmark the previous test for different foreach backends (doMC, doMPI, doParallel, doRedis, doRNG, doSNOW). 43 | * **Task 5: (2-5 days)** Implement a ''Vectorize.multicore()'' function which does the same thing that the original ''Vectorize()'', but including a mclapply() (from package multicore) instead of the mapply() used. 44 | * **Task 6: (15 days)** Benchmark the previous test for different multi-CPU configurations. Mentors should help to provide access to these kind of resources. 45 | * **Task 7: (10 days)** Consider a parallel wrapping of numDeriv package (which is a dependency of many optimization methods), in the same spirit. Benchmark with an optimization using numerical gradients. 46 | * **Final task: (5-10 days)** All the previous materials (code, samples, documentation) are to be released in a dedicated package (something like optim.vectorized) and submitted on CRAN. A possible integration in another package (optimx) or R core will be proposed, if results are satisfying enough (and code clear and well documented). 47 | 48 | **Skills required:** C, R API. Building R from sources. Basic parallel computing skills. 49 | 50 | **Test:** Write a simplified ''Vectorize.multicore'' function using mclapply instead of lapply. Benchmark efficiency using Vectorize.multicore versus Vectorize on a multiple points call of a slowed function: ''slow.flb <- function(x) {Sys.sleep(1);flb(x)}'' 51 | 52 | slow.flb.vectorized <- Vectorize(slow.flb) 53 | tic <- proc.time()[3] 54 | slow.flb.vectorized(seq(from=0,to=1,length=10)) 55 | toc <- proc.time()[3] 56 | toc-tic 57 | slow.flb.vectorized.multicore <- Vectorize.multicore(slow.flb) 58 | tic <- proc.time()[3] 59 | slow.flb.vectorized.multicore(seq(from=0,to=1,length=10)) 60 | toc <- proc.time()[3] 61 | toc-tic 62 | 63 | 64 | **Mentors:** 65 | * Y. Richet, from [[http://www.irsn.org|IRSN]] is working in the field of computer experiments. Released a grid engine based on R for external HPC ([[http://promethee.irsn.org]]). 66 | * in backup, J. Nash (nashjc _at_ uottawa.ca). B. Peterson and J. Ulrich are also interested in the subject. Depending on their availability, they should also support the project by their deep knowledge of optimization methods. 67 | //[[yann.richet@irsn.fr|Yann Richet]] 2012/03/26// -------------------------------------------------------------------------------- /RWiki/gsoc2012/wikipediar.txt: -------------------------------------------------------------------------------- 1 | wikipediaR package 2 | 3 | 4 | **Summary:** A package for retrieving and handling wikipedia articles 5 | 6 | **Description:** 7 | 8 | Wikipedia is a great source of information, which can be accessed free and analyzed. For other resources like Twitter and OpenStreetMap there are already packages ([[http://cran.r-project.org/web/packages/twitteR/| twitteR]], [[http://osmar.r-forge.r-project.org/| osmar]]). 9 | 10 | To my knowledge there is no package covering Wikipedia articles. I wondered why nobody wrote a package for that task yet, because there is a lot of interesting stuff a statistician can do with this data. The content can be analyzed using text mining tools, there is the information of which other articles are linked in the text, one can examine the external links ... 11 | For example you can take a group of articles and calculate their relatedness based on their links and the cosinus similarity ([[http://www.cip.ifi.lmu.de/~molnarc/| example]]) 12 | 13 | 14 | After the 3 months of work the package should fulfill the following tasks: 15 | 16 | * The user is able to choose the source of the articles. This could be the [[http://www.mediawiki.org/wiki/API|Wikipedia API]] or a local XML-Dump of all articles (those can be downloaded). In the first case the task could completly be done by R (using the [[http://www.omegahat.org/RCurl/|RCurl]] and [[http://www.omegahat.org/RSXML/|XML]] packages), but in the latter case the dump is a large file (2.5GB). I don't think one wants to parse this file with R. The idea would be to make use of other programs, e.g. read the XML-file into a database and import the articles with the [[http://cran.r-project.org/web/packages/RMySQL/index.html|RMySQL package]]. 17 | 18 | * Save articles as S4 objects (other possiblities: S3 objects, lists) 19 | * Include print, summary and plot methods for articles. So the user can have a quick explorative overview over his data. 20 | 21 | * Provide some methods which transform the articles in a suitable way, so that it will be easy for the user to analyze them with text mining tools from the [[http://tm.r-forge.r-project.org/tm|tm package]] or plot word clouds ([[http://cran.r-project.org/web/packages/wordcloud/index.html|wordcloud package]]) 22 | 23 | 24 | **Mentor:** -------------------------------------------------------------------------------- /RWiki/gsoc2012/xts.txt: -------------------------------------------------------------------------------- 1 | ====== Improvements to data visualisation, subsetting, and analytics for time series data. ====== 2 | 3 | ===== Summary ===== 4 | Improve data subsetting approaches, visualisation, and data analytics for xts time series objects. 5 | 6 | 7 | ===== Description ===== 8 | xts is a subclass of zoo, designed to allow the fastest possible manipulation of large time series objects. Some of the core xts functionality has even been back-ported into zoo. The zoo/xts packages are some of the most widely used R packages, with over a hundred CRAN packages depending on them and many addtional dependencies in other repositories. This project would add additional plotting, subsetting, and analytical features to xts to bring its supported functionality more in-line with R norms for much slower objects such as data.frames, and to add additional time series analytics functionality to the package. Some of this work may be back-ported into zoo. 9 | 10 | ===Plotting=== 11 | xts currently provides a rudimentary plot.xts method, and quantmod provides a more comprehensive set of plotting functions via chartSeries and chart_Series. These functions suffer from problems with axes and borders, and no support for panels, qqplot, boxplot, or histogram methods. Although xts has rather transparent as.* methods that allow some base R functions to be used rather painlessly, it would be beneficial to have xts versions of these methods available for ease of use. Primitives for most of these methods are available in PerformanceAnalytics, and should be adaptable to more general purpose xts methods. 12 | 13 | ===Multi-Class Columns=== 14 | One existing strength and deficiency of xts/zoo arises from its use of 'matrix' as the underlying core data type for the main time series data. One request often heard on the various R mailing lists is to support columns of various types as is done in data.frame. One problem with this suggestion is that data.frame objects are extremely slow, and xts objects are routinely used to hold time series data with tens of billions of observations. 15 | 16 | The general goal here would be for a data.frame-like ability for differing column classes, while retaining the power and speed of `[.xts` and the xts methods for rbind and cbind. The likely implementation approach would construct something akin to the list-structure of data.frames where class-specific list elements whould share an index, so that subsetting indices could then be applied to each column (or grouping of columns) 17 | at the C level. It is possible that multiple columns that shared a column class could be grouped into single list slots to avoid additional aggregation. 18 | 19 | ===Analytical Methods=== 20 | Many existing time series methods in R such as AR/ARIMA, Holt Winters, and VAR methods assume a regular time series. While many xts series are sufficiently regular to use these methods, there is no guarantee that this is the case. This would provide 'bridge' functionality to intelligently convert non-regular xts data to data regular enough to be converted to base R ts and passed to and from the multitude of time series functions that require regular data. The zooreg subclass from zoo will provide a template for this to create a (likely non-user-facing) "xtsreg" class, we could easily have an object that maps naturally on to ts objects and could use that code in conjunction with some appropriate translations. 21 | 22 | =====Skills required===== 23 | The successful applicant should have proficiency with R, xts, quantmod, and TTR. The ideal candidate would also have prior experience on the analysis of high-frequency data and market microstructure. 24 | 25 | =====Test===== 26 | The successful applicant must demonstrate prior experience with high frequency data and R. This could be via identifying a patch or extension, providing a new demo script for the or quantmod packages that would show understanding of the functionality provided, or by doing a detailed proposal with pseudocode for one or more potential enhancements to the toolchain. 27 | 28 | 29 | 30 | =====Mentors===== 31 | Jeffrey Ryan is the primary author of xts and quantmod, a coauthor of zoo, and primary or coauthor on multiple other R packages. 32 | 33 | Joshua Ulrich is the primary author of TTR, coauthor of xts and DEoptim, and contributing author to quantmod and many other R packages. 34 | 35 | Brian Peterson is a contributing author of xts/quantmod, and has previously mentored GSOC participants. 2012-03-09. 36 | 37 | Stuart Greenlee is a quantitative trader based in Chicago who spends much of time working with the xts and zoo packages. -------------------------------------------------------------------------------- /RWiki/gsoc2013/import_graphics.txt: -------------------------------------------------------------------------------- 1 | **Background/Motivation:** Graphics made using other programs can by imported into R using the grImport package, and this is useful for creating visualizations with R. For examples, see [[http://www.jstatsoft.org/v30/i04/|the grImport paper]], [[http://cbio.ensmp.fr/~thocking/image-labels.png|the iris data labeled with photos of the 3 flower species]] or [[http://www.flickr.com/photos/47179524@N02/4346623091/in/photostream|this scatterplot comparing several countries using their flags]]. 2 | 3 | **Goal:** Improve R's ability to import and plot external vector graphics. [[http://sugiyama-www.cs.titech.ac.jp/~toby/flags/index.html|This table of state flags plotted in R]] shows some problems with plotting SVGs using [[http://cran.r-project.org/web/packages/grImport/index.html|grImport]]. 4 | * How to import SVG into R in a portable manner? One solution is to use [[https://r-forge.r-project.org/scm/viewvc.php/pkg/directlabels/etc/flags/svg2ps.py?view=markup&revision=533&root=directlabels|a python script]] to convert SVG to PS, then use grImport to read PS images. Is it possible to convert SVG to PS entirely in R? It should be possible to do this in C code that uses librsvg and cairo. 5 | * Some complicated PS images cause errors in ''grImport::PostScriptTrace()'' when attempting to convert to RGML, e.g. [[http://upload.wikimedia.org/wikipedia/commons/d/da/Flag_of_Kansas.svg]]. This image is also not rendered on my iPad, but is correctly rendered by firefox and evince. Can we fix this bug in PostScriptTrace? 6 | * Some PS images are converted to RGML but then plotted incorrectly on the PDF graphics device. [[http://sugiyama-www.cs.titech.ac.jp/~toby/flags/index.html|See the table of state flags for examples]]. Can we fix these bugs? 7 | * Using the PNG graphics device is clearly inferior to the PDF graphics device. [[http://sugiyama-www.cs.titech.ac.jp/~toby/flags/index.html|See the table of state flags for examples]]. Can we fix these inconsistencies? 8 | 9 | **Impact:** 10 | There are two types of code that can be developed for this project 11 | * Code for importing vector graphics could be integrated to the grImport package. 12 | * Code for rendering graphics could be integrated into the grid package which is distributed with every copy of R. 13 | 14 | **Tests:** 15 | * Diagnose what is going wrong with any ONE of the flags that currently do not import correctly. 16 | * BONUS: suggest a fix for that one flag. 17 | 18 | **Mentors:** [[http://sugiyama-www.cs.titech.ac.jp/~toby/|Toby Dylan Hocking]] , [[http://www.stat.auckland.ac.nz/~paul/|Paul Murrell]] -------------------------------------------------------------------------------- /RWiki/gsoc2013/mpiprofiler.txt: -------------------------------------------------------------------------------- 1 | =====Title===== 2 | **Profiling Tools for Parallel Computing with R** 3 | 4 | 5 | =====Summary===== 6 | As R users become more and more sophisticated parallel programmers, the need for profiling tools becomes ever greater. This is especially true for distributed computing over MPI. Being able to generate data for communications, memory usage, I/O performance, etc. for an algorithm makes it much easier to debug and optimize that code. Many state-of-the-art tools already exist for solving this problem, but to date, no one has brought them to R. This project entails the development of a package, pbdPROF, which allows parallel R users to easily profile their parallel codes, and use the best analytics and visualization toolbox, R, to evaluate the profiling results. 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | =====Description===== 19 | Many great tools for profiling MPI codes already exist; they merely need to be incorporated into R packages in the appropriate way. Possible parallel profiling libraries for the project include [[https://www.mcs.anl.gov/research/projects/fpmpi/WWW/index.html|fpmpi]], [[http://mpip.sourceforge.net/|mpiP]], [[http://www.nersc.gov/users/software/debugging-and-profiling/ipm|IPM]], and [[http://www.cs.uoregon.edu/research/tau/home.php|TAU]]. When these libraries are correctly linked with MPI code during compilation, they automatically generate reports which give various measures of otherwise difficult to detect performance data. This data can be incredibly valuable to a developer writing MPI codes. 20 | 21 | However, there are two problems which need to be addressed for the R community. First, if an R programmer wanted to use these libraries with, for example, [[http://cran.r-project.org/web/packages/pbdMPI/|pbdMPI]] or [[http://cran.r-project.org/web/packages/Rmpi/index.html|Rmpi]], then he or she would have to manually compile the profiling library dynamically and manage the compiling and linking with the appropriate R package(s). Secondly, the reports these libraries generate are often nearly incomprehensible, and must be cleaned and parsed before real insight can be obtained. 22 | 23 | To resolve these issues, we propose the development of an R package which directly solves both of these problems. The package will be tightly integrated with [[http://r-pbd.org/|pbdR]] packages, but it must be able to work easily with other packages which link with an MPI library, e.g. [[http://cran.r-project.org/web/packages/Rmpi/index.html|Rmpi]]. The package will contain the necessary library source files, which will automatically be compiled when installing the package, and export necessary flags so that higher packages (such as [[http://cran.r-project.org/web/packages/pbdMPI/|pbdMPI]]) can easily be configured to link with these profiling libraries. 24 | 25 | The successful project would include: 26 | * Development of an R package, pbdPROF, which easily enables profiling of MPI codes using [[http://cran.r-project.org/web/packages/pbdMPI/|pbdMPI]] and [[http://cran.r-project.org/web/packages/Rmpi/index.html|Rmpi]]. 27 | * Development of parsing and plotting functions for this package to analyze profiler output. 28 | * Creation of a package vignette which will explain to both users and package developers how to utilize the pbdPROF package. 29 | * Demonstrating the usefulness of the package by profiling the parallel k-means implementations in the [[http://cran.r-project.org/web/packages/pmclust/index.html|pmclust]] package. 30 | 31 | 32 | =====Skills required===== 33 | The applicant should have: 34 | * A strong interest in parallel computing, with parallel programming experience strongly preferred. 35 | * Experience using R. 36 | * Some knowledge of C, compiling, and linking. 37 | * Basic familiarity with LaTeX and git/github will be useful. 38 | 39 | 40 | 41 | 42 | 43 | 44 | =====Tests===== 45 | * Write a simple example interfacing R to C using the .Call() interface. See the [[http://cran.r-project.org/doc/manuals/R-exts.html#Calling-_002eCall|Writing R Extensions]] manual for more information. 46 | * This test requires a system installation of an MPI library, such as [[http://www.open-mpi.org/|OpenMPI]] or [[http://www.mpich.org/|MPICH2]]. Compile the [[https://www.mcs.anl.gov/research/projects/fpmpi/WWW/index.html|fpmpi]] library and successfully run the "waittime" example (included with the library source) with 2 processors. After compiling both the library and examples, the example can be run via the command "mpiexec -np 2 waittime". If the example runs successfully, then it will generate a file named fpmpi_profile.txt. 47 | 48 | 49 | =====Mentors===== 50 | The pbdR Core Team: 51 | * [[http://thirteen-01.stat.iastate.edu/snoweye/mypage/|Wei-Chen Chen]] 52 | * George Ostrouchov 53 | * Pragneshkumar Patel 54 | * Drew Schmidt 55 | 56 | 57 | 58 | =====References===== 59 | * FPMPI: https://www.mcs.anl.gov/research/projects/fpmpi/WWW/index.html 60 | * mpiP: http://mpip.sourceforge.net/ 61 | * IPM: http://www.nersc.gov/users/software/debugging-and-profiling/ipm 62 | * TAU Performance System: http://www.cs.uoregon.edu/research/tau/home.php 63 | * The Rmpi package: http://cran.r-project.org/web/packages/pbdMPI/ 64 | * The pbdR project: http://r-pbd.org/ 65 | * The pbdMPI package: http://cran.r-project.org/package=pbdMPI 66 | * The pbdDEMO package: http://cran.r-project.org/package=pbdDEMO 67 | * The pmclust package: http://cran.r-project.org/web/packages/pmclust/index.html -------------------------------------------------------------------------------- /RWiki/gsoc2013/nso.txt: -------------------------------------------------------------------------------- 1 | ===== Title ===== 2 | **NSO -- NonSmooth Optimization Package** 3 | 4 | ===== Summary ===== 5 | Implement different approaches for the minimization of non-smooth functions, e.g. pattern-based methods, hybrid approaches, and specialized stochastic methods. 6 | 7 | ===== Description ===== 8 | These functions may be smooth in their domain, except for some regions of smaller dimension resp. of probability zero, and they may not be differentiable at their minimum (or maximum). Difference approximations of gradients may not be useful or even lead to serious failures, for instance in case of noise or oscillations. Often smooth algorithms will not converge or will converge to suboptimal solutions. 9 | 10 | Special techniques have been developed to exploit these features and thus to solve non-smooth optimization tasks more effectively: Bundle techniques and subgradient projection methods for convex minimization problems, hybrid approaches that utilize BFGS approaches where possible, or global stochastic solvers. 11 | 12 | The goal of this project will be to build a package for non-smooth optimization that provides different methods, one using a derivative-free approach, a hybrid approach, and (if time allows) a more stochastic way of finding an optimum. Students applying will have a vote in deciding which approahes to start with. 13 | 14 | Some Matlab toolboxes relevant for non-smooth optimization are 'SID-PSM' or 'Hanso', see [3] and [4], that are available for conversion to R. Bundle/subgradient approaches could be implemented following textbook chapters on non-smooth optimization (see Part II in the [1]). 15 | 16 | (Very interesting is also the work by Zhang on non-smooth least-squares problems, unfortunately the Fortran code is not freely available, thus the coding would have to be based on the article [5].) 17 | 18 | ===== Skills Required ===== 19 | * Some knowledge in optimization theory 20 | * Good programming expertise in Matlab and R 21 | * Know-how in Fortran and C may be useful 22 | 23 | ===== Tests ===== 24 | Identify some typical examples of non-smooth functions (literature, Internet, own imagination); see also reference [2] below. --- 25 | Try to find local/global minima by applying optimization routines in Matlab (or Octave) and R. --- 26 | Take a look at the NonSmooth Optimization (NSO) web page and try out some of the software packages listed there on your problem(s). --- 27 | Take a look at reference [5] and implement Wolfe's line search in R. 28 | 29 | ===== Mentors ===== 30 | * Hans W. Borchers, ABB Corporate Research 31 | * [tbd.] 32 | 33 | ===== References ===== 34 | - Part II "Nonsmooth Optimization", in: J. F. Bonnans et al., Numerical Optimization: Theoretical and Practical Aspects, Springer-Verlag, Berlin Heidelberg, 2006. 35 | - [[http://napsu.karmitsa.fi/nso/|NonSmooth Optimization]] NSO Web page by N. Karmitsu 36 | - [[http://www.cs.nyu.edu/overton/software/hanso/|HANSO: Hybrid Algorithm for Non-Smooth Optimization]] by M. L. Overton 37 | - [[http://www.mat.uc.pt/sid-psm/|SID-PSM]], L. N. Vincente, University of Coimbra 38 | - H. Zhang, [[https://www.math.lsu.edu/~hozhang/zhcpaper.html|"A derivative-free algorithm for the least-squares minimization"]] 39 | - [[http://www.norg.uminho.pt/aivaz/pswarm/|PSWARM]] Version 1.5, a global approach, C code (not the Matlab version) 40 | - A. Skajaa, [[http://cs.nyu.edu/overton/mstheses/skajaa/msthesis.pdf|Limited Memory BFGS for Nonsmooth Optimization]], Master's Thesis, Courant Institute of Mathematical Sciences, New York, 2010. 41 | 42 | ---- 43 | 44 | Please note: The authors of Matlab codes of these approaches have explicitly agreed that their programs may be used as guidelines and for re-implementation in R under the GPL license. 45 | -------------------------------------------------------------------------------- /RWiki/gsoc2013/right.txt: -------------------------------------------------------------------------------- 1 | ====== RIGHT : R Interactive Graphics via HTml ====== 2 | 3 | **Summary:** R Interactive Graphics via HTml (RIGHT) will support easy data exploration using interactive graphics and analysis and reveal information hidden among static graphs. It will use HTML5 canvas and JavaScript, instead of Qt, to make analysis and visualization results easier to distribute to wider audience. Furthermore, it will support intuitive R API for easy setup. 4 | 5 | **Background/Motivation:** There are excellent packages in R that produces very polished static graphics, e.g. lattice and ggplot2. However, exploring data using static graphics can be tedious: users have to frequently interact with command line to get meaningful graphs that answer even a simple question like “how are graphical elements (a point or a group of points) in one graph related to those in another graph?” Interactive graphics linking data in different graphs and interactive visualization of summary statistics, combined with intuitive R API for setup, can help address these questions and make data exploration easy and also fun. 6 | 7 | **Description** 8 | 9 | The goal of project RIGHT is to develop interactive graphics and analysis to easily explore data using HTML5 and JavaScript with an intuitive R API. 10 | 11 | The project will have three components: 12 | - HTML5 and JavaScript libraries (RIGHTJS) for interactive visualization. Supported features will include, but not be limited to: 13 | * Various types of basic graphs including scatter plot, histogram, and box and whisker. 14 | * Interactivity features: 15 | * Tooltip: display detailed information when mouse hovers. 16 | * Simultaneous selection: related data points on different graphs are simultaneously selected when clicked. 17 | * Multiple selections: selection with ctrl/shift key and box selection. 18 | * Hiding: selected data points can be hidden (for, say, outlier removal for interactive analysis) via right-click menu. 19 | * Searching: data points can be searched with Boolean statements. 20 | * Table: table shows detailed information about the selected data points. 21 | - R API 22 | * HTML5 and JavaScript code generation is supported through an intuitive R API similar to what is found in widely used static graphics packages, e.g. base graphics, lattice, and ggplot2. 23 | * R package with documentation will be developed for distribution through CRAN. 24 | - Interactive data analysis support 25 | * (Complex) data analysis will be done on the server side (remote or local) and the graphs will be updated interactively. For example, if a user selects and hides outliers, analysis will be performed on the server (remote or local) and graphs will be updated accordingly. 26 | 27 | **Demo:** a part of the first two components has been implemented. A demo video is available at [[http://youtube.googleapis.com/v/thlwjYFC_yY?vq=hd1080;hd=1;autoplay=1;]]. 28 | 29 | **Languages:** RIGHT will use **HTML5 canvas** & **JavaScript** for the following reasons: 30 | * Browsers natively support event handling necessary for interactive visualization, and JavaScript engines are powerful enough to visualize reasonably large data. 31 | * The performance will only get better as websites become more complex. RIGHT can piggy-back on the performance enhancements of JavaScript engines demanded by others. 32 | * HTML5 canvas and JavaScript make it easy to disseminate analysis and visualization results to wider audience since they are well adopted in many platforms including mobile. 33 | 34 | **Code** https://code.google.com/p/r-interactive-graphics-via-html/ 35 | 36 | **Related work/Literature** 37 | * [[http://spotfire.tibco.com/|TIBCO Spotfire]]: [[http://www.youtube.com/user/SpotfireGuy|many interactive visualization features]] were inspired by this commercial software package. RIGHT will be open source and provide an intuitive R API for setup; in comparison, Spotfire setup is only supports GUI-based setup. 38 | * [[https://github.com/ggobi/cranvas|cranvas]]: this is an excellent package that produces interactive visualization using Qt. Visualization using Qt has limited availability in mobile platforms (see [[http://doc.qt.digia.com/4.7/supported-platforms.html|Qt Supported Platforms]]). A package based on HTML5 canvas and JavaScript with similar functionality will give users more options when analysis and visualization results need to be widely distributed. 39 | * [[http://www.omegahat.org/SVGAnnotation/|SVGAnnotation]], [[http://sjp.co.nz/projects/gridsvg/|gridSVG]], [[http://svgmapping.r-forge.r-project.org/SVG_Mapping/Home.html|SVGMapping]], and [[https://github.com/hadley/r2d3|r2d3]]: SVG-based packages can provide polished visualization [[http://d3js.org/]], but the DOM-based approach makes it difficult to provide some types of interactivity. For instance, the many-to-one simultaneous selection between the scatter plot and the histogram shown in the [[http://youtube.googleapis.com/v/thlwjYFC_yY?vq=hd1080;hd=1;autoplay=1;|demo]] will be difficult to create with SVG, especially among more than two graphs. 40 | * [[http://www.rforge.net/canvas/|canvas]]: canvas package uses HTML5 canvas but does not support interactivity. 41 | * [[http://www.rstudio.com/shiny/|RStudio Shiny]]: Shiny does not offer interactive visualization. However, server offloading idea came from Shiny Server. 42 | 43 | **Skills required:** R, JavaScript, and HTML5 44 | 45 | **Tests** 46 | - Load Theoph dataset and plot concentration (conc) versus time (Time) by subject (Subject) with at least two graphics packages in R. Points for each subject should have different colors, and a legend should be added. 47 | - Re-draw the plot created in problem 1 with JavaScript using the kineticJS library (http://kineticjs.com/). 48 | 49 | **Mentors:** [[http://icc.skku.ac.kr/~jaewlee|Jae W. Lee]] , Jung Hoon Lee -------------------------------------------------------------------------------- /RWiki/gsoc2013/rtaq.txt: -------------------------------------------------------------------------------- 1 | **Highfrequency: add inferential methods to highfrequency** 2 | 3 | **Summary** 4 | 5 | Toolkit for data management and analysis of high-frequency financial data in R. 6 | 7 | **Description** 8 | The economic value of analyzing high-frequency financial data is now obvious, both in the academic and financial world. It is not only the basis of intraday and daily risk monitoring and forecasting, input to the portfolio allocation process, and also for high-frequency trading. For the efficient use of high-frequency data in financial decision making, the highfrequency package implements state of the art techniques for cleaning, matching, liquidity, risk and correlation calculation based on high-frequency data. For many of the high-frequency data based estimators, analytical formulae are available to compute standard errors and do inference. 9 | 10 | The goal of this project is to develop much needed extra functionality in this growing area: 11 | 12 | ______1. Realized quarticity estimates:______ 13 | A key quantity in determining the high-frequency data based volatility measures is the integrated quarticity, estimated using realized quarticity measures. Several estimators are available. The first objective of this project is to implement these realized quarticity estimators. The second and third objective is to apply these quarticity estimates to the calculation of standard errors for RV measures and the implementation of jump test statistics. 14 | 15 | __2. Standard errors and confidence bands for RV measures:__ 16 | Under assumptions, standard errors on the realized volatility measures can be computed. The second goal of this project is to implement the standard error calculation for popular RV measures such as the realized variance, bipower variation, medRV/minRV and ROWVar. 17 | 18 | __3. Jump test statistics based on daily RV measures:__ 19 | Many applications based on high-frequency data requires to detect the days on which the price process is characterized by a jump occurrence. A popular testing procedure for this is to compute test statistics based on the difference between a non-robust RV and a jump robust RV measure. 20 | 21 | 22 | 23 | __4. Standard errors for the parameter estimates of the heavy model:__ 24 | The heavyModel function in highfrequecy uses highfrequency based volatility measures to forecast future volatility using a time series model. Standard errors need to be implemented for the parameters of the time series model. The final objective of this project is to implement the likelihood function of the heavy model in C to speed up the estimation. To help with this, Chris Blakely kindly shared us his C-code for the heavy model implementation. 25 | 26 | 27 | **Literature** 28 | 29 | Huang and Tauchen. 2005. The relative contribution of jumps to total price variance. Journal of Financial Econometrics 3, 456-499. 30 | 31 | Andersen, Dobrev and Schaumburg. 2011. A Functional Filtering and Neighborhood Truncation Approach to Integrated Quarticity Estimation http://econ.au.dk/fileadmin/site_files/filer_oekonomi/Working_Papers/CREATES/2011/rp11_23.pdf and the references therein. 32 | 33 | Shephard and Sheppard. 2012. Realising the future: forecasting with high-frequency-based volatility (HEAVY) models. Journal of Applied Econometrics 25, 197-231. 34 | 35 | **Skills required** 36 | 37 | The successful applicant should have proficiency with R, xts and highfrequency. The ideal candidate would also have prior experience on the analysis of high-frequency data and market microstructure. 38 | 39 | **Test** 40 | 41 | The successful applicant must demonstrate prior experience with high frequency data and R. This could be via identifying a patch or extension, providing a new demo script for the highfrequency package that would show understanding of the functionality provided, or by doing a detailed proposal with pseudocode for one or more potential enhancements to the toolchain. 42 | 43 | **Mentors** 44 | 45 | Dr. Kris Boudt and Jonathan Cornelissen co-author the highfrequency package (together with Scott Payseur). They published many articles in the area of high-frequency financial data in leading international financial and econometrics journals. Kris Boudt previously mentored GSOC participants. -------------------------------------------------------------------------------- /RWiki/gsoc2013/spectralunmixing.txt: -------------------------------------------------------------------------------- 1 | ====== spectral unmixing ====== 2 | 3 | Many types of [[http://en.wikipedia.org/wiki/Spectroscopy|spectroscopy]] follow a bilinear model, e.g. 4 | * absorption in the UV/VIS/NIR/IR 5 | * fluorescence emission 6 | * Raman emission 7 | 8 | The spectrum //X (λ)// of a mixture of m substances is a linear combination of the pure substance spectra //S,,i,, (λ)// weigthed with their concentrations //c,,i,,// 9 | 10 | %%$$X (λ) = sum_(i = 1)^m c_i S_i (λ)$$%% 11 | 12 | or, written with matrices for n spectra: 13 | 14 | **X**^^(//n// x //p//)^^ = **C**^^(//n// x //m//)^^ **S**^^(//m// x //p//)^^ + ε 15 | 16 | where **X** is the matrix of observed mixture spectra, measured at //p// distinct wavelengths, the concentrations are stored in **C**, and the pure componnent spectra in **S**. 17 | 18 | [[http://en.wikipedia.org/wiki/Imaging_spectroscopy#Unmixing|Spectral unmixing]] tries to reverse this and find pure component spectra and concentrations from the observed spectra. A number of approaches exist, which follow different heuristics and constraints (e.g. spectra and concentrations should be non-negative, the pure component spectra actually exist in the observed matrix, ...). 19 | 20 | Several algorithms are described in the literature, and I also have Matlab code of Vertex Component Analysis and NFINDR. 21 | 22 | ===== Aim of the project ===== 23 | 24 | Aim of the project is to implement/port spectral unmixing methods to R. 25 | 26 | ===== Requirements ===== 27 | 28 | R, possibly C++ 29 | 30 | git/svn 31 | 32 | ===== Test ===== 33 | 34 | Please contact me for a test task. 35 | 36 | 37 | **Mentor:** Claudia Beleites [[Claudia.Beleites@ipht-jena.de|Claudia Beleites]] 38 | **Co-Mentors:** 39 | 40 | --- //[[Claudia.Beleites@ipht-jena.de|Claudia Beleites]] 2013/04/07// 41 | 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /RWiki/gsoc2013/xts.txt: -------------------------------------------------------------------------------- 1 | ====== Improvements to data construction, subsetting, and manipulation for time series data. ====== 2 | 3 | ===== Summary ===== 4 | Improvements to data construction, subsetting, and manipulation for xts time series objects. 5 | 6 | 7 | ===== Description ===== 8 | xts is a subclass of zoo, designed to allow the fastest possible manipulation of large time series objects. Some of the core xts functionality has even been back-ported into zoo. The zoo/xts packages are some of the most widely used R packages, with over a hundred CRAN packages depending on them and many addtional dependencies in other repositories. This project would add additional construction, subsetting, and analytical features to xts to bring its supported functionality more in-line with R norms for much slower objects such as data.frames, and to add additional time series analytics functionality to the package. Some of this work may be back-ported into zoo. 9 | 10 | ===Multi-Class Columns=== 11 | One existing strength and deficiency of xts/zoo arises from its use of 'matrix' as the underlying core data type for the main time series data. One request often heard on the various R mailing lists is to support columns of various types as is done in data.frame. One problem with this suggestion is that data.frame objects are extremely slow, and xts objects are routinely used to hold time series data with tens of billions of observations. 12 | 13 | The general goal here would be for a data.frame-like ability for differing column classes, while retaining the power and speed of `[.xts` and the xts methods for rbind and cbind. The likely implementation approach would construct something akin to the list-structure of data.frames where class-specific list elements whould share an index, so that subsetting indices could then be applied to each column (or grouping of columns) 14 | at the C level. It is possible that multiple columns that shared a column class could be grouped into single list slots to avoid additional aggregation. 15 | 16 | 17 | =====Skills required===== 18 | The successful applicant should have proficiency with R, xts, quantmod, and TTR. The ideal candidate would also have prior experience on the analysis of high-frequency data and market microstructure. 19 | 20 | =====Test===== 21 | The successful applicant must demonstrate prior experience with high frequency data and R. This could be via identifying a patch or extension, providing a new demo script for the or quantmod packages that would show understanding of the functionality provided, or by doing a detailed proposal with pseudocode for one or more potential enhancements to the toolchain. 22 | 23 | 24 | 25 | =====Mentors===== 26 | 27 | Joshua Ulrich is the primary author of TTR, coauthor of xts and DEoptim, and contributing author to quantmod, quantstrat, and many other R packages. 28 | 29 | Michael Weylandt created the xtsExtra package in GSoC 2012, and has agreed to co-mentor a project to advance it. 30 | 31 | Brian Peterson is a contributing author of xts/quantmod, as well as many other R packages, and has previously mentored GSOC participants. 32 | 33 | 34 | Jeffrey Ryan is the primary author of xts and quantmod, a coauthor of zoo, and primary or coauthor on multiple other R packages. 35 | -------------------------------------------------------------------------------- /RWiki/gsoc2014/animint.txt: -------------------------------------------------------------------------------- 1 | **Background/Related work:** standard R graphics are based on the pen and paper model, which makes animations and interactivity difficult to accomplish. Non-interactive animations can be accomplished with the ''animation'' package, and some interactions with non-animated plots can be done with the [[https://github.com/ggobi/cranvas/wiki|qtbase, qtpaint, and cranvas packages]]. Linked plots in the web are possible using [[http://www.omegahat.org/SVGAnnotation/SVGAnnotationPaper/SVGAnnotationPaper.html|SVGAnnotation]] or [[http://sjp.co.nz/projects/gridsvg/|gridSVG]], but using these to create such a visualization requires knowledge of Javascript. The [[https://r-forge.r-project.org/scm/viewvc.php/pkg/?root=svgmaps|svgmaps]] package defines interactivity (hrefs, tooltips) in R code using igeoms, and exports SVG plots using gridSVG. [[https://github.com/trifacta/vega|Vega]] can be used for describing plots in Javascript, but has limited interactivity. 2 | 3 | The [[https://github.com/tdhock/animint|animint]] package supports clicking linked plots to show/hide subsets of data, by adding 2 new aesthetics to the grammar of graphics: 4 | * ''showSelected=variable'' means that only the subset of the data that corresponds to the selected value of ''variable'' will be shown. 5 | * ''clickSelects=variable'' means that clicking a plot element will change the currently selected value of ''variable''. 6 | The animint package user writes only R code: a list ''L'' of ggplots, then calls ''gg2animint(L)'' which renders the interactive, animated plots in a web browser. 7 | 8 | Behind the scenes, the animint package developers have written R and Javascript code that does all the work. The [[https://github.com/tdhock/animint/blob/master/R/animint.R|R]] code exports all the plot data in CSV along with a ''plots.json'' metadata file. These files are then plotted in a web browser using [[https://github.com/tdhock/animint/blob/master/inst/htmljs/animint.js|Javascript code]] and the [[http://d3js.org|D3]] library. [[http://sugiyama-www.cs.titech.ac.jp/~toby/animint/|The result is several linked, interactive, animated web plots]]. 9 | 10 | **Proposal:** implement several improvements to the ''animint'' package. A [[https://github.com/tdhock/animint/blob/master/NEWS|list of TODOs can be found in the NEWS file]]. The most important improvements are: 11 | * [[http://docs.ggplot2.org/current/coord_fixed.html|coord_equal]] which would be useful for e.g. [[http://tdhock.github.io/animint/tornadoes.html|maps]]. 12 | * [[http://docs.ggplot2.org/0.9.3.1/facet_grid.html|facet_grid]] which would be useful for e.g. [[https://github.com/tdhock/animint/blob/master/examples/intreg.R|the last plot in this file]]. 13 | * Increase memory/disk/speed efficiency by e.g. saving compressed .csv.gz files rather than uncompressed .csv files. 14 | * Make tests for the SVG that is generated by the Javascript code, using e.g. [[http://phantomjs.org/|PhantomJS]] to non-interactively render the web page. One test that could be implemented is e.g. checking for at least one '''' and one '''' when you specify ''geom_text(aes(size=variable))+scale_size_continuous(range=c(10, 20))'' [[https://github.com/tdhock/animint/blob/master/examples/text_size.R|as in this example]]. 15 | * To make it easier to share interactive animations with colleagues or the public, implement a ''animint2gist(L)'' function that uploads the interactive animation to github as a gist. 16 | * Improve [[https://github.com/tdhock/animint/blob/master/R/knitr.R|knitr support]] so that several interactive animations can be easily included in an R markdown (Rmd) document. Currently it only supports including one interactive animation per document, [[https://github.com/tdhock/animint/blob/master/inst/doc/template.Rmd|as in this Rmd]] which is used to [[https://github.com/tdhock/animint/blob/master/R/doc.R|generate the examples web site]]. 17 | 18 | **Mentors:** [[https://github.com/tdhock|Toby Dylan Hocking]] , [[https://github.com/srvanderplas|Susan VanderPlas]] . 19 | 20 | **Tests: do one or several --- doing more hard tests makes you more likely to be selected** 21 | * Easy: to demonstrate some knowledge of animint, use it to visualize some of the data from your domain of expertise. 22 | * Medium: translate one or several [[http://vis.supstat.com/categories.html#animation-ref|examples of the animation package]] into an animint. 23 | * look at source code of one of the animation package functions e.g. grad.desc() for a demonstration of the gradient descent algorithm. Translate the for loops and plot() calls into code that generates data.frames. In the grad.desc() example, there should be one data.frame for the contour lines, one for the arrows, and one for the values of the objective/gradient at each iteration. 24 | * Use the data.frames to make some ggplots. In the grad.desc() example, there should be one ggplot with a geom_contour and a geom_path, and another ggplot with a geom_line that shows objective or gradient value versus iteration, and a make_tallrect in the background for selecting the iteration number. 25 | * Make a list of ggplots and pass that to gg2animint. For the grad.desc() example the plot list should be something like list(contour=ggplot(), objective=ggplot(), time=list(variable="iteration", ms=2000)). 26 | * Hard: fork the animint source code on github, then implement one of the improvements mentioned in the Proposal section, and send tdhock a pull request. -------------------------------------------------------------------------------- /RWiki/gsoc2014/bdvis.txt: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | ====== bdvis : Biodiversity data visualizations using R ====== 6 | 7 | 8 | **Summary:** Package bdvis is set of functions to visualize biodiversity occurrence data through R. We intend to improve the package and complete it with bug fixes, feature requests, vignette and submission to CRAN 9 | 10 | **Description:** Last year we participated in GSoC and initiated the package [[http://rwiki.sciviews.org/doku.php?id=developers:projects:biodiversity|bdvis : Biodiversity data visualizations using R]]. This set of functions to visualize biodiversity occurrence data through R. The package has received good response from user community and is being used. This year we intend to take this package further by improving some of the visualizations, implementing some more features based on user feedback and prepare a detailed vignette to describe the use of the package in details. 11 | 12 | **Goals:** 13 | * Implement the features requests and bug fixes 14 | * Submit the package on CRAN 15 | * Implement additional features based on feedback 16 | * Write up a Vignette to explain the usage of each function in package with examples (Which can be submitted as a paper for peer reviewed journal) 17 | * Add support for various data formats available in R environment for various packages like rgbif, rvertnet, rinat, spocc etc. 18 | 19 | 20 | **Mentors:** Javier Otegui (Melange ID: jotegui, [[mailto:javier.otegui@gmail.com|javier.otegui@gmail.com]]), Jorge Soberón, Virgilio Gómez-Rubio -------------------------------------------------------------------------------- /RWiki/gsoc2014/bio-interactive-web-plots.txt: -------------------------------------------------------------------------------- 1 | **Summary:** Interactive plots for scientific reports 2 | 3 | **Description:** 4 | 5 | Every scientific analysis should result in a reproducible report. The ReportingTools Bioconductor package provides multiple means of report generation, including an imperative API driven by an R script, as well as a declarative interface through knitr. ReportingTools converts common R/Bioconductor data structures like data.frames and ExpressionSets into report elements, such as tables and plots, according to user-definable mappings. It supports multiple backends, with the HTML backend being the most developed. The HTML report elements have some limited interactivity (such as sortable tables). Additional interactivity is enabled through integration with the shiny package. 6 | 7 | Our goal is to add mappings from R and Bioconductor data structures to simple interactive HTML report elements, where the interactivity is implemented in the front-end, independent of shiny or other R instance. These would include basic plots of summaries, as well as simple genomic plots, which would display alignments, per-position summaries, and annotations along the genome. Summarized data could be directly embedded in the report, and the client would access large datasets by querying standard biological data sources like DAS, AnnotationHub, ExperimentHub and web-accessible BAM/VCF/BigWig files. 8 | 9 | This work would occur in a new or at least separate package that could be used directly, or through its integration with ReportingTools. The API will resemble that of ggbio, with extensions inspired by animint. The API could be implemented on top of animint. The genomic plotting might rely on and even drive the development of the pViz javascript library. 10 | 11 | We aim to support simple, lightweight plots in redistributable reports. There is no intent for this to replace the more sophisticated apps like epiviz(R) or shiny-based solutions. It could be summarized as an extension of animint with specific support for Bioconductor data structures and an emphasis on integration with report generation tools, in particular ReportingTools. 12 | 13 | **Skills required:** 14 | 15 | * Proficiency in Javascript and web technologies/frameworks like D3.js and backbone.js. 16 | * Familiarity with R; having publicly released an R package would be a plus. 17 | * Exposure to biological/sequencing use cases is a plus. 18 | 19 | **Test:** 20 | 21 | Use ggbio::autoplot to plot one of the GRanges objects created by example(GRanges) and create an equivalent D3-based plot that displays the some aspect of the ranges (like the score) as a tooltip on mouse over. 22 | 23 | **Mentor:** 24 | 25 | Michael Lawrence (michafla @ Genentech's domain) -------------------------------------------------------------------------------- /RWiki/gsoc2014/bio-mzr.txt: -------------------------------------------------------------------------------- 1 | ===== Introduction ===== 2 | 3 | The mzR R/Bioconductor package provides a unified application programming interface to the common open and community-driven file formats and parsers available for mass spectrometry data, namely mzXML, mzML and mzData (see current [[http://bioconductor.org/packages/devel/bioc/vignettes/mzR/inst/doc/mzR.pdf|vignette]] for details and references). It relies on C and C++ code from other third party open-source projects and the Rcpp package to, notably, provide a direct mapping from R to C++ infrastructure. 4 | 5 | Currently, mzR provides two back-ends to read mass spectrometry raw data: 6 | 7 | * netCDF which reads, as the name implies, netCDF data 8 | * RAMP to read mzData and mzXML via the ISB RAMP parser. This back-end can also read mzML through the proteowizard RAMPadapter around the proteowizard infrastructure, but this interface is limited to the lowest common denominator between the mzXML/mzData/mzML formats. 9 | 10 | More details about the project can be found on the [[http://bioconductor.org/packages/devel/bioc/html/mzR.html|official package page]]. 11 | 12 | ===== Goal ===== 13 | 14 | The goal is to extend current useful, yet limited capabilities of mzR by adding support for the state-of-the-art [[http://proteowizard.sourceforge.net/|proteowizard]] project. 15 | 16 | ===== The tasks ===== 17 | 18 | We will provide example data in all formats and support on the domain. This project will use the mzR github page as main collaboration and communication hub. 19 | 20 | * Subscribe to the Bioconductor and Bioc-devel mailing lists. 21 | * Give us your github account to be added as a mzR collaborator on the [[https://github.com/sneumann/mzR|github page]]. 22 | * Install and explore the mzR and Rcpp packages and the proteowizard project. We will provide simple tasks to be coded in C++ using the proteowizard code base to familiarise yourself. 23 | * Expand mzR on raw data and meta-data accession using the proteowizard 'msdata' object type. This will allow read and write support for the widely used Proteomics Standards Initiative XML-based raw data formats and their latest definitions. 24 | * Implement support for additional open formats, in particular the MSMS identification (mzIdentML) using the 'identdata' object type. 25 | * Ensure successful compilation of the updated mzR package on the respective architectures. 26 | 27 | ===== Skills needed: ===== 28 | 29 | * Good knowledge of C/C++ and the Rcpp package. 30 | * R programming. 31 | * Basic knowledge of package development and S4 OO programming. 32 | * Knowledge with mass spectrometry data and the respective formats is desired. 33 | 34 | The candidate will have to familiarise herself proteowizard code base. 35 | 36 | ===== Exercises: ===== 37 | 38 | See the [[https://github.com/sneumann/mzR/wiki/Exercises|exercises page]] on the mzR wiki. 39 | 40 | 41 | ===== Mentors ===== 42 | 43 | 44 | Laurent Gatto (lg390 cam ac uk, melange id is lgatto) and Steffen Neuman (sneumann ipb halle de, melange id is sneumann) with support from Dirk Eddelbuettel. -------------------------------------------------------------------------------- /RWiki/gsoc2014/discrimination.txt: -------------------------------------------------------------------------------- 1 | ====== Kernel Density Estimation and Nonparametric Discriminant Analysis for Functional Data ====== 2 | 3 | 4 | **Summary:** To develop an R package to enable the user to implement kernel density estimation and nonparametric discriminant analysis for functional data. 5 | 6 | **Description:** Functional data analysis (FDA) has received considerable attention over recent years in different fields of application. These include the fields of bioscience, system engineering, and meteorology (see e.g., Ramsay and Silverman (2006)). 7 | The need for methods in FDA arises when the sample data are a collection of curves or continuous functions or, at least observations of such curves at discrete points in time, say. Such observations may be subject to measurement error. One of the first steps in an analysis is to construct the underlying smooth functions, often using a smoothing technique, such as being based on smoothing splines, and the current fda package in R helps to facilitate this. 8 | 9 | In discriminant analysis we want to assign a new curve to one of say k (>=2) different groups using a classification rule which has been derived from samples of data whose actual group membership is known a priori. There are a number of different approaches to this problem including: 10 | 11 | i. A regularized version of linear discriminant analysis called penalized discriminant analysis proposed by Hastie et al (1995); 12 | 13 | ii. The nonparametric curves discrimination methods of Ferrarty and Vieu ( 2003) where the underlying curves are represented using a semi-metric which may be based on functional principal component scores, or the values of successive derivatives, for example. The posterior classification probabilities are estimated using kernel smoothing techniques and, in particular, ideas in kernel density estimation are very important here. These are discussed by Delaigle and Hall (2010) 14 | 15 | iii. Using truncated versions of linear methods, as proposed by Delaigle and Hall (2012), where the truncation is achieved by projecting on to a finite number of principal components or by using partial least squares. 16 | 17 | 18 | Where the problem is to develop discrimination rules for multivariate data, there are already a number of functions available in R. The MASS package contains functions for performing both linear and quadratic discrimination, as well as logistic regression. CART methodology is catered for in the rpart package while the klaR package which is specifically for classification and visualization of data, contains many functions, including those for regularized discriminant analysis and nearest neighbour methods. However, these are all designed for multivariate data and are not able to utilise the form of truly functional data. 19 | 20 | The successful project would include: 21 | - A package of functions in R (which would link in with the current fda package) that would firstly enable the user to construct and visualize kernel density estimates for the different groups in the data. Secondly, it would make available a suite of nonparametric functional discrimination techniques which the user could then implement for their data. These two aims are not currently implemented in available R packages, as discussed above. 22 | -A package vignette or Journal of Statistical Software article which enables practitioners to understand and use these procedures. 23 | 24 | 25 | **Skills required:** Applicants should have: 26 | - A good working knowledge of programming in R. 27 | - A background in statistical functional data analysis, optimization, linear algebra and multivariate statistics. 28 | - Experience integrating compiled code within R. 29 | 30 | 31 | **Test:** 32 | 33 | i. The successful applicant should be able to demonstrate the ability to smooth functional data in a meaningful way using splines and carry out a functional principal components analysis using the fda package in R. To this end, they should implement these techniques on a chosen set of data and write a short report on the methods used, results and conclusions and include the R code which has been used 34 | 35 | ii. Write an R Function to simulate a set of functional data using a self-selected procedure. 36 | 37 | iii. Develop a suggested plan for the project including a timeline for the development of the code, its documentation and testing. 38 | 39 | 40 | **Mentor:** Dr. Peter Foster. (peter.foster@manchester.ac.uk ) who is a statistician with experience in nonparametric smoothing techniques, functional data analysis and in R. 41 | 42 | 43 | **References:** 44 | 45 | Delaigle, A and Hall, P (2010). Defining Probability Density for a Distribution of Random Functions. The Annals of Statistics, 38, 1171-1193. 46 | 47 | Delaigle, A and Hall, P (2012). Achieving Near Perfect Classification for Functional Data. Journal of Royal Statistical Society, Series B, 74, 267-286. 48 | 49 | Hastie, T, Buja, A and Tibshirani, R (1995). Penalized Discriminant Analysis. The Annals of Statistics, 23, 73-102. 50 | 51 | Ferraty, F. and Vieu, P. (2003). Curves Discrimination: A Nonparametric Functional Approach. Computational Statistics and Data Analysis, 44, 161-173. 52 | 53 | Ramsay, James O., and Silverman, Bernard W. (2006), Functional Data Analysis, 2nd ed., Springer, New York. 54 | 55 | 56 | -------------------------------------------------------------------------------- /RWiki/gsoc2014/enmgadgets.txt: -------------------------------------------------------------------------------- 1 | **Title**: ENMgadgets – Tools for pre and post processing of data in Ecological niche modeling. 2 | 3 | **Description**: 4 | Ecological niche modeling technique has gained substantial popularity in estimating species distributions and is rapidly improving in its conceptual framework. Various niche modeling algorithms are available which are implemented in package ‘dismo’ in R. However, processing of input and output data are done outside R environment especially in ArcGIS. Package ENMGadgets is in its infancy stage to facilitate users to prepare data required for niche models in R environments. We are proposing to strengthen ENMGadgets which already has few functions like DistanceFilter, PCARaster, PCAProjection, etc. 5 | 6 | We propose to develop functions for few pre and post processing of ecological niche modeling data like: 7 | 8 | * BatchCrop – to calibrate niche models, all the environmental variables are to be cropped to same spatial resolution and extent. 9 | 10 | 11 | * Threshold – to threshold model output based on least training probability method or by specific percentage of training data set omission parameter. 12 | 13 | * Partial ROC – to evaluate performance of continuous model prediction by Maxent. Ref: doi:10.1016/j.ecolmodel.2007.11.008 14 | 15 | * MOP – to interpret the effect of extrapolation of the model calibrated in a region and projected on different time or space. Ref: http://dx.doi.org/10.1016/j.ecolmodel.2013.04.011 16 | 17 | * PAM – to generate presence-absence matrix based on grid size and species distribution output generated from niche models. 18 | 19 | 20 | 21 | 22 | 23 | **Mentor(s)** : Jorge Soberón 24 | 25 | //Could the mentor provide his email?// 26 | -------------------------------------------------------------------------------- /RWiki/gsoc2014/factoranalytics.txt: -------------------------------------------------------------------------------- 1 | ====== Improve factorAnalytics ====== 2 | ===== Summary ===== 3 | The purpose of this project is to improve the useability, reliability, features, and documentation of the factorAnalytics package. 4 | 5 | 6 | ===== Description ===== 7 | 8 | ==== Summary methods ==== 9 | The major factor model estimation functions (fitTimeSeriesFactorModel, fitFundamentalFactorModel, and fitStatisticalFactorModel) all have summary methods that display output. However, these methods to not return an object. Each summary method should be enhanced to compute informative summary statistics and information and return these values in an object. 10 | 11 | ==== Variable selection ==== 12 | The function fitTimeSeriesFactorModel supports variable selection using the functions step, regsubset, or lars. Similar functionality should be added to the function fitFundamentalFactorModel to support automatic variable selection. Also, an automated technique for choosing the appropriate number of components in a PCA factor model could be added to the function fitStatisticalFactorModel. 13 | ==== Goodness of fit measures ==== 14 | Summary methods discussed above should include factor model goodness of fit measures. For example, R-squared, AIC/BIC, and possibly some factor model specific goodness of fit measures. 15 | ==== Unbalanced panel data ==== 16 | The current implementation of fitFundamentalFactorModel only supports balanced panel data; the function should be extended to also accept unbalanced panel data. 17 | 18 | ==== Industry factor model with an intercept term ==== 19 | The current implementation of fitFundamentalFactorModel supports fitting a Barra-type industry sector model without an intercept term (see Zivot 2005 eq. 15.9). The implementation should be extended to also fit Barra-type industry sector model with an intercept term (see Ruppert 2010 eq. 17.15). 20 | 21 | 22 | ==== Explanatory and predictive models ==== 23 | The current implementation of fitFundamentalFactorModel is unaware of the difference between an explanatory and predictive factor model; hence it would be the users responsibility to lag RHS variables to have the model be predictive versus explanatory. The implementation should be enhanced to allow the user to specify a model that is an explanatory or a predictive. 24 | 25 | 26 | ==== Help documentation, package vignettes, unit tests ==== 27 | The help documentation needs to be improved and include additional examples and explanation. Additional package vignettes should also be created to demonstrate the full capabilities of the package. Lastly, unit tests should be developed to help insure the stability of the code base as new features are added. 28 | 29 | 30 | 31 | ===== Skills required: ===== 32 | Applicants should have: 33 | 34 | - Familiarity with using the factorAnalytics package. 35 | 36 | - Proficiency with R and some experience in developing in R. 37 | 38 | - Knowing or comfort in quickly learning tools such as svn, Roxygen2 and LaTeX 39 | 40 | - Those with demonstrable experience with finance-related packages will be preferred. 41 | 42 | - A background in computer science or engineering with graduate training in finance is ideal. 43 | 44 | 45 | ===== Test: ===== 46 | 47 | A successful applicant will: 48 | 49 | - Discuss the proposed package functionality. 50 | 51 | - Write development timeline for code implementation, documentation and testing. 52 | 53 | - Provide a complete code example of a function with documentation that demonstrates familiarity with the tools listed above. 54 | 55 | - Identify any personal commitments that conflicts for their time during summer 2014. 56 | 57 | 58 | ===== Mentors ===== 59 | Guy Yollin is a faculty member in the Department of Applied Mathematics at the University of Washington. He teaches courses in statistical computing and electronic trading in the Computational Finance and Risk Management Masters of Science program. He has previously mentored GSOC projects, and would mentor the GSOC participant. 60 | Email: gyollin@uw.edu 61 | 62 | Eric Zivot is the Robert Richards Chaired Professor in Economics at the University of Washington. He previously mentored the GSOC 2013 project on factorAnalytics. 63 | Email: ezivot@u.washington.edu 64 | 65 | Doug Martin is Professor of Applied Mathematics and Director of the MS-CFRM program at the University of Washington. He is co-author with Bernd Scherer of Introduction to Modern Portfolio Optimization (2005), and is currently writing a new book on the topic with Scherer and Guy Yollin. He has previously mentored GSOC projects. 66 | Email: doug@amath.washington.edu 67 | -------------------------------------------------------------------------------- /RWiki/gsoc2014/manifold_optimization.txt: -------------------------------------------------------------------------------- 1 | **Summary:** Build an R package for solving optimization problems over manifolds. Such optimization problems are encountered frequently in applied mathematics, statistics, and machine learning. 2 | 3 | **Description:** The goal of this project is to create an R package, with a collection of functions each of which solves an optimization problem over a matrix argument with a manifold structure. The manifolds considered include but are not limited to, Stiefel manifolds, Grassmannian manifolds, fixed-rank manifolds, fixed-rank and positive semi-definte manifolds, oblique manifolds, rotation groups, special linear groups, and projective matrices manifolds. Optimization methods are provided specifically for each kind of manifold. 4 | Solving an optimization problem over manifold domains has wide applications in statistics and machine learning. Most multivariate analysis methods in statistics involve such a problem, such as principal components analysis, canonical correlation analysis, Fisher’s linear discriminant analysis, and reduced-rank regression. Some other examples in recent machine learning literature are: 5 | 6 | 1) In metric learning, a low-rank constraint on matrices allows one to learn a low dimensional representation of the data in a discriminative way. However, naïve approaches to optimization of an objective function over the set of low-rank matrices are either prohibitively time consuming (repeatedly using SVD) or numerically unstable (optimizing a factored representation of the low-rank matrix). However, taking adavantage of the manifold structure can improve the performance [5]. 7 | 8 | 2) In signal processing, adaptive estimation of subspace of covariance matrices has been successfully applied in signal de-noising and feature extraction. Among existing methods, YAST [4] has demonstrated good stability and low complexity. However, in light of Riemannian Geometry, if the problem is formulated as a search over matrix manifold, we can further improve the YAST algorithm [3]. 9 | 10 | There is very limited support in R for optimization on manifolds. The only available R package is the one written by Dr. Adragni and Dr. Cook, called GrassmannOptim [2]. It has a limited scope and focuses mainly on Grassmannian Manifolds. On the other hand, there is a Matlab toolbox for optimization on manifolds, called Manopt. It was mainly developed by P. A. Absil, who wrote a book on optimization over matrix manifolds [1]. 11 | 12 | There are several objectives of this project: 13 | 14 | - Develop a general purpose R package whose scope is substantially broader than the scope of the existing R package GrassmannOptim. 15 | - The new R package should cover the major optimization methods on manifold optimization, especially the gradient-like methods. It should have most of the typical functionality of the Matlab toolbox Manopt. It will provide a class of powerful, open-source tools to the large community of R users. 16 | - The new R package will also include several new, untypical algorithms that were developed recently but not included in the existing Matlab toolbox. One such algorithm is the LORETA Algorithm for fixed rank matrices [5]. 17 | 18 | 19 | **Skills required:** 20 | 21 | Applicants should have: 22 | - Knowledge of smooth manifolds, and optimization methods over such manifolds. Some knowledge of recent development in this subject is desirable. 23 | - Experience of building an R package. 24 | - Experience of developing and implementing computationally efficient algorithms. 25 | 26 | **Test:** 27 | 28 | -Write a sample package including one matrix manifold structure. In addition, there should be one algorithm based on that manifold. In the end, apply this sample package to examples, reporting the methods used and results. 29 | -Develop a suggested plan for the project including a timeline for the development of the code, its documentation and testing. 30 | 31 | **Mentors:** 32 | 33 | Professor Jianhua Huang (jianhua at stat.tamu.edu), Department of Statistics, Texas A&M University 34 | 35 | Website: http://www.stat.tamu.edu/~jianhua/ 36 | 37 | Yihui Xie, (Xie at yihui.name), Ph. D., RStudio, Inc., Author of R packages knitr, shiny, etc. 38 | 39 | Website: http://yihui.name/en/ 40 | 41 | **Reference:** 42 | 43 | [1] Absil, P. A. (2009). Optimization algorithms on matrix manifolds. Princeton University Press. 44 | 45 | [2] Adragni, K. P. (2012). Grassmannoptim: An r package for grassman manifold optimization. Journal of Statistical Software, 50 1-18. 46 | 47 | [3] F. Yger, M. B. (2012). Oblique principal subspace tracking on manifold. IEEE International Conference, 48 | 49 | [4] R. Badeau, G. R. (2008). Fast and stable yast algorithm for pincipal and minor subspace tracking. IEEE Transactions on Signal Processing, vol. 56, no. 8, 3437-3446. 50 | 51 | [5] Uri Shalit, D. W. (2012). Online Learning in the Embedded Manifolds of Low-Rank Matrices. Journal of Machine Learning Research, vol. 12, 429-458. 52 | -------------------------------------------------------------------------------- /RWiki/gsoc2014/pbdprof.txt: -------------------------------------------------------------------------------- 1 | =====Title===== 2 | **pbdPROF: Profiling Tools for High Performance Computing with R** 3 | 4 | 5 | =====Summary===== 6 | The pbdPROF package currently enables easy profiling of MPI-using R programs, as well as offering publication quality, easily interpreted visualizations for profiler outputs. We intend to broaden the scope of the package and include more generic performance analysis tools, such as PAPI, Byfl, and/or TAU. 7 | 8 | 9 | =====Description===== 10 | Last year, we participated in the successful Google Summer of Code project for [[http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2013:mpiprofiler|a set of MPI profilers]], culminating in the CRAN package [[http://cran.r-project.org/web/packages/pbdPROF/index.html|pbdPROF]]. This year, we hope to parlay that successful experience to expand pbdPROF into a broader, even more useful tool for R package developers. 11 | 12 | R's own Rprof() function is extremely useful, but its profiling capabilities are limited to simple timings of R functions. This is a very good starting point in performance analysis, but for more experienced developers working with compiled code, additional performance information can be invaluable. A more general framework that offers access to, for example, low-level software/hardware counters can have tremendous impact when trying to optimize performance of compiled code. 13 | 14 | Fortunately, the HPC community is rife with excellent libraries for performing exactly this task (see References section below). However, utilizing these libraries in an R package is generally far from trivial. We are proposing to make these performance analysis tools easily and readily available to R package developers. This framework would include: building a set of wrappers, bundling a subset of these tools into the pbdPROF package itself (such as with fpmpi in pbdPROF), and making the process of linking with supported external performance tools simple. Finally, the package will offer utilities for capturing outputs from these tools, parsing the outputs, and visualizing the results. 15 | 16 | Finally, the project will emphasize portability and accessibility. For the former, the package should easily build on as many major operating system platforms as each integrated performance tool will allow. For the latter, the final product should be "recognizable" to R programmers and easily integrated into an R workflow. 17 | 18 | 19 | =====Goals===== 20 | * Expand pbdPROF to include several performance analysis tools, such as PAPI, Byfl, or TAU. 21 | * Develop parsing and plotting utilities to enable quick analysis of profiler output. 22 | * Extend the package's vignette and documentation to include the new developments. 23 | 24 | 25 | 26 | =====Skills required===== 27 | The applicant should have: 28 | * Working knowledge of the C language. 29 | * Ability to work in a UNIX-like environment (Linux, BSD, Mac, etc.). 30 | * Familiarity with autotools (the GNU build system). 31 | * Experience using R, particularly [[http://cran.r-project.org/web/packages/ggplot2/index.html|ggplot2]]. 32 | * Basic familiarity with LaTeX and git/github will be useful. 33 | 34 | 35 | 36 | 37 | =====Tests===== 38 | * Write a simple C function callable by R using the native C interface (see the [[http://cran.r-project.org/doc/manuals/R-exts.html#Calling-_002eCall|Writing R Extensions]] manual for information, or see the src/ trees of the [[http://cran.r-project.org/web/packages/pbdMPI/index.html|pbdMPI]] and/or [[http://cran.r-project.org/web/packages/pbdBASE/index.html|pbdBASE]] packages for examples). This function should take a numeric (double) input and return that value plus 1.0. For quick reference, the basic build process is: compile the function as a shared library, load the library with dyn.load(), and call the compiled code via .Call() . 39 | 40 | * Build [[http://icl.cs.utk.edu/papi/|PAPI]], and run the PAPI_flops example. Provide that example's output. 41 | 42 | 43 | 44 | 45 | =====Mentors===== 46 | Primary GSoC contact: Drew Schmidt 47 | * email: 48 | * link_id: wrathematics 49 | \\ 50 | The pbdR Core Team also includes: 51 | * Wei-Chen Chen 52 | * email: 53 | * link_id: snoweye 54 | * George Ostrouchov 55 | * email: 56 | * link_id: ornlgeorge 57 | * Pragneshkumar Patel 58 | * email: 59 | * link_id: prag 60 | \\ 61 | All pbdR Core Team members will participate and help with mentoring, but the listed primary contact will be leading the mentorship. 62 | 63 | 64 | 65 | =====References===== 66 | pbdPROF: 67 | * Current dev build: https://github.com/RBigData/pbdPROF 68 | * Stable CRAN build: http://cran.r-project.org/package=pbdPROF 69 | 70 | \\ 71 | Profilers: 72 | * Performance Application Programming Interface (PAPI): http://icl.cs.utk.edu/papi/ 73 | * Byfl: https://github.com/losalamos/Byfl 74 | * TAU Performance System: http://www.cs.uoregon.edu/research/tau/home.php 75 | * HPCToolkit: http://hpctoolkit.org/ 76 | * IPM: http://www.nersc.gov/users/software/debugging-and-profiling/ipm 77 | * Scalasca: http://www.scalasca.org/about/about.html 78 | 79 | \\ 80 | Vis tools for profilers: 81 | * ParaProf: https://www.cs.uoregon.edu/research/tau/docs/paraprof/ 82 | * Jumpshot: http://www.mcs.anl.gov/research/projects/perfvis/software/viewers/ 83 | 84 | \\ 85 | Related work: 86 | * The Rmpi package: http://cran.r-project.org/web/packages/pbdMPI/ 87 | * The pbdR project: http://r-pbd.org/ 88 | * The pbdMPI package: http://cran.r-project.org/package=pbdMPI 89 | * The pbdDEMO package: http://cran.r-project.org/package=pbdDEMO -------------------------------------------------------------------------------- /RWiki/gsoc2014/right.txt: -------------------------------------------------------------------------------- 1 | ====== RIGHT : R Interactive Graphics via HTml ====== 2 | 3 | **Summary:** R Interactive Graphics via HTml (RIGHT) supports easy data exploration using interactive graphics and analysis and reveal information hidden among static graphs. It uses HTML5 canvas and JavaScript, instead of Qt, to make analysis and visualization results easier to distribute to wider audience. Furthermore, it will support intuitive R API for easy setup. 4 | 5 | **Background/Motivation:** There are excellent packages in R that produces very polished static graphics, e.g. lattice and ggplot2. However, exploring data using static graphics can be tedious: users have to frequently interact with command line to get meaningful graphs that answer even a simple question like “how are graphical elements (a point or a group of points) in one graph related to those in another graph?” Interactive graphics linking data in different graphs and interactive visualization of summary statistics, combined with intuitive R API for setup, can help address these questions and make data exploration easy and also fun. 6 | 7 | **Description/Goal** 8 | 9 | In GSOC 2013, the RIGHT project was successfully passed and released the first version. However, there are still some parts needed to be improved remained to capture a wide range of users’ interests. Therefore, this project proposes the enhancement of two big parts: (a) improving the R API such as ggplot2 like syntax to support user-friendly environment and refactoring the JavaScript code to make RIGHT handle the large amount of data (e.g. 50K); (b) Adding server offload functionality using Shiny to enable both client- and server-side interactivity. 10 | 11 | - **Improving the R API & Refactoring the JavaScript code** 12 | * **Moving low-level statistical process from JavaScript side to R side:** Calculating data in R side is very efficient, so moving calculation of data used to draw graphs from JavaScript side to R side will accelerate the performance of RIGHT. JavaScript side will have only drawing functionalities using the middle-data Object from R side. Data transfer methods (e.g. JSON) between JavaScript and R should be needed. 13 | * **Restructuring R API to support user-friendly environment:** the latest version of RIGHT API is similar to base-graphics API whose system is not intuitive for supporting various condition values to make creating interactive graphs various. Therefore, the state-of-the-art API system, such as ggplot2 would be a good model for making the R API intuitive. 14 | - **Adding server offload functionality using Shiny to enable both client- and server-side interactivity** 15 | 16 | 17 | **Demo:** A demo video is available at [[http://youtube.googleapis.com/v/thlwjYFC_yY?vq=hd1080;hd=1;autoplay=1;]]. 18 | 19 | **Code** https://code.google.com/p/r-interactive-graphics-via-html/ 20 | 21 | **Related work/Literature** 22 | * [[http://spotfire.tibco.com/|TIBCO Spotfire]]: [[http://www.youtube.com/user/SpotfireGuy|many interactive visualization features]] were inspired by this commercial software package. RIGHT will be open source and provide an intuitive R API for setup; in comparison, Spotfire setup is only supports GUI-based setup. 23 | * [[https://github.com/ggobi/cranvas|cranvas]]: this is an excellent package that produces interactive visualization using Qt. Visualization using Qt has limited availability in mobile platforms (see [[http://doc.qt.digia.com/4.7/supported-platforms.html|Qt Supported Platforms]]). A package based on HTML5 canvas and JavaScript with similar functionality will give users more options when analysis and visualization results need to be widely distributed. 24 | * [[http://www.omegahat.org/SVGAnnotation/|SVGAnnotation]], [[http://sjp.co.nz/projects/gridsvg/|gridSVG]], [[http://svgmapping.r-forge.r-project.org/SVG_Mapping/Home.html|SVGMapping]], and [[https://github.com/hadley/r2d3|r2d3]]: SVG-based packages can provide polished visualization [[http://d3js.org/]], but the DOM-based approach makes it difficult to provide some types of interactivity. For instance, the many-to-one simultaneous selection between the scatter plot and the histogram shown in the [[http://youtube.googleapis.com/v/thlwjYFC_yY?vq=hd1080;hd=1;autoplay=1;|demo]] will be difficult to create with SVG, especially among more than two graphs. 25 | * [[http://www.rforge.net/canvas/|canvas]]: canvas package uses HTML5 canvas but does not support interactivity. 26 | * [[http://www.rstudio.com/shiny/|RStudio Shiny]]: Shiny does not offer interactive visualization. However, server offloading idea came from Shiny Server. 27 | 28 | **Skills required:** R, JavaScript, and HTML5 29 | 30 | **Tests** 31 | - Load Theoph dataset and plot concentration (conc) versus time (Time) by subject (Subject) with at least two graphics packages in R. Points for each subject should have different colors, and a legend should be added. 32 | - Re-draw the plot created in problem 1 with JavaScript using the kineticJS library (http://kineticjs.com/). 33 | 34 | **Mentors:** [[http://icc.skku.ac.kr/~jaewlee|Jae W. Lee]] , Jung Hoon Lee , Chung Ha Sung -------------------------------------------------------------------------------- /RWiki/gsoc2014/roughset.txt: -------------------------------------------------------------------------------- 1 | 2 | 3 | ====== Package for Handling Rough Sets ====== 4 | 5 | 6 | **Summary:** A package for handling rough computations relating to reducts and knowledge in GNU/R. 7 | 8 | **Description:** To Do 9 | 10 | **Skills required:** Knowledge of Rough Sets, Algorithms for reducts (possibly), Ability to handle big data sets in GNU/R or C , or knowledge representation with rough and fuzzy sets. 11 | 12 | 13 | **Test:** Implementation of any of the basic reduct algorithms in any language OR implementation of any routine in concept analysis or allied field in addition to competence in GNU/R 14 | 15 | **Mentor:** [[http://www.logicamani.in|A Mani]]. 05.03.2014 -------------------------------------------------------------------------------- /RWiki/gsoc2014/spotvol.txt: -------------------------------------------------------------------------------- 1 | ====== Spot volatility estimation: Methods and applications ====== 2 | ===== Summary ===== 3 | This project aims at expanding the R package highfrequency with a broad range of functionality allowing to estimate the intraday volatility of highfrequency return series. 4 | 5 | 6 | 7 | ===== Description ===== 8 | Spot volatility estimates are important for highfrequency trading signal generation and risk management. Several spot volatility estimators have been proposed over the last years. This project aims at implementing them in the R package highfrequency: 9 | 10 | i. The stochastic volatility approach of Beltratti and Morana (2001), which, in particular includes both deterministic and stochastic periodic terms. The stochastic volatility model can be casted in state space form and estimated by maximum likelihood. 11 | 12 | ii. The non-parametric kernel approach of Kristensen (2010). 13 | 14 | iii. The piecewise constant volatility approach in Fried (2012). 15 | 16 | iv. The kalman filter of periodic intraday volatility in Bos, Janus and Koopman (2012). 17 | 18 | v. The GARCH + periodicity factor of Giot (2005), Chu and Lam (2011) and So and Xu (2013). 19 | 20 | vi. Periodic egarch model proposed by Bollerslev and Ghysels (1996). 21 | 22 | Besides estimation functions, the project aims at creating simulation functions to generate highfrequency data according to various highfrequency models with periodicity in the volatility of the diffusion. 23 | 24 | Finally, functions should be developed to test for jumps based on the estimated spot volatility and to forecast intraday value-at-risk under different distributional assumptions. 25 | 26 | This project is most related to the function spotVol in the highfrequency package, which implements the spot volatility estimators proposed in Boudt, Croux and Laurent (2008). 27 | 28 | ===== Skills required ===== 29 | Applicants should have: 30 | 31 | - A good working knowledge of programming in R. 32 | 33 | - A background in financial econometrics. 34 | 35 | 36 | ===== Test ===== 37 | 38 | i. The successful applicant should be able to demonstrate the ability to extend the spotVol function in the highfrequency package with the above described methods. To this end, the candidate should screen the function spotVol and propose improvements. 39 | 40 | ii. Develop a suggested plan for the project including a timeline for the development of the code, its documentation and testing. 41 | 42 | 43 | 44 | 45 | 46 | ===== Mentor ===== 47 | 48 | Dr. Kris Boudt and Jonathan Cornelissen who are coauthors of the R package highfrequency and Dr. Daniele Signori. All three mentors have published research on spot volatility estimation. 49 | 50 | ===== References ===== 51 | 52 | Beltratti, Andrea and Claudio Morana (2001). Deterministic and stochastic methods for estimation of intraday seasonal components with high frequency data. Economic Notes 30, 205-234. 53 | 54 | Bollerslev, T., and E. Ghysels. 1996. Periodic Autoregressive Conditional Heteroscedasticity. Journal of Business & Economic Statistics 14, 139–151. 55 | 56 | Bos, Charles, Janus, Pawel and Siem Jan Koopman (2012). Spot variance path estimation and its application to high-frequency jump testing. Journal of financial econometrics 10, 354-389. 57 | 58 | Boudt, Kris, Croux, Christophe and Laurent, Sebastien (2011) Robust estimation of intraweek periodicity in volatility and jump detection. Journal of Empirical Finance 18, 353-369. 59 | 60 | Chu, Carlin C.F and K.P. Lam (2011) Modeling intraday volatility: A new consideration. Journal of International Markets, Institutions and Money 21, 388-418. 61 | 62 | Fried, Roland (2012). On the online estimation of local constant volatilities. Computational Statistics and Data Analysis 56, 3080-3090. 63 | 64 | Giot, Pierre (2005). Market risk models for intraday data. European Journal of Finance 11, 309-324. 65 | 66 | Kristensen, Dennis (2010). Nonparametric filtering of the realized spot volatility: A kernel-based approach. Econometric Theory 26, 60-93. 67 | 68 | So, Mike K.P. and Rui Xu (2013) Forecasting intraday volatility and value-at-risk with highfrequency data. Asia-Pacific Financial Markets 20, 83-111. 69 | -------------------------------------------------------------------------------- /RWiki/highly_parallelized_implementations_of_robust_statistical_procedures_using_cuda.txt: -------------------------------------------------------------------------------- 1 | ====== robgpu: Highly parallelized implementations of robust statistical procedures using CUDA ====== 2 | 3 | 4 | **Summary:** A highly parallelized framework of robust methods for the analysis of large data sets based on utilizing the GPU via the CUDA parallel computing platform. 5 | 6 | **Description:** In modern data analysis, data sets are often large in the number of variables or observations. In addition, data sets frequently contain outliers in practice, requiring robust statistical methods for their analysis. However, applying robust methods to large data sets is computationally expensive. Furthermore, state-of-the-art methods frequently require the selection of regularization parameters, further aggrevating the problem of computation time. 7 | 8 | Practical algorithms for widely used robust procedures are often based on exploring different subsets of the data and applying a nonrobust estimator to those subsets. Computation time can thus be drastically reduced by exploring subsets in a highly parallelized manner by taking advantage of the machine's GPU. The CUDA platform allows to perform arbitrary calculations on supported GPUs and is therefore highly suited for this application. 9 | 10 | The aim of this project is to develop a new R package **robgpu** by utilizing the CUDA platform to implement the following functionality in a highly parallelized manner: 11 | 12 | 1. Minimum covariance determinant (MCD) estimator of location and scatter. This is probably the most widely used robust scatter matrix estimator. It is based on finding the subset of the data of a given size that minimizes the determinant of the corresponding covariance matrix. 13 | 14 | 2. Least trimmed squares (LTS) regression. This widely used regression estimator is based on finding the subset of the data of a given size that minimizes the corresponding sum of squared residuals. 15 | 16 | 3. Regularized versions of the above estimators for high-dimensional large data. MCD and LTS cannot be computed if the number of variables exceeds the number of observations. This computational problem can be overcome by adding a penalty term on the estimated parameters to the objective function. Certain penalty terms such as a lasso-type penalty further have the advantage that some of the parameters are set to exactly 0, thus yielding better interpretability of the results through sparse estimates. The implementation should be general such that different penalty functions can easily be used. 17 | 18 | 4. Integrate the highly parallelized robust methods into functionality for, e.g., outlier detection, graphical modeling, principal component analysis (PCA) or linear discriminant analysis (LDA), as time permits. 19 | 20 | **Literature** 21 | 22 | P.J. Rousseeuw, K. van Driessen (1999). A fast algorithm for the Minimum Covariance Determinant Estimator. Technometrics. 41(3), 212-223. 23 | 24 | P.J. Rousseeuw, K. van Driessen (2006). Computing LTS regression for large data sets. Data Mining and Knowledge Discovery, 12(1), 29-45. 25 | 26 | A. Alfons, C. Croux, S. Gelper (2013). Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics. In press. 27 | 28 | C. Croux, G. Haesbroeck. The regularized minimum covariance determinant estimator. Preprint. 29 | 30 | **Skills required:** Experience with R, C++, and in particular parallel computing via the CUDA platform is required. An ideal candidate would be experienced in implementing highly parallelized procedures for the analysis of large data. 31 | 32 | **Mentor:** [[http://people.few.eur.nl/alfons/|Andreas Alfons]] authored various R packages for robust statistics written in R and C++ and is experienced in parallel computing. He published articles on applied robust data analysis as well as statistical software in renowned international journals. -------------------------------------------------------------------------------- /RWiki/projects.txt: -------------------------------------------------------------------------------- 1 | ====== Developers' projects ====== 2 | 3 | Discuss here (future) projects that are not mature enough to get a definitive location ([[http://R-forge.r-project.org|R-Forge]], ...) or name. This is also the section to use for writing proposals, like for the [[developers:projects:gsoc2012|Google Summer of Code 2012]] and [[developers:projects:gci2011|Google Code-In 2011]]. 4 | 5 | **NOTE** This is a good place to provide links to ideas for Summer of Code or similar projects, including those that get some interest but do not for some reason get funded in a current competition. 6 | 7 | Please, once your project finds his way on a definitive location, indicate it clearly with a link to the main project's page. If, unfortunately, you decide not to push the project further, indicate it also clearly. 8 | -------------------------------------------------------------------------------- /RWiki/sidebar.txt: -------------------------------------------------------------------------------- 1 | ====== Developers' projects ====== 2 | Section for discussing (future) projects around R development. 3 | 4 | ===== Section content ===== 5 | {{indexmenu>developers:projects}} 6 | 7 | 8 | ---- 9 | Back to [[developers:developers|Developers section]] | [[:start|Start page]] -------------------------------------------------------------------------------- /RWiki/student_app_template.txt: -------------------------------------------------------------------------------- 1 | ===== R GSoC Student Application Template ===== 2 | 3 | Project title: 4 | 5 | Project short title (30 characters): 6 | 7 | URL of project idea page: 8 | 9 | 10 | 11 | 12 | ==== Bio of Student ==== 13 | 14 | Provide a brief (text) biography, and why you think your background qualifies you for this project. 15 | 16 | === CONTACT INFORMATION === 17 | 18 | Student name: 19 | 20 | Melange Link_id: 21 | 22 | Student postal address: 23 | 24 | Telephone(s): 25 | 26 | Email(s): 27 | 28 | Other communications channels: Skype/Google+, etc. : 29 | 30 | 31 | === Student affiliation === 32 | 33 | Institution: 34 | 35 | Program: 36 | 37 | Stage of completion: 38 | 39 | Contact to verify: 40 | 41 | 42 | ==== Schedule Conflicts: ==== 43 | 44 | Please list any schedule conflict that will interfere with you treating your proposed R GCoC project as a full time job in the summer. 45 | 46 | ==== MENTORS ==== 47 | 48 | Mentor names: 49 | 50 | Mentor emails: 51 | 52 | Mentor link_ids: 53 | 54 | Have you been in touch with the mentors? When and how? 55 | 56 | 57 | 58 | ==== CODING PLAN & METHODS ==== 59 | 60 | Describe in detail your plan for completing the work. What functions will be written, how and when will you do design, how will you verify the results of your coding? The sub-section headings below are examples only. Each project is different, please make your application appropriate to the work you propose. 61 | 62 | Describe perceived obstacles and challenges, and how you plan to overcome them. 63 | 64 | 65 | 66 | ==== TIMELINE ==== 67 | 68 | (consult GSOC schedule) 69 | 70 | Provide a detailed timeline of how you plan to spend your summer. Don't leave testing and documentation for last, as that's almost a guarantee of a failed project. 71 | 72 | What is your contingency plan for things not going to schedule? 73 | 74 | 75 | ==== MANAGEMENT OF CODING PROJECT ==== 76 | 77 | How do you propose to ensure code is submitted / tested? 78 | 79 | How often do you plan to commit? What changes in commit behavior would indicate a problem? 80 | 81 | 82 | ==== TEST ==== 83 | 84 | Describe the qualification test that you have submitted to you project mentors. If feasible, include code, details, output, and example of similar coding problems that you have solved. 85 | 86 | 87 | ==== Anything Else ==== 88 | -------------------------------------------------------------------------------- /melange-banner.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rstats-gsoc/gsoc2015/f6d0f1c33cc0e4df1515cbee810bbaf72baf8f06/melange-banner.png -------------------------------------------------------------------------------- /summeR-of-code.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rstats-gsoc/gsoc2015/f6d0f1c33cc0e4df1515cbee810bbaf72baf8f06/summeR-of-code.png --------------------------------------------------------------------------------