├── .gitignore ├── LICENCE ├── README.md └── advanced_scheduling.md /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | -------------------------------------------------------------------------------- /LICENCE: -------------------------------------------------------------------------------- 1 | Creative Commons Attribution 4.0 International Public License 2 | 3 | By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions. 4 | 5 | Section 1 – Definitions. 6 | 7 | Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image. 8 | Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License. 9 | Copyright and Similar Rights means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights. 10 | Effective Technological Measures means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements. 11 | Exceptions and Limitations means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material. 12 | Licensed Material means the artistic or literary work, database, or other material to which the Licensor applied this Public License. 13 | Licensed Rights means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license. 14 | Licensor means the individual(s) or entity(ies) granting rights under this Public License. 15 | Share means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them. 16 | Sui Generis Database Rights means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world. 17 | You means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning. 18 | 19 | Section 2 – Scope. 20 | 21 | License grant. 22 | Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to: 23 | reproduce and Share the Licensed Material, in whole or in part; and 24 | produce, reproduce, and Share Adapted Material. 25 | Exceptions and Limitations. For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions. 26 | Term. The term of this Public License is specified in Section 6(a). 27 | Media and formats; technical modifications allowed. The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted Material. 28 | Downstream recipients. 29 | Offer from the Licensor – Licensed Material. Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License. 30 | No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material. 31 | No endorsement. Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i). 32 | 33 | Other rights. 34 | Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise. 35 | Patent and trademark rights are not licensed under this Public License. 36 | To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties. 37 | 38 | Section 3 – License Conditions. 39 | 40 | Your exercise of the Licensed Rights is expressly made subject to the following conditions. 41 | 42 | Attribution. 43 | 44 | If You Share the Licensed Material (including in modified form), You must: 45 | retain the following if it is supplied by the Licensor with the Licensed Material: 46 | identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated); 47 | a copyright notice; 48 | a notice that refers to this Public License; 49 | a notice that refers to the disclaimer of warranties; 50 | a URI or hyperlink to the Licensed Material to the extent reasonably practicable; 51 | indicate if You modified the Licensed Material and retain an indication of any previous modifications; and 52 | indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License. 53 | You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information. 54 | If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable. 55 | If You Share Adapted Material You produce, the Adapter's License You apply must not prevent recipients of the Adapted Material from complying with this Public License. 56 | 57 | Section 4 – Sui Generis Database Rights. 58 | 59 | Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material: 60 | 61 | for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database; 62 | if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material; and 63 | You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database. 64 | 65 | For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights. 66 | 67 | Section 5 – Disclaimer of Warranties and Limitation of Liability. 68 | 69 | Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. Where disclaimers of warranties are not allowed in full or in part, this disclaimer may not apply to You. 70 | To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages. Where a limitation of liability is not allowed in full or in part, this limitation may not apply to You. 71 | 72 | The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability. 73 | 74 | Section 6 – Term and Termination. 75 | 76 | This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically. 77 | 78 | Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates: 79 | automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or 80 | upon express reinstatement by the Licensor. 81 | For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License. 82 | For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License. 83 | Sections 1, 5, 6, 7, and 8 survive termination of this Public License. 84 | 85 | Section 7 – Other Terms and Conditions. 86 | 87 | The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed. 88 | Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License. 89 | 90 | Section 8 – Interpretation. 91 | 92 | For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License. 93 | To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions. 94 | No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor. 95 | Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority. 96 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # scheduling_r\_scripts 2 | 3 | ![](https://mirrors.creativecommons.org/presskit/buttons/80x15/svg/by.svg) [![DOI](https://zenodo.org/badge/297568091.svg)](https://zenodo.org/badge/latestdoi/297568091) 4 | 5 | Authors: Roel M. Hogervorst 6 | 7 | *Last change 2021-1-19* 8 | 9 | This is an overview of many of the ways you can run an R script. 10 | 11 | This has become a rather large overview but I think could help a lot of R-users. This overview is for you if you want to know how to run your batch script (do one thing without supervision) automatically. This overview does not talk about shiny or plumber, they are both great products that do incredible work, but they both sort of assume they run on a computer that is always on. I'm talking about scripts that you run once every day/week/hour/ etc. If you want more complex workflows look into the [advanced scheduling page](advanced_scheduling.md). 12 | 13 | I'm trying to answer the following questions about all solutions: 14 | 15 | - Where are the costs? 16 | - How easy is it to set up and use. and how easy can you transfer your work to your coworker 17 | - how easy it is to change things, the script, changing secrets or frequency? 18 | - Can you manage your entire configuration in code? 19 | - Is there logging, how easy is it see what exactly went wrong? 20 | - How precise is it and will it auto recover on failure? 21 | - how do you have to deal with secrets? can they leak? 22 | - in what country does it run (if a president of that country gets a tantrum and starts to lock others out) 23 | 24 | I'm separating out 3 usecases: 25 | 26 | 1. [You run it on your own computer](#own-computer) 27 | 2. [You have a server available (a computer that is always on and accessible to you)](#own-a-server) 28 | 3. [You use ephemeral services (serverless, pay for use only)](#ephemeral) 29 | 30 | If you are looking for some (non binding, non-legal) advice, I give some suggestions [in Advice for useRs](#advice-for-users). 31 | 32 | # Own computer 33 | 34 | [*back to top*](#scheduling_r_scripts) 35 | 36 | Think about your laptop / computer. You can run a script, but you can also make the computer run it. So in order of difficulty. 37 | 38 | 1. Run it manually (`source("script.R")`) 39 | 2. Run it on a schedule 40 | 41 | ## Running it manually 42 | 43 | [*back to top*](#scheduling_r_scripts) 44 | 45 | Option 1 is really not sustainable, if you forget, it doesn't run. However the reality is that many companies rely on such a manual step. So we cannot ignore it completely. Millions of people worldwide copy stuff into an excel file from another excel file, save and send it to someone else. Manual is good for unstable processes, with lots of changing demands. It also helps in applying the best algorithm of all: common sense. 46 | 47 | **Where are the costs?**: High costs in people hours, maintenance of tools and training of other people. Specifically when the person doing this is doing repetitive work that a computer could have done, these are wasted hours. The cost of computer is already paid for in other ways. 48 | 49 | **How easy is it to set up and use. and how easy can you transfer your work to your coworker**: This is probably a process that evolved over time. Setup and use are unknown, until someone new is trained to do it. 50 | 51 | **how easy it is to change things, the script, changing secrets or frequency?** : Easy to change, some cost in new secrets that someone needs to type in. Changes in frequency only cost time that could have been spent on something else. 52 | 53 | **Can you manage your entire configuration in code?**: No, this is manual 54 | 55 | **Is there logging, how easy is it see what exactly went wrong?**: In my experience there is a lot of hidden configuration and no logs except for the brain of the user. 56 | 57 | **How precise is it and will it auto recover on failure?**: It can be as precise as you want, but it will cost you. It will auto recover on failure, because the user can retry. 58 | 59 | **how do you have to deal with secrets? can they leak?**: There is absolutely a risk of losing secrets and a risk of choosing easy passwords because you have to type them so much. Password managers are a help here. Leak can come from fishing or reuse of passwords. 60 | 61 | **in what country does it run**: In the country where the user is at that moment. 62 | 63 | **Links**: 64 | 65 | - why you need a [password manager](https://www.howtogeek.com/141500/why-you-should-use-a-password-manager-and-how-to-get-started/) 66 | 67 | ## Running it on a schedule on your computer 68 | 69 | [*back to top*](#scheduling_r_scripts) 70 | 71 | For linux and mac you have CRON or CRONTAB. A system tool that executes functions on a schedule. For ease of use there is a package (maybe multiple ones) to help you with setting up your R job called cronR (See links below). For windows systems there is task scheduler which works slightly differently but can do the same thing. There is also an R package to schedule scripts in task scheduler, taskscheduleR (See links below) 72 | 73 | This is an easy thing to try out for yourself. And easy to switch from a script that runs manually without your input (except typing source). There are some small snags: on linux the cron process runs as a different user and so it might not have access to the same R library. (you can specify the user if you want) You also have to think about the directory where the R process starts. Many of these tasks are done for you with the two packages I mention at the links of this section. 74 | 75 | **Where are the costs?**: Lower than running it manually after some time investment to make it run for you (see XKCD comic at links of this section). The process, if it doesn't take all your system resources, can run while you are doing other things. 76 | 77 | **How easy is it to set up and use. and how easy can you transfer your work to your coworker**: I think the initial setup is quite some work for inexperienced workers, but when it runs you can easily hand it over to a different user and set up in the same way. If your computer is turned off, the process will not run. 78 | 79 | **Can you manage your entire configuration in code?**: In theory yes, in practice not. You need to manually setup the cron task/ task scheduler and check that it works. 80 | 81 | **Is there logging, how easy is it see what exactly went wrong?**: Cron does have logging, I'm not sure about task scheduler but there must be. Where to find the logs is sometimes difficult to find out. 82 | 83 | **How precise is it and will it auto recover on failure?**: You will not get a message that the job failed and it will not retry with CRON. It will fail and try again the next time the time of execution is there. It is quite precise, if you say start at 0900 than the process will start at 0900. 84 | 85 | **how do you have to deal with secrets? can they leak?**: It runs on your computer so it really depends on where you store the secrets. If you place them in a .Renviron file than it really depends on where you store it. If it stays with the folder where the R process starts than other processes do not have access to it. If you place it in your home folder all the R processes have access to it. If you hard code the secrets in the script, than anyone who can access the script will have access. 86 | 87 | **in what country does it run**: In the country where the user is at that moment. 88 | 89 | **Links**: 90 | 91 | - For linux and mac systems: [CronR (on CRAN)](https://cran.r-project.org/web/packages/cronR/) 92 | - [help in scheduling with CRON use crontab.guru](https://crontab.guru) 93 | - for Windows systems: [taskscheduleR (on CRAN)](https://cran.r-project.org/web/packages/taskscheduleR/vignettes/taskscheduleR.html) 94 | - [XKCD, is it worth your time to automate a task?](https://xkcd.com/1205/ "There is always a relevant XKCD") 95 | - for complex workflows with multiple steps look into the [advanced scheduling page](advanced_scheduling.md). 96 | 97 | # Own a server 98 | 99 | [*back to top*](#scheduling_r_scripts) 100 | 101 | In this case you own a server. A server is just a 'laptop' (usually without a screen, and sometimes in the cloud). Examples are: a raspberry pi you have lying around, an old laptop that you can use, an actual server rack in house or office, or a cloud server for instance a virtual machine from one of the cloud providers (See links below). 102 | 103 | The largest issue is how you go from your scripts on your local computer to the server. You need some tools to send the scripts, for instance transferring files with SCP or syncthing or dropbox or something. Or some sort of release process with git (see for example my git remote shiny server example in the links below) 104 | 105 | The choices here are all dependent on how many jobs you run and the flexibility you seek. If you only run a few scripts and they have a fixed time than CRON is still a super useful tool. If you have several actions that depend on the output of each other then you need something else. See the [advanced scheduling page](advanced_scheduling.md). 106 | 107 | **Where are the costs?**: The running costs of a server. This varies a lot, but is generally cheapest if you run it locally and you pay for size in the cloud environments. Also don't forget the costs in time of keeping the servers updated. 108 | 109 | **How easy is it to set up and use. and how easy can you transfer your work to your coworker**: Hard, very hard to transfer to your coworker unless you script everything. 110 | 111 | **Can you manage your entire configuration in code?**: Yes, but in practice no. There are tools like ansible, chef and puppet that can set up your computers for you. But you need to move the files to the computer, make the files executable, set the permissions, setup CRON. 112 | 113 | **Is there logging, how easy is it see what exactly went wrong?**: As above: Cron does have logging, I'm not sure about task scheduler but there must be. Where to find the logs is sometimes difficult to find out. 114 | 115 | **How precise is it and will it auto recover on failure?**: It is CRON or one of the [advanced options](advanced_scheduling.md). So it depends very much on your setup. But in the simple case, with CRON there will be no mentioning of failure and no retries. 116 | 117 | **how do you have to deal with secrets? can they leak?**: Cloud connected servers with ports open are constantly pommeled by adversaries who want to take over your server and steal secrets or run cryptominers on them, they are not evil but the opportunity is cheap. Servers need to be patched and firewalled. They can become compromised and your secrets will be leaked. Devices that run on your own network and that have no direct open ports to the internet are generally better off. So a raspberry pi or laptop on your network that sometimes calls an API is less at risk than a server that has a shiny server running that the entire internet can access. 118 | 119 | **in what country does it run**: Your own or at your choice dependent on the cloud provider. 120 | 121 | **links**: 122 | 123 | - I'm not going to list all the VM (virtual machine) / VPS (virtual private server) services that are available but there are a lot of options. You can look at for instance: digital ocean, google cloud project, amazon web services, and microsoft's azure cloud. 124 | - [help in scheduling with CRON use crontab.guru](https://crontab.guru) 125 | - git remote example of release on server( [blogpost on how I used this on shiny server](https://blog.rmhogervorst.nl/blog/2018/02/06/setting-up-a-version-controlled-shiny-server/)) 126 | - 127 | 128 | # Ephemeral 129 | 130 | [*back to top*](#scheduling_r_scripts) 131 | 132 | With ephemeral I mean the infrastructure does not really exist for you. You only care that the job is executed and you only get billed for infrastructure used. This is sometimes called serverless, even though it runs on servers. 133 | 134 | For most of these services you use a third party to run your stuff for you. That means configuration (a lot for AWS, Azure and GCP; less for heroku) and registration. 135 | 136 | ## Serverless or Function as a Service (FAAS) 137 | 138 | These are 'functions' things that run when you want them to run. Usually in response to a request, but they can be triggered by different things like a CRON type of message. 139 | 140 | It runs on cloud providers like Amazon (AWS lambda) , Google (GCP Cloud Functions ) or Microsoft (Azure Functions) or anywhere with an open source framework like openfaas. 141 | 142 | **Where are the costs?**: With FAAS you pay for execution, you get a bit for free but pay for use. The advantages are that very fast scripts cost next to nothing and you don't pay for not using it (unlike a server where you pay whether you used it or not). It is also massively parallel, you can run thousands of copies of the 'function' (you would pay for all the executions though) because they are all independent. Long running processes are expensive though. 143 | 144 | **How easy is it to set up and use. and how easy can you transfer your work to your coworker**: Most of the cloud operators have configuration possible with files and APIs, but for your first setup you would probably do it by hand. If you have something running it pays to make this all 'configuration in code', that way you can make variants and hand them to your coworkers. 145 | 146 | **Can you manage your entire configuration in code?**: Yes 147 | 148 | **Is there logging, how easy is it see what exactly went wrong?**: Yes there is logging. For GCP for example all logs are centralized in [Cloud Logging](https://cloud.google.com/logging). There you can see logs for the R script and the service running the R script. Whats more is that the logs can be used to trigger events, which can be sent on to activate trigger based workflows. 149 | 150 | **How precise is it and will it auto recover on failure?**: FAAS often have things like a cold start and a hot start. They are superfast in hot start (milliseconds sometimes) but if you haven't used the function for a while it goes into storage and triggering it will give it a cold start and that might take 5-10 times longer. Triggering a cold function for your batch job at 0900h will maybe lose you some time but realistically it starts within a minute and so how much you care about this is up to you and your application. 151 | 152 | **how do you have to deal with secrets? can they leak?** The major cloud providers have their own secrets stores where you can retrieve your keys from. In general these stores are well protected. You could of course lose secrets when you add them to the script or when you broadcast the secrets from your script: e.g.: `cat(paste0("my secret is:", Sys.getenv("secret")))` 153 | 154 | **in what country does it run**: You can set the country/region of your choice. The three cloud providers are worldwide distributed so you are probably fine. There are even specific data centres for government work (Azure: US) or special regulated areas (Azure: China). 155 | 156 | **links**: 157 | 158 | - See the topics below for specific cloud vendors [GCP](#gcp), [Azure](#azure-functions), [AWS](#aws) 159 | - [openfaas (Function as a Service) (I don't have an R tutorial yet \#TODO )](https://www.openfaas.com/) 160 | 161 | 162 | 163 | 164 | ### GCP 165 | 166 | #### Google Cloud Build 167 | 168 | [*back to top*](#scheduling_r_scripts) 169 | 170 | Other cloud services have similar services, in the CI/CD field. All use APIs which can be triggered via cron to batch schedule scripts, with varying degrees on integration between the respective cloud services. This is a bit of detail about the Google offering, Cloud Build which is facilitated by [googleCloudRunner](https://code.markedmondson.me/googleCloudRunner/). It uses a yaml format to coordinate Docker containers running when triggered by git events, such as GitHub, BitBucket or Google's own git platform Source Repositories. 171 | 172 | **Where are the costs?**: You pay for CPU running time with a monthly free tier that usually means casual use is free. 173 | 174 | **How easy is it to set up and use. and how easy can you transfer your work to your coworker**: Once you are past authentication steps its simple to setup either in the Web UI or via the R package googleCloudRunner in R code. The package inclues an RStudio gadget which you can point at your R script, R code in-line of a script hosted on Google Cloud Storage. As its using a neutral yaml format running Docker images you can give that to co-workers and they can run the exact same job without knowing R, or give them the R script that generated that yaml 175 | 176 | **Can you manage your entire configuration in code?**: Yes, R code or yaml. 177 | 178 | **Is there logging, how easy is it see what exactly went wrong?**: Yes in Cloud Logging 179 | 180 | **How precise is it and will it auto recover on failure?**: The scheduler runs as specified via cron syntax. You can set up auto-retries via extra scripting. 181 | 182 | **how do you have to deal with secrets? can they leak?**: Google Secret Manager handles secret and has a templated R script `cr_buildstep_secret()` 183 | 184 | **in what country does it run**: Global cloud regions in US, EU and elsewhere. 185 | 186 | **links**: 187 | 188 | - ['Run R code on a schedule' use case](https://code.markedmondson.me/googleCloudRunner/articles/usecases.html#run-r-code-on-a-schedule-1) on `googleCloudRunner` website. 189 | - R on GCP Cloud Run (HTTP Containers as a service), Cloud Build (Batch jobs within Docker containers) and Cloud Scheduler (CRON in the cloud) - ([package 'googleCloudRunner'](https://code.markedmondson.me/googleCloudRunner/)) 190 | 191 | ### Azure Functions 192 | **Where are the costs?**: You pay for CPU running time with a monthly free tier that usually means casual use is free. 193 | 194 | **How easy is it to set up and use. and how easy can you transfer your work to your coworker**: There are several examples that you can follow for the most basic use-case: responding to http request. Other triggers are not yet available for R-users. With some grit you will be able to succeed. 195 | 196 | **how easy it is to change things, the script, changing secrets or frequency?**: You can set up a CI/CD pipeline where your work lives in repository, every new commit is tested, added to a docker container and pushed to a docker-repository that automatically triggers renewal in Azure. So your function is always up to date. 197 | 198 | **Can you manage your entire configuration in code?**: Yes, in yaml files. 199 | 200 | **Is there logging, how easy is it see what exactly went wrong?**: Local testing is quite doable. remote logging needs to be enabled through the cmdline or through the web portal. 201 | 202 | **How precise is it and will it auto recover on failure?**: The scheduler runs as specified via cron syntax. You can set up auto-retries via extra scripting. 203 | **how do you have to deal with secrets? can they leak?**: Azure key vault or function app configuration keep the secrets. 204 | 205 | **in what country does it run**: Global cloud regions in US, EU and elsewhere. 206 | 207 | **links**: 208 | - [great practical example of R on Azure Functions in a custom runtime by RevoDavid](https://blog.revolutionanalytics.com/2020/12/azure-functions-with-r.html) super useful! [(Repo here)](https://github.com/revodavid/R-custom-handler). 209 | - [more bare bones version of R on azure functions (using {httpuv} only, not plumber)](https://docs.microsoft.com/en-us/azure/azure-functions/functions-create-function-linux-custom-image?pivots=programming-language-other&tabs=bash%2Cportal) 210 | - [Overview of serverless R examples, specifically Azure](https://github.com/RMHogervorst/rscript_serverless/tree/main/Azure) 211 | - Older version of [Azure functions with R](https://github.com/ktaranov/azure-function-r) 212 | 213 | ### AWS Lambda 214 | **Where are the costs?** You pay for CPU running time with a monthly free tier that usually means casual use is free. 215 | **How easy is it to set up and use. and how easy can you transfer your work to your coworker**: 216 | **how easy it is to change things, the script, changing secrets or frequency?**: 217 | **Can you manage your entire configuration in code?**: 218 | **Is there logging, how easy is it see what exactly went wrong?**: 219 | **How precise is it and will it auto recover on failure?**: 220 | **how do you have to deal with secrets? can they leak?**: 221 | **in what country does it run**: Global cloud regions in US, EU and elsewhere. 222 | 223 | **links**: 224 | - R on AWS Lambda ([mediumpost](https://medium.com/bakdata/running-r-on-aws-lambda-9d40643551a6) & [AWS lambda R runtime](https://github.com/bakdata/aws-lambda-r-runtime)) 225 | 226 | ## Serverless integrated with version control 227 | 228 | Most of the following use Docker or something similar in the background but you usually do not have to care about it. 229 | 230 | ### Gitlab 231 | 232 | [*back to top*](#scheduling_r_scripts) 233 | 234 | Gitlab introduced 'runners' years ago. There is a huge collection of runners available. these are docker containers that you can use. The syntax seems somewhat easier than github. 235 | 236 | **Where are the costs?**: 2000 CI/CD minutes a month for free over all your projects. You can buy 1000 additional CI/CD minutes for 10 dollar. If you self host gitlab and the runners than you pay for the servers, network etc yourself and there is no extra pay for CI/CD. If I run this daily and every run will indeed take 10 minutes as they do now. I do not have enough space for running continuously. I should get my runs under 5.4 minutes. 237 | 238 | **How easy is it to set up and use. and how easy can you transfer your work to your coworker**: I found a few examples and if you write the file in the gitlab editor in the browser it really helps you while you type. It is not really hard, but not easy either. You can just hand over the configuration to a coworker and it will work for them too. 239 | 240 | **Can you manage your entire configuration in code?**: yes 241 | 242 | **Is there logging, how easy is it see what exactly went wrong?**: Yes, there is extensive logging, you can see the output of the container and highlighted what your commands were. So you can see that it failed because you did not install `libssl-dev` for instance. 243 | 244 | **How precise is it and will it auto recover on failure?**: When a 'pipeline' / job fails you get a notification. but not automatic retry. 245 | 246 | **how do you have to deal with secrets? can they leak?**: Under settings/ 'CI/CD' you can add env variables that are accessible in the gitlab script. 247 | 248 | **in what country does it run**: that depends on if you use a on premise gitlab instance or the public version. I cannot find where the public version lives. 249 | 250 | **links**: 251 | 252 | - a blogpost describing how to use R on [gitlab with docker containers for package building](https://blog.methodsconsultants.com/posts/developing-r-packages-with-usethis-and-gitlab-ci-part-ii/) 253 | - 254 | 255 | ### Github 256 | 257 | [*back to top*](#scheduling_r_scripts) 258 | 259 | Github actions is not really meant for scheduling scripts, but it does support it. You can set up an action (see blogpost link at the bottom) to schedule a run using the CRON syntax. github uses UTC. 260 | 261 | **Where are the costs?**: Github actions are free for 2000 actions minutes/month over all your projects. If I run this daily and every run will indeed take 8 minutes as they do now I can run 250 actions a month, which is enough for my use case. 262 | 263 | **How easy is it to set up and use. and how easy can you transfer your work to your coworker**: There are more and more examples but the setup was not super easy because the steps are slow 264 | 265 | **Can you manage your entire configuration in code?**: Yes, it is the only way. You create a yaml file with all the configuration. 266 | 267 | **Is there logging, how easy is it see what exactly went wrong?**: There are logs in github actions, visible to everyone with access to the repo. But the feedbackloop is a bit slow if you ask me. 268 | 269 | **How precise is it and will it auto recover on failure?**: You get an email when the action fails but there is no auto retry. 270 | 271 | **how do you have to deal with secrets? can they leak?**: Similar to cloud services there is a way to store them as variables that are only accessible to the application and you. Everyone with write access to your repo can see the secrets. 272 | 273 | **in what country does it run**: I actually don't know \#TODO 274 | 275 | **links**: 276 | 277 | - I created a blog post about github actions [here] and the github actions code lives [here](https://github.com/RMHogervorst/invertedushape/blob/main/.github/workflows/main.yml) 278 | - [examples of R-specific github actions are collected here](https://github.com/r-lib/actions) 279 | - there is a [book](https://ropenscilabs.github.io/actions_sandbox/) of github actions for R online. 280 | 281 | ### Heroku 282 | 283 | [*back to top*](#scheduling_r_scripts) 284 | 285 | Heroku is not really a version control system. But they did create something that is tightly connected to your git repository. I also did not want to create a new category for heroku only. Heroku is slightly more expensive than other 'cloud' providers but for that money they take over a lot of work for you. 286 | 287 | **Where are the costs?**: You pay for addons and for running of the service. 288 | 289 | **How easy is it to set up and use. and how easy can you transfer your work to your coworker**: It is relatively easy to setup but there are some manual steps that you have to record somewhere. 290 | 291 | **Can you manage your entire configuration in code?**: Almost (the scheduling is not yet possible) 292 | 293 | **Is there logging, how easy is it see what exactly went wrong?**: There are no logs, for the cheap version I'm using at leas. 294 | 295 | **How precise is it and will it auto recover on failure?**: It runs around the time of the schedule, not exactly. It fails silently. But if you run it from the command line you do see output. 296 | 297 | **how do you have to deal with secrets? can they leak?**: Similar to cloud services there is a way to store them as variables that are only accessible to the application and you. 298 | 299 | **in what country does it run**: By default in the USA, it is possible to run in Europe. Maybe other places too? 300 | 301 | **links**: 302 | 303 | - I wrote a blogpost about [running R on heroku](https://blog.rmhogervorst.nl/blog/2018/12/06/running-an-r-script-on-heroku/), and [an update from 2020](https://blog.rmhogervorst.nl/blog/2020/09/21/running-an-r-script-on-a-schedule-heroku/). 304 | 305 | # Advice for useRs 306 | 307 | [*back to top*](#scheduling_r_scripts) 308 | 309 | So you want to run a script on a schedule. After this entire document, I understand that you are a bit confused. I suggest we can think this through in two steps. The first step is to make sure your script is as **portable as possible**; making sure all the required things (script, secrets, package versions) are together. The second step is making some decisions. 310 | 311 | ## Making your script more portable 312 | 313 | 1. Make sure your script runs without interaction locally: From your R session try `source("name_of_your_script.R")`, and from a terminal: `Rscript name_of_your_script.R` 314 | 2. Replace all the secrets from the script and replace them with `Sys.getenv("secretname")` calls. (save the secrets in a .Renviron file next to the script in the same project (let git ignore that file! uploading that file to github is the same as giving someone your secret keys)) 315 | 3. Run the script again to see if it works. 316 | 4. (not required, but I really recommend it) use renv to capture the required packages 317 | 318 | ## Decisions: 319 | 320 | - is it one script and does it run during your working hours? The easiest way is to schedule it on your laptop, go to that [advice here](#own-computer). Keep in mind that it only works if your computer is awa 321 | - Only a few scripts, need to run without your computer and no more than a few times per day? Go for one of the [ephemeral services, through github/gitlab](#serverless-integrated-with-version-control) 322 | - If you have tens of scripts you might want to go for your (own server)[\#own-a-server]. use a spare (low power) computer locally (this is super cheap over time) or rent a computer from one of the cloud services (starts at 5 dollars/euros a month). 323 | - If your company/place of work has many scheduled data actions (daily aggregations, summaries, etc.).it might be time for a true scheduler. Try one of the examples in [advanced scheduling](advanced_scheduling.md). Hopefully you will have someone to help you with scheduling your scripts. 324 | 325 | Summarizing: you can set up local jobs, run it on a server, run it in the cloud, or schedule it through some ephemeral cloud services. Whatever you use, the first step is to make sure your script is more portable and let me know how it goes. 326 | 327 | # Other issues questions and solutions 328 | 329 | - Difficulty in finding the correct directory for the process? use something like the [here](https://cran.r-project.org/package=here) or [rprojroot](https://cran.r-project.org/package=rprojroot) package. 330 | - [Mark Edmondson has a great overview of scheduling R scripts on Google Cloud platform here](http://code.markedmondson.me/4-ways-schedule-r-scripts-on-google-cloud-platform/) 331 | 332 | # Typos, mistakes etc 333 | 334 | [*back to top*](#scheduling_r_scripts) 335 | 336 | Yes please open an issue or pull request to fix mistakes! For additions I would like an issue first to determine if they are within scope. 337 | 338 | You spelled CRAN wrong! Distinguish between CRAN the Comprehensive R Archive Network and CRON (stands for chronometer or something. the tool in unixes that you can use to schedule things). 339 | 340 | # Reuse / licencing of this work 341 | 342 | This text is licensed as CC BY 4.0 (creative commons attribution 4.0 international). You are free to copy and redistribute the material in any medium or format, and to adapt, remix transform and build upon it, even commercially. Just give me credit. See License file for more info. 343 | -------------------------------------------------------------------------------- /advanced_scheduling.md: -------------------------------------------------------------------------------- 1 | # Advanced scheduling 2 | 3 | [back to readme](README.md) 4 | 5 | ## Upgrading from CRON 6 | 7 | You can run CRON on most unix systems (linux, macos, etc). And CRON is super useful and reliable. But the logs are sometimes difficult to find, CRON will not warn you that a job failed, and if something depends on another thing you have to check yourself. If you have several tasks that depend on eachother 8 | 9 | * and they are all R scripts: use [DRAKE](#drake) , 10 | * if you have large data transformation steps, different databases to use and tasks in python and SQL too, you might want to look into [Airflow](#airflow) or [Luigi](#luigi) these are python based job schedulers for data transformations. 11 | 12 | 13 | 14 | ## DRAKE 15 | 16 | [drake docs](https://docs.ropensci.org/drake/) , [drake user manual](https://books.ropensci.org/drake/), [drake description on CRAN](https://cloud.r-project.org/web/packages/drake/index.html) 17 | 18 | From the docs: 19 | 20 | > Data analysis can be slow. A round of scientific computation can take several minutes, hours, or even days to complete. After it finishes, if you update your code or data, your hard-earned results may no longer be valid. How much of that valuable output can you keep, and how much do you need to update? How much runtime must you endure all over again? 21 | 22 | Drake analyses your steps, notices what is done and what is not and even visualises your steps. It also keeps logs of your work. It also has the ability to run on schedules although I do not know for sure. #helpwanted #TODO 23 | 24 | There is now an alternative to DRAKE by the same makers: [TARGETS](https://wlandau.github.io/targets/) 25 | 26 | 27 | # Airflow 28 | 29 | [Apache Airflow](https://airflow.apache.org/) is : 30 | 31 | > created by the community to programmatically author, schedule and monitor workflows. 32 | 33 | It is written in python and you define tasks in python too, it has monitoring, and a ui that helps you with your tasks. It is also integrated with many many many databases and platforms. Running R scripts is somewhat difficult. You have to use a system call to make it work which leads to the same issues you have with CRON (where from should it start and as what user?) Other options are to use a docker operator with the R script in it. 34 | 35 | # Luigi 36 | 37 | https://luigi.readthedocs.io/en/stable/index.html 38 | 39 | --------------------------------------------------------------------------------