/limits
90 | ```
91 |
92 | `Max open files` should reflect the value above.
93 |
94 | ## App Changes
95 |
96 | Be sure to disable prepared statements, as they will not work with PgBouncer in transaction mode.
97 |
98 | ## Statement Timeouts
99 |
100 | To use a [statement timeout](https://www.postgresql.org/docs/current/static/runtime-config-client.html#GUC-STATEMENT-TIMEOUT), run:
101 |
102 | ```sql
103 | ALTER ROLE USERNAME1 SET statement_timeout = 5000;
104 | ```
105 |
106 | ## Congrats
107 |
108 | You’ve successfully set up PgBouncer.
109 |
--------------------------------------------------------------------------------
/archive/csp-rails.md:
--------------------------------------------------------------------------------
1 | # Adding CSP to Rails
2 |
3 | Content Security Policy can be an effective way to prevent XSS attacks. If you aren’t familiar, here’s a [great intro](https://www.html5rocks.com/en/tutorials/security/content-security-policy/).
4 |
5 | To get started with Rails, first add the header to all requests in your `ApplicationController`. We want to start by blocking content in development so we notice it, but only report it in production so nothing breaks.
6 |
7 | ```ruby
8 | before_action :set_csp
9 |
10 | # use constants and freeze for performance
11 | CSP_HEADER_NAME = (Rails.env.production? ? "Content-Security-Policy-Report-Only" : "Content-Security-Policy").freeze
12 | CSP_HEADER_VALUE = "default-src *; report-uri /csp_reports?report_only=#{CSP_HEADER_NAME.include?("Report-Only")}".freeze
13 |
14 | def set_csp
15 | response.headers[CSP_HEADER_NAME] = CSP_HEADER_VALUE
16 | end
17 | ```
18 |
19 | ## Reports
20 |
21 | Create a model to track reports.
22 |
23 | ```sh
24 | rails g model CspReport
25 | ```
26 |
27 | And in the migration, do:
28 |
29 | ```ruby
30 | class CreateCspReports < ActiveRecord::Migration
31 | def change
32 | create_table :csp_reports do |t|
33 | t.text :document_uri
34 | t.text :referrer
35 | t.text :violated_directive
36 | t.text :effective_directive
37 | t.text :original_policy
38 | t.text :blocked_uri
39 | t.integer :status_code
40 | t.text :user_agent
41 | t.boolean :report_only
42 | t.timestamp :created_at
43 | end
44 | end
45 | end
46 | ```
47 |
48 | Add a controller to create the reports.
49 |
50 | ```ruby
51 | class CspReportsController < ApplicationController
52 | skip_before_action :verify_authenticity_token
53 |
54 | def create
55 | report = JSON.parse(request.body.read)["csp-report"]
56 | CspReport.create!(
57 | document_uri: report["document-uri"],
58 | referrer: report["referrer"],
59 | violated_directive: report["violated-directive"],
60 | effective_directive: report["effective-directive"],
61 | original_policy: report["original-policy"],
62 | blocked_uri: report["blocked-uri"],
63 | status_code: report["status-code"],
64 | user_agent: request.user_agent,
65 | report_only: params[:report_only] == "true"
66 | )
67 | head :ok
68 | end
69 | end
70 | ```
71 |
72 | Don’t forget the route.
73 |
74 | ```ruby
75 | resources :csp_reports, only: [:create]
76 | ```
77 |
78 | ## Enforcing the Policy
79 |
80 | Once the reports stop, you’ll want to enforce the policy in production.
81 |
82 | ```ruby
83 | CSP_HEADER_NAME = "Content-Security-Policy".freeze
84 | ```
85 |
86 | ## Testing New Policies
87 |
88 | You can have both an enforced policy and a report only policy, so use this to your advantage when changing policies. Make the new policy report only for a bit before enforcing it.
89 |
90 | ```ruby
91 | before_action :set_csp_report_only
92 |
93 | # use constants and freeze for performance
94 | CSP_REPORT_ONLY_HEADER_NAME = "Content-Security-Policy-Report-Only".freeze
95 | CSP_REPORT_ONLY_HEADER_VALUE = "default-src https:; report-uri /csp_reports?report_only=true".freeze
96 |
97 | def set_csp_report_only
98 | response.headers[CSP_REPORT_ONLY_HEADER_NAME] = CSP_REPORT_ONLY_HEADER_VALUE
99 | end
100 | ```
101 |
--------------------------------------------------------------------------------
/archive/emotion-recognition-ruby.md:
--------------------------------------------------------------------------------
1 | # Emotion Recognition in Ruby
2 |
3 | Welcome to another installment of deep learning in Ruby. Today, we’ll look at [FER+](https://github.com/Microsoft/FERPlus), a deep convolutional neural network for emotion recognition developed at Microsoft. The project is open source, and there’s a pretrained model in the [ONNX Model Zoo](https://github.com/onnx/models/tree/master/vision/body_analysis/emotion_ferplus) that we can get running quickly in Ruby.
4 |
5 | First, download [the model](https://onnxzoo.blob.core.windows.net/models/opset_8/emotion_ferplus/emotion_ferplus.tar.gz) and this photo of a park ranger.
6 |
7 | 
8 |
9 |
10 | Photo from Yellowstone National Park
11 |
12 |
13 | We’ll use MiniMagick to prepare the image and the ONNX Runtime gem to run the model.
14 |
15 | ```ruby
16 | gem "mini_magick"
17 | gem "onnxruntime"
18 | ```
19 |
20 | For the image, we need to zoom in on her face, resize it to 64x64, and convert it to grayscale. Typically, we’d use a face detection model to find the bounding box and use that information to crop the image, but for simplicity, we’ll do just do it manually.
21 |
22 | ```ruby
23 | img = MiniMagick::Image.open("ranger.jpg")
24 | img.crop "100x100+60+20", "-gravity", "center" # manual crop
25 | img.resize "64x64^", "-gravity", "center", "-extent", "64x64"
26 | img.colorspace "Gray"
27 | img.write "resized.jpg"
28 | ```
29 |
30 | Here’s a blown up version:
31 |
32 | 
33 |
34 | Finally, create a 64x64 matrix of the grayscale intensities.
35 |
36 | ```ruby
37 | # all pixels are the same for grayscale, so just get one of them
38 | pixels = img.get_pixels.flat_map { |r| r.map(&:first) }
39 | input = OnnxRuntime::Utils.reshape(pixels, [1, 1, 64, 64])
40 | ```
41 |
42 | Now that the input is prepared, we can load and run the model.
43 |
44 | ```ruby
45 | model = OnnxRuntime::Model.new("model.onnx")
46 | output = model.predict("Input3" => input)
47 | ```
48 |
49 | We use [softmax](https://victorzhou.com/blog/softmax/) to convert the model output into probabilities.
50 |
51 | ```ruby
52 | def softmax(x)
53 | exp = x.map { |v| Math.exp(v - x.max) }
54 | exp.map { |v| v / exp.sum }
55 | end
56 |
57 | probabilities = softmax(output["Plus692_Output_0"].first)
58 | ```
59 |
60 | Then map the labels and sort by highest probability.
61 |
62 | ```ruby
63 | emotion_labels = [
64 | "neutral", "happiness", "surprise", "sadness",
65 | "anger", "disgust", "fear", "contempt"
66 | ]
67 | pp emotion_labels.zip(probabilities).sort_by { |_, v| -v }.to_h
68 | ```
69 |
70 | And the results are in:
71 |
72 | ```
73 | {
74 | "happiness" => 0.9999839207138284,
75 | "surprise" => 1.0569785479062501e-05,
76 | "neutral" => 4.826811128840592e-06,
77 | "anger" => 4.63037778140089e-07,
78 | "sadness" => 9.574742925740587e-08,
79 | "contempt" => 7.941520916580971e-08,
80 | "fear" => 2.8803367665891773e-08,
81 | "disgust" => 1.568577943664937e-08
82 | }
83 | ```
84 |
85 | There’s a 99.9% probability she looks happy in the photo. Not bad!
86 |
87 | Here’s the [complete code](https://gist.github.com/ankane/3bb4ddbf84edd7f05a24cd3697ccd9a7). Now go out and try it with your own images!
88 |
--------------------------------------------------------------------------------
/archive/pghero-2-0.md:
--------------------------------------------------------------------------------
1 | # PgHero 2.0 Has Arrived
2 |
3 | It’s been over 2 years since PgHero 1.0 was released as a performance dashboard for Postgres. Since then, a number of new features have been added.
4 |
5 | - checks for serious issues like transaction ID wraparound and integer overflow
6 | - the ability to capture and view query stats over time
7 | - suggested indexes to give you a better idea of how to optimize queries (check out [Dexter](https://ankane.org/introducing-dexter) for automatic indexing)
8 |
9 | PgHero 2.0 provides even more insight into your database performance with two additional features: query details and space stats.
10 |
11 | ## Query Details
12 |
13 | PgHero makes it easy to see the most time-consuming queries during a given time period, but it’s hard to follow an individual query’s performance over time. When you run into issues, it’s not always easy to uncover what happened. Are the top queries during an incident consistently the most time-consuming, or are they new? Did the number of calls increase or was it the average time?
14 |
15 | The new [Query Details page](https://pghero.dokkuapp.com/datakick/queries/588635171) helps solve this.
16 |
17 | 
18 |
19 | This page allows you to deep dive into an individual query. View charts of total time, average time, and calls over the past 24 hours to see how they’ve moved.
20 |
21 | For those who [annotate queries](https://ankane.org/the-origin-of-sql-queries), you’ve likely realized the comment in PgHero only tells you one of the places a query is coming from since similar queries are grouped together. Now, you can get a better idea of all the places it’s called.
22 |
23 | If you don’t annotate queries, you should!!
24 |
25 | This page also lists tables in the query and their indexes so you can quickly see if an index is missing, and an “Explain” button is usually available to help you debug (but may be missing if PgHero hasn’t captured an unnormalized version of the query recently).
26 |
27 | ## Space Stats
28 |
29 | PgHero 2.0 also helps you manage storage space. You can track the growth of tables and indexes over time and view this data on the [Space page](https://pghero.dokkuapp.com/datakick/space). To see the fastest growing relations, click on the “7d Growth” header.
30 |
31 | 
32 |
33 | In addition, this page now reports unused indexes to help reclaim space. If you use read replicas, be sure to check that indexes aren’t used on any of them before dropping.
34 |
35 | You can also view the growth for an individual table or index over the past 30 days.
36 |
37 | 
38 |
39 | Lastly, there’s syntax highlighting for all SQL for improved readability.
40 |
41 | 
42 |
43 | Much better :)
44 |
45 | So what are you waiting for? Get the [latest version](https://github.com/ankane/pghero) of PgHero today.
46 |
47 |
48 |
49 | Note: If you use PgHero outside the dashboard, there are some [breaking changes](https://github.com/ankane/pghero/blob/master/guides/Rails.md#200) from 1.x to be aware of.
50 |
--------------------------------------------------------------------------------
/archive/hybrid-cryptography-rails.md:
--------------------------------------------------------------------------------
1 | # Hybrid Cryptography on Rails
2 |
3 | 
4 |
5 | [Hybrid cryptography](https://en.wikipedia.org/wiki/Hybrid_cryptosystem) allows certain servers to encrypt data without the ability to decrypt it. This can greatly limit damage in the event of a breach.
6 |
7 | Suppose we have a service that sends text messages to customers. Customers enter their phone number through the website or mobile app.
8 |
9 | With hybrid cryptography, we can set up web servers to only encrypt phone numbers. Text messages can be sent through background jobs which run on a different set of servers - ones that can decrypt and don’t allow inbound traffic. If internal employees need to view phone numbers, they can use a separate set of web servers that are only accessible through the company VPN.
10 |
11 | | Encrypt | Decrypt |
12 | --- | --- | --- | ---
13 | Customer web servers | ✓ |
14 | Background workers | ✓ | ✓ | No inbound traffic
15 | Internal web servers | ✓ | ✓ | Requires VPN
16 |
17 | ## Setup
18 |
19 | Install [Libsodium](https://github.com/crypto-rb/rbnacl/wiki/Installing-libsodium) and add [Lockbox](https://github.com/ankane/lockbox) and [RbNaCl](https://github.com/crypto-rb/rbnacl) to your Gemfile:
20 |
21 | ```ruby
22 | gem 'lockbox'
23 | gem 'rbnacl'
24 | ```
25 |
26 | Generate keys in the Rails console with:
27 |
28 | ```ruby
29 | Lockbox.generate_key_pair
30 | ```
31 |
32 | Store the keys with your other secrets. This is typically Rails credentials or an environment variable ([dotenv](https://github.com/bkeepers/dotenv) is great for this). Be sure to use different keys in development and production.
33 |
34 | ```sh
35 | PHONE_ENCRYPTION_KEY=...
36 | PHONE_DECRYPTION_KEY=...
37 | ```
38 |
39 | Only set the decryption key on servers that should be able to decrypt.
40 |
41 | ## Database Fields
42 |
43 | We’ll store phone numbers in an encrypted database field. Create a migration to add a new column for the encrypted data.
44 |
45 | ```ruby
46 | class AddEncryptedPhoneToUsers < ActiveRecord::Migration[5.2]
47 | def change
48 | add_column :users, :phone_ciphertext, :string
49 | end
50 | end
51 | ```
52 |
53 | In the model, add:
54 |
55 | ```ruby
56 | class User < ApplicationRecord
57 | encrypts :phone, algorithm: "hybrid", encryption_key: ENV["PHONE_ENCRYPTION_KEY"], decryption_key: ENV["PHONE_DECRYPTION_KEY"]
58 | end
59 | ```
60 |
61 | Set a user’s phone number to ensure it works.
62 |
63 | ## Files
64 |
65 | Suppose we also need to accept sensitive documents. We can take a similar approach with file uploads.
66 |
67 | For Active Storage, use:
68 |
69 | ```ruby
70 | class User < ApplicationRecord
71 | encrypts_attached :document, algorithm: "hybrid", encryption_key: ENV["PHONE_ENCRYPTION_KEY"], decryption_key: ENV["PHONE_DECRYPTION_KEY"]
72 | end
73 | ```
74 |
75 | For CarrierWave, use:
76 |
77 | ```ruby
78 | class DocumentUploader < CarrierWave::Uploader::Base
79 | encrypt algorithm: "hybrid", encryption_key: ENV["PHONE_ENCRYPTION_KEY"], decryption_key: ENV["PHONE_DECRYPTION_KEY"]
80 | end
81 | ```
82 |
83 | You can also encrypt an IO stream directly.
84 |
85 | ```ruby
86 | box = Lockbox.new(algorithm: "hybrid", encryption_key: ENV["PHONE_ENCRYPTION_KEY"], decryption_key: ENV["PHONE_DECRYPTION_KEY"])
87 | box.encrypt(params[:file])
88 | ```
89 |
90 | ## Conclusion
91 |
92 | You’ve now seen an approach for keeping your data safe in the event a server is compromised. For more on data protection, check out [Securing Sensitive Data in Rails](https://ankane.org/sensitive-data-rails).
93 |
--------------------------------------------------------------------------------
/archive/postgres-sslmode-explained.md:
--------------------------------------------------------------------------------
1 | # Postgres SSLMODE Explained
2 |
3 | When you connect to a database, Postgres uses the `sslmode` parameter to determine the security of the connection. There are many options, so here’s an analogy to web security:
4 |
5 | - `disable` is HTTP
6 | - `verify-full` is HTTPS
7 |
8 | All the other options fall somewhere in between, and by design, make less guarantees of security than HTTPS in your browser does.
9 |
10 | 
11 |
12 | This includes the default `prefer`. The [Postgres docs](https://www.postgresql.org/docs/current/libpq-ssl.html) have a great table explaining this:
13 |
14 | 
15 |
16 | Other modes like `require` are still useful in protecting against passive attacks (sniffing), but are vulnerable to active attacks that can compromise your credentials. Tarjei Husøy created [postgres-mitm](https://thusoy.com/2016/mitming-postgres) to demonstrate this.
17 |
18 | ## Defense
19 |
20 | The best way to protect a database is to limit inbound traffic. Require a VPN or SSH tunneling through a [bastion host](https://medium.com/@bill_73959/understanding-bastions-hosts-6ccd457e41ac) to connect. This ensures connections are always secure, and even if database credentials are compromised, an attacker won’t be able to access the database.
21 |
22 | If this is not feasible, always use `verify-full`. This includes from code, psql, SQL clients, and other tools like [pgsync](https://github.com/ankane/pgsync) and [pgslice](https://github.com/ankane/pgslice).
23 |
24 | You can specify `sslmode` in the connection URI:
25 |
26 | ```text
27 | postgresql://user:pass@host/dbname?sslmode=verify-full&sslrootcert=ca.pem
28 | ```
29 |
30 | Or use environment variables.
31 |
32 | ```sh
33 | PGSSLMODE=verify-full PGSSLROOTCERT=ca.pem
34 | ```
35 |
36 | Libraries for most programming languages have options as well.
37 |
38 | ```ruby
39 | PG.connect(sslmode: "verify-full", sslrootcert: "ca.pem")
40 | ```
41 |
42 | ## Certificates
43 |
44 | To verify an SSL/TLS certificate, the client checks it against a root certificate. Your browser ships with root certificates to verify HTTPS websites. Postgres doesn’t come with any root certificates, so to use `verify-full`, you must specify one.
45 |
46 | Here are root certificates for a number of providers:
47 |
48 | Provider | Certificate | Docs
49 | --- | --- | ---
50 | Amazon RDS | [Download](https://s3.amazonaws.com/rds-downloads/rds-ca-2019-root.pem) | [View](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#PostgreSQL.Concepts.General.SSL)
51 | Google Cloud SQL | In Account | [View](https://cloud.google.com/sql/docs/postgres/connect-admin-ip)
52 | Digital Ocean | In Account | [View](https://www.digitalocean.com/docs/databases/how-to/clusters/secure-clusters/)
53 | Citus Data | [Download](https://console.citusdata.com/citus.crt) | [View](https://docs.citusdata.com/en/v8.0/cloud/security.html)
54 |
55 | There’s no way to use `verify-full` with Heroku Postgres, so use caution when connecting from networks you don't fully trust. Instead of `heroku pg:psql`, use:
56 |
57 | ```sh
58 | heroku run psql \$DATABASE_URL
59 | ```
60 |
61 | This securely connects to a dyno before connecting to the database.
62 |
63 | If you use PgBouncer, [set up secure connections](https://ankane.org/securing-pgbouncer-amazon-rds) for it as well.
64 |
65 | ## Conclusion
66 |
67 | Hopefully this helps you understand connection security a bit better.
68 |
69 |
70 |
71 | Updates
72 |
73 | - August 2019: Added Digital Ocean
74 |
--------------------------------------------------------------------------------
/archive/ruby-ml-for-python-coders.md:
--------------------------------------------------------------------------------
1 | # Ruby ML for Python Coders
2 |
3 |
4 |
5 |
6 |
7 | Curious to try machine learning in Ruby? Here’s a short cheatsheet for Python coders.
8 |
9 | Data structure basics
10 |
11 | - [Numo: NumPy for Ruby](/numo)
12 | - [Daru: Pandas for Ruby](/daru)
13 |
14 | Libraries
15 |
16 | Category | Python | Ruby
17 | --- | --- | ---
18 | Multi-dimensional arrays | [NumPy](https://github.com/numpy/numpy) | [Numo](https://github.com/ruby-numo/numo-narray)
19 | Data frames | [Pandas](https://github.com/pandas-dev/pandas) | [Daru](https://github.com/SciRuby/daru)
20 | Visualization | [Matplotlib](https://github.com/matplotlib/matplotlib) | [Nyaplot](https://github.com/domitry/nyaplot)
21 | Predictive modeling | [Scikit-learn](https://github.com/scikit-learn/scikit-learn) | [Rumale](https://github.com/yoshoku/rumale)
22 | Gradient boosting | [XGBoost](https://github.com/dmlc/xgboost), [LightGBM](https://github.com/Microsoft/LightGBM) | [XGBoost](https://github.com/ankane/xgb), [LightGBM](https://github.com/ankane/lightgbm)
23 | Deep learning | [PyTorch](https://github.com/pytorch/pytorch), [TensorFlow](https://github.com/tensorflow/tensorflow) | [Torch-rb](https://github.com/ankane/torch-rb), [TensorFlow](https://github.com/ankane/tensorflow) (TensorFlow :construction:)
24 | Recommendations | [Surprise](https://github.com/NicolasHug/Surprise), [Implicit](https://github.com/benfred/implicit/) | [Disco](https://github.com/ankane/disco)
25 | Approximate nearest neighbors | [NGT](https://github.com/yahoojapan/NGT), [Annoy](https://github.com/spotify/annoy) | [NGT](https://github.com/ankane/ngt), [Hanny](https://github.com/yoshoku/hanny)
26 | Factorization machines | [xLearn](https://github.com/aksnzhy/xlearn) | [xLearn](https://github.com/ankane/xlearn)
27 | Natural language processing | [spaCy](https://github.com/explosion/spaCy), [NTLK](https://github.com/nltk/nltk) | [Many gems](https://github.com/arbox/nlp-with-ruby) (nothing comprehensive :cry:)
28 | Text classification | [fastText](https://github.com/facebookresearch/fastText) | [fastText](https://github.com/ankane/fasttext)
29 | Forecasting | [Prophet](https://github.com/facebook/prophet) | :cry:
30 | Optimization | [CVXPY](https://github.com/cvxgrp/cvxpy), [PuLP](https://github.com/coin-or/pulp), [SCS](https://github.com/cvxgrp/scs), [OSQP](https://github.com/oxfordcontrol/osqp) | [CBC](https://github.com/gverger/ruby-cbc), [SCS](https://github.com/ankane/scs), [OSQP](https://github.com/ankane/osqp)
31 | Reinforcement learning | [Vowpal Wabbit](https://github.com/VowpalWabbit/vowpal_wabbit) | [Vowpal Wabbit](https://github.com/ankane/vowpalwabbit)
32 | Scoring engine | [ONNX Runtime](https://github.com/Microsoft/onnxruntime) | [ONNX Runtime](https://github.com/ankane/onnxruntime), [Menoh](https://github.com/pfnet-research/menoh-ruby)
33 |
34 | This list is by no means comprehensive. Some Ruby libraries are ones I created, as mentioned [here](/new-ml-gems).
35 |
36 | If you’re planning to add Ruby support to your ML library:
37 |
38 | Category | Python | Ruby
39 | --- | --- | ---
40 | FFI (native) | [ctypes](https://docs.python.org/3/library/ctypes.html) | [Fiddle](https://ruby-doc.org/stdlib-2.7.0/libdoc/fiddle/rdoc/Fiddle.html)
41 | FFI (library) | [cffi](https://cffi.readthedocs.io/en/latest/) | [FFI](https://github.com/ffi/ffi)
42 | C++ extensions | [pybind11](https://github.com/pybind/pybind11) | [Rice](https://github.com/jasonroelofs/rice)
43 | Compile to C | [Cython](https://github.com/cython/cython) | [Rubex](https://github.com/SciRuby/rubex)
44 |
45 | Give Ruby a shot for your next maching learning project!
46 |
--------------------------------------------------------------------------------
/archive/scaling-reads.md:
--------------------------------------------------------------------------------
1 | # Scaling Reads
2 |
3 | **Note:** This approach is now packaged into [a gem](https://github.com/ankane/distribute_reads) :gem:
4 |
5 | ---
6 |
7 | One of the easier ways to scale your database is to distribute reads to replicas.
8 |
9 | ## Desire
10 |
11 | Here’s the desired behavior:
12 |
13 | ```ruby
14 | User.find(1) # primary
15 |
16 | distribute_reads do
17 | # use replica for reads
18 | User.maximum(:visits_count) # replica
19 | User.find(2) # replica
20 |
21 | # until a write
22 | # then switch to primary
23 | User.create! # primary
24 | User.last # primary
25 | end
26 | ```
27 |
28 | ## Contenders
29 |
30 | We looked at a number of libraries, including [Octopus](https://github.com/tchandy/octopus), [Octoshark](https://github.com/dalibor/octoshark), and [Replica Pools](https://github.com/kickstarter/replica_pools).
31 |
32 | The winner was [Makara](https://github.com/taskrabbit/makara) - it handles failover well and has a simple configuration.
33 |
34 | ## Getting Started
35 |
36 | First, install Makara.
37 |
38 | ```ruby
39 | gem 'makara'
40 | ```
41 |
42 | There are 3 important `ENV` variables in our setup.
43 |
44 | - `DATABASE_URL` - primary database
45 | - `REPLICA_DATABASE_URL` - replica database (can use the primary database in development)
46 | - `MAKARA` - feature flag for a smooth rollout
47 |
48 | Here are sample values:
49 |
50 | ```sh
51 | DATABASE_URL=postgres://nerd:secret@localhost:5432/db_development
52 | REPLICA_DATABASE_URL=postgres://nerd:secret@localhost:5432/db_development
53 | MAKARA=true
54 | ```
55 |
56 | Next, update `config/database.yml`.
57 |
58 | ```yml
59 | development: &default
60 | <% if ENV["MAKARA"] %>
61 | url: postgresql-makara:///
62 | makara:
63 | sticky: true
64 | connections:
65 | - role: master
66 | name: primary
67 | url: <%= ENV["DATABASE_URL"] %>
68 | - name: replica
69 | url: <%= ENV["REPLICA_DATABASE_URL"] %>
70 | <% else %>
71 | adapter: postgresql
72 | url: <%= ENV["DATABASE_URL"] %>
73 | <% end %>
74 |
75 | production:
76 | <<: *default
77 | ```
78 |
79 | We don’t use the middleware, so we remove it by adding to `config/application.rb`:
80 |
81 | ```ruby
82 | config.middleware.delete Makara::Middleware
83 | ```
84 |
85 | Also, we want to read from primary by default so have to patch Makara. Create an initializer `config/initializers/makara.rb` with:
86 |
87 | ```ruby
88 | Makara::Cache.store = :noop
89 |
90 | module DefaultToPrimary
91 | def _appropriate_pool(*args)
92 | return @master_pool unless Thread.current[:distribute_reads]
93 | super
94 | end
95 | end
96 |
97 | Makara::Proxy.send :prepend, DefaultToPrimary
98 |
99 | module DistributeReads
100 | def distribute_reads
101 | previous_value = Thread.current[:distribute_reads]
102 | begin
103 | Thread.current[:distribute_reads] = true
104 | Makara::Context.set_current(Makara::Context.generate)
105 | yield
106 | ensure
107 | Thread.current[:distribute_reads] = previous_value
108 | end
109 | end
110 | end
111 |
112 | Object.send :include, DistributeReads
113 | ```
114 |
115 | To distribute reads, use:
116 |
117 | ```ruby
118 | total_users = distribute_reads { User.count }
119 | ```
120 |
121 | You can also put multiple lines in a block.
122 |
123 | ```ruby
124 | distribute_reads do
125 | User.max(:visits_count)
126 | Order.sum(:revenue_cents)
127 | Visit.average(:duration)
128 | end
129 | ```
130 |
131 | ## Test Drive
132 |
133 | In the Rails console, run:
134 |
135 | ```ruby
136 | User.first # primary
137 | distribute_reads { User.last } # replica
138 | ```
139 |
140 | :heart: Happy scaling
141 |
--------------------------------------------------------------------------------
/archive/encryption-keys.md:
--------------------------------------------------------------------------------
1 | # Strong Encryption Keys for Rails
2 |
3 | 
4 |
5 | Encryption is a common way to protect sensitive data. Generating a secure key is an important part of the process.
6 |
7 | [attr_encrypted](https://github.com/attr-encrypted/attr_encrypted), the popular encryption library for Rails, uses AES-256-GCM by default, which takes a 256-bit key. So how can we generate a secure one?
8 |
9 | *If you’re in a hurry, feel free to skip to [the answer](#a-better-way).*
10 |
11 | ## Take 1
12 |
13 | One way to generate a key is:
14 |
15 | ```ruby
16 | SecureRandom.base64(32).first(32)
17 | ```
18 |
19 | This generates a 32 character string that looks pretty secure. Each character has 64 possible values (letters, numbers, / and +). However, a single byte can represent 256 possible values. We’ve eliminated 75% of possible values per byte, which compounds across all 32 bytes. Here’s the math:
20 |
21 | Method | Possible Keys | Equivalent
22 | --- | --- | ---
23 | Random | 25632 | 2256
24 | Take 1 | 6432 | 2192
25 |
26 | This reduces the number of possible keys by 99.999999999999999994%. Luckily, computers have not (yet) been able to brute force 128-bit keys, which have 2128 possible values.
27 |
28 | ## Why 256?
29 |
30 | So why do we use 256-bit keys to begin with? Security researcher Graham Sutherland [puts it well](https://security.stackexchange.com/questions/14068/why-most-people-use-256-bit-encryption-instead-of-128-bit):
31 |
32 | “Essentially it’s about security margin. The longer the key, the higher the effective security. If there is ever a break in AES that reduces the effective number of operations required to crack it, a bigger key gives you a better chance of staying secure.”
33 |
34 | Also, quantum computers are expected to brute force in [square root time](https://blog.agilebits.com/2013/03/09/guess-why-were-moving-to-256-bit-aes-keys/). This means a 256-bit key could be brute forced in the same time as traditional computers can brute force a 128-bit key.
35 |
36 | ## A Better Way
37 |
38 | The right way to generate a random 32-byte key is:
39 |
40 | ```ruby
41 | SecureRandom.random_bytes(32)
42 | ```
43 |
44 | However, we can’t store this directly in Rails credentials or as an environment variable. We need to encode it first. Hex is a popular encoding. Rails uses this for its master key in Rails 5.2.
45 |
46 | ```ruby
47 | SecureRandom.random_bytes(32).unpack("H*").first
48 | ```
49 |
50 | Ruby provides a helper to do this:
51 |
52 | ```ruby
53 | SecureRandom.hex(32)
54 | ```
55 |
56 | To decode the key, use:
57 |
58 | ```ruby
59 | [hex_key].pack("H*")
60 | ```
61 |
62 | We now have a much stronger key. If you store the key as an environment variable, your model should look something like:
63 |
64 | ```ruby
65 | class User < ApplicationRecord
66 | attr_encrypted :email, key: [ENV["EMAIL_ENCRYPTION_KEY"]].pack("H*")
67 | end
68 | ```
69 |
70 | ## Libraries
71 |
72 | Libraries should educate users on how to generate sufficiently random keys. The [rbnacl](https://github.com/crypto-rb/rbnacl) gem has a neat way of enforcing this - it checks if a string is binary before allowing it as a key.
73 |
74 | ```ruby
75 | if key.encoding != Encoding::BINARY
76 | raise ArgumentError, "Insecure key - key must use binary encoding"
77 | end
78 | ```
79 |
80 | This prevents our initial (flawed) method from working. I’ve incorporated this approach into the [blind_index](https://github.com/ankane/blind_index) gem and [opened an issue](https://github.com/attr-encrypted/attr_encrypted/issues/311) with attr_encrypted to get the author’s thoughts.
81 |
82 | ## Conclusion
83 |
84 | While secure key generation provides better protection against brute force attacks, it won’t help at all if the key is compromised. Limit who has access to encryption keys as well. For more security, consider a [key management service](https://github.com/ankane/kms_encrypted) to manage your keys.
85 |
86 | Happy encrypting!
87 |
--------------------------------------------------------------------------------
/archive/tensorflow-ruby.md:
--------------------------------------------------------------------------------
1 | # TensorFlow Object Detection in Ruby
2 |
3 | The [ONNX Runtime](https://github.com/ankane/onnxruntime) gem makes it easy to run Tensorflow models in Ruby. This short tutorial will show you how. It’s based on [this tutorial](https://github.com/onnx/tensorflow-onnx/blob/master/tutorials/ConvertingSSDMobilenetToONNX.ipynb) from tf2onnx.
4 |
5 | We’ll use SSD Mobilenet, which can detect multiple objects in an image.
6 |
7 | First, download the [pretrained model](https://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_2018_01_28.tar.gz) from the official TensorFlow Models project and this awesome shot of polar bears.
8 |
9 | 
10 |
11 |
12 | Photo from the U.S. Fish and Wildlife Service
13 |
14 |
15 | Install [tf2onnx](https://github.com/onnx/tensorflow-onnx)
16 |
17 | ```sh
18 | pip install tf2onnx
19 | ```
20 |
21 | And convert the model to ONNX
22 |
23 | ```sh
24 | python -m tf2onnx.convert --opset 10 --fold_const \
25 | --saved-model ssd_mobilenet_v1_coco_2018_01_28/saved_model \
26 | --output model.onnx
27 | ```
28 |
29 | Next, install the ONNX Runtime and MiniMagick gems
30 |
31 | ```ruby
32 | gem "onnxruntime"
33 | gem "mini_magick"
34 | ```
35 |
36 | Load the image
37 |
38 | ```ruby
39 | img = MiniMagick::Image.open("bears.jpg")
40 | pixels = img.get_pixels
41 | ```
42 |
43 | And the model
44 |
45 | ```ruby
46 | model = OnnxRuntime::Model.new("model.onnx")
47 | ```
48 |
49 | Check the model inputs
50 |
51 | ```ruby
52 | p model.inputs
53 | ```
54 |
55 | The shape is `[-1, -1, -1, 3]`. `-1` indicates any size. `pixels` has the shape `[img.width, img.height, 3]`. The model is designed to process multiple images at once, which is where the final dimension comes from.
56 |
57 | Let’s run the model:
58 |
59 | ```ruby
60 | result = model.predict("image_tensor:0" => [pixels])
61 | ```
62 |
63 | The model gives us a number of different outputs, like the number of detections, labels, scores, and boxes. Let’s print the results:
64 |
65 | ```ruby
66 | p result["num_detections:0"]
67 | # [3.0]
68 | p result["detection_classes:0"]
69 | # [[23.0, 23.0, 88.0, 1.0, ...]]
70 | ```
71 |
72 | We can see there were three detections, and if we look at the first three elements in the detection classes array, they are the numbers 23, 23, and 88. These correspond to [COCO](http://cocodataset.org/) labels. We can [look these up](https://github.com/amikelive/coco-labels/blob/master/coco-labels-paper.txt) and see that 23 is bear and 88 is teddy bear. Mostly right!
73 |
74 | With a bit more code, we can apply boxes and labels to the image.
75 |
76 | ```ruby
77 | coco_labels = {
78 | 23 => "bear",
79 | 88 => "teddy bear"
80 | }
81 |
82 | def draw_box(img, label, box)
83 | width, height = img.dimensions
84 |
85 | # calculate box
86 | thickness = 2
87 | top = (box[0] * height).round - thickness
88 | left = (box[1] * width).round - thickness
89 | bottom = (box[2] * height).round + thickness
90 | right = (box[3] * width).round + thickness
91 |
92 | # draw box
93 | img.combine_options do |c|
94 | c.draw "rectangle #{left},#{top} #{right},#{bottom}"
95 | c.fill "none"
96 | c.stroke "red"
97 | c.strokewidth thickness
98 | end
99 |
100 | # draw text
101 | img.combine_options do |c|
102 | c.draw "text #{left},#{top - 5} \"#{label}\""
103 | c.fill "red"
104 | c.pointsize 18
105 | end
106 | end
107 |
108 | result["num_detections:0"].each_with_index do |n, idx|
109 | n.to_i.times do |i|
110 | label = result["detection_classes:0"][idx][i].to_i
111 | label = coco_labels[label] || label
112 | box = result["detection_boxes:0"][idx][i]
113 | draw_box(img, label, box)
114 | end
115 | end
116 |
117 | # save image
118 | img.write("labeled.jpg")
119 | ```
120 |
121 | And the result:
122 |
123 | 
124 |
125 | Here’s the [complete code](https://gist.github.com/ankane/4a9681c8d9b9e814debe9e3ea836529d). Now go out and try it with your own images!
126 |
--------------------------------------------------------------------------------
/archive/numo.md:
--------------------------------------------------------------------------------
1 | # Numo: NumPy for Ruby
2 |
3 | 
4 |
5 |
6 | Photo by Jonas Svidras
7 |
8 |
9 | NumPy is an extremely popular library for machine learning in Python. It provides an efficient way to work with large, multi-dimensional arrays. What you may not know is Ruby has a library with similar functionality. It’s called Numo, and in this post, we’ll look at what you can do with it.
10 |
11 | ## Basic Operations
12 |
13 | Numo’s core data structure is the multi-dimensional array, which has methods for mathematical operations. These operations are written in C, so they’re much faster than performing the same operations in Ruby.
14 |
15 | Let’s start by creating a Numo array from a Ruby array.
16 |
17 | ```ruby
18 | x = Numo::DFloat.cast([[1, 2, 3], [4, 5, 6]])
19 | ```
20 |
21 | Each array has shape. We created a 2x3 2D array, but arrays can be 1D, 3D, or more.
22 |
23 | ```ruby
24 | x.shape # [2, 3]
25 | ```
26 |
27 | Read a row or column with:
28 |
29 | ```ruby
30 | x[0, true] # 1st row - [1, 2, 3]
31 | x[true, 2] # 3rd column - [3, 6]
32 | ```
33 |
34 | We can add a constant value:
35 |
36 | ```ruby
37 | x + 2 # [[3, 4, 5], [6, 7, 8]]
38 | ```
39 |
40 | Or add arrays:
41 |
42 | ```ruby
43 | x + x # [[2, 4, 6], [8, 10, 12]]
44 | ```
45 |
46 | Some operations like mean and sum can be run over a specific axis.
47 |
48 | ```ruby
49 | x.sum(0) # sum of each column - [5, 7, 9]
50 | x.mean(1) # mean of each row - [2, 5]
51 | ```
52 |
53 | We can also change its shape - useful for preparing data for models.
54 |
55 | ```ruby
56 | x.reshape(3, 2) # [[1, 2], [3, 4], [5, 6]]
57 | ```
58 |
59 | If you’re familiar with NumPy operations, there are [side-by-side examples](https://github.com/ruby-numo/numo-narray/wiki/100-narray-exercises) and a table showing how the [functions map](https://github.com/ruby-numo/numo-narray/wiki/Numo-vs-numpy).
60 |
61 | ## Building Models
62 |
63 | [Rumale](https://github.com/yoshoku/rumale) is a machine learning library similar to Python’s Scikit-learn. It uses Numo for inputs and outputs. Here’s a basic example of linear regression.
64 |
65 | ```ruby
66 | # generate data: y = 1 + 2(x0) + 3(x1)
67 | x = Numo::DFloat.asarray([[0, 1], [1, 0], [1, 2]])
68 | y = 1 + 2 * x[true, 0] + 3 * x[true, 1]
69 |
70 | # train
71 | model = Rumale::LinearModel::LinearRegression.new(
72 | fit_bias: true, max_iter: 10000)
73 | model.fit(x, y)
74 |
75 | # predict
76 | model.predict(x)
77 | ```
78 |
79 | Rumale has many, many models and other useful tools for:
80 |
81 | - Regression: linear, ridge, lasso, support vector machines
82 | - Classification: logistic regression, naive Bayes, K-nearest neighbors, support vector machines
83 | - Clustering: K-means, Gaussian mixture model
84 | - Dimensionality reduction: principal component analysis
85 |
86 | Scikit-learn has a great cheat-sheet to help you decide what do use:
87 |
88 |
89 |
90 |
91 |
92 | Image from Scikit-learn (BSD License)
93 |
94 |
95 | ## Storing Data
96 |
97 | Numo arrays can be marshaled just like other Ruby objects. This allows you to save your work and resume it at a later time.
98 |
99 | ```ruby
100 | # save
101 | File.binwrite("x.dump", Marshal.dump(x))
102 |
103 | # load
104 | x = Marshal.load(File.binread("x.dump"))
105 | ```
106 |
107 | [Npy](https://github.com/ankane/npy) allows you to save and load arrays in the same format as NumPy. This is more performant than marshaling.
108 |
109 | ```ruby
110 | # save
111 | Npy.save("x.npy", x)
112 |
113 | # load
114 | x = Npy.load("x.npy")
115 | ```
116 |
117 | It also makes it easy to load datasets like [MNIST](https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz).
118 |
119 | ```ruby
120 | mnist = Npy.load_npz("mnist.npz")
121 | ```
122 |
123 | ## Summary
124 |
125 | You now have a basic introduction to Numo and know how to:
126 |
127 | - perform basic operations
128 | - build a model
129 | - store data
130 |
131 | Consider [Numo](https://github.com/ruby-numo/numo-narray) for your next machine learning project.
132 |
--------------------------------------------------------------------------------
/archive/securing-pgbouncer-amazon-rds.md:
--------------------------------------------------------------------------------
1 | # Securing Database Traffic with PgBouncer and Amazon RDS
2 |
3 | Securing database traffic inside your network can be a great step for defense in depth. It’s also a necessity for [Zero Trust Networks](https://www.amazon.com/Zero-Trust-Networks-Building-Untrusted/dp/1491962194).
4 |
5 | Both Amazon RDS and PgBouncer have built-in support for TLS, but it’s a little bit of work to get it set up. This tutorial will show you how.
6 |
7 | ## Direct Connections
8 |
9 | The first step is to make sure all direct connections are secure. Luckily, Amazon RDS has a parameter named `rds.force_ssl` for this. Once it’s applied, you’ll see an error if you try to connect without TLS. You can test this out with:
10 |
11 | ```sh
12 | psql "postgresql://user:secret@dbhost:5432/ssltest?sslmode=disable
13 | ```
14 |
15 | You’ll see an error like `FATAL: no pg_hba.conf entry ... SSL off` if everything is configured correctly.
16 |
17 | There are a number of possible values for `sslmode`, which you can [read about here](https://www.postgresql.org/docs/current/static/libpq-ssl.html). The most secure (and one we want) is `verify-full`, as it provides protection against both eavesdropping and man-in-the-middle attacks. This mode requires you to provide a root certificate to verify against. AWS makes this certificate available on [their website](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#PostgreSQL.Concepts.General.SSL).
18 |
19 | ```sh
20 | wget https://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem
21 | ```
22 |
23 | To use it with `psql`, run:
24 |
25 | ```sh
26 | psql "postgresql://user:secret@dbhost:5432/ssltest?sslmode=verify-full&sslrootcert=rds-combined-ca-bundle.pem"
27 | ```
28 |
29 | Once connected, you should see an `SSL connection` line before the first prompt.
30 |
31 | There’s also an extension you can use (useful for non-`psql` connections).
32 |
33 | ```sql
34 | CREATE EXTENSION IF NOT EXISTS sslinfo;
35 | SELECT ssl_is_used();
36 | ```
37 |
38 | Now direct connections are good, so let’s secure connections from PgBouncer to the database.
39 |
40 | ## PgBouncer to the Database
41 |
42 | Follow [this guide](pgbouncer-setup) to set up PgBouncer. Once that’s completed, there are two settings to add to `/etc/pgbouncer/pgbouncer.ini`:
43 |
44 | ```ini
45 | server_tls_sslmode = verify-full
46 | server_tls_ca_file = /path/to/rds-combined-ca-bundle.pem
47 | ```
48 |
49 | Restart the service
50 |
51 | ```sh
52 | sudo service pgbouncer restart
53 | ```
54 |
55 | And test it
56 |
57 | ```sh
58 | psql "postgresql://user:secret@bouncerhost:6432/ssltest"
59 | ```
60 |
61 | The connection should succeed and the server should report SSL is used.
62 |
63 | ```sql
64 | SELECT ssl_is_used();
65 | ```
66 |
67 | We’ve now successfully encrypted traffic between the bouncer and the database!
68 |
69 | However, you’ll notice the `psql` prompt does not have an `SSL connection` line as it did before. You can also use `sslmode=disable` to successfully connect, and programs like `tcpdump` or [tshark](https://www.wireshark.org/docs/man-pages/tshark.html) will show unencrypted traffic between the client and the bouncer. You can test this out with:
70 |
71 | ```sh
72 | sudo tcpdump -i lo -X -s 0 'port 6432'
73 | ```
74 |
75 | Run commands in `psql` and you’ll see plaintext statements printed.
76 |
77 | ## Clients to PgBouncer
78 |
79 | This last flow is the trickiest. PgBouncer 1.7+ supports TLS, but we need to create keys and certificates for it. For this, we’ll create a private [PKI](https://en.wikipedia.org/wiki/Public_key_infrastructure). [Minica](https://github.com/jsha/minica) and [Vault](https://www.vaultproject.io/) are two ways to do this.
80 |
81 | We’ll use Minica (here are [instructions for Vault](vault-pki)). Install the latest version:
82 |
83 | ```sh
84 | sudo apt-get install minica
85 | ```
86 |
87 | And run:
88 |
89 | ```sh
90 | minica --domains bouncerhost
91 | ```
92 |
93 | We now have the files we need to connect. Add the key and certificate to `/etc/pgbouncer/pgbouncer.ini`:
94 |
95 | ```ini
96 | client_tls_sslmode = require # not verify-full
97 | client_tls_key_file = /path/to/bouncerhost/key.pem
98 | client_tls_cert_file = /path/to/bouncerhost/cert.pem
99 | ```
100 |
101 | And restart the service. To connect, we once again use `verify-full` but this time with the root certificate we generated above:
102 |
103 | ```sh
104 | psql "postgresql://user:secret@bouncerhost:6432/ssltest?sslmode=verify-full&sslrootcert=minica.pem"
105 | ```
106 |
107 | Confirm the `SSL connection` line is printed and `sslmode=disable` no longer works.
108 |
109 | We’ve now successfully encrypted traffic end-to-end!
110 |
--------------------------------------------------------------------------------
/archive/host-your-own-postgres.md:
--------------------------------------------------------------------------------
1 | # Host Your Own Postgres
2 |
3 | :elephant: Get running with the last version of Postgres in minutes
4 |
5 | ## Set Up Server
6 |
7 | Spin up a new server with Ubuntu 16.04.
8 |
9 | Firewall
10 |
11 | ```sh
12 | sudo ufw allow ssh
13 | sudo ufw enable
14 | ```
15 |
16 | [Automatic updates](https://help.ubuntu.com/16.04/serverguide/automatic-updates.html)
17 |
18 | ```sh
19 | sudo apt-get -y install unattended-upgrades
20 | echo 'APT::Periodic::Unattended-Upgrade "1";' >> /etc/apt/apt.conf.d/10periodic
21 | ```
22 |
23 | Time zone
24 |
25 | ```sh
26 | sudo dpkg-reconfigure tzdata
27 | ```
28 |
29 | and select `None of the above`, then `UTC`.
30 |
31 | ## Install Postgres
32 |
33 | Install PostgreSQL 10
34 |
35 | ```sh
36 | echo "deb https://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main" > /etc/apt/sources.list.d/pgdg.list
37 | wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
38 | sudo apt-get update
39 | sudo apt-get install -qq -y postgresql-10 postgresql-contrib
40 | ```
41 |
42 | ## Configure
43 |
44 | Edit `/etc/postgresql/10/main/postgresql.conf`.
45 |
46 | ```sh
47 | # general
48 | max_connections = 100
49 |
50 | # logging
51 | log_min_duration_statement = 100 # log queries over 100ms
52 | log_temp_files = 0 # log all temp files
53 |
54 | # stats
55 | shared_preload_libraries = 'pg_stat_statements'
56 | pg_stat_statements.max = 1000
57 | ```
58 |
59 | ## Remote Connections
60 |
61 | Enable remote connections if needed
62 |
63 | ```sh
64 | echo "host all all 0.0.0.0/0 md5" >> /etc/postgresql/9.6/main/pg_hba.conf
65 | echo "listen_addresses = '*'" >> /etc/postgresql/9.6/main/postgresql.conf
66 | sudo service postgresql restart
67 | ```
68 |
69 | And update the firewall
70 |
71 | ```sh
72 | sudo ufw allow 5432/tcp # for all ips
73 | sudo ufw allow from 127.0.0.1 to any port 5432 proto tcp # specific ip
74 | sudo ufw enable
75 | ```
76 |
77 | ## Provisioning
78 |
79 | Create a new user and database for each of your apps
80 |
81 | ```sh
82 | sudo su - postgres
83 | psql
84 | ```
85 |
86 | And run:
87 |
88 | ```sql
89 | CREATE USER myapp WITH PASSWORD 'mypassword';
90 | ALTER USER myapp WITH CONNECTION LIMIT 20;
91 | CREATE DATABASE myapp_production OWNER myapp;
92 | ```
93 |
94 | Generate a random password with:
95 |
96 | ```sh
97 | cat /dev/urandom | LC_CTYPE=C tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1
98 | ```
99 |
100 | ## Backups
101 |
102 | ### Daily
103 |
104 | Store backups on S3
105 |
106 | - [Amazon S3 Backup Scripts](https://github.com/collegeplus/s3-shell-backups/blob/master/s3-postgresql-backup.sh)
107 | - [Automatic Backups to Amazon S3 are Easy ](https://rossta.net/blog/automatic-backups-to-amazon-s3-are-easy.html)
108 |
109 | *TODO: better instructions*
110 |
111 | ### Continuous
112 |
113 | Rollback to a specific point in time with [WAL-E](https://github.com/wal-e/wal-e).
114 |
115 | Opbeat has a [great tutorial](https://opbeat.com/blog/posts/postgresql-backup-to-s3-part-one/).
116 |
117 | ## Logging
118 |
119 | [Papertrail](https://papertrailapp.com) is great and has a free plan.
120 |
121 | Install remote syslog
122 |
123 | ```sh
124 | cd /tmp
125 | wget https://github.com/papertrail/remote_syslog2/releases/download/v0.13/remote_syslog_linux_amd64.tar.gz
126 | tar xzf ./remote_syslog*.tar.gz
127 | cd remote_syslog
128 | sudo cp ./remote_syslog /usr/local/bin
129 | ```
130 |
131 | Create `/etc/log_files.yml` with:
132 |
133 | ```sh
134 | files:
135 | - /var/log/postgresql/*.log
136 | destination:
137 | host: logs.papertrailapp.com
138 | port: 12345
139 | protocol: tls
140 | ```
141 |
142 | ### Archive
143 |
144 | Archive logs to S3
145 |
146 | ```sh
147 | sudo apt-get install logrotate s3cmd
148 | s3cmd --configure
149 | ```
150 |
151 | Add to `/etc/logrotate.d/postgresql-common`:
152 |
153 | ```conf
154 | sharedscripts
155 | postrotate
156 | s3cmd sync /var/log/postgresql/*.gz s3://mybucket/logs/
157 | endscript
158 | ```
159 |
160 | Test with:
161 |
162 | ```sh
163 | logrotate -fv /etc/logrotate.d/postgresql-common
164 | ```
165 |
166 | ## TODO
167 |
168 | - scripts
169 |
170 | ```sh
171 | pghost bootstrap
172 | pghost allow all
173 | pghost allow 127.0.0.1
174 | pghost backup:all
175 | pghost backup myapp
176 | pghost restore myapp
177 | pghost provision myapp
178 | pghost logs:syslog logs.papertrailapp.com 12345
179 | pghost logs:archive mybucket/logs
180 | ```
181 |
182 | - monitoring (Graphite, CloudWatch, etc)
183 |
184 | ## Resources
185 |
186 | - [Copy your server logs to Amazon S3 using Logrotate and s3cmd](https://www.shanestillwell.com/2013/04/04/copy-your-server-logs-to-amazon-s3-using-logrotate-and-s3cmd/)
187 |
--------------------------------------------------------------------------------
/archive/rails-on-heroku.md:
--------------------------------------------------------------------------------
1 | # Rails on Heroku
2 |
3 | [The official guide](https://devcenter.heroku.com/articles/getting-started-with-rails4) is a great place to start, but there’s more you can do to make life easier.
4 |
5 | :tangerine: Based on lessons learned in the early days of [Instacart](https://www.instacart.com/)
6 |
7 | ## Deploys
8 |
9 | For zero downtime deploys, enable [preboot](https://devcenter.heroku.com/articles/preboot). This will cause deploys to take a few minutes longer to go live, but it’s better than impacting your users.
10 |
11 | ```sh
12 | heroku features:enable -a appname preboot
13 | ```
14 |
15 | Add a preload check make sure your app boots. Create `lib/tasks/preload.rake` with:
16 |
17 | ```ruby
18 | task preload: :environment do
19 | Rails.application.eager_load!
20 | ::Rails::Engine.subclasses.map(&:instance).each { |engine| engine.eager_load! }
21 | ActiveRecord::Base.descendants
22 | end
23 | ```
24 |
25 | And add a [release phase](https://devcenter.heroku.com/articles/release-phase) task to your `Procfile` to run the preload script and (optionally) migrations.
26 |
27 | ```sh
28 | release: bundle exec rails preload db:migrate
29 | ```
30 |
31 | Create a deployment script in `bin/deploy`. Here’s an example:
32 |
33 | ```sh
34 | #!/usr/bin/env bash
35 |
36 | function notify() {
37 | # add your chat service
38 | echo $1
39 | }
40 |
41 | notify "Deploying"
42 |
43 | git checkout master -q && git pull origin master -q && \
44 | git push origin master -q && git push heroku master
45 |
46 | if [ $? -eq 0 ]; then
47 | notify "Deploy complete"
48 | else
49 | notify "Deploy failed"
50 | fi
51 | ```
52 |
53 | Be sure to `chmod +x bin/deploy`. Replace the `echo` command with a call to your chat service ([Hipchat instructions](https://github.com/hipchat/hipchat-cli)).
54 |
55 | Deploy with:
56 |
57 | ```sh
58 | bin/deploy
59 | ```
60 |
61 | ## Migrations
62 |
63 | Follow best practices for [zero downtime migrations](https://github.com/ankane/strong_migrations).
64 |
65 | If you start to see errors about prepared statements after running migrations, disable them.
66 |
67 | ```yml
68 | production:
69 | prepared_statements: false
70 | ```
71 |
72 | Don’t worry! Your app will still be fast (and you’ll probably do this anyways at scale since PgBouncer requires it).
73 |
74 | ## Rollbacks
75 |
76 | Create a rollback script in `bin/rollback`.
77 |
78 | ```sh
79 | #!/usr/bin/env bash
80 |
81 | function notify() {
82 | # add your chat service
83 | echo $1
84 | }
85 |
86 | notify "Rolling back"
87 |
88 | heroku rollback
89 |
90 | if [ $? -eq 0 ]; then
91 | notify "Rollback complete"
92 | else
93 | notify "Rollback failed"
94 | fi
95 | ```
96 |
97 | Don’t forget to `chmod +x bin/rollback`. Rollback with:
98 |
99 | ```sh
100 | bin/rollback
101 | ```
102 |
103 | ## Logs
104 |
105 | Add [Papertrail](https://papertrailapp.com/) to make your logs easily searchable.
106 |
107 | ```sh
108 | heroku addons:create papertrail
109 | ```
110 |
111 | Set it up to [archive logs to S3](https://help.papertrailapp.com/kb/how-it-works/permanent-log-archives/).
112 |
113 | ## Performance
114 |
115 | Add a performance monitoring service like New Relic.
116 |
117 | ```sh
118 | heroku addons:create newrelic
119 | ```
120 |
121 | And follow the [installation instructions](https://devcenter.heroku.com/articles/newrelic).
122 |
123 | Use a [CDN](https://en.wikipedia.org/wiki/Content_delivery_network) like [Amazon CloudFront](https://devcenter.heroku.com/articles/using-amazon-cloudfront-cdn) to serve assets.
124 |
125 | ## Autoscaling
126 |
127 | Check out [HireFire](https://www.hirefire.io/).
128 |
129 | ## Productivity
130 |
131 | Use [Archer](https://github.com/ankane/archer) to enable console history.
132 |
133 | Use [aliases](https://www.digitalocean.com/community/tutorials/an-introduction-to-useful-bash-aliases-and-functions) for less typing.
134 |
135 | ```sh
136 | alias hc="heroku run rails console"
137 | ```
138 |
139 | ## Staging
140 |
141 | Create a separate app for staging.
142 |
143 | ```sh
144 | heroku create staging-appname -r staging
145 | heroku config:set RAILS_ENV=staging RACK_ENV=staging -r staging
146 | ```
147 |
148 | Deploy with:
149 |
150 | ```sh
151 | git push staging branch:master
152 | ```
153 |
154 | You may also want to password protect your staging environment.
155 |
156 | ```ruby
157 | class ApplicationController < ActionController::Base
158 | http_basic_authenticate_with name: "happy", password: "carrots" if Rails.env.staging?
159 | end
160 | ```
161 |
162 | ## Lastly...
163 |
164 | Have suggestions? [Please share](https://github.com/ankane/shorts/issues/new). For more tips, check out [Production Rails](https://github.com/ankane/production_rails).
165 |
166 | :hatched_chick: Happy coding!
167 |
--------------------------------------------------------------------------------
/archive/daru.md:
--------------------------------------------------------------------------------
1 | # Daru: Pandas for Ruby
2 |
3 | 
4 |
5 |
6 | Photo by Bruce Hong
7 |
8 |
9 | NumPy and Pandas are two extremely popular libraries for machine learning in Python. Last post, we looked at [Numo](https://ankane.org/numo), a Ruby library similar to NumPy. As luck would have it, there’s a library similar to Pandas as well. It’s called Daru, and it’s the focus of this post.
10 |
11 | ## Overview
12 |
13 | Daru is a data analysis library. Its core data structure is the data frame, which is similar to an in-memory database table. Data frames have rows and columns, and each column has a specific data type. Let’s create a data frame with the most populous countries:
14 |
15 | ```ruby
16 | df = Daru::DataFrame.new(
17 | country: ["China", "India", "USA"],
18 | population: [1433, 1366, 329] # in millions
19 | )
20 | ```
21 |
22 |
23 | Population data from the United Nations, 2019
24 |
25 |
26 | Here’s what it looks like:
27 |
28 | ```text
29 | country population
30 | 0 China 1433
31 | 1 India 1366
32 | 2 USA 329
33 | ```
34 |
35 | You can get specific columns with:
36 |
37 | ```ruby
38 | df[:country]
39 | df[:country, :population]
40 | ```
41 |
42 | Or specific rows with:
43 |
44 | ```ruby
45 | df.first(2) # first 2 rows
46 | df.last(2) # last 2 rows
47 | df.row[1] # 2nd row
48 | df.row[1..2] # 2nd and 3rd row
49 | ```
50 |
51 | ## Filtering, Sorting, and Grouping
52 |
53 | Select countries with over 1 billion people.
54 |
55 | ```ruby
56 | df.where(df[:population] > 1000)
57 | ```
58 |
59 | For equality, use `eq` or `in`.
60 |
61 | ```ruby
62 | df.where(df[:country].eq("China"))
63 | df.where(df[:country].in(["USA", "India"]))
64 | ```
65 |
66 | Negate a condition with `!`.
67 |
68 | ```ruby
69 | df.where(!df[:country].eq("India"))
70 | ```
71 |
72 | Combine operators with `&` (and) and `|` (or).
73 |
74 | ```ruby
75 | df.where(df[:country].eq("USA") | (df[:population] < 1400))
76 | ```
77 |
78 | Sort the data frame by a column with:
79 |
80 | ```ruby
81 | df.sort([:population])
82 | df.sort([:country], ascending: [false])
83 | ```
84 |
85 | You can also group data and perform aggregations.
86 |
87 | ```ruby
88 | cities = Daru::DataFrame.new(
89 | country: ["China", "China", "India"],
90 | city: ["Shanghai", "Beijing", "Mumbai"]
91 | )
92 | cities.group_by([:country]).count
93 | ```
94 |
95 | ## Combining Data Frames
96 |
97 | There are a number of ways to combine data frames. You can add rows:
98 |
99 | ```ruby
100 | countries = Daru::DataFrame.new(
101 | country: ["Indonesia", "Pakistan"],
102 | population: [271, 217] # in millions
103 | )
104 | df.concat(countries)
105 | ```
106 |
107 | Or add columns:
108 |
109 | ```ruby
110 | locations = Daru::DataFrame.new(
111 | continent: ["Asia", "Asia", "North America"],
112 | planet: ["Earth", "Earth", "Earth"]
113 | )
114 | df.merge(locations)
115 | ```
116 |
117 | You can also perform joins like in SQL.
118 |
119 | ```ruby
120 | cities = Daru::DataFrame.new(
121 | country: ["China", "China", "India"],
122 | city: ["Shanghai", "Beijing", "Mumbai"]
123 | )
124 | df.join(cities, how: :inner, on: [:country])
125 | ```
126 |
127 | ## Reading and Writing Data
128 |
129 | Daru makes it easy to load data from a CSV file.
130 |
131 | ```ruby
132 | Daru::DataFrame.from_csv("countries.csv")
133 | ```
134 |
135 | After manipulating the data, you can save it back to a CSV file.
136 |
137 | ```ruby
138 | df.write_csv("countries_v2.csv")
139 | ```
140 |
141 | You can also load data directly from Active Record.
142 |
143 | ```ruby
144 | relation = Country.where("population > 100")
145 | Daru::DataFrame.from_activerecord(relation)
146 | ```
147 |
148 | ## Plotting
149 |
150 | For plotting, use a Jupyter notebook with [IRuby](https://github.com/sciruby/iruby). Create a plot with:
151 |
152 | ```ruby
153 | df.plot type: :bar, x: :country, y: :population do |plot, diagram|
154 | plot.x_label "Country"
155 | plot.y_label "Population (millions)"
156 | diagram.color(Nyaplot::Colors.Pastel1)
157 | end
158 | ```
159 |
160 | 
161 |
162 | You can also create line charts, scatter plots, box plots, and histograms.
163 |
164 | ## Summary
165 |
166 | You’ve now seen how to use Daru to:
167 |
168 | - create data frames
169 | - filter, sort, and group data
170 | - combine data frames
171 | - create plots
172 |
173 | Try out [Daru](https://github.com/SciRuby/daru) for your next analysis.
174 |
--------------------------------------------------------------------------------
/archive/introducing-dexter.md:
--------------------------------------------------------------------------------
1 | # Introducing Dexter, the Automatic Indexer for Postgres
2 |
3 | 
4 |
5 | Your database knows which queries are running. It also has a pretty good idea of which indexes are best for a given query. And since indexes don’t change the results of a query, they’re really just a performance optimization. So why do we always need a human to choose them?
6 |
7 | Introducing [Dexter](https://github.com/ankane/dexter). Dexter indexes your database for you. You can still do it yourself, but Dexter will do a pretty good job.
8 |
9 | Dexter works in two phases:
10 |
11 | 1. Collect queries
12 | 2. Generate indexes
13 |
14 | We’ll walk through each of them.
15 |
16 | ### Phase 1: Collect
17 |
18 | You can stream Postgres log files directly to Dexter. Dexter finds lines like:
19 |
20 | ```txt
21 | LOG: duration: 14.077 ms statement: SELECT * FROM ratings WHERE user_id = 3;
22 | ```
23 |
24 | And parses out the query and duration. It uses fingerprinting to group queries. Queries with the same parse tree but different values are grouped together. For instance, both of the following queries have the same fingerprint.
25 |
26 | ```sql
27 | SELECT * FROM ratings WHERE user_id = 2;
28 | SELECT * FROM ratings WHERE user_id = 3;
29 | ```
30 |
31 | The data is aggregated to get the total execution time by fingerprint. You can get similar information from the [pg_stat_statements view](https://www.postgresql.org/docs/current/static/pgstatstatements.html), except queries in the view are normalized. This means, you get:
32 |
33 | ```sql
34 | SELECT * FROM ratings WHERE user_id = ?;
35 | ```
36 |
37 | instead of
38 |
39 | ```sql
40 | SELECT * FROM ratings WHERE user_id = 3;
41 | ```
42 |
43 | However, we need the actual values to determine costs in the next step. To prevent over-indexing, you can set a threshold for the total execution time before a query is considered for indexing.
44 |
45 | ### Phase 2. Generate
46 |
47 | To generate indexes, Dexter creates hypothetical indexes to try to speed up the slow queries we’ve just collected. Hypothetical indexes show how a query’s execution plan would change if an actual index existed. They take virtually no time to create, don’t require any disk space, and are only visible to the current session. You can read more about [hypothetical indexes here](https://rjuju.github.io/postgresql/2015/07/02/how-about-hypothetical-indexes.html).
48 |
49 | The main steps Dexter takes are:
50 |
51 | 1. Filter out queries on system tables and other databases
52 | 2. Analyze tables for up-to-date planner statistics if they haven’t been analyzed recently
53 | 3. Get the initial cost of queries
54 | 4. Create hypothetical indexes on columns that aren’t already indexes
55 | 5. Get costs again and see if any hypothetical indexes were used
56 |
57 | While fairly straightforward, this approach is extremely powerful, as it uses the Postgres query planner to figure out the best index(es) for a query. Hypothetical indexes that were used AND significantly reduced cost are selected to be indexes.
58 |
59 | To be safe, indexes are only logged by default. This allows you to use Dexter for index suggestions if you want to manually verify them first. When you let Dexter create indexes, they’re created concurrently to limit the impact on database performance.
60 |
61 | ```txt
62 | 2017-06-25T17:52:22+00:00 Index found: ratings (user_id)
63 | 2017-06-25T17:52:22+00:00 Creating index: CREATE INDEX CONCURRENTLY ON ratings (user_id)
64 | 2017-06-25T17:52:37+00:00 Index created: 15243 ms
65 | ```
66 |
67 | ### Trade-offs and Limitations
68 |
69 | The big advantage of indexes is faster data retrieval. On the flip side, indexes add overhead to write operations, like INSERT, UPDATE, and DELETE, as indexes must be updated as well. Indexes also take up disk space.
70 |
71 | Because of this, you may not want to index write-heavy tables. Dexter does not currently try to identify these tables automatically, but you can pass them in by hand.
72 |
73 | As for other limitations, Dexter does not try to create multicolumn indexes (edit: this is no longer the case). Dexter also assumes the search_path for queries is the same as the user running Dexter. You’ll still need to create unique constraints on your own. Dexter also requires the [HypoPG](https://github.com/HypoPG/hypopg) extension, which isn’t available on some hosted providers like Heroku and Amazon RDS.
74 |
75 | * * *
76 |
77 | It’s time to make forgotten indexes a problem of the past.
78 |
79 | [Add Dexter to your team](https://github.com/ankane/dexter) today.
80 |
81 | ### Thanks
82 |
83 | This software wouldn’t be possible without [HypoPG](https://github.com/HypoPG/hypopg), which allows you to create hypothetical indexes, and [pg_query](https://github.com/lfittl/pg_query), which allows you to parse and fingerprint queries. A big thanks to Dalibo and [Lukas Fittl](https://medium.com/@LukasFittl) respectively.
84 |
--------------------------------------------------------------------------------
/archive/postgres-users.md:
--------------------------------------------------------------------------------
1 | # Bootstrapping Postgres Users
2 |
3 | Setting up database users for an app can be challenging if you don’t do it often. Good permissions add a layer of security and can minimize the chances of developer mistakes.
4 |
5 | The three types of users we’ll cover are:
6 |
7 | Type | Description | Read | Write | Modify
8 | --- | --- | --- | --- | ---
9 | migrations | Schema changes | ✓ | ✓ | ✓
10 | app | Reading and writing data | ✓ | ✓ |
11 | analytics | Data analysis and reporting | ✓ | |
12 |
13 | Before we jump into it, there’s something you should know about new databases.
14 |
15 | ## New Databases
16 |
17 | After creating a new database, all users can access it and create tables in the `public` schema. This isn’t what we want. To fix this, run:
18 |
19 | ```sql
20 | REVOKE ALL ON DATABASE mydb FROM PUBLIC;
21 |
22 | REVOKE ALL ON SCHEMA public FROM PUBLIC;
23 | ```
24 |
25 | Be sure to replace `mydb` with your database name.
26 |
27 | ## Roles
28 |
29 | PostgreSQL uses the concept of roles to manage privileges. Roles can be used to define groups and users. A user is simply a role with a password and permission to log in.
30 |
31 | The approach we’ll take is to create a group and add users to it. This makes it easy to rotate credentials in the future: just add a second user to the group, set your app’s configuration to the new user, and remove the original one.
32 |
33 | ## Migrations
34 |
35 | First, we need a group to manage the schema. We could use a superuser, but this isn’t a great idea, as superusers can access all databases, change permissions, and create new roles. Instead, let’s create a new group.
36 |
37 | ```sql
38 | CREATE ROLE migrations;
39 |
40 | GRANT CONNECT ON DATABASE mydb TO migrations;
41 |
42 | GRANT ALL ON SCHEMA public TO migrations;
43 |
44 | ALTER ROLE migrations SET lock_timeout TO '5s';
45 | ```
46 |
47 | We set a lock timeout so migrations don’t disrupt normal database activity while attempting to acquire a lock.
48 |
49 | Now, we can create a user who’s a member of the group.
50 |
51 | ```sql
52 | CREATE ROLE migrator WITH LOGIN ENCRYPTED PASSWORD 'secret' IN ROLE migrations;
53 |
54 | ALTER ROLE migrator SET role TO 'migrations';
55 | ```
56 |
57 | The last statement ensures tables created by the user are owned by the group.
58 |
59 | You can generate a nice password from the command line with:
60 |
61 | ```sh
62 | cat /dev/urandom | LC_CTYPE=C tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1
63 | ```
64 |
65 | ## App
66 |
67 | Next, let’s create a group for our app. It’ll need to read and write data but shouldn’t need to modify the schema or truncate tables. We also want to set a statement timeout to prevent long running queries from degrading database performance.
68 |
69 | ```sql
70 | CREATE ROLE app;
71 |
72 | GRANT CONNECT ON DATABASE mydb TO app;
73 |
74 | GRANT USAGE ON SCHEMA public TO app;
75 |
76 | GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO app;
77 |
78 | GRANT SELECT, USAGE ON ALL SEQUENCES IN SCHEMA public TO app;
79 |
80 | ALTER DEFAULT PRIVILEGES FOR ROLE migrations IN SCHEMA public
81 | GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO app;
82 |
83 | ALTER DEFAULT PRIVILEGES FOR ROLE migrations IN SCHEMA public
84 | GRANT SELECT, USAGE ON SEQUENCES TO app;
85 |
86 | ALTER ROLE app SET statement_timeout TO '30s';
87 | ```
88 |
89 | > **Note:** The default privileges statements reference the group used for migrations. If you use Amazon RDS, you must run these statements as the migrator user we created above (since you don’t have access to a true superuser).
90 |
91 | Then, create a user with:
92 |
93 | ```sql
94 | CREATE ROLE myapp WITH LOGIN ENCRYPTED PASSWORD 'secret' IN ROLE app;
95 | ```
96 |
97 | ## Analytics
98 |
99 | Finally, let’s create a group to be used for data analysis, reporting, and business intelligence tools (like [Blazer](https://github.com/ankane/blazer), our open-source one). These users are often referred to as a *read-only users*. We don’t want them to be able to mistakenly update data.
100 |
101 | ```sql
102 | CREATE ROLE analytics;
103 |
104 | GRANT CONNECT ON DATABASE mydb TO analytics;
105 |
106 | GRANT USAGE ON SCHEMA public TO analytics;
107 |
108 | GRANT SELECT ON ALL TABLES IN SCHEMA public TO analytics;
109 |
110 | ALTER DEFAULT PRIVILEGES FOR ROLE migrations IN SCHEMA public
111 | GRANT SELECT ON TABLES TO analytics;
112 |
113 | ALTER ROLE analytics SET statement_timeout TO '3min';
114 | ```
115 |
116 | Once again, creating a user is relatively straightforward.
117 |
118 | ```sql
119 | CREATE ROLE bi WITH LOGIN ENCRYPTED PASSWORD 'secret' IN ROLE analytics;
120 | ```
121 |
122 | ## Summary
123 |
124 | You now know how to create different types of Postgres users. Spending a bit of time upfront to configure your users can make them easier to manage in the long run. This should give you a nice foundation.
125 |
--------------------------------------------------------------------------------
/archive/dokku-digital-ocean.md:
--------------------------------------------------------------------------------
1 | # Dokku on DigitalOcean
2 |
3 | :droplet: Your very own PaaS
4 |
5 | ## Create Droplet
6 |
7 | Create new droplet with Ubuntu 16.04. Be sure to use an SSH key.
8 |
9 | ## Install Dokku
10 |
11 | ```sh
12 | wget https://raw.githubusercontent.com/dokku/dokku/v0.12.5/bootstrap.sh
13 | sudo DOKKU_TAG=v0.12.5 bash bootstrap.sh
14 | ```
15 |
16 | And visit your server’s IP address in your browser to complete installation.
17 |
18 | If you have a domain, use virtualhost naming. Otherwise, Dokku will use different ports for each deploy of your app. You can add easily add a domain later.
19 |
20 | ## Add a Firewall
21 |
22 | Create a [firewall](https://cloud.digitalocean.com/networking/firewalls)
23 |
24 | Inbound Rules
25 |
26 | - SSH from your [external IP](https://www.google.com/search?q=external+ip)
27 | - HTTP and HTTPS from all IPv4 and all IPv6
28 |
29 | Outbound Rules
30 |
31 | - ICMP, all TCP, and all UDP from all IPv4 and all IPv6
32 |
33 | ## Set Up Server
34 |
35 | Turn on [automatic updates](https://help.ubuntu.com/16.04/serverguide/automatic-updates.html)
36 |
37 | ```sh
38 | sudo apt-get -y install unattended-upgrades
39 | echo 'APT::Periodic::Unattended-Upgrade "1";' >> /etc/apt/apt.conf.d/10periodic
40 | ```
41 |
42 | Enable swap
43 |
44 | ```sh
45 | sudo fallocate -l 4G /swapfile
46 | sudo chmod 600 /swapfile
47 | sudo mkswap /swapfile
48 | sudo swapon /swapfile
49 | sudo sh -c 'echo "/swapfile none swap sw 0 0" >> /etc/fstab'
50 | ```
51 |
52 | Configure time zone
53 |
54 | ```sh
55 | sudo dpkg-reconfigure tzdata
56 | ```
57 |
58 | and select `None of the above`, then `UTC`.
59 |
60 | ## Deploy
61 |
62 | Get the official Dokku client locally
63 |
64 | ```sh
65 | git clone git@github.com:progrium/dokku.git ~/.dokku
66 |
67 | # add the following to either your
68 | # .bashrc, .bash_profile, or .profile file
69 | alias dokku='$HOME/.dokku/contrib/dokku_client.sh'
70 | ```
71 |
72 | Create app
73 |
74 | ```sh
75 | dokku apps:create myapp
76 | ```
77 |
78 | Add a `CHECKS` file
79 |
80 | ```txt
81 | WAIT=2
82 | ATTEMPTS=15
83 | /
84 | ```
85 |
86 | Deploy
87 |
88 | ```sh
89 | git remote add dokku dokku@dokkuhost:myapp
90 | git push dokku master
91 | ```
92 |
93 | ## Workers
94 |
95 | Dokku only runs web processes by default. If you have workers or other process types, use:
96 |
97 | ```sh
98 | dokku ps:scale worker=1
99 | ```
100 |
101 | ## One-Off Jobs
102 |
103 | ```sh
104 | dokku run rails db:migrate
105 | dokku run rails console
106 | ```
107 |
108 | ## Scheduled Jobs
109 |
110 | Two options
111 |
112 | 1. Add a [custom clock process](https://devcenter.heroku.com/articles/scheduled-jobs-custom-clock-processes) to your Procfile
113 |
114 | 2. Or create `/etc/cron.d/myapp` with:
115 |
116 | ```
117 | PATH=/usr/local/bin:/usr/bin:/bin
118 | SHELL=/bin/bash
119 | * * * * * dokku dokku --rm run myapp rake task1
120 | 0 0 * * * dokku dokku --rm run myapp rake task2
121 | ```
122 |
123 | ## Custom Domains
124 |
125 | ```sh
126 | dokku domains:add www.datakick.org
127 | ```
128 |
129 | ## SSL
130 |
131 | Get free SSL certificates thanks to [Let’s Encrypt](https://letsencrypt.org/). On the server, run:
132 |
133 | ```sh
134 | dokku plugin:install https://github.com/dokku/dokku-letsencrypt.git
135 | dokku letsencrypt:cron-job --add
136 | ```
137 |
138 | And locally, run:
139 |
140 | ```sh
141 | dokku config:set --no-restart DOKKU_LETSENCRYPT_EMAIL=your@email.tld
142 | dokku letsencrypt
143 | ```
144 |
145 | ## Logging
146 |
147 | Use syslog to ship your logs to a service. [Papertrail](https://papertrailapp.com) is great and has a free plan.
148 |
149 | For apps, use:
150 |
151 | ```sh
152 | dokku plugin:install https://github.com/michaelshobbs/dokku-logspout.git
153 | dokku plugin:install https://github.com/michaelshobbs/dokku-hostname.git
154 | dokku logspout:server syslog+tls://logs.papertrailapp.com:12345
155 | dokku logspout:start
156 | ```
157 |
158 | For nginx and other logs, install [remote_syslog2](https://github.com/papertrail/remote_syslog2)
159 |
160 | ```sh
161 | cd /tmp
162 | wget https://github.com/papertrail/remote_syslog2/releases/download/v0.18/remote_syslog_linux_amd64.tar.gz
163 | tar xzf ./remote_syslog*.tar.gz
164 | cd remote_syslog
165 | sudo cp ./remote_syslog /usr/local/bin
166 | ```
167 |
168 | Create `/etc/log_files.yml` with:
169 |
170 | ```sh
171 | files:
172 | - /var/log/nginx/*.log
173 | - /var/log/unattended-upgrades/*.log
174 | destination:
175 | host: logs.papertrailapp.com
176 | port: 12345
177 | protocol: tls
178 | ```
179 |
180 | And run:
181 |
182 | ```sh
183 | remote_syslog
184 | ```
185 |
186 | ## Database
187 |
188 | Check out [Host Your Own Postgres](host-your-own-postgres).
189 |
190 | ## Memcached
191 |
192 | ```sh
193 | dokku plugin:install https://github.com/dokku/dokku-memcached.git
194 | dokku memcached:create lolipop
195 | dokku memcached:link lolipop myapp
196 | ```
197 |
198 | ## Redis
199 |
200 | ```sh
201 | dokku plugin:install https://github.com/dokku/dokku-redis.git
202 | dokku redis:create lolipop
203 | dokku redis:link lolipop myapp
204 | ```
205 |
206 | ## TODO
207 |
208 | - [Monitoring](https://www.brianchristner.io/how-to-setup-docker-monitoring/)
209 |
210 | ## Bonus
211 |
212 | Find great Docker projects at [Awesome Docker](https://github.com/veggiemonk/awesome-docker).
213 |
214 | ## Resources
215 |
216 | - [Additional Recommended Steps for New Ubuntu 14.04 Servers](https://www.digitalocean.com/community/tutorials/additional-recommended-steps-for-new-ubuntu-14-04-servers)
217 |
--------------------------------------------------------------------------------
/archive/securing-user-emails-lockbox.md:
--------------------------------------------------------------------------------
1 | # Securing User Emails in Rails with Lockbox
2 |
3 | 
4 |
5 | ---
6 |
7 | *This is an update to [Securing User Emails in Rails](https://ankane.org/securing-user-emails-in-rails) with a number of improvements:*
8 |
9 | - *Works with Devise’s email changed notifications*
10 | - *Works with Devise’s reconfirmable option*
11 | - *Stores encrypted data in a single field*
12 | - *You only need to manage a single key*
13 |
14 | ---
15 |
16 | Email addresses are a common form of personal data, and they’re often stored unencrypted. If an attacker gains access to the database or backups, emails will be compromised.
17 |
18 | This post will walk you through a practical approach to protecting emails. It works with [Devise](https://github.com/plataformatec/devise), the most popular authentication framework for Rails, and is general enough to work with others.
19 |
20 | ## Strategy
21 |
22 | We’ll use two concepts to make this happen: encryption and blind indexing. Encryption gives us a way to securely store the data, and blind indexing provides a way to look it up.
23 |
24 | Blind indexing works by computing a hash of the data. You’re probably familiar with hash functions like MD5 and SHA1. Rather than one of these, we use a hash function that takes a secret key and uses [key stretching](https://en.wikipedia.org/wiki/Key_stretching) to slow down brute force attempts. You can read more about [blind indexing here](https://www.sitepoint.com/how-to-search-on-securely-encrypted-database-fields/).
25 |
26 | We’ll use the [Lockbox](https://github.com/ankane/lockbox) gem for encryption and the [Blind Index](https://github.com/ankane/blind_index) gem for blind indexing.
27 |
28 | ## Instructions
29 |
30 | Let’s assume you have a `User` model with an email field.
31 |
32 | Add to your Gemfile:
33 |
34 | ```ruby
35 | gem 'lockbox'
36 | gem 'blind_index'
37 | ```
38 |
39 | And run:
40 |
41 | ```sh
42 | bundle install
43 | ```
44 |
45 | Generate a key
46 |
47 | ```ruby
48 | Lockbox.generate_key
49 | ```
50 |
51 | Store the key with your other secrets. This is typically Rails credentials or an environment variable ([dotenv](https://github.com/bkeepers/dotenv) is great for this). Be sure to use different keys in development and production.
52 |
53 | Set the following environment variables with your key (you can use this one in development)
54 |
55 | ```sh
56 | LOCKBOX_MASTER_KEY=0000000000000000000000000000000000000000000000000000000000000000
57 | ```
58 |
59 | or create `config/initializers/lockbox.rb` with something like
60 |
61 | ```ruby
62 | Lockbox.master_key = Rails.application.credentials.lockbox_master_key
63 | ```
64 |
65 | Next, let’s replace the email field with an encrypted version. Create a migration:
66 |
67 | ```sh
68 | rails generate migration add_email_ciphertext_to_users
69 | ```
70 |
71 | And add:
72 |
73 | ```ruby
74 | class AddEmailCiphertextToUsers < ActiveRecord::Migration[5.2]
75 | def change
76 | # encrypted data
77 | add_column :users, :email_ciphertext, :string
78 |
79 | # blind index
80 | add_column :users, :email_bidx, :string
81 | add_index :users, :email_bidx, unique: true
82 |
83 | # drop original here unless we have existing users
84 | remove_column :users, :email
85 | end
86 | end
87 | ```
88 |
89 | Then migrate:
90 |
91 | ```sh
92 | rails db:migrate
93 | ```
94 |
95 | Add to your user model:
96 |
97 | ```ruby
98 | class User < ApplicationRecord
99 | encrypts :email
100 | blind_index :email
101 | end
102 | ```
103 |
104 | Create a new user and confirm it works.
105 |
106 | ## Existing Users
107 |
108 | If you have existing users, we need to backfill the data before dropping the email column.
109 |
110 | ```ruby
111 | class User < ApplicationRecord
112 | encrypts :email, migrating: true
113 | blind_index :email, migrating: true
114 | end
115 | ```
116 |
117 | Backfill the data in the Rails console:
118 |
119 | ```ruby
120 | Lockbox.migrate(User)
121 | ```
122 |
123 | Then update the model to the desired state:
124 |
125 | ```ruby
126 | class User < ApplicationRecord
127 | encrypts :email
128 | blind_index :email
129 |
130 | # remove this line after dropping email column
131 | self.ignored_columns = ["email"]
132 | end
133 | ```
134 |
135 | Finally, drop the email column.
136 |
137 | ## Reconfirmable
138 |
139 | If you use the confirmable module with `reconfirmable`, you should also encrypt the `unconfirmed_email` field.
140 |
141 | ```ruby
142 | class AddUnconfirmedEmailToUsers < ActiveRecord::Migration[5.2]
143 | def change
144 | add_column :users, :unconfirmed_email_ciphertext, :text
145 | end
146 | end
147 | ```
148 |
149 | And add `unconfirmed_email` to the list of encrypted fields and a new method:
150 |
151 | ```ruby
152 | class User < ApplicationRecord
153 | encrypts :email, :unconfirmed_email
154 | end
155 | ```
156 |
157 | ## Logging
158 |
159 | We also need to make sure email addresses aren’t logged. Add to `config/initializers/filter_parameter_logging.rb`:
160 |
161 | ```ruby
162 | Rails.application.config.filter_parameters += [:email]
163 | ```
164 |
165 | Use [Logstop](https://github.com/ankane/logstop) to filter anything that looks like an email address as an extra line of defense. Add to your Gemfile:
166 |
167 | ```ruby
168 | gem 'logstop'
169 | ```
170 |
171 | And create `config/initializers/logstop.rb` with:
172 |
173 | ```ruby
174 | Logstop.guard(Rails.logger)
175 | ```
176 |
177 | ## Summary
178 |
179 | We now have a way to encrypt emails and query for exact matches. You can apply this same approach to other fields as well. For more security, consider a [key management service](https://github.com/ankane/kms_encrypted) to manage your keys.
180 |
--------------------------------------------------------------------------------
/archive/modern-encryption-rails.md:
--------------------------------------------------------------------------------
1 | # Modern Encryption for Rails
2 |
3 | 
4 |
5 | Encrypting sensitive data at the application-level is crucial for data security. Since writing [Securing Sensitive Data in Rails](https://ankane.org/sensitive-data-rails), I haven’t been able to shake the feeling that encryption in Rails could be easier and cleaner.
6 |
7 | To address this, I created a library called [Lockbox](https://github.com/ankane/lockbox). Here are some of the principles behind it.
8 |
9 | ## Easy to Use, Hard to Misuse
10 |
11 | Many cryptography mistakes happen during implementation. Lockbox provides good defaults and is designed to be hard to misuse. You don’t need to deal with initialization vectors and it only supports secure algorithms.
12 |
13 | ## Popular Integrations
14 |
15 | Sensitive data can appear in many places, like database fields, file uploads, and strings. You shouldn’t need different libraries for each of these.
16 |
17 | Lockbox can encrypt your data in all of these forms. It has built-in integrations with Active Record, Active Storage, and CarrierWave.
18 |
19 | ## Zero Downtime Migrations
20 |
21 | At some point, you may want to encrypt existing data. This should be easy to do, and most importantly, not require any downtime. Lockbox provides a single method you can use for this once your model is configured:
22 |
23 | ```ruby
24 | Lockbox.migrate(User)
25 | ```
26 |
27 | No need to write one-off backfill scripts.
28 |
29 | ## Maximum Compatibility
30 |
31 | Encrypting attributes shouldn’t break existing code or libraries. To make this possible, methods like `attribute_changed?` and `attribute_was` should behave similarly regardless of whether or not an attribute is encrypted. Lockbox includes these methods in its test suite for maximum compatibility.
32 |
33 | This allows features like Devise’s ability to send email change notifications to work when the email attribute is encrypted, which is an important measure to prevent account hijacking.
34 |
35 | ```ruby
36 | Devise.setup do |config|
37 | config.send_email_changed_notification = true
38 | end
39 | ```
40 |
41 | You can even query encrypted attributes thanks to the [blind_index](https://github.com/ankane/blind_index) gem.
42 |
43 | ## Modern Algorithms
44 |
45 | Lockbox uses AES-GCM for [authenticated encryption](https://tonyarcieri.com/all-the-crypto-code-youve-ever-written-is-probably-broken). It also supports XSalsa20 (thanks to Libsodium), which is recommended by [some cryptographers](https://latacora.micro.blog/2018/04/03/cryptographic-right-answers.html).
46 |
47 | ## Less Keys To Manage
48 |
49 | It’s a good practice to use a different encryption key for each field to make it more difficult for attackers and to reduce the likelihood of a [nonce collision](https://www.cryptologie.net/article/402/is-symmetric-security-solved/). However, this can be burdensome for developers.
50 |
51 | Instead, we can use a single master key and derive separate keys for each field from it. This approach is taken from [CipherSweet](https://ciphersweet.paragonie.com/internals/key-hierarchy), an encryption library for PHP and Node.js. Now developers can safely add encrypted fields without having to worry about generating and storing additional secrets.
52 |
53 | You can still specify keys for certain fields if you prefer, but it’s no longer required. Lockbox also works with [KMS Encrypted](https://github.com/ankane/kms_encrypted) if you want to use a key management service to manage your keys.
54 |
55 | ## Built-In Key Rotation
56 |
57 | It’s good security hygiene to rotate your encryption keys from time-to-time. Lockbox makes this easy by allowing you to specify previous versions of keys and algorithm:
58 |
59 | ```ruby
60 | class User < ApplicationRecord
61 | encrypts :email, previous_versions: [{key: previous_key}]
62 | end
63 | ```
64 |
65 | New data is encrypted with the new key and algorithm, while older data can still be decrypted.
66 |
67 | ## Cleaner Schema
68 |
69 | [attr_encrypted](https://github.com/attr-encrypted/attr_encrypted), the de facto encryption library for database fields, uses two fields for each encrypted attribute: one for the ciphertext and another for the initialization vector.
70 |
71 | ```ruby
72 | encrypted_email
73 | encrypted_email_iv
74 | ```
75 |
76 | However, it’s possible to store both in a single field for a cleaner schema.
77 |
78 | ```ruby
79 | email_ciphertext
80 | ```
81 |
82 | ## Hybrid Cryptography
83 |
84 | Hybrid cryptography allows certain servers to encrypt data without the ability to decrypt it. This can do a better job [protecting data](https://ankane.org/decryption-keys) than symmetric cryptography when you can use it. Lockbox makes it just as easy to use hybrid cryptography.
85 |
86 | ```ruby
87 | class User < ApplicationRecord
88 | encrypts :email, algorithm: "hybrid", encryption_key: encryption_key, decryption_key: decryption_key
89 | end
90 | ```
91 |
92 | ## Updates
93 |
94 | Since this post was originally published:
95 |
96 | - Lockbox also supports [types](https://ankane.org/lockbox-types)
97 | - Here’s how to [encrypt user email addresses](https://ankane.org/securing-user-emails-lockbox)
98 | - Lockbox supports [Mongoid](https://ankane.org/modern-encryption-mongoid)
99 |
100 | ## Summary
101 |
102 | You’ve now seen what Lockbox brings to encryption for Rails. To summarize, it:
103 |
104 | - Is hard to misuse
105 | - Works with database fields, files, and strings
106 | - Makes it easy to migrate existing data without downtime
107 | - Maximizes compatibility with existing code and libraries
108 | - Uses modern algorithms
109 | - Requires you to only manage a single encryption key
110 | - Makes key rotation easy
111 | - Stores encrypted data in a single field
112 | - Supports hybrid cryptography
113 |
114 | Try out [Lockbox](https://github.com/ankane/lockbox) today.
115 |
116 | *Already use a library for encryption? No worries, it’s [easy to migrate](https://github.com/ankane/lockbox#migrating-from-another-library).*
117 |
--------------------------------------------------------------------------------
/archive/decryption-keys.md:
--------------------------------------------------------------------------------
1 | # Why and How to Keep Your Decryption Keys Off Web Servers
2 |
3 | 
4 |
5 | Suppose a worst-case scenario happens: an attacker finds a remote code execution vulnerability and creates a [reverse shell](https://hackernoon.com/reverse-shell-cf154dfee6bd) on one of your web servers. They then find the database credentials, connect to your database, and steal the data.
6 |
7 | For unencrypted data and data encrypted at the storage level, it’s game over. The attacker has it all. If data is encrypted at the application level with symmetric encryption but the encryption key is accessible from the server, it’s exactly the same. The attacker has all they need to decrypt the data offline.
8 |
9 | This is the case whether you store the encryption key in configuration management, an environment variable, or dynamically load it from an outside source. If your app can access the key, it’s vulnerable to compromise.
10 |
11 | The best way to defend against this attack is make sure the compromised server isn’t able to decrypt data. Web servers are typically most exposed to attacks. If your web servers accept sensitive data but don’t need to show it in its entirety back to users, they should be able to encrypt the data and write it to the database, but not decrypt it. The data can be decrypted and processed by background workers that don’t allow inbound traffic.
12 |
13 | You likely can’t do this for all of your data, but you should do it for all of the data you can. Sometimes it’s possible to just show partial information back to users. This is universal for saved credit cards.
14 |
15 | 
16 |
17 | In these cases, you can store the partial data in a separate field which web servers can decrypt, while not allowing them to decrypt the full data.
18 |
19 | ## Practical Example
20 |
21 | Suppose we have a service that sends text messages to customers. Customers enter their phone number through the website or mobile app.
22 |
23 | We can set up web servers so they can only encrypt phone numbers. Text messages can be sent through background jobs which run on a different set of servers - ones that can decrypt and don’t allow inbound traffic. If internal employees need to view full phone numbers, they can use a separate set of web servers that are only accessible through the company VPN.
24 |
25 | | Encrypt | Decrypt |
26 | --- | --- | --- | ---
27 | Customer web servers | ✓ |
28 | Background workers | ✓ | ✓ | No inbound traffic
29 | Internal web servers | ✓ | ✓ | Requires VPN
30 |
31 | If customers need to see their saved phone numbers, you can show them the last 4 digits, which are stored in a separate field.
32 |
33 | ## Approaches
34 |
35 | Two approaches you can take to accomplish this are:
36 |
37 | 1. Hybrid cryptography
38 | 2. Cryptography as a service
39 |
40 | ## Hybrid Cryptography
41 |
42 | Public key cryptography, or asymmetric cryptography, uses different keys to perform encryption and decryption. Servers that need to encrypt have the encryption key and servers that need to decrypt have the decryption key.
43 |
44 | However, public key cryptography is much less efficient than symmetric cryptography, so most implementations combine the two. They use public key cryptography to exchange a symmetric key, and symmetric cryptography to encrypt the data. This is called hybrid cryptography, and it’s how TLS and GPG work.
45 |
46 | X25519 is a modern key exchange algorithm that’s [widely deployed](https://ianix.com/pub/curve25519-deployment.html) and [currently recommended](https://paragonie.com/blog/2019/03/definitive-2019-guide-cryptographic-key-sizes-and-algorithm-recommendations#after-fold).
47 |
48 | [Libsodium](https://libsodium.gitbook.io/doc/), which uses X25519, is a great option for hybrid cryptography in applications. It has [libraries](https://libsodium.gitbook.io/doc/bindings_for_other_languages) for most languages.
49 |
50 | ## Cryptography as a Service
51 |
52 | Another approach is to use a service to perform encryption and decryption. This service can allow some sets of servers to encrypt and others to decrypt. You could write your own (micro)service, but there are a number of existing solutions, often called key management services (KMS).
53 |
54 | - [Vault](https://www.vaultproject.io/)
55 | - [AWS KMS](https://aws.amazon.com/kms/)
56 | - [Google Cloud KMS](https://cloud.google.com/kms/)
57 |
58 | These services don’t store the encrypted data - they just encrypt and decrypt on-demand. You can either encrypt data directly with the KMS or use envelope encryption.
59 |
60 | ### Direct Encryption
61 |
62 | With direct encryption, you don’t need to set up encryption in your app. Whenever you need to encrypt or decrypt data, simply send the data to the KMS.
63 |
64 | However, this has a few downsides. It exposes the unencrypted data to the KMS, which is disastrous if the KMS alone is breached. It’s also less efficient for large files and hosted services have a fairly low limit on the size of data you can encrypt.
65 |
66 | ### Envelope Encryption
67 |
68 | Another approach is envelope encryption, which addresses the issues above but requires encryption in your app.
69 |
70 | To encrypt, generate a random encryption key, known as a data encryption key (DEK), and use it to encrypt the data. Then encrypt the DEK with the KMS and store the encrypted version.
71 |
72 | To decrypt, decrypt the DEK with the KMS and then use it to decrypt the data. This way, the KMS only ever sees the DEK.
73 |
74 | ### Auditing
75 |
76 | Another benefit of cryptography as a service is auditing. You can see exactly when data or DEKs are decrypted, and there’s no way to get around the auditing without compromising the KMS. This makes it easy to tell which information was accessed during a breach.
77 |
78 | ## Conclusion
79 |
80 | We don’t encrypt data for a sunny day. You’ve now seen two approaches to limit damage in the event of a web server breach.
81 |
82 | If you use Ruby on Rails, I’ve written a companion piece on [hybrid cryptography](/hybrid-cryptography-rails) with code for how to do this.
83 |
--------------------------------------------------------------------------------
/archive/gem-patterns.md:
--------------------------------------------------------------------------------
1 | # Gem Patterns
2 |
3 | I’ve created [a few](https://ankane.org/opensource?language=Ruby) Ruby gems over the years, and there are a number of patterns I’ve found myself repeating that I wanted to share. I didn’t invent them, but have long forgotten where I first saw them. They are:
4 |
5 | - [Rails Migrations](#rails-migrations)
6 | - [Rails Dependencies](#rails-dependencies)
7 | - [Testing Against Multiple Dependency Versions](#testing-against-multiple-dependency-versions)
8 | - [Testing Against Rails](#testing-against-rails)
9 | - [Coding Your Gemspec](#coding-your-gemspec)
10 |
11 | Let’s dig into each of them. In the examples, the gem is called `hello`.
12 |
13 | ## Rails Migrations
14 |
15 | Create a template in `lib/generators/hello/templates/migration.rb.tt`:
16 |
17 | ```ruby
18 | class <%= migration_class_name %> < ActiveRecord::Migration<%= migration_version %>
19 | def change
20 | # your migration
21 | end
22 | end
23 | ```
24 |
25 | The `.tt` extension denotes Thor template. [Thor](https://github.com/erikhuda/thor) is what Rails uses under the hood.
26 |
27 | Add `lib/generators/hello/install_generator.rb`
28 |
29 | ```ruby
30 | require "rails/generators/active_record"
31 |
32 | module Hello
33 | module Generators
34 | class InstallGenerator < Rails::Generators::Base
35 | include ActiveRecord::Generators::Migration
36 | source_root File.join(__dir__, "templates")
37 |
38 | def copy_migration
39 | migration_template "migration.rb", "db/migrate/install_hello.rb", migration_version: migration_version
40 | end
41 |
42 | def migration_version
43 | "[#{ActiveRecord::VERSION::MAJOR}.#{ActiveRecord::VERSION::MINOR}]"
44 | end
45 | end
46 | end
47 | end
48 | ```
49 |
50 | This lets you run:
51 |
52 | ```sh
53 | rails generate hello:install
54 | ```
55 |
56 | Change the generator path and class name to match your gem. They must match exactly what Rails expects to work.
57 |
58 | [Example](https://github.com/ankane/archer/blob/master/lib/generators/archer/install_generator.rb)
59 |
60 | ## Rails Dependencies
61 |
62 | If your gem depends on Rails, add `railties` and any other Rails libraries it needs.
63 |
64 | ```ruby
65 | spec.add_dependency "railties", ">= 5"
66 | spec.add_dependency "activerecord", ">= 5"
67 | ```
68 |
69 | I typically require a [supported version](https://rubyonrails.org/security/) of Rails.
70 |
71 | In code, don’t require Rails gems directly, as this can cause them to load early and introduce issues.
72 |
73 | ```ruby
74 | require "active_record" # bad!!
75 |
76 | ActiveRecord::Base.include(Hello::Model)
77 | ```
78 |
79 | Instead, do:
80 |
81 | ```ruby
82 | require "active_support"
83 |
84 | ActiveSupport.on_load(:active_record) do
85 | include Hello::Model
86 | end
87 | ```
88 |
89 | [Example](https://github.com/ankane/hightop/blob/master/lib/hightop.rb)
90 |
91 | ## Testing Against Multiple Dependency Versions
92 |
93 | If your gem has dependencies, you may want to test against multiple versions of a dependency. For instance, you may want to test against multiple versions of Active Record.
94 |
95 | To do this, create a `test/gemfiles` directory (or `spec/gemfiles` if you use RSpec).
96 |
97 | Create `test/gemfiles/activerecord50.gemfile` with:
98 |
99 | ```ruby
100 | source "https://rubygems.org"
101 |
102 | gemspec path: "../../"
103 |
104 | gem "activerecord", "~> 5.0.0"
105 | ```
106 |
107 | Install with:
108 |
109 | ```sh
110 | BUNDLE_GEMFILE=test/gemfiles/activerecord50.gemfile bundle install
111 | ```
112 |
113 | And run with:
114 |
115 | ```sh
116 | BUNDLE_GEMFILE=test/gemfiles/activerecord50.gemfile bundle exec rake
117 | ```
118 |
119 | [Example](https://github.com/ankane/groupdate/tree/master/test/gemfiles)
120 |
121 | On Travis CI, you can add to `.travis.yml`:
122 |
123 | ```yml
124 | gemfile:
125 | - Gemfile
126 | - test/gemfiles/activerecord50.gemfile
127 | ```
128 |
129 | You can also use a library like [Appraisal](https://github.com/thoughtbot/appraisal) to help generate and run these files.
130 |
131 | ## Testing Against Rails
132 |
133 | To test against Rails, use a library like [Combustion](https://github.com/pat/combustion). It’s designed to be used with RSpec, but I haven’t had any issues with Minitest. Combustion generates some files that aren’t needed, so I just delete them.
134 |
135 | ```ruby
136 | Combustion.initialize! :all
137 | ```
138 |
139 | [Example](https://github.com/ankane/field_test/tree/master/test)
140 |
141 | ## Coding Your Gemspec
142 |
143 | There are a variety of ways to code your gemspec. Here’s the one I like to use:
144 |
145 | ```ruby
146 | require_relative "lib/hello/version"
147 |
148 | Gem::Specification.new do |spec|
149 | spec.name = "hello"
150 | spec.version = Hello::VERSION
151 | spec.summary = "Hello world"
152 | spec.homepage = "https://github.com/you/hello"
153 | spec.license = "MIT"
154 |
155 | spec.author = "Your Name"
156 | spec.email = "you@example.com"
157 |
158 | spec.files = Dir["*.{md,txt}", "{lib}/**/*"]
159 | spec.require_path = "lib"
160 |
161 | spec.required_ruby_version = ">= 2.4"
162 |
163 | spec.add_dependency "activesupport", ">= 5"
164 |
165 | spec.add_development_dependency "bundler"
166 | spec.add_development_dependency "rake"
167 | end
168 | ```
169 |
170 | Change `files` if your gem has `app`, `config`, or `vendor` directories. I typically use the [last supported version](https://www.ruby-lang.org/en/downloads/branches/) for the minimum Ruby version.
171 |
172 | If your gem has an executable file, add:
173 |
174 | ```ruby
175 | spec.bindir = "exe"
176 | spec.executables = ["hello"]
177 | ```
178 |
179 | [Don’t check in](https://yehudakatz.com/2010/12/16/clarifying-the-roles-of-the-gemspec-and-gemfile/) `Gemfile.lock`.
180 |
181 | Some gems have moved development dependencies entirely out of the gemspec and into the Gemfile, which is another option.
182 |
183 | ## Summary
184 |
185 | You’ve now seen five patterns that can be useful for Ruby gems. Now go build something awesome!
186 |
--------------------------------------------------------------------------------
/archive/securing-user-emails-in-rails.md:
--------------------------------------------------------------------------------
1 | # Securing User Emails in Rails
2 |
3 | ---
4 |
5 | *There is an [updated version](https://ankane.org/securing-user-emails-lockbox) of this post.*
6 |
7 | ---
8 |
9 | The GDPR goes into effect next Friday. Whether or not you serve European residents, it’s a great reminder that we have the responsibility to build systems in a way that protects user privacy.
10 |
11 | Email addresses are a common form of personal data, and they’re often stored unencrypted. If an attacker gains access to the database or backups, emails will be compromised.
12 |
13 | This post will walk you through a practical approach to protecting emails. It works with [Devise](https://github.com/plataformatec/devise), the most popular authentication framework for Rails, and is general enough to work with others.
14 |
15 | ## Strategy
16 |
17 | We’ll use two concepts to make this happen: encryption and blind indexing. Encryption gives us a way to securely store the data, and blind indexing provides a way to look it up.
18 |
19 | Blind indexing works by computing a hash of the data. You’re probably familiar with hash functions like MD5 and SHA1. Rather than one of these, we use a hash function that takes a secret key and uses [key stretching](https://en.wikipedia.org/wiki/Key_stretching) to slow down brute force attempts. You can read more about [blind indexing here](https://www.sitepoint.com/how-to-search-on-securely-encrypted-database-fields/).
20 |
21 | We’ll use the [attr_encrypted gem](https://github.com/attr-encrypted/attr_encrypted) for encryption and the [blind_index gem](https://github.com/ankane/blind_index) for blind indexing.
22 |
23 | ## Instructions
24 |
25 | Let’s assume you have a `User` model with an email field.
26 |
27 | Add to your Gemfile:
28 |
29 | ```ruby
30 | gem 'attr_encrypted'
31 | gem 'blind_index'
32 | ```
33 |
34 | And run:
35 |
36 | ```sh
37 | bundle install
38 | ```
39 |
40 | Next, let’s replace the email field with an encrypted version. Create a migration:
41 |
42 | ```sh
43 | rails g migration add_encrypted_email_to_users
44 | ```
45 |
46 | And add:
47 |
48 | ```ruby
49 | class AddEncryptedEmailToUsers < ActiveRecord::Migration[5.2]
50 | def change
51 | # encrypted data
52 | add_column :users, :encrypted_email, :string
53 | add_column :users, :encrypted_email_iv, :string
54 | add_index :users, :encrypted_email_iv, unique: true
55 |
56 | # blind index
57 | add_column :users, :encrypted_email_bidx, :string
58 | add_index :users, :encrypted_email_bidx, unique: true
59 |
60 | # drop original here unless we have existing users
61 | remove_column :users, :email
62 | end
63 | end
64 | ```
65 |
66 | We use one column to store the encrypted data, one to store [the IV](http://www.cryptofails.com/post/70059609995/crypto-noobs-1-initialization-vectors), and another to store the blind index.
67 |
68 | We add a unique index on the IV since reusing an IV with the same key in AES-GCM (the default algorithm for attr_encrypted) will [leak the key](https://csrc.nist.gov/csrc/media/projects/block-cipher-techniques/documents/bcm/joux_comments.pdf).
69 |
70 | Then migrate:
71 |
72 | ```sh
73 | rails db:migrate
74 | ```
75 |
76 | Next, generate keys. We use environment variables to store the keys as hex-encoded strings ([dotenv](https://github.com/bkeepers/dotenv) is great for this). [Here’s an explanation](https://ankane.org/encryption-keys) of why `pack` is used. *Do not commit them to source control.* Generate one key for encryption and one key for hashing. You can generate keys in the Rails console with:
77 |
78 | ```ruby
79 | SecureRandom.hex(32)
80 | ```
81 |
82 | For development, you can use these:
83 |
84 | ```sh
85 | EMAIL_ENCRYPTION_KEY=0000000000000000000000000000000000000000000000000000000000000000
86 | EMAIL_BLIND_INDEX_KEY=ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
87 | ```
88 |
89 | Add to your user model:
90 |
91 | ```ruby
92 | class User < ApplicationRecord
93 | attr_encrypted :email, key: [ENV["EMAIL_ENCRYPTION_KEY"]].pack("H*")
94 | blind_index :email, key: [ENV["EMAIL_BLIND_INDEX_KEY"]].pack("H*")
95 | end
96 | ```
97 |
98 | > `pack` is used to decode the hex value
99 |
100 | Create a new user and confirm it works.
101 |
102 | ## Existing Users
103 |
104 | If you have existing users, we need to backfill the data before dropping the email column. We temporarily use a virtual attribute - `protected_email` - so we can backfill without downtime.
105 |
106 | ```ruby
107 | class User < ApplicationRecord
108 | attr_encrypted :protected_email, key: [ENV["EMAIL_ENCRYPTION_KEY"]].pack("H*"), attribute: "encrypted_email"
109 | blind_index :protected_email, key: [ENV["EMAIL_BLIND_INDEX_KEY"]].pack("H*"), attribute: "email", bidx_attribute: "encrypted_email_bidx"
110 |
111 | before_validation :protect_email, if: -> { email_changed? }
112 |
113 | def protect_email
114 | self.protected_email = email
115 | compute_protected_email_bidx
116 | end
117 | end
118 | ```
119 |
120 | Backfill the data in the Rails console:
121 |
122 | ```ruby
123 | User.where(encrypted_email: nil).find_each do |user|
124 | user.protect_email
125 | user.save!
126 | end
127 | ```
128 |
129 | Then update the model to the desired state:
130 |
131 | ```ruby
132 | class User < ApplicationRecord
133 | attr_encrypted :email, key: [ENV["EMAIL_ENCRYPTION_KEY"]].pack("H*")
134 | blind_index :email, key: [ENV["EMAIL_BLIND_INDEX_KEY"]].pack("H*")
135 |
136 | # remove this line after dropping email column
137 | self.ignored_columns = ["email"]
138 | end
139 | ```
140 |
141 | Finally, drop the email column.
142 |
143 | ## Logging
144 |
145 | We also need to make sure email addresses aren’t logged. Add to `config/initializers/filter_parameter_logging.rb`:
146 |
147 | ```ruby
148 | Rails.application.config.filter_parameters += [:email]
149 | ```
150 |
151 | Use [Logstop](https://github.com/ankane/logstop) to filter anything that looks like an email address as an extra line of defense. Add to your Gemfile:
152 |
153 | ```ruby
154 | gem 'logstop'
155 | ```
156 |
157 | And create `config/initializers/logstop.rb` with:
158 |
159 | ```ruby
160 | Logstop.guard(Rails.logger)
161 | ```
162 |
163 | ## Summary
164 |
165 | We now have a way to encrypt data and query for exact matches. You can apply this same approach to other fields as well. For more security, consider a [key management service](https://github.com/ankane/kms_encrypted) to manage your keys.
166 |
--------------------------------------------------------------------------------
/archive/new-ml-gems.md:
--------------------------------------------------------------------------------
1 | # 16 New ML Gems for Ruby
2 |
3 |
4 |
5 |
6 |
7 | In August, I set out to improve the machine learning ecosystem for Ruby. I wasn’t sure where it would go. Over the next 5 months, I ended up releasing 16 libraries and learned a lot along the way. I wanted to share some of that knowledge and introduce some of the libraries you can now use in Ruby.
8 |
9 | ## The Theme
10 |
11 | There are many great machine libraries for Python, so a natural place to start was to see what it’d take to bring them to Ruby. It turned out to be a lot less work than expected based on a common theme.
12 |
13 | ML libraries want to be fast. This means less time waiting and more time iterating. However, interpreted languages like Python and Ruby aren’t relatively fast. How do libraries overcome this?
14 |
15 | The key is they do most of the work in a compiled language - typically C++ - and have wrappers for other languages like Python.
16 |
17 | This was really great news. The same approach and code could be used for Ruby.
18 |
19 | ## The Patterns
20 |
21 | Ruby has a number of ways to call C and C++ code.
22 |
23 | Native extensions are one method. They’re written in C or C++ and use [Ruby’s C API](https://silverhammermba.github.io/emberb/c/). You may have noticed gems with native extensions taking longer to install, as they need to compile.
24 |
25 | ```c
26 | void Init_stats()
27 | {
28 | VALUE mStats = rb_define_module("Stats");
29 | rb_define_module_function(mStats, "mean", mean, 2);
30 | }
31 | ```
32 |
33 | A more general way for one language to call another is a foreign function interface, or FFI. It requires a C API (due to C++ name mangling), which many machine learning libraries had. An advantage of FFI is you can define the interface in the host language - in our case, Ruby.
34 |
35 | Ruby supports FFI with Fiddle. It was added in Ruby 1.9, but appears to be [“the Ruby standard library’s best kept secret.”](https://www.honeybadger.io/blog/use-any-c-library-from-ruby-via-fiddle-the-ruby-standard-librarys-best-kept-secret/)
36 |
37 | ```ruby
38 | module Stats
39 | extend Fiddle::Importer
40 | dlload "libstats.so"
41 | extern "double mean(int a, int b)"
42 | end
43 | ```
44 |
45 | There’s also the [FFI](https://github.com/ffi/ffi) gem, which provides higher-level functionality and overcomes some limitations of Fiddle (like the ability to pass structs by value).
46 |
47 | ```ruby
48 | module Stats
49 | extend FFI::Library
50 | ffi_lib "stats"
51 | attach_function :mean, [:int, :int], :double
52 | end
53 | ```
54 |
55 | For libraries without a C API, [Rice](https://github.com/jasonroelofs/rice) provides a really nice way to bind C++ code (similar to Python’s pybind11).
56 |
57 | ```cpp
58 | void Init_stats()
59 | {
60 | Module mStats = define_module("Stats");
61 | mStats.define_singleton_method("mean", &mean);
62 | }
63 | ```
64 |
65 | Another approach is SWIG (Simplified Wrapper and Interface Generator). You create an interface file and then run SWIG to generate the bindings. Gusto has a [good tutorial](https://engineering.gusto.com/simple-ruby-c-extensions-with-swig/) on this.
66 |
67 | ```swig
68 | %module stats
69 |
70 | double mean(int, int);
71 | ```
72 |
73 | There’s also [Rubex](https://github.com/SciRuby/rubex), which lets you write Ruby-like code that compiles to C (similar to Python’s Cython). It also provides the ability to interface with C libraries.
74 |
75 | ```ruby
76 | lib ""
77 | double mean(int, int)
78 | end
79 | ```
80 |
81 | None of the approaches above are specific to machine learning, so you can use them with any C or C++ library.
82 |
83 | ## The Libraries
84 |
85 | Libraries were chosen based on popularity and performance. Many have a similar interface to their Python counterpart to make it easy to follow existing tutorials. Libraries are broken down into categories below with brief descriptions.
86 |
87 | ### Gradient Boosting
88 |
89 | [XGBoost](https://github.com/ankane/xgb) and [LightGBM](https://github.com/ankane/lightgbm) are gradient boosting libraries. Gradient boosting is a powerful technique for building predictive models that fits many small decision trees that together make robust predictions, even with outliers and missing values. Gradient boosting performs well on tabular data.
90 |
91 | ### Deep Learning
92 |
93 | [Torch-rb](https://github.com/ankane/torch-rb) and [TensorFlow](https://github.com/ankane/tensorflow) are deep learning libraries. Torch-rb is built on LibTorch, the library that powers PyTorch. Deep learning has been very successful in areas like image recognition and natural language processing.
94 |
95 | ### Recommendations
96 |
97 | [Disco](https://github.com/ankane/disco) is a recommendation library. It looks at ratings or actions from users to predict other items they might like, known as collaborative filtering. Matrix factorization is a common way to accomplish this.
98 |
99 | [LIBMF](https://github.com/ankane/libmf) is a high-performance matrix factorization library.
100 |
101 | Collaborative filtering can also find similar users and items. If you have a large number of users or items, an approximate nearest neighbor algorithm can speed up the search. Spotify [does this](https://github.com/spotify/annoy#background) for music recommendations.
102 |
103 | [NGT](https://github.com/ankane/ngt) is an approximate nearest neighbor library that performs extremely well on benchmarks (in Python/C++).
104 |
105 |
106 |
107 |
108 |
109 |
110 | Image from ANN Benchmarks, MIT license
111 |
112 |
113 | Another promising technique for recommendations is factorization machines. The traditional approach to collaborative filtering builds a model exclusively from past ratings or actions. However, you may have additional *side information* about users or items. Factorization machines can incorporate this data. They can also perform classification and regression.
114 |
115 | [xLearn](https://github.com/ankane/xlearn) is a high-performance library for factorization machines.
116 |
117 | ### Optimization
118 |
119 | Optimization finds the best solution to a problem out of many possible solutions. Scheduling and vehicle routing are two common tasks. Optimization problems have an objective function to minimize (or maximize) and a set of constraints.
120 |
121 | Linear programming is an approach you can use when the objective function and constraints are linear. Here’s a really good [introductory series](https://www.youtube.com/watch?v=0TD9EQcheZM) if you want to learn more.
122 |
123 | [SCS](https://github.com/ankane/scs) is a library that can solve [many types](https://www.cvxpy.org/tutorial/advanced/index.html#choosing-a-solver) of optimization problems.
124 |
125 | [OSQP](https://github.com/ankane/osqp) is another that’s specifically designed for quadratic problems.
126 |
127 | ### Text Classification
128 |
129 | [fastText](https://github.com/ankane/fasttext) is a text classification and word representation library. It can label documents with one or more categories, which is useful for content tagging, spam filtering, and language detection. It can also compute word vectors, which can be compared to find similar words and analogies.
130 |
131 | ### Interoperability
132 |
133 | It’s nice when languages play nicely together.
134 |
135 | [ONNX Runtime](https://github.com/ankane/onnxruntime) is a scoring engine for ML models. You can build a model in one language, save it in the ONNX format, and run it in another. Here’s [an example](/tensorflow-ruby).
136 |
137 | [Npy](https://github.com/ankane/npy) is a library for saving and loading NumPy `npy` and `npz` files. It uses [Numo](/numo) for multi-dimensional arrays.
138 |
139 | ### Others
140 |
141 | [Vowpal Wabbit](https://github.com/ankane/vowpalwabbit) specializes in online learning. It’s great for reinforcement learning as well as supervised learning where you want to train a model incrementally instead of all at once. This is nice when you have a lot of data.
142 |
143 | [ThunderSVM](https://github.com/ankane/thundersvm) is an SVM library that runs in parallel on either CPUs or GPUs.
144 |
145 | [GSLR](https://github.com/ankane/gslr) is a linear regression library powered by GSL that supports both ordinary least squares and ridge regression. It can be used alone or to improve the performance of [Eps](https://github.com/ankane/eps).
146 |
147 | ## Shout-out
148 |
149 | I wanted to also give a shout-out to another library that entered the scene in 2019.
150 |
151 | [Rumale](https://github.com/yoshoku/rumale) is a machine learning library that supports many, many algorithms, similar to Python’s Scikit-learn. Thanks [@yoshoku](https://github.com/yoshoku) for the amazing work!
152 |
153 | ## Final Word
154 |
155 | There are now many state-of-the-art machine learning libraries available for Ruby. If you’re a Ruby engineering who’s interested in machine learning, now’s a good time to try it. Also, if you come across a C or C++ library you want to use in Ruby, you’ve seen a few ways to do it. Let’s make Ruby a great language for machine learning.
156 |
--------------------------------------------------------------------------------
/archive/rails-meet-data-science.md:
--------------------------------------------------------------------------------
1 | # Rails, Meet Data Science
2 |
3 | 
4 |
5 | Organizations today have more data than ever. Predictive modeling is a powerful way to use this data to solve problems and create better experiences for customers. For instance, do a better job keeping items in stock by predicting demand or lower costs by predicting fraud. If you use Ruby on Rails, it can be tough to know how to incorporate this into your app.
6 |
7 | We’ll go over four patterns you can use for prediction with Rails. We used all four successfully during my time at [Instacart](https://www.instacart.com). They can work when you have no data scientists (when I started) as well as when you have a strong data science team.
8 |
9 | ## Patterns
10 |
11 | With predictive modeling, you first train a model and then use it to predict. The patterns can be grouped by the language used for each task:
12 |
13 | Pattern | Train | Predict
14 | --- | --- | ---
15 | 1 | 3rd Party | 3rd Party
16 | 2 | Ruby | Ruby
17 | 3 | Another Language | Ruby
18 | 4 | Another Language | Another Language
19 |
20 | Two popular languages for data science are Python and R.
21 |
22 | You can decide which pattern to use for each model you build. We’ll walk through the approaches and discuss the pros and cons of each.
23 |
24 | ## Pattern 1: Use a 3rd Party
25 |
26 | Before building a model in-house, it’s good to see what already exists. There are a number of external services you can use for specific problems. Here are a few:
27 |
28 | - Fraud - [Sift Science](https://siftscience.com/)
29 | - Recommendations - [Tamber](https://tamber.com/)
30 | - Anomaly Detection & Forecasting - [Trend](https://trendapi.org/)
31 | - NLP - [Amazon Comprehend](https://aws.amazon.com/comprehend/) and [Google Cloud Natural Language](https://cloud.google.com/natural-language/)
32 | - Vision - [AWS Rekognition](https://aws.amazon.com/rekognition/) and [Google Cloud Vision](https://cloud.google.com/vision/)
33 |
34 | Pros
35 |
36 | - Get domain knowledge from the company
37 | - Fast to implement and easy to maintain
38 |
39 | Cons
40 |
41 | - Not easy to iterate if it doesn’t fit your needs
42 | - Vendor lock-in
43 |
44 | ## Pattern 2: Train and Predict in Ruby
45 |
46 | Ruby has a number of libraries for building simple models. Simple models can perform very well since a large part of model building is [feature engineering](https://en.wikipedia.org/wiki/Feature_engineering). This is a great option if there are no data scientists in your company or on your team. A developer can own the model end-to-end, which is great for speed and iteration.
47 |
48 | Here are a few libraries for building models in Ruby:
49 |
50 | - [Eps](https://github.com/ankane/eps) - good for beginners
51 | - [Rumale](https://github.com/yoshoku/rumale) - good for advanced users
52 | - [Xgb](https://github.com/ankane/xgb) - XGBoost
53 | - [LightGBM](https://github.com/ankane/lightgbm) - LightGBM
54 | - And [many more](https://github.com/arbox/machine-learning-with-ruby)
55 |
56 | Once a model is trained, you’ll need to store it. You can use methods provided by the library, or marshal if none exist. You can store the models as files or in the database.
57 |
58 | Be sure to commit the code used to train models so you can update them with newer data in the future. The Rails console is a decent place to create them, or use a [Jupyter notebook](https://jupyter.org/) running [IRuby](https://github.com/SciRuby/iruby) for better visualizations (see [setup instructions for Rails](https://ankane.org/jupyter-rails)).
59 |
60 | Pros
61 |
62 | - Simple models can perform well
63 | - No need to introduce a new language
64 |
65 | Cons
66 |
67 | - Limited tools for building models
68 | - Limited model selection
69 | - Many people who have experience building models don’t know Ruby
70 |
71 | ## Pattern 3: Train in Another Language, Predict in Ruby
72 |
73 | Ruby is getting better for data science thanks to [SciRuby](https://github.com/SciRuby/sciruby). However, languages like R and Python currently have much better tools. Also, many people who have experience building models don’t know Ruby.
74 |
75 | Luckily, you can build models in another language and predict in Ruby. This way, you can use more advanced tools for visualization, validation, and tuning without adding complexity to your production stack. If you don’t have data scientists, you can use this pattern to contract with one.
76 |
77 | Here are models that can currently predict in Ruby:
78 |
79 | - [Eps](https://github.com/ankane/eps) - Linear Regression, Naive Bayes
80 | - [Scoruby](https://github.com/asafschers/scoruby) - Random Forest, GBM, Decision Tree, Naive Bayes
81 | - [Xgb](https://github.com/ankane/xgb) - XGBoost
82 | - [LightGBM](https://github.com/ankane/lightgbm) - LightGBM
83 |
84 | For this to work, models need to be stored in a shared format that both languages understand. PMML and PFA are two interchange formats. PFA is newer but has less adoption than PMML. Andrey Melentyev has a [great post](https://www.andrey-melentyev.com/model-interoperability.html) on the topic.
85 |
86 | Once again, it’s important that models are reproducible. This allows you to update them with newer data in the future. Be sure to follow software engineering best practices like:
87 |
88 | - Use source control (create a new repo or add to your existing repo)
89 | - Use a package manager for a reproducible environment
90 | - Keep credentials out of source control (use `.env` or `.Renviron`)
91 |
92 | Here are some tools you can use:
93 |
94 | Function | Python | R
95 | --- | --- | --- | ---
96 | Package management | [Pipenv](https://pipenv.readthedocs.io/en/latest/) | [Jetpack](https://github.com/ankane/jetpack)
97 | Database access | [SQLAlchemy](https://www.sqlalchemy.org/) | [dbx](https://github.com/ankane/dbx)
98 | PMML export | [sklearn2pmml](https://github.com/jpmml/sklearn2pmml) | [pmml](https://cran.r-project.org/package=pmml)
99 |
100 | One place to be careful is implementing the features in Ruby. It must be consistent with how they were implemented in training. To ensure this is correct, verify it programmatically. Create a CSV file with ids and predictions from the original model and confirm the Ruby predictions match. Here’s some [example code](https://github.com/ankane/eps#verifying).
101 |
102 | Pros
103 |
104 | - Better tools for model building
105 | - No need to operate a new language in production
106 |
107 | Cons
108 |
109 | - Need to introduce a new language in development
110 | - Limited model selection
111 | - Need to create features in two languages
112 |
113 | ## Pattern 4: Train and Predict in Another Language
114 |
115 | The last option we’ll cover is doing both training and prediction outside Ruby. This is great if you have a team of data scientists who specialize in another language. This pattern allows data scientists to own models end-to-end.
116 |
117 | It also gives you access to models that are not available in Ruby. For instance, there are forecasting libraries like [Prophet](https://facebook.github.io/prophet/) and deep learning libraries like [TensorFlow](https://www.tensorflow.org/).
118 |
119 | The implementation depends on how predictions are generated. Two common ways are batch and real-time.
120 |
121 | ---
122 |
123 | ### Batch Predictions
124 |
125 | Batch predictions are generated asynchronously and are typically run on a regular interval. This can be every minute or once a week. An example is a daily job that updates demand forecasts for the following weeks. Predictions can be stored and later used by the Rails app as needed.
126 |
127 | Don’t be afraid to read and write directly to the database. While microservice design patterns caution against using the database as an API, we didn’t have much issue with it. When updating records, it’s also a good idea to write audits to see how predictions change over time.
128 |
129 | Jobs can be scheduled with cron, or ideally a distributed scheduler like [Mani](https://github.com/sherinkurian/mani) for high availability. If you need to let the Rails app know a job has completed, you can do this through your messaging system. HTTP works great if you don’t have one.
130 |
131 | ---
132 |
133 | ### Real-Time Predictions
134 |
135 | Real-time predictions are generated synchronously and are triggered by calls from the Rails app. An example is recommending items to a user at checkout based off what’s in their cart.
136 |
137 | HTTP is a common choice for retrieving predictions, but you can use a messaging system or even pipes. Great tools for HTTP are [Django](https://www.djangoproject.com/) and [Flask](http://flask.pocoo.org/) for Python and [Plumber](https://www.rplumber.io/) for R.
138 |
139 | ---
140 |
141 | As with the other patterns, follow best engineering practices. In addition to ones previously mentioned:
142 |
143 | - Use a framework, or at the very least a consistent project structure
144 | - Keep code [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)
145 |
146 | Don’t be afraid to use Rails to manage the database schema. It’s easy enough for data scientists to learn to create and run migrations. Otherwise, you need to support another system for schema changes.
147 |
148 | To store models, you most likely won’t use an interchange format, since libraries can’t load them. Instead, use serialization specific to the language, like pickle in Python and serialize in R.
149 |
150 | If deciding between Python and R, Python has more general purpose libraries, so it’s easier to run in production.
151 |
152 | Pros
153 |
154 | - Larger selection of models available
155 | - Data scientists can own models end-to-end
156 |
157 | Cons
158 |
159 | - Need to run multiple languages in production
160 |
161 | ## Conclusion
162 |
163 | You’ve now seen four great patterns for bringing predictive models to Rails. Each has different trade-offs, so we recommend taking the simplest approach that works for you. No matter which you choose, make sure your models are reproducible.
164 |
165 | Happy modeling!
166 |
167 |
168 |
169 | Updates
170 |
171 | - May 2019: Added Rumale
172 | - August 2019: Added Xgb and LightGBM
173 |
--------------------------------------------------------------------------------
/archive/scaling-the-monolith.md:
--------------------------------------------------------------------------------
1 | # Scaling the Monolith
2 |
3 | Many companies start out with a single web application. As the team and codebase grow, things feel less organized and common tasks like booting the app and running the test suite take longer and longer. It can be tempting to turn to microservices to alleviate some of this pain. However, distributed systems add a significant amount of complexity and mental overhead.
4 |
5 | Before you decide to split apart your app, there are a number of tactics you can use to scale it [majestically](https://m.signalvnoise.com/the-majestic-monolith-29166d022228#.bst5vwy6r). Spend a significant amount of time trying to solve your existing problems before making big changes.
6 |
7 | The topics we’ll cover are:
8 |
9 | - [Code](#code)
10 | - [Errors](#errors)
11 | - [Boot Times & Memory](#boot-times-memory)
12 | - [Testing](#testing)
13 | - [Databases](#databases)
14 | - [Stability](#stability)
15 |
16 | The examples are geared towards Rails apps, but the principles apply to any codebase.
17 |
18 | ## Code
19 |
20 | Rails models and controllers tend to get larger and larger. Rails introduced [concerns](https://signalvnoise.com/posts/3372-put-chubby-models-on-a-diet-with-concerns) as one way to address this. Concerns allow you to pull out related logic into a separate file.
21 |
22 | Service objects are another nice pattern for this. Here’s [an example](https://hackernoon.com/service-objects-in-ruby-on-rails-and-you-79ca8a1c946e) of a service object. There’s not a standard way to create service objects, but it’s a good idea to decide on a convention for your app. You can use gems like [Interactor](https://github.com/collectiveidea/interactor) to establish one.
23 |
24 | Use namespaces to organize code.
25 |
26 | ```ruby
27 | class Admin::UsersController < Admin::BaseController
28 | end
29 | ```
30 |
31 | Some teams also prefer to use Rails engines, although I’m not a fan of this approach. Here’s a [good comparison](https://stackoverflow.com/a/29641532/1177228) of the pros and cons of each.
32 |
33 | ## Errors
34 |
35 | As the team grows, it’s important that errors get routed to the right place. You can use the [ownership](https://github.com/ankane/ownership) gem to help with this. Add it to controllers, jobs, and rake tasks.
36 |
37 | ```ruby
38 | class WelcomeJob < ApplicationJob
39 | owner :growth
40 | end
41 | ```
42 |
43 | `git blame` can help with assigning initial owners.
44 |
45 | ## Boot Times & Memory
46 |
47 | As your app accumulates more gems and files, its boot time and memory usage grow. There have been a number of projects over the years to speed up boot time. [Spring](https://github.com/rails/spring) was introduced in Rails 4.1 and keeps your app running in the background so it doesn’t have to boot every time you run a new command.
48 |
49 | Last year, Shopify released [Bootsnap](https://github.com/Shopify/bootsnap), which caches expensive loading computations. It’s now part of Rails 5.2 and can be used with earlier versions of Rails as well. With Bootsnap, “the core Shopify platform - a rather large monolithic application - boots about 75% faster, dropping from around 25s to 6.5s.”
50 |
51 | Another tactic is lazy loading files. Instead of incurring a speed and memory penalty at startup to load files, you can incur it the first time a request or job requires it. If it’s never needed, it’s never loaded. You can specify which gems to load in your Gemfile.
52 |
53 | ```rb
54 | gem 'groupdate', require: false
55 | ```
56 |
57 | You can also use different Bundler groups to selectively load gems for different environments.
58 |
59 | ```rb
60 | group :web do
61 | gem 'rack-attack'
62 | end
63 |
64 | group :admin_web do
65 | gem 'activeadmin'
66 | end
67 |
68 | group :worker do
69 | gem 'premailer-rails'
70 | end
71 | ```
72 |
73 | Read how to [set it up here](https://engineering.harrys.com/2014/07/29/hacking-bundler-groups.html).
74 |
75 | Use [Bumbler](https://github.com/nevir/Bumbler) to see how long each gem takes to load and [Derailed Benchmarks](https://github.com/schneems/derailed_benchmarks) to see memory usage. Focus on the top ones and leave the rest.
76 |
77 | If a gem is slow, there’s a chance it may be doing a lot of work upfront. You can try to debug the gem and fix it. Here’s an [example](https://github.com/ankane/area/commit/2c8cc47d151828ebdcce0e7060b7ac77a4c2f9ce) of speeding up initial load time by only reading a CSV file when it’s needed.
78 |
79 | ## Testing
80 |
81 | As the number of tests grow, the test suite can become slow. [TestProf](https://test-prof.evilmartians.io) provides a number of tools to profile and optimize your tests. You can also use a library like [Database Cleaner](https://github.com/DatabaseCleaner/database_cleaner) to quickly clean the database after tests.
82 |
83 | In development, you can use Guard for [Minitest](https://github.com/guard/guard-minitest) or [RSpec](https://github.com/guard/guard-rspec) to automatically run tests when relevant files are modified. Also make sure it’s easy to manually run common subsets of tests. You can use tags in RSpec for this.
84 |
85 | ```sh
86 | rspec --tags growth
87 | ```
88 |
89 | The key to speeding up the entire test suite is parallelization. Stripe has a [great post](https://stripe.com/blog/distributed-ruby-testing) about how they were able to get three hours of tests to run in three minutes. With continuous integration, split tests across multiple machines. Both [Travis](https://docs.travis-ci.com/user/speeding-up-the-build/#parallelizing-your-builds-across-virtual-machines) and [Circle](https://circleci.com/docs/2.0/parallelism-faster-jobs/) support this. You can use [ParallelTests](https://github.com/grosser/parallel_tests) in development to use all the cores on your machine. Rails 6 will run tests in parallel by default.
90 |
91 | Another way to speed up tests is to change your schema dump format to SQL.
92 |
93 | ```ruby
94 | config.active_record.schema_format = :sql
95 | ```
96 |
97 | This allows you to load the database schema for tests without booting the Rails app. With Postgres, you can use:
98 |
99 | ```sh
100 | psql < db/structure.sql
101 | ```
102 |
103 | To prevent slow tests from being added, automatically fail tests that take too long. With RSpec, you can do:
104 |
105 | ```ruby
106 | RSpec.configure do |config|
107 | config.around(:each) do |example|
108 | duration = Benchmark.realtime(&example)
109 | raise "Test took over 2 seconds to run" if duration > 2
110 | end
111 | end
112 | ```
113 |
114 | Start with a higher value and ratchet it down as you fix tests that are slow. You can see the slowest tests with:
115 |
116 | ```sh
117 | rspec --profile
118 | ```
119 |
120 | As the number of tests grows, there’s a higher chance of a random network issue causing an individual test to fail. Automatically retry failing tests to cut down on noise. With RSpec, you can use [RSpec::Retry](https://github.com/NoRedInk/rspec-retry) for this.
121 |
122 | ```ruby
123 | require "rspec/retry"
124 |
125 | RSpec.configure do |config|
126 | config.around(:each) do |example|
127 | example.run_with_retry retry: 2 # must be 2 to retry once (shrug)
128 | end
129 | end
130 | ```
131 |
132 | For test failures, make sure they get routed to the committer. You can use webhooks from your CI platform to do this.
133 |
134 | ## Databases
135 |
136 | Modern relational databases can scale extremely well if you follow best practices.
137 |
138 | One of the most important things you can do is set a [statement timeout](https://github.com/ankane/the-ultimate-guide-to-ruby-timeouts#statement-timeouts-1) to prevent bad queries from taking too many resources.
139 |
140 | ```yml
141 | production:
142 | variables:
143 | statement_timeout: 250 # ms
144 | ```
145 |
146 | It’s also good to track which queries consume the most CPU time. With Postgres, you can use [PgHero](https://github.com/ankane/pghero) for this.
147 |
148 | 
149 |
150 | Use [Marginalia](https://github.com/basecamp/marginalia) to make it easy to identity the origin of queries. This adds a comment to the end of queries like `/*application:Datakick,controller:items,action:edit*/` so you can see where they’re coming from.
151 |
152 | Add defensive measures as well. For instance, pause low priority job queues automatically when the database CPU gets too high.
153 |
154 | ```ruby
155 | Sidekiq::Queue.new("low").pause!
156 | ```
157 |
158 | As the team grows, so does the chance of someone accidentally running a migration that takes down the site. [Strong Migrations](https://github.com/ankane/strong_migrations) can help prevent downtime due to database migrations. It raises an error if you try to run an unsafe operation and gives instructions for a better way to do it.
159 |
160 | 
161 |
162 | Some tables can accumulate a lot of columns. You can split them into multiple tables based off concern that have a 1-to-1 relationship.
163 |
164 | Scale reads by fixing N+1 queries and caching frequent queries. [Bullet](https://github.com/flyerhzm/bullet) can help you identify N+1 queries. If you still have high load after spending a good amount of time on these, use [Distribute Reads](https://github.com/ankane/distribute_reads) for replicas.
165 |
166 | Scale writes and space with additional databases. Use [Multiverse](https://github.com/ankane/multiverse) to manage them. This can also be good if you have business domains with different workloads. It adds complexity and removes the ability to join certain tables, but can increase stability.
167 |
168 | Partitioning is another strategy for space for tables that only need recent data. You can use [pgslice](https://github.com/ankane/pgslice) for Postgres.
169 |
170 | While Rails has built-in connection pooling, connections can become an issue when you have a lot of servers. With Postgres, use a connection pooler like [PgBouncer](https://ankane.org/pgbouncer-setup) when you start to hit 500 connections.
171 |
172 | Be hesitant to introduce new data stores. Most of the time you can [just table data](https://ankane.org/just-table-it). It’s often not worth having another technology to manage if your current stack can do the job.
173 |
174 | ## Stability
175 |
176 | Your monolith is one codebase, but you can increase stability by isolating different parts of the app in production. Have separate load balancers and web servers for your customer site and admin site so customers aren’t impacted if the admin site goes down. Use separate workers for different groups of queues so a backed up queue or bad job won’t affect the whole system.
177 |
178 | You can separate by business domain, which will be aligned with teams if you have vertical teams. This also allows you to scale different parts of your app independently as if they were different services.
179 |
180 | ## Conclusion
181 |
182 | As you’ve seen, there are a number of things you can do to scale your monolith. Focus on developer happiness and productivity as well as system stability. Keep track of metrics over time that impact developers, like boot time, test suite time, and deploy time. It’s also good to invest in projects that make it comfortable to ship code fast, like quick rollbacks. Overall, spend a decent amount of time trying to solve your exact pain points before breaking your app apart to solve them.
183 |
--------------------------------------------------------------------------------