├── .gitignore ├── CNAME ├── Gemfile ├── Gemfile.lock ├── LICENSE ├── README.md ├── _config.yml ├── _layouts └── default.html ├── assets └── css │ └── style.scss ├── awesome.md ├── contrib └── probability.md ├── contributors.md ├── img ├── counter_1.png ├── counter_2_top.png ├── flip_binary_tree.png ├── formula_idf_1.png ├── formula_jaccard.png ├── formula_mean.png ├── formula_pmi_1.png ├── formula_pmi_2.png ├── formula_rmse.png ├── formula_std.png ├── repo-logo.jpg ├── schema.png ├── sql_10_example.png ├── sql_11_example.png ├── sql_4_example.png ├── sql_5_example.png ├── sql_6_example.png ├── sql_7_example.png ├── sql_8_example.png └── sql_9_example.png ├── technical.md └── theory.md /.gitignore: -------------------------------------------------------------------------------- 1 | _site/ 2 | -------------------------------------------------------------------------------- /CNAME: -------------------------------------------------------------------------------- 1 | ds-interviews.org -------------------------------------------------------------------------------- /Gemfile: -------------------------------------------------------------------------------- 1 | source "https://rubygems.org" 2 | # Hello! This is where you manage which Jekyll version is used to run. 3 | # When you want to use a different version, change it below, save the 4 | # file and run `bundle install`. Run Jekyll with `bundle exec`, like so: 5 | # 6 | # bundle exec jekyll serve 7 | # 8 | # This will help ensure the proper Jekyll version is running. 9 | # Happy Jekylling! 10 | gem "jekyll" 11 | # This is the default theme for new Jekyll sites. You may change this to anything you like. 12 | #gem "minima", "~> 2.5" 13 | gem "jekyll-theme-cayman" 14 | gem "github-pages", group: :jekyll_plugins 15 | 16 | # If you want to use GitHub Pages, remove the "gem "jekyll"" above and 17 | # uncomment the line below. To upgrade, run `bundle update github-pages`. 18 | # gem "github-pages", group: :jekyll_plugins 19 | # If you have any plugins, put them here! 20 | group :jekyll_plugins do 21 | gem "jekyll-feed", "~> 0.12" 22 | end 23 | 24 | # Windows and JRuby does not include zoneinfo files, so bundle the tzinfo-data gem 25 | # and associated library. 26 | install_if -> { RUBY_PLATFORM =~ %r!mingw|mswin|java! } do 27 | gem "tzinfo", "~> 1.2" 28 | gem "tzinfo-data" 29 | end 30 | 31 | # Performance-booster for watching directories on Windows 32 | gem "wdm", "~> 0.1.1", :install_if => Gem.win_platform? 33 | 34 | -------------------------------------------------------------------------------- /Gemfile.lock: -------------------------------------------------------------------------------- 1 | GEM 2 | remote: https://rubygems.org/ 3 | specs: 4 | activesupport (6.0.3.1) 5 | concurrent-ruby (~> 1.0, >= 1.0.2) 6 | i18n (>= 0.7, < 2) 7 | minitest (~> 5.1) 8 | tzinfo (~> 1.1) 9 | zeitwerk (~> 2.2, >= 2.2.2) 10 | addressable (2.7.0) 11 | public_suffix (>= 2.0.2, < 5.0) 12 | coffee-script (2.4.1) 13 | coffee-script-source 14 | execjs 15 | coffee-script-source (1.11.1) 16 | colorator (1.1.0) 17 | commonmarker (0.17.13) 18 | ruby-enum (~> 0.5) 19 | concurrent-ruby (1.1.6) 20 | dnsruby (1.61.3) 21 | addressable (~> 2.5) 22 | em-websocket (0.5.1) 23 | eventmachine (>= 0.12.9) 24 | http_parser.rb (~> 0.6.0) 25 | ethon (0.12.0) 26 | ffi (>= 1.3.0) 27 | eventmachine (1.2.7) 28 | execjs (2.7.0) 29 | faraday (1.0.1) 30 | multipart-post (>= 1.2, < 3) 31 | ffi (1.12.2) 32 | forwardable-extended (2.6.0) 33 | gemoji (3.0.1) 34 | github-pages (204) 35 | github-pages-health-check (= 1.16.1) 36 | jekyll (= 3.8.5) 37 | jekyll-avatar (= 0.7.0) 38 | jekyll-coffeescript (= 1.1.1) 39 | jekyll-commonmark-ghpages (= 0.1.6) 40 | jekyll-default-layout (= 0.1.4) 41 | jekyll-feed (= 0.13.0) 42 | jekyll-gist (= 1.5.0) 43 | jekyll-github-metadata (= 2.13.0) 44 | jekyll-mentions (= 1.5.1) 45 | jekyll-optional-front-matter (= 0.3.2) 46 | jekyll-paginate (= 1.1.0) 47 | jekyll-readme-index (= 0.3.0) 48 | jekyll-redirect-from (= 0.15.0) 49 | jekyll-relative-links (= 0.6.1) 50 | jekyll-remote-theme (= 0.4.1) 51 | jekyll-sass-converter (= 1.5.2) 52 | jekyll-seo-tag (= 2.6.1) 53 | jekyll-sitemap (= 1.4.0) 54 | jekyll-swiss (= 1.0.0) 55 | jekyll-theme-architect (= 0.1.1) 56 | jekyll-theme-cayman (= 0.1.1) 57 | jekyll-theme-dinky (= 0.1.1) 58 | jekyll-theme-hacker (= 0.1.1) 59 | jekyll-theme-leap-day (= 0.1.1) 60 | jekyll-theme-merlot (= 0.1.1) 61 | jekyll-theme-midnight (= 0.1.1) 62 | jekyll-theme-minimal (= 0.1.1) 63 | jekyll-theme-modernist (= 0.1.1) 64 | jekyll-theme-primer (= 0.5.4) 65 | jekyll-theme-slate (= 0.1.1) 66 | jekyll-theme-tactile (= 0.1.1) 67 | jekyll-theme-time-machine (= 0.1.1) 68 | jekyll-titles-from-headings (= 0.5.3) 69 | jemoji (= 0.11.1) 70 | kramdown (= 1.17.0) 71 | liquid (= 4.0.3) 72 | mercenary (~> 0.3) 73 | minima (= 2.5.1) 74 | nokogiri (>= 1.10.4, < 2.0) 75 | rouge (= 3.13.0) 76 | terminal-table (~> 1.4) 77 | github-pages-health-check (1.16.1) 78 | addressable (~> 2.3) 79 | dnsruby (~> 1.60) 80 | octokit (~> 4.0) 81 | public_suffix (~> 3.0) 82 | typhoeus (~> 1.3) 83 | html-pipeline (2.12.3) 84 | activesupport (>= 2) 85 | nokogiri (>= 1.4) 86 | http_parser.rb (0.6.0) 87 | i18n (0.9.5) 88 | concurrent-ruby (~> 1.0) 89 | jekyll (3.8.5) 90 | addressable (~> 2.4) 91 | colorator (~> 1.0) 92 | em-websocket (~> 0.5) 93 | i18n (~> 0.7) 94 | jekyll-sass-converter (~> 1.0) 95 | jekyll-watch (~> 2.0) 96 | kramdown (~> 1.14) 97 | liquid (~> 4.0) 98 | mercenary (~> 0.3.3) 99 | pathutil (~> 0.9) 100 | rouge (>= 1.7, < 4) 101 | safe_yaml (~> 1.0) 102 | jekyll-avatar (0.7.0) 103 | jekyll (>= 3.0, < 5.0) 104 | jekyll-coffeescript (1.1.1) 105 | coffee-script (~> 2.2) 106 | coffee-script-source (~> 1.11.1) 107 | jekyll-commonmark (1.3.1) 108 | commonmarker (~> 0.14) 109 | jekyll (>= 3.7, < 5.0) 110 | jekyll-commonmark-ghpages (0.1.6) 111 | commonmarker (~> 0.17.6) 112 | jekyll-commonmark (~> 1.2) 113 | rouge (>= 2.0, < 4.0) 114 | jekyll-default-layout (0.1.4) 115 | jekyll (~> 3.0) 116 | jekyll-feed (0.13.0) 117 | jekyll (>= 3.7, < 5.0) 118 | jekyll-gist (1.5.0) 119 | octokit (~> 4.2) 120 | jekyll-github-metadata (2.13.0) 121 | jekyll (>= 3.4, < 5.0) 122 | octokit (~> 4.0, != 4.4.0) 123 | jekyll-mentions (1.5.1) 124 | html-pipeline (~> 2.3) 125 | jekyll (>= 3.7, < 5.0) 126 | jekyll-optional-front-matter (0.3.2) 127 | jekyll (>= 3.0, < 5.0) 128 | jekyll-paginate (1.1.0) 129 | jekyll-readme-index (0.3.0) 130 | jekyll (>= 3.0, < 5.0) 131 | jekyll-redirect-from (0.15.0) 132 | jekyll (>= 3.3, < 5.0) 133 | jekyll-relative-links (0.6.1) 134 | jekyll (>= 3.3, < 5.0) 135 | jekyll-remote-theme (0.4.1) 136 | addressable (~> 2.0) 137 | jekyll (>= 3.5, < 5.0) 138 | rubyzip (>= 1.3.0) 139 | jekyll-sass-converter (1.5.2) 140 | sass (~> 3.4) 141 | jekyll-seo-tag (2.6.1) 142 | jekyll (>= 3.3, < 5.0) 143 | jekyll-sitemap (1.4.0) 144 | jekyll (>= 3.7, < 5.0) 145 | jekyll-swiss (1.0.0) 146 | jekyll-theme-architect (0.1.1) 147 | jekyll (~> 3.5) 148 | jekyll-seo-tag (~> 2.0) 149 | jekyll-theme-cayman (0.1.1) 150 | jekyll (~> 3.5) 151 | jekyll-seo-tag (~> 2.0) 152 | jekyll-theme-dinky (0.1.1) 153 | jekyll (~> 3.5) 154 | jekyll-seo-tag (~> 2.0) 155 | jekyll-theme-hacker (0.1.1) 156 | jekyll (~> 3.5) 157 | jekyll-seo-tag (~> 2.0) 158 | jekyll-theme-leap-day (0.1.1) 159 | jekyll (~> 3.5) 160 | jekyll-seo-tag (~> 2.0) 161 | jekyll-theme-merlot (0.1.1) 162 | jekyll (~> 3.5) 163 | jekyll-seo-tag (~> 2.0) 164 | jekyll-theme-midnight (0.1.1) 165 | jekyll (~> 3.5) 166 | jekyll-seo-tag (~> 2.0) 167 | jekyll-theme-minimal (0.1.1) 168 | jekyll (~> 3.5) 169 | jekyll-seo-tag (~> 2.0) 170 | jekyll-theme-modernist (0.1.1) 171 | jekyll (~> 3.5) 172 | jekyll-seo-tag (~> 2.0) 173 | jekyll-theme-primer (0.5.4) 174 | jekyll (> 3.5, < 5.0) 175 | jekyll-github-metadata (~> 2.9) 176 | jekyll-seo-tag (~> 2.0) 177 | jekyll-theme-slate (0.1.1) 178 | jekyll (~> 3.5) 179 | jekyll-seo-tag (~> 2.0) 180 | jekyll-theme-tactile (0.1.1) 181 | jekyll (~> 3.5) 182 | jekyll-seo-tag (~> 2.0) 183 | jekyll-theme-time-machine (0.1.1) 184 | jekyll (~> 3.5) 185 | jekyll-seo-tag (~> 2.0) 186 | jekyll-titles-from-headings (0.5.3) 187 | jekyll (>= 3.3, < 5.0) 188 | jekyll-watch (2.2.1) 189 | listen (~> 3.0) 190 | jemoji (0.11.1) 191 | gemoji (~> 3.0) 192 | html-pipeline (~> 2.2) 193 | jekyll (>= 3.0, < 5.0) 194 | kramdown (1.17.0) 195 | liquid (4.0.3) 196 | listen (3.2.1) 197 | rb-fsevent (~> 0.10, >= 0.10.3) 198 | rb-inotify (~> 0.9, >= 0.9.10) 199 | mercenary (0.3.6) 200 | mini_portile2 (2.4.0) 201 | minima (2.5.1) 202 | jekyll (>= 3.5, < 5.0) 203 | jekyll-feed (~> 0.9) 204 | jekyll-seo-tag (~> 2.1) 205 | minitest (5.14.1) 206 | multipart-post (2.1.1) 207 | nokogiri (1.10.9) 208 | mini_portile2 (~> 2.4.0) 209 | octokit (4.18.0) 210 | faraday (>= 0.9) 211 | sawyer (~> 0.8.0, >= 0.5.3) 212 | pathutil (0.16.2) 213 | forwardable-extended (~> 2.6) 214 | public_suffix (3.1.1) 215 | rb-fsevent (0.10.3) 216 | rb-inotify (0.10.1) 217 | ffi (~> 1.0) 218 | rouge (3.13.0) 219 | ruby-enum (0.8.0) 220 | i18n 221 | rubyzip (2.3.0) 222 | safe_yaml (1.0.5) 223 | sass (3.7.4) 224 | sass-listen (~> 4.0.0) 225 | sass-listen (4.0.0) 226 | rb-fsevent (~> 0.9, >= 0.9.4) 227 | rb-inotify (~> 0.9, >= 0.9.7) 228 | sawyer (0.8.2) 229 | addressable (>= 2.3.5) 230 | faraday (> 0.8, < 2.0) 231 | terminal-table (1.8.0) 232 | unicode-display_width (~> 1.1, >= 1.1.1) 233 | thread_safe (0.3.6) 234 | typhoeus (1.3.1) 235 | ethon (>= 0.9.0) 236 | tzinfo (1.2.7) 237 | thread_safe (~> 0.1) 238 | tzinfo-data (1.2019.3) 239 | tzinfo (>= 1.0.0) 240 | unicode-display_width (1.7.0) 241 | wdm (0.1.1) 242 | zeitwerk (2.3.0) 243 | 244 | PLATFORMS 245 | ruby 246 | 247 | DEPENDENCIES 248 | github-pages 249 | jekyll 250 | jekyll-feed (~> 0.12) 251 | jekyll-theme-cayman 252 | tzinfo (~> 1.2) 253 | tzinfo-data 254 | wdm (~> 0.1.1) 255 | 256 | BUNDLED WITH 257 | 2.1.4 258 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Attribution 4.0 International 2 | 3 | ======================================================================= 4 | 5 | Creative Commons Corporation ("Creative Commons") is not a law firm and 6 | does not provide legal services or legal advice. Distribution of 7 | Creative Commons public licenses does not create a lawyer-client or 8 | other relationship. Creative Commons makes its licenses and related 9 | information available on an "as-is" basis. Creative Commons gives no 10 | warranties regarding its licenses, any material licensed under their 11 | terms and conditions, or any related information. Creative Commons 12 | disclaims all liability for damages resulting from their use to the 13 | fullest extent possible. 14 | 15 | Using Creative Commons Public Licenses 16 | 17 | Creative Commons public licenses provide a standard set of terms and 18 | conditions that creators and other rights holders may use to share 19 | original works of authorship and other material subject to copyright 20 | and certain other rights specified in the public license below. The 21 | following considerations are for informational purposes only, are not 22 | exhaustive, and do not form part of our licenses. 23 | 24 | Considerations for licensors: Our public licenses are 25 | intended for use by those authorized to give the public 26 | permission to use material in ways otherwise restricted by 27 | copyright and certain other rights. Our licenses are 28 | irrevocable. Licensors should read and understand the terms 29 | and conditions of the license they choose before applying it. 30 | Licensors should also secure all rights necessary before 31 | applying our licenses so that the public can reuse the 32 | material as expected. Licensors should clearly mark any 33 | material not subject to the license. This includes other CC- 34 | licensed material, or material used under an exception or 35 | limitation to copyright. More considerations for licensors: 36 | wiki.creativecommons.org/Considerations_for_licensors 37 | 38 | Considerations for the public: By using one of our public 39 | licenses, a licensor grants the public permission to use the 40 | licensed material under specified terms and conditions. If 41 | the licensor's permission is not necessary for any reason--for 42 | example, because of any applicable exception or limitation to 43 | copyright--then that use is not regulated by the license. Our 44 | licenses grant only permissions under copyright and certain 45 | other rights that a licensor has authority to grant. Use of 46 | the licensed material may still be restricted for other 47 | reasons, including because others have copyright or other 48 | rights in the material. A licensor may make special requests, 49 | such as asking that all changes be marked or described. 50 | Although not required by our licenses, you are encouraged to 51 | respect those requests where reasonable. More_considerations 52 | for the public: 53 | wiki.creativecommons.org/Considerations_for_licensees 54 | 55 | ======================================================================= 56 | 57 | Creative Commons Attribution 4.0 International Public License 58 | 59 | By exercising the Licensed Rights (defined below), You accept and agree 60 | to be bound by the terms and conditions of this Creative Commons 61 | Attribution 4.0 International Public License ("Public License"). To the 62 | extent this Public License may be interpreted as a contract, You are 63 | granted the Licensed Rights in consideration of Your acceptance of 64 | these terms and conditions, and the Licensor grants You such rights in 65 | consideration of benefits the Licensor receives from making the 66 | Licensed Material available under these terms and conditions. 67 | 68 | 69 | Section 1 -- Definitions. 70 | 71 | a. Adapted Material means material subject to Copyright and Similar 72 | Rights that is derived from or based upon the Licensed Material 73 | and in which the Licensed Material is translated, altered, 74 | arranged, transformed, or otherwise modified in a manner requiring 75 | permission under the Copyright and Similar Rights held by the 76 | Licensor. For purposes of this Public License, where the Licensed 77 | Material is a musical work, performance, or sound recording, 78 | Adapted Material is always produced where the Licensed Material is 79 | synched in timed relation with a moving image. 80 | 81 | b. Adapter's License means the license You apply to Your Copyright 82 | and Similar Rights in Your contributions to Adapted Material in 83 | accordance with the terms and conditions of this Public License. 84 | 85 | c. Copyright and Similar Rights means copyright and/or similar rights 86 | closely related to copyright including, without limitation, 87 | performance, broadcast, sound recording, and Sui Generis Database 88 | Rights, without regard to how the rights are labeled or 89 | categorized. For purposes of this Public License, the rights 90 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 91 | Rights. 92 | 93 | d. Effective Technological Measures means those measures that, in the 94 | absence of proper authority, may not be circumvented under laws 95 | fulfilling obligations under Article 11 of the WIPO Copyright 96 | Treaty adopted on December 20, 1996, and/or similar international 97 | agreements. 98 | 99 | e. Exceptions and Limitations means fair use, fair dealing, and/or 100 | any other exception or limitation to Copyright and Similar Rights 101 | that applies to Your use of the Licensed Material. 102 | 103 | f. Licensed Material means the artistic or literary work, database, 104 | or other material to which the Licensor applied this Public 105 | License. 106 | 107 | g. Licensed Rights means the rights granted to You subject to the 108 | terms and conditions of this Public License, which are limited to 109 | all Copyright and Similar Rights that apply to Your use of the 110 | Licensed Material and that the Licensor has authority to license. 111 | 112 | h. Licensor means the individual(s) or entity(ies) granting rights 113 | under this Public License. 114 | 115 | i. Share means to provide material to the public by any means or 116 | process that requires permission under the Licensed Rights, such 117 | as reproduction, public display, public performance, distribution, 118 | dissemination, communication, or importation, and to make material 119 | available to the public including in ways that members of the 120 | public may access the material from a place and at a time 121 | individually chosen by them. 122 | 123 | j. Sui Generis Database Rights means rights other than copyright 124 | resulting from Directive 96/9/EC of the European Parliament and of 125 | the Council of 11 March 1996 on the legal protection of databases, 126 | as amended and/or succeeded, as well as other essentially 127 | equivalent rights anywhere in the world. 128 | 129 | k. You means the individual or entity exercising the Licensed Rights 130 | under this Public License. Your has a corresponding meaning. 131 | 132 | 133 | Section 2 -- Scope. 134 | 135 | a. License grant. 136 | 137 | 1. Subject to the terms and conditions of this Public License, 138 | the Licensor hereby grants You a worldwide, royalty-free, 139 | non-sublicensable, non-exclusive, irrevocable license to 140 | exercise the Licensed Rights in the Licensed Material to: 141 | 142 | a. reproduce and Share the Licensed Material, in whole or 143 | in part; and 144 | 145 | b. produce, reproduce, and Share Adapted Material. 146 | 147 | 2. Exceptions and Limitations. For the avoidance of doubt, where 148 | Exceptions and Limitations apply to Your use, this Public 149 | License does not apply, and You do not need to comply with 150 | its terms and conditions. 151 | 152 | 3. Term. The term of this Public License is specified in Section 153 | 6(a). 154 | 155 | 4. Media and formats; technical modifications allowed. The 156 | Licensor authorizes You to exercise the Licensed Rights in 157 | all media and formats whether now known or hereafter created, 158 | and to make technical modifications necessary to do so. The 159 | Licensor waives and/or agrees not to assert any right or 160 | authority to forbid You from making technical modifications 161 | necessary to exercise the Licensed Rights, including 162 | technical modifications necessary to circumvent Effective 163 | Technological Measures. For purposes of this Public License, 164 | simply making modifications authorized by this Section 2(a) 165 | (4) never produces Adapted Material. 166 | 167 | 5. Downstream recipients. 168 | 169 | a. Offer from the Licensor -- Licensed Material. Every 170 | recipient of the Licensed Material automatically 171 | receives an offer from the Licensor to exercise the 172 | Licensed Rights under the terms and conditions of this 173 | Public License. 174 | 175 | b. No downstream restrictions. You may not offer or impose 176 | any additional or different terms or conditions on, or 177 | apply any Effective Technological Measures to, the 178 | Licensed Material if doing so restricts exercise of the 179 | Licensed Rights by any recipient of the Licensed 180 | Material. 181 | 182 | 6. No endorsement. Nothing in this Public License constitutes or 183 | may be construed as permission to assert or imply that You 184 | are, or that Your use of the Licensed Material is, connected 185 | with, or sponsored, endorsed, or granted official status by, 186 | the Licensor or others designated to receive attribution as 187 | provided in Section 3(a)(1)(A)(i). 188 | 189 | b. Other rights. 190 | 191 | 1. Moral rights, such as the right of integrity, are not 192 | licensed under this Public License, nor are publicity, 193 | privacy, and/or other similar personality rights; however, to 194 | the extent possible, the Licensor waives and/or agrees not to 195 | assert any such rights held by the Licensor to the limited 196 | extent necessary to allow You to exercise the Licensed 197 | Rights, but not otherwise. 198 | 199 | 2. Patent and trademark rights are not licensed under this 200 | Public License. 201 | 202 | 3. To the extent possible, the Licensor waives any right to 203 | collect royalties from You for the exercise of the Licensed 204 | Rights, whether directly or through a collecting society 205 | under any voluntary or waivable statutory or compulsory 206 | licensing scheme. In all other cases the Licensor expressly 207 | reserves any right to collect such royalties. 208 | 209 | 210 | Section 3 -- License Conditions. 211 | 212 | Your exercise of the Licensed Rights is expressly made subject to the 213 | following conditions. 214 | 215 | a. Attribution. 216 | 217 | 1. If You Share the Licensed Material (including in modified 218 | form), You must: 219 | 220 | a. retain the following if it is supplied by the Licensor 221 | with the Licensed Material: 222 | 223 | i. identification of the creator(s) of the Licensed 224 | Material and any others designated to receive 225 | attribution, in any reasonable manner requested by 226 | the Licensor (including by pseudonym if 227 | designated); 228 | 229 | ii. a copyright notice; 230 | 231 | iii. a notice that refers to this Public License; 232 | 233 | iv. a notice that refers to the disclaimer of 234 | warranties; 235 | 236 | v. a URI or hyperlink to the Licensed Material to the 237 | extent reasonably practicable; 238 | 239 | b. indicate if You modified the Licensed Material and 240 | retain an indication of any previous modifications; and 241 | 242 | c. indicate the Licensed Material is licensed under this 243 | Public License, and include the text of, or the URI or 244 | hyperlink to, this Public License. 245 | 246 | 2. You may satisfy the conditions in Section 3(a)(1) in any 247 | reasonable manner based on the medium, means, and context in 248 | which You Share the Licensed Material. For example, it may be 249 | reasonable to satisfy the conditions by providing a URI or 250 | hyperlink to a resource that includes the required 251 | information. 252 | 253 | 3. If requested by the Licensor, You must remove any of the 254 | information required by Section 3(a)(1)(A) to the extent 255 | reasonably practicable. 256 | 257 | 4. If You Share Adapted Material You produce, the Adapter's 258 | License You apply must not prevent recipients of the Adapted 259 | Material from complying with this Public License. 260 | 261 | 262 | Section 4 -- Sui Generis Database Rights. 263 | 264 | Where the Licensed Rights include Sui Generis Database Rights that 265 | apply to Your use of the Licensed Material: 266 | 267 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 268 | to extract, reuse, reproduce, and Share all or a substantial 269 | portion of the contents of the database; 270 | 271 | b. if You include all or a substantial portion of the database 272 | contents in a database in which You have Sui Generis Database 273 | Rights, then the database in which You have Sui Generis Database 274 | Rights (but not its individual contents) is Adapted Material; and 275 | 276 | c. You must comply with the conditions in Section 3(a) if You Share 277 | all or a substantial portion of the contents of the database. 278 | 279 | For the avoidance of doubt, this Section 4 supplements and does not 280 | replace Your obligations under this Public License where the Licensed 281 | Rights include other Copyright and Similar Rights. 282 | 283 | 284 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 285 | 286 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 287 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 288 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 289 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 290 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 291 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 292 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 293 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 294 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 295 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 296 | 297 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 298 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 299 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 300 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 301 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 302 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 303 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 304 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 305 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 306 | 307 | c. The disclaimer of warranties and limitation of liability provided 308 | above shall be interpreted in a manner that, to the extent 309 | possible, most closely approximates an absolute disclaimer and 310 | waiver of all liability. 311 | 312 | 313 | Section 6 -- Term and Termination. 314 | 315 | a. This Public License applies for the term of the Copyright and 316 | Similar Rights licensed here. However, if You fail to comply with 317 | this Public License, then Your rights under this Public License 318 | terminate automatically. 319 | 320 | b. Where Your right to use the Licensed Material has terminated under 321 | Section 6(a), it reinstates: 322 | 323 | 1. automatically as of the date the violation is cured, provided 324 | it is cured within 30 days of Your discovery of the 325 | violation; or 326 | 327 | 2. upon express reinstatement by the Licensor. 328 | 329 | For the avoidance of doubt, this Section 6(b) does not affect any 330 | right the Licensor may have to seek remedies for Your violations 331 | of this Public License. 332 | 333 | c. For the avoidance of doubt, the Licensor may also offer the 334 | Licensed Material under separate terms or conditions or stop 335 | distributing the Licensed Material at any time; however, doing so 336 | will not terminate this Public License. 337 | 338 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 339 | License. 340 | 341 | 342 | Section 7 -- Other Terms and Conditions. 343 | 344 | a. The Licensor shall not be bound by any additional or different 345 | terms or conditions communicated by You unless expressly agreed. 346 | 347 | b. Any arrangements, understandings, or agreements regarding the 348 | Licensed Material not stated herein are separate from and 349 | independent of the terms and conditions of this Public License. 350 | 351 | 352 | Section 8 -- Interpretation. 353 | 354 | a. For the avoidance of doubt, this Public License does not, and 355 | shall not be interpreted to, reduce, limit, restrict, or impose 356 | conditions on any use of the Licensed Material that could lawfully 357 | be made without permission under this Public License. 358 | 359 | b. To the extent possible, if any provision of this Public License is 360 | deemed unenforceable, it shall be automatically reformed to the 361 | minimum extent necessary to make it enforceable. If the provision 362 | cannot be reformed, it shall be severed from this Public License 363 | without affecting the enforceability of the remaining terms and 364 | conditions. 365 | 366 | c. No term or condition of this Public License will be waived and no 367 | failure to comply consented to unless expressly agreed to by the 368 | Licensor. 369 | 370 | d. Nothing in this Public License constitutes or may be interpreted 371 | as a limitation upon, or waiver of, any privileges and immunities 372 | that apply to the Licensor or You, including from the legal 373 | processes of any jurisdiction or authority. 374 | 375 | 376 | ======================================================================= 377 | 378 | Creative Commons is not a party to its public 379 | licenses. Notwithstanding, Creative Commons may elect to apply one of 380 | its public licenses to material it publishes and in those instances 381 | will be considered the “Licensor.” The text of the Creative Commons 382 | public licenses is dedicated to the public domain under the CC0 Public 383 | Domain Dedication. Except for the limited purpose of indicating that 384 | material is shared under a Creative Commons public license or as 385 | otherwise permitted by the Creative Commons policies published at 386 | creativecommons.org/policies, Creative Commons does not authorize the 387 | use of the trademark "Creative Commons" or any other trademark or logo 388 | of Creative Commons without its prior written consent including, 389 | without limitation, in connection with any unauthorized modifications 390 | to any of its public licenses or any other arrangements, 391 | understandings, or agreements concerning use of licensed material. For 392 | the avoidance of doubt, this paragraph does not form part of the 393 | public licenses. 394 | 395 | Creative Commons may be contacted at creativecommons.org. 396 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 |
3 | 4 |
Photo by Waseem Farooq from PxHere
5 |
6 | 7 | # Data Science Interviews 8 | 9 | Data science interview questions - with answers 10 | 11 | The answers are given by the community 12 | 13 | * If you know how to answer a question — please create a PR with the answer 14 | * If there's already an answer, but you can improve it — please create a PR with improvement suggestion 15 | * If you see a mistake — please create a PR with a fix 16 | 17 | For updates, follow me on Twitter ([@Al_Grigor](https://twitter.com/Al_Grigor)) and on LinkedIn ([agrigorev](https://www.linkedin.com/in/agrigorev)) 18 | 19 | Do you want to talk about data? Join [DataTalks.Club](https://DataTalks.Club) 20 | 21 | 22 | ## Questions by category 23 | 24 | * Theoretical questions: [theory.md](theory.md) (linear models, trees, neural networks and others) 25 | * Technical questions: [technical.md](technical.md) (SQL, Python, coding) 26 | * More to come 27 | 28 | ## Contributed questions 29 | 30 | The `contrib` folder contains contributed interview questions: 31 | 32 | * Probability: [contrib/probability.md](contrib/probability.md) 33 | * Add your questions here! 34 | 35 | ## Other useful things 36 | 37 | * Awesome data science interview questions and other resources: [awesome.md](awesome.md) 38 | 39 | 40 | This is a joint effort of many people. You can see the list of contributors here: [contributors.md](contributors.md) 41 | 42 | ## License 43 | 44 | This work is licensed under a [Creative Commons Attribution 4.0 International License][cc-by]. 45 | 46 | [![CC BY 4.0][cc-by-image]][cc-by] 47 | 48 | [cc-by]: http://creativecommons.org/licenses/by/4.0/ 49 | [cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png 50 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-cayman 2 | title: Data Science Interviews 3 | description: Data science interview questions, answers and other useful resources 4 | google_analytics: UA-163602297-1 5 | cover_image: img/repo-logo.jpg 6 | -------------------------------------------------------------------------------- /_layouts/default.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | {% if site.google_analytics %} 6 | 7 | 13 | {% endif %} 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | {{ page.title | default: site.title | default: site.github.repository_name }} 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 56 | 57 |
58 | {{ content }} 59 | 60 | 66 |
67 | 68 | -------------------------------------------------------------------------------- /assets/css/style.scss: -------------------------------------------------------------------------------- 1 | --- 2 | --- 3 | 4 | @import "{{ site.theme }}"; 5 | 6 | @media screen and (min-width: 64em) { 7 | .page-header { 8 | padding: 1rem 1rem; 9 | } 10 | } 11 | 12 | header a { 13 | color: #fff; 14 | } 15 | 16 | figcaption { 17 | font-size: 80%; 18 | text-align: center; 19 | } -------------------------------------------------------------------------------- /awesome.md: -------------------------------------------------------------------------------- 1 | ## Awesome Data Science Interview Resources 2 | 3 | A list of links with data science interview questions and other userful resources. 4 | 5 | Contributions are welcome! 6 | 7 | ### Questions and answers 8 | 9 | * [This repository](https://github.com/alexeygrigorev/data-science-interviews): [https://ds-interviews.org](https://ds-interviews.org) 10 | * [Data science interview questions and answers](https://github.com/iamtodor/data-science-interview-questions-and-answers) by [iamtodor](https://github.com/iamtodor) 11 | * [120+ data science interview questions](https://github.com/kojino/120-Data-Science-Interview-Questions) by [kojino](https://github.com/kojino/) 12 | * [40 Interview Questions asked at Startups in Machine Learning / Data Science](https://www.analyticsvidhya.com/blog/2016/09/40-interview-questions-asked-at-startups-in-machine-learning-data-science/) 13 | * [The Most Comprehensive Data Science & Machine Learning Interview Guide You’ll Ever Need](https://www.analyticsvidhya.com/blog/2018/06/comprehensive-data-science-machine-learning-interview-guide/) 14 | * [Data Science Recruitment Challenges](https://github.com/alexeygrigorev/datascience-recruitment-challenges) - take-home assignments for data science positions 15 | * [Latest Data Science Interview Questions](https://www.interviewbit.com/data-science-interview-questions/) - Complete Interview Guide 16 | * [Daily Machine Learning Questions](https://today.bnomial.com/) - One machine learning question every day with answers, explanations, references. 17 | 18 | ### Questions 19 | 20 | * [The toughest data science interview](https://www.linkedin.com/posts/agrigorev_datascience-machinelearning-ml-activity-6630138658219409409-bTWh) - a post on LinkedIn 21 | * [Data Science Interview Questions](https://www.itshared.org/2015/10/data-science-interview-questions.html) from [ITShared](https://www.itshared.org/) - from 2015, but many of them are still valid 22 | 23 | 24 | ### Other useful links 25 | 26 | * Getting a Data Science Job: [Video](https://www.youtube.com/watch?v=jYYR1fH8k7o), [Slides](https://www.slideshare.net/AlexeyGrigorev/getting-a-data-science-job) 27 | * [How does a technical screening for data science positions look like?](https://www.linkedin.com/posts/agrigorev_datascience-machinelearning-ml-activity-6631245015718866944-Vb87) - a post on LinkedIn 28 | * [How to prepare for Research Engineer (ML) interview?](https://www.linkedin.com/posts/agrigorev_machinelearning-ml-interviews-activity-6622232556311990272-_dAN) - a post on LinkedIn 29 | * [How to prepare for a data science interview?](https://www.quora.com/How-do-I-prepare-for-a-data-scientist-interview) - 100+ answers on Quora 30 | * [How to Get a Data Science Job: A Ridiculously Specific Guide](http://brohrer.github.io/get_data_science_job.html) 31 | * [How to Succeed in A Data Science Interview](https://blog.pramp.com/how-to-succeed-in-a-data-science-interview-27553ab69d8a) 32 | * [Machine Learning Systems Design](https://github.com/chiphuyen/machine-learning-systems-design) 33 | * [I interviewed at five top companies in Silicon Valley in five days, and luckily got five job offers](https://medium.com/@XiaohanZeng/i-interviewed-at-five-top-companies-in-silicon-valley-in-five-days-and-luckily-got-five-job-offers-25178cf74e0f) 34 | * [Navigating working with other teams](https://counting.substack.com/p/navigating-working-with-other-teams) 35 | 36 | 37 | ### Negotiation 38 | 39 | * [Ten Rules for Negotiating a Job Offer](https://haseebq.com/my-ten-rules-for-negotiating-a-job-offer/) 40 | * [Career Advice and Salary Negotiations: Move Early and Move Often](https://thehftguy.com/2017/01/23/career-advice-and-salary-negotiations-move-early-and-move-often/) 41 | -------------------------------------------------------------------------------- /contrib/probability.md: -------------------------------------------------------------------------------- 1 | # Probability 2 | 3 | 4 | 5 | 6 | 10 | 11 |
⚠️ 7 | Both questions and answers here are given by the community. Be careful and double check the answers before using them.
8 | If you see an error, please create a PR with a fix 9 |
12 | 13 |   14 | 15 | **Imagine you have a jar of 500 coins. 1 out of 500 is a coin with two heads and all the others have a tail and a head. You take a random coin from the jar and flip it 8 times. You observe heads 8 consecutive time. Are the chances that you took the coin with two heads higher than having drawn a regular coin with a head and a tail?** 16 | 17 | The main tool is Bayes Theorem. 18 | 19 | Define A the event of tossing the chosen coin and having heads 8 times, `B_1` and `B_2` the events of choosing the special and fair coins respectivly. We compute the odd of choosing the special coin over the fair one given the event A. 20 | - `P(B_1|A) : P(B_2 |A)` 21 | 22 | If this odd is greater than 1, then the answer is yes. Otherwise, no. 23 | 24 | By Bayes theorem (some manipulations), 25 | - `P(B_1|A) : P(B_2 |A) = (P(A|B_1) P(B_1)) : (P(A|B_2) P (B_2)) ` 26 | - `= ( P(A|B_1)/ P(A| B_2) ) * (P(B_1) / P(B_2) (*)` 27 | 28 | The second ratio is the odd of choosing the special coin over the fair one. It equals `1/499`. 29 | 30 | The first ratio is `1/(1/2)^8 = 256`. 31 | 32 | So the odd of choosing the special coin over the fair one given the event A is `256/499` which <1. Hence there is a lower chance that we took the special coin than the fair one. 33 | 34 | Extra comments: 35 | - From the solution, if there were 9 consecutive heads, then the odd would be 512/499 and hence the answer would be `yes`. 36 | - The formula (*), in general, has [the form ](https://en.wikipedia.org/wiki/Likelihood_ratios_in_diagnostic_testing#Estimation_of_pre-_and_post-test_probability) 37 | 38 | `post-odd = likelihood ratio of the event A * pre-odd` 39 | - Another solution is to compute directly `P(B_1|A)` by using Bayes and total probability theorems, 40 | 41 | ``` 42 | P(B_1|A) = P(A|B_1) P(B_1) / P(A) = P(A|B_1) P(B_1) / (P(A|B_1) P(B_1) + P(A|B_2) P(B_2)) 43 | = 1 * (1/500) / (1/500 + (1/2)^8 * (499/500)) 44 | ``` 45 | and see that it is `<1` 46 | - The problem contains several implicit assumptions. For example: 47 | - A with one head and tail is called **fair** under the assumption that both sides have equal chance to land in each toss. 48 | - We assume that each toss results in a head or a tail face but no other scenarios (like standing) 49 |
50 | 51 | 52 | -------------------------------------------------------------------------------- /contributors.md: -------------------------------------------------------------------------------- 1 | ## Contributors 2 | 3 | The content in this repository - especially the answers - is created by the community: 4 | 5 | * [sash-ko](https://github.com/sash-ko) 6 | * [samaritanhu](https://github.com/samaritanhu) 7 | * [gabrielatrindade](https://github.com/gabrielatrindade) 8 | * [manuel-lang](https://github.com/manuel-lang) 9 | * [damiannmm](https://github.com/damiannmm) 10 | * [vasugamdha](https://github.com/vasugamdha) 11 | * [MahdiRahbar](https://github.com/MahdiRahbar) 12 | * [pedrogengo](https://github.com/pedrogengo) 13 | * [rohanadagouda](https://github.com/rohanadagouda) 14 | * [python-engineer](https://github.com/python-engineer) 15 | * [hamzag95](https://github.com/hamzag95) 16 | * [rahulmadanraju](https://github.com/rahulmadanraju) 17 | * [donaldonana](https://github.com/donaldonana) 18 | * [pymacbit](https://github.com/pymacbit) 19 | * [alikhanafer](https://github.com/alikhanafer) 20 | * [Erlemar](https://github.com/Erlemar) 21 | * [AdmiralChopper](https://github.com/AdmiralChopper) 22 | * [BorisovDm](https://github.com/BorisovDm) 23 | * [eljur](https://github.com/eljur) 24 | * [SomeSnm](https://github.com/SomeSnm) 25 | * [Hannemit](https://github.com/Hannemit) 26 | * [l1x](https://github.com/l1x) 27 | * [JeremiahKamama](https://github.com/JeremiahKamama) 28 | * [JoaquinDF](https://github.com/JoaquinDF) 29 | * [LoweLundin](https://github.com/LoweLundin) 30 | * [pranaymodukuru](https://github.com/pranaymodukuru) 31 | * [ritwikbanrg](https://github.com/ritwikbanrg) 32 | * [averkij](https://github.com/averkij) 33 | * [Tejash-Shah](https://github.com/Tejash-Shah) 34 | * [vijay-ravi](https://github.com/vijay-ravi) 35 | * [Mudit Tiwari](https://github.com/mudittiwari255) 36 | * [mrsaeeddev](https://github.com/mrsaeeddev) 37 | * [hima9](https://github.com/hima9) 38 | 39 | 40 | Full list of contributors: [contributors](https://github.com/alexeygrigorev/data-science-interviews/contributors) 41 | 42 | Thank you for your contribution! 43 | -------------------------------------------------------------------------------- /img/counter_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/counter_1.png -------------------------------------------------------------------------------- /img/counter_2_top.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/counter_2_top.png -------------------------------------------------------------------------------- /img/flip_binary_tree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/flip_binary_tree.png -------------------------------------------------------------------------------- /img/formula_idf_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/formula_idf_1.png -------------------------------------------------------------------------------- /img/formula_jaccard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/formula_jaccard.png -------------------------------------------------------------------------------- /img/formula_mean.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/formula_mean.png -------------------------------------------------------------------------------- /img/formula_pmi_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/formula_pmi_1.png -------------------------------------------------------------------------------- /img/formula_pmi_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/formula_pmi_2.png -------------------------------------------------------------------------------- /img/formula_rmse.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/formula_rmse.png -------------------------------------------------------------------------------- /img/formula_std.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/formula_std.png -------------------------------------------------------------------------------- /img/repo-logo.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/repo-logo.jpg -------------------------------------------------------------------------------- /img/schema.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/schema.png -------------------------------------------------------------------------------- /img/sql_10_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/sql_10_example.png -------------------------------------------------------------------------------- /img/sql_11_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/sql_11_example.png -------------------------------------------------------------------------------- /img/sql_4_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/sql_4_example.png -------------------------------------------------------------------------------- /img/sql_5_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/sql_5_example.png -------------------------------------------------------------------------------- /img/sql_6_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/sql_6_example.png -------------------------------------------------------------------------------- /img/sql_7_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/sql_7_example.png -------------------------------------------------------------------------------- /img/sql_8_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/sql_8_example.png -------------------------------------------------------------------------------- /img/sql_9_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexeygrigorev/data-science-interviews/fcbce76df24aa473ed6dc07ec598097b9dd70b5b/img/sql_9_example.png -------------------------------------------------------------------------------- /technical.md: -------------------------------------------------------------------------------- 1 | # Technical interview questions 2 | 3 | 4 | 5 | 6 | 10 | 11 |
⚠️ 7 | The answers here are given by the community. Be careful and double check the answers before using them.
8 | If you see an error, please create a PR with a fix 9 |
12 | 13 | The list is based on [this post](https://medium.com/data-science-insider/technical-data-science-interview-questions-f61cd9cf218?source=friends_link&sk=01f4de0de746d28fe714d92a1e91e190) 14 | 15 | 16 | ## Table of contents 17 | 18 | * [SQL](#sql) 19 | * [Coding (Python)](#coding-python) 20 | * [Algorithmic Questions](#algorithmic-questions) 21 | 22 |
23 | 24 | ## SQL 25 | 26 | Suppose we have the following schema with two tables: Ads and Events 27 | 28 | * Ads(ad_id, campaign_id, status) 29 | * status could be active or inactive 30 | * Events(event_id, ad_id, source, event_type, date, hour) 31 | * event_type could be impression, click, conversion 32 | 33 | 34 | 35 | 36 | Write SQL queries to extract the following information: 37 | 38 | **1)** The number of active ads. 39 | 40 | ```sql 41 | SELECT count(*) FROM Ads WHERE status = 'active'; 42 | ``` 43 | 44 |
45 | 46 | 47 | **2)** All active campaigns. A campaign is active if there’s at least one active ad. 48 | 49 | ```sql 50 | SELECT DISTINCT a.campaign_id 51 | FROM Ads AS a 52 | WHERE a.status = 'active'; 53 | ``` 54 | 55 |
56 | 57 | **3)** The number of active campaigns. 58 | 59 | ```sql 60 | SELECT COUNT(DISTINCT a.campaign_id) 61 | FROM Ads AS a 62 | WHERE a.status = 'active'; 63 | ``` 64 | 65 |
66 | 67 | **4)** The number of events per each ad — broken down by event type. 68 | 69 | 70 | 71 | ```sql 72 | SELECT e.ad_id, e.event_type, count(*) as "count" 73 | FROM Events AS e 74 | GROUP BY e.ad_id, e.event_type 75 | ORDER BY e.ad_id, "count" DESC; 76 | ``` 77 | 78 |
79 | 80 | **5)** The number of events over the last week per each active ad — broken down by event type and date (most recent first). 81 | 82 | 83 | 84 | ```sql 85 | SELECT a.ad_id, e.event_type, e.date, count(*) as "count" 86 | FROM Ads AS a 87 | JOIN Events AS e 88 | ON a.ad_id = e.ad_id 89 | WHERE a.status = 'active' 90 | AND e.date >= DATEADD(week, -1, GETDATE()) 91 | GROUP BY a.ad_id, e.event_type, e.date 92 | ORDER BY e.date ASC, "count" DESC; 93 | ``` 94 | 95 |
96 | 97 | **6)** The number of events per campaign — by event type. 98 | 99 | 100 | 101 | 102 | ```sql 103 | SELECT a.campaign_id, e.event_type, count(*) as count 104 | FROM Ads AS a 105 | INNER JOIN Events AS e 106 | ON a.ad_id = e.ad_id 107 | GROUP BY a.campaign_id, e.event_type 108 | ORDER BY a.campaign_id, "count" DESC 109 | ``` 110 | 111 |
112 | 113 | **7)** The number of events over the last week per each campaign and event type — broken down by date (most recent first). 114 | 115 | 116 | 117 | ```sql 118 | -- for Postgres 119 | 120 | SELECT a.campaign_id AS camp_id, e.event_type, e.date, count(*) 121 | FROM Ads AS a 122 | INNER JOIN Events AS e 123 | ON a.ad_id = e.ad_id 124 | WHERE e.date >= DATEADD(week, -1, GETDATE()) 125 | GROUP BY a.campaign_id, e.event_type, e.date 126 | ORDER BY a.campaign_id, e.date DESC, "count" DESC; 127 | ``` 128 | 129 |
130 | 131 | **8)** CTR (click-through rate) for each ad. CTR = number of clicks / number of impressions. 132 | 133 | 134 | 135 | ```sql 136 | -- for Postgres 137 | 138 | SELECT impressions_clicks_table.ad_id, 139 | (impressions_clicks_table.clicks * 100 / impressions_clicks_table.impressions)::FLOAT || '%' AS CTR 140 | FROM 141 | ( 142 | SELECT e.ad_id, 143 | SUM(CASE e.event_type WHEN 'impression' THEN 1 ELSE 0 END) impressions, 144 | SUM(CASE e.event_type WHEN 'click' THEN 1 ELSE 0 END) clicks 145 | FROM Events AS e 146 | GROUP BY e.ad_id 147 | ) AS impressions_clicks_table 148 | ORDER BY impressions_clicks_table.ad_id; 149 | ``` 150 | 151 |
152 | 153 | **9)** CVR (conversion rate) for each ad. CVR = number of conversions / number of clicks. 154 | 155 | 156 | 157 | ```sql 158 | -- for Postgres 159 | 160 | SELECT conversions_clicks_table.ad_id, 161 | (conversions_clicks_table.conversions * 100 / conversions_clicks_table.clicks)::FLOAT || '%' AS CVR 162 | FROM 163 | ( 164 | SELECT e.ad_id, 165 | SUM(CASE e.event_type WHEN 'conversion' THEN 1 ELSE 0 END) conversions, 166 | SUM(CASE e.event_type WHEN 'click' THEN 1 ELSE 0 END) clicks 167 | FROM Events AS e 168 | GROUP BY e.ad_id 169 | ) AS conversions_clicks_table 170 | ORDER BY conversions_clicks_table.ad_id; 171 | ``` 172 | 173 |
174 | 175 | **10)** CTR and CVR for each ad broken down by day and hour (most recent first). 176 | 177 | 178 | 179 | 180 | ```sql 181 | -- for Postgres 182 | 183 | SELECT conversions_clicks_table.ad_id, 184 | conversions_clicks_table.date, 185 | conversions_clicks_table.hour, 186 | (impressions_clicks_table.clicks * 100 / impressions_clicks_table.impressions)::FLOAT || '%' AS CTR, 187 | (conversions_clicks_table.conversions * 100 / conversions_clicks_table.clicks)::FLOAT || '%' AS CVR 188 | FROM 189 | ( 190 | SELECT e.ad_id, e.date, e.hour, 191 | SUM(CASE e.event_type WHEN 'conversion' THEN 1 ELSE 0 END) conversions, 192 | SUM(CASE e.event_type WHEN 'click' THEN 1 ELSE 0 END) clicks, 193 | SUM(CASE e.event_type WHEN 'impression' THEN 1 ELSE 0 END) impressions 194 | FROM Events AS e 195 | GROUP BY e.ad_id, e.date, e.hour 196 | ) AS conversions_clicks_table 197 | ORDER BY conversions_clicks_table.ad_id, conversions_clicks_table.date DESC, conversions_clicks_table.hour DESC, "CTR" DESC, "CVR" DESC; 198 | ``` 199 | 200 |
201 | 202 | **11)** CTR for each ad broken down by source and day 203 | 204 | 205 | 206 | 207 | ```sql 208 | -- for Postgres 209 | 210 | SELECT conversions_clicks_table.ad_id, 211 | conversions_clicks_table.date, 212 | conversions_clicks_table.source, 213 | (impressions_clicks_table.clicks * 100 / impressions_clicks_table.impressions)::FLOAT || '%' AS CTR 214 | FROM 215 | ( 216 | SELECT e.ad_id, e.date, e.source, 217 | SUM(CASE e.event_type WHEN 'click' THEN 1 ELSE 0 END) clicks, 218 | SUM(CASE e.event_type WHEN 'impression' THEN 1 ELSE 0 END) impressions 219 | FROM Events AS e 220 | GROUP BY e.ad_id, e.date, e.source 221 | ) AS conversions_clicks_table 222 | ORDER BY conversions_clicks_table.ad_id, conversions_clicks_table.date DESC, conversions_clicks_table.source, "CTR" DESC; 223 | ``` 224 | 225 |
226 | 227 | 228 | ## Coding (Python) 229 | 230 | **1) FizzBuzz.** Print numbers from 1 to 100 231 | 232 | * If it’s a multiplier of 3, print “Fizz” 233 | * If it’s a multiplier of 5, print “Buzz” 234 | * If both 3 and 5 — “Fizz Buzz" 235 | * Otherwise, print the number itself 236 | 237 | Example of output: 1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz, 11, Fizz, 13, 14, Fizz Buzz, 16, 17, Fizz, 19, Buzz, Fizz, 22, 23, Fizz, Buzz, 26, Fizz, 28, 29, Fizz Buzz, 31, 32, Fizz, 34, Buzz, Fizz, ... 238 | 239 | ```python 240 | for i in range(1, 101): 241 | if i % 3 == 0 and i % 5 == 0: 242 | print('Fizz Buzz') 243 | elif i % 3 == 0: 244 | print('Fizz') 245 | elif i % 5 == 0: 246 | print('Buzz') 247 | else: 248 | print(i) 249 | ``` 250 | 251 |
252 | 253 | **2) Factorial**. Calculate a factorial of a number 254 | 255 | * `factorial(5)` = 5! = 1 * 2 * 3 * 4 * 5 = 120 256 | * `factorial(10)` = 10! = 1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * 10 = 3628800 257 | 258 | ```python 259 | def factorial(n): 260 | result = 1 261 | for i in range(2, n + 1): 262 | result *= i 263 | return result 264 | ``` 265 | 266 | We can also write this function using recursion: 267 | 268 | ```python 269 | def factorial(n: int): 270 | if n == 0 or n == 1: 271 | return 1 272 | else: 273 | return n * factorial(n - 1) 274 | ``` 275 | 276 | 277 |
278 | 279 | **3) Mean**. Compute the mean of number in a list 280 | 281 | * `mean([4, 36, 45, 50, 75]) = 42` 282 | * `mean([]) = NaN` (use `float('NaN')`) 283 | 284 | 285 | 286 | ```python 287 | def mean(numbers): 288 | if len(numbers) > 0: 289 | return sum(numbers) / len(numbers) 290 | return float('NaN') 291 | ``` 292 | 293 |
294 | 295 | **4) STD**. Calculate the standard deviation of elements in a list. 296 | 297 | * `std([1, 2, 3, 4]) = 1.29` 298 | * `std([1]) = NaN` (use `float('NaN')`) 299 | * `std([]) = NaN` 300 | 301 | 302 | 303 | ```python 304 | from math import sqrt 305 | from statistics import mean 306 | 307 | def std_dev(numbers): 308 | if len(numbers) > 1: 309 | avg = mean(numbers) 310 | var = sum((i - avg) ** 2 for i in numbers) / (len(numbers) - 1) 311 | std = sqrt(var) 312 | return std 313 | return float('NaN') 314 | ``` 315 | 316 |
317 | 318 | **5) RMSE**. Calculate the RMSE (root mean squared error) of a model. The function takes in two lists: one with actual values, one with predictions. 319 | 320 | * `rmse([1, 2], [1, 2]) = 0` 321 | * `rmse([1, 2, 3], [3, 2, 1]) = 1.63` 322 | 323 | 324 | 325 | ```python 326 | import math 327 | 328 | def rmse(y_true, y_pred): 329 | assert len(y_true) == len(y_pred), 'different sizes of the arguments' 330 | squares = sum((x - y)**2 for x, y in zip(y_true, y_pred)) 331 | return math.sqrt(squares / len(y_true)) 332 | ``` 333 | 334 |
335 | 336 | **6) Remove duplicates**. Remove duplicates in list. The list is not sorted and the order of elements from the original list should be preserved. 337 | 338 | * `[1, 2, 3, 1]` ⇒ `[1, 2, 3]` 339 | * `[1, 3, 2, 1, 5, 3, 5, 1, 4]` ⇒ `[1, 3, 2, 5, 4]` 340 | 341 | ```python 342 | def remove_duplicates(lst): 343 | new_list = [] 344 | mentioned_values = set() 345 | for elem in lst: 346 | if elem not in mentioned_values: 347 | new_list.append(elem) 348 | mentioned_values.add(elem) 349 | return new_list 350 | 351 | # The above solution checks the values into a set and it is O(1) efficient using 352 | # a few of lines. 353 | # A shorter solution follows: it is O(n^2) but can be fine when lst has no "too 354 | # many elements" - the quantity depends by the running box. 355 | def remove_duplicates2(lst): 356 | new_list = [] 357 | for elem in lst: 358 | if elem not in new_list: 359 | new_list.append(elem) 360 | return new_list 361 | ``` 362 | 363 |
364 | 365 | **7) Count**. Count how many times each element in a list occurs. 366 | 367 | `[1, 3, 2, 1, 5, 3, 5, 1, 4]` ⇒ 368 | * 1: 3 times 369 | * 2: 1 time 370 | * 3: 2 times 371 | * 4: 1 time 372 | * 5: 2 times 373 | 374 | ```python 375 | numbers = [1, 3, 2, 1, 5, 3, 5, 1, 4] 376 | counter = dict() 377 | for elem in numbers: 378 | counter[elem] = counter.get(elem, 0) + 1 379 | ``` 380 | or 381 | ```python 382 | from collections import Counter 383 | 384 | numbers = [1, 3, 2, 1, 5, 3, 5, 1, 4] 385 | counter = Counter(numbers) 386 | ``` 387 | 388 |
389 | 390 | **8) Palindrome**. Is string a palindrome? A palindrome is a word which reads the same backward as forwards. 391 | 392 | * “ololo” ⇒ Yes 393 | * “cafe” ⇒ No 394 | 395 | ```python 396 | def is_palindrome(s): 397 | return s == s[::-1] 398 | ``` 399 | or 400 | ```python 401 | def is_palindrome(s): 402 | for i in range(len(s) // 2): 403 | if s[i] != s[-i - 1]: 404 | return False 405 | return True 406 | ``` 407 | 408 |
409 | 410 | **9) Counter**. We have a list with identifiers of form “id-SITE”. Calculate how many ids we have per site. 411 | 412 | 413 | 414 | ```python 415 | def counter(lst): 416 | ans = {} 417 | for i in lst: 418 | site = i[-2:] 419 | ans[site] = ans.get(site, 0) + 1 420 | return ans 421 | ``` 422 | 423 |
424 | 425 | **10) Top counter**. We have a list with identifiers of form “id-SITE”. Show the top 3 sites. You can break ties in any way you want. 426 | 427 | 428 | 429 | ```python 430 | def top_counter(lst): 431 | site_dict = counter(lst) # using last problem's solution 432 | top_keys = sorted(site_dict, reverse=True, key=site_dict.get)[:3] 433 | return {key: site_dict[key] for key in top_keys} 434 | ``` 435 | 436 |
437 | 438 | **11) RLE**. Implement RLE (run-length encoding): encode each character by the number of times it appears consecutively. 439 | 440 | * `'aaaabbbcca'` ⇒ `[('a', 4), ('b', 3), ('c', 2), ('a', 1)]` 441 | * (note that there are two groups of 'a') 442 | 443 | ```python 444 | def rle(s): 445 | ans, cur, num = [], None, 0 446 | for i in range(len(s)): 447 | if i == 0: 448 | cur, num = s[i], 1 449 | elif cur != s[i]: 450 | ans.append((cur, num)) 451 | cur, num = s[i], 1 452 | else: 453 | num += 1 454 | if i == len(s) - 1: 455 | ans.append((cur, num)) 456 | return ans 457 | ``` 458 | 459 | Using itertools.groupby 460 | ```python 461 | import itertools 462 | 463 | def rle(s): 464 | return [(l, len(list(g))) for l, g in itertools.groupby(s)] 465 | ``` 466 | 467 |
468 | 469 | **12) Jaccard**. Calculate the Jaccard similarity between two sets: the size of intersection divided by the size of union. 470 | 471 | * `jaccard({'a', 'b', 'c'}, {'a', 'd'}) = 1 / 4` 472 | 473 | 474 | 475 | ```python 476 | def jaccard(a, b): 477 | return len(a & b) / len(a | b) 478 | ``` 479 | 480 |
481 | 482 | **13) IDF**. Given a collection of already tokenized texts, calculate the IDF (inverse document frequency) of each token. 483 | 484 | * input example: `[['interview', 'questions'], ['interview', 'answers']]` 485 | 486 | 487 | 488 | Where: 489 | 490 | * t is the token, 491 | * n(t) is the number of documents that t occurs in, 492 | * N is the total number of documents 493 | 494 | ```python 495 | from math import log10 496 | 497 | def idf1(docs): 498 | docs = [set(doc) for doc in docs] 499 | n_tokens = {} 500 | for doc in docs: 501 | for token in doc: 502 | n_tokens[token] = n_tokens.get(token, 0) + 1 503 | ans = {} 504 | for token in n_tokens: 505 | ans[token] = log10(len(docs) / (1 + n_tokens[token])) 506 | return ans 507 | ``` 508 | 509 | ```python 510 | import math 511 | 512 | def idf2(docs): 513 | n_docs = len(docs) 514 | 515 | docs = [set(doc) for doc in docs] 516 | all_tokens = set.union(*docs) 517 | 518 | idf_coefficients = {} 519 | for token in all_tokens: 520 | n_docs_w_token = sum(token in doc for doc in docs) 521 | idf_c = math.log10(n_docs / (1 + n_docs_w_token)) 522 | idf_coefficients[token] = idf_c 523 | 524 | return idf_coefficients 525 | ``` 526 | 527 |
528 | 529 | **14) PMI**. Given a collection of already tokenized texts, find the PMI (pointwise mutual information) of each pair of tokens. Return top 10 pairs according to PMI. 530 | 531 | * input example: `[['interview', 'questions'], ['interview', 'answers']]` 532 | 533 | PMI is used for finding collocations in text — things like “New York” or “Puerto Rico”. For two consecutive words, the PMI between them is: 534 | 535 | 536 | 537 | The higher the PMI, the more likely these two tokens form a collection. We can estimate PMI by counting: 538 | 539 | 540 | 541 | Where: 542 | * N is the total number of tokens in the text, 543 | * c(t1, t2) is the number of times t1 and t2 appear together, 544 | * c(t1) and c(t2) — the number of times they appear separately. 545 | 546 | ```python 547 | import math 548 | 549 | def pmi(docs): 550 | n_pairs = {} 551 | n_tokens = {} 552 | 553 | docs = [tuple(doc) for doc in docs] 554 | for doc in docs: 555 | for token in doc: 556 | n_tokens[token] = n_tokens.get(token, 0) + 1 557 | 558 | n_pairs[doc] = n_pairs.get(doc, 0) + 1 559 | ans = {} 560 | for pair in n_pairs: 561 | ans[pair] = math.log2(n_pairs[pair] * (sum(n_tokens.values())) / (n_tokens[pair[0]] * n_tokens[pair[1]])) 562 | 563 | srt = sorted(ans.items(), key=lambda x: x[1], reverse=True)[:10] 564 | 565 | return srt 566 | ``` 567 | 568 | 569 |
570 | 571 | ## Algorithmic Questions 572 | 573 | **1) Two sum**. Given an array and a number N, return True if there are numbers A, B in the array such that A + B = N. Otherwise, return False. 574 | 575 | * `[1, 2, 3, 4], 5` ⇒ `True` 576 | * `[3, 4, 6], 6` ⇒ `False` 577 | 578 | Brute force, O(n2): 579 | 580 | ```python 581 | def two_sum(numbers, target): 582 | n = len(numbers) 583 | 584 | for i in range(n): 585 | for j in range(i + 1, n): 586 | if numbers[i] + numbers[j] == target: 587 | return True 588 | 589 | return False 590 | ``` 591 | 592 | Linear, O(n): 593 | 594 | ```python 595 | def two_sum(numbers, target): 596 | index = {num: i for (i, num) in enumerate(numbers)} 597 | 598 | n = len(numbers) 599 | 600 | for i in range(n): 601 | a = numbers[i] 602 | b = target - a 603 | 604 | if b in index: 605 | j = index[b] 606 | if i != j: 607 | return True 608 | 609 | return False 610 | ``` 611 | 612 | Using itertools.combinations 613 | ```python 614 | from itertools import combinations 615 | 616 | def two_sum(numbers, target): 617 | for elem in combinations(numbers, 2): 618 | if elem[0] + elem[1] == target: 619 | return True 620 | return False 621 | ``` 622 | 623 | 624 |
625 | 626 | **2) Fibonacci**. Return the n-th Fibonacci number, which is computed using this formula: 627 | 628 | * F(0) = 0 629 | * F(1) = 1 630 | * F(n) = F(n-1) + F(n-2) 631 | * The sequence is: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ... 632 | 633 | ```python 634 | def fibonacci1(n): 635 | '''naive, complexity = O(2 ** n)''' 636 | if n == 0 or n == 1: 637 | return n 638 | else: 639 | return fibonacci1(n - 1) + fibonacci1(n - 2) 640 | ``` 641 | 642 | ```python 643 | def fibonacci2(n): 644 | '''dynamic programming, complexity = O(n)''' 645 | base1, base2 = 0, 1 646 | for i in range(n): 647 | base1, base2 = base2, base1 + base2 648 | return base1 649 | ``` 650 | 651 | ```python 652 | def fibonacci3(n): 653 | '''matrix multiplication, complexity = O(log(n))''' 654 | def mx_mul(m1, m2): 655 | ans = [[0 for i in range(len(m2[0]))] for j in range(len(m1))] 656 | for i in range(len(m1)): 657 | for j in range(len(m2[0])): 658 | for k in range(len(m2)): 659 | ans[i][j] += m1[i][k] * m2[k][j] 660 | return ans 661 | def pow(a, b): 662 | ans = [[1, 0], [0, 1]] 663 | while b > 0: 664 | if b % 2 == 1: 665 | ans = mx_mul(ans, a) 666 | a = mx_mul(a, a) 667 | b //= 2 668 | return ans 669 | ans = mx_mul(pow([[1, 1], [1, 0]], n), [[1], [0]])[1][0] 670 | return ans 671 | ``` 672 | 673 | Memoization with a dictionary 674 | 675 | ```python 676 | memo = {0: 0, 1: 1} 677 | 678 | def fibonacci4(n): 679 | '''Top down + memorization (dictionary), complexity = O(n)''' 680 | if n not in memo: 681 | memo[n] = fibonacci4(n-1) + fibonacci4(n-2) 682 | return memo[n] 683 | ``` 684 | 685 | Memoization with `lru_cache` 686 | 687 | ```python 688 | from functools import lru_cache 689 | 690 | @lru_cache() 691 | def fibonacci4(n): 692 | if n == 0 or n == 1: 693 | return n 694 | return fibonacci4(n - 1) + fibonacci4(n - 2) 695 | ``` 696 | 697 | ```python 698 | def fibonacci5(n): 699 | '''Top down + memorization (list), complexity = O(n) ''' 700 | if n == 1: 701 | return 1 702 | dic = [-1 for i in range(n)] 703 | dic[0], dic[1] = 1, 2 704 | def helper(n, dic): 705 | if dic[n] < 0: 706 | dic[n] = helper(n-1, dic) + helper(n-2, dic) 707 | return dic[n] 708 | return helper(n-1, dic) 709 | ``` 710 | 711 | 712 | 713 |
714 | 715 | **3) Most frequent outcome**. We have two dice of different sizes (D1 and D2). We roll them and sum their face values. What are the most probable outcomes? 716 | 717 | * 6, 6 ⇒ [7] 718 | * 2, 4 ⇒ [3, 4, 5] 719 | 720 | ```python 721 | def most_frequent_outcome(d1, d2): 722 | len_ans = abs(d1 - d2) + 1 723 | mi = min(d1, d2) 724 | ans = [mi + i for i in range(1, len_ans + 1)] 725 | return ans 726 | ``` 727 | 728 |
729 | 730 | **4) Reverse a linked list**. Write a function for reversing a linked list. 731 | 732 | * The definition of a list node: `Node(value, next)` 733 | * Example: `a -> b -> c` ⇒ `c -> b -> a` 734 | 735 | ```python 736 | def reverse_ll(head): 737 | if head.next is not None: 738 | last = None 739 | point = head 740 | while point is not None: 741 | point.next, point, last = last, point.next, point 742 | ``` 743 | 744 |
745 | 746 | **5) Flip a binary tree**. Write a function for rotating a binary tree. 747 | 748 | * The definition of a tree node: `Node(value, left, right)` 749 | 750 | 751 | 752 | ```python 753 | def flip_bt(head): 754 | if head is not None: 755 | head.left, head.right = head.right, head.left 756 | flip_bt(head.left) 757 | flip_bt(head.right) 758 | ``` 759 | 760 |
761 | 762 | **6) Binary search**. Return the index of a given number in a sorted array or -1 if it’s not there. 763 | 764 | * `[1, 4, 6, 10], 4` ⇒ `1` 765 | * `[1, 4, 6, 10], 3` ⇒ `-1` 766 | 767 | ```python 768 | def binary_search(lst, num): 769 | left, right = -1, len(lst) 770 | while right - left > 1: 771 | mid = (left + right) // 2 772 | if lst[mid] >= num: 773 | right = mid 774 | else: 775 | left = mid 776 | if right < 0 or right >= len(lst) or lst[right] != num: 777 | return -1 778 | else: 779 | return right 780 | ``` 781 | 782 |
783 | 784 | **7) Deduplication**. Remove duplicates from a sorted array. 785 | 786 | * `[1, 1, 1, 2, 3, 4, 4, 4, 5, 6, 6]` ⇒ `[1, 2, 3, 4, 5, 6]` 787 | 788 | ```python 789 | def deduplication1(lst): 790 | '''manual''' 791 | ans = [] 792 | last = None 793 | for i in lst: 794 | if last != i: 795 | ans.append(i) 796 | last = i 797 | return ans 798 | 799 | def deduplication2(lst): 800 | # order is not guaranteed unless call sorted(list(set(lst))) to sort again 801 | return list(set(lst)) 802 | ``` 803 | 804 |
805 | 806 | **8) Intersection**. Return the intersection of two sorted arrays. 807 | 808 | * `[1, 2, 4, 6, 10], [2, 4, 5, 7, 10]` ⇒ `[2, 4, 10]` 809 | 810 | ```python 811 | def intersection1(lst1, lst2): 812 | '''reserves duplicates''' 813 | ans = [] 814 | p1, p2 = 0, 0 815 | while p1 < len(lst1) and p2 < len(lst2): 816 | if lst1[p1] == lst2[p2]: 817 | ans.append(lst1[p1]) 818 | p1, p2 = p1 + 1, p2 + 1 819 | elif lst1[p1] < lst2[p2]: 820 | p1 += 1 821 | else: 822 | p2 += 1 823 | return ans 824 | 825 | def intersection2(lst1, lst2): 826 | '''removes duplicates''' 827 | # order is not guaranteed unless call sorted(...) to sort again 828 | return list(set(lst1) & set(lst2)) 829 | ``` 830 | 831 |
832 | 833 | **9) Union**. Return the union of two sorted arrays. 834 | 835 | * `[1, 2, 4, 6, 10], [2, 4, 5, 7, 10]` ⇒ `[1, 2, 4, 5, 6, 7, 10]` 836 | 837 | ```python 838 | def union1(lst1, lst2): 839 | '''reserves duplicates''' 840 | ans = [] 841 | p1, p2 = 0, 0 842 | while p1 < len(lst1) or p2 < len(lst2): 843 | if lst1[p1] == lst2[p2]: 844 | ans.append(lst1[p1]) 845 | p1, p2 = p1 + 1, p2 + 1 846 | elif lst1[p1] < lst2[p2]: 847 | ans.append(lst1[p1]) 848 | p1 += 1 849 | else: 850 | ans.append(lst2[p2]) 851 | p2 += 1 852 | return ans 853 | 854 | def union2(lst1, lst2): 855 | '''removes duplicates''' 856 | # order is not guaranteed unless call sorted(...) to sort again 857 | return list(set(lst1) | set(lst2)) 858 | ``` 859 | 860 |
861 | 862 | **10) Addition**. Implement the addition algorithm from school. Suppose we represent numbers by a list of integers from 0 to 9: 863 | 864 | * 12 is `[1, 2]` 865 | * 1000 is `[1, 0, 0, 0]` 866 | 867 | Implement the “+” operation for this representation 868 | 869 | * `[1, 1] + [1]` ⇒ `[1, 2]` 870 | * `[9, 9] + [2]` ⇒ `[1, 0, 1]` 871 | 872 | ```python 873 | def addition(lst1, lst2): 874 | def list_to_int(lst): 875 | ans, base = 0, 1 876 | for i in lst[::-1]: 877 | ans += i * base 878 | base *= 10 879 | return ans 880 | val = list_to_int(lst1) + list_to_int(lst2) 881 | ans = [int(i) for i in str(val)] 882 | return ans 883 | 884 | # another solution without int() and str() should be helpful 885 | def addition2(lst1, lst2): 886 | if len(lst2) == 0 or lst2 == [0]: 887 | return lst1[:] 888 | elif len(lst1) == 0: 889 | return lst2[:] 890 | # lst1, lst2 not empty 891 | digit1, lst1rest = lst1[-1], lst1[:-1] 892 | digit2, lst2rest = lst2[-1], lst2[:-1] 893 | digit, remainder = divmod(digit1 + digit2, 10) 894 | lst = addition2( 895 | addition2(lst1rest, [digit]), # recursively add digit to lst1 896 | lst2rest) # and then continue to add lst2 897 | ans = lst + [remainder] # add the remainder as last digit 898 | return ans 899 | ``` 900 | 901 |
902 | 903 | **11) Sort by custom alphabet**. You’re given a list of words and an alphabet (e.g. a permutation of Latin alphabet). You need to use this alphabet to order words in the list. 904 | 905 | Example: 906 | 907 | * Words: `['home', 'oval', 'cat', 'egg', 'network', 'green']` 908 | * Dictionary: `'bcdfghijklmnpqrstvwxzaeiouy'` 909 | 910 | Output: 911 | 912 | * `['cat', 'green', 'home', 'network', 'egg', 'oval']` 913 | 914 | ```python 915 | def sort_by_custom_alphabet(dictionary, words): 916 | words = sorted(words, key = lambda word: [dictionary.index(c) for c in word]) 917 | return words 918 | ``` 919 | 920 |
921 | 922 | **12) Check if a tree is a binary search tree**. In BST, the element in the root is: 923 | 924 | * Greater than or equal to the numbers on the left 925 | * Less than or equal to the number on the right 926 | * The definition of a tree node: `Node(value, left, right)` 927 | 928 | ```python 929 | def check_is_bst(head, min_val=None, max_val=None): 930 | """Check whether binary tree is binary search tree 931 | 932 | Aside of the obvious node.left.val <= node.val <= node.right.val have to be 933 | fulfilled, we also have to make sure that there is NO SINGLE leaves in the 934 | left part of node have more value than the current node. 935 | """ 936 | check_val = True 937 | check_left = True 938 | check_right = True 939 | 940 | if min_val: 941 | check_val = check_val and (head.val >= min_val) 942 | min_new = min(min_val, head.val) 943 | else: 944 | min_new = head.val 945 | 946 | if max_val: 947 | check_val = check_val and (head.val <= max_val) 948 | max_new = max(max_val, head.val) 949 | else: 950 | max_new = head.val 951 | 952 | if head.left: 953 | check_left = check_is_bst(head.left, min_val, max_new) 954 | 955 | if head.right: 956 | check_right = check_is_bst(head.right, min_new, max_val) 957 | 958 | return check_val and check_left and check_right 959 | ``` 960 | 961 |
962 | 963 | 964 | **13) Maximum Sum Contiguous Subarray**. You are given an array `A` of length `N`, you have to find the largest possible sum of an Subarray, of array `A`. 965 | * `[-2, 1, -3, 4, -1, 2, 1, -5, 4]` gives `6` as largest sum (from the subarray `[4, -1, 2, -1]` 966 | 967 | ```python 968 | from sys import maxsize 969 | def max_sum_subarr(list1, size): 970 | """Use Kadane's Algorithm for a optimal solution 971 | Time Complexity: O(n) 972 | Desciption: Use one variable for current sum, and one for Overall sum at an index. 973 | So here, the global_max will keep on updating the max sum at any index-1, 974 | and curr_max will check the max value at an index. 975 | And finally after iterating through the list, return the value of global_max variable which contains Maximum sum. 976 | """ 977 | curr_max=list1[0] 978 | global_max=list1[0] 979 | for each in range(1, size): 980 | curr_max = max(list1[each], curr_max+list1[each]) 981 | global_max = max(global_max, curr_max) 982 | return global_max 983 | 984 | n = int(input()) 985 | list1 = [] 986 | for i in range(0,n): 987 | num = int(input()) 988 | list1.append(num) 989 | 990 | print(max_sum_subarr(list1, len(list1))) 991 | ``` 992 | 993 |
994 | 995 | **14) Three sum**. Given an array, and a target value, find all possible combinations of three distinct numbers such that the sum of these three distinct numbers is equal to the target value. 996 | 997 | Example: 998 | 999 | Input: [12, 3, 1, 2, -6, 5, -8, 6], 0 1000 | Output: [[-8, 2, 6], [-8, 3, 5], [-6, 1, 5]] 1001 | 1002 | ```python 1003 | def threeSum(array, target): 1004 | array.sort() 1005 | triplets = [] 1006 | 1007 | for i in range(len(array) - 2): 1008 | left = i + 1 1009 | right = len(array) - 1 1010 | while left < right: 1011 | currentSum = array[i] + array[left] + array[right] 1012 | if currentSum == target: 1013 | triplets.append([array[i], array[left], array[right]]) 1014 | left += 1 1015 | right -= 1 1016 | elif currentSum > target: 1017 | right -= 1 1018 | elif currentSum < target: 1019 | left += 1 1020 | return triplets 1021 | ``` 1022 | 1023 | **15) Find Duplicate in array** Given and array, find all duplicated value in the array 1024 | 1025 | Example: 1026 | 1027 | input: [1,2,3,4,3,4,5] 1028 | output: [3,4] 1029 | 1030 | ```python 1031 | a = [1,2,3,4,3,4,5] 1032 | 1033 | def duplicates(a): 1034 | seen = set() 1035 | duplicated = set() 1036 | for i in a: 1037 | if i in seen: 1038 | duplicated.add(i) 1039 | else: 1040 | seen.add(i) 1041 | return duplicated 1042 | ``` 1043 | -------------------------------------------------------------------------------- /theory.md: -------------------------------------------------------------------------------- 1 | # Theoretical interview questions 2 | 3 | 4 | 5 | 6 | 10 | 11 |
⚠️ 7 | The answers here are given by the community. Be careful and double check the answers before using them.
8 | If you see an error, please create a PR with a fix 9 |
12 | 13 | 14 | * The list of questions is based on [this post](https://medium.com/data-science-insider/160-data-science-interview-questions-14dbd8bf0a08?source=friends_link&sk=7acf122a017c672a95f70c7cb7b585c0) 15 | * Legend: 👶 easy ‍⭐️ medium 🚀 expert 16 | * Do you know how to answer questions without answers? Please create a PR 17 | 18 | 19 | ## Table of contents 20 | 21 | * [Supervised machine learning](#supervised-machinelearning) 22 | * [Linear regression](#linear-regression) 23 | * [Validation](#validation) 24 | * [Classification](#classification) 25 | * [Regularization](#regularization) 26 | * [Feature selection](#feature-selection) 27 | * [Decision trees](#decision-trees) 28 | * [Random forest](#random-forest) 29 | * [Gradient boosting](#gradient-boosting) 30 | * [Parameter tuning](#parameter-tuning) 31 | * [Neural networks](#neural-networks) 32 | * [Optimization in neural networks](#optimization-in-neuralnetworks) 33 | * [Neural networks for computer vision](#neural-networks-for-computervision) 34 | * [Text classification](#text-classification) 35 | * [Clustering](#clustering) 36 | * [Dimensionality reduction](#dimensionality-reduction) 37 | * [Ranking and search](#ranking-andsearch) 38 | * [Recommender systems](#recommender-systems) 39 | * [Time series](#time-series) 40 | 41 |
42 | 43 | ## Supervised machine learning 44 | 45 | **What is supervised machine learning? 👶** 46 | 47 | Supervised learning is a type of machine learning in which our algorithms are trained using well-labeled training data, and machines predict the output based on that data. Labeled data indicates that the input data has already been tagged with the appropriate output. Basically, it is the task of learning a function that maps the input set and returns an output. Some of its examples are: Linear Regression, Logistic Regression, KNN, etc. 48 | 49 | k-Nearest Neighbors(KNN):Looking at the k closest labeled data points 50 | 51 |
52 | 53 | ## Linear regression 54 | 55 | **What is regression? Which models can you use to solve a regression problem? 👶** 56 | 57 | Regression is a part of supervised ML. Regression models investigate the relationship between a dependent (target) and independent variable (s) (predictor). 58 | Here are some common regression models 59 | 60 | - *Linear Regression* establishes a linear relationship between target and predictor (s). It predicts a numeric value and has a shape of a straight line. 61 | - *Polynomial Regression* has a regression equation with the power of independent variable more than 1. It is a curve that fits into the data points. 62 | - *Ridge Regression* helps when predictors are highly correlated (multicollinearity problem). It penalizes the squares of regression coefficients but doesn’t allow the coefficients to reach zeros (uses L2 regularization). 63 | - *Lasso Regression* penalizes the absolute values of regression coefficients and allows some of the coefficients to reach absolute zero (thereby allowing feature selection). 64 | 65 |
66 | 67 | **What is linear regression? When do we use it? 👶** 68 | 69 | Linear regression is a model that assumes a linear relationship between the input variables (X) and the single output variable (y). 70 | 71 | With a simple equation: 72 | 73 | ``` 74 | y = B0 + B1*x1 + ... + Bn * xN 75 | ``` 76 | 77 | B is regression coefficients, x values are the independent (explanatory) variables and y is dependent variable. 78 | 79 | The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. 80 | 81 | Simple linear regression: 82 | 83 | ``` 84 | y = B0 + B1*x1 85 | ``` 86 | 87 | Multiple linear regression: 88 | 89 | ``` 90 | y = B0 + B1*x1 + ... + Bn * xN 91 | ``` 92 | 93 |
94 | 95 | **What are the main assumptions of linear regression? ⭐** 96 | 97 | There are several assumptions of linear regression. If any of them is violated, model predictions and interpretation may be worthless or misleading. 98 | 99 | 1. **Linear relationship** between features and target variable. 100 | 2. **Additivity** means that the effect of changes in one of the features on the target variable does not depend on values of other features. For example, a model for predicting revenue of a company have of two features - the number of items _a_ sold and the number of items _b_ sold. When company sells more items _a_ the revenue increases and this is independent of the number of items _b_ sold. But, if customers who buy _a_ stop buying _b_, the additivity assumption is violated. 101 | 3. Features are not correlated (no **collinearity**) since it can be difficult to separate out the individual effects of collinear features on the target variable. 102 | 4. Errors are independently and identically normally distributed (yi = B0 + B1*x1i + ... + errori): 103 | 1. No correlation between errors (consecutive errors in the case of time series data). 104 | 2. Constant variance of errors - **homoscedasticity**. For example, in case of time series, seasonal patterns can increase errors in seasons with higher activity. 105 | 3. Errors are normally distributed, otherwise some features will have more influence on the target variable than to others. If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow. 106 | 107 |
108 | 109 | **What’s the normal distribution? Why do we care about it? 👶** 110 | 111 | The normal distribution is a continuous probability distribution whose probability density function takes the following formula: 112 | 113 | ![formula](https://mathworld.wolfram.com/images/equations/NormalDistribution/NumberedEquation1.gif) 114 | 115 | where μ is the mean and σ is the standard deviation of the distribution. 116 | 117 | The normal distribution derives its importance from the **Central Limit Theorem**, which states that if we draw a large enough number of samples, their mean will follow a normal distribution regardless of the initial distribution of the sample, i.e **the distribution of the mean of the samples is normal**. It is important that each sample is independent from the other. 118 | 119 | This is powerful because it helps us study processes whose population distribution is unknown to us. 120 | 121 | 122 |
123 | 124 | **How do we check if a variable follows the normal distribution? ‍⭐️** 125 | 126 | 1. Plot a histogram out of the sampled data. If you can fit the bell-shaped "normal" curve to the histogram, then the hypothesis that the underlying random variable follows the normal distribution can not be rejected. 127 | 2. Check Skewness and Kurtosis of the sampled data. Skewness = 0 and kurtosis = 3 are typical for a normal distribution, so the farther away they are from these values, the more non-normal the distribution. 128 | 3. Use Kolmogorov-Smirnov or/and Shapiro-Wilk tests for normality. They take into account both Skewness and Kurtosis simultaneously. 129 | 4. Check for Quantile-Quantile plot. It is a scatterplot created by plotting two sets of quantiles against one another. Normal Q-Q plot place the data points in a roughly straight line. 130 | 131 |
132 | 133 | **What if we want to build a model for predicting prices? Are prices distributed normally? Do we need to do any pre-processing for prices? ‍⭐️** 134 | 135 | Data is not normal. Specially, real-world datasets or uncleaned datasets always have certain skewness. Same goes for the price prediction. Price of houses or any other thing under consideration depends on a number of factors. So, there's a great chance of presence of some skewed values i.e outliers if we talk in data science terms. 136 | 137 | Yes, you may need to do pre-processing. Most probably, you will need to remove the outliers to make your distribution near-to-normal. 138 | 139 |
140 | 141 | **What methods for solving linear regression do you know? ‍⭐️** 142 | 143 | To solve linear regression, you need to find the coefficients which minimize the sum of squared errors. 144 | 145 | Matrix Algebra method: Let's say you have `X`, a matrix of features, and `y`, a vector with the values you want to predict. After going through the matrix algebra and minimization problem, you get this solution: . 146 | 147 | But solving this requires you to find an inverse, which can be time-consuming, if not impossible. Luckily, there are methods like Singular Value Decomposition (SVD) or QR Decomposition that can reliably calculate this part (called the pseudo-inverse) without actually needing to find an inverse. The popular python ML library `sklearn` uses SVD to solve least squares. 148 | 149 | Alternative method: Gradient Descent. See explanation below. 150 | 151 |
152 | 153 | **What is gradient descent? How does it work? ‍⭐️** 154 | 155 | Gradient descent is an algorithm that uses calculus concept of gradient to try and reach local or global minima. It works by taking the negative of the gradient in a point of a given function, and updating that point repeatedly using the calculated negative gradient, until the algorithm reaches a local or global minimum, which will cause future iterations of the algorithm to return values that are equal or too close to the current point. It is widely used in machine learning applications. 156 | 157 |
158 | 159 | **What is the normal equation? ‍⭐️** 160 | 161 | Normal equations are equations obtained by setting equal to zero the partial derivatives of the sum of squared errors (least squares); normal equations allow one to estimate the parameters of a multiple linear regression. 162 | 163 |
164 | 165 | **What is SGD  —  stochastic gradient descent? What’s the difference with the usual gradient descent? ‍⭐️** 166 | 167 | In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function. 168 | 169 | The difference lies in how the gradient of the loss function is estimated. In the usual GD, you have to run through ALL the samples in your training set in order to estimate the gradient and do a single update for a parameter in a particular iteration. In SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to estimate the gradient and do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent. 170 | 171 |
172 | 173 | **Which metrics for evaluating regression models do you know? 👶** 174 | 175 | 1. Mean Squared Error(MSE) 176 | 2. Root Mean Squared Error(RMSE) 177 | 3. Mean Absolute Error(MAE) 178 | 4. R² or Coefficient of Determination 179 | 5. Adjusted R² 180 | 181 |
182 | 183 | **What are MSE and RMSE? 👶** 184 | 185 | MSE stands for Mean Square Error while RMSE stands for Root Mean Square Error. They are metrics with which we can evaluate models. 186 | 187 |
188 | 189 | **What is the bias-variance trade-off? 👶** 190 | 191 | **Bias** is the error introduced by approximating the true underlying function, which can be quite complex, by a simpler model. **Variance** is a model sensitivity to changes in the training dataset. 192 | 193 | **Bias-variance trade-off** is a relationship between the expected test error and the variance and the bias - both contribute to the level of the test error and ideally should be as small as possible: 194 | 195 | ``` 196 | ExpectedTestError = Variance + Bias² + IrreducibleError 197 | ``` 198 | 199 | But as a model complexity increases, the bias decreases and the variance increases which leads to *overfitting*. And vice versa, model simplification helps to decrease the variance but it increases the bias which leads to *underfitting*. 200 | 201 |
202 | 203 | 204 | ## Validation 205 | 206 | **What is overfitting? 👶** 207 | 208 | When your model perform very well on your training set but can't generalize the test set, because it adjusted a lot to the training set. 209 | 210 |
211 | 212 | **How to validate your models? 👶** 213 | 214 | One of the most common approaches is splitting data into train, validation and test parts. 215 | Models are trained on train data, hyperparameters (for example early stopping) are selected based on the validation data, the final measurement is done on test dataset. 216 | Another approach is cross-validation: split dataset into K folds and each time train models on training folds and measure the performance on the validation folds. 217 | Also you could combine these approaches: make a test/holdout dataset and do cross-validation on the rest of the data. The final quality is measured on test dataset. 218 | 219 |
220 | 221 | **Why do we need to split our data into three parts: train, validation, and test? 👶** 222 | 223 | The training set is used to fit the model, i.e. to train the model with the data. The validation set is then used to provide an unbiased evaluation of a model while fine-tuning hyperparameters. This improves the generalization of the model. Finally, a test data set which the model has never "seen" before should be used for the final evaluation of the model. This allows for an unbiased evaluation of the model. The evaluation should never be performed on the same data that is used for training. Otherwise the model performance would not be representative. 224 | 225 |
226 | 227 | **Can you explain how cross-validation works? 👶** 228 | 229 | Cross-validation is the process to separate your total training set into two subsets: training and validation set, and evaluate your model to choose the hyperparameters. But you do this process iteratively, selecting different training and validation set, in order to reduce the bias that you would have by selecting only one validation set. 230 | 231 |
232 | 233 | **What is K-fold cross-validation? 👶** 234 | 235 | K fold cross validation is a method of cross validation where we select a hyperparameter k. The dataset is now divided into k parts. Now, we take the 1st part as validation set and remaining k-1 as training set. Then we take the 2nd part as validation set and remaining k-1 parts as training set. Like this, each part is used as validation set once and the remaining k-1 parts are taken together and used as training set. 236 | It should not be used in a time series data. 237 | 238 |
239 | 240 | **How do we choose K in K-fold cross-validation? What’s your favorite K? 👶** 241 | 242 | There are two things to consider while deciding K: the number of models we get and the size of validation set. We do not want the number of models to be too less, like 2 or 3. At least 4 models give a less biased decision on the metrics. On the other hand, we would want the dataset to be at least 20-25% of the entire data. So that at least a ratio of 3:1 between training and validation set is maintained.
243 | I tend to use 4 for small datasets and 5 for large ones as K. 244 | 245 |
246 | 247 | 248 | ## Classification 249 | 250 | **What is classification? Which models would you use to solve a classification problem? 👶** 251 | 252 | Classification problems are problems in which our prediction space is discrete, i.e. there is a finite number of values the output variable can be. Some models which can be used to solve classification problems are: logistic regression, decision tree, random forests, multi-layer perceptron, one-vs-all, amongst others. 253 | 254 |
255 | 256 | **What is logistic regression? When do we need to use it? 👶** 257 | 258 | Logistic regression is a Machine Learning algorithm that is used for binary classification. You should use logistic regression when your Y variable takes only two values, e.g. True and False, "spam" and "not spam", "churn" and "not churn" and so on. The variable is said to be a "binary" or "dichotomous". 259 | 260 |
261 | 262 | **Is logistic regression a linear model? Why? 👶** 263 | 264 | Yes, Logistic Regression is considered a generalized linear model because the outcome always depends on the sum of the inputs and parameters. Or in other words, the output cannot depend on the product (or quotient, etc.) of its parameters. 265 | 266 |
267 | 268 | **What is sigmoid? What does it do? 👶** 269 | 270 | A sigmoid function is a type of activation function, and more specifically defined as a squashing function. Squashing functions limit the output to a range between 0 and 1, making these functions useful in the prediction of probabilities. 271 | 272 | Sigmod(x) = 1/(1+e^{-x}) 273 | 274 |
275 | 276 | **How do we evaluate classification models? 👶** 277 | 278 | Depending on the classification problem, we can use the following evaluation metrics: 279 | 280 | 1. Accuracy 281 | 2. Precision 282 | 3. Recall 283 | 4. F1 Score 284 | 5. Logistic loss (also known as Cross-entropy loss) 285 | 6. Jaccard similarity coefficient score 286 | 287 |
288 | 289 | **What is accuracy? 👶** 290 | 291 | Accuracy is a metric for evaluating classification models. It is calculated by dividing the number of correct predictions by the number of total predictions. 292 | 293 |
294 | 295 | **Is accuracy always a good metric? 👶** 296 | 297 | Accuracy is not a good performance metric when there is imbalance in the dataset. For example, in binary classification with 95% of A class and 5% of B class, a constant prediction of A class would have an accuracy of 95%. In case of imbalance dataset, we need to choose Precision, recall, or F1 Score depending on the problem we are trying to solve. 298 | 299 |
300 | 301 | **What is the confusion table? What are the cells in this table? 👶** 302 | 303 | Confusion table (or confusion matrix) shows how many True positives (TP), True Negative (TN), False Positive (FP) and False Negative (FN) model has made. 304 | 305 | || | Actual | Actual | 306 | |:---:| :---: | :---: |:---: | 307 | || | Positive (1) | Negative (0) | 308 | |Predicted| Positive (1) | TP | FP | 309 | |Predicted| Negative (0) | FN | TN | 310 | 311 | * True Positives (TP): When the actual class of the observation is 1 (True) and the prediction is 1 (True) 312 | * True Negative (TN): When the actual class of the observation is 0 (False) and the prediction is 0 (False) 313 | * False Positive (FP): When the actual class of the observation is 0 (False) and the prediction is 1 (True) 314 | * False Negative (FN): When the actual class of the observation is 1 (True) and the prediction is 0 (False) 315 | 316 | Most of the performance metrics for classification models are based on the values of the confusion matrix. 317 | 318 |
319 | 320 | **What are precision, recall, and F1-score? 👶** 321 | 322 | * Precision and recall are classification evaluation metrics: 323 | * P = TP / (TP + FP) and R = TP / (TP + FN). 324 | * Where TP is true positives, FP is false positives and FN is false negatives 325 | * In both cases the score of 1 is the best: we get no false positives or false negatives and only true positives. 326 | * F1 is a combination of both precision and recall in one score (harmonic mean): 327 | * F1 = 2 * PR / (P + R). 328 | * Max F score is 1 and min is 0, with 1 being the best. 329 | 330 |
331 | 332 | **Precision-recall trade-off ‍⭐️** 333 | 334 | Tradeoff means increasing one parameter would lead to decreasing of other. Precision-recall tradeoff occur due to increasing one of the parameter(precision or recall) while keeping the model same. 335 | 336 | In an ideal scenario where there is a perfectly separable data, both precision and recall can get maximum value of 1.0. But in most of the practical situations, there is noise in the dataset and the dataset is not perfectly separable. There might be some points of positive class closer to the negative class and vice versa. In such cases, shifting the decision boundary can either increase the precision or recall but not both. Increasing one parameter leads to decreasing of the other. 337 | 338 |
339 | 340 | **What is the ROC curve? When to use it? ‍⭐️** 341 | 342 | ROC stands for *Receiver Operating Characteristics*. The diagrammatic representation that shows the contrast between true positive rate vs false positive rate. It is used when we need to predict the probability of the binary outcome. 343 | 344 |
345 | 346 | **What is AUC (AU ROC)? When to use it? ‍⭐️** 347 | 348 | AUC stands for *Area Under the ROC Curve*. ROC is a probability curve and AUC represents degree or measure of separability. It's used when we need to value how much model is capable of distinguishing between classes. The value is between 0 and 1, the higher the better. 349 | 350 |
351 | 352 | **How to interpret the AU ROC score? ‍⭐️** 353 | 354 | AUC score is the value of *Area Under the ROC Curve*. 355 | 356 | If we assume ROC curve consists of dots, , then 357 | 358 | 359 | 360 | An excellent model has AUC near to the 1 which means it has good measure of separability. A poor model has AUC near to the 0 which means it has worst measure of separability. When AUC score is 0.5, it means model has no class separation capacity whatsoever. 361 | 362 |
363 | 364 | **What is the PR (precision-recall) curve? ‍⭐️** 365 | 366 | A *precision*-*recall curve* (or PR Curve) is a plot of the precision (y-axis) and the recall (x-axis) for different probability thresholds. Precision-recall curves (PR curves) are recommended for highly skewed domains where ROC curves may provide an excessively optimistic view of the performance. 367 | 368 |
369 | 370 | **What is the area under the PR curve? Is it a useful metric? ‍⭐️I** 371 | 372 | The Precision-Recall AUC is just like the ROC AUC, in that it summarizes the curve with a range of threshold values as a single score. 373 | 374 | A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. 375 | 376 |
377 | 378 | **In which cases AU PR is better than AU ROC? ‍⭐️** 379 | 380 | What is different however is that AU ROC looks at a true positive rate TPR and false positive rate FPR while AU PR looks at positive predictive value PPV and true positive rate TPR. 381 | 382 | Typically, if true negatives are not meaningful to the problem or you care more about the positive class, AU PR is typically going to be more useful; otherwise, If you care equally about the positive and negative class or your dataset is quite balanced, then going with AU ROC is a good idea. 383 | 384 |
385 | 386 | **What do we do with categorical variables? ‍⭐️** 387 | 388 | Categorical variables must be encoded before they can be used as features to train a machine learning model. There are various encoding techniques, including: 389 | - One-hot encoding 390 | - Label encoding 391 | - Ordinal encoding 392 | - Target encoding 393 | 394 |
395 | 396 | **Why do we need one-hot encoding? ‍⭐️** 397 | 398 | If we simply encode categorical variables with a Label encoder, they become ordinal which can lead to undesirable consequences. In this case, linear models will treat category with id 4 as twice better than a category with id 2. One-hot encoding allows us to represent a categorical variable in a numerical vector space which ensures that vectors of each category have equal distances between each other. This approach is not suited for all situations, because by using it with categorical variables of high cardinality (e.g. customer id) we will encounter problems that come into play because of the curse of dimensionality. 399 | 400 |
401 | 402 | **What is "curse of dimensionality"? ‍⭐️** 403 | 404 | The curse of dimensionality is an issue that arises when working with high-dimensional data. It is often said that "the curse of dimensionality" is one of the main problems with machine learning. The curse of dimensionality refers to the fact that, as the number of dimensions (features) in a data set increases, the number of data points required to accurately learn the relationships between those features increases exponentially. 405 | 406 | A simple example where we have a data set with two features, x1 and x2. If we want to learn the relationship between these two features, we need to have enough data points so that we can accurately estimate the parameters of that relationship. However, if we add a third feature, x3, then the number of data points required to accurately learn the relationships between all three features increases exponentially. This is because there are now more parameters to estimate, and the number of data points needed to accurately estimate those parameters increases exponentially with the number of parameters. 407 | 408 | Simply put, the curse of dimensionality basically means that the error increases with the increase in the number of features. 409 | 410 |
411 | 412 | 413 | ## Regularization 414 | 415 | **What happens to our linear regression model if we have three columns in our data: x, y, z  —  and z is a sum of x and y? ‍⭐️** 416 | 417 | We would not be able to perform the regression. Because z is linearly dependent on x and y so when performing the regression would be a singular (not invertible) matrix. 418 |
419 | 420 | **What happens to our linear regression model if the column z in the data is a sum of columns x and y and some random noise? ‍⭐️** 421 | 422 | It creates a situation known as multicollinearity. Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, leading to issues in the estimation of the regression coefficients. The issue arises because the variables X and Y will be highly correlated, making it difficult for the model to distinguish the individual effects of X and Y on the dependent variable Z. To address it, feature selection, PCA or regularization techniques (L2) may be used. 423 | 424 |
425 | 426 | **What is regularization? Why do we need it? 👶** 427 | 428 | Regularization is used to reduce overfitting in machine learning models. It helps the models to generalize well and make them robust to outliers and noise in the data. 429 | 430 |
431 | 432 | **Which regularization techniques do you know? ‍⭐️** 433 | 434 | There are mainly two types of regularization, 435 | 1. L1 Regularization (Lasso regularization) - Adds the sum of absolute values of the coefficients to the cost function. 436 | 2. L2 Regularization (Ridge regularization) - Adds the sum of squares of coefficients to the cost function. 437 | 438 | * Where determines the amount of regularization. 439 | 440 |
441 | 442 | **What kind of regularization techniques are applicable to linear models? ‍⭐️** 443 | 444 | AIC/BIC, Ridge regression, Lasso, Elastic Net, Basis pursuit denoising, Rudin–Osher–Fatemi model (TV), Potts model, RLAD, 445 | Dantzig Selector,SLOPE 446 | 447 |
448 | 449 | **How does L2 regularization look like in a linear model? ‍⭐️** 450 | 451 | L2 regularization adds a penalty term to our cost function which is equal to the sum of squares of models coefficients multiplied by a lambda hyperparameter. This technique makes sure that the coefficients are close to zero and is widely used in cases when we have a lot of features that might correlate with each other. 452 | 453 |
454 | 455 | **How do we select the right regularization parameters? 👶** 456 | 457 | Regularization parameters can be chosen using a grid search, for example https://scikit-learn.org/stable/modules/linear_model.html has one formula for the implementing for regularization, alpha in the formula mentioned can be found by doing a RandomSearch or a GridSearch on a set of values and selecting the alpha which gives the least cross validation or validation error. 458 | 459 | 460 |
461 | 462 | **What’s the effect of L2 regularization on the weights of a linear model? ‍⭐️** 463 | 464 | L2 regularization penalizes larger weights more severely (due to the squared penalty term), which encourages weight values to decay toward zero. 465 | 466 |
467 | 468 | **How L1 regularization looks like in a linear model? ‍⭐️** 469 | 470 | L1 regularization adds a penalty term to our cost function which is equal to the sum of modules of models coefficients multiplied by a lambda hyperparameter. For example, cost function with L1 regularization will look like: 471 | 472 |
473 | 474 | **What’s the difference between L2 and L1 regularization? ‍⭐️** 475 | 476 | - Penalty terms: L1 regularization uses the sum of the absolute values of the weights, while L2 regularization uses the sum of the weights squared. 477 | - Feature selection: L1 performs feature selection by reducing the coefficients of some predictors to 0, while L2 does not. 478 | - Computational efficiency: L2 has an analytical solution, while L1 does not. 479 | - Multicollinearity: L2 addresses multicollinearity by constraining the coefficient norm. 480 | 481 |
482 | 483 | **Can we have both L1 and L2 regularization components in a linear model? ‍⭐️** 484 | 485 | Yes, elastic net regularization combines L1 and L2 regularization. 486 | 487 |
488 | 489 | **What’s the interpretation of the bias term in linear models? ‍⭐️** 490 | 491 | Bias is simply, a difference between predicted value and actual/true value. It can be interpreted as the distance from the average prediction and true value i.e. true value minus mean(predictions). But dont get confused between accuracy and bias. 492 | 493 |
494 | 495 | **How do we interpret weights in linear models? ‍⭐️** 496 | 497 | Without normalizing weights or variables, if you increase the corresponding predictor by one unit, the coefficient represents on average how much the output changes. By the way, this interpretation still works for logistic regression - if you increase the corresponding predictor by one unit, the weight represents the change in the log of the odds. 498 | 499 | If the variables are normalized, we can interpret weights in linear models like the importance of this variable in the predicted result. 500 | 501 |
502 | 503 | **If a weight for one variable is higher than for another  —  can we say that this variable is more important? ‍⭐️** 504 | 505 | Yes - if your predictor variables are normalized. 506 | 507 | Without normalization, the weight represents the change in the output per unit change in the predictor. If you have a predictor with a huge range and scale that is used to predict an output with a very small range - for example, using each nation's GDP to predict maternal mortality rates - your coefficient should be very small. That does not necessarily mean that this predictor variable is not important compared to the others. 508 | 509 |
510 | 511 | **When do we need to perform feature normalization for linear models? When it’s okay not to do it? ‍⭐️** 512 | 513 | Feature normalization is necessary for L1 and L2 regularizations. The idea of both methods is to penalize all the features relatively equally. This can't be done effectively if every feature is scaled differently. 514 | 515 | Linear regression without regularization techniques can be used without feature normalization. Also, regularization can help to make the analytical solution more stable, — it adds the regularization matrix to the feature matrix before inverting it. 516 | 517 |
518 | 519 | 520 | ## Feature selection 521 | 522 | **What is feature selection? Why do we need it? 👶** 523 | 524 | Feature Selection is a method used to select the relevant features for the model to train on. We need feature selection to remove the irrelevant features which leads the model to under-perform. 525 | 526 |
527 | 528 | **Is feature selection important for linear models? ‍⭐️** 529 | 530 | Yes, It is. It can make model performance better through selecting the most importance features and remove irrelevant features in order to make a prediction and it can also avoid overfitting, underfitting and bias-variance tradeoff. 531 | 532 |
533 | 534 | **Which feature selection techniques do you know? ‍⭐️** 535 | 536 | Here are some of the feature selections: 537 | - Principal Component Analysis 538 | - Neighborhood Component Analysis 539 | - ReliefF Algorithm 540 | 541 |
542 | 543 | **Can we use L1 regularization for feature selection? ‍⭐️** 544 | 545 | Yes, because the nature of L1 regularization will lead to sparse coefficients of features. Feature selection can be done by keeping only features with non-zero coefficients. 546 | 547 |
548 | 549 | **Can we use L2 regularization for feature selection? ‍⭐️** 550 | 551 | No, Because L2 regularization does not make the weights zero but only makes them very very small. L2 regularization can be used to solve multicollinearity since it stabilizes the model. 552 | 553 |
554 | 555 | 556 | ## Decision trees 557 | 558 | **What are the decision trees? 👶** 559 | 560 | This is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables. 561 | 562 | In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible. 563 | 564 | A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a value for the target variable. 565 | 566 | Various techniques : like Gini, Information Gain, Chi-square, entropy. 567 | 568 |
569 | 570 | **How do we train decision trees? ‍⭐️** 571 | 572 | 1. Start at the root node. 573 | 2. For each variable X, find the set S_1 that minimizes the sum of the node impurities in the two child nodes and choose the split {X*,S*} that gives the minimum over all X and S. 574 | 3. If a stopping criterion is reached, exit. Otherwise, apply step 2 to each child node in turn. 575 | 576 |
577 | 578 | **What are the main parameters of the decision tree model? 👶** 579 | 580 | * maximum tree depth 581 | * minimum samples per leaf node 582 | * impurity criterion 583 | 584 |
585 | 586 | **How do we handle categorical variables in decision trees? ‍⭐️** 587 | 588 | Some decision tree algorithms can handle categorical variables out of the box, others cannot. However, we can transform categorical variables, e.g. with a binary or a one-hot encoder. 589 | 590 |
591 | 592 | **What are the benefits of a single decision tree compared to more complex models? ‍⭐️** 593 | 594 | * easy to implement 595 | * fast training 596 | * fast inference 597 | * good explainability 598 | 599 |
600 | 601 | **How can we know which features are more important for the decision tree model? ‍⭐️** 602 | 603 | Often, we want to find a split such that it minimizes the sum of the node impurities. The impurity criterion is a parameter of decision trees. Popular methods to measure the impurity are the Gini impurity and the entropy describing the information gain. 604 | 605 |
606 | 607 | 608 | ## Random forest 609 | 610 | **What is random forest? 👶** 611 | 612 | Random Forest is a machine learning method for regression and classification which is composed of many decision trees. Random Forest belongs to a larger class of ML algorithms called ensemble methods (in other words, it involves the combination of several models to solve a single prediction problem). 613 | 614 |
615 | 616 | **Why do we need randomization in random forest? ‍⭐️** 617 | 618 | Random forest in an extension of the **bagging** algorithm which takes *random data samples from the training dataset* (with replacement), trains several models and averages predictions. In addition to that, each time a split in a tree is considered, random forest takes a *random sample of m features from full set of n features* (without replacement) and uses this subset of features as candidates for the split (for example, `m = sqrt(n)`). 619 | 620 | Training decision trees on random data samples from the training dataset *reduces variance*. Sampling features for each split in a decision tree *decorrelates trees*. 621 | 622 |
623 | 624 | **What are the main parameters of the random forest model? ‍⭐️** 625 | 626 | - `max_depth`: Longest Path between root node and the leaf 627 | - `min_sample_split`: The minimum number of observations needed to split a given node 628 | - `max_leaf_nodes`: Conditions the splitting of the tree and hence, limits the growth of the trees 629 | - `min_samples_leaf`: minimum number of samples in the leaf node 630 | - `n_estimators`: Number of trees 631 | - `max_sample`: Fraction of original dataset given to any individual tree in the given model 632 | - `max_features`: Limits the maximum number of features provided to trees in random forest model 633 | 634 |
635 | 636 | **How do we select the depth of the trees in random forest? ‍⭐️** 637 | 638 | The greater the depth, the greater amount of information is extracted from the tree, however, there is a limit to this, and the algorithm even if defensive against overfitting may learn complex features of noise present in data and as a result, may overfit on noise. Hence, there is no hard thumb rule in deciding the depth, but literature suggests a few tips on tuning the depth of the tree to prevent overfitting: 639 | 640 | - limit the maximum depth of a tree 641 | - limit the number of test nodes 642 | - limit the minimum number of objects at a node required to split 643 | - do not split a node when, at least, one of the resulting subsample sizes is below a given threshold 644 | - stop developing a node if it does not sufficiently improve the fit. 645 | 646 |
647 | 648 | **How do we know how many trees we need in random forest? ‍⭐️** 649 | 650 | The number of trees in random forest is worked by n_estimators, and a random forest reduces overfitting by increasing the number of trees. There is no fixed thumb rule to decide the number of trees in a random forest, it is rather fine tuned with the data, typically starting off by taking the square of the number of features (n) present in the data followed by tuning until we get the optimal results. 651 | 652 |
653 | 654 | **Is it easy to parallelize training of a random forest model? How can we do it? ‍⭐️** 655 | 656 | Yes, R provides a simple way to parallelize training of random forests on large scale data. 657 | It makes use of a parameter called multicombine which can be set to TRUE for parallelizing random forest computations. 658 | 659 | ```R 660 | rf <- foreach(ntree=rep(25000, 6), .combine=randomForest::combine, 661 | .multicombine=TRUE, .packages='randomForest') %dopar% { 662 | randomForest(x, y, ntree=ntree) 663 | } 664 | ``` 665 | 666 | 667 |
668 | 669 | **What are the potential problems with many large trees? ‍⭐️** 670 | 671 | - Overfitting: A large number of large trees can lead to overfitting, where the model becomes too complex and is able to memorize the training data but doesn't generalize well to new, unseen data. 672 | 673 | - Slow prediction time: As the number of trees in the forest increases, the prediction time for new data points can become quite slow. This can be a problem when you need to make predictions in real-time or on a large dataset. 674 | 675 | - Memory consumption: Random Forest models with many large trees can consume a lot of memory, which can be a problem when working with large datasets or on a limited hardware. 676 | 677 | - Lack of interpretability: Random Forest models with many large trees can be difficult to interpret, making it harder to understand how the model is making predictions or what features are most important. 678 | 679 | - Difficulty in tuning : With an increasing number of large trees the tuning process becomes more complex and computationally expensive. 680 | 681 | It's important to keep in mind that the number of trees in a Random Forest should be chosen based on the specific problem and dataset, rather than using a large number of trees by default. In practice, the number of trees in a random forest is chosen based on the trade-off between the computational cost and the performance. 682 | 683 |
684 | 685 | **What if instead of finding the best split, we randomly select a few splits and just select the best from them. Will it work? 🚀** 686 | 687 | Answer here 688 | 689 |
690 | 691 | **What happens when we have correlated features in our data? ‍⭐️** 692 | 693 | In random forest, since random forest samples some features to build each tree, the information contained in correlated features is twice as much likely to be picked than any other information contained in other features. 694 | 695 | In general, when you are adding correlated features, it means that they linearly contains the same information and thus it will reduce the robustness of your model. Each time you train your model, your model might pick one feature or the other to "do the same job" i.e. explain some variance, reduce entropy, etc. 696 | 697 |
698 | 699 | 700 | ## Gradient boosting 701 | 702 | **What is gradient boosting trees? ‍⭐️** 703 | 704 | Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. 705 | 706 |
707 | 708 | **What’s the difference between random forest and gradient boosting? ‍⭐️** 709 | 710 | 1. Random Forests builds each tree independently while Gradient Boosting builds one tree at a time. 711 | 2. Random Forests combine results at the end of the process (by averaging or "majority rules") while Gradient Boosting combines results along the way. 712 | 713 |
714 | 715 | **Is it possible to parallelize training of a gradient boosting model? How to do it? ‍⭐️** 716 | 717 | Yes, different frameworks provide different options to make training faster, using GPUs to speed up the process by making it highly parallelizable.For example, for XGBoost tree_method = 'gpu_hist' option makes training faster by use of GPUs. 718 | 719 |
720 | 721 | **Feature importance in gradient boosting trees  —  what are possible options? ‍⭐️** 722 | 723 | With CatBoost you can use implemented method get_feature_importance for getting SHAP values. https://arxiv.org/abs/1905.04610v1 724 | 725 | It allows to understand how excluding features helps to provide better results. Higher value is better. 726 | 727 | Also you can add random noise column to your data (with normal distribution), calculate feature importance and exclude all features below noise importance. 728 | 729 |
730 | 731 | **Are there any differences between continuous and discrete variables when it comes to feature importance of gradient boosting models? 🚀** 732 | 733 | Answer here 734 | 735 |
736 | 737 | **What are the main parameters in the gradient boosting model? ‍⭐️** 738 | 739 | There are many parameters, but below are a few key defaults. 740 | * learning_rate=0.1 (shrinkage). 741 | * n_estimators=100 (number of trees). 742 | * max_depth=3. 743 | * min_samples_split=2. 744 | * min_samples_leaf=1. 745 | * subsample=1.0. 746 | 747 |
748 | 749 | **How do you approach tuning parameters in XGBoost or LightGBM? 🚀** 750 | 751 | Depending upon the dataset, parameter tuning can be done manually or using hyperparameter optimization frameworks such as optuna and hyperopt. In manual parameter tuning, we need to be aware of max-depth, min_samples_leaf and min_samples_split so that our model does not overfit the data but try to predict generalized characteristics of data (basically keeping variance and bias low for our model). 752 | 753 |
754 | 755 | **How do you select the number of trees in the gradient boosting model? ‍⭐️** 756 | 757 | Most implementations of gradient boosting are configured by default with a relatively small number of trees, such as hundreds or thousands. Using scikit-learn we can perform a grid search of the n_estimators model parameter 758 | 759 |
760 | 761 | 762 | 763 | ## Parameter tuning 764 | 765 | **Which hyper-parameter tuning strategies (in general) do you know? ‍⭐️** 766 | 767 | There are several strategies for hyper-tuning but I would argue that the three most popular nowadays are the following: 768 | * Grid Search is an exhaustive approach such that for each hyper-parameter, the user needs to manually give a list of values for the algorithm to try. After these values are selected, grid search then evaluates the algorithm using each and every combination of hyper-parameters and returns the combination that gives the optimal result (i.e. lowest MAE). Because grid search evaluates the given algorithm using all combinations, it's easy to see that this can be quite computationally expensive and can lead to sub-optimal results specifically since the user needs to specify specific values for these hyper-parameters, which is prone for error and requires domain knowledge. 769 | 770 | * Random Search is similar to grid search but differs in the sense that rather than specifying which values to try for each hyper-parameter, an upper and lower bound of values for each hyper-parameter is given instead. With uniform probability, random values within these bounds are then chosen and similarly, the best combination is returned to the user. Although this seems less intuitive, no domain knowledge is necessary and theoretically much more of the parameter space can be explored. 771 | 772 | * In a completely different framework, Bayesian Optimization is thought of as a more statistical way of optimization and is commonly used when using neural networks, specifically since one evaluation of a neural network can be computationally costly. In numerous research papers, this method heavily outperforms Grid Search and Random Search and is currently used on the Google Cloud Platform as well as AWS. Because an in-depth explanation requires a heavy background in bayesian statistics and gaussian processes (and maybe even some game theory), a "simple" explanation is that a much simpler/faster acquisition function intelligently chooses (using a surrogate function such as probability of improvement or GP-UCB) which hyper-parameter values to try on the computationally expensive, original algorithm. Using the result of the initial combination of values on the expensive/original function, the acquisition function takes the result of the expensive/original algorithm into account and uses it as its prior knowledge to again come up with another set of hyper-parameters to choose during the next iteration. This process continues either for a specified number of iterations or for a specified amount of time and similarly the combination of hyper-parameters that performs the best on the expensive/original algorithm is chosen. 773 | 774 | 775 |
776 | 777 | **What’s the difference between grid search parameter tuning strategy and random search? When to use one or another? ‍⭐️** 778 | 779 | For specifics, refer to the above answer. 780 | 781 |
782 | 783 | 784 | ## Neural networks 785 | 786 | **What kind of problems neural nets can solve? 👶** 787 | 788 | Neural nets are good at solving non-linear problems. Some good examples are problems that are relatively easy for humans (because of experience, intuition, understanding, etc), but difficult for traditional regression models: speech recognition, handwriting recognition, image identification, etc. 789 | 790 |
791 | 792 | **How does a usual fully-connected feed-forward neural network work? ‍⭐️** 793 | 794 | In a usual fully-connected feed-forward network, each neuron receives input from every element of the previous layer and thus the receptive field of a neuron is the entire previous layer. They are usually used to represent feature vectors for input data in classification problems but can be expensive to train because of the number of computations involved. 795 | 796 |
797 | 798 | **Why do we need activation functions? 👶** 799 | 800 | The main idea of using neural networks is to learn complex nonlinear functions. If we are not using an activation function in between different layers of a neural network, we are just stacking up multiple linear layers one on top of another and this leads to learning a linear function. The Nonlinearity comes only with the activation function, this is the reason we need activation functions. 801 | 802 |
803 | 804 | **What are the problems with sigmoid as an activation function? ‍⭐️** 805 | 806 | The derivative of the sigmoid function for large positive or negative numbers is almost zero. From this comes the problem of vanishing gradient — during the backpropagation our net will not learn (or will learn drastically slow). One possible way to solve this problem is to use ReLU activation function. 807 | 808 |
809 | 810 | **What is ReLU? How is it better than sigmoid or tanh? ‍⭐️** 811 | 812 | ReLU is an abbreviation for Rectified Linear Unit. It is an activation function which has the value 0 for all negative values and the value f(x) = x for all positive values. The ReLU has a simple activation function which makes it fast to compute and while the sigmoid and tanh activation functions saturate at higher values, the ReLU has a potentially infinite activation, which addresses the problem of vanishing gradients. 813 | 814 |
815 | 816 | **How we can initialize the weights of a neural network? ‍⭐️** 817 | 818 | Proper initialization of weight matrix in neural network is very necessary. 819 | Simply we can say there are two ways for initializations. 820 | 1. Initializing weights with zeroes. 821 | Setting weights to zero makes your network no better than a linear model. It is important to note that setting biases to 0 will not create any troubles as non zero weights take care of breaking the symmetry and even if bias is 0, the values in every neuron are still different. 822 | 2. Initializing weights randomly. 823 | Assigning random values to weights is better than just 0 assignment. 824 | * a) If weights are initialized with very high values the term np.dot(W,X)+b becomes significantly higher and if an activation function like sigmoid() is applied, the function maps its value near to 1 where the slope of gradient changes slowly and learning takes a lot of time. 825 | * b) If weights are initialized with low values it gets mapped to 0, where the case is the same as above. This problem is often referred to as the vanishing gradient. 826 | 827 |
828 | 829 | **What if we set all the weights of a neural network to 0? ‍⭐️** 830 | 831 | If all the weights of a neural network are set to zero, the output of each connection is same (W*x = 0). This means the gradients which are backpropagated to each connection in a layer is same. This means all the connections/weights learn the same thing, and the model never converges. 832 | 833 |
834 | 835 | **What regularization techniques for neural nets do you know? ‍⭐️** 836 | 837 | * L1 Regularization - Defined as the sum of absolute values of the individual parameters. The L1 penalty causes a subset of the weights to become zero, suggesting that the corresponding features may safely be discarded. 838 | * L2 Regularization - Defined as the sum of square of individual parameters. Often supported by regularization hyperparameter alpha. It results in weight decay. 839 | * Data Augmentation - This requires some fake data to be created as a part of training set. 840 | * Drop Out : This is most effective regularization technique for neural nets. Few random nodes in each layer is deactivated in forward pass. This allows the algorithm to train on different set of nodes in each iterations. 841 |
842 | 843 | **What is dropout? Why is it useful? How does it work? ‍⭐️** 844 | 845 | Dropout is a technique that at each training step turns off each neuron with a certain probability of *p*. This way at each iteration we train only *1-p* of neurons, which forces the network not to rely only on the subset of neurons for feature representation. This leads to regularizing effects that are controlled by the hyperparameter *p*. 846 | 847 |
848 | 849 | 850 | ## Optimization in neural networks 851 | 852 | **What is backpropagation? How does it work? Why do we need it? ‍⭐️** 853 | 854 | The Backpropagation algorithm looks for the minimum value of the error function in weight space using a technique called the delta rule or gradient descent. 855 | The weights that minimize the error function is then considered to be a solution to the learning problem. 856 | 857 | We need backpropogation because, 858 | * Calculate the error – How far is your model output from the actual output. 859 | * Minimum Error – Check whether the error is minimized or not. 860 | * Update the parameters – If the error is huge then, update the parameters (weights and biases). After that again check the error. 861 | Repeat the process until the error becomes minimum. 862 | * Model is ready to make a prediction – Once the error becomes minimum, you can feed some inputs to your model and it will produce the output. 863 | 864 |
865 | 866 | **Which optimization techniques for training neural nets do you know? ‍⭐️** 867 | 868 | * Gradient Descent 869 | * Stochastic Gradient Descent 870 | * Mini-Batch Gradient Descent(best among gradient descents) 871 | * Nesterov Accelerated Gradient 872 | * Momentum 873 | * Adagrad 874 | * AdaDelta 875 | * Adam(best one. less time, more efficient) 876 | 877 |
878 | 879 | **How do we use SGD (stochastic gradient descent) for training a neural net? ‍⭐️** 880 | 881 | SGD approximates the expectation with few randomly selected samples (instead of the full data). In comparison to batch gradient descent, we can efficiently approximate the expectation in large data sets using SGD. For neural networks this reduces the training time a lot even considering that it will converge later as the random sampling adds noise to the gradient descent. 882 | 883 |
884 | 885 | **What’s the learning rate? 👶** 886 | 887 | The learning rate is an important hyperparameter that controls how quickly the model is adapted to the problem during the training. It can be seen as the "step width" during the parameter updates, i.e. how far the weights are moved into the direction of the minimum of our optimization problem. 888 | 889 |
890 | 891 | **What happens when the learning rate is too large? Too small? 👶** 892 | 893 | A large learning rate can accelerate the training. However, it is possible that we "shoot" too far and miss the minimum of the function that we want to optimize, which will not result in the best solution. On the other hand, training with a small learning rate takes more time but it is possible to find a more precise minimum. The downside can be that the solution is stuck in a local minimum, and the weights won't update even if it is not the best possible global solution. 894 | 895 |
896 | 897 | **How to set the learning rate? ‍⭐️** 898 | 899 | There is no straightforward way of finding an optimum learning rate for a model. It involves a lot of hit and trial. Usually starting with a small values such as 0.01 is a good starting point for setting a learning rate and further tweaking it so that it doesn't overshoot or converge too slowly. 900 | 901 |
902 | 903 | **What is Adam? What’s the main difference between Adam and SGD? ‍⭐️** 904 | 905 | Adam (Adaptive Moment Estimation) is a optimization technique for training neural networks. on an average, it is the best optimizer .It works with momentums of first and second order. The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search. 906 | 907 | Adam tends to converge faster, while SGD often converges to more optimal solutions. 908 | SGD's high variance disadvantages gets rectified by Adam (as advantage for Adam). 909 | 910 |
911 | 912 | **When would you use Adam and when SGD? ‍⭐️** 913 | 914 | Adam tends to converge faster, while SGD often converges to more optimal solutions. 915 | 916 |
917 | 918 | **Do we want to have a constant learning rate or we better change it throughout training? ‍⭐️** 919 | 920 | Generally, it is recommended to start learning rate with relatively high value and then gradually decrease learning rate so the model does not overshoot the minima and at the same time we don't want to start with very low learning rate as the model will take too long to converge. There are many available techniques to do decay the learning rate. For example, in PyTorch you can use 921 | a function called **StepLR** which decays the learning rate of each parameter by value **gamma**-which we have to pass through argument- after n number of epoch which you can also set through function argument named **epoch_size**. 922 | 923 |
924 | 925 | **How do we decide when to stop training a neural net? 👶** 926 | 927 | Simply stop training when the validation error is the minimum. 928 | 929 |
930 | 931 | **What is model checkpointing? ‍⭐️** 932 | 933 | Saving the weights learned by a model mid training for long running processes is known as model checkpointing so that you can resume your training from a certain checkpoint. 934 | 935 |
936 | 937 | **Can you tell us how you approach the model training process? ‍⭐️** 938 | 939 | Answer here 940 | 941 |
942 | 943 | 944 | ## Neural networks for computer vision 945 | 946 | **How we can use neural nets for computer vision? ‍⭐️** 947 | 948 | Neural nets used in the area of computer vision are generally Convolutional Neural Networks(CNN's). You can learn about convolutions below. It appears that convolutions are quite powerful when it comes to working with images and videos due to their ability to extract and learn complex features. Thus CNN's are a go-to method for any problem in computer vision. 949 | 950 |
951 | 952 | **What’s a convolutional layer? ‍⭐️** 953 | 954 | The idea of the convolutional layer is the assumption that the information needed for making a decision often is spatially close and thus, it only takes the weighted sum over nearby inputs. It also assumes that the networks’ kernels can be reused for all nodes, hence the number of weights can be drastically reduced. To counteract only one feature being learnt per layer, multiple kernels are applied to the input which creates parallel channels in the output. Consecutive layers can also be stacked to allow the network to find more high-level features. 955 | 956 |
957 | 958 | **Why do we actually need convolutions? Can’t we use fully-connected layers for that? ‍⭐️** 959 | 960 | A fully-connected layer needs one weight per inter-layer connection, which means the number of weights which needs to be computed quickly balloons as the number of layers and nodes per layer is increased. 961 | 962 |
963 | 964 | **What’s pooling in CNN? Why do we need it? ‍⭐️** 965 | 966 | Pooling is a technique to downsample the feature map. It allows layers which receive relatively undistorted versions of the input to learn low level features such as lines, while layers deeper in the model can learn more abstract features such as texture. 967 | 968 |
969 | 970 | **How does max pooling work? Are there other pooling techniques? ‍⭐️** 971 | 972 | Max pooling is a technique where the maximum value of a receptive field is passed on in the next feature map. The most commonly used receptive field is 2 x 2 with a stride of 2, which means the feature map is downsampled from N x N to N/2 x N/2. Receptive fields larger than 3 x 3 are rarely employed as too much information is lost. 973 | 974 | Other pooling techniques include: 975 | 976 | * Average pooling, the output is the average value of the receptive field. 977 | * Min pooling, the output is the minimum value of the receptive field. 978 | * Global pooling, where the receptive field is set to be equal to the input size, this means the output is equal to a scalar and can be used to reduce the dimensionality of the feature map. 979 | 980 |
981 | 982 | **Are CNNs resistant to rotations? What happens to the predictions of a CNN if an image is rotated? 🚀** 983 | 984 | CNNs are not resistant to rotation by design. However, we can make our models resistant by augmenting our datasets with different rotations of the raw data. The predictions of a CNN will change if an image is rotated and we did not augment our dataset accordingly. A demonstration of this occurence can be seen in [this video](https://www.youtube.com/watch?v=VO1bQo4PXV4), where a CNN changes its predicted class between a duck and a rabbit based on the rotation of the image. 985 | 986 |
987 | 988 | **What are augmentations? Why do we need them? 👶** 989 | 990 | Augmentations are an artifical way of expanding the existing datasets by performing some transformations, color shifts or many other things on the data. It helps in diversifying the data and even increasing the data when there is scarcity of data for a model to train on. 991 | 992 |
993 | 994 | **What kind of augmentations do you know? 👶** 995 | 996 | There are many kinds of augmentations which can be used according to the type of data you are working on some of which are geometric and numerical transformation, PCA, cropping, padding, shifting, noise injection etc. 997 | 998 |
999 | 1000 | **How to choose which augmentations to use? ‍⭐️** 1001 | 1002 | Augmentations really depend on the type of output classes and the features you want your model to learn. For eg. if you have mostly properly illuminated images in your dataset and want your model to predict poorly illuminated images too, you can apply channel shifting on your data and include the resultant images in your dataset for better results. 1003 | 1004 |
1005 | 1006 | **What kind of CNN architectures for classification do you know? 🚀** 1007 | 1008 | Image Classification 1009 | * Inception v3 1010 | * Xception 1011 | * DenseNet 1012 | * AlexNet 1013 | * VGG16 1014 | * ResNet 1015 | * SqueezeNet 1016 | * EfficientNet 1017 | * MobileNet 1018 | 1019 | The last three are designed so they use smaller number of parameters which is helpful for edge AI. 1020 | 1021 |
1022 | 1023 | **What is transfer learning? How does it work? ‍⭐️** 1024 | 1025 | Given a source domain D_S and learning task T_S, a target domain D_T and learning task T_T, transfer learning aims to help improve the learning of the target predictive function f_T in D_T using the knowledge in D_S and T_S, where D_S ≠ D_T,or T_S ≠ T_T. In other words, transfer learning enables to reuse knowledge coming from other domains or learning tasks. 1026 | 1027 | In the context of CNNs, we can use networks that were pre-trained on popular datasets such as ImageNet. We then can use the weights of the layers that learn to represent features and combine them with a new set of layers that learns to map the feature representations to the given classes. Two popular strategies are either to freeze the layers that learn the feature representations completely, or to give them a smaller learning rate. 1028 | 1029 |
1030 | 1031 | **What is object detection? Do you know any architectures for that? 🚀** 1032 | 1033 | Object detection is finding Bounding Boxes around objects in an image. 1034 | Architectures : 1035 | YOLO, Faster RCNN, Center Net 1036 | 1037 |
1038 | 1039 | **What is object segmentation? Do you know any architectures for that? 🚀** 1040 | 1041 | Object Segmentation is predicting masks. It does not differentiate objects. 1042 | Architectures : 1043 | Mask RCNN, UNet 1044 | 1045 |
1046 | 1047 | 1048 | ## Text classification 1049 | 1050 | **How can we use machine learning for text classification? ‍⭐️** 1051 | 1052 | Machine learning classification algorithms predict a class based on a numerical feature representation. This means that in order to use machine learning for text classification, we need to extract numerical features from our text data first before we can apply machine learning algorithms. Common approaches to extract numerical features from text data are bag of words, N-grams or word embeddings. 1053 | 1054 |
1055 | 1056 | **What is bag of words? How we can use it for text classification? ‍⭐️** 1057 | 1058 | Bag of Words is a representation of text that describes the occurrence of words within a document. The order or structure of the words is not considered. For text classification, we look at the histogram of the words within the text and consider each word count as a feature. 1059 | 1060 |
1061 | 1062 | **What are the advantages and disadvantages of bag of words? ‍⭐️** 1063 | 1064 | Advantages: 1065 | 1. Simple to understand and implement. 1066 | 1067 | Disadvantages: 1068 | 1. The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations. 1069 | 2. Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons 1070 | 3. Discarding word order ignores the context, and in turn meaning of words in the document. Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”). 1071 | 1072 |
1073 | 1074 | **What are N-grams? How can we use them? ‍⭐️** 1075 | 1076 | The function to tokenize into consecutive sequences of words is called n-grams. It can be used to find out N most co-occurring words (how often word X is followed by word Y) in a given sentence. 1077 | 1078 |
1079 | 1080 | **How large should be N for our bag of words when using N-grams? ‍⭐️** 1081 | 1082 | Answer here 1083 | 1084 |
1085 | 1086 | **What is TF-IDF? How is it useful for text classification? ‍⭐️** 1087 | 1088 | Term Frequency (TF) is a scoring of the frequency of the word in the current document. Inverse Document Frequency(IDF) is a scoring of how rare the word is across documents. It is used in scenario where highly recurring words may not contain as much informational content as the domain specific words. For example, words like “the” that are frequent across all documents therefore need to be less weighted. The TF-IDF score highlights words that are distinct (contain useful information) in a given document. 1089 | 1090 |
1091 | 1092 | **Which model would you use for text classification with bag of words features? ‍⭐️** 1093 | 1094 | 1. Bag Of Words model 1095 | 2. Word2Vec Embeddings 1096 | 3. fastText Embeddings 1097 | 4. Convolutional Neural Networks (CNN) 1098 | 5. Long Short-Term Memory (LSTM) 1099 | 6. Bidirectional Encoder Representations from Transformers (BERT) 1100 | 1101 |
1102 | 1103 | **Would you prefer gradient boosting trees model or logistic regression when doing text classification with bag of words? ‍⭐️** 1104 | 1105 | Usually logistic regression is better because bag of words creates a matrix with large number of columns. For a huge number of columns logistic regression is usually faster than gradient boosting trees. 1106 | 1107 |
1108 | 1109 | **What are word embeddings? Why are they useful? Do you know Word2Vec? ‍⭐️** 1110 | 1111 | Word Embeddings are vector representations for words. Each word is mapped to one vector, this vector tries to capture some characteristics of the word, allowing similar words to have similar vector representations. 1112 | Word Embeddings helps in capturing the inter-word semantics and represents it in real-valued vectors. 1113 | 1114 |
1115 | 1116 | Word2Vec is a method to construct such an embedding. It takes a text corpus as input and outputs a set of vectors which represents words in that corpus. 1117 | 1118 | It can be generated using two methods: 1119 | 1120 | - Common Bag of Words (CBOW) 1121 | - Skip-Gram 1122 | 1123 |
1124 | 1125 | **Do you know any other ways to get word embeddings? 🚀** 1126 | 1127 | - GloVe 1128 | - BERT 1129 |
1130 | 1131 | **If you have a sentence with multiple words, you may need to combine multiple word embeddings into one. How would you do it? ‍⭐️** 1132 | 1133 | Approaches ranked from simple to more complex: 1134 | 1135 | 1. Take an average over all words 1136 | 2. Take a weighted average over all words. Weighting can be done by inverse document frequency (idf part of tf-idf). 1137 | 3. Use ML model like LSTM or Transformer. 1138 | 1139 |
1140 | 1141 | **Would you prefer gradient boosting trees model or logistic regression when doing text classification with embeddings? ‍⭐️** 1142 | 1143 | Gradient boosting trees (GBTs) are generally a better choice than logistic regression for text classification with embeddings. This is because GBTs are able to learn more complex relationships between the features in the data, including the features extracted from the embeddings. 1144 | 1145 | Logistic regression is a linear model, which means that it can only learn linear relationships between the features. This can be a limitation for text classification, where the relationships between the features are often complex and non-linear. 1146 | 1147 | GBTs, on the other hand, are able to learn non-linear relationships between the features by combining multiple decision trees. This allows GBTs to learn more complex patterns in the data, which can lead to better performance on text classification tasks. 1148 | 1149 | In addition, GBTs are more robust to outliers and noise in the data than logistic regression. This can be important for text classification tasks, where the data can be noisy and imbalanced. 1150 | 1151 | Overall, GBTs are a better choice than logistic regression for text classification with embeddings, especially when the data is noisy or imbalanced. However, it is important to consider the computational cost and interpretability of GBTs before using them. 1152 | 1153 |
1154 | 1155 | **How can you use neural nets for text classification? 🚀** 1156 | 1157 | Here is a general overview of how to use neural nets for text classification: 1158 | 1159 | Preprocess the text: This includes cleaning the text by removing stop words, punctuation, and other irrelevant symbols. It may also involve converting the text to lowercase and stemming or lemmatizing the words. 1160 | Represent the text as a vector: This can be done using a variety of methods, such as one-hot encoding or word embeddings. 1161 | Build the neural net: The neural net architecture will depend on the specific text classification task. However, a typical architecture will include an embedding layer, one or more hidden layers, and an output layer. 1162 | Train the neural net: The neural net is trained by feeding it labeled examples of text data. The neural net will learn to adjust its parameters in order to minimize the loss function, which is typically the cross-entropy loss function. 1163 | Evaluate the neural net: Once the neural net is trained, it can be evaluated on a held-out test set to assess its performance. 1164 | Here are some specific examples of how neural nets can be used for text classification: 1165 | Sentiment analysis, Spam detection, Topic classification, Language identification 1166 | 1167 | Neural nets have achieved state-of-the-art results on many text classification tasks. However, they can be computationally expensive to train and deploy. 1168 | 1169 |
1170 | 1171 | **How can we use CNN for text classification? 🚀** 1172 | 1173 | Here are some specific examples of how CNNs can be used for text classification: 1174 | 1175 | Sentiment analysis: CNNs can be used to classify text as positive, negative, or neutral sentiment. This is a common task in social media analysis and customer service. 1176 | Spam detection: CNNs can be used to classify emails as spam or not spam. This is a common task in email filtering systems. 1177 | Topic classification: CNNs can be used to classify text documents into different topics. This is a common task in news and social media analysis. 1178 | Language identification: CNNs can be used to identify the language of a text document. This is a common task in translation systems. 1179 | 1180 |
1181 | 1182 | 1183 | ## Clustering 1184 | 1185 | **What is unsupervised learning? 👶** 1186 | 1187 | Unsupervised learning aims to detect patterns in data where no labels are given. 1188 | 1189 |
1190 | 1191 | **What is clustering? When do we need it? 👶** 1192 | 1193 | Clustering algorithms group objects such that similar feature points are put into the same groups (clusters) and dissimilar feature points are put into different clusters. 1194 | 1195 |
1196 | 1197 | **Do you know how K-means works? ‍⭐️** 1198 | 1199 | 1. Partition points into k subsets. 1200 | 2. Compute the seed points as the new centroids of the clusters of the current partitioning. 1201 | 3. Assign each point to the cluster with the nearest seed point. 1202 | 4. Go back to step 2 or stop when the assignment does not change. 1203 | 1204 |
1205 | 1206 | **How to select K for K-means? ‍⭐️** 1207 | 1208 | * Domain knowledge, i.e. an expert knows the value of k 1209 | * Elbow method: compute the clusters for different values of k, for each k, calculate the total within-cluster sum of square, plot the sum according to the number of clusters and use the band as the number of clusters. 1210 | * Average silhouette method: compute the clusters for different values of k, for each k, calculate the average silhouette of observations, plot the silhouette according to the number of clusters and select the maximum as the number of clusters. 1211 | 1212 |
1213 | 1214 | **What are the other clustering algorithms do you know? ‍⭐️** 1215 | 1216 | * k-medoids: Takes the most central point instead of the mean value as the center of the cluster. This makes it more robust to noise. 1217 | * Agglomerative Hierarchical Clustering (AHC): hierarchical clusters combining the nearest clusters starting with each point as its own cluster. 1218 | * DIvisive ANAlysis Clustering (DIANA): hierarchical clustering starting with one cluster containing all points and splitting the clusters until each point describes its own cluster. 1219 | * Density-Based Spatial Clustering of Applications with Noise (DBSCAN): Cluster defined as maximum set of density-connected points. 1220 | 1221 |
1222 | 1223 | **Do you know how DBScan works? ‍⭐️** 1224 | 1225 | * Two input parameters epsilon (neighborhood radius) and minPts (minimum number of points in an epsilon-neighborhood) 1226 | * Cluster defined as maximum set of density-connected points. 1227 | * Points p_j and p_i are density-connected w.r.t. epsilon and minPts if there is a point o such that both, i and j are density-reachable from o w.r.t. epsilon and minPts. 1228 | * p_j is density-reachable from p_i w.r.t. epsilon, minPts if there is a chain of points p_i -> p_i+1 -> p_i+x = p_j such that p_i+x is directly density-reachable from p_i+x-1. 1229 | * p_j is a directly density-reachable point of the neighborhood of p_i if dist(p_i,p_j) <= epsilon. 1230 | 1231 |
1232 | 1233 | **When would you choose K-means and when DBScan? ‍⭐️** 1234 | 1235 | * DBScan is more robust to noise. 1236 | * DBScan is better when the amount of clusters is difficult to guess. 1237 | * K-means has a lower complexity, i.e. it will be much faster, especially with a larger amount of points. 1238 | 1239 |
1240 | 1241 | 1242 | ## Dimensionality reduction 1243 | **What is the curse of dimensionality? Why do we care about it? ‍⭐️** 1244 | 1245 | Data in only one dimension is relatively tightly packed. Adding a dimension stretches the points across that dimension, pushing them further apart. Additional dimensions spread the data even further making high dimensional data extremely sparse. We care about it, because it is difficult to use machine learning in sparse spaces. 1246 | 1247 |
1248 | 1249 | **Do you know any dimensionality reduction techniques? ‍⭐️** 1250 | 1251 | * Singular Value Decomposition (SVD) 1252 | * Principal Component Analysis (PCA) 1253 | * Linear Discriminant Analysis (LDA) 1254 | * T-distributed Stochastic Neighbor Embedding (t-SNE) 1255 | * Autoencoders 1256 | * Fourier and Wavelet Transforms 1257 | 1258 |
1259 | 1260 | **What’s singular value decomposition? How is it typically used for machine learning? ‍⭐️** 1261 | 1262 | * Singular Value Decomposition (SVD) is a general matrix decomposition method that factors a matrix X into three matrices L (left singular values), Σ (diagonal matrix) and R^T (right singular values). 1263 | * For machine learning, Principal Component Analysis (PCA) is typically used. It is a special type of SVD where the singular values correspond to the eigenvectors and the values of the diagonal matrix are the squares of the eigenvalues. We use these features as they are statistically descriptive. 1264 | * Having calculated the eigenvectors and eigenvalues, we can use the Kaiser-Guttman criterion, a scree plot or the proportion of explained variance to determine the principal components (i.e. the final dimensionality) that are useful for dimensionality reduction. 1265 | 1266 |
1267 | 1268 | 1269 | ## Ranking and search 1270 | 1271 | **What is the ranking problem? Which models can you use to solve them? ‍⭐️** 1272 | 1273 | Answer here 1274 | 1275 |
1276 | 1277 | **What are good unsupervised baselines for text information retrieval? ‍⭐️** 1278 | 1279 | Answer here 1280 | 1281 |
1282 | 1283 | **How would you evaluate your ranking algorithms? Which offline metrics would you use? ‍⭐️** 1284 | 1285 | Answer here 1286 | 1287 |
1288 | 1289 | **What is precision and recall at k? ‍⭐️** 1290 | 1291 | Precision at k and recall at k are evaluation metrics for ranking algorithms. Precision at k shows the share of relevant items in the first *k* results of the ranking algorithm. And Recall at k indicates the share of relevant items returned in top *k* results out of all correct answers for a given query. 1292 | 1293 | Example: 1294 | For a search query "Car" there are 3 relevant products in your shop. Your search algorithm returns 2 of those relevant products in the first 5 search results. 1295 | Precision at 5 = # num of relevant products in search result / k = 2/5 = 40% 1296 | Recall at 5 = # num of relevant products in search result / # num of all relevant products = 2/3 = 66.6% 1297 | 1298 |
1299 | 1300 | **What is mean average precision at k? ‍⭐️** 1301 | 1302 | Answer here 1303 | 1304 |
1305 | 1306 | **How can we use machine learning for search? ‍⭐️** 1307 | 1308 | Answer here 1309 | 1310 |
1311 | 1312 | **How can we get training data for our ranking algorithms? ‍⭐️** 1313 | 1314 | Answer here 1315 | 1316 |
1317 | 1318 | **Can we formulate the search problem as a classification problem? How? ‍⭐️** 1319 | 1320 | Answer here 1321 | 1322 |
1323 | 1324 | **How can we use clicks data as the training data for ranking algorithms? 🚀** 1325 | 1326 | Answer here 1327 | 1328 |
1329 | 1330 | **Do you know how to use gradient boosting trees for ranking? 🚀** 1331 | 1332 | Answer here 1333 | 1334 |
1335 | 1336 | **How do you do an online evaluation of a new ranking algorithm? ‍⭐️** 1337 | 1338 | Answer here 1339 | 1340 |
1341 | 1342 | 1343 | ## Recommender systems 1344 | 1345 | **What is a recommender system? 👶** 1346 | 1347 | Recommender systems are software tools and techniques that provide suggestions for items that are most likely of interest to a particular user. 1348 | 1349 |
1350 | 1351 | **What are good baselines when building a recommender system? ‍⭐️** 1352 | 1353 | * A good recommer system should give relevant and personalized information. 1354 | * It should not recommend items the user knows well or finds easily. 1355 | * It should make diverse suggestions. 1356 | * A user should explore new items. 1357 | 1358 |
1359 | 1360 | **What is collaborative filtering? ‍⭐️** 1361 | 1362 | * Collaborative filtering is the most prominent approach to generate recommendations. 1363 | * It uses the wisdom of the crowd, i.e. it gives recommendations based on the experience of others. 1364 | * A recommendation is calculated as the average of other experiences. 1365 | * Say we want to give a score that indicates how much user u will like an item i. Then we can calculate it with the experience of N other users U as r_ui = 1/N * sum(v in U) r_vi. 1366 | * In order to rate similar experiences with a higher weight, we can introduce a similarity between users that we use as a multiplier for each rating. 1367 | * Also, as users have an individual profile, one user may have an average rating much larger than another user, so we use normalization techniques (e.g. centering or Z-score normalization) to remove the users' biases. 1368 | * Collaborative filtering does only need a rating matrix as input and improves over time. However, it does not work well on sparse data, does not work for cold starts (see below) and usually tends to overfit. 1369 | 1370 |
1371 | 1372 | **How we can incorporate implicit feedback (clicks, etc) into our recommender systems? ‍⭐️** 1373 | 1374 | In comparison to explicit feedback, implicit feedback datasets lack negative examples. For example, explicit feedback can be a positive or a negative rating, but implicit feedback may be the number of purchases or clicks. One popular approach to solve this problem is named weighted alternating least squares (wALS) [Hu, Y., Koren, Y., & Volinsky, C. (2008, December). Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on (pp. 263-272). IEEE.]. Instead of modeling the rating matrix directly, the numbers (e.g. amount of clicks) describe the strength in observations of user actions. The model tries to find latent factors that can be used to predict the expected preference of a user for an item. 1375 | 1376 |
1377 | 1378 | **What is the cold start problem? ‍⭐️** 1379 | 1380 | Collaborative filterung incorporates crowd knowledge to give recommendations for certain items. Say we want to recommend how much a user will like an item, we then will calculate the score using the recommendations of other users for this certain item. We can distinguish between two different ways of a cold start problem now. First, if there is a new item that has not been rated yet, we cannot give any recommendation. Also, when there is a new user, we cannot calculate a similarity to any other user. 1381 | 1382 |
1383 | 1384 | **Possible approaches to solving the cold start problem? ‍⭐️🚀** 1385 | 1386 | * Content-based filtering incorporates features about items to calculate a similarity between them. In this way, we can recommend items that have a high similarity to items that a user liked already. In this way, we are not dependent on the ratings of other users for a given item anymore and solve the cold start problem for new items. 1387 | * Demographic filtering incorporates user profiles to calculate a similarity between them and solves the cold start problem for new users. 1388 | 1389 |
1390 | 1391 | 1392 | ## Time series 1393 | 1394 | **What is a time series? 👶** 1395 | 1396 | A time series is a set of observations ordered in time usually collected at regular intervals. 1397 | 1398 |
1399 | 1400 | **How is time series different from the usual regression problem? 👶** 1401 | 1402 | The principle behind causal forecasting is that the value that has to be predicted is dependant on the input features (causal factors). In time series forecasting, the to be predicted value is expected to follow a certain pattern over time. 1403 | 1404 |
1405 | 1406 | **Which models do you know for solving time series problems? ‍⭐️** 1407 | 1408 | * Simple Exponential Smoothing: approximate the time series with an exponential function 1409 | * Trend-Corrected Exponential Smoothing (Holt‘s Method): exponential smoothing that also models the trend 1410 | * Trend- and Seasonality-Corrected Exponential Smoothing (Holt-Winter‘s Method): exponential smoothing that also models trend and seasonality 1411 | * Time Series Decomposition: decomposed a time series into the four components trend, seasonal variation, cycling variation and irregular component 1412 | * Autoregressive models: similar to multiple linear regression, except that the dependent variable y_t depends on its own previous values rather than other independent variables. 1413 | * Deep learning approaches (RNN, LSTM, etc.) 1414 | 1415 |
1416 | 1417 | **If there’s a trend in our series, how we can remove it? And why would we want to do it? ‍⭐️** 1418 | 1419 | We can explicitly model the trend (and/or seasonality) with approaches such as Holt's Method or Holt-Winter's Method. We want to explicitly model the trend to reach the stationarity property for the data. Many time series approaches require stationarity. Without stationarity,the interpretation of the results of these analyses is problematic [Manuca, Radu & Savit, Robert. (1996). Stationarity and nonstationarity in time series analysis. Physica D: Nonlinear Phenomena. 99. 134-161. 10.1016/S0167-2789(96)00139-X. ]. 1420 | 1421 |
1422 | 1423 | **You have a series with only one variable “y” measured at time t. How do predict “y” at time t+1? Which approaches would you use? ‍⭐️** 1424 | 1425 | We want to look at the correlation between different observations of y. This measure of correlation is called autocorrelation. Autoregressive models are multiple regression models where the time-lag series of the original time series are treated like multiple independent variables. 1426 | 1427 |
1428 | 1429 | **You have a series with a variable “y” and a set of features. How do you predict “y” at t+1? Which approaches would you use? ‍⭐️** 1430 | 1431 | Given the assumption that the set of features gives a meaningful causation to y, a causal forecasting approach such as linear regression or multiple nonlinear regression might be useful. In case there is a lot of data and the explainability of the results is not a high priority, we can also consider deep learning approaches. 1432 | 1433 |
1434 | 1435 | **What are the problems with using trees for solving time series problems? ‍⭐️** 1436 | 1437 | Random Forest models are not able to extrapolate time series data and understand increasing/decreasing trends. It will provide us with average data points if the validation data has values greater than the training data points. 1438 | 1439 |
1440 | --------------------------------------------------------------------------------