26 |
--------------------------------------------------------------------------------
/CNAME:
--------------------------------------------------------------------------------
1 | blog.scikit-learn.org
--------------------------------------------------------------------------------
/CODE-OF-CONDUCT.md:
--------------------------------------------------------------------------------
1 | # scikit-learn Code of Conduct
2 |
3 | ### Our community, Our Values
4 |
5 | The project is hosted on [scikit-learn/scikit-learn](https://github.com/scikit-learn/scikit-learn).
6 |
7 | The decision making process and governance structure of scikit-learn is laid out in the governance document: [scikit-learn governance and decision-making](https://scikit-learn.org/dev/governance.html#governance).
8 |
9 | We are a community based on openness and friendly, didactic, discussions.
10 |
11 | We aspire to treat everybody equally, and value their contributions.
12 |
13 | Decisions are made based on technical merit and consensus.
14 |
15 | Code is not the only way to help the project. Reviewing pull requests, answering questions to help others on mailing lists or issues, organizing and teaching tutorials, working on the website, improving the documentation, are all priceless contributions.
16 |
17 | We abide by the principles of openness, respect, and consideration of others of the Python Software Foundation [Code of Conduct](https://www.python.org/psf/codeofconduct/).
18 |
--------------------------------------------------------------------------------
/Gemfile:
--------------------------------------------------------------------------------
1 | source "https://rubygems.org"
2 |
3 | #gem "jekyll"
4 |
5 | #gem "minimal-mistakes-jekyll", "~>4.0"
6 | gem "minimal-mistakes-jekyll", :git => "https://github.com/mmistakes/minimal-mistakes.git", :tag => "4.24.0"
7 | gem "webrick", "~> 1.7"
8 | gem "jekyll-redirect-from"
9 | gem "jekyll-sitemap"
10 | gem "jekyll-archives"
11 | gem "jekyll-target-blank"
12 | gem "jekyll-paginate"
13 | gem "jekyll-twitter-plugin"
14 |
15 |
16 | # If you want to use GitHub Pages, remove the "gem "jekyll"" above and
17 | # uncomment the line below. To upgrade, run `bundle update github-pages`.
18 |
19 | gem "github-pages", group: :jekyll_plugins
20 | gem "jekyll-include-cache", group: :jekyll_plugins
21 |
22 |
23 | # If you have any plugins, put them here!
24 | group :jekyll_plugins do
25 | gem "jekyll-feed", "~> 0.12"
26 | end
27 |
28 | # Windows and JRuby does not include zoneinfo files, so bundle the tzinfo-data gem
29 | # and associated library.
30 | platforms :mingw, :x64_mingw, :mswin, :jruby do
31 | gem "tzinfo", "~> 1.2"
32 | gem "tzinfo-data"
33 | end
34 |
35 | # Performance-booster for watching directories on Windows
36 | gem "wdm", "~> 0.1.1", :platforms => [:mingw, :x64_mingw, :mswin]
37 |
38 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # scikit-learn Blog
2 |
3 | 
4 |
5 | This repository hosts the development of the [scikit-learn.org](https://scikit-learn.org/stable/) blog.
6 |
7 |
8 |
9 | ## How to Contribute
10 | Please refer to the [Guide to Contributing](https://github.com/scikit-learn/blog/blob/main/CONTRIBUTING.md).
11 | All contributions must abide by the [Code of Conduct](https://github.com/scikit-learn/blog/blob/main/CODE-OF-CONDUCT.md).
12 |
13 |
14 |
15 | ## Brand Standards
16 | This section contains scikit-learn's branding standards and guidelines.
17 |
18 | ### scikit-learn Color Palette
19 |  `RGB 41/171/226 | HEX #29ABE2 | scikit-learn Cyan`
20 |  `RGB 247/147/30 | HEX #F7931E | scikit-learn Orange`
21 |  `RGB 155/70/0| HEX #9B4600 | scikit-learn Brown`
22 |
23 | ### Logo
24 | Logos can be found in the [assets/images/](https://github.com/scikit-learn/blog/tree/main/assets/images) folder.
23 |
24 | In this interview, Andreas Mueller, lecturer in Data Science at Columbia University and core developer of the Python library scikit-learn, speaks with Reshama Shaikh about his recent work with the scikit-learn open source community.
25 |
26 |
27 | __RS) Tell us briefly about yourself__
28 |
29 | AM) I’m currently a lecturer in Data Science at Columbia University, where I teach applied machine learning. I have been a core developer of the Python library scikit-learn for the past 6 years. I recently published the book Introduction to Machine Learning for Python.
30 |
31 | Read the full interview here:
32 | [Interview with Andreas Mueller](https://mlconf.com/blog/interview-andreas-muller-lecturer-columbia-university-core-contributor-scikit-learn-reshama-shaikh/), February 2017
33 |
--------------------------------------------------------------------------------
/_posts/2018-09-30-nyc-sprint-highlights.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Highlights From The 2018 NYC WiMLDS Scikit Sprint"
3 | date: September 30, 2018
4 | categories:
5 | - Events
6 | tags:
7 | - Open Source
8 | - Sprints
9 | featured-image: nyc-sprint-2018.jpg
10 |
11 | postauthors:
12 | - name: Reshama Shaikh
13 | website: https://reshamas.github.io
14 | image: reshama_shaikh.jpeg
15 | ---
16 |
17 |
18 |
19 | {% include postauthor.html %}
20 |
21 |
22 | ## Sprint Repo
23 | The 2nd Annual NYC WiMLDS / Scikit Sprint was held on Saturday, September 29 at Stack Exchange in New York City. This is our repository for all items related to the [2018 NYC WiMLDS Scikit Sprint](https://github.com/WiMLDS/scikit-sprint-nyc-2018).
24 |
25 |
26 | ## The Sprint
27 |
28 |
21 |
22 | ## Sprint Background
23 |
24 | A 2013 study found that only [11% of open source contributors were women](https://www.newamerica.org/weekly/111/and-now-an-infuriating-statistic-about-women-and-coding/). A 2016 gender-inferred [analysis](https://blog.revolutionanalytics.com/2016/06/programmers-gender.html) examining the top 100 contributors for various programming languages found that just 2% of contributors to Python libraries on GitHub were women.
25 |
26 | To address this gender imbalance for the scikit-learn library, Andreas Mueller, core contributor, initiated organizing an open source sprint in New York City with the local chapter of Women in Machine Learning and Data Science ([WiMLDS](http://wimlds.org)). The first sprint was held in March 2017 and the second one was held in September of 2018. This report summarizes the impact of the two events.
27 |
28 | ## Impact Summary for 2017
29 | In 2017, **5** PRs were merged in:
30 | - **4** PRs were merged at the sprint
31 | - **1** PR was merged post-sprint without any follow-up
32 | - The PR merged post-sprint was by [Sergul Aydore](https://www.linkedin.com/in/sergül-aydöre-203193a/). After attending this sprint, Sergul then went on to participate in the August 2018 [scikit-learn core sprint for advanced contributors](http://gael-varoquaux.info/programming/sprint-on-scikit-learn-in-paris-and-austin.html) in Paris. Sergul states:
33 | >Participating in the March 2017 sprint helped me learn the basics and I was able to contribute to more complicated PRs in the August 2018 sprint.
34 | - No follow-up of open PRs was conducted.
35 |
36 | ## Impact Summary for 2018
37 | A total of 16 PRs were merged in as a result of the 2018 sprint:
38 | - **4** were merged at the sprint
39 | - **4** were updated and merged post-sprint by attendees who submitted of their own accord, without any follow-up.
40 | - To date, **8** PRs were merged by the sprint organizer (me) or members of the WiMLDS community. None of the initial sprint participants merged in a PR after follow-up.
41 |
42 | Read the full report here:
43 | [Impact Report For WiMLDS Scikit Learn Sprints](https://reshamas.github.io/impact-report-for-wimlds-scikit-learn-sprints/), November 2019
44 |
--------------------------------------------------------------------------------
/_posts/2019-06-01-nick-gradient-boosting.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Understanding Gradient Boosting as a Gradient Descent"
3 | date: June 6, 2019
4 | categories:
5 | - Technical
6 | tags:
7 | - Gradient boosting
8 |
9 | featured-image: gbdt.png
10 |
11 | postauthors:
12 | - name: Nicolas Hug
13 | website: https://github.com/NicolasHug
14 | image: nicolas_hug.jpg
15 | usemathjax: true
16 | ---
17 |
18 |
19 | {% include postauthor.html %}
20 |
21 |
22 | There are a lot of resources online about gradient boosting, but not many of them explain how gradient boosting relates to gradient descent. This post is an attempt to explain gradient boosting as a (kinda weird) gradient descent.
23 |
24 | I’ll assume zero previous knowledge of gradient boosting here, but this post requires a minimal working knowledge of gradient descent.
25 |
26 | __Let’s get started!__
27 |
28 | For a given sample $$ \mathbf{x}_i $$, a gradient boosting regressor yields
29 | predictions with the following form:
30 |
31 | $$ \hat{y}_i = \sum_{m = 1}^{\text{n_iter}} h_m(\mathbf{x}_i), $$
32 |
33 | where each $$ h_m $$ is an instance of a base estimator (often called weak learner, since it usually does not need to be extremely accurate). Since the base estimator is almost always a decision tree, I’ll abusively use the term GBDT (Gradient Boosting Decision Trees) to refer to gradient boosting in general.
34 |
35 | Each of the base estimators $$ h_m $$ isn’t trying to predict the target $$ y_i $$. Instead, the base estimators are trying to predict gradients. This sum $$ \sum_{m = 1}^{\text{n_iter}} h_m(\mathbf{x}_i) $$ is actually performing a gradient descent.
36 |
37 | Specifically, it’s a gradient descent in a functional space. This is in contrast to what we’re used to in many other machine learning algorithms (e.g. neural networks or linear regression), where gradient descent is instead performed in the parameter space. Let’s review that briefly.
38 |
39 | Read the full blog post on Nicolas' blog:
40 | [Understanding Gradient Boosting as a gradient descent](http://nicolas-hug.com/blog/gradient_boosting_descent)
41 |
--------------------------------------------------------------------------------
/_posts/2019-06-25-nairobi-adrin.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: " scikit-learn Sprint in Nairobi, Kenya"
3 | date: June 25, 2019
4 |
5 | categories:
6 | - Events
7 | tags:
8 | - Sprints
9 | - Community
10 |
11 | featured-image: nairobi-sklearn.jpg
12 |
13 | postauthors:
14 | - name: Adrin Jalali
15 | website: https://adrin.info/
16 | image: adrin-jalali.jpeg
17 | ---
18 |
19 |
20 | {% include postauthor.html %}
21 |
22 |
23 | Almost a year ago, after being the co-speaker of a “My first open source contribution” talk at PyData Berlin 2018, I myself became very motivated and started actively contributing to the scikit-learn project. I was surprised to see how much I could and had to learn to improve my contributions, and that was after over 20 years of programming experience, 6 years of which I did mostly Python, and several years of working in the industry. It wasn’t even the first time I was contributing to an open source project, but it was the first time I was actively looking for issues to fix.
24 |
25 | Read the full post on Adrin's blog:
26 | [Adrin Jalali's Blog: scikit-learn sprint at Nairobi](https://adrin.info/scikit-learn-sprint-at-nairobi-kenya.html)
27 |
--------------------------------------------------------------------------------
/_posts/2019-08-03-nairobi-impact-report.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Nairobi 2019 scikit-learn Sprint Impact Report"
3 | date: August 3, 2019
4 | categories:
5 | - Events
6 | tags:
7 | - Open Source
8 | - Sprints
9 | featured-image: nairobi-group.jpg
10 |
11 | postauthors:
12 | - name: Reshama Shaikh
13 | website: https://reshamas.github.io
14 | image: reshama_shaikh.jpeg
15 | ---
16 |
17 |
18 |
19 | {% include postauthor.html %}
20 |
21 |
22 | ## Sprint Background
23 |
24 | This report focuses on the summary, impact and lessons learned of the Nairobi WiMLDS scikit-learn sprint.
25 |
26 | ## Impact Summary for 2019
27 | - A total of **19** PRs were merged in:
28 | - **2** PRs were merged at the sprint
29 | - **15** PR was merged post-sprint without any follow-up
30 | - **2** PRs were merged *with* follow-up
31 |
32 | - All outstanding PRs from the sprint were merged in after 5 weeks, well before the 60 day suggested deadline.
33 |
34 | - One attendee traveled 8 hours just to attend the sprint.
35 |
36 | - Microsoft 4Afrika has been supportive of Nairobi WiMLDS and scikit-learn and would like to continue supporting the sprint events in the future.
37 |
38 |
39 | Read the full report here:
40 | [Nairobi Wimlds 2019 Scikit Learn Sprint Impact Report](https://reshamas.github.io/nairobi-wimlds-2019-scikit-learn-sprint-impact-report/), August 2019
41 |
--------------------------------------------------------------------------------
/_posts/2020-01-07-funding-software.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Don't Fund Software That Doesn't Exist"
3 | date: January 7, 2020
4 |
5 | categories:
6 | - Funding
7 | tags:
8 | - Open Source
9 | - Free Software
10 |
11 | featured-image:
12 |
13 | postauthors:
14 | - name: Andreas Mueller
15 | website: https://amueller.github.io/
16 | image: andreas-mueller.jpg
17 | ---
18 |
19 |
20 | {% include postauthor.html %}
21 |
22 |
23 | I’ve been happy to see an increase in funding for open source software across research areas and across funding bodies. However, I observed that a majority of funding from, say, the NSF, goes to projects that do not exist yet, and where the funding is supposed to create a new project, or to extend projects that are developed and used within a single research lab. I think this top-down approach to creating software comes from a misunderstanding of the existing open source software that is used in science. This post collects thoughts on the effectiveness of current grant-based funding and how to improve it from the perspective of the grant-makers.
24 |
25 | Instead of the current approach of funding new projects, I would recommend funding existing open source software, ideally software that is widely used, underfunded and already using peer-production as its organizational principle.
26 |
27 | Read the full post on Andreas' blog:
28 | [Andreas Mueller's post: Don't fund software that doesn't exist](https://peekaboo-vision.blogspot.com/2020/01/dont-fund-software-that-doesnt-exist.html)
29 |
30 |
--------------------------------------------------------------------------------
/_posts/2020-06-27-global-online-sprint-report.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Data Umbrella First Global Online Sprint Report"
3 | date: June 27, 2020
4 | categories:
5 | - Events
6 | tags:
7 | - Sprints
8 | - Community
9 | featured-image: global_sprint_annonce.png
10 | postauthors:
11 | - name: Reshama Shaikh
12 | website: https://reshamas.github.io
13 | image: reshama_shaikh.jpeg
14 | ---
15 |
16 |
17 | {% include postauthor.html %}
18 |
19 |
20 | ## Sprint Background
21 |
22 | This sprint was organized by [Reshama Shaikh](https://reshamas.github.io) of [Data Umbrella](https://www.dataumbrella.org) and [NYC PyLadies](http://nyc.pyladies.com/) to increase the participation of underrepresented persons in data science. All organization of this sprint was by volunteer time.
23 |
24 | The sprint was originally scheduled to be an in-person event in New York City. It would have been the fourth year in a row that I (Reshama Shaikh) would have organized a sprint in NYC. Due to the coronavirus pandemic, it was pivoted to become a virtual event.
25 |
26 | This report focuses on the summary, impact and lessons learned of the **first online** scikit-learn sprint.
27 |
28 |
29 | Read the full report here:
30 |
31 | [Data Umbrella First Global Online Sprint Report](https://blog.dataumbrella.org/data-umbrella-global-online-2020-scikit-learn-sprint-report)
32 |
--------------------------------------------------------------------------------
/_posts/2021-02-22-afme1-sprint-report.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Data Umbrella AFME1 Sprint Report"
3 | date: February 22, 2021
4 | categories:
5 | - Events
6 | tags:
7 | - Sprints
8 | - Community
9 | featured-image: afme1-summary.png
10 | postauthors:
11 | - name: Reshama Shaikh
12 | website: https://reshamas.github.io
13 | image: reshama_shaikh.jpeg
14 | ---
15 |
16 |
17 | {% include postauthor.html %}
18 |
19 |
20 | ## Sprint Background
21 |
22 | This sprint was organized by [Data Umbrella](https://www.dataumbrella.org) to increase the participation of underrepresented persons in data science, with a focus on the geographic regions of Africa & the Middle East (AFME).
23 |
24 | ## Summary
25 |
26 | The Data Umbrella Africa & Middle East (AFME1) scikit-learn online sprint was held on February 2, 2021, and the event report is now available. About 31 participants joined from 14 countries.
27 |
28 | Check out the report for more details.
29 |
30 | Read the full report here:
31 |
32 | [Data Umbrella AFME1 Sprint Report](https://reshamas.github.io/data-umbrella-afme-2021-scikit-learn-sprint-report)
33 |
--------------------------------------------------------------------------------
/_posts/2021-07-19-latam-sprint-report.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Data Umbrella LATAM Sprint Report"
3 | date: July 19, 2021
4 | categories:
5 | - Events
6 | tags:
7 | - Sprints
8 | - Community
9 | featured-image: latam-group-cover.png
10 | postauthors:
11 | - name: Reshama Shaikh
12 | website: https://reshamas.github.io
13 | image: reshama_shaikh.jpeg
14 | ---
15 |
16 |
17 | {% include postauthor.html %}
18 |
19 |
20 | ## Sprint Background
21 |
22 | This sprint was organized by [Data Umbrella](https://www.dataumbrella.org) to increase the participation of underrepresented persons in data science, with a focus on the geographic region of Latin America (LATAM).
23 |
24 | ## Summary
25 |
26 | The Data Umbrella Latin America scikit-learn online sprint was held on June 26, 2021, and the event report is now available. 40 participants joined from 9 countries.
27 |
28 | Check out the report for more details.
29 |
30 | Read the full report here:
31 |
32 | [Data Umbrella LATAM Sprint Report](https://blog.dataumbrella.org/data-umbrella-latam-2021-scikit-learn-sprint-report)
33 |
--------------------------------------------------------------------------------
/_posts/2021-11-20-afme2-sprint-report.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Data Umbrella AFME2 Sprint Report"
3 | date: November 20, 2021
4 | categories:
5 | - Events
6 | tags:
7 | - Sprints
8 | - Community
9 | featured-image: AFME2-photo.png
10 | postauthors:
11 | - name: Reshama Shaikh
12 | website: https://reshamas.github.io
13 | image: reshama_shaikh.jpeg
14 | ---
15 |
16 |
17 | {% include postauthor.html %}
18 |
19 |
20 | ## Sprint Background
21 |
22 | This sprint was organized by [Data Umbrella](https://www.dataumbrella.org) to increase the participation of underrepresented persons in data science, with a focus on the geographic regions of Africa & the Middle East (AFME).
23 |
24 | ## Summary
25 |
26 | The Data Umbrella Africa & Middle East (AFME2) scikit-learn online sprint was held on October 23, 2021, and the event report is now available. 40 participants joined from 17 countries, and 57% were returning contributors.
27 |
28 | Check out the report for informative plots, created using Jupyter, python and plotly.
29 |
30 | Read the full report here:
31 |
32 | [Data Umbrella AFME2 Sprint Report](https://blog.dataumbrella.org/data-umbrella-afme2-2021-scikit-learn-sprint-report)
33 |
--------------------------------------------------------------------------------
/_posts/2022-01-08-jml-interview.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Interview with Juan Martín Loyola, Triage Team Member"
3 | date: January 8, 2022
4 | categories:
5 | - Team
6 | tags:
7 | - Open Source
8 | featured-image: jml-interview.png
9 |
10 | postauthors:
11 | - name: Reshama Shaikh
12 | website: https://reshamas.github.io
13 | image: reshama_shaikh.jpeg
14 | - name: Juan Martín Loyola
15 | website: https://jmloyola.github.io/
16 | image: juan-martin-loyola.jpeg
17 | ---
18 |
19 |
20 |
21 | {% include postauthor.html %}
22 |
23 |
24 | We are happy to welcome Juan Martín Loyola to the scikit-learn Triage Team in December 2021.
25 |
26 | In this interview, learn more about Juan Martín's journey to open source. From a computer user to first contributing to PyMC, then Google Summer of Code, to Data Umbrella's Latin America open source sprint, to Triage Team member with scikit-learn.
27 |
28 | 1. __Tell us about yourself.__
29 |
30 | My name is Juan Martín Loyola, I'm a computer science Ph.D. student from San Luis, a province in the middle of Argentina, working on early classification models for text. This is related to the problem of document categorization where we are also interested in the classification speed (there is a cost associated with the classification delay).
31 |
32 |
33 | Read the full interview on Data Umbrella's blog:
34 | [Data Umbrella Interview: Juan Martín Loyola](https://blog.dataumbrella.org/jmloyola-opensource-experience)
35 |
--------------------------------------------------------------------------------
/_posts/2022-02-05-frenchaward.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "An Open Source Software Award for scikit-learn"
3 | date: February 7, 2022
4 | categories:
5 | - Press
6 | tags:
7 | - Open Source
8 | - Community
9 | featured-image: Frenchaward.png
10 |
11 | postauthors:
12 | - name: François Goupil
13 | website: https://www.linkedin.com/in/fran%C3%A7ois-goupil/
14 | image: francois_goupil.jpeg
15 | ---
16 |
17 |
18 |
19 | {% include postauthor.html %}
20 |
21 |
22 | We are pleased to announce that scikit-learn has received a prize for open-source scientific software from the French government. It is great recognition for all the community of contributors and users of a project born in France. Congratulations to the worldwide community for this great achievement!
23 |
24 | __A Community Award__
25 | scikit-learn was awarded for its very active community with more than 500,000 users per month and 2,200 contributors. Scikit-learn prides itself on being able to showcase its best practices for community building, an essential element of successful open-source software and open science innovation. Congratulations to all the projects and the teams that received the open-source software and open science award today. This work is inspiring for all of us!
26 |
27 | __The Reaction of the Community__
28 |
29 | >“I literally owe my career in the data space to scikit-learn. It’s not just a framework but a school of thought regarding predictive modeling. Super well deserved, folks :) “
30 | Maykon Schots from Brasil
31 |
32 | >“Well done everyone for getting us here :)”
33 | Joel Nothman from Australia
34 |
35 | __About the Award__
36 |
37 | For the first year, the Ministry of Higher Education, Research and Innovation awarded the Open Science Prizes for Free Research Softwares. Ten software developed by French teams were rewarded for their contribution to the advancement of scientific knowledge.
38 | As part of the second National Plan for Open Science, the Open Science Awards for Free Research Software highlights projects and research teams working on the development and dissemination of free software. It aims to emphasize teams and projects contributing to the construction of a common good.
39 |
40 | The main goal of this award is to draw the attention of the scientific community to exceptional or very promising achievements, which can serve as a model for the next generations of researchers and engineers. The prizes were awarded on the decision of a jury of experts rendered by Daniel Le Berre (Lens Computer Science Research Center, University of Artois-CNRS).
41 |
42 | The awards came in three categories, which recognized:
43 | - The scientific and technical quality of the software
44 | - Building an active community of contributors and users
45 | - The essential effort to provide documentation that facilitates the use and appropriation of the software
46 | See the whole list of award recipients. This article is available in both French and English.
47 |
48 | [The official list of award winners](https://www.ouvrirlascience.fr/remise-des-prix-science-ouverte-du-logiciel-libre-de-la-recherche/)
49 |
--------------------------------------------------------------------------------
/_posts/2022-02-08-performances.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Performances and scikit-learn"
3 | date: February 8, 2022
4 | categories:
5 | - Technical
6 | tags:
7 | - Open Source
8 | - Performance
9 | featured-image: julien-performances.png
10 |
11 | postauthors:
12 | - name: Julien Jerphanion
13 | website: https://jjerphan.xyz
14 | image: jjerphan.jpg
15 | ---
16 |
17 |
18 |
19 | {% include postauthor.html %}
20 |
21 |
22 | For more than 10 years, scikit-learn has been bringing machine learning and data science to the world. Since then, the library has aimed to deliver quality implementations to its users.
23 |
24 | This series of blog post aims at explaining the on-going work of the scikit-learn developers to boost the performances of the library.
25 |
26 | [Read more online](https://jjerphan.xyz/sklearn-perf.html)
27 |
--------------------------------------------------------------------------------
/_posts/2022-03-12-wimlds-paris-sprint.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Women in Machine Learning - A WiMLDS Paris sprint and contribution workshop"
3 | date: March 12, 2022
4 | categories:
5 | - Events
6 | tags:
7 | - Sprints
8 | - Community
9 | postauthors:
10 | - name: François Goupil
11 | website: https://www.linkedin.com/in/fran%C3%A7ois-goupil/
12 | image: francois_goupil.jpeg
13 | ---
14 |
15 |
16 |
17 | {% include postauthor.html %}
18 |
19 |
20 | Did you know that, on a rough estimation, only 6% of open source contributors were women?! This is awfully low. The scikit-learn team really cares about improving its diversity, gender being one of our focus, we decided to partner with Women in Machine Learning and Data Science Paris (WiMLDS Paris) to help there. On March 12th, on a Saturday morning, we joined for our sprint at CybelAngel! It’s been a long time since we organized a face-to-face event, especially a sprint!
21 |
22 | What is a scikit-learn sprint you may ask? The scikit-learn sprint is a hands-on “hackathon” where we work on issues in the scikit-learn GitHub repository and learn to contribute to open source. This sprint included an introductory and practical workshop about contribution to open source software.
23 |
24 | [Read more online](https://scikit-learn.fondation-inria.fr/wimlds-sprint/)
--------------------------------------------------------------------------------
/_posts/2022-03-21-behind-the-scenes.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Behind the Scenes of Data Umbrella scikit-learn Open Source Sprints"
3 | date: March 21, 2022
4 | categories:
5 | - Events
6 | tags:
7 | - Open Source
8 | featured-image: behind_the_scenes.png
9 |
10 | postauthors:
11 | - name: Reshama Shaikh
12 | website: https://reshamas.github.io
13 | image: reshama_shaikh.jpeg
14 | - name: Angela Okune
15 | website: https://angelaokune.me
16 | image: angela_okune.jpg
17 | ---
18 |
19 |
20 |
21 | {% include postauthor.html %}
22 |
23 |
24 | ## Introduction
25 |
26 | Prior to 2020, most data sprints were held in person during intensive 8-hour-long days. Data Umbrella founder, Reshama Shaikh, for example, led several in-person sprints in New York (2017, 2018, 2019), Nairobi (2019) and San Francisco (2019). Data Umbrella had always been interested in developing online resources and exploring ways to enable virtual participation, but this was not able to become a priority until 2020 when the pandemic forced everything online including data sprints. It was clear that an 8-hour in-person event could not just switch to being an 8-hour online event. So the move to online data sprints required the team to rethink the format and mechanisms of the event.
27 |
28 | Read the full article here:
29 | [Behind the Scenes: What It Takes to Run Data Umbrella’s scikit-learn Open Source Sprints](https://eventfund.codeforscience.org/behind-the-scenes-what-it-takes-to-run-data-umbrellas-scikit-learn-open-source-sprints/), March 2022
30 |
--------------------------------------------------------------------------------
/_posts/2022-03-28-maren-interview.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Interview with Maren Westermann: Extending the Impact of the scikit-learn Sprints to the Community"
3 | date: March 28, 2022
4 | categories:
5 | - Events
6 | - Team
7 | tags:
8 | - Open Source
9 | - Sprints
10 | - Community
11 | featured-image: maren-interview.png
12 |
13 | postauthors:
14 | - name: Reshama Shaikh
15 | website: https://reshamas.github.io
16 | image: reshama_shaikh.jpeg
17 | - name: Maren Westermann
18 | website: https://www.linkedin.com/in/dr-maren-westermann-0b8575144/
19 | image: maren-westerman.jpg
20 | ---
21 |
22 |
23 |
24 | {% include postauthor.html %}
25 |
26 |
27 | In this interview, learn more about how Maren moved from being a Data Umbrella scikit-learn participant to a mentor, and then to organise [open source workshops](https://www.meetup.com/en-AU/PyLadies-Berlin/).
28 |
29 |
30 | 1. __How did you learn of the Data Umbrella scikit-learn sprints and what inspired you to attend?__
31 |
32 | I learned of the first Data Umbrella scikit-learn online sprint, which took place in June 2020, via Twitter. I was interested in contributing to open source and had already made [one contribution](https://github.com/scikit-learn/scikit-learn/pull/16681) to scikit-learn. However, when I started contributing to open source I didn’t have a network of like-minded people. I was very much looking forward to connecting with people who shared my interest in open source, data science, and scikit-learn, and to building a professional network in this field.
33 |
34 | Read the full interview here:
35 | [Interview with Maren Westermann: Extending the Impact of the scikit-learn Sprints to the Community](https://blog.dataumbrella.org/mwestermann-sprints-experience), March 2022
36 |
--------------------------------------------------------------------------------
/_posts/2022-05-04-lucy-interview.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Interview with Lucy Liu, scikit-learn Team Member"
3 | date: May 4, 2022
4 | categories:
5 | - Team
6 | tags:
7 | - Open Source
8 | featured-image: lucy_card.png
9 |
10 | postauthors:
11 | - name: Reshama Shaikh
12 | website: https://reshamas.github.io
13 | image: reshama_shaikh.jpeg
14 | - name: Lucy Liu
15 | website: https://twitter.com/lucyleeow
16 | image: lucyliu.jpeg
17 | ---
18 |
19 |
20 |
21 | {% include postauthor.html %}
22 |
23 |
24 | Lucy Liu joined the scikit-learn Team in September 2020. In this interview, learn more about Lucy's journey through open source, from rstats to scikit-learn.
25 |
26 | 1. __Tell us about yourself.__
27 |
28 | My name is Lucy, I grew up in New Zealand and I am culturally Chinese. I currently live in Australia and work for Quansight labs.
29 |
30 | - GitHub: [@lucyleeow ](https://github.com/lucyleeow)
31 | - Twitter: [@lucyleeow](https://twitter.com/lucyleeow)
32 | - LinkedIn: [@lucy-liu](https://www.linkedin.com/in/lucy-liu-285468aa/)
33 |
34 |
35 | 1. __How did you first become involved in open source?__
36 |
37 | I first discovered open source when I started a research Masters, after finding my clinical Optometry job unfulfilling. I loved learning to program but was initially not game enough to contribute as I was only a beginner. After my masters, while working as a bioinformatician, I wrote some R packages for analysis of niche biomedical data and put them on github. My first contribution to an existing open source project was later when I worked at INRIA (French National Institute for Research in Digital Science and Technology) alongside the INRIA scikit-learn core developers. They helped me put up my first pull request and I have been contributing ever since!
38 |
39 | 1. __How did you get involved in scikit-learn? Can you share a few of the pull requests to scikit-learn that resonate with you?__
40 |
41 | I’m very interested in statistics and code so I was super keen to contribute to scikit-learn. Being relatively a beginner in both areas I started by contributing to documentation, then bug fixes and features. My first PR to scikit-learn was submitted in October 2019 to improve the [multiclass classification documentation](https://github.com/scikit-learn/scikit-learn/pull/15333). I have contributed the most to the calibration module in scikit-learn (including refactoring CalibratedClassifierCV), which has been very interesting and useful for when I later worked on post-processing of weather forecasts at the Bureau of Meteorology in Australia.
42 |
43 | Reference: [Lucy’s list of pull requests](https://github.com/scikit-learn/scikit-learn/pulls?q=is%3Apr+author%3Alucyleeow+is%3Aclosed)
44 |
45 | 1. __To which OSS projects and communities do you contribute?__
46 |
47 | I contribute to [Sphinx-Gallery](https://github.com/sphinx-gallery/sphinx-gallery) and scikit-learn. Sphinx-Gallery was a great introduction to open source for me as it is a small package that does not get a large number of issues and pull requests (unlike scikit-learn!).
48 |
49 | 1. __What do you find alluring about OSS?__
50 |
51 | I think the ability to see the source code and contribute back to the project are the best parts. If there is a feature you are interested in you can suggest and add it yourself, all the while learning from code reviews during the process!
52 |
53 | 1. __What pain points do you observe in community-led OSS?__
54 |
55 | I think some of the positive aspects of the OSS community can also lead to pain. While it is great that you are able to get many different perspectives from people of various backgrounds, it also makes forming a consensus more difficult, slow progress. People from any geographical location can work together asynchronously but this can also mean people work in their own silos, making it difficult to have a cohesive direction for the project. Large projects also have a difficult learning curve, making it difficult for new contributors and contributors interested in becoming core-developers. The latter is the problem if the project lacks core-developer time for project maintenance and reviewing PRs.
56 |
57 | 1. __If we discuss how far OS has evolved in 10 years, what would you like to see happen?__
58 |
59 | Some system that enables continuity of funding, which can combine funds from public and private sources. This would enable long term planning of OS projects and give developers more job stability. Better coordination between projects within the same area (e.g., scientific Python) would allow a better experience for users using Python for their projects.
60 |
61 | 1. __What are your favorite resources, books, courses, conferences, etc?__
62 |
63 | [Real Python](https://realpython.com/) have great tutorials and [regex101](https://regex101.com) makes regular expressions so much easier to write and review!
64 |
65 | I also love the YouTube channel [statquest](https://www.youtube.com/c/joshstarmer), which explains statistical concepts in a very accessible manner and introduces videos with a jingle - what more could you want?
66 |
67 | 1. __What are your hobbies, outside of work and open source?__
68 |
69 | I love cycling and feel strongly about designing cities for people instead of cars. I also enjoy rock climbing (indoors and outdoors), though sadly have not had much time for this recently.
70 |
--------------------------------------------------------------------------------
/_posts/2022-05-12-pyconde-keynote-reshama.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "5 Years, 10 Sprints, A scikit-learn Open Source Journey"
3 | date: May 12, 2022
4 | categories:
5 | - Events
6 | tags:
7 | - Open Source
8 | - Sprints
9 | - Community
10 | featured-image: reshama-pyconde.png
11 |
12 | postauthors:
13 | - name: Reshama Shaikh
14 | website: https://reshamas.github.io
15 | image: reshama_shaikh.jpeg
16 | ---
17 |
18 |
19 |
20 | {% include postauthor.html %}
21 |
22 |
23 |
24 | ## Video
25 |
26 |
27 |
28 | ## About
29 |
30 | We all use open source tools in various capacities, yet knowing how to contribute to open source is not as well known or accessible. The limited knowledge and education surrounding contributing to open source could be one explanation of the low participation rates by underrepresented persons in open source. Open source sprints are hands-on "workshops" or "hackathons" where contributors collaborate to resolve coding and documentation issues posted on a GitHub repository.
31 |
32 | Reshama shares how she organized her first open source sprint in 2017, which was in-person and held in New York City. Over the next 5 years, she organized in-person sprints from San Francisco, USA to Nairobi, Kenya, as well as pivoting to online sprints due to the global pandemic. In this keynote, Reshama shares highlights, challenges and lessons learned from the [sprints](https://www.dataumbrella.org/sprints).
33 |
34 | ## About Reshama Shaikh
35 | Reshama is a statistician/data scientist based in New York City. She earned her M.S. in statistics from Rutgers University. She earned her M.B.A. from NYU Stern School of Business where she studied strategy, business analytics and technology management.
36 |
37 | Reshama Shaikh is the Director of Data Umbrella. She is also on the Contributor Team for scikit-learn and [PyMC](https://docs.pymc.io/en/latest/) and an organizer for [NYC PyLadies](https://www.meetup.com/NYC-PyLadies/).
38 |
39 |
40 | ## Key Links
41 | - [Sprint Reports](https://blog.dataumbrella.org/tags/#sprint-report)
42 | - [Sprint Blogs](https://blog.dataumbrella.org/tags/#sprint-blog)
43 |
44 |
45 | ## Connecting
46 | - LinkedIn: [@reshamas](https://www.linkedin.com/in/reshamas/)
47 | - Bluesky: [@reshamas](https://bsky.app/profile/reshamas.bsky.social)
48 | - GitHub: [@reshamas](https://github.com/reshamas)
49 | - Medium: [@reshamas](https://medium.com/@reshamas)
50 | - Join the Data Umbrella [Meetup Group](https://www.meetup.com/data-umbrella/)
51 | - Subscribe to the Data Umbrella [YouTube](https://www.youtube.com/c/DataUmbrella/)
52 |
53 |
54 |
55 |
56 | ### Keynote Day
57 |
58 |
19 |
20 | In September of 2022, the [SciPy Latin America](https://pythoncientifico.ar/) conference took place in Salta, Argentina.
21 | As part of the event, we organized a [scikit-learn sprint](https://pythoncientifico.ar/events/sprints/).
22 | The main idea was to introduce the participants to the open source world and help them make their first contribution.
23 | The sprint event was an in-person event.
24 |
25 |
26 |
27 | ## Schedule
28 | - September 27, 2022 - **Pre-sprint** - 10:00 to 12:00 hs (UTC -3)
29 | - September 28, 2022 - **Sprint** - 10:00 to 17:00 hs (UTC -3)
30 |
31 | ## Repository
32 | For more information in Spanish, [check this repository](https://github.com/jmloyola/sklearn-sprint-argentina-2022).
33 | You will find details about the event, instructions to set up the development environment, links with further information and tutorials, and an example git workflow to make a pull request for the project.
34 |
35 | ## Photos
36 |
37 |
38 |
39 | Group photo of the SciPy Latin America sprint, Salta, Argentina, 2022. Sandra Meneses and Juan Martín Loyola are projected on the screen from a Zoom call. Photo credit: Lucía Torres.
40 |
41 |
42 |
43 |
44 |
45 |
46 | Participants of the SciPy Latin America sprint working on their computers. Photo credit: Ariel Silvio Norberto Ramos.
47 |
48 |
49 |
50 | ## Acknowledgment
51 | These people made this sprint possible:
52 | - Ariel Silvio Norberto Ramos, one of the organizers of the SciPy Latin America,
53 | - [Data Umbrella](https://www.dataumbrella.org/), [one of the community partners of the event](https://twitter.com/ScipyLA/status/1573710649963724802), especially Sandra Meneses and Reshama Shaikh,
54 | - The mentors that helped run the sprint.
55 |
--------------------------------------------------------------------------------
/_posts/2022-10-13-joining-forces-hugging-face.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "scikit-learn and Hugging Face join forces"
3 | date: October 13, 2022
4 |
5 | categories:
6 | - Updates
7 | - Community
8 | tags:
9 | - Open Source
10 |
11 | featured-image: HFxsklearn.png
12 |
13 | postauthors:
14 | - name: Lysandre Debut
15 | email: lysandre@huggingface.co
16 | website: https://github.com/LysandreJik
17 | image: "lysandre_debut.jpg"
18 | - name: François Goupil
19 | email: francois.goupil@inria.fr
20 | website: https://github.com/francoisgoupil
21 | image: "francois_goupil.jpeg"
22 | ---
23 |
24 |
25 | {% include postauthor.html %}
26 |
27 |
28 |
29 | [Hugging Face](hf.co) is happy to announce that we're partnering with [scikit-learn](https://scikit-learn.org/stable/index.html) to further our support of the machine learning tools and ecosystem.
30 |
31 | At Hugging Face, we've been putting a lot of effort into supporting deep learning, but we believe that machine learning as a whole can benefit from the tools we release. With statistical machine learning being essential in this field and scikit-learn dominating statistical ML, we're excited to partner and move forward together.
32 |
33 | As of September 2022, the Hugging Face Hub already hosts nearly 4,000 tabular classification and tabular regression model checkpoints, and we strive for this trend to continue.
34 |
35 |
36 |
39 |
40 |
41 | ## Support to the scikit-learn consortium
42 |
43 | Starting June 2022, Hugging Face is now an official sponsor of the scikit-learn consortium . Through this support, Hugging Face actively promotes the development and sustainability of sklearn. As a sponsor of the scikit-learn consortium hosted at the Inria foundation, we'll now participate in the scikit-learn consortium technical committee
44 |
45 | ## Development support
46 | To help sustaining the development of the library , we're happy to welcome Adrin Jalali and Benjamin Bossan to the Hugging Face team. Adrin is a core developer of scikit-learn as well as [fairlearn](fairlearn.org), while Benjamin is the author of the [skorch](https://github.com/skorch-dev/skorch) library and is now a contributor to scikit-learn.
47 |
48 | Hugging Face is happy to support the development of scikit-learn through code contributions, issues, pull requests, reviews, and discussions.
49 |
50 | ## Integration to and from the Hugging Face Hub
51 |
52 | ["Skops"](https://github.com/skops-dev/skops) is the name of the framework being actively developed as the link between the scikit-learn and the Hugging Face ecosystems. With Skops, we hope to facilitate essential workflows:
53 |
54 | - The ability to push scikit-learn models on the Hugging Face Hub
55 | - The possibility to try out models directly in the browser
56 | - The automatic creation of model cards, to improve model documentation and understanding
57 | - The ability to collaborate with others on machine learning projects
58 |
59 | ### Snapshot of your work
60 |
61 | Working at the intersection of scikit-learn and the Hub offers challenges linked to the two platforms. One of these challenges is secure persistence: the ability to serialize models in a secure, safe manner.
62 |
63 | scikit-learn models (estimators, predictors, ...) are usually saved using pickle, which is notorious for not being a secure format. Sharing scikit-learn models in this format exposes receivers to potentially malicious data which could execute arbitrary code when run.
64 |
65 | That's where secure persistence comes in: as the Hugging Face Hub aims to provide a platform for models, the ability to share safe, secure objects is essential. We've been working on adding secure persistence for scikit-learn models in [skops#128](https://github.com/skops-dev/skops/pull/128) and [skops#145](https://github.com/skops-dev/skops/pull/145)([doc preview](https://skops--145.org.readthedocs.build/en/145/persistence.html)). Instead of serializing using pickle, the object's contents are put into a zip file with an accompanying schema JSON file.
66 |
67 | Read about the Skops library in the following blog post: [Introducing Skops](https://huggingface.co/blog/skops).
68 |
69 | ## Improving interoperability
70 |
71 | Skops is an example of an integration of scikit-learn within our tools, but it is not the only example! We will strive to integrate with the rest of our ecosystem so that Hugging Face users may benefit from using scikit-learn tools and vice-versa.
72 |
73 | An example is the `evaluate` library, dedicated to efficiently evaluating machine learning models and datasets. We aim for this tool to natively support [scikit-learn metrics](https://github.com/huggingface/evaluate/issues/297) in its API.
74 |
75 | ---
76 |
77 | Through these efforts, we hope to kickstart a lasting relationship between the two ecosystems and provide simple, efficient bridges to lower the barrier of entry. We believe that educating and sharing models is the best way to foster inclusive machine learning from which all can benefit. We're excited to partner with scikit-learn for this endeavor.
78 |
--------------------------------------------------------------------------------
/_posts/2022-11-08-pandas-dataframe-output-for-sklearn-transformer.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Pandas DataFrame Output for sklearn Transformers"
3 | date: November 8, 2022
4 | categories:
5 | - Technical
6 | tags:
7 | - Performance
8 | featured-image: pandas_output_sklearn_transformers.PNG
9 |
10 | postauthors:
11 | - name: Sangam SwadiK
12 | website: https://www.linkedin.com/in/sangam-swadi-k/
13 | image: sangam_swadik.jpg
14 | ---
15 |
16 |
17 |
18 | {% include postauthor.html %}
19 |
20 |
21 | ## Video
22 |
23 |
24 | ## Upcoming feature in release 1.2
25 | Starting with the next release of [scikit-learn](https://github.com/scikit-learn/scikit-learn) (v1.2), pandas dataframe output will be available for all sklearn transformers! This will make running pipelines on dataframes much easier and provide better ways to track feature names. Previously, mapping a transformed output back into columns would be cumbersome as it might not be a one-to-one mapping in cases of complex preprocessing (e.g., polynomial features).
26 |
27 | The pandas dataframe output feature for transformers solves this by tracking features generated from pipelines automatically. The transformer output format can be configured explictly for either **numpy** or **pandas** output formats as shown in [sklearn.set_config](https://scikit-learn.org/dev/modules/generated/sklearn.set_config.html#sklearn.set_config) and the sample code below.
28 | ```python
29 | from sklearn import set_config
30 | set_config(transform_output = "pandas")
31 | ```
32 |
33 | See the sample notebook, [pandas-dataframe-output-for-sklearn-transformer.ipynb](https://github.com/scikit-learn/blog/blob/main/assets/notebooks/sklearn-pandas-df-output.ipynb) and documentation for a more detailed example and usage.
34 |
35 | ## Links to documentation and example notebook
36 | - [Pandas output for transformers documentation](https://scikit-learn.org/dev/auto_examples/miscellaneous/plot_set_output.html#sphx-glr-auto-examples-miscellaneous-plot-set-output-py)
37 | - [pandas-dataframe-output-for-sklearn-transformer.ipynb](https://github.com/scikit-learn/blog/blob/main/assets/notebooks/sklearn-pandas-df-output.ipynb)
38 |
39 |
40 | ## Reporting bugs
41 | We'd love your feedback on this. In case of any suggestions or bugs, please report them at
42 | [scikit-learn issues](https://github.com/scikit-learn/scikit-learn/issues)
43 |
44 | Thanks 🙏🏾 to maintainers: [**Thomas J. Fan**](https://github.com/thomasjpfan), [**Guillaume Lemaitre**](https://github.com/glemaitre) , [**Christian Lorentzen**](https://github.com/lorentzenchr) !!
--------------------------------------------------------------------------------
/_posts/2023-07-11-nvidia-is-a-new-sponsor.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "NVIDIA Is A New Sponsor Of The Scikit-Learn consortium at the Inria Foundation"
3 | date: November 14, 2023
4 |
5 | categories:
6 | - Funding
7 | tags:
8 | - Sponsor
9 |
10 | featured-image: NVIDIAxsklearn.jpg
11 |
12 | postauthors:
13 | - name: NVIDIA
14 | website: https://developer.nvidia.com/gpu-accelerated-libraries
15 | image: "nvidia-logo.png"
16 | - name: François Goupil
17 | email: francois.goupil@inria.fr
18 | website: https://github.com/francoisgoupil
19 | image: "francois_goupil.jpeg"
20 | ---
21 |
22 |
23 | {% include postauthor.html %}
24 |
25 |
26 | *Sponsored blog post*
27 |
28 | We are thrilled to announce that [NVIDIA](https://www.nvidia.com) has joined the [scikit-learn consortium](https://scikit-learn.fondation-inria.fr/) as a corporate partner. As a leading provider of GPU-accelerated computing solutions, we at NVIDIA recognize the importance of machine learning and the role it plays in the growth of many industries and areas of science. Our partnership with the scikit-learn consortium demonstrates our commitment to supporting the development and advancement of open-source software in the machine learning community.
29 |
30 |
31 |
34 |
35 |
36 | [Scikit-learn](https://scikit-learn.org/stable/) is a popular open-source Python library for machine learning. One of the strengths of scikit-learn is its ease of use and well-defined API. This makes it a favorite tool among data scientists and machine learning practitioners. Thanks to its active community and continuous development, scikit-learn is constantly evolving and improving.
37 |
38 | At NVIDIA, we believe that investing in open-source projects like scikit-learn is important. Afterall, it is a central component of the modern data stack in both science and industry. By financially supporting the scikit-learn consortium, we are contributing to the long-term sustainability of scikit-learn and helping to ensure that it remains an easy to use, reliable and valuable tool for years to come. Furthermore, we hope to help advance the project's development, improve its performance, and enhance its capabilities for machine learning on GPUs.
39 |
40 | Our partnership with the scikit-learn consortium will also enable us to collaborate more closely with the scikit-learn community, and provide us with insights into how we can improve NVIDIA’s [RAPIDS open-source libraries](https://developer.nvidia.com/rapids) to better serve their needs. We are committed to working with the foundation to ensure that scikit-learn remains a powerful and easy to use machine learning library that meets the needs of data science practitioners in science and industry.
41 |
42 | NVIDIA’s commitment to scikit-learn goes beyond financial support. We have hired [Tim Head](https://betatim.github.io), an experienced open-source maintainer, to work full-time on the project. This is not Tim’s first open-source rodeo. He has previously contributed to several high-profile open-source projects, including Project Jupyter. His focus will be reviewing pull requests and coordinating the development of large features. Tim was recently elected as a core maintainer of scikit-learn. His expertise and experience will be invaluable in ensuring the continued growth and success of the project.
43 |
44 | In summary, NVIDIA’s partnership with the scikit-learn consortium is an important step in our ongoing commitment to support the development and growth of open-source software in the machine learning community. We are excited to work with the foundation and the community of contributors to help advance the capabilities of scikit-learn and accelerate the development of machine learning applications.
45 |
46 | AI helped write this blog post!
47 |
--------------------------------------------------------------------------------
/_posts/2023-09-12-paris-dev-sprint.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "scikit-learn 2023 In-person Developer Sprint in Paris, France"
3 | date: September 10, 2023
4 |
5 | categories:
6 | - Events
7 | tags:
8 | - Sprints
9 | - Community
10 | featured-image: 2023-paris-dev-sprint.png
11 |
12 | postauthors:
13 | - name: Reshama Shaikh
14 | website: https://reshamas.github.io
15 | image: reshama_shaikh.jpeg
16 | - name: François Goupil
17 | email: francois.goupil@inria.fr
18 | website: https://github.com/francoisgoupil
19 | image: "francois_goupil.jpeg"
20 | ---
21 |
22 | {% include postauthor.html %}
23 |
24 |
25 |
26 | During the week of June 19 to 23, 2023, the scikit-learn team held its first developers sprint since 2019! The sprint took place in Paris, France at the Dataiku office. The sprint event was an in-person event and had 32 participants.
27 |
28 | The following [scikit-learn team members](https://scikit-learn.org/stable/about.html) joined the sprint:
29 |
30 | 1. Adrin Jalali
31 | 1. Arturo Amor Quiroz
32 | 1. François Goupil (@francoisgoupil)
33 | 1. Frank Charras (@fcharras)
34 | 1. Gael Varoquaux (@GaelVaroquaux)
35 | 1. Guillaume Lemaitre (@glemaitre)
36 | 1. Jérémie du Boisberranger (@jeremiedbb)
37 | 1. Joris Van den Bossche
38 | 1. Julien Jerphanion (@jjerphan)
39 | 1. Loïc Estève
40 | 1. Maren Westermann
41 | 1. Olivier Grisel (@ogrisel)
42 | 1. Roman Yurchak
43 | 1. Thomas Fan
44 | 1. Tim Head (@betatim)
45 |
46 | The following community members joined the sprint:
47 |
48 | 1. Alexandre Landeau
49 | 1. Alexandre Vigny
50 | 1. Chaine San Buenaventura
51 | 1. Camille Troillard
52 | 1. Denis Engemann
53 | 1. Franck Charras
54 | 1. Harizo Rajaona
55 | 1. Ines (intern at Dataiku)
56 | 1. Jovan Stojanovic
57 | 1. Leo Dreyfus-Schmidt
58 | 1. Léo Grinsztajn
59 | 1. Lilian Boulard
60 | 1. Louis Fouquet
61 | 1. Riccardo Cappuzzo
62 | 1. Samuel Ronsin
63 | 1. Vincent Maladière
64 | 1. Yann Lechelle
65 |
66 |
67 |
68 |
69 |
70 | scikit-learn Developer Sprint, Paris, June 2023; Photo credit: Copyright: Inria / Photo B. Fourrier, June 2023; (from left to right, back to front):
71 | Last Row: Denis Engemann, Riccardo Cappuzzo, François Goupil, Tim Head, Guillaume Lemaitre, Louis Fouquet, Jérémie du Boisberranger, Frank Charras, Léo Grinsztajn, Arturo Amor Quiroz.
72 | Middle Row: Thomas Fan, Lilian Boulard, Gaël Varoquaux, Ines, Jovan Stojanovic, Chaine San Buenaventura.
73 | First Row: Olivier Grisel, Harizo Rajaona, Vincent Maladière.
74 |
75 |
76 |
77 | ## Sponsors
78 | - Dataiku provided the space and some of the food, as well as all of the coffee.
79 | - The scikit-learn consortium organized the sprint, paid for the lunch, the travel and accommodation expenses.
80 |
81 | ## Topics covered at the sprint
82 | - PR #13649: [Monotonic constraints for Tree-based models](https://github.com/scikit-learn/scikit-learn/pull/13649)
83 | - Discussed the vision/future directions for the project. What is important to keep the project relevant in the future.
84 | - Should we share some points beyond the vision statement?
85 | - Thomas F will try and create a vision statement
86 | - Discussed what people are keeping an eye on with a two year time scale in mind in terms of technology and developments that are relevant.
87 | - Tim: keep improving our documentation (not just expanding it but also “gardening” to keep it readable)
88 | - Tim: increase active outreach and communication about new features/improvements and other changes. A lot of cool things in scikit-learn are virtually unknown to the wider public (e.g. Hist grad boosting being on par with lightgbm in terms of performance, …)
89 |
90 |
91 | ### What is next?
92 |
93 | We are discussing co-locating with OpenML in 2024 in Berlin, Germany to organize another developers' sprint.
94 |
95 |
96 |
97 |
98 |
99 | scikit-learn Developer Sprint, Paris, June 2023; Photo credit: Copyright Inria / Photo B. Fourrier, June 2023; (from left to right): Thomas Fan, Olivier Grisel
100 |
101 |
102 |
--------------------------------------------------------------------------------
/_posts/2023-27-11-mentoring.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "My mentored internship at scikit-learn"
3 | date: November 27, 2023
4 |
5 | categories:
6 | - Diversity
7 | tags:
8 | - Internship
9 | - Diversity
10 | - Inclusiveness
11 |
12 |
13 | postauthors:
14 | - name: Stefanie Senger
15 | email: stefanie.senger@posteo.de
16 | website: https://github.com/StefanieSenger
17 | image: "stefanie-senger.jpeg"
18 |
19 | - name: François Goupil
20 | email: francois.goupil@inria.fr
21 | website: https://github.com/francoisgoupil
22 | image: "francois_goupil.jpeg"
23 | ---
24 |
25 |
26 | {% include postauthor.html %}
27 |
28 |
29 | ## How it is to be an Intern at scikit-learn
30 |
31 | My name is Stefanie Senger, and I recently concluded a five-month mentored internship at scikit-learn, that had been funded by NumFocus as a Small Development Grant with a clear focus on fostering diversity in open-source projects. The idea to couple a grant with mentorship traces back to Maren Westermann's initiative. She envisioned a pathway to integrate more female coders into scikit-learn through internships and support. Scikit-learn would profit from fresh perspectives and some disruption. I was the guinea pig for an initial experiment, as Maren later told me.
32 |
33 |
34 | ## Starting the Internship
35 |
36 | As someone transitioning from a non-technical background to coding, working on scikit-learn was a big thing for me. I had participated in and taught at a data science boot camp, searching diligently for a first role in the field. I never doubted I could tackle more difficult tech challenges over time, but I knew there was much to learn. Scikit-learn had a heavy-tech aura to me, and when I discovered the internship ad, I just thought: this. I was genuinely taken aback when accepted for the role, though. There are many more experienced people looking for such an opportunity, after all.
37 |
38 | When I got to know better both my mentors, Adrin Jalali and Guillaume Lemaitre, it became quickly clear that only effort was required, and I could ask them any question along the way. I felt very welcome in the community, also by the other people I interacted with on GitHub.
39 |
40 |
41 | ## What I Worked on
42 |
43 | I began by working on documentation and examples such as "Multi-class AdaBoosted Decision Trees," to make those more comprehensive and helpful for users. Then some maintenance tasks on the code that were repetitive so I could find out what to do from other contributors' pull requests. Guillaume discovered that one AdaBoost algorithm required deprecation, and it fell on me to execute this. I had never looked at such a huge code base with so many layers of abstraction, and I had to learn quite some more Python to be able to go ahead. I even got the opportunity to present an "Intro to scikit-learn" workshop at EuroSciPy, the European conference on the scientific use of Python in Basel, where I also got to know many other contributors and people from the scikit-learn team at Inria.
44 |
45 | Adrin introduced me to the challenging task of implementing a new feature for metadata routing, developed over many years by the scikit-learn community. It allows users to set metadata, such as sample weights, in meta estimators, that can be routed to sub-estimators and other algorithms that are able to consume it. This was partly uncharted territory and meant finding solutions where there was no predefined path and adapting tests to match the expected behavior. In the last two months of my internship, I implemented metadata routing into some meta-estimators, which was tremendously difficult but, once accomplished, has nourished my professional confidence since.
46 |
47 |
48 | ## Mentorship in Action
49 |
50 | Let me describe how the mentoring worked because Guillaume's and Adrin's support was invaluable. They would both literally drop their tasks when I had questions and right away hint me in the right direction. I met Adrin twice a week, and we would co-work while I would throw questions at him. Guillaume was available remotely, and I knew he would jump into a video call with me when I needed help. They both gave reviewing my PRs a priority, and I got feedback on my work regularly.
51 |
52 | It was essential to have mentors signaling that it's okay to be learning and to propose tasks to me. If I had come into the project individually, I might have hesitated to take on most of the issues I ended up working on, fearing that my skills were insufficient and that I would hinder the progress of the project rather than help it. The mentoring setting gave me a justification to try things that I wasn't sure if I could do.
53 |
54 |
55 | ## Becoming a Community Member
56 |
57 | Looking ahead, I will continue contributing to scikit-learn. As I've gotten to know quite a few of the other contributors in person, I now feel part of the community. I know they care about values like openness and diversity, that I share, and while acknowledging the complexity of the code base, I know what I can learn from taking on issues and the sense of accomplishment when merging my solution into the main branch. And I love contributing to something meaningful, which is something I had always sought.
58 |
--------------------------------------------------------------------------------
/_posts/2024-05-04-authorship-info.md:
--------------------------------------------------------------------------------
1 | ---
2 | #### Blog Post Template ####
3 |
4 | #### Post Information ####
5 | title: "Note on Inline Authorship Information in scikit-learn"
6 | date: May 4, 2024
7 |
8 | #### Post Category and Tags ####
9 | # Format in titlecase without dashes (Ex. "Open Source" instead of "open-source")
10 | categories:
11 | - Updates
12 | tags:
13 | - Open Source
14 | - Machine Learning
15 | - License
16 |
17 | #### Featured Image ####
18 | featured-image: BSD_watermark.svg
19 |
20 | #### Author Info ####
21 | # Can accomodate multiple authors
22 | # Add SQUARE Author Image to /assets/images/author_images/ folder
23 | postauthors:
24 | - name: Adrin Jalali
25 | website: https://adrin.info/
26 | image: adrin-jalali.jpeg
27 | ---
28 |
29 |
30 | {% include postauthor.html %}
31 |
32 |
33 | Historically, scikit-learn's files have included authorship information similar
34 | to the following format:
35 |
36 | ```python
37 | # Authors: Author1, Author2, ...
38 | # License: BSD 3 clause
39 | ```
40 |
41 | However, after a series of discussions which you can see in detail in [this
42 | issue]( https://github.com/scikit-learn/scikit-learn/pull/28799), we could list
43 | the following caveats to the status quo:
44 |
45 | - Authorship information was not up-to-date and in most cases, but not always,
46 | reflect the original authors of the file;
47 | - It was unfair to all other contributors who have been contributing to the
48 | code-base;
49 | - One can check the real authors and the history of the authors of any part of
50 | the code-base using `git blame` and other `git` tools.
51 |
52 | Therefore we came to the conclusion to standardize all authorship information to
53 | mention "The scikit-learn developers", and have the license notice as:
54 |
55 | ```python
56 | # Authors: The scikit-learn developers
57 | # SPDX-License-Identifier: BSD-3-Clause
58 | ```
59 |
60 | The change is to happen gradually in the coming months after April 2024.
61 |
--------------------------------------------------------------------------------
/_posts/2024-07-18-yao-interview.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Interview with Yao Xiao, scikit-learn Team Member"
3 | date: July 18, 2024
4 | categories:
5 | - Team
6 | tags:
7 | - Open Source
8 | featured-image:
9 |
10 | postauthors:
11 | - name: Reshama Shaikh
12 | website: https://reshamas.github.io
13 | image: reshama_shaikh.jpeg
14 | - name: Yao Xiao
15 | website: https://charlie-xiao.github.io/
16 | image: yao-xiao.jpeg
17 | ---
18 |
19 |
20 |
21 | {% include postauthor.html %}
22 |
23 |
24 | Yao Xiao recently earned his undergraduate degree in mathematics and computer science. He will be pursuing a Master’s degree in Computational Science and Engineering at Harvard SEAS. Yao joined the scikit-learn team in February 2024.
25 |
26 | 1. __Tell us about yourself.__
27 |
28 | My name is Yao Xiao and I live in Shanghai, China. At the time of interview I have just got my Bachelor’s degree in Honors Mathematics and Computer Science at NYU Shanghai, and I’m going to pursue a Master’s degree in Computational Science and Engineering at Harvard SEAS. My current research interests are in networks and systems (e.g. sys4ml and ml4sys), but this may change in the future.
29 |
30 | - GitHub: [@Charlie](https://github.com/Charlie-XIAO)
31 | - LinkedIn: [@yao-xiao](https://www.linkedin.com/in/yao-xiao-200073244/)
32 | - Website: [https://charlie-xiao.github.io](https://charlie-xiao.github.io/)
33 |
34 | 1. __How did you first become involved in open source and scikit-learn?__
35 |
36 | In my junior year I took a course at NYU Courant called Open Source Software Development where we needed to make contributions to an open source software as our final project - and I chose scikit-learn.
37 |
38 | 1. __We would love to learn of your open source journey.__
39 |
40 | I was lucky to get involved in a pretty easy meta-issue when I first started contributing to scikit-learn. I made quite a few PRs towards that issue, familiarizing myself with the coding standards, contributing workflow etc., and during which I gradually explored the codebase and learned a lot from maintainers how to write better code. After that meta-issue was completed, I decided to continue contributing since I enjoyed the experience, and I started looking through the open issues, tried reproducing and investigating them, then opened PRs for those that I was able to solve. It is the process of familiarizing with more parts of the codebase, being able to make more PRs, so on and so forth. While contributing to scikit-learn, sometimes there are also issues to solve upstream, so I also had opportunities to contribute to projects like pandas and pydata-sphinx-theme. Up till today I’m still far from familiar with the entire scikit-learn project, but I will definitely continue the amazing open-source journey.
41 |
42 | 1. __To which OSS projects and communities do you contribute?__
43 |
44 | I have contributed to scikit-learn, pandas, pydata-sphinx-theme, sphinx-gallery. I’m also writing some small softwares that I decide to make open source.
45 |
46 | 1. __What do you find alluring about OSS?__
47 |
48 | It is amazing to feel that my code is being used by so many people all around the world through contributing to open source projects. Well it might be inappropriate to say “my code”, but I do feel like making some actual contributions to the community instead of just writing code for myself. Also OSS makes me care about code quality and so on instead of merely making things “work”, which is very important for programmers but not really taught in school.
49 |
50 | 1. __What pain points do you observe in community-led OSS?__
51 |
52 | Collaboration can lead to better code but also slows down the development process. Especially when there are not enough reviewers around, issues and PRs can easily get stale or forgotten. But I would say it’s more like a tradeoff rather than a pain point.
53 |
54 | 1. __If we discuss how far OS has evolved in 10 years, what would you like to see happen?__
55 |
56 | I couldn’t say about the past 10 years since I’ve only been involved for about one and a half years, but regarding the scientific Python ecosystem I would like to see better coordination across projects (which is already happening). For instance a common interface for array libraries and dataframe libraries would allow downstream dependents to easily provide more flexible support for different input/output types, etc. And as a Chinese I would also hope that open source can thrive in my country some day as well.
57 |
58 | 1. __What are your favorite resources, books, courses, conferences, etc?__
59 |
60 | As for physical books I would recommend *The Pragmatic Programmer* by Andy Hunt and Dave Thomas, and *Refactoring: Improving the Design of Existing Code* by Martin Fowler and Kent Back. As for courses I like MIT’s *The Missing Semester of Your CS Education*. In particular about learning Python, *The Python Tutorial* in the official Python documentation is good enough for me. By the way I want to mention that **documentations** of most languages and popular packages are very nice and they are the best place to learn the most up-to-date information.
61 |
62 | 1. __What are your hobbies, outside of work and open source?__
63 |
64 | I would say my largest hobby is programming (not for school, not for work, just for fun). I’ve recently been fascinated with [Tauri](https://v2.tauri.app/) and wrote a lot of small desktop applications for myself in my spare time. Apart from this I also love playing the piano and I’m an anime lover, so I often listen to or play piano versions of anime theme songs (mostly arranged by [Animenz](https://www.animenzpiano.com/)).
65 |
--------------------------------------------------------------------------------
/_posts/2024-09-02-survey-announcement.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Announcing the launch of the scikit-learn user survey"
3 | date: September 2, 2024
4 |
5 | categories:
6 | - Updates
7 | tags:
8 | - Community
9 | - Open Source
10 |
11 | postauthors:
12 | - name: Inessa Pawson
13 | email: inessapawson@gmail.com
14 | website: https://github.com/inessapawson
15 | image: "inessa-pawson.jpg"
16 | - name: François Goupil
17 | email: francois.goupil@inria.fr
18 | website: https://github.com/francoisgoupil
19 | image: "francois_goupil.jpeg"
20 | ---
21 |
22 |
23 | {% include postauthor.html %}
24 |
25 |
26 | We are excited to announce the launch of the scikit-learn user survey! Scikit-learn
27 | continues to evolve thanks to contributions from its diverse user community. As we plan
28 | for future releases, we want to ensure we are focusing on what matters most to you — our
29 | users.
30 |
31 | The goal of this survey is to better understand how users interact with the library,
32 | identify any pain points, learn about the features you find most useful, and what’s
33 | missing. This is your chance to have a say in how the library grows and adapts to meet
34 | the evolving needs of the machine learning community.
35 |
36 | The survey will take about 15 minutes of your time. It is available in Arabic, French,
37 | English, Japanese, Mandarin, Spanish, and Portuguese. You have the option to remain
38 | completely anonymous, and the data collected will be used solely for the purpose of
39 | improving scikit-learn.
40 |
41 | This user survey is a truly collaborative effort. We would like to thank the teams from
42 | probabl, University of Oxford (UK), and POSSEE OpenTeams, as well as many scikit-learn
43 | contributors, for their time and effort in designing and translating it.
44 |
45 | Once the survey closes, we’ll analyze the responses and publish the findings in a
46 | follow-up blog post.
47 |
48 | To take the survey, visit:
49 | [https://forms.gle/p5P7AweCJCbFMzfo6](https://forms.gle/p5P7AweCJCbFMzfo6).
50 | The survey will remain open until October 14th, 2024, and we encourage you to share it with your
51 | colleagues and extended network.
52 |
53 | We value every contribution in our community, and we’re committed to making scikit-learn
54 | even better. Your feedback is the foundation upon which scikit-learn will continue to
55 | grow and evolve. We look forward to hearing from you!
56 |
--------------------------------------------------------------------------------
/_posts/templates/2022-01-01-template-post.markdown:
--------------------------------------------------------------------------------
1 | ---
2 | #### Blog Post Template ####
3 |
4 | #### Post Information ####
5 | title: "Blog Post Template"
6 | date: January 4, 2022
7 |
8 | #### Post Category and Tags ####
9 | # Format in titlecase without dashes (Ex. "Open Source" instead of "open-source")
10 | categories:
11 | - Updates
12 | tags:
13 | - Open Source
14 | - Machine Learning
15 |
16 | #### Featured Image ####
17 | featured-image: jml.png
18 |
19 | #### Author Info ####
20 | # Can accomodate multiple authors
21 | # Add SQUARE Author Image to /assets/images/author_images/ folder
22 | postauthors:
23 | - name: First Author
24 | website: https://github.com
25 | email: author@email.com
26 | image: author.jpeg
27 | ---
28 |