38 | {% if site.navigation == 1 or post_count > 0 %}
39 |
40 |
43 | {% else %}
44 |
47 | {% endif %}
48 |
49 | {% endif %}
50 |
51 |
52 |
53 | {% include footer.html %}
54 |
123 | {% if site.google_analytics_id != "" %}
124 | {% include google_analytics.html %}
125 | {% endif %}
126 |
127 |
128 |
--------------------------------------------------------------------------------
/_posts/2025-11-23-using-the-current-environment.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: page
3 | title: "Using the current environment"
4 | category: getting_started
5 | date: 2026-11-23 20:42:38
6 | ---
7 |
8 | ## Getting started
9 |
10 | We recommend using **VS Code + Jupyter** as the default development stack for DataHaskell:
11 | - VS Code as your editor
12 | - Jupyter notebooks for literate, reproducible analysis
13 | - A Haskell notebook kernel (currently IHaskell)
14 | - The DataHaskell libraries (e.g. `dataframe`, `hasktorch`, plotting, etc.)
15 |
16 | This page walks you through:
17 |
18 | 1. Installing the basic tools
19 | 2. Choosing an environment (Dev Container vs local install)
20 | 3. Verifying everything with a “hello DataHaskell” notebook
21 |
22 | ---
23 |
24 | ## 1. Install the basics
25 |
26 | You only need to do this once per machine.
27 |
28 | ### 1.1. VS Code
29 |
30 | 1. Install **Visual Studio Code** from the official website.
31 | 2. Open VS Code and install these extensions:
32 | - **Jupyter**
33 | - **Python** (used by the Jupyter extension, even if you write Haskell)
34 | - **Dev Containers** (if you plan to use the container-based environment)
35 | - **Haskell** (for syntax highlighting, type info, etc.)
36 |
37 | ### 1.2. Git
38 |
39 | Install Git so you can clone repositories:
40 |
41 | - macOS: via Homebrew (`brew install git`) or Xcode command line tools
42 | - Linux: via your package manager (e.g. `sudo apt install git`)
43 | - Windows: [Git for Windows] or via WSL (Ubuntu on Windows)
44 |
45 | ### 1.3. (Optional but recommended) Docker
46 |
47 | If you want the easiest, most reproducible setup, install Docker:
48 |
49 | - Docker Desktop (macOS/Windows) or
50 | - `docker` + `docker-compose` from your Linux distro
51 |
52 | The Dev Container–based environment assumes Docker is available.
53 |
54 | ---
55 |
56 | ## 2. Choose an environment
57 |
58 | You have **two main options**:
59 |
60 | 1. **Option A (recommended): VS Code Dev Container**
61 | Everything is pre-installed in a Docker image (GHC, Cabal/Stack, IHaskell, DataFrame, etc).
62 |
63 | 2. **Option B: Local installation**
64 | Install GHC, Cabal, Jupyter, IHaskell, and DataHaskell libraries directly on your machine.
65 |
66 | If you’re not sure which to choose, pick **Option A**.
67 |
68 | ---
69 |
70 | ## 3. Option A – Dev Container (recommended)
71 |
72 | This is the “batteries included” path. You get a pinned environment without polluting your global system.
73 |
74 | ### 3.1. Clone the starter repository
75 |
76 | We provide a starter repository with a ready-made environment and example notebooks:
77 |
78 | ```bash
79 | git clone https://github.com/DataHaskell/datahaskell-starter
80 | cd datahaskell-starter
81 | ```
82 |
83 | ### 3.2. Open the project in VS Code
84 |
85 | ```bash
86 | code .
87 | ```
88 |
89 | You'll get a popup asking if you want to re-ooen the project in a container.
90 | Select this option and VS Code will load the DataHaskell docker file.
91 |
92 | ### 3.3. Running the example notebook
93 |
94 | Open the `getting-started` notebook. You'll see a section that says `Select Kernel` at the top right.
95 |
96 | Upon clicking it you'll be asked to select a kernel. Go to `Jupyter Environment` and use the Haskell kernel installed there.
97 |
98 | ## 3. Option B – Installing everything locally
99 |
100 | We recommend you use cabal for this section.
101 |
102 | ```bash
103 | cabal update
104 | cabal install --lib dataframe ihaskell-dataframe hasktorch \
105 | ihaskell dataframe-hasktorch ihaskell-dataframe time ihaskell template-haskell \
106 | vector text containers array random unix directory regex-tdfa containers \
107 | cassava statistics monad-bayes aeson \
108 | --force-reinstalls
109 | cabal install ihaskell --install-method=copy --installdir=/opt/bin
110 | ihaskell install --ghclib=$(ghc --print-libdir) --prefix=$HOME/.local/
111 | jupyter kernelspec install $HOME/.local/share/jupyter/kernels/haskell/
112 | jupyter notebook
113 | ```
114 |
115 | Check if this setup is working by trying out the linear regression tutorial from the DataHaskell website.
116 |
117 | > Note this way of globally installing packages might break some of your existing projects.
118 |
119 |
--------------------------------------------------------------------------------
/css/syntax.css:
--------------------------------------------------------------------------------
1 | .highlight .hll { background-color: #ffffcc }
2 | .highlight { background: #ffffff; }
3 | .highlight .c { color: #888888 } /* Comment */
4 | .highlight .err { color: #a61717; background-color: #e3d2d2 } /* Error */
5 | .highlight .k { color: #008800; font-weight: bold } /* Keyword */
6 | .highlight .cm { color: #888888 } /* Comment.Multiline */
7 | .highlight .cp { color: #cc0000; font-weight: bold } /* Comment.Preproc */
8 | .highlight .c1 { color: #888888 } /* Comment.Single */
9 | .highlight .cs { color: #cc0000; font-weight: bold; background-color: #fff0f0 } /* Comment.Special */
10 | .highlight .gd { color: #000000; background-color: #ffdddd } /* Generic.Deleted */
11 | .highlight .ge { font-style: italic } /* Generic.Emph */
12 | .highlight .gr { color: #aa0000 } /* Generic.Error */
13 | .highlight .gh { color: #333333 } /* Generic.Heading */
14 | .highlight .gi { color: #000000; background-color: #ddffdd } /* Generic.Inserted */
15 | .highlight .go { color: #888888 } /* Generic.Output */
16 | .highlight .gp { color: #555555 } /* Generic.Prompt */
17 | .highlight .gs { font-weight: bold } /* Generic.Strong */
18 | .highlight .gu { color: #666666 } /* Generic.Subheading */
19 | .highlight .gt { color: #aa0000 } /* Generic.Traceback */
20 | .highlight .kc { color: #008800; font-weight: bold } /* Keyword.Constant */
21 | .highlight .kd { color: #008800; font-weight: bold } /* Keyword.Declaration */
22 | .highlight .kn { color: #008800; font-weight: bold } /* Keyword.Namespace */
23 | .highlight .kp { color: #008800 } /* Keyword.Pseudo */
24 | .highlight .kr { color: #008800; font-weight: bold } /* Keyword.Reserved */
25 | .highlight .kt { color: #888888; font-weight: bold } /* Keyword.Type */
26 | .highlight .m { color: #0000DD; font-weight: bold } /* Literal.Number */
27 | .highlight .s { color: #dd2200; background-color: #fff0f0 } /* Literal.String */
28 | .highlight .na { color: #336699 } /* Name.Attribute */
29 | .highlight .nb { color: #003388 } /* Name.Builtin */
30 | .highlight .nc { color: #bb0066; font-weight: bold } /* Name.Class */
31 | .highlight .no { color: #003366; font-weight: bold } /* Name.Constant */
32 | .highlight .nd { color: #555555 } /* Name.Decorator */
33 | .highlight .ne { color: #bb0066; font-weight: bold } /* Name.Exception */
34 | .highlight .nf { color: #0066bb; font-weight: bold } /* Name.Function */
35 | .highlight .nl { color: #336699; font-style: italic } /* Name.Label */
36 | .highlight .nn { color: #bb0066; font-weight: bold } /* Name.Namespace */
37 | .highlight .py { color: #336699; font-weight: bold } /* Name.Property */
38 | .highlight .nt { color: #bb0066; font-weight: bold } /* Name.Tag */
39 | .highlight .nv { color: #336699 } /* Name.Variable */
40 | .highlight .ow { color: #008800 } /* Operator.Word */
41 | .highlight .w { color: #bbbbbb } /* Text.Whitespace */
42 | .highlight .mf { color: #0000DD; font-weight: bold } /* Literal.Number.Float */
43 | .highlight .mh { color: #0000DD; font-weight: bold } /* Literal.Number.Hex */
44 | .highlight .mi { color: #0000DD; font-weight: bold } /* Literal.Number.Integer */
45 | .highlight .mo { color: #0000DD; font-weight: bold } /* Literal.Number.Oct */
46 | .highlight .sb { color: #dd2200; background-color: #fff0f0 } /* Literal.String.Backtick */
47 | .highlight .sc { color: #dd2200; background-color: #fff0f0 } /* Literal.String.Char */
48 | .highlight .sd { color: #dd2200; background-color: #fff0f0 } /* Literal.String.Doc */
49 | .highlight .s2 { color: #dd2200; background-color: #fff0f0 } /* Literal.String.Double */
50 | .highlight .se { color: #0044dd; background-color: #fff0f0 } /* Literal.String.Escape */
51 | .highlight .sh { color: #dd2200; background-color: #fff0f0 } /* Literal.String.Heredoc */
52 | .highlight .si { color: #3333bb; background-color: #fff0f0 } /* Literal.String.Interpol */
53 | .highlight .sx { color: #22bb22; background-color: #f0fff0 } /* Literal.String.Other */
54 | .highlight .sr { color: #008800; background-color: #fff0ff } /* Literal.String.Regex */
55 | .highlight .s1 { color: #dd2200; background-color: #fff0f0 } /* Literal.String.Single */
56 | .highlight .ss { color: #aa6600; background-color: #fff0f0 } /* Literal.String.Symbol */
57 | .highlight .bp { color: #003388 } /* Name.Builtin.Pseudo */
58 | .highlight .vc { color: #336699 } /* Name.Variable.Class */
59 | .highlight .vg { color: #dd7700 } /* Name.Variable.Global */
60 | .highlight .vi { color: #3333bb } /* Name.Variable.Instance */
61 | .highlight .il { color: #0000DD; font-weight: bold } /* Literal.Number.Integer.Long */
62 |
--------------------------------------------------------------------------------
/_posts/2016-10-19-code-of-conduct.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: page
3 | title: "Code of conduct"
4 | category: community
5 | date: 2016-10-19 16:18:58
6 | ---
7 |
8 | **dataHaskell** supports the [Berlin code of conduct](http://berlincodeofconduct.org/), which has a clear goal:
9 | **To enrich a friendly, safe and welcoming environment.**
10 |
11 | ## Purpose
12 |
13 | A primary goal of all the conferences and user groups that refer to this Code of Conduct is to be inclusive to the largest number of contributors, with the most varied and diverse backgrounds possible. As such, we are committed to providing a friendly, safe and welcoming environment for all, regardless of gender, sexual orientation, ability, ethnicity, socioeconomic status and religion (or lack thereof).
14 |
15 | This Code of Conduct outlines our expectations for all those who participate in our community, as well as the consequences for unacceptable behavior.
16 |
17 | We invite all those who participate in our events to help us create safe and positive experiences for everyone.
18 |
19 | ## Open [Source/Culture/Tech] Citizenship
20 |
21 | A supplemental goal of this Code of Conduct is to increase open [source/culture/tech] citizenship by encouraging participants to recognize and strengthen the relationships between our actions and their effects on our community.
22 |
23 | Communities mirror the societies in which they exist and positive action is essential to counteract the many forms of inequality and abuses of power that exist in society.
24 |
25 | If you see someone who is making an extra effort to ensure our community is welcoming, friendly, and encourages all participants to contribute to the fullest extent, we want to know.
26 |
27 | ## Expected Behavior
28 |
29 | - Participate in an authentic and active way. In doing so, you contribute to the health and longevity of this community.
30 | - Exercise consideration and respect in your speech and actions.
31 | - Attempt collaboration before conflict.
32 | - Refrain from demeaning, discriminatory, or harassing behavior and speech.
33 | - Be mindful of your surroundings and of your fellow participants. Alert community leaders if you notice a dangerous situation, someone in distress, or violations of this Code of Conduct, even if they seem inconsequential.
34 |
35 | ## Unacceptable Behavior
36 |
37 | Unacceptable behaviors include: intimidating, harassing, abusive, discriminatory, derogatory or demeaning speech or actions by any participant in our community online, at all related events and in one-on-one communications carried out in the context of community business. Community event venues may be shared with members of the public; please be respectful to all patrons of these locations.
38 |
39 | Harassment includes: harmful or prejudicial verbal or written comments related to gender, sexual orientation, race, religion, disability; inappropriate use of nudity and/or sexual images in public spaces (including presentation slides); deliberate intimidation, stalking or following; harassing photography or recording; sustained disruption of talks or other events; inappropriate physical contact, and unwelcome sexual attention.
40 |
41 | ## Consequences of Unacceptable Behavior
42 |
43 | Unacceptable behavior from any community member, including sponsors and those with decision-making authority, will not be tolerated. Anyone asked to stop unacceptable behavior is expected to comply immediately.
44 |
45 | If a community member engages in unacceptable behavior, the community organizers may take any action they deem appropriate, up to and including a temporary ban or permanent expulsion from the community without warning (and without refund in the case of a paid event).
46 |
47 | ## If You Witness or Are Subject to Unacceptable Behavior
48 |
49 | If you are subject to or witness unacceptable behavior, or have any other concerns, please notify a community organizer as soon as possible. You can find a list of organizers to contact for each of the supporters of this code of conduct at the bottom of this page. Additionally, community organizers are available to help community members engage with local law enforcement or to otherwise help those experiencing unacceptable behavior feel safe. In the context of in-person events, organizers will also provide escorts as desired by the person experiencing distress.
50 |
51 | ## Addressing Grievances
52 |
53 | If you feel you have been falsely or unfairly accused of violating this Code of Conduct, you should notify one of the event organizers with a concise description of your grievance. Your grievance will be handled in accordance with our existing governing policies.
54 |
55 | ## Scope
56 |
57 | We expect all community participants (contributors, paid or otherwise; sponsors; and other guests) to abide by this Code of Conduct in all community venues—online and in-person—as well as in all one-on-one communications pertaining to community business.
58 |
59 | ## License and attribution
60 |
61 | Berlin Code of Conduct is distributed under a Creative Commons Attribution-ShareAlike license. It is based on the [pdx.rb code of conduct](http://pdxruby.org/codeofconduct), which is distributed under the same license.
62 |
--------------------------------------------------------------------------------
/_posts/2016-10-19-contributing-with-code.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: page
3 | title: "Contributing with code"
4 | category: community
5 | date: 2016-10-19 17:00:25
6 | ---
7 |
8 | Managing code can be hard sometimes, in Haskell generally TMTOWTDI (There's more than one way to do it). Here are some tips for making your, and everyone's involved in your contribution, life easier:
9 |
10 | ## Formatting
11 | This code style guide is based on [the haskell style guide](https://github.com/tibbe/haskell-style-guide/blob/master/haskell-style.md).
12 |
13 | ### Line Length
14 |
15 | Maximum line length is *80 characters*.
16 |
17 | ### Indentation
18 |
19 | Tabs are illegal. Use spaces for indenting. Indent your code blocks
20 | with *4 spaces*. Indent the `where` keyword two spaces to set it
21 | apart from the rest of the code and indent the definitions in a
22 | `where` clause 2 spaces. Some examples:
23 |
24 | ```haskell
25 | sayHello :: IO ()
26 | sayHello = do
27 | name <- getLine
28 | putStrLn $ greeting name
29 | where
30 | greeting name = "Hello, " ++ name ++ "!"
31 |
32 | filter :: (a -> Bool) -> [a] -> [a]
33 | filter _ [] = []
34 | filter p (x:xs)
35 | | p x = x : filter p xs
36 | | otherwise = filter p xs
37 | ```
38 |
39 | ### Blank Lines
40 |
41 | One blank line between top-level definitions. No blank lines between
42 | type signatures and function definitions. Add one blank line between
43 | functions in a type class instance declaration if the function bodies
44 | are large. Use your judgement.
45 |
46 | ### Whitespace
47 |
48 | Surround binary operators with a single space on either side. Use
49 | your better judgement for the insertion of spaces around arithmetic
50 | operators but always be consistent about whitespace on either side of
51 | a binary operator. Don't insert a space after a lambda.
52 |
53 | ### Data Declarations
54 |
55 | Align the constructors in a data type definition. Example:
56 |
57 | ```haskell
58 | data Tree a = Branch !a !(Tree a) !(Tree a)
59 | | Leaf
60 | ```
61 |
62 | For long type names the following formatting is also acceptable:
63 |
64 | ```haskell
65 | data HttpException
66 | = InvalidStatusCode Int
67 | | MissingContentHeader
68 | ```
69 |
70 | Format records as follows:
71 |
72 | ```haskell
73 | data Person = Person
74 | { firstName :: !String -- ^ First name
75 | , lastName :: !String -- ^ Last name
76 | , age :: !Int -- ^ Age
77 | } deriving (Eq, Show)
78 | ```
79 |
80 | ### List Declarations
81 |
82 | Align the elements in the list. Example:
83 |
84 | ```haskell
85 | exceptions =
86 | [ InvalidStatusCode
87 | , MissingContentHeader
88 | , InternalServerError
89 | ]
90 | ```
91 |
92 | Optionally, you can skip the first newline. Use your judgement.
93 |
94 | ```haskell
95 | directions = [ North
96 | , East
97 | , South
98 | , West
99 | ]
100 | ```
101 |
102 | ### Pragmas
103 |
104 | Put pragmas immediately following the function they apply to.
105 | Example:
106 |
107 | ```haskell
108 | id :: a -> a
109 | id x = x
110 | {-# INLINE id #-}
111 | ```
112 |
113 | In the case of data type definitions you must put the pragma before
114 | the type it applies to. Example:
115 |
116 | ```haskell
117 | data Array e = Array
118 | {-# UNPACK #-} !Int
119 | !ByteArray
120 | ```
121 |
122 | ### Hanging Lambdas
123 |
124 | You may or may not indent the code following a "hanging" lambda. Use
125 | your judgement. Some examples:
126 |
127 | ```haskell
128 | bar :: IO ()
129 | bar = forM_ [1, 2, 3] $ \n -> do
130 | putStrLn "Here comes a number!"
131 | print n
132 |
133 | foo :: IO ()
134 | foo = alloca 10 $ \a ->
135 | alloca 20 $ \b ->
136 | cFunction a b
137 | ```
138 |
139 | ### Export Lists
140 |
141 | Format export lists as follows:
142 |
143 | ```haskell
144 | module Data.Set
145 | (
146 | -- * The @Set@ type
147 | Set
148 | , empty
149 | , singleton
150 |
151 | -- * Querying
152 | , member
153 | ) where
154 | ```
155 |
156 | ### If-then-else clauses
157 |
158 | Generally, guards and pattern matches should be preferred over if-then-else
159 | clauses, where possible. Short cases should usually be put on a single line
160 | (when line length allows it).
161 |
162 | When writing non-monadic code (i.e. when not using `do`) and guards
163 | and pattern matches can't be used, you can align if-then-else clauses
164 | like you would normal expressions:
165 |
166 | ```haskell
167 | foo = if ...
168 | then ...
169 | else ...
170 | ```
171 |
172 | Otherwise, you should be consistent with the 4-spaces indent rule, and the
173 | `then` and the `else` keyword should be aligned. Examples:
174 |
175 | ```haskell
176 | foo = do
177 | someCode
178 | if condition
179 | then someMoreCode
180 | else someAlternativeCode
181 | ```
182 |
183 | ```haskell
184 | foo = bar $ \qux -> if predicate qux
185 | then doSomethingSilly
186 | else someOtherCode
187 | ```
188 |
189 | The same rule applies to nested do blocks:
190 |
191 | ```haskell
192 | foo = do
193 | instruction <- decodeInstruction
194 | skip <- load Memory.skip
195 | if skip == 0x0000
196 | then do
197 | execute instruction
198 | addCycles $ instructionCycles instruction
199 | else do
200 | store Memory.skip 0x0000
201 | addCycles 1
202 | ```
203 |
204 | ### Case expressions
205 |
206 | The alternatives in a case expression can be indented using either of
207 | the two following styles:
208 |
209 | ```haskell
210 | foobar = case something of
211 | Just j -> foo
212 | Nothing -> bar
213 | ```
214 |
215 | or as
216 |
217 | ```haskell
218 | foobar = case something of
219 | Just j -> foo
220 | Nothing -> bar
221 | ```
222 |
223 | Align the `->` arrows when it helps readability.
224 |
225 | ## Imports
226 |
227 | Imports should be grouped in the following order:
228 |
229 | 1. standard library imports
230 | 2. related third party imports
231 | 3. local application/library specific imports
232 |
233 | Put a blank line between each group of imports. The imports in each
234 | group should be sorted alphabetically, by module name.
235 |
236 | Always use explicit import lists or `qualified` imports for standard
237 | and third party libraries. This makes the code more robust against
238 | changes in these libraries. Exception: The Prelude.
239 |
240 | ## Comments
241 |
242 | ### Punctuation
243 |
244 | Write proper sentences; start with a capital letter and use proper
245 | punctuation.
246 |
247 | ### Top-Level Definitions
248 |
249 | Comment every top level function (particularly exported functions),
250 | and provide a type signature; use Haddock syntax in the comments.
251 | Comment every exported data type. Function example:
252 |
253 | ```haskell
254 | -- | Send a message on a socket. The socket must be in a connected
255 | -- state. Returns the number of bytes sent. Applications are
256 | -- responsible for ensuring that all data has been sent.
257 | send :: Socket -- ^ Connected socket
258 | -> ByteString -- ^ Data to send
259 | -> IO Int -- ^ Bytes sent
260 | ```
261 |
262 | For functions the documentation should give enough information to
263 | apply the function without looking at the function's definition.
264 |
265 | Record example:
266 |
267 | ```haskell
268 | -- | Bla bla bla.
269 | data Person = Person
270 | { age :: !Int -- ^ Age
271 | , name :: !String -- ^ First name
272 | }
273 | ```
274 |
275 | For fields that require longer comments format them like so:
276 |
277 | ```haskell
278 | data Record = Record
279 | { -- | This is a very very very long comment that is split over
280 | -- multiple lines.
281 | field1 :: !Text
282 |
283 | -- | This is a second very very very long comment that is split
284 | -- over multiple lines.
285 | , field2 :: !Int
286 | }
287 | ```
288 |
289 | ### End-of-Line Comments
290 |
291 | Separate end-of-line comments from the code using 2 spaces. Align
292 | comments for data type definitions. Some examples:
293 |
294 | ```haskell
295 | data Parser = Parser
296 | !Int -- Current position
297 | !ByteString -- Remaining input
298 |
299 | foo :: Int -> Int
300 | foo n = salt * 32 + 9
301 | where
302 | salt = 453645243 -- Magic hash salt.
303 | ```
304 |
305 | ### Links
306 |
307 | Use in-line links economically. You are encouraged to add links for
308 | API names. It is not necessary to add links for all API names in a
309 | Haddock comment. We therefore recommend adding a link to an API name
310 | if:
311 |
312 | * The user might actually want to click on it for more information (in
313 | your judgment), and
314 |
315 | * Only for the first occurrence of each API name in the comment (don't
316 | bother repeating a link)
317 |
318 | ## Naming
319 |
320 | Use camel case (e.g. `functionName`) when naming functions and upper
321 | camel case (e.g. `DataType`) when naming data types.
322 |
323 | For readability reasons, don't capitalize all letters when using an
324 | abbreviation. For example, write `HttpServer` instead of
325 | `HTTPServer`. Exception: Two letter abbreviations, e.g. `IO`.
326 |
327 | ### Modules
328 |
329 | Use singular when naming modules e.g. use `Data.Map` and
330 | `Data.ByteString.Internal` instead of `Data.Maps` and
331 | `Data.ByteString.Internals`.
332 |
333 | ## Dealing with laziness
334 |
335 | By default, use strict data types and lazy functions.
336 |
337 | ### Data types
338 |
339 | Constructor fields should be strict, unless there's an explicit reason
340 | to make them lazy. This avoids many common pitfalls caused by too much
341 | laziness and reduces the number of brain cycles the programmer has to
342 | spend thinking about evaluation order.
343 |
344 | ```haskell
345 | -- Good
346 | data Point = Point
347 | { pointX :: !Double -- ^ X coordinate
348 | , pointY :: !Double -- ^ Y coordinate
349 | }
350 | ```
351 |
352 | ```haskell
353 | -- Bad
354 | data Point = Point
355 | { pointX :: Double -- ^ X coordinate
356 | , pointY :: Double -- ^ Y coordinate
357 | }
358 | ```
359 |
360 | Additionally, unpacking simple fields often improves performance and
361 | reduces memory usage:
362 |
363 | ```haskell
364 | data Point = Point
365 | { pointX :: {-# UNPACK #-} !Double -- ^ X coordinate
366 | , pointY :: {-# UNPACK #-} !Double -- ^ Y coordinate
367 | }
368 | ```
369 |
370 | As an alternative to the `UNPACK` pragma, you can put
371 |
372 | ```haskell
373 | {-# OPTIONS_GHC -funbox-strict-fields #-}
374 | ```
375 |
376 | at the top of the file. Including this flag in the file itself instead
377 | of e.g. in the Cabal file is preferable as the optimization will be
378 | applied even if someone compiles the file using other means (i.e. the
379 | optimization is attached to the source code it belongs to).
380 |
381 | Note that `-funbox-strict-fields` applies to all strict fields, not
382 | just small fields (e.g. `Double` or `Int`). If you're using GHC 7.4 or
383 | later you can use `NOUNPACK` to selectively opt-out for the unpacking
384 | enabled by `-funbox-strict-fields`.
385 |
386 | ### Functions
387 |
388 | Have function arguments be lazy unless you explicitly need them to be
389 | strict.
390 |
391 | The most common case when you need strict function arguments is in
392 | recursion with an accumulator:
393 |
394 | ```haskell
395 | mysum :: [Int] -> Int
396 | mysum = go 0
397 | where
398 | go !acc [] = acc
399 | go acc (x:xs) = go (acc + x) xs
400 | ```
401 |
402 | Misc
403 | ----
404 |
405 | ### Point-free style
406 |
407 | Avoid over-using point-free style. For example, this is hard to read:
408 |
409 | ```haskell
410 | -- Bad:
411 | f = (g .) . h
412 | ```
413 |
414 | ### Warnings
415 |
416 | Code should be compilable with `-Wall -Werror`. There should be no
417 | warnings.
418 |
419 | ## Submitting pull requests
420 |
421 | Try to submit a pull request for **everything**. From a small function to some docs, a comment, or whatever. Even if you are the author of a new repository.
422 |
423 | Pull requests help everyone get to know what you just did. Everyone learns from you and you learn from anyone that suggests changes. Isn't that awesome?
424 |
--------------------------------------------------------------------------------
/_posts/2025-11-09-linear-regression.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: page
3 | title: "Linear Regression: California House Price Prediction"
4 | category: tutorial
5 | date: 2025-11-09 14:08:18
6 | ---
7 |
8 | In this tutorial, we'll predict California housing prices using two Haskell libraries: **DataFrame** (for data wrangling) and **Hasktorch** (for machine learning).
9 |
10 | You can follow along and code [here](https://ulwazi-exh9dbh2exbzgbc9.westus-01.azurewebsites.net/lab/tree/California_Housing.ipynb).
11 |
12 | ## What Are We Building?
13 |
14 | We're going to:
15 | 1. 📊 Load and clean real housing data
16 | 2. 🔧 Engineer some clever features
17 | 3. 🤖 Train a linear regression model
18 | 4. 🎯 Predict house prices!
19 |
20 | Think of it as teaching a computer to estimate home values based on things like location, number of rooms, and how close the house is to the ocean.
21 |
22 | ## Our libraries
23 |
24 | ### DataFrame
25 | DataFrame is Swiss Army knife of data manipulation. It lets you work with tabular data (like CSV files) in a mostly type-safe, functional way.
26 |
27 | ### Hasktorch
28 | Hasktorch brings the power of Torch to Haskell. It lets us do numerical computing and machine learning. It has tensors (multi-dimensional arrays) which are the building blocks of neural networks.
29 |
30 | ## Let's Dive Into The Code!
31 |
32 | ### Setting Up Our Imports
33 |
34 | ```haskell
35 | {-# LANGUAGE BangPatterns #-}
36 | {-# LANGUAGE NumericUnderscores #-}
37 | {-# LANGUAGE OverloadedStrings #-}
38 | {-# LANGUAGE ScopedTypeVariables #-}
39 | {-# LANGUAGE TypeApplications #-}
40 |
41 | module Main where
42 |
43 | import qualified DataFrame as D
44 | import qualified DataFrame.Functions as F
45 | import DataFrame.Hasktorch (toTensor)
46 | import Torch
47 | import DataFrame ((|>))
48 | ```
49 |
50 | **What's happening here?** We're enabling some handy language extensions and importing our tools. The `|>` operator is particularly cool. The operator is like the Unix pipe, letting us chain operations left-to-right!
51 |
52 | ### Step 1: Loading the Data
53 |
54 | ```haskell
55 | df <- D.readCsv "../data/housing.csv"
56 | ```
57 |
58 | **Simple, right?** We're loading California housing data from a CSV file. This dataset contains information about different neighborhoods—things like population, median income, and (importantly) median house values.
59 |
60 | ### Step 2: Handling Missing Data
61 |
62 | Real-world data is messy. Sometimes values are missing, and we need to deal with that:
63 |
64 | ```haskell
65 | let meanTotalBedrooms = df |> D.filterJust "total_bedrooms" |> D.mean (F.col @Double "total_bedrooms")
66 | ```
67 |
68 | **Translation:** "Hey DataFrame, take our data, filter out the rows where `total_bedrooms` is missing, then calculate the mean of what's left."
69 |
70 | We'll use this mean to fill in the blanks later. This is called **imputation**—fancy word for "educated guess filling."
71 |
72 | ### Step 3: Feature Engineering
73 |
74 | Arguably the most important part of the learning process is making sure your data is meaningful. This is called **"feature engineering."** We want to combine our features in interesting ways so that patterns become easier for the model to spot.
75 |
76 | Machine learning models are powerful, but they're not magic. They can only learn from what we give them. If we just hand over raw numbers, we're making the model work way harder than it needs to. But if we do some creative thinking and craft features that highlight the relationships we care about, we can make even a simple model perform amazingly well.
77 |
78 | In our housing example, we're going to:
79 | - Convert text categories (like "NEAR OCEAN") into numbers the model can use (with 0 being the closest to the ocean and 5 being the furthest).
80 | - Create a brand new feature: `rooms_per_household` (because maybe spacious homes are worth more?)
81 | - Normalize everything so no single feature dominates
82 |
83 | ```haskell
84 | oceanProximityMapping :: [(Text, Int)]
85 | oceanProximityMapping = [("ISLAND", 0), ("NEAR OCEAN", 1), ("NEAR BAY", 2), ("<1H OCEAN", 3), ("INLAND", 4)]
86 |
87 | let cleaned =
88 | df
89 | |> D.impute (F.col @(Maybe Double "total_bedrooms")) meanTotalBedrooms
90 | |> D.exclude ["median_house_value"]
91 | |> D.derive "ocean_proximity" (F.recodeWithDefault 5 oceanProximityMapping (F.col "ocean_proximity"))
92 | |> D.derive
93 | "rooms_per_household"
94 | (F.col @Double "total_rooms" / F.col "households")
95 | |> normalizeFeatures
96 | ```
97 |
98 | **Let's break this pipeline down:**
99 |
100 | 1. **Impute**: Fill in those missing bedroom values with the mean we calculated
101 | 2. **Exclude**: Remove the house value column (we'll use it as labels, not features)
102 | 3. **Derive ocean_proximity**: Convert text like "NEAR OCEAN" into numbers (0-4) that our model can understand
103 | 4. **Derive rooms_per_household**: Create a new feature! Maybe houses with more rooms per household are worth more?
104 | 5. **Normalize**: Scale all features to a 0-1 range so no single feature dominates
105 |
106 | ### Feature Normalization
107 |
108 | ```haskell
109 | normalizeFeatures :: D.DataFrame -> D.DataFrame
110 | normalizeFeatures df =
111 | df
112 | |> D.fold
113 | ( \name d ->
114 | let col = F.col @Double name
115 | in D.derive name ((col - F.minimum col) / (F.maximum col - F.minimum col)) d
116 | )
117 | (D.columnNames (df |> D.selectBy [D.byProperty (D.hasElemType @Double)]))
118 | ```
119 |
120 | Neural networks do better when all the data is scaled the same. We applying **min-max normalization** to every numeric column:
121 |
122 | ```
123 | normalized_value = (value - min) / (max - min)
124 | ```
125 |
126 | This squishes every feature to the 0-1 range. Why? Imagine if house prices ranged from 0-500,000 but number of bedrooms ranged from 0-5. The huge price numbers would dominate the small bedroom numbers during training. Normalization levels the playing field.
127 |
128 | ### Step 4: From DataFrame to Tensors
129 |
130 | ```haskell
131 | features = toTensor cleaned
132 | labels = toTensor (D.select ["median_house_value"] df)
133 | ```
134 |
135 | **Bridge time!** We're converting our nice, clean DataFrame into Hasktorch tensors. Think of tensors as supercharged matrices that GPUs love to work with. Our `features` are what the model learns from, and `labels` are what it's trying to predict.
136 |
137 | ### Step 5: Building Our Model
138 |
139 | ```haskell
140 | init <- sample $ LinearSpec{in_features = snd (D.dimensions cleaned), out_features = 1}
141 | ```
142 |
143 | **What's a linear model?** Imagine drawing the best-fit line through a scatter plot—except we're doing it in many dimensions. The model learns:
144 |
145 | ```
146 | house_price = w₁×feature₁ + w₂×feature₂ + ... + wₙ×featureₙ + bias
147 | ```
148 |
149 | We're creating a linear layer with as many inputs as we have features (after cleaning) and 1 output (the predicted price).
150 |
151 | ```haskell
152 | model :: Linear -> Tensor -> Tensor
153 | model state input = squeezeAll $ linear state input
154 | ```
155 |
156 | This is our prediction function—feed in features, get out a price estimate.
157 |
158 | ### Step 6: Training Loop
159 |
160 | ```haskell
161 | trained <- foldLoop init 100_000 $ \state i -> do
162 | let labels' = model state features
163 | loss = mseLoss labels labels'
164 | when (i `mod` 10_000 == 0) $ do
165 | putStrLn $ "Iteration: " ++ show i ++ " | Loss: " ++ show loss
166 | (state', _) <- runStep state GD loss 0.1
167 | pure state'
168 | ```
169 |
170 | **This is where learning happens!** Let's break it down:
171 |
172 | 1. **100,000 iterations**: The model gets 100,000 chances to improve
173 | 2. **labels'**: Make predictions with current model weights
174 | 3. **loss**: How wrong are we? MSE (Mean Squared Error) measures the average squared difference between predictions and real prices
175 | 4. **Print every 10,000 steps**: Show us how we're doing!
176 | 5. **runStep with GD**: Update the model using **Gradient Descent** with a learning rate of 0.1
177 | - Think of gradient descent as rolling a ball down a hill to find the lowest point (best model)
178 | - Learning rate controls how big our steps are
179 |
180 | **What you'll see:**
181 | ```
182 | Training linear regression model...
183 | Iteration: 10000 | Loss: Tensor Float [] 5.0225e9
184 | Iteration: 20000 | Loss: Tensor Float [] 4.9093e9
185 | Iteration: 30000 | Loss: Tensor Float [] 4.8576e9
186 | Iteration: 40000 | Loss: Tensor Float [] 4.8333e9
187 | Iteration: 50000 | Loss: Tensor Float [] 4.8217e9
188 | Iteration: 60000 | Loss: Tensor Float [] 4.8160e9
189 | Iteration: 70000 | Loss: Tensor Float [] 4.8130e9
190 | Iteration: 80000 | Loss: Tensor Float [] 4.8114e9
191 | Iteration: 90000 | Loss: Tensor Float [] 4.8105e9
192 | Iteration: 100000 | Loss: Tensor Float [] 4.8099e9
193 | ```
194 |
195 | ### Step 7: Making Predictions
196 |
197 | ```haskell
198 | let predictions =
199 | D.insertUnboxedVector
200 | "predicted_house_value"
201 | (asValue @(VU.Vector Float) (model trained features))
202 | df
203 | print $ D.select ["median_house_value", "predicted_house_value"] predictions
204 | ```
205 |
206 | **The grand finale!** We're:
207 | 1. Using our trained model to predict all the house values
208 | 2. Converting the tensor back to a vector
209 | 3. Adding it as a new column in our original DataFrame
210 | 4. Printing a comparison of real vs. predicted values
211 |
212 | You'll see something like:
213 | ```
214 | -------------------------------------------
215 | median_house_value | predicted_house_value
216 | --------------------|----------------------
217 | Double | Float
218 | --------------------|----------------------
219 | 452600.0 | 414079.94
220 | 358500.0 | 423011.94
221 | 352100.0 | 383239.06
222 | 341300.0 | 324928.94
223 | 342200.0 | 256934.23
224 | 269700.0 | 264944.84
225 | 299200.0 | 259094.13
226 | 241400.0 | 257224.55
227 | 226700.0 | 201753.69
228 | 261100.0 | 268698.7
229 | ...
230 | ```
231 |
232 | ## Key Concepts We Learned
233 |
234 | **DataFrame Operations:**
235 | - `|>` - Pipeline operator (read left to right!)
236 | - `readCsv` - Load data from CSV files
237 | - `impute` - Fill in missing values
238 | - `derive` - Create new columns from existing ones
239 | - `filterJust` - Remove rows with missing values
240 | - `select` / `exclude` - Choose which columns to keep
241 |
242 | **Hasktorch:**
243 | - `toTensor` - Convert DataFrames to tensors
244 | - `Linear` - Linear regression layer
245 | - `mseLoss` - Mean Squared Error loss function
246 | - `runStep` with `GD` - Gradient descent optimization
247 | - `sample` - Initialize model parameters
248 |
249 | **Machine Learning Flow:**
250 | 1. **Load Data** → Get it into your program
251 | 2. **Clean & Transform** → Handle missing values, normalize
252 | 3. **Feature Engineering** → Create useful new features
253 | 4. **Train** → Iteratively improve the model
254 | 5. **Predict** → Use the trained model on data
255 |
256 | ## Try It Yourself!
257 |
258 | **Experiment ideas:**
259 | - Change the learning rate (0.1) to see how it affects training
260 | - Add more derived features (like income per person)
261 | - Try different numbers of iterations
262 | - Use different normalization strategies
263 |
264 | ## The advantages of this approach
265 |
266 | 1. **Type Safety**: DataFrame's type system catches most errors at compile time
267 | 2. **Functional Style**: Pure functions and pipelines make data transformations clear
268 | 3. **Performance**: Hasktorch uses PyTorch's battle-tested backend
269 | 4. **Readability**: The `|>` operator makes data pipelines read like stories
270 |
271 | ## Next Steps
272 |
273 | Now that you've mastered the basics:
274 | - Try different models (polynomial regression, neural networks)
275 | - Experiment with more complex feature engineering
276 | - Learn about train/test splits and model validation
277 | - Explore Hasktorch's neural network modules
278 |
279 | ## Get involved
280 | Wanna help contribute to data science in Haskell?
281 |
--------------------------------------------------------------------------------
/_posts/2025-11-09-roadmap.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: page
3 | title: "Roadmap"
4 | category: community
5 | date: 2025-11-09 13:31:54
6 | ---
7 |
8 | # DataHaskell Roadmap 2026-2027
9 |
10 | **Version**: 1.0
11 | **Date**: November 2026
12 | **Coordinators**: DataHaskell Community
13 | **Key Partners**: dataframe, Hasktorch, distributed-process
14 |
15 | ---
16 |
17 | ## Executive Summary
18 |
19 | This roadmap outlines the strategic direction for building a complete, production-ready Haskell data science ecosystem. With three major pillars already in active development—**dataframe** (data manipulation), **Hasktorch** (deep learning), and **distributed-process** (distributed computing)—we are positioned to create a cohesive platform that rivals Python, R, and Julia for data science workloads.
20 |
21 | ### Vision
22 | By 2027, DataHaskell will provide a high-performance, end-to-end data science toolkit that enables practitioners to build reliable machine learning systems from data ingestion through model deployment.
23 |
24 | ### Core Principles
25 | 1. **Interoperability**: Seamless integration between ecosystem components
26 | 2. **Performance**: Match or exceed Python/R performance benchmarks
27 | 3. **Ergonomics**: Intuitive APIs that lower the barrier to entry
28 | 4. **Production Ready**: Focus on reliability, monitoring, and deployment
29 | 5. **Type Safety**: Leverage Haskell's type system (where possible) to catch errors at compile time
30 |
31 | ---
32 |
33 | ## Current State Assessment
34 |
35 | ### Strengths
36 | - **dataframe**: Modern dataframe library with IHaskell integration
37 | - **Hasktorch**: Mature deep learning library with PyTorch backend and GPU support
38 | - **distributed-process**: Battle-tested distributed computing framework
39 | - **IHaskell**: A Haskell kernel for Jupyter notebooks.
40 | - Strong functional programming foundations
41 | - Excellent parallelism and concurrency primitives
42 |
43 | ### Gaps to Address
44 | - No community of maintainers and contributors
45 | - Fragmented visualization ecosystem
46 | - Limited data I/O format support
47 | - Incomplete documentation and tutorials
48 | - Sparse integration examples between major libraries
49 | - Limited model deployment tooling
50 |
51 | ### Critical Needs
52 | - Unified onboarding experience
53 | - Comprehensive benchmarking against Python/R
54 | - Production deployment patterns
55 | - Enterprise adoption case studies
56 |
57 | ---
58 |
59 | ## Strategic Pillars
60 |
61 | ## Pillar 1: Core Data Infrastructure
62 |
63 | ### Phase 1 (Q1-Q2 2026) - Foundation
64 | **Owner**: dataframe team
65 |
66 | **Goals**:
67 | - Complete dataframe v1 release (March 2026)
68 | - Establish dataframe as the standard tabular data library
69 | - Performance parity with Pandas/Polars for common operations
70 |
71 | **Deliverables**:
72 | 1. **dataframe v0.1.0**
73 | - SQL-like API finalized
74 | - IHaskell integration complete
75 | - Type-safe column operations
76 | - Comprehensive test suite
77 | - Apache Arrow integration
78 |
79 | 2. **File Format Support**
80 | - CSV/TSV (existing)
81 | - Parquet (high priority)
82 | - Arrow IPC format
83 | - Excel (xlsx)
84 | - JSON (nested structures)
85 | - HDF5 (coordination with scientific computing)
86 |
87 | 3. **Performance Benchmarks**
88 | - Public benchmark suite comparing to:
89 | - Pandas
90 | - Polars
91 | - dplyr/tidyverse
92 | - Focus areas: filtering, grouping, joining, aggregations
93 | - Document optimization strategies
94 |
95 | ### Phase 2 (Q3-Q4 2026) - Expansion
96 | **Owner**: dataframe + community
97 |
98 | **Goals**:
99 | - Advanced data manipulation features
100 | - Computing on files larger than memory
101 | - Integration with Cloud database systems
102 |
103 | **Deliverables**:
104 | 1. **Advanced Operations**
105 | - Window functions
106 | - Rolling aggregations
107 | - Pivot/unpivot operations
108 | - Complex joins (anti, semi)
109 | - Reshaping operations (melt, cast)
110 |
111 | 2. **Cloud database Connectivity**
112 | - Read files from AWS/GCP/Azure
113 | - PostgreSQL integration
114 | - SQLite support
115 | - Query pushdown optimization
116 | - Streaming query results
117 |
118 | ---
119 |
120 | ## Pillar 2: Statistical Computing & Visualization
121 |
122 | ### Phase 1 (Q2-Q3 2026) - Statistics Core
123 | **Owner**: Community (needs maintainer)
124 |
125 | **Goals**:
126 | - Create a unified machine learning library on top of Hasktorch and Statistics
127 | - Create unified plotting API
128 |
129 | **Deliverables**:
130 | 1. **statistics**
131 | - Extend hypothesis testing (t-test, ANOVA)
132 | - Simple regression models (linear and logistic)
133 | - Generalized linear models (GLM)
134 | - Survival analysis basics
135 | - Integration with dataframe
136 |
137 | 2. **Plotting & Visualization**
138 | - **Option A**: Extend hvega (Vega-Lite) with dataframe integration
139 | - **Option B**: Create native plotting library with backends
140 | - Priority features:
141 | - Scatter plots, line plots, bar charts
142 | - Histograms and distributions
143 | - Heatmaps and correlation plots
144 | - Interactive plots for notebooks
145 | - Export to PNG, SVG, PDF
146 |
147 | ### Phase 2 (Q4 2026 - Q1 2027) - Advanced Analytics
148 | **Owner**: Community
149 |
150 | **Deliverables**:
151 | 1. **Advanced Statistical Methods**
152 | - Mixed effects models
153 | - Time series analysis (ARIMA, state space models)
154 | - Bayesian inference (integration with existing libraries)
155 | - Causal inference methods
156 | - Spatial statistics
157 |
158 | 2. **Visualization Expansion**
159 | - Grammar of graphics implementation
160 | - Geographic/mapping support
161 | - Network visualization
162 | - 3D plotting capabilities
163 |
164 | ---
165 |
166 | ## Pillar 3: Machine Learning & Deep Learning
167 |
168 | ### Phase 1 (Q1-Q2 2026) - Integration
169 | **Owners**: Hasktorch + dataframe teams
170 |
171 | **Goals**:
172 | - Improve dataframe → tensor pipeline
173 | - Example-driven documentation
174 |
175 | **Deliverables**:
176 | 1. **dataframe ↔ Hasktorch Bridge**
177 | - Zero-copy conversion where possible
178 | - Automatic type mapping
179 | - GPU memory management
180 | - Batch loading utilities
181 |
182 | 2. **ML Workflow Examples with new unified library**
183 | - End-to-end classification (Iris, MNIST)
184 | - Regression examples (California Housing)
185 | - Time series forecasting
186 | - NLP pipeline (text classification)
187 | - Computer vision (image classification)
188 |
189 | 3. **Data Preprocessing**
190 | - Feature scaling/normalization
191 | - One-hot encoding
192 | - Missing value imputation
193 | - Train/test splitting
194 | - Cross-validation utilities
195 |
196 | ### Phase 2 (Q3-Q4 2026) - Classical ML
197 | **Owner**: Community (coordinate with Hasktorch)
198 |
199 | **Goals**:
200 | - Fill gap between dataframe and deep learning
201 | - Provide scikit-learn equivalent
202 |
203 | **Deliverables**:
204 | 1. **haskell-ml-toolkit** (new library)
205 | - Decision trees and random forests
206 | - Gradient boosting (XGBoost integration or native)
207 | - Support Vector Machines
208 | - K-means and hierarchical clustering
209 | - Dimensionality reduction (PCA, t-SNE, UMAP)
210 | - Model evaluation metrics
211 | - Hyperparameter optimization
212 |
213 | 2. **Feature Engineering**
214 | - Automatic feature generation
215 | - Feature selection methods
216 | - Polynomial features
217 | - Text feature extraction
218 |
219 | ### Phase 3 (Q1-Q2 2027) - Model Management
220 | **Owners**: Hasktorch + community
221 |
222 | **Deliverables**:
223 | 1. **Model Serialization & Versioning**
224 | - Standard model format
225 | - Version tracking
226 | - Metadata storage
227 | - Model registry concept
228 |
229 | 2. **Model Deployment**
230 | - REST API server templates
231 | - Batch prediction utilities
232 | - Model monitoring hooks
233 | - ONNX export for interoperability
234 |
235 | ---
236 |
237 | ## Pillar 4: Distributed & Parallel Computing
238 |
239 | ### Phase 1 (Q2-Q3 2026) - Core Integration
240 | **Owners**: distributed-process + dataframe teams
241 |
242 | **Goals**:
243 | - Enable distributed data processing
244 | - Provide MapReduce-style operations
245 |
246 | **Deliverables**:
247 | 1. **Distributed DataFrame Operations**
248 | - Distributed CSV/Parquet reading
249 | - Parallel groupby and aggregations
250 | - Distributed joins
251 | - Shuffle operations
252 | - Fault tolerance mechanisms
253 |
254 | 2. **distributed-ml** (new library)
255 | - Distributed model training
256 | - Parameter servers
257 | - Data parallelism primitives
258 | - Model parallelism support
259 | - Integration with Hasktorch
260 |
261 | 3. **Examples & Patterns**
262 | - Multi-node data processing
263 | - Distributed hyperparameter search
264 | - Large-scale model training
265 | - Stream processing patterns
266 |
267 | ### Phase 2 (Q4 2026 - Q1 2027) - Production Features
268 | **Owner**: distributed-process team
269 |
270 | **Deliverables**:
271 | 1. **Cluster Management**
272 | - Node discovery and registration
273 | - Health monitoring
274 | - Resource allocation
275 | - Job scheduling
276 |
277 | 2. **Cloud Integration**
278 | - AWS backend
279 | - Google Cloud backend
280 | - Kubernetes deployment patterns
281 | - Docker containerization templates
282 |
283 | ---
284 |
285 | ## Pillar 5: Developer Experience
286 |
287 | ### Phase 1 (Q1-Q2 2026) - Documentation Blitz
288 | **Owner**: All maintainers + community
289 |
290 | **Goals**:
291 | - Lower barrier to entry
292 | - Comprehensive learning path
293 |
294 | **Deliverables**:
295 | 1. **DataHaskell Website Revamp**
296 | - Clear getting started guide
297 | - Library comparison matrix
298 | - Migration guides (from Python, R)
299 | - Success stories
300 |
301 | 2. **Tutorial Series**
302 | - Installation and setup (all platforms)
303 | - Your first data analysis
304 | - DataFrames deep dive
305 | - Machine learning workflow
306 | - Distributed computing basics
307 | - Production deployment
308 |
309 | 3. **Notebook Gallery**
310 | - 20+ example notebooks covering:
311 | - Data cleaning and exploration
312 | - Statistical analysis
313 | - ML model building
314 | - Visualization
315 | - Domain-specific examples (finance, biology, etc.)
316 |
317 | ### Phase 2 (Q3-Q4 2026) - Tooling
318 | **Owner**: Community
319 |
320 | **Deliverables**:
321 | 1. **datahaskell-cli** (new tool)
322 | - Project scaffolding
323 | - Dependency management presets
324 | - Environment setup automation
325 | - Example project templates
326 |
327 | 2. **IDE Support Improvements**
328 | - VSCode IHaskell support with dataHaskell stack supported out the box
329 | - HLS integration guides
330 | - Debugging workflows
331 | - IHaskell kernel improvements
332 |
333 | 3. **Testing & CI Templates**
334 | - Property-based testing examples
335 | - Benchmark suites
336 | - GitHub Actions templates
337 | - Continuous deployment patterns
338 |
339 | ---
340 |
341 | ## Pillar 6: Community & Ecosystem
342 |
343 | ### Ongoing Initiatives
344 |
345 | **Goals**:
346 | - Grow contributor base
347 | - Foster collaboration
348 | - Drive adoption
349 |
350 | **Deliverables**:
351 | 1. **Community Building**
352 | - Monthly community calls (starting Q1 2026)
353 | - Discord/Slack workspace
354 | - Quarterly virtual conferences
355 | - Mentorship program
356 |
357 | 2. **Contribution Framework**
358 | - Good first issues across all projects
359 | - Contribution guidelines
360 | - Code review standards
361 | - Recognition program
362 |
363 | 3. **Outreach**
364 | - Blog post series
365 | - Conference talks (Haskell Symposium, ZuriHac, etc.)
366 | - Academic collaborations
367 | - Industry partnerships
368 |
369 | 4. **Package Standards**
370 | - Naming conventions
371 | - API design guidelines
372 | - Documentation requirements
373 | - Testing standards
374 | - Version compatibility matrix
375 |
376 | ---
377 |
378 | ## Success Metrics
379 |
380 | ### Q2 2026
381 | - [ ] dataframe v1 released
382 | - [ ] 3 complete end-to-end tutorials published
383 | - [ ] Performance benchmarks showing ≥70% of Pandas speed
384 | - [ ] 5 integration examples between major libraries
385 |
386 | ### Q4 2026
387 | - [ ] 10,000+ total library downloads/month across ecosystem
388 | - [ ] 5+ active contributors
389 | - [ ] Performance parity (≥90%) with Pandas for common operations
390 | - [ ] Complete ML workflow from data to deployment documented
391 |
392 | ### Q2 2027
393 | - [ ] 2+ companies using DataHaskell
394 | - [ ] DataHaskell track at major Haskell conference
395 | - [ ] 3+ published case studies
396 | - [ ] Comprehensive distributed computing examples
397 |
398 | ### Q4 2027
399 | - [ ] Feature completeness with Python's core data science stack
400 | - [ ] 5+ production ML systems case studies
401 | - [ ] Enterprise support offerings available
402 |
403 | ---
404 |
405 | ## Resource Requirements
406 |
407 | ### Maintainer Coordination
408 | - **Monthly sync**: All pillar leads (1 hour)
409 | - **Quarterly planning**: Full maintainer group (2 hours)
410 |
411 | ### Funding Needs (Optional but Helpful)
412 | 1. **Infrastructure**
413 | - Benchmark server (GPU-enabled)
414 | - CI/CD resources
415 | - Documentation hosting
416 |
417 | 2. **Developer Support**
418 | - Part-time technical writer
419 | - Maintainer stipends or grants
420 | - Summer of Haskell projects
421 |
422 | 3. **Events**
423 | - Quarterly virtual meetups
424 | - Annual in-person hackathon
425 | - Conference sponsorships
426 |
427 | ---
428 |
429 | ## Risk Mitigation
430 |
431 | ### Technical Risks
432 |
433 | | Risk | Mitigation |
434 | |------|-----------|
435 | | Performance doesn't match Python | Early benchmarking, profiling, and optimization sprints |
436 | | Integration complexity | Defined interfaces, versioning strategy, compatibility tests |
437 | | Breaking changes in dependencies | Conservative version bounds, testing matrix |
438 |
439 | ### Community Risks
440 |
441 | | Risk | Mitigation |
442 | |------|-----------|
443 | | Maintainer burnout | Distributed ownership, recognition program, funding support |
444 | | Fragmentation | Regular coordination, shared roadmap, integration testing |
445 | | Slow adoption | Marketing efforts, case studies, migration guides |
446 |
447 | ### Ecosystem Risks
448 |
449 | | Risk | Mitigation |
450 | |------|-----------|
451 | | GHC changes break libraries | Test against multiple GHC versions, engage with GHC team |
452 | | Competing projects | Focus on collaboration, clear differentiation |
453 | | Limited contributor pool | Mentorship, good documentation, welcoming community |
454 |
455 | ---
456 |
457 | ## Decision Framework
458 |
459 | ### When to add new libraries
460 | **Criteria**:
461 | 1. Fills clear gap in ecosystem
462 | 2. Has committed maintainer
463 | 3. Integrates with existing components
464 | 4. Follows API design guidelines
465 | 5. Includes comprehensive tests and docs
466 |
467 | ### When to deprecate/consolidate
468 | **Criteria**:
469 | 1. Unmaintained for >6 months
470 | 2. Better alternative exists
471 | 3. Creates confusion in ecosystem
472 |
473 | ### Version Compatibility Policy
474 | - Support last 2 major GHC versions
475 | - Semantic versioning (PVP)
476 | - Deprecation warnings for 2 releases before removal
477 | - Compatibility matrix published on website
478 |
479 | ---
480 |
481 | ## Communication Plan
482 |
483 | ### Internal (Maintainers)
484 | - **Discord channel**: Daily async communication
485 | - **GitHub Discussions**: Technical decisions, RFCs
486 | - **Monthly video call**: Roadmap progress, blockers
487 | - **Quarterly planning session**: Next phase priorities
488 |
489 | ### External (Community)
490 | - **Blog**: Monthly progress updates
491 | - **Twitter/Social**: Weekly highlights
492 | - **Haskell Discourse**: Major announcements
493 | - **Newsletter**: Quarterly ecosystem update
494 | - **Documentation**: Always up-to-date
495 |
496 | ---
497 |
498 | ## How to Use This Roadmap
499 |
500 | This is a **living document**. We will:
501 | - Review quarterly and adjust priorities
502 | - Track progress in GitHub projects
503 | - Celebrate milestones publicly
504 | - Adapt based on community feedback
505 |
506 | **Questions?** Open a discussion on GitHub or join our community calls.
507 |
508 | ---
509 |
510 | *Let's build the future of data science in Haskell together! 🚀*
511 |
--------------------------------------------------------------------------------