└── README.md /README.md: -------------------------------------------------------------------------------- 1 | 2 | # 🔖 AI Data Scientist Handbook 2026 3 | This handbook aims to cut through the noise and curate the AI tools, workflows, and resources that can actually help data scientists accelerate their growth in the age of AI. 4 | 5 | Rather than covering general tools and resources or trying to provide an exhaustive list, the aim is to surface the resources that are most relevant to data scientists right now. 6 | 7 | If you want to read more about the motivation behind this project, see the [About](#about) section. 8 | 9 | ## Table of Contents 10 | - [Tools](#tools) 11 | - [Modern BI & Analytics Tools](#modern-bi--analytics-tools) 12 | - [Conversational Analytics Tools](#conversational-analytics-tools) 13 | - [Miscellaneous Tools](#miscellaneous-tools) 14 | - [Foundation Models](#foundation-models) 15 | - [Learning Resources](#learning-resources) 16 | - [Newsletters](#newsletters) 17 | - [Courses](#courses) 18 | - [YouTube Channels](#youtube-channels) 19 | - [Conferences](#conferences) 20 | - [About](#about) 21 | 22 | 23 | ## 🛠️ Tools 24 | 25 | ### Modern BI & Analytics Tools 26 | 27 | This section highlights **modern, forward-leaning BI and analytics tools** that go beyond traditional dashboarding. 28 | The focus is on tools that emphasize semantic layers, metrics-as-code, search-driven analytics, notebooks, and tighter integration with modern data and AI workflows. 29 | 30 | Well-known, legacy BI platforms (Looker, PowerBI, Qlik, Tableau) are intentionally excluded to keep this list **high-signal** and oriented toward how analytics is evolving rather than how it has historically been done. 31 | 32 | | Tool | Category | What It's Good At | Why It Matters for DS | 33 | |------|----------|-------------------|------------------------| 34 | | [Omni](https://www.omni.co/) | Semantic BI / Metrics Layer | SQL-first BI with strong semantic modeling | Bridges analytics and engineering workflows | 35 | | [Steep](https://www.steep.app/) | Metrics & Analytics | Lightweight, modern metrics exploration | Faster iteration than traditional BI | 36 | | [Lightdash](https://www.lightdash.com/) | Open-source BI | dbt-native BI with metrics-as-code | Fits modern analytics stacks | 37 | | [Evidence](https://evidence.dev/) | Analytics Reporting | Markdown + SQL driven reports | BI as code, versionable insights | 38 | | [Hex](https://hex.tech/) | Notebook BI | Notebooks, dashboards, collaboration; AI-powered conversational analytics | Analyst-to-stakeholder friendly with self-serve AI features | 39 | | [Mode](https://mode.com/) | Analytics Platform | SQL + Python with reporting | Strong hybrid DS / BI workflows | 40 | | [Preset](https://preset.io/) | Open-source BI | Managed Apache Superset | Scalable, customizable BI | 41 | | [Metabase](https://www.metabase.com/) | Open-source BI | Simple querying and dashboards | Fast exploration with low friction | 42 | | [ThoughtSpot](https://www.thoughtspot.com/) | Search-Driven Analytics | Natural language search with AI-assisted insights | Brings search-style analytics to large datasets | 43 | 44 | #### Standalone Semantic Layer Tools 45 | 46 | Dedicated semantic layer platforms that sit between your data warehouse and consumption tools (BI, AI, applications). They provide a single source of truth for metrics, dimensions, and business logic (decoupled from any specific BI tool). 47 | 48 | | Tool | Type | What It's Good At | Why It Matters | 49 | |------|------|-------------------|----------------| 50 | | [Cube](https://cube.dev/) | Open-source / Cloud | Headless semantic layer with REST, GraphQL, MDX, and SQL APIs; caching and pre-aggregations | API-first, works with any BI tool or AI agent; strong for embedded analytics | 51 | | [AtScale](https://www.atscale.com/) | Enterprise | Universal semantic layer with MDX/DAX support; integrates with Power BI, Tableau, Excel | Enterprise-grade; bridges legacy BI tools with modern cloud warehouses | 52 | | [dbt Semantic Layer (MetricFlow)](https://docs.getdbt.com/docs/build/about-metricflow) | Open-source / Cloud | Metrics-as-code defined alongside dbt models; integrates with dbt Cloud | Tight integration with dbt workflows; single place for transforms + metrics | 53 | | [Malloy](https://www.malloydata.dev/) | Open-source | Semantic modeling language from Google; compiles to SQL for BigQuery, Snowflake, Postgres, etc. | Lightweight, expressive; good for teams wanting code-first semantic models | 54 | 55 | *If you want to know more about the semantic layer, check out [this article](https://read.futureproofds.com/p/semantic-layers-and-the-future-of-agentic-analytics).* 56 | 57 | ### Conversational Analytics Tools 58 | 59 | Tools that let you talk to your data via conversational AI, natural-language querying, or AI-assisted analytics, but that don't quite fit into the modern BI category. 60 | 61 | | Tool | What it does | 62 | | --- | --- | 63 | | [PandasAI](https://pandas-ai.com/) | AI-driven interface for Python dataframes; ask questions and get insights and visuals. | 64 | | [Julius AI](https://julius.ai/) | Conversational AI analyst for data insights and charts (supports spreadsheets/CSV/sheet imports). | 65 | | [Zerve AI](https://www.zerve.ai/) | Conversational interface for querying and exploring data. | 66 | | [DataGPT](https://datagpt.com/) | Conversational AI data analyst that generates insights and deep analysis from business data. | 67 | | [FineChatBI](https://www.fanruan.com/en/finechatbi) | Conversational analytics tool to ask questions and build dashboards and visualizations. | 68 | | [Vanna AI](https://vanna.ai/) | Natural-language chat interface for querying SQL databases; generates SQL and charts/summaries. | 69 | | [Powerdrill (Chat with Database)](https://powerdrill.ai/features/chat-with-database) | Chat-based analytics interface for asking questions and analyzing data without writing SQL. | 70 | | [Wren AI](https://www.getwren.ai/) | Natural-language interface for querying and interacting with data sources. | 71 | 72 | ### Miscellaneous Tools 73 | 74 | This section includes **tools built specifically with data scientists in mind** that don’t fit into the other categories but are still highly relevant to modern, AI-native data science workflows. 75 | 76 | | Tool | What it’s for | Why it belongs here | 77 | |------|---------------|---------------------| 78 | | [Google Agent Development Kit (ADK)](https://cloud.google.com/vertex-ai/docs/agent-builder/overview) | Framework for building structured, tool-using agents | Designed for analytical and reasoning-heavy workflows, not just chatbots | 79 | | [MCP Toolbox for Databases](https://github.com/modelcontextprotocol/servers) | Standardized way to connect agents to databases | Directly addresses a core DS need: safe, structured access to data sources | 80 | | [Metaflow](https://metaflow.org/) | DS-first workflow and experiment framework | Built to let data scientists move from notebooks to production without heavy infrastructure | 81 | | [cleanlab](https://cleanlab.ai/) | Data quality and label issue detection | Focuses on a uniquely DS problem: silent data and label errors that hurt model performance | 82 | 83 | 84 | 85 | 86 | ## 🤖 Foundation Models 87 | This section lists foundation models (open-source and commercial) that are relevant to aim at solving core data science problems, including models for tabular data, time series, recommendations, and multimodal analysis. 88 | 89 | Use this section as a starting point for exploring foundation models and their capabilities. 90 | 91 | | Model | Domain | Organization | Access Type | Primary Use Case | 92 | | --- | --- | --- | --- | --- | 93 | | [TimeGPT](https://www.nixtla.io/) | Time series | Nixtla | API / Open | Forecasting and anomaly detection | 94 | | [TimesFM](https://github.com/google-research/timesfm) | Time series | Google | Open | Zero-shot forecasting | 95 | | [Chronos](https://github.com/amazon-science/chronos-forecasting) | Time series | Amazon AWS | Open | General forecasting | 96 | | [Moirai](https://github.com/SalesforceAIResearch/uni2ts) | Time series | Salesforce | Open | Multi-domain forecasting | 97 | | [Toto](https://github.com/DataDog/toto) | Observability | Datadog | Open | High-cardinality forecasting | 98 | | [MOMENT](https://github.com/moment-timeseries-foundation-model/moment) | Time series | CMU | Open | Multi-task (forecasting, anomaly, etc.) | 99 | | [Granite TTM-R2](https://github.com/ibm-granite/granite-tsfm) | Time series | IBM | Open | Sequential prediction | 100 | | [TabPFN](https://github.com/PriorLabs/TabPFN) | Tabular | Prior Labs | Open | Classification and regression | 101 | | [TableGPT2](https://huggingface.co/tablegpt/TableGPT2-7B) | Tabular / NLP | Zhejiang Univ. | Open | Table question answering and code generation | 102 | | [Netflix RecSys Model](https://netflixtechblog.com/foundation-model-for-personalized-recommendation-1a0bd8e02d39) | Recommendations | Netflix | Proprietary | Personalization at scale | 103 | | [Spotify 2T-HGNN](https://research.atspotify.com/publications/personalized-audiobook-recommendations-at-spotify-through-graph-neural-networks) | Recommendations | Spotify | Proprietary | Cross-modal recommendations | 104 | 105 | *If you want to know more about foundation models for data science, check out [this article](https://read.futureproofds.com/p/what-are-foundation-models-and-why-data-scientists-should-care).* 106 | 107 | 108 | ## 📚 Learning Resources 109 | 110 | This section lists **learning resources** that go beyond generic theory and either align with *AI-native data workflows* or *applied data science with modern AI tools*. 111 | 112 | ### Newsletters 113 | - [Future Proof Data Science](https://read.futureproofds.com/): A weekly newsletter for data scientists who want to stay relevant and grow their careers in the age of AI (and beyond). 114 | - [Jam with AI](https://jamwithai.substack.com/): A newsletter inspired by real-world AI/ML events & projects 115 | - [To Data & Beyond](https://youssefh.substack.com/): A newsletter for mastering Data Science & AI—Beyond the Basics 116 | - [Daily Dose of Data Science](https://blog.dailydoseofds.com/): A free newsletter for continuous learning about data science and ML, lesser-known techniques, and how to apply them in 2 minutes. 117 | - [Neural Pulse](https://neuralpulse.io/subscribe): A 5-minute, human-curated newsletter delivering the best in AI, ML, and data science (twice a week). 118 | 119 | 120 | ### Courses 121 | 122 | | Resource | What It Covers | Why It Belongs Here | 123 | |----------|----------------|---------------------| 124 | | [AI Workflows Bootcamp](https://futureproofds.com/) | A cohort-based program that helps data scientists master AI workflows and automation to 10× productivity, stay relevant, and accelerate their careers. | Built for data scientists, by data scientists. | 125 | | [DeepLearning.AI Courses](https://www.deeplearning.ai/courses/) | AI/ML foundations and applied developer workflows | Useful for DS learners who need *conceptual grounding* alongside applied workflows. | 126 | | [Building AI Agents and Agentic Workflows Specialization (Coursera)](https://www.coursera.org/specializations/building-ai-agents-and-agentic-workflows) | Building and orchestrating agent-based AI systems (LangChain, LangGraph, tool calling) | Focuses on *agentic workflows* that map directly to DS productivity scenarios. | 127 | | [Introduction to LangGraph (LangChain Academy)](https://academy.langchain.com/courses/intro-to-langgraph) | Building stateful, multi-actor agents with LangGraph | Hands-on course for building agentic workflows directly applicable to DS automation. | 128 | 129 | 130 | ### YouTube Channels 131 | 132 | - [Data Neighbor Podcast](https://www.youtube.com/@dataneighborpodcast): Hosted by industry veterans Hai Guan, Sravya Madipalli, and Shane Butler, covering data science careers, AI trends, and professional growth. 133 | - [AI Engineer](https://www.youtube.com/@aiDotEngineer): Official channel from the AI Engineer conference/community, featuring talks on AI engineering, agents, and applied AI development. 134 | 135 | *Feel free to reach out to me if you have any suggestions for channels that should be added to this list!* 136 | 137 | 138 | 139 | ## 🏆 Conferences 140 | 141 | ### United States 142 | 143 | | Conference | Date | Location | Details | 144 | |------------|------|----------|----------| 145 | | [ODSC AI East 2026](https://odsc.ai/east/) | April 28-29, 2026 | Boston, MA | Various tracks including ML, NLP, MLOps, and Data Visualization. 250+ speakers. | 146 | | [IBM Think 2026](https://www.ibm.com/events/think/) | May 4-7, 2026 | Boston, MA | Focuses on AI productivity, trusted data, scalable AI architectures, and cost optimization. | 147 | | [Machine Learning Week 2026](https://machinelearningweek.com/) | May 5-6, 2026 | San Francisco, CA | Focuses on making AI products robust and deployment-worthy. | 148 | | [The Data Science Conference](https://www.thedatascienceconference.com/) | May 28-29, 2026 | Chicago, IL | Vendor-free, sponsor-free, and recruiter-free conference for data science professionals. | 149 | | [Data + AI Summit 2026](https://www.databricks.com/dataaisummit) | June 15-18, 2026 | San Francisco, CA | Hosted by Databricks. Includes discussions, networking, and hands-on training. | 150 | | [AI Engineer World's Fair 2026](https://www.ai.engineer/worldsfair) | June 30 - July 2, 2026 | San Francisco, CA | Largest technical AI conference with 20 tracks, 250 speakers, 6,000+ attendees. | 151 | | [The AI Conference 2026](https://aiconference.com/) | Sept 30 - Oct 1, 2026 | San Francisco, CA | Vendor-neutral event by the creators of MLconf. Features AI research, engineering, and applied ML. | 152 | | [ODSC AI West 2026](https://odsc.ai/west/) | Oct 27-29, 2026 | Burlingame, CA | Focuses on AI and data science with workshops, hands-on training, and strategic insights. | 153 | 154 | ### Europe 155 | 156 | | Conference | Date | Location | Details | 157 | |------------|------|----------|----------| 158 | | [World AI Cannes Festival 2026](https://www.worldaicannes.com/) | Feb 12-13, 2026 | Cannes, France | Focuses on AI, ML, and data science. Features AI technologies and global innovators. | 159 | | [AI Engineer Europe 2026](https://www.ai.engineer/europe) | April 8-10, 2026 | London, UK | First official AI Engineer Europe event. Large multitrack technical AI conference for 1000+ AI engineers. | 160 | | [Data Innovation Summit 2026](https://datainnovationsummit.com/) | May 6-8, 2026 | Stockholm, Sweden | Covers data governance, literacy, machine learning, with speakers from major companies. | 161 | | [DATA 2026](https://data.scitevents.org/) | July 16-18, 2026 | Porto, Portugal | International conference on data science, technology, and applications. | 162 | | [ECML PKDD 2026](https://ecmlpkdd.org/2026/) | Sept 7-11, 2026 | Naples, Italy | Premier European conference on machine learning and knowledge discovery in databases. | 163 | 164 | # Contributing 165 | 166 | If you want to add to the repository or find any issues, please feel free to raise a PR and ensure correct placement within the relevant section or category. 167 | 168 | # About 169 | This repo exists because data science is entering a new phase. 170 | 171 | AI tools are no longer “nice to have” side experiments. They are becoming part of how we actually do the work, from analysis and exploration to production workflows. 172 | 173 | As demand grows for data scientists who understand how to integrate AI into their existing workflows, the signal-to-noise ratio is getting worse. There are endless tools, ideas, and opinions, many of them generic or borrowed from other fields. 174 | 175 | The goal of this repo is to cut through that noise. It’s a curated set of resources that are either built specifically for data scientists or closely aligned with how we already work. Not an exhaustive list, and not a guide, just a focused snapshot of what’s worth paying attention to right now. 176 | 177 | # FAQs 178 | 1. **How is curation done?** Curation is based on thorough research, recommendations from people I trust, my 7+ years of experience as a Data Scientist and extensive work integrating AI into data science workflows. 179 | 2. **Are all resources free?** Most resources here will be free, but I will also include paid alternatives if they are truly valuable to your career development. 180 | 3. **How often is the repository updated?** I plan to come back here as often as possible to ensure all resources are still available and relevant and also to add new ones. 181 | 182 | If you have questions or feedback send me a message through [here](https://www.linkedin.com/in/andresvourakis/). Enjoy! 183 | --------------------------------------------------------------------------------