├── legal.md ├── linux.md ├── serverless.md ├── social-media.md ├── courses.md ├── politics.md ├── secure-messaging.md ├── github.md ├── housing.md ├── hardware.md ├── enterprise.md ├── billing.md ├── note-taking.md ├── science.md ├── speaking.md ├── recruiting.md ├── interesting_research_papers.md ├── iot.md ├── management.md ├── osint.md ├── python-things.md ├── spatio-temporal.md ├── datacenter.md ├── cncf_k8s.md ├── quotes.md ├── books.md ├── money.md ├── EU_IT.md ├── communication.md ├── scala-tools.md ├── flipper.md ├── financials.md ├── leadership.md ├── health.md ├── ecommerce.md ├── dancing └── zouk.md ├── ethics.md ├── design.md ├── web.md ├── cloud-architecture.md ├── office.md ├── events.md ├── git_foo.md ├── data-platform.md ├── academic-writing.md ├── research-software-engineering.md ├── teaching.md ├── crypto_finance.md ├── networking.md ├── crypto └── defi.md ├── security-llm.md ├── data-engineering-streaming.md ├── marketing.md ├── knowledge-management.md ├── data_sources.md ├── technical-documentation.md ├── hiring_interview.md ├── development.md ├── E2E_operations.md ├── decison-science.md ├── data-governance.md ├── presentations.md ├── graphs.md ├── opsec.md ├── nlp-dl-getting-started.md ├── finance.md ├── data-management-dbt.md ├── geospatial.md ├── java-matrix.md ├── devops.md ├── owning-infrastructure.md ├── ai_tools.md ├── keeping-up-to-date.md ├── web_scraping.md ├── security.md ├── frequently_needed_stuff.md ├── streaming.md ├── startup_company.md ├── rust.md ├── data-science.md ├── real-world-dataflow-problems.md ├── llm.md ├── README.md ├── great_articles.md └── data-management.md /legal.md: -------------------------------------------------------------------------------- 1 | # legal 2 | 3 | ## NDA 4 | 5 | - https://waypointnda.com/ 6 | -------------------------------------------------------------------------------- /linux.md: -------------------------------------------------------------------------------- 1 | - use https://github.com/robbyrussell/oh-my-zsh as shell 2 | -------------------------------------------------------------------------------- /serverless.md: -------------------------------------------------------------------------------- 1 | # Serverless 2 | 3 | ## meta leayer 4 | - https://serverless.com 5 | -------------------------------------------------------------------------------- /social-media.md: -------------------------------------------------------------------------------- 1 | # social media 2 | 3 | ## tools 4 | 5 | - https://swat.io/de/ 6 | -------------------------------------------------------------------------------- /courses.md: -------------------------------------------------------------------------------- 1 | # list of great courses 2 | 3 | ## deeplearning 4 | - http://course.fast.ai 5 | -------------------------------------------------------------------------------- /politics.md: -------------------------------------------------------------------------------- 1 | # politics 2 | 3 | ## transparency 4 | 5 | - https://abstimmung.eu/git/2025 6 | -------------------------------------------------------------------------------- /secure-messaging.md: -------------------------------------------------------------------------------- 1 | # secure messaging 2 | 3 | - https://briarproject.org/ 4 | - matrix 5 | -------------------------------------------------------------------------------- /github.md: -------------------------------------------------------------------------------- 1 | # github 2 | 3 | ## stars 4 | 5 | - https://github.com/m-ahmed-elbeskeri/Starguard 6 | -------------------------------------------------------------------------------- /housing.md: -------------------------------------------------------------------------------- 1 | # Housing 2 | 3 | ## ecological house building 4 | 5 | - https://www.baufritz.com 6 | -------------------------------------------------------------------------------- /hardware.md: -------------------------------------------------------------------------------- 1 | # Hardware 2 | 3 | ## hardware description langauge 4 | 5 | - https://spade-lang.org/ 6 | -------------------------------------------------------------------------------- /enterprise.md: -------------------------------------------------------------------------------- 1 | # Enterprise 2 | 3 | - https://churchofturing.github.io/the-enterprise-experience.html 4 | -------------------------------------------------------------------------------- /billing.md: -------------------------------------------------------------------------------- 1 | # billing 2 | 3 | ## standards 4 | 5 | ### EU 6 | 7 | - https://de.wikipedia.org/wiki/ZUGFeRD 8 | -------------------------------------------------------------------------------- /note-taking.md: -------------------------------------------------------------------------------- 1 | # Notes 2 | 3 | ## 2nd brain tools 4 | 5 | - https://joplinapp.org/ 6 | - https://obsidian.md/ 7 | - https://logseq.com/ 8 | -------------------------------------------------------------------------------- /science.md: -------------------------------------------------------------------------------- 1 | # Science 2 | 3 | ## statistics 4 | ### p-values 5 | - https://lucklab.ucdavis.edu/blog/2018/4/19/why-i-lost-faith-in-p-values 6 | -------------------------------------------------------------------------------- /speaking.md: -------------------------------------------------------------------------------- 1 | # speaking 2 | 3 | ## public speaking 4 | 5 | - https://open.spotify.com/episode/0LSCotQAW5OQeTkggFJlAU?si=yXY3SMfLR6WDMcUT7cISyg 6 | -------------------------------------------------------------------------------- /recruiting.md: -------------------------------------------------------------------------------- 1 | # recruiting 2 | # self-preparation 3 | 4 | ## CV 5 | 6 | - https://github.com/jankapunkt/latexcv 7 | - https://github.com/rendercv/rendercv 8 | -------------------------------------------------------------------------------- /interesting_research_papers.md: -------------------------------------------------------------------------------- 1 | # interesting research papers 2 | 3 | ## deeplearning 4 | ### embeddings 5 | - categorical handling https://arxiv.org/pdf/1604.06737.pdf 6 | -------------------------------------------------------------------------------- /iot.md: -------------------------------------------------------------------------------- 1 | # IOT 2 | 3 | ## distributed data analytics at the edge 4 | 5 | - http://edgent.apache.org 6 | 7 | ## plattform 8 | 9 | - https://www.eclipse.org/ditto/ 10 | -------------------------------------------------------------------------------- /management.md: -------------------------------------------------------------------------------- 1 | # management 2 | 3 | ## guides 4 | 5 | ### Google 6 | 7 | - https://rework.withgoogle.com/guides/managers-develop-and-support-managers/steps/introduction/ 8 | -------------------------------------------------------------------------------- /osint.md: -------------------------------------------------------------------------------- 1 | # OSINT 2 | 3 | ## fact checking 4 | 5 | - http://verificationhandbook.com/ 6 | 7 | ## identities 8 | 9 | ### accounts 10 | 11 | - https://sherlockproject.xyz/ -------------------------------------------------------------------------------- /python-things.md: -------------------------------------------------------------------------------- 1 | # python things 2 | 3 | ## easy environment 4 | 5 | - jupyter 6 | - streamlit 7 | - chainlit 8 | 9 | ## sharing code 10 | 11 | - https://py.cafe/ 12 | -------------------------------------------------------------------------------- /spatio-temporal.md: -------------------------------------------------------------------------------- 1 | # geospatial time series 2 | 3 | ## regular 4 | - R 5 | - sf 6 | ## big 7 | - pyspark 8 | - https://github.com/sabman/PySparkGeoAnalysis/blob/master/slides.md 9 | -------------------------------------------------------------------------------- /datacenter.md: -------------------------------------------------------------------------------- 1 | # datacenter 2 | 3 | ## power 4 | 5 | - https://blog.cloudflare.com/an-introduction-to-three-phase-power-and-pdus/ 6 | - https://blog.railway.com/p/data-center-build-part-one -------------------------------------------------------------------------------- /cncf_k8s.md: -------------------------------------------------------------------------------- 1 | # CNCF 2 | 3 | ## overview 4 | - landscape https://landscape.cncf.io do not get intimidated 5 | 6 | ## k8s 7 | 8 | ### validation 9 | - https://github.com/criticalhop/kubectl-val 10 | -------------------------------------------------------------------------------- /quotes.md: -------------------------------------------------------------------------------- 1 | # great quotes 2 | 3 | 4 | - RFC1925 https://www.youtube.com/watch?v=5GvfPGcdhqI 5 | - Perfection has been reached not when there is nothing left to add, but when there is nothing left to take away 6 | -------------------------------------------------------------------------------- /books.md: -------------------------------------------------------------------------------- 1 | # books 2 | 3 | ## military 4 | 5 | - Carl von Clausewitz - Vom Kriege 6 | - https://de.wikipedia.org/wiki/Carl_von_Clausewitz 7 | - https://www.clausewitz-gesellschaft.de/wp-content/uploads/2014/12/VomKriege-a4.pdf -------------------------------------------------------------------------------- /money.md: -------------------------------------------------------------------------------- 1 | # money 2 | 3 | ## investing can make sense 4 | - https://msperlin.github.io/2018-05-12-Investing-Long-Run/ 5 | 6 | 7 | ## trading 8 | 9 | ### zero commission 10 | 11 | - https://gitlab.com/aenegri/toro 12 | -------------------------------------------------------------------------------- /EU_IT.md: -------------------------------------------------------------------------------- 1 | # EU IT services 2 | 3 | ## non-us it servcie prociders 4 | 5 | - https://misskey.de/notes/a3fzyvpre8 6 | 7 | ## EU cloud initiative 8 | 9 | - https://berthub.eu/articles/posts/now-how-to-get-that-european-cloud/ 10 | -------------------------------------------------------------------------------- /communication.md: -------------------------------------------------------------------------------- 1 | ## communication 2 | 3 | 4 | ### high performing technical teams 5 | 6 | - https://www.youtube.com/watch?v=BsYIvi3Sae8 7 | 8 | ### writing and leadership 9 | 10 | - https://www.youtube.com/watch?v=aFwVf5a3pZM 11 | -------------------------------------------------------------------------------- /scala-tools.md: -------------------------------------------------------------------------------- 1 | # great scala tooling 2 | **refactoring & linting** 3 | - https://scalacenter.github.io/scalafix/ 4 | 5 | **sbt** 6 | - https://docs.google.com/presentation/d/15gmGLiD4Eiyx8e5Kg2eXge5wD80DF6lLG8hSqc6ohmo/edit#slide=id.ge9b0b7c3_153 7 | -------------------------------------------------------------------------------- /flipper.md: -------------------------------------------------------------------------------- 1 | # flipper zero 2 | 3 | https://flipperzero.one/ 4 | 5 | ## firmwares 6 | 7 | - https://momentum-fw.dev/ 8 | 9 | ## collections 10 | 11 | - https://github.com/UberGuidoZ/Flipper-IRDB 12 | - https://github.com/UberGuidoZ/Flipper 13 | -------------------------------------------------------------------------------- /financials.md: -------------------------------------------------------------------------------- 1 | # financials 2 | 3 | > WARNING none of these links is to be considered investment advice 4 | 5 | ## green finance 6 | 7 | - https://www.faire-fonds.info/ 8 | 9 | 10 | ## budgeting 11 | 12 | ### plaintext 13 | 14 | - https://plainbudget.com/ 15 | -------------------------------------------------------------------------------- /leadership.md: -------------------------------------------------------------------------------- 1 | # leadership 2 | 3 | ## good times 4 | 5 | ### experts 6 | 7 | - https://idiallo.com/blog/how-to-lead-in-a-room-full-of-experts 8 | 9 | ## bad times 10 | ### technology 11 | 12 | - https://chaoticgood.management/how-to-be-a-leader-when-the-vibes-are-off/ 13 | -------------------------------------------------------------------------------- /health.md: -------------------------------------------------------------------------------- 1 | # Health 2 | 3 | ## pandemic 4 | 5 | ### covid19 6 | 7 | - https://www.nature.com/articles/s41586-020-2923-3 8 | - https://www.ecdc.europa.eu/en/news-events/forecasting-covid-19-cases-and-deaths-europe-new-hub 9 | - https://covid19forecasthub.org/ and https://viz.covid19forecasthub.org/ 10 | -------------------------------------------------------------------------------- /ecommerce.md: -------------------------------------------------------------------------------- 1 | # ecommerce 2 | 3 | ## shops 4 | 5 | - https://www.shopware.com/de/ 6 | - https://snipcart.com/blog/jamstack 7 | - https://fastspring.com/ 8 | 9 | 10 | ## forum / link aggregation 11 | 12 | - https://github.com/LemmyNet/lemmy 13 | 14 | 15 | ## analytics 16 | 17 | - https://plausible.io/ 18 | -------------------------------------------------------------------------------- /dancing/zouk.md: -------------------------------------------------------------------------------- 1 | # Zouk 2 | 3 | - https://www.youtube.com/watch?v=ITRCYkFMqfk 4 | - https://www.youtube.com/watch?v=7Ft4CgALkVI 5 | - https://www.youtube.com/watch?v=f_oqWEfVCwg 6 | - https://www.youtube.com/watch?v=2i_mvTy56wo 7 | - https://www.youtube.com/watch?v=dByCaQRO4ng 8 | - https://www.youtube.com/watch?v=iY9M34AIi4g 9 | -------------------------------------------------------------------------------- /ethics.md: -------------------------------------------------------------------------------- 1 | # ethics 2 | 3 | ## great quotes 4 | 5 | - Young engineers treat ethics as a speciality, something you don't really need to worry about; you just need to learn to code, change the world, disrupt something. They're like kids in a toy shop full of loaded AK-47's. https://twitter.com/yonatanzunger/status/975547532871790594 6 | -------------------------------------------------------------------------------- /design.md: -------------------------------------------------------------------------------- 1 | # design 2 | 3 | ## icon sets 4 | 5 | - https://illlustrations.co/ 6 | - https://feathericons.com/ 7 | - https://www.flaticon.com/ 8 | 9 | ## image generators 10 | 11 | - https://github.com/justjake/Gauss 12 | 13 | 14 | ## movie 15 | 16 | - https://www.pitivi.org/ 17 | 18 | 19 | ## web 20 | 21 | - https://webflow.com/ 22 | -------------------------------------------------------------------------------- /web.md: -------------------------------------------------------------------------------- 1 | # Web 2 | 3 | ## tracking 4 | 5 | - privacy friendly open source alternative to Google Analytics https://plausible.io/ 6 | 7 | 8 | ## development 9 | 10 | - https://github.com/sansyrox/robyn 11 | ### python 12 | 13 | rust interop 14 | - https://github.com/sparckles/Robyn 15 | 16 | 17 | ### cms 18 | 19 | - https://wagtail.org/ 20 | -------------------------------------------------------------------------------- /cloud-architecture.md: -------------------------------------------------------------------------------- 1 | # cloud architecture 2 | 3 | ## compliance 4 | 5 | - https://github.com/tmobile/pacbot/ 6 | 7 | ## secrets 8 | 9 | - https://github.com//tmobile/t-vault 10 | 11 | ## on-premise integration 12 | 13 | - https://github.com//tmobile/jazz 14 | 15 | 16 | ## self service infrastructure 17 | 18 | ### data 19 | 20 | - https://www.onyxia.sh/ 21 | -------------------------------------------------------------------------------- /office.md: -------------------------------------------------------------------------------- 1 | # Office & documentation stuff 2 | 3 | ## charting 4 | - https://mermaidjs.github.io 5 | 6 | ## physical 7 | ### room bookings 8 | 9 | - https://uniguest.com/de/beschilderung-der-reservierungsraume/ 10 | 11 | ## tooling 12 | 13 | ### video conferencing 14 | 15 | - https://meetling.de/ 16 | 17 | #### transcription 18 | 19 | - https://www.airgram.io/ 20 | -------------------------------------------------------------------------------- /events.md: -------------------------------------------------------------------------------- 1 | # organizing events 2 | 3 | ## conference 4 | 5 | - https://mini-conf.github.io 6 | - https://pretalx.com/p/about/ 7 | 8 | ## events 9 | 10 | - https://splashthat.com/ 11 | - https://github.com/aaronpk/Meetable 12 | 13 | time planning 14 | 15 | - https://crab.fit/ 16 | 17 | ## streaming 18 | 19 | - https://www.airmeet.com/ 20 | - https://streamyard.com/ 21 | -------------------------------------------------------------------------------- /git_foo.md: -------------------------------------------------------------------------------- 1 | # git foo 2 | 3 | ## utilities 4 | 5 | deleting local branches no longer present on server 6 | 7 | 8 | ``` 9 | git fetch -p && for branch in $(git for-each-ref --format '%(refname) %(upstream:track)' refs/heads | awk '$2 == "[gone]" {sub("refs/heads/", "", $1); print $1}'); do git branch -D $branch; done 10 | ``` 11 | 12 | https://stackoverflow.com/questions/7726949/remove-tracking-branches-no-longer-on-remote 13 | -------------------------------------------------------------------------------- /data-platform.md: -------------------------------------------------------------------------------- 1 | # data platform 2 | 3 | ## good content 4 | 5 | data platform engineering 6 | 7 | - https://davidsj.substack.com/p/data-platform-engineering?r=125hnz 8 | - https://georgheiler.com/event/magenta-data-architecture-25/ 9 | 10 | 11 | ## iam 12 | 13 | - https://opentdf.io/ 14 | - https://spiffe.io/ 15 | 16 | ## secrets 17 | 18 | - https://github.com/Infisical/infisical/blob/main/README.md 19 | - https://github.com/getsops/sops 20 | -------------------------------------------------------------------------------- /academic-writing.md: -------------------------------------------------------------------------------- 1 | # academic writing 2 | 3 | ## editing tips 4 | 5 | - sentence style 6 | - As a general comment to improve the writing, as this was the case in a couple of paragraphs: 7 | -) some paragraphs were of the form "We did A. This is because A allows us to B. We want to have B" Better writing is to say "We want B. Hence, to achieve B, we did A." This goes easier with readers. 8 | 9 | ## correction 10 | 11 | - https://languagetool.org 12 | -------------------------------------------------------------------------------- /research-software-engineering.md: -------------------------------------------------------------------------------- 1 | # Research Software Engineering (RSE) 2 | 3 | ## good links 4 | 5 | - finally DE realizes value of RSE a bit more https://gi.de/themen/beitrag/alles-ueber-research-software-engineering-1 6 | 7 | ## organizations 8 | 9 | - https://www.researchsoft.org/# 10 | - https://de-rse.org/de/index.html - https://lists.gi.de/postorius/lists/fg-rse.lists.gi.de/ 11 | 12 | ## Position papers about RSE 13 | 14 | - https://f1000research.com/articles/9-295/v2 15 | -------------------------------------------------------------------------------- /teaching.md: -------------------------------------------------------------------------------- 1 | # teaching 2 | 3 | ## programming 101 4 | 5 | - python 101 by examples of automating daily tasks https://automatetheboringstuff.com/2e/ 6 | 7 | ## data science 8 | 9 | - https://e2eml.school/blog.html 10 | - averages https://www.thestar.com/news/insight/2016/01/16/when-us-air-force-discovered-the-flaw-of-averages.html 11 | 12 | 13 | ## moocs 14 | - https://www.freecodecamp.org/news/here-are-380-ivy-league-courses-you-can-take-online-right-now-for-free-9b3ffcbd7b8c/ 15 | -------------------------------------------------------------------------------- /crypto_finance.md: -------------------------------------------------------------------------------- 1 | # crypto finance 2 | 3 | ## value of crypto/btc 4 | 5 | - https://digitalik.net/btc/ 6 | 7 | 8 | ## interesting platforms 9 | 10 | - https://ledn.io 11 | 12 | ## anoymous crypto coins 13 | 14 | - https://samouraiwallet.com/ 15 | - https://global.strike.me/ 16 | - https://wasabiwallet.io/ 17 | 18 | 19 | ## identifying crypto laundering 20 | 21 | - https://www.youtube.com/watch?v=1uNneo9L_jU 22 | 23 | 24 | ## privacy enhancng machinery 25 | 26 | - https://ashigaru.rs/ 27 | -------------------------------------------------------------------------------- /networking.md: -------------------------------------------------------------------------------- 1 | # networking (technical) 2 | 3 | ## hardware 4 | 5 | - https://shop.hak5.org/products/wifi-pineapple 6 | 7 | ## news 8 | 9 | - china distorted GPS to circles (https://www.schneier.com/blog/archives/2019/11/gps_manipulatio.html) 10 | - 5G PFA (anteanna array) 11 | - software defined radio 12 | 13 | ## resiliency 14 | 15 | - https://bowshock.nl/irc/ 16 | - https://ripe90.ripe.net/archives/video/1582/ 17 | - https://berthub.eu/articles/posts/cyber-security-pre-war-reality-check/ -------------------------------------------------------------------------------- /crypto/defi.md: -------------------------------------------------------------------------------- 1 | # crypto 2 | 3 | ## DEFI 4 | 5 | ### XRD 6 | 7 | based on: https://node.staatenlos.ch/radix-xrd-exrd-kaufen-staken-anleitung 8 | 9 | - https://staatenlos.ch/staatenlos/radix-staatenlos-node/ 10 | - buy from any exchange $ or € into BTC or XRP 11 | - transfer to bitfinex 12 | - buy XRD 13 | - move to wallet (example: https://www.youtube.com/watch?v=C1-V9th3Bas) 14 | - destination address 15 | - tag 16 | - staking: https://www.youtube.com/watch?v=Q3uyVDofgA4 17 | - wallet backup securely (cryphto phrases & wallet)! 18 | -------------------------------------------------------------------------------- /security-llm.md: -------------------------------------------------------------------------------- 1 | # LLM security 2 | 3 | ## scanning 4 | 5 | - https://github.com/leondz/garak 6 | 7 | ## ablation 8 | 9 | - https://huggingface.co/blog/mlabonne/abliteration 10 | 11 | ## llm validation 12 | 13 | - https://www.promptfoo.dev/ 14 | 15 | 16 | ### prompt injection 17 | 18 | - https://arstechnica.com/information-technology/2025/04/researchers-claim-breakthrough-in-fight-against-ais-frustrating-security-hole/ 19 | 20 | ## red teaming 21 | 22 | ### post exploitation 23 | 24 | - https://github.com/mbrg/power-pwn 25 | -------------------------------------------------------------------------------- /data-engineering-streaming.md: -------------------------------------------------------------------------------- 1 | # streaming data engineering 2 | 3 | ## engines 4 | 5 | ### feldera 6 | 7 | - https://github.com/feldera/feldera 8 | - https://www.youtube.com/watch?v=cn1Yaxwl6x8 9 | 10 | ### fluvio 11 | 12 | - https://github.com/infinyon/fluvio 13 | 14 | ### duckdb 15 | 16 | - https://github.com/turbolytics/sql-flow 17 | 18 | ### ark stream 19 | 20 | - https://github.com/chenquan/arkflow 21 | 22 | 23 | ## observability 24 | 25 | ### kafka 26 | 27 | - https://github.com/MAIF/yozefu 28 | 29 | 30 | ## benchmarks 31 | 32 | - https://github.com/timescale/rtabench 33 | -------------------------------------------------------------------------------- /marketing.md: -------------------------------------------------------------------------------- 1 | # Marketing tools 2 | 3 | ## website themes 4 | - https://thrivethemes.com/themes/ 5 | 6 | ## e-mail signatures 7 | 8 | - fancy gmail signatures in HTML https://followyoureyes.de/blog/email-client-signatur-online-erstellen 9 | 10 | 11 | ## writing 12 | 13 | - grammarly 14 | - https://quillbot.com/ 15 | - https://instatext.io 16 | - https://hemingwayapp.com/ 17 | 18 | 19 | ## logo checks 20 | 21 | - https://euipo.europa.eu/eSearch 22 | 23 | ## analytics 24 | 25 | - https://plausible.io/ 26 | 27 | 28 | ## content curation 29 | 30 | - https://buffer.com/resources/content-curation-tools/ 31 | -------------------------------------------------------------------------------- /knowledge-management.md: -------------------------------------------------------------------------------- 1 | # knowledge management 2 | 3 | ## self 4 | 5 | - https://obsidian.md/ 6 | 7 | 8 | ## semantics 9 | 10 | drawing graphs 11 | - https://data.world/integrations/grafo 12 | 13 | relation to graph 14 | - https://www.w3.org/TR/r2rml/ 15 | 16 | 17 | reference ontologies 18 | 19 | - https://www.semanticarts.com/gist/ 20 | - https://basic-formal-ontology.org/ 21 | 22 | representations 23 | 24 | - https://linkml.io/ 25 | 26 | databases 27 | - https://www.oxfordsemantic.tech/rdfox 28 | 29 | 30 | communities 31 | 32 | - https://datatransform.co/ 33 | 34 | data products 35 | 36 | - https://ekgf.github.io/dprod/ 37 | -------------------------------------------------------------------------------- /data_sources.md: -------------------------------------------------------------------------------- 1 | # interesting data sources 2 | 3 | obviously GDELT is a massive and really interesting source of data. 4 | 5 | ## OGD data management tools/frameworks 6 | 7 | - https://ckan.org/ 8 | 9 | ### company 10 | - https://github.com/Lukhers-dev/firmenbuch-HVD/tree/main 11 | 12 | ## geopsatial 13 | - http://www.geonames.org 14 | - https://github.com/kuwala-io/kuwala (geospatial popular times as osm + google maps api) 15 | - https://dataforgood.facebook.com/dfg/docs/methodology-high-resolution-population-density-maps 16 | - https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/population-density 17 | - https://github.com/giswqs/geospatial-data-catalogs 18 | -------------------------------------------------------------------------------- /technical-documentation.md: -------------------------------------------------------------------------------- 1 | # technical documentation/ technical writing 2 | 3 | ## frameworks 4 | 5 | - https://quarto.org/ 6 | - https://rmarkdown.rstudio.com/ 7 | - https://jupyter.org/ notebook based 8 | - https://jupyterbook.org/en/stable/start/publish.html 9 | - https://github.com/outerbounds/nbdoc 10 | - https://de.wikipedia.org/wiki/Markdown 11 | - https://asciidoctor.org/ 12 | - https://de.wikipedia.org/wiki/LaTeX 13 | - https://www.sphinx-doc.org/en/master/ 14 | 15 | doc tools 16 | - https://www.gitbook.com 17 | - https://docusaurus.io/ 18 | 19 | ### converters 20 | 21 | - https://pandoc.org/ 22 | 23 | ### DSL 24 | 25 | - https://rhai.rs/ (but more generic scripting) 26 | -------------------------------------------------------------------------------- /hiring_interview.md: -------------------------------------------------------------------------------- 1 | # hiring & interview tips 2 | 3 | ## blogs 4 | 5 | - https://www.zainrizvi.io/blog/the-interviewing-advice-no-one-shares/ 6 | 7 | 8 | ## IT 9 | 10 | - https://pythonawesome.com/a-command-line-tool-for-memorizing-algorithms-in-python/ 11 | 12 | ## AI 13 | 14 | - https://www.heise.de/hintergrund/Bewerbervorauswahl-per-Algorithmus-Wie-man-eine-KI-von-sich-ueberzeugt-6159470.html 15 | - https://www.jobscan.co 16 | - https://www.vmock.com/site 17 | - https://huyenchip.com/ml-interviews-book/ 18 | 19 | 20 | # salary negotiation 21 | 22 | - https://www.youtube.com/watch?v=KcyyKLSn5jQ 23 | - https://www.youtube.com/watch?v=6l2Frzq4Dp4 24 | - https://www.youtube.com/watch?v=KcyyKLSn5jQ 25 | 26 | 27 | # recruiting questions 28 | 29 | - https://github.com/Appsilon/awesome-interview-questions 30 | - https://github.com/DopplerHQ/awesome-interview-questions 31 | - https://towardsdatascience.com/red-flags-in-data-science-interviews-4f492bbed4c4 32 | -------------------------------------------------------------------------------- /development.md: -------------------------------------------------------------------------------- 1 | # Development 2 | 3 | ## books 4 | 5 | - site reliability, secure and reliable systems https://landing.google.com/sre/books/ really good & recommended 6 | 7 | ## distributed-systems 8 | 9 | ### chaos engineering 10 | 11 | - https://de.wikipedia.org/wiki/Cynefin-Framework 12 | - https://github.com/Netflix/vizceral 13 | 14 | 15 | ### streaming workflows 16 | 17 | - https://infinitic.io/ 18 | 19 | ## communication 20 | 21 | - https://github.com/nedbat/dinghy 22 | 23 | ## release management 24 | 25 | - https://github.com/release-drafter/release-drafter 26 | 27 | 28 | ## notifications 29 | 30 | - https://github.com/novuhq/novu 31 | 32 | ## dev tooling 33 | 34 | - https://trunk.io/ 35 | 36 | ### editors 37 | 38 | - https://zed.dev/ 39 | 40 | ## web requests 41 | 42 | - https://www.usebruno.com/ 43 | - 44 | 45 | 46 | ## Rust tips 47 | 48 | - https://blog.sdf.com/p/fast-development-in-rust-part-one 49 | - https://blog.sdf.com/p/fast-development-in-rust-part-2 50 | 51 | 52 | ## observability 53 | 54 | - devel 55 | -------------------------------------------------------------------------------- /E2E_operations.md: -------------------------------------------------------------------------------- 1 | # E2E operations 2 | - devOps 3 | - https://github.com/ansible/awx 4 | - dataOPs 5 | - https://github.com/selinon/selinon 6 | - https://airflow.apache.org 7 | - secOps 8 | - site reliability engineering https://landing.google.com/sre/book.html 9 | 10 | ## standardization & containers 11 | 12 | ### container formats 13 | - docker 14 | - rocket 15 | 16 | ### container orchestration & platforms 17 | - kubernets 18 | - openshift 19 | 20 | 21 | ### CI CD tooling 22 | - spinnaker 23 | - https://github.com/GoogleCloudPlatform/skaffold 24 | - https://fabric8.io 25 | - https://argoproj.github.io/ 26 | 27 | **git native changes** 28 | - https://github.com/weaveworks/flux 29 | 30 | ## experience stories 31 | - https://blog.pragmaticengineer.com/operating-a-high-scale-distributed-system/ 32 | 33 | 34 | ## ML operations 35 | 36 | - https://ml-ops.org 37 | 38 | ## networks 39 | 40 | ### vpns 41 | 42 | - https://tailscale.com/ 43 | - https://www.firezone.dev/ 44 | 45 | 46 | # business 47 | 48 | ## HR 49 | 50 | - https://www.rippling.com/ 51 | -------------------------------------------------------------------------------- /decison-science.md: -------------------------------------------------------------------------------- 1 | # decision science 2 | 3 | ## causal inference 4 | 5 | - https://github.com/Microsoft/dowhy 6 | - https://github.com/py-why/ 7 | 8 | ### time series 9 | 10 | - http://google.github.io/CausalImpact/CausalImpact.html 11 | 12 | ## bayesian 13 | 14 | ### models 15 | 16 | - bayesian inferencing with NUTS by JAGS 17 | - https://github.com/pyro-ppl/numpyro 18 | 19 | ### examples 20 | 21 | - covid 22 | - https://www.youtube.com/watch?v=ZxR3mw-Znzc&pp=qAMBugMGCgJkZRAB and https://www.youtube.com/watch?v=_DCkJkMji0U 23 | 24 | #### tutorials 25 | 26 | - [regression and other stories](https://avehtari.github.io/ROS-Examples/) 27 | - [port to python/ pymc3 via bambi](https://github.com/bambinos/Bambi_resources) 28 | - [statistical rethinking](https://xcelab.net/rm/statistical-rethinking/) 29 | - [port to python/ pymc3 via bambi](https://github.com/bambinos/Bambi_resources) 30 | - [port to python using numpyro](https://github.com/fehiepsi/rethinking-numpyro) 31 | - http://eleafeit.com/teaching/ 32 | 33 | ### visualization 34 | 35 | - https://arviz-devs.github.io/arviz/ 36 | -------------------------------------------------------------------------------- /data-governance.md: -------------------------------------------------------------------------------- 1 | # Data governance 2 | 3 | - Apache Atlas 4 | - https://github.com/amerissa/atlas 5 | - https://gravitino.apache.org 6 | - https://github.com/magda-io/magda 7 | - https://open-metadata.org/ 8 | - https://github.com/odpi/egeria/blob/master/README.md 9 | - https://github.com/odpi/data-governance 10 | - https://github.com/ContinuumIO/intake 11 | - knowledge graph & tagging http://datacommons.org/ 12 | - https://datasette.readthedocs.io/en/stable/ 13 | 14 | ## governance in practice 15 | - automation of ranger policies and atlas tags https://github.com/SvenskaSpel/cobra-policytool 16 | 17 | ## standardized data models 18 | 19 | - https://cloudinformationmodel.org/ 20 | - SIEM https://www.elastic.co/guide/en/ecs/current/index.html 21 | 22 | 23 | ## metadata 24 | 25 | - https://www.querybook.org/ 26 | - https://github.com/mlcommons/croissant 27 | ### lineage 28 | - https://github.com/reata/sqllineage 29 | 30 | 31 | ## resource usage 32 | 33 | - spark telemetry 34 | - https://github.com/cerndb/SparkPlugins 35 | 36 | ## privacy 37 | 38 | - https://www.youtube.com/watch?v=EqQdaHSBbY8 39 | -------------------------------------------------------------------------------- /presentations.md: -------------------------------------------------------------------------------- 1 | # presentations 2 | 3 | ## toolkit 4 | 5 | - [logitech spotlight](https://www.logitech.com/de-at/product/spotlight-presentation-remote) 6 | - allows to zoom and highlight interactively and also is a laser pointer in video presentations 7 | - [demopro](http://www.demoproapp.com/) 8 | - allows to draw on the screen and have a timer 9 | - dynamic dataset: https://drawdata.xyz/ 10 | - terminal presentation https://github.com/mfontanini/presenterm/?tab=readme-ov-file 11 | - https://github.com/BuoyantIO/demosh 12 | 13 | ## some presentations - mostly at meetups 14 | 15 | - machine learning model to production https://www.slideshare.net/GeorgHeiler/machine-learning-model-to-production 16 | - R and big data 17 | - R summer school https://docs.google.com/presentation/d/1VgocCul4ZyLVDqqr6iXsBRmmwsGhJkZj6-lzgdXs0ag/edit 18 | - meetup https://www.youtube.com/watch?v=CTy2NB_2Jyk and slides https://docs.google.com/presentation/d/1NHG7-WoEUsjrdxFjy01OmZjxWB-FZomhfrxO-QapzKg/edit 19 | - processing geospatial data https://docs.google.com/presentation/d/1wBMFVv-4iaVfbmH5GNynvSaTAMQtWpUoHuMNHP83Uys/edit?usp=sharing 20 | 21 | -------------------------------------------------------------------------------- /graphs.md: -------------------------------------------------------------------------------- 1 | # graphs and ontologies 2 | 3 | ## books 4 | 5 | - http://networksciencebook.com/ 6 | 7 | ## lectures 8 | 9 | - http://web.stanford.edu/class/cs224w/ 10 | 11 | ## databases 12 | - neo4j 13 | - https://grakn.ai 14 | - https://nebula-graph.io/en/ 15 | 16 | ### extensions 17 | 18 | - postgres 19 | - https://age.apache.org/ 20 | 21 | 22 | ## graph processing 23 | - tinkerpop 24 | - opencypher 25 | - sq 26 | - pregel 27 | - combined 28 | - https://github.com/SemyonSinchenko/ibisgraph 29 | 30 | ## [ml/ai/deep]-leanring on graphs 31 | 32 | - https://www.stellargraph.io/ 33 | 34 | 35 | ### embeddings 36 | - https://github.com/facebookresearch/PyTorch-BigGraph 37 | 38 | 39 | ### graph deeplearning 40 | 41 | - overview 2022 https://towardsdatascience.com/graph-ml-in-2022-where-are-we-now-f7f8242599e0 42 | 43 | #### GNN 44 | 45 | - https://distill.pub/2021/gnn-intro/ 46 | 47 | ## network science 48 | 49 | ### graph filtering 50 | - https://en.wikipedia.org/wiki/Disparity_filter_algorithm_of_weighted_network 51 | 52 | 53 | ### temporal graphs 54 | 55 | - https://teneto.readthedocs.io/en/latest/tutorial/networkmeasures.html 56 | -------------------------------------------------------------------------------- /opsec.md: -------------------------------------------------------------------------------- 1 | # opsec 2 | 3 | ## videos/knowledge share 4 | 5 | - https://www.youtube.com/watch?v=717XfmnThDY https://talks.datenspuren.de/ds24/talk/CQBJL8/ 6 | - https://www.youtube.com/watch?v=R4cr2D_O5Gg https://talks.datenspuren.de/ds24/talk/J8WDNL/ 7 | - https://www.youtube.com/watch?v=FCXuUxgWP6o https://talks.datenspuren.de/ds24/talk/GNG8EG/ 8 | - https://www.youtube.com/watch?v=i0IfCfhOHN4 9 | 10 | ## communictation 11 | 12 | - https://www.systemli.org/service/mail/ 13 | - https://riseup.net/ 14 | 15 | ### phones 16 | 17 | burner 18 | 19 | - https://rebeccawilliams.info/burner-phone-101/ and https://news.ycombinator.com/item?id=44967543 20 | - https://www.youtube.com/watch?v=i0IfCfhOHN4 21 | 22 | ## privacy preserving operating systems 23 | 24 | - https://grapheneos.org/ 25 | 26 | ## tools 27 | 28 | - https://www.notrace.how/ 29 | - https://antirepression.noblogs.org/ 30 | - https://antirepression.noblogs.org/129-broschuere/ 31 | 32 | ## legal toolkits 33 | 34 | - https://datenschmutz.de/ 35 | 36 | ## tracking examples 37 | 38 | - https://www.notrace.how/earsandeyes/ 39 | 40 | 41 | ## vpn 42 | 43 | evasion 44 | 45 | - https://news.ycombinator.com/item?id=45054260 46 | -------------------------------------------------------------------------------- /nlp-dl-getting-started.md: -------------------------------------------------------------------------------- 1 | basics: 2 | 3 | - BM25 4 | - https://umap-learn.readthedocs.io/en/latest/basic_usage.html 5 | - https://github.com/bmeaut/python_nlp_2018_spring/blob/master/course_material/14_Semantics_II/14_Semantics_2_lab.ipynb 6 | 7 | 8 | DL: 9 | - https://www.tensorflow.org/tutorials/text/word2vec 10 | - Embedding!!! 11 | - ** https://github.com/bmeaut/python_nlp_2020_fall/blob/master/labs/08_Deep_learning_nlp_lab.ipynb 12 | - corresponding lecture: https://github.com/bmeaut/python_nlp_2020_fall/blob/master/lectures/08_Deep_learning_nlp.ipynb 13 | - https://github.com/bmeaut/python_nlp_2020_fall/blob/master/lectures/09_Sequence_modeling.ipynb 14 | - fastai Text 15 | - https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb 16 | - https://github.com/fastai/fastbook/blob/master/12_nlp_dive.ipynb 17 | - stemming/lemmatization: löscht viele Inhalte. Besser passend encoden mit eigenen tokens 18 | - https://github.com/huggingface/transformers/tree/master/examples/text-classification 19 | 20 | 21 | trees: 22 | 23 | - lightgbm 24 | - xgboost 25 | - catboost 26 | 27 | 28 | useful stuff: 29 | 30 | - https://allennlp.org/ (demo) 31 | - https://github.com/allenai/kb/issues/13 32 | - https://www.tensorflow.org/tutorials/text/classify_text_with_bert 33 | - https://github.com/fastai/fastbook 34 | -------------------------------------------------------------------------------- /finance.md: -------------------------------------------------------------------------------- 1 | # finance 2 | 3 | ## ETFs 4 | 5 | 6 | ### stocks 7 | 8 | - https://www.deltavalue.de/shiller-kgv-cape-ratio/ 9 | 10 | #### potentially interesting ones 11 | 12 | > NOTICE: this is no investment advice. You make your own choices. I might also not have invested in these - just keeping them here as a list of potentially interesting assets 13 | 14 | - green 15 | - https://www.finanzen.net/aktien/transalta_renewables-aktie 16 | - dividends 17 | - https://www.finanzen.net/aktien/realty-aktie (immo) 18 | 19 | ### dividends 20 | 21 | - https://www.justetf.com/de/how-to/dividend-etfs-world.html 22 | 23 | ## knowledge 24 | 25 | ### videos 26 | 27 | - https://www.youtube.com/c/Finanzfluss 28 | 29 | ### blogs 30 | 31 | - https://collabfund.com/blog/the-art-and-science-of-spending-money/ 32 | 33 | 34 | ## crypto coins 35 | 36 | ### wallets 37 | 38 | #### software 39 | 40 | - https://www.exodus.com/ 41 | 42 | #### hardware 43 | 44 | - https://trezor.io/ 45 | 46 | 47 | ## retirement 48 | 49 | - https://www.dasinvestment.com/finanzen-altersvorsorge-rentenluecke-berechnen-und-schliessen/ 50 | 51 | 52 | ## wealth management 53 | 54 | - https://ghostfol.io/en/start 55 | 56 | 57 | ## indicator and analysis 58 | 59 | - https://macrodash.streamlit.app/ 60 | 61 | 62 | ## pre-ipo investment 63 | 64 | - https://fundrise.com/Venture 65 | -------------------------------------------------------------------------------- /data-management-dbt.md: -------------------------------------------------------------------------------- 1 | # DBT 2 | 3 | useful DBT utilities 4 | 5 | ## validation 6 | 7 | - https://github.com/dbt-labs/dbt-project-evaluator 8 | - https://github.com/PicnicSupermarket/dbt-score 9 | 10 | ## data testing 11 | - normal testing 12 | - https://hub.getdbt.com/calogica/dbt_expectations/latest/ 13 | - anomaly detection 14 | - https://github.com/elementary-data/dbt-data-reliability 15 | - pre-commit 16 | - https://github.com/dbt-checkpoint/dbt-checkpoint 17 | 18 | ## CI/CD 19 | 20 | - https://medium.com/inthepipeline/youre-running-dbt-in-ci-now-what-f24c6717b9de 21 | - merge request templates 22 | - https://medium.com/inthepipeline/dbt-best-practices-in-action-at-cal-itps-data-infra-project-0d11adf5513d 23 | 24 | ## SQL engine portability 25 | 26 | - https://github.com/tobymao/sqlglot 27 | 28 | ### multi-engine setups 29 | 30 | - https://github.com/borjavb/dbt-iceberg-poc 31 | 32 | ## schema changes 33 | 34 | - https://www.getdbt.com/coalesce-2021/surviving-schema-changes-with-automation 35 | 36 | ## DBT-cloud alternatives 37 | 38 | - https://github.com/nicholasyager/dbt-loom 39 | - https://dagster.io/ 40 | - dagster plus loom example https://github.com/cnolanminich/dbt-loom-example/tree/demo-with-dagster 41 | 42 | 43 | ## data contracts 44 | 45 | - https://github.com/bitol-io/open-data-contract-standard 46 | - http://datacontract.com/ 47 | - https://docs.getdbt.com/docs/collaborate/govern/about-model-governance 48 | 49 | ### contract implementations 50 | 51 | - https://www.youtube.com/watch?v=bVKuLcqv0kI&t=9s 52 | -------------------------------------------------------------------------------- /geospatial.md: -------------------------------------------------------------------------------- 1 | # geospatial tools 2 | 3 | ## databases 4 | 5 | - https://github.com/ULB-CoDE-WIT/MobilityDB 6 | 7 | ## spatial indicies 8 | 9 | - postGIS 10 | - https://github.com/uber/h3 11 | - R trees 12 | - https://github.com/mourner/flatbush 13 | 14 | ## regression 15 | - https://github.com/r-spatial/spatialreg/ 16 | - https://pysal.org/ 17 | 18 | ## tutorials & books 19 | 20 | - https://github.com/jsignell/pydata_ann_arbor_2019 21 | - https://geocompr.robinlovelace.net/ 22 | - https://geographicdata.science/book/notebooks/05_choropleth.html 23 | - https://pythongis.org/index.html 24 | 25 | ## web visualiatzion 26 | 27 | - gis 28 | - https://github.com/CartoDB/cartodb 29 | - https://plugins.qgis.org/plugins/qgis2web/ 30 | - https://github.com/qwc-services/qwc-docker and https://github.com/qgis/qwc2 31 | - QGIS + Geoserver 32 | - BI spatial breakdowns 33 | - https://www.metabase.com/learn/visualization/maps 34 | 35 | ## big geospatial analyses 36 | 37 | ### joins 38 | 39 | - https://www.youtube.com/watch?v=mrDi70FmCgk geomesa, geospark, STRTree 40 | - quad tree partitioning on disk to optimize spatial joins (john deere) https://www.youtube.com/watch?v=C1kYNsmJ7ho 41 | - H3 boundary handling https://www.youtube.com/watch?v=LP198QMdDbY 42 | 43 | ### predicate pushdown (data skipping) 44 | 45 | - https://www.youtube.com/watch?v=3D5WhCqfOo8 46 | 47 | ## spatial ML 48 | 49 | ### deeplearning 50 | 51 | - https://github.com/IBM/terratorch 52 | 53 | ## spatial analysis (fast) 54 | 55 | - https://github.com/allenai/atlantes 56 | -------------------------------------------------------------------------------- /java-matrix.md: -------------------------------------------------------------------------------- 1 | # java matrix tools 2 | 3 | ## whishlist 4 | 5 | - multiplication 6 | - addition 7 | - matrix decomposition 8 | - not only java based (option for BLAS) 9 | - vectorized (SIMD) 10 | - fast 11 | 12 | ## benchmarks: 13 | - https://github.com/optimatika/ojAlgo/wiki/Java-Matrix-Benchmark 14 | - https://java-matrix.org 15 | - https://www.reddit.com/r/scala/comments/5fcjo3/breeze_vs_nd4s/ 16 | 17 | ## available libraries 18 | - https://github.com/scalanlp/breeze 19 | - included into spark 20 | - netlib java blas backend for labpack and netlib which optimally are compiled with MKL support 21 | - https://nd4j.org 22 | - switching backend is transparent (BLAS, GPU) 23 | - commercially supported 24 | - http://ejml.org/wiki/index.php?title=Main_Page 25 | - no BLAS, pure java 26 | - free 27 | - 202 stars 28 | - apache 2.0 29 | - recommendation: nice, but only pure java. probably easy to get started with, not the most performance. a lot of algorithms implemented 30 | 31 | - https://github.com/optimatika/ojAlgo 32 | - no native code 33 | - 189 stars 34 | 35 | - https://scicomp.stackexchange.com/questions/2486/parallel-scientific-computation-software-development-language 36 | - intel MKL https://software.intel.com/en-us/articles/performance-tools-for-software-developers-how-do-i-use-intel-mkl-with-java/ 37 | - commercial licensing 38 | - https://ujmp.org 39 | - https://github.com/fommil/netlib-java/ 40 | - https://github.com/fommil/matrix-toolkits-java and https://github.com/andreas-solti/matrix-toolkits-java 41 | - https://github.com/hughperkins/jeigen 42 | - java wrapper around C library 43 | -------------------------------------------------------------------------------- /devops.md: -------------------------------------------------------------------------------- 1 | # devops / SSH 2 | 3 | https://www.youtube.com/watch?v=qvdlLTyUJ5I 4 | 5 | - creating a key: `ssh-keygen -t rsa -b 4096 -f ssh_host_rsa_key < /dev/null` 6 | 7 | 8 | # devops / ops 9 | 10 | ## identity management 11 | 12 | - https://spiffe.io/ 13 | 14 | ## secrets 15 | 16 | - https://github.com/Infisical/infisical/blob/main/README.md 17 | 18 | ## hosts 19 | 20 | ### status 21 | 22 | - https://www.openstatus.dev/ 23 | 24 | ### disks 25 | 26 | - `ncdu` disk usage fast with a GUI https://dev.yorhel.nl/ncdu 27 | 28 | 29 | ## osx 30 | 31 | ### kill 32 | 33 | - kill stuff running on port: `lsof -i tcp:3000` gives the offender - which then can be killed 34 | 35 | 36 | ## consistent environments 37 | 38 | - https://github.com/jetify-com/devbox 39 | - https://devenv.sh/ 40 | - https://devpod.sh/ 41 | - https://nixos.org/ 42 | 43 | ## hosting options 44 | 45 | - https://kiranet.org/self-hosting-like-its-2025/ 46 | 47 | ## AI support 48 | - https://tmuxai.dev/getting-started 49 | 50 | ## CI 51 | 52 | ### github actions 53 | 54 | local validation 55 | 56 | - https://github.com/bahdotsh/wrkflw 57 | 58 | 59 | ## containers 60 | 61 | ### sizing 62 | - https://github.com/h33333333/xray 63 | 64 | 65 | ## load analysis 66 | 67 | - https://dseltzer.gitlab.io/sping/docs/ 68 | ### platform engineering 69 | 70 | - https://github.com/microsandbox/microsandbox 71 | 72 | ## tools 73 | 74 | ### text 75 | 76 | - https://www.lazyvim.org/ 77 | 78 | ### disk 79 | 80 | https://github.com/bootandy/dust 81 | 82 | ### search 83 | 84 | - https://github.com/facebook/pathpicker/ 85 | - https://github.com/BurntSushi/ripgrep 86 | 87 | ### tmux 88 | 89 | - https://github.com/gpakosz/.tmux 90 | - keys Ctrl B + 91 | - `-/_` to split horizontally/vertically 92 | - `m` for mouse mode 93 | -------------------------------------------------------------------------------- /owning-infrastructure.md: -------------------------------------------------------------------------------- 1 | # owning infrastructure - selfhosting 2 | 3 | ## IT 4 | 5 | - https://oneuptime.com/blog/post/2025-10-29-aws-to-bare-metal-two-years-later/view 6 | - https://pierce.dev/notes/go-ahead-self-host-postgres#user-content-fn-1 7 | - https://github.com/vitabaks/autobase 8 | - https://news.ycombinator.com/item?id=46336947 9 | - https://cloudnative-pg.io/ 10 | 11 | ### hosting 12 | #### edge hardware 13 | 14 | - https://www.raspberrypi.com/products/raspberry-pi-5/ 15 | - http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-6-Plus.html 16 | 17 | #### hosted 18 | - https://opalstack.com/ 19 | - https://www.hetzner.com 20 | 21 | ### database 22 | 23 | - https://autobase.tech/ 24 | ### dev platform 25 | 26 | - https://coolify.io/ 27 | ### observability 28 | 29 | - https://uptimekuma.org/ 30 | - https://github.com/aristocratos/btop 31 | - https://github.com/nicolargo/glances 32 | ### VCS 33 | 34 | - https://about.gitea.com/ 35 | - vcs workflow stacking 36 | - https://github.com/aviator-co/av for https://www.stacking.dev/ 37 | - https://github.com/jj-vcs/jj 38 | 39 | ### content creation 40 | 41 | - https://www.ssp.sh/blog/self-host-self-independence/ 42 | - https://www.ssp.sh/brain/quartz-publish-obsidian-vault/ 43 | - https://github.com/jackyzha0/quartz 44 | - analytics 45 | - **https://github.com/arp242/goatcounter** 46 | - https://github.com/vinceanalytics/vince 47 | - https://plausible.io/ 48 | - https://github.com/ownstats/ownstats 49 | - content 50 | - https://www.ssp.sh/brain/listmonk/ 51 | - https://listmonk.app/ 52 | - https://github.com/ping13/listmonk-rss 53 | - members 54 | - https://www.ssp.sh/brain/open-subscription-platforms/ 55 | - https://www.memberstack.com 56 | 57 | ### sync 58 | - https://syncthing.net 59 | ### documents 60 | 61 | - https://github.com/paperless-ngx/paperless-ngx 62 | 63 | #### photos 64 | - https://www.photoprism.app/ 65 | 66 | #### audio 67 | - https://www.audiobookshelf.org/ 68 | 69 | 70 | #### e-book 71 | - https://calibre-ebook.com 72 | 73 | 74 | ### OS 75 | 76 | containers 77 | - https://www.talos.dev/ 78 | 79 | ### data 80 | 81 | - https://github.com/bigdatarepublic/open-data-platform?tab=readme-ov-file 82 | - https://github.com/l-mds/local-data-stack/ 83 | 84 | ### communication 85 | 86 | - matrix 87 | - https://github.com/spantaleev/matrix-docker-ansible-deploy 88 | ## home 89 | 90 | ### social media 91 | 92 | - https://timelinize.com/ 93 | 94 | 95 | ### learning 96 | 97 | - spaced repetition 98 | - https://borretti.me/article/hashcards-plain-text-spaced-repetition 99 | -------------------------------------------------------------------------------- /ai_tools.md: -------------------------------------------------------------------------------- 1 | # AI tools 2 | 3 | https://www.linkedin.com/posts/ghimiresunil_chatgpt-activity-7021128204388057088-0jf5/?utm_source=share&utm_medium=member_android 4 | 5 | 1. Krisp: Krisp's AI removes background voices, noises, and echo from your calls, giving you peace of call 6 | Link: https://krisp.ai/ 7 | - https://www.opus.pro/ 8 | - https://invideo.io/studio 9 | 10 | 3. Beatoven: Create unique royalty-free music that elevates your story 11 | Link: https://www.beatoven.ai/ 12 | 13 | 4. Cleanvoice: Automatically edit your podcast episodes 14 | Link: https://cleanvoice.ai/ 15 | 16 | 5. Podcastle: Studio quality recording, right from your computer 17 | Link: https://podcastle.ai/ 18 | 19 | 6. Flair: Design branded content in a flash 20 | Link: https://flair.ai/ 21 | 22 | 7. Illustroke: Create killer vector images from text prompts 23 | Link: https://illustroke.com/ 24 | 25 | 8. Patterned: Generate the exact patterns you need for and design 26 | Link: https://www.patterned.ai/ 27 | 28 | 9. Stockimg: Generate the perfect stock photo you need, every time 29 | Link: https://stockimg.ai/ 30 | 31 | 10. Copy: AI Generated copy, that actually increases conversion 32 | Link:https://www.copy.ai/ 33 | 34 | 11. CopyMonkey: Create Amazon listings in seconds 35 | Link: http://copymonkey.ai/ 36 | 37 | 12. Ocoya: Create and schedule social media content 10x faster 38 | Link: https://www.ocoya.com/ 39 | 40 | 13. Unbounce Smart Copy: Write high-performing cold emails at scale 41 | Link: https://unbounce.com/ 42 | 43 | 14. Vidyo: Make short-form vids from long-form content in just a few clicks 44 | Link: https://vidyo.ai/ 45 | 46 | 15. Maverick: Generate personalized videos at scale 47 | Link:https://lnkd.in/dmrkz_ah 48 | 49 | 16. Quickchat: AI chatbots that automate customer service charts 50 | Link: https://www.quickchat.ai/ 51 | 52 | 17. Puzzle: Build an AI-powered knowledge base for your team and customers 53 | Link: https://www.puzzlelabs.ai/ 54 | 55 | 18. Soundraw: Stop searching for the song you need. Create it. 56 | Link: https://soundraw.io/ 57 | 58 | 19. Cleanup: Remove any wanted object, defect, people, or text from your pictures in seconds 59 | Link: https://cleanup.pictures/ 60 | 61 | 20. Resumeworded: Improve your resume and LinkedIn profile 62 | Link: https://lnkd.in/d9EurcnX 63 | 64 | 21. Looka: Design your own beautiful brand 65 | Link: https://looka.com/ 66 | 67 | 22. theresanaiforthat: Comprehensive database of AIs available for every task 68 | Link: https://lnkd.in/dKhqaaF3 69 | 70 | 23. Synthesia: Create AI videos by simply typing in text. 71 | Link: https://www.synthesia.io/ 72 | 73 | 24. descript: New way to make video and podcasts 74 | Link: https://lnkd.in/d_Kdj35E 75 | 76 | 25. Otter: Capture and share insights from your meetings 77 | Link: https://otter.ai/ 78 | 79 | 26. Inkforall: AI content (Generation, Optimization, Performance)  80 | Link: https://inkforall.com/ 81 | 82 | 27. Thundercontent: Generate Content with AI 83 | Link: https://lnkd.in/djFxMZsZ 84 | 85 | 86 | ## coding 87 | 88 | - https://aider.chat/ 89 | -------------------------------------------------------------------------------- /keeping-up-to-date.md: -------------------------------------------------------------------------------- 1 | # keeping up to date 2 | 3 | ## passive 4 | 5 | follow blogs in your community 6 | 7 | I can recommend: 8 | 9 | - https://blog.fefe.de/ 10 | - https://news.ycombinator.com/ 11 | - https://www.reddit.com/ 12 | - https://www.reddit.com/r/dataengineering/ 13 | - https://www.reddit.com/r/mlscaling/ 14 | - https://www.reddit.com/r/LocalLLaMA/ 15 | - https://github.com/geoHeil/awesome-tools 16 | - https://www.heise.de/security 17 | - https://www.youtube.com 18 | - https://www.youtube.com/@mediacccde/videos 19 | - https://www.youtube.com/@PyDataTV 20 | - https://www.youtube.com/@SciPy-Conf/videos 21 | - https://www.youtube.com/@AmazonScience/videos 22 | - https://www.youtube.com/@dagsterio/videos 23 | - https://www.youtube.com/@duckdb/videos 24 | - https://www.youtube.com/@EuroSciPy/videos 25 | - https://www.youtube.com/@feldera_inc/videos 26 | - https://www.youtube.com/@H2Oai/videos 27 | - https://www.youtube.com/@MicrosoftResearch/videos 28 | - https://www.youtube.com/@neo4j/videos 29 | - https://www.youtube.com/@prefix_dev/videos 30 | - https://www.youtube.com/@opengeospatial/videos 31 | - https://www.youtube.com/@stan3394/videos 32 | - https://www.youtube.com/@streamnative/videos 33 | - https://www.youtube.com/@ScyllaDB/videos 34 | - security 35 | - https://www.youtube.com/@BlackHatOfficialYT/videos 36 | - https://www.youtube.com/@bsidesbudapest/videos 37 | - https://www.youtube.com/@BSidesCapeTown/videos 38 | - https://www.youtube.com/@bsidesberlin9228/videos 39 | - https://www.youtube.com/@bsideswarsaw189/videos 40 | - https://www.youtube.com/@bsidesleeds1246/videos 41 | - rust 42 | - https://www.youtube.com/@eurorust/videos 43 | - https://www.youtube.com/@RustVideos/videos 44 | - https://this-week-in-rust.org/ 45 | 46 | social media (if you follow the right people) 47 | 48 | - https://bsky.app/ 49 | - https://x.com/home 50 | 51 | Good starter packs https://blueskydirectory.com/starter-packs 52 | 53 | 54 | - join your local community go to https://www.meetup.com or https://lu.ma 55 | 56 | some interesting podcasts 57 | 58 | - https://logbuch-netzpolitik.de/ 59 | - https://open.spotify.com/show/5oAImDY5o4HzekRGNNw2r0 60 | - https://www.dataengineeringpodcast.com/ 61 | - https://talkpython.fm/ 62 | - https://dataskeptic.com/ 63 | - https://decodinglove.tv/ 64 | - https://lexfridman.com/podcast/ 65 | - https://allin.com/ 66 | 67 | some interesting blogs 68 | 69 | - https://georgheiler.com/ ;) 70 | - https://engineering.fb.com/ 71 | - netflix engineering blog 72 | - https://www.uber.com/en-US/blog/engineering/ 73 | - https://www.schneier.com/ 74 | 75 | intersting discord groups 76 | 77 | - Data Engineering 78 | - Pixi 79 | - Open Bio ML 80 | - Scraping 81 | - Data Trails 82 | - MLOps @chipro 83 | 84 | slack groups 85 | - locally optimistic 86 | - data engineer things 87 | - RSE 88 | 89 | ## active 90 | 91 | - host your own community to become an expert in the field 92 | - organize a workshop and invite experts 93 | - write a blog (consistently) 94 | - hist your own stream 95 | - actively have some friends/paper reading club in your field to channel releavnt news together 96 | - join communities adjacent to your field and transfer solutions of your field over to their problems. It is amazinig how much you can accomplish this way 97 | -------------------------------------------------------------------------------- /web_scraping.md: -------------------------------------------------------------------------------- 1 | # scraping 2 | 3 | ## tools 4 | 5 | - https://scrapy.org/ and https://spidermon.readthedocs.io/en/latest/ 6 | - https://github.com/scrapy-plugins/scrapy-playwright 7 | - drip crawler https://rapidapi.com/markhorverse-markhorverse-default/api/dripcrawler 8 | - js browser automation comparison 9 | - https://medium.com/@fujia1742/web-scraping-showdown-comparing-selenium-puppeteer-and-playwright-59f5c662ae21 10 | - https://blog.apify.com/playwright-vs-selenium-webscraping/ 11 | - https://www.zenrows.com/blog/web-scraping-tools 12 | - https://www.scrapingbee.com/ 13 | - https://www.zyte.com/scrapy-cloud/ 14 | - https://github.com/apify/crawlee-python/ 15 | 16 | - no-code/low-code tools 17 | - https://pixeljets.com/web-scraping-api/ 18 | 19 | 20 | ### AI scraping 21 | 22 | - https://www.youtube.com/watch?v=QxHE4af5BQE 23 | - https://github.com/mendableai/firecrawl 24 | - https://jina.ai/reader/ 25 | - https://github.com/VinciGit00/Scrapegraph-ai 26 | - https://github.com/confident-ai/deepeval https://docs.confident-ai.com/ 27 | - https://github.com/deepchecks/deepchecks https://docs.deepchecks.com/stable/getting-started/welcome.html 28 | - https://github.com/Skyvern-AI/skyvern 29 | - https://github.com/browser-use/browser-use 30 | - https://convergence.ai/proxy_lite/ 31 | - https://github.com/unclecode/crawl4ai 32 | 33 | ### tutorials 34 | 35 | - https://scrapeops.io/python-scrapy-playbook/scrapy-playwright/ 36 | - https://www.zenrows.com/blog/scrapy-playwright#install-scrapy-playwright 37 | - https://playwright.dev/python/docs/debug 38 | - rust 39 | - https://www.zenrows.com/blog/rust-web-scraping#extract-html-data 40 | - https://www.scrapingbee.com/blog/web-scraping-rust/k 41 | - https://itehax.com/blog/web-scraping-using-rust 42 | - https://github.com/kxzk/scraping-with-rust 43 | - https://www.scrapingdog.com/blog/web-scraping-with-rust/ 44 | - https://www.reddit.com/r/rust/comments/18e5wlf/web_scraping_with_playwright/ https://github.com/mattsse/chromiumoxide 45 | - https://substack.thewebscraping.club/ 46 | 47 | ## deployments 48 | 49 | ## scaling 50 | 51 | - AWS lambda (serverless) 52 | - https://stackoverflow.com/questions/28825484/is-amazon-lambda-suitable-for-web-scraping 53 | - https://aws.amazon.com/de/blogs/architecture/serverless-architecture-for-a-web-scraping-solution/ 54 | - scrapy 55 | - https://oxylabs.io/blog/scrapy-aws-lambda 56 | 57 | ## detection / evasion 58 | 59 | - https://camoufox.com/ 60 | - `!antibot` in discord or self hosted to check 61 | - https://www.wappalyzer.com/ 62 | - https://scrapingfish.com/webscraping-benchmark 63 | - https://www.zenrows.com/solutions/web-unblocker 64 | - https://substack.thewebscraping.club/p/why-scraper-is-blocked 65 | - ip rotation 66 | - https://github.com/D4rkwat3r/aiohttp-ip-rotator 67 | - https://github.com/Ge0rg3/requests-ip-rotator 68 | ### API discovery 69 | 70 | - https://github.com/Integuru-AI/Integuru 71 | 72 | 73 | ### proxy 74 | 75 | - https://free-proxy-list.net/ 76 | - https://scrapeops.io/ 77 | - https://proxy-seller.com/de/proxy-for-scrapy/ 78 | - example 79 | - https://scrapeops.io/python-scrapy-playbook/scrapy-playwright/ 80 | - https://www.zenrows.com/blog/scrapy-playwright#avoid-getting-blocked**** 81 | - https://www.zyte.com/scrapy-cloud/ 82 | 83 | 84 | 85 | ## sample scrapers 86 | 87 | - https://github.com/sibbl/wohnung-scraper 88 | -------------------------------------------------------------------------------- /security.md: -------------------------------------------------------------------------------- 1 | # security 2 | 3 | ## blogs/ressources 4 | 5 | - https://www.nsa.gov/News-Features/the-next-wave/ 6 | - https://www.mdpi.com/2076-3417/10/16/5607 7 | - https://docs.securityonion.net/en/2.3/ 8 | - https://github.com/safe-graph/graph-fraud-detection-papers 9 | - https://github.com/dvopsway/datasploit 10 | - https://github.com/cugu/awesome-forensics 11 | - https://www.youtube.com/watch?v=nMnHBnYfIaI 12 | 13 | 14 | ## forensics 15 | 16 | ### binary analysis 17 | 18 | - https://ghidra-sre.org/ 19 | 20 | ## exploits 21 | - metasploit 22 | - armitage 23 | - threat hunting 24 | - https://kestrel.readthedocs.io/en/latest/ 25 | - https://github.com/microsoft/msticpy 26 | 27 | ## scanning 28 | 29 | ### docker 30 | 31 | - https://github.com/subtrace/subtrace 32 | - 33 | ### mobile devices 34 | 35 | - https://github.com/mvt-project/mvt 36 | 37 | ### AI models 38 | 39 | - https://github.com/Azure/counterfit/ https://www.microsoft.com/security/blog/2021/05/03/ai-security-risk-assessment-using-counterfit/ 40 | - https://atlas.mitre.org/ 41 | - https://github.com/Trusted-AI/adversarial-robustness-toolbox 42 | 43 | ### APT scanning 44 | 45 | - https://github.com/cisagov/CHIRP 46 | 47 | ## big data 48 | - apache metron https://metron.apache.org 49 | 50 | ### log analysis 51 | 52 | - https://github.com/marty90/netlytics 53 | - https://github.com/s1l3nt78/sifter 54 | - https://github.com/laramies/theHarvester 55 | - https://github.com/j3ssie/Osmedeus 56 | - https://github.com/jaeles-project/jaeles 57 | 58 | #### cloud 59 | 60 | - https://github.com/matanolabs/matano 61 | 62 | ### metron tips 63 | - https://datahovel.com/2018/07/18/how-to-onboard-a-new-data-source-in-apache-metron/ 64 | 65 | ## device identification 66 | - user agent / SSDP simple service discovery protocol 67 | - ntopng & DNS queries 68 | 69 | ## security training 70 | - https://www.pagerduty.com/blog/security-training-at-pagerduty/ 71 | 72 | ## kerberos 73 | - https://fy.blackhats.net.au/blog/html/2017/05/23/kerberos_why_the_world_moved_on.html 74 | - debugging kerberos set `KRB5_TRACE` 75 | 76 | ## workflow SOC 77 | 78 | - https://github.com/microsoft/msticpy 79 | 80 | ## cloud security 81 | - https://forsetisecurity.org 82 | 83 | ### container 84 | - https://sysdig.com/blog/container-security-best-practices/ 85 | 86 | ## tools 87 | - great tooling overview https://open-source-security-software.net 88 | 89 | 90 | ## containers 91 | 92 | ### screening containers 93 | 94 | - https://github.com/quay/clair 95 | 96 | ## networking 97 | 98 | ### wifi 99 | - https://shop.hak5.org/ 100 | 101 | ## pentesting 102 | 103 | ### input 104 | 105 | - https://github.com/minimaxir/big-list-of-naughty-strings 106 | 107 | 108 | ## hats 109 | 110 | ### white 111 | 112 | ### grey 113 | 114 | - https://www.cyborgsecurity.com/python-malware-on-the-rise/ 115 | 116 | ### red team 117 | 118 | 119 | ## videos 120 | 121 | - spark big data security https://www.youtube.com/watch?v=5GvfPGcdhqI and https://www.youtube.com/watch?v=YxTE4mff5dk 122 | 123 | 124 | ## privacy 125 | 126 | - https://www.whonix.org/ 127 | - https://getoutline.org/de/ 128 | - suggestions https://news.ycombinator.com/item?id=28008571 129 | 130 | ## cloud 131 | 132 | ### permission escalation 133 | 134 | - https://github.com/netskopeoss/iaas_permission_mining 135 | 136 | 137 | ## data collection 138 | 139 | ### honey pot 140 | 141 | - https://github.com/telekom-security/tpotce 142 | 143 | 144 | ## CTFs 145 | 146 | - unit calculators hex, decimal, binary https://gchq.github.io/CyberChef/ 147 | 148 | ## people & organization 149 | 150 | - https://owasp.org/www-project-security-culture/v10/4-Security_Champions/ 151 | 152 | 153 | ## red teaming 154 | 155 | 156 | ### reverse shells 157 | 158 | - polymorphic 159 | - https://github.com/projectdiscovery/interactsh 160 | 161 | ## mobile phone 162 | 163 | - https://shop.nitrokey.com/de/shop?search=nitrophone 164 | 165 | ## privacy 166 | - https://veilid.com/about-veilid/ 167 | 168 | ### printouts 169 | - https://github.com/dfd-tud/deda 170 | 171 | ## ebpf 172 | ### encryption 173 | 174 | - https://github.com/qpoint-io/qtap 175 | -------------------------------------------------------------------------------- /frequently_needed_stuff.md: -------------------------------------------------------------------------------- 1 | # collection of frequently needed commands 2 | ## certificates 3 | **get certifiacates and store in keystore** 4 | ``` 5 | # view certificates 6 | openssl s_client -showcerts -connect my.host.com:443 7 | # store desired ones to a file (myfile.pem) 8 | 9 | # then base64 decode it 10 | base64 -d myfile.pem > myfile.cer 11 | 12 | # and add to keystore 13 | keytool -import -alias gateway-identity -keystore mykeystore.jks -file myfile.cer 14 | ``` 15 | 16 | ## kerberos 17 | - see contents (principals) of keytab 18 | ```bash 19 | klist -k keytab.keytab 20 | ``` 21 | - add new user (principal / keytab) 22 | 23 | ``` 24 | sudo kadmin.local 25 | kadmin.local: addprinc -randkey <> 26 | kadmin.local: ktadd -k /etc/security/keytabs/<>.keytab -norandkey <> 27 | sudo chown <> /etc/security/keytabs/<>.keytab 28 | sudo chmod 400 /etc/security/keytabs/<>.keytab 29 | kinit -kt /etc/security/keytabs/<>.keytab <> 30 | klist 31 | ``` 32 | 33 | ## ansible 34 | 35 | ```bash 36 | ansible-galacy init rolename 37 | ``` 38 | 39 | ## java 40 | - checking classpath 41 | ```bash 42 | jar tf foo.jar | grep MyClass 43 | ``` 44 | - reload the service / restart it in case you modify the same class (just to be sure) 45 | 46 | ## big data 47 | starting out it might be a good idea to quickly get something up and running 48 | - cloudbreak is great 49 | - in case you want to go for something smaller 50 | - https://github.com/rmaruthiyodan/docker-hdp-lab 51 | ## spark 52 | - assuming your cluster does not install all client code on each data node remember to ship hive-site.xml as well as the tez xml file when submitting a spark job in yarn cluster mode 53 | - it might have worked just fine in yarn client - but will no longer work in yarn cluster without these files. 54 | - dependency clashes: shading, but with some tricks. http://asyncified.io/2016/04/07/spark-uber-jars-and-shading-with-sbt-assembly/ 55 | - sql optimization: https://blog.deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/ 56 | 57 | **data sources** 58 | - REST API as enrichment https://github.com/sourav-mazumder/Data-Science-Extensions/tree/master/spark-datasource-rest 59 | **tuning** 60 | - http://rea.tech/how-we-optimize-apache-spark-apps/ 61 | - http://fdahms.com/2015/10/04/writing-efficient-spark-jobs/ 62 | 63 | 64 | **aggregations** 65 | - multiple aggregations 66 | - UADF 67 | - rdd statCounter 68 | - scalding: https://github.com/twitter/scalding/wiki/Aggregation-using-Algebird-Aggregators#composing-aggregators 69 | 70 | **partition handling** 71 | - https://towardsdatascience.com/writing-into-dynamic-partitions-using-spark-2e2b818a007a 72 | 73 | **spark on yarn** 74 | - vcores not chosen correctly https://stackoverflow.com/questions/25563736/yarn-is-not-honouring-yarn-nodemanager-resource-cpu-vcores/25570709#25570709 75 | 76 | ## hive 77 | - jdbc for other tools https://github.com/timveil/hive-jdbc-uber-jar 78 | 79 | ## hdfs shell 80 | 81 | - counting of files in directory `hdfs dfs -count -v -h /path/to/data/*` 82 | 83 | ## python 84 | - date handling https://www.youtube.com/watch?v=Q97vDzaNQyU 85 | 86 | 87 | ## yarn 88 | - overview https://de.hortonworks.com/blog/yarn-capacity-scheduler/ 89 | 90 | ## machine learning 91 | ### categorical handling 92 | - http://www.win-vector.com/blog/2016/11/you-should-re-encode-high-cardinality-categorical-variables/ 93 | 94 | ### cross validation 95 | - https://www.google.at/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwjTtunBu53YAhXGZFAKHTkcAhoQFggqMAA&url=https%3A%2F%2Fgithub.com%2Fyandexdataschool%2Fml-training-website%2Fraw%2Fgh-pages%2Fpresentations%2FSavenkov_KaggleMercedes_2017_eng.pdf&usg=AOvVaw3Zle9OmWU-FhS9QSB_dUVK 96 | 97 | ## conversions of data 98 | 99 | ### geospatial data 100 | - https://github.com/dice-cyfronet/ascii-grid-source/blob/master/src/main/java/pl/cyfronet/urbanflood/ogc/geoserver/data/ASCIIGridDataReader.java 101 | - https://stackoverflow.com/a/33473452/5200303 102 | - https://en.wikipedia.org/wiki/Ramer–Douglas–Peucker_algorithm 103 | 104 | ## ambari 105 | - changing the password follow https://community.hortonworks.com/questions/449/how-to-reset-ambari-admin-password.html 106 | ``` 107 | 1. Log on to ambari server host shell 108 | 2. Run 'psql -U ambari-server ambari' 109 | 3. Enter password 'bigdata' 110 | 4. In psql: 111 | update ambari.users set user_password='538916f8943ec225d97a9a86a2c6ec0818c1cd400e09e03b660fdaaec4af29ddbb6f2b1033b81b00' where user_name='admin' 112 | 113 | 5. Quit psql 114 | 6. Run 'ambari-server restart' 115 | 116 | This will reset the admin account back to the password of 'admin' 117 | ``` 118 | -------------------------------------------------------------------------------- /streaming.md: -------------------------------------------------------------------------------- 1 | # streaming 2 | 3 | ## hardware setup 4 | 5 | 6 | I realized that my photo camera can also record videos. Since then, I always use it for this purpose. 7 | 8 | - I have a Nikon D850, but any other decent photo camera works just as well as a webcam. 9 | - The camera is mounted on a tripod. 10 | - It is connected to a power outlet because otherwise, the battery would need to be replaced every few hours during video recordings. 11 | 12 | ### HDMI Capturing 13 | 14 | - Theoretically, you could simply connect the camera to the computer via USB – in my case, with the Nikon Webcam Utility. However, it was limited back then (it only worked in Zoom but not in Teams or WebEx). 15 | - This was frustrating for me. 16 | - Instead, I got a mini HDMI 4-channel mixer, the [Atem Mini](https://www.blackmagicdesign.com/at/products/atemmini). It has additional advantages, for example, I can also mix the iPad or screen content with the green screen, and always be seen in the foreground in front of the PPT or coding window. 17 | - The green screen is from [Elgato](https://www.elgato.com/de/green-screen) (you didn't see it in use yesterday). 18 | 19 | ### Audio 20 | 21 | - Good audio is important – and as long as you don't want to combine it with good video simultaneously, it's super easy. I use a [Blue Yeti microphone](https://www.bluemic.com/de-de/products/yeti/), which can be easily connected via USB – done. And it's dramatically better. 22 | - Unfortunately, I have problems with the audio input on the ATEM Mini and therefore route audio and video separately (resulting in a few frames delay). Did you notice this yesterday? 23 | - Nowadays I sue the yeti for most teams/zoom calls and it works great. But for recordings and live streams I use https://rode.com/de/microphones/wireless/wirelesspro directly connected timecoded to my camera for the best streaming/recording/audio experience 24 | - AH CQ18T 25 | - shure sm7db 26 | - audio playback 27 | - https://github.com/freeman-jiang/beatsync 28 | 29 | ### video 30 | 31 | - d850 32 | - secondary camera 33 | - teleprompter 34 | 35 | ### Lighting 36 | 37 | - Instead of investing in an extremely expensive camera, good audio and lighting are better investments. 38 | - I use: 39 | - 2x Key Lights from [Elgato](https://www.elgato.com/de/key-light) – they are great and you can adjust the brightness and color temperature. 40 | - A smart lamp from [Luke Roberts](https://www.luke-roberts.com/products/smart-lamp-model-f-white?ls=de) with a slight blue/purple tint for the background, or additional white lighting when using the green screen. 41 | 42 | ### Control 43 | 44 | - With the [Stream Deck](https://www.elgato.com/de/stream-deck-mk2) you can flexibly control the lights, the HDMI mixer (ATEM), or any other actions with a hardware button and even record and play macros. 45 | - One or two USB hubs. 46 | - Lots of cables (HDMI, USB, etc.). 47 | - For controlling the Atem Mini: a small network switch and cables (only works via TCP-IP). 48 | 49 | ### On-Screen 50 | 51 | - The [Logitech Spotlight Presentation Remote](https://www.logitech.com/de-at/products/presenters/spotlight-presentation-remote.910-004862.html?crid=11) is a great laser pointer that also works remotely in Webex or for 1000 people with 3 projectors simultaneously – because it's digital. The magnifier and focus gimmicks are great! 52 | - [DemoPro](http://www.demoproapp.com/) (Mac) offers similar features for Windows for countdown timers and drawing directly on the screen and is a good addition. 53 | 54 | I hope you find some useful tips here. 55 | 56 | 57 | ### atem mini tricks 58 | 59 | - ISO davinci resolve integration 60 | - https://www.youtube.com/watch?v=YtsVUtm9jmE 61 | - 4k https://www.youtube.com/watch?v=CZpcvqOPAKo 62 | - replay 63 | - https://www.youtube.com/watch?v=aMxg4ECnyms 64 | - https://www.youtube.com/watch?v=FcDULDr3QQM 65 | - resolve tricks 66 | - https://www.sideshowfx.net/davinci-resolve-color-panel-stream-deck-plus-mac?aff=7 https://jonnyelwyn.co.uk/film-and-video-editing/controlling-davinci-resolve-with-the-stream-deck/ 67 | - remote guest atem 68 | - https://www.youtube.com/watch?v=fKpusuY-cjs 69 | - macros 70 | - https://www.youtube.com/watch?v=BegP_UWs454 71 | 72 | ## paid 73 | - https://streamyard.com/ 74 | - https://www.ecamm.com/ 75 | 76 | ## opensource 77 | - https://obsproject.com/de 78 | - https://vdo.ninja/ 79 | - https://docs.vdo.ninja/advanced-settings/setup-parameters/and-sink 80 | - https://docs.vdo.ninja/advanced-settings/setup-parameters/and-audiooutput 81 | - !!! https://github.com/steveseguin/electroncapture !!! 82 | - Quality https://docs.vdo.ninja/getting-started/even-higher-quality-video 83 | - https://support.volume.com/hc/en-us/articles/4425732789911-Feature-Live-Guests-with-VDO-Ninja-OBS 84 | 85 | ### obs plugins 86 | 87 | - https://obsproject.com/forum/resources/obs-shaderfilter.1736/ 88 | - https://obsproject.com/forum/resources/advanced-masks.1856/ 89 | - https://obsproject.com/forum/resources/stroke-glow-shadow.1800/ 90 | - https://obsproject.com/forum/resources/composite-blur.1780/ 91 | - https://obsproject.com/forum/resources/3d-effect.1692/ 92 | - https://obsproject.com/forum/resources/downstream-keyer.1254/ 93 | - https://obsproject.com/forum/resources/source-clone.1632/ 94 | - https://obsproject.com/forum/resources/move.913/ 95 | - audio mixer 96 | - https://www.voxengo.com/product/marvelgeq/ 97 | - https://obsproject.com/forum/resources/source-record.1285/ 98 | - https://obsproject.com/forum/resources/aitum-vertical.1715/ vertical 99 | - https://streamer.bot/features 100 | - combined sourc switcher https://obsproject.com/forum/resources/source-switcher.941/ with move transition https://github.com/exeldro/obs-move-transition 101 | 102 | ## AI video summarization 103 | 104 | - https://klap.app/ 105 | 106 | ## AI video generation 107 | 108 | - https://www.heygen.com/ 109 | - https://lumalabs.ai/dream-machine 110 | - https://app.runwayml.com/login gen-3 111 | - https://invideo.io/ 112 | - https://www.revid.ai/ 113 | 114 | ## transcription & captions 115 | 116 | - https://sonix.ai/de 117 | - https://www.heygen.com/ 118 | 119 | ## multi streaming 120 | 121 | - https://www.boxcast.com/ 122 | - https://restream.io 123 | 124 | ## streaming server 125 | 126 | - https://owncast.online/ 127 | 128 | 129 | ## zoom tips 130 | 131 | - https://www.youtube.com/watch?v=I473FLo-Bug 132 | - https://www.youtube.com/watch?v=aK4S4wrmhF4&t=531s 133 | 134 | ## ppt tips 135 | 136 | - wow indeed https://discussions.apple.com/thread/7179965?sortBy=rank open PPT in non full screen mode makes it work 137 | 138 | ## scene switching 139 | 140 | ### super source 141 | 142 | - great makros https://www.a2z-productions.com/atem-resources 143 | - visual builder and buy of macros https://layouts.heretorecord.com/ 144 | 145 | ## streaming utilities 146 | 147 | - https://heretorecord.gitbook.io/h2r-graphics 148 | - stingers https://stingers.heretorecord.com/ 149 | 150 | 151 | ## vst plugins 152 | 153 | ### quality 154 | 155 | - https://www.accentize.com/dxrevive/ 156 | 157 | ## stock content 158 | 159 | ### audio 160 | 161 | - https://artlist.io/ 162 | 163 | ### visual 164 | 165 | - https://pixabay.com/de/ 166 | -------------------------------------------------------------------------------- /startup_company.md: -------------------------------------------------------------------------------- 1 | # operating/Starting a company 2 | 3 | Useful services (not only) for (starting) a company 4 | ## research lab 5 | 6 | this is not so company related - but still really interesting 7 | 8 | - https://1517.substack.com/p/why-bell-labs-worked 9 | - 10 | 11 | ## founding 12 | ### storytelling 13 | 14 | - https://tome.app 15 | 16 | ## errors 17 | 18 | - https://www.youtube.com/watch?v=zFYcXvOrwDE 19 | 20 | ## internal tools 21 | 22 | - https://github.com/ToolJet/ToolJet 23 | - https://github.com/directus/directus/ 24 | - https://www.nocodb.com/ 25 | - https://budibase.com/ 26 | - https://n8n.io/cloud/?ref=osit&utm_source=affiliate 27 | - https://github.com/mailhog/MailHog 28 | 29 | ### polls 30 | 31 | - https://pollunit.com/de 32 | 33 | ### mail 34 | 35 | - encrypted 36 | - https://tuta.com/de/ 37 | 38 | ## ERP 39 | 40 | - https://github.com/frappe/erpnext 41 | 42 | ### crm 43 | 44 | - https://www.dolibarr.org/ 45 | 46 | ## HR 47 | 48 | - https://www.rippling.com/ 49 | - https://www.turing.com/ 50 | - https://eightfold.ai/ 51 | - https://www.mysteryminds.com/ 52 | - graph based vettng of linkedin profiles https://www.fainds.com/plugin 53 | 54 | ### large scale HR 55 | - https://eightfold.ai/ 56 | 57 | ### time booking 58 | - https://www.kimai.org/ 59 | - https://sisevo.com/zeitmanagement/zeiterfassung 60 | 61 | ## marketing 62 | 63 | - https://kit.com 64 | - **https://github.com/ownstats/ownstats** 65 | - https://matomo.org/ 66 | - https://plausible.io/ 67 | - https://www.piano.io/ 68 | - https://github.com/vinceanalytics/vince 69 | - https://openpanel.dev/ 70 | - https://github.com/goenning/google-indexing-script 71 | - https://ownstats.com/ 72 | - rudderstack 73 | - https://www.netspring.io/ 74 | - A/B testing 75 | - https://www.statsig.com/ 76 | - tech documentation slack bot answers https://www.scoutos.com/ 77 | - document and link analytics 78 | - short.io 79 | - google analytics explained 80 | - https://www.youtube.com/watch?v=--RTXzVNNps 81 | - https://about.scarf.sh/ 82 | - https://www.memberstack.com/ 83 | 84 | document sending 85 | 86 | - https://www.papermark.com/de 87 | 88 | ### sales 89 | 90 | call transcription tools/AIs 91 | - https://www.gong.io/ 92 | 93 | ### analytics 94 | - https://thinkgrowth.org/the-startup-founders-guide-to-analytics-1d2176f20ac1 95 | 96 | ### dashboards 97 | 98 | - https://github.com/lightdash/lightdash#quick-start https://www.lightdash.com/ 99 | - https://www.metabase.com/ 100 | - https://evidence.dev/ 101 | 102 | ### logging 103 | 104 | - sentry 105 | - https://betterstack.com/logs 106 | - https://vector.dev/ 107 | - https://posthog.com 108 | ### secrets 109 | 110 | - https://github.com/Infisical/infisical/blob/main/README.md 111 | - mozilla sops + age 112 | 113 | ## legal 114 | 115 | - validate name is not claimed by anyone else https://euipo.europa.eu/eSearch/#basic/1+1+1+1/100+100+100+100/ 116 | - https://www.thanksroger.com 117 | 118 | ## finance 119 | 120 | - founding / payments https://stripe.com/atlas 121 | - split bills https://spliit.app/?ref=hn 122 | 123 | ### funding 124 | 125 | - https://www.fundingsimulator.com/ 126 | 127 | #### equity 128 | 129 | - https://comparator-one.vercel.app/ 130 | 131 | ### billing 132 | 133 | - Pricing: https://www.tier.run/ 134 | - https://www.getlago.com/ 135 | - https://github.com/getlago/lago/wiki/Stripe-Data-vs-Open‐Source-Alternatives:-a-MRR-example 136 | - https://github.com/killbill/killbill 137 | - https://hyperswitch.io/ 138 | - https://www.ballerine.com/ 139 | - https://tigerbeetle.com/ 140 | - https://www.corefin.com/ 141 | - https://github.com/flowglad/flowglad 142 | 143 | ### invoicing 144 | 145 | - https://invoiceplane.com/ 146 | 147 | ### taxes 148 | 149 | - https://www.avalara.com/eu/de/index.html 150 | - https://www.bookamat.com/ 151 | 152 | ### identity 153 | 154 | - keycloak 155 | - https://phasetwo.io/ and https://keycloak.discourse.group/t/keycloak-multi-tenancy-extensions-for-saas-applications/15426 156 | - https://www.alphabold.com/azure-ad-configuration/ 157 | - https://www.reddit.com/r/Office365/comments/mctsmx/setup_sso_office_365_with_idp_example_keycloak/ 158 | 159 | ## software 160 | 161 | ### hosting 162 | 163 | - simplicity with local https://grski.pl/self-host 164 | 165 | ### office / productivity 166 | 167 | - MS Office 168 | - https://nino.app/ 169 | - https://github.com/suitenumerique/docs 170 | - https://cryptpad.org/ 171 | 172 | ### prototyping 173 | 174 | - https://wireflow.co/ 175 | - sharing something 176 | - https://tunnl.gg/ 177 | - simple scalable hosting 178 | - https://uncloud.run/ 179 | 180 | ### developer experience 181 | 182 | - https://trunk.io/ 183 | - https://www.jetbrains.com/space/ 184 | 185 | ### data 186 | 187 | - https://github.com/sbalnojan/run-a-data-team 188 | - https://dagster.io/ 189 | - https://open-metadata.org/ 190 | - https://docs.getdbt.com/docs/introduction 191 | - https://www.metaplane.dev 192 | - https://formbricks.com/ 193 | - calendar and scheduling 194 | - https://github.com/calcom/cal.com 195 | - https://zcal.co 196 | - https://posthog.com/ 197 | - https://sentry.io/welcome/ 198 | 199 | - https://retina.ai/blog/dataops-principles/ 200 | #### BI 201 | 202 | - https://rowzero.io/ 203 | - https://marimo.app/ 204 | - omni bi https://omni.co/platform 205 | 206 | #### VPN 207 | 208 | - wireguard 209 | - https://github.com/trailofbits/algo 210 | - https://www.twingate.com/ 211 | - https://tailscale.com/ 212 | 213 | #### backends 214 | 215 | - https://directus.io/engine/ https://github.com/directus/directus/ 216 | - https://pocketbase.io/ 217 | - pockethost.io 218 | - https://fly.io 219 | - frontend 220 | - https://alpinejs.dev/ 221 | - https://htmx.org/ https://hypermedia.systems/ 222 | - https://vanjs.org/demo#hello-world 223 | #### frontend UI 224 | 225 | - https://ui.shadcn.com/ 226 | - https://www.figma.com/community/file/1233953507961010067/tremor-ui-kit-community 227 | - https://magicui.design/ 228 | - 229 | #### workflows 230 | 231 | - dagster https://docs.dagster.io/integrations/embedded-elt 232 | - https://www.tines.com/ 233 | - https://wem.io/ 234 | 235 | #### forms 236 | 237 | - https://formbricks.com 238 | 239 | ### monitoring 240 | 241 | - https://checkmk.com/de 242 | - https://signoz.io/ 243 | - https://github.com/vectordotdev/vector 244 | 245 | ### networking 246 | 247 | - https://github.com/netbox-community/netbox 248 | 249 | ## support 250 | 251 | - https://www.loom.com 252 | - https://github.com/rsrohan99/llamabot?__s=veeebstmvghwo4ozbvkh&utm_source=drip&utm_medium=email&utm_campaign=LlamaIndex+news%2C+2024-02-06 253 | - scout ai 254 | - pylon 255 | 256 | ## community 257 | 258 | - https://www.linen.dev/ 259 | - https://www.mightynetworks.com/ 260 | - https://github.com/discursus-data/saf_core 261 | - chat 262 | - https://revolt.chat/ 263 | 264 | ## project management 265 | 266 | - https://hedgedoc.org/ 267 | - https://wekan.github.io/ 268 | 269 | ## product 270 | 271 | ### communication 272 | 273 | - https://typedream.com/ 274 | 275 | ### strategy 276 | 277 | #### compound 278 | 279 | - https://www.youtube.com/watch?v=zAYPT6CrWRQ in particular 09:30 to 15.30 280 | - https://akashbajwa.substack.com/p/bundling-and-saas-power-laws 281 | - https://www.youtube.com/watch?v=BbSNA64TQ7s&t=233s 282 | 283 | ### data 284 | - data selling company (data as a service DSSV) https://blog.safegraph.com/data-as-a-service-bible-everything-you-wanted-to-know-about-running-daas-companies-d4cf4c15c038 285 | 286 | 287 | ## tool lists 288 | 289 | - https://airtable.com/shrVkzKl8rjQV38AZ/tbliSnfbucggubrwp/viwQqEKIx4rDDgD3j?blocks=hide 290 | 291 | ### document signing 292 | 293 | - docusign 294 | - https://documenso.com/ 295 | - https://github.com/OpenSignLabs/OpenSign 296 | - https://github.com/docusealco/docuseal 297 | 298 | ### video conferencing 299 | - https://opencollective.com/meet-coop/updates/announcement-meet-coop-2-0-is-here 300 | - jitsi 301 | 302 | ## growthhacking 303 | 304 | ### text 305 | 306 | - https://www.frizerly.com/ 307 | - https://byword.ai/ 308 | 309 | ### web pages 310 | 311 | - https://tips.io/ 312 | 313 | # enterprise 314 | 315 | ## HR 316 | 317 | 318 | 319 | ### communication 320 | - https://haiilo.com/ 321 | 322 | ## creator tools 323 | 324 | content sharing 325 | - https://kit.com/ 326 | 327 | 328 | ## IT 329 | 330 | ### CI/CD 331 | 332 | - https://depot.dev/ 333 | -------------------------------------------------------------------------------- /rust.md: -------------------------------------------------------------------------------- 1 | # rust 2 | 3 | ## learning rust 4 | 5 | - **https://www.zero2prod.com/index.html?country=Austria&discount_code=VAT20** truely awesome! 6 | - **https://www.howtocodeit.com/** really great! 7 | - prototyping with rust https://corrode.dev/blog/prototyping/ 8 | - https://github.com/rust-lang/rustlings 9 | - https://rust-exercises.com/ 10 | - https://github.com/sger/RustBooks 11 | - https://rust-unofficial.github.io/patterns/idioms/coercion-arguments.html 12 | - https://www.lurklurk.org/effective-rust/ 13 | - https://rustwasm.github.io/docs/book/introduction.html 14 | - https://burn.dev/book/overview.html 15 | - https://www.manning.com/books/rust-design-patterns 16 | - https://nnethercote.github.io/perf-book/introduction.html 17 | 18 | - https://antithesis.com/blog/is_something_bugging_you/ 19 | - https://joshlf.com/files/talks/Safety%20in%20an%20Unsafe%20World.pdf 20 | 21 | ### wow presentations 22 | 23 | - https://www.youtube.com/watch?v=Lc8lBMEJQdo 24 | - https://www.youtube.com/watch?v=mtERSY1wRcA 25 | 26 | ### Tips 27 | - https://blog.sdf.com/p/fast-development-in-rust-part-one 28 | - https://blog.sdf.com/p/fast-development-in-rust-part-2 29 | - named function arguments https://elastio.github.io/bon/blog/how-to-do-named-function-arguments-in-rust 30 | - project generator 31 | - https://github.com/mainmatter/gerust 32 | 33 | ### macros 34 | - https://veykril.github.io/tlborm/ 35 | 36 | ## rust and python 37 | - https://pyo3.rs/v0.21.2/ 38 | 39 | 40 | ## toolchain 41 | 42 | - https://github.com/Canop/bacon 43 | - multi project build with workspaces https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html 44 | - optimize build config https://github.com/Kobzol/cargo-wizard 45 | - `cargo-watch` with `cargo watch -x check` 46 | - sort cargo dependencies https://github.com/DevinR528/cargo-sort 47 | - releasese 48 | - https://github.com/MarcoIeni/release-plz 49 | - example https://github.com/deepcausality-rs/deep_causality 50 | 51 | ### cross compilation 52 | 53 | - https://burgers.io/cross-compile-rust-from-arm-to-x86-64 54 | - https://github.com/rust-cross/cargo-zigbuild 55 | - https://github.com/cross-rs/cross 56 | - https://github.com/messense/homebrew-macos-cross-toolchains 57 | - https://github.com/clux/muslrust/issues/142#issuecomment-2152638811 58 | 59 | ### static builds 60 | 61 | > In the case of MUSL ensure to: 62 | > - use an alternative allocator such as mimalloc by patching MSUL 63 | > - or directly compile it together (may be simpler for bazel) https://github.com/linkerd/linkerd2/commit/aaa6091ea86eb988e68eda16bdc9b59db9d96076#diff-63a2e5c64764a441fcbc73dc27b1a8e684df057ffd54ad39e66e2a643020eb0d 64 | > 65 | - https://kerkour.com/rust-pingoo-high-performance-allocations-mimalloc-heapless 66 | 67 | ### scaling the build beyond cargo 68 | 69 | - https://mmapped.blog/posts/17-scaling-rust-builds-with-bazel 70 | - https://users.rust-lang.org/t/static-linking-for-rust-without-glibc-scratch-image/112279/8 71 | - https://github.com/marvin-hansen/fluvio-examples and https://github.com/deepcausality-rs/deep_causality 72 | - faster linker 73 | - https://blog.rust.careers/post/compile_rust_faster/ 74 | 75 | ### ci-cd examples 76 | 77 | - https://gist.github.com/FedericoPonzi/873aea22b652572f5995f23b86543fdb 78 | - https://github.com/f2calv/multi-arch-container-rust 79 | - https://blog.urth.org/2023/03/05/cross-compiling-rust-projects-in-github-actions/ 80 | - https://github.com/dirien/rust-cross-compile 81 | 82 | 83 | ### dependencies 84 | - https://www.lurklurk.org/effective-rust/dep-graph.html cargo-udeps/cargo-deny 85 | - sbom 86 | - https://github.com/rust-secure-code/cargo-auditable 87 | - https://github.com/CycloneDX/cyclonedx-rust-cargo/ 88 | 89 | ### documentation 90 | - https://www.lurklurk.org/effective-rust/documentation.html 91 | - `#![deny(broken_intra_doc_links)]` 92 | - `#![warn(missing_docs)]` 93 | - `cargo doc` 94 | 95 | ### formatting 96 | 97 | - https://github.com/rust-lang/rustfmt ` rustup component add rustfmt`, `cargo fmt`, ` cargo fmt -- --check` 98 | 99 | ### linters 100 | 101 | - clippy https://doc.rust-lang.org/clippy/installation.html ` rustup component add clippy`, `cargo clippy`, `cargo clippy -- -D warnings` 102 | - https://www.schneems.com/2025/11/19/find-accidental-code-usage-with-a-custom-clippytoml/ 103 | - security audit ` cargo install cargo-audit`, `cargo audit` 104 | - https://github.com/EmbarkStudios/cargo-deny `cargo install --locked cargo-deny && cargo deny init && cargo deny check` 105 | - semver https://github.com/obi1kenobi/cargo-semver-checks 106 | - `cargo-udeps` to delete unused dependencies 107 | 108 | ### databases & ORM 109 | 110 | - https://github.com/launchbadge/sqlx 111 | 112 | ### testing 113 | 114 | - https://github.com/la10736/rstest 115 | - https://github.com/rust-lang/miri 116 | - https://github.com/rust-fuzz/cargo-fuzz 117 | - https://github.com/obi1kenobi/cargo-semver-checks 118 | - https://crates.io/crates/criterion 119 | - https://nexte.st/ 120 | - coverage 121 | - https://docs.rs/cargo-tarpaulin/latest/cargo_tarpaulin/ `cargo install cargo-tarpaulin` ` cargo tarpaulin --ignore-tests` 122 | - fuzzer 123 | - grammar 124 | - https://github.com/R9295/autarkie 125 | 126 | #### mocking 127 | - https://docs.rs/mockall/latest/mockall/ 128 | 129 | ## learnings 130 | 131 | ### macros 132 | 133 | - automatically derive macros i.e. for equality https://rust-exercises.com/04_traits/04_derive 134 | ``` 135 | #[derive(PartialEq)] 136 | ``` 137 | - https://docs.rs/derive_more/latest/derive_more/ 138 | 139 | ### architecture in rust 140 | 141 | - https://alexis-lozano.com/hexagonal-architecture-in-rust-1/ 142 | - project plan for successful rust https://corrode.dev/blog/successful-rust-business-adoption-checklist/ 143 | 144 | ### web frameworks 145 | 146 | - https://github.com/LukeMathWalker/pavex 147 | - https://actix.rs/ 148 | - https://github.com/tokio-rs/axum 149 | - https://github.com/poem-web/poem 150 | - https://leptos.dev/ 151 | - https://yew.rs/ 152 | - open api doc 153 | - https://docs.rs/utoipa/latest/utoipa/ 154 | - https://github.com/paperclip-rs/paperclip 155 | - https://github.com/hyperium/tonic and https://github.com/actix/actix-web/issues/2853 156 | 157 | ### web utils (built with rust) 158 | 159 | - https://exograph.dev/ 160 | 161 | ### template engines 162 | 163 | - https://github.com/djc/askama 164 | 165 | ### security 166 | 167 | (not only rust specific) 168 | 169 | - https://owasp.org/www-project-application-security-verification-standard/ 170 | - https://cheatsheetseries.owasp.org/cheatsheets/Password_Storage_Cheat_Sheet.html 171 | - https://github.com/biandratti/passivetcp-rs 172 | 173 | #### security-web 174 | 175 | - authorization 176 | - https://github.com/resyncgg/dacquiri 177 | - https://github.com/authzed/spicedb 178 | - acticx RBAC https://github.com/casbin/casbin-rs and https://github.com/casbin-rs/actix-casbin 179 | - request validation 180 | - https://github.com/ranger-ross/actix-web-validation 181 | - tools https://github.com/osirislab/awesome-rust-security/blob/master/README.md 182 | 183 | ### useful crates 184 | 185 | - errors 186 | - https://crates.io/crates/thiserror 187 | - parallelism 188 | - https://github.com/rayon-rs/rayon 189 | - frontend 190 | - Tauri https://tauri.app/ (electron like) 191 | - https://github.com/redbadger/crux 192 | - IO 193 | - https://opendal.apache.org/ 194 | - language models 195 | - https://docs.rs/kalosm/latest/kalosm/ 196 | - serialization 197 | - https://fory.apache.org/blog/fory_rust_versatile_serialization_framework/ 198 | ### data engineering and rust 199 | 200 | - https://blog.duyet.net/2023/06/fossil-data-platform-written-rust.html 201 | - https://rustfordata.com/chapter_1.html 202 | ## embedded 203 | 204 | - https://github.com/rust-embedded/rust-raspberrypi-OS-tutorials 205 | 206 | ## artifact handling 207 | 208 | - https://github.com/cenotelie/cratery 209 | 210 | ## logging & observability 211 | 212 | - https://signoz.io/ 213 | 214 | ### profiling 215 | - https://github.com/tikv/pprof-rs 216 | - https://github.com/pawurb/hotpath 217 | 218 | ## frameworks 219 | 220 | - actor 221 | - https://github.com/elfo-rs/elfo 222 | 223 | - ai 224 | - https://burn.dev/ 225 | 226 | ## verification 227 | 228 | - https://github.com/verus-lang/verus 229 | 230 | 231 | ## learning steps 232 | 233 | explore 234 | 235 | - https://github.com/howtocodeit/hexarch 236 | - https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html 237 | - https://leptos.dev/ 238 | - https://github.com/leptos-rs/leptos/tree/main/examples/tailwind_actix 239 | - https://github.com/bazelbuild/examples/tree/main/rust-examples#examples-to-build-rust-code 240 | -------------------------------------------------------------------------------- /data-science.md: -------------------------------------------------------------------------------- 1 | # data science 2 | 3 | ## great content 4 | 5 | - https://www.youtube.com/watch?v=xC-c7E5PK0Y 6 | - https://github.com/harvard-edge/cs249r_book 7 | 8 | ## ethics 9 | 10 | - https://book.the-turing-way.org/ 11 | 12 | 13 | ## nice overview courseware 14 | 15 | - https://mlip-cmu.github.io/s2025/ 16 | - vis teaching 17 | - https://github.com/stefmolin/data-morph 18 | 19 | ## list of useful tools 20 | 21 | - https://landscape.lfai.foundation/ 22 | - https://github.com/fmind/mlops-python-package 23 | - https://github.com/callmesora/llmops-python-package 24 | 25 | ## references to further similar lists 26 | 27 | - https://github.com/r0f1/datascience 28 | 29 | ## basics 30 | 31 | - https://missing.csail.mit.edu/ 32 | 33 | ## exploratory data analysis (EDA) 34 | 35 | - automated EDA 36 | - python 37 | - https://sci-analysis.readthedocs.io/en/latest/ 38 | - https://github.com/pandas-profiling/pandas-profiling 39 | - https://amueller.github.io/dabl/dev/ 40 | - https://www.visidata.org/ 41 | - https://github.com/microsoft/vscode-data-wrangler 42 | - R 43 | - https://boxuancui.github.io/DataExplorer/ 44 | 45 | ## anomaly detection 46 | 47 | - https://pyod.readthedocs.io/en/latest/index.html 48 | 49 | ## visualization 50 | 51 | - https://charticulator.com/ 52 | - https://www.datawrapper.de/ 53 | - https://flourish.studio/ 54 | - https://rawgraphs.io/ 55 | - high dimensional data https://github.com/facebookresearch/hiplot 56 | 57 | ## scaling pandas 58 | 59 | - https://github.com/ibis-project/ibis 60 | - elasticsearch: https://github.com/elastic/eland 61 | - dask 62 | - ray/modin 63 | - RAPIDS 64 | - vaex 65 | - https://ibis-project.org/ 66 | - https://github.com/pola-rs/polars 67 | - https://talkpython.fm/episodes/show/510/10-polars-tools-and-techniques-to-level-up-your-data-science 68 | - https://github.com/ddotta/awesome-polars 69 | - https://github.com/quantco/dataframely 70 | - https://github.com/JakobGM/patito 71 | - https://github.com/foxcroftjn/polars-strsim 72 | - https://github.com/zlobendog/polars_encryption 73 | - https://github.com/pola-rs/polars-xdt 74 | - https://github.com/TomBurdge/harley 75 | - https://github.com/azmyrajab/polars_ols 76 | - https://github.com/erichutchins/polars_iptools 77 | 78 | ## experiment design 79 | 80 | - http://docs.pyro.ai/en/stable/contrib.oed.html 81 | 82 | ## experiment tracking 83 | 84 | MLops overview: https://github.com/visenger/awesome-mlops 85 | 86 | - mlFlow 87 | - https://docs.bentoml.org 88 | - https://neptune.ai/blog/experiment-management 89 | - https://neptune.ai/blog/mlops-what-it-is-why-it-matters-and-how-to-implement-it-from-a-data-scientist-perspective 90 | - https://towardsdatascience.com/machine-learning-experiment-tracking-93b796e501b0 91 | - https://neptune.ai/blog/machine-learning-model-management 92 | 93 | ## bayesian analyses 94 | 95 | - frameworks 96 | - stand 97 | - pymc 98 | - pyro https://arviz-devs.github.io/arviz/ 99 | - https://botorch.org/ 100 | - https://arviz-devs.github.io/arviz/ 101 | - https://bayesiancomputationbook.com/welcome.html 102 | 103 | ## deep learning 104 | 105 | - books 106 | - https://github.com/fastai/fastbook 107 | - a really useful link collection on my blog: https://georgheiler.com/2020/12/03/intersting-links-about-deep-learning/ 108 | - automated/easy DL 109 | - https://github.com/geekjr/quickai 110 | - https://pythonawesome.com/a-framework-that-implements-automl-algorithms-for-model-architecture-search-at-scale/ 111 | - tabular deeplearning: 112 | - https://github.com/dreamquark-ai/tabnet 113 | 114 | - https://lets-unify.ai/ 115 | 116 | ## trees 117 | ### boosted 118 | 119 | - https://github.com/Pushp-Kharat1/PkBoost 120 | 121 | ## neighbor search 122 | 123 | - https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html 124 | - https://github.com/spotify/annoy 125 | ### subgroup discovery 126 | 127 | - https://github.com/flemmerich/pysubgroup 128 | - https://github.com/imoscovitz/wittgenstein 129 | - https://github.com/scikit-learn-contrib/skope-rules 130 | 131 | ## NLP 132 | 133 | - https://pythonawesome.com/attention-based-grapheme-to-phoneme-with-python/ 134 | 135 | ### rules 136 | - https://github.com/adaamko/POTATO 137 | 138 | ## reprodrucibility 139 | 140 | ### population drift monitoring 141 | 142 | - https://github.com/ing-bank/popmon 143 | 144 | ### exploration of models 145 | 146 | - exploration in a nice HTML UI by the end user to get a feel for the model or feedback: https://github.com/gradio-app/gradio 147 | 148 | ### fairness 149 | 150 | - https://www.youtube.com/watch?v=WfcCXlxkBb0, https://github.com/koaning/scikit-lego 151 | - https://koaning.io/posts/the-future-is-past/ 152 | - https://github.com/linkedin/LiFT 153 | 154 | 155 | ### explainable AI 156 | 157 | - https://github.com/interpretml/interpret 158 | 159 | ### version control 160 | - improve jupyter git integration http://timstaley.co.uk/posts/making-git-and-jupyter-notebooks-play-nice/ 161 | - https://elyra.readthedocs.io/en/latest/ 162 | 163 | ## streaming data science 164 | ### libraries 165 | - https://github.com/alibaba/alink 166 | - https://github.com/scikit-multiflow/scikit-multiflow 167 | - https://riverml.xyz/latest/ 168 | 169 | 170 | ## data 171 | 172 | ### audio 173 | 174 | - feature calculation 175 | - https://github.com/jcvasquezc/DisVoice 176 | 177 | ### streams 178 | 179 | - jupyter 180 | - bokeh on streams in jupyter https://medium.com/swlh/streaming-time-series-with-jupyter-and-influxdb-fec33803c460 181 | 182 | 183 | ### feature store 184 | 185 | - hopsworks 186 | - Behaviors compose better than states 187 | - perhaps a simple python library with some transformations & DBT & Dagster will do the job very well and https://metriql.com/ for metrics 188 | - https://feast.dev/ 189 | 190 | ## forecasting (time series) 191 | 192 | - https://www.youtube.com/watch?v=jo12CWZ00Lo 193 | - https://github.com/linkedin/greykite 194 | - https://facebook.github.io/prophet/ 195 | - https://github.com/winedarksea/AutoTS 196 | - https://github.com/timeseriesAI/tsai 197 | - https://github.com/AIStream-Peelout/flow-forecast 198 | 199 | ### generic timeseries operations 200 | - https://khiva.readthedocs.io/en/latest/ 201 | - cleaning https://towardsdatascience.com/a-common-mistake-to-avoid-when-working-with-time-series-data-eedf60a8b4c1 202 | 203 | 204 | 205 | ## website tracking 206 | 207 | ### google analytics 208 | 209 | - schema overview http://storage.googleapis.com/e-nor/visualizations/bigquery/ga360-schema.html 210 | 211 | ### CDM/CDP customer data management & customer data platform 212 | 213 | basically SQL-based sync & incremental updates of data to many systems and reports without sending a lot of CSV files around 214 | 215 | - https://www.hightouch.io/ 216 | - https://www.getcensus.com/ 217 | 218 | 219 | ## deployment 220 | 221 | - https://github.com/whylabs/whylogs 222 | - clean machine learning code https://leanpub.com/cleanmachinelearningcode 223 | - code refactoring example https://github.com/moutai/tdml-example/tree/master/with_tests/tests 224 | 225 | ### cloud instance handling 226 | 227 | - cheap training using spot instances https://www.spotml.io/ 228 | 229 | 230 | # use cases 231 | 232 | ## marketing 233 | 234 | - https://facebookexperimental.github.io/Robyn/ 235 | 236 | 237 | # text processing 238 | 239 | ## ocr 240 | 241 | - https://github.com/VikParuchuri/surya 242 | 243 | 244 | # labeling 245 | 246 | - prodigy 247 | - https://github.com/TutteInstitute/thisnotthat 248 | 249 | ## clustering 250 | 251 | - https://github.com/TutteInstitute 252 | - https://www.tute.com/ 253 | 254 | 255 | ## matching data 256 | 257 | entity/identity resolution 258 | 259 | - https://linktransformer.github.io/ 260 | 261 | ## LLM resources 262 | 263 | - https://github.com/SylphAI-Inc/LLM-engineer-handbook 264 | - https://github.com/stanfordnlp/dspy 265 | - https://github.com/SylphAI-Inc/AdalFlow 266 | - https://github.com/Eladlev/AutoPrompt 267 | - https://github.com/comet-ml/opik 268 | - https://github.com/huggingface/lighteval 269 | - https://github.com/explodinggradients/ragas 270 | - https://github.com/unslothai/unsloth 271 | - https://github.com/Lightning-AI/litgpt 272 | - embeddings 273 | - https://raw.githubusercontent.com/veekaybee/what_are_embeddings/main/embeddings.pdf 274 | 275 | ## cleaning 276 | 277 | - https://github.com/cleanlab/cleanlab 278 | 279 | ## vision 280 | 281 | - object detection 282 | - object tracking https://github.com/roboflow/trackers 283 | 284 | 285 | ## interactive computing 286 | 287 | notebooks 288 | 289 | - https://marimo.io/ 290 | 291 | 292 | ## similarity search 293 | 294 | ### semantic hashing 295 | 296 | - https://github.com/MinishLab/semhash 297 | - https://colab.research.google.com/drive/1o1nzwXWAa8kdkEJljbJFW1VuI-3VZLUn?usp=sharing 298 | ### minhash 299 | 300 | - https://github.com/beowolx/rensa 301 | 302 | ## automl 303 | 304 | - https://github.com/autogluon/autogluon 305 | 306 | ## system design for ML 307 | 308 | - https://github.com/harvard-edge/cs249r_book 309 | - 310 | -------------------------------------------------------------------------------- /real-world-dataflow-problems.md: -------------------------------------------------------------------------------- 1 | # real world dataflow problems 2 | Some problems I observed in real dataflow pipelines 3 | 4 | ## ETL 5 | - quality controlling data prior to ingest from source systems 6 | - actually knowing (responsibility for bought applications) where to find the data(base) and who knows about the format 7 | - parameterize everything, most importantly the columns. 8 | - always describe all the columns directly - even when loading data in spark 9 | - make sure column names can be refactored with IDE support (centralized at single place / module / library) 10 | - https://www.kdnuggets.com/2018/08/self-service-data-prep-tools-6-lessons-learned.html 11 | - spark https://medium.com/adobetech/spark-on-scala-adobe-analytics-reference-architecture-7457f5614b4c 12 | - CDC schema migration https://riccomini.name/kafka-change-data-capture-breaks-database-encapsulation 13 | 14 | ### SQL 15 | SQL nowadays is so much more than SQL92 (which most people are familiar with). Arrays, json, xml ... can be handled. In case of distributed systems ordering (total ordering vs partial ordering within partitions) turn out to be important concepts to master as well: 16 | - spark 17 | - https://blog.deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/ 18 | - https://www.enigma.com/blog/things-i-wish-id-known-about-spark 19 | - hive https://saurzcode.in/2015/01/hive-sort-order-distribute-cluster/ 20 | 21 | ### dataflow 22 | - understand the limits of your architecture / technology and dataflows regarding speed, capacity, latency, types of data handlable... 23 | - define clear APIs (schemata) between different data sources and pipelines to allow building workflows on top of the ingested data streams 24 | - be clear what type of data you want to process (batch vs. streaming). Do not force batch semantics for a stream processor. Idempotent jobs will make your life much easier. 25 | - for batch workloads consider oozie over nifi. *it just works TM* 26 | - some great NiFi tipps https://pierrevillard.com/best-of-nifi/ 27 | 28 | ### security 29 | - know the difference between masking field values on the fly i.e. in ranger vs actually not having the permissions to view a column which often (at lest for hive) disallows then to execute the `DESCRIBE TABLE` statement so any tool like tableau which relies on this will subsequently fail 30 | - understand knox https://community.hortonworks.com/content/kbentry/113013/how-to-troubleshoot-and-application-behind-apache.html 31 | 32 | ### Project management 33 | 34 | When working on a data project is is even more important to convey a story https://www.youtube.com/watch?v=plFPTDwk66s shows 6 points how to improve visualization by telling a clear story. 35 | 36 | ## operations 37 | - use HAproxy instead of mysql router for hive HA setups for metastore 38 | - centos os for locally installed cluster 39 | - make sure to really consider if the cluster is only meant for analytics or if it is not more reasonable to run the analytics cluster on kubernetes / openshift 40 | - own the base and cloud the spike 41 | - consider hybrid cloud (cross cloud provider) 42 | - adhere to https://www.acm.org/binaries/content/assets/public-policy/2017_usacm_statement_algorithms.pdf and especially look out for any data provenance issues 43 | - make sure to setup a file quota in HDFS per user 44 | - real world ops pains https://drive.google.com/file/d/15nxAaVXZwNFnJNVvgtKonNbzxNgTUCxP/view 45 | - java https://adoptopenjdk.net/releases.html 46 | 47 | ## scale out over multiple datacenters 48 | - https://medium.com/adobetech/creating-the-adobe-experience-platform-pipeline-with-kafka-4f1057a11ef 49 | 50 | ### monitoring 51 | - build in monitoring E2E by design 52 | 53 | ### llap 54 | llap might not start (in case of a small development cluster) if not enough memory is available or a node is down. However, currently in HDP 2.6.4 no meaningful error message is displayed 55 | 56 | ### hardware 57 | - use intel CPUs. Hdoop 3.0 will use erasure coding. ISA-L intel library can greatly speed up compressions https://issues.apache.org/jira/browse/HDFS-7285 58 | 59 | ## machine learning 60 | - no proper strategy for holdout group and prevention of feedback 61 | - model serving not scalable, no fully fledged ML solution available 62 | - when woring on a machine learning prototype the chance is high - if the results look promising that the model needs to be deployed int oa production environment. Business stakeholders will expect a smooth & quick transition to production mode (results already look so great). Therefore, make sure to only use data sources which are actually available in a production setting, and make sure to get the data directly at the source 63 | - understand the problem domain. Very often regular k-fold cross validation is not a perfect fit as there is a dependency on time. Use time series cross validation (possibly customized) to perform CV in a setting which resembles the actual business use case 64 | - proper understanding of the data. Cehck for errors (too much / not ehough / wrong units / ...) 65 | - work with and talk to the department to prevent data leakage into the model 66 | - reproducibility is a problem (https://petewarden.com/2018/03/19/the-machine-learning-reproducibility-crisis/) and the overall pipeline E2E needs to be thought out very well 67 | - model management is a big problem. This book https://mapr.com/ebooks/machine-learning-logistics/ describes it nicely 68 | - planning http://deon.drivendata.org 69 | - spark pipeline example https://engineering.autotrader.co.uk/2018/10/03/productionizing-days-to-sell.html 70 | 71 | Why do many analytics projects fail? https://www.fast.ai/2018/07/12/auto-ml-1/ 72 | 73 | 74 | It is not always about the algorithm: https://www.youtube.com/watch?v=kYMfE9u-lMo the details and thoughts around is what matters. 75 | Think about a whole system (and ideally it is simple) which actually delivers value vs. a very complex algorithm only running in a notebook https://www.youtube.com/watch?v=68ABAU_V8qI. 76 | And more importantly: make the results tangible for example using tools like https://github.com/gradio-app/gradio to drive trust from business units. 77 | 78 | https://arxiv.org/abs/2107.00079 collects various anti patterns https://towardsdatascience.com/how-not-to-do-mlops-96244a21c35e gives a short summary of them. 79 | 80 | ### business value 81 | - evaluation of models and presentation to non technical audience https://modelplot.github.io, https://github.com/gradio-app/gradio 82 | - before starting out with a data science use-case clearly define what constitutes success, how it is measured (= oftentimes this means how to obtain labels in a setting where no previous labelled data was collected) 83 | 84 | ## hiring 85 | - if in doubt hire a data engineer and not a data scientist (especially when starting out to bring data driven processes into the company) assuming the engineer also has a feeling for strategy 86 | - but watch out that communication skills are great as well 87 | - make sure to have enough *process* and engineering in the team to build a solid and *regular* IT software development lifecycle process accoring to current standards 88 | - not any (great) developer is a great teamlead 89 | - https://rework.withgoogle.com holds great resources regarding hiring & team building 90 | - in case working for a small company read https://medium.com/apteo/what-makes-a-good-data-scientist-at-a-small-company-3f445d421dff 91 | - https://towardsdatascience.com/red-flags-in-data-science-interviews-4f492bbed4c4 92 | - https://www.montecarlodata.com/blog-5-on-obvious-ways-to-make-data-engineers-love-working-for-you/ 93 | - You measure your data engineer’s customer impact 94 | - Data engineers grow with the company 95 | - Your data ecosystem is easy to navigate for data engineers 96 | - You invest in the tooling data engineers love 97 | - Your data engineers have autonomy 98 | ### teams & organization 99 | - https://about.gitlab.com/handbook/business-technology/data-team/ 100 | - great article: https://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/ usually a problem 101 | - either too many data engineers / site reliability engineers 102 | - or a gap (no holistic approach) as separated into segregated teams in different parts of company hierarchy 103 | - do **not** use cannons to kill flies https://towardsdatascience.com/what-frustrates-data-scientists-in-machine-learning-projects-3398919a7c79 104 | - inspiration for teams 105 | - https://code.facebook.com/posts/396395830836861/building-data-science-teams-to-have-an-impact-at-scale/?utm_source=codedot_rss_feed&utm_medium=rss&utm_campaign=RSS+Feed 106 | 107 | - start by working on a high profile use case, preferably for an external customer 108 | - work on a single team. Do not fight on multiple frontiers (use cases) but learn together. Remember kanban. Better to get a single pipeline fully into production then several only half functioning, and then speed up later on. 109 | - hiring & retaining data talent https://oreilly-ds-report.s3.amazonaws.com/Care_and_Feeding_of_Data_Scientists.pdf 110 | - security 111 | - https://brothke.medium.com/how-not-to-hire-for-a-senior-information-security-role-4bf71ce7ee26 112 | - tech lead experiences: https://tomgamon.com/posts/things-i-have-learned-new-tech-lead/ 113 | 114 | - learnings from the crowd: https://github.com/sdg-1/data-team-handbook 115 | 116 | ### culture 117 | - https://www.youtube.com/watch?v=VkeleGIUSM8 118 | 119 | ## big data 120 | **DO NOT do big data!** unless you really have big data and fully understand all the consequences of a distributed system. 121 | Instead, invest a couple of $$ into beefier single node computers. High single thread performance + lots of RAM will make you so much more productive. 122 | 123 | **scalability** 124 | Sometimes extreme scalability is not required! Do not get stuck in thinking you actually need it. Think of a scenario of many events for each user but the number of users being alsmost constant. Such a scenario can warrant some different algorithms to optimally process the data. 125 | 126 | Still, if required build for scale, i.e. for many users. 127 | But even more important have a scalable architecture of small and reuasable components. Git submoduels can be a tool which supports this even for otherwise hard to version artifacts. 128 | 129 | **small files problem** 130 | many small files (a lot smaller than HDFS block size) cause a performance degredation. 131 | workarounds: 132 | - combine input format (spark whole textfiles) 133 | - use sequence files 134 | - Hadoop archives HAR https://github.com/ZuInnoTe/hadoopoffice/wiki/Improve-performance-for-processing-a-lot-of-small-Office-files-with-Hadoop-Archives-(HAR) 135 | - kafka 136 | 137 | **specific helpful issues** 138 | - spark HBase kerberized: https://community.hortonworks.com/content/supportkb/198599/how-to-config-spark-to-connect-to-hbase-in-a-kerbe.html 139 | - approximation design patterns https://streaml.io/blog/eda-real-time-analytics-with-pulsar-functions 140 | 141 | **serving results of big data computation** 142 | - https://lambda.grofers.com/why-physical-storage-of-your-database-tables-might-matter-74b563d664d9 143 | 144 | ## java stuff 145 | ### containers 146 | - java and containers still do not play nicely (java9) https://mesosphere.com/blog/java-container/ 147 | 148 | 149 | ## great quotes 150 | 151 | ### innovation 152 | 153 | - If you separate the thinking about things from the doing of things, then innovation will suffer; https://berthub.eu/articles/posts/how-tech-loses-out/ 154 | 155 | 156 | ## google learnings 157 | - SRE https://sre.google/resources/practices-and-processes/twenty-years-of-sre-lessons-learned/ 158 | 159 | 160 | ## architecture and cloud 161 | 162 | https://thefrugalarchitect.com/ 163 | 164 | ## project management - general 165 | 166 | - Ensure you have no colliding / diametral goals 167 | - https://github.com/sbalnojan/run-a-data-team 168 | - https://databased.pedramnavid.com/p/its-not-enough-to-be-right?r=lgyh&utm_medium=ios&triedRedirect=true 169 | 170 | 171 | ## cloud 172 | 173 | ### regions 174 | 175 | - consider more than the basic cloud needs 176 | - consider the need for AI and big + many gpus and if your desired cloud region offers this capability 177 | 178 | ### de-cloudifixation/ re-patriation 179 | 180 | - ensure you control the interface (at least) if not the implementation 181 | 182 | ## logging 183 | 184 | - https://github.com/yandex/perforator 185 | 186 | - https://www.thestack.technology/warren-buffetts-geico-repatriates-work-from-the-cloud-continues-ambitious-infrastructure-overhaul/ 187 | 188 | -------------------------------------------------------------------------------- /llm.md: -------------------------------------------------------------------------------- 1 | # LLM 2 | 3 | ## good content 4 | 5 | - https://jax-ml.github.io/scaling-book/ 6 | - ocr 7 | - https://www.sergey.fyi/articles/gemini-flash-2 and https://www.runpulse.com/blog/why-llms-suck-at-ocr and https://blog.roboflow.com/florence-2-ocr/ 8 | - https://applied-llms.org/ 9 | - https://github.com/ray-project/llm-numbers 10 | - https://github.com/KalyanKS-NLP/llm-engineer-toolkit 11 | - https://koomen.dev/essays/horseless-carriages/ horseless carriages of AI 12 | - https://github.com/mlabonne/llm-course 13 | 14 | 15 | ## prompts 16 | 17 | - https://github.com/preset-io/promptimize 18 | - simple prompt engineering training from anthropic https://docs.google.com/spreadsheets/d/19jzLgRruG9kjUQNKtCg1ZjdD6l6weA6qRXG5zLIAhC8/edit?gid=150872633#gid=150872633 (Anthropic's Prompt Engineering Interactive Tutorial [PUBLIC ACCESS]) 19 | - https://colab.research.google.com/drive/1OnIipRwuHOZbKHN0haHGD0OnckBGfzqx auto optimierung 20 | - prompting frameworks 21 | - https://github.com/masci/banks 22 | - https://github.com/boundaryml/baml 23 | - prompt techniques 24 | - CoRT chain of recursive thoughts https://github.com/PhialsBasement/Chain-of-Recursive-Thoughts 25 | 26 | ## tokenization 27 | 28 | - https://github.com/openai/tiktoken 29 | 30 | ## vector databases 31 | 32 | - https://github.com/facebookresearch/faiss 33 | - https://www.pinecone.io/ 34 | - https://www.trychroma.com/ 35 | 36 | ## hallucination 37 | - https://docs.vectara.com/docs/learn/hallucination-evaluation 38 | 39 | 40 | ## private 41 | 42 | - https://bdtechtalks.com/2023/06/01/create-privategpt-local-llm/amp/ 43 | 44 | ## stacks 45 | ### serving LLMs 46 | 47 | - https://github.com/ajndkr/lanarky 48 | - https://github.com/deepset-ai/haystack (with Elasticsearch) 49 | 50 | ### full e2e example 51 | 52 | - https://github.com/praj2408/Langchain-PDF-App-GUI 53 | - https://github.com/pipeshub-ai/pipeshub-ai 54 | 55 | ### ready made connectors 56 | 57 | - https://www.danswer.ai/ 58 | - https://n8n.io/ 59 | 60 | ## no code GUI tools 61 | 62 | - https://github.com/FlowiseAI/Flowise 63 | - https://github.com/gmpetrov/databerry 64 | - https://www.docetl.org/ 65 | 66 | ## models 67 | 68 | - https://x.com/clementdelangue/status/1934672721066991908 69 | - https://simonwillison.net/2024/Dec/24/qvq/ 70 | - https://simonwillison.net/2024/Dec/26/deepseek-v3/ 71 | 72 | 73 | ## document preprocessors 74 | 75 | - unstract.com LLM whisperer https://pg.llmwhisperer.unstract.com/ 76 | - https://github.com/Zipstack/unstract 77 | - https://cloud.llamaindex.ai llamaparse 78 | - **https://github.com/DS4SD/docling** 79 | - https://github.com/VikParuchuri/marker 80 | - https://github.com/QuivrHQ/MegaParse 81 | - https://github.com/microsoft/markitdown 82 | - https://github.com/opendatalab/PDF-Extract-Kit 83 | 84 | ### HTML cleanup 85 | 86 | - https://emschwartz.me/comparing-13-rust-crates-for-extracting-text-from-html/ 87 | ### pdf preprocessors 88 | 89 | - https://github.com/opendatalab/MinerU 90 | - https://github.com/DS4SD/docling 91 | 92 | ### entity extraction 93 | 94 | - https://python.useinstructor.com/ 95 | - https://github.com/vlm-run/vlmrun-cookbook and good comments https://news.ycombinator.com/item?id=43187209 96 | - https://github.com/boundaryml/baml 97 | 98 | #### naming entities 99 | 100 | - https://github.com/urchade/GLiNER 101 | 102 | ## security 103 | 104 | - https://github.com/NVIDIA/garak 105 | - https://github.com/Azure/PyRIT 106 | - https://www.promptfoo.dev/docs/red-team/quickstart/ 107 | - https://github.com/mbrg/power-pwn 108 | - cryptography 109 | - https://eprint.iacr.org/2025/288 110 | - prompt injection scanning 111 | - https://arstechnica.com/information-technology/2025/04/researchers-claim-breakthrough-in-fight-against-ais-frustrating-security-hole/ https://arxiv.org/abs/2503.18813 112 | ### scraper & automation 113 | 114 | - https://github.com/browser-use/browser-use 115 | - https://github.com/nanobrowser/nanobrowser 116 | 117 | 118 | ### jailbreaks 119 | 120 | - https://github.com/CHATS-lab/persuasive_jailbreaker 121 | 122 | ### demonstrations 123 | 124 | - https://github.com/greshake/llm-security 125 | - https://bishopfox.com/resources/rvasec2024-patch-perfect-harmonizing-llms 126 | 127 | ### ethical challenges 128 | 129 | - https://www.youtube.com/watch?v=giT0ytynSqg 130 | 131 | ### security considerations 132 | 133 | - https://www.microsoft.com/en-us/security/blog/2025/04/24/new-whitepaper-outlines-the-taxonomy-of-failure-modes-in-ai-agents/ 134 | - https://www.bsi.bund.de/SharedDocs/Downloads/EN/BSI/Publications/ANSSI-BSI-joint-releases/LLM-based_Systems_Zero_Trust.pdf?__blob=publicationFile&v=3 135 | - https://genai.owasp.org/resource/agentic-ai-threats-and-mitigations/ 136 | 137 | ### security problems 138 | 139 | - https://theyseeyourphotos.com/ 140 | - https://www.404media.co/email/f459caa7-1a58-4f31-a9ba-3cb53a5046a4/ 141 | - https://www.npr.org/2024/12/10/nx-s1-5222574/kids-character-ai-lawsuit 142 | - https://bsky.app/profile/pedram.lol/post/3lbxzc34th22y 143 | - https://www.heise.de/news/Copilot-macht-aus-einem-Gerichtsreporter-einen-Kinderschaender-9840437.html 144 | - https://github.com/leondz/garak 145 | - https://www.youtube.com/watch?v=qyTSOSDEC5M 146 | - https://gandalf.lakera.ai/baseline genAI security training 147 | - https://www.heise.de/news/ChatGPT-Code-mit-betruegerischer-API-kostet-Programmierer-2500-US-Dollar-10169146.html 148 | - https://microsoft.github.io/presidio/analyzer/ 149 | - https://www.promptfoo.dev/docs/red-team/quickstart/ 150 | - https://github.com/mbrg/power-pwn 151 | - https://www.youtube.com/watch?v=-YJgcTCSzU0 152 | - https://www.youtube.com/watch?v=LcLrG_4i19E 153 | - https://www.youtube.com/watch?v=LcLrG_4i19E 154 | - backdoor 155 | - https://blog.sshh.io/p/how-to-backdoor-large-language-models 156 | 157 | - email summarization copiarate https://www.youtube.com/watch?v=84NVG1c5LRI 158 | - uncensor with abliteration https://colab.research.google.com/drive/1VYm3hOcvCpbGiqKZb141gJwjdmmCcVpR?usp=sharing 159 | 160 | 161 | identifying security issues 162 | 163 | - graph aware scanning 164 | - https://hiddenlayer.com/innovation-hub/shadowgenes-uncovering-model-genealogy/ 165 | - https://arxiv.org/abs/2501.11830 166 | - MCP 167 | - https://github.com/cisco-ai-defense/mcp-scanner 168 | #### red teaming 169 | 170 | - https://airedteamwhitepapers.blob.core.windows.net/lessonswhitepaper/MS_AIRT_Lessons_eBook.pdf 171 | - https://www.microsoft.com/en-us/security/blog/2025/01/13/3-takeaways-from-red-teaming-100-generative-ai-products/ 172 | - https://github.com/Azure/PyRIT 173 | 174 | 175 | ### data poisoning 176 | 177 | - https://arxiv.org/pdf/2302.04222 glaze 178 | - https://arxiv.org/pdf/2310.13828 night shade 179 | - https://nightshade.cs.uchicago.edu/whatis.html 180 | 181 | ## fine-tuning 182 | 183 | - https://huggingface.co/docs/peft/index 184 | - https://github.com/unslothai/unsloth 185 | - https://github.com/axolotl-ai-cloud/axolotl 186 | - https://github.com/Lightning-AI/litgpt 187 | - https://github.com/hiyouga/LLaMA-Factory 188 | - tool lists 189 | - https://huyenchip.com/llama-police 190 | - https://github.com/meta-llama/synthetic-data-kit 191 | - https://pypi.org/project/tokenlearn/ 192 | - auto eval https://colab.research.google.com/drive/1Igs3WZuXAIv9X0vwqiE90QlEPys8e8Oa?usp=sharing 193 | - axolotl https://colab.research.google.com/drive/1TsDKNo2riwVmU55gjuBgB1AXVtRRfRHW?usp=sharing 194 | - unsloth https://colab.research.google.com/drive/164cg_O7SV7G8kZr_JXqLd6VC7pd86-1Z?usp=sharing 195 | 196 | ### newer efficient way of fine-tuning and distillation 197 | 198 | SVD and Marchenko-Pastur seem to be really useful 199 | 200 | - https://github.com/arcee-ai/DistillKit https://www.youtube.com/watch?v=JE7SuP049mQ 201 | - https://github.com/cognitivecomputations/spectrum https://www.youtube.com/watch?v=CTncBjRgktk 202 | - https://github.com/arcee-ai/mergekit 203 | - https://www.youtube.com/watch?v=cvOpX75Kz4M 204 | - https://www.youtube.com/watch?v=qbAvOgGmFuE 205 | - https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing 206 | - https://github.com/arcee-ai/EvolKit 207 | - https://www.arcee.ai/blog/how-arcee-ai-helped-madeline-co-build-a-world-class-reasoning-model-from-first-principles- https://www.arcee.ai/product/arcee-conductor 208 | 209 | ## distributed training 210 | 211 | - https://github.com/microsoft/DeepSpeed 212 | - https://pytorch.org/tutorials/distributed/home.html 213 | - https://lightning.ai/docs/pytorch/stable/ 214 | - https://github.com/hpcaitech/ColossalAI 215 | 216 | ## prototyping 217 | 218 | ### UIs 219 | 220 | - https://github.com/open-webui/open-webui 221 | - https://lmstudio.ai/ 222 | - https://opencat.app/en/ 223 | - https://docs.chainlit.io/get-started/overview 224 | - https://www.gradio.app/ 225 | - https://streamlit.io/ 226 | 227 | ## rag frameworks 228 | 229 | - https://github.com/llmware-ai/llmware 230 | - https://github.com/infiniflow/ragflow?tab=readme-ov-file 231 | - https://github.com/Filimoa/open-parse 232 | - rag tools 233 | - https://www.onyx.app/ 234 | - https://github.com/google-gemini/gemini-fullstack-langgraph-quickstart 235 | - https://github.com/eyelevelai/groundx-on-prem 236 | - https://github.com/morphik-org/morphik-core 237 | - https://github.com/timescale/pgai 238 | 239 | ### graph rag 240 | 241 | - https://github.com/Muvon/octocode 242 | 243 | ### rag applications 244 | 245 | - https://arxiv.org/abs/2409.13740 246 | 247 | ## workflows 248 | 249 | - https://workflowai.com/ 250 | 251 | ## agents 252 | 253 | - https://github.com/humanlayer/12-factor-agents 254 | - https://github.com/The-Pocket/PocketFlow 255 | - https://claudiacode.com/ 256 | 257 | ### agent.md/claude.md tips 258 | 259 | - https://blog.sshh.io/p/how-i-use-every-claude-code-feature 260 | 261 | ### parallel agentic development 262 | 263 | - https://conductor.build/ 264 | 265 | ## verification & Testing 266 | 267 | - 10.17323/jle.2020.11046 268 | 269 | 270 | ## education 271 | 272 | ### audio 273 | 274 | - https://livekit.io/ 275 | - https://huggingface.co/Zyphra/Zonos-v0.1-hybrid 276 | - https://github.com/nari-labs/dia 277 | - https://github.com/KoljaB/RealtimeVoiceChat 278 | - https://github.com/resemble-ai/chatterbox 279 | 280 | ### video 281 | 282 | - explanation videos 283 | 284 | ## hosting models 285 | 286 | - https://github.com/containers/ramalama 287 | - https://ollama.com/ 288 | - https://github.com/vllm-project/vllm 289 | - https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct 290 | - https://github.com/sgl-project/sglang 291 | 292 | ## hardware 293 | 294 | on demand AI clusters 295 | 296 | - all the public cloud providers (though they often do not focus on offering a simple/streamlined GPU experience for researchers) 297 | - https://hpc-ai.com 298 | - https://lambdalabs.com/service/gpu-cloud/1-click-clusters 299 | - https://vast.ai/ 300 | - https://nebius.com/ 301 | - runpod 302 | - hyperstack 303 | - coreweave 304 | - crusoe 305 | - https://www.paperspace.com 306 | - https://huggingface.co/spaces 307 | - https://colab.research.google.com/ 308 | - https://www.anyscale.com 309 | - https://www.fluidstack.io 310 | - https://www.clustermax.ai/ 311 | 312 | ### validation of hardware 313 | - check cooling issues 314 | - https://github.com/huggingface/gpu-fryer 315 | - https://www.runpod.io/ 316 | - https://github.com/3b1b/manim 317 | - https://github.com/exo-explore/exo 318 | 319 | ## video generation 320 | 321 | - https://github.com/Wan-Video/Wan2.2 322 | ## realtime 323 | ### audio 324 | 325 | #### serving 326 | 327 | - https://fastrtc.org/ 328 | 329 | #### models 330 | - https://github.com/canopyai/Orpheus-TTS 331 | - https://github.com/fixie-ai/ultravox 332 | - https://www.openai.fm/ 333 | - qwen 2.5 omni 334 | - https://huggingface.co/spaces/Qwen/Qwen2.5-Omni-7B-Demo 335 | - https://huggingface.co/Qwen/Qwen2.5-Omni-7B 336 | - https://zeyuet.github.io/AudioX/ 337 | 338 | #### solutions 339 | - https://livekit.io/ 340 | - https://elevenlabs.io/ 341 | 342 | ## data analysis with AI 343 | 344 | - https://github.com/microsoft/data-formulator 345 | 346 | 347 | ## paper finders 348 | 349 | - https://allenai.org/blog/paper-finder 350 | 351 | 352 | ## applications 353 | 354 | ### schooling and tutoring 355 | 356 | - https://www.synthesis.com/tutor 357 | 358 | 359 | 360 | ## ai and politics 361 | 362 | - religion 363 | - vatican AI rules https://www.vatican.va/roman_curia/congregations/cfaith/documents/rc_ddf_doc_20250128_antiqua-et-nova_en.html 364 | 365 | ## mcp 366 | 367 | ### registry 368 | 369 | - https://github.com/IBM/mcp-context-forge 370 | 371 | 372 | ## applications 373 | 374 | ### local inferencing 375 | 376 | - https://github.com/ggml-org/LlamaBarn 377 | - https://lmstudio.ai/ 378 | - https://anythingllm.com/ 379 | - https://localai.io/ 380 | 381 | ### observability 382 | 383 | - https://github.com/Arize-ai/phoenix 384 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # awesome-tools 2 | curated list of awesome tools and libraries for specific domains 3 | 4 | ## Excel in python 5 | - pandas 6 | - editing data in jupyter notebooks: https://github.com/quantopian/qgrid 7 | 8 | ## geospatial processing 9 | - python 10 | - plotting raster https://github.com/fmaussion/salem 11 | - raster handling http://xarray.pydata.org/en/stable/ 12 | - multi dimensional arrays http://xarray.pydata.org/en/stable/ 13 | - spatial data including joins (works with dask) http://geopandas.org 14 | - cleaning of addresses: https://github.com/openvenues/libpostal 15 | - postgis 16 | - multi dimensional 17 | - http://www.nearimprov.com/nd-spatial-index-in-postgis/ 18 | - hadoop 19 | - http://www.geomesa.org 20 | - https://github.com/spatialcurrent/ansible-geomesa-accumulo-ubuntu 21 | - https://github.com/DataSystemsLab/GeoSpark 22 | - https://github.com/harsha2010/magellan 23 | - https://github.com/locationtech/geowave 24 | - https://github.com/locationtech/geotrellis 25 | - http://rasterframes.io 26 | - https://github.com/Esri/spatial-framework-for-hadoop and https://github.com/Esri/gis-tools-for-hadoop as well as their java api https://github.com/Esri/geometry-api-java 27 | - https://github.com/ngageoint/mrgeo 28 | - spark on windows 29 | - https://github.com/cdarlint/winutils 30 | - https://github.com/steveloughran/winutils/releases/tag/tag_2017-08-29-hadoop-2.8.1-native 31 | - https://stackoverflow.com/questions/34196302/the-root-scratch-dir-tmp-hive-on-hdfs-should-be-writable-current-permissions 32 | 33 | ### indexing 34 | - https://github.com/riskaware-ltd/open-eaggr 35 | - https://github.com/uber/h3 36 | 37 | 38 | ## nlp 39 | - http://www.nltk.org/book/ 40 | - https://github.com/keon/awesome-nlp 41 | - https://github.com/JohnSnowLabs/spark-nlp 42 | - https://github.com/databricks/spark-corenlp (check license extra carefully for commercial setup) 43 | - pyspark with https://spacy.io 44 | - https://explosion.ai 45 | - https://github.com/clulab/processors 46 | - https://github.com/google/sling 47 | - https://github.com/facebookresearch/faiss 48 | - https://github.com/bplank/bilstm-aux 49 | - https://github.com/facebookresearch/fastText 50 | - https://github.com/facebookresearch/InferSent 51 | - https://github.com/google/sentencepiece 52 | - https://github.com/zalandoresearch/flair 53 | - https://github.com/Microsoft/BlingFire 54 | 55 | ### chatbots 56 | - https://rasa.com 57 | 58 | ## clustering & distances 59 | - https://github.com/nmslib/nmslib 60 | - T-SNE interactive https://js.tensorflow.org 61 | - UMAP (faster alternative to T-SNE) https://github.com/lmcinnes/umap 62 | - fast DBSCAN. https://github.com/scikit-learn-contrib/hdbscan 63 | 64 | ### textual preprocessing 65 | - parsing HTML 66 | - https://github.com/GravityLabs/goose 67 | - https://github.com/ruippeixotog/scala-scraper 68 | - clustering 69 | - https://github.com/derrickburns/generalized-kmeans-clustering 70 | 71 | ## operations & monitoring 72 | - general operations 73 | - logging & alerting 74 | - https://sentry.io/welcome/ 75 | - https://bosun.org 76 | - https://www.datadoghq.com/product/ 77 | - https://opencensus.io 78 | - monitoring https://checkmk.com/de 79 | - certificates 80 | - https://certbot.eff.org and https://letsencrypt.org for free and automated https/ssl certificates 81 | - hadoop monitoring 82 | - https://github.com/linkedin/dr-elephant 83 | - https://github.com/qubole/sparklens 84 | - https://sites.google.com/site/sparkbigdebug/home 85 | - https://github.com/SparkMonitor/varOne (not maintained) 86 | - performance test https://github.com/databricks/spark-perf 87 | - https://github.com/conversant/spark-profiler 88 | - testing 89 | - https://www.slideshare.net/hkarau/testing-and-validating-distributed-systems-with-apache-spark-and-apache-beam-fosdem-2018-1?trk=v-feed 90 | - data quality 91 | - hadoop 92 | - https://github.com/FRosner/drunken-data-quality 93 | - packer base images 94 | - http://chef.github.io/bento/ 95 | - data platform 96 | - https://opendatahub.io/ 97 | 98 | ## time series 99 | **small** 100 | - [sktime](https://github.com/alan-turing-institute/sktime) and [sktime-dl](https://github.com/sktime/sktime-dl) 101 | - https://github.com/google/temporian 102 | - prediction 103 | - https://www.youtube.com/watch?v=0zpg9ODE6Ww 104 | - https://www.youtube.com/watch?v=68ABAU_V8qI 105 | - https://cran.r-project.org/web/packages/forecast/index.html 106 | - https://github.com/facebook/prophet 107 | - https://pythonawesome.com/probabilistic-time-series-modeling-in-python/ 108 | - bayesian structural https://cran.r-project.org/web/packages/bsts/index.html 109 | - multi prediction 110 | - https://cran.r-project.org/web/packages/forecastHybrid/index.html 111 | - https://cran.r-project.org/web/packages/opera/index.html https://github.com/Dralliag/opera 112 | - https://github.com/ellisp/forecastxgb-r-package 113 | - https://github.com/dmbee/seglearn 114 | - https://pytorch-forecasting.readthedocs.io/en/latest/index.html 115 | - feature extration 116 | - https://github.com/blue-yonder/tsfresh 117 | - https://robjhyndman.com/hyndsight/tspackages/ 118 | - https://www.featuretools.com (but also worth in general) 119 | - e2e ML pipeline generation: https://evalml.alteryx.com, https://evalml.alteryx.com/en/stable/demos/fraud.html 120 | - anomalies 121 | - https://github.com/Yelp/elastalert 122 | 123 | **hadoop** 124 | - handling & prediction 125 | - https://github.com/sryza/spark-timeseries 126 | - https://spark-summit.org/2016/events/huohua-a-distributed-time-series-analysis-framework-for-spark/ 127 | - https://github.com/twosigma/flint 128 | - https://databricks.gitbooks.io/databricks-spark-reference-applications/content/timeseries/index.html 129 | - correlation https://github.com/Sotera/correlation-approximation 130 | - https://github.com/sryza/spark-timeseries 131 | - anomaly detection 132 | - https://github.com/numenta/htm.java 133 | - https://github.com/YuhaoCheng/PyAnomaly 134 | - examples 135 | - streaming 136 | - https://github.com/confluentinc/demo-scene/blob/master/ksql-atm-fraud-detection/ksql-atm-fraud-detection-README.adoc 137 | - storage 138 | - http://www.chronix.io 139 | ## machine learning 140 | **model metadata** 141 | - https://github.com/IDSIA/sacred 142 | - http://studio.ml (also hyper opt) 143 | - https://github.com/mitdbg/modeldb 144 | - https://dataversioncontrol.com 145 | - https://www.comet.ml 146 | - https://aetros.com 147 | - https://github.com/ModelChimp/modelchimp 148 | - https://flyte.org/ 149 | 150 | **model building** 151 | - feature engineering 152 | - https://www.featuretools.com 153 | - small 154 | - http://scikit-learn.org/stable/ 155 | - production processes https://github.com/quantumblacklabs/kedro 156 | - R 157 | - preprocessing 158 | - https://cran.r-project.org/web/packages/vtreat/index.html 159 | - python 160 | - automl 161 | - teapot 162 | - http://automl.github.io/auto-sklearn/stable/ 163 | - https://github.com/kingfengji/gcForest 164 | - lambda https://www.youtube.com/watch?v=eF-4CrS_j_4 165 | - improved python https://software.intel.com/en-us/articles/using-intel-distribution-for-python-with-anaconda 166 | - https://github.com/mthh/jenkspy breaks 167 | - https://github.com/NeerajSarwan/CPT 168 | - hadoop 169 | - https://spark.apache.org/mllib/ 170 | - https://github.com/amidst/toolbox 171 | - fast aggregation in spark https://github.com/tdunning/t-digest 172 | - ensembling 173 | - http://ml-ensemble.com 174 | - https://github.com/viisar/brew 175 | - https://github.com/civisanalytics/civisml-extensions 176 | - specific great models 177 | - gradient boosted trees 178 | - xgboost 179 | - lightgbm 180 | - https://github.com/Azure/mmlspark 181 | - catboost https://github.com/catboost/catboost 182 | - https://github.com/fabsig/GPBoost 183 | - visualization of results 184 | - https://github.com/DistrictDataLabs/yellowbrick 185 | 186 | 187 | **model serving** 188 | - own API wrapper around original model code 189 | - http://clipper.ai 190 | - https://mlflow.org 191 | - https://www.acumos.org 192 | - https://polyaxon.com 193 | - http://vespa.ai 194 | - https://github.com/RedisLabsModules/redis-ml 195 | - https://riseml.com 196 | - https://github.com/Hydrospheredata/mist 197 | - https://github.com/Azure/ai-toolkit-iot-edge 198 | - https://www.dominodatalab.com and various other cloud data science work benches 199 | - https://datmo.com 200 | - https://aws.amazon.com/de/sagemaker/ 201 | 202 | **model serialization** 203 | - PMML 204 | - https://github.com/combust/mleap 205 | 206 | **hyperparameter tuning** 207 | - https://github.com/kubeflow/katib 208 | - https://sigopt.com 209 | - https://github.com/scikit-optimize/scikit-optimize 210 | - https://github.com/Yelp/MOE 211 | 212 | **e2e** 213 | - https://www.seldon.io 214 | - http://pipeline.ai 215 | - https://datmo.com 216 | - https://docs.ray.io/en/master/serve/index.html 217 | 218 | **ml solutions** 219 | - https://predictionio.incubator.apache.org 220 | - https://www.dominodatalab.com 221 | 222 | **bridiging python / r and big data** 223 | - http://blog.madhukaraphatak.com/pipe-in-spark/ 224 | - sparklyR 225 | - https://github.com/apple/turicreate out of core models on medium sized data 226 | 227 | **graph processing** 228 | - hadoop 229 | - https://tinkerpop.apache.org 230 | - https://github.com/dbs-leipzig/gradoop 231 | - https://github.com/gchq/Gaffer 232 | - https://graphframes.github.io/index.html 233 | - connector to neo4j https://github.com/neo4j-contrib/neo4j-spark-connector 234 | - communities 235 | - https://github.com/Sotera/distributed-graph-analytics/tree/master/dga-graphx 236 | - non hadoop 237 | - https://neo4j.com (single master, multi slave cluster possible) 238 | - tutorial 239 | - paradise papers https://github.com/johnymontana/pp-viz/blob/master/jupyter/PP_Viz.ipynb 240 | 241 | ## cool videos 242 | - telco hadoop geospatial 243 | - https://www.youtube.com/watch?v=VtvP54Xo3Ek&feature=youtu.be 244 | - streaming and declarative models: https://www.youtube.com/watch?v=Do7C4UJyWCM 245 | - ml 246 | - ml pipelines https://www.youtube.com/watch?v=cpR6Vkp7ImA 247 | - shingles and pipelines https://www.youtube.com/watch?v=qkrh35IF2SU, https://github.com/PacktPublishing/Mastering-Spark-for-Data-Science 248 | - gradient boosting comparision: https://www.youtube.com/watch?v=5CWwwtEM2TA 249 | - streaming 250 | - kafka https://www.youtube.com/watch?v=MNPI925PFD0 251 | - spark streaming in depth https://www.youtube.com/watch?v=hyZU_bw1-ow 252 | - python https://github.com/mrocklin/streamz 253 | - SQL hadoop & BI https://www.youtube.com/watch?v=v40HWIlsE_w&t=0s&list=PLSAiKuajRe2kGgi-GhMVE8IXzr5Pb3b5y&index=13 254 | - BMW self driving car & spark https://www.youtube.com/watch?v=ub2ufKrrAIs&t=0s&list=PLSAiKuajRe2kGgi-GhMVE8IXzr5Pb3b5y&index=27 255 | 256 | ## visualization 257 | - python 258 | - https://python-graph-gallery.com for inspiration 259 | - seaborn 260 | - R 261 | - ggplot2 + grest themes 262 | - javascript 263 | - https://uber.github.io/deck.gl/#/ 264 | - https://dc-js.github.io/dc.js/ 265 | 266 | **bi & dashboarding** 267 | - https://metabase.com 268 | - https://looker.com 269 | - python 270 | - https://github.com/stitchfix/pyxley 271 | **notebooks** 272 | - jupyter 273 | - http://beakerx.com 274 | - https://medium.com/netflix-techblog/notebook-innovation-591ee3221233 275 | - zeppelin 276 | 277 | **type safety** 278 | - https://github.com/typelevel/frameless 279 | 280 | ## probabilistic programming 281 | - stan 282 | - pymc3 283 | - https://github.com/uber/pyro 284 | 285 | ## databases 286 | - https://www.cockroachlabs.com (spanner) 287 | - https://www.snowflake.net/de/ 288 | - https://snowplowanalytics.com/products/snowplow-open-source/ 289 | - hbase-spark 290 | - via phenix spark 291 | - https://github.com/hortonworks/shc-release/tree/HDP-2.6.3.0-235-tag 292 | - postgres on GPUs http://www.brytlyt.com 293 | - improved cassandra scylla http://www.scylladb.com 294 | - https://www.mapd.com/platform/ 295 | - https://clickhouse.yandex 296 | - https://github.com/biokoda/actordb 297 | - https://clemenswinter.com/2018/07/09/how-to-analyze-billions-of-records-per-second-on-a-single-desktop-pc/amp/ 298 | ### graph dbs 299 | - streaming graphs https://github.com/NationalSecurityAgency/lemongraph 300 | 301 | **time series DBs** 302 | - https://timescale.com 303 | 304 | **big real time analytics and data integration** 305 | - https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7 306 | - https://www.quora.com/Should-I-use-Gobblin-or-Spark-Streaming-to-injest-data-from-Kafka-to-HDFS/answer/Prithiviraj-Damodaran 307 | 308 | ## scala 309 | ### configuration 310 | - typesafe configuration 311 | - https://cir.is/docs/validation 312 | - https://github.com/pureconfig/pureconfig 313 | 314 | ## bio-informatics 315 | - https://github.com/MG-RAST/AWE 316 | 317 | ## recommender 318 | - https://github.com/actionml/universal-recommender 319 | - https://github.com/DataSystemsLab/recdb-postgresql 320 | 321 | 322 | ## governance 323 | - apache atlas 324 | - cloudera navigator 325 | - https://www.waterlinedata.com (hadoop only) 326 | - https://alation.com (all) 327 | - https://www.privitar.com 328 | 329 | ## blockchain 330 | - data mining 331 | - https://www.openmined.org 332 | 333 | ## next gen big data tools 334 | - https://www.weld.rs 335 | - https://datafusion.rs 336 | - http://pachyderm.io 337 | - https://github.com/frankmcsherry/differential-dataflow 338 | -------------------------------------------------------------------------------- /great_articles.md: -------------------------------------------------------------------------------- 1 | # great articles & books 2 | collection of some nice articles 3 | 4 | ## general tips 5 | - http://www.unofficialgoogledatascience.com/2016/10/practical-advice-for-analysis-of-large.html 6 | - http://designingcx.com/cx-journey-mapping-toolkit 7 | - great and cheap resources 8 | - http://startupstash.com 9 | - https://github.com/toddmotto/public-apis 10 | - reproducible research: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285 11 | - https://medium.com/data-engineering/modeling-madly-8b2c72eb52be 12 | 13 | ## statistics 14 | - https://www.statisticsdonewrong.com 15 | 16 | ## processes 17 | - https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview 18 | 19 | ## great demos 20 | - https://community.hortonworks.com/repos/index.html 21 | 22 | ## dashboarding / visualization 23 | - https://eng.uber.com/on-call-dashboard/ 24 | - https://plot.ly/products/dash/ 25 | - http://pbpython.com/effective-matplotlib.html 26 | - https://resourcecards.com 27 | - https://blog.bufferapp.com/53-design-terms-explained-for-marketers 28 | - jupyter slides http://echorand.me/presentation-slides-with-jupyter-notebook.html 29 | - mapbox and http://leafletjs.com 30 | - https://uber.github.io/deck.gl/#/ 31 | ### streams 32 | - https://github.com/TU-Berlin-DIMA/i2 33 | 34 | ## flink 35 | - comcast ML pipeline https://data-artisans.com/flink-forward/resources/embedding-flink-throughout-an-operationalized-streaming-ml-lifecycle 36 | - CQRS event sourcing drive tribe platform on flink https://data-artisans.com/flink-forward/resources/panta-rhei-designing-distributed-applications-with-streams 37 | - spatio temporal event aggregation UBER https://data-artisans.com/flink-forward/resources/scaling-ubers-realtime-optimization-with-apache-flink 38 | - optimization of flink at netflix https://data-artisans.com/flink-forward/resources/scaling-flink-in-cloud 39 | - pravega https://data-artisans.com/flink-forward/resources/scaling-stream-data-pipelines 40 | - SQL motifs patterns https://www.slideshare.net/FlinkForward/flink-forward-berlin-2018-dawid-wysakowicz-detecting-patterns-in-event-streams-with-flink-sql 41 | 42 | ## dataflow 43 | - apache nifi 44 | - https://pierrevillard.com/2018/08/29/monitoring-driven-development-with-nifi-1-7/ 45 | - apache camel 46 | - akka + alpakka 47 | - https://kylo.io 48 | 49 | **governance** 50 | - apache atlas 51 | - https://github.com/linkedin/WhereHows 52 | 53 | ## data management principles 54 | - https://www.youtube.com/watch?v=DbViPBTq8Xc 55 | 56 | ## geospatial data 57 | - https://thehftguy.com/2017/07/19/what-does-it-really-take-to-track-100-million-cell-phones/ 58 | - https://matthewrocklin.com/blog//work/2017/09/21/accelerating-geopandas-1 59 | 60 | ### visualization 61 | - https://terrestris.github.io/react-geo/ 62 | 63 | ## machine learning 64 | 65 | - jupyter notebook tricks 66 | - https://codeburst.io/jupyter-notebook-tricks-for-data-science-that-enhance-your-efficiency-95f98d3adee4 67 | - https://github.com/mpastell/Pweave 68 | - https://michelleful.github.io/code-blog/2015/06/20/pipelines/ 69 | - xgboost lightgbm tuning https://medium.com/@Laurae2/getting-the-most-of-xgboost-and-lightgbm-speed-compiler-cpu-pinning-374c38d82b86 70 | - tutorial python ml http://nbviewer.jupyter.org/github/mdeff/python_tour_of_data_science/blob/with_outputs/python_tour_of_data_science.ipynb 71 | - advanced numpy tricks http://nbviewer.jupyter.org/github/vlad17/np-learn/blob/master/presentation.ipynb?flush_cache=true 72 | 73 | ## python tools 74 | 75 | ### ioc dependency injection 76 | - https://github.com/twgOren/roroioc 77 | 78 | ### hypothesis validation 79 | - https://eng.uber.com/maze/ 80 | 81 | ### visualization & data exploration 82 | - https://github.com/ContextLab/hypertools 83 | 84 | ### learning 85 | - https://www.kaggle.com/learn/overview 86 | - gbm https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d 87 | 88 | ### platforms 89 | - https://eng.uber.com/michelangelo/ 90 | - fb-learner flow 91 | - h2o.ai 92 | 93 | ### obtaining labelled data 94 | - https://github.com/Labelbox/Labelbox 95 | - https://prodi.gy 96 | 97 | ### categorical data 98 | - https://stats.stackexchange.com/q/263009/68300 99 | 100 | ### robust training 101 | - https://www.youtube.com/watch?v=5vlkVjGLsLc 102 | - https://www.youtube.com/watch?v=5vlkVjGLsLc 103 | 104 | ### deployment 105 | - https://github.com/Verizon/trapezium 106 | - http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf 107 | - https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf 108 | - https://loads.pickle.me.uk/2016/04/04/deploying-a-scikit-learn-classifier-to-production/ 109 | - https://news.ycombinator.com/item?id=13821217 110 | - https://www.kdnuggets.com/2016/11/moving-machine-learning-practice-production.html 111 | - https://github.com/lyg5623/lightgbm_predict4j 112 | - https://www.slideshare.net/xamat/10-more-lessons-learned-from-building-machine-learning-systems/12-However_It_is_not_always 113 | - https://github.com/opendatagroup/hadrian 114 | - https://docs.microsoft.com/de-de/azure/machine-learning/desktop-workbench/model-management-overview 115 | 116 | #### model explanation 117 | - https://github.com/marcotcr/lime 118 | - https://github.com/datascienceinc/Skater/blob/master/README.rst 119 | - https://github.com/slundberg/shap 120 | - https://github.com/csinva/imodels 121 | - https://github.com/pbiecek/DALEX/ 122 | - https://github.com/EthicalML/XAI 123 | 124 | ### evaluation 125 | - https://www.youtube.com/watch?v=WKAuXlsq6xw 126 | - https://sebastianraschka.com/blog/2016/model-evaluation-selection-part3.html 127 | - scoring process in production https://www.youtube.com/watch?v=-rGRHrED94Y 128 | - security e2e tutorial https://github.com/albahnsen/ML_SecurityInformatics 129 | - scoring https://www.r-bloggers.com/a-budget-of-classifier-evaluation-measures/ 130 | - imbalanced scoring http://www.kdnuggets.com/2016/08/learning-from-imbalanced-classes.html 131 | 132 | ### meta modeling 133 | - https://www.youtube.com/watch?v=Q0QmziFcfU0 134 | 135 | ## deeplearning 136 | - http://course.fast.ai/start.html 137 | - https://github.com/astorfi/TensorFlow-World-Resources/blob/master/README.rst 138 | - https://medium.com/towards-data-science/building-a-real-time-object-recognition-app-with-tensorflow-and-opencv-b7a2b4ebdc32 139 | - https://medium.com/towards-data-science/how-to-train-your-own-object-detector-with-tensorflows-object-detector-api-bec72ecfe1d9 140 | - http://briansp2020.github.io/2017/11/05/fast_ai_ROCm/ 141 | - densenet https://arxiv.org/abs/1608.06993 142 | - vision 143 | - https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/ 144 | **handling images** 145 | - https://fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5 146 | ## big data 147 | - scalability 148 | - https://www.cakesolutions.net/teamblogs/scaling-machine-learning 149 | - 150 | - spark 151 | - improve spark performance 152 | - calculate statistics https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-cost-based-optimization.html 153 | - https://github.com/high-performance-spark/high-performance-spark-examples 154 | - http://www.stephenzoio.com/creating-composable-data-pipelines-spark/ 155 | - ml & pipelines 156 | - https://www.youtube.com/watch?v=B6xequGNM20&list=PLYX1a6mVbBmzZTnuB4niJHiyzEEqYsGLN&index=6 157 | - hypothesis checkig 158 | - https://fullstackml.com/how-to-check-hypotheses-with-bootstrap-and-apache-spark-cd750775286a 159 | - etl 160 | - custom encoders: https://github.com/vaquarkhan/vk-wiki-notes/wiki/Apache-Spark-custom-Encoder-example 161 | - testing 162 | - https://github.com/dwestheide/kontextfrei 163 | - streaming 164 | - https://github.com/jbonofre/beam-samples 165 | - hbase 166 | - bulk loading from spark http://www.opencore.com/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/ 167 | - coprocessors 168 | - good bad and ugly of coprocessors from bloomberg https://www.youtube.com/watch?v=9NAPLmCB2sA 169 | - books 170 | - covering all the basic concepts: Designing Data-Intensive Applications http://shop.oreilly.com/product/0636920032175.do 171 | - streaming 172 | - kafka https://www.confluent.io/wp-content/uploads/confluent-kafka-definitive-guide-complete.pdf 173 | - security kafka https://www.confluent.io/kafka-summit-london18/kafka-as-a-service-a-tale-of-security-and-multi-tenancy 174 | - monitoring kafka https://www.confluent.io/kafka-summit-london18/monitor-kafka-like-a-pro 175 | - kafka for micro service communication https://github.com/confluentinc/qcon-microservices 176 | - mastering spark for data science - use cases end 2 end https://www.youtube.com/watch?v=B6xequGNM20&list=PLYX1a6mVbBmzZTnuB4niJHiyzEEqYsGLN&index=6 177 | - spark akka http://www.stephenzoio.com/spark-cluster-execution-with-akka/ 178 | - spark functional programming 179 | - https://www.iravid.com/posts/fp-and-spark.html 180 | - spark deployment 181 | - http://data-informed.com/spark-ml-from-lab-to-production-picking-the-right-deployment-architecture-for-your-business/ 182 | - e2e config https://github.com/cloudera-labs/envelope 183 | - pdf 184 | - https://github.com/EDS-APHP/SparkPdfExtractor?files=1 185 | 186 | ### spark ml 187 | - https://transmogrif.ai/ 188 | 189 | ## nlp 190 | - https://www.datacamp.com/community/tutorials/lda2vec-topic-model 191 | - https://juliasilge.com/blog/tidy-word-vectors/ 192 | - https://www.youtube.com/watch?time_continue=23&v=-lx2shfA-5s 193 | - https://arxiv.org/abs/1508.07909 194 | - https://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/ 195 | 196 | **embeddings** 197 | - https://distill.pub/2016/misread-tsne/ 198 | 199 | ## timeseries 200 | - https://www.kdnuggets.com/2017/11/automated-feature-engineering-time-series-data.html?lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3BgzlYOgbKSVemJoxvulGFZg%3D%3D 201 | - http://www.unofficialgoogledatascience.com/2017/07/fitting-bayesian-structural-time-series.html 202 | - https://www.kdnuggets.com/2016/11/combining-different-methods-create-advanced-time-series-prediction.html 203 | - https://facebookincubator.github.io/prophet/ 204 | - http://www.unofficialgoogledatascience.com/2017/04/our-quest-for-robust-time-series.html 205 | 206 | ### anomaly detection 207 | - anomaly detection https://github.com/htm-community/flink-htm 208 | - robfilter (r) 209 | - https://zedoul.github.io/cbar/articles/quickstart.html 210 | 211 | ### filtering 212 | - kalman filter and big data https://www.crcpress.com/authors/news/i3194-kalman-filter-at-the-age-of-big-data-programming-in-spark-scala 213 | - introductory book http://nbviewer.jupyter.org/github/rlabbe/Kalman-and-Bayesian-Filters-in-Python/blob/master/table_of_contents.ipynb 214 | 215 | ## streaming data 216 | - probabilistic streaming data types 217 | - http://nuit-blanche.blogspot.co.at/2016/12/datasketches-sketches-library-from.html 218 | - spark 219 | - https://www.inovex.de/blog/247-spark-streaming-on-yarn-in-production/ 220 | - akka streams 221 | - api scraping http://pascalbugnion.net/blog/scraping-apis-with-akka-streams.html 222 | 223 | ### streaming intro 224 | - https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 225 | 226 | ## podcasts 227 | - https://roaringelephant.org 228 | 229 | 230 | ## developer tooling 231 | - formatting 232 | - https://grysz.com/2015/11/13/reformat-java-code-in-intellij-according-to-uncle-bobs-stepdown-rule/ 233 | 234 | ## timeseries 235 | - https://www.kdnuggets.com/2017/11/automated-feature-engineering-time-series-data.html 236 | - https://eng.uber.com/neural-networks/ 237 | 238 | ## Bayesian stuff 239 | - http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/ 240 | 241 | ## great EDA (exploratory data anylysis) tips 242 | - https://www.kaggle.com/thie1e/exploratory-analysis-rossmann 243 | 244 | ## sequence prediction & rankiong 245 | - lambdamart / learning to rank 246 | - https://github.com/tedgueniche/IPredict compact prediction tree 247 | 248 | ## graphs 249 | - https://www.slideshare.net/SparkSummit/finding-graph-isomorphisms-in-graphx-and-graphframes 250 | - https://towardsdatascience.com/record-linking-with-apache-sparks-mllib-graphx-d118c5f31f83 251 | - distributed large RDF processing http://sansa-stack.net 252 | 253 | ### interactive graphs 254 | - https://github.com/nationalsecurityagency/lemongraph 255 | 256 | ## small graphs 257 | - c++ efficient stuff 258 | - https://github.com/BorgwardtLab/graph-kernels 259 | 260 | ## A/B testing 261 | - https://blog.insightdatascience.com/statistical-advice-for-a-b-testing-28654a24b9f0 262 | - http://www.unofficialgoogledatascience.com/2018/01/designing-ab-tests-in-collaboration.html 263 | - https://booking.ai/how-booking-com-increases-the-power-of-online-experiments-with-cuped-995d186fff1d 264 | 265 | 266 | ## databases 267 | ### postgres 268 | - https://www.pgcon.org/2017/schedule/attachments/443_lies-damned-lies-and-statistics.pdf 269 | 270 | ### graph databases 271 | - neo4j 272 | - http://janusgraph.org 273 | 274 | ### hadoop replacements? 275 | - http://pachyderm.io 276 | - https://github.com/vaexio/vaex 277 | - http://dask.pydata.org 278 | - deploy python on yarn 279 | - https://conda.github.io/conda-pack/ 280 | - google spanner DB 281 | 282 | ### db tuning 283 | - https://tech.gotinder.com/geosharded-recommendations-part-1-sharding-approach-2/ 284 | 285 | ### db comparisons 286 | - accumulo cassandra hbase http://accumulosummit.com/program/talks/comparing-accumulo-cassandra-hbase/ 287 | 288 | ## mobile apps 289 | - https://github.com/infinitered/ignite 290 | 291 | ## security 292 | ### fingerprinting 293 | - https://medium.com/@rukavitsya/canvas-fingerprinting-cookies-on-steroids-253f43c7e293 294 | - https://www.blackhat.com/docs/eu-17/materials/eu-17-Shuster-Passive-Fingerprinting-Of-HTTP2-Clients-wp.pdf 295 | 296 | 297 | ## computational geometry 298 | - https://www.cgal.org 299 | - geospatial algorithms https://i11www.iti.kit.edu/teaching/sommer2013/algokartografie/index 300 | 301 | 302 | ## messaging & RPC 303 | - https://github.com/uber/prototool 304 | 305 | 306 | ## use cases 307 | ### machine learning 308 | - machine learning on source code https://github.com/src-d/awesome-machine-learning-on-source-code 309 | - https://engine.sourced.tech/v0.6.1/ 310 | - https://github.com/src-d/ml 311 | 312 | ## terminal tools 313 | - https://hackernoon.com/macbook-my-command-line-utilities-f8a121c3b019 314 | 315 | 316 | ## optimization 317 | - https://multithreaded.stitchfix.com/blog/2018/06/21/constrained-optimization/ 318 | -------------------------------------------------------------------------------- /data-management.md: -------------------------------------------------------------------------------- 1 | # Data Management 2 | 3 | ## learning 4 | 5 | - https://www.reddit.com/r/dataengineering/comments/1491swe/a_mustread_data_engineering_collection/ 6 | - stack overview https://motherduck.com/blog/data-engineering-toolkit-essential-tools/ 7 | - https://github.com/rewrite-bigdata-in-rust/RBIR 8 | 9 | 10 | ## stack example 11 | 12 | - data 13 | - https://github.com/cnstlungu/portable-data-stack-dagster/tree/main 14 | - https://github.com/dagster-io/dagster-open-platform 15 | - https://medium.com/@edsoncezar16/asset-level-concurrency-limits-via-declarative-scheduling-c4da3a875cf4 16 | - observability in dagster with grafana/prometheus 17 | - https://metaops.solutions/blog/dagster-monitoring-prometheus-dbt-assets-sql-transformations-part-2 18 | - https://metaops.solutions/blog/dagster-monitoring-prometheus-system-metrics-custom-assets-part-1 19 | 20 | ### editor tooling 21 | 22 | databases 23 | 24 | - https://studio.outerbase.com/ 25 | 26 | ## environment handling 27 | 28 | - https://palm-cli.readthedocs.io/en/latest/introduction/examples.html 29 | 30 | ## choice of database 31 | 32 | - cloud databases 33 | - https://towardsdatascience.com/datastore-choices-sql-vs-nosql-database-ebec24d56106 really great overview 34 | 35 | - postgres based 36 | - https://github.com/paradedb/paradedb**** 37 | 38 | 39 | ## storage systems 40 | 41 | ### blob storage 42 | 43 | - https://tech.preferred.jp/en/blog/a-year-with-apache-ozone/ 44 | 45 | ## scheduler 46 | - https://www.digdag.io 47 | 48 | 49 | ## permissions 50 | - AWS policy generator https://github.com/salesforce/policy_sentry/blob/master/README.md 51 | ## ETL 52 | 53 | ### methodology 54 | 55 | - https://de.wikipedia.org/wiki/Data_Vault 56 | - metrics layer: https://www.sspaeti.com/blog/analytics-api-with-graphql-the-next-level-of-data-engineering/ 57 | 58 | ### input validation 59 | 60 | - https://github.com/JakobGM/patito 61 | - 62 | ### batch 63 | - https://github.com/volcano-sh/volcano 64 | - excel automation https://www.xlwings.org/ 65 | - spark 66 | - SQL tuning 67 | - https://www.youtube.com/watch?v=_Ne27JcLnEc 68 | - lakehouse spark 3.x & delta: https://www.youtube.com/watch?v=iog5feADeXc 69 | - https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-apache-iceberg-lakehouse-feature-comparison 70 | - optimizing delta https://www.youtube.com/watch?v=o2k9PICWdx0 71 | - auto optimization https://github.com/dataflint/spark 72 | 73 | - dagster tipps 74 | - https://github.com/sspaeti-com/awesome-dagster 75 | - https://github.com/sephib/dagster-graph-project 76 | - https://github.com/slamer59/dagster-mlflow 77 | - https://github.com/broadinstitute/dagster-utils 78 | - https://github.com/AntonFriberg/dagster-project-example 79 | - https://github.com/thedmi/dagster-celery-docker-example 80 | - https://github.com/kahnwong/dagster-demo 81 | - https://github.com/VladX09/dagster-celery-docker-bug/tree/non-zero-exit-example with https://github.com/dagster-io/dagster/issues/5008 82 | - https://github.com/astenuz/dagster-celery-test 83 | - https://github.com/xyzy-web/dagster-exchangerates 84 | - airflow problems 85 | - https://medium.com/bluecore-engineering/were-all-using-airflow-wrong-and-how-to-fix-it-a56f14cb0753 86 | - examples 87 | - https://github.com/dagster-io/mdsfest-opensource-mds 88 | 89 | #### DBT 90 | 91 | - dbt-loom with dagster https://github.com/cnolanminich/dbt-loom-example/tree/demo-with-dagster 92 | - https://github.com/borjavb/dbt-iceberg-poc 93 | 94 | #### data frames 95 | 96 | - https://github.com/xorq-labs/xorq 97 | 98 | ### schema 99 | 100 | - dynamic schema https://www.youtube.com/watch?v=No55ImP-Jic 101 | 102 | #### schema changes 103 | 104 | - https://github.com/datafold/data-diff 105 | 106 | ### streaming 107 | 108 | - www.gentlydownthe.stream 109 | - https://gazette.readthedocs.io/en/latest/ 110 | - https://github.com/drasi-project/drasi-platform 111 | 112 | #### flink deployments 113 | 114 | - k8s operator for blue green deployment 115 | - https://github.com/lyft/flinkk8soperator 116 | 117 | ### unit testing 118 | - hive sql unit testing with code coverage https://github.com/HotelsDotCom/mutant-swarm 119 | - pyspark 120 | - https://medium.com/@gu.martinm/pyspark-unit-integration-and-end-to-end-tests-c2ba71467d85 121 | 122 | ### native code 123 | 124 | - wrapping native code 125 | - https://github.com/bytedeco/javacpp 126 | - https://www.youtube.com/watch?v=5GvfPGcdhqI PCAP example 127 | 128 | #### spark 129 | - https://github.com/swoop-inc/spark-records 130 | - https://github.com/swoop-inc/spark-alchemy 131 | - framework ETL 132 | - https://github.com/CoxAutomotiveDataSolutions/waimak 133 | - https://arc.tripl.ai/ 134 | - fine tuning catalyst optimizer 135 | - https://www.youtube.com/watch?v=IjqC2Y2Hd5k 136 | - streaming optimization 137 | - https://medium.com/@Iqbalkhattra85/optimize-spark-structured-streaming-for-scale-d5dc5dee0622 138 | - delta merge optimization 139 | - https://www.youtube.com/watch?v=o2k9PICWdx0&t=50s 140 | 141 | ### classical DB 142 | - https://pgdash.io/blog/postgres-features.html 143 | 144 | 145 | ### cachings 146 | 147 | - redis tips: 148 | - https://developer.redis.com/howtos/antipatterns/ 149 | 150 | ### ETL tools 151 | - https://dataform.co/ 152 | - https://www.getdbt.com 153 | 154 | ### ETL in python 155 | - https://github.com/petl-developers/petl 156 | - https://github.com/python-bonobo/bonobo 157 | - https://www.singer.io 158 | - https://meltano.com/ (reference runner for singer) 159 | - https://airbyte.io/ 160 | - iceberg 161 | - https://github.com/datazip-inc/olake 162 | 163 | #### dbt and governance 164 | 165 | - https://github.com/dbt-labs/dbt-project-evaluator 166 | - https://github.com/tobymao/sqlglot 167 | - https://github.com/Montreal-Analytics/dbt-gloss 168 | - https://github.com/offbi/pre-commit-dbt 169 | - team structure https://github.com/dbt-labs/dbt-core/discussions/5244 monorepo or not 170 | - https://blog.montrealanalytics.com/blue-green-deployment-with-dbt-and-snowflake-922f1c658011 171 | - https://www.getdbt.com/analytics-engineering/case-for-elt-workflow/ 172 | - https://montrealanalytics.notion.site/Coalesce-Workshop-Guide-6382db82046f41599e9ec39afb035bdb and https://github.com/Montreal-Analytics/poutineshop-public and video of the workshop https://www.youtube.com/watch?v=L6ixHejZX5A&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=42 173 | 174 | #### unit testing data - validating assumptions 175 | 176 | - https://bulwark.readthedocs.io/en/latest/index.html 177 | -  https://github.com/datafold/data-diff 178 | 179 | ### ingestion 180 | 181 | #### convenience 182 | 183 | - https://slingdata.io/ 184 | - https://airbyte.com/ 185 | - https://docs.dagster.io/integrations/embedded-elt 186 | - https://www.prequel.co/ 187 | 188 | #### big data 189 | 190 | - https://inlong.apache.org/ 191 | - https://seatunnel.apache.org/ 192 | 193 | ### orchestration 194 | - Airflow 195 | - prefect 196 | - https://dagster.readthedocs.io 197 | 198 | ### monitoring 199 | - health dashbboard https://github.com/criteo/slab 200 | 201 | ## k8s 202 | ### failures 203 | - - https://github.com/hjacobs/kubernetes-failure-stories 204 | 205 | ## load tests 206 | - stress testing https://github.com/open-chaos/experiment-catalog 207 | 208 | ### unified batch & streaming 209 | - https://www.youtube.com/watch?v=4qSlsYogALo 210 | 211 | 212 | ### hybrid cloud 213 | 214 | #### data sync 215 | - NiFi https://medium.com/@abdelkrim.hadjidj/hub-and-spoke-architectures-with-nifi-site-to-site-communications-at-any-level-a-nifi-1-10-a8702f77c66e 216 | 217 | ## self service tools 218 | 219 | - https://www.gooddata.com/ 220 | 221 | 222 | ## api design 223 | 224 | - https://martinfowler.com/articles/cant-buy-integration.html 225 | 226 | 227 | 228 | # end user productivity 229 | 230 | - https://www.tadviewer.com/ 231 | 232 | # quality 233 | 234 | - https://github.com/Quantco/dataframely 235 | 236 | ## consistency 237 | 238 | # BI 239 | 240 | - https://evidence.dev/ 241 | 242 | 243 | - https://github.com/IvorySQL/IvorySQL 244 | 245 | ## scaling DBT 246 | 247 | - Unlocking model governance and multi-project deployments with dbt-meshify - Coalesce 2023 https://www.youtube.com/watch?v=FAsY0Qx8EyU https://dbt-labs.github.io/dbt-meshify/0.5/ 248 | - datafold demo https://www.youtube.com/watch?v=5Xxm6cYRmFg 249 | - JSON schema https://www.youtube.com/watch?v=z8y7mvOUFsY https://github.com/GClunies/Reflekt 250 | 251 | ## web automation 252 | 253 | - https://playwright.dev/ (apparently good for scraping) 254 | 255 | 256 | ## modern data startups 257 | 258 | - https://localfirstweb.dev/ 259 | - https://www.5x.co/ 260 | - https://www.propeldata.com/ 261 | - https://www.tinybird.co/ 262 | 263 | 264 | ## interesting databases 265 | 266 | - Postgres 267 | - DuckDB 268 | - https://www.youtube.com/watch?v=PoHfh6O43uE 269 | - Starrocks 270 | - https://github.com/datafuselabs/databend 271 | - https://cratedb.com/ 272 | - https://www.hydra.so/ 273 | - https://neon.tech/ 274 | - https://github.com/TFMV/featherman 275 | - https://kuzudb.com/ 276 | 277 | ### streaming 278 | 279 | - https://github.com/infinyon/fluvio 280 | - https://risingwave.com/ 281 | - https://materialize.com/ 282 | - https://www.decodable.co/ 283 | 284 | ### multimodal 285 | 286 | - https://www.lancedb.com/ 287 | ### graph 288 | 289 | - https://www.youtube.com/watch?v=X_RFo616M_U 290 | - https://kuzudb.com/ replaced by https://ladybugdb.com/ 291 | 292 | ### vector 293 | 294 | - https://turbopuffer.com/ 295 | ### timeseries 296 | 297 | - timescale https://www.timescale.com/ 298 | - https://www.timeplus.com/ 299 | - https://questdb.io/ 300 | - https://tembo.io/blog/pg-timeseries 301 | - https://github.com/GreptimeTeam/greptimedb 302 | - duckdb inspired 303 | - https://github.com/Basekick-Labs/arc 304 | #### immutable 305 | 306 | - https://xtdb.com/ 307 | 308 | ## learning 309 | 310 | - https://www.reddit.com/r/dataengineering/comments/1491swe/a_mustread_data_engineering_collection/ 311 | - stack overview https://motherduck.com/blog/data-engineering-toolkit-essential-tools/ 312 | - https://github.com/rewrite-bigdata-in-rust/RBIR 313 | 314 | 315 | ## stack example 316 | 317 | - data 318 | - https://github.com/cnstlungu/portable-data-stack-dagster/tree/main 319 | - https://github.com/dagster-io/dagster-open-platform 320 | - https://medium.com/@edsoncezar16/asset-level-concurrency-limits-via-declarative-scheduling-c4da3a875cf4 321 | - observability in dagster with grafana/prometheus 322 | - https://metaops.solutions/blog/dagster-monitoring-prometheus-dbt-assets-sql-transformations-part-2 323 | - https://metaops.solutions/blog/dagster-monitoring-prometheus-system-metrics-custom-assets-part-1 324 | 325 | ### editor tooling 326 | 327 | databases 328 | 329 | - https://studio.outerbase.com/ 330 | 331 | ## environment handling 332 | 333 | - https://palm-cli.readthedocs.io/en/latest/introduction/examples.html 334 | 335 | ## choice of database 336 | 337 | - cloud databases 338 | - https://towardsdatascience.com/datastore-choices-sql-vs-nosql-database-ebec24d56106 really great overview 339 | 340 | - postgres based 341 | - https://github.com/paradedb/paradedb**** 342 | 343 | 344 | ## storage systems 345 | 346 | ### blob storage 347 | 348 | - https://tech.preferred.jp/en/blog/a-year-with-apache-ozone/ 349 | 350 | ## scheduler 351 | - https://www.digdag.io 352 | 353 | 354 | ## permissions 355 | - AWS policy generator https://github.com/salesforce/policy_sentry/blob/master/README.md 356 | ## ETL 357 | 358 | ### methodology 359 | 360 | - https://de.wikipedia.org/wiki/Data_Vault 361 | - metrics layer: https://www.sspaeti.com/blog/analytics-api-with-graphql-the-next-level-of-data-engineering/ 362 | 363 | ### input validation 364 | 365 | - https://github.com/JakobGM/patito 366 | - 367 | ### batch 368 | - https://github.com/volcano-sh/volcano 369 | - excel automation https://www.xlwings.org/ 370 | - spark 371 | - SQL tuning 372 | - https://www.youtube.com/watch?v=_Ne27JcLnEc 373 | - lakehouse spark 3.x & delta: https://www.youtube.com/watch?v=iog5feADeXc 374 | - https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-apache-iceberg-lakehouse-feature-comparison 375 | - optimizing delta https://www.youtube.com/watch?v=o2k9PICWdx0 376 | - auto optimization https://github.com/dataflint/spark 377 | 378 | - dagster tipps 379 | - https://github.com/sspaeti-com/awesome-dagster 380 | - https://github.com/sephib/dagster-graph-project 381 | - https://github.com/slamer59/dagster-mlflow 382 | - https://github.com/broadinstitute/dagster-utils 383 | - https://github.com/AntonFriberg/dagster-project-example 384 | - https://github.com/thedmi/dagster-celery-docker-example 385 | - https://github.com/kahnwong/dagster-demo 386 | - https://github.com/VladX09/dagster-celery-docker-bug/tree/non-zero-exit-example with https://github.com/dagster-io/dagster/issues/5008 387 | - https://github.com/astenuz/dagster-celery-test 388 | - https://github.com/xyzy-web/dagster-exchangerates 389 | - airflow problems 390 | - https://medium.com/bluecore-engineering/were-all-using-airflow-wrong-and-how-to-fix-it-a56f14cb0753 391 | - examples 392 | - https://github.com/dagster-io/mdsfest-opensource-mds 393 | 394 | #### DBT 395 | 396 | - dbt-loom with dagster https://github.com/cnolanminich/dbt-loom-example/tree/demo-with-dagster 397 | - https://github.com/borjavb/dbt-iceberg-poc 398 | 399 | #### data frames 400 | 401 | - https://github.com/xorq-labs/xorq 402 | 403 | ### schema 404 | 405 | - dynamic schema https://www.youtube.com/watch?v=No55ImP-Jic 406 | 407 | #### schema changes 408 | 409 | - https://github.com/datafold/data-diff 410 | 411 | ### streaming 412 | 413 | - www.gentlydownthe.stream 414 | - https://gazette.readthedocs.io/en/latest/ 415 | - https://github.com/drasi-project/drasi-platform 416 | 417 | #### flink deployments 418 | 419 | - k8s operator for blue green deployment 420 | - https://github.com/lyft/flinkk8soperator 421 | 422 | ### unit testing 423 | - hive sql unit testing with code coverage https://github.com/HotelsDotCom/mutant-swarm 424 | - pyspark 425 | - https://medium.com/@gu.martinm/pyspark-unit-integration-and-end-to-end-tests-c2ba71467d85 426 | 427 | ### native code 428 | 429 | - wrapping native code 430 | - https://github.com/bytedeco/javacpp 431 | - https://www.youtube.com/watch?v=5GvfPGcdhqI PCAP example 432 | 433 | #### spark 434 | - https://github.com/swoop-inc/spark-records 435 | - https://github.com/swoop-inc/spark-alchemy 436 | - framework ETL 437 | - https://github.com/CoxAutomotiveDataSolutions/waimak 438 | - https://arc.tripl.ai/ 439 | - fine tuning catalyst optimizer 440 | - https://www.youtube.com/watch?v=IjqC2Y2Hd5k 441 | - streaming optimization 442 | - https://medium.com/@Iqbalkhattra85/optimize-spark-structured-streaming-for-scale-d5dc5dee0622 443 | - delta merge optimization 444 | - https://www.youtube.com/watch?v=o2k9PICWdx0&t=50s 445 | 446 | ### classical DB 447 | - https://pgdash.io/blog/postgres-features.html 448 | 449 | 450 | ### cachings 451 | 452 | - redis tips: 453 | - https://developer.redis.com/howtos/antipatterns/ 454 | 455 | ### ETL tools 456 | - https://dataform.co/ 457 | - https://www.getdbt.com 458 | 459 | ### ETL in python 460 | - https://github.com/petl-developers/petl 461 | - https://github.com/python-bonobo/bonobo 462 | - https://www.singer.io 463 | - https://meltano.com/ (reference runner for singer) 464 | - https://airbyte.io/ 465 | - iceberg 466 | - https://github.com/datazip-inc/olake 467 | 468 | #### dbt and governance 469 | 470 | - https://github.com/dbt-labs/dbt-project-evaluator 471 | - https://github.com/tobymao/sqlglot 472 | - https://github.com/Montreal-Analytics/dbt-gloss 473 | - https://github.com/offbi/pre-commit-dbt 474 | - team structure https://github.com/dbt-labs/dbt-core/discussions/5244 monorepo or not 475 | - https://blog.montrealanalytics.com/blue-green-deployment-with-dbt-and-snowflake-922f1c658011 476 | - https://www.getdbt.com/analytics-engineering/case-for-elt-workflow/ 477 | - https://montrealanalytics.notion.site/Coalesce-Workshop-Guide-6382db82046f41599e9ec39afb035bdb and https://github.com/Montreal-Analytics/poutineshop-public and video of the workshop https://www.youtube.com/watch?v=L6ixHejZX5A&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=42 478 | 479 | #### unit testing data - validating assumptions 480 | 481 | - https://bulwark.readthedocs.io/en/latest/index.html 482 | -  https://github.com/datafold/data-diff 483 | 484 | ### ingestion 485 | 486 | #### convenience 487 | 488 | - https://slingdata.io/ 489 | - https://airbyte.com/ 490 | - https://docs.dagster.io/integrations/embedded-elt 491 | - https://www.prequel.co/ 492 | 493 | #### big data 494 | 495 | - https://inlong.apache.org/ 496 | - https://seatunnel.apache.org/ 497 | 498 | ### orchestration 499 | - Airflow 500 | - prefect 501 | - https://dagster.readthedocs.io 502 | 503 | ### monitoring 504 | - health dashbboard https://github.com/criteo/slab 505 | 506 | ## k8s 507 | ### failures 508 | - - https://github.com/hjacobs/kubernetes-failure-stories 509 | 510 | ## load tests 511 | - stress testing https://github.com/open-chaos/experiment-catalog 512 | 513 | ### unified batch & streaming 514 | - https://www.youtube.com/watch?v=4qSlsYogALo 515 | 516 | 517 | ### hybrid cloud 518 | 519 | #### data sync 520 | - NiFi https://medium.com/@abdelkrim.hadjidj/hub-and-spoke-architectures-with-nifi-site-to-site-communications-at-any-level-a-nifi-1-10-a8702f77c66e 521 | 522 | ## self service tools 523 | 524 | - https://www.gooddata.com/ 525 | 526 | 527 | ## api design 528 | 529 | - https://martinfowler.com/articles/cant-buy-integration.html 530 | 531 | 532 | 533 | # end user productivity 534 | 535 | - https://www.tadviewer.com/ 536 | 537 | # quality 538 | 539 | - https://github.com/Quantco/dataframely 540 | 541 | ## consistency 542 | 543 | # BI 544 | 545 | - https://evidence.dev/ 546 | 547 | 548 | - https://github.com/IvorySQL/IvorySQL 549 | 550 | ## scaling DBT 551 | 552 | - Unlocking model governance and multi-project deployments with dbt-meshify - Coalesce 2023 https://www.youtube.com/watch?v=FAsY0Qx8EyU https://dbt-labs.github.io/dbt-meshify/0.5/ 553 | - datafold demo https://www.youtube.com/watch?v=5Xxm6cYRmFg 554 | - JSON schema https://www.youtube.com/watch?v=z8y7mvOUFsY https://github.com/GClunies/Reflekt 555 | 556 | ## web automation 557 | 558 | - https://playwright.dev/ (apparently good for scraping) 559 | 560 | 561 | ## modern data startups 562 | 563 | - https://localfirstweb.dev/ 564 | - https://www.5x.co/ 565 | - https://www.propeldata.com/ 566 | - https://www.tinybird.co/ 567 | 568 | 569 | ## interesting databases 570 | 571 | - Postgres 572 | - DuckDB 573 | - https://www.youtube.com/watch?v=PoHfh6O43uE 574 | - Starrocks 575 | - https://github.com/datafuselabs/databend 576 | - https://cratedb.com/ 577 | - https://www.hydra.so/ 578 | - https://neon.tech/ 579 | - https://github.com/TFMV/featherman 580 | - https://kuzudb.com/ 581 | 582 | ### streaming 583 | 584 | - https://github.com/infinyon/fluvio 585 | - https://risingwave.com/ 586 | - https://materialize.com/ 587 | - https://www.decodable.co/ 588 | 589 | ### multimodal 590 | 591 | - https://www.lancedb.com/ 592 | ### graph 593 | 594 | - https://www.youtube.com/watch?v=X_RFo616M_U 595 | - https://kuzudb.com/ replaced by https://ladybugdb.com/ 596 | 597 | ### vector 598 | 599 | - https://turbopuffer.com/ 600 | ### timeseries 601 | 602 | - timescale https://www.timescale.com/ 603 | - https://www.timeplus.com/ 604 | - https://questdb.io/ 605 | - https://tembo.io/blog/pg-timeseries 606 | - https://github.com/GreptimeTeam/greptimedb 607 | - duckdb inspired 608 | - https://github.com/Basekick-Labs/arc 609 | 610 | ### key-value (like) 611 | 612 | - Redis 613 | - https://github.com/valkey-io/valkey 614 | - https://aws.amazon.com/de/elasticache/ 615 | - https://aws.amazon.com/de/memorydb/ 616 | - https://www.memcached.org 617 | - https://www.bauplanlabs.com/ 618 | - https://hazelcast.com 619 | - https://www.dragonflydb.io/redis-alternative 620 | 621 | ### log structured data structures and brokers 622 | 623 | - https://github.com/microsoft/FASTER and https://github.com/faster-rs/faster-rs 624 | - https://redpanda.com/ 625 | - https://kafka.apache.org/ 626 | 627 | ### text search 628 | 629 | - https://github.com/paradedb/paradedb https://blog.paradedb.com/pages/elasticsearch_vs_postgres 630 | - https://github.com/quickwit-oss/tantivy 631 | - https://gitlab.science.ru.nl/informagus/zoekeend/ 632 | #### observability 633 | 634 | - https://www.scopedb.io/ 635 | 636 | ### PG addons 637 | 638 | - approximate 639 | - https://github.com/citusdata/postgresql-hll 640 | 641 | ## interesting message queues 642 | 643 | - https://zeromq.org/ 644 | - 645 | ### on basis of PostgreSQL 646 | - https://github.com/tembo-io/pgmq 647 | - https://github.com/hatchet-dev/hatchet 648 | 649 | ## visualization and admin 650 | 651 | - https://trailbase.io/ 652 | 653 | ## self service pipelines 654 | 655 | - https://github.com/n8n-io/n8n 656 | 657 | ## end user onboarding 658 | 659 | ### web based code editors 660 | 661 | - https://devpod.sh/ 662 | 663 | # remote workspaces gitlab airgapped 664 | 665 | # rust speed tools 666 | 667 | ## storage 668 | 669 | - https://developmentseed.org/obstore/latest/ 670 | 671 | - https://gitlab.com/vtak/gitlab-workspaces-kubernetes-webhook 672 | - https://gitlab.com/groups/gitlab-org/-/epics/14001#top 673 | 674 | 675 | ## data linkeage 676 | 677 | - https://github.com/moj-analytical-services/splink https://realworlddatascience.net/case-studies/posts/2023/11/22/splink.html 678 | 679 | ## template 680 | - https://copier.readthedocs.io/en/stable/ 681 | - https://github.com/sixfeetup/scaf 682 | 683 | ## model 684 | 685 | ### modeling tools 686 | 687 | - https://www.drawdb.app/ 688 | ### quality 689 | 690 | - https://practicaldatamodeling.substack.com/p/data-model-smells 691 | 692 | ## record linkeage 693 | 694 | books 695 | - https://link.springer.com/book/10.1007/978-3-642-31164-2 696 | 697 | probabilistic 698 | - https://moj-analytical-services.github.io/splink/#__tabbed_1_2 699 | 700 | neat utilities 701 | - https://github.com/dell-research-harvard/linktransformer 702 | 703 | 704 | ## AI BI apps 705 | 706 | - https://www.rilldata.com/ 707 | - https://www.sigmacomputing.com/ 708 | 709 | ## BI 710 | 711 | - https://github.com/djbarnwal/rill-developer 712 | - https://github.com/visivo-io/visivo 713 | 714 | 715 | ## duckdb addons 716 | 717 | - https://github.com/TFMV/arrowport 718 | - https://omni.co/ 719 | - https://www.zenlytic.com/ 720 | 721 | 722 | ## semantic layer 723 | 724 | - https://www.youtube.com/watch?v=DZkXvdzYlVs https://www.malloydata.dev/ 725 | 726 | 727 | # HPC 728 | 729 | ## parallel computing 730 | 731 | - https://parsl.readthedocs.io/en/stable/ 732 | 733 | 734 | # ducklake 735 | ## sqlmesh 736 | 737 | - https://www.linkedin.com/pulse/build-open-lakehouse-your-laptop-ducklake-sqlmesh-madson-msc-mba-kjuzc/?trackingId=ys2obi8yJyaftwCY1vg0Gw%3D%3D 738 | 739 | ### key-value (like) 740 | 741 | - Redis 742 | - https://github.com/valkey-io/valkey 743 | - https://aws.amazon.com/de/elasticache/ 744 | - https://aws.amazon.com/de/memorydb/ 745 | - https://www.memcached.org 746 | - https://www.bauplanlabs.com/ 747 | - https://hazelcast.com 748 | - https://www.dragonflydb.io/redis-alternative 749 | 750 | ### log structured data structures and brokers 751 | 752 | - https://github.com/microsoft/FASTER and https://github.com/faster-rs/faster-rs 753 | - https://redpanda.com/ 754 | - https://kafka.apache.org/ 755 | 756 | ### text search 757 | 758 | - https://github.com/paradedb/paradedb https://blog.paradedb.com/pages/elasticsearch_vs_postgres 759 | - https://github.com/quickwit-oss/tantivy 760 | - https://gitlab.science.ru.nl/informagus/zoekeend/ 761 | #### observability 762 | 763 | - https://www.scopedb.io/ 764 | 765 | ### PG addons 766 | 767 | - approximate 768 | - https://github.com/citusdata/postgresql-hll 769 | 770 | ## interesting message queues 771 | 772 | - https://zeromq.org/ 773 | - 774 | ### on basis of PostgreSQL 775 | - https://github.com/tembo-io/pgmq 776 | - https://github.com/hatchet-dev/hatchet 777 | 778 | ## visualization and admin 779 | 780 | - https://trailbase.io/ 781 | 782 | ## self service pipelines 783 | 784 | - https://github.com/n8n-io/n8n 785 | 786 | ## end user onboarding 787 | 788 | ### web based code editors 789 | 790 | - https://devpod.sh/ 791 | 792 | # remote workspaces gitlab airgapped 793 | 794 | # rust speed tools 795 | 796 | ## storage 797 | 798 | - https://developmentseed.org/obstore/latest/ 799 | 800 | - https://gitlab.com/vtak/gitlab-workspaces-kubernetes-webhook 801 | - https://gitlab.com/groups/gitlab-org/-/epics/14001#top 802 | 803 | 804 | ## data linkeage 805 | 806 | - https://github.com/moj-analytical-services/splink https://realworlddatascience.net/case-studies/posts/2023/11/22/splink.html 807 | 808 | ## template 809 | - https://copier.readthedocs.io/en/stable/ 810 | - https://github.com/sixfeetup/scaf 811 | 812 | ## model 813 | 814 | ### modeling tools 815 | 816 | - https://www.drawdb.app/ 817 | ### quality 818 | 819 | - https://practicaldatamodeling.substack.com/p/data-model-smells 820 | 821 | ## record linkeage 822 | 823 | books 824 | - https://link.springer.com/book/10.1007/978-3-642-31164-2 825 | 826 | probabilistic 827 | - https://moj-analytical-services.github.io/splink/#__tabbed_1_2 828 | 829 | neat utilities 830 | - https://github.com/dell-research-harvard/linktransformer 831 | 832 | 833 | ## AI BI apps 834 | 835 | - https://www.rilldata.com/ 836 | - https://www.sigmacomputing.com/ 837 | 838 | ## BI 839 | 840 | - https://github.com/djbarnwal/rill-developer 841 | - https://github.com/visivo-io/visivo 842 | 843 | 844 | ## duckdb addons 845 | 846 | - https://github.com/TFMV/arrowport 847 | - https://omni.co/ 848 | - https://www.zenlytic.com/ 849 | 850 | 851 | ## semantic layer 852 | 853 | - https://www.youtube.com/watch?v=DZkXvdzYlVs https://www.malloydata.dev/ 854 | 855 | 856 | # HPC 857 | 858 | ## parallel computing 859 | 860 | - https://parsl.readthedocs.io/en/stable/ 861 | 862 | 863 | # ducklake 864 | ## sqlmesh 865 | 866 | - https://www.linkedin.com/pulse/build-open-lakehouse-your-laptop-ducklake-sqlmesh-madson-msc-mba-kjuzc/?trackingId=ys2obi8yJyaftwCY1vg0Gw%3D%3D 867 | --------------------------------------------------------------------------------