├── .github └── workflows │ ├── checks.yaml │ └── mirror-google-docs.yaml ├── .gitignore ├── .pre-commit-config.yaml ├── LICENSE ├── README.rst ├── ai ├── README.rst └── presentations │ ├── 2021-05-20-TF-and-onednn.pdf │ ├── 2021-05-20-oneapi-spec.pdf │ ├── 2022_03_Metagraph_v1.pdf │ ├── 2023-06-28-JAX-PJRT-Intel-GPU-meeting-note.pdf │ ├── 2023-06-28-JAX-PJRT-Intel-GPU.pdf │ ├── 21ww07_AI_TAB_Level_Zero.pdf │ ├── AI-TAB-Feb-2021.pdf │ ├── AI_TAB_oneDAL ML.pdf │ ├── AI_TAB_oneTBB_0821.pdf │ ├── Antares4SyCL.pdf │ ├── Codeplay-oneAPI-AI-TAB-Nov2021.pdf │ ├── Data-Parallel-Essentials-For-Python-oneAPI-TAB.pdf │ ├── Hugginface_Intel_Model_Opti_Julien_Simon_Matrix_Yao_2023_3_15.pdf │ ├── Joint_Matrix_Dounia_Khaldi_2023_3_15.pdf │ ├── README.rst │ ├── ai_tab_oneccl.pdf │ ├── meeting-notes-2023-03-15.pdf │ ├── oneAPI AI TAB intro March 8 2022.pdf │ ├── oneAPI and Data Parallel C++ for AI TAB.pdf │ ├── oneAPI_development_of_oneDNN_for_Armv8-A_SVE_20210210_v4.pdf │ ├── oneDNN-2022-07-14.pdf │ └── oneDNNGraph-oneAPIAITAB.final.pdf ├── cross-tab ├── README.rst └── presentations │ ├── cross-tab-2021-12-14.pdf │ └── oneAPI-Community-Forum-Annual-Meeting-2022-12.pdf ├── hardware ├── README.rst └── presentations │ ├── 22ww24_LevelZeroSpec_TAB.pdf │ ├── 22ww24_Sysman_TAB.pdf │ ├── 22ww38_UnifiedRuntimeDirectionDiscussion - Copy wob.pdf │ ├── 23ww11_LevelZeroSpecUpdate.pdf │ ├── 23ww11_ProgrammingInJulia.pdf │ ├── 23ww11_UnifiedRuntime.pdf │ ├── L0_timestamps_units.pdf │ ├── Level-Zero-Spec-v1.5.pdf │ ├── TornadoVM-oneAPIHardwareSIG-June23.pdf │ ├── Unified-Runtime-for-oneAPI-HW-SIG-062223.pdf │ ├── l0-tab-intro.pdf │ └── tiles-as-devices-l0-sig.pdf ├── image ├── README.rst ├── minutes │ ├── 2021_11_29_Minutes.rst │ ├── 2021_12_16_Minutes.rst │ ├── 2022_02_03_Minutes.rst │ └── 2022_02_17_Minutes.rst └── presentations │ ├── 2021-11-29_Slides.pdf │ ├── 2021-12-16_Slides.pdf │ ├── 2022-02-03_Slides.pdf │ ├── 2022-02-17_Slides.pdf │ ├── 2023-06-21_Slides.pdf │ └── 2023-09-21_Slides.pdf ├── language ├── README.rst └── presentations │ ├── 2019-11-17-dpcpp-language-and-extensions.pdf │ ├── 2019-11-17-oneAPI-vision-for-TAB.pdf │ ├── 2019-11-17-oneDPL.pdf │ ├── 2020-01-28-TAB-DPCPPMeeting2_v7.pdf │ ├── 2020-03-04-TAB-C++-Minimum-Version.pdf │ ├── 2020-03-25-USM-for-TAB.pdf │ ├── 2020-04-22-oneDPL-for-TAB.pdf │ ├── 2020-05-oneDPL-for-TAB.pdf │ ├── 2020-07-01-TAB-Atomics.pdf │ ├── 2020-07-22 accessor simplification.pdf │ ├── 2020-08-26-TAB-Extension-Mechanism.pdf │ ├── 2020-08-26-TAB-LocalMemory.pdf │ ├── 2020-09-23-TAB-Extension-Naming.pdf │ ├── 2020-09-23-TAB-Function-pointers.pdf │ ├── 2020-10-28-TAB-specFeedback.pdf │ ├── 2020-12-16-TAB-DPCPP-NUMA-Discussion.pdf │ ├── 2020-12-16-TAB-oneAPI-year-one.pdf │ ├── 2021-02-24-TAB-dpcpp-implementation-prioritization.pdf │ ├── 2021-04-21-oneDPL-for-TAB.pdf │ ├── 2021-05-26-TAB-invoke_simd.pdf │ ├── 2021-07-28-TAB-DPCPP-properties.pdf │ ├── 2021-08-25-TAB-oneAPIv2-runtime.pdf │ ├── 2021-09-22-TAB-dynamic-selection.pdf │ ├── 2021-10-27-TAB-distributed-computing.pdf │ ├── 2022-09-28-TAB-SYCL-Graph.pdf │ ├── 2022-10-26-TAB-parallelism.pdf │ ├── 2023-03-14-TornadoVM.pdf │ ├── 2023-06-07-DK-matrix-oneapi-language.pdf.pdf │ ├── 2023-06-07_JointMatrix_NVIDIA.pdf.pdf │ ├── oneAPI-TAB-20220727-Kernel-Fusion.pdf │ ├── oneAPI-TAB-Rules-of-the-Road.pdf │ └── oneapi community forum governance Sept 2022.pdf ├── math ├── README.rst ├── minutes │ ├── 2020_05_20_Minutes.rst │ ├── 2020_06_03_Minutes.rst │ ├── 2020_06_17_Minutes.rst │ ├── 2020_07_01_Minutes.rst │ ├── 2020_07_15_Minutes.rst │ ├── 2020_08_12_Minutes.rst │ ├── 2020_09_09_Minutes.rst │ ├── 2020_11_11_Minutes.rst │ ├── 2021_01_27_Minutes.rst │ ├── 2021_02_24_Minutes.rst │ ├── 2021_03_24_Minutes.rst │ ├── 2021_05_19_Minutes.rst │ ├── 2021_06_16_Minutes.rst │ ├── 2021_07_14_Minutes.rst │ ├── 2021_10_06_Minutes.rst │ ├── 2022_03_23_Minutes.rst │ ├── 2022_06_08_Minutes.rst │ ├── 2022_07_27_Minutes.rst │ ├── 2022_09_21_Minutes.rst │ ├── 2022_10_05_Minutes.rst │ ├── 2023_03_08_Minutes.rst │ ├── 2023_05_17_Minutes.rst │ ├── 2023_07_12_Minutes.rst │ ├── 2023_09_20_Minutes.rst │ └── 2023_10_25_Minutes.rst └── presentations │ ├── 2020-05-20_Slides.pdf │ ├── 2020-06-03_Slides.pdf │ ├── 2020-06-17_Slides.pdf │ ├── 2020-07-01_Slides.pdf │ ├── 2020-07-15_Slides.pdf │ ├── 2020-08-12_Slides.pdf │ ├── 2020-09-09_Slides.pdf │ ├── 2020-11-11_Slides.pdf │ ├── 2021-01-27_Slides.pdf │ ├── 2021-02-24_Slides.pdf │ ├── 2021-03-24_Slides.pdf │ ├── 2021-05-19_Slides.pdf │ ├── 2021-06-16_Slides.pdf │ ├── 2021-07-14_Slides.pdf │ ├── 2021-10-06_Slides.pdf │ ├── 2022-03-23_Slides.pdf │ ├── 2022-06-08_Slides.pdf │ ├── 2022-07-27_Slides.pdf │ ├── 2022-09-21_Slides.pdf │ ├── 2022-10-05_Slides.pdf │ ├── 2023-03-08_Slides.pdf │ ├── 2023-03-08_balint_soproni_onemkl.pdf │ ├── 2023-05-17_Finlay_Marno_onemkl.pdf │ ├── 2023-05-17_Slides.pdf │ ├── 2023-07-12_Slides.pdf │ ├── 2023-07-12_oneMKL-PortBLAS.pdf │ ├── 2023-09-20_Slides-value_or_pointer.pdf │ ├── 2023-09-20_oneMKL-Sparse.pdf │ ├── 2023-10-25_Slides.pdf │ ├── Hammond Intro DPC++ May 2020 oneMKL TAB.pdf │ └── README.rst ├── organization ├── README.rst ├── oneAPI-Community-Forum-Structure.png └── oneAPI-Policies.rst └── requirements.txt /.github/workflows/checks.yaml: -------------------------------------------------------------------------------- 1 | on: 2 | push: 3 | branches: 4 | - main 5 | pull_request: 6 | 7 | jobs: 8 | checks: 9 | runs-on: ubuntu-latest 10 | steps: 11 | - uses: actions/checkout@v3 12 | - uses: actions/setup-python@v4 13 | with: 14 | python-version: '3.10' 15 | cache: 'pip' 16 | - uses: pre-commit/action@v3.0.0 17 | -------------------------------------------------------------------------------- /.github/workflows/mirror-google-docs.yaml: -------------------------------------------------------------------------------- 1 | on: 2 | workflow_dispatch: 3 | # every day at midnight 4 | schedule: 5 | - cron: '0 0 * * *' 6 | 7 | jobs: 8 | publish: 9 | runs-on: ubuntu-latest 10 | steps: 11 | - name: Checkout gh-pages 12 | uses: actions/checkout@v3 13 | with: 14 | ref: gh-pages 15 | path: gh-pages 16 | - name: Publish to github pages 17 | run: | 18 | rm -rf gh-pages/* 19 | touch gh-pages/.nojekyll 20 | mkdir -p gh-pages/meeting-notes 21 | cd gh-pages 22 | 23 | curl -s -L -o meeting-notes/cross-tab.pdf https://docs.google.com/document/d/1CQ7gqc9wgFecLY7x81Y2l363zSykHNkqERg4_2sbgQs/export?format=pdf 24 | 25 | git config user.name github-actions 26 | git config user.email github-actions@github.com 27 | git add . 28 | # Ignore errors because no updates returns an error status. 29 | git commit --reset-author --amend -m "Update from github actions" 30 | git push --force origin gh-pages 31 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- 1 | # SPDX-FileCopyrightText: 2020 The Khronos Group Inc. 2 | # 3 | # SPDX-License-Identifier: Apache-2.0 4 | 5 | # See https://pre-commit.com for more information 6 | # See https://pre-commit.com/hooks.html for more hooks 7 | 8 | exclude: math|ai 9 | 10 | repos: 11 | - repo: https://github.com/ambv/black 12 | rev: 22.12.0 13 | hooks: 14 | - id: black 15 | - repo: https://github.com/pre-commit/pre-commit-hooks 16 | rev: v4.4.0 17 | hooks: 18 | - id: trailing-whitespace 19 | - id: end-of-file-fixer 20 | - id: check-yaml 21 | - repo: https://github.com/pycqa/doc8 22 | rev: v1.1.1 23 | hooks: 24 | - id: doc8 25 | - repo: https://github.com/pycqa/flake8 26 | rev: 6.0.0 27 | hooks: 28 | - id: flake8 29 | - repo: https://github.com/pycqa/isort 30 | rev: 5.12.0 31 | hooks: 32 | - id: isort 33 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 oneAPI-SRC 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | .. image:: https://github.com/oneapi-src/oneAPI-tab/actions/workflows/checks.yaml/badge.svg 2 | :target: https://github.com/oneapi-src/oneAPI-tab/actions 3 | 4 | 5 | +---------------------------------+ 6 | | !IMPORTANT! | 7 | | | 8 | | This repository is now | 9 | | deprecated since the forum has | 10 | | moved to the UXL Foundation | 11 | | | 12 | | Find the latest up to date | 13 | | information on the projects | 14 | | here_ | 15 | +---------------------------------+ 16 | 17 | ================================ 18 | oneAPI Community Forum 19 | ================================ 20 | 21 | The oneAPI Community Forum exists to define a standards-based, 22 | cross-architecture open specification for accelerated computing and 23 | to foster the open-source implementations of the specification. 24 | 25 | More information can be found at oneapi.io_. 26 | 27 | This repository hosts notes and presentation materials from the 28 | oneAPI Community Forum meetings. The meetings are open and comprised 29 | of industry, government, and academia experts who help guide 30 | the oneAPI specification. 31 | 32 | The policies and governance processes are also available on this repo. 33 | 34 | The community is invited to join the meetings, review the `oneAPI 35 | Specification`_, and read the information in this repo. Contributions 36 | can be made by joining the Special Interest Groups (SIGs) or 37 | posting comments or questions as GitHub issues. General questions can 38 | go to this repo, and issues specific to parts of the specification can 39 | go to the `Specification repo`_. 40 | 41 | To be notified of new meeting notes, become a watcher of this repo. If 42 | you have a question about how to join the SIGs, email 43 | `oneapi@codeplay.com`_. 44 | 45 | Read about the `oneAPI Community Forum governance`_ to understand 46 | the organization and processes. 47 | 48 | .. _oneapi.io: https://oneapi.io 49 | .. _`oneAPI Specification`: https://spec.oneapi.io 50 | .. _`Specification repo`: https://github.com/oneapi-src/oneapi-spec 51 | .. _`oneapi@codeplay.com`: mailto:oneapi@codeplay.com 52 | .. _`oneAPI Community Forum governance`: organization 53 | .. _here: https://www.uxlfoundation.org 54 | 55 | oneAPI Community Forum Special Interest Groups (SIGs) 56 | ----------------------------------------------------- 57 | 58 | SIGs host regular meetings to organize community proposals and 59 | contributions to the oneAPI specification. They also act as a bridge 60 | between the community and developers working on implementations of 61 | the oneAPI specification. 62 | 63 | * `Language `__ - This group covers topics related to 64 | language implementations that integrate with the oneAPI 65 | specification. 66 | 67 | * `Math `__ - This groups covers topics related to math 68 | operations. 69 | 70 | * `Image `__ - This groups covers topics related to image 71 | processing operations. 72 | 73 | * `AI `__ - This group covers topics related to AI operations. 74 | 75 | * `Hardware `__ - This group covers topics related to the 76 | integration of hardware and how this is defined in the oneAPI 77 | specification. 78 | 79 | Upcoming oneAPI Community Forum Meetings 80 | ---------------------------------------- 81 | 82 | .. list-table:: 83 | :header-rows: 1 84 | 85 | * - Date 86 | - Meeting Type 87 | - Location 88 | - How to join 89 | * - 19 September 2023, 10am-11am US Central Time 90 | - Language SIG 91 | - Virtual 92 | - Contact_ 93 | * - TBD, 9am-10am Central Time 94 | - Hardware SIG 95 | - Virtual 96 | - Contact_ 97 | * - TBD, 8am-9:30am US Central Time 98 | - AI SIG 99 | - Virtual 100 | - Contact_ 101 | * - 25 October 2023, 9am-10am US Central Time 102 | - Math SIG 103 | - Virtual 104 | - Contact_ 105 | * - 14 December 2023, 10am-11am US Central Time 106 | - Image SIG 107 | - Virtual 108 | - Contact_ 109 | 110 | .. _Contact: https://www.oneapi.io/community 111 | 112 | Find the minutes for prior meetings in the appropriate SIG section of 113 | this repository. 114 | 115 | Publishing Meeting Notes from Google Docs 116 | ----------------------------------------- 117 | 118 | To ensure access for all, we mirror meeting notes from Google Docs in 119 | GitHub as a PDF. Here is the procedure. 120 | 121 | Edit the meeting notes in Google Docs. If you want to publish a 122 | presentation, upload a PDF to GitHub. For example, here is the 123 | directory that contains presentations for the cross-tab: 124 | presentations_. Select *Add file*, then *Upload files*. After uploading 125 | the file, click on its name to view and copy the URL from your 126 | browser to the Google Doc. 127 | 128 | When meeting notes are complete, publish them to GitHub as a 129 | PDF. Mirroring is triggered automatically once a day. If you do not 130 | want to wait, select `mirror workflow`_, select *Run workflow*, and 131 | then *Run workflow*. When you see the green check, it is 132 | published. Look at `mirroring yaml`_ to get the URL. For example, see 133 | `Cross-tab PDF`_. 134 | 135 | .. _presentations: https://github.com/oneapi-src/oneAPI-tab/tree/main/cross-tab/presentations 136 | .. _`mirror workflow`: https://github.com/oneapi-src/oneAPI-tab/actions/workflows/mirror-google-docs.yaml 137 | .. _`mirroring yaml`: .github/workflows/mirror-google-docs.yaml 138 | .. _`Cross-tab PDF`: https://oneapi-src.github.io/oneAPI-tab/meeting-notes/cross-tab.pdf 139 | -------------------------------------------------------------------------------- /ai/presentations/2021-05-20-TF-and-onednn.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/2021-05-20-TF-and-onednn.pdf -------------------------------------------------------------------------------- /ai/presentations/2021-05-20-oneapi-spec.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/2021-05-20-oneapi-spec.pdf -------------------------------------------------------------------------------- /ai/presentations/2022_03_Metagraph_v1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/2022_03_Metagraph_v1.pdf -------------------------------------------------------------------------------- /ai/presentations/2023-06-28-JAX-PJRT-Intel-GPU-meeting-note.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/2023-06-28-JAX-PJRT-Intel-GPU-meeting-note.pdf -------------------------------------------------------------------------------- /ai/presentations/2023-06-28-JAX-PJRT-Intel-GPU.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/2023-06-28-JAX-PJRT-Intel-GPU.pdf -------------------------------------------------------------------------------- /ai/presentations/21ww07_AI_TAB_Level_Zero.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/21ww07_AI_TAB_Level_Zero.pdf -------------------------------------------------------------------------------- /ai/presentations/AI-TAB-Feb-2021.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/AI-TAB-Feb-2021.pdf -------------------------------------------------------------------------------- /ai/presentations/AI_TAB_oneDAL ML.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/AI_TAB_oneDAL ML.pdf -------------------------------------------------------------------------------- /ai/presentations/AI_TAB_oneTBB_0821.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/AI_TAB_oneTBB_0821.pdf -------------------------------------------------------------------------------- /ai/presentations/Antares4SyCL.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/Antares4SyCL.pdf -------------------------------------------------------------------------------- /ai/presentations/Codeplay-oneAPI-AI-TAB-Nov2021.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/Codeplay-oneAPI-AI-TAB-Nov2021.pdf -------------------------------------------------------------------------------- /ai/presentations/Data-Parallel-Essentials-For-Python-oneAPI-TAB.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/Data-Parallel-Essentials-For-Python-oneAPI-TAB.pdf -------------------------------------------------------------------------------- /ai/presentations/Hugginface_Intel_Model_Opti_Julien_Simon_Matrix_Yao_2023_3_15.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/Hugginface_Intel_Model_Opti_Julien_Simon_Matrix_Yao_2023_3_15.pdf -------------------------------------------------------------------------------- /ai/presentations/Joint_Matrix_Dounia_Khaldi_2023_3_15.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/Joint_Matrix_Dounia_Khaldi_2023_3_15.pdf -------------------------------------------------------------------------------- /ai/presentations/README.rst: -------------------------------------------------------------------------------- 1 | ==================== 2 | AI tab presentations 3 | ==================== 4 | 5 | Presentations are in this directory. 6 | -------------------------------------------------------------------------------- /ai/presentations/ai_tab_oneccl.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/ai_tab_oneccl.pdf -------------------------------------------------------------------------------- /ai/presentations/meeting-notes-2023-03-15.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/meeting-notes-2023-03-15.pdf -------------------------------------------------------------------------------- /ai/presentations/oneAPI AI TAB intro March 8 2022.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/oneAPI AI TAB intro March 8 2022.pdf -------------------------------------------------------------------------------- /ai/presentations/oneAPI and Data Parallel C++ for AI TAB.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/oneAPI and Data Parallel C++ for AI TAB.pdf -------------------------------------------------------------------------------- /ai/presentations/oneAPI_development_of_oneDNN_for_Armv8-A_SVE_20210210_v4.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/oneAPI_development_of_oneDNN_for_Armv8-A_SVE_20210210_v4.pdf -------------------------------------------------------------------------------- /ai/presentations/oneDNN-2022-07-14.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/oneDNN-2022-07-14.pdf -------------------------------------------------------------------------------- /ai/presentations/oneDNNGraph-oneAPIAITAB.final.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/ai/presentations/oneDNNGraph-oneAPIAITAB.final.pdf -------------------------------------------------------------------------------- /cross-tab/README.rst: -------------------------------------------------------------------------------- 1 | ================================================================ 2 | oneAPI Community Forum Combined Meeting Notes 3 | ================================================================ 4 | 5 | 2021-12-14 6 | ========== 7 | 8 | Agenda 9 | ------ 10 | 11 | ========================= ============================================ 12 | Introduction Geoff Lowney 13 | Unified Runtime Paul Petersen, Z. Waters, A. Kukanov, T. Smith 14 | Distributed Programming David Ozog, Maria Gazaran, Robert Cohn 15 | Domain Specific Libraries Ilya Burylov 16 | ========================= ============================================ 17 | 18 | 19 | Attendees 20 | --------- 21 | 22 | ================================= =============================== 23 | Tian, Xinmin, Intel Lowney, Geoff, Intel 24 | Slavova, Gergana S, Intel Richards, Alison, Intel 25 | Smith, Timmie, Intel Podogova, Svetlana, Intel 26 | Brodman, James, Intel Ashbaugh, Ben, Intel 27 | Lueck, Gregory M, Intel Garland, Craig, Intel 28 | SBaik, Samsung (게스트) Burylov, Ilya, Intel 29 | Garzaran, Maria, Intel Sohrab Amirghodsi, Adobe 30 | Ozog, David M, Intel Kawakami, Kentaro/川上 健太, Fujitsu Labs 31 | Cohn, Robert, Intel Pat Quillen, Math Works 32 | Edward Smyth, NAG Ike, Atsushi/池 敦, Fujitsu 33 | Luo, Ye, ANL Mehdi Goli, Codeplay 34 | Tom Deakin, Univ of Bristol Bharat Agrawal, Ansys 35 | Khanna, Rahul, Intel Rabotnikov, Mark, Philips Medical System 36 | Knepper, Sarah, Intel Waters, Zack S, Intel 37 | Story, Shane, Intel Kharkov, Egor, Intel 38 | Kukanov, Alexey, Intel Kubarev, Valentin, Intel 39 | Pascuzzi, Vincent, BNL Petersen, Paul, Intel 40 | Luszczek, Piotr Rafal, UTK Stefan Yurkevitch, ArrayFire 41 | Schneider, Robert, Seimens HC Ruyman Reyes, Codeplay 42 | Pennycook, John, Intel Andrew Lumsdaine, University of Washington 43 | Arunachalam, Meena, Intel John Melonakos, ArrayFire 44 | Patty, Spencer, Intel Reinders, James, Intel 45 | Dolbeau, Romain, SiPearl Li, Jian Hui, Intel 46 | Rahman, Md, Intel Sheng Zha, AWS MXNET 47 | Nevin Liber, Argonne Alastair Murray, Codeplay 48 | Paul, Sriraj, Intel Andrew Richards, Codeplay 49 | Anzt, Hartwig, KIT Ronan Keryell, Xilinx 50 | Shabunin, Maksim, Intel Erik Lindahl, Stockholm University 51 | ================================= =============================== 52 | 53 | Slides_ 54 | 55 | .. _Slides: presentations/cross-tab-2021-12-14.pdf 56 | 57 | * Adobe only uses native runtime on each platform: Metal or Direct X 58 | 59 | * Need Vulcan back end support – requires more image processing, which 60 | is missing from oneAPI – prefers to write in oneAPI in C++ vs 61 | GLSL/etc. Needs texture support, image basic support, access to how 62 | a texture is stored (linearly or swizzled, 3D texture or a view on 63 | that with two images, etc) 64 | 65 | * Working on imaging support for oneAPI. 66 | 67 | * AR: Zack to connect with Ilya on image support 68 | 69 | * Does Level Zero support OpenCL? 70 | 71 | * Level Zero provides low level access to GPU and devices, and is 72 | intended to be lower-level than OpenCL. OpenCL is a true citizen. 73 | 74 | * We are doing a lot of work in SYCL and NVIDIA Backend. Very 75 | interested in level zero – looked at level zero but it was so low 76 | level compared to CUDA and Hip; Need something higher level to 77 | implement the sycl support and support other platforms above CUDA 78 | and HIP – need other ways to do that; if you think in terms of 79 | SYCL. How would they do that with multiple streams in cuda and or 80 | multiple queues. A lot of conversations to have a unified RT for 81 | other platforms. 82 | 83 | * This is a .1 spec, would like to get additional engagement for 84 | others; in terms of a timeline, is there something we could share? 85 | We need to put the draft on the github – we have the AR to put 86 | that up. We do intend to have an easy migration path – for other 87 | hw. Clean up and standardize in the PI layer – we welcome your 88 | feedback as we define. On the right track w/ the unified RT; one 89 | of the nice things through the adaptors is the ability to expose 90 | the extensions and expose the native platform experiences – if you 91 | need access for CUDA and low-level interfaces or APIs, you could 92 | be able to choose 93 | 94 | * SYCL for CUDA or for HIP – is the backend to call cublas directly, 95 | very important feature that they need – they are happy to 96 | collaborate on this (Codeplay). 97 | 98 | * Issues with level zero 99 | 100 | * Unified runtime expected to abstract all the backends but they are 101 | more familiar with other APIs. 102 | 103 | * Thread Safety – SPEC says level zero is not thread safe. (ZACK – AR 104 | to look at that). Spec says you can call for multiple threads. 2) 105 | 106 | * No primary context to allow libraries to interact w/o knowing each 107 | other; user exports (unfriendly for developers). Need to have a 108 | Level Zero adaptor to unify the behavior w/ other runtimes. That 109 | is more of his expectation. 110 | 111 | * Adaptors – where you need divergence, you can access the lower 112 | level platform… Need to access multiple lower level architectures 113 | – directly allocated from cublas or cuda low level (goes through 114 | the primary context and recognizes each other – independent but 115 | not convenient like that w/ CUDA). Unified RT is the opportunity 116 | for supporting those explicit constructs. 117 | 118 | * You can call from multiple threads but you need to be 119 | careful. Can't operate on the same object from multiple threads. 120 | Small clarification: You can't operate on the same object from 121 | multiple threads without synchronization. 122 | 123 | * Having this TLS magical CUDA state is a limitation for 124 | performance; And it is actually very non-thread safe, just in a 125 | more dangerous and subtle way. It also creates wrong expectation 126 | to users, e.g. CUDA libraries magically working without 127 | initialization! so when you need to interop with native libraries, 128 | you need to explain to users all the magic that CUDA runtime does 129 | and how to replicate it in a portable way 130 | 131 | * TLS will create more issues than the benefits we will get. It is a 132 | legacy issue in OpenMP now. 133 | 134 | * But this does not prevent to provide an optional TLS layer for 135 | porting simple use-cases coming from a single-thread world ? Just 136 | that it should not be the default and uninterested folks should not 137 | be performance-impacted from this. 138 | 139 | * Do people want to program CPU with SYCL? 140 | 141 | * Yes 142 | 143 | * Would resource manager help TBB and NUMA issues on Intel CPUs? 144 | 145 | * That is exactly why we are pursuing this. 146 | 147 | * We need to support both 32 bit (WASM) and 64 bit system. From high 148 | end workstation to iOS/Android devices. So portability and ability 149 | to scale down gracefully is critical for us. Apple platform support 150 | only their own solutions and we have had lots of issues with OpenCL 151 | on random hardware with OpenCL drivers. For cloud computing we have 152 | more flexibility. 153 | 154 | * MPI could surely benefit from modern C++ bindings.... 155 | 156 | * Do you still rely on free functions in shmem? We are back into our 157 | previous TLS discussion. Perhaps using kernel handler would be more 158 | C++ & SYCL compliant? Of course the syntax would be different, which 159 | is a problem for portability, with kh.shmem.putmem_nbi() for example 160 | instead of ::shmem_putmem_nbi(() – 161 | 162 | * Do we need to be looking at different extensions? Do we need this 163 | natively in SYCL? Ie. Universal Parallel C++ (Paul). How do we 164 | think about this (not in a library point of view but have this more 165 | integrated w/ SYCL) 166 | 167 | * For combining MPI and SYCL, have you looked at the Celerity project? 168 | https://celerity.github.io/ Celerity · High-level C++ for 169 | Accelerator Clusters High-level C++ for Accelerator Clusters 170 | 171 | * [Off-topic; Potential Collab] I have yet to see 'XPU' with 'X' == 172 | 'Q'. Something I'm interested in is having a qpu_selector, where 173 | this would use a QC simulator (akin to, e.g., an FPGA simulator) for 174 | Qiskit, cirq, DM-SIM, etc., perhaps via PI interface? Feel free to 175 | reach out. 176 | -------------------------------------------------------------------------------- /cross-tab/presentations/cross-tab-2021-12-14.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/cross-tab/presentations/cross-tab-2021-12-14.pdf -------------------------------------------------------------------------------- /cross-tab/presentations/oneAPI-Community-Forum-Annual-Meeting-2022-12.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/cross-tab/presentations/oneAPI-Community-Forum-Annual-Meeting-2022-12.pdf -------------------------------------------------------------------------------- /hardware/presentations/22ww24_LevelZeroSpec_TAB.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/hardware/presentations/22ww24_LevelZeroSpec_TAB.pdf -------------------------------------------------------------------------------- /hardware/presentations/22ww24_Sysman_TAB.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/hardware/presentations/22ww24_Sysman_TAB.pdf -------------------------------------------------------------------------------- /hardware/presentations/22ww38_UnifiedRuntimeDirectionDiscussion - Copy wob.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/hardware/presentations/22ww38_UnifiedRuntimeDirectionDiscussion - Copy wob.pdf -------------------------------------------------------------------------------- /hardware/presentations/23ww11_LevelZeroSpecUpdate.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/hardware/presentations/23ww11_LevelZeroSpecUpdate.pdf -------------------------------------------------------------------------------- /hardware/presentations/23ww11_ProgrammingInJulia.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/hardware/presentations/23ww11_ProgrammingInJulia.pdf -------------------------------------------------------------------------------- /hardware/presentations/23ww11_UnifiedRuntime.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/hardware/presentations/23ww11_UnifiedRuntime.pdf -------------------------------------------------------------------------------- /hardware/presentations/L0_timestamps_units.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/hardware/presentations/L0_timestamps_units.pdf -------------------------------------------------------------------------------- /hardware/presentations/Level-Zero-Spec-v1.5.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/hardware/presentations/Level-Zero-Spec-v1.5.pdf -------------------------------------------------------------------------------- /hardware/presentations/TornadoVM-oneAPIHardwareSIG-June23.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/hardware/presentations/TornadoVM-oneAPIHardwareSIG-June23.pdf -------------------------------------------------------------------------------- /hardware/presentations/Unified-Runtime-for-oneAPI-HW-SIG-062223.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/hardware/presentations/Unified-Runtime-for-oneAPI-HW-SIG-062223.pdf -------------------------------------------------------------------------------- /hardware/presentations/l0-tab-intro.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/hardware/presentations/l0-tab-intro.pdf -------------------------------------------------------------------------------- /hardware/presentations/tiles-as-devices-l0-sig.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/hardware/presentations/tiles-as-devices-l0-sig.pdf -------------------------------------------------------------------------------- /image/README.rst: -------------------------------------------------------------------------------- 1 | ==================================== 2 | Image SIG - a oneAPI Community Forum 3 | ==================================== 4 | 5 | Introduction 6 | ============ 7 | 8 | The Image SIG hosts discussions and presentations for fundamental image 9 | processing routines for high-performance computing, engineering, financial, and 10 | other applications. 11 | 12 | This SIG discusses the oneAPI Image Processing Library (oneIPL) in the planning 13 | and spec development stage. Its goal is to become an extensive library of 14 | ready-to-use, highly optimized image processing functions. As part of oneAPI, 15 | oneIPL is designed to allow execution on various computational devices: CPUs, 16 | GPUs, and other accelerators. The functionality is subdivided into several 17 | domains: filers, geometry transformations, and color and type conversions. 18 | 19 | Its royalty-free APIs help developers: 20 | 21 | 1. Use vectorization and SIMD (single instruction, multiple data) instructions 22 | on CPU and GPU. 23 | 24 | 2. Use CPU threading and GPU SIMT (single instruction, multiple threads). 25 | 26 | 3. Improve the performance of computation-intensive applications with pipelines 27 | involving image processing and other operations provided as DPC++ kernel or 28 | oneAPI libraries. 29 | 30 | In addition to domain-specific functionality, the Image SIG may also discuss 31 | overall design features like the execution model, memory model, or error 32 | handling. At times, the open source oneIPL Interfaces project, which implements 33 | the oneIPL specification may also be discussed. 34 | 35 | Purpose of the Group 36 | ==================== 37 | 38 | The Image Special Interest Group (Image SIG) is committed to the collaborative 39 | development, understanding, and promotion of the open standard oneIPL. We aim 40 | to build a comprehensive image-processing library to set a new standard for 41 | image processing in the oneAPI ecosystem. 42 | 43 | 1. Collaborate: Gather key industry stakeholders to contribute to the oneIPL 44 | standard. Engineers and software developers from across the industry will 45 | share insights, knowledge, and resources to collaboratively shape the future 46 | of the oneIPL specification. 47 | 48 | 2. Innovate: Encourage innovative thinking and problem-solving to advance the 49 | future of image processing within the oneAPI ecosystem. 50 | 51 | 3. Educate: Share experiences and best practices related to oneIPL with the 52 | wider oneAPI community. 53 | 54 | Strategy for Achieving Goals 55 | ============================ 56 | 57 | To achieve these goals, the following strategic approach will be taken: 58 | 59 | 1. Community Building: Foster a vibrant and active community of engineers and 60 | software developers who meet regularly to discuss, debate, and determine the 61 | direction of oneIPL. This will be achieved through online events and a clear 62 | communication channel for group members. 63 | 64 | 2. Open Collaboration: Promote an open-source mindset to enable active 65 | contributions from all members. The group will focus on collaboration, 66 | maintaining a transparent process where all members have access to resources 67 | and are encouraged to provide input on development. 68 | 69 | 3. Training and Education: Organize training sessions, workshops, and webinars 70 | to educate members and the broader oneAPI ecosystem about the features, 71 | advantages, and potential of the oneIPL library. This will also be an 72 | opportunity to gather feedback and requirements from end users. 73 | 74 | 4. Regular Updates and Reviews: Schedule periodic meetings for group members to 75 | share their experiences, progress, and challenges with the oneIPL 76 | library. These sessions will act as a platform for knowledge sharing, 77 | brainstorming solutions, and shaping the roadmap for the future of the 78 | oneIPL specification. 79 | 80 | 5. Integration with the oneAPI Ecosystem: Work closely with the broader oneAPI 81 | community to ensure the oneIPL library seamlessly integrates with the other 82 | elements of the oneAPI ecosystem. This will help ensure the standard is 83 | accessible, beneficial, and valuable to the community. 84 | 85 | Through this purpose and strategic approach, the Image Special Interest Group 86 | will shape the future of image processing via the oneIPL image processing 87 | library in alignment with the broader goals of the oneAPI ecosystem. We believe 88 | in the power of collaborative innovation to drive the future of image 89 | processing and look forward to contributing to this journey. 90 | 91 | --- 92 | 93 | The Image SIG is led by John Melonakos. 94 | 95 | Upcoming Meetings 96 | ================= 97 | 98 | The next Image SIG will be held on December 14, 2023, from 10am to 11am CST. 99 | 100 | To join the Image SIG, email oneapi@codeplay.com to register for our quarterly 101 | meetings. You will be added to the mailing list and notified of upcoming 102 | events. 103 | 104 | Previous Meetings 105 | ================= 106 | 107 | | September 21, 2023 - `presentation `__ 108 | | June 21, 2023 - `presentation `__ 109 | | February 17, 2022 - `notes `__, \ 110 | `presentation `__ 111 | | February 3, 2022 - `notes `__, \ 112 | `presentation `__ 113 | | December 16, 2021 - `notes `__, \ 114 | `presentation `__ 115 | | November 29, 2021 - `notes `__, \ 116 | `presentation `__ 117 | -------------------------------------------------------------------------------- /image/minutes/2021_11_29_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneIPL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2021-11-29 6 | ========== 7 | 8 | Slides: 9 | 10 | * `oneIPL TAB kickoff`_ 11 | 12 | Attendees: 13 | 14 | * Svetlana Podogova (Intel) 15 | * Victor Getmanskiy (Intel) 16 | * Aleksandr Duev (Intel) 17 | * Craig Garland (Intel) 18 | * Alexander Alekhin (Intel) 19 | * Jeff Rogers (Intel) 20 | * Sergey Ivanov (Intel) 21 | * Abhinav Singh (Intel) 22 | * Valentin Kubarev (Intel) 23 | * Dmitriy Budnikov (Intel) 24 | * Vladimir Kostarev (Intel) 25 | * Robert Schneider (Siemens Healthineers) 26 | * An Le (Intel) 27 | * Mark Rabotnikov (Philips) 28 | * Maksim Shabunin (Intel) 29 | * Ritesh R Kulkarni (Intel) 30 | * Alison L Richards (Intel) 31 | 32 | Agenda: 33 | 34 | * Welcome to the oneIPL TAB participants 35 | * oneAPI industry initiative 36 | * oneAPI Technical Advisory Board activity 37 | * Overview of oneIPL 38 | * oneIPL TAB rules and schedule 39 | 40 | 41 | Welcome to the oneIPL TAB participants: 42 | 43 | * Welcome to the oneAPI Technical Advisory Board (or TAB) for Image 44 | Processing Kickoff meeting We really appreciate you were able to 45 | join our community and our aim is to make this collaboration as much 46 | effective as we can. The goal of todays meeting is to make the 47 | overview of oneAPI initiative and especially of Image processing 48 | domain, to describe what kind of collaboration do we expect and to 49 | follow up with you on future TAB meetings. 50 | 51 | * The oneIPL Technical Advisory Board includes experts from Siemens 52 | Helthineers, Philips, Adobe, Samsung Medison, Sonoscape and 53 | Xinje. Also the experts form OpenCV, G-API and oneIPL projects are 54 | invited to the oneIPL TAB. And today on the bridge also we have a 55 | lot of people who helped to organize all this activity: 56 | 57 | - IPL managers: Craig Garland who is leading Intel Performance 58 | Libraries department, which includes not only Image Processing 59 | team, but Math Kernel Library (oneMKL) and several other 60 | domains. Svetlana Podogova and Valentin Kubarev - the engineering 61 | managers of Intel performance libraries for Image and Signal 62 | Processing. 63 | - Business Development Group 64 | - Technical Consulting Engineers 65 | 66 | oneAPI industry initiative: 67 | 68 | * Nowadays there is a very diverse environment of architectures which 69 | are widely used. It includes different types of platforms which are 70 | CPUs, GPUs, FPGAs, specific types of accelerators. This turns to be 71 | a difficult task for developers to support all of these 72 | architectures, so they really require some unified environment that 73 | allows them to program across all those without deep understanding 74 | all of details of each architecture. That is what oneAPI industry 75 | initiative is targeted for. 76 | 77 | * The oneAPI is an open unified standard based specification for 78 | heterogenous programming which includes DPC++ language spec as a 79 | core part, Level Zero spec for the system interface for the oneAPI 80 | languages and libraries and a number of elemets covering different 81 | domains like maths operations, data processing, video processing and 82 | now we are preparing to extend it for Image processing area 83 | also. The initial launch of oneAPI Spec v1 in Sept 2020 and since 84 | that time the speed of specification development and the number of 85 | participants in oneAPI community are actively growing. 86 | 87 | * More information about oneAPI may be found in `oneAPI home page`_ 88 | and `oneAPI Spec page`_. 89 | 90 | oneAPI Technical Advisory Board activity: 91 | 92 | * The TAB is the place where the future versions of oneAPI 93 | specification is discussed. This is invitation-based forum of 94 | industry experts, who give their inputs to the oneAPI 95 | Specification. It is very critical to hear the input from different 96 | companies to adopt the specification and make it unified and meet 97 | the challenges of Image Processing area. 98 | 99 | * There are 3 other TABs in different domains: the first TAB was 100 | created about 2 years ago for DPC++ language and library, then 101 | oneMKL (Math Kernel Libraries) TAB started in May 2020, and AI TAB 102 | (oneDNN and oneCCL) is active since Feb 2021. More information about 103 | TABs may be found in `oneAPI TAB GitHub`_. 104 | 105 | * The goal of oneIPL TAB is to review and collect feedback on oneIPL 106 | spec. Make the new specification to address the challenges of Image 107 | Processing development and adjust it to the industry needs This 108 | input will help to shape its next revisions. 109 | 110 | * The latest version of oneIPL Spec is published on `oneIPL Spec 111 | page`_. 112 | 113 | Overview of oneAPI Image Processing Library (oneIPL): 114 | 115 | * Victor Getmanskiy presented the high level overview of 116 | oneIPL. oneIPL provides the DPC++ API for image processing 117 | functionality, which is inherited from classic Intel Integrated 118 | Performance Primitives (Intel IPP) and is supporting xPU 119 | execution. So all oneIPL DPC++ API contains CPU optimized kernels, 120 | GPU optimized kernels and could be also extended to support other 121 | xPU optimisations in future. 122 | 123 | * oneIPL API provides C++ abstraction over image data which maps to 124 | the most accelerated memory available for the format and data type: 125 | Host memory, Shared memory, Device memory or partially Device tiled 126 | memory. 127 | 128 | * oneIPL API is based on DPC++ and sycl::queue to be able to construct 129 | pipelines of image processing and include any oneAPI API calls based 130 | on DPC++ queue targeted to different xPUs. Calls are asynchronous 131 | and scheduled by runtime for the target devices. 132 | 133 | * Currently the provisional oneIPL Spec v0.5 is published. It contains 134 | all functionality which is targeted for oneIPL first beta release 135 | in 2022. The functionality contains most commonly used functions 136 | like Geometry transformations (Resizing, mirroring), Color 137 | conversions (RGB to Plane, RGB to NV12), Filtering, Type conversion 138 | and other functions. 139 | 140 | * The list of topics for the first technical discussions are listed in 141 | `oneIPL TAB kickoff`_. 142 | 143 | oneIPL TAB rules and schedule: 144 | 145 | * DO NOT share any confidential information or trade secrets with the 146 | group 147 | 148 | * Focus on high level dicsussion of oneIPL Specification - not on the 149 | implementation details 150 | 151 | * Please submit the feedback in writing on GitHub in accordance to 152 | `oneAPI Contribution Guidelines`_. This will allow Intel to further 153 | upstream your feedback to other standards bodies, including The 154 | Khronos Group SYCL specification. 155 | 156 | * The oneIPL TAB will be 1-hour meeting per 2 weeks while discussing 157 | main content of Spec v0.5 158 | 159 | * Will move to 1 meeting per 4 weeks after the main topics are covered 160 | 161 | * Thechnical expert (any TAB member) presents the proposal to spec the 162 | group is discussing the topic and collecting feedback 163 | 164 | * All the materials and meeting minutes will be published on `oneAPI 165 | TAB GitHub`_. 166 | 167 | * The offline feedback from oneIPL TAB members will be also processed 168 | and discussed on next meeting 169 | 170 | * The first technical meeting for oneIPL TAB is planned for December 171 | 16th. Then we will make a New Year holidays and start bi-weekly 172 | series of meeting from January 20th (ww4) or February 3rd (ww6) - TBD 173 | 174 | * For the mid-area topics the cross-component TAB could be 175 | organized. The first Cross TAB session is planned to December 14th - 176 | the invitation is sent to oneIPL TAB Members. Feel free to attend. 177 | 178 | .. _`oneAPI Contribution guidelines`: https://spec.oneapi.io/versions/latest/introduction.html#contribution-guidelines 179 | .. _`oneAPI TAB GitHub`: https://github.com/oneapi-src/oneAPI-tab 180 | .. _`oneAPI home page`: https://www.oneapi.io/ 181 | .. _`oneAPI Spec page`: https://www.oneapi.io/spec/ 182 | .. _`oneIPL Spec page`: https://spec.oneapi.io/oneipl/latest/index.html 183 | .. _`oneIPL TAB kickoff`: ../presentations/2021-11-29_Slides.pdf 184 | -------------------------------------------------------------------------------- /image/minutes/2021_12_16_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneIPL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2021-12-16 6 | ========== 7 | 8 | Slides: 9 | 10 | * `oneIPL TAB #1`_ 11 | 12 | Attendees: 13 | 14 | * Svetlana Podogova (Intel) 15 | * Victor Getmanskiy (Intel) 16 | * Mark Rabotnikov (Philips) 17 | * SungShik Baik (Samsung Medison) 18 | * Tim van der Horst (Philips) 19 | * Robert Schneider (Siemens Healthineers) 20 | * Sergey Ivanov (Intel) 21 | * Maksim Shabunin (Intel) 22 | * Sohrab Amirghodsi (Adobe) 23 | * Valentin Kubarev (Intel) 24 | * Dmitriy Budnikov (Intel) 25 | 26 | Agenda: 27 | 28 | * Welcoming the oneIPL TAB participants 29 | * Programming model of oneIPL 30 | * Execution model of oneIPL 31 | * Image processing pipelines 32 | * Image data abstraction in oneIPL 33 | * Memory model in oneIPL 34 | * Closing words and next plans on oneIPL TAB 35 | 36 | 37 | oneIPL specification walk-through: 38 | 39 | * The latest version of oneIPL Spec is published on `oneIPL Spec 40 | page`_. 41 | 42 | * oneAPI Image Processing Library (oneIPL) consists of several domains, 43 | which includes Basic functionality, Color conversion operations, Filtering, 44 | Geometry related functionality and potentially can be extended to 3D 45 | operations. Current version of specification covers the most important parts 46 | of domains and might be extended in future versions. 47 | 48 | * The oneIPL programming language is SYCL 2020 based on C++17 49 | oneIPL primitives include class data abstractions and functional API 50 | 51 | * ipl::image class specifies image data, layout and supported data types, 52 | which are defined at compile-time. Each algorithm in the specification 53 | contains the layout and data types support matrix. This matrix describes 54 | generic layouts as channel count – rows (1,3,4 channels), and generic 55 | data types. 56 | 57 | * Generic formats are usually supported for multiple devices. Some data 58 | types are device specific. 59 | 60 | * oneIPL APIs follows SYCL xPU ideology. Each API is able to execute on the 61 | range of the devices, if device is supporting required features. oneIPL 62 | algorithm shall not substitute unsupported type by the different supported 63 | type if it impacts the result. If the type is not supported on the device, 64 | the error is fired. SYCL has such checks in kernels as runtime exception. 65 | oneIPL can perform such check before, since queue provides information on 66 | device features, and spec specifies error conditions. 67 | 68 | * Execution mode is asynchronous. Algorithms submitted to sycl::queue and 69 | returns the control flow. Execution is scheduled by runtime taking into 70 | account the dependencies vector. For arguments having type ipl::image 71 | dependencies are handled automatically (example slide 12). 72 | 73 | * Example with splitting pipeline. Two functions taking single input and 74 | multiple outputs can be executed in parallel, since calls are asynchronous. 75 | 76 | * More complex pipeline with watermarking (slide 14) – user-provided kernels 77 | like calculating overlay and blending. Some stages can produce output 78 | required to host, so shared memory allocator shall be specified in this case. 79 | If the image is pure device all intermediate output shall have device 80 | allocator. 81 | 82 | * oneIPL Image Abstraction. Image can be potentially implemented over host, 83 | shared, device and special texture memory. From implementation perspective 84 | first three can be done via USM (SYCL2020), texture can be implemented 85 | via sycl::image. 86 | 87 | * Memory model. oneapi::ipl::image class is basic data abstraction for image 88 | data. oneIPL provides single abstraction over different memory types. 89 | oneIPL supports different types of memory – device, host, shared and special 90 | GPU texture (image) memory. 91 | 92 | * See more details on `oneIPL Architecture page`_ 93 | 94 | 95 | oneIPL specification open discussion: 96 | 97 | * Robert Schneider, Siemens Healthineers: Should all API be used by HW 98 | accelerated types? 99 | Victor Getmanskiy, Intel: HW accelerated types are currently very limited 100 | by SYCL standard images – only 4-channel images of limited data type are 101 | supported. We are trying to extend image support to wider amount of 102 | format-type combination in next SYCL standard. So the wider HW-accelerated 103 | support would be possible in that case. 104 | 105 | * Robert Schneider, Siemens Healthineers: Are there query for HW feature like 106 | format/type? 107 | Victor Getmanskiy, Intel: Good question, we also see a gap here and consider 108 | such extension to current SYCL images in future standards. Currently we can 109 | query only if image/FP64/FP16 are supported or not. 110 | 111 | * Sergey Ivanov, OpenCV, Intel: You've said, the SYCL queue carries device 112 | context - does all the objects need to belong to the same context? 113 | Victor Getmanskiy, Intel: Yes, and specification recommends for each function 114 | to have a runtime checks that queue and image objects belong to same context. 115 | Checks can be disabled in implementation to avoid performance overhead. 116 | sycl::queue and data pointer on slide 12 are specified in the image 117 | constructor so implementation can determine, if the pointer and queue are 118 | related to the same device. 119 | 120 | * Tim van der Horst, Philips: If the checks for the supported data types are 121 | done before kernel, are they done in each call? 122 | Victor Getmanskiy, Intel: Implementation can be done in a way to allow to 123 | disable the checks to avoid performance overhead. In this case single initial 124 | check is required in user Application. 125 | 126 | * Robert Schneider, Siemens Healthineers: Can the ipl::image be constructed 127 | over sycl::image? 128 | Victor Getmanskiy, Intel: Currently sycl::image 2020 is not supported by any 129 | compiler, and SYCL image specification is being improved, but in future based 130 | on improved image specification such construction can be possible. 131 | 132 | * Sohrab Amirghodsi, Adobe: What is shared memory? 133 | Victor Getmanskiy, Intel: It is memory which implicitly copy from host to 134 | device and back. Device and host memory shall be explicitly copied. 135 | 136 | 137 | Next plans on oneIPL TAB: 138 | 139 | * The next technical meeting for oneIPL TAB is planned for February 3rd (ww6). 140 | The invitation for the meeting will be sent in the mid of January. 141 | 142 | * Next topic for the discussion is `oneIPL Image data abstraction`_. 143 | 144 | .. _`oneIPL Image data abstraction`: https://spec.oneapi.io/oneipl/latest/image/index-image.html 145 | .. _`oneIPL Architecture page`: https://spec.oneapi.io/oneipl/latest/architecture/index-architecture.html 146 | .. _`oneIPL Spec page`: https://spec.oneapi.io/oneipl/latest/index.html 147 | .. _`oneIPL TAB #1`: ../presentations/2021-12-16_Slides.pdf 148 | -------------------------------------------------------------------------------- /image/minutes/2022_02_03_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneIPL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2022-02-03 6 | ========== 7 | 8 | Slides: 9 | 10 | * `oneIPL TAB #2`_ 11 | 12 | Attendees: 13 | 14 | * Svetlana Podogova (Intel) 15 | * Victor Getmanskiy (Intel) 16 | * Mark Rabotnikov (Philips) 17 | * Ashish Uthama (Mathworks) 18 | * Tim van der Horst (Philips) 19 | * Robert Schneider (Siemens Healthineers) 20 | * Sergey Ivanov (Intel, OpenCV/G-API) 21 | * Maksim Shabunin (Intel, OpenCV) 22 | * Sohrab Amirghodsi (Adobe) 23 | * Valentin Kubarev (Intel) 24 | * Aleksandr Duev (Intel) 25 | * Dmitriy Budnikov (Intel) 26 | * Alison L Richards (Intel) 27 | 28 | Agenda: 29 | 30 | * Introduction and Open questions discussion 31 | * Hardware-accelerated images and data formats 32 | * oneIPL image data abstraction 33 | * oneIPL image interoperability with USM 34 | * Closing words and next plans on oneIPL TAB 35 | 36 | Open question discussion: 37 | 38 | * Problem statement: Accuracy `across devices is different`_ due to CPU and 39 | GPU devices support different IEEE754 compliance, the standard libraries has 40 | no claims on correct rounding and the order of operations impacts the result 41 | since the algorithms has different flows on different devices. 42 | 43 | * oneIPL suppose to control the precision of computations withing supported 44 | computation datatype – ComputeT, which is a template parameter of the 45 | functions. By default, it is float. Double (if it is supported by device) 46 | could be more precise but slow, and half might be faster but with 47 | low accuracy. 48 | 49 | * To discuss: 50 | * Are there any specific expectations from image processing perspective? 51 | * How important is the accuracy for different use-cases? 52 | * Are there any criteria on the results similarity across devices? 53 | * Are there any image similarity metrics not related to accuracy, which 54 | required to be fulfilled like PSNR? 55 | 56 | * **Ashish Uthama, Mathworks**: For CPU and GPU comparison we see this 57 | difference and can tolerate +-1 pixel variation. For the floating point it 58 | depends on the pipeline depth: how many calculations you have. 59 | 60 | Requirement: it should be ok for the whole pipeline, no other special 61 | expectations. But it is important to have reproducibility - to have the same 62 | result on the same device. Across different devices difference in pixels 63 | is ok. 64 | 65 | Another thing is that was observed by Mathworks is different run-to-run 66 | results while using threading – it is important to have similar run-to-run 67 | results. 68 | 69 | Mathworks use pixel-wise comparison, and sometimes they use structural 70 | similarity metrics for regression tests. 71 | Ashish Uthama shared the link for `structural similarity article`_. 72 | 73 | * **Tim van der Horst, Philips**: the same, we also do pixel wise comparison. 74 | 75 | * **Robert Schneider, Siemens Healthineers**: Accuracy not necessarily need to 76 | be the same across CPU and GPU, but it should not be too great difference, 77 | should not add artifacts or noises. 78 | 79 | +-1 is ok, but also there should not be big difference according to visual 80 | feeling, and AI algorithms should not be messed by this difference. 81 | 82 | 83 | 84 | oneIPL specification walk-through: 85 | 86 | * The latest version of oneIPL Spec is published on `oneIPL Spec page`_. 87 | 88 | * Current provisional spec version is 0.5, the spec 0.6 is in progress. 89 | 90 | * The main difference between spec ver 0.5 and ver 0.6 is the following: 91 | * ipl::formats replaced by ipl::layouts 92 | * Image constructors are changed to remove dependency on implementation 93 | * Default image allocator will be USM 94 | * Methods to image auxiliary classes moved to image API 95 | * Switched to generic template parameters 96 | * Gaussian filter with separated sigma for x and y axis 97 | * Normalize without sycl::buffer in spec 98 | 99 | * sycl::image supported formats list was reduced. 100 | Mostly only the C4 ipl::image formats are mapped to hardware-accelerated 101 | images. 102 | 103 | * Currently there is no support for SYCL 2020 images in compilers, so 104 | hardware images are used only inside implementation and there is no option 105 | to have any public API working with SYCL images in spec. 106 | 107 | * SYCL image still is incomplete. SYCL 2020 image still has no implementation 108 | in compilers. So public API has no option to use sycl::image, the option is 109 | only the implementation and image(texture) memory access. 110 | 111 | * The list of supported formats was reduced in SYCL 2020. 112 | Q: Is there any feedback on this change? 113 | 114 | **Robert Schneider, Siemens Healthineers**: grayscale images are very 115 | important and the texture images are also blocking. 116 | 117 | **Ashish Uthama, Mathworks**: agree. 118 | 119 | **Tim van der Horst, Philips**: agree, grayscale image format is very 120 | important for medical. 121 | 122 | **Mark Rabotnikov, Philips** (offline feedback): Support of gray-scale 123 | images: all medical scanners (like CT, MR, PET, etc.) create gray-scale 124 | images, usually 16-bit, but can be also 8-bit. In addition, gray-scale 125 | 32-bit floating point images are widely used during processing, including 126 | AI, etc. So in my view it is very important to fully support such images. 127 | 128 | * One of the important changes in oneIPL Spec v0.6 is changing from Formats to 129 | Layouts. 130 | Reasons: 131 | 132 | * Layout is a part of compile-time dispatch and defines algorithm 133 | in kernel, multiple formats are mapped to single layout: 134 | channel4 layout -> rgba/bgra/argb/abgr/cmyk/… formats 135 | 136 | * Align to industry standard approach (OpenCV, python libraries, 137 | Intel® Integrated Performance Primitives/NVIDIA Performance Primitives. 138 | 139 | * Significantly affected functions: Color conversions, which requires specific 140 | color format. 141 | 142 | * **Ashish Uthama, Mathworks**: Does the scope of oneIPL supported formats 143 | include 3D data or only 2D? 144 | **Victor Getmanskiy, Intel**: The only 2D formats are included right now, 145 | but it would be extended in future to cover 3D also 146 | (Reference: slide 5 in `oneIPL TAB #1 presentation`_ ) 147 | 148 | * Basic terminology is discussed (Region of Interest, pitch, width, length) 149 | 150 | * oneapi::ipl::image class is basic data abstraction for image data in oneIPL. 151 | oneIPL provides single abstraction over different memory types: host, device, 152 | shared and special GPU images (textures). 153 | 154 | * The ipl::image class is reviewed, API are discussed: 155 | 156 | **Robert Schneider, Siemens Healthineers**: What is returned by 157 | get_pointer() function? 158 | 159 | **Victor Getmanskiy, Intel**: this is the pointer to the full image 160 | 161 | **Robert Schneider, Siemens Healthineers**: and what if the hardware 162 | texture is used? 163 | 164 | **Victor Getmanskiy, Intel**: this is very important question. It can be 165 | extended as soon as the texture images are added to the SYCL standard with 166 | capability to return device memory. But now it is hard to introduce it in the 167 | spec, since it is not allowed in SYCL standard. 168 | 169 | **Robert Schneider, Siemens Healthineers**: It would be good to make API 170 | more forward looking, more general to take into account future potential 171 | extension. For example, introduce some structure / class / handler for it. 172 | 173 | **Ashish Uthama, Mathworks**: we may use more specific naming for this 174 | function, something like get_USM_pointer(). 175 | 176 | * **Agreed for all TAB members to take offline review** of slide 14 of 177 | `oneIPL TAB #2`_ presentation and provide the suggestion for the more 178 | general API ideas and the function naming. 179 | 180 | 181 | * **Mark Rabotnikov, Philips** (offline feedback): Regarding the naming of the 182 | methods in image class: in my view there is some inconsistency in naming with 183 | respect to ROI. For example, I think it would be clearer to change 184 | get_width/get_height to get_roi_width/get_roi_height. 185 | 186 | * **Sergey Ivanov, OpenCV/G-API** (offline feedback): I believe, the is_roi(), 187 | get_roi_rect(), get_full_image(), get_roi(), get_width(), get_height(), 188 | get_full_width() and get_full_height() functions are redundant in image class 189 | This can be represented by 3 methods only: 190 | 191 | * const roi_rect& get_rect() const; 192 | 193 | * auto original() const ->image; 194 | 195 | * auto apply_roi(const roi_rect& roi_rect) const ->image; 196 | 197 | Then, there will be the following: 198 | * object.is_roi() -> object.get_rect() == object.original().get_rect() 199 | * object.get_full_width() -> object.original().get_rect().width() 200 | * object.get_full_height() -> object.original().get_rect().height() 201 | * object.get_width() -> object.get_rect().width() 202 | * object.get_height() -> object.get_rect().height() 203 | 204 | Also, I suppose, that get_pointer() should be part of USM template 205 | specialization only, instead it is possible to introduce: 206 | 207 | * auto get_handle() -> cl::sycl::handle 208 | 209 | * The different image constructors are discussed. 210 | 211 | * The example of custom kernel with the USM usage is discussed. 212 | 213 | * See more details on `oneIPL Image Data Abstraction page`_ 214 | 215 | 216 | Next plans on oneIPL TAB: 217 | 218 | * The next technical meeting for oneIPL TAB is planned for February 17th (ww8) 219 | 220 | * Next topic for the discussion is Memory Allocation and oneIPL Library 221 | design details. 222 | 223 | .. _`oneIPL Image Data Abstraction page`: https://spec.oneapi.io/oneipl/latest/image/index-image.html 224 | .. _`oneIPL Spec page`: https://spec.oneapi.io/oneipl/latest/index.html 225 | .. _`oneIPL TAB #2`: ../presentations/2022-02-03_Slides.pdf 226 | .. _`structural similarity article`: https://www.mathworks.com/help/images/ref/ssim.html 227 | .. _`across devices is different`: https://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf 228 | .. _`oneIPL TAB #1 presentation`: ../presentations/2021-12-16_Slides.pdf 229 | -------------------------------------------------------------------------------- /image/minutes/2022_02_17_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneIPL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2022-02-17 6 | ========== 7 | 8 | Slides: 9 | 10 | * `oneIPL TAB #3`_ 11 | 12 | Attendees: 13 | 14 | * Svetlana Podogova (Intel) 15 | * Victor Getmanskiy (Intel) 16 | * Mark Rabotnikov (Philips) 17 | * Ashish Uthama (Mathworks) 18 | * Robert Schneider (Siemens Healthineers) 19 | * Sergey Ivanov (Intel, OpenCV/G-API) 20 | * Maksim Shabunin (Intel, OpenCV) 21 | * Valentin Kubarev (Intel) 22 | * Dmitriy Budnikov (Intel) 23 | 24 | 25 | Agenda: 26 | 27 | * Introduction and Results of previous discussion of oneIPL API 28 | * oneIPL Memory allocation and temporary images 29 | * oneIPL Domains 30 | * oneIPL Error handling mechanism 31 | * Closing words and next plans on oneIPL TAB 32 | 33 | Introduction and Results of previous discussion of oneIPL API: 34 | 35 | * Current provisional spec version is 0.5, the spec 0.6 is in progress and will 36 | be published soon. 37 | 38 | * The main differences between spec ver 0.5 and ver 0.6 are the following: 39 | * ipl::formats replaced by ipl::layouts 40 | * Image constructors are changed to remove dependency on implementation 41 | * Default image allocator will be USM 42 | * Methods to image auxiliary classes moved to image API 43 | * Switched to generic template parameters 44 | * Gaussian filter with separated sigma for x and y axis 45 | * Normalize without sycl::buffer in spec 46 | 47 | * On the previous meeting, the discussion about oneIPL API was started. 48 | TAB members agreed to take offline review of slide 14 of `oneIPL TAB #2`_ 49 | presentation and to provide the suggestion for the more general API ideas 50 | and the function naming. 51 | 52 | The offline feedback was collected and processed (see offline feedbacks in 53 | `oneIPL TAB #2 minutes`_). The following changes were applied to the 54 | class image for the oneIPL Specification ver 0.6: 55 | the get_pointer() is replaced by general get_access(args) function, which is 56 | common for USM and Image (Texture) access abstraction. 57 | 58 | "unspecified" return type of this API is not language-related, but a 59 | spec-related placeholder for the return type, as per SYCL specification. 60 | It can be defined differently for particular class specialization in the 61 | implementation. 62 | 63 | The minimal API approach is applied for Image class per oneIPL TAB feedbacks: 64 | get_rect(), get_roi() and get_origin() functions are left to get access to 65 | the necessary part of image, and the size accessors are now moved to the 66 | roi_rect struct (see slide 7 of `oneIPL TAB #3`_ for the details). 67 | 68 | * **Mark Rabotnikov, Philips**: Why do we need get_origin()? Is not it the same 69 | as image itself? 70 | 71 | **Victor Getmanskiy, Intel**: yes, it may be needed in user kernels to use 72 | neighbor pixels to process borders for example. 73 | It is common case for Filters, when you need to access the pixels outside the 74 | processing area. Also, it may be used to create a different ROI on the same 75 | origin. 76 | 77 | 78 | oneIPL Memory Model: 79 | 80 | * oneIPL spec supports different type of memory assigned via AllocatorT or 81 | provided via pointer by user to image constructor: host USM, device USM, 82 | shared USM and Image (texture) memory. 83 | 84 | USM allocator has a major difference from std::allocator. Allocator cannot be 85 | constructed without context, so device context is required. It complicates 86 | the APIs, since allocator has no default constructor. 87 | 88 | * Victor Getmanskiy presented the different ways of memory allocation on host 89 | and device and different types of access and control (see slide 9 for 90 | details) 91 | 92 | * **Ashish Uthama, Mathworks**: What happens at the point of access? 93 | Is the copy operation performed? 94 | 95 | **Victor Getmanskiy, Intel**: It depends on the allocation and access types: 96 | * If you have device allocation and memory assess on device -> no copy is 97 | performed. 98 | * If you have shared allocator and access for read -> no copy as well, but 99 | for the write access copy to host would happen after kernel is finished. 100 | * In image_allocator_t case the memory will be allocated on both host 101 | and device and access is possible only to write or to read, but not both. 102 | In case of read no copy to host happen, in case of write copy to host 103 | would happen after kernel is finished. 104 | 105 | * **Ashish Uthama, Mathworks**: Actually is this SYCL limitation? It is not so 106 | clear: if you have reading only, you do not know if it do the copy or not. 107 | I prefer the explicit copy operation, so I am wondering if this is 108 | restriction of SYCL? 109 | 110 | **Victor Getmanskiy, Intel**: For the USM it is possible to implement 111 | explicit copy, for Image it is not controlled (SYCL limitation). You still 112 | can explicitly copy USM memory and create image with device_allocator_t over 113 | memory without copy. 114 | 115 | For Images (Textures) creating an object over device memory is currently not 116 | possible because of SYCL limitations, the construction from host pointer is 117 | only available. But we are working on it as well. 118 | 119 | * **Ashish Uthama, Mathworks**: I have a question about device context and 120 | different data allocators usage. For example, If I have integration to a 3rd 121 | party library, and some data is allocated, is it possible to get the same 122 | context? 123 | 124 | **Victor Getmanskiy, Intel**: This is not done in the library, but there is 125 | technical possibility to obtain the context, for example CUDA backends in 126 | oneMKL open-source are capable to work over context of different devices like 127 | NVIDIA. I think, in this case memory type will be device and mapping over 128 | such memory with proper queue would create image over memory without extra 129 | copy operation. 130 | 131 | * In spec ver 0.5 default allocator was implementation-defined. 132 | For compatibility, in spec ver 0.6 the default allocator is selected as 133 | shared USM. 134 | 135 | 136 | oneIPL Domains: 137 | 138 | * On the slide 15, the domains in blue are currently described in the oneIPL 139 | Specification, yellow color is to be considered in the future versions of 140 | spec. 141 | 142 | * Color conversion domain includes conversion functions for the main formats, 143 | the YUV422 and YUV444 to rgb are considered as a possible future extension. 144 | 145 | * The Color conversions Domain API's are presented. 146 | 147 | * The Filtering Domain in oneIPL spec ver 0.6 includes Sobel filter with the 148 | fixed 3x3 kernel size and Gaussian filter. Possible future extension of 149 | oneIPL spec is considered to include Bilateral, Sobel, Filter Box and Filter 150 | Median. 151 | 152 | * **Mark Rabotnikov, Philips** (slide 20): What is crop in this context? 153 | 154 | **Victor Getmanskiy, Intel**: this controls if the operation takes into account 155 | the pixels from the origin while processing the ROI. 156 | If you turn the crop ON -> the pixels outside the ROI are not accessible. 157 | 158 | **Mark Rabotnikov, Philips**: Is it needed for borders only? 159 | 160 | **Victor Getmanskiy, Intel**: It affects on how the borders are processed, 161 | but also you can optimize the implementation and accelerate the processing 162 | with this option. E.g. for big images the cropped ROI can be put into faster 163 | memory providing 2D data locality. 164 | 165 | * The Transformation Domain includes resize functions with bilinear, bicubic, 166 | lanczos and supersampling interpolation types and also mirroring. Future 167 | extensions are: the Nearest neighbor Resize and Warp Affine bilinear 168 | functions. 169 | 170 | * **Mark Rabotnikov, Philips**: It makes a lot of sense to take into account 171 | what are the most useful operations for AI domain. Resize is the major for 172 | sure, but there are some basic operations for Deep Learning and Neural 173 | Network operations for training. Augmentation, inference, normalize – need 174 | to take a look on this domain needs for further Specification extensions. 175 | 176 | * There are several directions of future oneIPL Spec extensions were presented, 177 | the feedback and suggestions for the most important steps are appreciated. 178 | 179 | oneIPL Error Handling mechanism: 180 | 181 | * The oneIPL Error Handling mechanism relies on the mechanism of C++ 182 | exceptions. The oneIPL additionally has a requirement to implement 183 | compile-time checks which can be based on template parameters. 184 | 185 | * Sync and async exception handling flows were presented. Exception types and 186 | examples from oneIPL spec were reviewed. 187 | 188 | 189 | Next plans on oneIPL TAB: 190 | 191 | * The next technical meeting for oneIPL TAB is planned for March 3rd (ww10) 192 | 193 | * Next topic for the discussion is oneIPL Functions overview. 194 | 195 | * After covering the main topics the oneIPL TAB will have meetings once per 196 | 4 weeks. 197 | 198 | .. _`oneIPL TAB #3`: ../presentations/2022-02-17_Slides.pdf 199 | .. _`oneIPL TAB #2`: ../presentations/2022-02-03_Slides.pdf 200 | .. _`oneIPL TAB #2 minutes`: 2022_02_03_Minutes.rst 201 | -------------------------------------------------------------------------------- /image/presentations/2021-11-29_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/image/presentations/2021-11-29_Slides.pdf -------------------------------------------------------------------------------- /image/presentations/2021-12-16_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/image/presentations/2021-12-16_Slides.pdf -------------------------------------------------------------------------------- /image/presentations/2022-02-03_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/image/presentations/2022-02-03_Slides.pdf -------------------------------------------------------------------------------- /image/presentations/2022-02-17_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/image/presentations/2022-02-17_Slides.pdf -------------------------------------------------------------------------------- /image/presentations/2023-06-21_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/image/presentations/2023-06-21_Slides.pdf -------------------------------------------------------------------------------- /image/presentations/2023-09-21_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/image/presentations/2023-09-21_Slides.pdf -------------------------------------------------------------------------------- /language/presentations/2019-11-17-dpcpp-language-and-extensions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2019-11-17-dpcpp-language-and-extensions.pdf -------------------------------------------------------------------------------- /language/presentations/2019-11-17-oneAPI-vision-for-TAB.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2019-11-17-oneAPI-vision-for-TAB.pdf -------------------------------------------------------------------------------- /language/presentations/2019-11-17-oneDPL.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2019-11-17-oneDPL.pdf -------------------------------------------------------------------------------- /language/presentations/2020-01-28-TAB-DPCPPMeeting2_v7.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2020-01-28-TAB-DPCPPMeeting2_v7.pdf -------------------------------------------------------------------------------- /language/presentations/2020-03-04-TAB-C++-Minimum-Version.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2020-03-04-TAB-C++-Minimum-Version.pdf -------------------------------------------------------------------------------- /language/presentations/2020-03-25-USM-for-TAB.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2020-03-25-USM-for-TAB.pdf -------------------------------------------------------------------------------- /language/presentations/2020-04-22-oneDPL-for-TAB.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2020-04-22-oneDPL-for-TAB.pdf -------------------------------------------------------------------------------- /language/presentations/2020-05-oneDPL-for-TAB.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2020-05-oneDPL-for-TAB.pdf -------------------------------------------------------------------------------- /language/presentations/2020-07-01-TAB-Atomics.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2020-07-01-TAB-Atomics.pdf -------------------------------------------------------------------------------- /language/presentations/2020-07-22 accessor simplification.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2020-07-22 accessor simplification.pdf -------------------------------------------------------------------------------- /language/presentations/2020-08-26-TAB-Extension-Mechanism.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2020-08-26-TAB-Extension-Mechanism.pdf -------------------------------------------------------------------------------- /language/presentations/2020-08-26-TAB-LocalMemory.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2020-08-26-TAB-LocalMemory.pdf -------------------------------------------------------------------------------- /language/presentations/2020-09-23-TAB-Extension-Naming.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2020-09-23-TAB-Extension-Naming.pdf -------------------------------------------------------------------------------- /language/presentations/2020-09-23-TAB-Function-pointers.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2020-09-23-TAB-Function-pointers.pdf -------------------------------------------------------------------------------- /language/presentations/2020-10-28-TAB-specFeedback.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2020-10-28-TAB-specFeedback.pdf -------------------------------------------------------------------------------- /language/presentations/2020-12-16-TAB-DPCPP-NUMA-Discussion.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2020-12-16-TAB-DPCPP-NUMA-Discussion.pdf -------------------------------------------------------------------------------- /language/presentations/2020-12-16-TAB-oneAPI-year-one.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2020-12-16-TAB-oneAPI-year-one.pdf -------------------------------------------------------------------------------- /language/presentations/2021-02-24-TAB-dpcpp-implementation-prioritization.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2021-02-24-TAB-dpcpp-implementation-prioritization.pdf -------------------------------------------------------------------------------- /language/presentations/2021-04-21-oneDPL-for-TAB.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2021-04-21-oneDPL-for-TAB.pdf -------------------------------------------------------------------------------- /language/presentations/2021-05-26-TAB-invoke_simd.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2021-05-26-TAB-invoke_simd.pdf -------------------------------------------------------------------------------- /language/presentations/2021-07-28-TAB-DPCPP-properties.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2021-07-28-TAB-DPCPP-properties.pdf -------------------------------------------------------------------------------- /language/presentations/2021-08-25-TAB-oneAPIv2-runtime.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2021-08-25-TAB-oneAPIv2-runtime.pdf -------------------------------------------------------------------------------- /language/presentations/2021-09-22-TAB-dynamic-selection.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2021-09-22-TAB-dynamic-selection.pdf -------------------------------------------------------------------------------- /language/presentations/2021-10-27-TAB-distributed-computing.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2021-10-27-TAB-distributed-computing.pdf -------------------------------------------------------------------------------- /language/presentations/2022-09-28-TAB-SYCL-Graph.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2022-09-28-TAB-SYCL-Graph.pdf -------------------------------------------------------------------------------- /language/presentations/2022-10-26-TAB-parallelism.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2022-10-26-TAB-parallelism.pdf -------------------------------------------------------------------------------- /language/presentations/2023-03-14-TornadoVM.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2023-03-14-TornadoVM.pdf -------------------------------------------------------------------------------- /language/presentations/2023-06-07-DK-matrix-oneapi-language.pdf.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2023-06-07-DK-matrix-oneapi-language.pdf.pdf -------------------------------------------------------------------------------- /language/presentations/2023-06-07_JointMatrix_NVIDIA.pdf.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/2023-06-07_JointMatrix_NVIDIA.pdf.pdf -------------------------------------------------------------------------------- /language/presentations/oneAPI-TAB-20220727-Kernel-Fusion.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/oneAPI-TAB-20220727-Kernel-Fusion.pdf -------------------------------------------------------------------------------- /language/presentations/oneAPI-TAB-Rules-of-the-Road.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/oneAPI-TAB-Rules-of-the-Road.pdf -------------------------------------------------------------------------------- /language/presentations/oneapi community forum governance Sept 2022.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/language/presentations/oneapi community forum governance Sept 2022.pdf -------------------------------------------------------------------------------- /math/README.rst: -------------------------------------------------------------------------------- 1 | =============================== 2 | oneAPI Community Forum Math SIG 3 | =============================== 4 | 5 | The Math SIG hosts discussions and presentations for fundamental 6 | mathematical routines for use in high-performance computing, 7 | engineering, financial and other applications. Functionality 8 | that is covered may include dense linear algebra, sparse linear 9 | algebra, discrete Fourier transforms, random number generators, 10 | summary statistics and vector math. In addition to 11 | domain-specific functionality, the Math SIG may also discuss 12 | overall design features like the execution model, memory model, 13 | or error handling. At times, the open source oneMKL Interfaces 14 | project, which is an implementation of the oneMKL specification, 15 | may also be discussed. 16 | 17 | The Math SIG is led by Sarah Knepper. 18 | 19 | To find out how to join the Math SIG `get in touch. `__ 20 | 21 | Meeting notes: 22 | ============== 23 | 24 | | `May 20, 2020 `__ 25 | | `June 3, 2020 `__ 26 | | `June 17, 2020 `__ 27 | | `July 1, 2020 `__ 28 | | `July 15, 2020 `__ 29 | | `August 12, 2020 `__ 30 | | `September 09, 2020 `__ 31 | | `November 11, 2020 `__ 32 | | `December 16, 2020 <../tab-dpcpp-onedpl/README.rst>`__ 33 | | `January 27, 2021 `__ 34 | | `February 24, 2021 `__ 35 | | `March 24, 2021 `__ 36 | | `May 19, 2021 `__ 37 | | `June 16, 2021 `__ 38 | | `July 14, 2021 `__ 39 | | `October 6, 2021 `__ 40 | | `December 14, 2021 <../cross-tab/README.rst>`__ 41 | | `March 23, 2022 `__ 42 | | `June 8, 2022 `__ 43 | | `July 27, 2022 `__ 44 | | `September 21, 2022 `__ 45 | | `October 5, 2022 `__ 46 | | `December 14, 2022 `__ 47 | | `March 8, 2023 `__ 48 | | `May 17, 2023 `__ 49 | | `July 12, 2023 `__ 50 | | `September 20, 2023 `__ 51 | | `October 25, 2023 `__ 52 | -------------------------------------------------------------------------------- /math/minutes/2020_06_03_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneMKL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2020-06-03 6 | ========== 7 | 8 | Materials: 9 | 10 | * `Overview of oneMKL programming model <../presentations/2020-06-03_Slides.pdf>`__ 11 | * `oneMKL Specification `__ 12 | 13 | Attendees: 14 | 15 | * Peter Caday (Intel) 16 | * Marius Cornea (Intel) 17 | * Craig Garland (Intel) 18 | * Mark Hoemmen (Stellar Science) 19 | * Sarah Knepper (Intel) 20 | * Maria Kraynyuk (Intel) 21 | * Nevin Liber (ANL) 22 | * Piotr Luszczek (UTK) 23 | * Spencer Patty (Intel) 24 | * Pat Quillen (MathWorks) 25 | * Alison Richards (Intel) 26 | * Nichols Romero (ANL) 27 | * Shane Story (Intel) 28 | * Harry Waugh (University of Bristol) 29 | 30 | Agenda: 31 | 32 | * Welcoming remarks - all 33 | * Overview of oneMKL programming model - Maria Kraynyuk 34 | * Walk-through of the oneMKL Specification - Spencer Patty 35 | * Wrap-up and next steps 36 | 37 | Introduction: 38 | 39 | * Harry Waugh - PhD student of Simon McIntosh-Smith at the University of Bristol. Looking at how BLAS is used at different scales in HPC codes. 40 | 41 | Walk-through of the oneMKL Spec: 42 | 43 | * We'll look at parts of the oneMKL spec, which is in turn part of a bigger oneAPI specification. At the end of the day, the functionality is the most important part. 44 | * The dark blue items on slide 9 are what is currently in the oneMKL spec. At some future time, we may add some of the other features, like sparse solvers or summary statistics. Feedback on prioritization for additional domain areas would be appreciated. For multidimensional FFTs, we currently support up to 3D FFTs. 45 | * There are other parts of Intel MKL that aren't part of the oneMKL spec. 46 | 47 | `oneMKL Spec `__: 48 | 49 | * We just released revision 0.8 of the spec, with a goal to reach revision 1.0 by August. Any feedback you have would be appreciated. 50 | * We'll spend a few minutes going through the structure of the document. We have two sections: oneMKL Architecture and oneMKL Domains. 51 | * The Architecture chapter details the assumptions we are making, the design features. Someone who is writing an implementation to the spec would need to be aware of these. 52 | * In the Domains chapter, we define multiple domains of the oneAPI Math Kernel Library. It contains descriptions of the APIs, and it will continue to be extended. 53 | * We've updated BLAS and LAPACK for revision 0.8 to be nearly complete, but the other domains will target revision 0.9. They are currently based on documentation from an earlier beta release of oneMKL. 54 | * There are both buffer-based and pointer-based (also known as USM for Unified Shared Memory) interfaces. The APIs are overloaded by data types. 55 | 56 | Discussion on lack of a matrix interface and buffers: 57 | 58 | * While this is supposed to be a C++ interface, it looks more like the traditional C interface. 59 | * Why was a matrix encapsulation type not chosen? 60 | * We did discuss this internally as we were developing APIs. There is an ongoing C++ proposal for mdarrays, so we didn't want to create our own interim types when there is a standard coming up. If it makes sense for us to do a matrix encapsulation object in the meantime, perhaps we should go down that route. 61 | * Why use 1-D buffers instead of 2-D buffers? 62 | * The 2-D buffers don't allow for leading dimensions or increments. We're currently treating SYCL buffers as essentially pointers with automatic data transfer and dependency tracking. 63 | * Do the SYCL buffers give size? Since you're not passing pointers, you could protect the user. 64 | * You can do memory safety here because you can query the size of the buffer. We haven't specified an exception to be thrown if the size of the buffer isn't sufficient, but that could be done. 65 | * Concern that the unsafety of C is being combined with the hassle of C++ (::, etc.). The SYCL buffers track dependencies, but don't encode the fact that they are matrices. 66 | * John Pennycook may have done some experiments using mdspan; we'll follow up. 67 | 68 | Other comments: 69 | 70 | * Is there a plan for human-readable names; e.g., norm() instead of nrm2()? This is being done in SLATE. 71 | * Some people are very familiar with Netlib Fortran BLAS API, so we didn't want to deviate too far from that. 72 | * We wanted to be close to the BLAS++ from SLATE. There have been internal discussions on this; we may introduce such names in the future. 73 | * Once the proposed C++ linear algebra APIs become standard, we could extend to support those APIs. 74 | * These APIs don't support row major layout. 75 | * Yes, though there are some thoughts on how to do that. Using an enum, or via a new namespace since typically users won't combine row and column major operations. 76 | * There is value in modernizing the BLAS APIs. 77 | 78 | 79 | oneAPI MKL Programming Model (from the slides): 80 | 81 | * 3 examples showing the difference between C and buffer APIs, buffer and USM APIs, plus a more complex case. 82 | * The C APIs and the buffer APIs are pretty close to each other. 83 | * For DPC++, we create devices and buffers. We allocate memory on the host and wrap it in SYCL buffers. We call a BLAS function, passing the queue and buffers. 84 | * The main difference between buffer and USM API is how we allocate memory. For USM API, we create pointers and specify context for how we want to allocate it. 85 | * In the more complex case, where we want to call several functions, the buffers handle the synchronization. In USM, you explicitly handle it yourself. 86 | 87 | * For the Aurora Exascale Computer project, we get a ton of feedback about batched APIs. 88 | * It would be ideal to file `Github issues `__ against the spec, or discussions, or whatever it happens to be. We're tracking these, and it helps to get the feedback going in the community. 89 | 90 | Continuing the review of the Architecture chapter of the oneMKL Spec: 91 | 92 | Execution Model: 93 | 94 | * We have non-member functions (standalone routines), as in BLAS and LAPACK, as well as member function (class-based encapsulations) in DFT and RNGs. Non-member functions take a queue as the first argument. We will throw exceptions instead of returning an info or error type. 95 | * For device usage, if you have multiple devices available, we depend on the DPC++ language to provide subdevices. Think of this as your NUMA type of stuff. We're not introducing language into the oneMKL specification to handle subdevices. 96 | * Both synchronous and asynchronous execution is supported. As the language evolves, we would like to have as much as possible be asynchronous, but it's not required to have asynchronous execution. 97 | 98 | * What does an exception mean for asynchronous execution? 99 | * There is a method for asynchronous exceptions to be caught and then re-thrown. If you look at the Intel oneMKL examples, you can see how to catch this. The language is moving to a point where no asynchronous execution is lost. You have an exception handler attached to your queue. 100 | * Prefer throwing exceptions over the XERBLA approach in BLAS. 101 | * It's also possible to force the queue to have in-order behavior - especially useful for USM. 102 | 103 | * Buffers handle dependencies themselves, so that is why the USM APIs take a vector of event dependencies and return an event. You can build your own dependency tree. While it's a little more work to do that, it offers more flexibility. 104 | * We assume all functions are host thread safe. In the class-based APIs, it's not necessary that two threads concurrently using the same class will be thread safe. 105 | 106 | * If I have a big matrix and I give separate chunks of it to different threads, is that okay? E.g., I have a 100x100 matrix, is it okay to give 50x50 to one thread and 50x50 to another? 107 | * The SYCL buffers' dependencies are controlled by SYCL runtime, and they can't track how they interact. That's an advantage with USM - as a programmer you know that working on one part of a matrix doesn't overlap with working on another part of the matrix. A matrix encapsulation can handle that as well. For host thread safety, we'll need to look that up for buffers. For USM pointers, you should be fine; they're all in the same host virtual address space. 108 | 109 | Memory model: 110 | 111 | * 2 approaches to memory model: buffers and USM. Each have their own properties. With USM, use malloc_device or malloc_share. There is no assumption that data is host accessible. An implementation may need to be able to handle both cases (how the memory was allocated). Each has different properties, and it doesn't always make sense to assume it's always host accessible. 112 | 113 | API design - logistical design - namespaces: 114 | 115 | * There is a proposal going through the oneAPI TAB meetings to change the namespace to oneapi::mkl; that is, to put everything under the same oneapi namespace. It's longer, but it makes things more simple if you can use a "using" statement. This would be done for all oneAPI libraries (TBB, etc.) 116 | * Don't like deeply nested namespaces just to get into functionality. With this, you'd need to go 3 deep before you can use functionality. Shorter is better! 117 | * There are issues with using "using" statements. May do it for a particular function, but typically not for a whole namespace. 118 | * We've even had discussions internally about the different domain namespaces, how to differentiate between them. Pros and cons for each variant. 119 | * We are open to suggestions here. Concrete proposals on Github would be great! Could have a flat structure as well. Something to think about for later. 120 | * Lots of people think of namespaces as package management, but it's not really. They aren't like Python import statements. 121 | * Have namespaces and include structure be related to each other. 122 | 123 | * Thank you all for the engaging and active discussions. Please keep them going on the `oneAPI specification repository `__. 124 | 125 | * Mark Hoemmen started a `discussion `__ on dense linear algebra functions needing encapsulations for matrices and vectors shortly after the meeting. 126 | -------------------------------------------------------------------------------- /math/minutes/2020_07_01_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneMKL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2020-07-01 6 | ========== 7 | 8 | Materials: 9 | 10 | * `Exceptions/Error handling <../presentations/2020-07-01_Slides.pdf>`__ 11 | 12 | Attendees: 13 | 14 | * Peter Caday (Intel) 15 | * Mark Hoemmen (Stellar Science) 16 | * Sarah Knepper (Intel) 17 | * Maria Kraynyuk (Intel) 18 | * Nevin Liber (ANL) 19 | * Piotr Luszczek (UTK) 20 | * Mesut Meterelliyoz (Intel) 21 | * Pat Quillen (MathWorks) 22 | * Alison Richards (Intel) 23 | * Nichols Romero (ANL) 24 | * Shane Story (Intel) 25 | * Julia Sukharina (Intel) 26 | * Harry Waugh (University of Bristol) 27 | * Maria Zhukova (Intel) 28 | 29 | Agenda: 30 | 31 | * Welcoming remarks - all 32 | * Exceptions/Error handling - Maria Kraynyuk 33 | * Open discussion on oneMKL Specification and open source oneMKL interfaces GitHub project - all 34 | * Wrap-up and next steps 35 | 36 | Exceptions/Error handling: 37 | 38 | * Motivation for different level of details: wide range of applications use oneMKL. Some need no details, some just want to identify all oneMKL exceptions, some want more details. 39 | 40 | * 3 levels of exceptions: 41 | 42 | 1. Base exception, which can be inherited from std::exception, but not required to. 43 | 2. Problem-specific exception classes, like for invalid arguments or something uninitialized, which are inherited from the oneMKL base exception class. 44 | 3. Domain-specific exception classes, like for an uninitialized descriptor in DFT; inherited from problem-specific exceptions or the base class. 45 | 46 | * 7 problem-specific exceptions can cover all domains. Differentiate between host or device memory allocation; device not supported; some functionality is not implemented; incorrect input. For sparse and DFT domains, there are descriptors/handles; have an exception if it wasn't initialized first. Finally, check that everything was fine during computation; since this can be very different between domains, have domain-specific exceptions as well. 47 | * Final list of domain-specific exceptions is not yet defined but will be added to the spec. 48 | 49 | Open Discussion: 50 | 51 | * Board strongly recommends requiring inheriting from std::exception, rather than allowing it. 52 | * Helps applications to handle exceptions consistently. 53 | * In general, it's really helpful if custom exceptions inherit from ``std::exception``. 54 | 55 | * It's helpful to use existing idioms for C++ exceptions. 56 | * For example, throwing ``std::bad_alloc`` for bad allocation. 57 | 58 | * Distinguish between "user messed up" and "library messed up" errors; may not need to fuss too much with user messed up errors. 59 | * Things the xerbla would be invoked for - like incorrect arguments - fall into the user messed up category. 60 | * Can start with a basic exception and subdivide more in the future. 61 | * Haven't seen code that recovers from xerbla. 62 | * There are cases for Intel MKL where users redefine xerbla because they don't want to stop but continue. It's possible the user code would take a different path in this case. 63 | 64 | * Will there be MKL_DIRECT_CALL support? 65 | * We want to specify a pre-check system, which is currently partially implemented in the open source interfaces. We want it to be part of the spec. Allow to switch off at compile time if argument checking is not needed, though would still need to check for computational problems. 66 | * We would want to use the compile-time dispatching interfaces instead of the run-time dispatching. 67 | * Need to think carefully about direct call because it requires header support. 68 | 69 | * What happens for companies that don't allow exceptions? 70 | * We have received a request on potentially using statuses instead of exceptions, though using DPC++ led to the choice of exceptions. 71 | * If necessary, we can have a system that catches everything at the top level with a status code that users can query with an API. 72 | 73 | * There is an argument about exceptions being only exceptional conditions. Providing bad data is not exceptional. 74 | * Bad allocation is exceptional, bad arguments are bugs. But things like LAPACK not converging - is that exceptional, or simply hard? 75 | * It can be handled through an ``if`` branch in user code. For numerical issues, can have a fast path in the code; try LU, if it fails, switch to QR. Debatable if it should be an exception. 76 | * Negative numbers from the ``info`` parameter in LAPACK are truly errors, you can't do anything. But positive numbers indicate a failure during computations, such as passing a non-positive definite matrix to Cholesky (potrf). 77 | * In this case, potrf would throw an exception, and the ``info`` code of the problem can be obtained by the ``get_info()`` method of the exception object. 78 | * Concern that large production runs may assert that no exceptions are thrown, to devote all core-hours to compute. 79 | * Sometimes LU is called to determine if a matrix is singular. Users can't be expected to detect a singular matrix; the algorithm finds that out for them. 80 | * But having both error codes and exceptions on a single function call may be unwieldy. The ``std::filesystem`` has both, but users do not like functions that return an error code also throw an exception. 81 | * In some cases it may be okay, but don't want a function that will have informational errors and malloc exceptions tied to the same number. 82 | 83 | * With exceptions, there's an expectation you'd unwind the stack. Can you throw an exception on the GPU, or do you have to go back to the CPU to throw it? 84 | * You need to wait. 85 | 86 | * Our interface in general is asynchronous. Reporting about the conditioning of a matrix forces things to be synchronous. 87 | * One option is device-accessible pointers, like in cuBLAS. 88 | * SYCL has a notion of asynchronous errors. To catch them, you need to register an asynchronous error handler on the queue you created. But it doesn't come with a way for user code to raise asynchronous exceptions. We could press for changes to the SYCL specification in the future. 89 | * However, even if possible, the ability to throw an exception on the device may affect performance. 90 | * In PLASMA, both asynchronous and synchronous APIs were provided. In the asynchronous API, there was no guarantee on the state of the data in the event of an error. In the synchronous API, an object was introduced to keep track of the progress. 91 | * The C++ executor proposal (P0443) has an "error channel", though that is not quite the same. 92 | -------------------------------------------------------------------------------- /math/minutes/2020_07_15_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneMKL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2020-07-15 6 | ========== 7 | 8 | Materials: 9 | 10 | * `Overview of oneMKL Random Number Generators domain <../presentations/2020-07-15_Slides.pdf>`__ 11 | 12 | Attendees: 13 | 14 | * Peter Caday (Intel) 15 | * Marius Cornea (Intel) 16 | * Pavel Dyakov (Intel) 17 | * Alina Elizarova (Intel) 18 | * Mehdi Goli (Codeplay) 19 | * Mark Hoemmen (Stellar Science) 20 | * Sarah Knepper (Intel) 21 | * Maria Kraynyuk (Intel) 22 | * Nevin Liber (ANL) 23 | * Piotr Luszczek (UTK) 24 | * Mesut Meterelliyoz (Intel) 25 | * Spencer Patty (Intel) 26 | * Pat Quillen (MathWorks) 27 | * Alison Richards (Intel) 28 | * Nichols Romero (ANL) 29 | * Edward Smyth (NAG) 30 | * Shane Story (Intel) 31 | * Harry Waugh (University of Bristol) 32 | 33 | Agenda: 34 | 35 | * Welcoming remarks - all 36 | * Discussion and updates from last meeting 37 | * Overview of oneMKL Random Number Generators domain - Pavel Dyakov and Alina Elizarova 38 | * Wrap-up and next steps 39 | 40 | Introduction, including how you use math libraries: 41 | 42 | * Mehdi Goli - Principal software engineer at Codeplay, based in Edinburgh, developed different math libraries in SYCL. As part of contribution to community, contributed cuBLAS backend to oneMKL open source interfaces project. In the past couple of years, developing linear algebra kernels, Eigen backend used in TensorFlow. 43 | * Edward Smyth - Works for Numerical Algorithms Group (NAG); famous for their math libraries. Working there for many years, did some early work for Intel MKL years ago. Also does a lot of HPC consultancy. 44 | 45 | Opening up for comments on the oneMKL Spec or general discussion: 46 | 47 | * Overall, the focus from reviewing the spec has been on the linear algebra interfaces. No additional comments from what has been captured during previous oneMKL TAB meetings. 48 | * In OpenMP, you have control over the level of parallelism you get. Being able to control how much of the CPU/GPU is used would be useful. 49 | * Interoperability of SYCL and other runtimes: In the oneMKL open source interfaces project, the oneMKL APIs can work with cuBLAS underneath and the CUDA runtime, for example. 50 | 51 | Row-major support for BLAS: 52 | 53 | * What happens if you add another layout? 54 | * The approach taken matches CBLAS capabilities. If you had lots of other layouts, a matrix object would better encapsulate. This is the procedural version for now. 55 | * Why using namespaces instead of enums? 56 | * Users will typically use just column major or just row major for entire application. With the namespace approach, you do not need an enum parameter for every function. This approach also doesn't break current API. 57 | 58 | * Multi-dimensional arrays (tensors) would benefit from supporting multiple strides. There have been requests for SLATE to cover multi-dimensional strides. 59 | 60 | Overview of RNGs 61 | 62 | * Two types of classes: 63 | 64 | 1. Engines - source of randomness, hold state of random number generators. 65 | 2. Distributions - hold distribution parameters, mean, standard deviation parameters. 66 | 67 | * Two types of free functions: 68 | 69 | 1. Service routines - responsible for engine's state modification. 70 | 2. Generation routines - used to obtain random numbers from a given engine and distribution 71 | 72 | * Engine classes work with both free function types; Distributions classes work only with generation routines. 73 | 74 | * What are the return values for the generation routines? 75 | * The generated numbers are provided via output parameters. For buffer-based APIs, the return type is void. For USM, an event is returned. 76 | 77 | * 3 steps for workflow: 78 | 79 | 1. Create and initialize an Engine object. Can adjust Engine state by calling service function if required. 80 | 2. Create and initialize a distribution object. 81 | 3. Call the generation routine to obtain n random numbers. 82 | 83 | Deeper look at Engines classes: 84 | 85 | * Engines classes: represent source of independent and identically distributed random variables. 86 | * Engine object holds its internal state between calls; it holds a ``sycl::queue`` object, and generation is performed on the device associated with the queue. 87 | * May be different constructors for different Engines, depending on the seed needed. 88 | * Copy constructible and copy assignable. 89 | * A random number engine is just a vector of bits, which should be trivially copyable without needing to implement custom copy constructors. 90 | * Different engines have different costs for copying due to differences in internal state. But people typically pass Engines around by non-const reference, so you are not copying them all over the place. 91 | * What about move constructors? 92 | * Wanted to provide minimal set of constructors. 93 | * Engine state is modified in generate calls. 94 | * Are you going to offer threefry4x64? 95 | * No plans currently for threefry algorithm. For counter-based engines we have philox4x32x10 and ARS-5. 96 | 97 | Deeper look at Distributions classes: 98 | 99 | * Distributions represent statistical properties of produced random numbers. 100 | * Gaussian takes mean and standard deviation in constructor. Different parameters are needed for different Distributions. 101 | * Copy constructible and assignable; trivial action in this case. 102 | * The structs are just tags; why do they not have the transforms in them? 103 | * Everything happens in the generate function. 104 | * Dispatch to the right method based on the tag. 105 | * Did not want to specify in the specification since different devices may have different implementations and work with the Engine and Distribution differently. 106 | 107 | Deeper look at generation routines: 108 | 109 | * These provide random numbers from a given engine with a given distribution. 110 | * Uses sycl::buffer<> or USM pointer provided by user as storage for the generated numbers. 111 | * How does this compare to the RNGs in C++? 112 | * It is a little different than the std:: one, since std:: does not work for different devices and has shared state between different threads. 113 | * oneDPL implements std:: like subset of random number engines and distributions which works for different devices. 114 | * The goal is to provide something that looks C++-like but in a more performant way by generating multiple numbers. 115 | * Wanted to make the user experience easier with a general entry point - a free function routine instead of an operator that provides just one number from a given Engine. But the general workflow is similar. 116 | * Needs justification in the specification why it looks different from std so it does not seem arbitrary. 117 | 118 | Deeper look at service routines: 119 | 120 | * Service routines are used to modify the state of the engine. 121 | * Two ways: 122 | 123 | 1. Skip-ahead method: Engine behaves like it has already generated num_to_skip elements. If you want to generate from one random sequence but on different devices, may use skip-ahead. 124 | 2. Leapfrog method: Produced by same engine, but with fixed increment. 125 | 126 | * Not all engines support these routines. 127 | * Typo in the text underneath the skip-ahead image - refer to the colors in the image instead. 128 | * Can I overload leapfrog and write a custom engine? 129 | * If you provide a generate function for this, then yes. 130 | * Is there any guarantee on the complexity of leapfrog() or skip_ahead()? 131 | * No, it differs from engine to engine. For philox4x32x10, skip_ahead() is constant time, but other engines may be more complex. 132 | 133 | General RNG usage: 134 | 135 | * What customer base uses RNG? 136 | * Not too many people at DOE use it. Most just use Boost or the built-in Fortran functions. 137 | * Lots of customers in the finance field use Monte Carlo (or quasi-Monte Carlo) simulations for risk management and options calculation. Lots of customers in physics use Monte Carlo simulations. 138 | 139 | * May want a specialized Engine, for instance, to support a thousand streams with a thousand threads. 140 | * This could be specified in the constructor. 141 | -------------------------------------------------------------------------------- /math/minutes/2020_08_12_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneMKL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2020-08-12 6 | ========== 7 | 8 | Materials: 9 | 10 | * `Overview of oneMKL Summary Statistics domain <../presentations/2020-08-12_Slides.pdf>`__ 11 | 12 | Attendees: 13 | 14 | * Pavel Dyakov (Intel) 15 | * Alina Elizarova (Intel) 16 | * Mark Hoemmen (Stellar Science) 17 | * Sarah Knepper (Intel) 18 | * Maria Kraynyuk (Intel) 19 | * Nevin Liber (ANL) 20 | * Piotr Luszczek (UTK) 21 | * Spencer Patty (Intel) 22 | * Pat Quillen (MathWorks) 23 | * Nichols Romero (ANL) 24 | * Edward Smyth (NAG) 25 | * Shane Story (Intel) 26 | 27 | Agenda: 28 | 29 | * Welcoming remarks - all 30 | * Updates from last meeting 31 | * oneMKL Random Number Generators pass downs and open questions - Pavel Dyakov and Alina Elizarova 32 | * Overview of oneMKL Summary Statistics domain - Pavel Dyakov and Alina Elizarova 33 | * Wrap-up and next steps 34 | 35 | Updates from last meeting: 36 | 37 | * oneMKL Specification v. 0.9 was published on July 30. 38 | * oneMKL Specification v. 1.0 is scheduled for August 30. Cutoff date for accepting feedback is August 17, with code freeze August 28. 39 | * Feedback that will be considered for a future version (after v. 1.0) is captured in an appendix - many thanks to the oneMKL TAB members! 40 | * Going forward, oneMKL TAB will switch to a 4-week cadence, as there are still topics to cover, for consideration in a future spec version. 41 | * Please continue to file `Github issues `__ against the specification, even after version 1.0. 42 | * Can also file issues against the `oneMKL open source interfaces `__ if there are issues about the open source implementation of the specification. 43 | 44 | oneMKL Random Number Generators discussion: 45 | 46 | * Based on feedback from the previous oneMKL TAB meeting, we have removed trivial constructors to make it cleaner. Adding move constructors to v. 1.0. 47 | 48 | 5 groups of engines: 49 | 50 | 1. Modern engines with good statistical properties. 51 | 2. Engines with small period but good performance - two instantiations of linear congruential engines. 52 | 3. Hardware dependent engines. For example, ars5 can be done via hardware (HW) instruction or software, but there is a significant performance difference. Nondeterministic is implemented via HW instruction, and cannot be supported on HW lacking this instruction (like current GPUs). sfmt19937 utilizes vectorized capabilities of CPUs, but may not make sense on GPU hardware. 53 | 4. Engines that used to be popular in the past, but not so much anymore. 54 | 5. Quasi-random engines that can be used for certain use cases. 55 | 56 | Questions for feedback: 57 | 58 | * Do we need to introduce a default engine in the oneMKL spec? 59 | * Our proposal is to add to the spec, but as implementation defined. It is needed to support users who do not care about what engine they use, but need to generate random numbers. 60 | * oneMKL TAB agrees with default engine proposal. 61 | 62 | * What would the default engine be? 63 | * Each implementation can specify its own choice. For Intel, this is still under discussion, but currently leaning towards philox engine, as it has good performance for both CPUs and GPUs and has perfect statistics properties. 64 | 65 | * Should HW-dependent engines be included in the oneMKL spec? 66 | * Proposal is to include them. 67 | * In the C++ specification, hardware-dependent engines correspond to `std::random_device`. 68 | * The nondeterministic engine, due to the lack of a seed, would produce different bits on different HW. But other engines can produce same bits on different HW - including the SIMD Mersenne-Twister when run on systems with different SIMD lengths. 69 | * The oneMKL TAB recommends documenting the reproducibility and agrees with including HW-dependent engines. 70 | 71 | * What happens if some hardware cannot support an engine? 72 | * An implementation can support only a partial version of the spec, so not every engine needs to be supported. 73 | * With the exceptions introduced in v. 1.0, we can specify throwing a certain exception if not supported by an implementation. 74 | * `std::random_device` falls back to the CPU. Do you want that to happen here, or if the hardware is not available, then do not use it. 75 | * In Intel oneMKL, we have no fallback for nondeterministic engine. It probably makes sense not to have fallbacks to avoid confusing users, as otherwise users would see same results from run-to-run. 76 | 77 | * Except for hardware-dependent engines, do all of the other engines (in principle) work on Intel CPU, GPU, and potentially other vendor hardware as well? 78 | * Yes, though some, like r250, will have lower performance on the GPU. The first two groups are supported on both Intel CPUs and GPUs, and sobol is supported on both as well. Some of these engines are already implemented by Nvidia GPUs. 79 | 80 | * How should a user understand the tradeoffs of different engines? Is it a matter of performance vs randomness? 81 | * Different engines have different periods. For example, the period of MCG is too small for some applications. Additionally, if you want parallel uses of random numbers, some are easier to parallelize. For instance, the skipahead for philox is light-weight, while it is heavier for mt19937, leading to an additional overhead to parallelize this engine. 82 | 83 | * Should outdated engines be included in the oneMKL spec? 84 | * Proposal is to keep them in the spec to provide a wide set of engines. This set can be used in older applications. 85 | * The oneMKL TAB agrees with keeping the outdated engines in the spec. 86 | 87 | Overview of oneMKL Summary Statistics domain: 88 | 89 | * Introduced Summary Statistics component in v. 0.9 of the oneMKL spec. 90 | * There is a dataset structure, service routines to make a dataset, and free functions - computational routines that compute basic statistical estimates for single/double precision. 91 | * We want to extend the set in the future. 92 | * Typical usage model: create and initialize an object for dataset; call mean function to calculate mean values. 93 | * For consistency with the rest of oneMKL spec, 1D SYCL buffers and USM pointers are used to represent multi-dimensional dataset. 94 | 95 | Dataset Structure: 96 | 97 | * The dataset structure holds all user data, with two specializations to support USM and buffer APIs. 98 | * It stores the number of dimensions and observations; the matrix of observations; and two optional parameters: weights (scale of observation in dataset) and indices (which dimensions are processed). 99 | 100 | * Discussion on struct storage for the dataset: 101 | * It is a struct instead of a class because it just stores data and nothing else. There is a template parameter for row or column major. 102 | * If it is just a struct then you do not need a constructor - you can just use curly brackets. You can give default values in the struct, and when you use curly bracket initialization, anything you leave out will be initialized to the default values. 103 | * The general rule is if the data can vary independently with no constraints, then it can be a struct. 104 | * If everything is public, then users can change the data. 105 | * A user may want to calculate the mean for some indices and different statistics for other indices. The user could change just the indices and re-use the struct for both. 106 | * However, the number of dimensions and observations do seem tied to the matrix of observations. 107 | * One additional concern with data structs in general is that you can never change the order of things inside the struct - they need to be initialized in order. 108 | 109 | * Only the 1D ``sycl::buffer`` has a specialization; it will not work for 2D buffers (same issues as with dense linear algebra). 110 | 111 | Service Routines: 112 | 113 | * Supports users using pre-C++17 version to create a dataset object. 114 | * SYCL 2020 provisional supports C++17 as minimum. 115 | * If C++17 will be the minimum required, then we can remove these two functions and rely on deduction guides. 116 | * It is just an enumeration; it is not using constraints (C++20). 117 | * The order of the template parameters is reversed compared to the structure; this should be fixed. 118 | 119 | Computational functions: 120 | 121 | * Each takes queue, the dataset, and a place to store results. 122 | * Different template parameters to support different methods. 123 | 124 | * What does ``fast`` mean? 125 | * Fast is the fastest if there are multiple methods. Some functions only have the fast method. 126 | 127 | * Can ``fast`` be less accurate? 128 | * Yes, or it can provide slightly different results when using different number of threads. 129 | 130 | * oneMKL TAB recommends to specify why users may not want to use ``fast`` - what are the tradeoffs. May also want to choose a different name - why would someone choose a slow method? 131 | 132 | General discussion: 133 | 134 | * Has summary statistics been part of Intel MKL for a while? Who uses it? 135 | * Yes, it was part of the "vector statistics" domain in Intel MKL, which contains both RNGs and summary statistics. Summary statistics and RNGs are not really related, so that is why they are in separate domains in oneMKL. 136 | * Summary statistics routines are mostly used for some primitives in classic machine learning. 137 | * Financial service applications and Monte Carlo simulations may also use these. 138 | 139 | * There is a DOE project called `Spack `__ that is a package manager. It would be good to have oneMKL support for Spack in the future. Spack has special rules to automatically install Intel MKL via a special way of packaging Intel software: ``spack install intel-mkl``. 140 | -------------------------------------------------------------------------------- /math/minutes/2020_09_09_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneMKL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2020-09-09 6 | ========== 7 | 8 | Materials: 9 | 10 | * `Overview of oneAPI specification and process going forward <../presentations/2020-09-09_Slides.pdf>`__ 11 | 12 | Attendees: 13 | 14 | * Robert Cohn (Intel) 15 | * Mehdi Goli (Codeplay) 16 | * Jeff Hammond (Intel) 17 | * Mark Hoemmen (Stellar Science) 18 | * Sarah Knepper (Intel) 19 | * Maria Kraynyuk (Intel) 20 | * Piotr Luszczek (UTK) 21 | * Spencer Patty (Intel) 22 | * Pat Quillen (MathWorks) 23 | * Alison Richards (Intel) 24 | * Nichols Romero (ANL) 25 | * Edward Smyth (NAG) 26 | 27 | Agenda: 28 | 29 | * Welcoming remarks 30 | * Updates from last meeting 31 | * Overall oneAPI specification and process going forward - Robert Cohn 32 | * Wrap-up and next steps 33 | 34 | Updates from last meeting: 35 | 36 | * Verified meeting series appears correctly on calendars (meeting every 4 weeks now). 37 | * Enabling RNG domain in open source oneMKL interfaces is in progress 38 | 39 | oneAPI Specification: 40 | 41 | * Robert Cohn is the editor for the oneAPI specification, which includes oneMKL. 42 | * Purpose of oneAPI specification: document design for stakeholders, implementers and users. Also useful for things like validation, documentation; the whole pipeline. 43 | * oneAPI spec contains 9 elements: 44 | 45 | 1. DPC++ - programming model/language for oneAPI, based on SYCL with Intel extensions. Most of the oneAPI libraries have bindings in DPC++. 46 | 2. oneDPL - standard library for DPC++. Very similar (or same, at times) as standard C++ library. Includes parallel STL as a way to express parallelism and executors that are extended to work with DPC++. Also has some parallel implementations of algorithms. 47 | 3. oneDNN - primitives for deep learning, like convolution. High performance implementations that work on CPU or GPU. 48 | 4. oneCCL - communication library for oneDNN; similar to MPI. Collective operations that are most valuable for deep learning. 49 | 5. Level Zero - low level driver; sits above kernel mode driver. DPC++ runtime sits on top of Level Zero, and most of the other libraries sit on top of DPC++ runtime. 50 | 6. oneDAL - library for data science; includes k-means, correlation, clustering, and other algorithms. 51 | 7. oneTBB - Threading Building Blocks - CPU only library, but it does have support for coordinating computations that may be accelerated on devices. Underlying threading for many oneAPI libraries. 52 | 8. oneVPL - video processing; includes encoding, decoding and others. 53 | 9. oneMKL - math routines. 54 | 55 | Discussion on oneCCL: 56 | 57 | * What is the history of oneCCL? It looks very similar to MPI, but slightly different. 58 | * oneCCL used to be MLSL (Machine Learning Scalability Library). 59 | * MLSL was created by some expert asynchronous MPI folks. It basically used MPI collective techniques, but was optimized more heavily for their needs and did not have to support all of the MPI standard. 60 | * See `https://arxiv.org/abs/1801.08030 `__ for paper on scaling for machine learning. 61 | 62 | Relationship of spec to product: 63 | 64 | * First product release of oneAPI implements oneAPI spec 1.0. 65 | * The products are a superset of the spec. Things like OpenMP offload are not described by the spec but may be part of the product. 66 | 67 | Roadmap of spec: 68 | 69 | * Available in GitHub repository. 70 | * 1.0 to be published in October (at the latest). 71 | * Once 1.0 is done, it will not be changed (beyond editing updates). 72 | * Various elements have new features planned for oneAPI, after 1.0 version. 73 | * Just like with oneAPI v. 1.0, incremental versions will be published before the final version. Provisional 1.1 specs will be published quarterly. In about a year, oneAPI 1.1 will be published. 74 | 75 | Governance: 76 | 77 | * Governance for spec is very similar to how open source projects are managed. 78 | * There is a core team for each element. For oneMKL, it is Maria Kraynyuk and Spencer Patty. 79 | * Contribution to spec is open to anyone. 80 | * Core team membership today is currently all Intel, but is open to anyone based on history of contribution. 81 | 82 | Getting involved: 83 | 84 | * Continue participating in TAB. 85 | * Let Sarah Knepper know if you would like a blog post/article you write to be posted on `oneapi.com `__. 86 | -------------------------------------------------------------------------------- /math/minutes/2020_11_11_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneMKL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2020-11-11 6 | ========== 7 | 8 | Materials: 9 | 10 | * `Overview of oneMKL Vector Math domain <../presentations/2020-11-11_Slides.pdf>`__ 11 | 12 | Attendees: 13 | 14 | * Rachel Ertl (Intel) 15 | * Mehdi Goli (Codeplay) 16 | * Sarah Knepper (Intel) 17 | * Andrey Kolesov (Intel) 18 | * Nevin Liber (ANL) 19 | * Piotr Luszczek (UTK) 20 | * Pat Quillen (MathWorks) 21 | * Nichols Romero (ANL) 22 | * Edward Smyth (NAG) 23 | * Andrey Stepin (Intel) 24 | * Shane Story (Intel) 25 | 26 | Agenda: 27 | 28 | * Welcoming remarks 29 | * Updates from last meeting 30 | * Overview of oneMKL Vector Math domain - Andrey Stepin 31 | * Wrap-up and next steps 32 | 33 | Updates from last meeting: 34 | 35 | * oneAPI Developer Summit 2020: Nevin Liber will be participating on a panel discussion. 36 | * oneMKL specification v. 1.0 released. 37 | * RNG and BLAS domains available in oneMKL open source interfaces project. 38 | 39 | Overview of Vector Math (VM) Domain: 40 | 41 | * Vector math allows to accelerate the calculation of values of transcendental functions for vectors of different types, like float, double, and complex types. 42 | * There are different accuracies available which allow users to select between calculation speed and accuracy (up to close to correctly rounded results). 43 | * DPC++ interface based on C++, allows to perform the same capabilities of calculation on the same data types and same types of optimizations as the pre-existing C/Fortran APIs. 44 | * VM has a very nice capability of reporting numerical errors and fixing them on the fly if user desires. 45 | * For example, in calculating the inverse cumulative distribution function, a user may find that some values are outside of the domain of the specified inverse function. Such out-of-domain values can be fixed by the status code handler. 46 | 47 | VM APIs: 48 | 49 | * Classic C/Fortran APIs are very simple; accept vector length, input and output vectors. Same interface for OpenMP offload. 50 | * DPC++ interfaces have explicit SYCL queues, and either buffers or simple pointers (for USM). 51 | 52 | DPC++ Usage Models: 53 | 54 | * If user wants to calculate something more than just one function, the buffers handle the dependencies (building the directed acyclic graph). 55 | * For USM, the user needs to care about dependencies themselves. 56 | * USM API has a nice property that it can accept host pointers. If user has device pointers or shared pointers, they are also accepted. But users need to care about setting explicit dependencies. 57 | * The utility of USM pointers seem to outweigh the convenience of the SYCL buffers. 58 | 59 | VM Error Handler: 60 | 61 | * Simple use case: calculate some functions and check there were no computational errors. 62 | * User can specify a special option via a flag. If this flag is specified, the library will register all computational errors that are associated with a queue. 63 | * User also can ask for explicit computational status for every element, like in example 3. Then custom processing can be done to handle computational errors. 64 | * User can replace results that had some error status with a value that is appropriate; for example, replacing INF with FLT_MAX. 65 | 66 | Considering future extensions for VM: 67 | 68 | * Strided API, which calculates a transcendental function with some stride. For this, USM makes sense, but buffer may not. 69 | * Thinking of extending VM to accept std::vector. 70 | * Also looking forward to improving the computational performance of vector math by allowing so-called kernel fusion. User will program some sequence of operations, and those operations will be fused in that the GPU will not produce additional loads/stores between functions. 71 | 72 | * For the strided interface - will that involve a user-provided iterator, or an array of strides? 73 | * It is pretty limited in that it just has an increment. Basically, it increments the index by a fixed value, and you can place the outputs based on a different increment. The number of calculations is regulated by the length of the vector. 74 | * Coming from the batched linear algebra world, where you have a batch of matrices, the matrices may not be coming from one big buffer unless the matrices themselves are strided in the same manner. 75 | 76 | * Is there functionality to compute sum of absolute value for complex numbers? 77 | * It does not have such operation, but it could be extended to have this. 78 | -------------------------------------------------------------------------------- /math/minutes/2021_01_27_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneMKL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2021-01-27 6 | ========== 7 | 8 | Materials: 9 | 10 | * `Overview of oneMKL Batched Linear Algebra <../presentations/2021-01-27_Slides.pdf>`__ 11 | 12 | Attendees: 13 | 14 | * Marius Cornea (Intel) 15 | * Pavel Dyakov (Intel) 16 | * Rachel Ertl (Intel) 17 | * Mehdi Goli (Codeplay) 18 | * Mark Hoemmen (Stellar Science) 19 | * Louise Huot (Intel) 20 | * Sarah Knepper (Intel) 21 | * Maria Kraynyuk (Intel) 22 | * Nevin Liber (ANL) 23 | * Piotr Luszczek (UTK) 24 | * Spencer Patty (Intel) 25 | * Pat Quillen (MathWorks) 26 | * Alison Richards (Intel) 27 | * Nichols Romero (ANL) 28 | * Edward Smyth (NAG) 29 | * Shane Story (Intel) 30 | 31 | Agenda: 32 | 33 | * Welcoming remarks 34 | * Updates from last meeting 35 | * Overview of oneMKL Batched Linear Algebra - Louise Huot and Rachel Ertl 36 | * Wrap-up and next steps 37 | 38 | Overview of Batched Linear Algebra: 39 | 40 | * We will go over support for batched linear algebra in the oneMKL specification as well as some extensions being considered. 41 | * Batching: computes multiple independent operations. 42 | * Currently support batch functionality for some BLAS and LAPACK functions, as well as all DFTs (out of scope). 43 | * 2 different batch APIs: group and strided. 44 | 45 | * Group takes independent pointers for each input matrix/vector. 46 | * Strided API is less flexible because it allows for only one size of computation, and all matrices have to be stored in the same buffer. 47 | 48 | Group API in detail: 49 | 50 | * A group gathers all computations with identical parameters, but all matrices/vectors are different. 51 | * Two extreme scenarios: 1 group and N groups. 52 | * Only supports USM API (not buffer API). 53 | * Follows same naming pattern as other functions in the oneMKL specification. 54 | * Because we are doing group APIs, all of the standard GEMM parameters become arrays, of size the number of groups for non-matrix parameter. 55 | * One entry in a (non-matrix) parameter array corresponds to one group. 56 | * Array of pointers for matrices, of size the sum of all the group sizes (total batch size). The `a` array contains pointers that point to the different `a_i` matrices. 57 | * Another visualization of the batch group computation is shown on slide 8, describing the layout of the `a` array. 58 | * After the first `group_size[0]` elements of `a`, it starts to point to matrices in the next group. 59 | * This is what had already been supported in Intel MKL on the CPU side for a number of years. 60 | 61 | Strided API in details: 62 | 63 | * The matrix/vector parameters are a key difference with strided APIs, compared to group APIs. 64 | * Strided API supports a single group, with fixed stride between successive matrices/vectors in the batch. 65 | * Zero stride can be used for input matrices/vectors, but not for output matrices/vectors. 66 | * Strided APIs supports both DPC++ buffer and USM pointers. 67 | * The API is very similar to non-batch counterparts, but with the addition of stride and batch size parameters. 68 | * Walk-through an example of USM version with strided API. 69 | * Batch of 100 matrices; the matrices happen to be square, though the general case is also supported. 70 | * Allocate memory for the matrices and pivot indices, as well as for the scratch space. 71 | * Do the computation in a try-catch, to be able to catch some exceptions. 72 | 73 | LAPACK Batch Exceptions: 74 | 75 | * The tables on slide 11 are taken from the oneMKL specification. The left side shows common exceptions across all domains, the right side shows LAPACK-specific exceptions. 76 | * Highlighted in yellow are the batch errors. 77 | * We are trying to retain information captured in the `info` parameter that users of standard LAPACK would be familiar with (via `info()` method). We want to give information on where in the batch the error occurred (via `ids()` method), and details on the exceptions (via `exceptions()` method). 78 | * The multiple inheritance hierarchy of LAPACK exceptions, shown in slide 12, gives the user a choice of which exception to catch. The left side shows the hierarchy that starts with `std::exception` and is common across oneMKL domains. The right side shows the LAPACK-specific hierarchy that starts with `oneapi::mkl::lapack::exception`, and provides the `info()` method. 79 | * If user does not care about the particulars of the exception, can just catch `std::exception`. 80 | * If user wants to catch all oneMKL-related exceptions, can just catch `oneapi::mkl::exception`. 81 | * If user wants to focus on LAPACK exceptions, then catch exceptions from the LAPACK-specific hierarchy. 82 | 83 | Speculation about future interfaces: 84 | 85 | * Coming in C++23 is mdspan, so we may want to consider adding that on top of existing interfaces. 86 | * On the right side of slide 13, another option is having a matrix object that captures matrix data and state data, where the state data could include errors. But this could be problematic, since errors do not correspond with the matrix, but with the operation. 87 | * Agreement from the TAB that matrices do not have errors, operations do. 88 | 89 | Batch API extension requests: 90 | 91 | * Have scalars passed by reference instead of value. 92 | * Have different scalars (e.g., alpha for axpy) for each computation. 93 | * Problem with current API is that we cannot extend to support those two requests because there would be conflicting APIs with different semantics. 94 | * For example, for strided API where the scalar becomes a pointer, it would be of type fp_type* alpha, but expected to store only one scalar. But for the second requested extension, it would have the same type but would point to batch_size scalars. 95 | * Concern from TAB that having APIs with double* and double** may allow users to pass one when they meant the other; consider using `std::span` instead. 96 | 97 | * Potential approaches under consideration to support these requests: 98 | 99 | 1. Naïve idea to add more inputs to provide size of arrays - not leaning toward this change since there are already lots of parameters and want to avoid introducing more parameters. 100 | 2. Could have additional parameter for stride of each array - same drawback as previous. 101 | 3. New entry point, but this adds new functions. 102 | 4. Something similar to what was proposed by SLATE: pass a vector for each parameter, of size either 1/group_count/total_batch_size (as appropriate). Sizes and increments could be vectors, but would need to handle incompatibility between different sizes. Need additional overloads to handle scalars passed by reference. 103 | 5. Have a matrix/tensor object encapsulation. 104 | 105 | * Why use overloading, with functions having the same name? It makes debugging difficult. 106 | 107 | * Overloading in oneMKL is used to handle different data types, when performing the same type of operation (e.g., gemm instead of dgemm). A similar rationale was applied for batch function naming. 108 | * Concern from TAB that it can be hard to tell what is wrong when passing a lot of parameters and would suggest to use different names for the different types of batches. 109 | 110 | * Why use `std::vector`? `std::span` from C++20 is simply a pointer and a length. 111 | 112 | * The problem with `std::vector` is that it would be a separate deep copy into device memory. 113 | * Since the request for passing scalars by reference is for device data, this is a valid concern. 114 | * `std::vector` prevents using device data and is overkill for what is needed. 115 | * It would be easy to implement a `onemkl::span` if C++20 is not yet supported for DPC++, and it is preferable to using `std::vector`. 116 | 117 | Dense Linear Algebra Batch support in libraries: 118 | 119 | * Small summary of different batch dense linear algebra support in various libraries, and how the oneMKL specification currently covers the usage. 120 | * Would like to cover usage of different libraries on different hardware. 121 | * cuBLAS has two types of APIs, but they only allow for one group - more similar to the strided APIs. The non-strided APIs match the group API, but for only 1 group. cuBLAS also has some batched computations for lower or mixed-precision GEMM and other LAPACK routines. 122 | * MAGMA has even more batch coverage. Variable and fixed version - fixed can be covered by group API with group_count of 1. The variable batch can be covered by the group API as well, but with each group having size of 1. 123 | * MAGMA also has some functionality not covered in the oneMKL specification, mainly additional level-3 BLAS and some LAPACK. 124 | * rocBLAS has more extended support (additional BLAS). 125 | 126 | Future considerations for batch support in oneMKL specification: 127 | 128 | * More functions, supporting variable batch (different scalars for each computation, etc.), and possibly matrix/tensor object encapsulation. 129 | -------------------------------------------------------------------------------- /math/minutes/2021_03_24_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneMKL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2021-03-24 6 | ========== 7 | 8 | Materials: 9 | 10 | * `Overview of oneMKL FFT <../presentations/2021-03-24_Slides.pdf>`__ 11 | 12 | Attendees: 13 | 14 | * Marius Cornea (Intel) 15 | * Pavel Dyakov (Intel) 16 | * Alina Elizarova (Intel) 17 | * Mehdi Goli (Codeplay) 18 | * Mark Hoemmen (Stellar Science) 19 | * Sarah Knepper (Intel) 20 | * Maria Kraynyuk (Intel) 21 | * Nevin Liber (ANL) 22 | * Piotr Luszczek (UTK) 23 | * Spencer Patty (Intel) 24 | * Helen Parks (Intel) 25 | * Nichols Romero (ANL) 26 | * Edward Smyth (NAG) 27 | * Julia Sukharina (Intel) 28 | 29 | Agenda: 30 | 31 | * Welcoming remarks 32 | * Updates from last meeting 33 | * Overview of oneMKL FFT - Helen Parks 34 | * Wrap-up and next steps 35 | 36 | Overview of FFT: 37 | 38 | * oneMKL DPC++ FFT API is currently heavily based on Intel's Discrete Fourier Transform Interface ("DFTI interface") from oneMKL C/Fortran APIs; it is not the FFTW interface. It supports 1D/2D/3D data, which is well aligned with general FFT library support, and supports both SYCL buffers and USM data like the other oneMKL domains. 39 | 40 | Four-step workflow (similar to many FFT libraries): 41 | 42 | 1. Create descriptor - intended to be a lightweight operation 43 | 2. Configure descriptor - via multiple calls to set_value function; sets one parameter per call - very lightweight as well 44 | 3. Commit descriptor - this is where the heavy work is done; setting twiddle factors, determining the factorization to use. Idea is that you would commit once and compute many times. The descriptor can use the provided queue to make device-specific kernels. 45 | 4. Compute - expect to have multiple calls to compute to reuse the work done at the commit stage. The queue that is committed will be used at the compute stage. 46 | 47 | * What about making a plan separate from the queue? 48 | 49 | * The descriptor is intended to encapsulate all the information about the FFT. It is hard to get away from device-specific work, with one descriptor per device and a commit for each. It is probably possible to handle running multiple queues on the same hardware with a single descriptor and single commit call. 50 | * The create and configure calls are lightweight; commit is the heavy call. 51 | 52 | 3 main components: 53 | 54 | 1. Descriptor class (already discussed): two different constructors based on 1D or multi-dimensional FFT. There can be additional private members in an implementation that are not part of the oneMKL specification. 55 | 2. Enum classes for configuring values: config_param covers all parameters that can be configured. config_value corresponds to a subset of values that can be used while configuring. Some config_param parameters will just take numerical values; others will take config_value enums. 56 | 3. Free functions for compute: there are multiple compute functions covering different variants (forward/backward, in-place/out-of-place, etc.). 57 | 58 | * Are the … in set_value overloads, parameter packs, etc.? 59 | 60 | * Those are C …; in the Intel oneMKL product implementation, we are using a vararg list to process the call. 61 | * The TAB strongly recommends not mixing C … with C++; there are lots of ways trouble can come up. 62 | * Type safety would be the number one concern, which would allow some issues to be found at compile-time and not only at run-time. 63 | * C++ developers may expect parameter pack instead of C …; also the C … may get deprecated in future in C++. In general, there are a lot of mixing of C idioms with C++ here. 64 | * We expect this to evolve in future versions of the specification. We would like this to be more type safe and to have compile-time errors when setting parameters. But it may be impossible to entirely get away from run-time errors. Individually type-safe settings may not make sense all together. Commit does prework and a final check that all parameters work together - that run-time error is hard to avoid. 65 | * Yes, but a useful error message could be provided there, instead of segfaulting or unexpectedly chopping floating point numbers in half. 66 | 67 | * Descriptors take the number of dimensions as a run-time parameter. Have you considered making the number of dimensions a compile-time parameter, since the total number of combinations is small (3)? 68 | 69 | * The descriptor could take a std::array of correct size based on the compile-time dimensions. 70 | 71 | 1D example: 72 | 73 | * Initialize data, wrap in a buffer. In this example, descriptor object is templated on double precision and complex domain. Configures backward_scale parameter. It expects same precision as the data in the set_value. 74 | 75 | * Suppose that precision were float. So the set_value function would expect the scalar to have type float. But the value that users would usually type in is 1.0/integer, which always have type double. So do users have to cast to float, or is there an overload? 76 | 77 | * This is handled through the C … args. In the examples we provide, this line does an explicit cast on the variable (it was removed from slides to save space), to show best practices. However, as we previously discussed, better type safety would be nice. 78 | * For the common use case of float and typing 1.0/N without doing static_cast(1.0/N), it could give users a completely wrong result or crash. 79 | 80 | * Highlighting the difference between buffer and USM input data: the compute functions return a SYCL event in the USM case, so the user has something they can wait on. In the buffer case, they can use the queue itself or an accessor to the data to accomplish that wait. 81 | 82 | 2D example: 83 | 84 | * Also demonstrates the batched usage, which is very common FFT usage - and for performance reasons, it is highly recommended when running on GPUs. 85 | * Need to set the distance in elements from the start of one FFT to the start of the next. The most common case is shown here, where they are contiguously stored. N1 and N2 are both referring to complex elements, since it's the complex domain. 86 | * For out-of-place, the compute_forward needs to take both input and output buffers. 87 | 88 | 3D example: 89 | 90 | * Fairly similar to the 2D example; here we have the real domain. Strides describe the data layout within a single FFT with multiple dimensions; it is always a vector of dimension D+1. Offset into the buffer as well as strides for each dimension. For the convention we have where a real forward transform has a complex backward transform, the input and output strides will almost always be different. Because the input and output domains are different, you need to reset the strides and recommit before the backward transform, according to the spec currently. 91 | 92 | * Does it make sense to have a compute_forward_and_backward function? 93 | 94 | * No, people usually want to do something in-between. 95 | 96 | * You have to do a commit for the forward and a commit for the backward; could you re-use if you want to set up once and do multiple calls? 97 | 98 | * If the input and output strides are different from forward to backward, then to use one descriptor you would have to commit every single time. Alternatively, you could set up two different descriptors and commit them each once. It may be most efficient to change the specification to avoid having to keep committing. 99 | * If you are doing this during every time step in a simulation, it would be costly. This is good feedback that it is a common use case to alternate between forward and backward many times. 100 | 101 | Future directions: 102 | 103 | * Managing scratch workspace on devices. 104 | * Current specification is heavily based on DFTI interface, but the goal is to make it very general and usable for a wide user base (e.g., current FFTW, cuFFT, DFTI, etc. users). Most FFT libraries use "plan-compute" terminology. 105 | * Making sure it is general enough for broad adoption is where we are headed. 106 | 107 | Specification questions: 108 | 109 | * Device-side APIs that can be embedded/inlined into user kernels - thoughts? 110 | 111 | * See device-side APIs as a good thing, in general, and definitely for linear algebra. 112 | 113 | * When factorizing an FFT - is this based on pre-existing heuristics, or can users time this on their machine? 114 | 115 | * Right now, in the Intel oneMKL product, it is based on our pre-analysis. That is a major difference between our implementation and the FFTW implementation. 116 | * Probably most people use defaults from FFTW anyway. But there may be some people who try to squeeze the last bit of performance out of their systems. 117 | 118 | * Is there anything that does not allow me to say: run an FFT on a queue on the GPU, and also run an FFT on a queue on the CPU? 119 | 120 | * As of now, the specification would require you to have two descriptor objects and commit them separately to each queue (one for GPU, one for CPU). From the user perspective, it may look clunky to have two descriptors that need to be set up separately. From a performance perspective, when you have two queues on two devices, you have to do the device-side setup - need to do it for both the CPU and the GPU. So not as much of a performance hit if you want to run on two different types of hardware. But if you are running on the same hardware, but with multiple queues - then you are taking an unnecessary performance hit by committing twice. 121 | * In the case that the FFTs would be very different in size and number, you would need multiple descriptors anyway. 122 | 123 | * Running multiple queues on the same hardware - any intuitions on how common this might be? 124 | 125 | * Here the keyword is running the same FFT on the same hardware. Does not seem like it is a big deal, but maybe others would have different perspectives. 126 | * That was the intuition that went in to the specification, that this use case would not be as common. But we did get a question about this, so we wanted to bring attention to this to see if intuition was wrong. 127 | 128 | General comments about oneMKL specification: 129 | 130 | * In general, oneMKL is a mix of C and C++ idioms. If I had to pick one thing to change, it would be this. 131 | * Difficulty debugging things on GPUs, so anything that can help (type safety!) would be appreciated. 132 | 133 | Requested sparse BLAS functionality: 134 | 135 | * C = A * (B_local + B_remote); B_local are locally owned rows and B_remote are rows brought in from other MPI processes. This functionality may be somewhat specific to multi-grid; it would be useful for Trilinos developers. 136 | -------------------------------------------------------------------------------- /math/minutes/2021_05_19_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneMKL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2021-05-19 6 | ========== 7 | 8 | Materials: 9 | 10 | * `Overview of device APIs for Random Number Generator domain <../presentations/2021-05-19_Slides.pdf>`__ 11 | 12 | Attendees: 13 | 14 | * Hartwig Anzt (KIT, UTK) 15 | * Romain Dolbeau (SiPearl) 16 | * Aleksandr Duev (Intel) 17 | * Pavel Dyakov (Intel) 18 | * Alina Elizarova (Intel) 19 | * Rachel Ertl (Intel) 20 | * Mehdi Goli (Codeplay) 21 | * Mark Hoemmen (Stellar Science) 22 | * Sarah Knepper (Intel) 23 | * Maria Kraynyuk (Intel) 24 | * Valentin Kubarev (Intel) 25 | * Nevin Liber (ANL) 26 | * Piotr Luszczek (UTK) 27 | * Mesut Meterelliyoz (Intel) 28 | * Vincent Pascuzzi (LBNL) 29 | * Svetlana Podogova (Intel) 30 | * Pat Quillen (MathWorks) 31 | * Alison Richards (Intel) 32 | * Nadezhda Sidneva (Intel) 33 | * Edward Smyth (NAG) 34 | * Shane Story (Intel) 35 | 36 | Agenda: 37 | 38 | * Welcoming remarks 39 | * Updates from last meeting 40 | * Overview of device APIs for Random Number Generator domain - Alina Elizarova 41 | * Wrap-up and next steps 42 | 43 | New member introductions: 44 | 45 | * Hartwig Anzt - Leads a team at Karlsruhe Institute of Technology and affiliated with University of Tennessee Knoxville (Innovative Computing Laboratory). Worked at ICL for years and developed MAGMA-Sparse. Now developing Ginkgo, a C++ based sparse linear algebra library, targeting accelerators from different vendors. 46 | 47 | * Romain Dolbeau - Works at a small, fairly new company in Europe called SiPearl, which is part of the European Processor Initiative, with a goal to develop HPC oriented process in EU. Mostly based on ARM technology, investigating proof of concept for European-developed accelerators. Very important to have libraries, like oneMKL, to make HPC users' lives easier. 48 | 49 | * Vince Pascuzzi: Postdoctoral researcher at Lawrence Berkeley National Laboratory, porting existing high-energy physics software to GPUs. Works in Center for Computational Excellence in DOE High Energy Physics (HEP) umbrella. Looking at different portability solutions to utilize as many upcoming HPC clusters as possible. Added support of cuRAND backend to the oneMKL interfaces project. 50 | 51 | 52 | Updates from last meeting: 53 | 54 | * Mark - Fourth revision (counting from 0) of the C++ standard library linear algebra proposal will have LEWG review in June; see wg21.link/p1673r3. 55 | 56 | Overview of device APIs for random number generator domain: 57 | 58 | Terminology: 59 | 60 | * Host-side APIs: All previous APIs that were discussed in oneMKL TAB meetings can be called manual offload, or host-side APIs, as all took a sycl::queue object and work only with global memory objects. 61 | * Device-side APIs: Sometimes there is a need to call functions inside a user's own kernels, so created new APIs that are called device-side. Their main purpose is to be used inside other kernels. 62 | 63 | Motivation: 64 | 65 | * For most applications, random number generators are not the end goal. For example, in Monte-Carlo simulations, need to have some post-processing to get down to a single number. With host APIs, using a memory-bound RNG may result in large performance overhead. 66 | * Workflow and names are pretty much the same in device APIs as in host APIs, but with a few differences. 67 | * Device APIs take no global memory buffer, just generate random numbers in private memory. User can then do post processing in the same kernel and return. Host and device APIs are in different namespaces; can’t share state between host and device APIs directly, as for device APIs each thread would have own copy of engine. 68 | 69 | oneMKL RNG Device APIs Example: 70 | 71 | * Simplest example of usage: create engine and distribution. The generate function itself does not take any memory objects, just like std:: random number generators. The main thing is to generate a lot of random numbers for each work item and do some post-processing with them. 72 | 73 | More complex example, using Monte Carlo to estimate pi: 74 | 75 | * In device APIs, a single kernel is used for both generation and postprocessing. In host APIs, user should create buffer for random numbers, create engine from the queue, then generate with passing SYCL buffer. 76 | * Negative impact of efficiency of host API - has 70% overhead from reading/storing the numbers. 77 | 78 | * Are the queries on the state done in parallel and from distinct from work items? 79 | 80 | * Each work item would have an independent engine and be parallelized independently. 81 | 82 | Engine classes: 83 | 84 | * All device APIs are in the oneapi::mkl::rng::device namespace. There are different sets of engines and distributions in host- and device-side APIs. Currently two available engines for device-side APIs: philox4x32x10 and mrg32k3a engines. These have small states, with about 6 and 11 integer values, respectively. 85 | 86 | * It is nice that seeds and offsets can be controlled separately, maybe coarse- versus fine-grained. What happens if offsets are the same? 87 | 88 | * It will work (it won't crash), but the statistics would not be ideal. It performs exponential skipping; if pass initializer_list as {0, 1}, next would be 2^64 - main technique used for mrg. In philox, every skip ahead is done in constant time. 89 | 90 | Distributions: 91 | 92 | * These are pretty similar to what is available on the host. 93 | 94 | Generate functions: 95 | 96 | * Two generate functions: one returns a scalar, one returns a vector. generate_single should be used to calculate a tail, for example. Can keep same engine after generating vectors of, say, size 16, when you just need one or two more scalar values. 97 | 98 | * Distribution object may be modified - keep it low cost by passing in a non-const reference. 99 | 100 | * As long as you can copy a distribution lightly, it should not matter if you pass in const in one case, and non-const in another. 101 | * Pretend I'm writing a library, which is templated on the distribution. The library could just take the distribution by value, and internally decide which one it wants (const or non-const reference). 102 | * Good observation that users may need to differentiate between the host and the device. 103 | * Most distributions are characterized by 2 or 4 parameters - very cheap to copy. Decide from which namespace to use functionality. 104 | * Fine to pass by value because it is lightweight (that is the preferred idiom nowadays). 105 | 106 | More complex example - want to keep engine states between different kernels: 107 | 108 | * This is useful when you need to initialize engines outside of the main computational block, or if you start generating in one kernel and want to continue generating in another kernel. 109 | * The example initializes the engine in one kernel, then loads the state in the next kernel to generate random numbers. 110 | 111 | Helper functions: 112 | 113 | * Auxiliary APIs to assist in the previous situation where user needs to create buffers on host, by providing the ability to submit kernel and initialize engine inside the constructor. The helper interface provides accessor APIs. Can load and store engines in the kernel, similar to sycl vectors. 114 | * As a more complex case, functor is available: user can write lambda to initialize engine whenever they want. 115 | 116 | Future steps: 117 | 118 | * Plan to add RNG device APIs to oneMKL spec 1.1, and to extend the set of Engines and Distributions that are supported. 119 | * In future, may think of supporting ESIMD extension (currently experimental). 120 | 121 | * Something we are adding to Netlib LAPACK is lots of randomized matrix methods. For those, we need non-Gaussian matrices. Saw vectors as input arguments in interfaces. Most generators were uniform. It would be interesting exercise of how you would use them to generate a Gaussian matrix for a randomized method. 122 | 123 | * For host APIs, there is the distribution called Gaussian-multivariate, which produces such a matrix. But this distribution is currently only for host APIs; device APIs are just for scalar and vector output. 124 | 125 | 126 | Question about multi-GPU applications/libraries experience: 127 | 128 | * What is meant by multi-GPU? Inside same node, using multiple GPUs without using MPI, or do you mean an MPI-based approach? 129 | 130 | * Both. We are considering all options, and trying to understand if someone has experience. Some approaches use multiple GPUs without MPI and do all submissions internally. If you have experience with both, that would be great to share. 131 | 132 | * Mark: Not a fun experience; mostly pleading with MPI drivers/library to do the right thing. The programming experience is improving; want to avoid the user having to wait for the GPU to finish the MPI communication. Need to know the device is done computing something to communicate it. Used MPI to communicate both within and across a node. Used Kokkos to run GPU kernels. Could have used Nickel, but that is not a full MPI stack (kind of complicated when crossing nodes). For portability, wanted to stick to MPI for the communication layer. Christian had some success using pgas, using Kokkos::view, just for individual kernels, not all communications. But have not followed closely for a year. 133 | 134 | * Hartwig: Also have experience with both, using distinct MPI ranks, each with one GPU. Terry (from group) has a background; if of significant interest, could ask to prepare some slides. Has an executor concept (think of a queue). Each queue/executor can have a different GPU, and can share data between executors. Mostly experimented on Nvidia architectures, though maybe some with AMD. In the end, it was a lot of effort. If you want to go across multiple nodes, you need to use MPI (at least for now). 135 | -------------------------------------------------------------------------------- /math/minutes/2021_10_06_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneMKL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2021-10-06 6 | ========== 7 | 8 | Materials: 9 | 10 | * `DPC++ APIs for oneMKL Data Fitting <../presentations/2021-10-06_Slides.pdf>`__ 11 | 12 | Attendees: 13 | 14 | * Marius Cornea (Intel) 15 | * Romain Dolbeau (SiPearl) 16 | * Pavel Dyakov (Intel) 17 | * Alina Elizarova (Intel) 18 | * Andrey Fedorov (Intel) 19 | * Mehdi Goli (Codeplay) 20 | * Sarah Knepper (Intel) 21 | * Maria Kraynyuk (Intel) 22 | * Nevin Liber (ANL) 23 | * Ye Luo (ANL) 24 | * Piotr Luszczek (UTK) 25 | * Vincent Pascuzzi (BNL) 26 | * Pat Quillen (MathWorks) 27 | * Alison Richards (Intel) 28 | * Nikita Semin (Intel) 29 | * Edward Smyth (NAG) 30 | * Shane Story (Intel) 31 | 32 | Agenda: 33 | 34 | * Welcoming remarks 35 | * Updates from last meeting 36 | * DPC++ APIs for oneMKL Data Fitting - Nikita Semin and Andrey Fedorov 37 | * Wrap-up and next steps 38 | 39 | New member introduction: 40 | 41 | * Ye Luo - Computational scientist at Argonne National Labs (ANL). Major was Material Science and Physics. Doing simulations of materials at atomic scale, solving Schrödinger equation. Work on multiple methods, including quantum Monte Carlo, and Density Functional Theory (DFT), which solves the same equation in a different manner. These heavily rely on matrices; mostly dense linear algebra, sometimes sparse. Most time spent doing BLAS - heavy on matrix-matrix multiplications, also rely on linear solvers (LU, matrix inversion, solving eigenvalues). All parallel over MPI, then over threading, and some SIMD: use all the technologies offered by modern HPC and GPUs. In Density Functional Theory, use 1D/2D/3D FFTs, depending on the problem. 42 | 43 | DPC++ APIs for oneMKL Data Fitting: 44 | 45 | * Data fitting is set of functions to perform interpolation and integration. 46 | 47 | Data layouts for coefficients and functions: 48 | 49 | * Row major, column major, and first coordinate. The "first coordinate" layout assumes independent memory for each set of coefficients/functions. It is useful if you do not want to calculate functions 1-4, but only functions 1 and 3. 50 | 51 | Usage of Data Fitting: 52 | 53 | * Many C/C++/Python libraries contain interpolation functions, as well as applications that use it. 54 | * NumPy and cuPy have only linear 1D interpolation. SciPy and Eigen use cubic splines interpolation, for example. 55 | * Splines are pretty wide-spread functionality. For most of the interpolation of the previously-mentioned libraries, it is spline based. 56 | 57 | Proposal for oneMKL specification changes: 58 | 59 | * For proof that the specification is implementable, we can have Data Fitting DPC++ APIs as an experimental feature for the Intel oneMKL product, with a chance to change interfaces based on feedback. There's no de facto standard interface for data fitting. Opened oneMKL spec issue `#379 `__ for discussions; please feel free to make any comments there. 60 | 61 | What do you think about adding Data Fitting APIs to oneMKL specification? 62 | 63 | * The oneMKL TAB expects it would be useful to provide because making splines can be difficult, so providing portable DPC++ APIs would be user friendly. 64 | * The oneMKL TAB recommends to keep in mind how it would extend to 2D/3D. For example, providing both a scattered collection of points as well as a grid of points. 65 | * Experience with two scenarios: first is one spline function, but many evaluations, based on particle distances; second is evaluating multiple functions on a single position (row or column major). For the second scenario, that could be vectorized on CPU or GPU, but not many opportunities for optimization for the first scenario. 66 | * Use both 1D and 3D cubic splines; use 1D quintic, but that is difficult to handle. 67 | 68 | Proposed APIs: 69 | 70 | * Target is to support both Buffer and USM APIs and make them flexible to cover all use cases. 71 | * Decided to split into 3 main parts: 72 | 73 | * Spline classes, support linear, quadratic, and cubic splines. 74 | * Free functions that operate with the spline objects. 75 | * Handlers that will let us simplify the APIs, show how the data is stored. 76 | 77 | * Partitions could be uniform, non-uniform, or quasi-uniform. 78 | * Interpolation sites could be uniform, non-uniform, or sorted. 79 | * Coefficients, functions, and interpolation results can be stored in row-major, column-major, or first-coordinate. 80 | 81 | Basic example: 82 | 83 | * User should include header file and initialize some data. 84 | * User creates special handlers for the data. We can use deduction guides since DPC++ uses C++17 - user will not need to specify any templates for these handlers. 85 | * After this, create a spline object that will concentrate all the data. 86 | * Then, compute spline coefficients by calling construct function. 87 | * Then call interpolate. The user can do some work on the coefficients as desired. Having free functions for construct and interpolate gives the ability to create dependencies between function calls; construct and interpolate return events. 88 | * In this simple case, when the user calls interpolate right after construct, the user doesn't need to specify it. 89 | * The handlers may be created on the go. 90 | 91 | * Why have a separate construct function, and not have the constructor do the work? 92 | 93 | * 2 main reasons. We want to return an event from the construct function, but cannot from a constructor. We want to let the user control when the computations are performed, so we use a separate function. 94 | * These are host APIs: the interpolation is being done on the device, but called from the host. 95 | 96 | * Does the user need to carry the cfh (coefficient function handler) around, ensuring the lifetime is longer than that of the cubic_spline? 97 | 98 | * No, because cfh does not own the memory. 99 | * Handlers are mostly views, used to specify some template parameters that can be deduced. 100 | * If we want to be able to deduce template parameters, we need to use a handler. This simplifies usage. 101 | * The user can create a handler on the fly if they do not want to specify any template parameters; it does not need to be kept until the end of the interpolation. 102 | 103 | * Why doesn't the spline own its own coefficients? That is, why do the APIs separate the compute from the memory management? 104 | 105 | * There may be workflows where the coefficients are computed by the user, so this workflow is supported. But this is not the main use case. 106 | * In the more common use case where the spline computes the coefficients, the user will still need to know the size of the coefficients. 107 | * The oneMKL TAB recommends to consider having APIs where the spline manages the coefficients. This may be especially important once/if 2D splines are supported. 108 | 109 | Advanced example: 110 | 111 | * Use an std::vector of pointers for coefficients. 112 | * Deduction guides available in C++17, so user does not need to specify any template parameters. But sometimes we cannot use deduction guides. If user wants to use quasi_uniform partition, the user needs to pass the template parameter to the handler, but then the user doesn't need to specify when creating cubic_spline. 113 | * Other part of usage model stays the same - similar usage for different situations. In these two examples, spline object has the same type and can be stored in some container. Our spline objects do not depend on how we store data (column/row major, or first coordinate) - still has the same type. 114 | 115 | Spline interface: 116 | 117 | * Interfaces for different orders of splines are nearly the same. 118 | * The class has only 2 template parameters. 119 | * The design allows us to store splines with different layouts in the same containers. 120 | * The constructor takes handlers, values corresponding to quantity of partitions and quantity of functions, and handlers for internal and boundary conditions. 121 | 122 | * Is the order of template parameters going from cubic_spline to spline_base really swapped? 123 | 124 | * Good catch; they will be the same order. 125 | 126 | * What is the role of the partition handler? 127 | 128 | * The main goal is to tell the layout; the partitions (x values) should be sorted. 129 | 130 | * Is it permitted to mix precision types? For example, double precision partition and single precision function? 131 | 132 | * No, everything has to be the same floating point type. You also cannot mix buffer and USM. 133 | 134 | * Why do we need so many separate templates and handlers? It seems like the user would have to juggle a lot of stuff: a wrapper for the coefficients plus several handlers - the user would need to write a wrapper to encapsulate all these pieces. 135 | 136 | * We use multiple handlers as in case when the user wants to specify some certain layout (which can’t be deduced) usage of handlers let the user specify it only for handler. In opposite case the user will have to specify all template parameters corresponding to layouts of coefficients/functions/partitions/boundary_conditions/internal_conditions, specify if coefficients were computed or not, and specify the type of floating point storage (float* or sycl::buffer, or the same for double), in total 7 parameters. We consider it to be not convenient. 137 | * The oneMKL TAB suggests reducing the number of handlers/templates: It seems that 1 handler is needed along with the cubic_spline object, and perhaps managing pointer to coefficients, but that should be sufficient. 138 | 139 | * Is it necessary to store the queue in the cubic spline? 140 | 141 | * Yes so that it can be checked that the memory was allocated in the same context as the queue. 142 | * Note that if the spline managed the memory, then you could get away from this. 143 | * Because of this, the user would need two separate queues and two separate cubic_spline handlers if they wanted to have two queues doing the same interpolation without interfering with each other. 144 | 145 | * Is it possible to separate the execution part of the logic from the data part of the logic? 146 | 147 | * The execution part is in the construct function. It does not take a queue. 148 | 149 | Meeting ended on Slide 16, before finishing slides, but further discussion and feedback on the oneMKL specification issue `#379 `__ is highly welcome! 150 | -------------------------------------------------------------------------------- /math/minutes/2022_03_23_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneMKL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2022-03-23 6 | ========== 7 | 8 | Materials: 9 | 10 | * `Update on DPC++ APIs for oneMKL Data Fitting <../presentations/2022-03-23_Slides.pdf>`__ 11 | 12 | Attendees: 13 | 14 | * Hartwig Anzt (KIT) 15 | * Terry Cojean (KIT) 16 | * Romain Dolbeau (SiPearl) 17 | * Pavel Dyakov (Intel) 18 | * Alina Elizarova (Intel) 19 | * Andrey Fedorov (Intel) 20 | * Mehdi Goli (Codeplay) 21 | * Harshitha Gunalan (Intel) 22 | * Louise Huot (Intel) 23 | * Sarah Knepper (Intel) 24 | * Maria Kraynyuk (Intel) 25 | * Piotr Luszczek (UTK) 26 | * Vincent Pascuzzi (BNL) 27 | * Pat Quillen (MathWorks) 28 | * Alison Richards (Intel) 29 | * Nikita Semin (Intel) 30 | * George Silva (Intel) 31 | 32 | Agenda: 33 | 34 | * Welcoming remarks 35 | * Updates from last meeting 36 | * Update on DPC++ APIs for oneMKL Data Fitting - Andrey Fedorov and Nikita Semin 37 | * Wrap-up and next steps 38 | 39 | New member introduction: 40 | 41 | * Terry Cojean - During his PhD that he got in 2018, he worked with Intel MKL using BLAS and LAPACK functionalities, using runtime systems. After getting his PhD, he now works with Hartwig Anzt, developing the math library Ginkgo that allows to interface with different GPU technologies (Nvidia, Intel, etc.) via cuSparse, Intel oneMKL, and so forth. This includes sparse matrix-matrix multiplications, random number generators, and dot and norm from dense BLAS. He has experience in library and interface design. 42 | 43 | Updates from last meeting: 44 | 45 | * oneMKL TAB meetings will be held every two months. 46 | * In the oneMKL Interfaces open source project, LAPACK domain is now supported on Nvidia GPUs via cuSolver. 47 | * The Intel oneMKL 2022.0 product was released, with the Intel oneMKL 2022.1 release coming soon. 48 | * oneAPI Image Processing Library (oneIPL) provisional specification was released and oneIPL TAB meetings have commenced. 49 | * A Level Zero TAB will be starting soon; please let us know if you would be interested in joining. 50 | 51 | DPC++ APIs for oneMKL Data Fitting: 52 | 53 | * We introduced Data Fitting at a previous oneMKL TAB meeting. Now we are going to update what changes have been made and answer some questions from the previous meeting. What we are discussing now is part of the upcoming Intel oneMKL 2022.1 release. There is also a pull request for the oneMKL specification; please feel free to provide feedback there. 54 | 55 | What Data Fitting Is: 56 | 57 | * Data Fitting is a set of functions to perform interpolation and integration. Terminology is given on the picture. On the x-axis, there is a partition, where we split the interval into sub-intervals of partition points. There are also function values as an input. The small red points on the x-axis are interpolation sites; they are points we need to calculate function values for. Red points on the curve are interpolation results. We can operate over multiple functions; we can interpolate over multiple functions using the same partition points. Interpolation is mostly based on splines (linear, quadratic, cubic). 58 | 59 | * We currently have two data layouts: row major and column major. Row major means we place all function values for partition points for the first function, then all function values for each partition point for the second function; each horizontal line is a function. For column major, we place function values for all functions for first partition point, then for second partition point, and so forth. 60 | 61 | * We want to ask you about a data layout named "first coordinate". This means we have independent memory for each set of functions. On previous slide, the function values are placed in same memory. Here, we have one memory for function values for all partition points for first function, then one memory for second function, and so forth. 62 | * Does it make sense to add this layout to the oneMKL specification? Do you have use cases where this would be useful? 63 | 64 | * No known use cases. 65 | 66 | * Is an example of this leaving out the functions you don't have? Why are the boxes misaligned in the graphic? 67 | 68 | * This would be pointing to the first value for each function. The graphic is just illustrating that memory may be non-contiguous. 69 | 70 | Updates from previous oneMKL TAB meeting: 71 | 72 | * Why do we need to have a separate construct function and not have the constructor do the work? 73 | 74 | * We want to let the user control when the computations are performed. Plus, we can't return sycl::event from the constructor, but we need it. 75 | 76 | * Is it possible to mix precision types? 77 | 78 | * We only allow same floating-point type for partitions and function values and coefficients. 79 | 80 | * Is it necessary to store the queue in the spline object? 81 | 82 | * Yes because we need the context of execution inside the spline. 83 | 84 | * Why doesn't the spline own its own coefficients? 85 | 86 | * This was a difficult question. The coefficients may live longer than the spline. We consider our splines as kind of view. 87 | 88 | * Do you have any actual use case where the coefficients outlive the spline? 89 | 90 | * We don't want to prohibit this. If the spline will own the coefficients, users may want to save it to some file, so we will need to provide extra functionality to save splines. Also the question can be asked not only about coefficients but also for function values. We may not want to lose function values for some reason. 91 | 92 | * No disagreement, but still no concrete use case. It's fine to manage the coefficients of the spline outside of the spline, but from the point of view for ease of use, the oneMKL TAB recommends packing the coefficients in the spline. 93 | 94 | * Though there could be cases, like having precomputed spline coefficients in a file, or doing an iterative procedure, where it helps for the user to own the memory for the functions. 95 | 96 | * What we are proposing as a pull request for the oneMKL specification (and that are part of Intel oneMKL 2022.1): only USM APIs, supporting 2 types of splines, 3 types of partitions, 2 function value layouts, and 5 boundary conditions. 97 | 98 | * Nothing important noted as missing. 99 | 100 | * After some investigation, we changed our original APIs. Let's look at the spline class and interpolate free function. 101 | 102 | Spline interface: 103 | 104 | * There's a common spline class with 3 template parameters: floating point type, spline type (which order), and dimensions. 105 | * Currently support only 1D splines, but this allows further expansion. 106 | * Based on previous oneMKL TAB meeting feedback, we decided to get rid of different data handlers. 107 | * We faced the problem where we have many parameters and need to handle them somehow. We decided to move certain parameters out of the constructor and left only part of them there. 108 | * There are 2 constructors in spline class, one taking a queue, one taking device and context. Other parameters for constructors are quantity of functions to compute splines for, and if the coefficients were already computed. 109 | * We don't want to perform deep copy, so we removed copy/move constructors and assignment operators. 110 | * We have some set functions, usually first parameter is the data, and the other is a hint about the storage. 111 | * We also have some functions that simplify spline usage: is_initialized, to tell if all values are set; get_required_coeffs_size, returns expected size for coefficients array; and a construct function that forms the spline construction by computing coefficients and returning sycl::event to user. Some splines require additional functions, like cubic splines need internal and boundary conditions set. 112 | 113 | Interpolate function: 114 | 115 | * 4 overloads, only showing 2 on slide. 116 | * This is templated by an interpolant - we didn't call it spline because we may have non-spline-based interpolation in future. We can use it with splines of different types now. 117 | * Interpolant is first parameter, then array of where we want to perform interpolation, then size of array, and where we want to store interpolation results. There is an indicator that shows the library which derivatives should be computed - each bit in this parameter indicates a certain derivative. Finally dependencies and hints for result storage and sites. 118 | * The other example overload takes a queue - this function will make computations associated with the input queue, not the queue that is associated with the interpolant. This may be needed if, for example, users want to perform interpolation with different queues where each queue corresponds to a certain subdevice but not the whole device, but using the same coefficients and other interpolation sites. 119 | * We also have an overload without computing derivatives. 120 | 121 | * Are the two hint parameters at the end independent of the interpolant? 122 | 123 | * Interpolant has its own hints for coefficients and function values; those hints are fully independent. 124 | 125 | Example: 126 | 127 | * Functionality is in experimental namespace. 128 | * User will fill initial data, then construct spline where the certain spline type is passed as template parameter. 129 | * Here we construct a spline without precomputed coefficients. 130 | * After constructing the spline, we set some data such as partitions, coefficients, functions. 131 | * Then we call construct function to compute coefficients. 132 | * After that, the user will need to prepare data for interpolation results. 133 | * Then call interpolate function to get functional values. Example is without derivatives. 134 | 135 | * The example uses malloc_shared, which means both host and device can access. Does it also work with malloc_device? 136 | 137 | * Yes. 138 | 139 | * We're adding DPC++ data fitting interfaces to the oneMKL specification: pull request `#413 `__. Feedback is welcome! 140 | 141 | * Interfaces are also implemented in the upcoming Intel oneMKL 2022.1 release, so you can try it and may have more questions after that. 142 | * If you know someone who may be interested in spline interpolation functionality, please let us know. 143 | 144 | Wrap-up: 145 | 146 | * Any timeline for when the FFT domain will be available in the open source oneMKL interfaces project? 147 | 148 | * Currently targeting the second half of 2022. 149 | 150 | * What is the process for deprecating v1.0 of the specification? 151 | 152 | * We'll get back to you on this. 153 | -------------------------------------------------------------------------------- /math/minutes/2022_06_08_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneMKL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2022-06-08 6 | ========== 7 | 8 | Materials: 9 | 10 | * `Update on the oneMKL interfaces open source project <../presentations/2022-06-08_Slides.pdf>`__ 11 | 12 | Attendees: 13 | 14 | * Terry Cojean (KIT) 15 | * Marius Cornea (Intel) 16 | * Mehdi Goli (Codeplay) 17 | * Harshitha Gunalan (Intel) 18 | * Mark Hoemmen (Nvidia) 19 | * Louise Huot (Intel) 20 | * Sarah Knepper (Intel) 21 | * Nevin Liber (ANL) 22 | * Ye Luo (ANL) 23 | * Piotr Luszczek (UTK) 24 | * Mesut Meterelliyoz (Intel) 25 | * Spencer Patty (Intel) 26 | * Pat Quillen (MathWorks) 27 | * George Silva (Intel) 28 | * Shane Story (Intel) 29 | 30 | Agenda: 31 | 32 | * Welcoming remarks 33 | * Updates from last meeting 34 | * Update on the oneMKL interfaces open source project - Mesut Meterelliyoz 35 | * Wrap-up and next steps 36 | 37 | Updates from last meeting: 38 | 39 | * Clarification about deprecation. We do not deprecate the specification, but we can deprecate an API in a specification. 40 | 41 | Update on the oneMKL open-source interfaces project: 42 | 43 | * Two years ago Maria Kraynyuk gave an overview of oneMKL interfaces. 44 | * oneMKL is part of the oneAPI initiative. Purpose is to achieve best possible performance on different hardware using the same source code. 45 | * This started with CPUs and GPUs; it is possible to extend to other accelerators like FPGAs, but this is not yet done. 46 | * We started with support for Intel and Nvidia GPUs; support has been extended to AMD GPUs. 47 | 48 | 2 usage models for dispatching: 49 | 50 | * Run-time dispatch: Sample code queries two devices; one for CPU, one for GPU, and creates a queue for each device. There are two calls to gemm; the first will run on the cpu device, the second on the gpu device. 51 | * What happens in the background: we have a dispatcher library called libonemkl; when you link an application with this dispatcher library, it looks at the device info in the queue and selects and loads the third-party library based on that info. The purpose of the backend is to convert the DPC++ runtime objects from the application to third-party library specific objects. Let's say we are using cuBLAS backend: from the GPU queue it will understand NVIDIA GPU and pick the cuBLAS backend branch, and then make the cuBLAS GEMM call, and then run on the NVIDIA GPU hardware. 52 | * Run-time dispatch supports only dynamic linking. 53 | 54 | * Compile-time dispatch: The difference here is that the application uses a templated backend selector API. The user specifies what backend to use. Similar example as before, creating two selectors (one for cpu, one for gpu), and these selectors are used in the gemm call. 55 | * The difference in the compilation is that the user needs to know which library to link to. When you compile oneMKL interfaces, you'll select which library. 56 | * Compile-time dispatch supports both static and dynamic linking. 57 | 58 | * The compilation flag -fsycl can be problematic. Is it possible to avoid needing -fsycl, though the user would still need to specify a sycl library? 59 | 60 | * The -fsycl is needed for the linkage with the sycl library as well. It is not only for compiling the kernels. 61 | 62 | oneMKL interfaces - What is new? 63 | 64 | * Slide 8 summarizes what has been added over the past two years. Previously we only had BLAS domain; we've added support for random number generator (RNG) and LAPACK. Other domains are in the pipeline. 65 | * We've added support for new hardware: AMD GPUs. 66 | * We've added support for new backends. For BLAS, we added support for NETLIB backend on CPU. We added support for cuRAND and rocRAND for RNG, rocBLAS for BLAS, and cuSOLVER for LAPACK. 67 | * We added support for new compiler: hipSYCL. This was done by Heidelberg University. 68 | * We made some changes in codebase to align with SYCL2020 specification; codebase is now compatible with SYCL2020 specification. 69 | * Earlier we had only functional tests for all APIs. There was a request to show some examples for using oneMKL interfaces project. We added examples for all domains, showing how to use compile time and run-time dispatch for each example. 70 | * We had an issue about performance between native cuBLAS and the cuBLAS backend. This was due to creating a context every time an API was called. Once the context was cached, this closed the gap and should see performance on par between these two now. 71 | * On the BLAS domain, we've been adding new APIs to the oneMKL spec, and since the oneMKL interfaces project is an implementation of the oneMKL spec, we added these new APIs to the oneMKL interfaces project, including tests. So right now, the BLAS domain should be following the oneMKL spec directly. 72 | * We've been getting more usage, with users reporting issues and improvements, so we're continually working on fixing bugs and documentation updates of this project. 73 | 74 | * Is there a timeline for sparse domain to be supported? 75 | 76 | * Hopefully next year. 77 | 78 | Support matrix by compiler: 79 | 80 | * We split support by compiler to show which domain is covered by which compiler. 81 | * In the open source project, we support 3 compilers: 82 | 83 | * DPC++ - can be downloaded from the oneAPI base toolkit (see links on bottom right) 84 | * LLVM compiler - can get from Github 85 | * hipSYCL compiler - can get from Github 86 | 87 | * Intel oneMKL is tested and shipped with the DPC++ library, so for Intel GPU only the DPC++ compiler is supported. 88 | * We are working actively to add AMD rocBLAS backend using LLVM's experimental support for AMD. A pull request is in progress. Once the BLAS domain is in good shape, we'll extend to other domains. 89 | * On Windows, LLVM only supports CPU devices so far. 90 | * The focus of the rest of the presentation is hipSYCL. Currently there is support for BLAS and RNG domains on AMD GPUs. 91 | * While the name hipSYCL may make you think it works only for AMD hardware, it also supports NVIDIA GPUs. There are open pull requests to enable cuBLAS and cuRAND libraries with the hipSYCL compiler. 92 | 93 | hipSYCL overview: 94 | 95 | * There are four major compilers for SYCL: 96 | 97 | * DPC++, either Intel's or llvm version. 98 | * triSYCL from Xilinx 99 | * ComputeCpp from Codeplay 100 | * hipSYCL, shown in gray boxes 101 | 102 | * hipSYCL is open source and was started by Aksel Alpay as a hobby project. 103 | * The difference between hipSYCL and other implementations is that other implementations started from OpenCL. hipSYCL doesn't use OpenCL at all; it either uses OpenMP, CUDA, or ROCm directly, depending on hardware. 104 | * The reason they started working on this is because not all hardware vendors had adopted SYCL yet, so hipSYCL helps to close that gap. 105 | * hipSYCL enables running SYCL code to utilize vendor-optimized libraries as well as vendor debug and performance tuning tools, to optimize your SYCL code. In the background, you'll be using vendor-optimized libraries. 106 | 107 | Why did we start working with hipSYCL? 108 | 109 | * Intel has 22 oneAPI Center of Excellence (CoE) programs; this includes national labs and universities around the globe. The purpose of the CoE with Heidelberg University is to bring oneAPI components to AMD hardware by leveraging hipSYCL. This is the first and only attempt to implement oneAPI with a compiler independent of DPC++. This demonstrates the purpose of oneAPI - to be an open source initiative. 110 | * Programming model: hipSYCL should support key SYCL2020 features and compile oneAPI code that achieves performance within 80% of CUDA/ROCm performance. 111 | * hipSYCL should run oneAPI libraries. This is where oneMKL comes into the picture. Using oneMKL interfaces as a proof point that you can use oneMKL interfaces to run with hipSYCL. 112 | * Last milestone - hipSYCL should be able to support the low-level API Level-Zero, so it eventually can be used for oneMKL GPUs. 113 | 114 | Collaboration on oneMKL interfaces for hipSYCL 115 | 116 | * Today you can build oneMKL interfaces with hipSYCL on AMD or (soon) NVIDIA GPUs. 117 | * Performance comparison table shown on the right hand side of slide 12. 118 | * Plots for 3 APIs: GEMM, GEMV, and AXPY - one API for each BLAS level. GEMM is compute bound, while GEMV and AXPY are memory bound. So you can compare performance for both compute and memory bound problems. 119 | * The dark bars are from rocBLAS native calls. The gray bars are from oneMKL interfaces. The plots show that there are basically no performance differences calling via these different interfaces. So the goal to achieve best performance on different hardware is achieved. 120 | * Future work is NVIDIA GPU support (in progress) and LAPACK domain support with hipSYCL. Later goal is to add Intel GPU support with hipSYCL through Level-Zero. 121 | 122 | * Why do you explicitly need Level-Zero? 123 | 124 | * If you are using Intel GPUs today, you would use the Intel DPC++ compiler and Level-Zero (the POR backend for DPC++). 125 | * Once hipSYCL supports Intel GPUs, you would still need Level-Zero for this. 126 | * If a different GPU is targeted (AMD or NVIDIA), then Level-Zero would not be needed. 127 | * For a user to call the oneMKL interfaces in an application, no direct call to Level-Zero would be needed, regardless of the targeted backend/GPU. 128 | 129 | * In the charts, sometimes the black bar is lower than the gray bar, even for medium range of sizes. The gemv for size 1000, in particular, shows a really low rocBLAS time (high performance). Why is this? 130 | 131 | * The performance charts were generated by the hipSYCL team, so we don't have insights into this. It's possible there was some instability in the runs. 132 | 133 | * For the LLVM compiler support for AMD GPUs, was this also done by the hipSYCL team or someone else? 134 | 135 | * This is being done by Intel. The LLVM hip backend support is experimental. We have plans to add it to the oneMKL interfaces, but we may need to wait until the compiler gets more mature in case things break. 136 | 137 | * Is hipSYCL a compiler or a header-only library? 138 | 139 | * hipSYCL is a SYCL compiler; it provides a multi-backend implementation of SYCL for CPUs and GPUs. 140 | 141 | * Any hope to get AMD more involved? 142 | 143 | * We are always happy to have contributions! 144 | 145 | * For a future meeting topic, Mark Hoemmen would be interested in presenting on the proposal for C++ standard linear algebra algorithms (P1673). 146 | -------------------------------------------------------------------------------- /math/minutes/2022_07_27_Minutes.rst: -------------------------------------------------------------------------------- 1 | ============================================= 2 | oneMKL Technical Advisory Board Meeting Notes 3 | ============================================= 4 | 5 | 2022-07-27 6 | ========== 7 | 8 | Materials: 9 | 10 | * `Overview of matrix transposition and copy routines <../presentations/2022-07-27_Slides.pdf>`__ 11 | 12 | Attendees: 13 | 14 | * Andrew Barker (Intel) 15 | * Terry Cojean (KIT) 16 | * Mehdi Goli (Codeplay) 17 | * Harshitha Gunalan (Intel) 18 | * Louise Huot (Intel) 19 | * Cheol Kim (Intel) 20 | * Sarah Knepper (Intel) 21 | * Maria Kraynyuk (Intel) 22 | * Nevin Liber (ANL) 23 | * Piotr Luszczek (UTK) 24 | * George Silva (Intel) 25 | * Shane Story (Intel) 26 | 27 | Agenda: 28 | 29 | * Welcoming remarks 30 | * Updates from last meeting 31 | * Overview of matrix transposition and copy routines - Andrew Barker 32 | * Wrap-up and next steps 33 | 34 | Updates from last meeting: 35 | 36 | * Planning an extra oneMKL TAB meeting on August 24. 37 | 38 | Overview and motivation: 39 | 40 | * Main motivation is that matrix transposition is something applications do a lot but isn't explicitly supported in standard BLAS. 41 | * There are routines in the Intel oneMKL product to do out-of place scaling and transposition, in-place scaling and transposition, and matrix addition/scaling, called omatcopy, imatcopy, and omatadd, respectively. 42 | * Similar functionality in NVIDIA cuBLAS and AMD rocBLAS is called geam. The geam function is the only API in cuBLAS and rocBLAS; they do some of the other combinations by having the same pointer to A and C or B and C. The documentation in Intel oneMKL prohibits that, which is why it has different APIs. 43 | * op(A) may be transpose, non-transpose, or a conjugate transpose for complex arithmetic. 44 | 45 | Vendor library interfaces and support: 46 | 47 | * The omatadd function from the Intel oneMKL product is most similar to geam from NVIDIA cuBLAS/AMD rocBLAS. In-place transposition isn't supported by NVIDIA cuBLAS/AMD rocBLAS, but it is by imatcopy in Intel oneMKL. C += beta B^T, for example, is supported only in NVIDIA cuBLAS. 48 | 49 | Comparison of omatadd and cublasSgeam APIs: 50 | 51 | * Very similar list of parameters. 52 | 53 | The basic question we have for the oneMKL TAB is which direction to go: 54 | 55 | * One option is to use {i,o}matcopy/omatadd APIs. An advantage of this is that users of Intel oneMKL CPU functions would have an easy on-ramp; this would also be a quick implementation in oneMKL open source interfaces with Intel oneMKL backend. 56 | * Another option is to use the geam APIs, which would be an easy on-ramp for existing NVIDIA cuBLAS/AMD rocBLAS GPU users. 57 | 58 | Would this cover device-to-host and host-to-device copies? 59 | 60 | * On GPU for graphics, there are pitch copies, like cudaMemcpy2D. The data is not really matrix data but texture data. 61 | * MAGMA uses some of this additional functionality. For example, in LU, you may look for the pivots on the CPU but then apply the pivots on the GPU. This requires copying the data between the device and host and back, and often with a transpose because you need the matrix panel in different forms (e.g., transpose the data on the device to do pivoting). 62 | * MAGMA has routines that do device-to-host and host-to-device copying, with optional transpose in between. 63 | * The SYCL 2020 specification for images covers such host-device copies for images. However, they only support half and float - not double (because images generally aren't stored in double precision). For HPC, double precision is often needed. 64 | * Currently only on-device copying is being considered. 65 | 66 | What batched interfaces are supported? 67 | 68 | * The Intel oneMKL product supports both group and strided batch omatcopy and imatcopy. Only strided batch is currently supported for omatadd. 69 | 70 | The example use cases and APIs are only on-device copies. 71 | 72 | * They can be either SYCL buffers or USM pointers, but they all have to be device accessible. 73 | * The on-device copy is useful; it can occur with similarity transformations. 74 | * If oneMKL specification is not planning to support host-to-device copying, that could be handled on the user side. 75 | 76 | Is there a preference between {i,o}matcopy/omatadd and geam for function names? 77 | 78 | * That would depend on the user base. 79 | * Since users may be coming from rocBLAS and cuBLAS, the natural match would be geam. 80 | * Right now, the draft implementation uses the {i,o}matcopy/omatadd APIs. 81 | * No strong preference for one over the other. 82 | 83 | Is double precision planned to be supported for images in a future SYCL version? 84 | 85 | * Not currently. However, SYCL buffers can be used to copy data between host and device, and then a SYCL kernel can be written to do the transposition. 86 | -------------------------------------------------------------------------------- /math/minutes/2023_07_12_Minutes.rst: -------------------------------------------------------------------------------- 1 | ========================================= 2 | Math Special Interest Group Meeting Notes 3 | ========================================= 4 | 5 | 2023-07-12 6 | ========== 7 | 8 | Materials: 9 | 10 | * `General updates <../presentations/2023-07-12_Slides.pdf>`__ 11 | * `Integrating SYCL-BLAS into oneMKL <../presentations/2023-07-12_oneMKL-PortBLAS.pdf>`__ 12 | 13 | Attendees: 14 | 15 | * Alejandro Acosta (Codeplay) 16 | * Andrew Barker (Intel) 17 | * Frank Brill (Cadence) 18 | * Ouadie El Farouki (Codeplay) 19 | * Mehdi Goli (Codeplay) 20 | * Paolo Gorlani (Codeplay) 21 | * Sarah Knepper (Intel) 22 | * Piotr Luszczek (UTK) 23 | * John Melonakos (Intel) 24 | * Kumudha Narasimhan (Codeplay) 25 | * Helen Parks (Intel) 26 | * Pat Quillen (MathWorks) 27 | * Alison Richards (Intel) 28 | * Nicolò Scipione (Codeplay) 29 | * Shane Story (Intel) 30 | 31 | Agenda: 32 | 33 | * Welcoming remarks 34 | * Updates from last meeting 35 | * Integrating SYCL-BLAS into oneMKL - Ouadie El Farouki 36 | * Wrap-up and next steps 37 | 38 | Updates from last meeting: 39 | 40 | * The Image Special Interest Group was launched, focused on oneAPI Image Processing Library. This and other SIGs are open for others to join, so please feel free to extend invitations to people from your own or other organizations. 41 | * rocFFT was added as a backend for DFT domain in the open source oneMKL interfaces project. 42 | * Additional functionality was added for several different backends. 43 | 44 | SYCL-BLAS as a oneMKL backend - Ouadie El Farouki: 45 | 46 | * Paolo, Nicolò and Alejandro are colleagues that are happy to answer questions. 47 | * Motivations of BLAS (Basic Linear Algebra Subprograms): reusability through a common interface and exploring hardware capabilities for efficiency and accuracy. 48 | * There are extensions of BLAS including batched operations. 49 | * 2 categories of BLAS libraries: open source and proprietary. Proprietary are often fine-tuned for specific hardware while the open source ones may be more portable. 50 | 51 | SYCL-BLAS: 52 | 53 | * SYCL-BLAS is a SYCL and C++ based BLAS implementation started in 2015 by Codeplay. There have been more than 40 developers, with many papers published about it. 54 | * It follows modern C++ specifications and can be used as a header-only library. It uses template meta-programming. SYCL-BLAS was developed to be the reference SYCL-based implementation; it can be compiled by different SYCL compilers and run on any SYCL compatible device. 55 | * Performance and portability is achieved in SYCL-BLAS through template meta-programming. 56 | * Choose SYCL_COMPILER, the TUNING_TARGET for the device to tune it for (so it picks the right template parameters for the given device), the SYCL_TARGET and SYCL_ARCH for computational purposes. 57 | * Auto-tuning - given the problem size on the target device, the auto-tuner will choose the best configuration and return the optimal kernel parameters (tiling size, cache line, work group size, etc.). 58 | * As examples of ongoing use-cases of SYCL-BLAS, it is used as an optional backend for SYCL-DNN. It is used experimentally for a JPEG-Compression application that uses the tuning of the GEMM operator in the JPEG compression algorithm. 59 | 60 | SYCL-BLAS and oneMKL: 61 | 62 | * oneMKL is the open source implementation of the oneMKL specification, which is part of the 10 core specifications of oneAPI. oneMKL is broken into different domains, including BLAS. oneMKL interfaces support multiple devices through third party libraries. There may be multiple libraries that map to a given device. 63 | * For the BLAS domain, Intel oneMKL, NVIDIA cuBLAS, and AMD rocBLAS are backends. SYCL-BLAS is a portable backend that can handle all of those devices. If we come to support other devices, they will be supported by the SYCL-BLAS backend as well. 64 | * You can specify SYCL-BLAS backend when building oneMKL, but you can't choose SYCL-BLAS with other backends; the others need to be disabled. 65 | * There are two usage models inherent to oneMKL: run-time dispatching (don't need to give the backend at compile-time; it is chosen at run-time), or compile-time dispatching - the backend is specified as a template parameter. 66 | * A concise example of compile-time dispatching is given in the slides, where the device and backend are selected, buffers are prepared, and computation is launched. 67 | 68 | Future plans: 69 | 70 | * A pull request is already open for full support of USM. 71 | * Working on row-major support. 72 | * Increasing operator implementation (currently ~60%). 73 | * Supporting complex type for operators that support it. 74 | 75 | Update: to emphasize its portability feature, SYCL-BLAS will be renamed as portBLAS. 76 | 77 | How big is your search space for auto-tuning BLAS, and how long does it take to scan it? How does the scanning of the search space to get optimal performance work? 78 | 79 | * We have a set of configurations for each device target, which are put into a json file. Some of them are experimental hints that include the optimal configuration for that device. The user inputs the problem size. There is a very basic, exhaustive, brute force search over all configurations that are available for the operation (especially gemm). It launches different gemm kernels and sorts the results based on performance. It prints all configurations along with their profiling results. 80 | 81 | Since this can be used as a header-only library, what is the compilation time? 82 | 83 | * Not long. oneMKL enables the default, reference backend (not all configurations). There is a guide for enabling different backends. 84 | 85 | How does performance compare to proprietary libraries? 86 | 87 | * This is work in progress; we want a benchmark to measure this. 88 | -------------------------------------------------------------------------------- /math/minutes/2023_10_25_Minutes.rst: -------------------------------------------------------------------------------- 1 | ========================================= 2 | Math Special Interest Group Meeting Notes 3 | ========================================= 4 | 5 | 2023-10-25 6 | ========== 7 | 8 | Materials: 9 | 10 | * `Discussion on Discrete Fourier Transform APIs <../presentations/2023-10-25_Slides.pdf>`__ 11 | 12 | Attendees 13 | 14 | * Hartwig Anzt (UTK, KIT) 15 | * Andrew Barker (Intel) 16 | * Romain Biessy (Codeplay) 17 | * Hugh Bird (Codeplay) 18 | * Gajanan Choudhary (Intel) 19 | * Tadej Ciglaric (Codeplay) 20 | * Terry Cojean (KIT) 21 | * Raphael Egan (Intel) 22 | * Sarah Knepper (Intel) 23 | * Piotr Luszczek (UTK) 24 | * Finlay Marno (Codeplay) 25 | * John Melonakos (Intel) 26 | * Rob Mueller-Albrecht (Intel) 27 | * Kumudha Narasimhan (Codeplay) 28 | * Helen Parks (Intel) 29 | * Spencer Patty (Intel) 30 | 31 | Agenda: 32 | 33 | * Welcoming remarks 34 | * UXL/Math SIG Mailing Lists 35 | * Updates from last meeting 36 | * Discussion on Discrete Fourier Transform APIs - Raphael Egan 37 | * Wrap-up and next steps 38 | 39 | Updates from last meeting: 40 | 41 | * Please join the Math SIG Mailing List if you haven't already: https://lists.uxlfoundation.org/g/Math-SIG. 42 | * Starting around December 1, we will switch to use the UXL mailing list and event notification process. 43 | * oneMKL RNG device API has been added to the oneAPI spec, as well as introducing value_or_pointer wrapper for BLAS USM scalar parameters. 44 | * Adding RNG device API to the oneMKL interfaces is in progress, as is enabling the sparse BLAS domain with MKLCPU backend. 45 | 46 | Discussion on Discrete Fourier Transform APIs - Raphael Egan 47 | 48 | * There are some changes we are intending to implement soon for closed-source oneMKL, and aligning the spec with those changes is in progress. 49 | * This presentation is for us to present the proposed changes to you and have a discussion, so we can resolve any concerns in the current proposal. 50 | * We will briefly set an overview for this presentation to see the motivation for this change. We will show how the upcoming changes are addressed by illustrating with a consistent example. 51 | 52 | Motivation: 53 | 54 | * The Discrete Fourier Transform is defined by a very ugly formula. If you look at this formula in detail, the exponents sign in the twiddle factor can take two values (+1 or - 1). 55 | * When playing with complex data (where x is complex itself), it doesn't have a consequence since y is complex as well. But if x is from the real domain, you have complex y values, but it satisfies some relationships so you only need to store about half the data. 56 | * You have an implicit type that is real in the forward domain, but complex in the backward domain. This is a rather particular application, but it has serious consequences. 57 | * Even if you are playing with complex transforms but have non-unit strides, what we will talk about is very relevant. 58 | * For any well-defined transform in one direction, one can design the corresponding transform in the other direction, consistent with the fundamental roundtrip identity. 59 | * If you do a forward transform followed by a backward transform, you recover your original data. 60 | 61 | Current configuration parameters and issues: 62 | 63 | * The way the user communicates to oneMKL about the data layout is by providing the strides within a multi-dimensional signal. The stride that defines the distance between successive elements in a given data set. 64 | * For distances, we require users to set by domain (forward and backward separately), regardless of the direction they compute. But for strides, we require this to be set by input/output, which can be confusing sometimes. 65 | * This was a choice in designing the original SYCL API that departed from classical C/Fortran APIs. The original motivation was to make sure every 1D real descriptor could still do forward and backward transform even in batched cases. The stride being set differently was overlooked. There are 2D or 3D real transforms that then have issues - as illustrated next. 66 | * A 1D real transform of length X, batched M times, is pretty simple. Only need to define distance; default offset and strides are used. We need to define distances, which the SYCL API allows us to set by domain. It's always well defined. We know where we're going from and where we're going to. 67 | * For a 2D real transform of size YxX, batched M times, due to the nature of real transform and that we're going from real to complex data (storing about half due to complex conjugacy), we need to explicitly set the offset and strides. Additionally, they are different in the forward and backward domain. 68 | * Currently in the SYCL API, the user is required to use two different descriptors (for performance, to avoid re-committing). Another drawback in current design is that if you consider descriptor for forward direction, there are still some backward domain things that need to be set, or else it is ill-defined. 69 | * We can't make sense of strides for backward dimension by the current settings. 70 | * Any app that would end up requiring to configure the strides in a non-default way, where they are different between the forward and backward domain, would face these bottlenecks. oneMKL in particular comes with an extra head scratcher: you can't set default values for real descriptor. Either you need to re-commit every time you reverse direction, or use two descriptors - resulting in a larger than necessary memory footprint. 71 | 72 | On initial slide x and y were in/out vectors. Now they represent lengths? 73 | 74 | * m/n are small tensor - could have used better notation. 75 | 76 | What is X/2 if X is not divisible by 2? Is it for selecting real/imaginary components individually? 77 | 78 | * X is expected to have integer division. That's why we add 1. 79 | 80 | Is X the length in complex values? 81 | 82 | * In the forward domain, it would be real. For backward domain, it would be in complex. 83 | 84 | What about real-to-real? 85 | 86 | * Real-to-real isn't really supported. We chose not to present due to additional complexity. 87 | 88 | Upcoming changes to address these issues: 89 | 90 | * Instead of requiring users to set strides in input/output, set them instead by domain. This is just like what is currently done for batch distances. 91 | * What is struck through in "red" will enter a deprecation period, where user will get a warning and be advised to use new APIs. We don't want to support mix-and-matching of parameters; for example, configuring descriptor using both input strides and backward_strides isn't supported. 92 | * Now it will be what you would expect for a batched in-place real-to-complex transform. 93 | 94 | * Back to the 2D batched example: we are now setting strides by domain. Strides and distances are well defined regardless of compute direction. Internally we have all the info needed to make sure descriptor is configured for both forward and backward successfully. So if you do a round-trip transformation, you recover the original input data consistently. 95 | 96 | * We are currently working on specification changes. The original intention was to target these new changes for the spec v1.3 cut off. 97 | 98 | * The Math SIG agrees it is a good change. 99 | 100 | What is status of closed-source oneMKL APIs, for C/Fortran and SYCL? Do they already support forward/backward strides? 101 | 102 | * Classic C and Fortran APIs don't currently support strides and distances to be set by domain. 103 | * The changes to the SYCL APIs are targeted to the oneMKL 2024.1 release. 104 | 105 | If the spec if changed, when would we implement this in the oneMKL interfaces? Would we wait for the Intel oneMKL closed source library to be updated? 106 | 107 | * The vision is to implement sometime in 2024, after it is available in closed source. The former, input/output strides will enter a deprecation period for closed-source oneMKL. 108 | 109 | How long it will be in the spec but not implemented? 110 | 111 | * We don't foresee a lot of implementation changes, but are worried about users. We don't think it would be a big change. 112 | 113 | * Currently in the open source interfaces, for CPU we are creating two descriptors, one for each direction. However, one will make sense but one won't. With these changes, both are expected to make sense in the end. 114 | 115 | cuFFT and rocFFT go with input/output. There's some benefit with the commit time - only 1 kernel. Could you do one kernel and implement with forward and backward strides? Or two commits and twice the commit time? 116 | 117 | * For real transforms in cuFFT, unless I'm mistaken, there are two different descriptors: r2c and c2r. 118 | * So that means you would still have to specify direction when you create descriptor. 119 | 120 | What if you only want to do one direction? 121 | 122 | * That was a considered alternative. But that would have required significant changes to other APIs to ensure undefined behavior goes away. It would be clearly defined at commit which direction would be used. For some other API either it requires compute direction to be explicitly set beforehand or the names of the APIs themselves are direction-specific (r2c, c2r, etc.). 123 | 124 | Is it not possible to create a single kernel? 125 | 126 | * It's likely possible, but may have performance implications. A lot of the kernels are constrained by build-time constants (JIT-time constants), including direction. 127 | * To make kernel more general, it would become a compute-time constant. This is not a huge change, but it's not insignificant. No guarantee that it would perform just as well. 128 | * It's a tradeoff for convenience for user, tradeoff in performance at commit time, and overall footprint. 129 | -------------------------------------------------------------------------------- /math/presentations/2020-05-20_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2020-05-20_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2020-06-03_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2020-06-03_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2020-06-17_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2020-06-17_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2020-07-01_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2020-07-01_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2020-07-15_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2020-07-15_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2020-08-12_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2020-08-12_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2020-09-09_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2020-09-09_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2020-11-11_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2020-11-11_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2021-01-27_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2021-01-27_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2021-02-24_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2021-02-24_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2021-03-24_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2021-03-24_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2021-05-19_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2021-05-19_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2021-06-16_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2021-06-16_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2021-07-14_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2021-07-14_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2021-10-06_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2021-10-06_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2022-03-23_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2022-03-23_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2022-06-08_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2022-06-08_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2022-07-27_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2022-07-27_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2022-09-21_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2022-09-21_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2022-10-05_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2022-10-05_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2023-03-08_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2023-03-08_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2023-03-08_balint_soproni_onemkl.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2023-03-08_balint_soproni_onemkl.pdf -------------------------------------------------------------------------------- /math/presentations/2023-05-17_Finlay_Marno_onemkl.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2023-05-17_Finlay_Marno_onemkl.pdf -------------------------------------------------------------------------------- /math/presentations/2023-05-17_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2023-05-17_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2023-07-12_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2023-07-12_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/2023-07-12_oneMKL-PortBLAS.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2023-07-12_oneMKL-PortBLAS.pdf -------------------------------------------------------------------------------- /math/presentations/2023-09-20_Slides-value_or_pointer.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2023-09-20_Slides-value_or_pointer.pdf -------------------------------------------------------------------------------- /math/presentations/2023-09-20_oneMKL-Sparse.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2023-09-20_oneMKL-Sparse.pdf -------------------------------------------------------------------------------- /math/presentations/2023-10-25_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/2023-10-25_Slides.pdf -------------------------------------------------------------------------------- /math/presentations/Hammond Intro DPC++ May 2020 oneMKL TAB.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/math/presentations/Hammond Intro DPC++ May 2020 oneMKL TAB.pdf -------------------------------------------------------------------------------- /math/presentations/README.rst: -------------------------------------------------------------------------------- 1 | ============================ 2 | Presentations for Math SIG 3 | ============================ 4 | -------------------------------------------------------------------------------- /organization/README.rst: -------------------------------------------------------------------------------- 1 | ================================== 2 | oneAPI Community Forum Governance 3 | ================================== 4 | 5 | The oneAPI Community Forum aims to bring individuals and 6 | organizations together to: 7 | 8 | * Define a standards-based, cross-architecture open specification for 9 | accelerated computing 10 | * Foster open-source implementations of the specification 11 | 12 | The oneAPI Community Forum intends to use a light weight model 13 | of governance to facilitate straightforward collaboration for 14 | the oneAPI specification and open source implementations. 15 | 16 | oneAPI Community Forum Policies 17 | ------------------------------- 18 | 19 | The current policies are documented `on this page. `__ 20 | 21 | oneAPI Community Forum Structure 22 | -------------------------------- 23 | 24 | .. image:: oneAPI-Community-Forum-Structure.png 25 | :width: 700 26 | :alt: oneAPI Community Forum Structure 27 | 28 | oneAPI Community Forum Steering Committee 29 | ----------------------------------------- 30 | 31 | The Steering Committee is led by Rod Burns from Codeplay 32 | Software and meets quarterly. 33 | 34 | The Steering Committee is responsible for leadership of the 35 | forum, including these activities: 36 | 37 | * Agreeing and tracking annual goals for the oneAPI Community Forum 38 | * Agreeing on the formation of new Working Groups and SIGs 39 | * Reviewing votes from the Working Groups 40 | * Owning and defining the intellectual property framework for 41 | contributions 42 | * Ratifying new versions of the specification 43 | * Approving the plans from the Marketing Committee 44 | 45 | oneAPI Community Forum Special Interest Groups (SIGs) 46 | ----------------------------------------------------- 47 | 48 | The SIGs exist to faciliate technical discussions that are 49 | highly relevant to bring positive change to the oneAPI 50 | specification and the implementations of its elements. 51 | 52 | SIG activities include the following: 53 | 54 | * Open technical discussions relevant to specific technologies and the 55 | oneAPI specification 56 | * Facilitating discussion and presentation of proposals 57 | 58 | New SIGs can be proposed by members of the community. 59 | 60 | oneAPI Community Forum Working Groups 61 | ------------------------------------- 62 | 63 | The oneAPI Community Forum Working Groups (WGs) exist to 64 | facilitate modifications to the oneAPI specification. 65 | 66 | Working Groups will be formed in early 2023. These can be 67 | proposed by members of the oneAPI Community Forum SIGs. 68 | 69 | oneAPI Community Forum Marketing Committee 70 | ------------------------------------------ 71 | 72 | The Marketing Committee meetings are open to anyone. The 73 | committee owns the definition and execution of the 74 | marketing strategy, steers the content on the website, 75 | coordinates event activities and coordinates marketing 76 | activities between the community. 77 | 78 | The Marketing Committee is led by Alison Richards. 79 | -------------------------------------------------------------------------------- /organization/oneAPI-Community-Forum-Structure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/oneAPI-tab/97da745d5865726ed358b615c3ecd7407e8e3315/organization/oneAPI-Community-Forum-Structure.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pre-commit 2 | --------------------------------------------------------------------------------