├── .gitignore ├── README.md ├── adm ├── 1 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ ├── 4 │ │ └── b.txt │ ├── 5 │ │ └── b.txt │ ├── 6 │ │ └── b.txt │ ├── 7 │ │ └── b.txt │ ├── 8 │ │ └── b.txt │ └── b.txt ├── 2 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ ├── 4 │ │ └── b.txt │ ├── 5 │ │ └── b.txt │ └── b.txt ├── 3 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ ├── 4 │ │ ├── 1 │ │ │ └── b.txt │ │ ├── 2 │ │ │ └── b.txt │ │ ├── 3 │ │ │ └── b.txt │ │ ├── 4 │ │ │ └── b.txt │ │ └── b.txt │ ├── 5 │ │ └── b.txt │ ├── 6 │ │ └── b.txt │ ├── 7 │ │ └── b.txt │ └── b.txt ├── 4 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ └── b.txt ├── 5 │ └── b.txt └── b.txt ├── b.txt ├── bda ├── 1 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ ├── 4 │ │ └── b.txt │ ├── 5 │ │ └── b.txt │ └── b.txt ├── 2 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ ├── 4 │ │ └── b.txt │ ├── 5 │ │ └── b.txt │ └── b.txt ├── 3 │ └── b.txt ├── 4 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ ├── 4 │ │ └── b.txt │ └── b.txt ├── 5 │ └── b.txt ├── 6 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ └── b.txt └── b.txt ├── extract-outcomes.py ├── k ├── 1 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ ├── 1 │ │ │ └── b.txt │ │ ├── 2 │ │ │ └── b.txt │ │ └── b.txt │ ├── 3 │ │ ├── 1 │ │ │ ├── 1 │ │ │ │ └── b.txt │ │ │ ├── 2 │ │ │ │ └── b.txt │ │ │ ├── 3 │ │ │ │ ├── 1 │ │ │ │ │ └── b.txt │ │ │ │ ├── 2 │ │ │ │ │ └── b.txt │ │ │ │ └── b.txt │ │ │ ├── 4 │ │ │ │ ├── 1 │ │ │ │ │ └── b.txt │ │ │ │ ├── 2 │ │ │ │ │ └── b.txt │ │ │ │ ├── 3 │ │ │ │ │ └── b.txt │ │ │ │ └── b.txt │ │ │ ├── 5 │ │ │ │ ├── 1 │ │ │ │ │ └── b.txt │ │ │ │ ├── 2 │ │ │ │ │ └── b.txt │ │ │ │ ├── 3 │ │ │ │ │ └── b.txt │ │ │ │ └── b.txt │ │ │ ├── 6 │ │ │ │ └── b.txt │ │ │ └── b.txt │ │ ├── 2 │ │ │ └── b.txt │ │ ├── 3 │ │ │ └── b.txt │ │ ├── 4 │ │ │ └── b.txt │ │ └── b.txt │ ├── 4 │ │ └── b.txt │ └── b.txt ├── 2 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ └── b.txt ├── 3 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ ├── 4 │ │ └── b.txt │ └── b.txt ├── 4 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ └── b.txt ├── 5 │ └── b.txt ├── 6 │ └── b.txt └── b.txt ├── leftover.txt ├── pe ├── 1 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ └── b.txt ├── 2 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ ├── 1 │ │ │ └── b.txt │ │ ├── 2 │ │ │ └── b.txt │ │ ├── 3 │ │ │ └── b.txt │ │ ├── 4 │ │ │ └── b.txt │ │ ├── 5 │ │ │ └── b.txt │ │ ├── 6 │ │ │ └── b.txt │ │ ├── 7 │ │ │ └── b.txt │ │ └── b.txt │ └── b.txt ├── 3 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ └── b.txt ├── 4 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ └── b.txt ├── 5 │ └── b.txt └── b.txt ├── sd ├── 1 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ ├── 1 │ │ │ └── b.txt │ │ ├── 2 │ │ │ └── b.txt │ │ ├── 3 │ │ │ └── b.txt │ │ ├── 4 │ │ │ └── b.txt │ │ ├── 5 │ │ │ ├── 1 │ │ │ │ ├── 1 │ │ │ │ │ └── b.txt │ │ │ │ ├── 2 │ │ │ │ │ └── b.txt │ │ │ │ ├── 3 │ │ │ │ │ └── b.txt │ │ │ │ ├── 4 │ │ │ │ │ └── b.txt │ │ │ │ ├── 5 │ │ │ │ │ └── b.txt │ │ │ │ ├── 6 │ │ │ │ │ └── b.txt │ │ │ │ ├── 7 │ │ │ │ │ └── b.txt │ │ │ │ ├── 8 │ │ │ │ │ └── b.txt │ │ │ │ ├── 9 │ │ │ │ │ └── b.txt │ │ │ │ ├── 10 │ │ │ │ │ └── b.txt │ │ │ │ ├── 11 │ │ │ │ │ └── b.txt │ │ │ │ ├── 12 │ │ │ │ │ └── b.txt │ │ │ │ ├── 13 │ │ │ │ │ └── b.txt │ │ │ │ └── b.txt │ │ │ ├── 2 │ │ │ │ └── b.txt │ │ │ └── b.txt │ │ ├── 6 │ │ │ └── b.txt │ │ ├── 7 │ │ │ └── b.txt │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ └── b.txt ├── 2 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ ├── 4 │ │ └── b.txt │ ├── 5 │ │ └── b.txt │ ├── 6 │ │ └── b.txt │ ├── 7 │ │ └── b.txt │ ├── 8 │ │ └── b.txt │ └── b.txt ├── 3 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ ├── 4 │ │ └── b.txt │ └── b.txt ├── 4 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ └── b.txt ├── 5 │ └── b.txt ├── 6 │ └── b.txt ├── 7 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ ├── 4 │ │ └── b.txt │ └── b.txt ├── 8 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ └── b.txt └── b.txt ├── skill-tree.mm ├── synchronize-mm.py ├── update-js-from-md.py └── use ├── 1 ├── 1 │ └── b.txt ├── 2 │ └── b.txt ├── 3 │ └── b.txt ├── 4 │ └── b.txt ├── 5 │ ├── 1 │ │ └── b.txt │ ├── 2 │ │ └── b.txt │ ├── 3 │ │ └── b.txt │ └── b.txt └── b.txt ├── 2 ├── 1 │ └── b.txt ├── 2 │ └── b.txt └── b.txt ├── 3 └── b.txt ├── 4 └── b.txt ├── 5 └── b.txt ├── 6 ├── 1 │ └── b.txt ├── 2 │ └── b.txt ├── 3 │ └── b.txt ├── 4 │ └── b.txt └── b.txt ├── 7 └── b.txt └── b.txt /.gitignore: -------------------------------------------------------------------------------- 1 | *.pdf 2 | tmp 3 | *.json 4 | GWDG 5 | GWDG.pub 6 | -------------------------------------------------------------------------------- /adm/1/1/b.txt: -------------------------------------------------------------------------------- 1 | # ADM1.1 Architecture 2 | 3 | Learn about Architectures and specific hardware used in HPC systems. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understanding Motherboards 10 | * Understanding RAM 11 | * Understanding PCIe 12 | * Understanding DMMA and RDMA 13 | * Understanding Firmware 14 | * Understanding virtualisation and remembering Hypervisors, Emulation, virtualised hardware, and hardware passthrough 15 | * Understanding Temperature and cooling 16 | 17 | 18 | -------------------------------------------------------------------------------- /adm/1/2/b.txt: -------------------------------------------------------------------------------- 1 | # ADM1.2 Networking 2 | 3 | Learn about networking and protocols such as IPv4, cabeling, and others. 4 | 5 | ## Requirements 6 | 7 | 8 | ## Learning Outcomes 9 | 10 | * Understand IPv4 and IPv6 11 | * Understand Cabling 12 | * Understand switches and cards 13 | * Understand DHCP 14 | * Understand Fabrics 15 | 16 | -------------------------------------------------------------------------------- /adm/1/3/b.txt: -------------------------------------------------------------------------------- 1 | # ADM1.3 Bootprocess 2 | 3 | Learn about how a computer boots and what is required for a successful start 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand the BIOS 10 | * Understand Netboot procedure 11 | * Understand the bootloader 12 | * Understand the Init system/systemd 13 | 14 | -------------------------------------------------------------------------------- /adm/1/4/b.txt: -------------------------------------------------------------------------------- 1 | # ADM1.4 Monitoring 2 | 3 | Learning about monitoring a system and what data is relevant for which decision. 4 | 5 | ## Requirements 6 | 7 | 8 | ## Learning Outcomes 9 | 10 | * Understanding CLI monitoring 11 | * Understanding files stored under /proc 12 | * Understanding Architectures for Organized monitoring tools 13 | * Understanding flapping 14 | 15 | -------------------------------------------------------------------------------- /adm/1/5/b.txt: -------------------------------------------------------------------------------- 1 | # ADM1.5 Cluster Management 2 | 3 | Learning about programs and systems for cluster management. 4 | 5 | ## Requirements 6 | 7 | * USE2.4 Remote access 8 | 9 | ## Learning Outcomes 10 | 11 | * Employ SSH connections for administrator access 12 | * Understanding System provisioning 13 | * Understanding user management 14 | 15 | -------------------------------------------------------------------------------- /adm/1/6/b.txt: -------------------------------------------------------------------------------- 1 | # ADM1.6 Resource Management 2 | 3 | Learn about managing and limiting the resources a user has available or can access. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understanding User limits 10 | * Understanding Quota 11 | * Understanding job managers 12 | 13 | -------------------------------------------------------------------------------- /adm/1/7/b.txt: -------------------------------------------------------------------------------- 1 | # ADM1.7 Security 2 | 3 | Learn about security and implications for administrating a cluster. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Remember best practices 10 | * Understand the Unix Permission Model 11 | * Understand how to verify software authenticity and integrity 12 | * Understand Network segmentationa nd firewalls 13 | * Understand SSH-keys and attack monitoring 14 | * Understand 2FA procesures 15 | * Understand Advanced fencing of sensitive data (GDPR) 16 | 17 | -------------------------------------------------------------------------------- /adm/1/8/b.txt: -------------------------------------------------------------------------------- 1 | # ADM1.8 Electricity and Signals 2 | 3 | Learn about electricity and signals. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Remember AC and DC electricity 10 | * Remember Electrical safety 11 | * Understand how to model power consumption 12 | * Understand EMI, Ground loops, isolation and shielding 13 | * Understand Communication 14 | * Understand cables and remembering 15 | * Single ended 16 | * differential 17 | * twisted pair 18 | * coax/twinax 19 | * fiber 20 | * balancing 21 | * bandwidth 22 | * current limits 23 | * Understanding rechargeable batteries and remebering 24 | * Capacity 25 | * currentl limits 26 | * Charging 27 | * safety 28 | * disposal 29 | 30 | -------------------------------------------------------------------------------- /adm/1/b.txt: -------------------------------------------------------------------------------- 1 | # ADM1 Theoretic Principles 2 | 3 | These principles are the learning foundation of the administration work for an HPC Cluster. 4 | Many primitives and fundamentals are described and recalled such as electricity and signals, networking and infrastructure. 5 | Additionally many subjects like security, resource management, monitoring, and software stacks are described. 6 | 7 | ## Requirements 8 | 9 | * Using knowledge (USE2) 10 | 11 | ## Learning Outcomes 12 | 13 | * Understand the hardware components of a computer and also investigate virtualisation techniques 14 | * Describe the components required for networking including cables, protocols, and fabrics 15 | * Review the boot process and investigate BIOS options, netboot procedures as well as bootloaders 16 | * Summarize monitoring options including CLI monitoring, files under /proc, and different architectures 17 | * Generalize cluster management options including SSH for administrator access, user management, and system provisioning 18 | * Understand quota for files and compute, user limits and job managers 19 | * Understand the Unix Permission model, how to verify authenticity, 2FA procedures and advanced fencing 20 | * Practice how to model power consumption and understand fundamental concepts such as cables and batteries 21 | 22 | ## Subskills 23 | 24 | * [[skill-tree:adm:1:1:b]] 25 | * [[skill-tree:adm:1:2:b]] 26 | * [[skill-tree:adm:1:3:b]] 27 | * [[skill-tree:adm:1:4:b]] 28 | * [[skill-tree:adm:1:5:b]] 29 | * [[skill-tree:adm:1:6:b]] 30 | * [[skill-tree:adm:1:7:b]] 31 | * [[skill-tree:adm:1:8:b]] 32 | -------------------------------------------------------------------------------- /adm/2/1/b.txt: -------------------------------------------------------------------------------- 1 | # ADM2.1 Network 2 | 3 | Learn about Networking and specific protocols, such as Routing, nftables, NTP, DNS, Failover, and many more. 4 | 5 | ## Requirements 6 | 7 | * Networking (ADM1.2) 8 | 9 | ## Learning Outcomes 10 | 11 | * Understand Linux Network Configuration 12 | * Understand Routing 13 | * Understand nftables 14 | * Understand NTP 15 | * Understand DNS and DHCP 16 | * Understand Failover 17 | * Understand Connecting Physical hardware 18 | * Understand TUN/TAP Networks 19 | 20 | -------------------------------------------------------------------------------- /adm/2/2/b.txt: -------------------------------------------------------------------------------- 1 | # ADM2.2 Storage 2 | 3 | Learn about types and of storage and what they are used for. 4 | 5 | ## Requirements 6 | 7 | 8 | ## Learning Outcomes 9 | 10 | * Understand file systems and remembering 11 | * Local file systems 12 | * file systems over network 13 | * quota 14 | * snapshots 15 | * Describe common usage of NFS in a cluster 16 | * Providing a home directory to users 17 | * Share a software repository 18 | * Describe the key concepts of NFS 19 | * Software/hardware components 20 | * Security concept of NFSv3 21 | * Performance bottlenecks and caching behavior 22 | * List relevant mount options: protocol, timelimit, nfs-version 23 | * Deploy NFS infrastructure on a server and a client 24 | * Installing required software packages and services on the server 25 | * Specification of exports 26 | * Installing required software packages and the kernel module on the client 27 | * Updating fstab to mount the remote file system 28 | * Examine NFS deployments 29 | 30 | * Understand RAID 31 | * Understand Backups 32 | * Understand Data transfer 33 | * Understand Object Stores 34 | * Understand direct disk access over network and remembering 35 | * iSCSI 36 | * VMe over fabric 37 | 38 | -------------------------------------------------------------------------------- /adm/2/3/b.txt: -------------------------------------------------------------------------------- 1 | # ADM2.3 Virtualization 2 | 3 | Learning about virtualisation and how to employ it for an HPC cluster. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand networking for virtualisation 10 | * Understand image management 11 | * Understand hardware passthrough 12 | * Understand QEMU 13 | * Understand VMWare 14 | * Understand libvirt 15 | * Understand OpenStack 16 | 17 | -------------------------------------------------------------------------------- /adm/2/4/b.txt: -------------------------------------------------------------------------------- 1 | # ADM2.4 Power 2 | 3 | Learn about power and how to distribute it in an HPC cluster. 4 | 5 | ## Requirements 6 | 7 | * Electricity and Signals (ADM2.10) 8 | 9 | ## Learning Outcomes 10 | 11 | * Understand checking cables 12 | * Understand Breakers and disconnects 13 | * Understand Distributing Load 14 | * Understand Computer/switch/etc. power supplies 15 | * Understand UPS 16 | 17 | -------------------------------------------------------------------------------- /adm/2/5/b.txt: -------------------------------------------------------------------------------- 1 | # ADM2.5 Cooling Systems 2 | 3 | Learn about the cooling system and requirements for an HPC system. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand Air cooling 10 | * Understand water cooling 11 | * Understand Oil Cooling 12 | * Understand heat exchange to external 13 | 14 | -------------------------------------------------------------------------------- /adm/2/b.txt: -------------------------------------------------------------------------------- 1 | # ADM2 Cluster Infrastructure 2 | 3 | Learn about what infrastructure is required to run an HPC cluster. 4 | This includes networking, power, and cooling requirements as well as types of storage and virtualisation. 5 | 6 | ## Requirements 7 | 8 | ## Learning Outcomes 9 | 10 | * Review Linux network configuration including routing, nftables, ntp, DNS and DHPC, failover, and TUN/TAP Networks 11 | * Describe usage of NFS in a cluster and key concepts including software/hardware components, security, and how to deploy NFS 12 | * Understand RAID including Backups, data transfers, and object stores 13 | * Understand virtualisation and investigate image management, hardware passthrough, and software such as OpenStack, VMWar, QEMU, and libvirt 14 | * Review cables, breakers and disconnects, distributing loads, UPS and power supplies 15 | * Summarize cooling options such as air, water, oil and investigate heat exchangers 16 | 17 | 18 | ## Subskills 19 | 20 | * [[skill-tree:adm:2:1:b]] 21 | * [[skill-tree:adm:2:2:b]] 22 | * [[skill-tree:adm:2:3:b]] 23 | * [[skill-tree:adm:2:4:b]] 24 | * [[skill-tree:adm:2:5:b]] 25 | -------------------------------------------------------------------------------- /adm/3/1/b.txt: -------------------------------------------------------------------------------- 1 | # ADM3.1 Management Tools 2 | 3 | Learn about specific cluster management tools. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand xCAT 10 | * Understand Clustore 11 | 12 | -------------------------------------------------------------------------------- /adm/3/2/b.txt: -------------------------------------------------------------------------------- 1 | # ADM3.2 Resource Management 2 | 3 | Learn about resource management software. 4 | 5 | ## Requirements 6 | 7 | * [[skill-tree:use:2:b]] 8 | 9 | ## Learning Outcomes 10 | 11 | * Understand SLURM 12 | * Understand SGE 13 | * Understand HTConda 14 | 15 | -------------------------------------------------------------------------------- /adm/3/3/b.txt: -------------------------------------------------------------------------------- 1 | # ADM3.3 Parallel Shell 2 | 3 | Learn about parallel shell programs. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand Clustershell 10 | 11 | -------------------------------------------------------------------------------- /adm/3/4/1/b.txt: -------------------------------------------------------------------------------- 1 | # ADM3.4.1 Stateful Image 2 | 3 | An HPC cluster consists of several central components such as storage, workload manager, subnet manager, etc., which sometimes run on dedicated servers. These servers must be deployed from the central management system. Usually these servers have local hard disks, so a stateful installation of these servers is necessary. 4 | 5 | Students should learn how to create stateful node images for HPC clusters. They should be able to use provisioning software to create and configure such images. They should know how to customize stateful node images. 6 | 7 | ## Requirements 8 | 9 | ## Learning Outcomes 10 | 11 | * Able to prepare the head node to deploy stateful node images 12 | * Customize stateful node images with custom partition layouts and software packages 13 | * Setup network configuration of stateful nodes interfaces (Ethernet and High-Speed) 14 | * Configure and start required services on stateful nodes 15 | * Able to deploy stateful images to cluster nodes 16 | 17 | ## Maintainer 18 | 19 | * Markus Hilger 20 | * Peter Grossöhme 21 | * HPC Engineers @ Megware 22 | -------------------------------------------------------------------------------- /adm/3/4/2/b.txt: -------------------------------------------------------------------------------- 1 | # ADM3.4.2 Stateless Image 2 | 3 | To minimise costs and downtime, many HPC compute nodes do not have local disks, but only keep the operating system in RAM. This type of deployment requires special tools to create an image that can be loaded into the RAM of a compute node. 4 | 5 | Students should learn how to create stateless node images for HPC clusters. They should be able to use provisioning software to create and configure such images. They should know how to customize stateless node images. 6 | 7 | 8 | ## Requirements 9 | 10 | ## Learning Outcomes 11 | 12 | * Able to prepare the head node to deploy stateless node images 13 | * Build bootable stateless node images on the head node using chroot or containers 14 | * Setup network configuration of stateless nodes interfaces (Ethernet and High-Speed) 15 | * Configure and start required services on stateless nodes 16 | * Able to deploy stateless images to cluster nodes 17 | 18 | ## Maintainer 19 | 20 | * Markus Hilger 21 | * Peter Grossöhme 22 | * HPC Engineers @ Megware 23 | -------------------------------------------------------------------------------- /adm/3/4/3/b.txt: -------------------------------------------------------------------------------- 1 | # ADM3.4.3 Manage Cluster Nodes 2 | 3 | During the operation of an HPC cluster, it is necessary to manually execute commands on several nodes or make temporary changes to the configuration for debugging or testing purposes. In case of problems, any logs or measurements must be collected. Furthermore, regular software updates are necessary. 4 | 5 | Students should learn basics of HPC cluster administration. They should know how to execute commands on multiple nodes, collect logs, gather debug information and update operating system software components. 6 | 7 | ## Requirements 8 | 9 | ## Learning Outcomes 10 | 11 | * Execute a commands on multiple nodes in parallel using the parallel shell 12 | * Able to update software packages on nodes 13 | * Able to gather logs and debug information from nodes and node service processors 14 | * Control nodes using the node service processor (power on/off, set boot target) 15 | * Able to use the Serial-over-LAN connection for debugging purposes 16 | 17 | ## Maintainer 18 | 19 | * Markus Hilger 20 | * Peter Grossöhme 21 | * HPC Engineers @ Megware 22 | -------------------------------------------------------------------------------- /adm/3/4/4/b.txt: -------------------------------------------------------------------------------- 1 | # ADM3.4.4 Deployment Systems 2 | 3 | Deployment systems allow an admin to boot nodes from network on a common image. 4 | This also allows the admin to be able to quickly change the boot parameters and even the full os of any computer in the network. 5 | 6 | ## Requirements 7 | 8 | ## Learning Outcomes 9 | 10 | * Understand Warewulf 11 | * Understand Bright 12 | * Understand Ansible 13 | 14 | -------------------------------------------------------------------------------- /adm/3/4/b.txt: -------------------------------------------------------------------------------- 1 | # ADM3.4 System Provisioning 2 | 3 | Learn about system provisioning and deployment. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand Stateful images 10 | * Understand Stateless images 11 | * Understand Manage cluster nodes 12 | * Understand deployment software such as Warewulf, Bright, and Ansible 13 | 14 | ## Subskills 15 | 16 | * [[skill-tree:adm:3:4:1:b]] 17 | * [[skill-tree:adm:3:4:2:b]] 18 | * [[skill-tree:adm:3:4:3:b]] 19 | * [[skill-tree:adm:3:4:4:b]] 20 | -------------------------------------------------------------------------------- /adm/3/5/b.txt: -------------------------------------------------------------------------------- 1 | # ADM3.5 User Management 2 | 3 | Learn about user management software. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand User management software 10 | 11 | -------------------------------------------------------------------------------- /adm/3/6/b.txt: -------------------------------------------------------------------------------- 1 | # ADM3.6 Monitoring 2 | 3 | Learn about specific monitoring software. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand Prometheus 10 | * Understand Telegraf 11 | * Understand Grafana 12 | * Understand Icinga 13 | 14 | -------------------------------------------------------------------------------- /adm/3/7/b.txt: -------------------------------------------------------------------------------- 1 | # ADM3.7 IPMI 2 | 3 | Learn about the IPMI system for cluster management 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand the IMPI system 10 | 11 | 12 | -------------------------------------------------------------------------------- /adm/3/b.txt: -------------------------------------------------------------------------------- 1 | # ADM3 Cluster Management 2 | 3 | Learn about managing an HPC cluster and what software can be used for aspects such as resource management, system provisioning, and monitoring. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand management tools such as xCAT and Clustore 10 | * Understand workload manager such as SLURM, SGE, or HTConda 11 | * Understand parallel shell programs such as Clustershell 12 | * Understand Stateful and Stateless images as well as how to manage cluster nodes and deployment software 13 | * Understand User management software 14 | * Understand monitoring option such as Prometheus, Telegraf, Grafana, Icinga 15 | * Understand the IMPI system 16 | 17 | 18 | ## Subskills 19 | 20 | * [[skill-tree:adm:3:1:b]] 21 | * [[skill-tree:adm:3:2:b]] 22 | * [[skill-tree:adm:3:3:b]] 23 | * [[skill-tree:adm:3:4:b]] 24 | * [[skill-tree:adm:3:5:b]] 25 | * [[skill-tree:adm:3:6:b]] 26 | * [[skill-tree:adm:3:7:b]] 27 | -------------------------------------------------------------------------------- /adm/4/1/b.txt: -------------------------------------------------------------------------------- 1 | # ADM4.1 Setup 2 | 3 | Lear how to set up a software stack with multiple dependencies. 4 | 5 | ## Requirements 6 | 7 | * [[skill-tree:use:1:5:1:b]] 8 | * [[skill-tree:use:1:5:2:b]] 9 | 10 | ## Learning Outcomes 11 | 12 | * Understand system package manager and remembering 13 | * repository packages 14 | * mirroring and caching 15 | * system upgrades 16 | * rollback 17 | * Understand building from source 18 | * Understand module system and remembering 19 | * dependencies and conflicts 20 | * environmental modules 21 | * Lmod 22 | * Understand SysV Init Scripts 23 | * Understand Systemd Units for services 24 | 25 | -------------------------------------------------------------------------------- /adm/4/2/b.txt: -------------------------------------------------------------------------------- 1 | # ADM4.2 Containers 2 | 3 | Learn about containers and how they can be used as a software stack. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand container 10 | * Understand Software containerisation 11 | * Understand container deployment 12 | 13 | -------------------------------------------------------------------------------- /adm/4/3/b.txt: -------------------------------------------------------------------------------- 1 | # ADM4.3 User environment 2 | 3 | Learn about user environments and how to set the up properly. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand Shell initialization 10 | * Understand HOME Directory installs 11 | * Understand SPACK and reme,ber 12 | * concepts 13 | * system configuration 14 | * compilation 15 | 16 | -------------------------------------------------------------------------------- /adm/4/b.txt: -------------------------------------------------------------------------------- 1 | # ADM4 Software Stack 2 | 3 | Learn about software stack and how to set them up. 4 | Differentiate between system and user based software stack. 5 | Also, learn about using containers for software stacks. 6 | 7 | ## Requirements 8 | 9 | ## Learning Outcomes 10 | 11 | * Review how to set up a software stack and remember system package managers, building from source, and a module system 12 | * Understand how to use containers and their deployment, as well as software containerisation 13 | * Recognize how user can manage their own software by using HOME Directory installs, SPACK, and similar environments 14 | 15 | ## Subskills 16 | 17 | * [[skill-tree:adm:4:1:b]] 18 | * [[skill-tree:adm:4:2:b]] 19 | * [[skill-tree:adm:4:3:b]] 20 | 21 | -------------------------------------------------------------------------------- /adm/5/b.txt: -------------------------------------------------------------------------------- 1 | # ADM5 Workflow Management Systems - Snakemake 2 | 3 | Snakemake offers to setup a system-wide configuration file, selecting the 4 | default excecutor (Snakemake lingo for the execution of job, e.g. for 5 | cloud systems or HPC batch sytems) or defining remote file paths (e.g. to 6 | stage-in and out data "automagically"). 7 | 8 | ## Requirements: 9 | 10 | * [[skill-tree:use:6:3:b]] 11 | 12 | ## Learning Outcomes 13 | 14 | * Identify Snakemake's capabilities for configuring its behaviour across a cluster environment. 15 | * Configure and set-up a cluster-wide YAML-based configuration file for Snakemake that will not interfere with other systems (e.g. CWL or Nextflow) and helps avoiding performance issues such as I/O contention. 16 | -------------------------------------------------------------------------------- /adm/b.txt: -------------------------------------------------------------------------------- 1 | # ADM Administration 2 | 3 | Learn about administrating an HPC cluster and remember many basics about servers and other equipment. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Review theoretic principles such as hardware components, networking, boot process, monitoring, cluster management, and user management. 10 | * Summarize Linux network configuration, and investigate NFS, virtualisation and cooling options 11 | * Describe cluster management with specific tools and investigate workload managers, different types of images as well as deployment, monitoring and IPMI 12 | * Classify software stacks and review installing of software, containers and software containerisation, as well as user installs 13 | * Identify Snakemake's capabilities for configuring its behaviour across a cluster environment and configure a cluster-wide configuration 14 | 15 | 16 | ## Subskills 17 | 18 | * [[skill-tree:adm:1:b]] 19 | * [[skill-tree:adm:2:b]] 20 | * [[skill-tree:adm:3:b]] 21 | * [[skill-tree:adm:4:b]] 22 | * [[skill-tree:adm:5:b]] 23 | -------------------------------------------------------------------------------- /b.txt: -------------------------------------------------------------------------------- 1 | # SkillTree 2 | 3 | ## Subskills 4 | * [[skill-tree:adm:b]] 5 | * [[skill-tree:bda:b]] 6 | * [[skill-tree:k:b]] 7 | * [[skill-tree:pe:b]] 8 | * [[skill-tree:sd:b]] 9 | * [[skill-tree:use:b]] 10 | -------------------------------------------------------------------------------- /bda/1/5/b.txt: -------------------------------------------------------------------------------- 1 | # BDA1.5 Ethical/Privacy 2 | 3 | Ethical considerations and privacy concerns are paramount in the field of big data analytics, where the handling and analysis of vast amounts of data can have profound implications on individuals, society, and organizations. This module delves into the ethical principles, privacy regulations, and best practices for responsible data usage in big data analytics. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Understand the ethical implications** of big data analytics on individuals, communities, and society. 10 | * **Examine privacy regulations** such as GDPR, CCPA, and HIPAA and their implications for big data analytics. 11 | * **Analyze ethical dilemmas** related to data collection, storage, analysis, and dissemination in big data projects. 12 | * **Discuss the ethical responsibilities** of data scientists, analysts, and organizations in ensuring fairness, transparency, and accountability. 13 | * **Explore the role of bias** in data collection, algorithmic decision-making, and model outcomes, and strategies for bias mitigation. 14 | * **Understand the principles of data anonymization** and pseudonymization for protecting individual privacy while preserving data utility. 15 | * **Analyze the impact of data breaches** and security vulnerabilities on individual privacy, organizational reputation, and regulatory compliance. 16 | * **Discuss ethical considerations** in data sharing and collaboration, including informed consent, data ownership, and intellectual property rights. 17 | * **Explore the ethical use of predictive analytics** in sensitive domains such as healthcare, criminal justice, and finance, and potential biases and risks. 18 | * **Examine the ethical implications** of surveillance technologies, facial recognition systems, and biometric data collection in public and private settings. 19 | * **Discuss the ethical challenges** of algorithmic transparency, explainability, and accountability in automated decision-making systems. 20 | * **Analyze the ethical implications** of data-driven automation and artificial intelligence in reshaping labor markets, job displacement, and socioeconomic inequalities. 21 | * **Explore ethical considerations** in the use of data-driven technologies for political campaigning, voter targeting, and manipulation of public opinion. 22 | * **Discuss the ethical dimensions** of data-driven approaches to environmental monitoring, climate change mitigation, and sustainable development. 23 | * **Analyze the ethical implications** of data-driven approaches to public health surveillance, disease monitoring, and pandemic response, including issues of privacy and consent. 24 | 25 | AI generated content 26 | -------------------------------------------------------------------------------- /bda/2/1/b.txt: -------------------------------------------------------------------------------- 1 | # BDA2.1 Ophidia 2 | 3 | Ophidia is an advanced tool designed for managing and processing big data in scientific and HPC environments. This module delves into Ophidia's capabilities for large-scale analytics, particularly focusing on its support for handling and analyzing massive volumes of scientific data efficiently. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Understand Ophidia’s architecture** and how it integrates with existing HPC environments to support scalable data analytics. 10 | * **Learn to use Ophidia's array-based data model** for efficient data storage, retrieval, and processing. 11 | * **Implement data operations** using Ophidia's functional interface, which includes aggregation, selection, and array manipulation. 12 | * **Develop workflows using Ophidia’s workflow management tools** to automate and optimize complex data analysis tasks. 13 | * **Utilize Ophidia's large-scale analytical operators** to perform high-level, scientific data analyses and transformations. 14 | * **Explore case studies** that demonstrate the application of Ophidia in real-world scientific research, particularly in climate and environmental data analysis. 15 | * **Integrate visualization tools** with Ophidia to create insightful graphical representations of large-scale data sets. 16 | * **Assess the performance benefits** of using Ophidia in big data projects, comparing it with other analytics tools in terms of speed and scalability. 17 | * **Navigate data privacy and security considerations** in the context of using Ophidia for sensitive or proprietary scientific data. 18 | * **Participate in hands-on labs** to gain practical experience with Ophidia, focusing on setup, configuration, and execution of typical data workflows. 19 | * **Critically analyze** the suitability of Ophidia for various types of data-intensive applications in scientific research. 20 | * **Master the use of Ophidia’s built-in functions** for complex statistical analysis. 21 | * **Implement advanced data reduction techniques** to manage large datasets effectively. 22 | * **Develop custom operators** for tailored scientific analysis. 23 | * **Explore parallel data processing capabilities** to improve computational efficiency. 24 | * **Apply multidimensional data analysis** across varied scientific domains. 25 | * **Understand metadata management** to optimize data discovery and retrieval. 26 | * **Investigate interoperability** with other big data tools and formats. 27 | * **Learn about data provenance** to enhance reproducibility. 28 | * **Engage in collaborative projects** to tackle complex challenges using Ophidia. 29 | * **Explore real-time data processing capabilities** for streaming data scenarios. 30 | 31 | AI generated content 32 | -------------------------------------------------------------------------------- /bda/2/2/b.txt: -------------------------------------------------------------------------------- 1 | # BDA2.2 Jupyter Notebooks 2 | 3 | Jupyter Notebooks provide an interactive computing environment that facilitates code writing, visualization, and data analysis. This module explores the use of Jupyter Notebooks in the context of HPC and big data analytics, focusing on their capabilities to perform real-time data analysis and visualization. 4 | 5 | ## Requirements 6 | 7 | 8 | ## Learning Outcomes 9 | 10 | * **Understand the architecture** of Jupyter Notebooks and their integration within HPC environments to support scalable data analysis. 11 | * **Create and manage Jupyter Notebook instances** in local and remote settings, leveraging HPC resources. 12 | * **Develop and execute code** efficiently in Notebooks using languages like Python, R, and Julia. 13 | * **Utilize interactive widgets** to create dynamic, user-interactive data visualizations and tools within Notebooks. 14 | * **Integrate Jupyter Notebooks with big data tools** like Apache Spark to handle large datasets effectively. 15 | * **Employ advanced visualization libraries** within Notebooks to represent complex datasets visually. 16 | * **Collaborate effectively using Jupyter Notebooks**, sharing notebooks for reproducible research and peer review. 17 | * **Optimize Notebook performance** in big data applications, including memory management and parallel computing techniques. 18 | * **Implement best practices for security** in Jupyter Notebooks, securing sensitive data and computational processes. 19 | * **Explore extensions and plugins** that enhance the functionality of Jupyter Notebooks, such as dashboarding tools and version control integration. 20 | * **Analyze case studies** where Jupyter Notebooks have been effectively used in big data projects across various industries. 21 | * **Participate in hands-on workshops** to gain practical experience with complex analytical tasks using Jupyter Notebooks. 22 | * **Navigate ethical and legal considerations** in using Jupyter Notebooks, especially regarding data privacy and usage rights in collaborative environments. 23 | * **Stay updated with the latest developments** in the Jupyter ecosystem, including new tools and features that can enhance data analysis workflows. 24 | 25 | AI generated content 26 | -------------------------------------------------------------------------------- /bda/2/3/b.txt: -------------------------------------------------------------------------------- 1 | # BDA2.3 Cloud 2 | 3 | Cloud computing has become integral to managing and processing big data, providing scalable resources and environments that facilitate complex data analyses. This module explores the integration of cloud computing with HPC systems, focusing on the tools and techniques that enhance big data analytics capabilities. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Understand the role of cloud computing** in big data analytics, identifying how cloud resources can be leveraged to process and analyze large datasets. 10 | * **Explore various cloud services** and solutions that support big data projects, including data storage, computation, and analytics platforms. 11 | * **Implement data migration strategies** to efficiently transfer large volumes of data to and from cloud environments. 12 | * **Utilize cloud-based HPC solutions** to perform scalable and efficient data analyses, examining the trade-offs between on-premise and cloud-based HPC. 13 | * **Develop and deploy applications** in the cloud using containerization and orchestration tools like Docker and Kubernetes. 14 | * **Optimize cost and resource usage** in cloud environments, employing best practices for cloud resource management and cost-efficiency. 15 | * **Secure cloud-based data solutions**, understanding security best practices and compliance issues related to data in the cloud. 16 | * **Integrate cloud data services with existing HPC infrastructure**, ensuring seamless data flow and maintenance. 17 | * **Evaluate the performance of cloud services** using benchmarks and performance metrics specific to big data applications. 18 | * **Participate in hands-on labs** to set up and configure cloud environments for real-world data analytics scenarios. 19 | * **Assess the scalability and elasticity** of cloud solutions, understanding how to dynamically adjust resources based on workload demands. 20 | * **Navigate ethical and legal considerations** of storing and processing data in the cloud, especially in a multi-tenant environment. 21 | * **Explore innovative cloud technologies** and emerging trends that influence big data analytics, such as serverless computing and machine learning services. 22 | * **Collaborate across distributed teams** using cloud-based tools and platforms to enhance productivity and data sharing in big data projects. 23 | * **Analyze the impact of cloud computing on data governance** and regulatory compliance, focusing on data sovereignty and auditability. 24 | * **Explore advanced networking configurations** for cloud environments to enhance data transfer speeds and reduce latency in big data processing. 25 | * **Understand the use of APIs in cloud environments** to automate tasks and integrate diverse data sources and analytical tools. 26 | * **Implement disaster recovery and data backup strategies** in the cloud to ensure data integrity and availability. 27 | * **Develop multi-cloud strategies** to avoid vendor lock-in and enhance resilience in big data analytics. 28 | * **Utilize AI and machine learning workflows** in the cloud to automate data analysis and generate insights at scale. 29 | 30 | AI generated content 31 | -------------------------------------------------------------------------------- /bda/2/4/b.txt: -------------------------------------------------------------------------------- 1 | # BDA2.4 RayDP 2 | 3 | RayDP is an open-source library that seamlessly integrates Ray with Apache Spark, enhancing the capabilities of both frameworks for handling large-scale data processing and machine learning tasks. This module explores how RayDP leverages Ray's simple, flexible, and performant model with Spark’s powerful data processing capabilities, ideal for complex analytics in HPC environments. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Understand RayDP’s architecture** and its integration points with Apache Spark, identifying how it enhances functionality in distributed data processing. 10 | * **Set up and configure RayDP** in a HPC environment, optimizing it for specific data workflows. 11 | * **Execute large-scale data processing tasks** using RayDP, combining Ray’s scalability with Spark’s data processing tools. 12 | * **Develop machine learning pipelines** utilizing RayDP, integrating Spark MLlib and Ray’s machine learning libraries. 13 | * **Optimize data operations** with RayDP for improved performance and efficiency in data handling and computation. 14 | * **Handle streaming data** using RayDP’s integration with Spark Streaming, processing real-time data feeds effectively. 15 | * **Implement complex analytics workflows** that leverage the full capabilities of both Ray and Spark through RayDP. 16 | * **Utilize RayDP for AI and deep learning tasks**, taking advantage of Ray’s support for scalable model training and inference. 17 | * **Explore the use of RayDP in various industry sectors**, such as finance, healthcare, and telecommunications, for real-world applications. 18 | * **Participate in hands-on labs** to gain practical experience with deploying, managing, and optimizing RayDP-based applications. 19 | * **Navigate the scalability and resource management challenges** in RayDP, applying best practices for large-scale data deployments. 20 | * **Assess the impact of RayDP on project outcomes**, analyzing improvements in processing times and resource utilization. 21 | * **Explore the future potential of RayDP**, discussing upcoming features and potential enhancements in the integration of Ray and Spark. 22 | * **Address security considerations and data governance** in RayDP applications, ensuring compliance with legal and ethical standards. 23 | * **Collaborate effectively in distributed teams** using RayDP, enhancing communication and project management in big data projects. 24 | * **Critically evaluate RayDP’s performance** against other big data frameworks, understanding its unique advantages and limitations. 25 | 26 | AI generated content 27 | -------------------------------------------------------------------------------- /bda/2/5/b.txt: -------------------------------------------------------------------------------- 1 | # BDA2.5 Spark-Horovod 2 | 3 | Spark-Horovod is a powerful integration that combines Apache Spark's large-scale data processing capabilities with Horovod's efficient distributed training framework for deep learning. This module explores the synergies between Spark and Horovod, focusing on how this combination can be utilized for advanced machine learning tasks in HPC environments. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Understand the integration of Spark with Horovod**, recognizing how it facilitates distributed deep learning. 10 | * **Set up and configure Spark-Horovod environments** in HPC settings, ensuring optimal configuration for specific project requirements. 11 | * **Execute distributed training sessions** using Spark-Horovod, applying it to practical machine learning problems involving large datasets. 12 | * **Optimize data preprocessing tasks** within Spark to feed into Horovod-powered training pipelines efficiently. 13 | * **Leverage Spark’s data handling capabilities** to manage the input and output of machine learning models trained with Horovod. 14 | * **Utilize advanced features of Horovod** such as gradient aggregation and checkpointing to improve the efficiency of model training. 15 | * **Develop scalable machine learning applications** that combine Spark’s big data processing with Horovod’s efficient computation distribution. 16 | * **Monitor and debug distributed training processes**, using tools integrated within Spark and Horovod to track performance and identify bottlenecks. 17 | * **Explore case studies** demonstrating the successful application of Spark-Horovod in industries such as finance, healthcare, and retail. 18 | * **Participate in hands-on labs** to experience real-world challenges and solutions in training deep learning models at scale. 19 | * **Navigate data governance and security issues** in distributed machine learning environments, focusing on compliance and data protection. 20 | * **Assess the performance and scalability** of Spark-Horovod, comparing it to other distributed machine learning frameworks. 21 | * **Discuss future trends and advancements** in the integration of big data processing and distributed deep learning. 22 | * **Collaborate on projects using Spark-Horovod**, fostering teamwork and knowledge exchange among peers. 23 | * **Critically evaluate the impact of using Spark-Horovod** on machine learning project outcomes, focusing on improvements in speed, scalability, and accuracy. 24 | * **Implement post-training tasks** such as model evaluation and deployment within the Spark-Horovod ecosystem. 25 | * **Analyze resource allocation and management** within Spark-Horovod to optimize computational efficiency. 26 | * **Integrate Spark-Horovod with other big data platforms** and ecosystems to enhance data flow and processing capabilities. 27 | * **Design resilient and fault-tolerant systems** using Spark-Horovod to handle failures in distributed environments. 28 | * **Advance knowledge in tuning hyperparameters** within distributed settings to maximize model performance. 29 | 30 | AI generated content 31 | -------------------------------------------------------------------------------- /bda/4/1/b.txt: -------------------------------------------------------------------------------- 1 | # BDA4.1 Preparation 2 | 3 | Data preparation is a critical step in the big data analytics pipeline, involving the collection, cleaning, transformation, and integration of diverse data sources to create a unified dataset suitable for analysis. This module covers the essential techniques and best practices for preparing data for downstream analytics tasks. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Understand the importance** of data preparation in the big data analytics process, including its impact on the quality and reliability of analytical results. 10 | * **Explore techniques** for data acquisition and ingestion from various sources, including databases, file systems, APIs, and streaming platforms. 11 | * **Analyze methods** for data cleaning and preprocessing to address issues such as missing values, outliers, duplicates, and inconsistencies. 12 | * **Understand the principles** of data transformation and normalization to ensure data consistency and compatibility across different formats and structures. 13 | * **Explore the concepts** of feature engineering and extraction for creating new features from raw data to enhance model performance and interpretability. 14 | * **Analyze strategies** for data integration and consolidation to combine disparate datasets into a unified schema for comprehensive analysis. 15 | * **Understand the principles** of data sampling and stratification for creating representative subsets of large datasets to facilitate exploratory analysis and model training. 16 | * **Explore techniques** for handling imbalanced datasets to address challenges associated with unequal class distributions in classification tasks. 17 | * **Analyze methods** for data partitioning and splitting to separate datasets into training, validation, and test sets for model development and evaluation. 18 | * **Understand the concepts** of data anonymization and pseudonymization to protect sensitive information and ensure compliance with privacy regulations. 19 | * **Explore techniques** for handling temporal and spatial data to capture temporal dependencies and spatial relationships in analytical models. 20 | * **Analyze the principles** of data quality assessment and validation to measure the accuracy, completeness, and consistency of datasets. 21 | * **Understand the concepts** of metadata management and documentation to catalog and annotate datasets for traceability and reproducibility. 22 | * **Explore techniques** for data enrichment and augmentation to supplement existing datasets with additional information from external sources. 23 | * **Analyze the role** of data profiling and exploratory data analysis (EDA) in gaining insights into dataset characteristics and identifying patterns and trends. 24 | * **Understand the principles** of data versioning and lineage tracking to manage changes and lineage information throughout the data lifecycle. 25 | * **Explore techniques** for data compression and storage optimization to reduce storage costs and improve data accessibility and retrieval performance. 26 | * **Analyze the challenges** and best practices associated with scaling data preparation workflows for handling increasingly large and complex datasets. 27 | 28 | AI generated content 29 | -------------------------------------------------------------------------------- /bda/4/2/b.txt: -------------------------------------------------------------------------------- 1 | # BDA4.2 Pre-processing 2 | 3 | Pre-processing is a crucial phase in the Big Data Analytics process, focusing on improving data quality and format to ensure it is suitable for complex analysis. This module delves into advanced techniques for data cleaning, normalization, transformation, and reduction, aimed at preparing raw data efficiently for analytical tasks. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Identify and rectify** inconsistencies, missing values, and outliers in big datasets to enhance data accuracy. 10 | * **Implement normalization** and standardization techniques to ensure data uniformity, making it easier to compare and analyze. 11 | * **Execute transformation methods** such as scaling, encoding, and discretization to tailor data for specific analytical models. 12 | * **Utilize dimensionality reduction** techniques, including PCA, to reduce the number of random variables under consideration, while preserving essential information. 13 | * **Automate repetitive data cleaning tasks** using scripts and libraries to increase efficiency and reduce the likelihood of human error. 14 | * **Develop and apply strategies** for dealing with unstructured data types like text and images, making them amenable to analysis. 15 | * **Assess the impact** of various pre-processing techniques on the quality of datasets and the robustness of subsequent analyses. 16 | * **Employ advanced filtering** to remove redundant or irrelevant data features, focusing analysis on significant attributes. 17 | * **Integrate and reconcile data** from disparate sources to build comprehensive datasets ready for in-depth analysis. 18 | * **Design scalable and reproducible data preprocessing pipelines** that can be applied across various projects and datasets. 19 | * **Optimize preprocessing workflows** for improved performance in high-volume data environments. 20 | * **Explore real-world applications** where effective pre-processing has significantly enhanced the outcomes of data analytics projects. 21 | * **Navigate ethical and legal considerations** in data preprocessing, ensuring compliance with data protection laws and ethical standards. 22 | * **Explore data imputation techniques** to handle missing data effectively, tailoring approaches to the dataset and analysis needs. 23 | * **Implement feature extraction** methods to derive new variables from existing data for enhanced insights and model performance. 24 | * **Use advanced techniques for anomaly detection** to identify significant deviations in data patterns. 25 | * **Develop skills in using automation tools** for data transformation in cloud environments or platforms like Apache Spark. 26 | * **Practice data pre-processing in real-time analytics scenarios**, addressing challenges with streaming data. 27 | * **Learn to preprocess data for specific types of analysis**, such as time series or predictive modeling, customizing methods accordingly. 28 | * **Evaluate the scalability of preprocessing methods** in distributed computing environments, adapting to technology constraints. 29 | * **Address data quality issues systematically**, creating frameworks for assessing and improving data quality. 30 | 31 | 32 | AI generated content 33 | -------------------------------------------------------------------------------- /bda/4/3/b.txt: -------------------------------------------------------------------------------- 1 | # BDA4.3 Visualization 2 | 3 | Visualization is a pivotal stage in Big Data Analytics, enabling analysts to see and understand trends, outliers, and patterns in data through graphical representation. This module focuses on the principles, tools, and techniques necessary for creating effective visualizations that communicate data insights clearly and compellingly. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Understand the principles** of data visualization, including the selection of appropriate chart types, color schemes, and data encoding techniques. 10 | * **Develop skills** to use advanced visualization tools and software for creating interactive and static visualizations. 11 | * **Apply best practices** in creating dashboards and reports that effectively communicate the results of data analysis. 12 | * **Explore the use of visualization** for big data in various scenarios such as real-time data streams, large datasets, and complex data structures. 13 | * **Design visualizations** that adhere to accessibility and usability standards to ensure that they are understandable by a diverse audience. 14 | * **Integrate visual analytics** techniques into the visualization process to enable deeper exploration and discovery of data insights. 15 | * **Utilize scripting** with libraries like D3.js for custom visualization solutions that go beyond conventional tools. 16 | * **Critically analyze** the impact of different visualization choices on the interpretation of data. 17 | * **Implement interactive elements** in visualizations to enhance user engagement and data exploration. 18 | * **Assess the effectiveness** of visualizations in conveying the desired message and meeting analytical objectives. 19 | * **Explore case studies** where effective visualization has led to significant business or research outcomes. 20 | * **Navigate ethical considerations** in data visualization, focusing on the responsible representation of information. 21 | * **Leverage advanced visualization techniques** such as heatmaps, geospatial maps, and treemaps. 22 | * **Incorporate time-series analysis** in visualizations to effectively showcase data changes over time. 23 | * **Utilize machine learning algorithms** to identify patterns and outliers, enhancing the insights from visualizations. 24 | * **Implement dynamic and real-time visualizations** for streaming data. 25 | * **Develop custom visual solutions** using programming to extend beyond traditional tools. 26 | * **Evaluate the scalability of visualization tools** to handle Big Data effectively. 27 | * **Integrate visual data discovery tools** for interactive data exploration. 28 | * **Understand visual perception implications** in visualization design. 29 | * **Apply advanced statistical methods** for effective data aggregation in visualizations. 30 | * **Explore the role of AI and automation** in visual data analysis. 31 | 32 | AI generated content 33 | -------------------------------------------------------------------------------- /bda/4/4/b.txt: -------------------------------------------------------------------------------- 1 | # BDA4.4 Analysis 2 | 3 | Analysis is a critical phase in Big Data Analytics, where data, having been collected, cleansed, and visualized, is now subject to deeper examination to extract actionable insights. This module focuses on the advanced analytical techniques and tools necessary for effective decision-making based on large volumes of data. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Apply statistical methods** to analyze large datasets, interpreting results to make data-driven decisions. 10 | * **Utilize machine learning models** to perform predictive analytics, classifying and forecasting data trends. 11 | * **Implement unsupervised learning techniques** like clustering and dimensionality reduction to discover patterns and relationships in data. 12 | * **Evaluate model performance** using metrics specific to different types of analysis (classification, regression, clustering). 13 | * **Integrate advanced analytics techniques** into business processes to enhance operational efficiency and strategic planning. 14 | * **Develop scripts and algorithms** for automated data analysis, optimizing them for speed and accuracy. 15 | * **Conduct time-series analysis** to forecast future events based on historical data. 16 | * **Use text analytics and natural language processing (NLP)** to extract meaningful information from unstructured data. 17 | * **Perform sentiment analysis** to gauge consumer attitudes and market trends from social media and other textual data. 18 | * **Apply geospatial analysis** to interpret data related to geographic locations and environments. 19 | * **Explore the use of big data analytics in various industry sectors** such as healthcare, finance, retail, and telecommunications. 20 | * **Navigate ethical and legal considerations** in data analysis, ensuring the privacy and security of sensitive information. 21 | * **Participate in hands-on labs and simulations** to apply analytical techniques to real-world datasets. 22 | * **Critically assess the limitations and biases** of data analysis methods, striving for accuracy and fairness in conclusions. 23 | * **Engage in collaborative projects** to tackle complex analysis tasks, sharing insights and methodologies across diverse teams. 24 | * **Stay abreast of emerging trends and technologies** in data analysis, including AI and IoT analytics. 25 | 26 | AI generated content 27 | -------------------------------------------------------------------------------- /bda/5/b.txt: -------------------------------------------------------------------------------- 1 | # BDA5 Machine Learning 2 | 3 | Machine learning techniques are essential for extracting valuable insights and predictions from big data. This module covers the principles, algorithms, and applications of machine learning in the context of big data analytics. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Understand the fundamentals** of machine learning, including supervised learning, unsupervised learning, and reinforcement learning. 10 | * **Explore the principles** of feature engineering and feature selection for preparing input data for machine learning models. 11 | * **Analyze the role** of data preprocessing techniques such as normalization, standardization, and missing value imputation in improving model performance. 12 | * **Understand the concepts** of model evaluation, including performance metrics such as accuracy, precision, recall, and F1-score. 13 | * **Explore the concepts** of cross-validation and hyperparameter tuning for optimizing model performance and generalization. 14 | * **Analyze the architecture** of popular machine learning libraries and frameworks such as scikit-learn, TensorFlow, and PyTorch. 15 | * **Understand the principles** of linear regression and logistic regression for solving regression and classification problems, respectively. 16 | * **Explore the concepts** of decision trees, random forests, and gradient boosting for building ensemble learning models. 17 | * **Analyze the principles** of support vector machines (SVMs) for binary classification and kernel methods for nonlinear decision boundaries. 18 | * **Understand the concepts** of clustering algorithms such as k-means, hierarchical clustering, and density-based clustering for unsupervised learning tasks. 19 | * **Explore the role** of dimensionality reduction techniques such as principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) for visualizing and compressing high-dimensional data. 20 | * **Analyze the principles** of neural networks and deep learning architectures for solving complex machine learning problems. 21 | * **Understand the concepts** of convolutional neural networks (CNNs) for image classification and object detection tasks. 22 | * **Explore the role** of recurrent neural networks (RNNs) and long short-term memory (LSTM) networks for sequence modeling and natural language processing tasks. 23 | * **Analyze the principles** of generative adversarial networks (GANs) for generating synthetic data and images. 24 | * **Understand the concepts** of transfer learning and fine-tuning pre-trained models for domain adaptation and task-specific learning. 25 | * **Explore the role** of autoencoders and variational autoencoders (VAEs) for unsupervised feature learning and data generation. 26 | * **Analyze the principles** of reinforcement learning algorithms such as Q-learning and deep Q-networks (DQN) for learning optimal decision-making policies. 27 | 28 | AI generated content 29 | -------------------------------------------------------------------------------- /bda/6/1/b.txt: -------------------------------------------------------------------------------- 1 | # BDA6.1 Analysis Workflow 2 | 3 | The Analysis Workflow module provides a systematic approach to conducting data analysis within Big Data projects, emphasizing efficient processes and methodologies to handle and extract value from large datasets. This course outlines the key components and stages involved in a successful analysis workflow in the context of Big Data Analytics. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Define and outline the stages** of a typical big data analysis workflow, from data collection to data interpretation. 10 | * **Develop data ingestion strategies** to effectively gather and store data from various sources, ensuring quality and accessibility. 11 | * **Implement data cleaning and preprocessing techniques** to prepare raw data for analysis, enhancing data quality and usefulness. 12 | * **Utilize exploratory data analysis (EDA)** techniques to summarize characteristics of data and discover initial patterns. 13 | * **Construct models and hypotheses** based on statistical foundations and business intelligence insights. 14 | * **Apply advanced analytical methods** to interpret complex datasets, employing techniques such as machine learning, regression analysis, and clustering. 15 | * **Optimize workflows for efficiency and scalability**, adjusting processes to handle large volumes of data effectively. 16 | * **Automate routine data analysis tasks** using scripting and batch processing to reduce manual effort and increase reproducibility. 17 | * **Validate and refine analytical models** through iterative testing and tuning to improve accuracy and relevance. 18 | * **Communicate results effectively** to stakeholders using visualization tools and presentation techniques. 19 | * **Develop documentation and reporting standards** for analysis workflows to ensure consistency and clarity in outputs. 20 | * **Navigate ethical and compliance issues** related to data analysis, focusing on data privacy, security, and regulatory standards. 21 | * **Integrate new technologies and methodologies** into existing workflows to stay current with industry trends and enhance capabilities. 22 | * **Evaluate the impact of analysis workflows** on business outcomes, demonstrating the value of data-driven decision making. 23 | * **Collaborate in multidisciplinary teams** to bring diverse expertise into the workflow, enhancing the depth and breadth of analytical insights. 24 | * **Critically assess the limitations and biases** in analytical models and workflows, aiming for transparency and objectivity in conclusions. 25 | * **Manage and optimize the use of analytical tools and platforms** within the workflow, including selection and configuration of software and hardware resources. 26 | * **Develop skills in data simulation and synthetic data generation** to test models when actual data is incomplete or unavailable. 27 | * **Implement continuous improvement practices** in analysis workflows to adapt and evolve with organizational needs and technological advances. 28 | * **Lead and manage big data projects** with a focus on strategic planning and cross-functional coordination. 29 | 30 | AI generated content 31 | -------------------------------------------------------------------------------- /bda/6/2/b.txt: -------------------------------------------------------------------------------- 1 | # BDA6.2 Data-driven Workflows 2 | 3 | Data-driven workflows are essential for organizations aiming to integrate data at every stage of their decision-making processes. This module covers the design, implementation, and optimization of workflows that leverage data to guide business strategies and operations effectively. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Identify the components of data-driven workflows**, understanding how data is sourced, processed, and analyzed to inform decisions. 10 | * **Design workflows that integrate data collection, analysis, and interpretation** seamlessly to support continuous improvement and real-time decision making. 11 | * **Implement automation in data workflows** to streamline operations and reduce manual errors, using tools like BPM (Business Process Management) software. 12 | * **Utilize data to predict outcomes and drive decisions**, applying predictive analytics and machine learning models within workflows. 13 | * **Develop metrics and KPIs** to measure the effectiveness of data-driven workflows and guide iterative improvements. 14 | * **Optimize workflow performance** with advanced data strategies, including data blending and real-time data processing. 15 | * **Create robust data governance frameworks** within workflows to ensure data integrity, security, and compliance with regulations. 16 | * **Train teams to leverage data-driven workflows**, emphasizing the importance of data literacy and analytical skills across the organization. 17 | * **Evaluate and select technology platforms** that best support the scalability and complexity of data-driven workflows. 18 | * **Integrate cross-functional data sources** to enrich workflow inputs and enhance the comprehensiveness of data analysis. 19 | * **Develop visualization dashboards** that represent workflow outputs and performance indicators clearly and effectively. 20 | * **Implement feedback mechanisms** in workflows to capture insights and refine processes continuously. 21 | * **Explore case studies of successful data-driven workflows** in various industries to understand best practices and common pitfalls. 22 | * **Address ethical considerations** in automating and managing data-driven decisions, particularly regarding data privacy and bias. 23 | * **Foster a culture of data-driven decision making** within organizations, encouraging proactive data utilization and analysis. 24 | * **Navigate the challenges of integrating legacy systems** with modern data workflow solutions, ensuring seamless data flows across different technologies. 25 | * **Critically assess the impact of data-driven workflows** on organizational efficiency and market responsiveness. 26 | * **Explore emerging technologies** like AI and IoT, and their roles in enhancing data-driven workflows. 27 | * **Conduct workshops and training sessions** to develop hands-on expertise in managing and optimizing data-driven workflows. 28 | * **Lead strategic initiatives** to implement and scale data-driven workflows across large organizations. 29 | 30 | AI generated content 31 | -------------------------------------------------------------------------------- /bda/6/3/b.txt: -------------------------------------------------------------------------------- 1 | # BDA6.3 Integrating BDA with HPC Workflows 2 | 3 | Integrating Big Data Analytics (BDA) with High-Performance Computing (HPC) workflows enables organizations to handle and analyze large-scale data efficiently. This module delves into the methodologies and technologies required to seamlessly combine BDA capabilities with HPC infrastructures. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Understand the fundamental requirements** for integrating BDA into HPC environments, identifying key challenges and solutions. 10 | * **Design integrated workflows** that leverage the computational power of HPC with the analytical capabilities of BDA. 11 | * **Implement scalable data management strategies** within HPC workflows to handle vast amounts of data efficiently. 12 | * **Utilize parallel computing techniques** in BDA processes to enhance data processing and analysis speeds. 13 | * **Optimize data storage and retrieval** to support intensive BDA tasks within HPC environments. 14 | * **Develop robust data pipelines** that ensure data integrity and availability across distributed computing resources. 15 | * **Apply machine learning and AI algorithms** effectively in HPC settings to solve complex analytical problems. 16 | * **Monitor and manage the performance** of integrated BDA-HPC systems, using tools and practices that enhance reliability and scalability. 17 | * **Address security and privacy issues** relevant to BDA-HPC integrations, implementing comprehensive data protection measures. 18 | * **Evaluate the impact of integrated solutions** on organizational capabilities and business outcomes. 19 | * **Train technical teams** on the effective use of integrated BDA-HPC workflows, focusing on skills development and best practices. 20 | * **Innovate new approaches** for merging BDA with HPC, exploring cutting-edge technologies and methodologies. 21 | * **Navigate regulatory and compliance challenges** associated with deploying BDA-HPC integrations in different industries. 22 | * **Lead cross-functional projects** that utilize integrated BDA-HPC workflows to drive innovation and efficiency. 23 | * **Assess and enhance fault tolerance** and disaster recovery strategies in integrated systems. 24 | * **Explore case studies** where BDA-HPC integration has transformed business processes and analytical capabilities. 25 | * **Engage with industry experts** and thought leaders to stay updated on the latest trends and advancements in BDA-HPC integration. 26 | * **Critically evaluate software and hardware choices** for building effective BDA-HPC environments. 27 | * **Develop strategies for real-time analytics** within HPC workflows, leveraging fast data processing and analysis. 28 | * **Lead strategic planning sessions** to align BDA-HPC integration with broader organizational goals. 29 | 30 | AI generated content 31 | -------------------------------------------------------------------------------- /bda/6/b.txt: -------------------------------------------------------------------------------- 1 | # BDA6 Workflows 2 | 3 | Workflows play a crucial role in orchestrating and automating the sequence of tasks involved in Big Data Analytics (BDA), ensuring efficient data processing, analysis, and decision-making. In this overview, we explore the key aspects of workflows and their significance in the context of big data analytics. 4 | 5 | **Analysis Workflow (BDA7.2):** Analysis workflows are designed to streamline the process of data analysis, from data preparation to insights generation. This section discusses the components and stages of an analysis workflow, including data ingestion, data cleaning, feature engineering, model training, evaluation, and deployment. Topics also include workflow management tools, pipeline orchestration, and best practices for designing efficient analysis workflows. 6 | 7 | **Data-driven Workflows (BDA7.3):** Data-driven workflows leverage insights from data to drive the automation and optimization of business processes. This section explores the use of data-driven workflows for tasks such as predictive maintenance, anomaly detection, recommendation systems, and personalized marketing. Topics include data-driven decision-making, workflow automation tools, and the integration of data-driven workflows with enterprise systems. 8 | 9 | **Integrating BDA with HPC Workflows (BDA7.4):** Integrating Big Data Analytics (BDA) with High-Performance Computing (HPC) workflows enables the seamless execution of data-intensive tasks on HPC infrastructures. This section discusses the challenges and opportunities of integrating BDA with HPC workflows, including data movement, resource provisioning, and workflow interoperability. Topics also include workflow management systems that support both BDA and HPC workloads, enabling organizations to leverage the computational power of HPC systems for big data analytics tasks. 10 | 11 | By mastering workflows in the context of Big Data Analytics (BDA), practitioners can streamline data processing pipelines, automate repetitive tasks, and accelerate insights generation, enabling organizations to derive value from their data assets more effectively. 12 | 13 | ## Requirements 14 | 15 | ## Learning Outcomes 16 | 17 | * Anaylize workflows and investigate the systematic approach. 18 | * Apply data driven workflows for decision making processes. 19 | * Examine the integration of Big data into HPC workflows. 20 | 21 | ## Subskills 22 | 23 | * [[skill-tree:bda:6:1:b]] 24 | * [[skill-tree:bda:6:2:b]] 25 | * [[skill-tree:bda:6:3:b]] 26 | 27 | AI generated content 28 | -------------------------------------------------------------------------------- /bda/b.txt: -------------------------------------------------------------------------------- 1 | # BDA Big Data Analytics 2 | 3 | The analysis of large volumes of data is essential in HPC. 4 | Big data analytics covers concepts and tools to perform this task traditionally in the cloud environment -- that, compared to HPC systems, utilizes cheap and less-reliable hardware. 5 | However, as tools and concepts matured, they are now applied in HPC environments as well. 6 | 7 | HPC workloads utilize tools and methodology from Data Science (DS) and Artificial Intelligence (AI) to process data in order to obtain results quickly. 8 | AI can be used inside simulations or to steer workflows, while data science can be used to find interesting patterns inside the data. 9 | 10 | ## Learning Outcomes 11 | 12 | * Examine some theoretical principles of big data. 13 | * Differentiate the various tools that could be used in an HPC environment effectively. 14 | * Contrast the different steps in data processing from preparation, to post-processing, and visualization and analysis. 15 | * Describe and apply the concepts of artificial intelligence and data science. 16 | * Compose a workflow consisting of HPC and big data tools to analyze the data. 17 | 18 | # Subskills 19 | 20 | * [[skill-tree:bda:1:b]] 21 | * [[skill-tree:bda:2:b]] 22 | * [[skill-tree:bda:3:b]] 23 | * [[skill-tree:bda:4:b]] 24 | * [[skill-tree:bda:6:b]] 25 | * [[skill-tree:bda:5:b]] 26 | * [[skill-tree:bda:6:b]] 27 | -------------------------------------------------------------------------------- /extract-outcomes.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import os 4 | import sys 5 | 6 | def find_skills(): 7 | data = {} 8 | for root, dirs, files in os.walk("."): 9 | for f in files: 10 | if(f[-4:] == ".txt"): 11 | c = root[2:] 12 | if c in data: 13 | data[c].append(f) 14 | else: 15 | data[c] = [f] 16 | del data[""] 17 | return data 18 | 19 | def find_section(data, section): 20 | pos = data.find("# " + section) 21 | if pos == -1: 22 | return "" 23 | rest = data[pos + 2 + len(section):].strip() 24 | pos = rest.find("# ") 25 | if pos == -1: 26 | return rest 27 | return rest[0:pos].strip() 28 | 29 | 30 | def removeSection(data, section): 31 | pos = data.find("# " + section) 32 | if pos == -1: 33 | return data 34 | head = data[0:pos] 35 | rest = data[pos + 2:] 36 | 37 | pos = rest.find("# ") 38 | if pos == -1: 39 | return head 40 | 41 | return head + rest[pos:] 42 | 43 | def extract_outcomes(file): 44 | fd = open(file, "r") 45 | data = fd.read() 46 | fd.close() 47 | return find_section(data, "Outcomes") 48 | 49 | 50 | exist = find_skills() 51 | 52 | for dir in exist: 53 | outcome_basic = "" 54 | files = exist[dir] 55 | files.sort() 56 | for f in files: 57 | path = dir + "/" + f 58 | outcomes = extract_outcomes(path) 59 | if len(outcomes) < 5: 60 | print("Missing learning objectives: " + path) 61 | if f == "b.txt": 62 | outcome_basic = outcomes 63 | else: 64 | if outcomes == outcome_basic: 65 | print("Same learning objectives as basic: " + path) 66 | 67 | removeSection("Subskills") 68 | 69 | if data != datanew: 70 | fd = open(file, "w") 71 | fd.write(datanew) 72 | fd.close() 73 | -------------------------------------------------------------------------------- /k/1/1/b.txt: -------------------------------------------------------------------------------- 1 | # K1.1 System Architectures 2 | 3 | An HPC cluster is built out of a few up to many servers that are connected by a high-performance communication network. 4 | The servers are called nodes. 5 | HPC clusters use batch systems for running compute jobs. 6 | Interactive access is usually limited to a few nodes. 7 | A typical cluster contains of these parts: 8 | * various system-, hardware-, and I/O-architectures used for supercomputers, i.e. shared memory systems, distributed systems, and cluster systems 9 | * physical hardware; chassis/rack, computer system units, interconnect, power., compute node architecture with memory, local disk, and sockets 10 | * typical architecture of cluster systems consisting of nodes with different roles (e.g. so-called head, management, login, compute, interactive, visualization nodes, etc.) 11 | Which are functioning as: 12 | 13 | **Login or gateway nodes** where you are on after logging into the cluster and where you can work interactively: 14 | * **Compute nodes** that are the workhorses of a cluster 15 | * **Admin or system nodes** that work in the background necessary for the operation of the cluster, e.g. for running the batch service, or starting and shutting down compute nodes 16 | * **Disk nodes** that provide global file systems, i.e. file systems that can be used on all other kinds of nodes 17 | * **Special nodes** e.g. for data movement, visualization, or pre- and post-processing of large data sets 18 | * **Head nodes** which can mean login or admin nodes used for management either by a user or an administrator 19 | 20 | ## Learning Outcomes 21 | 22 | * Comprehend there are nodes with several functions 23 | * Comprehend that nodes are connected by a high-performance communication network 24 | * Comprehend that global file systems are available on all nodes of the cluster and are convenient because their files can be accessed directly on all nodes 25 | * Quantitatively, parallel file systems offer higher I/O performance than classic network file systems 26 | * Qualitatively, they allow several processes to write into the same file 27 | * Comprehend the two purposes of the communication network 28 | * Enabling high-speed data communication for parallel applications running on multiple nodes 29 | * Providing a high-speed connection to the disk systems in the cluster 30 | * Understand Share memory systems 31 | * Understand Distributed systems 32 | 33 | -------------------------------------------------------------------------------- /k/1/2/1/b.txt: -------------------------------------------------------------------------------- 1 | # K1.2.1 Processing elements 2 | 3 | 4 | ## Learning Outcomes 5 | 6 | * Understand elementary processing elements like CPUs, GPUs, many-core architectures 7 | * Understand vector systems, and FPGAs 8 | * Understand the NUMA architecture used for symmetric multiprocessing systems where the memory access time depends on the memory location relative to the processor 9 | * Comprehend that in traditional **CPUs** - although CPU stands for Central Processing Unit - there is no central, i.e. single, processing unit any more because today all CPUs have multiple compute cores which all have the same functionality 10 | * Comprehend that **GPUs** (Graphical Processing Units) or **GPGPUs** (General Purpose Graphical Processing Units) were originally used for image processing and displaying images on screens before people started to utilize the computing power of GPUs for other purposes 11 | * Comprehend that **FPGAs** (Field-Programmable Gate Arrays) are devices that have configurable hardware and configurations are specified by hardware description languages 12 | * Comprehend that **FPGAs** are interesting if one uses them to implement hardware features that are not available in CPUs or GPUs (e.g. low precision arithmetic that needs only a few bits) 13 | * Comprehend that **Vector units** are successors of vector computers (i.e. the first generation of supercomputers) and that they are supposed to provide higher memory bandwidth than CPUs 14 | * Comprehend that at an abstract level the high-speed network connects compute units and main memory which leads to three main parallel computer architectures 15 | * **Shared Memory** where all compute units can directly access the whole main memory 16 | * **Distributed memory** where individual computers are connected with a network 17 | * **NUMA** (Non-Uniform Memory Access) combines properties from shared and distributed memory systems, because at the hardware level a NUMA system resembles a distributed memory 18 | * Comprehend that in general, the effort for programming parallel applications for distributed systems is higher than for shared memory systems 19 | * Understand parallelization techniques at the instruction level of a processing element (e.g. pipelining, SIMD processing) 20 | * Understand advanced instruction sets that improve parallelization (e.g., AVX-512) 21 | * Understand hybrid approaches, e.g. combining CPUs with GPUs or FPGAs 22 | * Understand typical network topologies and architectures used for HPC systems, like fat trees based on switched fabrics using e.g. fast Ethernet (1 or 10 Gbit) or InfiniBand 23 | * Understand special or application-specific hardware (e.g. TPUs) 24 | -------------------------------------------------------------------------------- /k/1/2/2/b.txt: -------------------------------------------------------------------------------- 1 | # K1.2.2 Network 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand network demands for HPC systems (e.g. high bandwidth and low latency) 6 | * Understand typical network architectures used for HPC systems, like fast Ethernet (1 or 10 Gbit) or InfiniBand 7 | * Understand Demands 8 | * Understand Topologies 9 | * Understand Interconnects 10 | 11 | 12 | -------------------------------------------------------------------------------- /k/1/2/b.txt: -------------------------------------------------------------------------------- 1 | # K1.2 Hardware Architectures 2 | 3 | HPC computer architectures are parallel computer architectures. A parallel computer is built out of 4 | * Compute units. 5 | * Main memory. 6 | * A high-speed network. 7 | 8 | ## Learning Outcomes 9 | 10 | * Differentiate different processing elements such as CPU, FPGA, GPU, and others. 11 | * Demonstrate networking with different topologies and interconnects. 12 | 13 | ## Subskills 14 | 15 | * [[skill-tree:k:1:2:1:b]] 16 | * [[skill-tree:k:1:2:1:b]] 17 | -------------------------------------------------------------------------------- /k/1/3/1/1/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1.1 Media Types 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand different types of storage medias 6 | 7 | -------------------------------------------------------------------------------- /k/1/3/1/2/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1.2 Storage System 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand Object storage 6 | * Understand Burst Buffer 7 | * Understand Network Storage 8 | 9 | -------------------------------------------------------------------------------- /k/1/3/1/3/1/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1.3.1 Parallel File System 2 | 3 | Parallel (or cluster) file systems are global file systems like distributed file systems, i.e. they can be used on any node in the cluster. 4 | 5 | They are designed to deliver high I/O bandwidth and provide large disk space. 6 | 7 | Parallel file systems and programming environments have typically solved the problems of data partitioning and collective access by introducing file modes. The different modes specify the semantics of simultaneous operations by multiple processes. Once a mode is defined, conventional read and write operations are used to access the data, and their semantics are determined by the mode. The most common modes are broadcast, reduce, scatter, gather, shared, offset, and independent. 8 | 9 | ## Learning Outcomes 10 | 11 | * Comprehend that the parallel or cluster aspect is twofold: 12 | * Firstly, the hardware is parallel itself (the file system is provided by several servers that operate in a coordinated way). 13 | * Secondly, parallel I/O is enabled, i.e. more than one process can consistently write to the same file at the same time. 14 | 15 | -------------------------------------------------------------------------------- /k/1/3/1/3/2/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1.3.2 Network File Systems 2 | 3 | Distributed (or network) file systems are provided by a file server which is integrated into the cluster. These file systems are available on all nodes. 4 | A classic distributed file system is the Network File System (NFS). 5 | 6 | ## Learning Outcomes 7 | 8 | * Comprehend that Network file systems are not designed for (very) high I/O loads. 9 | 10 | -------------------------------------------------------------------------------- /k/1/3/1/3/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1.3 File Systems 2 | 3 | Before starting to use an HPC system it is important to basically understand file system architectures used in clusters. 4 | 5 | ## Learning Outcomes 6 | 7 | * Understand Local and pseudo File Systems. 8 | * Understand Parallel File System. 9 | * Understand network file systems. 10 | 11 | ## Subskills 12 | 13 | * [[skill-tree:k:1:3:1:3:1:b]] 14 | * [[skill-tree:k:1:3:1:3:2:b]] 15 | -------------------------------------------------------------------------------- /k/1/3/1/4/1/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1.4.1 NetCDF 2 | 3 | NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. It is commonly used in climatology, meteorology, oceanography applications (e.g., weather forecasting, climate change) and GIS applications. 4 | NetCDF allows the user to describe multidimensional data and include metadata which further characterizes the data. 5 | 6 | The notation used to describe a NetCDF object is called CDL (network Common Data form Language), which provides a convenient way of describing NetCDF datasets. The NetCDF system includes utilities for producing human-oriented CDL text files from binary NetCDF datasets and vice-versa. 7 | 8 | This skill focuses on the data modeling aspect and basic tools that aid in understanding NetCDF. 9 | 10 | ## Learning Outcomes 11 | 12 | * Describe the Classic NetCDF Model characteristics and limitations. 13 | * Describe how a 3D-variable’s data values can be represented in a visualization tool. 14 | * Discuss features of the Enhanced Data Model (NetCDF-4): groups, multiple unlimited dimensions, and new types, including user-defined types. 15 | * Analyze the implications of using NetCDF-4 compression on data. 16 | * Justify when to model data using variables or attributes based on characteristics. 17 | * Design and implement a NetCDF data model in CDL text notation for simple datasets using dimensions, variables, attributes, and coordinate variables. 18 | * Apply the in-built NetCDF utilities to create and analyze NetCDF files: 19 | * Examine the CDL data model and the data of a file using ncdump. 20 | * Generate a NetCDF file from a CDL text file using ncgen. 21 | * Convert a NetCDF file from one binary format to another, optionally changing compression and chunk size settings using nccopy. 22 | 23 | -------------------------------------------------------------------------------- /k/1/3/1/4/2/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1.4.2 NetCDF/CF Conventions 2 | 3 | The Climate and Forecast (CF) conventions are metadata conventions for earth science data, intended to promote the processing and sharing of files created with the NetCDF Application Programmer Interface (API). The CF conventions define metadata that are included in the same file as the data (thus making the file "self-describing"). 4 | 5 | The purpose of the CF conventions is to require conforming datasets to contain sufficient metadata that they are self-describing in the sense that each variable in the file has an associated description of what it represents, including physical units if appropriate, and that each value can be located in space (relative to earth-based coordinates) and time. 6 | 7 | The CF conventions enable users of data from different sources to decide which data are comparable and allows building applications with powerful extraction, regridding, and display capabilities. The CF conventions are framed as a NetCDF standard, but most ideas relate to metadata design in general and not specifically to NetCDF, and hence can be contained in other formats such as XML. 8 | 9 | ## Learning Outcomes 10 | 11 | * Describe the relationships among elements of the CF-NetCDF conventions and NetCDF entities. 12 | * Locate data in space-time and as a function of other independent variables (coordinates), to facilitate processing and graphics. 13 | * Evaluate programs in C that use the NetCDF API to read and write files in a metadata-aware manner. 14 | * Judge if given metadata meets the CF standard and provides basic understandability for users. 15 | * Design a data model for NetCDF using the CF conventions. 16 | 17 | -------------------------------------------------------------------------------- /k/1/3/1/4/3/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1.4.3 HDF5 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand HDF5 6 | 7 | -------------------------------------------------------------------------------- /k/1/3/1/4/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1.4 Middleware 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand NetCDF 6 | * Understand NetCDF/CF Conventions 7 | * Understand HDF5 8 | 9 | ## Subskills 10 | 11 | * [[skill-tree:k:1:3:1:4:1:b]] 12 | * [[skill-tree:k:1:3:1:4:2:b]] 13 | * [[skill-tree:k:1:3:1:4:3:b]] 14 | -------------------------------------------------------------------------------- /k/1/3/1/5/1/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1.5.1 MPI-IO 2 | 3 | MPI-IO is an API standard for parallel I/O that allows multiple processes of a parallel program to access data in a common file simultaneously. MPI-IO intends to leverage the relatively wide acceptance of the MPI interface in order to create a similar I/O interface. MPI I/O maps I/O reads and writes to message-passing sends and receives. Accesses to a file can be independent (no coordination between processes takes place) or collective (each process of the group associated with the communicator must participate in the collective access). 4 | 5 | MPI-IO supports a high-level interface to describe the partitioning of file data among processes, a collective interface describing complete transfers of global data structures between process memories and files, asynchronous I/O operations, allowing computation to be overlapped with I/O, and optimization of physical file layout on storage devices (disks). 6 | 7 | In the MPI-I/O model, each process defines the datatype for its section of the file. These are passed into the MPI-I/O routines and data is automatically read and transferred directly to local memory. There is no single large buffer and no explicit master process. In addition, each read/write access operates on a number of MPI objects which can be of any MPI basic or derived datatype. The data partitioning via MPI derived datatypes has the advantage of added flexibility and expressiveness. 8 | 9 | Non-contiguous operations and collective calls have been defined in MPI-IO which lead to a classification of data access into four levels. These levels are characterized by two orthogonal aspects: contiguous vs. non-contiguous data access, and independent vs. collective calls. Depending on the level, a different set of optimizations can be thought of. 10 | 11 | ## Learning Outcomes 12 | 13 | * Employ MPI derived datatypes for expressing the data layout in the file as well as the partitioning of the file data among the communicator processes. 14 | * Specify high-level information about I/O to the system rather than low-level system-dependent information. 15 | * Collect statistics on the actual read/write operations performed to validate the MPI-IO performance. 16 | 17 | -------------------------------------------------------------------------------- /k/1/3/1/5/2/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1.5.2 PNetCDF 2 | 3 | Parallel-NetCDF (or PNetCDF) is a high-performance parallel I/O library for accessing Unidata's NetCDF. It is built upon MPI-IO, the I/O extension to MPI communications. Using the high-level NetCDF data structures, the Parallel-NetCDF libraries can make use of optimizations to efficiently distribute the file read and write applications between multiple processors. 4 | 5 | In addition to the conventional NetCDF read and write APIs, PNetCDF provides a new set of non-blocking APIs. Non-blocking APIs allow users to post multiple read and write requests first and let PNetCDF aggregate them into a large request, hence to achieve better performance. 6 | Through full use of existing optimizations available in MPI-IO implementation, PNetCDF has been demonstrated to be able to deliver high-performance parallel I/O. 7 | 8 | Parallel I/O in the Unidata NetCDF library has been supported since release 4.0, for HDF5 data files. Since version 4.1.1 the Unidata NetCDF C library supports parallel I/O to classic and 64-bit offset files using the Parallel-NetCDF library, but with the NetCDF API. 9 | 10 | ## Learning Outcomes 11 | 12 | * Interpret the PNetCDF files structure: 13 | * Header 14 | * Non-record variables (all dimensions specified) 15 | * Record variables (ones with an unlimited dimension) 16 | * Store and retrieve data in PNetCDF. 17 | * Show basic API use and error checking. 18 | 19 | -------------------------------------------------------------------------------- /k/1/3/1/5/3/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1.5 Parallel I/O principles 2 | 3 | Parallel I/O allows each processor in a multi-processor system to read and write data from multiple processes to a common file independently. 4 | 5 | Data-intensive scientific applications use parallel I/O software to access files. 6 | In HPC, increasing demands in the I/O system can cause bottlenecks. Parallel I/O plays a fundamental role to balance the fast increase in computational power and the progress of processor architectures. 7 | 8 | ## Learning Outcomes 9 | 10 | * Use parallel I/O libraries. 11 | * Assess the implications of parallel I/O on application efficiency. 12 | * Implement an application that utilizes parallel I/O to store, retrieve and analyze data. 13 | * Improve the I/O throughput with a parallel I/O file system. 14 | * List I/O system simulators developed for parallel I/O. 15 | 16 | -------------------------------------------------------------------------------- /k/1/3/1/5/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1.5 Parallel I/O 2 | 3 | Parallel I/O allows each processor in a multi-processor system to read and write data from multiple processes to a common file independently. 4 | 5 | Data-intensive scientific applications use parallel I/O software to access files. 6 | In HPC, increasing demands in the I/O system can cause bottlenecks. Parallel I/O plays a fundamental role to balance the fast increase in computational power and the progress of processor architectures. 7 | 8 | ## Learning Outcomes 9 | 10 | * Use parallel I/O libraries with MPI-IO. 11 | * Use parallel I/O libraries with PNetCDF. 12 | * Demonstrate parallel IO principles. 13 | 14 | ## Subskills 15 | 16 | * [[skill-tree:k:1:3:1:5:1:b]] 17 | * [[skill-tree:k:1:3:1:5:2:b]] 18 | * [[skill-tree:k:1:3:1:5:3:b]] 19 | -------------------------------------------------------------------------------- /k/1/3/1/6/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1.6 POSIX 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand the POSIX interface 6 | 7 | -------------------------------------------------------------------------------- /k/1/3/1/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.1 I/O Layers 2 | 3 | ## Learning Outcomes 4 | 5 | * Discuss different media types. 6 | * Discuss different storage systems. 7 | * Discuss different file systems. 8 | * Demonstrate different middle wares. 9 | * Sketch Parallel IO and demonstrate different API. 10 | * Discuss the POSIX interface for IO 11 | 12 | 13 | ## Subskills 14 | 15 | * [[skill-tree:k:1:3:1:1:b]] 16 | * [[skill-tree:k:1:3:1:2:b]] 17 | * [[skill-tree:k:1:3:1:3:b]] 18 | * [[skill-tree:k:1:3:1:4:b]] 19 | * [[skill-tree:k:1:3:1:5:b]] 20 | * [[skill-tree:k:1:3:1:6:b]] 21 | -------------------------------------------------------------------------------- /k/1/3/2/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.2 Data Reduction Techniques 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand Data reduction techniques 6 | 7 | -------------------------------------------------------------------------------- /k/1/3/3/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.3 Data Management 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand Cache coherence 6 | * Understand Information lifecycle management 7 | * Understand Staging 8 | * Understand Quota Policies 9 | * Understand Storage tiers 10 | * Understand Archiving 11 | 12 | -------------------------------------------------------------------------------- /k/1/3/4/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3.4 Access Pattern 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand access patterns 6 | * Understand how different access patterns influence data read and write operations 7 | 8 | -------------------------------------------------------------------------------- /k/1/3/b.txt: -------------------------------------------------------------------------------- 1 | # K1.3 Input/Output 2 | 3 | The hardware/software organization/structure in a data center/supercomputer. 4 | 5 | ## Learning Outcomes 6 | 7 | * Analyse different IO Layers. 8 | * Demonstrate different data reduction techniques. 9 | * Outline data management with cache coherence or staging. 10 | * Outline access patterns for files and also parallel file access. 11 | 12 | 13 | ## Subskills 14 | 15 | * [[skill-tree:k:1:3:1:b]] 16 | * [[skill-tree:k:1:3:2:b]] 17 | * [[skill-tree:k:1:3:3:b]] 18 | * [[skill-tree:k:1:3:4:b]] 19 | -------------------------------------------------------------------------------- /k/1/4/b.txt: -------------------------------------------------------------------------------- 1 | # K1.4 Operation of an HPC System 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand the typical infrastructure of data and computing centers 6 | * Understand economic and business aspects in infrastructure decisions 7 | * Understand administration aspects of an HPC system 8 | * Understand user support aspects (typically on different levels) 9 | 10 | -------------------------------------------------------------------------------- /k/1/b.txt: -------------------------------------------------------------------------------- 1 | # K1 Supercomputers 2 | 3 | Supercomputers are computers that led the world in terms of processing capacity, and particularly in the speed of calculations. 4 | 5 | ## Learning Outcomes 6 | 7 | * Discuss various system-, hardware-, and I/O-architectures used for supercomputers, i.e. computers that led the world in terms of processing capacity, and particularly in the speed of calculations, at the time of their introduction, or share key architectural aspects with these computers 8 | * Discuss different hardware architectures. 9 | * Experiment with IO operations on different storage media and also on parallel file systems. 10 | * Outline typical operation of data and computing centers. 11 | 12 | ## Subskills 13 | 14 | * [[skill-tree:k:1:1:b]] 15 | * [[skill-tree:k:1:2:b]] 16 | * [[skill-tree:k:1:3:b]] 17 | * [[skill-tree:k:1:4:b]] 18 | -------------------------------------------------------------------------------- /k/2/2/b.txt: -------------------------------------------------------------------------------- 1 | # K2.2 Bounds for a Parallel Program 2 | 3 | ## Learning Outcomes 4 | 5 | * Explain how performance bounds of the various components of the HPC system (e.g. CPU, caches, memory) can limit the overall performance of a parallel program. 6 | * Explain how performance bounds of the various components of the HPC system (e.g. network, I/O) can limit the overall performance of a parallel program. 7 | * Explain how performance bounds of the various components of the HPC system (e.g. network, I/O) can limit the overall performance of a parallel program. 8 | * Explain how performance bounds of the various components of the HPC system (e.g. CPU, caches, memory) can limit the overall performance of a parallel program. 9 | 10 | -------------------------------------------------------------------------------- /k/2/3/b.txt: -------------------------------------------------------------------------------- 1 | # K2.3 Performance Characteristics 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand Performance characteristics of I/O 6 | * Understand Performance characteristics of CPU Usage 7 | * Understand Performance characteristics of Memory 8 | * Understand Performance characteristics of Communication 9 | 10 | -------------------------------------------------------------------------------- /k/2/b.txt: -------------------------------------------------------------------------------- 1 | # K2 Performance Modeling 2 | 3 | HPC systems are massively parallel and therefore sophisticated parallel programs are required to exploit their performance potential as much as possible, including but not limited to modeling the I/O performance. 4 | 5 | ## Learning Outcomes 6 | 7 | * Describe how the performance of parallel programs may be assessed. 8 | * Describe bounds for the performance of parallel programs. 9 | * Describe different performance characteristics. 10 | 11 | ## Subskills 12 | 13 | * [[skill-tree:k:2:1:b]] 14 | * [[skill-tree:k:2:2:b]] 15 | * [[skill-tree:k:2:3:b]] 16 | -------------------------------------------------------------------------------- /k/3/1/b.txt: -------------------------------------------------------------------------------- 1 | # K3.1 Level of Parallelization 2 | 3 | ## Learning Outcomes 4 | 5 | * Describe parallelization techniques at the intra-node level (e.g. based on basic OpenMP features). 6 | * Apply the message-passing paradigm based on environments like MPI, which is the de-facto standard at the inter-node level for parallelizing programs using more than a single node. 7 | * Understand Intra- and Inter-Node, as well as Multi Level approaches 8 | 9 | -------------------------------------------------------------------------------- /k/3/2/b.txt: -------------------------------------------------------------------------------- 1 | # K3.2 Parallelization Overheads 2 | 3 | Parallelization of a program always introduces some extra work in addition to the work done by the sequential version of the program. 4 | The main sources of parallelization overhead are data communication (between processes) and synchronization (of processes and threads). 5 | Other sources are additional operations that are introduced at the algorithmic level (for example in global reduction operations) or at a lower software level (for example by address calculations). 6 | 7 | 8 | ## Learning Outcomes 9 | 10 | * Comprehend that **Data communication** is necessary for programs that are parallelized for distributed memory computers (if data communication is not necessary the program is called trivially or embarrassingly parallel). 11 | * Comprehend that **Synchronization** plays an important role with shared memory parallelization. 12 | * Comprehend that there are also other sources of parallel inefficiency like: 13 | * Parts of a program that were not parallelized and still run serially. 14 | * Unbalanced loads. 15 | * Comprehend that there are two hardware effects that can reduce the efficiency of the execution of shared-memory parallel programs: 16 | * **NUMA** can lead to noticeable performance degradation if data locality is bad (i.e. if too much data that a thread need is not in its NUMA domain). 17 | * **False sharing** occurs if threads process data that is in the same data cache line, which can lead to serial execution or even take longer than explicitly serial execution of the affected piece of code. 18 | * Comprehend the overheads caused by redundant computations 19 | * Comprehend the problems of execution speed noise (OS jitter, cache contention, thermal throttling, etc.), and typical trade-offs (e.g. reducing the synchronization overhead by increasing the communication overhead) 20 | 21 | -------------------------------------------------------------------------------- /k/3/3/b.txt: -------------------------------------------------------------------------------- 1 | # K3.3 Domain Decomposition 2 | 3 | Domain decomposition is a technique for parallelizing programs that perform simulations in engineering or natural sciences. 4 | Such a technique is needed on distributed memory systems. 5 | On distributed memory systems the computational work and, in general, the data have to be decomposed to enable parallel computations that employ several compute processes (which implies the possibility to run on multiple compute nodes). 6 | 7 | 8 | ## Learning Outcomes 9 | 10 | * Describe typical decomposition strategies to split a domain into subdomains to make it suited for parallel processing. 11 | * Discuss measures like surface to volume ratio. 12 | * Comprehend that in a domain decomposition a region is decomposed, e.g. box is split into smaller ones and a mesh is decomposed into smaller parts, in order to assign these subdomains to processes. 13 | * Comprehend that in order to update a variable that is defined on a site of a mesh, or for a particle, in general data from neighbouring sites or particles is needed. 14 | * Comprehend that some neighbour regions expand beyond the sub-domain of their process, i.e. neighbour regions are partly stored on remote processes (Halo regions) and must be made available on the local process before updating can begin. 15 | * Comprehend that the halo exchange needed in parallel computer simulations is one kind of parallel overhead and has a performance impact, that everybody, who is running such simulations, should know about: 16 | * Relative overhead is approximately proportional to the surface (Halo exchange) to volume (amount of work) ratio of a sub-domain. 17 | 18 | -------------------------------------------------------------------------------- /k/3/4/b.txt: -------------------------------------------------------------------------------- 1 | # K3.4 Autoparallelization 2 | 3 | ## Learning Outcomes 4 | 5 | * Describe the auto parallelization capabilities of current compilers (e.g. to automatically parallelize suitable loops), which are applicable at the intra-node level. 6 | 7 | -------------------------------------------------------------------------------- /k/3/b.txt: -------------------------------------------------------------------------------- 1 | # K3 Program Parallelization 2 | 3 | ## Learning Outcomes 4 | 5 | * Describe the typical parallelization techniques used at the intra- and inter-node level of cluster systems. 6 | * Discuss the causes of parallelization overheads, which eventually prevent efficient use of an increasing number of processing elements. 7 | * Apply domain decomposition strategies (i.e. splitting a problem into pieces that allow for parallel computation). 8 | * Apply Autoparallelization to a program. 9 | 10 | ## Subskills 11 | 12 | * [[skill-tree:k:3:1:b]] 13 | * [[skill-tree:k:3:2:b]] 14 | * [[skill-tree:k:3:3:b]] 15 | * [[skill-tree:k:3:4:b]] 16 | -------------------------------------------------------------------------------- /k/4/2/b.txt: -------------------------------------------------------------------------------- 1 | # K4.2 SLURM Workload Manager 2 | 3 | SLURM is a widely used open-source workload manager providing various advanced features. 4 | 5 | 6 | ## Learning Outcomes 7 | 8 | * Run interactive jobs with salloc, a batch job with sbatch. 9 | * Explain the architecture of SLURM, i.e., the role of slurmd, srun and the injection of environment variables. 10 | * Explain the function of the tools: sacct, sbatch, salloc, srun, scancel, squeue, sinfo. 11 | * Explain time limits and the benefit of a backfill scheduler. 12 | * Comprehend that environment variables are set when running a job. 13 | * Comprehend and describe the expected behavior of simple job scripts. 14 | * Comprehend how variables are prioritized when using the command line and a script. 15 | * Change a provided job template and embed them into shell scripts to run a variety of parallel applications. 16 | * Analyze the output generated from submitting to the job scheduler and typically generated errors. 17 | 18 | -------------------------------------------------------------------------------- /k/4/3/b.txt: -------------------------------------------------------------------------------- 1 | # K4.3 Scheduling strategies 2 | 3 | ## Learning Outcomes 4 | 5 | * Undertand Backfilling 6 | * Undertand Fair Share 7 | * Undertand Shortest Job First 8 | * Undertand First Come First Served 9 | 10 | -------------------------------------------------------------------------------- /k/4/b.txt: -------------------------------------------------------------------------------- 1 | # K4 Job Scheduling 2 | 3 | Parallel computers are operated differently than a normal PC, all users must share the system. 4 | Therefore, various operative procedures are in place. Users must understand these concepts and procedures to be able to use the available resources of a system to run a parallel application. 5 | A workload manager/job scheduler controls how available hardware resources are distributed among the user requests (jobs). 6 | 7 | Users of computing centers typically compete for the expensive HPC resources of cluster systems. 8 | HPC resources can be distinguished as 9 | * Shared resources (e.g. a parallel file system that is often shared across all cluster nodes and therefore shared between all users), 10 | * Not-shared resources (e.g. cluster nodes dedicated to a particular parallel program of an individual user). 11 | 12 | The configuration of the cluster system matters as well: a cluster node can also be a resource that is shared between several users. 13 | 14 | A major aspect of job scheduling is to manage these resources in a way that users are treated fairly. 15 | Accounting for users or user groups can additionally support this. 16 | 17 | ## Learning Outcomes 18 | 19 | * Comprehend the princimples of Job scheduling and why program managing the jobs are required. 20 | * Demonstrate the SLURM workload manager. 21 | * Discuss different Scheduling strategies. 22 | 23 | ## Subskills 24 | 25 | * [[skill-tree:k:4:1:b]] 26 | * [[skill-tree:k:4:2:b]] 27 | * [[skill-tree:k:4:3:b]] 28 | -------------------------------------------------------------------------------- /k/5/b.txt: -------------------------------------------------------------------------------- 1 | # K5 Modeling Costs 2 | 3 | The user's awareness of the costs related to the operation of an HPC system is raised. 4 | 5 | For the resources of an HPC system, a distinction is made between costs for the computing elements of the supercomputer and costs for the storage system. 6 | 7 | ## Learning Outcomes 8 | 9 | * Describe the impact of a cluster nodes type (e.g. CPU type, main memory expansion, or GPU extensions) and of the storage media type (SSD, disk, or e.g. tape for long term archiving (LTA) purposes) on its costs. 10 | * Describe how to assess runtime costs for jobs. 11 | * Discuss how to assess the costs for the infrastructure of data and computing centers as well as their personnel costs. 12 | * Explain economic and business aspects, e.g., break-even considerations, when personnel costs for tuning a parallel program and savings through speedups achieved are compared. 13 | * Understand costs of resources and remember 14 | * cluster node types 15 | * Runtime costs 16 | * Storage systems 17 | * Understand costs of Data/compute centers 18 | * Understand Economic and bisiness aspects for modelling costs 19 | * Understand Personnel costs of tuning and remeber 20 | * Break-even considerations 21 | * savings through speedups 22 | 23 | -------------------------------------------------------------------------------- /k/6/b.txt: -------------------------------------------------------------------------------- 1 | # K6 Data Management Plan 2 | 3 | Research Data Management (RDM) has become a hype within the past years: Funding agencies demand for data management plans (DMP), research institutions set up data management policies and guidelines, and national projects aim on sustainably establishing research data infrastructures (NFDI). But why? 4 | Data is one of the most important assets in science. Ever faster growing data needs to be handled during project lifetime and beyond. Good scientific practice demands for long-standing and traceable research outcomes. Therefore it is important not to lose track of the origin and processing of data. Additionally, aligned data management routines potentially increase the efficiency of research processes and groups. 5 | Nonetheless, past has shown that data tends to get lost, but usually not physically: Knowledge about data gets lost over time, due to common staff turnover in science and inappropriate or missing documentation. This leads to large and expensive data silos, stuffed with useless, 'dark' data, time-consuming and frustrating to deal with. 6 | 7 | Participants should be aware about the risk of dark data and the benefits of data management. They should know about theoretical concepts of data management and practical strategies and techniques to implement aligned data management solutions. They should be able to describe and structure data, find and use metadata standards. They should know what to consider when publishing research data and how to find suitable data respositories. 8 | 9 | ## Learning Outcomes 10 | 11 | * Understanding the risk of losing (knowledge about) data 12 | * Knowing about RDM concepts 13 | * Knowing techniques and abilities to minimize the risk of data loss 14 | * Able to adequately describe and structure data 15 | * Able to apply metadata schemas and create metadata profiles 16 | * Knowing what to consider when publishing data 17 | * Able to find suitable data repositories 18 | 19 | ## Maintainer 20 | 21 | * Christian Löschen @ TU Dresden** 22 | 23 | -------------------------------------------------------------------------------- /k/b.txt: -------------------------------------------------------------------------------- 1 | # K HPC Knowledge 2 | 3 | The theoretical knowledge of HPC provides the background to understand how supercomputers and HPC environments operate. 4 | This enables practitioners to effectively use such environments. 5 | 6 | ## Learning Outcomes 7 | 8 | * Explain the hardware, software, and operation of HPC systems. 9 | * Construct and judge simple performance models for systems and applications. 10 | * Understand that there are performance frontiers. 11 | * Compare different paradigms for the parallelization of applications. 12 | * Construct and execute an HPC workflow on an HPC system. 13 | * Comprehend job scheduling principles. 14 | * Apply a cost model to compute the costs for running a workflow on an HPC system. 15 | * Discuss data management plans. 16 | 17 | ## Subskills 18 | 19 | * [[skill-tree:k:1:b]] 20 | * [[skill-tree:k:2:b]] 21 | * [[skill-tree:k:3:b]] 22 | * [[skill-tree:k:4:b]] 23 | * [[skill-tree:k:5:b]] 24 | * [[skill-tree:k:6:b]] 25 | -------------------------------------------------------------------------------- /leftover.txt: -------------------------------------------------------------------------------- 1 | 2 | * check disk quotas commonly used to limit the amount of disk space available for the user 3 | * write robust job scripts, e.g. to simplify job submissions by the help of automated job chaining 4 | * read keyboard input to add interactivity to scripts 5 | -------------------------------------------------------------------------------- /pe/1/1/b.txt: -------------------------------------------------------------------------------- 1 | # PE1.1 Time to Solution Constraints 2 | 3 | Optimizing the performance of engineering systems by understanding and managing time to solution constraints is vital for cost-effective and efficient project completion. This course introduces techniques and strategies to minimize the time required to reach viable solutions. 4 | 5 | ## Requirements 6 | 7 | 8 | ## Learning Outcomes 9 | 10 | 1. **Remember the fundamental concepts** and definitions related to time to solution constraints in performance engineering. 11 | 2. **Understand the implications** of various constraints on project timelines and outcomes. 12 | 3. **Apply standard methods** and tools to measure and evaluate time to solution in different scenarios. 13 | 4. **Analyze data** to identify critical constraints that adversely affect project timelines. 14 | 5. **Evaluate the effectiveness** of different strategies for managing time to solution constraints in project environments. 15 | 6. **Create innovative solutions** and optimization strategies to effectively reduce time to solution while maintaining quality and budget considerations. 16 | 7. **Summarize key factors** and metrics that influence time to solution in engineering projects. 17 | 8. **Interpret project data and reports** to forecast potential delays and prepare mitigation strategies. 18 | 9. **Design a framework** for continuous improvement in managing time to solution constraints across multiple projects. 19 | 10. **Assess the impact** of implemented solutions on overall project efficiency and cost reduction. 20 | 11. **Formulate plans** to integrate time to solution constraints into broader project scopes and long-term strategies. 21 | 12. **Synthesize various approaches** for overcoming specific time to solution challenges in complex projects. 22 | 13. **Critique existing methodologies** and propose modifications to enhance efficiency and effectiveness in achieving desired outcomes. 23 | 14. **Construct models** to simulate different scenarios and predict the effects of various time to solution strategies. 24 | 15. **Discuss the ethical considerations** in time management, emphasizing transparency and accountability in project reporting and stakeholder communications. 25 | 16. **Explore emerging tools and technologies** that could revolutionize time to solution strategies in performance engineering. 26 | 27 | AI generated content 28 | -------------------------------------------------------------------------------- /pe/1/2/b.txt: -------------------------------------------------------------------------------- 1 | # PE1.2 Total Cost of Ownership 2 | 3 | Understanding the Total Cost of Ownership (TCO) is crucial for making informed decisions in performance engineering. This course dives into the comprehensive accounting of all costs associated with the acquisition, operation, and maintenance of engineering assets. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | 1. **Define the concept** of Total Cost of Ownership and its relevance to performance engineering. 10 | 2. **Identify all components** that contribute to the TCO of engineering projects, including hidden and indirect costs. 11 | 3. **Calculate TCO** for various engineering assets using structured financial models. 12 | 4. **Analyze case studies** to illustrate the impact of TCO on project decision-making and long-term sustainability. 13 | 5. **Evaluate different purchasing strategies** and their effects on TCO, including lease vs. buy decisions. 14 | 6. **Assess the influence** of maintenance and operational practices on the TCO of assets. 15 | 7. **Compare TCO** across different technology solutions to determine the most cost-effective choice. 16 | 8. **Develop a strategic approach** to minimize TCO through efficient design, procurement, and asset management techniques. 17 | 9. **Utilize software tools** and frameworks to automate TCO calculations and provide actionable insights. 18 | 10. **Forecast long-term costs** and benefits associated with different project options to guide strategic planning. 19 | 11. **Interpret financial reports** and data to enhance TCO-based decision-making. 20 | 12. **Synthesize TCO analysis** with environmental and social governance (ESG) considerations to promote sustainable practices. 21 | 13. **Discuss the challenges** of accurate TCO estimation in complex and uncertain environments. 22 | 14. **Propose innovative solutions** to reduce TCO while maintaining or enhancing performance and compliance. 23 | 15. **Critique existing TCO models** and methodologies to identify areas for improvement. 24 | 16. **Explore future trends** in cost management and how they might influence TCO calculations in performance engineering. 25 | 17. **Implement risk assessment strategies** to mitigate financial uncertainties in TCO evaluations. 26 | 18. **Examine regulatory impacts** on TCO, including compliance costs and potential financial incentives. 27 | 19. **Engage with experts** through seminars and workshops to gain industry insights on TCO management. 28 | 20. **Design project scenarios** to apply TCO calculations in real-world engineering challenges. 29 | 21. **Lead discussions** on ethical considerations in cost reporting and TCO manipulation. 30 | 31 | AI generated content 32 | -------------------------------------------------------------------------------- /pe/1/b.txt: -------------------------------------------------------------------------------- 1 | # PE1 Cost Awareness 2 | 3 | Cost awareness is a crucial aspect of Performance Engineering (PE), focusing on understanding the economic implications of performance optimization decisions in High Performance Computing (HPC) environments. In this overview, we explore the importance of cost awareness, considering factors such as time to solution constraints and the total cost of ownership. 4 | 5 | **Time To Solution Constraints (PE2.2):** Time to solution constraints refers to the limitations imposed by deadlines or time-sensitive requirements on completing computational tasks within specified time frames. This branch examines techniques for estimating the costs associated with missing time to solution constraints, including potential revenue loss, project delays, and missed opportunities. Understanding time to solution constraints enables organizations to prioritize performance optimization efforts and meet critical deadlines effectively. 6 | 7 | **Total Cost of Ownership (PE2.3):** The total cost of ownership (TCO) encompasses all direct and indirect costs associated with acquiring, deploying, and maintaining HPC infrastructure over its entire lifecycle. This section explores methodologies for calculating TCO, including initial investment costs, operational expenses, maintenance costs, and end-of-life disposal costs. Mastery of TCO analysis enables organizations to make informed decisions about HPC investments, optimize resource allocation, and minimize long-term costs while maximizing value. 8 | 9 | By incorporating cost awareness into performance engineering practices, organizations can align performance optimization efforts with budget constraints, prioritize investments based on cost-effectiveness, and achieve better overall efficiency and return on investment in HPC environments. 10 | 11 | ## Requirements 12 | 13 | ## Learning Outcomes 14 | 15 | * Identify many of the constraint for the time to get the solution. 16 | * Infer the total cost of ownership. 17 | 18 | ## Subskills 19 | 20 | * [[skill-tree:pe:1:1:b]] 21 | * [[skill-tree:pe:1:2:b]] 22 | 23 | AI generated content 24 | -------------------------------------------------------------------------------- /pe/2/1/b.txt: -------------------------------------------------------------------------------- 1 | # PE2.1 Utilization Principles 2 | 3 | Mastering the principles of utilization is essential for effectively measuring and enhancing system performance. This course delves into the theoretical and practical aspects of system utilization metrics and optimization strategies. 4 | 5 | ## Requirements 6 | 7 | 8 | ## Learning Outcomes 9 | 10 | * Understand CPU vs. Elapsed Times 11 | * Understand Shared and Unshared Memory 12 | * Understand I/O Statistics (Devices and File Systems) 13 | * Understand Page Faults 14 | 15 | -------------------------------------------------------------------------------- /pe/2/2/b.txt: -------------------------------------------------------------------------------- 1 | # PE2.2 I/O Performance 2 | 3 | There are several aspects involved in delivering high I/O performance to parallel applications, from hardware characteristics to methods that manipulate workloads to improve achievable performance. 4 | 5 | Running the same application with different I/O configurations gives the possibility to tune the I/O system according to the application access pattern. 6 | 7 | One way to predict application performance in HPC systems with different I/O configurations is by using modelling and simulation techniques. 8 | Modeling the system allows assessing obtained performance and therewith estimate the performance potentially gained by optimizations. 9 | 10 | File systems are implemented in the operating system which deploys strategies to improve performance such as scheduling, caching and aggregation. 11 | Therefore, the observable I/O performance depends on more than the capabilities of the raw block device. 12 | 13 | 14 | ## Learning Outcomes 15 | 16 | * Select performance models to assess and optimize the application I/O performance. 17 | * Identify tools capable of predicting the behavior of applications in HPC. 18 | * Apply methods to manipulate workloads to improve achievable performance. 19 | * Understand metrics that describe relevant I/O performance characteristics. 20 | * Select tools to analyze I/O performance of applications. 21 | * Apply the performance analysis tools to applications and workflows. 22 | * Interpret the results of I/O analysis tools and distill optimization strategies for the application or system configurations. 23 | 24 | -------------------------------------------------------------------------------- /pe/2/3/1/b.txt: -------------------------------------------------------------------------------- 1 | # PE2.3.1 Score-P 2 | 3 | Collecting measurement data from HPC applications can be difficult. Score-P presents a generally uniform approach to collecting profiling and tracing data that can be applied to a broad range of HPC applications and allow measurements to be collected in many common HPC environments. In order to use this measurement system effectively, users must be aware of both the common performance engineering issues that affect all performance measurements and the specifics of how to use the various components of Score-P to collect accurate and relevant performance data. 4 | 5 | ## Learning Outcomes 6 | * Able to instrument applications including one or more parallel paradigms from the following: 7 | * MPI 8 | * SHMEM 9 | * OpenMP 10 | * Pthreads 11 | * Able to instrument applications including at least one specialized form of measurement from the following: 12 | * Accelerator usage 13 | * I/O performance 14 | * Hardware counters 15 | * Memory usage 16 | * Able to score measurements and effectively describe whether a measurement includes unacceptable perturbation 17 | * Able to create and use filter files to improve measurement quality and relevance 18 | * Able to manage measurement data through environment variables and scripting 19 | * Able to automate naming/filing of experiments in appropriate locations 20 | * Able to read and understand configuration and manifest files generated by Score-P 21 | 22 | ## Maintainer 23 | 24 | * Bert Wesarg, William (Bill) Williams, ZIH Tools Team @ TU Dresden 25 | 26 | -------------------------------------------------------------------------------- /pe/2/3/2/b.txt: -------------------------------------------------------------------------------- 1 | # PE2.3.2 Scalasca 2 | 3 | Analyzing detailed event traces and identifying performance problems and their root causes can be challenging. 4 | Scalasca automatically searches for patterns of inefficient behavior in OTF2 traces collected with Score-P and identifies the time a thread or process is waiting in an interaction with another thread or process. 5 | Furthermore, it helps in identifying the root causes of those waiting times by determining imbalances in the program, highlighting the part of the code and processes/threads that contribute the most to any waiting time found. 6 | 7 | ## Learning Outcomes 8 | 9 | * Able to automatically run Score-P instrumented measurements (including automatic 10 | analysis) of applications containing 11 | - MPI, and 12 | - OpenMP and/or Pthreads. 13 | * Able to identify waiting time in the application measurement 14 | * Able to identify the causing delay (imbalance) causing waiting time 15 | * Derive solution hypotheses from the analysis report to resolve the 16 | performance problems identified. 17 | 18 | ## Maintainer 19 | 20 | * Marc-André Hermanns, HPC Group, IT Center, RWTH Aachen University 21 | -------------------------------------------------------------------------------- /pe/2/3/3/b.txt: -------------------------------------------------------------------------------- 1 | # PE2.3.3 Vampir 2 | 3 | Once tracing data has been collected by a measurement system, the challenge of interpreting this data remains. Trace data may be post-processed in a variety of ways, or it may be visualized directly. Direct visualization allows an analyst to examine the full details of the event trace in a manual analysis. Vampir is a tool that focuses on providing quality visualization to support manual trace analysis. Users of Vampir need to understand both how to use the interface in order to focus on particular areas of interest in a given trace and how to understand the trace data. Ideally, this will allow them to either collect a more focused measurement or to perform successful optimization of their code. 4 | 5 | Vampir is a tool that focuses on providing quality visualization to support manual trace analysis. Users of Vampir need to understand both how to use the interface in order to focus on particular areas of interest in a given trace and how to understand the trace data. Ideally, this will allow them to either collect a more focused measurement or to perform successful optimization of their code. 6 | 7 | ## Learning Outcomes 8 | * Able to launch Vampir both stand-alone and connected to a VampirServer instance if available 9 | * Able to use the function summary to determine at a high level what parts of the code may not perform well 10 | * Able to use the summary timeline to quickly identify changed behavior over time and determine next steps for investigation 11 | * Able to identify when a small number of processes have divergent behavior from the rest of the application 12 | * Able to use the performance radar and counter timelines to correlate performance metrics with the code that is executing 13 | * Able to use the communication matrix to describe communication patterns and identify potential bottlenecks and/or imbalances 14 | 15 | ## Maintainer 16 | 17 | * Bert Wesarg, William (Bill) Williams, ZIH Tools Team @ TU Dresden 18 | 19 | -------------------------------------------------------------------------------- /pe/2/3/4/b.txt: -------------------------------------------------------------------------------- 1 | # PE2.3.4 Darshan 2 | 3 | 4 | ## Learning Outcomes 5 | 6 | * Understand Darshan 7 | 8 | -------------------------------------------------------------------------------- /pe/2/3/5/b.txt: -------------------------------------------------------------------------------- 1 | # PE2.3.5 PIKA 2 | 3 | Analyzing application performance in HPC can be a very challenging task. It depends on both the performance analysis tools and the build system of your application. In most cases, applications need to be instrumented to enable performance measurements with the desired tool. PIKA provides HPC users with a first insight into the performance of their application without instrumenting the code. It supplies an interactive web interface for each HPC job that displays resource utilization for various performance metrics. 4 | 5 | ## Learning objectievs 6 | * Able to detect pathological performance behavior: 7 | * Requested memory > used memory 8 | * Requested compute resources (CPU cores, GPUs) > used compute resources 9 | * Able to understand the resource utilization based on the application algorithm 10 | * Able to determine possible limitations by resources: 11 | * memory-bound 12 | * compute-bound 13 | * GPU-bound 14 | * IO-heavy 15 | * network-heavy 16 | * Able to find performance bottlenecks by correlating various performance metrics 17 | * e.g.: blocking io operations result in lower FLOPS 18 | 19 | ## Maintainer 20 | 21 | * Frank Winkler, ZIH Team @ TU Dresden 22 | -------------------------------------------------------------------------------- /pe/2/3/6/b.txt: -------------------------------------------------------------------------------- 1 | # PE2.3.6 Lo2s 2 | 3 | Collecting measurement data from HPC applications can be difficult. Lightweight Node-Level Performance Monitoring (Lo2s, IPA: [ˈloːtʊs]) presents with its system monitoring a novel viewpoint on application behavior, which can be used to find performance bottlenecks on a node-level. In order to use this measurement system effectively, users must be aware of both the common performance-engineering issues that affect all performance measurements and the specifics of how to use the two measurement modes and various additional metric recordings of Lo2s to collect accurate and relevant performance data. 4 | 5 | ## Learning Outcomes 6 | 7 | * Able to sample applications including one or more parallel paradigms from the following: 8 | * MPI (node-level) 9 | * OpenMP 10 | * Pthreads 11 | * Able to sample applications and nodes including at least one specialized form of measurement from the following: 12 | * I/O performance 13 | * Hardware counters 14 | * Perf metrics 15 | * Linux tracepoints 16 | * Able to use Lo2s in application monitoring and system monitoring mode 17 | 18 | ## Maintainer 19 | 20 | * Mario Bielert, ZIH Tools Team @ TU Dresden 21 | 22 | -------------------------------------------------------------------------------- /pe/2/3/7/b.txt: -------------------------------------------------------------------------------- 1 | # PE2.3.7 NVIDIA Nsight Systems 2 | 3 | CUDA applications can be optimized in various places. The NVIDIA Nsight Systems profiler helps users identity those parts of their code that are most suitable for optimizations. This includes, but is not limited to, memory transfers, compute optimizations and kernel overlap. 4 | 5 | ## Learning Outcomes 6 | 7 | * Use the CLI to identify common optimization targets 8 | * Generate traces for analysis in the GUI 9 | * Understand how to use the GUI to find inefficient parts in an application 10 | 11 | ## Maintainer 12 | 13 | * Markus Velten, ZIH Team @ TU Dresden 14 | 15 | -------------------------------------------------------------------------------- /pe/2/3/b.txt: -------------------------------------------------------------------------------- 1 | # PE2.3 Profiling tools 2 | 3 | Profiling is explained for the CPU level, where it can be supported by hardware performance counters and by sampling techniques. 4 | 5 | Sampling is used to see, by examining the program counter, what routines and source code lines of a program are responsible for which portions of the total runtime. 6 | 7 | Automatically adding trace code to a parallel program by so-called instrumentation to record its execution in a strict chronology is explained and the difference to profiling is emphasized. 8 | 9 | Similar techniques are explained for profiling the network level (e.g. based on InfiniBand counters and I/O server states). 10 | 11 | ## Learning Outcomes 12 | 13 | * Demonstrate the use of Score-P for collecting program traces. 14 | * Demonstrate the use of Scalasca for analyzing traces. 15 | * Demonstrate the analysis of program traces using Vampir. 16 | * Understand Darshan. 17 | * Demonstrate PIKA to check the performance of anyprogram without instrumenting it. 18 | * Demonstrate collecting traces of a program usig L02s. 19 | * Demonstrate analysis program from NVIDIA for CUDA code. 20 | 21 | ## Subskills 22 | 23 | * [[skill-tree:pe:2:3:1:b]] 24 | * [[skill-tree:pe:2:3:2:b]] 25 | * [[skill-tree:pe:2:3:3:b]] 26 | * [[skill-tree:pe:2:3:4:b]] 27 | * [[skill-tree:pe:2:3:5:b]] 28 | * [[skill-tree:pe:2:3:6:b]] 29 | * [[skill-tree:pe:2:3:7:b]] 30 | -------------------------------------------------------------------------------- /pe/2/b.txt: -------------------------------------------------------------------------------- 1 | # PE2 Measuring System Performance 2 | 3 | ## Learning Outcomes 4 | 5 | * Measure the system performance by the help of standard tools and by profiling in order to assess the runtime behavior of parallel programs. 6 | * Compute the utilization of a machine by various measures. 7 | * Discover the IO performance of a program and select an appropriate model. 8 | 9 | 10 | ## Subskills 11 | 12 | * [[skill-tree:pe:2:1:b]] 13 | * [[skill-tree:pe:2:2:b]] 14 | * [[skill-tree:pe:2:3:b]] 15 | -------------------------------------------------------------------------------- /pe/3/1/b.txt: -------------------------------------------------------------------------------- 1 | # PE3.1 Design & Documentation 2 | 3 | Design and documentation are fundamental aspects of performance engineering, providing the framework for effective benchmarking and system evaluation. This course emphasizes the importance of meticulous design and comprehensive documentation in creating reproducible and reliable performance benchmarks. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Understand the role** of design and documentation in the context of performance engineering. 10 | * **Develop comprehensive design documents** that clearly outline system architecture and performance criteria. 11 | * **Create detailed documentation** that supports the reproducibility and verification of performance benchmarks. 12 | * **Apply best practices** in technical writing to enhance clarity and precision in documentation. 13 | * **Utilize diagrams and flowcharts** to visually represent system designs and benchmarking processes. 14 | * **Implement version control systems** to manage changes in design documents and maintain historical records. 15 | * **Coordinate with multidisciplinary teams** to gather input and ensure all aspects of system performance are documented. 16 | * **Evaluate existing documentation** to identify gaps in information and areas for improvement. 17 | * **Integrate feedback mechanisms** into documentation practices to continually refine and improve design clarity. 18 | * **Train team members** on effective documentation techniques and standards. 19 | * **Review and update documentation** regularly to reflect system upgrades and changes. 20 | * **Develop templates and guidelines** for performance engineering documentation to standardize practices across projects. 21 | * **Discuss the ethical implications** of documentation in maintaining transparency and accountability in engineering projects. 22 | * **Explore software tools** that aid in the documentation and design process, such as CAD for system layouts and benchmarking software for performance analysis. 23 | * **Simulate documentation audits** to prepare for compliance checks and internal reviews. 24 | * **Lead workshops** on effective design and documentation strategies within the organization. 25 | * **Analyze case studies** where effective documentation has led to successful benchmarking and system optimizations. 26 | * **Implement a feedback loop** from field performance data to refine initial designs and documentation. 27 | * **Develop a comprehensive understanding** of the regulatory requirements impacting design documentation in different industries. 28 | * **Craft a documentation strategy** that aligns with organizational goals and performance benchmarks. 29 | * **Master technical communication skills** to convey complex performance data in understandable formats. 30 | * **Assess the impact of design decisions** on system performance through detailed documentation reviews. 31 | * **Incorporate sustainability considerations** into system design documentation. 32 | * **Facilitate effective knowledge transfer** through well-organized documentation practices. 33 | * **Develop performance narratives** that align technical documentation with stakeholder needs and expectations. 34 | 35 | AI generated content 36 | -------------------------------------------------------------------------------- /pe/3/2/b.txt: -------------------------------------------------------------------------------- 1 | # PE3.2 Controlled Experiments 2 | 3 | Controlled experiments are crucial for validating performance improvements and understanding the variables that impact system efficiency. This course provides a deep dive into the design, execution, and analysis of controlled experiments in the context of performance engineering. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Differentiate types of benchmarks: 10 | * The Linpack benchmark is used, for example, to build the TOP 500 list of the currently fastest supercomputers, which is updated twice a year. 11 | * For HPC users, however, synthetic tests to benchmark HPC cluster hardware (like the Linpack benchmark) are of less importance, because the emphasis lies on the determination of speedups and efficiencies of the parallel program they want to use. 12 | * Comprehend that benchmarking is very essential in the HPC environment and can be applied to a variety of issues: 13 | * What is the scalability of my program? 14 | * How many cluster nodes can be maximally used, before the efficiency drops to values which are unacceptable? 15 | * How does the same program perform in different cluster environments? 16 | * Comprehend that benchmarking is also a basis for dealing with questions emerging from tuning, e.g.: 17 | * What is the appropriate task size (big vs. small) that may have a positive performance impact on my program? 18 | * Is the use of hyper-threading technology advantageous? 19 | * What is the best mapping of processes to nodes, pinning of processes/threads to CPUs or cores, and setting memory affinities to NUMA nodes in order to speed up a parallel program? 20 | * What is the best compiler selection for my program (GCC, Intel, PGI, ...), in combination with the most suitable MPI environment (Open MPI, Intel MPI, ...)? 21 | * What is the best compiler generation/version for my program? 22 | * What are the best compiler options regarding, for example, the optimization level -O2, -O3, . . . , for building the executable program? 23 | * Is the use of PGO (Profile Guided Optimization) or other high-level optimization, e.g. using IPA/IPO (Inter-Procedural Analyzer/Inter-Procedural Optimizer), helpful? 24 | * What is the performance behavior after a (parallel) algorithm has been improved, i.e. to what extent are speedup, efficiency, and scalability improved? 25 | * Assess speedups and efficiencies as the key measures for benchmarks of a parallel program. 26 | * Benchmark the runtime behaviour of parallel programs, performing controlled experiments by providing varying HPC resources (e.g. 1, 2, 4, 8, ... cores on shared memory systems or 1, 2, 4, 8, ... nodes on distributed systems for the benchmarks). 27 | * Measure runtimes by the help of tools like: 28 | * Built-in time command, e.g. for MPI programs ('time mpirun ... my-mpi-app'). 29 | * Stand-alone 'time' program, e.g. for sequential or OpenMP programs ('/usr/bin/time my-openmp-app'). 30 | 31 | -------------------------------------------------------------------------------- /pe/3/3/b.txt: -------------------------------------------------------------------------------- 1 | # PE3.3 Strong vs. Weak Scaling 2 | 3 | Understanding the concepts of strong and weak scaling is crucial for evaluating the scalability of systems and applications in performance engineering. This course explores the differences between these scaling types, their implications on system performance, and best practices for their application in real-world scenarios. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Differentiate types of scaling: 10 | * **Weak scaling**: problem size increases proportionally to the number of parallel processes to analyze how big may the problems be that I can solve. 11 | * **Strong scaling**: problem size remains the same for an increasing number of processes to analyze how fast can I solve a problem of a given size. 12 | * Interpret typical weak and strong scaling plots. 13 | * Avoid typical pitfalls: 14 | * **Break-even considerations regarding the benchmark effort**: 15 | * Benchmarking also represents a certain effort, namely for providing the HPC resources and human time explicitly used for that purpose. 16 | * **Presenting fair speedups**: 17 | * For conventional speedup calculations the same version of an algorithm (the same program) is used to measure runtimes T(sequential) and T(parallel) but for fair speedup calculations, the best known sequential algorithm to measure T(sequential) should be used. 18 | * **Special features of current CPU architectures**: 19 | * Features like turbo boost and hyper-threading may influence benchmark results. 20 | * **Shared nodes**: 21 | * If the same cores are potentially shared at times on a node by different programs, the value of the benchmark results may be significantly reduced or even made uselessly. 22 | * **Reproducibility**: 23 | * There are parallel algorithms which may produce non-deterministic results, due to inherent effects of concurrency which in turn may lead to different (but generally equivalent) results but also to strongly differing runtimes of repeated runs. 24 | 25 | -------------------------------------------------------------------------------- /pe/3/b.txt: -------------------------------------------------------------------------------- 1 | # PE3 Benchmarking 2 | 3 | Benchmarking is the activity to measure performance reliably and to assess the obtained performance. 4 | 5 | A benchmark is an act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it. 6 | 7 | Such a controlled experiment is named a benchmark, but the term is also used – apparent from the context – for the program that is, or set of programs that are used for benchmarking. 8 | 9 | For HPC users measuring the performance behavior of the parallel program(s) they use is of primary importance in order to make optimal use of HPC hardware. 10 | 11 | ## Learning Outcomes 12 | 13 | * Produce a design and the required documentation for a benchmark. 14 | * Compose a controlled experiment from a design and measure important values. 15 | * Apply both strong and weak scaling to a benchmark to find out which is the appropriate model. 16 | 17 | ## Subskills 18 | 19 | * [[skill-tree:pe:3:1:b]] 20 | * [[skill-tree:pe:3:2:b]] 21 | * [[skill-tree:pe:3:3:b]] 22 | -------------------------------------------------------------------------------- /pe/4/1/b.txt: -------------------------------------------------------------------------------- 1 | # PE4.1 I/O Tuning 2 | 3 | I/O Tuning is crucial for optimizing the input/output operations of systems, significantly impacting overall system performance. This course covers the strategies and techniques necessary for assessing and enhancing I/O performance across various computing environments. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * **Understand the fundamentals** of I/O processes and the impact on system performance. 10 | * **Identify common I/O bottlenecks** in both hardware and software and strategies to mitigate them. 11 | * **Implement I/O monitoring tools** to track performance issues and assess the effectiveness of tuning efforts. 12 | * **Optimize file system configurations** to enhance data throughput and reduce latency. 13 | * **Adjust operating system settings** to better align with the specific I/O demands of applications. 14 | * **Design and execute benchmarks** to measure I/O performance before and after tuning interventions. 15 | * **Apply advanced tuning techniques** such as RAID configurations, SSD caching, and file system optimization. 16 | * **Evaluate different storage technologies** (e.g., HDD vs. SSD) and their configurations for optimal I/O performance. 17 | * **Develop a systematic approach** to continuous I/O performance assessment and tuning. 18 | * **Synthesize I/O tuning practices with overall system performance management** to ensure holistic optimization. 19 | * **Explore the use of software-defined storage solutions** for flexible I/O performance enhancement. 20 | * **Discuss the impact of network configurations** on I/O performance, particularly in distributed systems. 21 | * **Incorporate security considerations** into I/O tuning practices to protect data integrity. 22 | * **Utilize virtualization technologies** to simulate different I/O scenarios and their impacts on system performance. 23 | * **Lead teams in collaborative projects** that focus on optimizing I/O operations across different departments. 24 | * **Stay informed about the latest developments** in I/O technology and tuning methodologies. 25 | * **Critique existing I/O tuning methods** and propose innovative solutions based on up-to-date research. 26 | * **Develop best practices documentation** for I/O tuning that can be shared within the organization. 27 | * **Host workshops and training sessions** to disseminate I/O tuning knowledge and practices. 28 | * **Evaluate the environmental impact** of I/O tuning decisions, considering energy consumption and sustainability. 29 | * **Master the use of diagnostic tools** for identifying and resolving I/O issues. 30 | * **Examine the effects of file system choice** on I/O performance, including comparisons of file systems under different workloads. 31 | * **Facilitate effective data management strategies** to complement I/O tuning efforts. 32 | * **Integrate I/O performance metrics** into overall system health indicators. 33 | * **Train technical staff** on troubleshooting I/O problems quickly and efficiently. 34 | 35 | AI generated content 36 | -------------------------------------------------------------------------------- /pe/4/2/b.txt: -------------------------------------------------------------------------------- 1 | # PE4.2 Tuning without Modifying the Source Code 2 | 3 | This course focuses on techniques for optimizing system performance through configuration changes, hardware upgrades, and other strategies that do not involve altering the source code. It is designed for professionals who need to enhance system efficiency without the need for programming modifications. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand Small vs. Big Tasks 10 | * Understand Process Mapping to Nodes 11 | * Understand CPU and Thread Pinning 12 | * Understand Memory Affinity (NUMA) 13 | * Understand Optimized Libraries 14 | * Understand Compiler Options / Optimization Switches 15 | * Understand Profile Guided Optimization Workflow (PGO) 16 | * Understand Package Specific Options 17 | * Understand Runtime Options for MPI and OpenMP 18 | 19 | -------------------------------------------------------------------------------- /pe/4/3/b.txt: -------------------------------------------------------------------------------- 1 | # PE4.3 Tuning via Reprogramming 2 | 3 | Tuning via Reprogramming focuses on optimizing system performance through direct modifications to the software's source code. This course explores the methodologies, tools, and best practices for reprogramming to enhance efficiency and functionality. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand Efficient Algorithms 10 | * Understand Functional Units 11 | * Understand Vectorization, SIMD 12 | * Understand SIMT 13 | 14 | -------------------------------------------------------------------------------- /pe/4/b.txt: -------------------------------------------------------------------------------- 1 | # PE4 Tuning 2 | 3 | Tuning means to change some system variables/configurations (without changing the program itself) in order to optimize the performance. 4 | 5 | ## Learning Outcomes 6 | 7 | * Analyze the IO performance of a program and tune it to increase performance. 8 | * Use profiling tools such as prof, gprof and line-by-line profilers. 9 | * Define Amdahl's Law. 10 | * Tune a parallel program in order to achieve better runtimes and to optimize the usage of the HPC resources. 11 | 12 | ## Subskills 13 | 14 | * [[skill-tree:pe:4:1:b]] 15 | * [[skill-tree:pe:4:2:b]] 16 | * [[skill-tree:pe:4:3:b]] 17 | -------------------------------------------------------------------------------- /pe/5/b.txt: -------------------------------------------------------------------------------- 1 | # PE5 Optimization Cycle 2 | 3 | The workflow is represented by an optimization cycle with the steps benchmarking, gathering system performance data (e.g. via profiling), analyzing, and tuning. 4 | 5 | ## Learning Outcomes 6 | 7 | * Apply the full workflow for tuning a parallel program. 8 | 9 | -------------------------------------------------------------------------------- /pe/b.txt: -------------------------------------------------------------------------------- 1 | # PE Performance Engineering 2 | 3 | In High-Performance Computing, the performant and efficient execution is important to obtain results in a timely fashion but also to reduce the costs involved when running workflows on expensive supercomputers. 4 | Performance engineering is the field of systematic approaches for measuring and analyzing the performance of systems and applications, that modifies compile and runtime settings to increase performance (tuning), and that optimizes applications and systems. 5 | 6 | 7 | ## Learning objective 8 | 9 | * Apply a cost model to compute the costs for running a workflow on an HPC system. 10 | * List typical pitfalls when measuring performance. 11 | * Execute benchmarks to obtain performance baselines for systems and applications. 12 | * Comprehend profiling in order to assess the runtime behavior of parallel programs and to detect bottlenecks. 13 | * Comprehend tunable settings in applications and system that influence performance. 14 | 15 | ## Subskills 16 | 17 | * [[skill-tree:pe:1:b]] 18 | * [[skill-tree:pe:2:b]] 19 | * [[skill-tree:pe:3:b]] 20 | * [[skill-tree:pe:4:b]] 21 | * [[skill-tree:pe:5:b]] 22 | -------------------------------------------------------------------------------- /sd/1/1/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.1 Programming Languages 2 | 3 | The user learns how to complete programming tasks and gets a short overview of machine- and assembly-languages toward so-called high-level programming languages. 4 | 5 | The focus lies on the programming languages that are in widespread use within the HPC community. 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand C/C++ 10 | * Understand Fortran 11 | * Understand HPX 12 | * Understand Chip level programming 13 | * Understand Interoperability 14 | * Understand IDEs 15 | * Understand Development Cycle 16 | 17 | -------------------------------------------------------------------------------- /sd/1/2/1/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.1 Parallel Algorithms 2 | 3 | ## Learning Outcomes 4 | 5 | * Contrast that some algorithms are embarrassingly (i.e. trivially) parallelizable while their parallelization will vary from easy to hard in practice. 6 | * Determine the computational complexity of algorithms. 7 | * Understand Parallel Nature of Algorithms 8 | * Understand Computational Complexity 9 | 10 | -------------------------------------------------------------------------------- /sd/1/2/2/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.2 Shared Memory Systems 2 | 3 | The parallel concepts of threads and processes are introduced and their impacts on performance are outlined. 4 | 5 | ## Learning Outcomes 6 | 7 | * Detect race conditions and use synchronization mechanisms to avoid them. 8 | * Estimate the problems that may result from erroneous use of synchronization mechanisms (e.g. deadlocks). 9 | * Assess data dependency situations, i.e. an instruction reading the data written by a preceding instruction in the source code, and anti dependencies, i.e. an instruction having to read data before a succeeding instruction overwrites it, and output dependencies, i.e. instructions writing to the same memory location. 10 | * Use data parallelism, e.g. applying parallel streams of identical instructions to different elements of appropriate data structures such as arrays. 11 | * Apply the concept of functional parallelism, i.e. executing a set of distinct functions possibly using the same data. 12 | * Assess the applicability of parallel language extensions like OpenMP. 13 | * Assess concepts like software pipelining, e.g. to optimize loops by out-of-order execution. 14 | * Assess the applicability of parallel language extensions like CUDA as well as their interoperability (e.g. combining OpenACC and CUDA). 15 | * Assess parallel concepts typically used for shared memory systems, e.g. to exploit temporal locality by data reuse with efficient utilization of the memory hierarchy. 16 | * Assess concepts like software pipelining and vectorization principles. 17 | * Assess the influence of control dependencies by jumps, branches, and function calls, e.g. on pipeline filling. 18 | * Assess the applicability of parallel language extensions like OpenACC, and C++ AMP. 19 | * Understand Single Instruction Multiple Data (SIMD) 20 | * Understand Synchronization 21 | * Understand Memory Hierarchy and Data Reuse 22 | * Understand Software Pipelining 23 | * Understand Dependency Pattern and remember 24 | * Data Dependencies 25 | * Control Dependencies 26 | * Understand Language Extensions and remember 27 | * Processes 28 | * CUDA 29 | * OpenACC 30 | * Threading with OpenMP and Pthreads 31 | * C++ AMP 32 | * Vectorization 33 | 34 | -------------------------------------------------------------------------------- /sd/1/2/3/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.3 Message Passing Systems 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand Synchronization 6 | * Understand Communication Overview and remember 7 | * Blocking 8 | * Non-Blocking 9 | * Point-to-Point 10 | * Collective 11 | * Overlay Networks 12 | 13 | -------------------------------------------------------------------------------- /sd/1/2/4/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.4 Load Balancing 2 | 3 | ## Learning Outcomes 4 | 5 | * Apply domain decomposition strategies. 6 | * Apply simple scheduling algorithms like task farming to achieve an appropriate distribution of the workloads across the multiple computing resources of the HPC system. 7 | * Apply domain decomposition strategies. 8 | * Apply more sophisticated approaches e.g. based on tree structures like divide-and-conquer or work-stealing to achieve an appropriate distribution of the workloads across the multiple computing resources of the HPC system. 9 | 10 | -------------------------------------------------------------------------------- /sd/1/2/5/1/1/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.1.1 NetCDF C API 2 | 3 | The C library is the core implementation on which non-Java interfaces are built. 4 | It provides reliability, performance, and portability. 5 | 6 | ## Learning Outcomes 7 | 8 | * Write a two-dimensional array of sample data and read data from this file. 9 | * Write some variables with units attributes and coordinate dimensions and read data variables and attributes from this file. 10 | * Write some four-dimensional variables using a record dimension and read from the variables from this file. 11 | * Employ ncgen to generate the C code needed to create the corresponding NetCDF file from a CDL text file. 12 | 13 | -------------------------------------------------------------------------------- /sd/1/2/5/1/10/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.1.10 NetCDF Ruby API 2 | 3 | RubyNetCDF is basically a one-to-one interface to the original NetCDF library, but it: 4 | * Consolidates some functions having same purposes. 5 | * Supports some additional functions to help users by combining the raw functions. 6 | 7 | The Ruby API for NetCDF was contributed as part of the Dennou Ruby Project, providing software for data analyses, visualization, and numerical simulations for geophysical studies. 8 | RubyNetCDF supports all the functionality of the NetCDF-3 C library. 9 | 10 | ## Lerning objectives 11 | 12 | * Write a two-dimensional array of sample data and read data from this file. 13 | * Write some variables with units attributes and coordinate dimensions and read data variables and attributes from this file. 14 | * Write some four-dimensional variables using a record dimension and read from the variables from this file. 15 | 16 | -------------------------------------------------------------------------------- /sd/1/2/5/1/11/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.1.11 NetCDF Perl API 2 | 3 | There are two Perl APIs for NetCDF: 4 | * PDL::NetCDF, an object-oriented interface between NetCDF files and PDL objects. 5 | * NetCDFPerl, developed at Unidata, but now in maintenance mode only, and no further development is currently planned. 6 | 7 | ## Learning Outcomes 8 | 9 | * Write a two-dimensional array of sample data and read data from this file. 10 | * Write some variables with units attributes and coordinate dimensions and read data variables and attributes from this file. 11 | * Write some four-dimensional variables using a record dimension and read from the variables from this file. 12 | 13 | -------------------------------------------------------------------------------- /sd/1/2/5/1/12/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.1.12 NetCDF Remote Data Access 2 | 3 | The goal of Unidata's Thematic Real-time Environmental Distributed Data Services (THREDDS) is to provide students, educators and researchers with coherent access to a large collection of real-time and archived datasets from a variety of environmental data sources at a number of distributed server sites. 4 | 5 | The THREDDS Data Server (TDS) is a web server that provides catalog, metadata, and data access services for scientific data. 6 | Every TDS publishes THREDDS catalogs that advertise the datasets and services it makes available. 7 | THREDDS catalogs are XML documents that list datasets and the data access services available for the datasets. 8 | Catalogs may contain metadata to document details about the datasets. 9 | TDS configuration files provide the TDS with information about which datasets and data collections are available and what services are provided for the datasets. 10 | The available remote data access protocols include OPeNDAP, OGC WCS, OGC WMS, and HTTP. 11 | The ncISO service allows THREDDS catalogs to be translated into ISO metadata records. 12 | The TDS also supports several dataset collection services including some sophisticated dataset aggregation capabilities. 13 | This allows the TDS to aggregate a collection of datasets into a single virtual dataset, greatly simplifying user access to that data collection. 14 | 15 | The TDS is open source and runs inside the open-source Tomcat Servlet container. 16 | The TDS is developed and supported by Unidata, a division of the University Corporation for Atmospheric Research (UCAR), and is sponsored by the National Science Foundation. 17 | 18 | The Unidata Local Data Manager (LDM) is a collection of cooperating programs that select, capture, manage, and distribute arbitrary data products. 19 | The system is designed for event-driven data distribution of the kind used in the Unidata Internet Data Distribution (IDD) project. 20 | LDM system includes network client and server programs designed for event-driven data distribution and is the fundamental component comprising the IDD system. 21 | 22 | ## Learning Outcomes 23 | 24 | * From the user's perspective, one should be able to retrieve data and: 25 | * Obtain information regarding the dataset, without the need to download any files: 26 | * Temporal spatial ranges, available variables, contact info, dataset details. 27 | * Retrieve user-relevant subset of the data (temporal, spatial, and variable subsetting). 28 | * Get data remotely in a variety of ways, including: 29 | * OPeNDAP (DAP2/DAP4) protocol. 30 | * ncWMS implementing the Web Map Service. 31 | * Web Coverage Service. 32 | * ncISO. 33 | * NetCDF Subset Services. 34 | * Download one file, even if data span multiple files. 35 | * From the data provider's perspective, one should be able to provide data and: 36 | * Catalog data holdings. 37 | * Aggregate data files. 38 | * Provide a quick view of data. 39 | * Easily add information (metadata) to datasets. 40 | 41 | -------------------------------------------------------------------------------- /sd/1/2/5/1/13/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.1.13 NetCDF principles 2 | 3 | In a simple view, NetCDF is: 4 | * A data model. 5 | * A file format. 6 | * A set of APIs and libraries for various programming languages. 7 | 8 | Together, the data model, file format, and APIs support the creation, access, and sharing of scientific data. 9 | 10 | ## Learning Outcomes 11 | 12 | * Comply with the best practices for writing NetCDF files. 13 | * Identify protocols, servers, and clients for remote data access through NetCDF interfaces. 14 | * Discuss how features of the enhanced NetCDF-4 format can be applied: 15 | * Groups. 16 | * Multiple unlimited dimensions. 17 | * User-defined types. 18 | * Data compression. 19 | * Create short programs in C that use the NetCDF API to read and write files in the NetCDF classical data model for a given NetCDF data model. 20 | * Create short programs that use the netcdf4-python module to create and read files in NetCDF-4 format for a given NetCDF data model. 21 | 22 | -------------------------------------------------------------------------------- /sd/1/2/5/1/2/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.1.2 NetCDF C++ API 2 | 3 | The NetCDF-4 C++ API was for using in managing fusion research data from CCFE's innovative MAST (Mega Amp Spherical Tokamak) experiment. 4 | 5 | The API is implemented as a layer over the NetCDF-4 C interface, which means bug fixes and performance enhancements in the C interface will be immediately available to C++ developers as well. 6 | 7 | ## Learning Outcomes 8 | 9 | * Write a two-dimensional array of sample data and read data from this file. 10 | * Write some variables with units attributes and coordinate dimensions and read data variables and attributes from this file. 11 | * Write some four-dimensional variables using a record dimension and read from the variables from this file. 12 | 13 | -------------------------------------------------------------------------------- /sd/1/2/5/1/3/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.1.3 NetCDF Java API 2 | 3 | The NetCDF Java library implements the Common Data Model (CDM) to interface NetCDF files to a variety of data formats (e.g., NetCDF, HDF, GRIB). 4 | Layered above the basic data access, the CDM uses the metadata contained in datasets to provide a higher-level interface to geoscience specific features of datasets, in particular, providing geolocation and data subsetting in coordinate space. 5 | 6 | ## Learning Outcomes 7 | 8 | * Employ the CDM/NetCDF-Java to read datasets in various formats. 9 | * Write a two-dimensional array of sample data and read data from this file. 10 | * Write some variables with units attributes and coordinate dimensions and read data variables and attributes from this file. 11 | * Write some four-dimensional variables using a record dimension and read from the variables from this file. 12 | 13 | -------------------------------------------------------------------------------- /sd/1/2/5/1/4/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.1.4 NetCDF Fortran-90 API 2 | 3 | The Fortran-90 library provides current Fortran support for modelers and scientists. 4 | 5 | The Fortran-90 interface is significantly simpler than either the C or Fortran-77 interfaces because of its use of overloaded functions and optional arguments. 6 | 7 | ## Learning Outcomes 8 | 9 | * Write a two-dimensional array of sample data and read data from this file. 10 | * Write some variables with units attributes and coordinate dimensions and read data variables and attributes from this file. 11 | * Write some four-dimensional variables using a record dimension and read from the variables from this file. 12 | 13 | -------------------------------------------------------------------------------- /sd/1/2/5/1/5/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.1.5 NetCDF Python API 2 | 3 | Python is an interpreted, object-oriented language that is supported on a wide range of hardware and operating systems. 4 | There are now several NetCDF interfaces for Python. 5 | 6 | netcdf4-python is a Python interface to the NetCDF C library. 7 | It is implemented on top of HDF5. 8 | This module can read and write files in both the new NetCDF 4 and the old NetCDF 3 format and can create files that are readable by HDF5 clients. 9 | xray is a higher-level interface that uses netcdf4-python internally to implement a pandas-like package for N-D labelled arrays for scientific data. 10 | 11 | pycdf is a recent Python interface to the NetCDF library. 12 | It provides almost complete coverage of the NetCDF C API, wrapping it inside easy to use python classes. 13 | NetCDF arrays are handled using array objects provided either by the python numpy, Numeric or numarray packages. 14 | 15 | PyNIO is a Python package that allows read and/or write access to a variety of data formats using an interface modelled on NetCDF. 16 | Currently supported formats include NetCDF, HDF4, GRIB1 and GRIB2 (read-only), and HDF-EOS 2 Grid and Swath data (read-only). 17 | 18 | PyPNetCDF is a Python interface to PNetCDF. 19 | It allows access to NetCDF files using MPI and the library PNetCDF. 20 | The tools can read and write in a parallel way. 21 | 22 | Pupynere (PUre PYthon NEtcdf REader) allows read-access to NetCDF files using the same syntax as the Scientific.IO.NetCDF Python module. 23 | Even though it's written in Python, the module is up to 40% faster than Scientific.IO.NetCDF and pynetcdf. 24 | 25 | ## Learning Outcomes 26 | 27 | * Write a two-dimensional array of sample data and read data from this file. 28 | * Write some variables with units attributes and coordinate dimensions and read data variables and attributes from this file. 29 | * Write some four-dimensional variables using a record dimension and read from the variables from this file. 30 | 31 | -------------------------------------------------------------------------------- /sd/1/2/5/1/6/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.1.6 NetCDF Matlab API 2 | 3 | MATLAB is a commercial integrated technical computing environment that combines numeric computation, advanced graphics and visualization, and a high-level programming language. 4 | Versions 7.7 and later of MATLAB have built-in support for reading and writing NetCDF data. 5 | 6 | Several freely-available software packages that implement a MATLAB/NetCDF interface are available: MEXNC, MexEPS, and the CSIRO MATLAB/NetCDF interface. 7 | It is also possible to call NetCDF Java library methods from MATLAB, so using it provides the advantages of the Java interface. 8 | 9 | ## Learning Outcomes 10 | 11 | * Write a two-dimensional array of sample data and read data from this file. 12 | * Write some variables with units attributes and coordinate dimensions and read data variables and attributes from this file. 13 | * Write some four-dimensional variables using a record dimension and read from the variables from this file. 14 | 15 | -------------------------------------------------------------------------------- /sd/1/2/5/1/7/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.1.7 NetCDF IDL API 2 | 3 | IDL (Interactive Data Language) is a commercial scientific computing environment that combines mathematics, advanced data visualization, scientific graphics, and a graphical user interface toolkit to analyze and visualize scientific data. 4 | IDL supports data in NetCDF format. 5 | 6 | ## Learning objecives 7 | 8 | * Write a two-dimensional array of sample data and read data from this file. 9 | * Write some variables with units attributes and coordinate dimensions and read data variables and attributes from this file. 10 | * Write some four-dimensional variables using a record dimension and read from the variables from this file. 11 | 12 | -------------------------------------------------------------------------------- /sd/1/2/5/1/8/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.1.8 NetCDF NCL API 2 | 3 | NCL (NCAR Command Language) is a free interpreted language designed specifically for scientific data processing and visualization. 4 | In addition to common programming features, NCL also handles manipulation of metadata, configuration of the visualizations, import of data from a variety of data formats (including NetCDF, HDF4, HDF4-EOS, GRIB1, GRIB2) via a single function, and an algebra that supports array operations. 5 | 6 | ## Learning Outcomes 7 | 8 | * Write a two-dimensional array of sample data and read data from this file. 9 | * Write some variables with units attributes and coordinate dimensions and read data variables and attributes from this file. 10 | * Write some four-dimensional variables using a record dimension and read from the variables from this file. 11 | 12 | -------------------------------------------------------------------------------- /sd/1/2/5/1/9/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.1.9 NetCDF R API 2 | 3 | R is a language and environment for statistical computing and graphics. 4 | Three R NetCDF interfaces are available: 5 | * ncdf4 package. 6 | * RNetCDF. 7 | * ncvar. 8 | 9 | ## Learning Outcomes 10 | 11 | * Write a two-dimensional array of sample data and read data from this file. 12 | * Write some variables with units attributes and coordinate dimensions and read data variables and attributes from this file. 13 | * Write some four-dimensional variables using a record dimension and read from the variables from this file. 14 | 15 | -------------------------------------------------------------------------------- /sd/1/2/5/1/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.1 NetCDF 2 | 3 | In a simple view, NetCDF is: 4 | * A data model. 5 | * A file format. 6 | * A set of APIs and libraries for various programming languages. 7 | 8 | Together, the data model, file format, and APIs support the creation, access, and sharing of scientific data. 9 | 10 | NetCDF APIs are available for most programming languages used in geosciences. 11 | Although the NetCDF data model and format are language-independent, all NetCDF APIs are currently implemented over C or Java. 12 | Files written through one language API are readable through other language APIs and some language interfaces support remote access. 13 | 14 | Languages such as C++, Fortran-77, Fortran-90, Perl, Python, Ruby, Matlab, among others, have C-based interfaces. 15 | In particular, NetCDF4-python is a Python interface to the NetCDF C library. 16 | 17 | ## Learning Outcomes 18 | 19 | * Examine different NetCDF programming APIs: 20 | * C 21 | * C++ 22 | * Java 23 | * Fortra-90 24 | * Python 25 | * Matlab 26 | * IDL 27 | * NCL 28 | * R 29 | * Ruby 30 | * Perl 31 | * Remote 32 | * Examine the NetCDF remote data access 33 | * Analyze NetCDF principle and investigate NetCDF files 34 | 35 | ## Subskill 36 | 37 | * [[skill-tree:sd:1:2:5:1:1:b]] 38 | * [[skill-tree:sd:1:2:5:1:2:b]] 39 | * [[skill-tree:sd:1:2:5:1:3:b]] 40 | * [[skill-tree:sd:1:2:5:1:4:b]] 41 | * [[skill-tree:sd:1:2:5:1:5:b]] 42 | * [[skill-tree:sd:1:2:5:1:6:b]] 43 | * [[skill-tree:sd:1:2:5:1:7:b]] 44 | * [[skill-tree:sd:1:2:5:1:8:b]] 45 | * [[skill-tree:sd:1:2:5:1:9:b]] 46 | * [[skill-tree:sd:1:2:5:1:10:b]] 47 | * [[skill-tree:sd:1:2:5:1:11:b]] 48 | * [[skill-tree:sd:1:2:5:1:12:b]] 49 | * [[skill-tree:sd:1:2:5:1:13:b]] 50 | -------------------------------------------------------------------------------- /sd/1/2/5/2/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5.2 XIOS 2 | 3 | XIOS, or XML-IO-Server, is a library dedicated to I/O management in climate codes targeted for large core simulation (> 10 000) on climate coupled models. 4 | 5 | XIOS manages the output of diagnostics and other data produced by climate component codes into files and offers temporal and spatial post-processing operations on this data.  6 | 7 | XIOS aims at simplifying the I/O management by minimizing the number of subroutines to be called and by supporting a maximum of on-line processing of the data. 8 | 9 | ## Learning Outcomes 10 | 11 | * Describe the XIOS-XML terminology and structuration. 12 | * Test XIOS on a specific architecture. 13 | * Use one or more processes dedicated exclusively to the I/O management to obtain: 14 | * Simultaneous writing and computing by an asynchronous call. 15 | * Asynchronous transfer of data from clients to servers. 16 | * Asynchronous data writing by each server. 17 | * Use of parallel file system ability via Netcdf4-HDF5 file format to obtain: 18 | * Simultaneous writing in the same single file by all servers. 19 | * No more post-processing rebuilding of the files. 20 | 21 | -------------------------------------------------------------------------------- /sd/1/2/5/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.5 I/O Programming middleware 2 | 3 | ## Learning Outcomes 4 | 5 | * Assess general concepts of HPC I/O systems (e.g. parallel file systems) and how to map the data model to the storage system, e.g. by using appropriate I/O libraries and middleware architectures. 6 | * Demonstrate the use of NetCDF as a parallel data format with different API. 7 | * Demonstrate the use of XIOS. 8 | 9 | ## Subskills 10 | 11 | * [[skill-tree:sd:1:2:5:1:b]] 12 | * [[skill-tree:sd:1:2:5:2:b]] 13 | -------------------------------------------------------------------------------- /sd/1/2/6/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.6 GPU Programming with CUDA C/C++ Programming Fundamentals 2 | 3 | NVIDIA GPUs can be programmed by using CUDA as an extension to the C/C++ programming languages. NVIDIA provides an entire CUDA software environment which includes compilers, profilers, libraries and more. CUDA allows programmers to write highly parallel compute kernels for the execution on NVIDIA GPUs. In order to write CUDA kernels that can utilize the GPU, an understanding of their architecture and its mapping to the CUDA language extension is required. 4 | 5 | ## Learning Outcomes 6 | 7 | * Understand the CUDA thread hierarchy: 8 | * Address individual threads within the kernel grid 9 | * Write scalable codes via grid-striding loops 10 | * Configure and modify kernel launch configurations 11 | * Understand the nature of unified memory: 12 | * Understand the necessity for memory transfers 13 | * Understand the functional principles of unified memory 14 | * Understand the potential performance hazards & ways to mitigate those 15 | * Be able to execute kernels concurrently via streams: 16 | * Use streams to fully utilize the GPU 17 | * Use streams to hide memory transfer times 18 | * Be able to query the status of a GPU & potential CUDA errors that occur during program execution 19 | 20 | ## Maintainer 21 | 22 | * Markus Velten, ZIH Tools Team @ TU Dresden 23 | 24 | -------------------------------------------------------------------------------- /sd/1/2/7/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2.7 Accelerators 2 | 3 | Accelerators are addon card similar to GPUs. 4 | They are designed to perform specific tasts and require additional programming. 5 | 6 | ## Learning Outcomes 7 | 8 | * Understand FPGA 9 | * Understand GraphCore 10 | * Understand Neuromorphic Computing 11 | 12 | -------------------------------------------------------------------------------- /sd/1/2/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.2 Parallel Programming 2 | 3 | Parallel programming of shared memory systems and message passing systems as well as load balancing is addressed. 4 | 5 | ## Learning Outcomes 6 | 7 | * Assess the parallel nature of algorithms. 8 | * Develop parallel programs using Shared memory programming. 9 | * Develop parallel programs using Message passing. 10 | * Analyse load balancing algorithms and paradigm. 11 | * Demonstrate OP programming middleware. 12 | * Demonstrate GPU programming using CUDA 13 | * Breakdown the difference of accelerators compared with a CPU 14 | 15 | 16 | ## Subskills 17 | 18 | * [[skill-tree:sd:1:2:1:b]] 19 | * [[skill-tree:sd:1:2:2:b]] 20 | * [[skill-tree:sd:1:2:3:b]] 21 | * [[skill-tree:sd:1:2:4:b]] 22 | * [[skill-tree:sd:1:2:5:b]] 23 | * [[skill-tree:sd:1:2:6:b]] 24 | * [[skill-tree:sd:1:2:7:b]] 25 | -------------------------------------------------------------------------------- /sd/1/3/b.txt: -------------------------------------------------------------------------------- 1 | # SD1.3 Efficient Algorithms and Data Structures 2 | 3 | ## Learning Outcomes 4 | 5 | * Assess the efficiency of algorithms and data structures, especially with respect to their suitability for typical (scientific) (parallel) programs, e.g. by the help of popular practice-relevant examples. 6 | 7 | -------------------------------------------------------------------------------- /sd/1/b.txt: -------------------------------------------------------------------------------- 1 | # SD1 Programming Concepts for HPC 2 | 3 | ## Learning Outcomes 4 | 5 | * Develop programs for HPC. 6 | * Develop parallel programs for shared memory systems as well as for message passing systems. 7 | * Develop efficient algorithms and data structures. 8 | * Assess the efficiency and suitability of algorithms and data structures for the respective application. 9 | 10 | ## Subskills 11 | 12 | * [[skill-tree:sd:1:1:b]] 13 | * [[skill-tree:sd:1:2:b]] 14 | * [[skill-tree:sd:1:3:b]] 15 | -------------------------------------------------------------------------------- /sd/2/1/b.txt: -------------------------------------------------------------------------------- 1 | # SD2.1 Integrated Development Environments 2 | 3 | ## Learning Outcomes 4 | 5 | * Configure and use integrated development environments (IDEs) like Eclipse, e.g. to seamlessly perform the typical development cycle with the steps edit, build (compile and link), and test. 6 | 7 | -------------------------------------------------------------------------------- /sd/2/2/b.txt: -------------------------------------------------------------------------------- 1 | # SD2.2 Debugging 2 | 3 | ## Learning Outcomes 4 | 5 | * Debug a program using simple techniques such as inserting debugging output statements into the source code, e.g. using printf – also against the background of potential problems with the ordering of the (stdout) output that may exist in parallel environments like MPI. 6 | * Apply the common concepts and workflows when using a debugger (commands like step into, step over, step out, breakpoints). 7 | * Use sophisticated debuggers such as GDB. 8 | * Use sophisticated debuggers such as DDT and TotalView. 9 | 10 | -------------------------------------------------------------------------------- /sd/2/3/b.txt: -------------------------------------------------------------------------------- 1 | # SD2.3 Programming Idioms 2 | 3 | This skill conveys programming idioms in general and for specific programming languages in order to help developers to solve recurring programming problems. 4 | 5 | ## Learning Outcomes 6 | 7 | * Describe programming idioms for a specific programming language, e.g. Fortran, Python, C, C++. 8 | * Recognize where programming idioms are violated and to refactor the code to comply to a specific programming idiom. 9 | * Apply programming idioms for a specific programming language, e.g. Fortran, Python, C, C++. 10 | 11 | -------------------------------------------------------------------------------- /sd/2/4/b.txt: -------------------------------------------------------------------------------- 1 | # SD2.4 Logging 2 | 3 | Logging is necessary in order to comprehend when, where, and why an error occurs during the execution. 4 | Parallel programs are prone to failures and errors during operation. 5 | Knowledge about logging concepts, the ability to apply them appropriately, and to purposefully analyze the log files is therefore essential in the context of high-performance computing. 6 | 7 | ## Learning Outcomes 8 | 9 | * Describe logging in general like log levels etc. (e.g. ERROR, WARN, INFO, DEBUG, TRACE). 10 | * Different logging formats. 11 | * Select appropriate information that should be logged e.g. timestamp, pid, thread, level, loggername) in order to be able to identify the problem. 12 | * Differentiate between structured logging and text logging. 13 | * Apply logging implementations/libraries for a specific programming language like Fortran, C, C++. 14 | * Develop, maintain, and document a consistent logging concept for a program. 15 | * Implement a logging concept for a program in a specific programming language, e.g. Fortran, C, C++. 16 | * Recognize logging demands and challenges especially for distributed systems. 17 | * Select the most appropriate log format for the context. 18 | * Apply structured logging and text logging. 19 | 20 | -------------------------------------------------------------------------------- /sd/2/5/b.txt: -------------------------------------------------------------------------------- 1 | # SD2.5 Exception Handling 2 | 3 | The skill conveys general concepts about exception handling, how exception handling can be implemented in a specific programming language and how a consistent exception handling policy can be defined and thoroughly followed during implementation. 4 | 5 | ## Learning Outcomes 6 | 7 | * Differentiate among the terms "mistake", "fault", "failure", and "error". 8 | * Describe exception handling concepts in general (e.g. Errors vs. Exceptions). 9 | * Articulate why it helps to write software that is robust. 10 | * Use best practices for exception handling. 11 | * Describe how exception handling is supported in a specific programming language, e.g. Fortran, C (e.g. ), C++ (i.e. try, catch, throw). 12 | * Apply appropriate exception handling in a specific programming language. 13 | 14 | -------------------------------------------------------------------------------- /sd/2/6/b.txt: -------------------------------------------------------------------------------- 1 | # SD2.6 Coding Standards 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand and apply coding standards 6 | 7 | -------------------------------------------------------------------------------- /sd/2/7/b.txt: -------------------------------------------------------------------------------- 1 | # SD2.7 Testing 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand testing procedures 6 | 7 | -------------------------------------------------------------------------------- /sd/2/8/b.txt: -------------------------------------------------------------------------------- 1 | # SD2.8 Portability 2 | 3 | ## Learning Outcomes 4 | 5 | * Understand portability in terms of linked libraries 6 | * Understand code that can be used on different OS 7 | 8 | -------------------------------------------------------------------------------- /sd/2/b.txt: -------------------------------------------------------------------------------- 1 | # SD2 Programming Best Practices 2 | 3 | This skill provides knowledge about software development best practices that will help scientists to develop high-quality scientific software. 4 | 5 | ## Learning Outcomes 6 | 7 | * Analyse the benefits of using an IDE for programming. 8 | * Apply the best practices from software engineering regarding programming and debugging. 9 | * Apply programming best practices in order to develop robust and maintainable programs. 10 | * Distinguish different log levels and apply them to a program. 11 | * Dramatize the handling of exceptions for programs. 12 | * Apply coding standards to a programming project. 13 | * Apply testing procedures to a programming project. 14 | * Demonstrate programming of portable code. 15 | 16 | ## Subskills 17 | 18 | * [[skill-tree:sd:2:1:b]] 19 | * [[skill-tree:sd:2:2:b]] 20 | * [[skill-tree:sd:2:3:b]] 21 | * [[skill-tree:sd:2:4:b]] 22 | * [[skill-tree:sd:2:5:b]] 23 | * [[skill-tree:sd:2:6:b]] 24 | * [[skill-tree:sd:2:7:b]] 25 | * [[skill-tree:sd:2:8:b]] 26 | -------------------------------------------------------------------------------- /sd/3/1/b.txt: -------------------------------------------------------------------------------- 1 | # SD3.1 Version Control 2 | 3 | This skill covers how to apply version and configuration management to the development of (parallel) programs in order to track and control changes in the sources and how to establish and maintain consistency of the program or software system throughout its life, and facilitate cooperative development. 4 | 5 | Systems like Revision Control System (RCS), Subversion (SVN), and GIT are presented. 6 | 7 | ## Learning Outcomes 8 | 9 | * Describe the basics of version control systems, e.g. what is version control. 10 | * Discuss the benefits of using version control for software development especially in a team. 11 | * Describe what is branching and merging. 12 | * Assess the difference between distributed and centralized version control systems. 13 | * Apply basic Git concepts. 14 | * Apply basic SVN concepts. 15 | * Use basic git commands such as git add, git commit, git pull, git push. 16 | * Use SVN commands. 17 | * Resolve merge conflicts. 18 | * Apply a specific workflow, such as Feature Branch Workflow, Gitflow Workflow, Centralized Workflow, Forking Worfklow. 19 | * Apply advanced git concepts and commands. 20 | * Apply Git as a version control system. 21 | * Apply SVN as a version control system. 22 | * Apply advanced git concepts, such as pull requests, branches, tags, submodules etc. 23 | * Differentiate types of workflows, such as Feature Branch Workflow, Gitflow Workflow, Centralized Workflow, Forking Worfklow. 24 | 25 | -------------------------------------------------------------------------------- /sd/3/2/b.txt: -------------------------------------------------------------------------------- 1 | # SD3.2 Issue Tracking and Bug Tracking 2 | 3 | ## Learning Outcomes 4 | 5 | * Describe concepts of issue/bug tracking systems and their basic concepts like task, sub-task, new feature, story, release planning, sprint planning in order to structure and organize the development process (e.g. assigning tasks to developers, reporting bugs, writing user stories, managing the stages of an issue (to do, in progress, in review, done) etc.) 6 | * Differentiate issue tracking systems, like Jira or Redmine. 7 | * Apply issue tracking in order to manage tasks, bug reports, and other issues occuring during development and enabling task assignment in the team. 8 | * Apply different issue tracking systems, like Jira or Redmine, to the development project. 9 | * Define a consistent workflow in the development team for issue tracking. 10 | 11 | -------------------------------------------------------------------------------- /sd/3/3/b.txt: -------------------------------------------------------------------------------- 1 | # SD3.3 Release Management 2 | 3 | The benefits of release management are explained. 4 | Moreover, it is covered how software releases are managed according to a well-defined and consistent process. 5 | 6 | ## Learning Outcomes 7 | 8 | * Describe the basics of release management and what the benefits are of applying a release management process in the context of high-performance computing "fault", "failure", and "error". 9 | * Discuss the differences among Major Release, Minor Release, Emergency Fix (and potentially other types of releases) and what should be contained in each of them. 10 | * Correlate the tasks and steps of release management. 11 | * Apply the steps of the deployment process of the release version and the required dependencies. 12 | * Comply with the best practices of making releases identifiable via version numbers using appropriate version numbering scheme (e.g. using the version control system). 13 | * Characterize the lifecycle of a release (including states such as stable, unstable). 14 | * Differentiate frameworks of release planning and management, e.g. SCRUM release planning and ITIL. 15 | * Apply frameworks of release planning and management like SCRUM release planning or ITIL. 16 | * Classify releases according to release categories (e.g. major, minor, emergency fix). 17 | * Plan and manage releases of scientific software and to document the release including the release notes. 18 | * Apply best practices to make a release identifiable via version numbers using appropriate version numbering scheme (e.g. using the version control system). 19 | * Find the best release management process for the team (e.g. dependending on team size etc.). 20 | 21 | -------------------------------------------------------------------------------- /sd/3/4/b.txt: -------------------------------------------------------------------------------- 1 | # SD3.4 Deployment Management 2 | 3 | This skill conveys how dependencies are managed, how and why to setup different environments for development, testing, and production, how and why to automate the deployment process and the importance of preserving and documenting reproducible software stacks that can be used by other users/researchers in order to reliably reproduce the results. 4 | 5 | ## Learning Outcomes 6 | 7 | * Describe the basics of dependency management. 8 | * Discuss why different environments for testing, development, production, and staging are necessary. 9 | * Compare the differences between different deployment environments, and what are their specific requirements. 10 | * Use software building environments like make, Autotools, CMake. 11 | * Correlate continuous integration, delivery, and deployment and the differences between them. 12 | * Analyze the benefits, drawbacks, and tradeoffs of continuous integration, delivery, and deployment. 13 | * Set up the production, testing, and development environments. 14 | * Define and preserve reproducible software stacks to make computational results reproducible, e.g. by applying virtualization environments like VirtualBox, Docker, rkt, or BioContainers or tools for defining scientific workflows like Nextflow, or Singularity. 15 | * Configure an environment for continuous integration, delivery, and deployment using a selected continuous integration tool like Jenkins, Buildbot or Travis-CI with basic processing steps like compiling and automated testing. 16 | * Describe the basics of dependency management for different programming languages. 17 | * Use advanced software building environments like Scons and Waf. 18 | * Discuss the challenges for Portability, e.g. for the source code of programs and job scripts to avoid typical compiler-, linker-, and MPI-issues. 19 | * Use a software build and installation framework like EasyBuild. 20 | 21 | -------------------------------------------------------------------------------- /sd/3/b.txt: -------------------------------------------------------------------------------- 1 | # SD3 Software Configuration Management 2 | 3 | ## Learning Outcomes 4 | 5 | * Describe the purpose and importance of software configuration management, especially in the context of high-performance computing. 6 | * Demonstrate Bug and issue tracking for a software project. 7 | * Apply the basic concepts, terms and processes of SCM and apply steps of SCM in an HPC project. 8 | * Detail terms like Configuration Item, Baseline, SCM Directories, Version, Revision, and Release. 9 | 10 | ## Subskills 11 | 12 | * [[skill-tree:sd:3:1:b]] 13 | * [[skill-tree:sd:3:2:b]] 14 | * [[skill-tree:sd:3:3:b]] 15 | * [[skill-tree:sd:3:4:b]] 16 | -------------------------------------------------------------------------------- /sd/4/1/b.txt: -------------------------------------------------------------------------------- 1 | # SD4.1 Test-driven Development and Agile Testing 2 | 3 | ## Learning Outcomes 4 | 5 | * Describe the challenges of testing scientific applications. 6 | * Discuss test-driven and test-first concepts and understanding the benefits. 7 | * Characterize what constitutes a test strategy. 8 | * Explain that there are different test types, e.g. given by the test pyramid. 9 | * Apply unit testing in a specific programming language using an appropriate unit testing framework, e.g. pfUnit for Fortran, glib testing framework for C. 10 | * Develop (agile) testing strategies. 11 | * Write different test types for the test pyramid. 12 | * Understand Continous Integration (CI) and remeber 13 | * jenkins 14 | * buildbot 15 | * hugo 16 | * Understand Continuous Delivery / Deployment (CD) 17 | 18 | -------------------------------------------------------------------------------- /sd/4/2/b.txt: -------------------------------------------------------------------------------- 1 | # SD4.2 Extreme Programming 2 | 3 | ## Learning Outcomes 4 | 5 | * Discuss the principles of extreme programming and when to apply it. 6 | * Apply the principles in the context of an HPC project. 7 | 8 | -------------------------------------------------------------------------------- /sd/4/3/b.txt: -------------------------------------------------------------------------------- 1 | # SD4.3 SCRUM 2 | 3 | ## Learning Outcomes 4 | 5 | * Desribe the concepts of SCRUM, e.g. Sprint, Backlog, Planning, Daily meetings/Stand up meeting, and project velocity. 6 | * Apply practices of SCRUM. 7 | 8 | -------------------------------------------------------------------------------- /sd/4/b.txt: -------------------------------------------------------------------------------- 1 | # SD4 Agile Methods 2 | 3 | Practices of agile software development are covered in order to convey skills about collaborative, and self-organizing software development advocating adaptive planning, evolutionary development, and encouring rapid and flexible response to change. 4 | 5 | ## Learning Outcomes 6 | 7 | * Apply agile test development practices in the context of HPC. 8 | * Demonstrate extreme programming. 9 | * Analyse the concept of SCRUM for project management. 10 | 11 | ## Subskills 12 | 13 | * [[skill-tree:sd:4:1:b]] 14 | * [[skill-tree:sd:4:2:b]] 15 | * [[skill-tree:sd:4:3:b]] 16 | -------------------------------------------------------------------------------- /sd/5/b.txt: -------------------------------------------------------------------------------- 1 | # SD5 Software Quality 2 | 3 | ## Learnign objectives 4 | 5 | * Understand Coding Standards 6 | * Understand Code Quality 7 | * Apply software engineering methods and practices especially in the context of high-performance computing. 8 | * Develop parallel programs and to apply software engineering methods and best practices. 9 | * Assess code quality using different metrics, e.g. length of functions, length of files, lines of code, complexity metrics, code coverage. 10 | * Use static code analysis tools in order to calculate the metrics (e.g. http://cppcheck.sourceforge.net/). 11 | * Identify bad code structures (known as bad smells) in order to assess the quality of the code design. 12 | * Understand Refactoring 13 | * Apply common code refactorings in order to improve code quality, such as extract method, extract class, rename class and when it is suitable to apply which refactoring. 14 | * Apply refactoring that are specific to programming languages (e.g. Fortran). 15 | * Understand Code Reviews 16 | * Use a review system like Gerrit to organize the code reviews. 17 | * Document code review results and resulting tasks in an issue tracking system (for example Jira). 18 | * Define checklists for code reviews. 19 | * Conduct code reviews in pairs or in a team. 20 | * Document code review results and resulting tasks in an issue tracking system (for example Jira). 21 | * Demonstrate awareness of technical debt during software development and how to pay technical debt. 22 | 23 | -------------------------------------------------------------------------------- /sd/7/1/b.txt: -------------------------------------------------------------------------------- 1 | # SD7.1 Requirements Documentation 2 | 3 | ## Learning Outcomes 4 | 5 | * Describe which information needs to be captured in a **requirements** document 6 | * Apply the IEEE standard for software requirements specification for a structured requirement specification. 7 | * Document requirements using a specified template. 8 | 9 | -------------------------------------------------------------------------------- /sd/7/2/b.txt: -------------------------------------------------------------------------------- 1 | # SD7.2 Software Architecture and Software Design Documentation 2 | 3 | This skill covers how software architecture and design can appropriately be documented, e.g. using templates. 4 | In order to preserve knowledge about the main components of the software and the related decisions about why the software has been designed in this specific way, it is important to document them. 5 | 6 | ## Learning Outcomes 7 | 8 | * Document the different views of the software architecture according to a specific documentation framework, e.g. 4+1 views, Views and Beyond, architecture decision frameworks (e.g. Taylor, Olaf Zimmermann). 9 | * Apply a modeling language for documenting the design and the architecture, e.g. Unified Modeling Language (UML). 10 | 11 | -------------------------------------------------------------------------------- /sd/7/3/b.txt: -------------------------------------------------------------------------------- 1 | # SD7.3 Source Code Documentation 2 | 3 | ## Learnining objectives 4 | 5 | * Document source code using documentation generators like doxygen, pydoc, or sphinx. 6 | * Produce a consistent source code documentation according to guidelines and best practices. 7 | 8 | -------------------------------------------------------------------------------- /sd/7/4/b.txt: -------------------------------------------------------------------------------- 1 | # SD7.4 Documentation for Reproducibility 2 | 3 | ## Learning Outcomes 4 | 5 | * Document all necessary information for end-users so that they are able to reproduce the results. 6 | * Document the software stack, build instructions, input data, results etc. 7 | * Use tools for literate programming like activepapers knitr, or jupyter to document all necessary information for end-users so that they are able to reproduce the results especially in the context of concurrency. 8 | 9 | -------------------------------------------------------------------------------- /sd/7/b.txt: -------------------------------------------------------------------------------- 1 | # SD7 Documentation 2 | 3 | ## Learning Outcomes 4 | 5 | * Experiment with requirements and document them. 6 | * Document the entire software architecture and design appropriately. 7 | * Demonstrate source code documentation tools. 8 | * Provide a documentation for developers (e.g. describing the software architecture, for extending the software etc.), which is required for reproducing a program. 9 | 10 | ## Subskills 11 | 12 | * [[skill-tree:sd:7:1:b]] 13 | * [[skill-tree:sd:7:2:b]] 14 | * [[skill-tree:sd:7:3:b]] 15 | * [[skill-tree:sd:7:4:b]] 16 | -------------------------------------------------------------------------------- /sd/8/1/b.txt: -------------------------------------------------------------------------------- 1 | # SD8.1 Programming Snakemake Workflows 2 | 3 | ## Requirements 4 | 5 | * [[skill-tree:use:1:3:b]] 6 | * Python Programming 7 | 8 | ## Learning objectives 9 | 10 | * Develop Snakemake workflows by utilising rules, input/output files, and directives such as shell, script, run, "wrappers", and resource definitions. 11 | * Integrate custom Python scripts into Snakemake workflows for dynamic data manipulation and resource parameterization. 12 | * Handle automatic software deployment using Conda, or software provisioning with Apptainer/Singularity or module files. 13 | * Modularize existing Snakemake workflows into semantic units to improve maintainability and scalability. 14 | -------------------------------------------------------------------------------- /sd/8/2/b.txt: -------------------------------------------------------------------------------- 1 | # SD8.2 Contributing Snakemake Workflows 2 | 3 | ## Requirements 4 | 5 | * [[skill-tree:sd:8:1:b]] 6 | * [[skill-tree:sd:3:1:b]] 7 | 8 | ## Learning Objectives 9 | 10 | * Implement a Continuous Integration (CI) workflow on GitHub by adapting the CI workflow of the Snakemake Catalogue Template to ensure automated testing, linting, and formatting control for reliable build and release cycles. 11 | * Modularise existing Snakemake workflow code into semantic units to enhance maintainability and enable efficient sub-component testing. 12 | * Follow the standardardized folder structure to simplify navigation through the code base. 13 | -------------------------------------------------------------------------------- /sd/8/3/b.txt: -------------------------------------------------------------------------------- 1 | # SD8 Workflow Management Systems principles 2 | 3 | Workflow Management Systems are tools to enable reproducible data analysis, especially if many data analyis processing steps are involved. 4 | 5 | Workflow Management Systems cover all data analysis from A to Z, e.g. data preprocessing, quality filtering, analysis and statistical evaluation of processed data. On HPC clusters this may include, starting jobs, staging data to compute nodes, running the computations, deleting temporary data, generating publication ready reports, archiving and cleaning work environments. They guarantee portability of their workflows across systems and transparenty of their processing. 6 | 7 | ## Learning objectives 8 | 9 | * Explain the purpose and benefits of a Workflow Management System. 10 | * Use Conda, Apptainer/Singularity, or module files to manage software environments for workflow execution. 11 | * Develop maintainable workflows by building modular workflow structures. 12 | * Configure advanced resource management by setting up workflow-specific resources such as memory, CPU, or cluster configurations. 13 | * Integrate custom Python scripts for quick data manipulation or dynamic resource parameterization. 14 | * Diagnose and resolve workflow errors by identifying failed jobs, analysing logs, and restarting workflows from failed steps. 15 | 16 | -------------------------------------------------------------------------------- /sd/8/b.txt: -------------------------------------------------------------------------------- 1 | # SD8 Workflow Management Systems 2 | 3 | Workflow Management Systems are tools to enable reproducible data analysis, especially if many data analyis processing steps are involved. 4 | 5 | Workflow Management Systems cover all data analysis from A to Z, e.g. data preprocessing, quality filtering, analysis and statistical evaluation of processed data. On HPC clusters this may include, starting jobs, staging data to compute nodes, running the computations, deleting temporary data, generating publication ready reports, archiving and cleaning work environments. They guarantee portability of their workflows across systems and transparenty of their processing. 6 | 7 | There are some Workflow Management Systems, such as Nextflow or Snakemake, which are designed to automate this process on HPC Clusters. 8 | 9 | ## Requirements 10 | 11 | * [[skill-tree:use:1:3:b]] 12 | 13 | ## Learning objectives 14 | 15 | * Develop Snakemake workflows for atomization of processes 16 | * Demonstrate contributing Snakemake workflows to the community 17 | * Breakdown workflow management systems and investigate the structure and purpose. 18 | 19 | ## Subskills 20 | 21 | * [[skill-tree:sd:8:1:b]] 22 | * [[skill-tree:sd:8:2:b]] 23 | * [[skill-tree:sd:8:3:b]] 24 | 25 | -------------------------------------------------------------------------------- /sd/b.txt: -------------------------------------------------------------------------------- 1 | # SD Software Development 2 | 3 | Software engineering is often neglected in computational science. 4 | However, it bears the potential to increase productivity by providing scaffolding for collaborative programming, reducing the coding errors and increasing the manageability of software. 5 | 6 | ## Learning Outcomes 7 | 8 | * Apply software engineering methods and best practices when developing parallel applications. 9 | * Write modular, reusable code by applying software design principles like loose coupling and information hiding. 10 | * Configure and use integrated development environments (IDEs) to seamlessly perform the typical development cycle with the steps edit, build (compile and link), and test. 11 | * Recognize where (parallel) programming idioms are violated and refactor the code to comply with a specific programming idiom. 12 | * Define and establish coding standards and conventions in a project. 13 | * Apply version and configuration management to the development of (parallel) programs in order to track and control changes in the sources and to establish and maintain consistency of the program or software system throughout its life. 14 | * Appropriately document the entire software system. 15 | * Demonstrate using a workflow management system for a given use case. 16 | 17 | ## Subskills 18 | 19 | * [[skill-tree:sd:1:b]] 20 | * [[skill-tree:sd:2:b]] 21 | * [[skill-tree:sd:3:b]] 22 | * [[skill-tree:sd:4:b]] 23 | * [[skill-tree:sd:5:b]] 24 | * [[skill-tree:sd:5:b]] 25 | * [[skill-tree:sd:7:b]] 26 | * [[skill-tree:sd:8:b]] 27 | -------------------------------------------------------------------------------- /use/1/1/b.txt: -------------------------------------------------------------------------------- 1 | # USE1.1 Command Line Interface 2 | 3 | HPC systems are usually accessed via a Linux-based Command Line Interface (CLI) that is provided by a shell. 4 | 5 | At its core, a shell is simply a convenient tool that you can use to execute commands on a Linux computer. 6 | The shell provides a textual interface allowing to interact with the operating system and performing all possible operations, i.e., accessing and manipulating files, and running programs. 7 | However, there are various misconceptions that new users typically face when handling a shell such as the Bash. 8 | Particularly, dealing with control characters and the format expected when executing programs with arguments can be error-prone. 9 | 10 | Part of this skill is the general principles of the interaction with a shell, to execute and to stop programs. 11 | 12 | ## Requirements 13 | 14 | 15 | ## Learning Outcomes 16 | 17 | * Utilize the bash shell to execute individual programs with arguments. 18 | * Describe the meaning of the exit code of a program. 19 | * Run multiple programs after another depending on the exit code ;, &&, ||. 20 | * List the set of basic programs and their tasks: 21 | * pwd 22 | * whoami 23 | * sleep 24 | * kill 25 | * echo 26 | * clear 27 | * man 28 | * vi, vim, emacs, nano 29 | * exit 30 | * Utilize the available help of a program (--help argument and the man pages). 31 | * Interrupt or abort a program execution: 32 | * CTRL-C 33 | * CTRL-Z 34 | * using kill -9 35 | * Use the shell history to search and execute previously executed commands. 36 | * Set and print shell variables. 37 | * Print all currently set variables 38 | * Identify potential special characters that must be handled with care. 39 | * List strings that could refer to files/directories 40 | * Utilize escaping to ensure correct handling of arguments. 41 | * Understand wildcard characters to select a group of files/directories: 42 | * * 43 | * ? 44 | * [-,] 45 | * How to close popular command line text editors with/or without saving changes: 46 | * nano 47 | * vi 48 | * emacs 49 | 50 | -------------------------------------------------------------------------------- /use/1/2/b.txt: -------------------------------------------------------------------------------- 1 | # USE1.2 Shell Scripts 2 | 3 | The shell provides a programming language that allows you to write programs (known as shell scripts), which helps to combine commands into more complex applications. 4 | 5 | Despite initially seeming cumbersome and inefficient to many, the real power of shell scripting quickly becomes apparent to those who use it. 6 | Everyday use is to automate repetitive tasks, which would otherwise be time-consuming to complete. 7 | 8 | This skill covers the bash shell. 9 | 10 | ## Requirements 11 | 12 | * [[skill-tree:use:1:1:b]] 13 | 14 | ## Learning Outcomes 15 | 16 | * Create a basic shell script that executes a sequence of programs. 17 | * Design a script using the bash construct "if" that handles conditional execution based on: 18 | * Performing a test for the existing of a file/directory, 19 | * Testing for the presence of certain text in a file, 20 | * Design a script that performs a task repeatedly using the bash "for" or "while" loop. 21 | * Utilize debugging options for troubleshooting of shell programs: 22 | * Options to bash: -e, -x 23 | * Use shell functions to break large, complex sequences, into a series of small functions. 24 | * Learn how to manipulate filenames. 25 | * Learn to manage temporary files: 26 | * Choose an adequate file system (or top directory) for temporary files. 27 | * Automatically generate a unique folder name for temporary files. 28 | * Automatically delete temporary folder whenever the script exits. 29 | 30 | -------------------------------------------------------------------------------- /use/1/3/b.txt: -------------------------------------------------------------------------------- 1 | # USE1.3 UNIX File System Tree 2 | 3 | The UNIX file system is organized hierarchically according to the Filesystem Hierarchy Standard. 4 | 5 | UNIX follows the philosophy that everything is a file. Directories are indicated with a "/" separator and start from the root directory "/" and different devices are linked to this tree. 6 | 7 | Files and directories can be referred to either using absolute or relative file names. 8 | 9 | Typically, several elementary programs are pre-installed and allow access and manipulation of files and directories. 10 | 11 | ## Requirements 12 | 13 | ## Learning Outcomes 14 | 15 | * Describe the organization of a hierarchical file system. 16 | * Explain the basic UNIX permission model and metadata of files. 17 | * Describe the Filesystem Hierarchy Standard and the relevance of the directories: 18 | * etc 19 | * home 20 | * opt 21 | * lib and /usr/lib 22 | * bin and /usr/bin 23 | * sbin and /usr/sbin 24 | * tmp 25 | * Utilize tools to navigate and traverse the file system: 26 | * ls (-R, -l) 27 | * cd 28 | * pushd/popd 29 | * stat 30 | * Use tools to read files: 31 | * cat 32 | * head/tail 33 | * less/more 34 | * cmp 35 | * Utilize tools to manipulate the file system: 36 | * mkdir/rmdir 37 | * touch 38 | * cp/mv 39 | * ln 40 | * Utilize tools to identify and manipulate permissions: 41 | * chmod 42 | * chown/chgrp 43 | 44 | -------------------------------------------------------------------------------- /use/1/4/b.txt: -------------------------------------------------------------------------------- 1 | # USE1.4 Remote Access 2 | 3 | An HPC system is accessed remotely and has its particular file systems that contain data and programs to execute. 4 | Learning the ways of general interaction with the remote system and the tools involved is essential for newcomers. 5 | 6 | Users must connect to an HPC system typically using the Secure Shell (SSH), which then starts a shell and allows the interactive access. 7 | When the user disconnects, such a session is terminated. 8 | A server-sided session that persists after disconnection enables the user to execute long-running programs remotely and allows the continuation of a previous session seamlessly. 9 | Data transfer between a local user system (e.g., desktop or laptop) and a remote system requires special tools. 10 | Interactive access is also allowed. 11 | 12 | The tools discussed here are generally valid for systems using Linux, Mac, Windows, and also mobile devices. 13 | 14 | ## Requirements 15 | 16 | ## Learning Outcomes 17 | 18 | * Describe how SSH-authentication works: 19 | * Password authentication. 20 | * Public-key authentication. 21 | * The role of an authentication agent and the security implications. 22 | * Generate an SSH public/private key under Linux. 23 | * Register a key for public-key authentication using ssh-copy-id. 24 | * Perform a remote login from Linux using SSH. 25 | * Use SSH-agent or Windows equivalents. 26 | * Use Agent forwarding to connect to a third HPC system from an HPC system that you logged into with ssh from your computer. 27 | * Know when to use and how to create a config file. 28 | * Utilize tools to transfer data between the local and remote system: 29 | * scp 30 | * sftp 31 | * rsync (-avz) 32 | * Describe how SSHFS allows mounting a remote directory to a local directory for interactive usage (Mac/Linux only), e.g. for copying files or to transparently use your favourite graphical text editor on the local computer for editing files on the remote cluster. 33 | * Utilize screen and tmux to preserve a session between logins: 34 | * Creation of a session. 35 | * Detaching from the current session. 36 | * Resuming a previous session. 37 | 38 | -------------------------------------------------------------------------------- /use/1/5/1/b.txt: -------------------------------------------------------------------------------- 1 | # USE1.5.1 Environment Modules 2 | 3 | The Environment Modules package is a widely used tool for managing information (module files) about installed software. 4 | It comes with integration into the shell. 5 | 6 | ## Requirements 7 | 8 | ## Learning Outcomes 9 | 10 | * Comprehend that Modules can have dependencies and conflicts: 11 | * A Module can enforce other Modules that it depends on must be loaded before the Module itself can be loaded. 12 | * Conflicting modules must not be loaded at the same time (e.g. two versions of a compiler). 13 | * Query information about packages: 14 | * list 15 | * avail 16 | * whatis 17 | * search 18 | * display 19 | * help 20 | * Load/Unload software modules: 21 | * load/unload 22 | * purge 23 | * swap 24 | * switch 25 | * Describe the MODULEPATH variable. 26 | * Describe the general dependency structure of software. 27 | * Describe how a package manager makes software available. 28 | * Understand shell variables relevant for executing and building software: 29 | * PATH for executables 30 | * LD\_LIBRARY\_PATH for libraries 31 | * MANPATH for manual pages (man command) 32 | * PKG\_CONFIG\_PATH for pkg-config 33 | * Manipulate shell variables to include additional software: 34 | * Setting shell variables for a single program by prefixing or by using export. 35 | 36 | -------------------------------------------------------------------------------- /use/1/5/2/b.txt: -------------------------------------------------------------------------------- 1 | # USE1.5.2 Spack 2 | 3 | Spack is a package manager for Linux and macOS. 4 | It allows the user to install software with all necessary dependency and to load relevant software. 5 | Spack creates unique configurations for software. 6 | 7 | ## Learning Outcomes 8 | 9 | * Describe the concept of configurations and used hashing. 10 | * Query information about packages: 11 | * list 12 | * info 13 | * find (also use -d to reveal their dependencies) 14 | * spec 15 | * Load/Unload software modules. 16 | * Describe the spec syntax to specify package configurations. 17 | 18 | -------------------------------------------------------------------------------- /use/1/5/3/b.txt: -------------------------------------------------------------------------------- 1 | # USE1.5.3 Conda environments 2 | 3 | Some languages such as python require modules or packages. 4 | Usually these are managed by specialized software environments such as conda. 5 | The benefit of these is that a user can install packages and they are only available for a single user. 6 | 7 | ## Requrements 8 | 9 | * [[skill-tree:use:1:5:1:b]] 10 | 11 | ## Learning Outcomes 12 | 13 | * Understand what conda environments are for 14 | * Understand the difference between a module system and conda environment 15 | * Discuss how such an environment can be constructed and what dependencies it has 16 | * Prepare a conda enironment for basic python packages 17 | * Analyse the loading time for a conda environment for different file storage location 18 | -------------------------------------------------------------------------------- /use/1/5/b.txt: -------------------------------------------------------------------------------- 1 | # USE1.5 Software Environment 2 | 3 | HPC systems have generally installed multiple versions of several essential software tools and software environments. 4 | Package management tools provide access to this wide variety of software. 5 | A user has to load the software for the current shell session to make commands available. 6 | 7 | The widely available software tools are: 8 | * Environment modules 9 | * SPACK 10 | * Conda 11 | 12 | ## Learning Outcomes 13 | 14 | * Query available software using the package manager and select the appropriate versions for deployment in the session environment. 15 | * Describe the MODULEPATH variable and understand shell variables relevant for executing and building software 16 | * Query information about packages using SPACK and install as well as load and unload software modules 17 | * Understand what conda environments are for and how a user can install their own environments 18 | 19 | ## Subskills 20 | 21 | * [[skill-tree:use:1:5:1:b]] 22 | * [[skill-tree:use:1:5:2:b]] 23 | * [[skill-tree:use:1:5:3:b]] 24 | -------------------------------------------------------------------------------- /use/1/b.txt: -------------------------------------------------------------------------------- 1 | # USE1 Cluster Operating System 2 | 3 | HPC systems are usually accessed via a Linux-based Command Line Interface (CLI). 4 | Via the CLI, users can run programs to manipulate files, load additional software, or execute programs. 5 | 6 | While there are many similarities to single-user Linux systems, HPC systems differ widely in the management of provided Software packages, and users must pay attention to the locality of data. 7 | 8 | ## Learning Outcomes 9 | 10 | * Describe the command line interface and the bash shell as well es perform basic commands. 11 | * Use and write basic shell scripts, e.g., to automate the execution of several commands. 12 | * Sketch the organization of the typical file system tree. 13 | * Utilize essential tools to navigate and manage files. 14 | * Understand how to access a Cluster via SSH. 15 | * Understand terminal multiplexers and use them to run longer commands without keeping a connection alive. 16 | * Utilize tools to transfer data between a desktop/laptop system and a remote HPC system. 17 | * Utilize package management tools that provide access to a wide variety of software. 18 | 19 | ## Subskills 20 | 21 | * [[skill-tree:use:1:1:b]] 22 | * [[skill-tree:use:1:2:b]] 23 | * [[skill-tree:use:1:3:b]] 24 | * [[skill-tree:use:1:4:b]] 25 | * [[skill-tree:use:1:5:b]] 26 | -------------------------------------------------------------------------------- /use/2/1/b.txt: -------------------------------------------------------------------------------- 1 | # USE2.1 Job Scheduling 2 | 3 | Parallel computers are operated differently than a normal PC, all users must share the system. 4 | Therefore, various operative procedures are in place. 5 | Users must understand these concepts and procedures to be able to use the available resources of a system to run a parallel application. 6 | Moreover, individual solutions can often be found in a specific system. 7 | 8 | ## Learning Outcomes 9 | 10 | * Run parallel programs in an HPC environment. 11 | * Use the command-line interface. 12 | * Write robust job scripts, e.g. to simplify job submissions by the help of automated job chaining. 13 | * Select the appropriate software environment. 14 | * Use a workload manager like SLURM or TORQUE to allocate HPC resources (e.g. CPUs) and to submit a batch job. 15 | * Consider cost aspects. 16 | * Measure system performance as a basis for benchmarking a parallel program. 17 | * Benchmark a parallel program. 18 | * Tune a parallel program from the outside via runtime options. 19 | * Apply the workflow for tuning. 20 | 21 | -------------------------------------------------------------------------------- /use/2/2/b.txt: -------------------------------------------------------------------------------- 1 | # USE2.2 Job Scripts 2 | 3 | Job Scripts are a more efficient and more powerfull way of running job on an HPC system. 4 | 5 | ## Requirements 6 | 7 | ## Learning Outcomes 8 | 9 | * Use the command-line interface. 10 | * Write robust job scripts, e.g. to simplify job submissions by the help of automated job chaining. 11 | * Select the appropriate software environment. 12 | * Use a workload manager to allocate HPC resources for running a parallel program interactively. 13 | * Recognize cost aspects. 14 | * Measure system performance as a basis for benchmarking a parallel program. 15 | * Benchmark a parallel program. 16 | * Tune a parallel program from the outside via runtime options. 17 | * Apply the workflow for tuning. 18 | -------------------------------------------------------------------------------- /use/2/b.txt: -------------------------------------------------------------------------------- 1 | # USE2 Running of Parallel Programs 2 | 3 | Parallel computers are operated differently than a normal PC, all users must share the system. 4 | Therefore, various operative procedures are in place. 5 | Users must understand these concepts and procedures to be able to use the available resources of a system to run a parallel application. 6 | Moreover, individual solutions can often be found in a specific system. 7 | 8 | ## Learning Outcomes 9 | 10 | * Use a workload manager like SLURM or TORQUE to allocate HPC resources (e.g. CPUs) and to submit a batch job. 11 | * Write robust job scripts, e.g. to simplify job submissions by the help of automated job chaining. 12 | 13 | ## Subskills 14 | 15 | * [[skill-tree:use:2:1:b]] 16 | * [[skill-tree:use:1:2:b]] 17 | -------------------------------------------------------------------------------- /use/3/b.txt: -------------------------------------------------------------------------------- 1 | # USE3 Building of Parallel Programs 2 | 3 | Building programs that run on an HPC system requires some thought into software, compiling, linking and running. 4 | 5 | ## Requrements 6 | 7 | * [[skill-tree:use:1:b]] 8 | * [[skill-tree:use:2:b]] 9 | 10 | ## Learning Outcomes 11 | 12 | * Build parallel programs, e.g. via open sources packages. 13 | * Run parallel programs in an HPC environment. 14 | * Use a compiler and to assess the effects of optimization switches available for the relevant compilers (e.g. GNU, Intel). 15 | * Use a linker and to assess the effects of linker specific options and environment variables (e.g. -L and LIBRARY_PATH, -rpath and LD_RUN_PATH). 16 | * Configure the relevant settings (e.g. by setting compiler and linker options), which determine how the application ought to be built with regard to the parallelization technique(s) used (e.g. OpenMP, MPI). 17 | * Use software building environments like make, Autotools, CMake. 18 | * Run parallel programs in an HPC environment. 19 | * Use a compiler and to assess the effects of optimization switches available for compilers commercially available (e.g. PGI, NAG). 20 | * Use efficient open-source libraries (e.g. OpenBLAS, FFTW) or highly optimized vendor libraries (e.g. Intel-MKL, IBM-ESSL). 21 | * Configure the relevant settings (e.g. by setting compiler and linker options), which determine how the application ought to be built with regard to the parallelization technique(s) used (e.g. OpenACC, C++ AMP). 22 | * Use the profile guided optimization (PGO) technique. 23 | * Use software building environments like Scons, Waf, make, Autotools, CMake. 24 | -------------------------------------------------------------------------------- /use/4/b.txt: -------------------------------------------------------------------------------- 1 | # USE4 Developing Parallel Programs 2 | 3 | In HPC parallelization brings huge performance benefits and is omnipresent. That brings challenges when developing software. There are a lot of concepts that have to be considered to write parallel programs well. 4 | 5 | ## Requirements 6 | 7 | ## Outcomes 8 | 9 | * Develop parallel software. 10 | * Code parallel programs. 11 | * Analyze and debug parallel programs. 12 | * Articulate the problems of synchronization issues like Race Conditions, Deadlocks. 13 | * Analyze and optimize performance of parallel applications 14 | 15 | -------------------------------------------------------------------------------- /use/5/b.txt: -------------------------------------------------------------------------------- 1 | # USE5 Automating Common Tasks 2 | 3 | A lot of menial tasks in HPC can be automated. Cronjobs starting shell scripts that back up data, batches that execute several jobs in a scheduler or just simple scripts that do several things when executed, so the user doesnt have to. Automation makes your work more effective freeing up your time for more complex work. 4 | 5 | ## Reqirements 6 | 7 | * [[skill-tree:use:2:b]] 8 | 9 | ## Learning Outcomes 10 | 11 | * Recognize which tasks are eligible for automatization 12 | * Create simple scripts to execute shell commands 13 | * Distinguish different types of commands and loops 14 | * Articulate when it is more efficient to use a shell script rather than a program and when its the opposite 15 | -------------------------------------------------------------------------------- /use/6/1/b.txt: -------------------------------------------------------------------------------- 1 | # USE6.1 Selecting Workflows 2 | 3 | Users need to select appropriate, pre-built workflows from established repositories for use in various computational and research tasks. Learners will develop an understanding of how to assess and select workflows that meet the specific needs of their project, whether using the nf-core repository for Nextflow workflows or the Snakemake Workflow Catalog for Snakemake-based workflows. 4 | 5 | ## Learning Objectives 6 | 7 | * Understanding Workflow Repositories: 8 | * for Nextflow: Recall the purpose and structure of the nf-core repository, including its community-curated Nextflow pipelines. 9 | * for Snakemake: Describe the Snakemake Workflow Catalogue as a collection of curated, tested workflows available for reuse. 10 | * Evaluate the suitability of an available workflow for a project based on input/output formats, computational environment, and data types. 11 | * Verify a workflow’s capability by checking for proper dependency management (e.g., Conda, Apptainer, or module file definitions) and comprehensive documentation. 12 | * Assess whether a workflow has been checked for Continuous Integration (CI) compatibility and evaluate its maintenance status. 13 | * Identify the types of data required for workflow input and analyse whether the available data match the workflow design. 14 | -------------------------------------------------------------------------------- /use/6/2/b.txt: -------------------------------------------------------------------------------- 1 | # USE6.2 Workflow Configuration 2 | 3 | ## Learning Objectives 4 | 5 | * Determine and specify the appropriate input files (e.g., raw data, reference files) required by a workflow, ensuring correct dataset selection for analysis. 6 | * Configure input file paths and directories in workflow configuration files (e.g., config.yaml in Snakemake or params in Nextflow) to accommodate local or remote data locations. 7 | * Selecting Program Parameters: Learners will understand how to modify and set program-specific parameters (e.g. tuning alignment tools for long or short reads, quality filtering criteria, algorithm selection for statistical data evaluation) within the configuration files to tailor the workflow to specific project requirements or experimental conditions 8 | * Modify and set programme-specific parameters (e.g., tuning alignment tools for long or short reads, quality filtering criteria, algorithm selection for statistical data evaluation) within configuration files to tailor the workflow to specific project requirements or experimental conditions. 9 | * Explain the importance of utilising dedicated file systems for inputs and outputs (e.g., launching workflows from a HOME file system, directing outputs to dedicated or high-performance file systems) and configure workflows accordingly to optimise data management. 10 | -------------------------------------------------------------------------------- /use/6/3/b.txt: -------------------------------------------------------------------------------- 1 | # USE6.3 Workflow Parameterization for HPC Clusters 2 | 3 | ## Requirements 4 | 5 | * [[skill-tree:use:2:2:b]] 6 | 7 | ## Learning Objectives 8 | 9 | * Identify the specific computational requirements (e.g., node configurations, wall time limits) necessary for running workflows efficiently in high-performance computing (HPC) environments. 10 | * Set execution parameters (e.g., job duration, memory allocation, number of threads/MPI ranks) using workflow profile files to optimise resource usage on HPC systems, and make effective use of templates where available. 11 | * Tune workflows to exploit concurrent execution capabilities of HPC systems (e.g. pooling jobs or using job arrays) by adjusting parameters to effectively distribute tasks across multiple nodes or cores, including those for shared memory tools. 12 | * Set parameters related to input/output operations (e.g., data stage-in) to mitigate potential causes of I/O contention. 13 | -------------------------------------------------------------------------------- /use/6/4/b.txt: -------------------------------------------------------------------------------- 1 | # USE6.4 Running workflows 2 | 3 | ## Requirements: 4 | 5 | * [[skill-tree:use:5:1:3:b]] 6 | * [[skill-tree:use:4:2:b]] 7 | * optionally: [[skill-tree:sd:3:1:b]] 8 | 9 | ## Learning Objectivs 10 | 11 | 12 | * Identify HPC job scheduling systems (e.g. SLURM, LSF, HT-Torque) and describe their role in managing and executing workflows. 13 | * Submit workflows to HPC systems, applying the appropriate Snakemake plugins. 14 | * Utilise job monitoring tools (e.g. squeue/sacct for SLURM) to track the status and performance of running workflows, identify potential issues or bottlenecks. 15 | * Diagnose and troubleshoot common errors: 16 | * Interpret workflow logs to assess the execution process, identify issues, and validate the correctness of generated results. 17 | * Report and differentiate between programme failures (due to bugs), workflow or workflow manager issues, and HPC system-level problems (e.g. file system or node failures). 18 | * Collect and manage output data generated during execution, instructing the Workflow Management System to produce publication-ready reports. 19 | -------------------------------------------------------------------------------- /use/6/b.txt: -------------------------------------------------------------------------------- 1 | # USE6 Workflow Management Systems 2 | 3 | Workflow Management Systems are basic tools for reproducible data analysis, especially if many data analyis steps are involved. 4 | 5 | Workflow Management Systems cover all data analysis from A to Z, e.g. data preprocessing, quality filtering, analysis and statistical evaluation of processed data. This may include, starting jobs, staging data to compute nodes, running the computations, deleting temporary data, generating publication ready reports, archiving and cleaning work environments. 6 | 7 | There are some Workflow Management Systems, such as Nextflow or Snakemake, which are designed to automate this process on HPC Clusters. 8 | 9 | ## Learning objectives 10 | 11 | * Explain the purpose and benefits of a Workflow Management System. 12 | * Identify and select an appropriate workflow from a workflow catalogue. 13 | * Recall the importance of reproducibility in research and describe how Workflow Management Systems ensure it in HPC environments. 14 | * Configure workflows to run optimally on a particular system (e.g. selecting appropriate queues, parameterising resources, and choosing input files from various file systems). 15 | * Diagnose and resolve errors by identifying failed jobs, viewing logs, and restarting workflows from failed steps. 16 | 17 | ## Subskills 18 | 19 | * [[skill-tree:use:6:1:b]] 20 | * [[skill-tree:use:6:2:b]] 21 | * [[skill-tree:use:6:3:b]] 22 | * [[skill-tree:use:6:4:b]] 23 | 24 | 25 | -------------------------------------------------------------------------------- /use/7/b.txt: -------------------------------------------------------------------------------- 1 | # USE7 Post-processing Tools 2 | 3 | Post processing is relevant for all application running on an HPC system. 4 | The process of generating results is very different to a home desktop, or laptop workflow. 5 | Software that can process small amounts of data usually struggles to visualize larger dataset as usually generated in an HPC application. 6 | 7 | ## Learning Outcomes 8 | 9 | * Understand types of post processing 10 | * Analyze post processing workflows 11 | * Understand visualization tool such as Paraview 12 | * Understand CDO 13 | -------------------------------------------------------------------------------- /use/b.txt: -------------------------------------------------------------------------------- 1 | # USE Use of the HPC Environment 2 | 3 | HPC environments are different from local systems and cloud environments. 4 | 5 | Practitioners typically face initial challenges to utilize such a system efficiently. 6 | Moreover, often, data centers deploy a specific solution to set up and execute parallel applications. 7 | That means users must be trained in a particular solution to utilize a supercomputer efficiently. 8 | 9 | Various user roles are covered as part of this subtree: practitioners that aim to deploy existing parallel applications as well as testers that just run already deployed applications, and also developers that create new applications. 10 | 11 | ## Learning Outcomes 12 | 13 | * Apply tools provided by the operating system to navigate and manage files and executables. 14 | * Use a workload manager to allocate HPC resources and run job using scripts. 15 | * Select the software environment to build existing OpenSource projects 16 | * Develop novel parallel applications effectively. 17 | * Design and deploy scripts that automatize repetitive tasks. 18 | * Construct workflows that utilize remote (distributed) environments to execute a parallel workflow. 19 | * Illustrate the use of post processing tools for analyzing and visualizing results. 20 | 21 | ## Subskills 22 | 23 | * [[skill-tree:use:1:b]] 24 | * [[skill-tree:use:2:b]] 25 | * [[skill-tree:use:3:b]] 26 | * [[skill-tree:use:4:b]] 27 | * [[skill-tree:use:5:b]] 28 | * [[skill-tree:use:6:b]] 29 | * [[skill-tree:use:7:b]] 30 | --------------------------------------------------------------------------------