├── .gitignore ├── imgs ├── slack.png ├── ansible.png ├── passwords.png ├── hacc-boxes.png ├── networking.png ├── spine-leaf.png ├── booking-system.png ├── infrastructure.png ├── account-renewal.png ├── booking-a-server.png ├── accessing-a-server.png ├── installed-xilinx-tools.png ├── reverting-to-vitits-workflow.png └── validating-a-xilinx-accelerator-card.png ├── hacc-removebg.png ├── docs ├── overleaf │ ├── known-limitations.md │ ├── account-renewal.md │ ├── technical-support.md │ ├── who-does-what.md │ ├── booking-system.md │ ├── features.md │ ├── operating-the-cluster.md │ ├── first-steps.md │ ├── infrastructure.md │ └── vocabulary.md ├── account-renewal.md ├── technical-support.md ├── booking-system.md ├── features.md ├── who-does-what.md ├── overleaf_export.sh ├── operating-the-cluster.md ├── first-steps.md ├── vocabulary.md └── infrastructure.md ├── LICENSE └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | 3 | .vscode 4 | 5 | localhost.yml -------------------------------------------------------------------------------- /imgs/slack.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/imgs/slack.png -------------------------------------------------------------------------------- /hacc-removebg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/hacc-removebg.png -------------------------------------------------------------------------------- /imgs/ansible.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/imgs/ansible.png -------------------------------------------------------------------------------- /imgs/passwords.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/imgs/passwords.png -------------------------------------------------------------------------------- /imgs/hacc-boxes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/imgs/hacc-boxes.png -------------------------------------------------------------------------------- /imgs/networking.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/imgs/networking.png -------------------------------------------------------------------------------- /imgs/spine-leaf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/imgs/spine-leaf.png -------------------------------------------------------------------------------- /imgs/booking-system.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/imgs/booking-system.png -------------------------------------------------------------------------------- /imgs/infrastructure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/imgs/infrastructure.png -------------------------------------------------------------------------------- /imgs/account-renewal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/imgs/account-renewal.png -------------------------------------------------------------------------------- /imgs/booking-a-server.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/imgs/booking-a-server.png -------------------------------------------------------------------------------- /imgs/accessing-a-server.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/imgs/accessing-a-server.png -------------------------------------------------------------------------------- /imgs/installed-xilinx-tools.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/imgs/installed-xilinx-tools.png -------------------------------------------------------------------------------- /imgs/reverting-to-vitits-workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/imgs/reverting-to-vitits-workflow.png -------------------------------------------------------------------------------- /imgs/validating-a-xilinx-accelerator-card.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fpgasystems/hacc/HEAD/imgs/validating-a-xilinx-accelerator-card.png -------------------------------------------------------------------------------- /docs/overleaf/known-limitations.md: -------------------------------------------------------------------------------- 1 | # Known limitations 2 | 3 | * The U250 and U280 servers—as well as the Versal server—are virtualized. 4 | * Not all servers in the U250 cluster support the Vivado workflow (as they do not have a USB - JTAG connection). The same is true for the Versal server. 5 | * None of the QSFP28 FPGA interfaces are directly connected to other FPGAs—in a point-to-point topology—but to the corresponding leaf switch. 6 | -------------------------------------------------------------------------------- /docs/overleaf/account-renewal.md: -------------------------------------------------------------------------------- 1 | # Account renewal 2 | 3 | The guest account expires after one year. A month before the deadline, you will receive an email reminder. If you plan to keep using the ETHZ-HACC, please write back to javier.moyapaya@inf.ethz.ch an email like the one in the picture below: 4 | 5 | ![Account renewal.](./account-renewal.png "Account renewal.") 6 | 7 | Once the account expires, all user data is automatically deleted from its home directory and the account cannot be restored. 8 | -------------------------------------------------------------------------------- /docs/overleaf/technical-support.md: -------------------------------------------------------------------------------- 1 | # Technical support 2 | 3 | As mentioned **here,** we do not provide technical support and you should not use any of ETH’s emails with that purpose. Instead, please write to research_clusters@amd.com or join the **HACC Slack workspace** to interact with other researchers. 4 | 5 | ![HACC Slack workspace.](./slack.png "HACC Slack workspace.") 6 | 7 | ## Accounts and accessibility 8 | For support related to **accounts and accessibility,** please contact Kassiem Jacobs at kassiem.jacobs@inf.ethz.ch. 9 | 10 | ## Infrastructure 11 | For topics related to the **ETHZ-HACC infrastructure,** please contact Javier Moya at javier.moyapaya@inf.ethz.ch. 12 | -------------------------------------------------------------------------------- /docs/account-renewal.md: -------------------------------------------------------------------------------- 1 |
2 |
3 |

4 | Back to top 5 |

6 | 7 | # Account renewal 8 | 9 | The guest account expires after one year. A month before the deadline, you will receive an email reminder. If you plan to keep using the ETHZ-HACC, please write back to [hacc@ethz.ch](mailto:hacc@ethz.ch) an email like the one in the picture below: 10 | 11 | ![Account renewal.](../imgs/account-renewal.png "Account renewal.") 12 | *Account renewal.* 13 | 14 | Once the account expires, all user data is automatically deleted from its home directory and the account cannot be restored. -------------------------------------------------------------------------------- /docs/overleaf/who-does-what.md: -------------------------------------------------------------------------------- 1 | # Who does what 2 | The following people are relevant for our ETHZ-Heterogeneous Compute Accelerated Cluster (ETHZ-HACC) project. **For specific technical issues or requests,** please follow our dedicated **Technical support** page and avoid using the references below. 3 | 4 | ## AMD University Program coordinators 5 | * **Cathal McCabe**, **AMD University Program** Manager EMEA, AMD 6 | * **Gustavo Alonso**, Systems Group leader, ETH Zürich 7 | * **Javier Moya**, ETHZ-HACC coordinator, ETH Zürich 8 | * **Mario Ruiz**, **AMD University Program** coordinator, AMD 9 | * **Michaela Blott**, Senior Fellow, AMD 10 | 11 | ## Operations 12 | 13 | ### Systems Group 14 | * **Kassiem Jacobs**, Systems Administrator, **DCIM** 15 | 16 | ### IT Service Group (ISG) 17 | * **Manuel Maestinger**, Linux systems 18 | * **Matthias Gabathuler**, Cloud technologies, **DevOps** 19 | * **Martin Sedler**, **DCIM** leader 20 | * **Stefan Walter**, Operations coordinator 21 | -------------------------------------------------------------------------------- /docs/technical-support.md: -------------------------------------------------------------------------------- 1 |
2 |
3 |

4 | Back to top 5 |

6 | 7 | # Technical support 8 | 9 | As mentioned [here,](https://www.xilinx.com/member/xup_research_clusters.html) we do not provide technical support and you should not use any of ETH’s emails with that purpose. Instead, please write to [research_clusters@amd.com](mailto:research_clusters@amd.com) or join the [HACC Slack workspace](https://join.slack.com/t/amdhacc/shared_invite/zt-1kcx60g41-0PRQUtlpmw7CGeRfwxtyXA) to interact with other researchers. 10 | 11 | ![HACC Slack workspace.](../imgs/slack.png "HACC Slack workspace.") 12 | *HACC Slack workspace.* 13 | 14 | For support related to **accounts and accessibility** or **ETHZ-HACC infrastructure,** please contact [hacc@ethz.ch.](hacc@ethz.ch) -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 FPGA @ Systems Group, ETH Zurich 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /docs/overleaf/booking-system.md: -------------------------------------------------------------------------------- 1 | # Booking system 2 | Before connecting to any HACC servers, you must make a reservation through the **booking system**. To use the booking system, please remember the following: 3 | 4 | * You must be connected to the ETH network in order to access it, and 5 | * Use your **main LDAP/Active directory password** as a part of your credentials. 6 | 7 | **Please notice that you do not need to book the build servers.** 8 | 9 | ![Booking system.](./booking-system.png "Booking system.") 10 | 11 | ## Booking rules 12 | The HACCs are a collaborative hub; many people may want access to the same limited resources. The following simple rules should help: 13 | 14 | 1. We kindly ask users to limit the bookings to the shortest period. 15 | 2. Release the booking as soon as possible if you finish early; other users may be waiting. 16 | 3. The default maximum booking time is 5 hours. 17 | 4. Note that you can book resources for the future if you need the hardware at a particular time. 18 | 5. Don’t assume there will be limited demand for a HACC during the night: other users in different time zones may want to access the HACC. 19 | 6. Do not systematically extend your booking times: if you predict that your experiments will require longer, submit a request to research_clusters@amd.com outlining your needs so we can arrange a plan. 20 | -------------------------------------------------------------------------------- /docs/overleaf/features.md: -------------------------------------------------------------------------------- 1 | # Features 2 | 3 | ## Systems Group RunTime 4 | By using **Systems Group RunTime**, you can easily: 5 | 6 | * Validate the fundamental infrastructure functionality of ETHZ-HACC, ensuring its reliability and performance. 7 | * Expedite project creation through intuitive multiple-choice dialogs and templates, streamlining the development of your accelerated applications. 8 | * Seamlessly integrate with **GitHub CLI** for efficient version control and collaboration. 9 | * Effectively manage ACAPs, FPGAs, multi-core CPUs, and GPUs, all through a unified device index. 10 | * Transition between **Vivado and Vitis workflows** effortlessly, eliminating the need for system reboots and enhancing your development agility. 11 | * Design, build, and deploy modern FPGA-accelerated applications with ease, leveraging your own or third-party integrations, for instance, ACCL, Coyote, or EasyNet (please refer to our **Infrastructure for heterogeneous architectures and hardware acceleration**). 12 | * Explore model-based design principles with readily available *out-of-the-box* examples. 13 | * Simplify the creation of HIP/ROCm GPU applications using the *sgutil new-build-run hip* commands. 14 | 15 | 16 | 17 | ## Managed platform 18 | We use **Infrastructure as Code (IaC)** methodology, we achieve the following: 19 | 20 | * Continuos delivery, 21 | * Avoid manual configuration to enforce consistency, 22 | * Use declarative definition files helping to document the cluster, 23 | * Easily follow Xilinx’s tools versioning release schedule, and 24 | * Deliver stable tests environments rapidly at scale. 25 | -------------------------------------------------------------------------------- /docs/booking-system.md: -------------------------------------------------------------------------------- 1 |
2 |
3 |

4 | Back to top 5 |

6 | 7 | # Booking system 8 | Before connecting to any HACC servers, you must make a reservation through the [booking system](https://hacc-booking.ethz.ch/login.php). To use the booking system, please remember the following: 9 | 10 | * You must be connected to the ETH network in order to access it, and 11 | * Use your **main LDAP/Active directory password** as a part of your credentials. 12 | 13 | **Please notice that you do not need to book the build servers.** 14 | 15 | ![Booking system.](../imgs/booking-system.png "Booking system.") 16 | *Booking system.* 17 | 18 | ## Booking rules 19 | The HACCs are a collaborative hub; many people may want access to the same limited resources. The following simple rules should help: 20 | 21 | 1. We kindly ask users to limit the bookings to the shortest period. 22 | 2. Release the booking as soon as possible if you finish early; other users may be waiting. 23 | 3. The default maximum booking time is 5 hours. 24 | 4. Note that you can book resources for the future if you need the hardware at a particular time. 25 | 5. Don’t assume there will be limited demand for a HACC during the night: other users in different time zones may want to access the HACC. 26 | 6. Do not systematically extend your booking times: if you predict that your experiments will require longer, submit a request to research_clusters@amd.com outlining your needs so we can arrange a plan. -------------------------------------------------------------------------------- /docs/overleaf/operating-the-cluster.md: -------------------------------------------------------------------------------- 1 | # Operating the cluster 2 | 3 | Our HACC is provisioned and managed based on **Infrastructure as Code (IaC)**. Just as the same source code always generates the same binary, an IaC model generates the same environment every time it deploys. This allows us to reset the whole infrastructure to a defined state at any time. In fact, we can re-install and set up from scratch—without other interaction—all of the servers in our cluster in about an hour. 4 | 5 | The following figure shows a simplified model of HACC’s Ansible automation platform: 6 | 7 | ![Ansible Automation Platform (AAP).](./ansible.png "Ansible Automation Platform (AAP).") 8 | 9 | The playbooks defining our cluster are grouped into two categories: **IaaS**, virtual machines, networking setup, load balancers, connection topologies, and Debian package installation. With the PaaS playbooks, we take care of installing the software allowing users to develop their heterogeneous accelerated applications. 10 | 11 | **Thanks to IaC and AAP, we can easily follow Xilinx’s tools versioning release schedule as mentioned in **Releases.**** 12 | 13 | ## Pipelines 14 | 15 | In order to maintain the health and performance of our cluster, two different pipelines are executed to ensure servers sanity and optimal functionality: 16 | 17 | * Weekly pipeline on build servers: Deals with memory leak mitigation, resource cleanup, consistent performance, and improved reliability. 18 | * Daily pipeline on deployment servers: In addition, ensures that all servers are reverted to the **Vitis Workflow** *at the beginning of the day.* 19 | 20 | The exact pipeline execution times are reflected on the **booking system** itself. 21 | -------------------------------------------------------------------------------- /docs/features.md: -------------------------------------------------------------------------------- 1 |
2 |
3 |

4 | Back to top 5 |

6 | 7 | # Features 8 | 9 | ## Managed cluster 10 | We use [Infrastructure as Code (IaC)](./vocabulary.md#infrastructure-as-code-iac) and [Ansible Automation Platform (AAP)](./vocabulary.md#ansible-automation-platform-aap) to [operate the cluster](../docs/operating-the-cluster.md#operating-the-cluster) efficiently. Under the scope of a [DevOps](./vocabulary.md#devops) methodology, we achieve the following: 11 | 12 | * Continuos delivery, 13 | * Avoid manual configuration to enforce consistency, 14 | * Use declarative definition files helping to document the cluster, 15 | * Easily follow AMD’s tools versioning release schedule, and 16 | * Deliver stable tests environments rapidly at scale. 17 | 18 | ## HACC Development (hdev) 19 | By using [HACC Development (hdev),](https://github.com/fpgasystems/hdev) you can easily: 20 | 21 | * **Validate the fundamental functionality** of ETHZ-HACC, ensuring its reliability and performance. 22 | * Expedite project creation through intuitive multiple-choice dialogs and templates, **streamlining the development of your accelerated applications.** 23 | * Seamlessly integrate with [GitHub CLI](https://cli.github.com) for efficient **version control and collaboration.** 24 | * Effectively manage ASoCs, FPGAs, multi-core CPUs, and GPUs, **all through a unified device index.** 25 | * Transition between [Vivado and Vitis workflows](./vocabulary.md#vivado-and-vitis-workflows) effortlessly, eliminating the need for system reboots and enhancing your development agility. 26 | * **Design, build, and deploy modern accelerated applications with ease,** leveraging your own or third-party integrations. 27 | 28 | -------------------------------------------------------------------------------- /docs/who-does-what.md: -------------------------------------------------------------------------------- 1 |
2 |
3 |

4 | Back to top 5 |

6 | 7 | # Who does what 8 | The following people are relevant for our ETHZ-Heterogeneous Compute Accelerated Cluster (ETHZ-HACC) project. **For specific technical issues or requests,** please follow our dedicated [Technical support](https://github.com/fpgasystems/hacc/blob/main/docs/technical-support.md#technical-support) page and avoid using the references below. 9 | 10 | ## AMD University Program coordinators 11 | * [Cathal McCabe](https://www.linkedin.com/in/cathalmccabe/), [AMD University Program](https://www.xilinx.com/support/university/XUP-HACC.html) Manager EMEA, AMD 12 | * [Gustavo Alonso](https://systems.ethz.ch/people/profile.gustavo-alonso.html), Systems Group leader, ETH Zürich 13 | * [Javier Moya](https://systems.ethz.ch/people/profile.Mjk5NjU5.TGlzdC8zODkxLDEyOTU2NDI2OTI=.html), ETHZ-HACC coordinator, ETH Zürich 14 | * [Mario Ruiz](https://www.linkedin.com/in/mario-ruiz-noguera/), [AMD University Program](https://www.xilinx.com/support/university/XUP-HACC.html) coordinator, AMD 15 | * [Michaela Blott](https://www.linkedin.com/in/michaelablott/?originalSubdomain=ie), Senior Fellow, AMD 16 | 17 | ## Operations 18 | 19 | ### Systems Group 20 | * [Kassiem Jacobs](https://biol.ethz.ch/en/the-department/people/person-detail.MTI1MDA2.TGlzdC80NjAsOTIzMDMxMjIy.html), Systems Administrator, [DCIM](./vocabulary.md#dcim), [Booking system](./booking-system.md#booking-system) 21 | 22 | ### IT Service Group (ISG) 23 | * [Manuel Maestinger](https://www.isg.inf.ethz.ch/Main/ManuelMaestinger), Linux systems 24 | * [Matthias Gabathuler](https://www.isg.inf.ethz.ch/Main/MatthiasGabathuler), Cloud technologies, [DevOps](./vocabulary.md#devops) 25 | * [Martin Sedler](https://www.isg.inf.ethz.ch/Main/MartinSedler), [DCIM](./vocabulary.md#dcim) leader 26 | * [Stefan Walter](https://www.isg.inf.ethz.ch/Main/StefanWalter), Operations coordinator -------------------------------------------------------------------------------- /docs/overleaf_export.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | bold=$(tput bold) 4 | normal=$(tput sgr0) 5 | 6 | # Constants 7 | DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" 8 | 9 | # Change to the directory 10 | cd "$DIR" 11 | 12 | # Remove overleaf directory 13 | if [ -d "$DIR/overleaf" ]; then 14 | rm -r "$DIR/overleaf" 15 | fi 16 | 17 | # Create overleaf directory 18 | mkdir "$DIR/overleaf" 19 | 20 | # Export to overleaf 21 | for file in *.md; do 22 | if [[ -f "$file" ]]; then 23 | # Create a new file with a modified name 24 | new_file="${file%.md}-tex.md" 25 | cp "$file" "$new_file" 26 | # Remove markdown links, e.g., [Devops](#devops) 27 | sed -i '' 's/\[\([^]]*\)\](\([^)]*#.*\))/**\1**/g' "$new_file" 28 | # Remove https links, e.g., [Devops](https://example.com) 29 | sed -i '' 's/\[\([^]]*\)\](https:\/\/[^)]*)/**\1**/g' "$new_file" 30 | # Remove email links, e.g., [AnyEmailAccount](mailto: AnyEmailAccount) 31 | sed -i '' 's/\[\([^]]*\)\](mailto:[^)]*)/\1/g' "$new_file" 32 | # Add \url to Any.Name@Any.Domain 33 | #sed -i '' -E 's/([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+)/\\url{\1}/g' "$new_file" 34 | # Remove HTML tags, e.g., 35 | sed -i '' 's/<[^>]*>//g' "$new_file" 36 | # Remove "Back to" links 37 | grep -v '^Back to top$' "$new_file" > temp.md && mv temp.md "$new_file" 38 | # Remove blank lines at the top 39 | awk 'NF{p=1} p' "$new_file" > temp.md && mv temp.md "$new_file" 40 | # Remove italic markdown footnotes *Anyword.* 41 | awk '/^!\[/ { p = 1; print; next } p && /^\*/ { p = 0; next } { p = 0 } 1' "$new_file" > temp.md && mv temp.md "$new_file" 42 | # Replace ```AnyString``` with bold **AnyString** 43 | sed -i '' 's/```\([^`]*\)```/**\1**/g' "$new_file" 44 | awk '/^```$/{f=!f; next} f{print "* **" $0 "**"; next} 1' "$new_file" > tmpfile && mv tmpfile "$new_file" 45 | # scape * in *.ethz.ch 46 | sed -i '' 's/\*\.ethz\.ch/\\*\.ethz\.ch/g' "$new_file" 47 | # Replace blank spaces 48 | sed -i '' 's/\*\* / **/g' "$new_file" 49 | # Replace ../imgs with ./ 50 | sed -i '' 's/\.\.\/imgs\//\.\//g' "$new_file" 51 | # Move to the tex folder 52 | mv "$new_file" "overleaf/${new_file//-tex/}" 53 | fi 54 | done -------------------------------------------------------------------------------- /docs/operating-the-cluster.md: -------------------------------------------------------------------------------- 1 |
2 |
3 |

4 | Back to top 5 |

6 | 7 | # Operating the cluster 8 | 9 | The ETHZ-HACC is provisioned and managed based on [Infrastructure as Code (IaC)](../docs/vocabulary.md#infrastructure-as-code-iac) and [Ansible Automation Platform (AAP)](../docs/vocabulary.md#ansible-automation-platform-aap). Just as the same source code always generates the same binary, an IaC model generates the same environment every time it deploys. This allows us to reset the whole infrastructure to a defined state at any time. In fact, we can re-install and set up from scratch—without other interaction—all of the servers in our cluster in about an hour. 10 | 11 | The following figure shows a simplified model of HACC’s Ansible automation platform: 12 | 13 | ![Ansible Automation Platform (AAP).](../imgs/ansible.png "Ansible Automation Platform (AAP).") 14 | *Ansible Automation Platform (AAP).* 15 | 16 | The playbooks defining our cluster are grouped into two categories: [IaaS](../docs/vocabulary.md#infrastructure-as-a-service-iaas) and [PaaS](../docs/vocabulary.md#platform-as-a-service-paas) playbooks. We refer the IaaS playbooks to the [YAML](../docs/vocabulary.md#yaml) files describing the infrastructure itself. This relates to the OS installation (including the definitions of partition sizes and similar lower-level attributes), virtual machines, networking setup, load balancers, connection topologies, and Debian package installation. With the PaaS playbooks, we take care of installing the software allowing users to develop their heterogeneous accelerated applications. 17 | 18 | **Thanks to IaC and AAP, we can easily follow Xilinx’s tools versioning release schedule as mentioned in [Releases.](../README.md#releases)** 19 | 20 | ## Pipelines 21 | 22 | In order to maintain the health and performance of our cluster, two different pipelines are executed to ensure servers sanity and optimal functionality: 23 | 24 | * Weekly pipeline on build servers: Deals with memory leak mitigation, resource cleanup, consistent performance, and improved reliability. 25 | * Daily pipeline on deployment servers: In addition, ensures that all servers are reverted to the [Vitis Workflow](vocabulary.md#vitis-workflow) *at the beginning of the day.* 26 | 27 | The exact pipeline execution times are reflected on the [booking system](https://hacc-booking.ethz.ch/login.php) itself. -------------------------------------------------------------------------------- /docs/overleaf/first-steps.md: -------------------------------------------------------------------------------- 1 | # First steps 2 | This guide will help you set up your HACC account, booking and accessing a server, and validate server’s Xilinx accelerator card. We will cover the following sections: 3 | 4 | * **Setting your passwords** 5 | * **Setting your secure remote access** 6 | * **Booking and accessing a server** 7 | * **Validating a Xilinx accelerator card** 8 | 9 | Before continuing, please make sure you have been already accepted on ETH Zürich HACC program and you have a valid user account. In any other case, please visit **Get started**. 10 | 11 | ## Setting your passwords 12 | Once your ETH account has been created, you will need to generate two different passwords: an LDAP/Active directory password and a RADIUS password. The first one is part of your main ETH credentials; the *remote authentication dial-in user service (RADIUS)* password is used for **setting your remote secure access**. Please, follow these steps to generate them: 13 | 14 | 1. Visit the ETH Zürich **Web Center,** 15 | 2. Click on *Forgot your password* to receive a temporal password to use with Web Center, 16 | 3. Log in to Web Center and click on *Self service/Change password*, 17 | 4. Select the *LDAPS* and *Active Directory* checkboxes and introduce your new password, and 18 | 5. Select the *Radius* checkbox and introduce your new password. 19 | 20 | ![Setting your passwords.](./passwords.png "Setting your passwords.") 21 | 22 | ## Setting your remote secure access 23 | You must be connected to the ETH network to access the cluster. If this is not the case, you first need to establish a secure remote connection—either through a **jump host**—before being able to use the HACC servers. 24 | 25 | ### Jump host 26 | To make use of ETH’s jumphost, first you would need to edit your **~/.ssh/config** file by adding the following lines: 27 | 28 | * **ServerAliveInterval 300** 29 | * **ServerAliveCountMax 12** 30 | * **Host jumphost.inf.ethz.ch** 31 | * **User ETHUSER** 32 | * **Host \*.ethz.ch !jumphost.inf.ethz.ch** 33 | * **User ETHUSER** 34 | * **ProxyJump jumphost.inf.ethz.ch** 35 | 36 | After that, you should be able to access HACC servers with SSH, for instance: **ssh ETHUSER@alveo-build-01.ethz.ch**. **Please note that for the proposed ssh-configuration file, you must include the whole domain when you try to log in to the server.** 37 | 38 | ### Virtual private network (VPN) 39 | Accessing the HACC via VPN connection will exclusively be through the **Cisco Secure Client** client. Please follow the **How to set up VPN** section for a proper configuration. **Remember to make use of your RADIUS password!** 40 | 41 | ## Booking and accessing a server 42 | After configuring our passwords and virtual private network connection, the next step would be to reserve a server through the **booking system** and then access it. **Please remember that you must be connected to the ETH network to make use of the booking system.** 43 | 44 | ### Booking a server 45 | Please, follow these steps to book a server: 46 | 47 | 1. Log in into the **booking system** using your **main LDAP/Active directory password**, 48 | 2. Once you are on the *Dashoboard* page, please click on *New booking*, 49 | 3. Select the *Time range,* the *Boards* or servers you wish to book, along with a mandatory *Comment* referring to your research activites, and 50 | 4. Press the *Book* button. 51 | 52 | We would like you to follow the **booking rules** while you work with the cluster. 53 | 54 | ![Booking a server.](./booking-a-server.png "Booking a server.") 55 | 56 | ### Accessing a server 57 | After **booking a server**—and assuming you are connected to ETH network via VPN— you should be able to access it using ssh, i.e.: **ssh jmoyapaya@alveo-u50d-05**. Please remember that for accessing a server you should also use your **main LDAP/Active directory password**: 58 | 59 | ![Accessing a server.](./accessing-a-server.png "Accessing a server.") 60 | 61 | You can also make use of **X11 forwarding** if you need to run graphical applications on the remote server (for instance, Vivado). For this, please add a -X after the ssh command, i.e.: **ssh -X jmoyapaya@alveo-u50d-05**. 62 | 63 | ## Validating a Xilinx accelerator card 64 | Once you are logged into a server, you should be able to validate server’s accelerator card with **xbutil validate --device**: 65 | 66 | ![Validating a Xilinx accelerator card.](./validating-a-xilinx-accelerator-card.png "Validating a Xilinx accelerator card.") 67 | 68 | ### Reverting to Vitis workflow 69 | It is possible that when you log on to a server, you may find that the previous user has left the server in *Vivado mode.* In such a situation, you have the opportunity to revert the server to work again with the Vitis workflow by following the instructions on the screen: 70 | 71 | ![Reverting to Vitis workflow.](./reverting-to-vitits-workflow.png "Reverting to Vitis workflow.") 72 | 73 | ## References 74 | * [1] **Remote Access by Secure Shell (SSH) using a jump host** 75 | -------------------------------------------------------------------------------- /docs/overleaf/infrastructure.md: -------------------------------------------------------------------------------- 1 | # Infrastructure 2 | ETHZ-HACC comprises high-end servers, GPUs, reconfigurable accelerator cards, and high-speed networking. Each accelerator card has all of its Ethernet interfaces connected to a 100 GbE leaf switch to allow exploration of arbitrary network topologies for distributed computing. Additionally, we are offering a build server with development and bitstream compilation purposes. 3 | 4 | ![ETHZ-HACC is comprised of high-​end servers, reconfigurable accelerator cards, and high-​speed networking.](./infrastructure.png "ETHZ-HACC is comprised of high-​end servers, reconfigurable accelerator cards, and high-​speed networking.") 5 | 6 | There are **two types of deployment servers.** The first type of servers are equipped with only one accelerator card; the second are servers equipped with an heterogeneous variety of accelerators including GPUs, FPGAs, and ACAPs (please, see the section **HACC boxes architecture**. In total, ETHZ-HACC counts twelve GPUs, thirty-one Alveo data center accelerator cards, and seven Versal cards. The following tables give an overview of the **server names** and their **resources:** 7 | 8 | ![ETHZ-HACC server names.](./server-names.png "ETHZ-HACC server names.") 9 | 10 | ![ETHZ-HACC resources.](./resources.png "ETHZ-HACC resources.") 11 | 12 | ## *Build cluster* 13 | We are offering a *build cluster* for development and bitstream compilation purposes. Multiple users can access this machine simultaneously **without booking it first.** Please only use the HACC build servers if you do not have access to similar resources at your research institution: too many users running large jobs on this machine will likely cause builds to run slowly—or sometimes to fail. Also, avoid using the build servers for debugging or simulating your hardware. 14 | 15 | ## High-end servers and reconfigurable accelerator cards 16 | ### AMD EPYC 17 | EPYC is the world’s highest-performing x86 server processor with faster performance for cloud, enterprise, and HPC workloads. To learn more about it, please refer to the **AMD EPYC processors website** and its **data sheet.** 18 | 19 | ### Virtex Ultrascale+ 20 | Virtex UltraScale+ devices provide the highest performance and integration capabilities in a 14nm/16nm FinFET node. It also provides registered inter-die routing lines enabling >600 MHz operation, with abundant and flexible clocking to deliver a virtual monolithic design experience. As the industry’s most capable FPGA family, the devices are ideal for compute-intensive applications ranging from 1+Tb/s networking and machine learning to radar/early-warning systems. 21 | 22 | * **Alveo U250** 23 | * **Alveo U280** 24 | * **Alveo U50D** 25 | * **Alveo U55C** 26 | 27 | ### Versal ACAP 28 | Versal ACAPs deliver unparalleled application- and system-level value for cloud, network, and edge applications​. The disruptive 7nm architecture combines heterogeneous compute engines with a breadth of hardened memory and interfacing technologies for superior performance/watt over competing 10nm FPGAs. 29 | 30 | * **Versal VCK5000** 31 | 32 | ### Storage 33 | Each HACC users can store data on the following directories: 34 | 35 | * **/home/USERNAME**: directory on an NFS drive accessible by USERNAME from any the HACC servers. 36 | * **/mnt/scratch**: directory on an NFS drive accessible by all users from any of the HACC servers. 37 | * **/local/home/USERNAME/**: directory on the local server drive accessible by USERNAME on the HACC server. 38 | * **/tmp**: directory on the local server drive accessible by all users on the HACC server. Its content is removed every time the server is restarted. 39 | 40 | ### USB - JTAG connectivity 41 | The USB - JTAG connection allows granted users to interact directly with the FPGA by downloading bitstreams or updating memory content. The correct setup and access of a USB - JTAG connection is essential developers using working with **Vivado workflow**. 42 | 43 | ## HACC boxes architecture 44 | The following picture details the architecture of the three heterogeneous servers equipped with 2x EPYC Milan CPUs, 4x **Instinct MI200 GPUs,** 2x **Alveo U55C FPGAs,** and 2x **Versal VCK5000 ACAPs** each. 45 | 46 | ![HACC boxes architecture.](./hacc-boxes.png "HACC boxes architecture.") 47 | 48 | ## Networking 49 | 50 | ![Management, access and data networks.](./networking.png "Management, access and data networks.") 51 | 52 | ### Management network 53 | We refer to the management network as the infrastructure allowing our IT administrators to manage, deploy, update and monitor our cluster **remotely.** 54 | 55 | ### Access network 56 | The access network is the infrastructure that allows secure remote access to our **users** through SSH. 57 | 58 | ### Data network 59 | For our **high-speed networking** data network, we are using a **spine-leaf architecture**: 60 | 61 | ![Spine-leaf data network architecture.](./spine-leaf.png "Spine-leaf data network architecture.") 62 | 63 | On the server side, the CPU NICs are **ConnectX-5** adaptors. For the servers **with only one accelerator card, only one 100 GbE port is connected to the respective leaf switch.** On the other hand, **the HACC boxes have two 100 GbE ports connected to the respective leaf switch,** offering a total of 200 GbE effective bandwidth. 64 | -------------------------------------------------------------------------------- /docs/first-steps.md: -------------------------------------------------------------------------------- 1 |
2 |
3 |

4 | Back to top 5 |

6 | 7 | # First steps 8 | This guide will help you set up your HACC account, booking and accessing a server, and validate server’s Xilinx accelerator card. We will cover the following sections: 9 | 10 | * [Setting your passwords](#setting-your-passwords) 11 | * [Setting your secure remote access](#setting-your-remote-secure-access) 12 | * [Booking and accessing a server](#booking-and-accessing-a-server) 13 | * [Validating a Xilinx accelerator card](#validating-a-xilinx-accelerator-card) 14 | 15 | Before continuing, please make sure you have been already accepted on ETH Zürich HACC program and you have a valid user account. In any other case, please visit [Get started](https://www.amd-haccs.io/get-started.html). 16 | 17 | ## Setting your passwords 18 | Once your ETH account has been created, you will need to generate two different passwords: an LDAP/Active directory password and a RADIUS password. The first one is part of your main ETH credentials; the *remote authentication dial-in user service (RADIUS)* password is used for [setting your remote secure access](#setting-your-remote-secure-access). Please, follow these steps to generate them: 19 | 20 | 1. Visit the ETH Zürich [Web Center,](https://iam.password.ethz.ch/authentication/login_en.html) 21 | 2. Click on *Forgot your password* to receive a temporal password to use with Web Center, 22 | 3. Log in to Web Center and click on *Self service/Change password*, 23 | 4. Select the *LDAPS* and *Active Directory* checkboxes and introduce your new password, and 24 | 5. Select the *Radius* checkbox and introduce your new password. 25 | 26 | ![Setting your passwords.](../imgs/passwords.png "Setting your passwords.") 27 | *Setting your passwords.* 28 | 29 | ## Setting your remote secure access 30 | You must be connected to the ETH network to access the cluster. If this is not the case, you first need to establish a secure remote connection—either through a [jump host](#jump-host) [[1]](#references) or a [virtual private network (VPN)](#virtual-private-network-vpn)—before being able to use the HACC servers. 31 | 32 | ### Jump host 33 | To make use of ETH’s jumphost, first you would need to edit your ```~/.ssh/config``` file by adding the following lines: 34 | 35 | ``` 36 | ServerAliveInterval 300 37 | ServerAliveCountMax 12 38 | Host jumphost.inf.ethz.ch 39 | User ETHUSER 40 | Host *.ethz.ch !jumphost.inf.ethz.ch 41 | User ETHUSER 42 | ProxyJump jumphost.inf.ethz.ch 43 | ``` 44 | 45 | After that, you should be able to access HACC servers with SSH, for instance: ```ssh ETHUSER@alveo-build-01.ethz.ch```. **Please note that for the proposed ssh-configuration file, you must include the whole domain when you try to log in to the server.** 46 | 47 | ### Virtual private network (VPN) 48 | Accessing the HACC via VPN connection will exclusively be through the **Cisco Secure Client** client. Please follow the [How to set up VPN](https://unlimited.ethz.ch/display/itkb/VPN) section for a proper configuration. **Remember to make use of your RADIUS password!** 49 | 50 | ## Booking and accessing a server 51 | After configuring our passwords and virtual private network connection, the next step would be to reserve a server through the [booking system](https://alveo-booking.ethz.ch/login.php) and then access it. **Please remember that you must be connected to the ETH network to make use of the booking system.** 52 | 53 | ### Booking a server 54 | Please, follow these steps to book a server: 55 | 56 | 1. Log in into the [booking system](https://alveo-booking.ethz.ch/login.php) using your **main LDAP/Active directory password**, 57 | 2. Once you are on the *Dashoboard* page, please click on *New booking*, 58 | 3. Select the *Time range,* the *Boards* or servers you wish to book, along with a mandatory *Comment* referring to your research activites, and 59 | 4. Press the *Book* button. 60 | 61 | We would like you to follow the [booking rules](../docs/booking-system.md#booking-rules) while you work with the cluster. 62 | 63 | ![Booking a server.](../imgs/booking-a-server.png "Booking a server.") 64 | *Booking a server.* 65 | 66 | ### Accessing a server 67 | After [booking a server](#booking-a-server)—and assuming you are connected to ETH network via VPN— you should be able to access it using ssh, i.e.: ```ssh jmoyapaya@alveo-u50d-05```. Please remember that for accessing a server you should also use your **main LDAP/Active directory password**: 68 | 69 | ![Accessing a server.](../imgs/accessing-a-server.png "Accessing a server.") 70 | *Accessing a server.* 71 | 72 | You can also make use of **X11 forwarding** if you need to run graphical applications on the remote server (for instance, Vivado). For this, please add a -X after the ssh command, i.e.: ```ssh -X jmoyapaya@alveo-u50d-05```. 73 | 74 | ## Validating a Xilinx accelerator card 75 | Once you are logged into a server, you should be able to validate server’s accelerator card with ```xbutil validate --device```: 76 | 77 | ![Validating a Xilinx accelerator card.](../imgs/validating-a-xilinx-accelerator-card.png "Validating a Xilinx accelerator card.") 78 | *Validating a Xilinx accelerator card.* 79 | 80 | ### Reverting to Vitis workflow 81 | It is possible that when you log on to a server, you may find that the previous user has left the server in *Vivado mode.* In such a situation, you have the opportunity to revert the server to work again with the Vitis workflow by following the instructions on the screen: 82 | 83 | ![Reverting to Vitis workflow.](../imgs/reverting-to-vitits-workflow.png "Reverting to Vitis workflow.") 84 | *Reverting to Vitis workflow is based on the [hot-plug boot](https://github.com/fpgasystems/hacc/blob/main/docs/vocabulary.md#pci-hot-plug) process.* 85 | 86 | ## References 87 | * [1] [Remote Access by Secure Shell (SSH) using a jump host](https://www.isg.inf.ethz.ch/Main/HelpRemoteAccessSSH) -------------------------------------------------------------------------------- /docs/overleaf/vocabulary.md: -------------------------------------------------------------------------------- 1 | # Vocabulary 2 | 3 | ## Agile 4 | Agile is an iterative project management and software development approach that helps teams deliver value to their customers faster and with fewer headaches. Instead of betting everything on a long-term launch, an agile team provides work in small but consumable increments. Requirements, plans, and results are evaluated continuously, so teams have a natural mechanism for responding to change quickly. Frameworks such as **Shape Up** are considered part of Agile methodologies. 5 | 6 | ## Ansible Automation Platform (AAP) 7 | The Red Hat **Ansible Automation Platform (AAP)** is an orchestrated and open-source tool for software provisioning, configuration management, and application-deployment automation. Ansible uses its own **YAML** 8 | 9 | ## DCIM 10 | Data center infrastructure management (DCIM) is the integration of information technology (IT) and facility management disciplines to centralize monitoring, management and intelligent capacity planning of a data center's critical systems. 11 | 12 | ## Deployment types 13 | ### Infrastructure as a service (IaaS) 14 | Introduced in 2012 by Oracle, infrastructure as a service (IaaS) is a cloud computing service model through which computing resources are hosted in a public, private, or hybrid cloud and provided on-demand to the final users. 15 | 16 | ### Platform as a service (PaaS) 17 | Platform as a service (PaaS) is a category of cloud computing services that allows customers to provision, instantiate, run, and manage a modular bundle comprising a computing platform and one or more applications, without the complexity of building and maintaining the infrastructure typically associated with developing and launching the application(s); and to allow developers to create, develop, and package such software bundles. 18 | 19 | ## DevOps 20 | DevOps is a set of practices that combines software development (Dev) and IT operations (Ops). It aims to shorten the systems development life cycle and provide continuous delivery with high software quality. Several DevOps aspects came from the **Agile** methodology. 21 | 22 | ## PCI hot-plug 23 | We refer to the PCIe hot-plug as the process that allows us to transition Xilinx accelerator cards from the Vitis to Vivado workflows or vice-versa. The critical aspect of the process is to use Linux capabilities to re-enumerate PCI devices on the fly without the need for cold or warm rebooting of the system. 24 | 25 | ### Cold boot 26 | The process of powering off and on the machine to completely reload the operating system and reset all the hardware peripherals (including all PCI devices). In the context of Xilinx Alveo Cards, a cold boot causes to pull the flashable partitions (or base shell) from the card’s PROM into the programmable logic. This operation is required to revert a server to the Vitis workflow. 27 | 28 | ### Warm boot 29 | A warm boot restarts the system without the need to interrupt the power. In the context of Xilinx Alveo Cards, a warm boot would be required to re-enumarate the number of PCI functions without restoring the base shell. This operation is required to bring a server to the Vivado workflow. 30 | 31 | ## Infrastructure as Code (IaC) 32 | Infrastructure as Code (IaC) is the process of provisioning and managing computer data centers through machine- and human-readable **YAML** definition files—rather than physical hardware configuration or interactive configuration tools. The IT infrastructure managed by this process comprises physical equipment, such as bare-metal servers, virtual machines, and associated configuration resources. The definitions may be in a version control system. 33 | 34 | ### Tools 35 | There are many tools that fulfill infrastructure automation capabilities and use IaC. Broadly speaking, any framework or tool that performs changes or configures infrastructure declaratively or imperatively based on a programmatic approach can be considered IaC. ETHZ-HACC uses **Ansible** for defining the cluster infrastructure. 36 | 37 | ### Relation to DevOps 38 | IaC can be a key attribute of enabling best practices in **DevOps**–developers become more involved in defining configuration and Ops teams get involved earlier in the development process. Tools that utilize IaC bring visibility to the state and configuration of servers and ultimately provide the visibility to users within the enterprise, aiming to bring teams together to maximize their efforts. 39 | 40 | ## Shape Up 41 | Instead of *Scrum,* we use **Shape Up** to shape and build our accelerated applications. To execute the techniques of the method we use **Basecamp**—a project management tool that puts all our project communication, task management, and documentation in one place (where designers and programmers work seamlessly together). 42 | 43 | ## Spine-leaf architecture 44 | A spine-leaf architecture is data center network topology that consists of two switching layers—a spine and leaf. The leaf layer consists of access switches that aggregate traffic from servers and connect directly into the spine or network core. Spine switches interconnect all leaf switches in a full-mesh topology. 45 | 46 | ## Vivado and Vitis workflows 47 | Vivado offers a hardware-centric approach to designing hardware, while Vitis offers a software-centric approach to developing *both* hardware and software. These perspectives are best represented by the languages used to make things with the two tools. 48 | 49 | ### Vivado workflow 50 | Vivado is for creating hardware designs that run in an FPGA. These either consist of a set of hardware description language (HDL, typically Verilog or VHDL) files, or of a block design, which can include a variety of pre-built IP blocks (which at their core abstract away pre-written HDL). If a design includes a processor, Vitis will also be required to write the program to run on the processor, as Vivado only handles the programmable logic. 51 | 52 | ### Vitis workflow 53 | Vitis is for writing software to run in an FPGA, and is the combination of a couple of different Xilinx tools, including what was Xilinx SDK, Vivado High-Level Synthesis (HLS), and SDSoC. The functionality of each of these is now merged together under Vitis. 54 | 55 | ## YAML 56 | YAML is a human-readable data-serialization language commonly used for configuration files and in applications where data is being stored or transmitted. 57 | -------------------------------------------------------------------------------- /docs/vocabulary.md: -------------------------------------------------------------------------------- 1 |
2 |
3 |

4 | Back to top 5 |

6 | 7 | # Vocabulary 8 | 9 | ## Agile 10 | Agile is an iterative project management and software development approach that helps teams deliver value to their customers faster and with fewer headaches. Instead of betting everything on a long-term launch, an agile team provides work in small but consumable increments. Requirements, plans, and results are evaluated continuously, so teams have a natural mechanism for responding to change quickly. Frameworks such as [Shape Up](#shape-up) or [DevOps](#devops) are considered part of Agile methodologies. 11 | 12 | ## Ansible Automation Platform (AAP) 13 | The Red Hat [Ansible Automation Platform (AAP)](https://www.ansible.com) is an orchestrated and open-source tool for software provisioning, configuration management, and application-deployment automation. Ansible uses its own [YAML](#yaml)-based declarative language enabling [Infrastructure as Code (IaC).](#infrastructure-as-code-iac) 14 | 15 | ## DCIM 16 | Data center infrastructure management (DCIM) is the integration of information technology (IT) and facility management disciplines to centralize monitoring, management and intelligent capacity planning of a data center's critical systems. 17 | 18 | ## Deployment types 19 | ### Infrastructure as a service (IaaS) 20 | Introduced in 2012 by Oracle, infrastructure as a service (IaaS) is a cloud computing service model through which computing resources are hosted in a public, private, or hybrid cloud and provided on-demand to the final users. 21 | 22 | ### Platform as a service (PaaS) 23 | Platform as a service (PaaS) is a category of cloud computing services that allows customers to provision, instantiate, run, and manage a modular bundle comprising a computing platform and one or more applications, without the complexity of building and maintaining the infrastructure typically associated with developing and launching the application(s); and to allow developers to create, develop, and package such software bundles. 24 | 25 | ## DevOps 26 | DevOps is a set of practices that combines software development (Dev) and IT operations (Ops). It aims to shorten the systems development life cycle and provide continuous delivery with high software quality. Several DevOps aspects came from the [Agile](#agile) methodology. 27 | 28 | ## PCI hot-plug 29 | We refer to the PCIe hot-plug as the process that allows us to transition Xilinx accelerator cards from the Vitis to Vivado workflows or vice-versa. The critical aspect of the process is to use Linux capabilities to re-enumerate PCI devices on the fly without the need for cold or warm rebooting of the system. 30 | 31 | ### Cold boot 32 | The process of powering off and on the machine to completely reload the operating system and reset all the hardware peripherals (including all PCI devices). In the context of Xilinx Alveo Cards, a cold boot causes to pull the flashable partitions (or base shell) from the card’s PROM into the programmable logic. This operation is required to revert a server to the Vitis workflow. 33 | 34 | ### Warm boot 35 | A warm boot restarts the system without the need to interrupt the power. In the context of Xilinx Alveo Cards, a warm boot would be required to re-enumarate the number of PCI functions without restoring the base shell. This operation is required to bring a server to the Vivado workflow. 36 | 37 | ## Infrastructure as Code (IaC) 38 | Infrastructure as Code (IaC) is the process of provisioning and managing computer data centers through machine- and human-readable [YAML](#yaml) definition files—rather than physical hardware configuration or interactive configuration tools. The IT infrastructure managed by this process comprises physical equipment, such as bare-metal servers, virtual machines, and associated configuration resources. The definitions may be in a version control system. 39 | 40 | ### Tools 41 | There are many tools that fulfill infrastructure automation capabilities and use IaC. Broadly speaking, any framework or tool that performs changes or configures infrastructure declaratively or imperatively based on a programmatic approach can be considered IaC. ETHZ-HACC uses [Ansible](#ansible-automation-platform-aap) for defining the cluster infrastructure. 42 | 43 | ### Relation to DevOps 44 | IaC can be a key attribute of enabling best practices in [DevOps](#devops)–developers become more involved in defining configuration and Ops teams get involved earlier in the development process. Tools that utilize IaC bring visibility to the state and configuration of servers and ultimately provide the visibility to users within the enterprise, aiming to bring teams together to maximize their efforts. 45 | 46 | ## Shape Up 47 | Instead of *Scrum,* we use [Shape Up](https://basecamp.com/shapeup) to shape and build our accelerated applications. To execute the techniques of the method we use [Basecamp](https://basecamp.com)—a project management tool that puts all our project communication, task management, and documentation in one place (where designers and programmers work seamlessly together). 48 | 49 | ## Spine-leaf architecture 50 | A spine-leaf architecture is data center network topology that consists of two switching layers—a spine and leaf. The leaf layer consists of access switches that aggregate traffic from servers and connect directly into the spine or network core. Spine switches interconnect all leaf switches in a full-mesh topology. 51 | 52 | ## Vivado and Vitis workflows 53 | Vivado offers a hardware-centric approach to designing hardware, while Vitis offers a software-centric approach to developing *both* hardware and software. These perspectives are best represented by the languages used to make things with the two tools. 54 | 55 | ### Vivado workflow 56 | Vivado is for creating hardware designs that run in an FPGA. These either consist of a set of hardware description language (HDL, typically Verilog or VHDL) files, or of a block design, which can include a variety of pre-built IP blocks (which at their core abstract away pre-written HDL). If a design includes a processor, Vitis will also be required to write the program to run on the processor, as Vivado only handles the programmable logic. 57 | 58 | ### Vitis workflow 59 | Vitis is for writing software to run in an FPGA, and is the combination of a couple of different Xilinx tools, including what was Xilinx SDK, Vivado High-Level Synthesis (HLS), and SDSoC. The functionality of each of these is now merged together under Vitis. 60 | 61 | ## YAML 62 | YAML is a human-readable data-serialization language commonly used for configuration files and in applications where data is being stored or transmitted. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 3 |

4 | fpgasystems hdev 5 |

6 | 7 |

8 | 9 |

10 | 11 |

12 | Heterogenous Accelerated Compute Cluster 13 |

14 | 15 | Under the scope of the AMD University Program, the Heterogeneous Accelerated Compute Clusters (HACCs) is a special initiative to support novel research in adaptive compute acceleration for high-performance computing (HPC). The scope of the program is broad and encompasses systems, architecture, tools, and applications. 16 | 17 | HACCs are equipped with the latest AMD hardware and software technologies for adaptive compute acceleration research. Each cluster is specially configured to enable some of the world’s foremost academic teams to conduct state-of-the-art HPC research. 18 | 19 | Five HACCs have been established at some of world’s most prestigious universities. The first of them was assigned to [Prof. Dr. Gustavo Alonso](https://people.inf.ethz.ch/alonso/) of the [Institute for Platform Computing - Systems Group (SG)](https://systems.ethz.ch) at the [Swiss Federal Institute of Technology Zurich (ETH Zurich, ETHZ)](https://ethz.ch/en.html) in 2020. 20 | 21 | ## Sections 22 | * [Account renewal](/docs/account-renewal.md#account-renewal) 23 | * [Acknowledgment and citation](#acknowledgment-and-citation) 24 | * [Booking system](/docs/booking-system.md#booking-system) 25 | * [Features](docs/features.md#features) 26 | * [First steps](docs/first-steps.md#first-steps) 27 | * [Get started](https://www.amd-haccs.io/get-started.html) 28 | * [HACC Development (hdev)](https://github.com/fpgasystems/hdev) 29 | * [Infrastructure](docs/infrastructure.md#infrastructure) 30 | * [License](#license) 31 | * [Operating the cluster](docs/operating-the-cluster.md#operating-the-cluster) 32 | * [Releases](#releases) 33 | * [Technical support](docs/technical-support.md) 34 | * [Usage guidance](#usage-guidance) 35 | * [Vocabulary](docs/vocabulary.md#vocabulary) 36 | * [Who does what](docs/who-does-what.md#who-does-what) 37 | 38 | # Releases 39 | 40 | The table below provides an overview of the current ETHZ-HACC setup across different releases: 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 |
Cluster
Ubuntu
Vivado
HIP/ROCm
20.0422.042023.22024.16.2.26.3.3
BUILD
U50D
U55C
V80
ALVEO BOXES
HACC BOXES
○ Existing release.
● Existing release installed on the cluster.
124 | 125 | ## Ubuntu 126 | Ubuntu releases are according to [IT Service Group of the Department of Computer Science](https://www.isg.inf.ethz.ch/Main/ServicesDesktopsAndLaptopsLinux) release schedule. 127 | 128 | ## AMD Tools 129 | ### Reconfigurable devices 130 | AMD's tool versioning for ASoCs and FPGAs follows [XRT’s release schedule.](https://github.com/Xilinx/XRT/releases) All servers equipped with Alveo or Versal boards (referred to as deployment servers) are associated with a unique AMD software version. This includes XRT's Xilinx Board Utility (xbutil), Vivado, Vitis_HLS, and the flashable partitions (or base shell) running on the reconfigurable devices. Some deployment servers also feature Vitis installed. Pay attention to the **welcome message,** as it will indicate the installed tools and their locations. 131 | 132 | ![Installed AMD Tools.](./imgs/installed-xilinx-tools.png "Installed AMD Tools.") 133 | *Installed AMD Tools.* 134 | 135 | 136 | 137 | AMD has officially announced the end-of-life (EOL) for its Alveo U250 and U280 data center accelerator cards. As a consequence, **we will no longer update tools or provide support for these devices.** The current tools will be **frozen at version 2023.2 (running on Ubuntu 22.04),** and no further updates will be released for these platforms. Users of the U250 and U280 are encouraged to plan for migration to alternative solutions within AMD's portfolio or other supported products. Please refer to the official AMD documentation and support channels for more details. 138 | 139 | ### Graphic Processing Units (GPUs) 140 | For GPU accelerators, HIP and ROCm tools versioning is according to [HIP release schedule](https://github.com/ROCm-Developer-Tools/HIP/releases). 141 | 142 | # Usage guidance 143 | When utilizing the HACC, please adhere to the following guidelines: 144 | 145 | * **Deployment servers:** Utilize deployment servers exclusively for testing and verification purposes. Refrain from utilizing them for any software builds. Restrict your usage on these machines to Vitis and HIP runtime. 146 | 147 | * **Software builds:** For software building tasks, utilize the HACC BUILD cluster instead. This machine allows multiple users simultaneous access without requiring booking. Only resort to this node if you lack local access to suitable servers for running builds in your institute. 148 | 149 | * **Tool installations:** Users are only permitted to use preinstalled tools on the system. Avoid installing external tools without prior approval from the HACC manager. If utilizing PYNQ, you may install packages using pip3, ensuring the package is system-wide installed beforehand. For any special requirements, contact [research_clusters@amd.com,](mailto:research_clusters@amd.com) and we will endeavor to accommodate your needs. 150 | 151 | * Lastly, ensure compliance with the [Booking rules.](./docs/booking-system.md#booking-rules) 152 | 153 | # Acknowledgment and citation 154 | 155 | We encourage ETHZ-HACC users to acknowledge the support provided by AMD and ETH Zürich for their research in presentations, papers, posters, and press releases. Please use the following acknowledgment statement and citation. 156 | 157 | ## Acknowledgment 158 | 159 | This work was supported in part by AMD under the Heterogeneous Accelerated Compute Clusters (HACC) program (formerly known as the XACC program - Xilinx Adaptive Compute Cluster program) 160 | 161 | ## Citation 162 | 163 | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8340448.svg)](https://doi.org/10.5281/zenodo.8340448) 164 | 165 | ``` 166 | @misc{moya2023hacc, 167 | author = {Javier Moya, Matthias Gabathuler, Mario Ruiz, Gustavo Alonso}, 168 | title = {fpgasystems/hacc: ETHZ-HACC}, 169 | howpublished = {Zenodo}, 170 | year = {2023}, 171 | month = sep, 172 | note = {\url{https://doi.org/10.5281/zenodo.8340448}}, 173 | doi = {10.5281/zenodo.8340448} 174 | } 175 | ``` 176 | 177 | ### Download 178 | 179 | To get a printed copy of the cited resource, please follow [this link.](https://public.3.basecamp.com/p/oQPqiHQ8yHNatsMT7zMxteZ5) 180 | 181 | # License 182 | 183 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 184 | 185 | Copyright (c) 2022 FPGA @ Systems Group, ETH Zurich 186 | 187 | Permission is hereby granted, free of charge, to any person obtaining a copy 188 | of this software and associated documentation files (the "Software"), to deal 189 | in the Software without restriction, including without limitation the rights 190 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 191 | copies of the Software, and to permit persons to whom the Software is 192 | furnished to do so, subject to the following conditions: 193 | 194 | The above copyright notice and this permission notice shall be included in all 195 | copies or substantial portions of the Software. 196 | 197 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 198 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 199 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 200 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 201 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 202 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 203 | SOFTWARE. -------------------------------------------------------------------------------- /docs/infrastructure.md: -------------------------------------------------------------------------------- 1 |
2 |
3 |

4 | Back to top 5 |

6 | 7 | # Infrastructure 8 | The ETHZ-HACC comprises build and deployment servers. **Build servers are dedicated to development and bitstream compilation,** providing a robust environment for software and hardware design. **Deployment servers,** on the other hand, host one or more acceleration devices, enabling **high-performance execution of workloads.** This separation ensures an efficient workflow, allowing developers to compile and test their applications on build servers before deploying them to accelerator-equipped machines for execution. 9 | 10 | ![ETHZ-HACC infrastructure. This diagram is for illustrative purposes and may be subject to change. Please refer to the table below for the specific server configuration.](../imgs/infrastructure.png "ETHZ-HACC infrastructure. This diagram is for illustrative purposes and may be subject to change. Please refer to the table below for the specific server configuration.") 11 | *ETHZ-HACC infrastructure. This diagram is for illustrative purposes and may be subject to change. Please refer to the table below for the specific server configuration.* 12 | 13 | ## Build servers 14 | Build servers are dedicated for development and bitstream compilation purposes. Multiple users can access this machine simultaneously **without booking it first.** Please only use the HACC build servers if you do not have access to similar resources at your research institution: too many users running large jobs on this machine will likely cause builds to run slowly—or sometimes to fail. Also, avoid using the build servers for debugging or simulating your hardware. 15 | 16 | 17 | 18 | ## Deployment servers 19 | 20 | Deployment servers feature high-end multi-core processors, one or more accelerator cards—such as GPUs or reconfigurable accelerator cards—and high-speed networking. The following table gives an overview of the devices installed on each of them: 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 |
Cluster
ASoCs
FPGAs
GPUs
Servers
VCK5000V80U250U280U50DU55CMI100MI210
U50D
alveo-u50d-[01:02]
U55C
alveo-u55c-[01:10]
V80
alveo-v80-01
ALVEO BOXES
alveo-box-[01:02]
HACC BOXES
● ●

 
 
 
 
● ●
● ●
 
 
 
 
 
 
 
 
 
 
 
 
● ●
● ●
 
 
 

 
 
● ● ● ●
● ● ● ●
● ● ● ●
● ●
hacc-box-[01:02]
hacc-box-03
hacc-box-04
hacc-box-05
● Number of devices.
112 | 113 | As shown in the table above, some servers are equipped with a single accelerator card, while others feature a heterogeneous mix of accelerators, including Adaptive SoCs, FPGAs, and GPUs. The section [HACC boxes architecture](#hacc-boxes-architecture) provides details on a representative heterogeneous configuration. 114 | 115 | ### AMD EPYC 116 | EPYC is the world’s highest-performing x86 server processor with faster performance for cloud, enterprise, and HPC workloads. To learn more about it, please refer to the [AMD EPYC processors website](https://www.amd.com/en/processors/epyc-server-cpu-family) and its [data sheet.](https://www.amd.com/system/files/documents/amd-epyc-7003-series-datasheet.pdf) 117 | 118 | ### Virtex Ultrascale+ 119 | Virtex UltraScale+ devices provide the highest performance and integration capabilities in a 14nm/16nm FinFET node. It also provides registered inter-die routing lines enabling >600 MHz operation, with abundant and flexible clocking to deliver a virtual monolithic design experience. As the industry’s most capable FPGA family, the devices are ideal for compute-intensive applications ranging from 1+Tb/s networking and machine learning to radar/early-warning systems. 120 | 121 | * [Alveo U250](https://www.xilinx.com/products/boards-and-kits/alveo/u250.html) 122 | * [Alveo U280](https://www.xilinx.com/products/boards-and-kits/alveo/u280.html) 123 | * [Alveo U50D](https://www.xilinx.com/products/boards-and-kits/alveo/u50.html) 124 | * [Alveo U55C](https://www.xilinx.com/applications/data-center/high-performance-computing/u55c.html) 125 | 126 | ### Versal Adaptive SoCs 127 | Versal Adaptive SoCs deliver unparalleled application- and system-level value for cloud, network, and edge applications​. The disruptive 7nm architecture combines heterogeneous compute engines with a breadth of hardened memory and interfacing technologies for superior performance/watt over competing 10nm FPGAs. 128 | 129 | * [Versal VCK5000](https://www.xilinx.com/products/boards-and-kits/vck5000.html) 130 | 131 | ### Alveo V80 Compute Accelerator Cards 132 | Alveo V80 compute accelerator cards provide exceptional performance for data center, AI, and high-performance computing workloads. Built on advanced 7nm technology, they integrate high-bandwidth memory, optimized compute engines, and low-latency interconnects to deliver superior throughput and efficiency compared to previous-generation accelerators. 133 | 134 | * [Alveo V80](https://www.amd.com/en/products/accelerators/alveo/v80.html) 135 | 136 | ### Storage 137 | Each HACC users can store data on the following directories: 138 | 139 | * ```/home/USERNAME```: directory on an NFS drive accessible by USERNAME from any the HACC servers. 140 | * ```/mnt/scratch```: directory on an NFS drive accessible by all users from any of the HACC servers. 141 | * ```/local/home/USERNAME/```: directory on the local server drive accessible by USERNAME on the HACC server. 142 | * ```/tmp```: directory on the local server drive accessible by all users on the HACC server. Its content is removed every time the server is restarted. 143 | 144 | ### USB - JTAG connectivity 145 | The USB - JTAG connection allows granted users to interact directly with the FPGA by downloading bitstreams or updating memory content. The correct setup and access of a USB - JTAG connection is essential developers using working with [Vivado workflow.](./vocabulary.md#vivado-workflow) 146 | 147 | ### HACC boxes architecture 148 | The following picture details the architecture of the one of our heterogeneous servers (specifically, hacc-box-01), which is equipped with 2x EPYC Milan CPUs, 4x [Instinct MI210 GPUs,](https://www.amd.com/system/files/documents/amd-instinct-mi210-brochure.pdf) 1x 100 GbE NIC, 2x [Alveo U55C FPGAs,](https://www.xilinx.com/applications/data-center/high-performance-computing/u55c.html) and 2x [Versal VCK5000 Adaptive SoCs](https://www.xilinx.com/products/boards-and-kits/vck5000.html). **The diagram can be used as a reference for the configuration of other HACC boxes.** 149 | 150 | ![Server architecture of hacc-box-01. This diagram is for illustrative purposes and may be subject to change.](../imgs/hacc-boxes.png "Server architecture of hacc-box-01. This diagram is for illustrative purposes and may be subject to change.") 151 | *Server architecture of hacc-box-01. This diagram is for illustrative purposes and may be subject to change.* 152 | 153 | ## Networking 154 | 155 | Each server has at least three connections: one to the **management network,** one to the **access network,** and one to the high-speed **data network.** Additionally, all Ethernet interfaces of reconfigurable accelerator cards are connected to a 100 GbE leaf switch (or 200 GbE for Alveo V80 compute accelerator cards), enabling the exploration of arbitrary network topologies for distributed computing. 156 | 157 | ![Management, access and data networks. This diagram is for illustrative purposes and may be subject to change.](../imgs/networking.png "Management, access and data networks. This diagram is for illustrative purposes and may be subject to change.") 158 | *Management, access and data networks. This diagram is for illustrative purposes and may be subject to change.* 159 | 160 | ### Management network 161 | We refer to the management network as the infrastructure allowing our IT administrators to manage, deploy, update and monitor our cluster **remotely.** 162 | 163 | ### Access network 164 | The access network is the infrastructure that allows secure remote access to our **users** through SSH. 165 | 166 | ### Data network 167 | For our **high-speed networking** data network, we are using a [spine-leaf architecture](../docs/vocabulary.md#spine-leaf-architecture) where the L2 leaf layer is built with 100 and 200 GbE [Cisco Nexus 9000 Series](https://www.cisco.com/c/en/us/support/switches/nexus-9000-series-switches/series.html) switches, and active optic cables (AOCs): 168 | 169 | ![Spine-leaf data network architecture. This diagram is for illustrative purposes and may be subject to change.](../imgs/spine-leaf.png "Spine-leaf data network architecture. This diagram is for illustrative purposes and may be subject to change.") 170 | *Spine-leaf data network architecture. This diagram is for illustrative purposes and may be subject to change.* 171 | 172 | --------------------------------------------------------------------------------