├── LICENSE ├── README.md ├── gpu_stress_test └── gpu_stress_test.py ├── health_checks ├── config.json ├── dmesg_whitelist.py ├── health_check_fixes │ ├── fix_ansible_fact_gathering_hang.sh │ ├── fix_docker.sh │ ├── fix_docker_info.sh │ ├── fix_docker_nvidia_communication.sh │ ├── fix_flint_version.sh │ ├── fix_nvidia_fabric_manager.sh │ ├── fix_nvidia_smi.sh │ ├── fix_pcie_limited.sh │ ├── fix_swap.sh │ ├── fix_vbios_version.sh │ ├── fix_zpool.sh │ ├── reinstall_nvidia.sh │ ├── txvr-fw-update-on-host.sh │ └── uninstall_nvidia.sh ├── health_checks.py ├── run_health_checks.py └── utils │ └── commands.py ├── host_validation ├── communication_validation_tests.py ├── gpu_connection_test.py ├── p2p_ib_test.py └── utils │ ├── dist.py │ ├── events.py │ ├── fixed_traceback.py │ ├── mapping.py │ ├── run_command.py │ ├── serialization.py │ └── timer.py ├── ib_burn ├── config.json ├── ib_burn.py └── ib_fabric.sh ├── requirements.txt └── ufm_events └── find_problematic_events.py /LICENSE: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/LICENSE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/README.md -------------------------------------------------------------------------------- /gpu_stress_test/gpu_stress_test.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/gpu_stress_test/gpu_stress_test.py -------------------------------------------------------------------------------- /health_checks/config.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/config.json -------------------------------------------------------------------------------- /health_checks/dmesg_whitelist.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/dmesg_whitelist.py -------------------------------------------------------------------------------- /health_checks/health_check_fixes/fix_ansible_fact_gathering_hang.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_check_fixes/fix_ansible_fact_gathering_hang.sh -------------------------------------------------------------------------------- /health_checks/health_check_fixes/fix_docker.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_check_fixes/fix_docker.sh -------------------------------------------------------------------------------- /health_checks/health_check_fixes/fix_docker_info.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_check_fixes/fix_docker_info.sh -------------------------------------------------------------------------------- /health_checks/health_check_fixes/fix_docker_nvidia_communication.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_check_fixes/fix_docker_nvidia_communication.sh -------------------------------------------------------------------------------- /health_checks/health_check_fixes/fix_flint_version.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_check_fixes/fix_flint_version.sh -------------------------------------------------------------------------------- /health_checks/health_check_fixes/fix_nvidia_fabric_manager.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_check_fixes/fix_nvidia_fabric_manager.sh -------------------------------------------------------------------------------- /health_checks/health_check_fixes/fix_nvidia_smi.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_check_fixes/fix_nvidia_smi.sh -------------------------------------------------------------------------------- /health_checks/health_check_fixes/fix_pcie_limited.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_check_fixes/fix_pcie_limited.sh -------------------------------------------------------------------------------- /health_checks/health_check_fixes/fix_swap.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_check_fixes/fix_swap.sh -------------------------------------------------------------------------------- /health_checks/health_check_fixes/fix_vbios_version.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_check_fixes/fix_vbios_version.sh -------------------------------------------------------------------------------- /health_checks/health_check_fixes/fix_zpool.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_check_fixes/fix_zpool.sh -------------------------------------------------------------------------------- /health_checks/health_check_fixes/reinstall_nvidia.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_check_fixes/reinstall_nvidia.sh -------------------------------------------------------------------------------- /health_checks/health_check_fixes/txvr-fw-update-on-host.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_check_fixes/txvr-fw-update-on-host.sh -------------------------------------------------------------------------------- /health_checks/health_check_fixes/uninstall_nvidia.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_check_fixes/uninstall_nvidia.sh -------------------------------------------------------------------------------- /health_checks/health_checks.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/health_checks.py -------------------------------------------------------------------------------- /health_checks/run_health_checks.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/run_health_checks.py -------------------------------------------------------------------------------- /health_checks/utils/commands.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/health_checks/utils/commands.py -------------------------------------------------------------------------------- /host_validation/communication_validation_tests.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/host_validation/communication_validation_tests.py -------------------------------------------------------------------------------- /host_validation/gpu_connection_test.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/host_validation/gpu_connection_test.py -------------------------------------------------------------------------------- /host_validation/p2p_ib_test.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/host_validation/p2p_ib_test.py -------------------------------------------------------------------------------- /host_validation/utils/dist.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/host_validation/utils/dist.py -------------------------------------------------------------------------------- /host_validation/utils/events.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/host_validation/utils/events.py -------------------------------------------------------------------------------- /host_validation/utils/fixed_traceback.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/host_validation/utils/fixed_traceback.py -------------------------------------------------------------------------------- /host_validation/utils/mapping.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/host_validation/utils/mapping.py -------------------------------------------------------------------------------- /host_validation/utils/run_command.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/host_validation/utils/run_command.py -------------------------------------------------------------------------------- /host_validation/utils/serialization.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/host_validation/utils/serialization.py -------------------------------------------------------------------------------- /host_validation/utils/timer.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/host_validation/utils/timer.py -------------------------------------------------------------------------------- /ib_burn/config.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/ib_burn/config.json -------------------------------------------------------------------------------- /ib_burn/ib_burn.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/ib_burn/ib_burn.py -------------------------------------------------------------------------------- /ib_burn/ib_fabric.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/ib_burn/ib_fabric.sh -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/requirements.txt -------------------------------------------------------------------------------- /ufm_events/find_problematic_events.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imbue-ai/cluster-health/HEAD/ufm_events/find_problematic_events.py --------------------------------------------------------------------------------