├── img ├── aire.png ├── us_1.png └── llm-d.png ├── LICENSE ├── CONTRIBUTING.md └── README.md /img/aire.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/den-vasyliev/aire/HEAD/img/aire.png -------------------------------------------------------------------------------- /img/us_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/den-vasyliev/aire/HEAD/img/us_1.png -------------------------------------------------------------------------------- /img/llm-d.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/den-vasyliev/aire/HEAD/img/llm-d.png -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | // ... [standard Apache 2.0 license text continues] ... -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to AIRE 2 | 3 | Thank you for your interest in contributing to AIRE! This document provides guidelines and instructions for contributing to the project. 4 | 5 | ## Code of Conduct 6 | 7 | By participating in this project, you agree to abide by our Code of Conduct (coming soon). We expect all contributors to help maintain a respectful, inclusive, and collaborative environment. 8 | 9 | ## How to Contribute 10 | 11 | There are many ways to contribute to AIRE: 12 | 13 | 1. **Report Bugs** 14 | - Use the GitHub Issues tracker 15 | - Clearly describe the issue including steps to reproduce 16 | - Include relevant information like OS, environment, etc. 17 | 18 | 2. **Suggest Enhancements** 19 | - Use GitHub Issues for feature requests 20 | - Clearly describe the feature and its use case 21 | - If possible, outline a technical approach 22 | 23 | 3. **Submit Code Changes** 24 | - Fork the repository 25 | - Create a new branch for your feature/fix 26 | - Write clear, commented code 27 | - Include tests where applicable 28 | - Submit a Pull Request 29 | 30 | 4. **Improve Documentation** 31 | - Fix typos or clarify existing documentation 32 | - Add examples and use cases 33 | - Translate documentation 34 | 35 | ## Development Process 36 | 37 | 1. **Fork & Clone** 38 | ```bash 39 | git clone https://github.com/your-username/aire.git 40 | cd aire 41 | ``` 42 | 43 | 2. **Create a Branch** 44 | ```bash 45 | git checkout -b feature/your-feature-name 46 | ``` 47 | 48 | 3. **Make Changes** 49 | - Write your code 50 | - Add tests 51 | - Update documentation 52 | 53 | 4. **Commit Changes** 54 | - Use clear commit messages 55 | - Reference relevant issues 56 | ```bash 57 | git commit -m "feat: add new feature #IssueNumber" 58 | ``` 59 | 60 | 5. **Submit Pull Request** 61 | - Push to your fork 62 | - Submit a PR to the main repository 63 | - Respond to review comments 64 | 65 | ## Pull Request Guidelines 66 | 67 | - Follow the project's coding style and conventions 68 | - Include tests for new features 69 | - Update documentation as needed 70 | - One feature/fix per PR 71 | - Keep PRs focused and manageable in size 72 | 73 | ## Commit Message Format 74 | 75 | We follow [Conventional Commits](https://www.conventionalcommits.org/). Each commit message should be structured as follows: 76 | 77 | ``` 78 | (): 79 | 80 | [optional body] 81 | 82 | [optional footer] 83 | ``` 84 | 85 | ### Types 86 | - `feat`: A new feature 87 | - `fix`: A bug fix 88 | - `docs`: Documentation only changes 89 | - `style`: Changes that don't affect the code's meaning (white-space, formatting, etc) 90 | - `refactor`: Code changes that neither fix a bug nor add a feature 91 | - `perf`: Code changes that improve performance 92 | - `test`: Adding missing tests or correcting existing tests 93 | - `chore`: Changes to build process or auxiliary tools 94 | 95 | ### Examples: 96 | ```bash 97 | feat(auth): add JWT authentication 98 | fix(api): handle null response from user service 99 | docs(readme): update installation instructions 100 | test(login): add unit tests for login validation 101 | ``` 102 | 103 | ### Breaking Changes 104 | For commits that introduce breaking changes, add `BREAKING CHANGE:` in the footer: 105 | 106 | ``` 107 | feat(api): change authentication endpoint path 108 | 109 | BREAKING CHANGE: Authentication endpoint moved from /auth to /v2/auth 110 | ``` -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AIRE - AI Reliability Engineering 2 | 3 | ## _Unreliable AI is worse than no AI at all._ 4 | 5 |

6 | AIRE Logo 7 |

8 | 9 | ## About 10 | 11 | AIRE is an open-source framework that applies Site Reliability Engineering (SRE) principles and practices to AI systems. It provides a comprehensive toolkit and methodology for AI/ML practitioners, Data Engineers, DevOps, and SRE teams to develop, deliver, and operate reliable AI products. 12 | 13 | As AI becomes critical for business competitiveness, organizations face unique challenges in integrating these systems reliably and securely. AIRE bridges this gap by combining CNCF ecosystem tools with established SRE practices to enhance AI system reliability, security, and business alignment. 14 | 15 | ## Why AIRE? 16 | 17 | [AI systems present unique operational challenges ](https://medium.com/@den.vasyliev/ai-reliability-engineering-the-third-age-of-sre-1f4a71478cfa)that traditional SRE practices don't fully address: 18 | 19 | ### Key Challenges 20 | - **Limited Visibility**: Lack of control over AI lifecycle including data collection, model deployment, and monitoring 21 | - **Quality Assurance**: Ensuring model robustness against prompt attacks, data drift, and performance degradation 22 | - **Complexity Management**: Handling intricate dependencies, configurations, and resources across environments 23 | - **Security Concerns**: Protecting against data leakage, bias, and malicious attacks 24 | - **Operational Integration**: Incorporating AI systems into existing DevOps workflows 25 | 26 | AIRE addresses these challenges by providing: 27 | - A structured approach to AI system reliability 28 | - Tools and practices for managing AI-specific risks 29 | - Methods for defining and measuring AI system reliability 30 | - Integration patterns for existing MLOps and DevOps workflows 31 | - Standardized processes for AI operations 32 | 33 | ## Features 34 | 35 | ### AIRE Framework 36 | - Guidelines and best practices for AI reliability 37 | - Templates and checklists for implementation 38 | - Reference architectures integrating CNCF tools 39 | - Implementation examples with real-world scenarios 40 | - Documentation standards for AI systems 41 | 42 | ### AIRE Toolkit 43 | - Open-source tools collection 44 | - AI-specific monitoring and observability solutions 45 | - Testing and validation utilities for LLMs 46 | - Deployment templates for AI workflows 47 | - Security scanning and prompt attack prevention tools 48 | - Language chain tracing capabilities 49 | - AI gateway integration patterns 50 | 51 | ## Practical Use Cases 52 | [vLLM Simulator](https://llm-d.ai/docs/architecture/Components/inf-simulator) 53 | ### LLM Delivery and Deployment with Kubernetes Controllers 54 | Based on the Flux OCI architecture, AIRE provides a streamlined approach for deploying LLMs to Kubernetes: 55 | - **GitOps-driven Deployment**: Utilize custom controllers to manage LLM deployments through Git workflows 56 | - **Infrastructure as Code**: Define LLM configurations, resource requirements, and scaling policies declaratively 57 | - **Automated Rollouts**: Support for canary deployments and automated rollbacks 58 | - **Resource Optimization**: Intelligent scheduling and resource management for GPU/CPU workloads 59 | - **Model Versioning**: Integrated version control and model artifact management 60 | - **Standardized OCI Image Format**: Ensure consistency by unifying LLM deployments around the OCI image format 61 |

62 | Use Case 63 |

64 | 65 | 66 | ### Observability with OpenInference 67 | Leveraging OpenInference for comprehensive LLM observability: 68 | - **Distributed Tracing**: End-to-end visibility into LLM request flows and chain-of-thought processes 69 | - **Performance Metrics**: Track latency, throughput, and resource utilization 70 | - **Semantic Logging**: Structured logging for prompt engineering and response analysis 71 | - **Cost Monitoring**: Track token usage and associated costs 72 | - **Quality Metrics**: Monitor hallucination rates, response quality, and model drift 73 | 74 | ### Reliability with AI Gateway 75 | Following industry best practices for AI gateway implementation: 76 | - **Traffic Management**: Rate limiting, load balancing, and request routing 77 | - **Security Controls**: Authentication, authorization, and prompt validation 78 | - **Cost Optimization**: Smart caching and request batching 79 | - **Model Governance**: Version control, A/B testing, and shadow deployment 80 | - **API Standardization**: Unified interface for multiple LLM providers 81 | 82 | ## Key Benefits 83 | 84 | ### For Organizations 85 | - **Reliability**: Define and track SLOs/SLIs specific to AI systems 86 | - **Operations**: Streamlined maintenance and monitoring 87 | - **Development**: Faster and safer deployment cycles 88 | - **Collaboration**: Better alignment between AI/ML and SRE teams 89 | - **Risk Management**: Reduced operational and compliance risks 90 | 91 | ### For the CNCF Ecosystem 92 | - **Standards Promotion**: Framework for ensuring reliability in AI workflows 93 | - **Technology Bridge**: Adaptation of CNCF tools for AI-specific challenges 94 | - **Enhanced Observability**: Practical implementations for AI lifecycle monitoring 95 | - **Ethical AI**: Methods to reduce risks and ensure compliance 96 | - **Operational Excellence**: Standardized processes for AI integration 97 | 98 | ## Documentation 99 | 100 | [llm-d is a Kubernetes-native high-performance distributed LLM inference framework](https://llm-d.ai/blog/llm-d-announce) 101 | 102 | ## Contributing 103 | 104 | We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details. 105 | 106 | ## Community 107 | - [YouTube](https://youtu.be/Ef6JUVLWPwU) 108 | - [Discord/Slack Channel] 109 | - [Discussion Forum] 110 | - [Community Meetings] 111 | 112 | ## Reference Architecture & Case Studies 113 | - [ENTERPRISE GENAI DELIVERY PATTERNS](https://itrevolution.com/product/enterprise-gen-ai-delivery-patterns) 114 | - [FMOps/LLMOps on AWS](https://aws.amazon.com/blogs/machine-learning/fmops-llmops-operationalize-generative-ai-and-differences-with-mlops/) 115 | - [OpenAI: Scaling Kubernetes to 7,500 nodes](https://openai.com/research/scaling-kubernetes-to-7500-nodes) 116 | - [GPT-4 Architecture Overview](https://www.semianalysis.com/p/gpt-4-architecture-infrastructure) 117 | - [Ray Framework](https://assets.ctfassets.net/bguokct8bxgd/26Vuu2NJLVnWkX4TkalSmB/fbc74da45885ca8e5048583f8a7e9d25/Ray_OSS_Datasheet_-_Final.pdf) 118 | 119 | ## License 120 | 121 | This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. 122 | --------------------------------------------------------------------------------