├── img
    ├── aire.png
    ├── us_1.png
    └── llm-d.png
├── LICENSE
├── CONTRIBUTING.md
└── README.md


/img/aire.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/den-vasyliev/aire/HEAD/img/aire.png


--------------------------------------------------------------------------------
/img/us_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/den-vasyliev/aire/HEAD/img/us_1.png


--------------------------------------------------------------------------------
/img/llm-d.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/den-vasyliev/aire/HEAD/img/llm-d.png


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 |                                  Apache License
 2 |                            Version 2.0, January 2004
 3 |                         http://www.apache.org/licenses/
 4 | 
 5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
 6 | 
 7 |    1. Definitions.
 8 | 
 9 |       "License" shall mean the terms and conditions for use, reproduction,
10 |       and distribution as defined by Sections 1 through 9 of this document.
11 | 
12 |       "Licensor" shall mean the copyright owner or entity authorized by
13 |       the copyright owner that is granting the License.
14 |       
15 |       // ... [standard Apache 2.0 license text continues] ... 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
  1 | # Contributing to AIRE
  2 | 
  3 | Thank you for your interest in contributing to AIRE! This document provides guidelines and instructions for contributing to the project.
  4 | 
  5 | ## Code of Conduct
  6 | 
  7 | By participating in this project, you agree to abide by our Code of Conduct (coming soon). We expect all contributors to help maintain a respectful, inclusive, and collaborative environment.
  8 | 
  9 | ## How to Contribute
 10 | 
 11 | There are many ways to contribute to AIRE:
 12 | 
 13 | 1. **Report Bugs**
 14 |    - Use the GitHub Issues tracker
 15 |    - Clearly describe the issue including steps to reproduce
 16 |    - Include relevant information like OS, environment, etc.
 17 | 
 18 | 2. **Suggest Enhancements**
 19 |    - Use GitHub Issues for feature requests
 20 |    - Clearly describe the feature and its use case
 21 |    - If possible, outline a technical approach
 22 | 
 23 | 3. **Submit Code Changes**
 24 |    - Fork the repository
 25 |    - Create a new branch for your feature/fix
 26 |    - Write clear, commented code
 27 |    - Include tests where applicable
 28 |    - Submit a Pull Request
 29 | 
 30 | 4. **Improve Documentation**
 31 |    - Fix typos or clarify existing documentation
 32 |    - Add examples and use cases
 33 |    - Translate documentation
 34 | 
 35 | ## Development Process
 36 | 
 37 | 1. **Fork & Clone**
 38 |    ```bash
 39 |    git clone https://github.com/your-username/aire.git
 40 |    cd aire
 41 |    ```
 42 | 
 43 | 2. **Create a Branch**
 44 |    ```bash
 45 |    git checkout -b feature/your-feature-name
 46 |    ```
 47 | 
 48 | 3. **Make Changes**
 49 |    - Write your code
 50 |    - Add tests
 51 |    - Update documentation
 52 | 
 53 | 4. **Commit Changes**
 54 |    - Use clear commit messages
 55 |    - Reference relevant issues
 56 |    ```bash
 57 |    git commit -m "feat: add new feature #IssueNumber"
 58 |    ```
 59 | 
 60 | 5. **Submit Pull Request**
 61 |    - Push to your fork
 62 |    - Submit a PR to the main repository
 63 |    - Respond to review comments
 64 | 
 65 | ## Pull Request Guidelines
 66 | 
 67 | - Follow the project's coding style and conventions
 68 | - Include tests for new features
 69 | - Update documentation as needed
 70 | - One feature/fix per PR
 71 | - Keep PRs focused and manageable in size
 72 | 
 73 | ## Commit Message Format
 74 | 
 75 | We follow [Conventional Commits](https://www.conventionalcommits.org/). Each commit message should be structured as follows:
 76 | 
 77 | ```
 78 | <type>(<optional scope>): <description>
 79 | 
 80 | [optional body]
 81 | 
 82 | [optional footer]
 83 | ```
 84 | 
 85 | ### Types
 86 | - `feat`: A new feature
 87 | - `fix`: A bug fix
 88 | - `docs`: Documentation only changes
 89 | - `style`: Changes that don't affect the code's meaning (white-space, formatting, etc)
 90 | - `refactor`: Code changes that neither fix a bug nor add a feature
 91 | - `perf`: Code changes that improve performance
 92 | - `test`: Adding missing tests or correcting existing tests
 93 | - `chore`: Changes to build process or auxiliary tools
 94 | 
 95 | ### Examples:
 96 | ```bash
 97 | feat(auth): add JWT authentication
 98 | fix(api): handle null response from user service
 99 | docs(readme): update installation instructions
100 | test(login): add unit tests for login validation
101 | ```
102 | 
103 | ### Breaking Changes
104 | For commits that introduce breaking changes, add `BREAKING CHANGE:` in the footer:
105 | 
106 | ```
107 | feat(api): change authentication endpoint path
108 | 
109 | BREAKING CHANGE: Authentication endpoint moved from /auth to /v2/auth
110 | ``` 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # AIRE - AI Reliability Engineering
  2 | 
  3 | ## _Unreliable AI is worse than no AI at all._
  4 | 
  5 | <p align="center">
  6 |   <img src="/img/aire.png" alt="AIRE Logo" />
  7 | </p>
  8 | 
  9 | ## About
 10 | 
 11 | AIRE is an open-source framework that applies Site Reliability Engineering (SRE) principles and practices to AI systems. It provides a comprehensive toolkit and methodology for AI/ML practitioners, Data Engineers, DevOps, and SRE teams to develop, deliver, and operate reliable AI products.
 12 | 
 13 | As AI becomes critical for business competitiveness, organizations face unique challenges in integrating these systems reliably and securely. AIRE bridges this gap by combining CNCF ecosystem tools with established SRE practices to enhance AI system reliability, security, and business alignment.
 14 | 
 15 | ## Why AIRE?
 16 | 
 17 | [AI systems present unique operational challenges ](https://medium.com/@den.vasyliev/ai-reliability-engineering-the-third-age-of-sre-1f4a71478cfa)that traditional SRE practices don't fully address:
 18 | 
 19 | ### Key Challenges
 20 | - **Limited Visibility**: Lack of control over AI lifecycle including data collection, model deployment, and monitoring
 21 | - **Quality Assurance**: Ensuring model robustness against prompt attacks, data drift, and performance degradation
 22 | - **Complexity Management**: Handling intricate dependencies, configurations, and resources across environments
 23 | - **Security Concerns**: Protecting against data leakage, bias, and malicious attacks
 24 | - **Operational Integration**: Incorporating AI systems into existing DevOps workflows
 25 | 
 26 | AIRE addresses these challenges by providing:
 27 | - A structured approach to AI system reliability
 28 | - Tools and practices for managing AI-specific risks
 29 | - Methods for defining and measuring AI system reliability
 30 | - Integration patterns for existing MLOps and DevOps workflows
 31 | - Standardized processes for AI operations
 32 | 
 33 | ## Features
 34 | 
 35 | ### AIRE Framework
 36 | - Guidelines and best practices for AI reliability
 37 | - Templates and checklists for implementation
 38 | - Reference architectures integrating CNCF tools
 39 | - Implementation examples with real-world scenarios
 40 | - Documentation standards for AI systems
 41 | 
 42 | ### AIRE Toolkit
 43 | - Open-source tools collection
 44 | - AI-specific monitoring and observability solutions
 45 | - Testing and validation utilities for LLMs
 46 | - Deployment templates for AI workflows
 47 | - Security scanning and prompt attack prevention tools
 48 | - Language chain tracing capabilities
 49 | - AI gateway integration patterns
 50 | 
 51 | ## Practical Use Cases
 52 | [vLLM Simulator](https://llm-d.ai/docs/architecture/Components/inf-simulator)
 53 | ### LLM Delivery and Deployment with Kubernetes Controllers
 54 | Based on the Flux OCI architecture, AIRE provides a streamlined approach for deploying LLMs to Kubernetes:
 55 | - **GitOps-driven Deployment**: Utilize custom controllers to manage LLM deployments through Git workflows
 56 | - **Infrastructure as Code**: Define LLM configurations, resource requirements, and scaling policies declaratively
 57 | - **Automated Rollouts**: Support for canary deployments and automated rollbacks
 58 | - **Resource Optimization**: Intelligent scheduling and resource management for GPU/CPU workloads
 59 | - **Model Versioning**: Integrated version control and model artifact management
 60 | - **Standardized OCI Image Format**: Ensure consistency by unifying LLM deployments around the OCI image format
 61 | <p align="center">
 62 |   <img src="/img/llm-d.png" alt="Use Case" />
 63 | </p>
 64 | 
 65 | 
 66 | ### Observability with OpenInference
 67 | Leveraging OpenInference for comprehensive LLM observability:
 68 | - **Distributed Tracing**: End-to-end visibility into LLM request flows and chain-of-thought processes
 69 | - **Performance Metrics**: Track latency, throughput, and resource utilization
 70 | - **Semantic Logging**: Structured logging for prompt engineering and response analysis
 71 | - **Cost Monitoring**: Track token usage and associated costs
 72 | - **Quality Metrics**: Monitor hallucination rates, response quality, and model drift
 73 | 
 74 | ### Reliability with AI Gateway
 75 | Following industry best practices for AI gateway implementation:
 76 | - **Traffic Management**: Rate limiting, load balancing, and request routing
 77 | - **Security Controls**: Authentication, authorization, and prompt validation
 78 | - **Cost Optimization**: Smart caching and request batching
 79 | - **Model Governance**: Version control, A/B testing, and shadow deployment
 80 | - **API Standardization**: Unified interface for multiple LLM providers
 81 | 
 82 | ## Key Benefits
 83 | 
 84 | ### For Organizations
 85 | - **Reliability**: Define and track SLOs/SLIs specific to AI systems
 86 | - **Operations**: Streamlined maintenance and monitoring
 87 | - **Development**: Faster and safer deployment cycles
 88 | - **Collaboration**: Better alignment between AI/ML and SRE teams
 89 | - **Risk Management**: Reduced operational and compliance risks
 90 | 
 91 | ### For the CNCF Ecosystem
 92 | - **Standards Promotion**: Framework for ensuring reliability in AI workflows
 93 | - **Technology Bridge**: Adaptation of CNCF tools for AI-specific challenges
 94 | - **Enhanced Observability**: Practical implementations for AI lifecycle monitoring
 95 | - **Ethical AI**: Methods to reduce risks and ensure compliance
 96 | - **Operational Excellence**: Standardized processes for AI integration
 97 | 
 98 | ## Documentation
 99 | 
100 | [llm-d is a Kubernetes-native high-performance distributed LLM inference framework](https://llm-d.ai/blog/llm-d-announce)
101 | 
102 | ## Contributing
103 | 
104 | We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.
105 | 
106 | ## Community
107 | - [YouTube](https://youtu.be/Ef6JUVLWPwU)
108 | - [Discord/Slack Channel]
109 | - [Discussion Forum]
110 | - [Community Meetings]
111 | 
112 | ## Reference Architecture & Case Studies
113 | - [ENTERPRISE GENAI DELIVERY PATTERNS](https://itrevolution.com/product/enterprise-gen-ai-delivery-patterns)
114 | - [FMOps/LLMOps on AWS](https://aws.amazon.com/blogs/machine-learning/fmops-llmops-operationalize-generative-ai-and-differences-with-mlops/)
115 | - [OpenAI: Scaling Kubernetes to 7,500 nodes](https://openai.com/research/scaling-kubernetes-to-7500-nodes)
116 | - [GPT-4 Architecture Overview](https://www.semianalysis.com/p/gpt-4-architecture-infrastructure)
117 | - [Ray Framework](https://assets.ctfassets.net/bguokct8bxgd/26Vuu2NJLVnWkX4TkalSmB/fbc74da45885ca8e5048583f8a7e9d25/Ray_OSS_Datasheet_-_Final.pdf)
118 | 
119 | ## License
120 | 
121 | This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
122 | 


--------------------------------------------------------------------------------