├── .github ├── CODE_OF_CONDUCT.md ├── ISSUE_TEMPLATE.md └── PULL_REQUEST_TEMPLATE.md ├── .gitignore ├── CHANGELOG.md ├── CONTRIBUTING.md ├── LICENSE.md ├── README.md ├── code ├── AOAIHandler.py ├── AzureBatch.py ├── AzureStorageHandler.py ├── RunBatch.py └── Utilities.py ├── media ├── batch_accel_overview_new.png └── overview.pdf ├── requirements.txt └── templates ├── AOAI_config_template.json ├── app_config.json ├── batch_template.json └── storage_config.json /.github/CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Microsoft Open Source Code of Conduct 2 | 3 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 4 | 5 | Resources: 6 | 7 | - [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/) 8 | - [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) 9 | - Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns 10 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | 4 | > Please provide us with the following information: 5 | > --------------------------------------------------------------- 6 | 7 | ### This issue is for a: (mark with an `x`) 8 | ``` 9 | - [ ] bug report -> please search issues before submitting 10 | - [ ] feature request 11 | - [ ] documentation issue or request 12 | - [ ] regression (a behavior that used to work and stopped in a new release) 13 | ``` 14 | 15 | ### Minimal steps to reproduce 16 | > 17 | 18 | ### Any log messages given by the failure 19 | > 20 | 21 | ### Expected/desired behavior 22 | > 23 | 24 | ### OS and Version? 25 | > Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) 26 | 27 | ### Versions 28 | > 29 | 30 | ### Mention any other details that might be useful 31 | 32 | > --------------------------------------------------------------- 33 | > Thanks! We'll be in touch soon. 34 | -------------------------------------------------------------------------------- /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | ## Purpose 2 | 3 | * ... 4 | 5 | ## Does this introduce a breaking change? 6 | 7 | ``` 8 | [ ] Yes 9 | [ ] No 10 | ``` 11 | 12 | ## Pull Request Type 13 | What kind of change does this Pull Request introduce? 14 | 15 | 16 | ``` 17 | [ ] Bugfix 18 | [ ] Feature 19 | [ ] Code style update (formatting, local variables) 20 | [ ] Refactoring (no functional changes, no api changes) 21 | [ ] Documentation content changes 22 | [ ] Other... Please describe: 23 | ``` 24 | 25 | ## How to Test 26 | * Get the code 27 | 28 | ``` 29 | git clone [repo-address] 30 | cd [repo-name] 31 | git checkout [branch-name] 32 | npm install 33 | ``` 34 | 35 | * Test the code 36 | 37 | ``` 38 | ``` 39 | 40 | ## What to Check 41 | Verify that the following are valid 42 | * ... 43 | 44 | ## Other Information 45 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | /.venv 2 | /config 3 | /data 4 | /notebooks 5 | /code/__pycache__ 6 | /code/test.py -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | ## [project-title] Changelog 2 | 3 | 4 | # x.y.z (yyyy-mm-dd) 5 | 6 | *Features* 7 | * ... 8 | 9 | *Bug Fixes* 10 | * ... 11 | 12 | *Breaking Changes* 13 | * ... 14 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to [project-title] 2 | 3 | This project welcomes contributions and suggestions. Most contributions require you to agree to a 4 | Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us 5 | the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. 6 | 7 | When you submit a pull request, a CLA bot will automatically determine whether you need to provide 8 | a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions 9 | provided by the bot. You will only need to do this once across all repos using our CLA. 10 | 11 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 12 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or 13 | contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. 14 | 15 | - [Code of Conduct](#coc) 16 | - [Issues and Bugs](#issue) 17 | - [Feature Requests](#feature) 18 | - [Submission Guidelines](#submit) 19 | 20 | ## Code of Conduct 21 | Help us keep this project open and inclusive. Please read and follow our [Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 22 | 23 | ## Found an Issue? 24 | If you find a bug in the source code or a mistake in the documentation, you can help us by 25 | [submitting an issue](#submit-issue) to the GitHub Repository. Even better, you can 26 | [submit a Pull Request](#submit-pr) with a fix. 27 | 28 | ## Want a Feature? 29 | You can *request* a new feature by [submitting an issue](#submit-issue) to the GitHub 30 | Repository. If you would like to *implement* a new feature, please submit an issue with 31 | a proposal for your work first, to be sure that we can use it. 32 | 33 | * **Small Features** can be crafted and directly [submitted as a Pull Request](#submit-pr). 34 | 35 | ## Submission Guidelines 36 | 37 | ### Submitting an Issue 38 | Before you submit an issue, search the archive, maybe your question was already answered. 39 | 40 | If your issue appears to be a bug, and hasn't been reported, open a new issue. 41 | Help us to maximize the effort we can spend fixing issues and adding new 42 | features, by not reporting duplicate issues. Providing the following information will increase the 43 | chances of your issue being dealt with quickly: 44 | 45 | * **Overview of the Issue** - if an error is being thrown a non-minified stack trace helps 46 | * **Version** - what version is affected (e.g. 0.1.2) 47 | * **Motivation for or Use Case** - explain what are you trying to do and why the current behavior is a bug for you 48 | * **Browsers and Operating System** - is this a problem with all browsers? 49 | * **Reproduce the Error** - provide a live example or a unambiguous set of steps 50 | * **Related Issues** - has a similar issue been reported before? 51 | * **Suggest a Fix** - if you can't fix the bug yourself, perhaps you can point to what might be 52 | causing the problem (line of code or commit) 53 | 54 | You can file new issues by providing the above information at the corresponding repository's issues link: https://github.com/[organization-name]/[repository-name]/issues/new]. 55 | 56 | ### Submitting a Pull Request (PR) 57 | Before you submit your Pull Request (PR) consider the following guidelines: 58 | 59 | * Search the repository (https://github.com/[organization-name]/[repository-name]/pulls) for an open or closed PR 60 | that relates to your submission. You don't want to duplicate effort. 61 | 62 | * Make your changes in a new git fork: 63 | 64 | * Commit your changes using a descriptive commit message 65 | * Push your fork to GitHub: 66 | * In GitHub, create a pull request 67 | * If we suggest changes then: 68 | * Make the required updates. 69 | * Rebase your fork and force push to your GitHub repository (this will update your Pull Request): 70 | 71 | ```shell 72 | git rebase master -i 73 | git push -f 74 | ``` 75 | 76 | That's it! Thank you for your contribution! 77 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) Microsoft Corporation. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

Unofficial Azure OpenAI Batch Accelerator

2 |

Disclaimer:

This is a reference implementation of the Azure OpenAI Batch API designed to be extended for different use cases.
3 | This code is NOT intended for production use but instead as a starting point/reference implenentation of the Azure OpenAI (AOAI) Batch API. The code here is provided AS IS, you assume all responsibility (e.g., charges) from running this code. Testing in your environment should be done before running large use cases. Lastly, this is a work in progress and will be updated frequently. Please check back regularly for updates. 4 |

Background & Overview

5 | This accelerator is designed to help users to quickly start using the Azure OpenAI Batch API. An overview of how the accelerator works is shown below: 6 | 7 | ![Overview](media/batch_accel_overview_new.png) 8 | 9 | Key features of the accelerator are: 10 |
11 | 12 | 1. Automated Batch Job Submission and Creation 13 | 2. Multi-threaded Async Processing to Reduce Overall Processing Time 14 | 3. Automated Error Tracking 15 | 4. Multi-directory Hierarchy Support 16 | 5. Configurable Micro-batch support 17 | 6. Automated Post-job Cleanup 18 | 19 |
20 | For more details, including a detailed data flow diagram, please see this overview. 21 | 22 |

Installation & Setup

23 | Environment:

24 | 25 | 1. Python 3.11 (or higher) 26 | 2. Pip 27 | 3. An Azure Data Lake Storage (v2) account 28 | 4. An Azure OpenAI deployment 29 | 30 |
The following pip packages are required:
31 | 1. azure-storage-file-datalake
32 | 2. openai 33 | 3. tiktoken 34 | 4. requests 35 | 5. token-count 36 | 6. asyncio 37 | 7. aiohttp 38 | 39 | In addition to this, it is recommended to install these dependencies in a virtual environment to avoid conflicts (e.g., .venv) 40 |

Connecting AOAI to Azure Storage

41 | The `Storage Blob Data Contributer` role must be given to the AOAI service's Managed Identity to allow AOAI to access the data in the Azure Storage Account. 42 |

Configuration:

43 | There are three configuration files required to use this accelerator: 44 | 45 | 1. `AOAI_config.json` - This file contains the settings for AOAI. 46 | 2. `storage_config.json` - This file contains the settings for the Azure Data Lake Storage Account which will hold the input/output of the job. 47 | 3. `app_config.json` - This file contains the application configuration settings. 48 | 4. `APP_CONFIG` in `runBatch.py` - This variable should be set to point to the `app_config.json` file which defines the app settings. Alternatively, this value can be set as an environment variable in the underlying OS. This will support command line parameter-based input in the future. 49 | 50 | Reference templates of these files have been provided in the `templates` directory where <> denote settings that must be filled in. 51 | Other important settings are: 52 | 53 | 1. aoai_api_version - This must be set to `2024-07-01-preview` as that's the only API version which supports the Batch API at this time. In the future, different versions can be set here. 54 | 2. batch_job_endpoint - This must be set to `/chat/completions`. 55 | 3. batch_size - This controls the 'micro batch' size which is the number of files that will be sent to the batch service in paralle. It is set to a recommended value of `10` but can be changed 56 | based on the requirements/file sizes being sent to the batch service. 57 | 4. download_to_local - This controls if the files should be downloaded to local to count the number of tokens in a file. Currently this should be set to the default value of `false` but may be used in future versions. 58 | 5. input_directory/filesystem - This is the directory and filesystem the code will check for input files, respectively. The default directory setting of `/` assumes no directories in the input filesystem. The current implementation is not recursive; if input files are stored in a directory in the input filesystem/container then it should be specified here. 59 | 6. output_directory/filesystem - This is the directory and filesystem the code will write output files, respectively. The default directory setting of `/` assumes no directories in the ouput filesystem. 60 | 7. error_directory/filesystem - This is the directory and filesystem the code will write error files, respectively. The default directory setting of `/` assumes no directories in the error filesystem. 61 | 8. continuous_mode - This setting controls how the code is run. If set to `true`, it will continuously check the input directory for files every 60 seconds, taking a snapshot of the files and kicking off a series of batch jobs to process until all files are processed. To stop, press `ctrl+c`. If set to `false` it will only run when executed. 62 | 63 |

Using the accelerator

64 | 65 | 1. Input: Upload formatted batch files to the input location specified in the `storage_config.json` configuration file. Once all files are uploaded, start the `runBatch.py` in the code directoy. When run, the code will run continuously or once, depending on the `continuous_mode` setting described above. 66 | 2. Output: The code will create a directory in the `processed_filesystem_system_name` location in `storage_config.json` configuration file for each file processed along with a timestamp of when the file was processed. The raw input file will also be moved to the `processed` directory. In addition, if there are any errors, they will be put in the `error_filesystem_system_name` location, with a timestamp. 67 | 3. Metadata: The output creates a metadata file for each input file which contains mapping information which may be useful for automated processing of results. 68 | 4. Cleanup: After processing is complete, the code will automatically process and clean up all files in the input directory, locally downloaded files, and all uploaded files to the AOAI Batch Service. 69 | 70 |

Issues

71 | If you have any problems using this code or would like to see a new feature added, please create a new issue using the 'Issues' tab. 72 | 73 | -------------------------------------------------------------------------------- /code/AOAIHandler.py: -------------------------------------------------------------------------------- 1 | 2 | from openai import AzureOpenAI 3 | import requests 4 | import aiohttp 5 | import datetime 6 | import asyncio 7 | 8 | class AOAIHandler: 9 | def __init__(self, config, batch=False): 10 | self.config_data = config 11 | self.model = config["aoai_deployment_name"] 12 | self.batch_endpoint = config["batch_job_endpoint"] 13 | self.completion_window = config["completion_window"] 14 | self.aoai_client = self.init_client(config) 15 | self.batch_status = {} 16 | self.azure_endpoint = config['aoai_endpoint'] 17 | self.api_version = config['aoai_api_version'] 18 | self.api_key = config["aoai_key"] 19 | def init_client(self,config): 20 | client = AzureOpenAI( 21 | azure_endpoint = config['aoai_endpoint'], 22 | api_key=config['aoai_key'], 23 | api_version=config['aoai_api_version'] 24 | ) 25 | return client 26 | async def upload_batch_input_file_async(self,input_file_name, input_file_path, session): 27 | try: 28 | url = f"{self.azure_endpoint}openai/files/import?api-version={self.api_version}" 29 | headers = { 30 | "Content-Type": "application/json", 31 | "api-key": self.api_key # Replace with your actual API key 32 | } 33 | # Define the payload 34 | payload = { 35 | "purpose": "batch", 36 | "filename": input_file_name, 37 | "content_url": input_file_path 38 | } 39 | async with session.post(url, headers=headers, json=payload) as response: 40 | return await response.json() 41 | except Exception as e: 42 | print(f"An exception occurred while uploading the file: {e}") 43 | return False 44 | def upload_batch_input_file(self,input_file_name, input_file_path): 45 | try: 46 | url = f"{self.azure_endpoint}openai/files/import?api-version={self.api_version}" 47 | headers = { 48 | "Content-Type": "application/json", 49 | "api-key": self.api_key # Replace with your actual API key 50 | } 51 | # Define the payload 52 | payload = { 53 | "purpose": "batch", 54 | "filename": input_file_name, 55 | "content_url": input_file_path 56 | } 57 | 58 | return requests.request("POST", url, headers=headers, json=payload) 59 | except Exception as e: 60 | print(f"An exception occurred while uploading the file: {e}") 61 | return False 62 | def delete_single(self, file_id): 63 | deletion_status = False 64 | try: 65 | # Attempt to delete the file 66 | response = self.aoai_client.files.delete(file_id) 67 | print(f"File {file_id} deleted from client successfully.") 68 | deletion_status = True 69 | except Exception as e: 70 | # Handle any exceptions that occur 71 | print(f"An error occurred while deleting file {file_id}: {e}") 72 | return deletion_status 73 | def delete_all_files(self): 74 | deletion_status = {} 75 | file_objects = self.aoai_client.files.list().data 76 | # Extracting the ids using a list comprehension 77 | file_ids = [file_object.id for file_object in file_objects] 78 | for file_id in file_ids: 79 | try: 80 | # Attempt to delete the file 81 | response = self.aoai_client.files.delete(file_id) 82 | print(f"File {file_id} deleted successfully.") 83 | deletion_status[file_id] = True 84 | except Exception as e: 85 | # Handle any exceptions that occur 86 | print(f"An error occurred while deleting file {file_id}: {e}") 87 | deletion_status[file_id] = False 88 | 89 | return deletion_status 90 | def create_batch_job(self,file_id): 91 | # Submit a batch job with the file 92 | batch_response = self.aoai_client.batches.create( 93 | input_file_id=file_id, 94 | endpoint=self.batch_endpoint, 95 | completion_window=self.completion_window, 96 | ) 97 | # Save batch ID for later use 98 | batch_id = batch_response.id 99 | self.batch_status[batch_id] = "Submitted" 100 | return batch_response 101 | async def wait_for_file_upload(self, file_id): 102 | status = "pending" 103 | while True: 104 | file_response = self.aoai_client.files.retrieve(file_id) 105 | status = file_response.status 106 | if status == "error": 107 | print(f"{datetime.datetime.now()} Error occurred while processing file {file_id}") 108 | break 109 | elif status == "processed": 110 | print(f"{datetime.datetime.now()} File {file_id} processed successfully.") 111 | break 112 | else: 113 | print(f"{datetime.datetime.now()} File Id: {file_id}, Status: {status}") 114 | await asyncio.sleep(5) 115 | return file_response 116 | async def wait_for_batch_job(self, batch_id): 117 | # Wait until the uploaded file is in processed state 118 | status = "validating" 119 | while status not in ("completed", "failed", "canceled"): 120 | batch_response = self.aoai_client.batches.retrieve(batch_id) 121 | status = batch_response.status 122 | print(f"{datetime.datetime.now()} Batch Id: {batch_id}, Status: {status}") 123 | await asyncio.sleep(10) 124 | if status == "failed": 125 | print(f"Batch job {batch_id} failed.") 126 | elif status == "canceled": 127 | print(f"Batch job {batch_id} was canceled.") 128 | else: 129 | print(f"Batch job {batch_id} completed successfully.") 130 | return batch_response -------------------------------------------------------------------------------- /code/AzureBatch.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | from Utilities import Utils 4 | import asyncio 5 | import aiohttp 6 | class AzureBatch: 7 | def __init__(self, aoai_client, input_storage_handler, 8 | error_storage_handler, processed_storage_handler, batch_path, 9 | input_directory_client, local_download_path, output_directory, error_directory, 10 | count_tokens=False): 11 | self.aoai_client = aoai_client 12 | self.input_storage_handler = input_storage_handler 13 | self.error_storage_handler = error_storage_handler 14 | self.processed_storage_handler = processed_storage_handler 15 | self.batch_path = batch_path 16 | self.input_directory_client = input_directory_client 17 | self.local_download_path = local_download_path 18 | self.output_directory = output_directory 19 | self.error_directory = error_directory 20 | self.count_tokens = count_tokens 21 | 22 | async def process_all_files(self,files,micro_batch_size): 23 | tasks = [] 24 | current_tasks = 0 25 | async with aiohttp.ClientSession() as session: 26 | for file in files: 27 | tasks.append(self.process_file(file, session)) 28 | current_tasks += 1 29 | if current_tasks == micro_batch_size: 30 | await asyncio.gather(*tasks) 31 | tasks = [] 32 | current_tasks = 0 33 | #Process any remaining tasks 34 | if len(tasks) > 0: 35 | await asyncio.gather(*tasks) 36 | 37 | async def process_file(self,file, session): 38 | print(f"Processing file {file}") 39 | filename_only = Utils.get_file_name_only(file) 40 | file_wo_directory = Utils.strip_directory_name(file) 41 | file_extension = Utils.get_file_extension(file_wo_directory) 42 | output_directory_name = self.output_directory+"/"+Utils.append_postfix(filename_only) 43 | error_directory_name = self.error_directory+"/"+Utils.append_postfix(filename_only) 44 | #Mark start time 45 | processing_result = {} 46 | batch_data = None 47 | try: 48 | batch_data = await self.submit_batch_job(file, file_wo_directory, error_directory_name, filename_only, session) 49 | if batch_data is None: 50 | return 51 | self.process_batch_result(batch_data, filename_only, file_extension, file_wo_directory, 52 | error_directory_name, output_directory_name) 53 | cleanup_status = self.cleanup_batch(file_wo_directory,batch_data["file_id"], batch_data["output_file_id"], batch_data["error_file_id"]) 54 | processing_result["cleanup_status"] = cleanup_status 55 | except Exception as e: 56 | #Unexpected exception during processing 57 | print(f"An error occurred while processing file: {file}. Error: {e}") 58 | if batch_data is not None: 59 | file_write_result = self.error_storage_handler.write_content_to_directory(batch_data["batch_file_data"],error_directory_name,filename_only) 60 | cleanup_status = self.cleanup_batch(file_wo_directory,batch_data["file_id"], batch_data["output_file_id"], batch_data["error_file_id"]) 61 | processing_result["cleanup_status"] = cleanup_status 62 | return processing_result 63 | 64 | async def submit_batch_job(self,file, file_wo_directory, error_directory_name, filename_only, session): 65 | batch_storage_path = self.batch_path + file 66 | try: 67 | if self.local_download_path is not None: 68 | output_path = os.path.join(self.local_download_path, file) 69 | batch_file_data = self.input_storage_handler.save_file_to_local(file, 70 | self.input_directory_client, output_path) 71 | if self.count_tokens: 72 | token_size = Utils.get_tokens_in_file(output_path,"gpt-4") 73 | else: 74 | batch_file_data = self.input_storage_handler.get_file_data(file_wo_directory,self.input_directory_client) 75 | batch_file_string = str(batch_file_data) 76 | if self.count_tokens: 77 | token_size = Utils.num_tokens_from_string(batch_file_string,"gpt-4") 78 | except Exception as e: 79 | print(f"Could not download file: {file}. Error: {e}") 80 | return None 81 | if self.count_tokens: 82 | print(f"File {file} has {token_size} tokens") 83 | else: 84 | token_size = "N/A" 85 | # Process the file 86 | upload_response = await self.aoai_client.upload_batch_input_file_async(file,batch_storage_path, session) 87 | if not upload_response: 88 | print(f"An error occurred while uploading file {file}. Please check the file and try again.") 89 | file_write_result = self.error_storage_handler.write_content_to_directory(batch_file_data,error_directory_name,file_wo_directory) 90 | cleanup_status = self.cleanup_batch(file_wo_directory,None, None, None) 91 | return None 92 | file_content_json = upload_response 93 | if "error" in file_content_json: 94 | print(f"An error occurred while uploading file {file}. Please check the file and try again.\n\nCode: "+file_content_json["error"]["code"]+"\n\nMessage: "+file_content_json["error"]["message"]) 95 | file_write_result = self.error_storage_handler.write_content_to_directory(batch_file_data,error_directory_name,file_wo_directory) 96 | cleanup_status = self.cleanup_batch(file_wo_directory,None, None, None) 97 | return None 98 | file_id = file_content_json['id'] 99 | print(f"file_id: {file_content_json['id']}") 100 | #TODO: Check if the file was uploaded successfully, if not, move to error folder and cleanup 101 | await self.aoai_client.wait_for_file_upload(file_id) 102 | try: 103 | initial_batch_response = self.aoai_client.create_batch_job(file_id) 104 | except Exception as e: 105 | print(f"An error occurred while creating batch job for file: {file}. Error: {e}") 106 | file_write_result = self.error_storage_handler.write_content_to_directory(batch_file_data,error_directory_name,file_wo_directory) 107 | cleanup_status = self.cleanup_batch(file_wo_directory,None, None, None) 108 | return None 109 | #This takes start time as a param 110 | (finished_batch_response) = await self.aoai_client.wait_for_batch_job(initial_batch_response.id) 111 | batch_data = { 112 | "file": file, 113 | "input_file_id": finished_batch_response.input_file_id, 114 | "batch_job_id": initial_batch_response.id, 115 | "error_file_id": finished_batch_response.error_file_id, 116 | "output_file_id": finished_batch_response.output_file_id, 117 | "token_size": token_size, 118 | "initial_batch_response": initial_batch_response, 119 | "finished_batch_response": finished_batch_response, 120 | "file_id": file_id, 121 | "batch_file_data": batch_file_data 122 | } 123 | return batch_data 124 | 125 | def process_batch_result(self,batch_data, filename_only, file_extension, file_wo_directory, 126 | error_directory_name, output_directory_name): 127 | batch_metadata = self.create_batch_metadata(batch_data) 128 | metadata_filename = f"{filename_only}_metadata."+file_extension 129 | if batch_data["error_file_id"] is not None: 130 | error_file_content = self.aoai_client.aoai_client.files.content(batch_data["error_file_id"]) 131 | error_file_content_string = str(error_file_content.text) 132 | else: 133 | errors = batch_data["finished_batch_response"].errors.data 134 | error_file_content = {} 135 | error_index = 1 136 | for error in errors: 137 | error_file_content["Error "+str(error_index)] = error.message 138 | error_file_content_string = json.dumps(error_file_content) 139 | if batch_data["output_file_id"] is not None: 140 | output_file_content = self.aoai_client.aoai_client.files.content(batch_data["output_file_id"]) 141 | output_file_content_string = str(output_file_content.text) 142 | else: 143 | output_file_content = "" 144 | output_file_content_string = "" 145 | filename = batch_data["file"] 146 | file_id = batch_data["initial_batch_response"].id 147 | if not error_file_content_string == "": 148 | error_filename = f"{filename_only}_error."+file_extension 149 | batch_data["error_file_name"] = error_filename 150 | error_file_content_json = error_file_content_string 151 | error_file_metadata = json.dumps(batch_metadata) 152 | error_content_write_result = self.error_storage_handler.write_content_to_directory(error_file_content_json,error_directory_name,error_filename) 153 | error_metadata_write_result = self.error_storage_handler.write_content_to_directory(error_file_metadata,error_directory_name,metadata_filename) 154 | file_write_result = self.error_storage_handler.write_content_to_directory(batch_data["batch_file_data"],error_directory_name,file_wo_directory) 155 | if error_content_write_result and error_metadata_write_result: 156 | print(f"An error file with details written to the 'error' directory.") 157 | else: 158 | print(f"There was a problem processing file: {filename} and details could not be written to storage. Please check {file_id} for more details.") 159 | if not output_file_content_string == "": 160 | output_filename = f"{filename_only}_output."+file_extension 161 | batch_metadata["output_file_name"] = output_filename 162 | output_file_content = self.aoai_client.aoai_client.files.content(batch_metadata["output_file_id"]) 163 | output_file_content_json = output_file_content_string 164 | output_file_metadata = json.dumps(batch_metadata) 165 | output_content_write_result = self.processed_storage_handler.write_content_to_directory(output_file_content_json,output_directory_name,output_filename) 166 | output_metadata_write_result = self.processed_storage_handler.write_content_to_directory(output_file_metadata,output_directory_name,metadata_filename) 167 | file_write_result = self.processed_storage_handler.write_content_to_directory(batch_data["batch_file_data"],output_directory_name,file_wo_directory) 168 | if output_content_write_result and output_metadata_write_result: 169 | print(f"File: {filename} has been processed successfully. Results are available in the 'processed' directory.") 170 | else: 171 | print(f"File: {filename} has been processed successfully but could not be written to storage. Please check {file_id} for more details.") 172 | 173 | def create_batch_metadata(self,batch_data): 174 | batch_metadata = { 175 | "file_name": batch_data["file"], 176 | "input_file_id": batch_data["finished_batch_response"].input_file_id, 177 | "batch_job_id": batch_data["initial_batch_response"].id, 178 | "error_file_id": batch_data["finished_batch_response"].error_file_id, 179 | "output_file_id": batch_data["finished_batch_response"].output_file_id, 180 | "token_size": batch_data["token_size"], 181 | "file_id": batch_data["file_id"] 182 | } 183 | return batch_metadata 184 | 185 | def cleanup_batch(self,filename,file_id, output_file_id, error_file_id): 186 | cleanup_result = {} 187 | if file_id is not None: 188 | print("Deleting input file from client...") 189 | deletion_status = self.aoai_client.delete_single(file_id) 190 | if output_file_id is not None: 191 | print("Deleting output file from client...") 192 | deletion_status = self.aoai_client.delete_single(output_file_id) 193 | if error_file_id is not None: 194 | print("Deleting error file from client...") 195 | deletion_status = self.aoai_client.delete_single(error_file_id) 196 | if self.local_download_path is not None: 197 | local_filename_with_path = self.local_download_path+"\\"+filename 198 | if os.path.exists(local_filename_with_path): 199 | os.remove(local_filename_with_path) 200 | print(f"File {local_filename_with_path} deleted successfully.") 201 | cleanup_result["local_file_deletion"] = True 202 | az_storage_deletion_status = self.input_storage_handler.delete_file_data(filename,self.input_directory_client) 203 | if az_storage_deletion_status: 204 | print(f"File {filename} deleted from storage successfully.") 205 | cleanup_result["az_storage_file_deletion"] = True 206 | else: 207 | print(f"An error occurred while deleting file {filename} from storage.") 208 | cleanup_result["az_storage_file_deletion"] = False 209 | return cleanup_result -------------------------------------------------------------------------------- /code/AzureStorageHandler.py: -------------------------------------------------------------------------------- 1 | from azure.storage.filedatalake import ( 2 | DataLakeServiceClient, 3 | DataLakeDirectoryClient, 4 | FileSystemClient 5 | ) 6 | import json 7 | class StorageHandler: 8 | def __init__(self, storage_account_name, storage_account_key, file_system_name=None): 9 | self.storage_account_name = storage_account_name 10 | self.storage_account_key = storage_account_key 11 | self.service_client = self.get_service_client_account_key(storage_account_name, storage_account_key) 12 | if file_system_name is not None: 13 | self.file_system_client = self.get_file_system_client(file_system_name) 14 | else: 15 | self.file_system_client = None 16 | self.byte_read_size = 50000 17 | def get_directories(self,path): 18 | paths = self.file_system_client.get_paths(path=path) 19 | return_paths = [] 20 | for current_path in paths: 21 | if current_path.is_directory: 22 | return_paths.append(current_path.name) 23 | #No subdirectories found, return the current directory 24 | if len(return_paths) == 0: 25 | return_paths.append(path) 26 | return return_paths 27 | def write_content_to_directory(self, file_content, directory_name, output_filename): 28 | write_result = False 29 | destination_directory_client = self.get_or_create_directory_client(directory_name) 30 | result_file_content_status = self.write_json_to_storage(output_filename,file_content,destination_directory_client) 31 | if result_file_content_status: 32 | write_result = True 33 | print(f"File {output_filename} written to storage directory.") 34 | else: 35 | print(f"Error writing file {output_filename} to directory.") 36 | return write_result 37 | def get_or_create_directory_client(self,directory_name): 38 | dir_exists = self.check_directory_exists(directory_name) 39 | if(dir_exists): 40 | directory_client = self.get_directory_client(directory_name) 41 | else: 42 | directory_client = self.create_directory(directory_name) 43 | return directory_client 44 | def write_bytes_to_storage_chunked(self, source_filename,source_directory_client, 45 | destination_filename,destination_directory_client): 46 | try: 47 | output_file_stream = destination_directory_client.get_file_client(destination_filename) 48 | file_content_stream = self.get_file_stream(source_filename,source_directory_client) 49 | byte_stream = file_content_stream.read(self.byte_read_size) 50 | offset = 0 51 | while len(byte_stream) > 0: 52 | size = len(byte_stream) 53 | if not output_file_stream.exists(): 54 | output_file_stream.upload_data(data=byte_stream, overwrite=True) 55 | else: 56 | output_file_stream.append_data(data=byte_stream, offset=offset, length=size, flush=True) 57 | offset += size 58 | byte_stream = file_content_stream.read(self.byte_read_size) 59 | except Exception as e: 60 | print(f"Error writing file {source_filename} to destination directory: {e}") 61 | 62 | def copy_file_to_directory(self, source_filename, source_directory, destination_filesystem_client, 63 | destination_directory, destination_filename ): 64 | source_directory_client = self.get_directory_client(source_directory) 65 | destination_directory_client = destination_filesystem_client.get_or_create_directory_client(destination_directory) 66 | self.write_bytes_to_storage_chunked(source_filename,source_directory_client,destination_filename, 67 | destination_directory_client) 68 | 69 | return True 70 | def write_json_to_storage(self,output_name,output_data,directory_client): 71 | return_code = True 72 | try: 73 | file_client = directory_client.get_file_client(output_name) 74 | file_client.upload_data(output_data, overwrite=True) 75 | except Exception as e: 76 | return_code = False 77 | finally: 78 | return return_code 79 | def check_directory_exists(self,directory_name): 80 | return_status = False 81 | try: 82 | directory_client = self.file_system_client.get_directory_client(directory_name) 83 | if directory_client.exists(): 84 | return_status = True 85 | else: 86 | return_status = False 87 | except Exception as e: 88 | return_status = False 89 | return return_status 90 | def create_directory(self, directory_name: str) -> DataLakeDirectoryClient: 91 | directory_client = self.file_system_client.create_directory(directory_name) 92 | return directory_client 93 | 94 | def get_directory_client(self, directory_name: str) -> DataLakeDirectoryClient: 95 | directory_client = self.file_system_client.get_directory_client(directory_name) 96 | return directory_client 97 | 98 | def get_file_list(self, path: str) -> list: 99 | file_list = [] 100 | paths = self.file_system_client.get_paths(path=path) 101 | for path in paths: 102 | if not path.is_directory: 103 | file_list.append(path.name) 104 | return file_list 105 | def get_file_stream(self, file_name,directory_client): 106 | file_client = directory_client.get_file_client(file_name) 107 | download = file_client.download_file() 108 | return download 109 | def get_file_data(self, file_name,directory_client): 110 | file_client = directory_client.get_file_client(file_name) 111 | download = file_client.download_file() 112 | return download.readall() 113 | def delete_file_data(self, file_name,directory_client): 114 | return_status = True 115 | try: 116 | file_client = directory_client.get_file_client(file_name) 117 | file_client.delete_file() 118 | except Exception as e: 119 | return_status = False 120 | return return_status 121 | def save_file_to_local(self, file_name, directory_client, local_path): 122 | file_client = directory_client.get_file_client(file_name) 123 | download = file_client.download_file() 124 | data = download.readall() 125 | try: 126 | with open(local_path, "wb") as file: 127 | file.write(data) 128 | print(f"File {file_name} saved to local path {local_path}") 129 | except Exception as e: 130 | print(f"An error occurred while saving file {file_name} to local path {local_path}: {e}") 131 | return data 132 | 133 | def get_file_system_client(self, file_system_name: str) -> FileSystemClient: 134 | file_system_client = self.service_client.get_file_system_client(file_system_name) 135 | return file_system_client 136 | 137 | def get_service_client_account_key(self, account_name, account_key) -> DataLakeServiceClient: 138 | account_url = f"https://{account_name}.dfs.core.windows.net" 139 | service_client = DataLakeServiceClient(account_url, credential=account_key) 140 | 141 | return service_client 142 | 143 | -------------------------------------------------------------------------------- /code/RunBatch.py: -------------------------------------------------------------------------------- 1 | from Utilities import Utils 2 | from AzureStorageHandler import StorageHandler 3 | from AOAIHandler import AOAIHandler 4 | from AzureBatch import AzureBatch 5 | import time 6 | import asyncio 7 | import signal 8 | import sys 9 | import os 10 | 11 | def signal_handler(sig, frame): 12 | print('Exiting...') 13 | sys.exit(0) 14 | 15 | def main(): 16 | signal.signal(signal.SIGINT, signal_handler) 17 | APP_CONFIG = os.environ.get('APP_CONFIG', r"C:\Users\dade\Desktop\AOAIBatchWorkingFork\aoai-batch-api-accelerator\config\app_config.json") 18 | try: 19 | app_config_data = Utils.read_json_data(APP_CONFIG) 20 | storage_config_data = Utils.read_json_data(app_config_data["storage_config"]) 21 | storage_account_name = storage_config_data["storage_account_name"] 22 | storage_account_key = storage_config_data["storage_account_key"] 23 | input_filesystem_system_name = storage_config_data["input_filesystem_system_name"] 24 | error_filesystem_system_name = storage_config_data["error_filesystem_system_name"] 25 | processed_filesystem_system_name = storage_config_data["processed_filesystem_system_name"] 26 | input_directory = storage_config_data["input_directory"] 27 | output_directory = storage_config_data["output_directory"] 28 | error_directory = storage_config_data["error_directory"] 29 | aoai_config_data = Utils.read_json_data(app_config_data["AOAI_config"]) 30 | BATCH_PATH = "https://"+storage_account_name+".blob.core.windows.net/"+input_filesystem_system_name+"/" 31 | batch_size = int(app_config_data["batch_size"]) 32 | count_tokens = int(app_config_data["count_tokens"]) 33 | aoai_client = AOAIHandler(aoai_config_data) 34 | input_storage_handler = StorageHandler(storage_account_name, storage_account_key, input_filesystem_system_name) 35 | error_storage_handler = StorageHandler(storage_account_name, storage_account_key, error_filesystem_system_name) 36 | processed_storage_handler = StorageHandler(storage_account_name, storage_account_key, processed_filesystem_system_name) 37 | files = input_storage_handler.get_file_list(input_directory) 38 | input_directory_client = input_storage_handler.get_directory_client(input_directory) 39 | download_to_local = app_config_data["download_to_local"] 40 | local_download_path = None 41 | if download_to_local: 42 | local_download_path = app_config_data["local_download_path"] 43 | continuous_mode = app_config_data["continuous_mode"] 44 | azure_batch = AzureBatch(aoai_client, input_storage_handler, 45 | error_storage_handler, processed_storage_handler, BATCH_PATH, input_directory_client, 46 | local_download_path,output_directory, error_directory,count_tokens) 47 | except Exception as e: 48 | print(f"An error occurred while initializing the application, please check the configuration. \n\n\tException:\n\n\t\t{e}\n\n") 49 | return 50 | if continuous_mode: 51 | print("Running in continuous mode") 52 | while True: 53 | if len(files) > 0: 54 | asyncio.run(azure_batch.process_all_files(files, batch_size)) 55 | else: 56 | print("No files found. Sleeping for 60 seconds") 57 | time.sleep(60) 58 | files = input_storage_handler.get_file_list(input_directory) 59 | else: 60 | print("Running in on-demand mode") 61 | asyncio.run(azure_batch.process_all_files(files, batch_size)) 62 | 63 | #TODO: 1) Support blob storage 64 | 65 | 66 | 67 | 68 | 69 | if __name__ == "__main__": 70 | main() -------------------------------------------------------------------------------- /code/Utilities.py: -------------------------------------------------------------------------------- 1 | import json 2 | import tiktoken 3 | import os 4 | from token_count import TokenCount 5 | from datetime import datetime 6 | class Utils: 7 | #Add utility to count output tokens and estimate price. 8 | def __init__(self): 9 | pass 10 | @staticmethod 11 | def strip_directory_name(file_name): 12 | file_name_split = file_name.split("/") 13 | return file_name_split[len(file_name_split)-1] 14 | @staticmethod 15 | def get_file_name_only(file_name): 16 | file_name_with_extension = Utils.strip_directory_name(file_name) 17 | file_name_with_extension_split = file_name_with_extension.split(".") 18 | file_name_only = file_name_with_extension_split[0] 19 | return file_name_only 20 | @staticmethod 21 | def read_json_data(file_name): 22 | with open(file_name) as json_file: 23 | data = json.load(json_file) 24 | return data 25 | @staticmethod 26 | def get_file_list(directory): 27 | file_list = [] 28 | for file in os.listdir(directory): 29 | file_list.append(file) 30 | return file_list 31 | @staticmethod 32 | def num_tokens_from_string(string: str, encoding_name: str) -> int: 33 | encoding = tiktoken.encoding_for_model(encoding_name) 34 | num_tokens = len(encoding.encode(string)) 35 | return num_tokens 36 | @staticmethod 37 | def get_tokens_in_file(file, model_family): 38 | tc = TokenCount(model_name=model_family) 39 | tokens = tc.num_tokens_from_file(file) 40 | return tokens 41 | @staticmethod 42 | def append_postfix(file): 43 | datetime_string = datetime.today().strftime('%Y-%m-%d_%H_%M_%S') 44 | return f"{file}_{datetime_string}" 45 | @staticmethod 46 | def clean_binary_string(data): 47 | return data[2:-1].replace('\\n', '').replace('\\"', '"').replace('\\\\', '\\') 48 | @staticmethod 49 | def convert_to_json_from_binary_string(data): 50 | # Remove the leading "b'" and trailing "'" 51 | data_str = data[2:-1] 52 | 53 | # Replace escape sequences 54 | data_str_clean = data_str.replace('\\n', '').replace('\\"', '"').replace('\\\\', '\\') 55 | 56 | # Convert the JSON string to a dictionary 57 | data_dict = json.loads(data_str_clean) 58 | return data_dict 59 | @staticmethod 60 | def get_file_extension(file_name): 61 | file_name_split = file_name.split(".") 62 | #No extension 63 | extension = file_name 64 | if len(file_name_split) > 1: 65 | extension = file_name_split[len(file_name_split)-1] 66 | return extension 67 | 68 | 69 | -------------------------------------------------------------------------------- /media/batch_accel_overview_new.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure-Samples/aoai-batch-api-accelerator/61315acbc5360f0ec0d6bbfa04693d9921ea9216/media/batch_accel_overview_new.png -------------------------------------------------------------------------------- /media/overview.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure-Samples/aoai-batch-api-accelerator/61315acbc5360f0ec0d6bbfa04693d9921ea9216/media/overview.pdf -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | azure-storage-file-datalake 2 | openai 3 | tiktoken 4 | requests 5 | token-count 6 | asyncio 7 | aiohttp -------------------------------------------------------------------------------- /templates/AOAI_config_template.json: -------------------------------------------------------------------------------- 1 | { 2 | "aoai_key": "", 3 | "aoai_api_version": "2024-07-01-preview", 4 | "aoai_endpoint": "", 5 | "aoai_deployment_name": "", 6 | "batch_job_endpoint": "/chat/completions", 7 | "completion_window": "24h" 8 | } -------------------------------------------------------------------------------- /templates/app_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "AOAI_config": "", 3 | "storage_config": "", 4 | "local_download_path": "", 5 | "batch_size":10, 6 | "download_to_local":false, 7 | "continuous_mode":true, 8 | "count_tokens":false 9 | } -------------------------------------------------------------------------------- /templates/batch_template.json: -------------------------------------------------------------------------------- 1 | { 2 | "custom_id":"", 3 | "method":"POST", 4 | "url":"/chat/completions", 5 | "body":{ 6 | "model":"", 7 | "messages":[ 8 | { 9 | "role":"system", 10 | "content":"" 11 | }, 12 | { 13 | "role":"user", 14 | "content":"" 15 | } 16 | ] 17 | } 18 | } -------------------------------------------------------------------------------- /templates/storage_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "storage_account_name": "", 3 | "storage_account_key": "", 4 | "input_filesystem_system_name": "", 5 | "processed_filesystem_system_name": "", 6 | "error_filesystem_system_name": "", 7 | "input_directory": "/", 8 | "output_directory": "/", 9 | "error_directory": "/" 10 | 11 | } --------------------------------------------------------------------------------