├── README.md └── code /README.md: -------------------------------------------------------------------------------- 1 | # Automated ETL Pipeline for Weather Data 2 | 3 | ## Overview 4 | This project is an **Automated ETL (Extract, Transform, Load) Pipeline** designed to collect, process, and store weather data from various sources. The pipeline ensures seamless data integration for analytics, reporting, and visualization. 5 | 6 | ## Features 7 | - **Automated Data Extraction**: Retrieves weather data from APIs, CSV files, or databases. 8 | - **Data Transformation**: Cleans, formats, and enriches the data to ensure consistency. 9 | - **Efficient Data Loading**: Stores the processed data into a database or a cloud storage solution. 10 | - **Scheduled Execution**: Uses cron jobs or workflow orchestration tools for automation. 11 | - **Logging & Monitoring**: Tracks data processing stages and errors. 12 | 13 | ## Technologies Used 14 | - **Python**: Main scripting language for ETL tasks. 15 | - **Pandas**: Data manipulation and transformation. 16 | - **SQL / PostgreSQL**: Database storage. 17 | - **Apache Airflow / Prefect**: Workflow orchestration. 18 | - **APIs / Web Scraping**: For data extraction from online sources. 19 | - **AWS S3 / Google Cloud Storage** (Optional): Cloud-based data storage. 20 | 21 | ## Installation 22 | ### Prerequisites 23 | - Python 3.x installed 24 | - PostgreSQL or any other database service (optional) 25 | - Required Python packages (listed in `requirements.txt`) 26 | 27 | ### Setup 28 | 1. Clone the repository: 29 | ```sh 30 | git clone https://github.com/yourusername/Automated-ETL-Pipeline-for-Weather-Data.git 31 | cd Automated-ETL-Pipeline-for-Weather-Data 32 | ``` 33 | 2. Install dependencies: 34 | ```sh 35 | pip install -r requirements.txt 36 | ``` 37 | 3. Configure the `.env` file with API keys, database credentials, and other configurations. 38 | 39 | ## Usage 40 | Run the ETL script manually: 41 | ```sh 42 | python main.py 43 | ``` 44 | Or schedule the pipeline using Airflow, cron jobs, or Prefect. 45 | 46 | ## Project Structure 47 | ``` 48 | Automated-ETL-Pipeline-for-Weather-Data/ 49 | │-- src/ 50 | │ │-- extract.py # Handles data extraction 51 | │ │-- transform.py # Cleans and processes data 52 | │ │-- load.py # Stores processed data 53 | │ │-- config.py # Configuration settings 54 | │-- main.py # Main execution script 55 | │-- requirements.txt # Dependencies 56 | │-- README.md # Project documentation 57 | ``` 58 | 59 | ## Contributing 60 | Feel free to contribute by opening an issue or submitting a pull request. 61 | 62 | ## License 63 | This project is licensed under the MIT License. 64 | 65 | ## Contact 66 | For any inquiries, reach out via desmondeteh@gmail.com 67 | 68 | -------------------------------------------------------------------------------- /code: -------------------------------------------------------------------------------- 1 | import requests 2 | import sqlite3 3 | import pandas as pd 4 | import datetime 5 | import time 6 | import logging 7 | import random 8 | 9 | # Set up logging 10 | logging.basicConfig(filename="etl_pipeline.log", level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s") 11 | 12 | # OpenWeatherMap API Configuration 13 | API_KEY = "your_api_key_here" # Replace with your API key 14 | CITY = "New York" 15 | BASE_URL = "https://api.openweathermap.org/data/2.5/weather" 16 | 17 | # SQLite Database Configuration 18 | DB_NAME = "weather_data.db" 19 | TABLE_NAME = "weather" 20 | 21 | # List of synthetic weather conditions 22 | WEATHER_CONDITIONS = ["Clear", "Cloudy", "Rainy", "Stormy", "Snowy", "Foggy", "Windy"] 23 | 24 | 25 | def extract_weather_data(city): 26 | """Extract weather data from OpenWeatherMap API.""" 27 | try: 28 | params = {"q": city, "appid": API_KEY, "units": "metric"} 29 | response = requests.get(BASE_URL, params=params) 30 | response.raise_for_status() 31 | data = response.json() 32 | logging.info(f"Successfully extracted weather data for {city}.") 33 | return data 34 | except requests.exceptions.RequestException as e: 35 | logging.error(f"API request failed. Generating synthetic data. Error: {e}") 36 | return generate_synthetic_data(city) 37 | 38 | 39 | def generate_synthetic_data(city): 40 | """Generate synthetic weather data when API fails.""" 41 | logging.info(f"Generating synthetic weather data for {city}.") 42 | return { 43 | "name": city, 44 | "main": { 45 | "temp": round(random.uniform(-10, 40), 2), # Random temperature (-10 to 40°C) 46 | "humidity": random.randint(10, 100), # Humidity (10-100%) 47 | "pressure": random.randint(980, 1050), # Pressure (980-1050 hPa) 48 | }, 49 | "weather": [{"description": random.choice(WEATHER_CONDITIONS)}], # Random weather condition 50 | "wind": {"speed": round(random.uniform(0, 20), 2)}, # Wind speed (0-20 m/s) 51 | } 52 | 53 | 54 | def transform_weather_data(data): 55 | """Transform the extracted weather data into a structured format.""" 56 | try: 57 | if not data: 58 | return None 59 | 60 | transformed_data = { 61 | "city": data["name"], 62 | "temperature": data["main"]["temp"], 63 | "humidity": data["main"]["humidity"], 64 | "pressure": data["main"]["pressure"], 65 | "weather": data["weather"][0]["description"], 66 | "wind_speed": data["wind"]["speed"], 67 | "date_time": datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), 68 | } 69 | 70 | df = pd.DataFrame([transformed_data]) 71 | logging.info("Successfully transformed weather data.") 72 | return df 73 | except Exception as e: 74 | logging.error(f"Error transforming weather data: {e}") 75 | return None 76 | 77 | 78 | def load_data_to_db(df, db_name, table_name): 79 | """Load the transformed data into an SQLite database.""" 80 | try: 81 | conn = sqlite3.connect(db_name) 82 | df.to_sql(table_name, conn, if_exists="append", index=False) 83 | conn.close() 84 | logging.info("Successfully loaded data into database.") 85 | except Exception as e: 86 | logging.error(f"Error loading data into database: {e}") 87 | 88 | 89 | def run_etl(): 90 | """Run the ETL pipeline.""" 91 | logging.info("ETL pipeline started.") 92 | data = extract_weather_data(CITY) 93 | transformed_data = transform_weather_data(data) 94 | if transformed_data is not None: 95 | load_data_to_db(transformed_data, DB_NAME, TABLE_NAME) 96 | logging.info("ETL pipeline completed.") 97 | 98 | 99 | if __name__ == "__main__": 100 | while True: 101 | run_etl() 102 | time.sleep(3600) # Run every 1 hour 103 | --------------------------------------------------------------------------------