├── IOT_Phase 4.pdf
├── Big Data Analysis Website README (1) (3).pdf
├── IBM CLOUD APPLICATION DEVELOPMENT PROJECT PHASE 2.docx
├── phase-3 ibm.py
├── phase 3 Document_development.md
├── README.md
└── CAD_Phase 5.ipynb


/IOT_Phase 4.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lakshmanroy/Big_Data_Analysis_for_IBM/HEAD/IOT_Phase 4.pdf


--------------------------------------------------------------------------------
/Big Data Analysis Website README (1) (3).pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lakshmanroy/Big_Data_Analysis_for_IBM/HEAD/Big Data Analysis Website README (1) (3).pdf


--------------------------------------------------------------------------------
/IBM CLOUD APPLICATION DEVELOPMENT PROJECT PHASE 2.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lakshmanroy/Big_Data_Analysis_for_IBM/HEAD/IBM CLOUD APPLICATION DEVELOPMENT PROJECT PHASE 2.docx


--------------------------------------------------------------------------------
/phase-3 ibm.py:
--------------------------------------------------------------------------------
 1 | # Python Code
 2 | 
 3 | # Data Selection
 4 | 
 5 | import pandas as pd
 6 | 
 7 | # Load the dataset into a DataFrame
 8 | # rainfall_data = pd.read_csv('f'/kaggle/input/airline-delay-and-cancellation-data-2009-2018/{file}.csv'')
 9 | 
10 | # Database Setup
11 | # Assuming you have established a connection to the IBM DB2 database
12 | import ibm_db
13 | 
14 | # Create a database instance
15 | db2_conn = ibm_db.connect("DATABASE=BSJ92334;HOSTNAME=3883e7e4-18f5-4afe-be8c-fa31c41761d2.bs2io90l08kqb1od8lcg.databases.appdomain.cloud:31498;PORT=60000;PROTOCOL=TCPIP;UID=bsj92334;PWD=9xAOjpxeWtsLcMUo;", "", "")
16 | 
17 | # Data Exploration
18 | # Assuming you want to perform a basic exploration of the dataset
19 | # For example, checking the first few rows of data
20 | print(rainfall_data.head())
21 | 
22 | # Analysis Techniques
23 | # Assuming you want to perform basic statistical analysis
24 | # For example, calculating mean and standard deviation of rainfall
25 | mean_rainfall = rainfall_data['Rainfall'].mean()
26 | std_dev_rainfall = rainfall_data['Rainfall'].std()
27 | print(f"Mean Rainfall: {mean_rainfall}")
28 | print(f"Standard Deviation of Rainfall: {std_dev_rainfall}")
29 | 
30 | # SQL Command for Creating Database Table
31 | 
32 | create_table_sql = '''
33 | CREATE TABLE your_table_name (
34 |     Year INT,
35 |     Month INT,
36 |     State VARCHAR(255),
37 |     District VARCHAR(255),
38 |     Rainfall FLOAT
39 | )
40 | '''
41 | 
42 | # Execute the SQL command to create the table
43 | stmt = ibm_db.exec_immediate(db2_conn, create_table_sql)


--------------------------------------------------------------------------------
/phase 3 Document_development.md:
--------------------------------------------------------------------------------
 1 | Big-_data-_analysis-
 2 | This project for IBM courses. Deploying a big data analysis solution using IBM Cloud Databases and performing data analysis typically involves several steps. In this response, I'll outline a high-level process to help you get started. Please note that the specific steps and tools may vary based on your project requirements and the technology stack you are using.
 3 | 
 4 | Here's a general overview of the process:
 5 | 
 6 | Define Requirements:
 7 | 
 8 | Identify the specific data analysis needs and objectives of your project.
 9 | Determine the types of data sources and data formats you'll be working with.
10 | Decide on the tools and technologies you'll use for data analysis.
11 | 
12 | Set Up IBM Cloud Databases:
13 | 
14 | Sign in to your IBM Cloud account or create one if you don't have an account.
15 | Access the IBM Cloud Databases service and create the appropriate database instances for your data storage needs. You may choose between different database options like Db2, PostgreSQL, MongoDB, etc., depending on your data requirements.
16 | Configure security settings, such as firewall rules and authentication methods, to protect your databases.
17 | Data Ingestion:
18 | 
19 | Load your data into the IBM Cloud Databases. This might involve batch uploads or real-time streaming, depending on your use case.
20 | Ensure that the data is properly structured and cleaned, if necessary, for analysis.
21 | Data Analysis Tools:
22 | 
23 | Choose the appropriate data analysis tools and libraries. IBM offers several cloud-based data analysis services, such as IBM Watson Studio, which provides data science and machine learning capabilities.
24 | You can also use other popular data analysis tools like Python (with libraries like Pandas, NumPy, and Matplotlib), R, or SQL for querying the data.
25 | Data Analysis:
26 | 
27 | Write and run queries, scripts, or code to analyze your data. You can perform various data analysis tasks like exploratory data analysis (EDA), statistical analysis, machine learning, etc., depending on your project goals.
28 | Use the chosen tools to visualize the results and gain insights from your data.
29 | Data Visualization:
30 | 
31 | Create visualizations and reports to communicate your findings effectively. Tools like IBM Cognos Analytics or open-source alternatives like Matplotlib, Seaborn, or Tableau can be helpful for this purpose.
32 | Scaling and Optimization:
33 | 
34 | As your data and analysis requirements grow, consider scaling your IBM Cloud Databases resources to meet the increased demand.
35 | Optimize your queries and analysis processes for better performance and cost-efficiency.
36 | Data Security and Compliance:
37 | 
38 | Ensure that your data analysis solution complies with data security and privacy regulations (e.g., GDPR, HIPAA).
39 | Implement encryption, access controls, and auditing to protect sensitive data.
40 | Monitoring and Maintenance:
41 | 
42 | Set up monitoring and alerts to track the health and performance of your databases and data analysis workflows.
43 | Regularly maintain and update your data analysis solution to address issues and incorporate improvements.
44 | Documentation and Collaboration:
45 | 
46 | Document your data analysis processes, findings, and code for future reference.
47 | Collaborate with team members and stakeholders to share insights and collaborate on data-driven decisions.
48 | Deployment and Automation:
49 | 
50 | Consider automating data analysis pipelines and deployment using tools like IBM Cloud Functions, Apache Airflow, or similar technologies for scalability and efficiency.
51 | Scaling as Needed:
52 | 
53 | As your data analysis needs evolve, be prepared to scale your infrastructure and tools accordingly.
54 | Remember that the specific steps and tools may vary depending on your project's complexity and requirements. IBM Cloud provides a range of services and tools to support big data analysis, so you can tailor your solution to fit your unique needs.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | Big Data Analysis Website README
 3 | Welcome to the Big Data Analysis website! This README file will provide you with detailed instructions on how to navigate the website, update its content, and manage any dependencies.
 4 | Table of Contents
 5 | 1.	Website Navigation
 6 | 2.	Updating Content
 7 | 3.	Dependencies
 8 | 1. Website Navigation
 9 | The Big Data Analysis website is designed to provide information and tools related to big data analysis. Here are the main sections and how to navigate them:
10 | Home
11 | •	The home page provides an overview of the website and its purpose.
12 | •	It may contain announcements, news, or featured content.
13 | •	To navigate to the home page, click the "Home" link in the navigation menu.
14 | About
15 | •	The About page provides information about the website's mission, goals, and the team behind it.
16 | •	To navigate to the About page, click the "About" link in the navigation menu.
17 | Blog
18 | •	The Blog section contains articles and posts related to big data analysis.
19 | •	You can read and search for articles of interest.
20 | •	To access the Blog section, click the "Blog" link in the navigation menu.
21 | Tutorials
22 | •	The Tutorials section provides step-by-step guides and tutorials on big data analysis techniques and tools.
23 | •	To explore tutorials, click the "Tutorials" link in the navigation menu.
24 | Resources
25 | •	The Resources page offers links to external resources, books, courses, and tools for big data analysis.
26 | •	Navigate to the Resources page by clicking the "Resources" link in the navigation menu.
27 | Contact
28 | •	If you have questions, suggestions, or need to contact the website administrators, visit the Contact page.
29 | •	Click the "Contact" link in the navigation menu to access this page.
30 | 2. Updating Content
31 | As a website administrator or contributor, you may need to update or add new content to the website. Here are some instructions for doing so:
32 | Blog Posts
33 | •	To add a new blog post, log in to the content management system (CMS) using your credentials.
34 | •	Navigate to the Blog section in the CMS and click "Add New Post."
35 | •	Fill in the title, content, and any metadata for the post.
36 | •	You can include text, images, and links within your blog post.
37 | •	Once the post is ready, click "Publish" or "Save as Draft" if you want to review it before publishing.
38 | Tutorials
39 | •	To create a new tutorial, log in to the CMS.
40 | •	Visit the Tutorials section and click "Create New Tutorial."
41 | •	Add a title, description, and content for the tutorial.
42 | •	Include step-by-step instructions, code samples, and any relevant images.
43 | •	After creating the tutorial, you can either publish it immediately or save it as a draft for later review.
44 | Resources
45 | •	To update the Resources page, log in to the CMS.
46 | •	Find the Resources section and edit the existing links or add new ones.
47 | •	Ensure that the links are up-to-date and relevant to big data analysis..
48 | 3. Dependencies
49 | The Big Data Analysis website may have dependencies that need to be maintained. These dependencies include:
50 | •	Web Hosting: The website is hosted on a web server. Ensure the hosting subscription is up-to-date, and server configurations are in order.
51 | •	Content Management System (CMS): The CMS is the platform used for content creation and management. Ensure the CMS is regularly updated for security and functionality.
52 | •	Plugins and Themes: If the website uses plugins or themes for additional functionality or design, keep them updated to the latest versions.
53 | •	Databases: If the website stores data, ensure that the database is regularly backed up and optimized for performance.
54 | •	Analytics and Tracking Tools: If you use analytics or tracking tools (e.g., Google Analytics), keep the tracking codes up-to-date and regularly review the analytics data.
55 | •	Server and Security Updates: Regularly update the server's operating system, web server software, and apply security patches to protect against vulnerabilities.
56 | •	Domain Name and SSL Certificates: Ensure the domain name registration is renewed, and SSL certificates are valid for secure browsing.
57 | 
58 | 


--------------------------------------------------------------------------------
/CAD_Phase 5.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","metadata":{},"source":["## 1. Imports and Initialization"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:17:57.304568Z","iopub.status.busy":"2023-05-21T13:17:57.303858Z","iopub.status.idle":"2023-05-21T13:17:58.482543Z","shell.execute_reply":"2023-05-21T13:17:58.480753Z","shell.execute_reply.started":"2023-05-21T13:17:57.304453Z"},"trusted":true},"outputs":[],"source":["ls /kaggle/input/airline-delay-and-cancellation-data-2009-2018"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:17:59.994353Z","iopub.status.busy":"2023-05-21T13:17:59.993214Z","iopub.status.idle":"2023-05-21T13:19:00.199941Z","shell.execute_reply":"2023-05-21T13:19:00.198219Z","shell.execute_reply.started":"2023-05-21T13:17:59.994277Z"},"trusted":true},"outputs":[],"source":["pip install pyspark\n","\n","import altair as alt\n","\n","from pyspark import SparkContext, SparkConf\n","from pyspark.sql import SparkSession\n","\n","import pyspark.sql.functions as F\n","import pyspark.sql.types as T \n","from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler\n","from pyspark.ml import Pipeline\n","from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier\n","from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator\n","\n","import warnings\n","warnings.filterwarnings('ignore')"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:19:00.204241Z","iopub.status.busy":"2023-05-21T13:19:00.203792Z","iopub.status.idle":"2023-05-21T13:19:06.699918Z","shell.execute_reply":"2023-05-21T13:19:06.698467Z","shell.execute_reply.started":"2023-05-21T13:19:00.204189Z"},"trusted":true},"outputs":[],"source":["# initialize sparkSession\n","spark = SparkSession.builder.config(\"spark.executor.memory\",\"2g\").getOrCreate()\n","spark.sparkContext.setLogLevel(\"ERROR\")"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:29:13.634511Z","iopub.status.busy":"2023-05-21T13:29:13.634034Z","iopub.status.idle":"2023-05-21T13:29:13.646502Z","shell.execute_reply":"2023-05-21T13:29:13.644405Z","shell.execute_reply.started":"2023-05-21T13:29:13.634469Z"},"trusted":true},"outputs":[],"source":["spark.catalog.clearCache()"]},{"cell_type":"markdown","metadata":{},"source":["## 2. Loading and Cleaning the Data"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:19:09.222128Z","iopub.status.busy":"2023-05-21T13:19:09.221704Z","iopub.status.idle":"2023-05-21T13:19:09.227710Z","shell.execute_reply":"2023-05-21T13:19:09.226732Z","shell.execute_reply.started":"2023-05-21T13:19:09.222089Z"},"trusted":true},"outputs":[],"source":["file_names_range = list(range(2009, 2016))\n","file_paths = [f'/kaggle/input/airline-delay-and-cancellation-data-2009-2018/{file}.csv' for file in file_names_range]"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:19:12.393823Z","iopub.status.busy":"2023-05-21T13:19:12.393353Z","iopub.status.idle":"2023-05-21T13:19:12.417595Z","shell.execute_reply":"2023-05-21T13:19:12.415914Z","shell.execute_reply.started":"2023-05-21T13:19:12.393781Z"},"trusted":true},"outputs":[],"source":["schema = T.StructType([\n","    T.StructField(\"FL_DATE\", T.TimestampType(), nullable=True),\n","    T.StructField(\"OP_CARRIER\", T.StringType(), nullable=True),\n","    T.StructField(\"OP_CARRIER_FL_NUM\", T.IntegerType(), nullable=True),\n","    T.StructField(\"ORIGIN\", T.StringType(), nullable=True),\n","    T.StructField(\"DEST\", T.StringType(), nullable=True),\n","    T.StructField(\"CRS_DEP_TIME\", T.DoubleType(), nullable=True),\n","    T.StructField(\"DEP_TIME\", T.DoubleType(), nullable=True),\n","    T.StructField(\"DEP_DELAY\", T.DoubleType(), nullable=True),\n","    T.StructField(\"TAXI_OUT\", T.DoubleType(), nullable=True),\n","    T.StructField(\"WHEELS_OFF\", T.DoubleType(), nullable=True),\n","    T.StructField(\"WHEELS_ON\", T.DoubleType(), nullable=True),\n","    T.StructField(\"TAXI_IN\", T.DoubleType(), nullable=True),\n","    T.StructField(\"CRS_ARR_TIME\", T.DoubleType(), nullable=True),\n","    T.StructField(\"ARR_TIME\",T.DoubleType(), nullable=True),\n","    T.StructField(\"ARR_DELAY\", T.DoubleType(), nullable=True),\n","    T.StructField(\"CANCELLED\", T.DoubleType(), nullable=True),\n","    T.StructField(\"CANCELLATION_CODE\", T.StringType(), nullable=True),\n","    T.StructField(\"DIVERTED\", T.DoubleType(), nullable=True),\n","    T.StructField(\"CRS_ELAPSED_TIME\", T.DoubleType(), nullable=True),\n","    T.StructField(\"ACTUAL_ELAPSED_TIME\", T.DoubleType(), nullable=True),\n","    T.StructField(\"AIR_TIME\", T.DoubleType(), nullable=True),\n","    T.StructField(\"DISTANCE\", T.DoubleType(), nullable=True),\n","    T.StructField(\"CARRIER_DELAY\", T.DoubleType(), nullable=True),\n","    T.StructField(\"WEATHER_DELAY\", T.DoubleType(), nullable=True),\n","    T.StructField(\"NAS_DELAY\", T.DoubleType(), nullable=True),\n","    T.StructField(\"SECURITY_DELAY\", T.DoubleType(), nullable=True),\n","    T.StructField(\"LATE_AIRCRAFT_DELAY\", T.DoubleType(), nullable=True),\n","    T.StructField(\"Unnamed: 27\", T.StringType(), nullable=True)\n","])"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:19:18.062917Z","iopub.status.busy":"2023-05-21T13:19:18.062443Z","iopub.status.idle":"2023-05-21T13:19:22.060190Z","shell.execute_reply":"2023-05-21T13:19:22.058722Z","shell.execute_reply.started":"2023-05-21T13:19:18.062874Z"},"trusted":true},"outputs":[],"source":["df = spark.read.schema(schema).format(\"csv\").option(\"header\", \"true\").load(file_paths)\n","# Load data from S3 into a Spark DataFrame\n","data = spark.read.csv('s3a://your-bucket-name/data/file.csv', header=True, inferSchema=True)\n","\n","# Perform data analysis\n","result = data.groupBy('column_name').agg({'aggregation_function'}).show()\n"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:19:22.063151Z","iopub.status.busy":"2023-05-21T13:19:22.062732Z","iopub.status.idle":"2023-05-21T13:19:22.162767Z","shell.execute_reply":"2023-05-21T13:19:22.161620Z","shell.execute_reply.started":"2023-05-21T13:19:22.063111Z"},"trusted":true},"outputs":[],"source":["# remove null values from the cols used for classification:\n","df = df.dropna(subset= [\n","    'FL_DATE',\n"," 'OP_CARRIER',\n"," 'OP_CARRIER_FL_NUM',\n"," 'ORIGIN',\n"," 'DEST',\n"," 'CRS_DEP_TIME',\n"," 'CRS_ARR_TIME',\n"," 'CANCELLED',\n"," 'DIVERTED',\n"," 'CRS_ELAPSED_TIME',\n"," 'DISTANCE'])\n","\n","# save df for analysis\n","analysis_df = df"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:29:20.543881Z","iopub.status.busy":"2023-05-21T13:29:20.543377Z","iopub.status.idle":"2023-05-21T13:29:20.564966Z","shell.execute_reply":"2023-05-21T13:29:20.563182Z","shell.execute_reply.started":"2023-05-21T13:29:20.543834Z"},"trusted":true},"outputs":[],"source":["# drop the cols who indirectly indicate if a flight is canelled or not (apart from the column CANCELLED)\n","# most of those cols contain null values, if the flight is cancelled\n","\n","classify_df = df.drop(\"Unnamed: 27\", \n","                        \"CARRIER_DELAY\", \n","                        \"WEATHER_DELAY\",\n","                        \"NAS_DELAY\",\n","                        \"SECURITY_DELAY\",\n","                        \"LATE_AIRCRAFT_DELAY\",\n","                        \"CANCELLATION_CODE\",\n","                        \"DEP_TIME\",\n","                        \"DEP_DELAY\",\n","                        \"TAXI_OUT\",\n","                        \"WHEELS_OFF\",\n","                        \"WHEELS_ON\",\n","                        \"TAXI_IN\",\n","                        \"ARR_TIME\",\n","                        \"ARR_DELAY\",\n","                        \"ACTUAL_ELAPSED_TIME\", \n","                        \"AIR_TIME\")"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:29:22.682527Z","iopub.status.busy":"2023-05-21T13:29:22.681254Z","iopub.status.idle":"2023-05-21T13:29:22.701867Z","shell.execute_reply":"2023-05-21T13:29:22.700694Z","shell.execute_reply.started":"2023-05-21T13:29:22.682444Z"},"trusted":true},"outputs":[],"source":["# numerical timestamp column\n","classify_df = classify_df.withColumn(\"FL_DATE\", F.unix_timestamp(\"FL_DATE\"))"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:29:23.095906Z","iopub.status.busy":"2023-05-21T13:29:23.095409Z","iopub.status.idle":"2023-05-21T13:29:23.108893Z","shell.execute_reply":"2023-05-21T13:29:23.106187Z","shell.execute_reply.started":"2023-05-21T13:29:23.095857Z"},"trusted":true},"outputs":[],"source":["classify_df.columns"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:29:51.395240Z","iopub.status.busy":"2023-05-21T13:29:51.394736Z","iopub.status.idle":"2023-05-21T13:30:51.755917Z","shell.execute_reply":"2023-05-21T13:30:51.753179Z","shell.execute_reply.started":"2023-05-21T13:29:51.395197Z"},"trusted":true},"outputs":[],"source":["# Take a subset: either balanced (with subsampling) or unbalanced\n","# we take a subset, because of memory limitations\n","\n","# select subsample of positive samples\n","pos_df = classify_df.filter(F.col('CANCELLED').isin(1)).sample(fraction=0.1)\n","# select an equal amount of negative samples (number of neg samples == number of pos samples)\n","neg_df = classify_df.filter(F.col('CANCELLED').isin(0)).orderBy(F.rand()).limit(pos_df.count())\n","\n","\n","# balanced df - a subset - around 141k\n","classify_df = pos_df.union(neg_df).sample(fraction=1.0).cache()\n","\n","# unbalanced df - but a subset - around 215k\n","#classify_df = classify_df.sample(fraction=0.005).cache() "]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["#classify_df.rdd.countApprox(timeout = 1000,confidence = 0.90)"]},{"cell_type":"markdown","metadata":{},"source":["## 3. Analysis (on the analysis_df)"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# get the most present flight carriers\n","carriers_flight_count_df = analysis_df.groupBy(F.col('OP_CARRIER')).count().orderBy(F.col('count').desc())\n","top_10 = carriers_flight_count_df.limit(10).toPandas()\n","top_10 = top_10.rename(columns={'OP_CARRIER':'Carrier'})\n","top_10"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# visualisation\n","chart = alt.Chart(top_10).mark_arc(outerRadius=260, innerRadius=75).encode(\n","    theta = alt.Theta(field=\"count\", type=\"quantitative\", stack=True),\n","    color = alt.Color('Carrier:N', scale=alt.Scale(scheme='category20'), legend=None),\n",").properties(\n","    title='Top 10 Carriers by amount of flights',\n","    width=600,\n","    height=300\n",")\n","\n","pie = chart.mark_arc(outerRadius=350)\n","value_text = pie.mark_text(radius=300, size=15).encode(text=alt.Text('count:Q'))\n","\n","pie2 = chart.mark_arc(outerRadius=250)\n","text = pie2.mark_text(radius=200, size=15).encode(\n","    text=alt.Text('Carrier:N'), \n","    color=alt.value(\"#000000\")\n",")\n","\n","(chart + text + value_text).configure_view(\n","    strokeWidth=0\n",").configure_title(\n","    fontSize=18\n",")"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# count number of cancellations per code/reason\n","carriers_flight_count_df = analysis_df.filter(F.col('CANCELLATION_CODE').isNotNull()).groupBy(F.col('CANCELLATION_CODE')).count()\n","cancellation_reasons = carriers_flight_count_df.toPandas()\n","cancellation_reasons"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# rename col values\n","cancellation_reasons['CANCELLATION_CODE'][cancellation_reasons['CANCELLATION_CODE'] == 'A'] = 'By carrier'\n","cancellation_reasons['CANCELLATION_CODE'][cancellation_reasons['CANCELLATION_CODE'] == 'B'] = 'Due to weather'\n","cancellation_reasons['CANCELLATION_CODE'][cancellation_reasons['CANCELLATION_CODE'] == 'C'] = 'By national air system'\n","cancellation_reasons['CANCELLATION_CODE'][cancellation_reasons['CANCELLATION_CODE'] == 'D'] = 'For security'\n","cancellation_reasons = cancellation_reasons.rename(columns={'CANCELLATION_CODE':'Reason'})"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["cancellation_reasons"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# visualisation of calcellation reasons\n","chart = alt.Chart(cancellation_reasons).mark_arc(outerRadius=180, innerRadius=50).encode(\n","    theta = alt.Theta(field=\"count\", type=\"quantitative\", stack=True),\n","    color = alt.Color('Reason:N', scale=alt.Scale(scheme='category20'), legend=None),\n",").properties(\n","    title='Reasons for flight cancellations',\n","    width=600,\n","    height=300\n",")\n","\n","pie = chart.mark_arc(outerRadius=250)\n","value_text = pie.mark_text(radius=220, size=15).encode(text=alt.Text('count:Q'))\n","\n","pie2 = chart.mark_arc(outerRadius=150)\n","text = pie2.mark_text(radius=120, size=12).encode(\n","    text=alt.Text('Reason:N'), \n","    color=alt.value(\"#000000\")\n",")\n","\n","(chart + text + value_text).configure_view(\n","    strokeWidth=0\n",").configure_title(\n","    fontSize=18\n",")"]},{"cell_type":"markdown","metadata":{},"source":["## 4. Preprocessing"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:30:58.144366Z","iopub.status.busy":"2023-05-21T13:30:58.143674Z","iopub.status.idle":"2023-05-21T13:30:58.173211Z","shell.execute_reply":"2023-05-21T13:30:58.171098Z","shell.execute_reply.started":"2023-05-21T13:30:58.144308Z"},"trusted":true},"outputs":[],"source":["# define StringIndexer: categorical (string) cols -> to column indices, \n","# each category gets a integer based on their frequency (start from 0)\n","\n","carrier_indexer = StringIndexer(inputCol=\"OP_CARRIER\", outputCol=\"OP_CARRIER_Index\")\n","origin_indexer = StringIndexer(inputCol=\"ORIGIN\", outputCol=\"ORIGIN_Index\")\n","dest_indexer = StringIndexer(inputCol=\"DEST\", outputCol=\"DEST_Index\")\n"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:30:58.530798Z","iopub.status.busy":"2023-05-21T13:30:58.530242Z","iopub.status.idle":"2023-05-21T13:30:58.553664Z","shell.execute_reply":"2023-05-21T13:30:58.552126Z","shell.execute_reply.started":"2023-05-21T13:30:58.530751Z"},"trusted":true},"outputs":[],"source":["# define onehotencoder for a index columns \n","onehotencoder_carrier_vector = OneHotEncoder(inputCol=\"OP_CARRIER_Index\", outputCol=\"OP_CARRIER_vec\")\n","onehotencoder_origin_vector = OneHotEncoder(inputCol=\"ORIGIN_Index\", outputCol=\"ORIGIN_vec\")\n","onehotencoder_dest_vector = OneHotEncoder(inputCol=\"DEST_Index\", outputCol=\"DEST_vec\")\n"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:30:59.018142Z","iopub.status.busy":"2023-05-21T13:30:59.016495Z","iopub.status.idle":"2023-05-21T13:34:32.559853Z","shell.execute_reply":"2023-05-21T13:34:32.557672Z","shell.execute_reply.started":"2023-05-21T13:30:59.018074Z"},"trusted":true},"outputs":[],"source":["# Pipelining the preprocessing stages defined above \n","pipeline = Pipeline(stages=[carrier_indexer, origin_indexer, dest_indexer,\n","                            onehotencoder_carrier_vector, onehotencoder_origin_vector,\n","                            onehotencoder_dest_vector])\n","\n","transformed_df = pipeline.fit(classify_df).transform(classify_df)"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:34:32.570096Z","iopub.status.busy":"2023-05-21T13:34:32.568631Z","iopub.status.idle":"2023-05-21T13:34:32.683159Z","shell.execute_reply":"2023-05-21T13:34:32.681293Z","shell.execute_reply.started":"2023-05-21T13:34:32.570012Z"},"trusted":true},"outputs":[],"source":["# select columns that are combined to one feature column\n","feature_columns = transformed_df.columns\n","\n","# remove cols that whould not be in our feature cols (label col, intermediate preprocessing cols)\n","for item in [\"CANCELLED\", \"ORIGIN\", \"DEST\", \"OP_CARRIER\", \"OP_CARRIER_Index\", \"ORIGIN_Index\", \"DEST_Index\"]:\n","    feature_columns.remove(item)\n","\n","\n","assembler = VectorAssembler(inputCols=feature_columns, outputCol=\"features\")\n","\n","# build feature col\n","assembled_df = assembler.transform(transformed_df)"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:34:32.685401Z","iopub.status.busy":"2023-05-21T13:34:32.684887Z","iopub.status.idle":"2023-05-21T13:34:32.715248Z","shell.execute_reply":"2023-05-21T13:34:32.713760Z","shell.execute_reply.started":"2023-05-21T13:34:32.685347Z"},"trusted":true},"outputs":[],"source":["# select only feature and label column\n","final_classify_df = assembled_df.select(\"features\", F.col(\"CANCELLED\").alias(\"label\"))"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:34:32.721204Z","iopub.status.busy":"2023-05-21T13:34:32.719553Z","iopub.status.idle":"2023-05-21T13:34:32.730695Z","shell.execute_reply":"2023-05-21T13:34:32.729142Z","shell.execute_reply.started":"2023-05-21T13:34:32.721040Z"},"trusted":true},"outputs":[],"source":["final_classify_df.printSchema()"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:34:32.735080Z","iopub.status.busy":"2023-05-21T13:34:32.733260Z","iopub.status.idle":"2023-05-21T13:34:32.773936Z","shell.execute_reply":"2023-05-21T13:34:32.772028Z","shell.execute_reply.started":"2023-05-21T13:34:32.734999Z"},"trusted":true},"outputs":[],"source":["train, test = final_classify_df.randomSplit([.7, .3], seed=9) # 70, 30 split on balanced set or on subset of samples"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:34:32.777761Z","iopub.status.busy":"2023-05-21T13:34:32.777332Z","iopub.status.idle":"2023-05-21T13:34:32.902474Z","shell.execute_reply":"2023-05-21T13:34:32.900936Z","shell.execute_reply.started":"2023-05-21T13:34:32.777722Z"},"trusted":true},"outputs":[],"source":["#spark.catalog.clearCache()\n","# caching data into memory - models run quicker\n","train = train.repartition(32).cache()\n","test = test.repartition(32).cache()"]},{"cell_type":"markdown","metadata":{},"source":["## 5. Training Models (on balanced and unbalanced data)"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:34:32.904815Z","iopub.status.busy":"2023-05-21T13:34:32.904111Z","iopub.status.idle":"2023-05-21T13:34:32.939332Z","shell.execute_reply":"2023-05-21T13:34:32.938249Z","shell.execute_reply.started":"2023-05-21T13:34:32.904753Z"},"trusted":true},"outputs":[],"source":["# define the models\n","log_regress = LogisticRegression(labelCol = 'label', featuresCol = 'features')\n","decision_tree = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')\n","rand_forest = RandomForestClassifier(labelCol = 'label', featuresCol = 'features')\n","gbt = GBTClassifier(labelCol = 'label', featuresCol = 'features')"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:34:32.943771Z","iopub.status.busy":"2023-05-21T13:34:32.943257Z","iopub.status.idle":"2023-05-21T13:34:47.899579Z","shell.execute_reply":"2023-05-21T13:34:47.897877Z","shell.execute_reply.started":"2023-05-21T13:34:32.943719Z"},"trusted":true},"outputs":[],"source":["log_regress_model = log_regress.fit(train)\n","images/README_1200x800.gif"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:34:47.905342Z","iopub.status.busy":"2023-05-21T13:34:47.904915Z","iopub.status.idle":"2023-05-21T13:34:53.364589Z","shell.execute_reply":"2023-05-21T13:34:53.363042Z","shell.execute_reply.started":"2023-05-21T13:34:47.905303Z"},"trusted":true},"outputs":[],"source":["decision_tree_model = decision_tree.fit(train)"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:34:53.378596Z","iopub.status.busy":"2023-05-21T13:34:53.374398Z","iopub.status.idle":"2023-05-21T13:35:04.400017Z","shell.execute_reply":"2023-05-21T13:35:04.398920Z","shell.execute_reply.started":"2023-05-21T13:34:53.378504Z"},"trusted":true},"outputs":[],"source":["rand_forest_model = rand_forest.fit(train)"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:35:04.403129Z","iopub.status.busy":"2023-05-21T13:35:04.402061Z","iopub.status.idle":"2023-05-21T13:36:21.498266Z","shell.execute_reply":"2023-05-21T13:36:21.497154Z","shell.execute_reply.started":"2023-05-21T13:35:04.403066Z"},"trusted":true},"outputs":[],"source":["gbt_model = gbt.fit(train)"]},{"cell_type":"markdown","metadata":{},"source":["## 6. Evaluation"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:36:21.500955Z","iopub.status.busy":"2023-05-21T13:36:21.500013Z","iopub.status.idle":"2023-05-21T13:36:21.811316Z","shell.execute_reply":"2023-05-21T13:36:21.810129Z","shell.execute_reply.started":"2023-05-21T13:36:21.500891Z"},"trusted":true},"outputs":[],"source":["# Predications on test set\n","log_regress_predictions = log_regress_model.transform(test)\n","decision_tree_predictions = decision_tree_model.transform(test)\n","rand_forest_predictions = rand_forest_model.transform(test)\n","gbt_predictions = gbt_model.transform(test)\n"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:36:21.813326Z","iopub.status.busy":"2023-05-21T13:36:21.812764Z","iopub.status.idle":"2023-05-21T13:36:21.842566Z","shell.execute_reply":"2023-05-21T13:36:21.840789Z","shell.execute_reply.started":"2023-05-21T13:36:21.813270Z"},"trusted":true},"outputs":[],"source":["# Define metrics to evaluate the models \n","# ROC = areaUnderROC = Area under the Receiver Operating Characteristic (ROC) curve, \n","# A curve that plots the TPR against the FPR. \n","# The area under the ROC curve represents the probability that the model correctly ranks a randomly chosen positive instance higher than a randomly chosen negative instance.\n","# A higher value of areaUnderROC indicates better model performance, with 1.0 being the maximum achievable value.\n","evaluator_ROC = BinaryClassificationEvaluator(labelCol='label', metricName='areaUnderROC')\n","\n","# PR = areaUnderPR = Area Under the Precision-Recall curve\n","# A curve that plots the precision (positive predictive value) against the recall (sensitivity). \n","# The area under the precision-recall curve represents the trade-off between precision and recall. \n","# A higher value of areaUnderPR indicates better model performance, with 1.0 being the maximum achievable value.\n","evaluator_PR = BinaryClassificationEvaluator(labelCol='label', metricName='areaUnderPR')\n","\n","# Accuracy\n","# in pyspark accuracy metrics is for multiclass-classification\n","evaluator_Acc = MulticlassClassificationEvaluator(labelCol='label', metricName='accuracy')\n"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:36:21.845815Z","iopub.status.busy":"2023-05-21T13:36:21.844857Z","iopub.status.idle":"2023-05-21T13:36:36.554815Z","shell.execute_reply":"2023-05-21T13:36:36.553701Z","shell.execute_reply.started":"2023-05-21T13:36:21.845747Z"},"trusted":true},"outputs":[],"source":["# set evaluations\n","\n","log_regress_ROC = evaluator_ROC.evaluate(log_regress_predictions)\n","decision_tree_ROC = evaluator_ROC.evaluate(decision_tree_predictions)\n","rand_forest_ROC = evaluator_ROC.evaluate(rand_forest_predictions)\n","gbt_ROC = evaluator_ROC.evaluate(gbt_predictions)\n","\n","log_regress_PR = evaluator_PR.evaluate(log_regress_predictions)\n","decision_tree_PR = evaluator_PR.evaluate(decision_tree_predictions)\n","rand_forest_PR = evaluator_PR.evaluate(rand_forest_predictions)\n","gbt_PR = evaluator_PR.evaluate(gbt_predictions)\n","\n","log_regress_Acc = evaluator_Acc.evaluate(log_regress_predictions)\n","decision_tree_Acc = evaluator_Acc.evaluate(decision_tree_predictions)\n","rand_forest_Acc = evaluator_Acc.evaluate(rand_forest_predictions)\n","gbt_Acc = evaluator_Acc.evaluate(gbt_predictions)\n"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2023-05-21T13:36:36.559634Z","iopub.status.busy":"2023-05-21T13:36:36.559092Z","iopub.status.idle":"2023-05-21T13:36:36.571978Z","shell.execute_reply":"2023-05-21T13:36:36.570823Z","shell.execute_reply.started":"2023-05-21T13:36:36.559562Z"},"trusted":true},"outputs":[],"source":["# Print the metrics of each model - unbalanced dataset\n","print('Metric esults:')\n","print('Area under Receiver Operating Characteristic curve:')\n","print(\"Logistic Regression ROC: {:.4f}\".format(log_regress_ROC))\n","print(\"Decision Tree ROC: {:.4f}\".format(decision_tree_ROC))\n","print(\"Random Forest ROC: {:.4f}\".format(rand_forest_ROC))\n","print(\"Gradient Boosted Trees ROC: {:.4f}\".format(gbt_ROC))\n","\n","print('Area under Precision Recall curve:')\n","print(\"Logistic Regression PR: {:.4f}\".format(log_regress_PR))\n","print(\"Decision Tree PR: {:.4f}\".format(decision_tree_PR))\n","print(\"Random Forest PR: {:.4f}\".format(rand_forest_PR))\n","print(\"Gradient Boosted Trees PR: {:.4f}\".format(gbt_PR))\n","\n","print('Accuracy:')\n","print(\"Logistic Regression PR: {:.4f}\".format(log_regress_Acc))\n","print(\"Decision Tree PR: {:.4f}\".format(decision_tree_Acc))\n","print(\"Random Forest PR: {:.4f}\".format(rand_forest_Acc))\n","print(\"Gradient Boosted Trees PR: {:.4f}\".format(gbt_Acc))"]},{"cell_type":"markdown","metadata":{},"source":["#### Example Results from 2 previous runs:\n","\n","##### Unbalanced set:\n","Area under Receiver Operating Characteristic curve:\n","- Logistic Regression ROC: 0.6707\n","- Decision Tree ROC: 0.4989\n","- Random Forest ROC: 0.6255\n","- Gradient Boosted Trees ROC: 0.6616\n","\n","Area under Precision Recall curve:\n","- Logistic Regression PR: 0.0318\n","- Decision Tree PR: 0.0187\n","- Random Forest PR: 0.0283\n","- Gradient Boosted Trees PR: 0.0371\n","\n","Accuracy:\n","- Logistic Regression PR: 0.9834\n","- Decision Tree PR: 0.9833\n","- Random Forest PR: 0.9834\n","- Gradient Boosted Trees PR: 0.9833\n","\n","##### Balanced set:\n","Area under Receiver Operating Characteristic curve:\n","- Logistic Regression ROC: 0.7126\n","- Decision Tree ROC: 0.5853\n","- Random Forest ROC: 0.6762\n","- Gradient Boosted Trees ROC: 0.7304\n","\n","Area under Precision Recall curve:\n","- Logistic Regression PR: 0.6907\n","- Decision Tree PR: 0.5853\n","- Random Forest PR: 0.6621\n","- Gradient Boosted Trees PR: 0.7163\n","\n","Accuracy:\n","- Logistic Regression PR: 0.6529\n","- Decision Tree PR: 0.6226\n","- Random Forest PR: 0.6290\n","- Gradient Boosted Trees PR: 0.6660"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.6.4"}},"nbformat":4,"nbformat_minor":4}
2 | 


--------------------------------------------------------------------------------