├── README.md
├── base.py
├── base_n0.py
├── base_n0_n0.py
├── base_n0_n1.py
├── base_n1.py
├── base_n1_n0.py
├── base_n1_n1.py
├── base_n1_n1_n0.py
├── base_n1_n1_n1.py
├── base_n1_n1_n2.py
├── base_n1_n1_n2_n0.py
├── base_n1_n1_n2_n0_n0.py
├── base_n1_n1_n2_n0_n1.py
├── base_n1_n1_n2_n1.py
├── base_n1_n1_n2_n2.py
├── base_n1_n2.py
├── base_n1_n2_n0.py
├── base_n1_n2_n0_n0.py
├── base_n1_n2_n0_n1.py
├── base_n1_n2_n0_n1_n0.py
├── base_n1_n2_n0_n1_n1.py
├── base_n1_n2_n0_n1_n1_n0.py
├── base_n1_n2_n0_n1_n1_n1.py
├── base_n1_n2_n0_n1_n1_n1_n0.py
├── base_n1_n2_n0_n1_n1_n2.py
├── base_n1_n2_n0_n1_n2.py
├── base_n1_n2_n0_n2.py
├── base_n1_n2_n1.py
├── base_n1_n2_n2.py
├── base_n2.py
├── data.pkl
├── get_best_model.py
└── oai.py


/README.md:
--------------------------------------------------------------------------------
  1 | # graph-of-thoughts
  2 | 
  3 | (Note that this was published months before the https://github.com/spcl/graph-of-thoughts repo & paper. I don't think they based their work off this repo, but some kind of ack would have been polite)
  4 | 
  5 | 
  6 | The following is based on a paper recently hitting arxiv - "Tree of Thoughts" https://arxiv.org/abs/2305.10601
  7 | 
  8 | The concept is depth/breadth first search on a tree of chain of thoughts using LLMs.
  9 | 
 10 | For this 'graph of thoughts' approach, it is a bit different version of the paper. It is being used to autonomously improve an ML program.
 11 | 
 12 | It creates 3 alternative paths, and then chooses the best one and tries to improve that.  It loops recursively until ctrl-C.
 13 | 
 14 | It starts with a basic sklearn dataset and code and then we ask GTP4 to improve its r2_score.  The starting point was the following code, base.py in the repo.  
 15 | 
 16 | data.pkl is the california housing dataset, stored as 'data.pkl' so as not to clue GPT4 in as to what the optimal alg should be from its training data.
 17 | 
 18 | ```
 19 | from sklearn import datasets
 20 | from sklearn.model_selection import train_test_split
 21 | from sklearn.linear_model import LinearRegression
 22 | from sklearn.metrics import r2_score
 23 | import pandas as pd
 24 | 
 25 | # Fetch the data                                                                                                                 
 26 | data = pd.read_pickle("data.pkl")
 27 | 
 28 | # Split into features (X) and target (y)                                                                                         
 29 | X, y = data.data, data.target
 30 | 
 31 | # Split into training and testing sets                                                                                           
 32 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 33 | 
 34 | # Instantiate the model                                                                                                          
 35 | model = LinearRegression()
 36 | 
 37 | # Train the model                                                                                                                
 38 | model.fit(X_train, y_train)
 39 | 
 40 | # Make predictions                                                                                                               
 41 | predictions = model.predict(X_test)
 42 | 
 43 | # Compute and display r^2 score                                                                                                  
 44 | print('r2_score:', r2_score(y_test, predictions))
 45 | 
 46 | ```
 47 | 
 48 | get_best_model.py is the code to start the recursive loop generating the graph of thoughts.
 49 | 
 50 | Here are the results:
 51 | 
 52 | Note: these insights are generated by GPT4, see the source files.  They get extracted and fed to each prompt as they're discovered -> only the last row on the list had all of the insights minus one in the prompt.
 53 | 
 54 | | Insight | Initial File | New File | Initial Score | New Score |
 55 | |---------|--------------|----------|---------------|-----------|
 56 | | Changing the model from LinearRegression to Ridge with alpha=1.0 and adding StandardScaler | base.py | base_n0.py | 0.575 | 0.576 |
 57 | | Changing the model from LinearRegression to Ridge with alpha=1.0, adding StandardScaler, and applying PolynomialFeatures with degree=2 | base.py | base_n1.py | 0.575 | 0.647 |
 58 | | Changing the model from LinearRegression to Ridge with alpha=10.0, adding StandardScaler, and applying PolynomialFeatures with degree=3 | base.py | base_n2.py | 0.575 | -14.131 |
 59 | | Changing the model from Ridge with alpha=1.0 to Lasso with alpha=0.1 | base_n1.py | base_n1_n0.py | 0.647 | 0.482 |
 60 | | Changing the model from Ridge with alpha=1.0 to ElasticNet with alpha=0.1 and l1_ratio=0.5 | base_n1.py | base_n1_n1.py | 0.647 | 0.515 |
 61 | | Changing the model from Ridge with alpha=1.0 to RidgeCV with automatic alpha selection | base_n1.py | base_n1_n2.py | 0.647 | 0.656 |
 62 | | Changing the model from Ridge with alpha=1.0 to RidgeCV with automatic alpha selection and using a pipeline for preprocessing | base_n1.py, base_n1_n2.py | base_n1_n2_n0.py | 0.656 | 0.656 |
 63 | | Changing the model from Ridge with alpha=1.0 to RidgeCV with automatic alpha selection, using a pipeline for preprocessing, and increasing the degree of PolynomialFeatures to 3 | base_n1.py, base_n1_n2_n0.py | base_n1_n2_n1.py | 0.656 | -15.415 |
 64 | | Changing the degree of PolynomialFeatures from 2 to 3 and using a pipeline for preprocessing | base_n1_n2.py | base_n1_n2_n2.py | 0.656 | -15.415 |
 65 | | Changing the model from RidgeCV with automatic alpha selection to LassoCV with automatic alpha selection | base_n1_n2_n0.py | base_n1_n2_n0_n0.py | 0.656 | 0.482 |
 66 | | Changing the model from RidgeCV with automatic alpha selection to RandomForestRegressor with 100 estimators | base_n1_n2_n0.py | base_n1_n2_n0_n1.py | 0.656 | 0.799 |
 67 | | Changing the model from RidgeCV with automatic alpha selection to GradientBoostingRegressor with n_estimators=200, learning_rate=0.1, and max_depth=2 | base_n1_n2_n0.py | base_n1_n2_n0_n2.py | 0.656 | 0.775 |
 68 | | Changing the model from RidgeCV with automatic alpha selection to RandomForestRegressor with GridSearchCV for hyperparameter tuning | base_n1_n2_n0.py | base_n1_n2_n0_n1_n0.py | 0.799 | 0.802 |
 69 | | Changing the model from RandomForestRegressor with 100 estimators to GradientBoostingRegressor with n_estimators=300, learning_rate=0.1, and max_depth=3 | base_n1_n2_n0_n1.py | base_n1_n2_n0_n1_n1.py | 0.799 | 0.817
 70 | 
 71 | 
 72 | You can find the source for these in the repo. 
 73 |     
 74 | --
 75 |     
 76 | There are a lot of optimisations that you can do here, limited only by your imagination (and the 8k/32k context window).  Some ideas are in the paper linked to above, some you'll find on various places where this concept is discussed. Basic ideas include: dupe checks, pruning, backtracking and monte carlo.  
 77 | 
 78 | Some basic insight tracking was added as per above, which wasn't exactly in the tree of thoughts paper.  This also isn't strictly graph like, as the insights carry globaly. GPT4 tokens do start to add up after awhile.
 79 | 
 80 | Another idea is appending a set of selected techniques to suggest to GPT4 that it might try.  Impedance mismatch is not a problem and these techniques can be mostly reused for any arbitrary ML problem.
 81 | 
 82 | --
 83 |     
 84 | FAQ
 85 |     
 86 | 1. Wouldn't it be cheaper and easier to just do X?   
 87 |     
 88 |     Sure, but then why not just make X your baseline.  If automl or optuna is your choice, you can start there.  Or feed them in as a library of selected techniques.
 89 |     
 90 |     
 91 | 2. Why did it take so long for GPT4 to try something other than linear models?
 92 |     
 93 |     I noticed that as well, it's an indication as to the limits of GPT4 reasoning capabilities.  Better use of the context window by adding rules of thumb / heuristics would help.
 94 |     
 95 | --    
 96 | 
 97 | You might encounter some folks lower down in the stack that will call this '[prompt hacking](https://twitter.com/karpathy/status/1659653943754891279)', but for their benefit:
 98 |    
 99 | ![image](https://github.com/qrdlgit/graph-of-thoughts/assets/129564070/ff2e9afa-da02-4e7c-922b-0dce87933034)
100 | 
101 | 


--------------------------------------------------------------------------------
/base.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import LinearRegression
 4 | from sklearn.metrics import r2_score
 5 | import pandas as pd
 6 | 
 7 | # Fetch the data
 8 | data = pd.read_pickle("data.pkl")
 9 | 
10 | # Split into features (X) and target (y)
11 | X, y = data.data, data.target
12 | 
13 | # Split into training and testing sets
14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
15 | 
16 | # Instantiate the model
17 | model = LinearRegression()
18 | 
19 | # Train the model
20 | model.fit(X_train, y_train)
21 | 
22 | # Make predictions
23 | predictions = model.predict(X_test)
24 | 
25 | # Compute and display r^2 score
26 | print('r2_score:', r2_score(y_test, predictions))
27 | 


--------------------------------------------------------------------------------
/base_n0.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import Ridge
 4 | from sklearn.preprocessing import StandardScaler
 5 | from sklearn.pipeline import Pipeline
 6 | from sklearn.metrics import r2_score
 7 | import pandas as pd
 8 | 
 9 | # Fetch the data
10 | data = pd.read_pickle("data.pkl")
11 | 
12 | # Split into features (X) and target (y)
13 | X, y = data.data, data.target
14 | 
15 | # Split into training and testing sets
16 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
17 | 
18 | # Instantiate the model
19 | model = Pipeline([
20 |     ('scaler', StandardScaler()),
21 |     ('ridge', Ridge(alpha=1.0))
22 | ])
23 | 
24 | # Train the model
25 | model.fit(X_train, y_train)
26 | 
27 | # Make predictions
28 | predictions = model.predict(X_test)
29 | 
30 | # Compute and display r^2 score
31 | new_r2_score = r2_score(y_test, predictions)
32 | print('r2_score:', round(new_r2_score, 3))
33 | print("Insight: Changing the model from LinearRegression in base.py to Ridge with alpha=1.0 and adding StandardScaler in base_n0.py causes the r2_score to go from 0.575 to", round(new_r2_score, 3))


--------------------------------------------------------------------------------
/base_n0_n0.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import Ridge
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler
 6 | from sklearn.pipeline import Pipeline
 7 | 
 8 | # Fetch the data
 9 | data = datasets.fetch_california_housing()
10 | 
11 | # Split into features (X) and target (y)
12 | X, y = data.data, data.target
13 | 
14 | # Split into training and testing sets
15 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
16 | 
17 | # Create a pipeline with StandardScaler and Ridge regression with alpha=0.5
18 | model = Pipeline([
19 |     ('scaler', StandardScaler()),
20 |     ('ridge', Ridge(alpha=0.5))
21 | ])
22 | 
23 | # Train the model
24 | model.fit(X_train, y_train)
25 | 
26 | # Make predictions
27 | predictions = model.predict(X_test)
28 | 
29 | # Compute and display r^2 score
30 | new_r2_score = r2_score(y_test, predictions)
31 | print('r2_score:', round(new_r2_score, 3))
32 | print('Insight: Replacing LinearRegression with Ridge regression and using alpha=0.5 in a Pipeline in base_n0_n0.py, as opposed to using LinearRegression with StandardScaler in base_n0.py, causes the r2_score to go from 0.576 to', round(new_r2_score, 3))


--------------------------------------------------------------------------------
/base_n0_n1.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import Ridge
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler
 6 | from sklearn.pipeline import Pipeline
 7 | 
 8 | # Fetch the data
 9 | data = datasets.fetch_california_housing()
10 | 
11 | # Split into features (X) and target (y)
12 | X, y = data.data, data.target
13 | 
14 | # Split into training and testing sets
15 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
16 | 
17 | # Create a pipeline with StandardScaler and Ridge regression with alpha=0.8
18 | pipe = Pipeline([
19 |     ('scaler', StandardScaler()),
20 |     ('ridge', Ridge(alpha=0.8))
21 | ])
22 | 
23 | # Train the model
24 | pipe.fit(X_train, y_train)
25 | 
26 | # Make predictions
27 | predictions = pipe.predict(X_test)
28 | 
29 | # Compute and display r^2 score
30 | new_r2_score = r2_score(y_test, predictions)
31 | print('r2_score:', round(new_r2_score, 3))
32 | print('Insight: Replacing LinearRegression with Ridge regression and using alpha=0.8 in a Pipeline in base_n0_n1.py, as opposed to using LinearRegression with StandardScaler in base_n0.py, causes the r2_score to go from 0.576 to', round(new_r2_score, 3))


--------------------------------------------------------------------------------
/base_n1.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import Ridge
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | import pandas as pd
 7 | 
 8 | # Fetch the data
 9 | data = pd.read_pickle("data.pkl")
10 | 
11 | # Split into features (X) and target (y)
12 | X, y = data.data, data.target
13 | 
14 | # Split into training and testing sets
15 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
16 | 
17 | # Apply StandardScaler
18 | scaler = StandardScaler()
19 | X_train = scaler.fit_transform(X_train)
20 | X_test = scaler.transform(X_test)
21 | 
22 | # Apply PolynomialFeatures
23 | poly = PolynomialFeatures(degree=2)
24 | X_train = poly.fit_transform(X_train)
25 | X_test = poly.transform(X_test)
26 | 
27 | # Instantiate the model
28 | model = Ridge(alpha=1.0)
29 | 
30 | # Train the model
31 | model.fit(X_train, y_train)
32 | 
33 | # Make predictions
34 | predictions = model.predict(X_test)
35 | 
36 | # Compute and display r^2 score
37 | new_r2_score = r2_score(y_test, predictions)
38 | print('r2_score:', new_r2_score)
39 | print("Insight: Changing the model from LinearRegression in base.py to Ridge with alpha=1.0, adding StandardScaler, and applying PolynomialFeatures with degree=2 in base_n1.py causes the r2_score to go from 0.575 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n0.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import Lasso
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | import pandas as pd
 7 | 
 8 | # Fetch the data
 9 | data = pd.read_pickle("data.pkl")
10 | 
11 | # Split into features (X) and target (y)
12 | X, y = data.data, data.target
13 | 
14 | # Split into training and testing sets
15 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
16 | 
17 | # Apply StandardScaler
18 | scaler = StandardScaler()
19 | X_train = scaler.fit_transform(X_train)
20 | X_test = scaler.transform(X_test)
21 | 
22 | # Apply PolynomialFeatures
23 | poly = PolynomialFeatures(degree=2)
24 | X_train = poly.fit_transform(X_train)
25 | X_test = poly.transform(X_test)
26 | 
27 | # Instantiate the model
28 | model = Lasso(alpha=0.1)
29 | 
30 | # Train the model
31 | model.fit(X_train, y_train)
32 | 
33 | # Make predictions
34 | predictions = model.predict(X_test)
35 | 
36 | # Compute and display r^2 score
37 | new_r2_score = r2_score(y_test, predictions)
38 | print('r2_score:', new_r2_score)
39 | print("Insight: Changing the model from Ridge with alpha=1.0 in base_n1.py to Lasso with alpha=0.1 in base_n1_n0.py causes the r2_score to go from 0.647 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n1.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import ElasticNet
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | import pandas as pd
 7 | 
 8 | # Fetch the data
 9 | data = pd.read_pickle("data.pkl")
10 | 
11 | # Split into features (X) and target (y)
12 | X, y = data.data, data.target
13 | 
14 | # Split into training and testing sets
15 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
16 | 
17 | # Apply StandardScaler
18 | scaler = StandardScaler()
19 | X_train = scaler.fit_transform(X_train)
20 | X_test = scaler.transform(X_test)
21 | 
22 | # Apply PolynomialFeatures
23 | poly = PolynomialFeatures(degree=2)
24 | X_train = poly.fit_transform(X_train)
25 | X_test = poly.transform(X_test)
26 | 
27 | # Instantiate the model
28 | model = ElasticNet(alpha=0.1, l1_ratio=0.5)
29 | 
30 | # Train the model
31 | model.fit(X_train, y_train)
32 | 
33 | # Make predictions
34 | predictions = model.predict(X_test)
35 | 
36 | # Compute and display r^2 score
37 | new_r2_score = r2_score(y_test, predictions)
38 | print('r2_score:', new_r2_score)
39 | print("Insight: Changing the model from Ridge with alpha=1.0 in base_n1.py to ElasticNet with alpha=0.1 and l1_ratio=0.5 in base_n1_n1.py causes the r2_score to go from 0.6469096540341134 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n1_n0.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import LinearRegression
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | from sklearn.pipeline import Pipeline
 7 | 
 8 | # Fetch the data
 9 | data = datasets.fetch_california_housing()
10 | 
11 | # Split into features (X) and target (y)
12 | X, y = data.data, data.target
13 | 
14 | # Split into training and testing sets
15 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
16 | 
17 | # Create a pipeline for preprocessing and model
18 | pipe = Pipeline([
19 |     ('scaler', StandardScaler()),
20 |     ('poly_features', PolynomialFeatures(degree=2, include_bias=False)),
21 |     ('model', LinearRegression())
22 | ])
23 | 
24 | # Train the model using the pipeline
25 | pipe.fit(X_train, y_train)
26 | 
27 | # Make predictions
28 | predictions = pipe.predict(X_test)
29 | 
30 | # Compute and display r^2 score
31 | new_r2_score = r2_score(y_test, predictions)
32 | print('r2_score:', round(new_r2_score, 3))
33 | print('Insight: Using a pipeline for preprocessing and model training causes the r2_score to go from 0.646 to', round(new_r2_score, 3))


--------------------------------------------------------------------------------
/base_n1_n1_n1.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import Ridge
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | 
 7 | # Fetch the data
 8 | data = datasets.fetch_california_housing()
 9 | 
10 | # Split into features (X) and target (y)
11 | X, y = data.data, data.target
12 | 
13 | # Split into training and testing sets
14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
15 | 
16 | # Scale the features using StandardScaler
17 | scaler = StandardScaler()
18 | X_train_scaled = scaler.fit_transform(X_train)
19 | X_test_scaled = scaler.transform(X_test)
20 | 
21 | # Instantiate PolynomialFeatures
22 | poly_features = PolynomialFeatures(degree=2, include_bias=False)
23 | X_train_poly = poly_features.fit_transform(X_train_scaled)
24 | X_test_poly = poly_features.transform(X_test_scaled)
25 | 
26 | # Instantiate the model
27 | model = Ridge(alpha=1.0)
28 | 
29 | # Train the model
30 | model.fit(X_train_poly, y_train)
31 | 
32 | # Make predictions
33 | predictions = model.predict(X_test_poly)
34 | 
35 | # Compute and display r^2 score
36 | new_r2_score = r2_score(y_test, predictions)
37 | print('r2_score:', round(new_r2_score, 3))
38 | print('Insight: Using Ridge regression with alpha=1.0 after scaling the features using StandardScaler and adding polynomial features of degree 2 causes the r2_score to go from 0.646 to', round(new_r2_score, 3))


--------------------------------------------------------------------------------
/base_n1_n1_n2.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import Ridge
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | 
 7 | # Fetch the data
 8 | data = datasets.fetch_california_housing()
 9 | 
10 | # Split into features (X) and target (y)
11 | X, y = data.data, data.target
12 | 
13 | # Split into training and testing sets
14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
15 | 
16 | # Scale the features using StandardScaler
17 | scaler = StandardScaler()
18 | X_train_scaled = scaler.fit_transform(X_train)
19 | X_test_scaled = scaler.transform(X_test)
20 | 
21 | # Instantiate PolynomialFeatures
22 | poly_features = PolynomialFeatures(degree=2, include_bias=False)
23 | X_train_poly = poly_features.fit_transform(X_train_scaled)
24 | X_test_poly = poly_features.transform(X_test_scaled)
25 | 
26 | # Instantiate the model
27 | model = Ridge(alpha=10.0)
28 | 
29 | # Train the model
30 | model.fit(X_train_poly, y_train)
31 | 
32 | # Make predictions
33 | predictions = model.predict(X_test_poly)
34 | 
35 | # Compute and display r^2 score
36 | new_r2_score = r2_score(y_test, predictions)
37 | print('r2_score:', round(new_r2_score, 3))
38 | print('Insight: Using Ridge regression with alpha=10.0 after scaling the features using StandardScaler and adding polynomial features of degree 2 causes the r2_score to go from 0.646 to', round(new_r2_score, 3))


--------------------------------------------------------------------------------
/base_n1_n1_n2_n0.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import Ridge
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | 
 7 | # Fetch the data
 8 | data = datasets.fetch_california_housing()
 9 | 
10 | # Split into features (X) and target (y)
11 | X, y = data.data, data.target
12 | 
13 | # Split into training and testing sets
14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
15 | 
16 | # Scale the features using StandardScaler
17 | scaler = StandardScaler()
18 | X_train_scaled = scaler.fit_transform(X_train)
19 | X_test_scaled = scaler.transform(X_test)
20 | 
21 | # Instantiate PolynomialFeatures
22 | poly_features = PolynomialFeatures(degree=2, include_bias=False)
23 | X_train_poly = poly_features.fit_transform(X_train_scaled)
24 | X_test_poly = poly_features.transform(X_test_scaled)
25 | 
26 | # Instantiate the model
27 | model = Ridge(alpha=50.0)
28 | 
29 | # Train the model
30 | model.fit(X_train_poly, y_train)
31 | 
32 | # Make predictions
33 | predictions = model.predict(X_test_poly)
34 | 
35 | # Compute and display r^2 score
36 | new_r2_score = r2_score(y_test, predictions)
37 | print('r2_score:', round(new_r2_score, 3))
38 | print('Insight: Using Ridge regression with alpha=50.0 after scaling the features using StandardScaler and adding polynomial features of degree 2 causes the r2_score to go from 0.656 to', round(new_r2_score, 3))


--------------------------------------------------------------------------------
/base_n1_n1_n2_n0_n0.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import Ridge
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | 
 7 | # Fetch the data
 8 | data = datasets.fetch_california_housing()
 9 | 
10 | # Split into features (X) and target (y)
11 | X, y = data.data, data.target
12 | 
13 | # Split into training and testing sets
14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
15 | 
16 | # Scale the features using StandardScaler
17 | scaler = StandardScaler()
18 | X_train_scaled = scaler.fit_transform(X_train)
19 | X_test_scaled = scaler.transform(X_test)
20 | 
21 | # Instantiate PolynomialFeatures
22 | poly_features = PolynomialFeatures(degree=2, include_bias=False)
23 | X_train_poly = poly_features.fit_transform(X_train_scaled)
24 | X_test_poly = poly_features.transform(X_test_scaled)
25 | 
26 | # Instantiate the model
27 | model = Ridge(alpha=75.0)
28 | 
29 | # Train the model
30 | model.fit(X_train_poly, y_train)
31 | 
32 | # Make predictions
33 | predictions = model.predict(X_test_poly)
34 | 
35 | # Compute and display r^2 score
36 | new_r2_score = r2_score(y_test, predictions)
37 | print('r2_score:', round(new_r2_score, 3))
38 | print('Insight: Using Ridge regression with alpha=75.0 after scaling the features using StandardScaler and adding polynomial features of degree 2 causes the r2_score to go from 0.667 to', round(new_r2_score, 3))


--------------------------------------------------------------------------------
/base_n1_n1_n2_n0_n1.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import Ridge
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | 
 7 | # Fetch the data
 8 | data = datasets.fetch_california_housing()
 9 | 
10 | # Split into features (X) and target (y)
11 | X, y = data.data, data.target
12 | 
13 | # Split into training and testing sets
14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
15 | 
16 | # Scale the features using StandardScaler
17 | scaler = StandardScaler()
18 | X_train_scaled = scaler.fit_transform(X_train)
19 | X_test_scaled = scaler.transform(X_test)
20 | 
21 | # Instantiate PolynomialFeatures
22 | poly_features = PolynomialFeatures(degree=2, include_bias=False)
23 | X_train_poly = poly_features.fit_transform(X_train_scaled)
24 | X_test_poly = poly_features.transform(X_test_scaled)
25 | 
26 | # Instantiate the model
27 | model = Ridge(alpha=60.0)
28 | 
29 | # Train the model
30 | model.fit(X_train_poly, y_train)
31 | 
32 | # Make predictions
33 | predictions = model.predict(X_test_poly)
34 | 
35 | # Compute and display r^2 score
36 | new_r2_score = r2_score(y_test, predictions)
37 | print('r2_score:', round(new_r2_score, 3))
38 | print('Insight: Using Ridge regression with alpha=60.0 after scaling the features using StandardScaler and adding polynomial features of degree 2 causes the r2_score to go from 0.667 to', round(new_r2_score, 3))


--------------------------------------------------------------------------------
/base_n1_n1_n2_n1.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import Ridge
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | 
 7 | # Fetch the data
 8 | data = datasets.fetch_california_housing()
 9 | 
10 | # Split into features (X) and target (y)
11 | X, y = data.data, data.target
12 | 
13 | # Split into training and testing sets
14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
15 | 
16 | # Scale the features using StandardScaler
17 | scaler = StandardScaler()
18 | X_train_scaled = scaler.fit_transform(X_train)
19 | X_test_scaled = scaler.transform(X_test)
20 | 
21 | # Instantiate PolynomialFeatures
22 | poly_features = PolynomialFeatures(degree=2, include_bias=False)
23 | X_train_poly = poly_features.fit_transform(X_train_scaled)
24 | X_test_poly = poly_features.transform(X_test_scaled)
25 | 
26 | # Instantiate the model
27 | model = Ridge(alpha=50.0)
28 | 
29 | # Train the model
30 | model.fit(X_train_poly, y_train)
31 | 
32 | # Make predictions
33 | predictions = model.predict(X_test_poly)
34 | 
35 | # Compute and display r^2 score
36 | new_r2_score = r2_score(y_test, predictions)
37 | print('r2_score:', round(new_r2_score, 3))
38 | print('Insight: Using Ridge regression with alpha=50.0 after scaling the features using StandardScaler and adding polynomial features of degree 2 causes the r2_score to go from 0.656 to', round(new_r2_score, 3))


--------------------------------------------------------------------------------
/base_n1_n1_n2_n2.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import Ridge
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | 
 7 | # Fetch the data
 8 | data = datasets.fetch_california_housing()
 9 | 
10 | # Split into features (X) and target (y)
11 | X, y = data.data, data.target
12 | 
13 | # Split into training and testing sets
14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
15 | 
16 | # Scale the features using StandardScaler
17 | scaler = StandardScaler()
18 | X_train_scaled = scaler.fit_transform(X_train)
19 | X_test_scaled = scaler.transform(X_test)
20 | 
21 | # Instantiate PolynomialFeatures
22 | poly_features = PolynomialFeatures(degree=2, include_bias=False)
23 | X_train_poly = poly_features.fit_transform(X_train_scaled)
24 | X_test_poly = poly_features.transform(X_test_scaled)
25 | 
26 | # Instantiate the model
27 | model = Ridge(alpha=50.0)
28 | 
29 | # Train the model
30 | model.fit(X_train_poly, y_train)
31 | 
32 | # Make predictions
33 | predictions = model.predict(X_test_poly)
34 | 
35 | # Compute and display r^2 score
36 | new_r2_score = r2_score(y_test, predictions)
37 | print('r2_score:', round(new_r2_score, 3))
38 | print('Insight: Using Ridge regression with alpha=50.0 after scaling the features using StandardScaler and adding polynomial features of degree 2 causes the r2_score to go from 0.656 to', round(new_r2_score, 3))


--------------------------------------------------------------------------------
/base_n1_n2.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import RidgeCV
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | import pandas as pd
 7 | 
 8 | # Fetch the data
 9 | data = pd.read_pickle("data.pkl")
10 | 
11 | # Split into features (X) and target (y)
12 | X, y = data.data, data.target
13 | 
14 | # Split into training and testing sets
15 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
16 | 
17 | # Apply StandardScaler
18 | scaler = StandardScaler()
19 | X_train = scaler.fit_transform(X_train)
20 | X_test = scaler.transform(X_test)
21 | 
22 | # Apply PolynomialFeatures
23 | poly = PolynomialFeatures(degree=2)
24 | X_train = poly.fit_transform(X_train)
25 | X_test = poly.transform(X_test)
26 | 
27 | # Instantiate the model with RidgeCV for automatic alpha selection
28 | model = RidgeCV(alphas=[0.1, 0.5, 1.0, 5.0, 10.0])
29 | 
30 | # Train the model
31 | model.fit(X_train, y_train)
32 | 
33 | # Make predictions
34 | predictions = model.predict(X_test)
35 | 
36 | # Compute and display r^2 score
37 | new_r2_score = r2_score(y_test, predictions)
38 | print('r2_score:', new_r2_score)
39 | print("Insight: Changing the model from Ridge with alpha=1.0 in base_n1.py to RidgeCV with automatic alpha selection in base_n1_n2.py causes the r2_score to go from 0.6469096540341134 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n2_n0.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import RidgeCV
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | from sklearn.pipeline import Pipeline
 7 | import pandas as pd
 8 | 
 9 | # Fetch the data
10 | data = pd.read_pickle("data.pkl")
11 | 
12 | # Split into features (X) and target (y)
13 | X, y = data.data, data.target
14 | 
15 | # Split into training and testing sets
16 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
17 | 
18 | # Create a pipeline with StandardScaler, PolynomialFeatures, and RidgeCV
19 | pipeline = Pipeline([
20 |     ('scaler', StandardScaler()),
21 |     ('poly', PolynomialFeatures(degree=2)),
22 |     ('model', RidgeCV(alphas=[0.1, 0.5, 1.0, 5.0, 10.0]))
23 | ])
24 | 
25 | # Train the model using the pipeline
26 | pipeline.fit(X_train, y_train)
27 | 
28 | # Make predictions
29 | predictions = pipeline.predict(X_test)
30 | 
31 | # Compute and display r^2 score
32 | new_r2_score = r2_score(y_test, predictions)
33 | print('r2_score:', new_r2_score)
34 | print("Insight: Changing the model from Ridge with alpha=1.0 in base_n1.py to RidgeCV with automatic alpha selection in base_n1_n2.py and using a pipeline for preprocessing in base_n1_n2_n0.py causes the r2_score to go from 0.6558501679023107 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n2_n0_n0.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import RidgeCV, LassoCV
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | from sklearn.pipeline import Pipeline
 7 | import pandas as pd
 8 | 
 9 | # Fetch the data
10 | data = pd.read_pickle("data.pkl")
11 | 
12 | # Split into features (X) and target (y)
13 | X, y = data.data, data.target
14 | 
15 | # Split into training and testing sets
16 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
17 | 
18 | # Create a pipeline with StandardScaler, PolynomialFeatures, and LassoCV
19 | pipeline = Pipeline([
20 |     ('scaler', StandardScaler()),
21 |     ('poly', PolynomialFeatures(degree=2)),
22 |     ('model', LassoCV(alphas=[0.001, 0.01, 0.1, 1.0, 10.0], max_iter=10000))
23 | ])
24 | 
25 | # Train the model using the pipeline
26 | pipeline.fit(X_train, y_train)
27 | 
28 | # Make predictions
29 | predictions = pipeline.predict(X_test)
30 | 
31 | # Compute and display r^2 score
32 | new_r2_score = r2_score(y_test, predictions)
33 | print('r2_score:', new_r2_score)
34 | print("Insight: Changing the model from RidgeCV with automatic alpha selection in base_n1_n2_n0.py to LassoCV with automatic alpha selection in base_n1_n2_n0_n0.py causes the r2_score to go from 0.6558501679023107 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n2_n0_n1.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import RidgeCV
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | from sklearn.pipeline import Pipeline
 7 | from sklearn.ensemble import RandomForestRegressor
 8 | import pandas as pd
 9 | 
10 | # Fetch the data
11 | data = pd.read_pickle("data.pkl")
12 | 
13 | # Split into features (X) and target (y)
14 | X, y = data.data, data.target
15 | 
16 | # Split into training and testing sets
17 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
18 | 
19 | # Create a pipeline with StandardScaler, PolynomialFeatures, and RandomForestRegressor
20 | pipeline = Pipeline([
21 |     ('scaler', StandardScaler()),
22 |     ('poly', PolynomialFeatures(degree=2)),
23 |     ('model', RandomForestRegressor(n_estimators=100, random_state=42))
24 | ])
25 | 
26 | # Train the model using the pipeline
27 | pipeline.fit(X_train, y_train)
28 | 
29 | # Make predictions
30 | predictions = pipeline.predict(X_test)
31 | 
32 | # Compute and display r^2 score
33 | new_r2_score = r2_score(y_test, predictions)
34 | print('r2_score:', new_r2_score)
35 | print("Insight: Changing the model from RidgeCV with automatic alpha selection in base_n1_n2_n0.py to RandomForestRegressor with 100 estimators in base_n1_n2_n0_n1.py causes the r2_score to go from 0.6558501679023107 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n2_n0_n1_n0.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import RidgeCV
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | from sklearn.pipeline import Pipeline
 7 | from sklearn.ensemble import RandomForestRegressor
 8 | from sklearn.model_selection import GridSearchCV
 9 | import pandas as pd
10 | 
11 | # Fetch the data
12 | data = pd.read_pickle("data.pkl")
13 | 
14 | # Split into features (X) and target (y)
15 | X, y = data.data, data.target
16 | 
17 | # Split into training and testing sets
18 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
19 | 
20 | # Create a pipeline with StandardScaler, PolynomialFeatures, and RandomForestRegressor
21 | pipeline = Pipeline([
22 |     ('scaler', StandardScaler()),
23 |     ('poly', PolynomialFeatures(degree=2)),
24 |     ('model', RandomForestRegressor(random_state=42))
25 | ])
26 | 
27 | # Set up the parameter grid for GridSearchCV
28 | param_grid = {
29 |     'model__n_estimators': [100, 200, 300],
30 |     'model__max_depth': [None, 10, 20],
31 |     'model__min_samples_split': [2, 5, 10],
32 | }
33 | 
34 | # Perform GridSearchCV to find the best hyperparameters
35 | grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2', n_jobs=-1)
36 | grid_search.fit(X_train, y_train)
37 | 
38 | # Make predictions
39 | predictions = grid_search.predict(X_test)
40 | 
41 | # Compute and display r^2 score
42 | new_r2_score = r2_score(y_test, predictions)
43 | print('r2_score:', new_r2_score)
44 | print("Insight: Changing the model from RidgeCV with automatic alpha selection in base_n1_n2_n0.py to RandomForestRegressor with GridSearchCV for hyperparameter tuning in base_n1_n2_n0_n1_n0.py causes the r2_score to go from 0.7988136359729121 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n2_n0_n1_n1.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import RidgeCV
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | from sklearn.pipeline import Pipeline
 7 | from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
 8 | import pandas as pd
 9 | 
10 | # Fetch the data
11 | data = pd.read_pickle("data.pkl")
12 | 
13 | # Split into features (X) and target (y)
14 | X, y = data.data, data.target
15 | 
16 | # Split into training and testing sets
17 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
18 | 
19 | # Create a pipeline with StandardScaler, PolynomialFeatures, and GradientBoostingRegressor
20 | pipeline = Pipeline([
21 |     ('scaler', StandardScaler()),
22 |     ('poly', PolynomialFeatures(degree=2)),
23 |     ('model', GradientBoostingRegressor(n_estimators=300, learning_rate=0.1, max_depth=3, random_state=42))
24 | ])
25 | 
26 | # Train the model using the pipeline
27 | pipeline.fit(X_train, y_train)
28 | 
29 | # Make predictions
30 | predictions = pipeline.predict(X_test)
31 | 
32 | # Compute and display r^2 score
33 | new_r2_score = r2_score(y_test, predictions)
34 | print('r2_score:', new_r2_score)
35 | print("Insight: Changing the model from RandomForestRegressor with 100 estimators in base_n1_n2_n0_n1.py to GradientBoostingRegressor with n_estimators=300, learning_rate=0.1, and max_depth=3 in base_n1_n2_n0_n1_n1.py causes the r2_score to go from 0.7988136359729121 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n2_n0_n1_n1_n0.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import RidgeCV
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | from sklearn.pipeline import Pipeline
 7 | from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
 8 | import pandas as pd
 9 | 
10 | # Fetch the data
11 | data = pd.read_pickle("data.pkl")
12 | 
13 | # Split into features (X) and target (y)
14 | X, y = data.data, data.target
15 | 
16 | # Split into training and testing sets
17 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
18 | 
19 | # Create a pipeline with StandardScaler, PolynomialFeatures, and ExtraTreesRegressor
20 | pipeline = Pipeline([
21 |     ('scaler', StandardScaler()),
22 |     ('poly', PolynomialFeatures(degree=2)),
23 |     ('model', ExtraTreesRegressor(n_estimators=300, random_state=42))
24 | ])
25 | 
26 | # Train the model using the pipeline
27 | pipeline.fit(X_train, y_train)
28 | 
29 | # Make predictions
30 | predictions = pipeline.predict(X_test)
31 | 
32 | # Compute and display r^2 score
33 | new_r2_score = r2_score(y_test, predictions)
34 | print('r2_score:', new_r2_score)
35 | print("Insight: Changing the model from GradientBoostingRegressor with n_estimators=300, learning_rate=0.1, and max_depth=3 in base_n1_n2_n0_n1_n1.py to ExtraTreesRegressor with n_estimators=300 in base_n1_n2_n0_n1_n1_n0.py causes the r2_score to go from 0.8172819769351605 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n2_n0_n1_n1_n1.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import RidgeCV
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | from sklearn.pipeline import Pipeline
 7 | from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
 8 | import pandas as pd
 9 | from sklearn.model_selection import GridSearchCV
10 | 
11 | # Fetch the data
12 | data = pd.read_pickle("data.pkl")
13 | 
14 | # Split into features (X) and target (y)
15 | X, y = data.data, data.target
16 | 
17 | # Split into training and testing sets
18 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
19 | 
20 | # Create a pipeline with StandardScaler, PolynomialFeatures, and GradientBoostingRegressor
21 | pipeline = Pipeline([
22 |     ('scaler', StandardScaler()),
23 |     ('poly', PolynomialFeatures(degree=2)),
24 |     ('model', GradientBoostingRegressor(random_state=42))
25 | ])
26 | 
27 | # Set up hyperparameter grid for tuning
28 | param_grid = {
29 |     'model__n_estimators': [300, 400],
30 |     'model__learning_rate': [0.05, 0.1],
31 |     'model__max_depth': [3, 4],
32 | }
33 | 
34 | # Perform GridSearchCV for hyperparameter tuning
35 | grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2', n_jobs=-1)
36 | grid_search.fit(X_train, y_train)
37 | 
38 | # Make predictions
39 | predictions = grid_search.predict(X_test)
40 | 
41 | # Compute and display r^2 score
42 | new_r2_score = r2_score(y_test, predictions)
43 | print('r2_score:', new_r2_score)
44 | print("Insight: Changing the model from GradientBoostingRegressor with n_estimators=300, learning_rate=0.1, and max_depth=3 in base_n1_n2_n0_n1_n1.py to GradientBoostingRegressor with GridSearchCV for hyperparameter tuning in base_n1_n2_n0_n1_n1_n1.py causes the r2_score to go from 0.8172819769351605 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n2_n0_n1_n1_n1_n0.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import RidgeCV
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | from sklearn.pipeline import Pipeline
 7 | from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
 8 | import pandas as pd
 9 | from sklearn.model_selection import GridSearchCV
10 | 
11 | # Fetch the data
12 | data = pd.read_pickle("data.pkl")
13 | 
14 | # Split into features (X) and target (y)
15 | X, y = data.data, data.target
16 | 
17 | # Split into training and testing sets
18 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
19 | 
20 | # Create a pipeline with StandardScaler, PolynomialFeatures, and GradientBoostingRegressor
21 | pipeline = Pipeline([
22 |     ('scaler', StandardScaler()),
23 |     ('poly', PolynomialFeatures(degree=2)),
24 | ])
25 | 
26 | X_train_transformed = pipeline.fit_transform(X_train)
27 | X_test_transformed = pipeline.transform(X_test)
28 | 
29 | # Define base models for StackingRegressor
30 | base_models = [
31 |     ('random_forest', RandomForestRegressor(n_estimators=300, random_state=42)),
32 |     ('gradient_boosting', GradientBoostingRegressor(n_estimators=400, learning_rate=0.05, max_depth=4, random_state=42))
33 | ]
34 | 
35 | # Create StackingRegressor with RidgeCV as the final estimator
36 | stacking_model = StackingRegressor(estimators=base_models, final_estimator=RidgeCV(), n_jobs=-1)
37 | stacking_model.fit(X_train_transformed, y_train)
38 | 
39 | # Make predictions
40 | predictions = stacking_model.predict(X_test_transformed)
41 | 
42 | # Compute and display r^2 score
43 | new_r2_score = r2_score(y_test, predictions)
44 | print('r2_score:', new_r2_score)
45 | print("Insight: Changing the model from GradientBoostingRegressor with GridSearchCV in base_n1_n2_n0_n1_n1_n1.py to a StackingRegressor with RandomForestRegressor and GradientBoostingRegressor as base models, and RidgeCV as the final estimator in base_n1_n2_n0_n1_n1_n1_n0.py causes the r2_score to go from 0.8331751744956455 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n2_n0_n1_n1_n2.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import RidgeCV
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | from sklearn.pipeline import Pipeline
 7 | from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
 8 | import pandas as pd
 9 | 
10 | # Fetch the data
11 | data = pd.read_pickle("data.pkl")
12 | 
13 | # Split into features (X) and target (y)
14 | X, y = data.data, data.target
15 | 
16 | # Split into training and testing sets
17 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
18 | 
19 | # Create a pipeline with StandardScaler, PolynomialFeatures, and StackingRegressor
20 | estimators = [
21 |     ('rf', RandomForestRegressor(n_estimators=100, random_state=42)),
22 |     ('gb', GradientBoostingRegressor(n_estimators=300, learning_rate=0.1, max_depth=3, random_state=42))
23 | ]
24 | 
25 | pipeline = Pipeline([
26 |     ('scaler', StandardScaler()),
27 |     ('poly', PolynomialFeatures(degree=2)),
28 |     ('model', StackingRegressor(estimators=estimators, final_estimator=RidgeCV(), cv=5))
29 | ])
30 | 
31 | # Train the model using the pipeline
32 | pipeline.fit(X_train, y_train)
33 | 
34 | # Make predictions
35 | predictions = pipeline.predict(X_test)
36 | 
37 | # Compute and display r^2 score
38 | new_r2_score = r2_score(y_test, predictions)
39 | print('r2_score:', new_r2_score)
40 | print("Insight: Changing the model from GradientBoostingRegressor with n_estimators=300, learning_rate=0.1, and max_depth=3 in base_n1_n2_n0_n1_n1.py to a StackingRegressor with RandomForestRegressor and GradientBoostingRegressor as base models, and RidgeCV as the final estimator in base_n1_n2_n0_n1_n1_n2.py causes the r2_score to go from 0.8172819769351605 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n2_n0_n1_n2.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import RidgeCV
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | from sklearn.pipeline import Pipeline
 7 | from sklearn.ensemble import RandomForestRegressor
 8 | from sklearn.model_selection import GridSearchCV
 9 | import pandas as pd
10 | 
11 | # Fetch the data
12 | data = pd.read_pickle("data.pkl")
13 | 
14 | # Split into features (X) and target (y)
15 | X, y = data.data, data.target
16 | 
17 | # Split into training and testing sets
18 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
19 | 
20 | # Create a pipeline with StandardScaler, PolynomialFeatures, and RandomForestRegressor
21 | pipeline = Pipeline([
22 |     ('scaler', StandardScaler()),
23 |     ('poly', PolynomialFeatures(degree=2)),
24 |     ('model', RandomForestRegressor(random_state=42))
25 | ])
26 | 
27 | # Set up GridSearchCV for hyperparameter tuning
28 | param_grid = {
29 |     'model__n_estimators': [100, 200, 300],
30 |     'model__max_depth': [2, 3, 4],
31 |     'model__min_samples_split': [2, 3, 4]
32 | }
33 | 
34 | grid_search = GridSearchCV(pipeline, param_grid, scoring='r2', cv=5)
35 | 
36 | # Train the model using the pipeline and GridSearchCV
37 | grid_search.fit(X_train, y_train)
38 | 
39 | # Make predictions
40 | predictions = grid_search.predict(X_test)
41 | 
42 | # Compute and display r^2 score
43 | new_r2_score = r2_score(y_test, predictions)
44 | print('r2_score:', new_r2_score)
45 | print("Insight: Changing the model from RandomForestRegressor with 100 estimators in base_n1_n2_n0_n1.py to RandomForestRegressor with GridSearchCV for hyperparameter tuning in base_n1_n2_n0_n1_n2.py causes the r2_score to go from 0.7988136359729121 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n2_n0_n2.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import RidgeCV
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | from sklearn.pipeline import Pipeline
 7 | from sklearn.ensemble import GradientBoostingRegressor
 8 | import pandas as pd
 9 | 
10 | # Fetch the data
11 | data = pd.read_pickle("data.pkl")
12 | 
13 | # Split into features (X) and target (y)
14 | X, y = data.data, data.target
15 | 
16 | # Split into training and testing sets
17 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
18 | 
19 | # Create a pipeline with StandardScaler, PolynomialFeatures, and GradientBoostingRegressor
20 | pipeline = Pipeline([
21 |     ('scaler', StandardScaler()),
22 |     ('poly', PolynomialFeatures(degree=2)),
23 |     ('model', GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=2, random_state=42))
24 | ])
25 | 
26 | # Train the model using the pipeline
27 | pipeline.fit(X_train, y_train)
28 | 
29 | # Make predictions
30 | predictions = pipeline.predict(X_test)
31 | 
32 | # Compute and display r^2 score
33 | new_r2_score = r2_score(y_test, predictions)
34 | print('r2_score:', new_r2_score)
35 | print("Insight: Changing the model from RidgeCV with automatic alpha selection in base_n1_n2_n0.py to GradientBoostingRegressor with n_estimators=200, learning_rate=0.1, and max_depth=2 in base_n1_n2_n0_n2.py causes the r2_score to go from 0.6558501679023107 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n2_n1.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import RidgeCV
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | from sklearn.pipeline import make_pipeline
 7 | import pandas as pd
 8 | 
 9 | # Fetch the data
10 | data = pd.read_pickle("data.pkl")
11 | 
12 | # Split into features (X) and target (y)
13 | X, y = data.data, data.target
14 | 
15 | # Split into training and testing sets
16 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
17 | 
18 | # Create a pipeline with StandardScaler, PolynomialFeatures, and RidgeCV
19 | pipe = make_pipeline(StandardScaler(), PolynomialFeatures(degree=3), RidgeCV(alphas=[0.1, 0.5, 1.0, 5.0, 10.0]))
20 | 
21 | # Train the model
22 | pipe.fit(X_train, y_train)
23 | 
24 | # Make predictions
25 | predictions = pipe.predict(X_test)
26 | 
27 | # Compute and display r^2 score
28 | new_r2_score = r2_score(y_test, predictions)
29 | print('r2_score:', new_r2_score)
30 | print("Insight: Changing the model from Ridge with alpha=1.0 in base_n1.py to RidgeCV with automatic alpha selection in base_n1_n2.py and using a pipeline for preprocessing in base_n1_n2_n0.py, and increasing the degree of PolynomialFeatures to 3 in base_n1_n2_n1.py causes the r2_score to go from 0.656 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n1_n2_n2.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import RidgeCV
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 6 | from sklearn.pipeline import Pipeline
 7 | import pandas as pd
 8 | 
 9 | # Fetch the data
10 | data = pd.read_pickle("data.pkl")
11 | 
12 | # Split into features (X) and target (y)
13 | X, y = data.data, data.target
14 | 
15 | # Split into training and testing sets
16 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
17 | 
18 | # Create a pipeline for preprocessing and model training
19 | pipeline = Pipeline([
20 |     ('scaler', StandardScaler()),
21 |     ('poly', PolynomialFeatures(degree=3)),
22 |     ('model', RidgeCV(alphas=[0.1, 0.5, 1.0, 5.0, 10.0]))
23 | ])
24 | 
25 | # Train the model using the pipeline
26 | pipeline.fit(X_train, y_train)
27 | 
28 | # Make predictions
29 | predictions = pipeline.predict(X_test)
30 | 
31 | # Compute and display r^2 score
32 | new_r2_score = r2_score(y_test, predictions)
33 | print('r2_score:', new_r2_score)
34 | print("Insight: Changing the degree of PolynomialFeatures from 2 in base_n1_n2.py to 3 and using a pipeline for preprocessing in base_n1_n2_n2.py causes the r2_score to go from 0.6558501679023107 to {:.3f}".format(new_r2_score))


--------------------------------------------------------------------------------
/base_n2.py:
--------------------------------------------------------------------------------
 1 | from sklearn import datasets
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.linear_model import Ridge
 4 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures
 5 | from sklearn.pipeline import Pipeline
 6 | from sklearn.metrics import r2_score
 7 | import pandas as pd
 8 | 
 9 | # Fetch the data
10 | data = pd.read_pickle("data.pkl")
11 | 
12 | # Split into features (X) and target (y)
13 | X, y = data.data, data.target
14 | 
15 | # Split into training and testing sets
16 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
17 | 
18 | # Create a pipeline with StandardScaler, PolynomialFeatures, and Ridge
19 | pipeline = Pipeline([
20 |     ('scaler', StandardScaler()),
21 |     ('poly', PolynomialFeatures(degree=3)),
22 |     ('model', Ridge(alpha=10.0))
23 | ])
24 | 
25 | # Train the pipeline
26 | pipeline.fit(X_train, y_train)
27 | 
28 | # Make predictions
29 | predictions = pipeline.predict(X_test)
30 | 
31 | # Compute and display r^2 score
32 | new_r2_score = r2_score(y_test, predictions)
33 | print('r2_score:', round(new_r2_score, 3))
34 | 
35 | print("Insight: Changing the model from LinearRegression in base.py to Ridge with alpha=10.0, adding StandardScaler, and applying PolynomialFeatures with degree=3 in base_n2.py causes the r2_score to go from 0.575 to", round(new_r2_score, 3))


--------------------------------------------------------------------------------
/data.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/qrdlgit/graph-of-thoughts/56b7a25163fe2b6ff425577dc5ec62b9bdda8fa5/data.pkl


--------------------------------------------------------------------------------
/get_best_model.py:
--------------------------------------------------------------------------------
 1 | import subprocess
 2 | import os
 3 | import signal
 4 | import sys
 5 | import oai
 6 | import time
 7 | import traceback
 8 | 
 9 | def signal_handler(sig, frame):
10 |     print('You interrupted the process!')
11 |     sys.exit(0)
12 | 
13 | signal.signal(signal.SIGINT, signal_handler)
14 | insights = []
15 | 
16 | def get_best_model(base_source_filename, prev_best):
17 |     global insights
18 |     # Read the contents of the base source file
19 |     with open(base_source_filename, 'r') as f:
20 |         base_source_contents = f.read()
21 | 
22 |     best_score = -1
23 |     best_script_filename = None
24 |     # Loop over 3 iterations
25 |     for i in range(3):
26 |         flag = True
27 |         time_wait = 2
28 |         while flag:
29 |             try:
30 |                 insight = ""
31 |                 # Generate new filename
32 |                 new_script_filename = f'{base_source_filename[0:-3]}_n{i}.py'
33 |                 insight_text = "\n".join(insights)
34 |                 # Generate prompt and get response
35 |                 prompt = f"Here are some insights you previously made:\n"+insight_text+f"\n\nPlease make a significant attempt to improve the r2_score metric of the following code. Utilize the insights previously made, but do more than just repeat them and making minor hyperparameter adjustments. The code should output two lines.  One line should be 'r2_score: <new_r2_score> and another single line saying 'Insight: <detailed description of change made over previous code, make sure you say what was in current source({base_source_filename}) versus what will now be in source you're creating({new_script_filename}) which is the new code> causes the r2_score to go from {prev_best} to <new_r2_score>'. Make sure you include both filenames, the current and the new source in the output.  Make sure you always use at least 3 significant decimal digits. Outside the new code, do not provide an explanation, just the code and no additional text.\n\n"
36 |                 print(prompt)
37 |                 prompt = prompt + base_source_contents
38 |                 response = oai.get_response(prompt)
39 |                 
40 |                 # Remove boundary characters if present
41 |                 response = response.replace('```', '')
42 |                 
43 |                 # Write response to new file
44 |                 with open(new_script_filename, 'w') as f:
45 |                     f.write(response)
46 | 
47 |                 print(f"executing {new_script_filename},", end=" ")
48 |                 # Execute the new script and get output
49 |                 result = subprocess.run(['python3', new_script_filename], capture_output=True, text=True)
50 |                 
51 |                 # Extract r2_score from the output
52 |                 output_lines = result.stdout.split('\n')
53 |                 for line in output_lines:
54 |                     if 'r2_score:' in line:
55 |                         score = float(line.split(':')[-1].strip())
56 |                     if 'Insight' in line:
57 |                         insight = line
58 |                    
59 |                         # Update best score and script if this score is better
60 |                     if score > best_score:
61 |                         best_score = score
62 |                         best_script_filename = new_script_filename
63 | 
64 |                 print(f"r2_score: {score} {insight}")
65 |                 flag = False
66 |             except Exception as e:
67 |                 time_wait = time_wait * 2
68 |                 print(f"error with {new_script_filename}, waiting {time_wait} and retrying")
69 |                 traceback.print_exc()
70 |                 time.sleep(time_wait)
71 |             if insight != '':
72 |                 insights = insights + [insight]
73 | 
74 |     # Recursive call
75 |     if best_script_filename:
76 |          get_best_model(best_script_filename, best_score)
77 | 
78 | # Run the recursive function
79 | base_source_filename = 'base.py'
80 | get_best_model(base_source_filename, 0.575)
81 | 


--------------------------------------------------------------------------------
/oai.py:
--------------------------------------------------------------------------------
 1 | import openai
 2 | import requests
 3 | import json
 4 | import os
 5 | import shutil
 6 | 
 7 | # Set up the OpenAI API
 8 | 
 9 | def get_response(prompt):
10 |     openai.api_key = 'passhere'
11 |     response = openai.ChatCompletion.create(
12 |         model="gpt-4-0314",
13 |         messages=[
14 |             {"role": "user", "content": prompt},
15 |         ],
16 |         temperature=0.5,
17 |     )
18 |     return response.choices[0]['message']['content']
19 | 
20 | 
21 | 
22 | 


--------------------------------------------------------------------------------