├── README.md ├── base.py ├── base_n0.py ├── base_n0_n0.py ├── base_n0_n1.py ├── base_n1.py ├── base_n1_n0.py ├── base_n1_n1.py ├── base_n1_n1_n0.py ├── base_n1_n1_n1.py ├── base_n1_n1_n2.py ├── base_n1_n1_n2_n0.py ├── base_n1_n1_n2_n0_n0.py ├── base_n1_n1_n2_n0_n1.py ├── base_n1_n1_n2_n1.py ├── base_n1_n1_n2_n2.py ├── base_n1_n2.py ├── base_n1_n2_n0.py ├── base_n1_n2_n0_n0.py ├── base_n1_n2_n0_n1.py ├── base_n1_n2_n0_n1_n0.py ├── base_n1_n2_n0_n1_n1.py ├── base_n1_n2_n0_n1_n1_n0.py ├── base_n1_n2_n0_n1_n1_n1.py ├── base_n1_n2_n0_n1_n1_n1_n0.py ├── base_n1_n2_n0_n1_n1_n2.py ├── base_n1_n2_n0_n1_n2.py ├── base_n1_n2_n0_n2.py ├── base_n1_n2_n1.py ├── base_n1_n2_n2.py ├── base_n2.py ├── data.pkl ├── get_best_model.py └── oai.py /README.md: -------------------------------------------------------------------------------- 1 | # graph-of-thoughts 2 | 3 | (Note that this was published months before the https://github.com/spcl/graph-of-thoughts repo & paper. I don't think they based their work off this repo, but some kind of ack would have been polite) 4 | 5 | 6 | The following is based on a paper recently hitting arxiv - "Tree of Thoughts" https://arxiv.org/abs/2305.10601 7 | 8 | The concept is depth/breadth first search on a tree of chain of thoughts using LLMs. 9 | 10 | For this 'graph of thoughts' approach, it is a bit different version of the paper. It is being used to autonomously improve an ML program. 11 | 12 | It creates 3 alternative paths, and then chooses the best one and tries to improve that. It loops recursively until ctrl-C. 13 | 14 | It starts with a basic sklearn dataset and code and then we ask GTP4 to improve its r2_score. The starting point was the following code, base.py in the repo. 15 | 16 | data.pkl is the california housing dataset, stored as 'data.pkl' so as not to clue GPT4 in as to what the optimal alg should be from its training data. 17 | 18 | ``` 19 | from sklearn import datasets 20 | from sklearn.model_selection import train_test_split 21 | from sklearn.linear_model import LinearRegression 22 | from sklearn.metrics import r2_score 23 | import pandas as pd 24 | 25 | # Fetch the data 26 | data = pd.read_pickle("data.pkl") 27 | 28 | # Split into features (X) and target (y) 29 | X, y = data.data, data.target 30 | 31 | # Split into training and testing sets 32 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 33 | 34 | # Instantiate the model 35 | model = LinearRegression() 36 | 37 | # Train the model 38 | model.fit(X_train, y_train) 39 | 40 | # Make predictions 41 | predictions = model.predict(X_test) 42 | 43 | # Compute and display r^2 score 44 | print('r2_score:', r2_score(y_test, predictions)) 45 | 46 | ``` 47 | 48 | get_best_model.py is the code to start the recursive loop generating the graph of thoughts. 49 | 50 | Here are the results: 51 | 52 | Note: these insights are generated by GPT4, see the source files. They get extracted and fed to each prompt as they're discovered -> only the last row on the list had all of the insights minus one in the prompt. 53 | 54 | | Insight | Initial File | New File | Initial Score | New Score | 55 | |---------|--------------|----------|---------------|-----------| 56 | | Changing the model from LinearRegression to Ridge with alpha=1.0 and adding StandardScaler | base.py | base_n0.py | 0.575 | 0.576 | 57 | | Changing the model from LinearRegression to Ridge with alpha=1.0, adding StandardScaler, and applying PolynomialFeatures with degree=2 | base.py | base_n1.py | 0.575 | 0.647 | 58 | | Changing the model from LinearRegression to Ridge with alpha=10.0, adding StandardScaler, and applying PolynomialFeatures with degree=3 | base.py | base_n2.py | 0.575 | -14.131 | 59 | | Changing the model from Ridge with alpha=1.0 to Lasso with alpha=0.1 | base_n1.py | base_n1_n0.py | 0.647 | 0.482 | 60 | | Changing the model from Ridge with alpha=1.0 to ElasticNet with alpha=0.1 and l1_ratio=0.5 | base_n1.py | base_n1_n1.py | 0.647 | 0.515 | 61 | | Changing the model from Ridge with alpha=1.0 to RidgeCV with automatic alpha selection | base_n1.py | base_n1_n2.py | 0.647 | 0.656 | 62 | | Changing the model from Ridge with alpha=1.0 to RidgeCV with automatic alpha selection and using a pipeline for preprocessing | base_n1.py, base_n1_n2.py | base_n1_n2_n0.py | 0.656 | 0.656 | 63 | | Changing the model from Ridge with alpha=1.0 to RidgeCV with automatic alpha selection, using a pipeline for preprocessing, and increasing the degree of PolynomialFeatures to 3 | base_n1.py, base_n1_n2_n0.py | base_n1_n2_n1.py | 0.656 | -15.415 | 64 | | Changing the degree of PolynomialFeatures from 2 to 3 and using a pipeline for preprocessing | base_n1_n2.py | base_n1_n2_n2.py | 0.656 | -15.415 | 65 | | Changing the model from RidgeCV with automatic alpha selection to LassoCV with automatic alpha selection | base_n1_n2_n0.py | base_n1_n2_n0_n0.py | 0.656 | 0.482 | 66 | | Changing the model from RidgeCV with automatic alpha selection to RandomForestRegressor with 100 estimators | base_n1_n2_n0.py | base_n1_n2_n0_n1.py | 0.656 | 0.799 | 67 | | Changing the model from RidgeCV with automatic alpha selection to GradientBoostingRegressor with n_estimators=200, learning_rate=0.1, and max_depth=2 | base_n1_n2_n0.py | base_n1_n2_n0_n2.py | 0.656 | 0.775 | 68 | | Changing the model from RidgeCV with automatic alpha selection to RandomForestRegressor with GridSearchCV for hyperparameter tuning | base_n1_n2_n0.py | base_n1_n2_n0_n1_n0.py | 0.799 | 0.802 | 69 | | Changing the model from RandomForestRegressor with 100 estimators to GradientBoostingRegressor with n_estimators=300, learning_rate=0.1, and max_depth=3 | base_n1_n2_n0_n1.py | base_n1_n2_n0_n1_n1.py | 0.799 | 0.817 70 | 71 | 72 | You can find the source for these in the repo. 73 | 74 | -- 75 | 76 | There are a lot of optimisations that you can do here, limited only by your imagination (and the 8k/32k context window). Some ideas are in the paper linked to above, some you'll find on various places where this concept is discussed. Basic ideas include: dupe checks, pruning, backtracking and monte carlo. 77 | 78 | Some basic insight tracking was added as per above, which wasn't exactly in the tree of thoughts paper. This also isn't strictly graph like, as the insights carry globaly. GPT4 tokens do start to add up after awhile. 79 | 80 | Another idea is appending a set of selected techniques to suggest to GPT4 that it might try. Impedance mismatch is not a problem and these techniques can be mostly reused for any arbitrary ML problem. 81 | 82 | -- 83 | 84 | FAQ 85 | 86 | 1. Wouldn't it be cheaper and easier to just do X? 87 | 88 | Sure, but then why not just make X your baseline. If automl or optuna is your choice, you can start there. Or feed them in as a library of selected techniques. 89 | 90 | 91 | 2. Why did it take so long for GPT4 to try something other than linear models? 92 | 93 | I noticed that as well, it's an indication as to the limits of GPT4 reasoning capabilities. Better use of the context window by adding rules of thumb / heuristics would help. 94 | 95 | -- 96 | 97 | You might encounter some folks lower down in the stack that will call this '[prompt hacking](https://twitter.com/karpathy/status/1659653943754891279)', but for their benefit: 98 | 99 | ![image](https://github.com/qrdlgit/graph-of-thoughts/assets/129564070/ff2e9afa-da02-4e7c-922b-0dce87933034) 100 | 101 | -------------------------------------------------------------------------------- /base.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import LinearRegression 4 | from sklearn.metrics import r2_score 5 | import pandas as pd 6 | 7 | # Fetch the data 8 | data = pd.read_pickle("data.pkl") 9 | 10 | # Split into features (X) and target (y) 11 | X, y = data.data, data.target 12 | 13 | # Split into training and testing sets 14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 15 | 16 | # Instantiate the model 17 | model = LinearRegression() 18 | 19 | # Train the model 20 | model.fit(X_train, y_train) 21 | 22 | # Make predictions 23 | predictions = model.predict(X_test) 24 | 25 | # Compute and display r^2 score 26 | print('r2_score:', r2_score(y_test, predictions)) 27 | -------------------------------------------------------------------------------- /base_n0.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import Ridge 4 | from sklearn.preprocessing import StandardScaler 5 | from sklearn.pipeline import Pipeline 6 | from sklearn.metrics import r2_score 7 | import pandas as pd 8 | 9 | # Fetch the data 10 | data = pd.read_pickle("data.pkl") 11 | 12 | # Split into features (X) and target (y) 13 | X, y = data.data, data.target 14 | 15 | # Split into training and testing sets 16 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 17 | 18 | # Instantiate the model 19 | model = Pipeline([ 20 | ('scaler', StandardScaler()), 21 | ('ridge', Ridge(alpha=1.0)) 22 | ]) 23 | 24 | # Train the model 25 | model.fit(X_train, y_train) 26 | 27 | # Make predictions 28 | predictions = model.predict(X_test) 29 | 30 | # Compute and display r^2 score 31 | new_r2_score = r2_score(y_test, predictions) 32 | print('r2_score:', round(new_r2_score, 3)) 33 | print("Insight: Changing the model from LinearRegression in base.py to Ridge with alpha=1.0 and adding StandardScaler in base_n0.py causes the r2_score to go from 0.575 to", round(new_r2_score, 3)) -------------------------------------------------------------------------------- /base_n0_n0.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import Ridge 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler 6 | from sklearn.pipeline import Pipeline 7 | 8 | # Fetch the data 9 | data = datasets.fetch_california_housing() 10 | 11 | # Split into features (X) and target (y) 12 | X, y = data.data, data.target 13 | 14 | # Split into training and testing sets 15 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 16 | 17 | # Create a pipeline with StandardScaler and Ridge regression with alpha=0.5 18 | model = Pipeline([ 19 | ('scaler', StandardScaler()), 20 | ('ridge', Ridge(alpha=0.5)) 21 | ]) 22 | 23 | # Train the model 24 | model.fit(X_train, y_train) 25 | 26 | # Make predictions 27 | predictions = model.predict(X_test) 28 | 29 | # Compute and display r^2 score 30 | new_r2_score = r2_score(y_test, predictions) 31 | print('r2_score:', round(new_r2_score, 3)) 32 | print('Insight: Replacing LinearRegression with Ridge regression and using alpha=0.5 in a Pipeline in base_n0_n0.py, as opposed to using LinearRegression with StandardScaler in base_n0.py, causes the r2_score to go from 0.576 to', round(new_r2_score, 3)) -------------------------------------------------------------------------------- /base_n0_n1.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import Ridge 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler 6 | from sklearn.pipeline import Pipeline 7 | 8 | # Fetch the data 9 | data = datasets.fetch_california_housing() 10 | 11 | # Split into features (X) and target (y) 12 | X, y = data.data, data.target 13 | 14 | # Split into training and testing sets 15 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 16 | 17 | # Create a pipeline with StandardScaler and Ridge regression with alpha=0.8 18 | pipe = Pipeline([ 19 | ('scaler', StandardScaler()), 20 | ('ridge', Ridge(alpha=0.8)) 21 | ]) 22 | 23 | # Train the model 24 | pipe.fit(X_train, y_train) 25 | 26 | # Make predictions 27 | predictions = pipe.predict(X_test) 28 | 29 | # Compute and display r^2 score 30 | new_r2_score = r2_score(y_test, predictions) 31 | print('r2_score:', round(new_r2_score, 3)) 32 | print('Insight: Replacing LinearRegression with Ridge regression and using alpha=0.8 in a Pipeline in base_n0_n1.py, as opposed to using LinearRegression with StandardScaler in base_n0.py, causes the r2_score to go from 0.576 to', round(new_r2_score, 3)) -------------------------------------------------------------------------------- /base_n1.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import Ridge 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | import pandas as pd 7 | 8 | # Fetch the data 9 | data = pd.read_pickle("data.pkl") 10 | 11 | # Split into features (X) and target (y) 12 | X, y = data.data, data.target 13 | 14 | # Split into training and testing sets 15 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 16 | 17 | # Apply StandardScaler 18 | scaler = StandardScaler() 19 | X_train = scaler.fit_transform(X_train) 20 | X_test = scaler.transform(X_test) 21 | 22 | # Apply PolynomialFeatures 23 | poly = PolynomialFeatures(degree=2) 24 | X_train = poly.fit_transform(X_train) 25 | X_test = poly.transform(X_test) 26 | 27 | # Instantiate the model 28 | model = Ridge(alpha=1.0) 29 | 30 | # Train the model 31 | model.fit(X_train, y_train) 32 | 33 | # Make predictions 34 | predictions = model.predict(X_test) 35 | 36 | # Compute and display r^2 score 37 | new_r2_score = r2_score(y_test, predictions) 38 | print('r2_score:', new_r2_score) 39 | print("Insight: Changing the model from LinearRegression in base.py to Ridge with alpha=1.0, adding StandardScaler, and applying PolynomialFeatures with degree=2 in base_n1.py causes the r2_score to go from 0.575 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n0.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import Lasso 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | import pandas as pd 7 | 8 | # Fetch the data 9 | data = pd.read_pickle("data.pkl") 10 | 11 | # Split into features (X) and target (y) 12 | X, y = data.data, data.target 13 | 14 | # Split into training and testing sets 15 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 16 | 17 | # Apply StandardScaler 18 | scaler = StandardScaler() 19 | X_train = scaler.fit_transform(X_train) 20 | X_test = scaler.transform(X_test) 21 | 22 | # Apply PolynomialFeatures 23 | poly = PolynomialFeatures(degree=2) 24 | X_train = poly.fit_transform(X_train) 25 | X_test = poly.transform(X_test) 26 | 27 | # Instantiate the model 28 | model = Lasso(alpha=0.1) 29 | 30 | # Train the model 31 | model.fit(X_train, y_train) 32 | 33 | # Make predictions 34 | predictions = model.predict(X_test) 35 | 36 | # Compute and display r^2 score 37 | new_r2_score = r2_score(y_test, predictions) 38 | print('r2_score:', new_r2_score) 39 | print("Insight: Changing the model from Ridge with alpha=1.0 in base_n1.py to Lasso with alpha=0.1 in base_n1_n0.py causes the r2_score to go from 0.647 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n1.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import ElasticNet 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | import pandas as pd 7 | 8 | # Fetch the data 9 | data = pd.read_pickle("data.pkl") 10 | 11 | # Split into features (X) and target (y) 12 | X, y = data.data, data.target 13 | 14 | # Split into training and testing sets 15 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 16 | 17 | # Apply StandardScaler 18 | scaler = StandardScaler() 19 | X_train = scaler.fit_transform(X_train) 20 | X_test = scaler.transform(X_test) 21 | 22 | # Apply PolynomialFeatures 23 | poly = PolynomialFeatures(degree=2) 24 | X_train = poly.fit_transform(X_train) 25 | X_test = poly.transform(X_test) 26 | 27 | # Instantiate the model 28 | model = ElasticNet(alpha=0.1, l1_ratio=0.5) 29 | 30 | # Train the model 31 | model.fit(X_train, y_train) 32 | 33 | # Make predictions 34 | predictions = model.predict(X_test) 35 | 36 | # Compute and display r^2 score 37 | new_r2_score = r2_score(y_test, predictions) 38 | print('r2_score:', new_r2_score) 39 | print("Insight: Changing the model from Ridge with alpha=1.0 in base_n1.py to ElasticNet with alpha=0.1 and l1_ratio=0.5 in base_n1_n1.py causes the r2_score to go from 0.6469096540341134 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n1_n0.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import LinearRegression 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | from sklearn.pipeline import Pipeline 7 | 8 | # Fetch the data 9 | data = datasets.fetch_california_housing() 10 | 11 | # Split into features (X) and target (y) 12 | X, y = data.data, data.target 13 | 14 | # Split into training and testing sets 15 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 16 | 17 | # Create a pipeline for preprocessing and model 18 | pipe = Pipeline([ 19 | ('scaler', StandardScaler()), 20 | ('poly_features', PolynomialFeatures(degree=2, include_bias=False)), 21 | ('model', LinearRegression()) 22 | ]) 23 | 24 | # Train the model using the pipeline 25 | pipe.fit(X_train, y_train) 26 | 27 | # Make predictions 28 | predictions = pipe.predict(X_test) 29 | 30 | # Compute and display r^2 score 31 | new_r2_score = r2_score(y_test, predictions) 32 | print('r2_score:', round(new_r2_score, 3)) 33 | print('Insight: Using a pipeline for preprocessing and model training causes the r2_score to go from 0.646 to', round(new_r2_score, 3)) -------------------------------------------------------------------------------- /base_n1_n1_n1.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import Ridge 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | 7 | # Fetch the data 8 | data = datasets.fetch_california_housing() 9 | 10 | # Split into features (X) and target (y) 11 | X, y = data.data, data.target 12 | 13 | # Split into training and testing sets 14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 15 | 16 | # Scale the features using StandardScaler 17 | scaler = StandardScaler() 18 | X_train_scaled = scaler.fit_transform(X_train) 19 | X_test_scaled = scaler.transform(X_test) 20 | 21 | # Instantiate PolynomialFeatures 22 | poly_features = PolynomialFeatures(degree=2, include_bias=False) 23 | X_train_poly = poly_features.fit_transform(X_train_scaled) 24 | X_test_poly = poly_features.transform(X_test_scaled) 25 | 26 | # Instantiate the model 27 | model = Ridge(alpha=1.0) 28 | 29 | # Train the model 30 | model.fit(X_train_poly, y_train) 31 | 32 | # Make predictions 33 | predictions = model.predict(X_test_poly) 34 | 35 | # Compute and display r^2 score 36 | new_r2_score = r2_score(y_test, predictions) 37 | print('r2_score:', round(new_r2_score, 3)) 38 | print('Insight: Using Ridge regression with alpha=1.0 after scaling the features using StandardScaler and adding polynomial features of degree 2 causes the r2_score to go from 0.646 to', round(new_r2_score, 3)) -------------------------------------------------------------------------------- /base_n1_n1_n2.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import Ridge 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | 7 | # Fetch the data 8 | data = datasets.fetch_california_housing() 9 | 10 | # Split into features (X) and target (y) 11 | X, y = data.data, data.target 12 | 13 | # Split into training and testing sets 14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 15 | 16 | # Scale the features using StandardScaler 17 | scaler = StandardScaler() 18 | X_train_scaled = scaler.fit_transform(X_train) 19 | X_test_scaled = scaler.transform(X_test) 20 | 21 | # Instantiate PolynomialFeatures 22 | poly_features = PolynomialFeatures(degree=2, include_bias=False) 23 | X_train_poly = poly_features.fit_transform(X_train_scaled) 24 | X_test_poly = poly_features.transform(X_test_scaled) 25 | 26 | # Instantiate the model 27 | model = Ridge(alpha=10.0) 28 | 29 | # Train the model 30 | model.fit(X_train_poly, y_train) 31 | 32 | # Make predictions 33 | predictions = model.predict(X_test_poly) 34 | 35 | # Compute and display r^2 score 36 | new_r2_score = r2_score(y_test, predictions) 37 | print('r2_score:', round(new_r2_score, 3)) 38 | print('Insight: Using Ridge regression with alpha=10.0 after scaling the features using StandardScaler and adding polynomial features of degree 2 causes the r2_score to go from 0.646 to', round(new_r2_score, 3)) -------------------------------------------------------------------------------- /base_n1_n1_n2_n0.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import Ridge 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | 7 | # Fetch the data 8 | data = datasets.fetch_california_housing() 9 | 10 | # Split into features (X) and target (y) 11 | X, y = data.data, data.target 12 | 13 | # Split into training and testing sets 14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 15 | 16 | # Scale the features using StandardScaler 17 | scaler = StandardScaler() 18 | X_train_scaled = scaler.fit_transform(X_train) 19 | X_test_scaled = scaler.transform(X_test) 20 | 21 | # Instantiate PolynomialFeatures 22 | poly_features = PolynomialFeatures(degree=2, include_bias=False) 23 | X_train_poly = poly_features.fit_transform(X_train_scaled) 24 | X_test_poly = poly_features.transform(X_test_scaled) 25 | 26 | # Instantiate the model 27 | model = Ridge(alpha=50.0) 28 | 29 | # Train the model 30 | model.fit(X_train_poly, y_train) 31 | 32 | # Make predictions 33 | predictions = model.predict(X_test_poly) 34 | 35 | # Compute and display r^2 score 36 | new_r2_score = r2_score(y_test, predictions) 37 | print('r2_score:', round(new_r2_score, 3)) 38 | print('Insight: Using Ridge regression with alpha=50.0 after scaling the features using StandardScaler and adding polynomial features of degree 2 causes the r2_score to go from 0.656 to', round(new_r2_score, 3)) -------------------------------------------------------------------------------- /base_n1_n1_n2_n0_n0.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import Ridge 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | 7 | # Fetch the data 8 | data = datasets.fetch_california_housing() 9 | 10 | # Split into features (X) and target (y) 11 | X, y = data.data, data.target 12 | 13 | # Split into training and testing sets 14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 15 | 16 | # Scale the features using StandardScaler 17 | scaler = StandardScaler() 18 | X_train_scaled = scaler.fit_transform(X_train) 19 | X_test_scaled = scaler.transform(X_test) 20 | 21 | # Instantiate PolynomialFeatures 22 | poly_features = PolynomialFeatures(degree=2, include_bias=False) 23 | X_train_poly = poly_features.fit_transform(X_train_scaled) 24 | X_test_poly = poly_features.transform(X_test_scaled) 25 | 26 | # Instantiate the model 27 | model = Ridge(alpha=75.0) 28 | 29 | # Train the model 30 | model.fit(X_train_poly, y_train) 31 | 32 | # Make predictions 33 | predictions = model.predict(X_test_poly) 34 | 35 | # Compute and display r^2 score 36 | new_r2_score = r2_score(y_test, predictions) 37 | print('r2_score:', round(new_r2_score, 3)) 38 | print('Insight: Using Ridge regression with alpha=75.0 after scaling the features using StandardScaler and adding polynomial features of degree 2 causes the r2_score to go from 0.667 to', round(new_r2_score, 3)) -------------------------------------------------------------------------------- /base_n1_n1_n2_n0_n1.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import Ridge 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | 7 | # Fetch the data 8 | data = datasets.fetch_california_housing() 9 | 10 | # Split into features (X) and target (y) 11 | X, y = data.data, data.target 12 | 13 | # Split into training and testing sets 14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 15 | 16 | # Scale the features using StandardScaler 17 | scaler = StandardScaler() 18 | X_train_scaled = scaler.fit_transform(X_train) 19 | X_test_scaled = scaler.transform(X_test) 20 | 21 | # Instantiate PolynomialFeatures 22 | poly_features = PolynomialFeatures(degree=2, include_bias=False) 23 | X_train_poly = poly_features.fit_transform(X_train_scaled) 24 | X_test_poly = poly_features.transform(X_test_scaled) 25 | 26 | # Instantiate the model 27 | model = Ridge(alpha=60.0) 28 | 29 | # Train the model 30 | model.fit(X_train_poly, y_train) 31 | 32 | # Make predictions 33 | predictions = model.predict(X_test_poly) 34 | 35 | # Compute and display r^2 score 36 | new_r2_score = r2_score(y_test, predictions) 37 | print('r2_score:', round(new_r2_score, 3)) 38 | print('Insight: Using Ridge regression with alpha=60.0 after scaling the features using StandardScaler and adding polynomial features of degree 2 causes the r2_score to go from 0.667 to', round(new_r2_score, 3)) -------------------------------------------------------------------------------- /base_n1_n1_n2_n1.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import Ridge 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | 7 | # Fetch the data 8 | data = datasets.fetch_california_housing() 9 | 10 | # Split into features (X) and target (y) 11 | X, y = data.data, data.target 12 | 13 | # Split into training and testing sets 14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 15 | 16 | # Scale the features using StandardScaler 17 | scaler = StandardScaler() 18 | X_train_scaled = scaler.fit_transform(X_train) 19 | X_test_scaled = scaler.transform(X_test) 20 | 21 | # Instantiate PolynomialFeatures 22 | poly_features = PolynomialFeatures(degree=2, include_bias=False) 23 | X_train_poly = poly_features.fit_transform(X_train_scaled) 24 | X_test_poly = poly_features.transform(X_test_scaled) 25 | 26 | # Instantiate the model 27 | model = Ridge(alpha=50.0) 28 | 29 | # Train the model 30 | model.fit(X_train_poly, y_train) 31 | 32 | # Make predictions 33 | predictions = model.predict(X_test_poly) 34 | 35 | # Compute and display r^2 score 36 | new_r2_score = r2_score(y_test, predictions) 37 | print('r2_score:', round(new_r2_score, 3)) 38 | print('Insight: Using Ridge regression with alpha=50.0 after scaling the features using StandardScaler and adding polynomial features of degree 2 causes the r2_score to go from 0.656 to', round(new_r2_score, 3)) -------------------------------------------------------------------------------- /base_n1_n1_n2_n2.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import Ridge 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | 7 | # Fetch the data 8 | data = datasets.fetch_california_housing() 9 | 10 | # Split into features (X) and target (y) 11 | X, y = data.data, data.target 12 | 13 | # Split into training and testing sets 14 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 15 | 16 | # Scale the features using StandardScaler 17 | scaler = StandardScaler() 18 | X_train_scaled = scaler.fit_transform(X_train) 19 | X_test_scaled = scaler.transform(X_test) 20 | 21 | # Instantiate PolynomialFeatures 22 | poly_features = PolynomialFeatures(degree=2, include_bias=False) 23 | X_train_poly = poly_features.fit_transform(X_train_scaled) 24 | X_test_poly = poly_features.transform(X_test_scaled) 25 | 26 | # Instantiate the model 27 | model = Ridge(alpha=50.0) 28 | 29 | # Train the model 30 | model.fit(X_train_poly, y_train) 31 | 32 | # Make predictions 33 | predictions = model.predict(X_test_poly) 34 | 35 | # Compute and display r^2 score 36 | new_r2_score = r2_score(y_test, predictions) 37 | print('r2_score:', round(new_r2_score, 3)) 38 | print('Insight: Using Ridge regression with alpha=50.0 after scaling the features using StandardScaler and adding polynomial features of degree 2 causes the r2_score to go from 0.656 to', round(new_r2_score, 3)) -------------------------------------------------------------------------------- /base_n1_n2.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import RidgeCV 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | import pandas as pd 7 | 8 | # Fetch the data 9 | data = pd.read_pickle("data.pkl") 10 | 11 | # Split into features (X) and target (y) 12 | X, y = data.data, data.target 13 | 14 | # Split into training and testing sets 15 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 16 | 17 | # Apply StandardScaler 18 | scaler = StandardScaler() 19 | X_train = scaler.fit_transform(X_train) 20 | X_test = scaler.transform(X_test) 21 | 22 | # Apply PolynomialFeatures 23 | poly = PolynomialFeatures(degree=2) 24 | X_train = poly.fit_transform(X_train) 25 | X_test = poly.transform(X_test) 26 | 27 | # Instantiate the model with RidgeCV for automatic alpha selection 28 | model = RidgeCV(alphas=[0.1, 0.5, 1.0, 5.0, 10.0]) 29 | 30 | # Train the model 31 | model.fit(X_train, y_train) 32 | 33 | # Make predictions 34 | predictions = model.predict(X_test) 35 | 36 | # Compute and display r^2 score 37 | new_r2_score = r2_score(y_test, predictions) 38 | print('r2_score:', new_r2_score) 39 | print("Insight: Changing the model from Ridge with alpha=1.0 in base_n1.py to RidgeCV with automatic alpha selection in base_n1_n2.py causes the r2_score to go from 0.6469096540341134 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n2_n0.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import RidgeCV 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | from sklearn.pipeline import Pipeline 7 | import pandas as pd 8 | 9 | # Fetch the data 10 | data = pd.read_pickle("data.pkl") 11 | 12 | # Split into features (X) and target (y) 13 | X, y = data.data, data.target 14 | 15 | # Split into training and testing sets 16 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 17 | 18 | # Create a pipeline with StandardScaler, PolynomialFeatures, and RidgeCV 19 | pipeline = Pipeline([ 20 | ('scaler', StandardScaler()), 21 | ('poly', PolynomialFeatures(degree=2)), 22 | ('model', RidgeCV(alphas=[0.1, 0.5, 1.0, 5.0, 10.0])) 23 | ]) 24 | 25 | # Train the model using the pipeline 26 | pipeline.fit(X_train, y_train) 27 | 28 | # Make predictions 29 | predictions = pipeline.predict(X_test) 30 | 31 | # Compute and display r^2 score 32 | new_r2_score = r2_score(y_test, predictions) 33 | print('r2_score:', new_r2_score) 34 | print("Insight: Changing the model from Ridge with alpha=1.0 in base_n1.py to RidgeCV with automatic alpha selection in base_n1_n2.py and using a pipeline for preprocessing in base_n1_n2_n0.py causes the r2_score to go from 0.6558501679023107 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n2_n0_n0.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import RidgeCV, LassoCV 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | from sklearn.pipeline import Pipeline 7 | import pandas as pd 8 | 9 | # Fetch the data 10 | data = pd.read_pickle("data.pkl") 11 | 12 | # Split into features (X) and target (y) 13 | X, y = data.data, data.target 14 | 15 | # Split into training and testing sets 16 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 17 | 18 | # Create a pipeline with StandardScaler, PolynomialFeatures, and LassoCV 19 | pipeline = Pipeline([ 20 | ('scaler', StandardScaler()), 21 | ('poly', PolynomialFeatures(degree=2)), 22 | ('model', LassoCV(alphas=[0.001, 0.01, 0.1, 1.0, 10.0], max_iter=10000)) 23 | ]) 24 | 25 | # Train the model using the pipeline 26 | pipeline.fit(X_train, y_train) 27 | 28 | # Make predictions 29 | predictions = pipeline.predict(X_test) 30 | 31 | # Compute and display r^2 score 32 | new_r2_score = r2_score(y_test, predictions) 33 | print('r2_score:', new_r2_score) 34 | print("Insight: Changing the model from RidgeCV with automatic alpha selection in base_n1_n2_n0.py to LassoCV with automatic alpha selection in base_n1_n2_n0_n0.py causes the r2_score to go from 0.6558501679023107 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n2_n0_n1.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import RidgeCV 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | from sklearn.pipeline import Pipeline 7 | from sklearn.ensemble import RandomForestRegressor 8 | import pandas as pd 9 | 10 | # Fetch the data 11 | data = pd.read_pickle("data.pkl") 12 | 13 | # Split into features (X) and target (y) 14 | X, y = data.data, data.target 15 | 16 | # Split into training and testing sets 17 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 18 | 19 | # Create a pipeline with StandardScaler, PolynomialFeatures, and RandomForestRegressor 20 | pipeline = Pipeline([ 21 | ('scaler', StandardScaler()), 22 | ('poly', PolynomialFeatures(degree=2)), 23 | ('model', RandomForestRegressor(n_estimators=100, random_state=42)) 24 | ]) 25 | 26 | # Train the model using the pipeline 27 | pipeline.fit(X_train, y_train) 28 | 29 | # Make predictions 30 | predictions = pipeline.predict(X_test) 31 | 32 | # Compute and display r^2 score 33 | new_r2_score = r2_score(y_test, predictions) 34 | print('r2_score:', new_r2_score) 35 | print("Insight: Changing the model from RidgeCV with automatic alpha selection in base_n1_n2_n0.py to RandomForestRegressor with 100 estimators in base_n1_n2_n0_n1.py causes the r2_score to go from 0.6558501679023107 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n2_n0_n1_n0.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import RidgeCV 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | from sklearn.pipeline import Pipeline 7 | from sklearn.ensemble import RandomForestRegressor 8 | from sklearn.model_selection import GridSearchCV 9 | import pandas as pd 10 | 11 | # Fetch the data 12 | data = pd.read_pickle("data.pkl") 13 | 14 | # Split into features (X) and target (y) 15 | X, y = data.data, data.target 16 | 17 | # Split into training and testing sets 18 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 19 | 20 | # Create a pipeline with StandardScaler, PolynomialFeatures, and RandomForestRegressor 21 | pipeline = Pipeline([ 22 | ('scaler', StandardScaler()), 23 | ('poly', PolynomialFeatures(degree=2)), 24 | ('model', RandomForestRegressor(random_state=42)) 25 | ]) 26 | 27 | # Set up the parameter grid for GridSearchCV 28 | param_grid = { 29 | 'model__n_estimators': [100, 200, 300], 30 | 'model__max_depth': [None, 10, 20], 31 | 'model__min_samples_split': [2, 5, 10], 32 | } 33 | 34 | # Perform GridSearchCV to find the best hyperparameters 35 | grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2', n_jobs=-1) 36 | grid_search.fit(X_train, y_train) 37 | 38 | # Make predictions 39 | predictions = grid_search.predict(X_test) 40 | 41 | # Compute and display r^2 score 42 | new_r2_score = r2_score(y_test, predictions) 43 | print('r2_score:', new_r2_score) 44 | print("Insight: Changing the model from RidgeCV with automatic alpha selection in base_n1_n2_n0.py to RandomForestRegressor with GridSearchCV for hyperparameter tuning in base_n1_n2_n0_n1_n0.py causes the r2_score to go from 0.7988136359729121 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n2_n0_n1_n1.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import RidgeCV 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | from sklearn.pipeline import Pipeline 7 | from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor 8 | import pandas as pd 9 | 10 | # Fetch the data 11 | data = pd.read_pickle("data.pkl") 12 | 13 | # Split into features (X) and target (y) 14 | X, y = data.data, data.target 15 | 16 | # Split into training and testing sets 17 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 18 | 19 | # Create a pipeline with StandardScaler, PolynomialFeatures, and GradientBoostingRegressor 20 | pipeline = Pipeline([ 21 | ('scaler', StandardScaler()), 22 | ('poly', PolynomialFeatures(degree=2)), 23 | ('model', GradientBoostingRegressor(n_estimators=300, learning_rate=0.1, max_depth=3, random_state=42)) 24 | ]) 25 | 26 | # Train the model using the pipeline 27 | pipeline.fit(X_train, y_train) 28 | 29 | # Make predictions 30 | predictions = pipeline.predict(X_test) 31 | 32 | # Compute and display r^2 score 33 | new_r2_score = r2_score(y_test, predictions) 34 | print('r2_score:', new_r2_score) 35 | print("Insight: Changing the model from RandomForestRegressor with 100 estimators in base_n1_n2_n0_n1.py to GradientBoostingRegressor with n_estimators=300, learning_rate=0.1, and max_depth=3 in base_n1_n2_n0_n1_n1.py causes the r2_score to go from 0.7988136359729121 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n2_n0_n1_n1_n0.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import RidgeCV 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | from sklearn.pipeline import Pipeline 7 | from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor 8 | import pandas as pd 9 | 10 | # Fetch the data 11 | data = pd.read_pickle("data.pkl") 12 | 13 | # Split into features (X) and target (y) 14 | X, y = data.data, data.target 15 | 16 | # Split into training and testing sets 17 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 18 | 19 | # Create a pipeline with StandardScaler, PolynomialFeatures, and ExtraTreesRegressor 20 | pipeline = Pipeline([ 21 | ('scaler', StandardScaler()), 22 | ('poly', PolynomialFeatures(degree=2)), 23 | ('model', ExtraTreesRegressor(n_estimators=300, random_state=42)) 24 | ]) 25 | 26 | # Train the model using the pipeline 27 | pipeline.fit(X_train, y_train) 28 | 29 | # Make predictions 30 | predictions = pipeline.predict(X_test) 31 | 32 | # Compute and display r^2 score 33 | new_r2_score = r2_score(y_test, predictions) 34 | print('r2_score:', new_r2_score) 35 | print("Insight: Changing the model from GradientBoostingRegressor with n_estimators=300, learning_rate=0.1, and max_depth=3 in base_n1_n2_n0_n1_n1.py to ExtraTreesRegressor with n_estimators=300 in base_n1_n2_n0_n1_n1_n0.py causes the r2_score to go from 0.8172819769351605 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n2_n0_n1_n1_n1.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import RidgeCV 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | from sklearn.pipeline import Pipeline 7 | from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor 8 | import pandas as pd 9 | from sklearn.model_selection import GridSearchCV 10 | 11 | # Fetch the data 12 | data = pd.read_pickle("data.pkl") 13 | 14 | # Split into features (X) and target (y) 15 | X, y = data.data, data.target 16 | 17 | # Split into training and testing sets 18 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 19 | 20 | # Create a pipeline with StandardScaler, PolynomialFeatures, and GradientBoostingRegressor 21 | pipeline = Pipeline([ 22 | ('scaler', StandardScaler()), 23 | ('poly', PolynomialFeatures(degree=2)), 24 | ('model', GradientBoostingRegressor(random_state=42)) 25 | ]) 26 | 27 | # Set up hyperparameter grid for tuning 28 | param_grid = { 29 | 'model__n_estimators': [300, 400], 30 | 'model__learning_rate': [0.05, 0.1], 31 | 'model__max_depth': [3, 4], 32 | } 33 | 34 | # Perform GridSearchCV for hyperparameter tuning 35 | grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2', n_jobs=-1) 36 | grid_search.fit(X_train, y_train) 37 | 38 | # Make predictions 39 | predictions = grid_search.predict(X_test) 40 | 41 | # Compute and display r^2 score 42 | new_r2_score = r2_score(y_test, predictions) 43 | print('r2_score:', new_r2_score) 44 | print("Insight: Changing the model from GradientBoostingRegressor with n_estimators=300, learning_rate=0.1, and max_depth=3 in base_n1_n2_n0_n1_n1.py to GradientBoostingRegressor with GridSearchCV for hyperparameter tuning in base_n1_n2_n0_n1_n1_n1.py causes the r2_score to go from 0.8172819769351605 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n2_n0_n1_n1_n1_n0.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import RidgeCV 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | from sklearn.pipeline import Pipeline 7 | from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor 8 | import pandas as pd 9 | from sklearn.model_selection import GridSearchCV 10 | 11 | # Fetch the data 12 | data = pd.read_pickle("data.pkl") 13 | 14 | # Split into features (X) and target (y) 15 | X, y = data.data, data.target 16 | 17 | # Split into training and testing sets 18 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 19 | 20 | # Create a pipeline with StandardScaler, PolynomialFeatures, and GradientBoostingRegressor 21 | pipeline = Pipeline([ 22 | ('scaler', StandardScaler()), 23 | ('poly', PolynomialFeatures(degree=2)), 24 | ]) 25 | 26 | X_train_transformed = pipeline.fit_transform(X_train) 27 | X_test_transformed = pipeline.transform(X_test) 28 | 29 | # Define base models for StackingRegressor 30 | base_models = [ 31 | ('random_forest', RandomForestRegressor(n_estimators=300, random_state=42)), 32 | ('gradient_boosting', GradientBoostingRegressor(n_estimators=400, learning_rate=0.05, max_depth=4, random_state=42)) 33 | ] 34 | 35 | # Create StackingRegressor with RidgeCV as the final estimator 36 | stacking_model = StackingRegressor(estimators=base_models, final_estimator=RidgeCV(), n_jobs=-1) 37 | stacking_model.fit(X_train_transformed, y_train) 38 | 39 | # Make predictions 40 | predictions = stacking_model.predict(X_test_transformed) 41 | 42 | # Compute and display r^2 score 43 | new_r2_score = r2_score(y_test, predictions) 44 | print('r2_score:', new_r2_score) 45 | print("Insight: Changing the model from GradientBoostingRegressor with GridSearchCV in base_n1_n2_n0_n1_n1_n1.py to a StackingRegressor with RandomForestRegressor and GradientBoostingRegressor as base models, and RidgeCV as the final estimator in base_n1_n2_n0_n1_n1_n1_n0.py causes the r2_score to go from 0.8331751744956455 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n2_n0_n1_n1_n2.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import RidgeCV 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | from sklearn.pipeline import Pipeline 7 | from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor 8 | import pandas as pd 9 | 10 | # Fetch the data 11 | data = pd.read_pickle("data.pkl") 12 | 13 | # Split into features (X) and target (y) 14 | X, y = data.data, data.target 15 | 16 | # Split into training and testing sets 17 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 18 | 19 | # Create a pipeline with StandardScaler, PolynomialFeatures, and StackingRegressor 20 | estimators = [ 21 | ('rf', RandomForestRegressor(n_estimators=100, random_state=42)), 22 | ('gb', GradientBoostingRegressor(n_estimators=300, learning_rate=0.1, max_depth=3, random_state=42)) 23 | ] 24 | 25 | pipeline = Pipeline([ 26 | ('scaler', StandardScaler()), 27 | ('poly', PolynomialFeatures(degree=2)), 28 | ('model', StackingRegressor(estimators=estimators, final_estimator=RidgeCV(), cv=5)) 29 | ]) 30 | 31 | # Train the model using the pipeline 32 | pipeline.fit(X_train, y_train) 33 | 34 | # Make predictions 35 | predictions = pipeline.predict(X_test) 36 | 37 | # Compute and display r^2 score 38 | new_r2_score = r2_score(y_test, predictions) 39 | print('r2_score:', new_r2_score) 40 | print("Insight: Changing the model from GradientBoostingRegressor with n_estimators=300, learning_rate=0.1, and max_depth=3 in base_n1_n2_n0_n1_n1.py to a StackingRegressor with RandomForestRegressor and GradientBoostingRegressor as base models, and RidgeCV as the final estimator in base_n1_n2_n0_n1_n1_n2.py causes the r2_score to go from 0.8172819769351605 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n2_n0_n1_n2.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import RidgeCV 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | from sklearn.pipeline import Pipeline 7 | from sklearn.ensemble import RandomForestRegressor 8 | from sklearn.model_selection import GridSearchCV 9 | import pandas as pd 10 | 11 | # Fetch the data 12 | data = pd.read_pickle("data.pkl") 13 | 14 | # Split into features (X) and target (y) 15 | X, y = data.data, data.target 16 | 17 | # Split into training and testing sets 18 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 19 | 20 | # Create a pipeline with StandardScaler, PolynomialFeatures, and RandomForestRegressor 21 | pipeline = Pipeline([ 22 | ('scaler', StandardScaler()), 23 | ('poly', PolynomialFeatures(degree=2)), 24 | ('model', RandomForestRegressor(random_state=42)) 25 | ]) 26 | 27 | # Set up GridSearchCV for hyperparameter tuning 28 | param_grid = { 29 | 'model__n_estimators': [100, 200, 300], 30 | 'model__max_depth': [2, 3, 4], 31 | 'model__min_samples_split': [2, 3, 4] 32 | } 33 | 34 | grid_search = GridSearchCV(pipeline, param_grid, scoring='r2', cv=5) 35 | 36 | # Train the model using the pipeline and GridSearchCV 37 | grid_search.fit(X_train, y_train) 38 | 39 | # Make predictions 40 | predictions = grid_search.predict(X_test) 41 | 42 | # Compute and display r^2 score 43 | new_r2_score = r2_score(y_test, predictions) 44 | print('r2_score:', new_r2_score) 45 | print("Insight: Changing the model from RandomForestRegressor with 100 estimators in base_n1_n2_n0_n1.py to RandomForestRegressor with GridSearchCV for hyperparameter tuning in base_n1_n2_n0_n1_n2.py causes the r2_score to go from 0.7988136359729121 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n2_n0_n2.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import RidgeCV 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | from sklearn.pipeline import Pipeline 7 | from sklearn.ensemble import GradientBoostingRegressor 8 | import pandas as pd 9 | 10 | # Fetch the data 11 | data = pd.read_pickle("data.pkl") 12 | 13 | # Split into features (X) and target (y) 14 | X, y = data.data, data.target 15 | 16 | # Split into training and testing sets 17 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 18 | 19 | # Create a pipeline with StandardScaler, PolynomialFeatures, and GradientBoostingRegressor 20 | pipeline = Pipeline([ 21 | ('scaler', StandardScaler()), 22 | ('poly', PolynomialFeatures(degree=2)), 23 | ('model', GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=2, random_state=42)) 24 | ]) 25 | 26 | # Train the model using the pipeline 27 | pipeline.fit(X_train, y_train) 28 | 29 | # Make predictions 30 | predictions = pipeline.predict(X_test) 31 | 32 | # Compute and display r^2 score 33 | new_r2_score = r2_score(y_test, predictions) 34 | print('r2_score:', new_r2_score) 35 | print("Insight: Changing the model from RidgeCV with automatic alpha selection in base_n1_n2_n0.py to GradientBoostingRegressor with n_estimators=200, learning_rate=0.1, and max_depth=2 in base_n1_n2_n0_n2.py causes the r2_score to go from 0.6558501679023107 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n2_n1.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import RidgeCV 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | from sklearn.pipeline import make_pipeline 7 | import pandas as pd 8 | 9 | # Fetch the data 10 | data = pd.read_pickle("data.pkl") 11 | 12 | # Split into features (X) and target (y) 13 | X, y = data.data, data.target 14 | 15 | # Split into training and testing sets 16 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 17 | 18 | # Create a pipeline with StandardScaler, PolynomialFeatures, and RidgeCV 19 | pipe = make_pipeline(StandardScaler(), PolynomialFeatures(degree=3), RidgeCV(alphas=[0.1, 0.5, 1.0, 5.0, 10.0])) 20 | 21 | # Train the model 22 | pipe.fit(X_train, y_train) 23 | 24 | # Make predictions 25 | predictions = pipe.predict(X_test) 26 | 27 | # Compute and display r^2 score 28 | new_r2_score = r2_score(y_test, predictions) 29 | print('r2_score:', new_r2_score) 30 | print("Insight: Changing the model from Ridge with alpha=1.0 in base_n1.py to RidgeCV with automatic alpha selection in base_n1_n2.py and using a pipeline for preprocessing in base_n1_n2_n0.py, and increasing the degree of PolynomialFeatures to 3 in base_n1_n2_n1.py causes the r2_score to go from 0.656 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n1_n2_n2.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import RidgeCV 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 6 | from sklearn.pipeline import Pipeline 7 | import pandas as pd 8 | 9 | # Fetch the data 10 | data = pd.read_pickle("data.pkl") 11 | 12 | # Split into features (X) and target (y) 13 | X, y = data.data, data.target 14 | 15 | # Split into training and testing sets 16 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 17 | 18 | # Create a pipeline for preprocessing and model training 19 | pipeline = Pipeline([ 20 | ('scaler', StandardScaler()), 21 | ('poly', PolynomialFeatures(degree=3)), 22 | ('model', RidgeCV(alphas=[0.1, 0.5, 1.0, 5.0, 10.0])) 23 | ]) 24 | 25 | # Train the model using the pipeline 26 | pipeline.fit(X_train, y_train) 27 | 28 | # Make predictions 29 | predictions = pipeline.predict(X_test) 30 | 31 | # Compute and display r^2 score 32 | new_r2_score = r2_score(y_test, predictions) 33 | print('r2_score:', new_r2_score) 34 | print("Insight: Changing the degree of PolynomialFeatures from 2 in base_n1_n2.py to 3 and using a pipeline for preprocessing in base_n1_n2_n2.py causes the r2_score to go from 0.6558501679023107 to {:.3f}".format(new_r2_score)) -------------------------------------------------------------------------------- /base_n2.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import Ridge 4 | from sklearn.preprocessing import StandardScaler, PolynomialFeatures 5 | from sklearn.pipeline import Pipeline 6 | from sklearn.metrics import r2_score 7 | import pandas as pd 8 | 9 | # Fetch the data 10 | data = pd.read_pickle("data.pkl") 11 | 12 | # Split into features (X) and target (y) 13 | X, y = data.data, data.target 14 | 15 | # Split into training and testing sets 16 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 17 | 18 | # Create a pipeline with StandardScaler, PolynomialFeatures, and Ridge 19 | pipeline = Pipeline([ 20 | ('scaler', StandardScaler()), 21 | ('poly', PolynomialFeatures(degree=3)), 22 | ('model', Ridge(alpha=10.0)) 23 | ]) 24 | 25 | # Train the pipeline 26 | pipeline.fit(X_train, y_train) 27 | 28 | # Make predictions 29 | predictions = pipeline.predict(X_test) 30 | 31 | # Compute and display r^2 score 32 | new_r2_score = r2_score(y_test, predictions) 33 | print('r2_score:', round(new_r2_score, 3)) 34 | 35 | print("Insight: Changing the model from LinearRegression in base.py to Ridge with alpha=10.0, adding StandardScaler, and applying PolynomialFeatures with degree=3 in base_n2.py causes the r2_score to go from 0.575 to", round(new_r2_score, 3)) -------------------------------------------------------------------------------- /data.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/qrdlgit/graph-of-thoughts/56b7a25163fe2b6ff425577dc5ec62b9bdda8fa5/data.pkl -------------------------------------------------------------------------------- /get_best_model.py: -------------------------------------------------------------------------------- 1 | import subprocess 2 | import os 3 | import signal 4 | import sys 5 | import oai 6 | import time 7 | import traceback 8 | 9 | def signal_handler(sig, frame): 10 | print('You interrupted the process!') 11 | sys.exit(0) 12 | 13 | signal.signal(signal.SIGINT, signal_handler) 14 | insights = [] 15 | 16 | def get_best_model(base_source_filename, prev_best): 17 | global insights 18 | # Read the contents of the base source file 19 | with open(base_source_filename, 'r') as f: 20 | base_source_contents = f.read() 21 | 22 | best_score = -1 23 | best_script_filename = None 24 | # Loop over 3 iterations 25 | for i in range(3): 26 | flag = True 27 | time_wait = 2 28 | while flag: 29 | try: 30 | insight = "" 31 | # Generate new filename 32 | new_script_filename = f'{base_source_filename[0:-3]}_n{i}.py' 33 | insight_text = "\n".join(insights) 34 | # Generate prompt and get response 35 | prompt = f"Here are some insights you previously made:\n"+insight_text+f"\n\nPlease make a significant attempt to improve the r2_score metric of the following code. Utilize the insights previously made, but do more than just repeat them and making minor hyperparameter adjustments. The code should output two lines. One line should be 'r2_score: and another single line saying 'Insight: causes the r2_score to go from {prev_best} to '. Make sure you include both filenames, the current and the new source in the output. Make sure you always use at least 3 significant decimal digits. Outside the new code, do not provide an explanation, just the code and no additional text.\n\n" 36 | print(prompt) 37 | prompt = prompt + base_source_contents 38 | response = oai.get_response(prompt) 39 | 40 | # Remove boundary characters if present 41 | response = response.replace('```', '') 42 | 43 | # Write response to new file 44 | with open(new_script_filename, 'w') as f: 45 | f.write(response) 46 | 47 | print(f"executing {new_script_filename},", end=" ") 48 | # Execute the new script and get output 49 | result = subprocess.run(['python3', new_script_filename], capture_output=True, text=True) 50 | 51 | # Extract r2_score from the output 52 | output_lines = result.stdout.split('\n') 53 | for line in output_lines: 54 | if 'r2_score:' in line: 55 | score = float(line.split(':')[-1].strip()) 56 | if 'Insight' in line: 57 | insight = line 58 | 59 | # Update best score and script if this score is better 60 | if score > best_score: 61 | best_score = score 62 | best_script_filename = new_script_filename 63 | 64 | print(f"r2_score: {score} {insight}") 65 | flag = False 66 | except Exception as e: 67 | time_wait = time_wait * 2 68 | print(f"error with {new_script_filename}, waiting {time_wait} and retrying") 69 | traceback.print_exc() 70 | time.sleep(time_wait) 71 | if insight != '': 72 | insights = insights + [insight] 73 | 74 | # Recursive call 75 | if best_script_filename: 76 | get_best_model(best_script_filename, best_score) 77 | 78 | # Run the recursive function 79 | base_source_filename = 'base.py' 80 | get_best_model(base_source_filename, 0.575) 81 | -------------------------------------------------------------------------------- /oai.py: -------------------------------------------------------------------------------- 1 | import openai 2 | import requests 3 | import json 4 | import os 5 | import shutil 6 | 7 | # Set up the OpenAI API 8 | 9 | def get_response(prompt): 10 | openai.api_key = 'passhere' 11 | response = openai.ChatCompletion.create( 12 | model="gpt-4-0314", 13 | messages=[ 14 | {"role": "user", "content": prompt}, 15 | ], 16 | temperature=0.5, 17 | ) 18 | return response.choices[0]['message']['content'] 19 | 20 | 21 | 22 | --------------------------------------------------------------------------------