├── tests └── test_sklearn_wrapper.py ├── data ├── generated_code │ ├── cmc_v3_1_code.txt │ ├── cmc_v4_1_code.txt │ ├── airlines_v3_1_code.txt │ ├── airlines_v3_3_code.txt │ ├── airlines_v4_1_code.txt │ ├── airlines_v4_3_code.txt │ ├── airlines_v4_4_code.txt │ ├── breast-w_v3_1_code.txt │ ├── breast-w_v3_3_code.txt │ ├── breast-w_v3_4_code.txt │ ├── breast-w_v4_3_code.txt │ ├── eucalyptus_v3_2_code.txt │ ├── eucalyptus_v4_2_code.txt │ ├── tic-tac-toe_v3_2_code.txt │ ├── tic-tac-toe_v3_3_code.txt │ ├── kaggle_spaceship-titanic_v3_3_code.txt │ ├── jungle_chess_2pcs_raw_endgame_complete_v3_2_code.txt │ ├── jungle_chess_2pcs_raw_endgame_complete_v3_3_code.txt │ ├── kaggle_spaceship-titanic_v4_4_code.txt │ ├── balance-scale_v4_1_code.txt │ ├── eucalyptus_v4_3_code.txt │ ├── kaggle_playground-series-s3e12_v4_1_code.txt │ ├── airlines_v3_0_code.txt │ ├── airlines_v3_2_code.txt │ ├── diabetes_v4_2_code.txt │ ├── kaggle_playground-series-s3e12_v4_2_code.txt │ ├── kaggle_pharyngitis_v3_2_code.txt │ ├── kaggle_playground-series-s3e12_v4_0_code.txt │ ├── breast-w_v4_1_code.txt │ ├── balance-scale_v3_1_code.txt │ ├── airlines_v4_0_code.txt │ ├── eucalyptus_v3_0_code.txt │ ├── jungle_chess_2pcs_raw_endgame_complete_v3_1_code.txt │ ├── balance-scale_v3_5_code.txt │ ├── credit-g_v4_4_code.txt │ ├── airlines_v3_4_code.txt │ ├── pc1_v3_1_code.txt │ ├── diabetes_v4_3_code.txt │ ├── kaggle_spaceship-titanic_v4_3_code.txt │ ├── kaggle_pharyngitis_v3_0_code.txt │ ├── credit-g_v3_1_code.txt │ ├── airlines_v4_2_code.txt │ ├── credit-g_v3_2_code.txt │ ├── breast-w_v4_4_code.txt │ ├── diabetes_v4_4_code.txt │ ├── breast-w_v3_2_code.txt │ ├── eucalyptus_v3_4_code.txt │ ├── eucalyptus_v4_4_code.txt │ ├── diabetes_v4_0_code.txt │ ├── cmc_v3_0_code.txt │ ├── balance-scale_v3_3_code.txt │ ├── pc1_v3_4_code.txt │ ├── eucalyptus_v3_1_code.txt │ ├── cmc_v4_3_code.txt │ ├── credit-g_v4_0_code.txt │ ├── kaggle_pharyngitis_v4_1_code.txt │ ├── pc1_v4_0_code.txt │ ├── breast-w_v3_0_code.txt │ ├── credit-g_v4_1_code.txt │ ├── kaggle_playground-series-s3e12_v3_0_code.txt │ ├── pc1_v3_2_code.txt │ ├── jungle_chess_2pcs_raw_endgame_complete_v3_0_code.txt │ ├── credit-g_v3_4_code.txt │ ├── diabetes_v4_1_code.txt │ ├── kaggle_health-insurance-lead-prediction-raw-data_v3_4_code.txt │ ├── balance-scale_v4_4_code.txt │ ├── jungle_chess_2pcs_raw_endgame_complete_v4_3_code.txt │ ├── kaggle_playground-series-s3e12_v3_1_code.txt │ ├── credit-g_v3_0_code.txt │ ├── balance-scale_v4_2_code.txt │ ├── kaggle_health-insurance-lead-prediction-raw-data_v4_1_code.txt │ ├── balance-scale_v4_0_code.txt │ ├── pc1_v3_0_code.txt │ ├── diabetes_v3_0_code.txt │ ├── diabetes_v3_2_code.txt │ ├── diabetes_v3_1_code.txt │ ├── kaggle_playground-series-s3e12_v3_3_code.txt │ ├── kaggle_playground-series-s3e12_v4_3_code.txt │ ├── kaggle_spaceship-titanic_v4_0_code.txt │ ├── kaggle_spaceship-titanic_v4_2_code.txt │ ├── kaggle_pharyngitis_v4_2_code.txt │ ├── balance-scale_v3_0_code.txt │ ├── tic-tac-toe_v3_4_code.txt │ ├── eucalyptus_v4_0_code.txt │ ├── tic-tac-toe_v3_0_code.txt │ ├── breast-w_v4_0_code.txt │ ├── kaggle_health-insurance-lead-prediction-raw-data_v3_0_code.txt │ ├── kaggle_pharyngitis_v3_4_code.txt │ ├── kaggle_pharyngitis_v4_4_code.txt │ ├── pc1_v4_1_code.txt │ ├── balance-scale_v4_3_code.txt │ ├── kaggle_spaceship-titanic_v3_0_code.txt │ ├── cmc_v4_2_code.txt │ ├── balance-scale_v3_4_code.txt │ ├── kaggle_playground-series-s3e12_v4_4_code.txt │ ├── kaggle_pharyngitis_v3_1_code.txt │ ├── kaggle_health-insurance-lead-prediction-raw-data_v4_0_code.txt │ ├── eucalyptus_v4_1_code.txt │ ├── kaggle_spaceship-titanic_v3_1_code.txt │ ├── cmc_v3_2_code.txt │ ├── kaggle_pharyngitis_v4_3_code.txt │ ├── breast-w_v4_2_code.txt │ ├── jungle_chess_2pcs_raw_endgame_complete_v4_2_code.txt │ ├── kaggle_playground-series-s3e12_v3_2_code.txt │ ├── kaggle_health-insurance-lead-prediction-raw-data_v3_1_code.txt │ ├── kaggle_health-insurance-lead-prediction-raw-data_v3_3_code.txt │ ├── cmc_v4_0_code.txt │ ├── credit-g_v4_2_code.txt │ ├── jungle_chess_2pcs_raw_endgame_complete_v3_4_code.txt │ ├── jungle_chess_2pcs_raw_endgame_complete_v4_1_code.txt │ ├── pc1_v4_4_code.txt │ ├── pc1_v4_2_code.txt │ ├── tic-tac-toe_v3_1_code.txt │ ├── kaggle_spaceship-titanic_v3_2_code.txt │ ├── diabetes_v3_3_code.txt │ ├── diabetes_v3_4_code.txt │ ├── tic-tac-toe_v4_2_code.txt │ ├── jungle_chess_2pcs_raw_endgame_complete_v4_4_code.txt │ ├── balance-scale_v3_2_code.txt │ ├── kaggle_playground-series-s3e12_v3_4_code.txt │ ├── credit-g_v3_3_code.txt │ ├── jungle_chess_2pcs_raw_endgame_complete_v4_0_code.txt │ ├── kaggle_health-insurance-lead-prediction-raw-data_v4_4_code.txt │ ├── kaggle_health-insurance-lead-prediction-raw-data_v3_2_code.txt │ ├── cmc_v4_4_code.txt │ ├── kaggle_spaceship-titanic_v3_4_code.txt │ ├── cmc_v3_3_code.txt │ ├── pc1_v3_3_code.txt │ ├── kaggle_pharyngitis_v3_3_code.txt │ ├── pc1_v4_3_code.txt │ ├── tic-tac-toe_v4_4_code.txt │ ├── cmc_v3_4_code.txt │ ├── kaggle_spaceship-titanic_v4_1_code.txt │ ├── kaggle_health-insurance-lead-prediction-raw-data_v4_2_code.txt │ ├── tic-tac-toe_v4_1_code.txt │ ├── tic-tac-toe_v4_0_code.txt │ ├── eucalyptus_v3_3_code.txt │ ├── credit-g_v4_3_code.txt │ ├── kaggle_pharyngitis_v4_0_code.txt │ ├── kaggle_health-insurance-lead-prediction-raw-data_v4_3_code.txt │ ├── tic-tac-toe_v4_3_code.txt │ ├── airlines_v3_4_prompt.txt │ ├── airlines_v4_4_prompt.txt │ ├── airlines_v3_2_prompt.txt │ ├── airlines_v4_2_prompt.txt │ ├── airlines_v3_0_prompt.txt │ ├── airlines_v3_1_prompt.txt │ ├── airlines_v3_3_prompt.txt │ ├── airlines_v4_0_prompt.txt │ ├── airlines_v4_1_prompt.txt │ ├── airlines_v4_3_prompt.txt │ └── balance-scale_v3_0_prompt.txt ├── .DS_Store ├── evaluations.zip └── dataset_descriptions │ ├── .DS_Store │ ├── openml_breast-w.txt │ ├── openml_tic-tac-toe.txt │ ├── kaggle_health-insurance-lead-prediction-raw-data.txt │ ├── openml_balance-scale.txt │ ├── kaggle_pharyngitis.txt │ ├── openml_ jungle_chess_2pcs_raw_endgame_complete.txt │ ├── kaggle_playground-series-s3e12.txt │ ├── openml_diabetes.txt │ ├── openml_cmc.txt │ └── kaggle_spaceship-titanic.txt ├── caafe ├── __init__.py └── metrics.py ├── LICENSE.txt ├── setup.py └── scripts ├── run_classifiers_script.py └── generate_features_script.py /tests/test_sklearn_wrapper.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/cmc_v3_1_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/cmc_v4_1_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v3_1_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v3_3_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v4_1_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v4_3_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v4_4_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/breast-w_v3_1_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/breast-w_v3_3_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/breast-w_v3_4_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/breast-w_v4_3_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/eucalyptus_v3_2_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/eucalyptus_v4_2_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/tic-tac-toe_v3_2_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/tic-tac-toe_v3_3_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_spaceship-titanic_v3_3_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /caafe/__init__.py: -------------------------------------------------------------------------------- 1 | from .sklearn_wrapper import CAAFEClassifier 2 | -------------------------------------------------------------------------------- /data/generated_code/jungle_chess_2pcs_raw_endgame_complete_v3_2_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/generated_code/jungle_chess_2pcs_raw_endgame_complete_v3_3_code.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/noahho/CAAFE/HEAD/data/.DS_Store -------------------------------------------------------------------------------- /data/evaluations.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/noahho/CAAFE/HEAD/data/evaluations.zip -------------------------------------------------------------------------------- /data/dataset_descriptions/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/noahho/CAAFE/HEAD/data/dataset_descriptions/.DS_Store -------------------------------------------------------------------------------- /data/dataset_descriptions/openml_breast-w.txt: -------------------------------------------------------------------------------- 1 | **Breast Cancer Wisconsin (Original) Data Set.** Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The target feature records the prognosis (malignant or benign). -------------------------------------------------------------------------------- /data/dataset_descriptions/openml_tic-tac-toe.txt: -------------------------------------------------------------------------------- 1 | **Tic-Tac-Toe Endgame database** 2 | This database encodes the complete set of possible board configurations at the end of tic-tac-toe games, where "x" is assumed to have played first. The target concept is "win for x" (i.e., true when "x" has one of 8 possible ways to create a "three-in-a-row"). -------------------------------------------------------------------------------- /data/generated_code/kaggle_spaceship-titanic_v4_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # CryoSleep and VIP interaction 3 | # Usefulness: Passengers who are both in CryoSleep and VIP might have a different probability of being transported. 4 | # Input samples: 'CryoSleep': [False, False, False], 'VIP': [False, False, False] 5 | df['CryoSleep_VIP'] = df['CryoSleep'].astype(int) & df['VIP'].astype(int) 6 | -------------------------------------------------------------------------------- /data/generated_code/balance-scale_v4_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Right moment 3 | # Usefulness: This feature represents the product of right-weight and right-distance, which helps in determining the balance scale tip according to the problem description. 4 | # Input samples: 'right-weight': [5.0, 1.0, 5.0], 'right-distance': [3.0, 1.0, 5.0] 5 | df['right_moment'] = df['right-weight'] * df['right-distance'] 6 | -------------------------------------------------------------------------------- /data/generated_code/eucalyptus_v4_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Survival_rate 3 | # Usefulness: Survival rate can be an important factor in determining the utility of a tree for soil conservation, as trees with higher survival rates may provide better long-term coverage and erosion control. 4 | # Input samples: 'Surv': [nan, 75.0, 75.0], 'Rep': [2.0, 2.0, 3.0] 5 | df['Survival_rate'] = df['Surv'] / df['Rep'] 6 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_playground-series-s3e12_v4_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature name: gravity_ph_ratio 3 | # Usefulness: This feature combines specific gravity and pH, which may provide additional information about the urine's ability to form calcium oxalate crystals. 4 | # Input samples: 'gravity': [1.02, 1.02, 1.02], 'ph': [5.53, 5.27, 5.36] 5 | df['gravity_ph_ratio'] = df['gravity'] / df['ph'] 6 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v3_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Distance) 3 | # Usefulness: Distance between airports can be a useful feature for predicting flight delays. Longer distances may have more potential for delays due to weather, air traffic control, etc. 4 | # Input samples: 'AirportFrom': [225.0, 39.0, 5.0], 'AirportTo': [11.0, 7.0, 60.0] 5 | df['Distance'] = ((df['AirportFrom'] - df['AirportTo'])**2)**0.5 -------------------------------------------------------------------------------- /data/generated_code/airlines_v3_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Distance) 3 | # Usefulness: Distance between airports can be a useful feature for predicting flight delays. Longer distances may have more potential for delays due to weather, air traffic control, etc. 4 | # Input samples: 'AirportFrom': [225.0, 39.0, 5.0], 'AirportTo': [11.0, 7.0, 60.0] 5 | df['Distance'] = ((df['AirportFrom'] - df['AirportTo'])**2)**0.5 -------------------------------------------------------------------------------- /data/generated_code/diabetes_v4_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Pedigree_Age Interaction 3 | # Usefulness: This interaction feature can help capture the combined effect of diabetes pedigree function and age on diabetes risk, as both factors can influence the likelihood of diabetes. 4 | # Input samples: 'pedi': [0.8, 0.29, 0.32], 'age': [21.0, 42.0, 55.0] 5 | df['pedigree_age_interaction'] = df['pedi'] * df['age'] 6 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_playground-series-s3e12_v4_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Urea-to-calcium ratio) 3 | # Usefulness: Urine analysis can show elevated levels of urea and calcium in patients with kidney stones. This ratio can be an indicator of the likelihood of kidney stone formation. 4 | # Input samples: 'urea': [398.0, 178.0, 364.0], 'calc': [3.16, 3.04, 7.31] 5 | df['urea_calc_ratio'] = df['urea'] / df['calc'] -------------------------------------------------------------------------------- /data/generated_code/kaggle_pharyngitis_v3_2_code.txt: -------------------------------------------------------------------------------- 1 | # ('age_group', 'The patient age group.') 2 | # Usefulness: Younger children are more likely to have GAS infection. This feature can help in predicting the outcome of the RADT test. 3 | # Input samples: 'age_y': [4.2, 4.5, 8.5, 5.7, 6.3, 6.8, 9.0, 6.1, 7.5, 10.7] 4 | df['age_group'] = pd.cut(df['age_y'], bins=[0, 5, 10, 100], labels=['0-5', '5-10', '10+']).astype(str) -------------------------------------------------------------------------------- /data/generated_code/kaggle_playground-series-s3e12_v4_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature name: "gravity_ph_product" 3 | # Usefulness: Combining gravity and pH values can provide useful information about the urine environment, which might be related to the formation of calcium oxalate crystals. 4 | # Input samples: 'gravity': [1.02, 1.02, 1.02], 'ph': [7.61, 5.56, 5.47] 5 | df['gravity_ph_product'] = df['gravity'] * df['ph'] 6 | -------------------------------------------------------------------------------- /data/generated_code/breast-w_v4_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Adhesion_Mitoses_Ratio 3 | # Usefulness: This feature represents the ratio between Marginal_Adhesion and Mitoses, which can help in identifying the relationship between these two features and the prognosis. 4 | # Input samples: 'Marginal_Adhesion': [1.0, 10.0, 1.0], 'Mitoses': [1.0, 7.0, 1.0] 5 | df['Adhesion_Mitoses_Ratio'] = df['Marginal_Adhesion'] / df['Mitoses'] 6 | -------------------------------------------------------------------------------- /data/generated_code/balance-scale_v3_1_code.txt: -------------------------------------------------------------------------------- 1 | # ('right-weight-times-distance', 'Product of right weight and right distance') 2 | # Usefulness: The product of right weight and right distance can be an important factor in determining the balance of the scale. 3 | # Input samples: 'right-weight': [5.0, 1.0, 5.0], 'right-distance': [3.0, 1.0, 5.0], ... 4 | df['right-weight-times-distance'] = df['right-weight'] * df['right-distance'] -------------------------------------------------------------------------------- /data/dataset_descriptions/kaggle_health-insurance-lead-prediction-raw-data.txt: -------------------------------------------------------------------------------- 1 | For the data and objective, it is evident that this is a Binary Classification Problem data in the Tabular Data format. 2 | A policy is recommended to a person when they land on an insurance website, and if the person chooses to fill up a form to apply, it is considered a Positive outcome (Classified as lead). All other conditions are considered Zero outcomes. -------------------------------------------------------------------------------- /data/generated_code/airlines_v4_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # FlightLengthCategory (Categorical representation of flight length) 3 | # Usefulness: The length of the flight might have an impact on the delay, as longer flights might have more chances of facing issues that cause delays. 4 | # Input samples: 'Length': [30.0, 130.0, 166.0] 5 | import numpy as np 6 | df['FlightLengthCategory'] = pd.cut(df['Length'], bins=[0, 100, 200, np.inf], labels=[1, 2, 3]) 7 | -------------------------------------------------------------------------------- /data/generated_code/eucalyptus_v3_0_code.txt: -------------------------------------------------------------------------------- 1 | # ('Rainfall+Altitude', 'Adding Rainfall and Altitude') 2 | # Usefulness: This column captures the sum of Rainfall and Altitude which may be important in predicting the utility of a species as the location of the site (Rainfall and Altitude) could affect the growth of the species differently. 3 | # Input samples: 'Rainfall': [1300.0, 850.0, 1080.0], 'Altitude': [150.0, 100.0, 180.0] 4 | df['Rainfall+Altitude'] = df['Rainfall'] + df['Altitude'] -------------------------------------------------------------------------------- /data/generated_code/jungle_chess_2pcs_raw_endgame_complete_v3_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Feature name and description) 3 | # Usefulness: This feature calculates the difference between the strength of the white piece and the strength of the black piece, giving an idea of which color has an advantage in the game. 4 | # Input samples: 'white_piece0_strength': [7.0, 0.0, 4.0], 'black_piece0_strength': [0.0, 4.0, 0.0] 5 | df['strength_diff'] = df['white_piece0_strength'] - df['black_piece0_strength'] 6 | -------------------------------------------------------------------------------- /data/generated_code/balance-scale_v3_5_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Feature name and description) 3 | # Usefulness: This feature calculates the difference between the left and right distances and adds the information to the dataset. This can be useful because it can help the classifier understand the distribution of weight on each side of the scale. 4 | # Input samples: 'left-distance': [1.0, 3.0, 3.0], 'right-distance': [5.0, 4.0, 4.0] 5 | df['distance_diff'] = df['left-distance'] - df['right-distance'] 6 | -------------------------------------------------------------------------------- /data/generated_code/credit-g_v4_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Employment stability 3 | # Usefulness: This feature checks if the individual has a stable employment, which can help in understanding their ability to repay the credit. 4 | # Input samples: 'employment': [4, 0, 2] 5 | df['employment_stability'] = df['employment'].apply(lambda x: 1 if x >= 3 else 0) 6 | 7 | # Dropping 'employment' column as 'employment_stability' provides a more generalized representation of employment status 8 | df.drop(columns=['employment'], inplace=True) -------------------------------------------------------------------------------- /data/generated_code/airlines_v3_4_code.txt: -------------------------------------------------------------------------------- 1 | # (Distance, numerical) 2 | # Usefulness: Longer flights may be more prone to delays due to various reasons such as weather or air traffic control 3 | # Input samples: 'AirportFrom': [63.0, 30.0, 80.0, 0.0, 38.0, 136.0, 22.0, 71.0, 197.0, 2.0], 'AirportTo': [6.0, 77.0, 0.0, 61.0, 1.0, 3.0, 59.0, 6.0, 61.0, 45.0] 4 | import numpy as np 5 | df['Distance'] = np.sqrt((df['AirportFrom'] - df['AirportTo'])**2) 6 | df.drop(columns=['AirportFrom', 'AirportTo'], inplace=True)# (AirlineDelayRate, numerical) -------------------------------------------------------------------------------- /data/generated_code/pc1_v3_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # ('total_Op'/'total_Opnd' ratio) 3 | # Usefulness: This ratio gives an idea about the proportion of operators to operands in the code. 4 | # Input samples: 'total_Op': [62.0, 20.0, 160.0], 'total_Opnd': [61.0, 17.0, 145.0] 5 | df['Op_Opnd_Ratio'] = df['total_Op'] / df['total_Opnd'] 6 | # ('lOComment' / 'lOCode') 7 | # Usefulness: This feature captures the proportion of lines of code that are comments. 8 | # Input samples: 'lOComment': [5.0, 1.0, 7.0], 'lOCode': [18.0, 9.0, 35.0] 9 | df['lOComment_ratio'] = df['lOComment'] / df['lOCode'] -------------------------------------------------------------------------------- /data/generated_code/diabetes_v4_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # BMI age interaction 3 | # Usefulness: As age increases, the impact of BMI on diabetes risk may also increase. 4 | # Input samples: 'mass': [34.5, 26.4, 37.2], 'age': [40.0, 21.0, 45.0] 5 | df['bmi_age_interaction'] = df['mass'] * df['age'] 6 | 7 | # Glucose and blood pressure interaction 8 | # Usefulness: Higher glucose and blood pressure levels may indicate a higher risk of diabetes. 9 | # Input samples: 'plas': [117.0, 134.0, 102.0], 'pres': [88.0, 58.0, 74.0] 10 | df['glucose_pressure_interaction'] = df['plas'] * df['pres'] 11 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_spaceship-titanic_v4_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: TotalExpense (sum of expenses in RoomService, FoodCourt, ShoppingMall, Spa, and VRDeck) 3 | # Usefulness: Passengers with higher expenses may have different probabilities of being transported due to their activity patterns and locations on the ship. 4 | # Input samples: 'RoomService': [0.0, 0.0, 672.0], 'FoodCourt': [0.0, 0.0, 0.0], 'ShoppingMall': [0.0, 0.0, 0.0], 'Spa': [0.0, 0.0, 0.0], 'VRDeck': [0.0, 0.0, 20.0] 5 | df['TotalExpense'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['Spa'] + df['VRDeck'] 6 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_pharyngitis_v3_0_code.txt: -------------------------------------------------------------------------------- 1 | # ('head_and_neck_symptoms', 'Presence of head and neck symptoms') 2 | # Usefulness: Presence of head and neck symptoms may indicate the presence of GAS infection and help predict a positive RADT result. 3 | # Input samples: 'headache': [0.0, 0.0, 0.0], 'tonsillarswelling': [1.0, 0.0, 0.0], 'exudate': [1.0, 1.0, 0.0], 'erythema': [1.0, 0.0, 1.0], 'petechiae': [0.0, 0.0, 0.0], 'radt': [0.0, 1.0, 0.0] 4 | df['head_and_neck_symptoms'] = ((df['headache'] == 1) | (df['tonsillarswelling'] == 1) | (df['exudate'] == 1) | (df['erythema'] == 1) | (df['petechiae'] == 1)).astype(int) -------------------------------------------------------------------------------- /data/generated_code/credit-g_v3_1_code.txt: -------------------------------------------------------------------------------- 1 | # Explanation why the column 'own_telephone' is dropped 2 | # This column is dropped as it is not expected to have a significant impact on the classification of "class" according to dataset description and attributes. 3 | df.drop(columns=['own_telephone'], inplace=True) 4 | 5 | # (Feature name and description) 6 | # Usefulness: This feature creates a binary indicator variable for customers who have a credit amount greater than the median credit amount. 7 | # Input samples: 'credit_amount': [1549.0, 7476.0, 2442.0] 8 | df['above_median_credit'] = df['credit_amount'] > df['credit_amount'].median() -------------------------------------------------------------------------------- /data/generated_code/airlines_v4_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # FlightLengthCategory 3 | # Usefulness: This feature categorizes the flight length into short, medium, and long-haul flights, which might be useful in predicting delays as different flight lengths might have different patterns of delays. 4 | # Input samples: 'Length': [134.0, 244.0, 380.0] 5 | def categorize_flight_length(length): 6 | if length < 200: 7 | return 0 # Short-haul 8 | elif 200 <= length < 500: 9 | return 1 # Medium-haul 10 | else: 11 | return 2 # Long-haul 12 | 13 | df['FlightLengthCategory'] = df['Length'].apply(categorize_flight_length) 14 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright by Noah Hollmann, Samuel Müller and Frank Hutter 2023. 2 | 3 | Licensed under the Apache License, Version 2.0 (the "License"); 4 | you may not use this file except in compliance with the License. 5 | You may obtain a copy of the License at 6 | 7 | http://www.apache.org/licenses/LICENSE-2.0 8 | 9 | Unless required by applicable law or agreed to in writing, software 10 | distributed under the License is distributed on an "AS IS" BASIS, 11 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | See the License for the specific language governing permissions and 13 | limitations under the License. 14 | -------------------------------------------------------------------------------- /data/dataset_descriptions/openml_balance-scale.txt: -------------------------------------------------------------------------------- 1 | **Balance Scale Weight & Distance Database** 2 | This data set was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of (left-distance * left-weight) and (right-distance * right-weight). If they are equal, it is balanced. 3 | 4 | Attribute description 5 | The attributes are the left weight, the left distance, the right weight, and the right distance. -------------------------------------------------------------------------------- /data/dataset_descriptions/kaggle_pharyngitis.txt: -------------------------------------------------------------------------------- 1 | Group A streptococcus (GAS) infection is a major cause of pediatric pharyngitis, and infection with this organism requires appropriate antimicrobial therapy. 2 | 3 | There is controversy as to whether physicians can rely on signs and symptoms to select pediatric patients with pharyngitis who should undergo rapid antigen detection testing (RADT) for GAS . 4 | 5 | Our objective was to evaluate the validity of signs and symptoms in the selective testing of children with pharyngitis. 6 | 7 | Now, let's use machine learning to analyze whether a diagnosis can be made from the child's symptoms and signs. 8 | Can we predict RADT positive? -------------------------------------------------------------------------------- /data/generated_code/credit-g_v3_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # ('credit_per_duration', 'Useful to know how much credit is granted per month, as this can indicate if the customer is overextending themselves or not.') 3 | # Input samples: {'duration': [48.0, 12.0, 24.0], 'credit_amount': [3609.0, 3331.0, 1311.0]} 4 | df['credit_per_duration'] = df['credit_amount'] / df['duration'] 5 | # ('credit_per_person', 'Useful to know how much credit is granted per person, as this can indicate if the customer is overextending themselves or not.') 6 | # Input samples: {'credit_amount': [3609.0, 3331.0, 1311.0], 'num_dependents': [1.0, 1.0, 1.0]} 7 | df['credit_per_person'] = df['credit_amount'] / (df['num_dependents'] + 1) -------------------------------------------------------------------------------- /data/generated_code/breast-w_v4_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Nuclei_Features_Avg 3 | # Usefulness: Combining Bare_Nuclei, Bland_Chromatin, and Normal_Nucleoli into a single feature as they all represent characteristics of cell nuclei. 4 | # Input samples: 'Bare_Nuclei': [2.0, 10.0, 3.0], 'Bland_Chromatin': [2.0, 4.0, 4.0], 'Normal_Nucleoli': [1.0, 8.0, 10.0] 5 | df['Nuclei_Features_Avg'] = (df['Bare_Nuclei'] + df['Bland_Chromatin'] + df['Normal_Nucleoli']) / 3 6 | 7 | # Dropping redundant columns 8 | # Explanation: Dropping Bare_Nuclei, Bland_Chromatin, and Normal_Nucleoli as they are combined into Nuclei_Features_Avg 9 | df.drop(columns=['Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli'], inplace=True) -------------------------------------------------------------------------------- /data/generated_code/diabetes_v4_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Insulin Glucose Interaction 3 | # Usefulness: Insulin and glucose levels are directly related to diabetes risk. Their interaction term can help capture this relationship. 4 | # Input samples: 'insu': [155.0, 65.0, 0.0], 'plas': [129.0, 86.0, 151.0] 5 | df['insulin_glucose_interaction'] = df['insu'] * df['plas'] 6 | 7 | # Skin Thickness Insulin Interaction 8 | # Usefulness: Skin thickness and insulin levels can both be related to diabetes risk. Their interaction term can help capture this relationship. 9 | # Input samples: 'skin': [49.0, 52.0, 46.0], 'insu': [155.0, 65.0, 0.0] 10 | df['skin_insulin_interaction'] = df['skin'] * df['insu'] 11 | -------------------------------------------------------------------------------- /data/generated_code/breast-w_v3_2_code.txt: -------------------------------------------------------------------------------- 1 | # Explanation: 'Marginal_Adhesion' is dropped as it has low importance in the feature importance analysis and may hurt predictive performance. 2 | df.drop(columns=['Marginal_Adhesion'], inplace=True)# ('High_Clump_Size', 'Create a binary feature indicating high values for Clump_Thickness and Cell_Size_Uniformity') 3 | # Usefulness: High values for Clump_Thickness and Cell_Size_Uniformity are known to be associated with malignancy. This new feature captures this information in a binary form. 4 | # Input samples: 'Clump_Thickness': [1.0, 5.0, 4.0], 'Cell_Size_Uniformity': [1.0, 2.0, 1.0], ... 5 | df['High_Clump_Size'] = ((df['Clump_Thickness'] >= 6) & (df['Cell_Size_Uniformity'] >= 6)).astype(int) -------------------------------------------------------------------------------- /data/generated_code/eucalyptus_v3_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Seedlot_Density) 3 | # Usefulness: Some seedlots may be more dense than others, which could affect the growth of the trees. This feature gives an indication of the density of each seedlot. 4 | # Input samples: 'PMCno': [1521.0, 1524.0, 1786.0], 'Ht': [13.32, 5.5, 5.39], 'DBH': [31.38, 7.6, 6.05] 5 | df['Seedlot_Density'] = (df['Ht'] * df['DBH']) / df['PMCno'] 6 | # (Species_Survival_Rate) 7 | # Usefulness: This feature gives an indication of the average survival rate for each species. 8 | # Input samples: 'Sp': [6, 8, 1], 'Surv': [nan, 20.0, 88.0] 9 | species_mean_survival = df.groupby('Sp')['Surv'].mean() 10 | df['Species_Survival_Rate'] = df['Sp'].map(species_mean_survival) -------------------------------------------------------------------------------- /data/generated_code/eucalyptus_v4_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Seedlot_Density) 3 | # Usefulness: Some seedlots may be more dense than others, which could affect the growth of the trees. This feature gives an indication of the density of each seedlot. 4 | # Input samples: 'PMCno': [1521.0, 1524.0, 1786.0], 'Ht': [13.32, 5.5, 5.39], 'DBH': [31.38, 7.6, 6.05] 5 | df['Seedlot_Density'] = (df['Ht'] * df['DBH']) / df['PMCno'] 6 | # (Species_Survival_Rate) 7 | # Usefulness: This feature gives an indication of the average survival rate for each species. 8 | # Input samples: 'Sp': [6, 8, 1], 'Surv': [nan, 20.0, 88.0] 9 | species_mean_survival = df.groupby('Sp')['Surv'].mean() 10 | df['Species_Survival_Rate'] = df['Sp'].map(species_mean_survival) -------------------------------------------------------------------------------- /data/generated_code/diabetes_v4_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Insulin and Glucose Interaction 3 | # Usefulness: Combining insulin and glucose levels may provide better insight into diabetes risk, as both are related to the body's ability to process glucose. 4 | # Input samples: 'insu': [220.0, 0.0, 96.0], 'plas': [119.0, 107.0, 115.0] 5 | df['insulin_glucose'] = df['insu'] * df['plas'] 6 | 7 | # Pregnancies and Age Interaction 8 | # Usefulness: Combining the number of times pregnant and age may help the classifier identify patterns related to diabetes risk, as older women with more pregnancies could be at higher risk. 9 | # Input samples: 'preg': [1.0, 0.0, 1.0], 'age': [29.0, 23.0, 32.0] 10 | df['pregnancies_age'] = df['preg'] * df['age'] 11 | -------------------------------------------------------------------------------- /data/generated_code/cmc_v3_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Wife_and_husband_education) 3 | # Usefulness: Education level of both wife and husband may be important in determining contraceptive method used. High education level of both may indicate more planning and use of long-term methods. 4 | # Input samples: 'Wifes_education': [3, 3, 2], 'Husbands_education': [3, 3, 3] 5 | df['Wife_and_husband_education'] = df['Wifes_education'] + df['Husbands_education'] 6 | # (Wife_age_squared) 7 | # Usefulness: This feature may capture the non-linear relationship between wife's age and contraceptive method used. A squared term may be useful in capturing this relationship. 8 | # Input samples: 'Wifes_age': [48.0, 32.0, 28.0] 9 | df['Wife_age_squared'] = df['Wifes_age'] ** 2 -------------------------------------------------------------------------------- /data/generated_code/balance-scale_v3_3_code.txt: -------------------------------------------------------------------------------- 1 | # ('right-weight' * 'right-distance') 2 | # Usefulness: This column calculates the right moment of the scale, which is an important factor in determining the balance of the scale. 3 | # Input samples: 'right-weight': [3.0, 4.0, 5.0], 'right-distance': [2.0, 2.0, 5.0] 4 | df['right-moment'] = df['right-weight'] * df['right-distance']# ('left-distance' - 'right-distance') 5 | # Usefulness: This column calculates the difference between the left and right distances of the scale, which can help to determine the direction in which the scale will tip. 6 | # Input samples: 'left-distance': [4.0, 5.0, 4.0], 'right-distance': [2.0, 2.0, 5.0] 7 | df['distance-difference'] = df['left-distance'] - df['right-distance'] -------------------------------------------------------------------------------- /data/generated_code/pc1_v3_4_code.txt: -------------------------------------------------------------------------------- 1 | # ('Halstead's Program Level') 2 | # Usefulness: Halstead's Program Level is a measure of the program's complexity. It is calculated based on the program's volume and the number of unique operators and operands. 3 | # Input samples: 'V': [305.03, 404.17, 280.0], 'uniq_Op': [18.0, 14.0, 14.0], 'uniq_Opnd': [18.0, 13.0, 18.0] 4 | df['program_level'] = df['V'] / (df['uniq_Op'] * df['uniq_Opnd']) 5 | # ('Halstead's Program Length per LOC') 6 | # Usefulness: Halstead's Program Length is a measure of the length of the program. Dividing it by the line count gives an idea of how long the program is per line. 7 | # Input samples: 'loc': [15.0, 13.0, 14.0], 'L': [0.09, 0.05, 0.1], ... 8 | df['program_length_per_loc'] = df['L'] / df['loc'] -------------------------------------------------------------------------------- /data/generated_code/eucalyptus_v3_1_code.txt: -------------------------------------------------------------------------------- 1 | # (Feature name and description) 2 | # Usefulness: This feature captures the ratio of the altitude to the rainfall, which can help to identify if certain areas with specific rainfall and altitude combinations are more suitable for soil conservation. 3 | # Input samples: 'Altitude': [150.0, 200.0, 150.0], 'Rainfall': [1250.0, 1400.0, 900.0] 4 | df['Altitude_Rainfall_ratio'] = df['Altitude'] / df['Rainfall']# Explanation why the column 'Year' is dropped 5 | # The year of planting is unlikely to have a direct impact on the classification of "Utility", as the species and seedlot are already included in the dataset, and the effect of the year on the growth and health of the trees is likely to be indirect and complex. 6 | df.drop(columns=['Year'], inplace=True) -------------------------------------------------------------------------------- /data/generated_code/cmc_v4_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature name: Education_gap 3 | # Usefulness: This feature captures the difference in education levels between the wife and husband, which may influence their contraceptive method choice. 4 | # Input samples: 'Wifes_education': [3, 0, 2], 'Husbands_education': [3, 1, 3] 5 | df['Education_gap'] = df['Wifes_education'] - df['Husbands_education'] 6 | 7 | # Feature name: Age_standard_living_interaction 8 | # Usefulness: This feature captures the interaction between the wife's age and the standard-of-living index, which may have an impact on the couple's contraceptive method choice. 9 | # Input samples: 'Wifes_age': [37.0, 35.0, 22.0], 'Standard-of-living_index': [1, 3, 3] 10 | df['Age_standard_living_interaction'] = df['Wifes_age'] * df['Standard-of-living_index'] 11 | -------------------------------------------------------------------------------- /data/generated_code/credit-g_v4_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature name: credit_amount_per_duration 3 | # Usefulness: This feature calculates the credit amount per month, which can provide insight into the monthly payment burden on the customer. 4 | # Input samples: 'credit_amount': [1224.0, 8588.0, 6615.0], 'duration': [9.0, 39.0, 24.0] 5 | df['credit_amount_per_duration'] = df['credit_amount'] / df['duration'] 6 | 7 | # Feature name: installment_commitment_ratio 8 | # Usefulness: This feature calculates the ratio of the installment rate to the credit amount. It can provide insight into the customer's ability to pay off their credit. 9 | # Input samples: 'installment_commitment': [3.0, 4.0, 2.0], 'credit_amount': [1224.0, 8588.0, 6615.0] 10 | df['installment_commitment_ratio'] = df['installment_commitment'] / df['credit_amount'] 11 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_pharyngitis_v4_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Age and Exudate Interaction 3 | # Usefulness: Younger children may be more likely to have exudate present in their throat, which could be indicative of a more severe infection and thus more likely to be RADT positive. 4 | # Input samples: 'age_y': [11.6, 8.2, 5.1], 'exudate': [0.0, 0.0, 0.0] 5 | df['age_exudate_interaction'] = df['age_y'] * df['exudate'] 6 | 7 | # Feature: Age and Tonsillar Swelling Interaction 8 | # Usefulness: Younger children may be more likely to have tonsillar swelling, which could be indicative of a more severe infection and thus more likely to be RADT positive. 9 | # Input samples: 'age_y': [11.6, 8.2, 5.1], 'tonsillarswelling': [0.0, 1.0, 0.0] 10 | df['age_tonsillar_swelling_interaction'] = df['age_y'] * df['tonsillarswelling'] 11 | -------------------------------------------------------------------------------- /data/generated_code/pc1_v4_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: code_density 3 | # Usefulness: This feature represents the proportion of code lines (lOCode) to the total lines of code, comments, and blank lines. 4 | # A higher code density may indicate better code organization and lower chances of defects. 5 | # Input samples: 'lOCode': [3.0, 21.0, 37.0], 'lOComment': [0.0, 0.0, 0.0], 'lOBlank': [1.0, 4.0, 4.0] 6 | df['code_density'] = df['lOCode'] / (df['lOCode'] + df['lOComment'] + df['lOBlank']) 7 | 8 | # Feature: effort_per_line 9 | # Usefulness: This feature represents the average effort (Halstead effort) per line of code. Higher effort per line may indicate more complex code and a higher likelihood of defects. 10 | # Input samples: 'E': [39.3, 2966.84, 7545.68], 'loc': [3.0, 21.0, 37.0] 11 | df['effort_per_line'] = df['E'] / df['loc'] 12 | -------------------------------------------------------------------------------- /data/generated_code/breast-w_v3_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # ('Total_Cell_Size', 'Use the product of Cell_Size_Uniformity and Single_Epi_Cell_Size to get the total cell size of the sample.', 3 | # 'Cell_Size_Uniformity': [1.0, 2.0, 6.0], 'Single_Epi_Cell_Size': [2.0, 2.0, 10.0], ...) 4 | df['Total_Cell_Size'] = df['Cell_Size_Uniformity'] * df['Single_Epi_Cell_Size'] 5 | # ('Uniformity_Difference', 'Use the absolute difference between Cell_Size_Uniformity and Cell_Shape_Uniformity to get the uniformity difference of the sample.', 6 | # 'Cell_Size_Uniformity': [1.0, 2.0, 6.0], 'Cell_Shape_Uniformity': [1.0, 2.0, 5.0], ...) 7 | df['Uniformity_Difference'] = abs(df['Cell_Size_Uniformity'] - df['Cell_Shape_Uniformity']) 8 | 9 | # Explanation: The absolute difference between Cell_Size_Uniformity and Cell_Shape_Uniformity is a good estimate of the uniformity difference of the sample. -------------------------------------------------------------------------------- /data/generated_code/credit-g_v4_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Installment to credit amount ratio 3 | # Usefulness: This feature calculates the ratio of installment commitment to credit amount, which can help to identify patterns in credit risk based on how much of the credit amount is being committed to installments. 4 | # Input samples: 'installment_commitment': [4.0, 4.0, 4.0], 'credit_amount': [1549.0, 7476.0, 2442.0] 5 | df['installment_to_credit_amount_ratio'] = df['installment_commitment'] / df['credit_amount'] 6 | 7 | # Feature: Income stability 8 | # Usefulness: This feature calculates the ratio of duration to present employment, which can help to identify patterns in credit risk based on the stability of a person's income. 9 | # Input samples: 'duration': [9.0, 48.0, 27.0], 'employment': [1, 3, 4] 10 | df['income_stability'] = df['duration'] / df['employment'] 11 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_playground-series-s3e12_v3_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Urea_to_calcium_ratio) 3 | # Usefulness: High levels of urea and calcium in urine have been associated with kidney stones. This column adds information on the ratio of urea to calcium, which is a useful predictor for the formation of calcium oxalate crystals. 4 | # Input samples: 'urea': [75.0, 187.0, 380.0], 'calc': [3.98, 6.99, 7.18] 5 | df['urea_to_calcium_ratio'] = df['urea'] / df['calc'] 6 | # (Osmo_cond_ratio) 7 | # Usefulness: High levels of osmolarity and conductivity in urine have been associated with kidney stones. This column adds information on the ratio of osmolarity to conductivity, which is a useful predictor for the formation of calcium oxalate crystals. 8 | # Input samples: 'osmo': [527.0, 461.0, 874.0], 'cond': [25.8, 17.8, 29.5] 9 | df['osmo_cond_ratio'] = df['osmo'] / df['cond']# (Urea_minus_calcium) -------------------------------------------------------------------------------- /data/generated_code/pc1_v3_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Feature name and description) 3 | # Usefulness: This feature calculates the ratio of unique operands to unique operators 4 | # Input samples: 'uniq_Op': [14.0, 8.0, 20.0], 'uniq_Opnd': [12.0, 9.0, 136.0] 5 | df['uniq_Opnd/uniq_Op'] = df['uniq_Opnd'] / df['uniq_Op'] 6 | # (Feature name and description) 7 | # Usefulness: This feature calculates the ratio of the number of operands to the number of operators 8 | # Input samples: 'total_Op': [28.0, 20.0, 367.0], 'total_Opnd': [24.0, 13.0, 304.0] 9 | df['total_Opnd/total_Op'] = df['total_Opnd'] / df['total_Op']# (Feature name and description) 10 | # Usefulness: This feature calculates the ratio of unique operands to the total number of operands 11 | # Input samples: 'uniq_Opnd': [12.0, 9.0, 136.0], 'total_Opnd': [24.0, 13.0, 304.0] 12 | df['uniq_Opnd/total_Opnd'] = df['uniq_Opnd'] / df['total_Opnd'] -------------------------------------------------------------------------------- /data/generated_code/jungle_chess_2pcs_raw_endgame_complete_v3_0_code.txt: -------------------------------------------------------------------------------- 1 | # ('white_piece0_strength' / 'black_piece0_strength') 2 | # Usefulness: The ratio of the strengths of the white and black pieces provides a measure of how strong the white piece is relative to the black piece. This can be useful in predicting the outcome of the game. 3 | # Input samples: 'white_piece0_strength': [7.0, 6.0, 7.0], 'black_piece0_strength': [0.0, 0.0, 6.0] 4 | df['strength_ratio'] = df['white_piece0_strength'] / df['black_piece0_strength']# ('white_piece0_strength' + 'black_piece0_strength') 5 | # Usefulness: The sum of the strengths of the white and black pieces provides a measure of how strong the pieces are. This can be useful in predicting the outcome of the game. 6 | # Input samples: 'white_piece0_strength': [7.0, 6.0, 7.0], 'black_piece0_strength': [0.0, 0.0, 6.0] 7 | df['total_strength'] = df['white_piece0_strength'] + df['black_piece0_strength'] -------------------------------------------------------------------------------- /data/generated_code/credit-g_v3_4_code.txt: -------------------------------------------------------------------------------- 1 | # (Feature name and description) 2 | # Usefulness: This feature captures the ratio of credit amount to installment commitment. 3 | # Input samples: 'credit_amount': [1082.0, 5293.0, 2080.0], 'installment_commitment': [4.0, 2.0, 1.0] 4 | df['credit_installment_ratio'] = df['credit_amount'] / (df['installment_commitment'] * df['duration'])# (Feature name and description) 5 | # Usefulness: This feature captures the number of credits per person responsible for maintenance 6 | # Input samples: 'existing_credits': [2.0, 2.0, 1.0], 'num_dependents': [1.0, 1.0, 1.0] 7 | df['credits_per_person'] = df['existing_credits'] / df['num_dependents']# (Feature name and description) 8 | # Usefulness: This feature captures the ratio of credit amount to present employment. 9 | # Input samples: 'credit_amount': [1082.0, 5293.0, 2080.0], 'employment': [4, 0, 2] 10 | df['credit_employment_ratio'] = df['credit_amount'] / df['employment'] -------------------------------------------------------------------------------- /data/generated_code/diabetes_v4_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Age category 3 | # Usefulness: Categorizing age values into groups can help the classifier identify patterns related to diabetes risk. 4 | # Input samples: 'age': [25.0, 39.0, 21.0] 5 | def age_category(age): 6 | if age < 30: 7 | return 0 # Young 8 | elif 30 <= age < 50: 9 | return 1 # Middle-aged 10 | else: 11 | return 2 # Old 12 | 13 | df['age_cat'] = df['age'].apply(age_category) 14 | 15 | # Glucose tolerance test result category 16 | # Usefulness: Categorizing glucose tolerance test results can help the classifier identify patterns related to diabetes risk. 17 | # Input samples: 'plas': [87.0, 137.0, 134.0] 18 | def glucose_category(glucose): 19 | if glucose < 140: 20 | return 0 # Normal 21 | elif 140 <= glucose < 200: 22 | return 1 # Prediabetes 23 | else: 24 | return 2 # Diabetes 25 | 26 | df['glucose_cat'] = df['plas'].apply(glucose_category) 27 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_health-insurance-lead-prediction-raw-data_v3_4_code.txt: -------------------------------------------------------------------------------- 1 | # ('Family_Size', 'Useful to capture the total number of individuals insured under the same policy', 'Is_Spouse': ['No', 'No', 'Yes'], 'Upper_Age': [52, 22, 63], ...) 2 | df['Family_Size'] = df['Is_Spouse'].replace({'Yes':2, 'No':1}) + (df['Upper_Age'] > 30).astype(int)# ('Holding_Policy_Duration_Imputed_Holding_Policy_Type_Imputed', 'Useful to capture if both holding policy duration and type were imputed', 'Holding_Policy_Duration_Imputed': [0, 0, 1], 'Holding_Policy_Type_Imputed': [0, 1, 0], ...) 3 | df['Holding_Policy_Duration_Imputed_Holding_Policy_Type_Imputed'] = (df['Holding_Policy_Duration'].isna().astype(int)) * (df['Holding_Policy_Type'].isna().astype(int))# ('Premium_Per_Age', 'Useful to capture premium paid per year of life insured', 'Reco_Policy_Premium': [13112.0, 9800.0, 17280.0], 'Age': [52, 22, 63], ...) 4 | df['Premium_Per_Age'] = df['Reco_Policy_Premium'] / (df['Upper_Age'] + df['Lower_Age']) -------------------------------------------------------------------------------- /data/generated_code/balance-scale_v4_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Right Moment 3 | # Usefulness: Calculates the moment on the right side of the balance scale, which is essential to determine the class. 4 | # Input samples: 'right-weight': [1.0, 3.0, 5.0], 'right-distance': [5.0, 2.0, 4.0] 5 | df['right_moment'] = df['right-weight'] * df['right-distance'] 6 | # Moment Difference 7 | # Usefulness: Calculates the difference between the left and right moments, which helps in determining the class. 8 | # Input samples: 'left_moment': [2.0, 8.0, 6.0], 'right_moment': [5.0, 6.0, 20.0] 9 | df['moment_difference'] = df['left-weight'] * df['left-distance'] - df['right-weight'] * df['right-distance'] 10 | 11 | # Dropping left-weight, left-distance, right-weight, right-distance as they are not needed anymore 12 | # We have created new features (left_moment, right_moment, moment_difference) that capture the information of these columns 13 | df.drop(columns=['left-weight', 'left-distance', 'right-weight', 'right-distance'], inplace=True) -------------------------------------------------------------------------------- /data/dataset_descriptions/openml_ jungle_chess_2pcs_raw_endgame_complete.txt: -------------------------------------------------------------------------------- 1 | This dataset is part of a collection datasets based on the game "Jungle Chess" (a.k.a. Dou Shou Qi). For a description of the rules, please refer to the paper (link attached). The paper also contains a description of various constructed features. As the tablebases are a disjoint set of several tablebases based on which (two) pieces are on the board, we have uploaded all tablebases that have explicit different content: 2 | 3 | * Rat vs Rat 4 | * Rat vs Panther 5 | * Rat vs. Lion 6 | * Rat vs. Elephant 7 | * Panther vs. Lion 8 | * Panther vs. Elephant 9 | * Tiger vs. Lion 10 | * Lion vs. Lion 11 | * Lion vs. Elephant 12 | * Elephant vs. Elephant 13 | * Complete (Combination of the above) 14 | * RAW Complete (Combination of the above, containing for both pieces just the rank, file and strength information). This dataset contains a similar classification problem as, e.g., the King and Rook vs. King problem and is suitable for classification tasks. -------------------------------------------------------------------------------- /data/generated_code/jungle_chess_2pcs_raw_endgame_complete_v4_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature name: strength_difference 3 | # Usefulness: This feature calculates the difference in strength between the white and black pieces, which can be useful in determining the outcome of a potential capture. 4 | # Input samples: 'white_piece0_strength': [5.0, 6.0, 0.0], 'black_piece0_strength': [6.0, 0.0, 0.0] 5 | df['strength_difference'] = df['white_piece0_strength'] - df['black_piece0_strength'] 6 | 7 | # Feature name: stronger_piece 8 | # Usefulness: This feature indicates which piece is stronger (1 for white, -1 for black, 0 for equal strength). This can help the classifier understand the balance of power between the two pieces. 9 | # Input samples: 'white_piece0_strength': [5.0, 6.0, 0.0], 'black_piece0_strength': [6.0, 0.0, 0.0] 10 | df['stronger_piece'] = df.apply(lambda row: 1 if row['white_piece0_strength'] > row['black_piece0_strength'] else (-1 if row['white_piece0_strength'] < row['black_piece0_strength'] else 0), axis=1) 11 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_playground-series-s3e12_v3_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Urea-to-calcium ratio) 3 | # Usefulness: Urine analysis can show elevated levels of urea and calcium in patients with kidney stones. This ratio can be an indicator of the likelihood of kidney stone formation. 4 | # Input samples: 'urea': [398.0, 178.0, 364.0], 'calc': [3.16, 3.04, 7.31] 5 | df['urea_calc_ratio'] = df['urea'] / df['calc'] 6 | # (Osmolarity divided by conductivity) 7 | # Usefulness: Osmolarity and conductivity are both measures of the concentration of particles in urine. This feature captures their relationship. 8 | # Input samples: 'osmo': [442.0, 803.0, 853.0], 'cond': [25.7, 26.0, 24.5] 9 | df['osmo_div_cond'] = df['osmo'] / df['cond']# (Calcium concentration times pH) 10 | # Usefulness: Calcium concentration and pH are both important factors in the formation of kidney stones. This feature captures their interaction. 11 | # Input samples: 'calc': [3.16, 3.04, 7.31], 'ph': [5.53, 5.27, 5.36] 12 | df['calc_times_ph'] = df['calc'] * df['ph'] -------------------------------------------------------------------------------- /data/generated_code/credit-g_v3_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Credit to Income Ratio) 3 | # Usefulness: This feature provides the ratio of credit amount to income of the customer which is a useful metric for credit risk assessment. 4 | # Input samples: 'credit_amount': [1224.0, 8588.0, 6615.0], 'installment_commitment': [3.0, 4.0, 2.0], 'personal_status': [2, 2, 2], 'num_dependents': [1.0, 1.0, 1.0] 5 | df['credit_income_ratio'] = df['credit_amount'] / (df['installment_commitment'] * df['personal_status'] * (df['num_dependents'] + 1)) 6 | # (Age bin) 7 | # Usefulness: Age is an important factor in determining credit risk. This feature adds a categorical column that bins the age of the customer. 8 | # Input samples: 'age': [30.0, 45.0, 75.0], 'personal_status': [2, 2, 2], 'class': [0.0, 0.0, 0.0] 9 | df['age_bin'] = pd.cut(df['age'], bins=[18, 30, 40, 50, 60, 70, 120], labels=['18-29', '30-39', '40-49', '50-59', '60-69', '70+'], include_lowest=True) 10 | df.drop(columns=['age'], inplace=True) # Dropping original column as it is redundant and not needed anymore. -------------------------------------------------------------------------------- /data/dataset_descriptions/kaggle_playground-series-s3e12.txt: -------------------------------------------------------------------------------- 1 | This dataset can be used to predict the presence of kidney stones based on urine analysis. 2 | 3 | The 79 urine specimens, were analyzed in an effort to 4 | determine if certain physical characteristics of the urine might be related to the 5 | formation of calcium oxalate crystals. 6 | The six physical characteristics of the urine are: (1) specific gravity, the density of the urine relative to water; (2) pH, the negative logarithm of the hydrogen ion; (3) osmolarity (mOsm), a unit used in biology and medicine but not in 7 | physical chemistry. Osmolarity is proportional to the concentration of 8 | molecules in solution; (4) conductivity (mMho milliMho). One Mho is one 9 | reciprocal Ohm. Conductivity is proportional to the concentration of charged 10 | ions in solution; (5) urea concentration in millimoles per litre; and (6) calcium 11 | concentration (CALC) in millimolesllitre. 12 | 13 | The data is obtained from 'Physical Characteristics of Urines With and Without Crystals',a chapter from Springer Series in Statistics. -------------------------------------------------------------------------------- /data/generated_code/balance-scale_v4_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature name and description: right_moment 3 | # Usefulness: Calculates the moment on the right side of the balance scale, which is the product of right-weight and right-distance. 4 | # Input samples: 'right-weight': [5.0, 3.0, 1.0], 'right-distance': [2.0, 4.0, 4.0] 5 | df['right_moment'] = df['right-weight'] * df['right-distance'] 6 | # Feature name and description: moment_difference 7 | # Usefulness: Calculates the difference between the left and right moments, which helps to determine the tipping direction of the balance scale. 8 | # Input samples: 'left-weight': [2.0, 4.0, 1.0], 'left-distance': [2.0, 5.0, 4.0], 'right-weight': [5.0, 3.0, 1.0], 'right-distance': [2.0, 4.0, 4.0] 9 | df['moment_difference'] = (df['left-weight'] * df['left-distance']) - (df['right-weight'] * df['right-distance']) 10 | # Drop the original columns as they are now represented by the moment_difference feature and may introduce multicollinearity. 11 | df.drop(columns=['left-weight', 'left-distance', 'right-weight', 'right-distance'], inplace=True) 12 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_health-insurance-lead-prediction-raw-data_v4_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Reco_Policy_Premium_Age_Ratio 3 | # Usefulness: This feature captures the ratio of recommended policy premium to the average age of the applicants, which can help identify the affordability of the policy for different age groups and their likelihood to respond positively. 4 | # Input samples: 'Reco_Policy_Premium': [12960.0, 21767.2, 26764.8], 'Upper_Age': [48, 58, 65], 'Lower_Age': [48, 50, 60] 5 | df['Reco_Policy_Premium_Age_Ratio'] = df['Reco_Policy_Premium'] / ((df['Upper_Age'] + df['Lower_Age']) / 2) 6 | 7 | # Feature: Health_Indicator_Holding_Policy_Type 8 | # Usefulness: Combining Health Indicator and Holding Policy Type can provide insights on whether different health conditions and policy types influence the response to recommended insurance policies. 9 | # Input samples: 'Health Indicator': ['X2', 'X5', 'X4'], 'Holding_Policy_Type': [3.0, 4.0, 3.0] 10 | df['Health_Indicator_Holding_Policy_Type'] = df['Health Indicator'].astype(str) + '_' + df['Holding_Policy_Type'].astype(str) 11 | -------------------------------------------------------------------------------- /data/dataset_descriptions/openml_diabetes.txt: -------------------------------------------------------------------------------- 1 | 4. Relevant Information: 2 | Several constraints were placed on the selection of these instances from 3 | a larger database. In particular, all patients here are females at 4 | least 21 years old of Pima Indian heritage. ADAP is an adaptive learning 5 | routine that generates and executes digital analogs of perceptron-like 6 | devices. It is a unique algorithm; see the paper for details. 7 | 8 | 7. For Each Attribute: (all numeric-valued) 9 | 1. Number of times pregnant 10 | 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 11 | 3. Diastolic blood pressure (mm Hg) 12 | 4. Triceps skin fold thickness (mm) 13 | 5. 2-Hour serum insulin (mu U/ml) 14 | 6. Body mass index (weight in kg/(height in m)^2) 15 | 7. Diabetes pedigree function 16 | 8. Age (years) 17 | 9. Class variable (0 or 1) 18 | 19 | Relabeled values in attribute 'class' 20 | From: 0 To: tested_negative 21 | From: 1 To: tested_positive -------------------------------------------------------------------------------- /data/generated_code/balance-scale_v4_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Left Moment 3 | # Usefulness: This feature represents the product of left weight and left distance, which is useful to determine the balance scale tip. 4 | # Input samples: 'left-weight': [4.0, 5.0, 1.0], 'left-distance': [2.0, 4.0, 4.0] 5 | df['left_moment'] = df['left-weight'] * df['left-distance'] 6 | 7 | # Balance Difference 8 | # Usefulness: This feature represents the difference between left moment and right moment, which can help to classify the balance scale tip more accurately. 9 | # Input samples: 'left-weight': [4.0, 5.0, 1.0], 'left-distance': [2.0, 4.0, 4.0], 'right-weight': [4.0, 4.0, 5.0], 'right-distance': [2.0, 4.0, 5.0] 10 | df['balance_diff'] = (df['left-weight'] * df['left-distance']) - (df['right-weight'] * df['right-distance']) 11 | 12 | # Drop redundant columns 13 | # Explanation: The original columns are now represented by the balance_diff feature, which captures the relationship between left and right moments. 14 | df.drop(columns=['left-weight', 'left-distance', 'right-weight', 'right-distance'], inplace=True) 15 | -------------------------------------------------------------------------------- /data/generated_code/pc1_v3_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Feature name and description) 3 | # Usefulness: This feature is the ratio of unique operands to total operands 4 | # Input samples: 'uniq_Opnd': [3.0, 14.0, 28.0], 'total_Opnd': [3.0, 20.0, 39.0] 5 | df['uniq_Opnd_ratio'] = df['uniq_Opnd'] / df['total_Opnd'] 6 | # (Feature name and description) 7 | # Usefulness: This feature is the ratio of McCabe's cyclomatic complexity to Halstead's volume 8 | # Input samples: 'v(g)': [1.0, 4.0, 6.0], 'V': [19.65, 276.9, 541.74] 9 | df['complexity_volume_ratio'] = df['v(g)'] / df['V']# (Feature name and description) 10 | # Usefulness: This feature is the ratio of McCabe's line count of code to Halstead's total operators + operands 11 | # Input samples: 'loc': [3.0, 21.0, 37.0], 'N': [7.0, 57.0, 97.0] 12 | df['code_to_operands_ratio'] = df['loc'] / df['N']# (Feature name and description) 13 | # Usefulness: This feature is the ratio of McCabe's cyclomatic complexity to McCabe's essential complexity 14 | # Input samples: 'v(g)': [1.0, 4.0, 6.0], 'ev(g)': [1.0, 1.0, 1.0] 15 | df['complexity_essential_ratio'] = df['v(g)'] / df['ev(g)'] -------------------------------------------------------------------------------- /data/generated_code/diabetes_v3_0_code.txt: -------------------------------------------------------------------------------- 1 | # (GlucoseInsulinRatio) 2 | # Usefulness: The ratio of glucose levels to insulin levels can be a better indicator of diabetes risk than glucose levels alone. 3 | # Input samples: 'plas': [119.0, 107.0, 115.0], 'insu': [220.0, 0.0, 96.0] 4 | df['GlucoseInsulinRatio'] = df['plas'] / df['insu'] 5 | 6 | # (AgeBMI) 7 | # Usefulness: BMI and age can interact in complex ways to increase the risk of diabetes. This feature can capture this relationship. 8 | # Input samples: 'age': [29.0, 23.0, 32.0], 'mass': [45.6, 26.4, 34.6] 9 | df['AgeBMI'] = df['age'] * df['mass'] 10 | 11 | # (BloodPressurePedigree) 12 | # Usefulness: The pedigree function can be an indicator of genetic predisposition to diabetes. This feature can capture this relationship with blood pressure. 13 | # Input samples: 'pres': [86.0, 60.0, 70.0], 'pedi': [0.81, 0.13, 0.53] 14 | df['BloodPressurePedigree'] = df['pres'] * df['pedi'] 15 | 16 | # Dropping columns that might be redundant and hurt the predictive performance of the downstream classifier (Feature selection) 17 | df.drop(columns=['preg', 'skin'], inplace=True) -------------------------------------------------------------------------------- /data/generated_code/diabetes_v3_2_code.txt: -------------------------------------------------------------------------------- 1 | # (GlucoseInsulinRatio) 2 | # Usefulness: The ratio of glucose levels to insulin levels can be a better indicator of diabetes risk than glucose levels alone. 3 | # Input samples: 'plas': [119.0, 107.0, 115.0], 'insu': [220.0, 0.0, 96.0] 4 | df['GlucoseInsulinRatio'] = df['plas'] / df['insu'] 5 | 6 | # (AgeBMI) 7 | # Usefulness: BMI and age can interact in complex ways to increase the risk of diabetes. This feature can capture this relationship. 8 | # Input samples: 'age': [29.0, 23.0, 32.0], 'mass': [45.6, 26.4, 34.6] 9 | df['AgeBMI'] = df['age'] * df['mass'] 10 | 11 | # (BloodPressurePedigree) 12 | # Usefulness: The pedigree function can be an indicator of genetic predisposition to diabetes. This feature can capture this relationship with blood pressure. 13 | # Input samples: 'pres': [86.0, 60.0, 70.0], 'pedi': [0.81, 0.13, 0.53] 14 | df['BloodPressurePedigree'] = df['pres'] * df['pedi'] 15 | 16 | # Dropping columns that might be redundant and hurt the predictive performance of the downstream classifier (Feature selection) 17 | df.drop(columns=['preg', 'skin'], inplace=True) -------------------------------------------------------------------------------- /data/generated_code/diabetes_v3_1_code.txt: -------------------------------------------------------------------------------- 1 | # (Insulin resistance) 2 | # Usefulness: Insulin resistance is a well-known risk factor for diabetes. This feature calculates insulin resistance using the HOMA-IR formula. 3 | # Input samples: 'plas': [87.0, 137.0, 134.0], 'insu': [0.0, 0.0, 291.0] 4 | df['homa_ir'] = (df['plas'] * df['insu']) / 405 5 | df.drop(columns=['insu'], inplace=True)# (Age group) 6 | # Usefulness: Age is a well-known risk factor for diabetes. This feature categorizes age into age groups. 7 | # Input samples: 'age': [25.0, 39.0, 21.0] 8 | df['age_group'] = pd.cut(df['age'], bins=[20, 30, 40, 50, 60, 70, 81], labels=['20-29', '30-39', '40-49', '50-59', '60-69', '70+']) 9 | df.drop(columns=['age'], inplace=True)# (Triceps skin fold thickness group) 10 | # Usefulness: Triceps skin fold thickness is a well-known risk factor for diabetes. This feature categorizes triceps skin fold thickness into groups. 11 | # Input samples: 'skin': [23.0, 41.0, 20.0] 12 | df['skin_group'] = pd.cut(df['skin'], bins=[-1, 0, 10, 20, 30, 100], labels=['0', '1-10', '11-20', '21-30', '31+']) 13 | df.drop(columns=['skin'], inplace=True) -------------------------------------------------------------------------------- /data/generated_code/kaggle_playground-series-s3e12_v3_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Urea to Creatinine Ratio) 3 | # Usefulness: Urea and creatinine are both markers of kidney function. The ratio of the two can be used to estimate the cause of acute kidney injury. 4 | # Input samples: 'urea': [125.0, 164.0, 418.0], 'calc': [1.05, 1.16, 2.36], 'id': [265.0, 157.0, 385.0] 5 | df['urea_creatinine_ratio'] = df['urea'] / df['calc'] 6 | # (Calcium to Creatinine Ratio) 7 | # Usefulness: Calcium and creatinine are both markers of kidney function. The ratio of the two can be used to estimate the cause of acute kidney injury. 8 | # Input samples: 'calc': [1.05, 1.16, 2.36], 'urea': [125.0, 164.0, 418.0], 'id': [265.0, 157.0, 385.0] 9 | df['calcium_creatinine_ratio'] = df['calc'] / df['urea'] 10 | # (Urea to Calcium Ratio) 11 | # Usefulness: Urea and calcium are both important factors in the formation of kidney stones. The ratio of the two can help predict the presence of kidney stones. 12 | # Input samples: 'urea': [125.0, 164.0, 418.0], 'calc': [1.05, 1.16, 2.36], 'target': [0.0, 1.0, 1.0] 13 | df['urea_calcium_ratio'] = df['urea'] / df['calc'] -------------------------------------------------------------------------------- /data/generated_code/kaggle_playground-series-s3e12_v4_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Urea to Creatinine Ratio) 3 | # Usefulness: Urea and creatinine are both markers of kidney function. The ratio of the two can be used to estimate the cause of acute kidney injury. 4 | # Input samples: 'urea': [125.0, 164.0, 418.0], 'calc': [1.05, 1.16, 2.36], 'id': [265.0, 157.0, 385.0] 5 | df['urea_creatinine_ratio'] = df['urea'] / df['calc'] 6 | # (Calcium to Creatinine Ratio) 7 | # Usefulness: Calcium and creatinine are both markers of kidney function. The ratio of the two can be used to estimate the cause of acute kidney injury. 8 | # Input samples: 'calc': [1.05, 1.16, 2.36], 'urea': [125.0, 164.0, 418.0], 'id': [265.0, 157.0, 385.0] 9 | df['calcium_creatinine_ratio'] = df['calc'] / df['urea'] 10 | # (Urea to Calcium Ratio) 11 | # Usefulness: Urea and calcium are both important factors in the formation of kidney stones. The ratio of the two can help predict the presence of kidney stones. 12 | # Input samples: 'urea': [125.0, 164.0, 418.0], 'calc': [1.05, 1.16, 2.36], 'target': [0.0, 1.0, 1.0] 13 | df['urea_calcium_ratio'] = df['urea'] / df['calc'] -------------------------------------------------------------------------------- /data/generated_code/kaggle_spaceship-titanic_v4_0_code.txt: -------------------------------------------------------------------------------- 1 | # Feature name: GroupSize 2 | # Usefulness: Indicates the number of passengers traveling in the same group. Larger groups might have a higher chance of being transported together. 3 | # Input samples: 'PassengerId': ['8878_01', '7749_02', '3955_01'] 4 | group_sizes = df['PassengerId'].str.split('_', expand=True)[0].value_counts() 5 | df['GroupSize'] = df['PassengerId'].str.split('_', expand=True)[0].map(group_sizes) 6 | # Feature name: Deck 7 | # Usefulness: Extracts the deck information from the Cabin column. Passengers on different decks might have different chances of being transported. 8 | # Input samples: 'Cabin': ['E/568/P', 'D/244/P', 'G/648/S'] 9 | df['Deck'] = df['Cabin'].str[0] 10 | df['Deck'] = df['Deck'].astype('category') 11 | 12 | # Feature name: Side 13 | # Usefulness: Extracts the side information from the Cabin column. Passengers on different sides (Port or Starboard) might have different chances of being transported. 14 | # Input samples: 'Cabin': ['E/568/P', 'D/244/P', 'G/648/S'] 15 | df['Side'] = df['Cabin'].str[-1] 16 | df['Side'] = df['Side'].astype('category') -------------------------------------------------------------------------------- /data/generated_code/kaggle_spaceship-titanic_v4_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature name: GroupSize 3 | # Usefulness: Indicates the number of passengers traveling in the same group. Larger groups might have a higher chance of being transported together. 4 | # Input samples: 'PassengerId': ['8878_01', '7749_02', '3955_01'] 5 | group_sizes = df['PassengerId'].str.split('_', expand=True)[0].value_counts() 6 | df['GroupSize'] = df['PassengerId'].str.split('_', expand=True)[0].map(group_sizes) 7 | # Feature name: Deck 8 | # Usefulness: Extracts the deck information from the Cabin column. Passengers on different decks might have different chances of being transported. 9 | # Input samples: 'Cabin': ['E/568/P', 'D/244/P', 'G/648/S'] 10 | df['Deck'] = df['Cabin'].str[0] 11 | df['Deck'] = df['Deck'].astype('category') 12 | 13 | # Feature name: Side 14 | # Usefulness: Extracts the side information from the Cabin column. Passengers on different sides (Port or Starboard) might have different chances of being transported. 15 | # Input samples: 'Cabin': ['E/568/P', 'D/244/P', 'G/648/S'] 16 | df['Side'] = df['Cabin'].str[-1] 17 | df['Side'] = df['Side'].astype('category') -------------------------------------------------------------------------------- /data/generated_code/kaggle_pharyngitis_v4_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Respiratory_symptoms 3 | # Usefulness: This feature combines the information from 'cough' and 'rhinorrhea' to create a single feature representing respiratory symptoms, which can potentially improve the prediction of "radt". 4 | # Input samples: 'cough': [1.0, 0.0, 0.0], 'rhinorrhea': [1.0, 0.0, 0.0] 5 | df['Respiratory_symptoms'] = df[['cough', 'rhinorrhea']].sum(axis=1) 6 | 7 | # Drop 'cough' and 'rhinorrhea' as 'Respiratory_symptoms' combines their information 8 | df.drop(columns=['cough', 'rhinorrhea'], inplace=True) 9 | # Feature: Sudden_and_conjunctivitis 10 | # Usefulness: This feature combines the information from 'sudden' and 'conjunctivitis' to create a single feature representing sudden onset and conjunctivitis symptoms, which can potentially improve the prediction of "radt". 11 | # Input samples: 'sudden': [0.0, 1.0, 0.0], 'conjunctivitis': [0.0, 0.0, 0.0] 12 | df['Sudden_and_conjunctivitis'] = df[['sudden', 'conjunctivitis']].sum(axis=1) 13 | 14 | # Drop 'sudden' and 'conjunctivitis' as 'Sudden_and_conjunctivitis' combines their information 15 | df.drop(columns=['sudden', 'conjunctivitis'], inplace=True) -------------------------------------------------------------------------------- /data/generated_code/balance-scale_v3_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # ('balance', 'Balance between left and right weights') 3 | # Usefulness: This column calculates the difference between the left weight and right weight, giving an indication of the balance between the two. 4 | # Input samples: 'left-weight': [4.0, 5.0, 1.0], 'right-weight': [4.0, 5.0, 2.0], ... 5 | df['balance'] = df['left-weight'] - df['right-weight'] 6 | # ('distance_diff', 'Difference between left and right distance') 7 | # Usefulness: This column calculates the difference between the left distance and right distance, giving an indication of the balance between the two. 8 | # Input samples: 'left-distance': [2.0, 4.0, 4.0], 'right-distance': [2.0, 4.0, 5.0], ... 9 | df['distance_diff'] = df['left-distance'] - df['right-distance']# ('weight_product', 'Product of left and right weights') 10 | # Usefulness: This column calculates the product of the left and right weights, giving an indication of the total weight involved in the balance and how the weight is distributed between the two sides. 11 | # Input samples: 'left-weight': [4.0, 5.0, 1.0], 'right-weight': [4.0, 5.0, 2.0], ... 12 | df['weight_product'] = df['left-weight'] * df['right-weight'] -------------------------------------------------------------------------------- /data/generated_code/tic-tac-toe_v3_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Number of "x" in the top row) 3 | # Usefulness: The number of "x" in the top row can help predict if "x" is closer to winning by having a higher chance of completing a three-in-a-row. 4 | # Input samples: 'top-left-square': [2, 0, 1], 'top-middle-square': [1, 1, 2], 'top-right-square': [0, 1, 2] 5 | df['num_x_top_row'] = df[['top-left-square', 'top-middle-square', 'top-right-square']].apply(lambda x: sum(i == 1 for i in x), axis=1) 6 | # (Number of "o" in the diagonal from top-left to bottom-right) 7 | # Usefulness: The number of "o" in the diagonal from top-left to bottom-right can help predict if "o" is closer to winning by having a higher chance of completing a three-in-a-row. 8 | # Input samples: 'top-left-square': [2, 0, 1], 'middle-middle-square': [1, 2, 2], 'bottom-right-square': [2, 0, 1] 9 | df['num_o_topleft_bottomright_diag'] = [1 if df.iloc[i]['top-left-square'] == 2 else 0 for i in range(len(df))] 10 | df['num_o_topleft_bottomright_diag'] += [1 if df.iloc[i]['middle-middle-square'] == 2 else 0 for i in range(len(df))] 11 | df['num_o_topleft_bottomright_diag'] += [1 if df.iloc[i]['bottom-right-square'] == 2 else 0 for i in range(len(df))] -------------------------------------------------------------------------------- /data/generated_code/eucalyptus_v4_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Surv_to_Rainfall_ratio (Surv divided by Rainfall) 3 | # Usefulness: The survival rate of trees in relation to the amount of rainfall can provide insights into the species' adaptability and suitability for soil conservation in different rainfall conditions. 4 | # Input samples: 'Surv': [40.0, 70.0, 45.0], 'Rainfall': [1300.0, 850.0, 1080.0] 5 | df['Surv_to_Rainfall_ratio'] = df['Surv'] / df['Rainfall'] 6 | 7 | # DBH_to_Surv_ratio (DBH divided by Surv) 8 | # Usefulness: The ratio of diameter base height to survival rate can provide insights into the tree's overall form and ability to withstand external factors, which may affect its utility for soil conservation. 9 | # Input samples: 'DBH': [26.59, 17.01, 7.89], 'Surv': [40.0, 70.0, 45.0] 10 | df['DBH_to_Surv_ratio'] = df['DBH'] / df['Surv'] 11 | 12 | # Ht_to_Surv_ratio (Ht divided by Surv) 13 | # Usefulness: The ratio of height to survival rate can provide insights into the tree's overall form and ability to withstand external factors, which may affect its utility for soil conservation. 14 | # Input samples: 'Ht': [10.8, 12.28, 5.65], 'Surv': [40.0, 70.0, 45.0] 15 | df['Ht_to_Surv_ratio'] = df['Ht'] / df['Surv'] 16 | -------------------------------------------------------------------------------- /data/generated_code/tic-tac-toe_v3_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Number of X's in the board) 3 | # Usefulness: The number of X's in the board can be a useful feature to predict the outcome of the game. If there are more X's than O's, it is more likely that X wins. 4 | # Input samples: 'top-left-square': [2, 2, 2], 'top-middle-square': [0, 0, 1], ... 5 | df['num_X'] = df.apply(lambda row: sum([1 for attr in row.index if row[attr] == 1]), axis=1) 6 | # (Number of O's in the board) 7 | # Usefulness: The number of O's in the board can be a useful feature to predict the outcome of the game. If there are more O's than X's, it is more likely that O wins. 8 | # Input samples: 'top-left-square': [2, 2, 2], 'top-middle-square': [0, 0, 1], ... 9 | df['num_O'] = df.apply(lambda row: sum([1 for attr in row.index if row[attr] == 2]), axis=1)# (Whether the center square is occupied) 10 | # Usefulness: The center square is an important square in tic-tac-toe. If it is occupied, it can be more difficult for the player who did not occupy it to win. 11 | # Input samples: 'top-left-square': [2, 2, 2], 'top-middle-square': [0, 0, 1], ... 12 | df['center_occupied'] = df['middle-middle-square'].apply(lambda x: 1 if x == 1 or x == 2 else 0)# (Whether the first player won) -------------------------------------------------------------------------------- /data/generated_code/breast-w_v4_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Cell_Size_and_Shape_Uniformity 3 | # Usefulness: Combining Cell_Size_Uniformity and Cell_Shape_Uniformity might capture the overall uniformity of the cells, which could help in classifying the prognosis. 4 | # Input samples: 'Cell_Size_Uniformity': [1.0, 2.0, 6.0], 'Cell_Shape_Uniformity': [1.0, 2.0, 5.0] 5 | df['Cell_Size_and_Shape_Uniformity'] = df['Cell_Size_Uniformity'] * df['Cell_Shape_Uniformity'] 6 | 7 | # Feature: Adhesion_and_Mitoses 8 | # Usefulness: Combining Marginal_Adhesion and Mitoses might capture the relationship between cell adhesion and cell division, which could help in classifying the prognosis. 9 | # Input samples: 'Marginal_Adhesion': [1.0, 1.0, 6.0], 'Mitoses': [1.0, 2.0, 1.0] 10 | df['Adhesion_and_Mitoses'] = df['Marginal_Adhesion'] * df['Mitoses'] 11 | 12 | # Feature: Adhesion_and_Epi_Cell_Size 13 | # Usefulness: Combining Marginal_Adhesion and Single_Epi_Cell_Size might capture the relationship between cell adhesion and cell size, which could help in classifying the prognosis. 14 | # Input samples: 'Marginal_Adhesion': [1.0, 1.0, 6.0], 'Single_Epi_Cell_Size': [2.0, 2.0, 10.0] 15 | df['Adhesion_and_Epi_Cell_Size'] = df['Marginal_Adhesion'] * df['Single_Epi_Cell_Size'] -------------------------------------------------------------------------------- /data/generated_code/kaggle_health-insurance-lead-prediction-raw-data_v3_0_code.txt: -------------------------------------------------------------------------------- 1 | # ('Holding_Policy_Duration_Imputed', 'Useful to capture if the duration of a holding policy was imputed due to missing data.',) 2 | # Input samples: ('Holding_Policy_Duration': ['6.0', '14+', '8.0'], ...) 3 | df['Holding_Policy_Duration_Imputed'] = df['Holding_Policy_Duration'].isna().astype(int)# ('Previously_Insured_Lower_Age', 'Useful to capture if the policy holder was previously insured and their lower age.',) 4 | # Input samples: ('Lower_Age': [20, 42, 69], 'Reco_Insurance_Type': ['Joint', 'Individual', 'Individual'], ...) 5 | df['Previously_Insured_Lower_Age'] = (df['Reco_Insurance_Type'] == 'Individual').astype(int) * df['Lower_Age']# Explanation: 'ID' is dropped as it does not provide any useful information for the downstream classifier. 6 | df.drop(columns=['ID'], inplace=True)# ('Holding_Policy_Duration_Imputed_Holding_Policy_Type', 'Useful to capture if the duration of a holding policy was imputed and the type of the holding policy.',) 7 | # Input samples: ('Holding_Policy_Duration': ['6.0', '14+', '8.0'], 'Holding_Policy_Type': [2.0, 2.0, 1.0], ...) 8 | df['Holding_Policy_Duration_Imputed_Holding_Policy_Type'] = df['Holding_Policy_Duration_Imputed'] * df['Holding_Policy_Type'] -------------------------------------------------------------------------------- /data/generated_code/kaggle_pharyngitis_v3_4_code.txt: -------------------------------------------------------------------------------- 1 | # ('fever_and_cough', 'Whether the patient has both fever and cough') 2 | # Usefulness: Fever and cough are two of the most common symptoms of respiratory infections, including GAS pharyngitis. This feature captures their co-occurrence. 3 | # Input samples: 'temperature': [38.0, 39.0, 39.5], 'cough': [0.0, 1.0, 0.0] 4 | df['fever_and_cough'] = ((df['temperature'] >= 38.0) & (df['cough'] > 0)).astype(int)# ('fever_and_petechiae', 'Whether the patient has both fever and petechiae') 5 | # Usefulness: Fever and petechiae are two of the most common symptoms of GAS infection. This feature captures their co-occurrence. 6 | # Input samples: 'temperature': [38.0, 39.0, 39.5], 'petechiae': [0.0, 0.0, 0.0] 7 | df['fever_and_petechiae'] = ((df['temperature'] >= 38.0) & (df['petechiae'] > 0)).astype(int)# ('fever_and_rhinorrhea', 'Whether the patient has both fever and rhinorrhea') 8 | # Usefulness: Fever and rhinorrhea are two of the most common symptoms of respiratory infections, including GAS pharyngitis. This feature captures their co-occurrence. 9 | # Input samples: 'temperature': [38.0, 39.0, 39.5], 'rhinorrhea': [0.0, 0.0, 0.0] 10 | df['fever_and_rhinorrhea'] = ((df['temperature'] >= 38.0) & (df['rhinorrhea'] > 0)).astype(int) -------------------------------------------------------------------------------- /data/generated_code/kaggle_pharyngitis_v4_4_code.txt: -------------------------------------------------------------------------------- 1 | # ('fever_and_cough', 'Whether the patient has both fever and cough') 2 | # Usefulness: Fever and cough are two of the most common symptoms of respiratory infections, including GAS pharyngitis. This feature captures their co-occurrence. 3 | # Input samples: 'temperature': [38.0, 39.0, 39.5], 'cough': [0.0, 1.0, 0.0] 4 | df['fever_and_cough'] = ((df['temperature'] >= 38.0) & (df['cough'] > 0)).astype(int)# ('fever_and_petechiae', 'Whether the patient has both fever and petechiae') 5 | # Usefulness: Fever and petechiae are two of the most common symptoms of GAS infection. This feature captures their co-occurrence. 6 | # Input samples: 'temperature': [38.0, 39.0, 39.5], 'petechiae': [0.0, 0.0, 0.0] 7 | df['fever_and_petechiae'] = ((df['temperature'] >= 38.0) & (df['petechiae'] > 0)).astype(int)# ('fever_and_rhinorrhea', 'Whether the patient has both fever and rhinorrhea') 8 | # Usefulness: Fever and rhinorrhea are two of the most common symptoms of respiratory infections, including GAS pharyngitis. This feature captures their co-occurrence. 9 | # Input samples: 'temperature': [38.0, 39.0, 39.5], 'rhinorrhea': [0.0, 0.0, 0.0] 10 | df['fever_and_rhinorrhea'] = ((df['temperature'] >= 38.0) & (df['rhinorrhea'] > 0)).astype(int) -------------------------------------------------------------------------------- /data/generated_code/pc1_v4_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature name: complexity_ratio 3 | # Usefulness: This feature captures the relationship between "cyclomatic complexity" and "design complexity", which might help in identifying modules with high complexity ratios that are more prone to defects. 4 | # Input samples: 'v(g)': [1.0, 3.0, 6.0], 'iv(G)': [1.0, 2.0, 5.0] 5 | df['complexity_ratio'] = df['v(g)'] / df['iv(G)'] 6 | 7 | # Feature name: code_density 8 | # Usefulness: This feature captures the relationship between the total number of operators and operands and the line count of code. It can help in identifying modules with high code density that might be more prone to defects. 9 | # Input samples: 'total_Op': [62.0, 20.0, 160.0], 'total_Opnd': [61.0, 17.0, 145.0], 'loc': [18.0, 9.0, 35.0] 10 | df['code_density'] = (df['total_Op'] + df['total_Opnd']) / df['loc'] 11 | 12 | # Feature name: unique_operator_ratio 13 | # Usefulness: This feature captures the relationship between the unique operators and the total operators. It might help in identifying modules with high unique operator ratios that are more prone to defects due to higher complexity. 14 | # Input samples: 'uniq_Op': [4.0, 12.0, 21.0], 'total_Op': [62.0, 20.0, 160.0] 15 | df['unique_operator_ratio'] = df['uniq_Op'] / df['total_Op'] -------------------------------------------------------------------------------- /data/generated_code/balance-scale_v4_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Right Moment 3 | # Usefulness: This feature calculates the moment on the right side, which is the product of right-weight and right-distance. This is useful because the balance scale depends on the moments on both sides. 4 | # Input samples: 'right-weight': [3.0, 4.0, 5.0], 'right-distance': [2.0, 2.0, 5.0] 5 | df['right_moment'] = df['right-weight'] * df['right-distance'] 6 | # Moment Difference 7 | # Usefulness: This feature calculates the difference between the left moment and the right moment. A positive value indicates the balance scale tips to the left, a negative value indicates it tips to the right, and a value close to zero indicates it is balanced. 8 | # Input samples: 'left-weight': [2.0, 3.0, 1.0], 'left-distance': [4.0, 5.0, 4.0], 'right-weight': [3.0, 4.0, 5.0], 'right-distance': [2.0, 2.0, 5.0] 9 | df['moment_diff'] = (df['left-weight'] * df['left-distance']) - (df['right-weight'] * df['right-distance']) 10 | # Drop redundant columns 11 | # The original columns 'left-weight', 'left-distance', 'right-weight', and 'right-distance' can be dropped as they are now represented by the new features 'left_moment', 'right_moment', and 'moment_diff'. 12 | df.drop(columns=['left-weight', 'left-distance', 'right-weight', 'right-distance'], inplace=True) 13 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name="caafe", 5 | version="0.1.5", 6 | packages=find_packages(), 7 | description="Context-Aware Automated Feature Engineering (CAAFE) is an automated machine learning tool that uses large language models for feature engineering in tabular datasets. It generates Python code for new features along with explanations for their utility, enhancing interpretability.", 8 | long_description=open("README.md").read(), 9 | long_description_content_type="text/markdown", 10 | author="Noah Hollmann, Samuel Müller, Frank Hutter", 11 | author_email="noah.homa@gmail.com", 12 | url="https://github.com/automl/CAAFE", 13 | license="LICENSE.txt", 14 | classifiers=[ 15 | "Development Status :: 3 - Alpha", 16 | "License :: Free for non-commercial use", 17 | "Programming Language :: Python :: 3.7", 18 | "Programming Language :: Python :: 3.8", 19 | "Programming Language :: Python :: 3.9", 20 | ], 21 | python_requires=">=3.7", 22 | install_requires=[ 23 | "openai==0.28", 24 | "kaggle", 25 | "openml==0.12.0", 26 | "tabpfn", 27 | ], 28 | extras_require={ 29 | "full": ["autofeat", "featuretools", "tabpfn[full]"], 30 | }, 31 | ) 32 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_spaceship-titanic_v3_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # ('FamilySize', 'The total number of family members aboard the Spaceship Titanic') 3 | # Usefulness: Family size may have an impact on whether a passenger was transported, as larger families may have been more likely to stick together and be transported as a group. 4 | # Input samples: 'PassengerId': ['8878_01', '7749_02', '3955_01'], 'HomePlanet': ['Earth', 'Mars', 'Earth'], 'Cabin': ['E/568/P', 'D/244/P', 'G/648/S'] 5 | df['FamilySize'] = df.groupby(df['PassengerId'].str.split('_').str[0])['PassengerId'].transform('count') 6 | # ('IsAdult', 'Whether the passenger is an adult or not') 7 | # Usefulness: Age may have an impact on whether a passenger was transported, but instead of using age directly, we can create a binary feature based on whether the passenger is an adult or not. 8 | # Input samples: 'Age': [22.0, 37.0, 45.0] 9 | df['IsAdult'] = (df['Age'] >= 18).astype(int)# ('IsVIPandCryo', 'Whether the passenger is both a VIP and in cryosleep') 10 | # Usefulness: Passengers who are both VIP and in cryosleep may be less likely to be transported, as they may have paid more for the voyage and been more invested in its success. 11 | # Input samples: 'VIP': [False, False, False], 'CryoSleep': [False, False, True] 12 | df['IsVIPandCryo'] = (df['VIP'] & df['CryoSleep']).astype(int) -------------------------------------------------------------------------------- /data/generated_code/cmc_v4_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Age_and_education 3 | # Usefulness: Combining wife's age and education can show the effect of education level on the contraceptive method used at different ages. 4 | # Input samples: 'Wifes_age': [46.0, 45.0, 39.0], 'Wifes_education': [3, 2, 2] 5 | df['Age_and_education'] = df['Wifes_age'] * df['Wifes_education'] 6 | 7 | # Age_and_children 8 | # Usefulness: Combining wife's age and number of children ever born can show the relationship between age and fertility, which may influence contraceptive method choice. 9 | # Input samples: 'Wifes_age': [46.0, 45.0, 39.0], 'Number_of_children_ever_born': [5.0, 6.0, 6.0] 10 | df['Age_and_children'] = df['Wifes_age'] * df['Number_of_children_ever_born'] 11 | 12 | # Education_gap 13 | # Usefulness: The difference in education levels between the wife and husband may affect the contraceptive method choice. 14 | # Input samples: 'Wifes_education': [3, 2, 2], 'Husbands_education': [2, 3, 3] 15 | df['Education_gap'] = df['Wifes_education'] - df['Husbands_education'] 16 | 17 | # Religion_and_media 18 | # Usefulness: Combining wife's religion and media exposure can show the influence of religious beliefs and media exposure on contraceptive method choice. 19 | # Input samples: 'Wifes_religion': [1, 1, 1], 'Media_exposure': [0, 0, 0] 20 | df['Religion_and_media'] = df['Wifes_religion'] * df['Media_exposure'] 21 | -------------------------------------------------------------------------------- /data/generated_code/balance-scale_v3_4_code.txt: -------------------------------------------------------------------------------- 1 | # ('right-weight' * 'right-distance') 2 | # Usefulness: This feature represents the right side's torque, which is a crucial factor in determining the balance of the scale. 3 | # Input samples: 'right-weight': [1.0, 3.0, 5.0], 'right-distance': [5.0, 2.0, 4.0] 4 | df['right-torque'] = df['right-weight'] * df['right-distance']# ('left-weight' + 'right-weight') / ('left-distance' + 'right-distance') 5 | # Usefulness: This feature represents the weight-to-distance ratio for both sides of the scale, which can help determine the balance of the scale. 6 | # Input samples: 'left-weight': [1.0, 4.0, 2.0], 'left-distance': [2.0, 2.0, 3.0], 'right-weight': [1.0, 3.0, 5.0], 'right-distance': [5.0, 2.0, 4.0] 7 | df['weight-distance-ratio'] = (df['left-weight'] + df['right-weight']) / (df['left-distance'] + df['right-distance'])# ('left-weight' - 'right-weight') / ('left-distance' - 'right-distance') 8 | # Usefulness: This feature represents the difference in weight-to-distance ratio between the left and right sides of the scale, which can help determine the direction of the tipping. 9 | # Input samples: 'left-weight': [1.0, 4.0, 2.0], 'left-distance': [2.0, 2.0, 3.0], 'right-weight': [1.0, 3.0, 5.0], 'right-distance': [5.0, 2.0, 4.0] 10 | df['weight-distance-ratio-diff'] = (df['left-weight'] - df['right-weight']) / (df['left-distance'] - df['right-distance']) -------------------------------------------------------------------------------- /data/generated_code/kaggle_playground-series-s3e12_v4_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Calc_to_ph_diff) 3 | # Usefulness: The difference between the pH and calcium concentration can give insight into the formation of calcium oxalate crystals. 4 | # Input samples: 'ph': [5.24, 5.41, 6.79], 'calc': [4.49, 0.83, 0.58] 5 | df['Calc_to_ph_diff'] = df['calc'] - df['ph'] 6 | # Explanation: The column 'id' is dropped as it is not informative for the classification task. 7 | df.drop(columns=['id'], inplace=True)# (Calc_to_gravity_diff) 8 | # Usefulness: The ratio between urea concentration and specific gravity can give insight into the concentration of molecules in the urine which may be related to the formation of calcium oxalate crystals. 9 | # Input samples: 'urea': [550.0, 159.0, 199.0], 'gravity': [1.03, 1.01, 1.02] 10 | df['Urea_to_gravity_ratio'] = df['urea'] / df['gravity']# Explanation: The column 'osmo' is dropped as it is highly correlated with 'Osmo_to_ph_ratio' and 'Osmo_to_calc_ratio' which are kept. 11 | df.drop(columns=['osmo'], inplace=True)# (Urea_to_calc_ratio) 12 | # Usefulness: The ratio between urea concentration and calcium concentration can give insight into the concentration of molecules in the urine which may be related to the formation of calcium oxalate crystals. 13 | # Input samples: 'urea': [550.0, 159.0, 199.0], 'calc': [4.49, 0.83, 0.58] 14 | df['Urea_to_calc_ratio'] = df['urea'] / df['calc'] -------------------------------------------------------------------------------- /data/generated_code/kaggle_pharyngitis_v3_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # ('age_y_2', 'Squared age_y', 'Usefulness: This feature allows the model to capture a potential quadratic relationship between age and the outcome.', 3 | # 'age_y': [11.6, 8.2, 5.1], ...) 4 | df['age_y_2'] = df['age_y'] ** 2 5 | # ('swollen_tonsils', 'Swollen tonsils', 'Usefulness: This feature captures the presence of both swollen tonsils and tonsillar swelling.', 6 | # 'swollenadp': [1.0, 0.0, 2.0], 'tonsillarswelling': [0.0, 1.0, 0.0], ...) 7 | df['swollen_tonsils'] = ((df['swollenadp'] > 0) & (df['tonsillarswelling'] > 0)).astype(int)# ('fever_cough', 'Fever and cough', 'Usefulness: This feature captures the presence of both fever and cough.', 8 | # 'temperature': [38.8, 38.6, 39.5], 'cough': [1.0, 1.0, 0.0], ...) 9 | df['fever_cough'] = ((df['temperature'] > 38) & (df['cough'] > 0)).astype(int)# ('fever_rhinorrhea', 'Fever and rhinorrhea', 'Usefulness: This feature captures the presence of both fever and rhinorrhea.', 10 | # 'temperature': [38.8, 38.6, 39.5], 'rhinorrhea': [1.0, 1.0, 0.0], ...) 11 | df['fever_rhinorrhea'] = ((df['temperature'] > 38) & (df['rhinorrhea'] > 0)).astype(int)# ('age_pain_interaction', 'Interaction between age and pain', 'Usefulness: This feature captures the interaction between age and pain.', 12 | # 'age_y': [11.6, 8.2, 5.1], 'pain': [1.0, 1.0, 1.0], ...) 13 | df['age_pain_interaction'] = df['age_y'] * df['pain'] -------------------------------------------------------------------------------- /data/dataset_descriptions/openml_cmc.txt: -------------------------------------------------------------------------------- 1 | 4. Relevant Information: 2 | This dataset is a subset of the 1987 National Indonesia Contraceptive 3 | Prevalence Survey. The samples are married women who were either not 4 | pregnant or do not know if they were at the time of interview. The 5 | problem is to predict the current contraceptive method choice 6 | (no use, long-term methods, or short-term methods) of a woman based 7 | on her demographic and socio-economic characteristics. 8 | 9 | 7. Attribute Information: 10 | 11 | 1. Wife's age (numerical) 12 | 2. Wife's education (categorical) 1=low, 2, 3, 4=high 13 | 3. Husband's education (categorical) 1=low, 2, 3, 4=high 14 | 4. Number of children ever born (numerical) 15 | 5. Wife's religion (binary) 0=Non-Islam, 1=Islam 16 | 6. Wife's now working? (binary) 0=Yes, 1=No 17 | 7. Husband's occupation (categorical) 1, 2, 3, 4 18 | 8. Standard-of-living index (categorical) 1=low, 2, 3, 4=high 19 | 9. Media exposure (binary) 0=Good, 1=Not good 20 | 10. Contraceptive method used (class attribute) 1=No-use 21 | 2=Long-term 22 | 3=Short-term -------------------------------------------------------------------------------- /data/generated_code/kaggle_health-insurance-lead-prediction-raw-data_v4_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Reco_Policy_Premium_Ratio 3 | # Usefulness: The ratio between Reco_Policy_Premium and the age difference can provide information on the premium cost per age difference, which might be related to the likelihood of a positive response. 4 | # Input samples: 'Reco_Policy_Premium': [16968.0, 11322.0, 17430.0], 'Upper_Age': [54, 42, 69], 'Lower_Age': [20, 42, 69] 5 | df['Reco_Policy_Premium_Ratio'] = df['Reco_Policy_Premium'] / (df['Upper_Age'] - df['Lower_Age'] + 1) 6 | # Feature: Is_Spouse_Encoded 7 | # Usefulness: Encoding the Is_Spouse feature can help the classifier better understand the relationship status information and its relation to the target variable. 8 | # Input samples: 'Is_Spouse': ['Yes', 'No', 'No'] 9 | df['Is_Spouse_Encoded'] = df['Is_Spouse'].map({'Yes': 1, 'No': 0}) 10 | 11 | # Feature: Reco_Insurance_Type_Encoded 12 | # Usefulness: Encoding the Reco_Insurance_Type feature can help the classifier better understand the type of insurance recommendation and its relation to the target variable. 13 | # Input samples: 'Reco_Insurance_Type': ['Joint', 'Individual', 'Individual'] 14 | df['Reco_Insurance_Type_Encoded'] = df['Reco_Insurance_Type'].map({'Joint': 1, 'Individual': 0}) 15 | 16 | # Dropping original Reco_Insurance_Type column as it is now encoded 17 | df.drop(columns=['Reco_Insurance_Type'], inplace=True) -------------------------------------------------------------------------------- /data/generated_code/eucalyptus_v4_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Altitude_Rain_ratio 3 | # Usefulness: This feature calculates the ratio of altitude to rainfall which can be useful in determining the utility of the trees for soil conservation, as it takes into account the environmental factors affecting tree growth. 4 | # Input samples: 'Altitude': [150.0, 200.0, 150.0], 'Rainfall': [1250.0, 1400.0, 900.0] 5 | df['Altitude_Rain_ratio'] = df['Altitude'] / df['Rainfall'] 6 | 7 | # Survival_rate 8 | # Usefulness: This feature calculates the survival rate of trees by dividing the number of survived trees by the total number of trees planted. It can be useful in determining the utility of the trees for soil conservation, as it takes into account the survival of trees. 9 | # Input samples: 'Surv': [75.0, nan, nan], 'Year': [1983.0, 1980.0, 1986.0] 10 | df['Survival_rate'] = df['Surv'] / (2021 - df['Year']) 11 | 12 | # Avg_rating 13 | # Usefulness: This feature calculates the average rating of vigour, insect resistance, stem form, crown form, and branch form. It can be useful in determining the utility of the trees for soil conservation, as it takes into account the overall quality of trees. 14 | # Input samples: 'Vig': [4.5, 3.3, nan], 'Ins_res': [3.5, 4.0, nan], 'Stem_Fm': [3.3, 3.0, nan], 'Crown_Fm': [3.2, 3.5, nan], 'Brnch_Fm': [2.8, 3.0, nan] 15 | df['Avg_rating'] = df[['Vig', 'Ins_res', 'Stem_Fm', 'Crown_Fm', 'Brnch_Fm']].mean(axis=1) 16 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_spaceship-titanic_v3_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # ('IsAdult', 'Indicates whether the passenger is an adult (age >= 18).') 3 | # Usefulness: Adults may behave differently than children during an emergency situation such as a spaceship collision. This column captures this difference in behavior. 4 | # Input samples: 'Age': [2.0, 44.0, 28.0] 5 | df['IsAdult'] = (df['Age'] >= 18).astype(int) 6 | # ('FamilySize', 'Total number of family members (including self) aboard the Spaceship Titanic.') 7 | # Usefulness: Family size may impact the likelihood of a passenger being transported to another dimension. A larger family size may increase the chances of being transported. 8 | # Input samples: 'PassengerId': ['5909_03', '4256_08', '2000_01'], 'HomePlanet': ['Earth', 'Earth', 'Europa'], 'Cabin': ['G/961/S', 'F/880/P', 'C/76/S'], 'Transported': [True, False, True] 9 | df['FamilySize'] = df.groupby(df['PassengerId'].str.split('_').str[0])['PassengerId'].transform('count')# ('IsChildandHasCabin', 'Indicates whether the passenger is a child (age < 18) and has a cabin on the spaceship.') 10 | # Usefulness: Being a child and having a cabin may indicate a higher social status and may therefore impact the likelihood of being transported to another dimension. 11 | # Input samples: 'Age': [2.0, 44.0, 28.0], 'Cabin': ['G/961/S', 'F/880/P', 'C/76/S'] 12 | df['IsChildandHasCabin'] = ((df['Age'] < 18) & (~df['Cabin'].isna())).astype(int) -------------------------------------------------------------------------------- /data/generated_code/cmc_v3_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Wife_and_husband_education) 3 | # Usefulness: A woman's contraceptive method choice may be influenced by the education level of both herself and her husband. This new column combines the wife's and husband's education levels to capture this relationship. 4 | # Input samples: 'Wifes_education': [3, 2, 2], 'Husbands_education': [2, 3, 3] 5 | df['Wife_and_husband_education'] = df['Wifes_education'] + df['Husbands_education'] 6 | # (Number_of_children_and_age) 7 | # Usefulness: A woman's age and the number of children she has ever born may be correlated with her contraceptive method choice. This new column combines these two features to capture this relationship. 8 | # Input samples: 'Wifes_age': [46.0, 45.0, 39.0], 'Number_of_children_ever_born': [5.0, 6.0, 6.0] 9 | df['Number_of_children_and_age'] = df['Wifes_age'] * df['Number_of_children_ever_born']# Explanation: Wife's religion is dropped because it is a binary feature with low variance, which may not be useful for the downstream classifier. 10 | df.drop(columns=['Wifes_religion'], inplace=True)# (Age_diff) 11 | # Usefulness: The difference in age between a woman and her husband may be correlated with her contraceptive method choice. This new column captures this relationship. 12 | # Input samples: 'Wifes_age': [46.0, 45.0, 39.0], 'Husbands_education': [45.0, 46.0, 44.0] 13 | df['Age_diff'] = abs(df['Wifes_age'] - df['Husbands_education']) -------------------------------------------------------------------------------- /data/generated_code/kaggle_pharyngitis_v4_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # ('age_y_above_10', 'Whether the patient age is above 10 years old') 3 | # Usefulness: Older children may exhibit different symptoms and signs than younger children, so this feature can help capture that difference. 4 | # Input samples: 'age_y': [4.4, 11.3, 5.8] 5 | df['age_y_above_10'] = (df['age_y'] > 10).astype(int) 6 | # ('pain_swollenadp', 'Whether the patient has both pain and swollen adenoids') 7 | # Usefulness: Pain and swollen adenoids are both common symptoms of GAS pharyngitis, and the presence of both may indicate a higher likelihood of a positive RADT result. 8 | # Input samples: 'pain': [1.0, 1.0, 1.0], 'swollenadp': [0.0, 2.0, 0.0] 9 | df['pain_swollenadp'] = ((df['pain'] == 1) & (df['swollenadp'] > 0)).astype(int)# Explanation: 'age_y_above_10' is a feature that has already been added, but it may be useful to also include the opposite feature. 10 | df['age_y_below_10'] = (df['age_y'] <= 10).astype(int) 11 | 12 | # Explanation: 'pain' and 'tender' are both symptoms that may indicate GAS pharyngitis. This feature captures the presence of either symptom. 13 | df['pain_or_tender'] = ((df['pain'] == 1) | (df['tender'] == 1)).astype(int) 14 | 15 | # Explanation: 'swollenadp' and 'tender' are both symptoms that may indicate GAS pharyngitis. This feature captures the presence of either symptom. 16 | df['swollenadp_or_tender'] = ((df['swollenadp'] > 0) | (df['tender'] == 1)).astype(int) -------------------------------------------------------------------------------- /data/generated_code/breast-w_v4_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Chromatin_and_Nucleoli 3 | # Usefulness: Combining bland chromatin and normal nucleoli can capture the relationship between chromatin and nucleoli, which could be related to the malignancy of the tumor. 4 | # Input samples: 'Bland_Chromatin': [2.0, 2.0, 1.0], 'Normal_Nucleoli': [1.0, 1.0, 1.0] 5 | df['Chromatin_and_Nucleoli'] = df['Bland_Chromatin'] * df['Normal_Nucleoli'] 6 | 7 | # Drop columns that might be redundant after creating the new feature 8 | # Explanation: Columns 'Bland_Chromatin' and 'Normal_Nucleoli' are combined into the new feature 'Chromatin_and_Nucleoli', which might capture their relationship better. 9 | df.drop(columns=['Bland_Chromatin', 'Normal_Nucleoli'], inplace=True) 10 | 11 | # Feature: Clump_Thickness_and_Mitoses 12 | # Usefulness: Combining clump thickness and mitoses can capture the relationship between the thickness of the clump and the number of mitoses, which could be related to the malignancy of the tumor. 13 | # Input samples: 'Clump_Thickness': [1.0, 5.0, 4.0], 'Mitoses': [1.0, 1.0, 1.0] 14 | df['Clump_Thickness_and_Mitoses'] = df['Clump_Thickness'] * df['Mitoses'] 15 | 16 | # Drop columns that might be redundant after creating the new feature 17 | # Explanation: Columns 'Clump_Thickness' and 'Mitoses' are combined into the new feature 'Clump_Thickness_and_Mitoses', which might capture their relationship better. 18 | df.drop(columns=['Clump_Thickness', 'Mitoses'], inplace=True) -------------------------------------------------------------------------------- /data/generated_code/jungle_chess_2pcs_raw_endgame_complete_v4_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Strength difference between white and black pieces 3 | # Usefulness: This feature calculates the difference in strength between the white and black pieces, which can be useful in determining the likelihood of a piece capturing another piece or being in a winning position. 4 | # Input samples: 'white_piece0_strength': [7.0, 6.0, 4.0], 'black_piece0_strength': [4.0, 6.0, 7.0] 5 | df['strength_difference'] = df['white_piece0_strength'] - df['black_piece0_strength'] 6 | 7 | # Feature: File difference between white and black pieces 8 | # Usefulness: This feature calculates the difference in file between the white and black pieces, which can be useful in determining the likelihood of a piece capturing another piece or being in a winning position. 9 | # Input samples: 'white_piece0_file': [0.0, 0.0, 6.0], 'black_piece0_file': [5.0, 0.0, 3.0] 10 | df['file_difference'] = df['white_piece0_file'] - df['black_piece0_file'] 11 | 12 | # Feature: Product of strengths between white and black pieces 13 | # Usefulness: This feature calculates the product of strengths between the white and black pieces, which can be useful in determining the overall strength of the pieces on the board and their likelihood of capturing other pieces. 14 | # Input samples: 'white_piece0_strength': [7.0, 6.0, 4.0], 'black_piece0_strength': [4.0, 6.0, 7.0] 15 | df['strength_product'] = df['white_piece0_strength'] * df['black_piece0_strength'] 16 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_playground-series-s3e12_v3_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # ('calc_to_urea_ratio', 'Ratio of calcium concentration to urea concentration') 3 | # Usefulness: Calcium oxalate stones are the most common type of kidney stone. The ratio of calcium to urea concentration in urine has been shown to be a useful predictor of calcium oxalate stone formation. 4 | # Input samples: 'calc': [7.68, 2.17, 12.68], 'urea': [396.0, 159.0, 364.0] 5 | df['calc_to_urea_ratio'] = df['calc'] / df['urea'] 6 | # ('is_acidic', 'Whether the urine is acidic (pH < 7)') 7 | # Usefulness: The pH of urine can affect the formation of kidney stones. Urine that is too acidic or too alkaline can promote the formation of certain types of kidney stones. 8 | # Input samples: 'ph': [5.58, 5.09, 5.24] 9 | df['is_acidic'] = (df['ph'] < 7).astype(int) 10 | # ('calc_to_ph_product', 'Product of calcium concentration and pH') 11 | # Usefulness: The product of calcium concentration and pH in urine has been shown to be a useful predictor of calcium oxalate stone formation. 12 | # Input samples: 'calc': [7.68, 2.17, 12.68], 'ph': [5.58, 5.09, 5.24] 13 | df['calc_to_ph_product'] = df['calc'] * df['ph']# ('osmo_to_urea_ratio', 'Ratio of osmolarity to urea concentration') 14 | # Usefulness: The ratio of osmolarity to urea concentration in urine has been shown to be a useful predictor of kidney stone formation. 15 | # Input samples: 'osmo': [945.0, 371.0, 703.0], 'urea': [396.0, 159.0, 364.0] 16 | df['osmo_to_urea_ratio'] = df['osmo'] / df['urea'] -------------------------------------------------------------------------------- /data/generated_code/kaggle_health-insurance-lead-prediction-raw-data_v3_1_code.txt: -------------------------------------------------------------------------------- 1 | # ('Holding_Policy_Duration_Imputed', 'Use the median of Holding_Policy_Duration to impute missing values in this column. Replace "14+" with 15.', 2 | # {'Holding_Policy_Duration': ['3.0', '3.0', '6.0']}) 3 | median_duration = df['Holding_Policy_Duration'].replace('14+', 15).astype(float).median() 4 | df['Holding_Policy_Duration_Imputed'] = df['Holding_Policy_Duration'].replace('14+', 15).astype(float).fillna(median_duration)# ('Age_Bin', 'Use the age of the person to create bins. Insurance needs may vary by age.', 5 | # {'Upper_Age': [48, 58, 65], 'Lower_Age': [48, 50, 60]}) 6 | bins = [0, 30, 40, 50, 60, 70, 100] 7 | labels = ['0-30', '30-40', '40-50', '50-60', '60-70', '70+'] 8 | df['Age_Bin'] = pd.cut(df['Upper_Age'], bins=bins, labels=labels)# ('Holding_Policy_Duration_Multiplied', 'Multiply Holding_Policy_Duration_Imputed and Holding_Policy_Type.', 9 | # {'Holding_Policy_Duration_Imputed': [3.0, 3.0, 6.0], 'Holding_Policy_Type': [3.0, 4.0, 3.0]}) 10 | df['Holding_Policy_Duration_Multiplied'] = df['Holding_Policy_Duration_Imputed'] * df['Holding_Policy_Type']# ('Family_Size', 'Use the Is_Spouse and Upper_Age-Lower_Age to create a family size column. This may be useful for predicting insurance needs.', 11 | # {'Is_Spouse': ['No', 'Yes', 'Yes'], 'Upper_Age': [48, 58, 65], 'Lower_Age': [48, 50, 60]}) 12 | df['Family_Size'] = df.apply(lambda row: 1 if row['Is_Spouse'] == 'No' else row['Upper_Age']-row['Lower_Age']+1, axis=1) -------------------------------------------------------------------------------- /data/generated_code/kaggle_health-insurance-lead-prediction-raw-data_v3_3_code.txt: -------------------------------------------------------------------------------- 1 | # ('Region_Code_Count', 'Use the count of Region_Code to capture the frequency of a region.') 2 | # Usefulness: The frequency of a region may be an important predictor for the insurance purchase decision. 3 | # Input samples: 'Region_Code': [598, 4855, 529], 'ID': [46203, 7682, 43204] 4 | df['Region_Code_Count'] = df.groupby('Region_Code')['ID'].transform('count')# ('Spouse_Age_Difference', 'Use the difference between Upper_Age and Lower_Age for spouses to capture the age difference within a household.') 5 | # Usefulness: The age difference within a household may be an important predictor for the insurance purchase decision, especially for married couples. 6 | # Input samples: 'Upper_Age': [56, 43, 33], 'Lower_Age': [56, 42, 20], 'Is_Spouse': ['No', 'Yes', 'No'] 7 | import numpy as np 8 | df['Spouse_Age_Difference'] = df.loc[df['Is_Spouse']=='Yes', 'Upper_Age'] - df.loc[df['Is_Spouse']=='Yes', 'Lower_Age'] 9 | df.loc[df['Is_Spouse']=='No', 'Spouse_Age_Difference'] = np.nan# ('Premium_to_Age_Ratio_Bucket', 'Use the Premium_to_Age_Ratio to create buckets.') 10 | # Usefulness: Buckets for Premium_to_Age_Ratio may be a better predictor for the insurance purchase decision than the continuous variable. 11 | # Input samples: 'Reco_Policy_Premium': [16172.0, 19272.0, 13661.2], 'Upper_Age': [56, 43, 33] 12 | df['Premium_to_Age_Ratio_Bucket'] = pd.cut(x=df['Reco_Policy_Premium']/df['Upper_Age'], bins=[0, 200, 400, 600, 1000], labels=['0-200', '200-400', '400-600', '600+']) -------------------------------------------------------------------------------- /data/generated_code/cmc_v4_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature name: Education_gap 3 | # Usefulness: The difference in education levels between the wife and husband may influence the couple's contraceptive method choice. 4 | # Input samples: 'Wifes_education': [3, 3, 2], 'Husbands_education': [3, 3, 3] 5 | df['Education_gap'] = df['Wifes_education'] - df['Husbands_education'] 6 | 7 | # Feature name: Age_per_child 8 | # Usefulness: The age per child can indicate the couple's family planning decisions, which may influence their contraceptive method choice. 9 | # Input samples: 'Wifes_age': [48.0, 32.0, 28.0], 'Number_of_children_ever_born': [6.0, 2.0, 2.0] 10 | df['Age_per_child'] = df['Wifes_age'] / (df['Number_of_children_ever_born'] + 1) 11 | # Feature name: Religion_and_working 12 | # Usefulness: The interaction between the wife's religion and her working status may influence the couple's contraceptive method choice based on cultural and financial factors. 13 | # Input samples: 'Wifes_religion': [0, 1, 0], 'Wifes_now_working%3F': [1, 1, 1] 14 | df['Religion_and_working'] = df['Wifes_religion'] * df['Wifes_now_working%3F'] 15 | # Feature name: Occupation_and_standard_living 16 | # Usefulness: The interaction between the husband's occupation and the standard-of-living index may influence the couple's contraceptive method choice based on their financial situation and job stability. 17 | # Input samples: 'Husbands_occupation': [1, 1, 2], 'Standard-of-living_index': [3, 2, 2] 18 | df['Occupation_and_standard_living'] = df['Husbands_occupation'] * df['Standard-of-living_index'] -------------------------------------------------------------------------------- /data/generated_code/credit-g_v4_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Credit amount to duration ratio 3 | # Usefulness: This feature represents the amount of credit per month, which can help identify the risk level of the customer. 4 | # Input samples: 'credit_amount': [3609.0, 3331.0, 1311.0], 'duration': [48.0, 12.0, 24.0] 5 | df['credit_amount_duration_ratio'] = df['credit_amount'] / df['duration'] 6 | 7 | # Credit amount to age ratio 8 | # Usefulness: This feature represents the amount of credit relative to the age of the customer, which can help assess the risk level and financial stability of the customer. 9 | # Input samples: 'credit_amount': [3609.0, 3331.0, 1311.0], 'age': [27.0, 42.0, 26.0] 10 | df['credit_amount_age_ratio'] = df['credit_amount'] / df['age'] 11 | 12 | # Duration to residence since ratio 13 | # Usefulness: This feature represents the proportion of the credit duration with respect to the time the customer has been residing at their current residence, which can help assess the risk level and stability of the customer. 14 | # Input samples: 'duration': [48.0, 12.0, 24.0], 'residence_since': [1.0, 4.0, 3.0] 15 | df['duration_residence_since_ratio'] = df['duration'] / df['residence_since'] 16 | 17 | # Drop num_dependents column as it may be redundant and hurt predictive performance 18 | # Explanation: The number of people being liable to provide maintenance for may not provide significant information about the credit risk of a customer, especially when other features like credit amount and duration are considered. 19 | df.drop(columns=['num_dependents'], inplace=True) 20 | -------------------------------------------------------------------------------- /data/generated_code/jungle_chess_2pcs_raw_endgame_complete_v3_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Number of pieces with strength 0 on the board) 3 | # Usefulness: This feature adds information on the number of pieces that are not contributing to the game. This may be useful in determining a player's strategy. 4 | # Input samples: 'white_piece0_strength': [0.0, 4.0, 4.0], 'black_piece0_strength': [7.0, 0.0, 0.0], ... 5 | df['num_zero_strength'] = (df['white_piece0_strength'] == 0).astype(int) + (df['black_piece0_strength'] == 0).astype(int) 6 | # (Difference in number of pieces with strength 7 between white and black) 7 | # Usefulness: This feature adds information on the relative strength of the pieces on the board. This may be useful in determining a player's strategy. 8 | # Input samples: 'white_piece0_strength': [0.0, 4.0, 4.0], 'black_piece0_strength': [7.0, 0.0, 0.0], ... 9 | df['strength_7_diff'] = (df['white_piece0_strength'] == 7).astype(int) - (df['black_piece0_strength'] == 7).astype(int)# (Difference in number of pieces with strength 6 or 7 between white and black) 10 | # Usefulness: This feature adds information on the relative strength of the pieces on the board. This may be useful in determining a player's strategy. 11 | # Input samples: 'white_piece0_strength': [0.0, 4.0, 4.0], 'black_piece0_strength': [7.0, 0.0, 0.0], ... 12 | if 'RAW Complete' in df.columns.tolist(): 13 | df.drop(columns=['RAW Complete'], inplace=True) 14 | df['strength_6_7_diff'] = ((df['white_piece0_strength'] == 6) | (df['white_piece0_strength'] == 7)).astype(int) - ((df['black_piece0_strength'] == 6) | (df['black_piece0_strength'] == 7)).astype(int) -------------------------------------------------------------------------------- /data/generated_code/jungle_chess_2pcs_raw_endgame_complete_v4_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Strength difference between white and black pieces 3 | # Usefulness: The difference in strength between the white and black pieces can indicate which player has a better chance of winning. 4 | # Input samples: 'white_piece0_strength': [7.0, 0.0, 4.0], 'black_piece0_strength': [0.0, 4.0, 0.0] 5 | df['strength_diff'] = df['white_piece0_strength'] - df['black_piece0_strength'] 6 | # Feature: Sum of ranks of white and black pieces 7 | # Usefulness: The sum of ranks of the white and black pieces can provide information about their overall position on the board, which can impact the game's outcome. 8 | # Input samples: 'white_piece0_rank': [7.0, 2.0, 6.0], 'black_piece0_rank': [6.0, 5.0, 7.0] 9 | df['sum_of_ranks'] = df['white_piece0_rank'] + df['black_piece0_rank']# Feature: Product of strengths of white and black pieces 10 | # Usefulness: The product of strengths of the white and black pieces can provide information about the overall power balance in the game, which can impact the game's outcome. 11 | # Input samples: 'white_piece0_strength': [7.0, 0.0, 4.0], 'black_piece0_strength': [0.0, 4.0, 0.0] 12 | df['strength_product'] = df['white_piece0_strength'] * df['black_piece0_strength']# Feature: Difference in files of white and black pieces 13 | # Usefulness: The difference in files of the white and black pieces can provide information about their horizontal position on the board, which can impact the game's outcome. 14 | # Input samples: 'white_piece0_file': [6.0, 3.0, 1.0], 'black_piece0_file': [5.0, 0.0, 2.0] 15 | df['file_diff'] = df['white_piece0_file'] - df['black_piece0_file'] -------------------------------------------------------------------------------- /caafe/metrics.py: -------------------------------------------------------------------------------- 1 | from sklearn.metrics import ( 2 | roc_auc_score, 3 | accuracy_score, 4 | balanced_accuracy_score, 5 | average_precision_score, 6 | mean_squared_error, 7 | mean_absolute_error, 8 | r2_score, 9 | ) 10 | import torch 11 | import numpy as np 12 | 13 | 14 | def auc_metric(target, pred, multi_class="ovo", numpy=False): 15 | lib = np if numpy else torch 16 | try: 17 | if not numpy: 18 | target = torch.tensor(target) if not torch.is_tensor(target) else target 19 | pred = torch.tensor(pred) if not torch.is_tensor(pred) else pred 20 | if len(lib.unique(target)) > 2: 21 | if not numpy: 22 | return torch.tensor( 23 | roc_auc_score(target, pred, multi_class=multi_class) 24 | ) 25 | return roc_auc_score(target, pred, multi_class=multi_class) 26 | else: 27 | if len(pred.shape) == 2: 28 | pred = pred[:, 1] 29 | if not numpy: 30 | return torch.tensor(roc_auc_score(target, pred)) 31 | return roc_auc_score(target, pred) 32 | except ValueError as e: 33 | print(e) 34 | return np.nan if numpy else torch.tensor(np.nan) 35 | 36 | 37 | def accuracy_metric(target, pred): 38 | target = torch.tensor(target) if not torch.is_tensor(target) else target 39 | pred = torch.tensor(pred) if not torch.is_tensor(pred) else pred 40 | if len(torch.unique(target)) > 2: 41 | return torch.tensor(accuracy_score(target, torch.argmax(pred, -1))) 42 | else: 43 | return torch.tensor(accuracy_score(target, pred[:, 1] > 0.5)) 44 | -------------------------------------------------------------------------------- /data/generated_code/pc1_v4_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Unique Operand Ratio 3 | # Usefulness: The ratio of unique operands (uniq_Opnd) to total operands (total_Opnd) can provide insights into the diversity of data used in the code, which can be an indicator of code complexity and potential defects. 4 | # Input samples: 'uniq_Opnd': [18.0, 13.0, 18.0], 'total_Opnd': [23.0, 38.0, 26.0] 5 | df['unique_operand_ratio'] = df['uniq_Opnd'] / df['total_Opnd'] 6 | # Dropping correlated features 7 | # Explanation: Dropping features that are highly correlated with other features can help reduce multicollinearity and improve the performance of the downstream classifier. 8 | 9 | # 'N' is correlated with 'total_Op' and 'total_Opnd' as they all represent the total number of operators and operands. 10 | df.drop(columns=['N'], inplace=True) 11 | 12 | # 'lOCode' and 'loc' are correlated as they both represent the line count of code. 13 | df.drop(columns=['lOCode'], inplace=True) 14 | 15 | # 'V' is correlated with 'total_Op' and 'total_Opnd' as it represents the Halstead volume. 16 | df.drop(columns=['V'], inplace=True) 17 | 18 | # 'L' is correlated with 'uniq_Op' and 'uniq_Opnd' as it represents the Halstead program length. 19 | df.drop(columns=['L'], inplace=True) 20 | # Feature: Code to Comment Ratio 21 | # Usefulness: The ratio of lines of code (loc) to lines of comments (lOComment) can provide insights into the balance between code and documentation, which can be related to code maintainability and potential defects. 22 | # Input samples: 'loc': [15.0, 13.0, 14.0], 'lOComment': [8.0, 0.0, 1.0] 23 | df['code_to_comment_ratio'] = df['loc'] / (df['lOComment'] + 1) # Adding 1 to avoid division by zero -------------------------------------------------------------------------------- /data/generated_code/pc1_v4_2_code.txt: -------------------------------------------------------------------------------- 1 | # Feature: Code Complexity 2 | # Usefulness: Combining McCabe's cyclomatic complexity (v(g)) and Halstead's difficulty (D) to create a single metric for overall code complexity. Higher complexity may be associated with higher chances of defects. 3 | # Input samples: 'v(g)': [4.0, 1.0, 26.0], 'D': [14.0, 5.78, 22.35] 4 | df['code_complexity'] = df['v(g)'] * df['D'] 5 | 6 | # Feature: Comment Ratio 7 | # Usefulness: The ratio of lines of comments (lOComment) to lines of code (lOCode) can be an indicator of code quality. A higher ratio may indicate better documentation and lower chances of defects. 8 | # Input samples: 'lOCode': [9.0, 13.0, 166.0], 'lOComment': [0.0, 0.0, 49.0] 9 | df['comment_ratio'] = df['lOComment'] / (df['lOCode'] + 1e-6) # Adding a small constant to avoid division by zero 10 | 11 | # Feature: Blank Line Ratio 12 | # Usefulness: The ratio of blank lines (lOBlank) to lines of code (loc) can indicate the readability of the code. Higher ratios may suggest better readability and lower chances of defects. 13 | # Input samples: 'lOBlank': [5.0, 0.0, 39.0], 'loc': [11.0, 13.0, 167.0] 14 | df['blank_line_ratio'] = df['lOBlank'] / (df['loc'] + 1e-6) # Adding a small constant to avoid division by zero 15 | 16 | # Feature: Intelligence per Line of Code 17 | # Usefulness: The ratio of Halstead's intelligence (I) to lines of code (loc) can indicate the complexity of the code. Higher intelligence per line may be associated with higher chances of defects. 18 | # Input samples: 'I': [17.46, 23.35, 218.7], 'loc': [11.0, 13.0, 167.0] 19 | df['intelligence_per_loc'] = df['I'] / (df['loc'] + 1e-6) # Adding a small constant to avoid division by zero 20 | -------------------------------------------------------------------------------- /data/dataset_descriptions/kaggle_spaceship-titanic.txt: -------------------------------------------------------------------------------- 1 | Dataset Description 2 | In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system. 3 | 4 | File and Data Field Descriptions 5 | train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data. 6 | PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always. 7 | HomePlanet - The planet the passenger departed from, typically their planet of permanent residence. 8 | CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins. 9 | Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard. 10 | Destination - The planet the passenger will be debarking to. 11 | Age - The age of the passenger. 12 | VIP - Whether the passenger has paid for special VIP service during the voyage. 13 | RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities. 14 | Name - The first and last names of the passenger. 15 | Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict. -------------------------------------------------------------------------------- /data/generated_code/tic-tac-toe_v3_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Number of squares occupied) 3 | # Usefulness: Knowing the number of squares occupied can give insight into the progress of the game and help classify "Class" 4 | # Input samples: 'top-left-square': [0, 1, 2], 'top-middle-square': [0, 0, 0], 'top-right-square': [2, 2, 1], 'middle-left-square': [1, 2, 0], 'middle-middle-square': [0, 1, 2], 'middle-right-square': [2, 0, 1], 'bottom-left-square': [0, 2, 0], 'bottom-middle-square': [1, 0, 2], 'bottom-right-square': [2, 1, 1] 5 | df['num_squares_occupied'] = df.apply(lambda row: len([x for x in row if x != 0]), axis=1) 6 | # (Count of X's on the board) 7 | # Usefulness: Knowing the number of X's on the board can give insight into the progress of the game and help classify "Class" 8 | # Input samples: 'top-left-square': [0, 1, 2], 'top-middle-square': [0, 0, 0], 'top-right-square': [2, 2, 1], 'middle-left-square': [1, 2, 0], 'middle-middle-square': [0, 1, 2], 'middle-right-square': [2, 0, 1], 'bottom-left-square': [0, 2, 0], 'bottom-middle-square': [1, 0, 2], 'bottom-right-square': [2, 1, 1] 9 | df['num_X'] = df.apply(lambda row: row.tolist().count(1), axis=1)# (Count of O's on the board) 10 | # Usefulness: Knowing the number of O's on the board can give insight into the progress of the game and help classify "Class" 11 | # Input samples: 'top-left-square': [0, 1, 2], 'top-middle-square': [0, 0, 0], 'top-right-square': [2, 2, 1], 'middle-left-square': [1, 2, 0], 'middle-middle-square': [0, 1, 2], 'middle-right-square': [2, 0, 1], 'bottom-left-square': [0, 2, 0], 'bottom-middle-square': [1, 0, 2], 'bottom-right-square': [2, 1, 1] 12 | df['num_O'] = df.apply(lambda row: row.tolist().count(2), axis=1) -------------------------------------------------------------------------------- /data/generated_code/kaggle_spaceship-titanic_v3_2_code.txt: -------------------------------------------------------------------------------- 1 | # ('CryoSleep_Age', 'Combination of CryoSleep and Age') 2 | # Usefulness: Passengers in cryosleep may be less likely to be transported to an alternate dimension. Combining CryoSleep status with age may provide additional information on this relationship. 3 | # Input samples: 'CryoSleep': [False, True, False], 'Age': [22.0, 57.0, 19.0] 4 | df['CryoSleep_Age'] = df['CryoSleep'].astype(int) * df['Age']# ('Total_Spending', 'Total amount spent on all luxury amenities') 5 | # Usefulness: The total amount spent on luxury amenities may provide information on the passenger's socioeconomic status and likelihood of being transported to an alternate dimension. 6 | # Input samples: 'RoomService': [0.0, 0.0, 0.0], 'FoodCourt': [0.0, 0.0, 47.0], 'ShoppingMall': [859.0, 0.0, 0.0], 'Spa': [62.0, 0.0, 263.0], 'VRDeck': [0.0, 0.0, 384.0] 7 | df['Total_Spending'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['Spa'] + df['VRDeck']# ('VIP_Deck_Spending', 'Combination of VIP, Cabin deck, and total spending') 8 | # Usefulness: VIP passengers may be more likely to be transported to an alternate dimension. Combining VIP status with the deck of the cabin and total spending may provide additional information on this relationship. 9 | # Input samples: 'VIP': [False, False, False], 'Cabin': ['F/1224/P', 'C/201/P', 'E/263/S'], 'RoomService': [0.0, 0.0, 263.0], 'FoodCourt': [0.0, 47.0, 0.0], 'ShoppingMall': [859.0, 0.0, 0.0], 'Spa': [62.0, 0.0, 263.0], 'VRDeck': [0.0, 384.0, 0.0] 10 | df['VIP_Deck_Spending'] = df['VIP'].astype(int) * df['Cabin'].str.split('/').str[0] + '_' + (df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['Spa'] + df['VRDeck']).astype(str) -------------------------------------------------------------------------------- /data/generated_code/diabetes_v3_3_code.txt: -------------------------------------------------------------------------------- 1 | # (Plasma glucose concentration a 2 hours in an oral glucose tolerance test) - (2-Hour serum insulin (mu U/ml)) 2 | # Usefulness: This feature measures the difference between glucose and insulin levels in the blood, which is an important factor in detecting insulin resistance and diabetes. 3 | # Input samples: 'plas': [117.0, 134.0, 102.0], 'insu': [145.0, 291.0, 105.0] 4 | df['glu_ins_diff'] = df['plas'] - df['insu']# (Body mass index) * (Diabetes pedigree function) 5 | # Usefulness: This feature is a combination of BMI and diabetes pedigree function, which measures the likelihood of having diabetes based on family history. High values of this feature indicate a higher risk of diabetes. 6 | # Input samples: 'mass': [34.5, 26.4, 37.2], 'pedi': [0.4, 0.35, 0.2] 7 | df['bmi_pedi_product'] = df['mass'] * df['pedi']# (Age) * (Diabetes pedigree function) 8 | # Usefulness: This feature is a combination of age and diabetes pedigree function, which measures the likelihood of having diabetes based on family history. High values of this feature indicate a higher risk of diabetes. 9 | # Input samples: 'age': [40.0, 21.0, 45.0], 'pedi': [0.4, 0.35, 0.2] 10 | df['age_pedi_product'] = df['age'] * df['pedi']# Drop 'skin' column 11 | # Explanation: The 'skin' column may not be useful for predicting diabetes as it measures the thickness of a fold of skin on the triceps, which may not be directly related to diabetes risk. 12 | df.drop(columns=['skin'], inplace=True)# (Body mass index)^2 13 | # Usefulness: This feature is a transformation of the 'mass' column, which may help capture non-linear relationships between BMI and diabetes risk. 14 | # Input samples: 'mass': [34.5, 26.4, 37.2] 15 | df['bmi_squared'] = df['mass'] ** 2 -------------------------------------------------------------------------------- /data/generated_code/diabetes_v3_4_code.txt: -------------------------------------------------------------------------------- 1 | # (Age multiplied by diabetes pedigree function) 2 | # Usefulness: Older women with a family history of diabetes are at a higher risk of developing diabetes. 3 | # Input samples: 'age': [32.0, 29.0, 21.0], 'pedi': [0.97, 0.92, 0.37] 4 | df['age_pedi'] = df['age'] * df['pedi']# (Plasma glucose concentration multiplied by BMI) 5 | # Usefulness: Women with higher BMI and higher glucose concentrations have a higher risk of developing diabetes. 6 | # Input samples: 'plas': [129.0, 86.0, 151.0], 'mass': [36.4, 41.3, 42.1] 7 | df['plas_bmi'] = df['plas'] * df['mass']# (Body mass index divided by age) 8 | # Usefulness: Age and BMI are important risk factors for diabetes. This feature combines them to capture the interaction between them. 9 | # Input samples: 'mass': [36.4, 41.3, 42.1], 'age': [32.0, 29.0, 21.0] 10 | df['mass_age'] = df['mass'] / df['age']# Explanation why the column 'insu' is dropped: The 'insu' column has a high percentage of missing values (48.7%) and it is unlikely that this column will add useful information to the classifier. 11 | df.drop(columns=['insu'], inplace=True)# (Plasma glucose concentration divided by age) 12 | # Usefulness: Age and plasma glucose concentration are important risk factors for diabetes. This feature combines them to capture the interaction between them. 13 | # Input samples: 'plas': [129.0, 86.0, 151.0], 'age': [32.0, 29.0, 21.0] 14 | df['plas_age'] = df['plas'] / df['age']# (Body mass index multiplied by diabetes pedigree function) 15 | # Usefulness: Women with higher BMI and a family history of diabetes are at a higher risk of developing diabetes. 16 | # Input samples: 'mass': [36.4, 41.3, 42.1], 'pedi': [0.97, 0.92, 0.37] 17 | df['mass_pedi'] = df['mass'] * df['pedi'] -------------------------------------------------------------------------------- /data/generated_code/tic-tac-toe_v4_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Diagonal top-left to bottom-right 3 | # Usefulness: This feature checks if there is a winning diagonal from top-left to bottom-right for player "x". 4 | # Input samples: 'top-left-square': [2, 1, 1], 'middle-middle-square': [1, 2, 2], 'bottom-right-square': [2, 2, 2] 5 | df['diagonal_tl_br'] = ((df['top-left-square'] == 1) & (df['middle-middle-square'] == 1) & (df['bottom-right-square'] == 1)).astype(int) 6 | 7 | # Diagonal top-right to bottom-left 8 | # Usefulness: This feature checks if there is a winning diagonal from top-right to bottom-left for player "x". 9 | # Input samples: 'top-right-square': [0, 2, 1], 'middle-middle-square': [1, 2, 2], 'bottom-left-square': [2, 0, 2] 10 | df['diagonal_tr_bl'] = ((df['top-right-square'] == 1) & (df['middle-middle-square'] == 1) & (df['bottom-left-square'] == 1)).astype(int) 11 | 12 | # Horizontal middle row 13 | # Usefulness: This feature checks if there is a winning horizontal row in the middle for player "x". 14 | # Input samples: 'middle-left-square': [1, 1, 1], 'middle-middle-square': [1, 2, 2], 'middle-right-square': [2, 2, 2] 15 | df['horizontal_middle_row'] = ((df['middle-left-square'] == 1) & (df['middle-middle-square'] == 1) & (df['middle-right-square'] == 1)).astype(int) 16 | 17 | # Count x's in a row 18 | # Usefulness: This feature counts the number of "x" in each row, which can help identify if there are enough "x" to form a winning condition. 19 | # Input samples: 'top-left-square': [2, 1, 1], 'top-middle-square': [1, 1, 1], 'top-right-square': [0, 2, 1], 'middle-left-square': [1, 1, 1], 'middle-middle-square': [1, 2, 2], 'middle-right-square': [2, 2, 2], 'bottom-left-square': [2, 0, 2], 'bottom-middle-square': [1, 0, 0], 'bottom-right-square': [2, 2, 2] 20 | df['x_count'] = (df == 1).sum(axis=1) 21 | -------------------------------------------------------------------------------- /data/generated_code/jungle_chess_2pcs_raw_endgame_complete_v4_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Strength difference between white and black pieces 3 | # Usefulness: The difference in strength between the two pieces can be an important factor in determining the outcome of the game. 4 | # Input samples: 'white_piece0_strength': [0.0, 4.0, 4.0], 'black_piece0_strength': [7.0, 0.0, 0.0] 5 | df['strength_difference'] = df['white_piece0_strength'] - df['black_piece0_strength'] 6 | 7 | # Is white piece stronger than black piece 8 | # Usefulness: This binary feature indicates whether the white piece is stronger than the black piece, which can be useful for predicting the outcome of the game. 9 | # Input samples: 'white_piece0_strength': [0.0, 4.0, 4.0], 'black_piece0_strength': [7.0, 0.0, 0.0] 10 | df['white_piece_stronger'] = (df['white_piece0_strength'] > df['black_piece0_strength']).astype(int) 11 | 12 | # Is white piece weaker than black piece 13 | # Usefulness: This binary feature indicates whether the white piece is weaker than the black piece, which can be useful for predicting the outcome of the game. 14 | # Input samples: 'white_piece0_strength': [0.0, 4.0, 4.0], 'black_piece0_strength': [7.0, 0.0, 0.0] 15 | df['white_piece_weaker'] = (df['white_piece0_strength'] < df['black_piece0_strength']).astype(int) 16 | 17 | # Is white piece on a diagonal with black piece 18 | # Usefulness: This binary feature indicates whether the white piece and the black piece are on a diagonal, which can be useful for predicting the outcome of the game. 19 | # Input samples: 'white_piece0_file': [4.0, 0.0, 3.0], 'white_piece0_rank': [2.0, 5.0, 6.0], 'black_piece0_file': [5.0, 1.0, 1.0], 'black_piece0_rank': [8.0, 3.0, 8.0] 20 | df['diagonal'] = (abs(df['white_piece0_file'] - df['black_piece0_file']) == abs(df['white_piece0_rank'] - df['black_piece0_rank'])).astype(int) 21 | -------------------------------------------------------------------------------- /data/generated_code/balance-scale_v3_2_code.txt: -------------------------------------------------------------------------------- 1 | # ('right-weight' * 'right-distance') 2 | # Usefulness: This feature captures the interaction between the weight and distance on the right side of the balance scale, which is a key factor in determining the class of the balance scale. 3 | # Input samples: 'right-weight': [5.0, 3.0, 1.0], 'right-distance': [2.0, 4.0, 4.0] 4 | df['right-interaction'] = df['right-weight'] * df['right-distance']# ('left-weight' / 'left-distance') 5 | # Usefulness: This feature captures the ratio between the weight and distance on the left side of the balance scale, which is a key factor in determining the class of the balance scale. 6 | # Input samples: 'left-weight': [2.0, 4.0, 1.0], 'left-distance': [2.0, 5.0, 4.0] 7 | df['left-ratio'] = df['left-weight'] / df['left-distance']# ('left-weight' * 'left-distance') / ('right-weight' * 'right-distance') 8 | # Usefulness: This feature captures the ratio between the left and right interactions, which is a key factor in determining the class of the balance scale. 9 | # Input samples: 'left-weight': [2.0, 4.0, 1.0], 'left-distance': [2.0, 5.0, 4.0], 'right-weight': [5.0, 3.0, 1.0], 'right-distance': [2.0, 4.0, 4.0] 10 | df['interaction-ratio'] = (df['left-weight'] * df['left-distance']) / (df['right-weight'] * df['right-distance'])# ('left-weight' - 'right-weight') / ('left-distance' - 'right-distance') 11 | # Usefulness: This feature captures the difference in weight per unit distance between the left and right sides of the balance scale, which is a key factor in determining the class of the balance scale. 12 | # Input samples: 'left-weight': [2.0, 4.0, 1.0], 'left-distance': [2.0, 5.0, 4.0], 'right-weight': [5.0, 3.0, 1.0], 'right-distance': [2.0, 4.0, 4.0] 13 | df['weight-distance-diff'] = (df['left-weight'] - df['right-weight']) / (df['left-distance'] - df['right-distance']) -------------------------------------------------------------------------------- /data/generated_code/kaggle_playground-series-s3e12_v3_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Calc_to_ph_diff) 3 | # Usefulness: The difference between the pH and calcium concentration can give insight into the formation of calcium oxalate crystals. 4 | # Input samples: 'ph': [5.24, 5.41, 6.79], 'calc': [4.49, 0.83, 0.58] 5 | df['Calc_to_ph_diff'] = df['calc'] - df['ph'] 6 | # Explanation: The column 'id' is dropped as it is not informative for the classification task. 7 | df.drop(columns=['id'], inplace=True)# (Calc_to_gravity_diff) 8 | # Usefulness: The difference between the specific gravity and calcium concentration can give insight into the formation of calcium oxalate crystals. 9 | # Input samples: 'gravity': [1.03, 1.01, 1.02], 'calc': [4.49, 0.83, 0.58] 10 | df['Calc_to_gravity_diff'] = df['calc'] - df['gravity']# Explanation: The column 'cond' is dropped as it is highly correlated with 'osmo' which is kept. 11 | df.drop(columns=['cond'], inplace=True)# (Urea_to_gravity_ratio) 12 | # Usefulness: The ratio between urea concentration and specific gravity can give insight into the concentration of molecules in the urine which may be related to the formation of calcium oxalate crystals. 13 | # Input samples: 'urea': [550.0, 159.0, 199.0], 'gravity': [1.03, 1.01, 1.02] 14 | df['Urea_to_gravity_ratio'] = df['urea'] / df['gravity']# Explanation: The column 'osmo' is dropped as it is highly correlated with 'Osmo_to_ph_ratio' and 'Osmo_to_calc_ratio' which are kept. 15 | df.drop(columns=['osmo'], inplace=True)# (Urea_to_calc_ratio) 16 | # Usefulness: The ratio between urea concentration and calcium concentration can give insight into the concentration of molecules in the urine which may be related to the formation of calcium oxalate crystals. 17 | # Input samples: 'urea': [550.0, 159.0, 199.0], 'calc': [4.49, 0.83, 0.58] 18 | df['Urea_to_calc_ratio'] = df['urea'] / df['calc'] -------------------------------------------------------------------------------- /data/generated_code/credit-g_v3_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Credit amount to duration ratio) 3 | # Usefulness: This column captures the ratio of credit amount to duration, which gives an idea of how much the borrower will have to pay back per unit time. This is a useful metric for predicting credit risk. 4 | # Input samples: 'credit_amount': [2473.0, 522.0, 719.0], 'duration': [18.0, 12.0, 12.0] 5 | df['credit_duration_ratio'] = df['credit_amount'] / df['duration'] 6 | # (Age bin) 7 | # Usefulness: This column bins the age of the borrower into different groups. This can help capture non-linear relationships between age and credit risk. 8 | # Input samples: 'age': [25.0, 42.0, 41.0], 9 | df['age_bin'] = pd.cut(df['age'], bins=[0, 25, 40, 60, 100], labels=['young', 'middle-aged', 'senior', 'old'])# (Credit amount to installment commitment ratio) 10 | # Usefulness: This column captures the ratio of credit amount to installment commitment, which gives an idea of how much the borrower will have to pay back per installment. This is a useful metric for predicting credit risk. 11 | # Input samples: 'credit_amount': [2473.0, 522.0, 719.0], 'installment_commitment': [4.0, 4.0, 4.0] 12 | df['credit_installment_ratio'] = df['credit_amount'] / df['installment_commitment']# (Credit history and purpose combination) 13 | # Usefulness: This column combines credit history and purpose columns to capture the relationship between the two. This can help capture the borrower's credit history and their reason for taking the loan. 14 | # Input samples: 'credit_history': [2, 4, 2], 'purpose': [2, 3, 6] 15 | df['credit_history_purpose'] = df['credit_history'].astype(str) + '_' + df['purpose'].astype(str)# Explanation: The column 'foreign_worker' has low variance and is unlikely to be useful for predicting credit risk. Therefore, it is dropped. 16 | df.drop(columns=['foreign_worker'], inplace=True) -------------------------------------------------------------------------------- /data/generated_code/jungle_chess_2pcs_raw_endgame_complete_v4_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature name: strength_difference 3 | # Usefulness: This feature calculates the difference in strength between the white and black pieces, which can help the classifier understand the power dynamics between the two pieces. 4 | # Input samples: 'white_piece0_strength': [7.0, 6.0, 7.0], 'black_piece0_strength': [0.0, 0.0, 6.0] 5 | df['strength_difference'] = df['white_piece0_strength'] - df['black_piece0_strength'] 6 | 7 | # Feature name: is_adjacent 8 | # Usefulness: This feature indicates if the white and black pieces are adjacent to each other, which can help the classifier understand if the pieces are close enough to interact. 9 | # Input samples: 'white_piece0_file': [6.0, 2.0, 1.0], 'white_piece0_rank': [4.0, 7.0, 1.0], 'black_piece0_file': [5.0, 3.0, 4.0], 'black_piece0_rank': [8.0, 5.0, 2.0] 10 | df['is_adjacent'] = (((df['white_piece0_file'] - df['black_piece0_file']).abs() <= 1) & ((df['white_piece0_rank'] - df['black_piece0_rank']).abs() <= 1)).astype(int) 11 | 12 | # Feature name: stronger_piece 13 | # Usefulness: This feature indicates if the white piece is stronger than the black piece, which can help the classifier understand the power dynamics between the two pieces. 14 | # Input samples: 'white_piece0_strength': [7.0, 6.0, 7.0], 'black_piece0_strength': [0.0, 0.0, 6.0] 15 | df['stronger_piece'] = (df['white_piece0_strength'] > df['black_piece0_strength']).astype(int) 16 | 17 | # Feature name: weaker_piece 18 | # Usefulness: This feature indicates if the black piece is stronger than the white piece, which can help the classifier understand the power dynamics between the two pieces. 19 | # Input samples: 'white_piece0_strength': [7.0, 6.0, 7.0], 'black_piece0_strength': [0.0, 0.0, 6.0] 20 | df['weaker_piece'] = (df['white_piece0_strength'] < df['black_piece0_strength']).astype(int) -------------------------------------------------------------------------------- /data/generated_code/kaggle_health-insurance-lead-prediction-raw-data_v4_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Policy_duration_years 3 | # Usefulness: Converting the Holding_Policy_Duration to numeric values in years may provide a better understanding of the duration of the policy, which could be relevant in predicting the response. 4 | # Input samples: 'Holding_Policy_Duration': ['7.0', '2.0', '1.0'] 5 | df['Policy_duration_years'] = df['Holding_Policy_Duration'].replace('14+', '15').astype(float) 6 | 7 | # Feature: Health_Indicator_numeric 8 | # Usefulness: Converting the Health Indicator to numeric values may provide a better understanding of the health condition of the policy holder, which could be relevant in predicting the response. 9 | # Input samples: 'Health Indicator': ['X1', 'X2', 'X3'] 10 | df['Health_Indicator_numeric'] = df['Health Indicator'].str.extract('(\d+)').astype(float) 11 | 12 | # Drop the original Health Indicator column as it is now redundant 13 | df.drop(columns=['Health Indicator'], inplace=True) 14 | 15 | # Feature: Is_Spouse_binary 16 | # Usefulness: Converting the Is_Spouse column to binary values may provide a better understanding of whether the policy holder has a spouse or not, which could be relevant in predicting the response. 17 | # Input samples: 'Is_Spouse': ['No', 'No', 'No'] 18 | df['Is_Spouse_binary'] = (df['Is_Spouse'] == 'Yes').astype(int) 19 | 20 | # Drop the original Is_Spouse column as it is now redundant 21 | df.drop(columns=['Is_Spouse'], inplace=True) 22 | 23 | # Feature: Reco_Policy_Premium_log 24 | # Usefulness: Taking the logarithm of the Reco_Policy_Premium may help normalize the distribution of the values, which could improve the performance of the classifier. 25 | # Input samples: 'Reco_Policy_Premium': [13112.0, 9800.0, 17280.0] 26 | import numpy as np 27 | df['Reco_Policy_Premium_log'] = np.log(df['Reco_Policy_Premium']) 28 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_health-insurance-lead-prediction-raw-data_v3_2_code.txt: -------------------------------------------------------------------------------- 1 | # ('City_Code_Count', 'Usefulness: The count of individuals in a city may be indicative of the likelihood of a response.', 2 | # 'Input samples: City_Code': ['C1', 'C5', 'C1'], 'Region_Code': [2037, 3535, 1159]) 3 | df['City_Code_Count'] = df.groupby('City_Code')['Region_Code'].transform('count')# ('Family_Size', 'Usefulness: The family size may be indicative of the likelihood of a response.', 4 | # 'Input samples: Upper_Age': [28, 52, 52], 'Lower_Age': [28, 52, 52], 'Is_Spouse': ['No', 'No', 'No]') 5 | df['Family_Size'] = df['Is_Spouse'].apply(lambda x: 2 if x == 'Yes' else 1) + (df['Upper_Age'] + df['Lower_Age'])//30# ('Premium_By_Age', 'Usefulness: The ratio of premium to age may be indicative of the likelihood of a response.', 6 | # 'Input samples: Upper_Age': [28, 52, 52], 'Lower_Age': [28, 52, 52], 'Reco_Policy_Premium': [10544.0, 11484.0, 19240.0]) 7 | df['Premium_By_Age'] = df['Reco_Policy_Premium'] / ((df['Upper_Age'] + df['Lower_Age'])/2)# ('Holding_Policy_Duration_Imputed_Ordinal', 'Usefulness: The duration of holding policy may be important in predicting response. This column imputes a value of 0 for NaNs and maps the duration to an ordinal scale.', 8 | # 'Input samples: Holding_Policy_Duration': ['3.0', '4.0', '2.0'], 'Holding_Policy_Type': [3.0, 2.0, 3.0]) 9 | df['Holding_Policy_Duration_Imputed'] = df['Holding_Policy_Duration'].fillna(0) 10 | df['Holding_Policy_Duration_Imputed_Ordinal'] = df['Holding_Policy_Duration_Imputed'].replace(['14+'], 15).astype(float)# ('Premium_Per_Region', 'Usefulness: The average premium per region may be indicative of the likelihood of a response.', 11 | # 'Input samples: Region_Code': [2037, 3535, 1159], 'Reco_Policy_Premium': [10544.0, 11484.0, 19240.0]) 12 | df['Premium_Per_Region'] = df.groupby('Region_Code')['Reco_Policy_Premium'].transform('mean') -------------------------------------------------------------------------------- /data/generated_code/cmc_v4_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature name: Education_gap 3 | # Usefulness: This feature captures the difference in education levels between the wife and husband. This might be relevant as couples with similar education levels may have similar preferences for contraceptive methods. 4 | # Input samples: 'Wifes_education': [3, 1, 3], 'Husbands_education': [3, 1, 3] 5 | df['Education_gap'] = df['Wifes_education'] - df['Husbands_education'] 6 | 7 | # Feature name: Age_per_child 8 | # Usefulness: This feature represents the average age of the wife per child born. It may be useful to understand the relationship between the wife's age and the number of children, which could influence contraceptive method choice. 9 | # Input samples: 'Wifes_age': [45.0, 47.0, 33.0], 'Number_of_children_ever_born': [1.0, 7.0, 5.0] 10 | df['Age_per_child'] = df['Wifes_age'] / df['Number_of_children_ever_born'] 11 | df['Age_per_child'].fillna(0, inplace=True) # Fill NaN values with 0 (for cases where Number_of_children_ever_born is 0) 12 | 13 | # Feature name: Working_and_religion_interaction 14 | # Usefulness: This feature captures the interaction between the wife's working status and religion. This might be relevant as the combination of these factors could influence contraceptive method choice. 15 | # Input samples: 'Wifes_now_working%3F': [1, 1, 1], 'Wifes_religion': [1, 1, 1] 16 | df['Working_and_religion_interaction'] = df['Wifes_now_working%3F'] * df['Wifes_religion'] 17 | 18 | # Feature name: Education_and_media_interaction 19 | # Usefulness: This feature captures the interaction between wife's education and media exposure. This might be relevant as the combination of these factors could influence contraceptive method choice. 20 | # Input samples: 'Wifes_education': [3, 1, 3], 'Media_exposure': [0, 0, 0] 21 | df['Education_and_media_interaction'] = df['Wifes_education'] * df['Media_exposure'] 22 | -------------------------------------------------------------------------------- /scripts/run_classifiers_script.py: -------------------------------------------------------------------------------- 1 | """ 2 | Runs downstream classifiers on the dataset and combines it with different feature sets. 3 | """ 4 | 5 | import argparse 6 | import torch 7 | from tabpfn.scripts import tabular_metrics 8 | from cafe_feature_engineering import data, cafe, evaluate, settings 9 | import os 10 | import openai 11 | from tabpfn.scripts.tabular_baselines import clf_dict 12 | from tabpfn import TabPFNClassifier 13 | from functools import partial 14 | 15 | 16 | if __name__ == "__main__": 17 | # Parse args 18 | parser = argparse.ArgumentParser() 19 | parser.add_argument( 20 | "--seed", 21 | type=int, 22 | default=0, 23 | ) 24 | parser.add_argument( 25 | "--dataset_id", 26 | type=int, 27 | default=-1, 28 | ) 29 | parser.add_argument( 30 | "--prompt_id", 31 | type=str, 32 | default="v4", 33 | ) 34 | args = parser.parse_args() 35 | device = "cuda" if torch.cuda.is_available() else "cpu" 36 | classifier = TabPFNClassifier(device=device, N_ensemble_configurations=16) 37 | classifier.fit = partial(classifier.fit, overwrite_warning=True) 38 | tabpfn = partial(clf_dict["transformer"], classifier=classifier) 39 | 40 | prompt_id = args.prompt_id 41 | dataset_id = args.dataset_id 42 | seed = args.seed 43 | methods = [tabpfn, "logistic", "random_forest", "xgb", "autosklearn2", "autogluon"] 44 | 45 | cc_test_datasets_multiclass = data.load_all_data() 46 | if dataset_id != -1: 47 | cc_test_datasets_multiclass = [cc_test_datasets_multiclass[dataset_id]] 48 | 49 | metric_used = tabular_metrics.auc_metric 50 | 51 | for i in range(0, len(cc_test_datasets_multiclass)): 52 | ds = cc_test_datasets_multiclass[i] 53 | evaluate.evaluate_dataset_with_and_without_cafe( 54 | ds, seed, methods, metric_used, prompt_id=prompt_id 55 | ) 56 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_spaceship-titanic_v3_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # ('FamilySize', 'Number of family members travelling with the passenger') 3 | # Usefulness: The number of family members travelling with the passenger can be important for predicting whether the passenger was transported, as families may be more likely to be transported together. 4 | # Input samples: 'PassengerId': ['4841_02', '1040_01', '5788_01'], 'HomePlanet': ['Europa', 'Earth', 'Mars'], 'Cabin': ['B/192/S', 'G/164/S', 'F/1105/S'] 5 | df['FamilySize'] = df.groupby(df['PassengerId'].str.split('_').str[0])['PassengerId'].transform('count') 6 | 7 | # ('CabinDeck', 'The deck level of the passenger cabin') 8 | # Usefulness: The deck level of the passenger cabin may be important for predicting whether the passenger was transported, as passengers on certain decks may be more likely to be affected by the spacetime anomaly. 9 | # Input samples: 'Cabin': ['B/192/S', 'G/164/S', 'F/1105/S'], 'Destination': ['TRAPPIST-1e', 'PSO J318.5-22', 'TRAPPIST-1e'] 10 | df['CabinDeck'] = df['Cabin'].str.split('/').str[0] 11 | 12 | # ('IsAdult', 'Whether the passenger is an adult (age >= 18)') 13 | # Usefulness: Age may be an important factor in predicting whether the passenger was transported, and it is common to consider age groups such as adults vs. children. This column simplifies the age variable by categorizing passengers as adults or children. 14 | # Input samples: 'Age': [21.0, 15.0, 27.0], 'VIP': [False, False, False] 15 | df['IsAdult'] = (df['Age'] >= 18).astype(int) 16 | 17 | # ('TotalSpending', 'The total amount spent by the passenger on luxury amenities') 18 | # Usefulness: The amount spent by the passenger on luxury amenities may be indicative of their socioeconomic status or level of attachment to the current dimension, and thus may be important for predicting whether they were transported. 19 | # Input samples: 'RoomService': [12.0, 521.0, 850.0], 'FoodCourt': [1855.0, 162.0, 0.0 -------------------------------------------------------------------------------- /data/generated_code/cmc_v3_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # ('Husband_wife_age_difference', 'Useful to capture the age difference between husband and wife, which could be a factor in contraceptive method choice.') 3 | # Input samples: ('Wifes_age': [37.0, 35.0, 22.0], 'Husbands_education': [3, 1, 3]) 4 | df['Husband_wife_age_difference'] = df['Wifes_age'] - df['Husbands_education'] 5 | # ('Total_children', 'Useful to capture the total number of children a woman has, which could be a factor in contraceptive method choice.') 6 | # Input samples: ('Number_of_children_ever_born': [4.0, 4.0, 1.0], 'Wifes_now_working%3F': [1, 0, 0]) 7 | df['Total_children'] = df['Number_of_children_ever_born'] + (1 - df['Wifes_now_working%3F']) * 3# ('Education_gap', 'Useful to capture the difference in education level between husband and wife, which could be a factor in contraceptive method choice.') 8 | # Input samples: ('Wifes_education': [3, 0, 2], 'Husbands_education': [3, 1, 3]) 9 | df['Education_gap'] = abs(df['Wifes_education'] - df['Husbands_education'])# ('Children_per_year', 'Useful to capture the rate at which children are born, which could be a factor in contraceptive method choice.') 10 | # Input samples: ('Number_of_children_ever_born': [4.0, 4.0, 1.0], 'Wifes_age': [37.0, 35.0, 22.0]) 11 | df['Children_per_year'] = df['Number_of_children_ever_born'] / (df['Wifes_age'] - 14) # assuming that the woman got married at 14# ('Age_squared', 'Useful to capture the non-linear relationship between age and contraceptive method choice.') 12 | # Input samples: ('Wifes_age': [37.0, 35.0, 22.0], 'Contraceptive_method_used': [2.0, 2.0, 2.0]) 13 | df['Age_squared'] = df['Wifes_age'] ** 2# ('Working_wife', 'Useful to capture whether the wife is working or not, which could be a factor in contraceptive method choice.') 14 | # Input samples: ('Wifes_now_working%3F': [1, 0, 0], 'Standard-of-living_index': [1, 3, 3]) 15 | df['Working_wife'] = df['Wifes_now_working%3F'] * (df['Standard-of-living_index'] - 2) / 2 -------------------------------------------------------------------------------- /data/generated_code/pc1_v3_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # (Feature name and description) 3 | # Usefulness: This feature calculates the ratio of unique operands to unique operators in the code. This ratio can be used to assess the complexity of the code and can be a useful feature for predicting defects. 4 | # Input samples: 'uniq_Op': [14.0, 8.0, 16.0, 4.0, 15.0], 'uniq_Opnd': [12.0, 12.0, 17.0, 6.0, 23.0] 5 | df['uniq_Opnd/uniq_Op'] = df['uniq_Opnd'] / df['uniq_Op'] 6 | # (Feature name and description) 7 | # Usefulness: This feature calculates the ratio of the total number of operands to the total number of operators in the code. This ratio can be used to assess the complexity of the code and can be a useful feature for predicting defects. 8 | # Input samples: 'total_Op': [35.0, 19.0, 32.0, 13.0, 72.0], 'total_Opnd': [22.0, 14.0, 22.0, 12.0, 58.0] 9 | df['total_Opnd/total_Op'] = df['total_Opnd'] / df['total_Op']# (Feature name and description) 10 | # Usefulness: This feature calculates the ratio of the Halstead program length to McCabe's line count of code. This ratio can be used to assess the readability of the code and can be a useful feature for predicting defects. 11 | # Input samples: 'L': [0.08, 0.21, 0.1, 0.25, 0.05], 'loc': [12.0, 8.0, 13.0, 6.0, 19.0] 12 | df['L/loc'] = df['L'] / df['loc']# (Feature name and description) 13 | # Usefulness: This feature calculates the ratio of the Halstead effort to McCabe's line count of code. This ratio can be used to assess the maintainability of the code and can be a useful feature for predicting defects. 14 | # Input samples: 'E': [3438.37, 665.58, 2820.11, 332.19, 12903.06], 'loc': [12.0, 8.0, 13.0, 6.0, 19.0] 15 | df['E/loc'] = df['E'] / df['loc']# (Feature name and description) 16 | # Usefulness: This feature calculates the ratio of the Halstead volume to McCabe's cyclomatic complexity. This ratio can be used to assess the maintainability of the code and can be a useful feature for predicting defects. 17 | # Input samples: 'V': [267.93, 142.62, 272.4, 83.05, 682.23], 'v(g)': [1.0, 1.0, 2.0, 1.0, 7.0] 18 | df['V/v(g)'] = df['V'] / df['v(g)'] -------------------------------------------------------------------------------- /data/generated_code/kaggle_pharyngitis_v3_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # ('age_y_above_10', 'Whether the patient age is above 10 years old') 3 | # Usefulness: Older children may exhibit different symptoms and signs than younger children, so this feature can help capture that difference. 4 | # Input samples: 'age_y': [4.4, 11.3, 5.8] 5 | df['age_y_above_10'] = (df['age_y'] > 10).astype(int) 6 | # ('pain_swollenadp', 'Whether the patient has both pain and swollen adenoids') 7 | # Usefulness: Pain and swollen adenoids are both common symptoms of GAS pharyngitis, and the presence of both may indicate a higher likelihood of a positive RADT result. 8 | # Input samples: 'pain': [1.0, 1.0, 1.0], 'swollenadp': [0.0, 2.0, 0.0] 9 | df['pain_swollenadp'] = ((df['pain'] == 1) & (df['swollenadp'] > 0)).astype(int)# Explanation: 'age_y_above_10' is a feature that has already been added, but it may be useful to also include the opposite feature. 10 | df['age_y_below_10'] = (df['age_y'] <= 10).astype(int) 11 | 12 | # Explanation: 'pain' and 'tender' are both symptoms that may indicate GAS pharyngitis. This feature captures the presence of either symptom. 13 | df['pain_or_tender'] = ((df['pain'] == 1) | (df['tender'] == 1)).astype(int) 14 | 15 | # Explanation: 'swollenadp' and 'tender' are both symptoms that may indicate GAS pharyngitis. This feature captures the presence of either symptom. 16 | df['swollenadp_or_tender'] = ((df['swollenadp'] > 0) | (df['tender'] == 1)).astype(int) 17 | 18 | # Explanation: 'scarlet' is a symptom that may indicate scarlet fever, which is caused by GAS. This feature captures the presence of this symptom. 19 | df['scarlet_fever'] = (df['scarlet'] == 1).astype(int) 20 | 21 | # Explanation: 'conjunctivitis' is a symptom that may indicate a viral infection, which is not caused by GAS. This feature captures the absence of this symptom. 22 | df['no_conjunctivitis'] = (df['conjunctivitis'] == 0).astype(int) 23 | 24 | # Explanation: 'nauseavomit' is a symptom that may indicate a viral infection, which is not caused by GAS. This feature captures the presence of this symptom. 25 | df['nausea_or_vomiting'] = (df['nauseavomit'] == 1).astype(int) -------------------------------------------------------------------------------- /data/generated_code/pc1_v4_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Complexity Ratio 3 | # Usefulness: This feature represents the ratio between McCabe's "cyclomatic complexity" and "design complexity". A high ratio indicates that the code is more complex and may be more prone to defects. 4 | # Input samples: 'v(g)': [1.0, 1.0, 2.0], 'iv(G)': [1.0, 1.0, 2.0] 5 | df['complexity_ratio'] = df['v(g)'] / df['iv(G)'] 6 | 7 | # Feature: Comment Ratio 8 | # Usefulness: This feature represents the ratio between the number of comment lines and the total number of lines of code. A higher ratio indicates better documentation, which may lead to fewer defects. 9 | # Input samples: 'lOComment': [0.0, 12.0, 0.0], 'lOCode': [12.0, 8.0, 13.0] 10 | df['comment_ratio'] = df['lOComment'] / df['lOCode'] 11 | 12 | # Feature: Operand Ratio 13 | # Usefulness: This feature represents the ratio between unique operands and total operands. A higher ratio indicates a more diverse set of operands, which may affect the complexity and defect probability of the code. 14 | # Input samples: 'uniq_Opnd': [12.0, 12.0, 17.0], 'total_Opnd': [22.0, 14.0, 22.0] 15 | df['operand_ratio'] = df['uniq_Opnd'] / df['total_Opnd'] 16 | 17 | # Dropping columns that may be redundant or less informative 18 | # Explanation: Columns 'lOCode', 'lOComment', 'uniq_Opnd', and 'total_Opnd' are used to create the new features 'comment_ratio' and 'operand_ratio', so they may be less informative for the classifier. 19 | df.drop(columns=['lOCode', 'lOComment', 'uniq_Opnd', 'total_Opnd'], inplace=True) 20 | 21 | # Feature: Operator Ratio 22 | # Usefulness: This feature represents the ratio between unique operators and total operators. A higher ratio indicates a more diverse set of operators, which may affect the complexity and defect probability of the code. 23 | # Input samples: 'uniq_Op': [14.0, 8.0, 16.0], 'total_Op': [35.0, 19.0, 32.0] 24 | df['operator_ratio'] = df['uniq_Op'] / df['total_Op'] 25 | 26 | # Dropping columns that may be redundant or less informative 27 | # Explanation: Columns 'uniq_Op' and 'total_Op' are used to create the new feature 'operator_ratio', so they may be less informative for the classifier. 28 | df.drop(columns=['uniq_Op', 'total_Op'], inplace=True) 29 | -------------------------------------------------------------------------------- /data/generated_code/tic-tac-toe_v4_4_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Horizontal win for x 3 | # Usefulness: This feature indicates if there is a horizontal win for x in any of the three rows. 4 | # Input samples: 'top-left-square': [2, 0, 1], 'top-middle-square': [1, 1, 2], 'top-right-square': [0, 1, 2], 'middle-left-square': [1, 0, 1], 'middle-middle-square': [1, 2, 2], 'middle-right-square': [2, 1, 2], 'bottom-left-square': [2, 2, 2], 'bottom-middle-square': [1, 2, 1], 'bottom-right-square': [2, 2, 1] 5 | df['horizontal_win_x'] = ((df['top-left-square'] == 1) & (df['top-middle-square'] == 1) & (df['top-right-square'] == 1)) | ((df['middle-left-square'] == 1) & (df['middle-middle-square'] == 1) & (df['middle-right-square'] == 1)) | ((df['bottom-left-square'] == 1) & (df['bottom-middle-square'] == 1) & (df['bottom-right-square'] == 1)) 6 | 7 | # Vertical win for x 8 | # Usefulness: This feature indicates if there is a vertical win for x in any of the three columns. 9 | # Input samples: 'top-left-square': [2, 0, 1], 'top-middle-square': [1, 1, 2], 'top-right-square': [0, 1, 2], 'middle-left-square': [1, 0, 1], 'middle-middle-square': [1, 2, 2], 'middle-right-square': [2, 1, 2], 'bottom-left-square': [2, 2, 2], 'bottom-middle-square': [1, 2, 1], 'bottom-right-square': [2, 2, 1] 10 | df['vertical_win_x'] = ((df['top-left-square'] == 1) & (df['middle-left-square'] == 1) & (df['bottom-left-square'] == 1)) | ((df['top-middle-square'] == 1) & (df['middle-middle-square'] == 1) & (df['bottom-middle-square'] == 1)) | ((df['top-right-square'] == 1) & (df['middle-right-square'] == 1) & (df['bottom-right-square'] == 1)) 11 | 12 | # Diagonal win for x 13 | # Usefulness: This feature indicates if there is a diagonal win for x in any of the two diagonals. 14 | # Input samples: 'top-left-square': [2, 0, 1], 'top-middle-square': [1, 1, 2], 'top-right-square': [0, 1, 2], 'middle-left-square': [1, 0, 1], 'middle-middle-square': [1, 2, 2], 'middle-right-square': [2, 1, 2], 'bottom-left-square': [2, 2, 2], 'bottom-middle-square': [1, 2, 1], 'bottom-right-square': [2, 2, 1] 15 | df['diagonal_win_x'] = ((df['top-left-square'] == 1) & (df['middle-middle-square'] == 1) & (df['bottom-right-square'] == 1)) | ((df['top-right-square'] == 1) & (df['middle-middle-square'] == 1) & (df['bottom-left-square'] == 1)) 16 | -------------------------------------------------------------------------------- /data/generated_code/cmc_v3_4_code.txt: -------------------------------------------------------------------------------- 1 | # (Age_above_mean) 2 | # Usefulness: This column will indicate whether the wife's age is above the mean age of all the wives in the dataset. This can be an important factor in deciding the contraceptive method used. 3 | # Input samples: 'Wifes_age': [45.0, 47.0, 33.0], (mean age = 32.98) 4 | mean_age = df['Wifes_age'].mean() 5 | df['Age_above_mean'] = (df['Wifes_age'] > mean_age).astype(int)# (Children_per_year) 6 | # Usefulness: This column will indicate the average number of children born per year for each wife. This can be an important factor in deciding the contraceptive method used. 7 | # Input samples: 'Number_of_children_ever_born': [1.0, 7.0, 5.0], 'Wifes_age': [45.0, 47.0, 33.0] 8 | df['Children_per_year'] = df['Number_of_children_ever_born'] / (df['Wifes_age'] - 18) # Assuming women get married at 18 years old.# (Number_of_children_ever_born_squared) 9 | # Usefulness: This column will capture the non-linear relationship between the number of children ever born and the contraceptive method used. 10 | # Input samples: 'Number_of_children_ever_born': [1.0, 7.0, 5.0] 11 | df['Number_of_children_ever_born_squared'] = df['Number_of_children_ever_born']**2# (Total_children) 12 | # Usefulness: This column will indicate the total number of children (including current pregnancy) for each wife. This can be an important factor in deciding the contraceptive method used. 13 | # Input samples: 'Number_of_children_ever_born': [1.0, 7.0, 5.0], 'Wifes_now_working%3F': [1, 1, 1], 'Wifes_age': [45.0, 47.0, 33.0] 14 | df['Total_children'] = df['Number_of_children_ever_born'] + ((df['Wifes_now_working%3F'] == 1) & (df['Wifes_age'] >= 20) & (df['Wifes_age'] <= 49)).astype(int)# (Husband_education_difference) 15 | # Usefulness: This column will indicate the difference in education level between the wife and husband. This can be an important factor in deciding the contraceptive method used. 16 | # Input samples: 'Wifes_education': [3, 1, 3], 'Husbands_education': [3, 1, 3] 17 | df['Husband_education_difference'] = abs(df['Wifes_education'] - df['Husbands_education'])# (Total_children_squared) 18 | # Usefulness: This column will capture the non-linear relationship between the total number of children and the contraceptive method used. 19 | # Input samples: 'Total_children': [2.0, 8.0, 6.0] 20 | df['Total_children_squared'] = df['Total_children']**2 -------------------------------------------------------------------------------- /data/generated_code/kaggle_spaceship-titanic_v4_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: GroupSize (Number of passengers in the same group) 3 | # Usefulness: This feature can help to identify if passengers traveling in larger groups have a higher or lower chance of being transported. 4 | # Input samples: 'PassengerId': ['5909_03', '4256_08', '2000_01'] 5 | df['GroupSize'] = df['PassengerId'].apply(lambda x: int(x.split('_')[1])) 6 | 7 | # Feature: TotalExpenses (Total amount spent on amenities) 8 | # Usefulness: Passengers who spend more on amenities might have a different likelihood of being transported. 9 | # Input samples: 'RoomService': [0.0, 0.0, 5.0], 'FoodCourt': [0.0, 0.0, 2676.0], 'ShoppingMall': [0.0, 608.0, 13.0], 'Spa': [0.0, 0.0, 0.0], 'VRDeck': [0.0, 91.0, 157.0] 10 | df['TotalExpenses'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['Spa'] + df['VRDeck'] 11 | 12 | # Feature: AgeGroup (Categorical age group) 13 | # Usefulness: Different age groups may have different likelihoods of being transported. 14 | # Input samples: 'Age': [2.0, 44.0, 28.0] 15 | import numpy as np 16 | bins = [0, 12, 18, 35, 60, np.inf] 17 | labels = ['Child', 'Teen', 'YoungAdult', 'Adult', 'Senior'] 18 | df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels) 19 | 20 | # Dropping Age as AgeGroup contains the relevant information 21 | df.drop(columns=['Age'], inplace=True) 22 | # Feature: SameHomeAndDestination (Boolean indicating if HomePlanet and Destination are the same) 23 | # Usefulness: Passengers with the same HomePlanet and Destination might have a different likelihood of being transported. 24 | # Input samples: 'HomePlanet': ['Earth', 'Earth', 'Europa'], 'Destination': ['55 Cancri e', '55 Cancri e', '55 Cancri e'] 25 | df['SameHomeAndDestination'] = df['HomePlanet'] == df['Destination'] 26 | # Feature: ExpensesPerDeck (Average expenses of passengers in the same deck) 27 | # Usefulness: Passengers in the same deck with different spending patterns might have a different likelihood of being transported. 28 | # Input samples: 'Cabin': ['G/961/S', 'F/880/P', 'C/76/S'], 'TotalExpenses': [0.0, 608.0, 2851.0] 29 | df['Deck'] = df['Cabin'].apply(lambda x: x[0]) 30 | df = df.merge(df.groupby('Deck')['TotalExpenses'].mean().reset_index().rename(columns={'TotalExpenses': 'ExpensesPerDeck'}), on='Deck', how='left') 31 | 32 | # Dropping Cabin again as the Deck and GroupSize features contain relevant information from it 33 | df.drop(columns=['Cabin'], inplace=True) -------------------------------------------------------------------------------- /data/generated_code/kaggle_health-insurance-lead-prediction-raw-data_v4_2_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Reco_Policy_Premium and Age Interaction 3 | # Usefulness: This feature captures the interaction between the recommended policy premium and the average age of the person, which might help in understanding how different age groups prefer different policy premiums. 4 | # Input samples: 'Reco_Policy_Premium': [10544.0, 11484.0, 19240.0], 'Upper_Age': [28, 52, 52], 'Lower_Age': [28, 52, 52] 5 | df['Average_Age'] = (df['Upper_Age'] + df['Lower_Age']) / 2 6 | df['Premium_Age_Interaction'] = df['Reco_Policy_Premium'] / df['Average_Age'] 7 | 8 | # Holding_Policy_Type and Reco_Policy_Cat Interaction 9 | # Usefulness: This feature captures the interaction between the holding policy type and the recommended policy category, which might help in understanding how different policy categories are preferred based on the holding policy type. 10 | # Input samples: 'Holding_Policy_Type': [3.0, 2.0, 3.0], 'Reco_Policy_Cat': [16, 17, 21] 11 | df['Policy_Type_Cat_Interaction'] = df['Holding_Policy_Type'].astype(str) + "_" + df['Reco_Policy_Cat'].astype(str) 12 | df['Policy_Type_Cat_Interaction'] = df['Policy_Type_Cat_Interaction'].astype('category') 13 | 14 | # Drop Holding_Policy_Type as it is now captured in the interaction feature 15 | df.drop(columns=['Holding_Policy_Type'], inplace=True) 16 | 17 | # Holding_Policy_Duration and Reco_Policy_Premium Interaction 18 | # Usefulness: This feature captures the interaction between the holding policy duration and the recommended policy premium, which might help in understanding how different policy premiums are preferred based on the holding policy duration. 19 | # Input samples: 'Holding_Policy_Duration': ['3.0', '4.0', '2.0'], 'Reco_Policy_Premium': [10544.0, 11484.0, 19240.0] 20 | df['Duration_Premium_Interaction'] = df['Holding_Policy_Duration'].replace('14+', '15').astype(float) * df['Reco_Policy_Premium'] 21 | 22 | # City_Code and Accomodation_Type Interaction 23 | # Usefulness: This feature captures the interaction between the city code and the accommodation type, which might help in understanding how different accommodation types are preferred in different cities. 24 | # Input samples: 'City_Code': ['C1', 'C5', 'C1'], 'Accomodation_Type': ['Rented', 'Owned', 'Owned'] 25 | df['City_Accomodation_Interaction'] = df['City_Code'].astype(str) + "_" + df['Accomodation_Type'].astype(str) 26 | df['City_Accomodation_Interaction'] = df['City_Accomodation_Interaction'].astype('category') 27 | -------------------------------------------------------------------------------- /data/generated_code/tic-tac-toe_v4_1_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: Row Wins for X 3 | # Usefulness: This feature counts the number of rows where X has a win (three-in-a-row). It helps the classifier to identify if X has won or not. 4 | # Input samples: 'top-left-square': [0, 1, 2], 'top-middle-square': [0, 0, 0], 'top-right-square': [2, 2, 1], ... 5 | df['row_wins_x'] = ((df['top-left-square'] == 1) & (df['top-middle-square'] == 1) & (df['top-right-square'] == 1) | 6 | (df['middle-left-square'] == 1) & (df['middle-middle-square'] == 1) & (df['middle-right-square'] == 1) | 7 | (df['bottom-left-square'] == 1) & (df['bottom-middle-square'] == 1) & (df['bottom-right-square'] == 1)).astype(int) 8 | 9 | # Feature: Column Wins for X 10 | # Usefulness: This feature counts the number of columns where X has a win (three-in-a-row). It helps the classifier to identify if X has won or not. 11 | # Input samples: 'top-left-square': [0, 1, 2], 'middle-left-square': [1, 2, 0], 'bottom-left-square': [0, 2, 0], ... 12 | df['col_wins_x'] = ((df['top-left-square'] == 1) & (df['middle-left-square'] == 1) & (df['bottom-left-square'] == 1) | 13 | (df['top-middle-square'] == 1) & (df['middle-middle-square'] == 1) & (df['bottom-middle-square'] == 1) | 14 | (df['top-right-square'] == 1) & (df['middle-right-square'] == 1) & (df['bottom-right-square'] == 1)).astype(int) 15 | 16 | # Feature: Diagonal Wins for X 17 | # Usefulness: This feature counts the number of diagonals where X has a win (three-in-a-row). It helps the classifier to identify if X has won or not. 18 | # Input samples: 'top-left-square': [0, 1, 2], 'middle-middle-square': [0, 1, 2], 'bottom-right-square': [2, 1, 1], ... 19 | df['diag_wins_x'] = ((df['top-left-square'] == 1) & (df['middle-middle-square'] == 1) & (df['bottom-right-square'] == 1) | 20 | (df['top-right-square'] == 1) & (df['middle-middle-square'] == 1) & (df['bottom-left-square'] == 1)).astype(int) 21 | 22 | # Feature: Total Wins for X 23 | # Usefulness: This feature combines the row, column, and diagonal wins for X into a single feature, which should simplify the model and improve its performance. 24 | # Input samples: 'row_wins_x': [0, 1, 0], 'col_wins_x': [0, 0, 1], 'diag_wins_x': [1, 0, 0], ... 25 | df['total_wins_x'] = df['row_wins_x'] + df['col_wins_x'] + df['diag_wins_x'] 26 | 27 | # Dropping redundant columns 28 | # Explanation: Since we have combined the row, column, and diagonal wins for X into a single feature, these individual columns are not necessary for the classifier. 29 | df.drop(columns=['row_wins_x', 'col_wins_x', 'diag_wins_x'], inplace=True) -------------------------------------------------------------------------------- /data/generated_code/tic-tac-toe_v4_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Diagonal 2 3 | # Usefulness: This adds useful real world knowledge to classify "Class" as it checks for a win condition in the second diagonal (top-right to bottom-left). 4 | # Input samples: 'top-right-square': [1, 0, 1], 'middle-middle-square': [0, 2, 2], 'bottom-left-square': [2, 1, 1] 5 | df['diag2'] = df.apply(lambda row: 1 if row['top-right-square'] == row['middle-middle-square'] == row['bottom-left-square'] and row['top-right-square'] != 0 else 0, axis=1) 6 | 7 | # Vertical wins 8 | # Usefulness: This adds useful real world knowledge to classify "Class" as it checks for a win condition in each of the three vertical columns. 9 | # Input samples: 'top-left-square': [2, 2, 2], 'top-middle-square': [0, 0, 1], 'top-right-square': [1, 0, 1], 'middle-left-square': [1, 0, 2], 'middle-middle-square': [0, 2, 2], 'middle-right-square': [1, 1, 2], 'bottom-left-square': [2, 1, 1], 'bottom-middle-square': [2, 0, 0], 'bottom-right-square': [2, 2, 0] 10 | df['vertical_wins'] = df.apply(lambda row: 1 if (row['top-left-square'] == row['middle-left-square'] == row['bottom-left-square'] and row['top-left-square'] != 0) or (row['top-middle-square'] == row['middle-middle-square'] == row['bottom-middle-square'] and row['top-middle-square'] != 0) or (row['top-right-square'] == row['middle-right-square'] == row['bottom-right-square'] and row['top-right-square'] != 0) else 0, axis=1) 11 | 12 | # Empty squares count 13 | # Usefulness: This adds useful real world knowledge to classify "Class" as it counts the number of empty squares in the board, which can help to identify if the game ended in a draw. 14 | # Input samples: 'top-left-square': [2, 2, 2], 'top-middle-square': [0, 0, 1], 'top-right-square': [1, 0, 1], 'middle-left-square': [1, 0, 2], 'middle-middle-square': [0, 2, 2], 'middle-right-square': [1, 1, 2], 'bottom-left-square': [2, 1, 1], 'bottom-middle-square': [2, 0, 0], 'bottom-right-square': [2, 2, 0] 15 | df['empty_squares'] = df.apply(lambda row: sum([1 for col in ['top-left-square', 'top-middle-square', 'top-right-square', 'middle-left-square', 'middle-middle-square', 'middle-right-square', 'bottom-left-square', 'bottom-middle-square', 'bottom-right-square'] if row[col] == 0]), axis=1) 16 | 17 | # Dropping less relevant columns 18 | # Explanation: Dropping the original square columns as the newly created features (diagonal, horizontal, vertical wins, and empty squares count) capture the relevant information for predicting "Class". 19 | df.drop(columns=['top-left-square', 'top-middle-square', 'top-right-square', 'middle-left-square', 'middle-middle-square', 'middle-right-square', 'bottom-left-square', 'bottom-middle-square', 'bottom-right-square'], inplace=True) 20 | -------------------------------------------------------------------------------- /data/generated_code/eucalyptus_v3_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # ('DBH_Ht', 'Feature that combines DBH and Ht', 'This feature combines the height and diameter of the tree, which is important for determining the overall size and maturity of the tree.') 3 | df['DBH_Ht'] = df['DBH'] * df['Ht'] 4 | 5 | # ('Surv_Vig', 'Feature that combines Surv and Vig', 'This feature combines the survival and vigor of the tree, which is important for determining the health and growth potential of the tree.') 6 | df['Surv_Vig'] = df['Surv'] * df['Vig'] 7 | 8 | # ('InsRes_StemFm', 'Feature that combines Ins_res and Stem_Fm', 'This feature combines the insect resistance and stem form of the tree, which is important for determining the overall health and resistance to pests and diseases.') 9 | df['InsRes_StemFm'] = df['Ins_res'] * df['Stem_Fm'] 10 | 11 | # ('CrownFm_BrnchFm', 'Feature that combines Crown_Fm and Brnch_Fm', 'This feature combines the crown form and branch form of the tree, which is important for determining the overall shape and structure of the tree.') 12 | df['CrownFm_BrnchFm'] = df['Crown_Fm'] * df['Brnch_Fm'] 13 | 14 | # ('Altitude_Rainfall', 'Feature that combines Altitude and Rainfall', 'This feature combines the altitude and rainfall of the site, which is important for determining the environmental conditions that the tree is growing in.') 15 | df['Altitude_Rainfall'] = df['Altitude'] * df['Rainfall'] 16 | 17 | # Drop PMCno column due to high amount of missing data 18 | df.drop(columns=['PMCno'], inplace=True) 19 | # ('Altitude_Sp', 'Feature that combines Altitude and Sp', 'This feature combines the altitude and species of the tree, which is important for determining the environmental conditions and type of the tree.') 20 | df['Altitude_Sp'] = df['Altitude'] * df['Sp'] 21 | 22 | # ('Rainfall_Year', 'Feature that combines Rainfall and Year', 'This feature combines the rainfall and year of planting of the tree, which is important for determining the climate conditions and planting time of the tree.') 23 | df['Rainfall_Year'] = df['Rainfall'] * df['Year'] 24 | 25 | # ('DBH_Ht_Sp', 'Feature that combines DBH, Ht, and Sp', 'This feature combines the diameter, height, and species of the tree, which is important for determining the size and type of the tree.') 26 | df['DBH_Ht_Sp'] = df['DBH'] * df['Ht'] * df['Sp'] 27 | 28 | # ('Rainfall_Frosts_Locality', 'Feature that combines Rainfall, Frosts, and Locality', 'This feature combines the rainfall, frosts, and locality of the site, which is important for determining the environmental conditions that the tree is growing in.') 29 | df['Rainfall_Frosts_Locality'] = df['Rainfall'] * df['Frosts'] * df['Locality'] 30 | 31 | # Drop Stem_Fm, Crown_Fm, Brnch_Fm columns due to high amount of missing data 32 | df.drop(columns=['Stem_Fm', 'Crown_Fm', 'Brnch_Fm'], inplace=True) -------------------------------------------------------------------------------- /data/generated_code/credit-g_v4_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature name: Credit per month 3 | # Usefulness: This feature calculates the credit amount per month, which can help identify if the customer can afford the credit based on their monthly income. 4 | # Input samples: 'credit_amount': [2473.0, 522.0, 719.0], 'duration': [18.0, 12.0, 12.0] 5 | df['credit_per_month'] = df['credit_amount'] / df['duration'] 6 | # Feature name: Age group 7 | # Usefulness: This feature categorizes customers into age groups, which can help identify patterns in credit risk based on age. 8 | # Input samples: 'age': [25.0, 42.0, 41.0] 9 | bins = [0, 25, 35, 50, 100] 10 | labels = [0, 1, 2, 3] 11 | df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels).astype(int) 12 | 13 | # Explanation why the column 'age' is dropped 14 | # The age column is dropped because the age_group column captures the relevant information about age for credit risk prediction. 15 | df.drop(columns=['age'], inplace=True) 16 | # Feature name: Employment stability 17 | # Usefulness: This feature checks if the customer has a stable employment history, which can be an indicator of their ability to repay the credit. 18 | # Input samples: 'employment': [0, 4, 4] 19 | df['employment_stability'] = df['employment'].apply(lambda x: 1 if x >= 3 else 0) 20 | 21 | # Explanation why the column 'employment' is dropped 22 | # The employment column is dropped because the employment_stability column captures the relevant information about employment stability for credit risk prediction. 23 | df.drop(columns=['employment'], inplace=True) 24 | # Feature name: Installment to credit ratio 25 | # Usefulness: This feature calculates the ratio between the installment commitment and the credit amount, which can help identify if the customer can afford the monthly payments. 26 | # Input samples: 'installment_commitment': [4.0, 4.0, 4.0], 'credit_amount': [2473.0, 522.0, 719.0] 27 | df['installment_to_credit_ratio'] = df['installment_commitment'] / df['credit_amount'] 28 | 29 | # Explanation why the column 'installment_commitment' is dropped 30 | # The installment_commitment column is dropped because the installment_to_credit_ratio column captures the relevant information about the relationship between installment commitment and credit amount for credit risk prediction. 31 | df.drop(columns=['installment_commitment'], inplace=True) 32 | # Feature name: Credit history risk 33 | # Usefulness: This feature checks if the customer has a risky credit history, which can be an indicator of their creditworthiness. 34 | # Input samples: 'credit_history': [2, 4, 2] 35 | df['credit_history_risk'] = df['credit_history'].apply(lambda x: 1 if x in [0, 1, 3] else 0) 36 | 37 | # Explanation why the column 'credit_history' is dropped 38 | # The credit_history column is dropped because the credit_history_risk column captures the relevant information about credit history risk for credit risk prediction. 39 | df.drop(columns=['credit_history'], inplace=True) -------------------------------------------------------------------------------- /scripts/generate_features_script.py: -------------------------------------------------------------------------------- 1 | """ 2 | Runs the CAAFE algorithm on a dataset and saves the generated code and prompt to a file. 3 | """ 4 | 5 | import argparse 6 | from functools import partial 7 | 8 | from tabpfn.scripts import tabular_metrics 9 | from tabpfn import TabPFNClassifier 10 | import tabpfn 11 | from tabpfn.scripts.tabular_baselines import clf_dict 12 | import os 13 | import openai 14 | import torch 15 | 16 | from caafe.data import get_data_split, load_all_data 17 | from caafe.caafe import generate_features 18 | 19 | 20 | def generate_and_save_feats(i, seed=0, iterative_method=None, iterations=10): 21 | if iterative_method is None: 22 | iterative_method = tabpfn 23 | 24 | ds = cc_test_datasets_multiclass[i] 25 | 26 | ds, df_train, df_test, df_train_old, df_test_old = get_data_split(ds, seed) 27 | code, prompt, messages = generate_features( 28 | ds, 29 | df_train, 30 | just_print_prompt=False, 31 | model=model, 32 | iterative=iterations, 33 | metric_used=metric_used, 34 | iterative_method=iterative_method, 35 | display_method="print", 36 | ) 37 | 38 | data_dir = os.environ.get("DATA_DIR", "data/") 39 | f = open( 40 | f"{data_dir}/generated_code/{ds[0]}_{prompt_id}_{seed}_prompt.txt", 41 | "w", 42 | ) 43 | f.write(prompt) 44 | f.close() 45 | 46 | f = open(f"{data_dir}/generated_code/{ds[0]}_{prompt_id}_{seed}_code.txt", "w") 47 | f.write(code) 48 | f.close() 49 | 50 | 51 | if __name__ == "__main__": 52 | # Parse args 53 | parser = argparse.ArgumentParser() 54 | parser.add_argument( 55 | "--seed", 56 | type=int, 57 | default=0, 58 | ) 59 | parser.add_argument( 60 | "--dataset_id", 61 | type=int, 62 | default=-1, 63 | ) 64 | parser.add_argument( 65 | "--prompt_id", 66 | type=str, 67 | default="v3", 68 | ) 69 | parser.add_argument( 70 | "--iterations", 71 | type=int, 72 | default=10, 73 | ) 74 | args = parser.parse_args() 75 | prompt_id = args.prompt_id 76 | dataset_id = args.dataset_id 77 | iterations = args.iterations 78 | seed = args.seed 79 | 80 | model = "gpt-3.5-turbo" if prompt_id == "v3" else "gpt-4" 81 | 82 | openai.api_key = os.environ["OPENAI_API_KEY"] 83 | 84 | cc_test_datasets_multiclass = load_all_data() 85 | if dataset_id != -1: 86 | cc_test_datasets_multiclass = [cc_test_datasets_multiclass[dataset_id]] 87 | 88 | device = "cuda" if torch.cuda.is_available() else "cpu" 89 | classifier = TabPFNClassifier(device=device, N_ensemble_configurations=16) 90 | classifier.fit = partial(classifier.fit, overwrite_warning=True) 91 | tabpfn = partial(clf_dict["transformer"], classifier=classifier) 92 | metric_used = tabular_metrics.auc_metric 93 | 94 | for i in range(0, len(cc_test_datasets_multiclass)): 95 | generate_and_save_feats(i, seed=seed, iterations=iterations) 96 | -------------------------------------------------------------------------------- /data/generated_code/kaggle_pharyngitis_v4_0_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Feature: age_temperature_interaction 3 | # Usefulness: This feature captures the interaction between age and temperature, which may help to identify patterns related to the likelihood of a positive RADT result. 4 | # Input samples: 'age_y': [8.2, 5.5, 6.5], 'temperature': [39.7, 38.7, 38.0] 5 | df['age_temperature_interaction'] = df['age_y'] * df['temperature'] 6 | 7 | # Feature: pain_swollenadp_tender_interaction 8 | # Usefulness: This feature captures the interaction between pain, swollenadp, and tender, which may help to identify patterns related to the likelihood of a positive RADT result. 9 | # Input samples: 'pain': [0.0, 1.0, 1.0], 'swollenadp': [1.0, 0.0, 1.0], 'tender': [1.0, 0.0, 1.0] 10 | df['pain_swollenadp_tender_interaction'] = df['pain'] * df['swollenadp'] * df['tender'] 11 | 12 | # Dropping columns that may be redundant and hurt the predictive performance 13 | # Explanation: Since we have created an interaction feature with pain, swollenadp, and tender, we can drop the individual columns to reduce the chance of overfitting. 14 | df.drop(columns=['pain', 'swollenadp', 'tender'], inplace=True) 15 | 16 | # Feature: respiratory_symptoms 17 | # Usefulness: This feature captures the presence of respiratory symptoms (cough, rhinorrhea) which may help to identify patterns related to the likelihood of a positive RADT result. 18 | # Input samples: 'cough': [1.0, 0.0, 0.0], 'rhinorrhea': [0.0, 1.0, 0.0] 19 | df['respiratory_symptoms'] = df['cough'] + df['rhinorrhea'] 20 | 21 | # Dropping columns that may be redundant and hurt the predictive performance 22 | # Explanation: Since we have created a new feature combining cough and rhinorrhea, we can drop the individual columns to reduce the chance of overfitting. 23 | df.drop(columns=['cough', 'rhinorrhea'], inplace=True) 24 | 25 | # Feature: sudden_headache_interaction 26 | # Usefulness: This feature captures the interaction between sudden onset and headache, which may help to identify patterns related to the likelihood of a positive RADT result. 27 | # Input samples: 'sudden': [0.0, 1.0, 0.0], 'headache': [0.0, 0.0, 0.0] 28 | df['sudden_headache_interaction'] = df['sudden'] * df['headache'] 29 | 30 | # Dropping columns that may be redundant and hurt the predictive performance 31 | # Explanation: Since we have created an interaction feature with sudden and headache, we can drop the individual columns to reduce the chance of overfitting. 32 | df.drop(columns=['sudden', 'headache'], inplace=True) 33 | # Feature: age_conjunctivitis_interaction 34 | # Usefulness: This feature captures the interaction between age and conjunctivitis, which may help to identify patterns related to the likelihood of a positive RADT result. 35 | # Input samples: 'age_y': [8.2, 5.5, 6.5], 'conjunctivitis': [0.0, 0.0, 0.0] 36 | df['age_conjunctivitis_interaction'] = df['age_y'] * df['conjunctivitis'] 37 | 38 | # Dropping columns that may be redundant and hurt the predictive performance 39 | # Explanation: Since we have created an interaction feature with age and conjunctivitis, we can drop the individual columns to reduce the chance of overfitting. 40 | df.drop(columns=['conjunctivitis'], inplace=True) -------------------------------------------------------------------------------- /data/generated_code/kaggle_health-insurance-lead-prediction-raw-data_v4_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Health_Indicator_Encoded 3 | # Usefulness: This new feature encodes the Health Indicator categorical variable into numerical values. It can help to identify the health status of the policy holder, which might affect their likelihood of being classified as a lead. 4 | # Input samples: 'Health Indicator': ['X5', 'X4', 'X2'] 5 | df['Health_Indicator_Encoded'] = df['Health Indicator'].apply(lambda x: int(x[1:])) 6 | 7 | # Holding_Policy_Duration_Encoded 8 | # Usefulness: This new feature encodes the Holding Policy Duration categorical variable into numerical values. It can help to identify the duration of the holding policy, which might affect their likelihood of being classified as a lead. 9 | # Input samples: 'Holding_Policy_Duration': ['2.0', '14+', '12.0'] 10 | df['Holding_Policy_Duration_Encoded'] = df['Holding_Policy_Duration'].replace('14+', '15').astype(float) 11 | 12 | # Dropping original categorical columns, as we have encoded them into numerical features 13 | df.drop(columns=['Health Indicator', 'Holding_Policy_Duration'], inplace=True) 14 | 15 | # City_Code_Encoded 16 | # Usefulness: This new feature encodes the City_Code categorical variable into numerical values. It can help to identify the city where the policy holder lives, which might affect their likelihood of being classified as a lead. 17 | # Input samples: 'City_Code': ['C4', 'C20', 'C17'] 18 | df['City_Code_Encoded'] = df['City_Code'].apply(lambda x: int(x[1:])) 19 | 20 | # Dropping original categorical column, as we have encoded it into a numerical feature 21 | df.drop(columns=['City_Code'], inplace=True) 22 | 23 | # Reco_Policy_Premium_Ratio 24 | # Usefulness: This new feature represents the ratio of the Reco_Policy_Premium to the average premium in the dataset. It can help to identify if the policy holder is applying for a policy with a premium that is higher or lower than the average, which might affect their likelihood of being classified as a lead. 25 | # Input samples: 'Reco_Policy_Premium': [16172.0, 19272.0, 13661.2] 26 | df['Reco_Policy_Premium_Ratio'] = df['Reco_Policy_Premium'] / df['Reco_Policy_Premium'].mean() 27 | 28 | # Holding_Policy_Type_Ratio 29 | # Usefulness: This new feature represents the ratio of the Holding_Policy_Type to the average Holding_Policy_Type in the dataset. It can help to identify if the policy holder is applying for a policy with a holding policy type that is more or less common, which might affect their likelihood of being classified as a lead. 30 | # Input samples: 'Holding_Policy_Type': [4.0, 3.0, 2.0] 31 | df['Holding_Policy_Type_Ratio'] = df['Holding_Policy_Type'] / df['Holding_Policy_Type'].mean() 32 | 33 | # Reco_Policy_Cat_Ratio 34 | # Usefulness: This new feature represents the ratio of the Reco_Policy_Cat to the average Reco_Policy_Cat in the dataset. It can help to identify if the policy holder is applying for a policy with a recommended policy category that is more or less common, which might affect their likelihood of being classified as a lead. 35 | # Input samples: 'Reco_Policy_Cat': [5, 18, 22] 36 | df['Reco_Policy_Cat_Ratio'] = df['Reco_Policy_Cat'] / df['Reco_Policy_Cat'].mean() 37 | -------------------------------------------------------------------------------- /data/generated_code/tic-tac-toe_v4_3_code.txt: -------------------------------------------------------------------------------- 1 | 2 | # Top row win 3 | # Usefulness: Indicates if the top row has a winning combination for "x" 4 | # Input samples: 'top-left-square': [0, 2, 0], 'top-middle-square': [1, 0, 0], 'top-right-square': [1, 0, 1] 5 | df['top_row_win'] = (df['top-left-square'] == df['top-middle-square']) & (df['top-middle-square'] == df['top-right-square']) & (df['top-left-square'] == 1) 6 | 7 | # Middle row win 8 | # Usefulness: Indicates if the middle row has a winning combination for "x" 9 | # Input samples: 'middle-left-square': [2, 2, 0], 'middle-middle-square': [2, 1, 1], 'middle-right-square': [2, 0, 0] 10 | df['middle_row_win'] = (df['middle-left-square'] == df['middle-middle-square']) & (df['middle-middle-square'] == df['middle-right-square']) & (df['middle-left-square'] == 1) 11 | 12 | # Bottom row win 13 | # Usefulness: Indicates if the bottom row has a winning combination for "x" 14 | # Input samples: 'bottom-left-square': [1, 2, 2], 'bottom-middle-square': [0, 1, 2], 'bottom-right-square': [2, 0, 2] 15 | df['bottom_row_win'] = (df['bottom-left-square'] == df['bottom-middle-square']) & (df['bottom-middle-square'] == df['bottom-right-square']) & (df['bottom-left-square'] == 1) 16 | 17 | # Left column win 18 | # Usefulness: Indicates if the left column has a winning combination for "x" 19 | # Input samples: 'top-left-square': [0, 2, 0], 'middle-left-square': [2, 2, 0], 'bottom-left-square': [1, 2, 2] 20 | df['left_column_win'] = (df['top-left-square'] == df['middle-left-square']) & (df['middle-left-square'] == df['bottom-left-square']) & (df['top-left-square'] == 1) 21 | 22 | # Right column win 23 | # Usefulness: Indicates if the right column has a winning combination for "x" 24 | # Input samples: 'top-right-square': [1, 0, 1], 'middle-right-square': [2, 0, 0], 'bottom-right-square': [2, 0, 2] 25 | df['right_column_win'] = (df['top-right-square'] == df['middle-right-square']) & (df['middle-right-square'] == df['bottom-right-square']) & (df['top-right-square'] == 1) 26 | 27 | # Diagonal win (top-left to bottom-right) 28 | # Usefulness: Indicates if the diagonal from top-left to bottom-right has a winning combination for "x" 29 | # Input samples: 'top-left-square': [0, 2, 0], 'middle-middle-square': [2, 1, 1], 'bottom-right-square': [2, 0, 2] 30 | df['diag_tl_br_win'] = (df['top-left-square'] == df['middle-middle-square']) & (df['middle-middle-square'] == df['bottom-right-square']) & (df['top-left-square'] == 1) 31 | 32 | # Diagonal win (top-right to bottom-left) 33 | # Usefulness: Indicates if the diagonal from top-right to bottom-left has a winning combination for "x" 34 | # Input samples: 'top-right-square': [1, 0, 1], 'middle-middle-square': [2, 1, 1], 'bottom-left-square': [1, 2, 2] 35 | df['diag_tr_bl_win'] = (df['top-right-square'] == df['middle-middle-square']) & (df['middle-middle-square'] == df['bottom-left-square']) & (df['top-right-square'] == 1) 36 | 37 | # Dropping original columns as they are now represented by the added features 38 | df.drop(columns=['top-left-square', 'top-middle-square', 'top-right-square', 'middle-left-square', 'middle-middle-square', 'middle-right-square', 'bottom-left-square', 'bottom-middle-square', 'bottom-right-square'], inplace=True) 39 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v3_4_prompt.txt: -------------------------------------------------------------------------------- 1 | 2 | The dataframe `df` is loaded and in memory. Columns are also named attributes. 3 | Description of the dataset in `df` (column dtypes might be inaccurate): 4 | " 5 | Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure." 6 | 7 | Columns in `df` (true feature dtypes listed here, categoricals encoded as int): 8 | Airline (int32): NaN-freq [0.0%], Samples [14, 15, 0, 17, 1, 11, 5, 1, 17, 13] 9 | Flight (float64): NaN-freq [0.0%], Samples [3972.0, 649.0, 815.0, 2027.0, 1509.0, 2776.0, 330.0, 2362.0, 1628.0, 955.0] 10 | AirportFrom (float64): NaN-freq [0.0%], Samples [63.0, 30.0, 80.0, 0.0, 38.0, 136.0, 22.0, 71.0, 197.0, 2.0] 11 | AirportTo (float64): NaN-freq [0.0%], Samples [6.0, 77.0, 0.0, 61.0, 1.0, 3.0, 59.0, 6.0, 61.0, 45.0] 12 | DayOfWeek (int32): NaN-freq [0.0%], Samples [0, 2, 1, 2, 3, 0, 2, 2, 2, 1] 13 | Time (float64): NaN-freq [0.0%], Samples [530.0, 575.0, 485.0, 490.0, 921.0, 690.0, 1215.0, 965.0, 375.0, 985.0] 14 | Length (float64): NaN-freq [0.0%], Samples [85.0, 152.0, 142.0, 145.0, 175.0, 161.0, 149.0, 145.0, 60.0, 97.0] 15 | Delay (category): NaN-freq [0.0%], Samples [0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0] 16 | 17 | 18 | This code was written by an expert datascientist working to improve predictions. It is a snippet of code that adds new columns to the dataset. 19 | Number of samples (rows) in training dataset: 1500 20 | 21 | This code generates additional columns that are useful for a downstream classification algorithm (such as XGBoost) predicting "Delay". 22 | Additional columns add new semantic information, that is they use real world knowledge on the dataset. They can e.g. be feature combinations, transformations, aggregations where the new column is a function of the existing columns. 23 | The scale of columns and offset does not matter. Make sure all used columns exist. Follow the above description of columns closely and consider the datatypes and meanings of classes. 24 | This code also drops columns, if these may be redundant and hurt the predictive performance of the downstream classifier (Feature selection). Dropping columns may help as the chance of overfitting is lower, especially if the dataset is small. 25 | The classifier will be trained on the dataset with the generated columns and evaluated on a holdout set. The evaluation metric is accuracy. The best performing code will be selected. 26 | Added columns can be used in other codeblocks, dropped columns are not available anymore. 27 | 28 | Code formatting for each added column: 29 | ```python 30 | # (Feature name and description) 31 | # Usefulness: (Description why this adds useful real world knowledge to classify "Delay" according to dataset description and attributes.) 32 | # Input samples: (Three samples of the columns used in the following code, e.g. 'Airline': [14, 15, 0], 'Flight': [3972.0, 649.0, 815.0], ...) 33 | (Some pandas code using Airline', 'Flight', ... to add a new column for each row in df) 34 | ```end 35 | 36 | Code formatting for dropping columns: 37 | ```python 38 | # Explanation why the column XX is dropped 39 | df.drop(columns=['XX'], inplace=True) 40 | ```end 41 | 42 | Each codeblock generates exactly one useful column and can drop unused columns (Feature selection). 43 | Each codeblock ends with ```end and starts with "```python" 44 | Codeblock: 45 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v4_4_prompt.txt: -------------------------------------------------------------------------------- 1 | 2 | The dataframe `df` is loaded and in memory. Columns are also named attributes. 3 | Description of the dataset in `df` (column dtypes might be inaccurate): 4 | " 5 | Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure." 6 | 7 | Columns in `df` (true feature dtypes listed here, categoricals encoded as int): 8 | Airline (int32): NaN-freq [0.0%], Samples [14, 15, 0, 17, 1, 11, 5, 1, 17, 13] 9 | Flight (float64): NaN-freq [0.0%], Samples [3972.0, 649.0, 815.0, 2027.0, 1509.0, 2776.0, 330.0, 2362.0, 1628.0, 955.0] 10 | AirportFrom (float64): NaN-freq [0.0%], Samples [63.0, 30.0, 80.0, 0.0, 38.0, 136.0, 22.0, 71.0, 197.0, 2.0] 11 | AirportTo (float64): NaN-freq [0.0%], Samples [6.0, 77.0, 0.0, 61.0, 1.0, 3.0, 59.0, 6.0, 61.0, 45.0] 12 | DayOfWeek (int32): NaN-freq [0.0%], Samples [0, 2, 1, 2, 3, 0, 2, 2, 2, 1] 13 | Time (float64): NaN-freq [0.0%], Samples [530.0, 575.0, 485.0, 490.0, 921.0, 690.0, 1215.0, 965.0, 375.0, 985.0] 14 | Length (float64): NaN-freq [0.0%], Samples [85.0, 152.0, 142.0, 145.0, 175.0, 161.0, 149.0, 145.0, 60.0, 97.0] 15 | Delay (category): NaN-freq [0.0%], Samples [0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0] 16 | 17 | 18 | This code was written by an expert datascientist working to improve predictions. It is a snippet of code that adds new columns to the dataset. 19 | Number of samples (rows) in training dataset: 1500 20 | 21 | This code generates additional columns that are useful for a downstream classification algorithm (such as XGBoost) predicting "Delay". 22 | Additional columns add new semantic information, that is they use real world knowledge on the dataset. They can e.g. be feature combinations, transformations, aggregations where the new column is a function of the existing columns. 23 | The scale of columns and offset does not matter. Make sure all used columns exist. Follow the above description of columns closely and consider the datatypes and meanings of classes. 24 | This code also drops columns, if these may be redundant and hurt the predictive performance of the downstream classifier (Feature selection). Dropping columns may help as the chance of overfitting is lower, especially if the dataset is small. 25 | The classifier will be trained on the dataset with the generated columns and evaluated on a holdout set. The evaluation metric is accuracy. The best performing code will be selected. 26 | Added columns can be used in other codeblocks, dropped columns are not available anymore. 27 | 28 | Code formatting for each added column: 29 | ```python 30 | # (Feature name and description) 31 | # Usefulness: (Description why this adds useful real world knowledge to classify "Delay" according to dataset description and attributes.) 32 | # Input samples: (Three samples of the columns used in the following code, e.g. 'Airline': [14, 15, 0], 'Flight': [3972.0, 649.0, 815.0], ...) 33 | (Some pandas code using Airline', 'Flight', ... to add a new column for each row in df) 34 | ```end 35 | 36 | Code formatting for dropping columns: 37 | ```python 38 | # Explanation why the column XX is dropped 39 | df.drop(columns=['XX'], inplace=True) 40 | ```end 41 | 42 | Each codeblock generates exactly one useful column and can drop unused columns (Feature selection). 43 | Each codeblock ends with ```end and starts with "```python" 44 | Codeblock: 45 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v3_2_prompt.txt: -------------------------------------------------------------------------------- 1 | 2 | The dataframe `df` is loaded and in memory. Columns are also named attributes. 3 | Description of the dataset in `df` (column dtypes might be inaccurate): 4 | " 5 | Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure." 6 | 7 | Columns in `df` (true feature dtypes listed here, categoricals encoded as int): 8 | Airline (int32): NaN-freq [0.0%], Samples [13, 13, 6, 3, 4, 0, 12, 5, 17, 17] 9 | Flight (float64): NaN-freq [0.0%], Samples [321.0, 116.0, 17.0, 1623.0, 2642.0, 1444.0, 7305.0, 312.0, 244.0, 1131.0] 10 | AirportFrom (float64): NaN-freq [0.0%], Samples [129.0, 1.0, 4.0, 102.0, 54.0, 108.0, 65.0, 72.0, 137.0, 202.0] 11 | AirportTo (float64): NaN-freq [0.0%], Samples [6.0, 6.0, 11.0, 86.0, 39.0, 42.0, 82.0, 20.0, 15.0, 15.0] 12 | DayOfWeek (int32): NaN-freq [0.0%], Samples [3, 0, 3, 0, 1, 5, 1, 6, 0, 4] 13 | Time (float64): NaN-freq [0.0%], Samples [890.0, 783.0, 100.0, 525.0, 645.0, 780.0, 527.0, 930.0, 880.0, 1085.0] 14 | Length (float64): NaN-freq [0.0%], Samples [134.0, 244.0, 380.0, 68.0, 132.0, 49.0, 127.0, 146.0, 280.0, 70.0] 15 | Delay (category): NaN-freq [0.0%], Samples [0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0] 16 | 17 | 18 | This code was written by an expert datascientist working to improve predictions. It is a snippet of code that adds new columns to the dataset. 19 | Number of samples (rows) in training dataset: 1500 20 | 21 | This code generates additional columns that are useful for a downstream classification algorithm (such as XGBoost) predicting "Delay". 22 | Additional columns add new semantic information, that is they use real world knowledge on the dataset. They can e.g. be feature combinations, transformations, aggregations where the new column is a function of the existing columns. 23 | The scale of columns and offset does not matter. Make sure all used columns exist. Follow the above description of columns closely and consider the datatypes and meanings of classes. 24 | This code also drops columns, if these may be redundant and hurt the predictive performance of the downstream classifier (Feature selection). Dropping columns may help as the chance of overfitting is lower, especially if the dataset is small. 25 | The classifier will be trained on the dataset with the generated columns and evaluated on a holdout set. The evaluation metric is accuracy. The best performing code will be selected. 26 | Added columns can be used in other codeblocks, dropped columns are not available anymore. 27 | 28 | Code formatting for each added column: 29 | ```python 30 | # (Feature name and description) 31 | # Usefulness: (Description why this adds useful real world knowledge to classify "Delay" according to dataset description and attributes.) 32 | # Input samples: (Three samples of the columns used in the following code, e.g. 'Airline': [13, 13, 6], 'Flight': [321.0, 116.0, 17.0], ...) 33 | (Some pandas code using Airline', 'Flight', ... to add a new column for each row in df) 34 | ```end 35 | 36 | Code formatting for dropping columns: 37 | ```python 38 | # Explanation why the column XX is dropped 39 | df.drop(columns=['XX'], inplace=True) 40 | ```end 41 | 42 | Each codeblock generates exactly one useful column and can drop unused columns (Feature selection). 43 | Each codeblock ends with ```end and starts with "```python" 44 | Codeblock: 45 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v4_2_prompt.txt: -------------------------------------------------------------------------------- 1 | 2 | The dataframe `df` is loaded and in memory. Columns are also named attributes. 3 | Description of the dataset in `df` (column dtypes might be inaccurate): 4 | " 5 | Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure." 6 | 7 | Columns in `df` (true feature dtypes listed here, categoricals encoded as int): 8 | Airline (int32): NaN-freq [0.0%], Samples [13, 13, 6, 3, 4, 0, 12, 5, 17, 17] 9 | Flight (float64): NaN-freq [0.0%], Samples [321.0, 116.0, 17.0, 1623.0, 2642.0, 1444.0, 7305.0, 312.0, 244.0, 1131.0] 10 | AirportFrom (float64): NaN-freq [0.0%], Samples [129.0, 1.0, 4.0, 102.0, 54.0, 108.0, 65.0, 72.0, 137.0, 202.0] 11 | AirportTo (float64): NaN-freq [0.0%], Samples [6.0, 6.0, 11.0, 86.0, 39.0, 42.0, 82.0, 20.0, 15.0, 15.0] 12 | DayOfWeek (int32): NaN-freq [0.0%], Samples [3, 0, 3, 0, 1, 5, 1, 6, 0, 4] 13 | Time (float64): NaN-freq [0.0%], Samples [890.0, 783.0, 100.0, 525.0, 645.0, 780.0, 527.0, 930.0, 880.0, 1085.0] 14 | Length (float64): NaN-freq [0.0%], Samples [134.0, 244.0, 380.0, 68.0, 132.0, 49.0, 127.0, 146.0, 280.0, 70.0] 15 | Delay (category): NaN-freq [0.0%], Samples [0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0] 16 | 17 | 18 | This code was written by an expert datascientist working to improve predictions. It is a snippet of code that adds new columns to the dataset. 19 | Number of samples (rows) in training dataset: 1500 20 | 21 | This code generates additional columns that are useful for a downstream classification algorithm (such as XGBoost) predicting "Delay". 22 | Additional columns add new semantic information, that is they use real world knowledge on the dataset. They can e.g. be feature combinations, transformations, aggregations where the new column is a function of the existing columns. 23 | The scale of columns and offset does not matter. Make sure all used columns exist. Follow the above description of columns closely and consider the datatypes and meanings of classes. 24 | This code also drops columns, if these may be redundant and hurt the predictive performance of the downstream classifier (Feature selection). Dropping columns may help as the chance of overfitting is lower, especially if the dataset is small. 25 | The classifier will be trained on the dataset with the generated columns and evaluated on a holdout set. The evaluation metric is accuracy. The best performing code will be selected. 26 | Added columns can be used in other codeblocks, dropped columns are not available anymore. 27 | 28 | Code formatting for each added column: 29 | ```python 30 | # (Feature name and description) 31 | # Usefulness: (Description why this adds useful real world knowledge to classify "Delay" according to dataset description and attributes.) 32 | # Input samples: (Three samples of the columns used in the following code, e.g. 'Airline': [13, 13, 6], 'Flight': [321.0, 116.0, 17.0], ...) 33 | (Some pandas code using Airline', 'Flight', ... to add a new column for each row in df) 34 | ```end 35 | 36 | Code formatting for dropping columns: 37 | ```python 38 | # Explanation why the column XX is dropped 39 | df.drop(columns=['XX'], inplace=True) 40 | ```end 41 | 42 | Each codeblock generates exactly one useful column and can drop unused columns (Feature selection). 43 | Each codeblock ends with ```end and starts with "```python" 44 | Codeblock: 45 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v3_0_prompt.txt: -------------------------------------------------------------------------------- 1 | 2 | The dataframe `df` is loaded and in memory. Columns are also named attributes. 3 | Description of the dataset in `df` (column dtypes might be inaccurate): 4 | " 5 | Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure." 6 | 7 | Columns in `df` (true feature dtypes listed here, categoricals encoded as int): 8 | Airline (int32): NaN-freq [0.0%], Samples [6, 4, 4, 15, 17, 4, 11, 4, 7, 17] 9 | Flight (float64): NaN-freq [0.0%], Samples [376.0, 2056.0, 1182.0, 1018.0, 1574.0, 1860.0, 2703.0, 2720.0, 6534.0, 223.0] 10 | AirportFrom (float64): NaN-freq [0.0%], Samples [225.0, 39.0, 5.0, 15.0, 4.0, 183.0, 102.0, 65.0, 6.0, 123.0] 11 | AirportTo (float64): NaN-freq [0.0%], Samples [11.0, 7.0, 60.0, 13.0, 41.0, 7.0, 86.0, 5.0, 261.0, 61.0] 12 | DayOfWeek (int32): NaN-freq [0.0%], Samples [4, 3, 1, 1, 3, 4, 2, 0, 1, 4] 13 | Time (float64): NaN-freq [0.0%], Samples [1195.0, 735.0, 1198.0, 1138.0, 1005.0, 1050.0, 1375.0, 981.0, 686.0, 375.0] 14 | Length (float64): NaN-freq [0.0%], Samples [30.0, 130.0, 166.0, 200.0, 270.0, 130.0, 59.0, 83.0, 100.0, 95.0] 15 | Delay (category): NaN-freq [0.0%], Samples [1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0] 16 | 17 | 18 | This code was written by an expert datascientist working to improve predictions. It is a snippet of code that adds new columns to the dataset. 19 | Number of samples (rows) in training dataset: 1500 20 | 21 | This code generates additional columns that are useful for a downstream classification algorithm (such as XGBoost) predicting "Delay". 22 | Additional columns add new semantic information, that is they use real world knowledge on the dataset. They can e.g. be feature combinations, transformations, aggregations where the new column is a function of the existing columns. 23 | The scale of columns and offset does not matter. Make sure all used columns exist. Follow the above description of columns closely and consider the datatypes and meanings of classes. 24 | This code also drops columns, if these may be redundant and hurt the predictive performance of the downstream classifier (Feature selection). Dropping columns may help as the chance of overfitting is lower, especially if the dataset is small. 25 | The classifier will be trained on the dataset with the generated columns and evaluated on a holdout set. The evaluation metric is accuracy. The best performing code will be selected. 26 | Added columns can be used in other codeblocks, dropped columns are not available anymore. 27 | 28 | Code formatting for each added column: 29 | ```python 30 | # (Feature name and description) 31 | # Usefulness: (Description why this adds useful real world knowledge to classify "Delay" according to dataset description and attributes.) 32 | # Input samples: (Three samples of the columns used in the following code, e.g. 'Airline': [6, 4, 4], 'Flight': [376.0, 2056.0, 1182.0], ...) 33 | (Some pandas code using Airline', 'Flight', ... to add a new column for each row in df) 34 | ```end 35 | 36 | Code formatting for dropping columns: 37 | ```python 38 | # Explanation why the column XX is dropped 39 | df.drop(columns=['XX'], inplace=True) 40 | ```end 41 | 42 | Each codeblock generates exactly one useful column and can drop unused columns (Feature selection). 43 | Each codeblock ends with ```end and starts with "```python" 44 | Codeblock: 45 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v3_1_prompt.txt: -------------------------------------------------------------------------------- 1 | 2 | The dataframe `df` is loaded and in memory. Columns are also named attributes. 3 | Description of the dataset in `df` (column dtypes might be inaccurate): 4 | " 5 | Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure." 6 | 7 | Columns in `df` (true feature dtypes listed here, categoricals encoded as int): 8 | Airline (int32): NaN-freq [0.0%], Samples [14, 2, 12, 17, 17, 17, 4, 4, 7, 4] 9 | Flight (float64): NaN-freq [0.0%], Samples [4265.0, 648.0, 2862.0, 2642.0, 1149.0, 595.0, 1395.0, 2123.0, 4772.0, 1430.0] 10 | AirportFrom (float64): NaN-freq [0.0%], Samples [113.0, 21.0, 249.0, 72.0, 233.0, 233.0, 15.0, 15.0, 25.0, 16.0] 11 | AirportTo (float64): NaN-freq [0.0%], Samples [6.0, 66.0, 12.0, 76.0, 12.0, 64.0, 25.0, 5.0, 60.0, 5.0] 12 | DayOfWeek (int32): NaN-freq [0.0%], Samples [3, 3, 4, 6, 4, 6, 1, 6, 6, 5] 13 | Time (float64): NaN-freq [0.0%], Samples [915.0, 715.0, 1075.0, 1220.0, 515.0, 430.0, 1050.0, 340.0, 625.0, 840.0] 14 | Length (float64): NaN-freq [0.0%], Samples [100.0, 173.0, 63.0, 60.0, 80.0, 65.0, 77.0, 144.0, 92.0, 82.0] 15 | Delay (category): NaN-freq [0.0%], Samples [0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0] 16 | 17 | 18 | This code was written by an expert datascientist working to improve predictions. It is a snippet of code that adds new columns to the dataset. 19 | Number of samples (rows) in training dataset: 1500 20 | 21 | This code generates additional columns that are useful for a downstream classification algorithm (such as XGBoost) predicting "Delay". 22 | Additional columns add new semantic information, that is they use real world knowledge on the dataset. They can e.g. be feature combinations, transformations, aggregations where the new column is a function of the existing columns. 23 | The scale of columns and offset does not matter. Make sure all used columns exist. Follow the above description of columns closely and consider the datatypes and meanings of classes. 24 | This code also drops columns, if these may be redundant and hurt the predictive performance of the downstream classifier (Feature selection). Dropping columns may help as the chance of overfitting is lower, especially if the dataset is small. 25 | The classifier will be trained on the dataset with the generated columns and evaluated on a holdout set. The evaluation metric is accuracy. The best performing code will be selected. 26 | Added columns can be used in other codeblocks, dropped columns are not available anymore. 27 | 28 | Code formatting for each added column: 29 | ```python 30 | # (Feature name and description) 31 | # Usefulness: (Description why this adds useful real world knowledge to classify "Delay" according to dataset description and attributes.) 32 | # Input samples: (Three samples of the columns used in the following code, e.g. 'Airline': [14, 2, 12], 'Flight': [4265.0, 648.0, 2862.0], ...) 33 | (Some pandas code using Airline', 'Flight', ... to add a new column for each row in df) 34 | ```end 35 | 36 | Code formatting for dropping columns: 37 | ```python 38 | # Explanation why the column XX is dropped 39 | df.drop(columns=['XX'], inplace=True) 40 | ```end 41 | 42 | Each codeblock generates exactly one useful column and can drop unused columns (Feature selection). 43 | Each codeblock ends with ```end and starts with "```python" 44 | Codeblock: 45 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v3_3_prompt.txt: -------------------------------------------------------------------------------- 1 | 2 | The dataframe `df` is loaded and in memory. Columns are also named attributes. 3 | Description of the dataset in `df` (column dtypes might be inaccurate): 4 | " 5 | Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure." 6 | 7 | Columns in `df` (true feature dtypes listed here, categoricals encoded as int): 8 | Airline (int32): NaN-freq [0.0%], Samples [3, 13, 7, 11, 12, 17, 12, 11, 0, 3] 9 | Flight (float64): NaN-freq [0.0%], Samples [1589.0, 356.0, 6512.0, 2459.0, 7287.0, 2826.0, 7292.0, 2995.0, 797.0, 463.0] 10 | AirportFrom (float64): NaN-freq [0.0%], Samples [102.0, 21.0, 1.0, 108.0, 34.0, 162.0, 53.0, 129.0, 53.0, 65.0] 11 | AirportTo (float64): NaN-freq [0.0%], Samples [45.0, 24.0, 120.0, 3.0, 6.0, 86.0, 6.0, 3.0, 15.0, 3.0] 12 | DayOfWeek (int32): NaN-freq [0.0%], Samples [1, 6, 6, 4, 6, 4, 0, 3, 2, 3] 13 | Time (float64): NaN-freq [0.0%], Samples [865.0, 866.0, 461.0, 755.0, 724.0, 615.0, 745.0, 360.0, 455.0, 335.0] 14 | Length (float64): NaN-freq [0.0%], Samples [210.0, 157.0, 134.0, 170.0, 146.0, 60.0, 140.0, 215.0, 365.0, 164.0] 15 | Delay (category): NaN-freq [0.0%], Samples [1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 16 | 17 | 18 | This code was written by an expert datascientist working to improve predictions. It is a snippet of code that adds new columns to the dataset. 19 | Number of samples (rows) in training dataset: 1500 20 | 21 | This code generates additional columns that are useful for a downstream classification algorithm (such as XGBoost) predicting "Delay". 22 | Additional columns add new semantic information, that is they use real world knowledge on the dataset. They can e.g. be feature combinations, transformations, aggregations where the new column is a function of the existing columns. 23 | The scale of columns and offset does not matter. Make sure all used columns exist. Follow the above description of columns closely and consider the datatypes and meanings of classes. 24 | This code also drops columns, if these may be redundant and hurt the predictive performance of the downstream classifier (Feature selection). Dropping columns may help as the chance of overfitting is lower, especially if the dataset is small. 25 | The classifier will be trained on the dataset with the generated columns and evaluated on a holdout set. The evaluation metric is accuracy. The best performing code will be selected. 26 | Added columns can be used in other codeblocks, dropped columns are not available anymore. 27 | 28 | Code formatting for each added column: 29 | ```python 30 | # (Feature name and description) 31 | # Usefulness: (Description why this adds useful real world knowledge to classify "Delay" according to dataset description and attributes.) 32 | # Input samples: (Three samples of the columns used in the following code, e.g. 'Airline': [3, 13, 7], 'Flight': [1589.0, 356.0, 6512.0], ...) 33 | (Some pandas code using Airline', 'Flight', ... to add a new column for each row in df) 34 | ```end 35 | 36 | Code formatting for dropping columns: 37 | ```python 38 | # Explanation why the column XX is dropped 39 | df.drop(columns=['XX'], inplace=True) 40 | ```end 41 | 42 | Each codeblock generates exactly one useful column and can drop unused columns (Feature selection). 43 | Each codeblock ends with ```end and starts with "```python" 44 | Codeblock: 45 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v4_0_prompt.txt: -------------------------------------------------------------------------------- 1 | 2 | The dataframe `df` is loaded and in memory. Columns are also named attributes. 3 | Description of the dataset in `df` (column dtypes might be inaccurate): 4 | " 5 | Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure." 6 | 7 | Columns in `df` (true feature dtypes listed here, categoricals encoded as int): 8 | Airline (int32): NaN-freq [0.0%], Samples [6, 4, 4, 15, 17, 4, 11, 4, 7, 17] 9 | Flight (float64): NaN-freq [0.0%], Samples [376.0, 2056.0, 1182.0, 1018.0, 1574.0, 1860.0, 2703.0, 2720.0, 6534.0, 223.0] 10 | AirportFrom (float64): NaN-freq [0.0%], Samples [225.0, 39.0, 5.0, 15.0, 4.0, 183.0, 102.0, 65.0, 6.0, 123.0] 11 | AirportTo (float64): NaN-freq [0.0%], Samples [11.0, 7.0, 60.0, 13.0, 41.0, 7.0, 86.0, 5.0, 261.0, 61.0] 12 | DayOfWeek (int32): NaN-freq [0.0%], Samples [4, 3, 1, 1, 3, 4, 2, 0, 1, 4] 13 | Time (float64): NaN-freq [0.0%], Samples [1195.0, 735.0, 1198.0, 1138.0, 1005.0, 1050.0, 1375.0, 981.0, 686.0, 375.0] 14 | Length (float64): NaN-freq [0.0%], Samples [30.0, 130.0, 166.0, 200.0, 270.0, 130.0, 59.0, 83.0, 100.0, 95.0] 15 | Delay (category): NaN-freq [0.0%], Samples [1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0] 16 | 17 | 18 | This code was written by an expert datascientist working to improve predictions. It is a snippet of code that adds new columns to the dataset. 19 | Number of samples (rows) in training dataset: 1500 20 | 21 | This code generates additional columns that are useful for a downstream classification algorithm (such as XGBoost) predicting "Delay". 22 | Additional columns add new semantic information, that is they use real world knowledge on the dataset. They can e.g. be feature combinations, transformations, aggregations where the new column is a function of the existing columns. 23 | The scale of columns and offset does not matter. Make sure all used columns exist. Follow the above description of columns closely and consider the datatypes and meanings of classes. 24 | This code also drops columns, if these may be redundant and hurt the predictive performance of the downstream classifier (Feature selection). Dropping columns may help as the chance of overfitting is lower, especially if the dataset is small. 25 | The classifier will be trained on the dataset with the generated columns and evaluated on a holdout set. The evaluation metric is accuracy. The best performing code will be selected. 26 | Added columns can be used in other codeblocks, dropped columns are not available anymore. 27 | 28 | Code formatting for each added column: 29 | ```python 30 | # (Feature name and description) 31 | # Usefulness: (Description why this adds useful real world knowledge to classify "Delay" according to dataset description and attributes.) 32 | # Input samples: (Three samples of the columns used in the following code, e.g. 'Airline': [6, 4, 4], 'Flight': [376.0, 2056.0, 1182.0], ...) 33 | (Some pandas code using Airline', 'Flight', ... to add a new column for each row in df) 34 | ```end 35 | 36 | Code formatting for dropping columns: 37 | ```python 38 | # Explanation why the column XX is dropped 39 | df.drop(columns=['XX'], inplace=True) 40 | ```end 41 | 42 | Each codeblock generates exactly one useful column and can drop unused columns (Feature selection). 43 | Each codeblock ends with ```end and starts with "```python" 44 | Codeblock: 45 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v4_1_prompt.txt: -------------------------------------------------------------------------------- 1 | 2 | The dataframe `df` is loaded and in memory. Columns are also named attributes. 3 | Description of the dataset in `df` (column dtypes might be inaccurate): 4 | " 5 | Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure." 6 | 7 | Columns in `df` (true feature dtypes listed here, categoricals encoded as int): 8 | Airline (int32): NaN-freq [0.0%], Samples [14, 2, 12, 17, 17, 17, 4, 4, 7, 4] 9 | Flight (float64): NaN-freq [0.0%], Samples [4265.0, 648.0, 2862.0, 2642.0, 1149.0, 595.0, 1395.0, 2123.0, 4772.0, 1430.0] 10 | AirportFrom (float64): NaN-freq [0.0%], Samples [113.0, 21.0, 249.0, 72.0, 233.0, 233.0, 15.0, 15.0, 25.0, 16.0] 11 | AirportTo (float64): NaN-freq [0.0%], Samples [6.0, 66.0, 12.0, 76.0, 12.0, 64.0, 25.0, 5.0, 60.0, 5.0] 12 | DayOfWeek (int32): NaN-freq [0.0%], Samples [3, 3, 4, 6, 4, 6, 1, 6, 6, 5] 13 | Time (float64): NaN-freq [0.0%], Samples [915.0, 715.0, 1075.0, 1220.0, 515.0, 430.0, 1050.0, 340.0, 625.0, 840.0] 14 | Length (float64): NaN-freq [0.0%], Samples [100.0, 173.0, 63.0, 60.0, 80.0, 65.0, 77.0, 144.0, 92.0, 82.0] 15 | Delay (category): NaN-freq [0.0%], Samples [0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0] 16 | 17 | 18 | This code was written by an expert datascientist working to improve predictions. It is a snippet of code that adds new columns to the dataset. 19 | Number of samples (rows) in training dataset: 1500 20 | 21 | This code generates additional columns that are useful for a downstream classification algorithm (such as XGBoost) predicting "Delay". 22 | Additional columns add new semantic information, that is they use real world knowledge on the dataset. They can e.g. be feature combinations, transformations, aggregations where the new column is a function of the existing columns. 23 | The scale of columns and offset does not matter. Make sure all used columns exist. Follow the above description of columns closely and consider the datatypes and meanings of classes. 24 | This code also drops columns, if these may be redundant and hurt the predictive performance of the downstream classifier (Feature selection). Dropping columns may help as the chance of overfitting is lower, especially if the dataset is small. 25 | The classifier will be trained on the dataset with the generated columns and evaluated on a holdout set. The evaluation metric is accuracy. The best performing code will be selected. 26 | Added columns can be used in other codeblocks, dropped columns are not available anymore. 27 | 28 | Code formatting for each added column: 29 | ```python 30 | # (Feature name and description) 31 | # Usefulness: (Description why this adds useful real world knowledge to classify "Delay" according to dataset description and attributes.) 32 | # Input samples: (Three samples of the columns used in the following code, e.g. 'Airline': [14, 2, 12], 'Flight': [4265.0, 648.0, 2862.0], ...) 33 | (Some pandas code using Airline', 'Flight', ... to add a new column for each row in df) 34 | ```end 35 | 36 | Code formatting for dropping columns: 37 | ```python 38 | # Explanation why the column XX is dropped 39 | df.drop(columns=['XX'], inplace=True) 40 | ```end 41 | 42 | Each codeblock generates exactly one useful column and can drop unused columns (Feature selection). 43 | Each codeblock ends with ```end and starts with "```python" 44 | Codeblock: 45 | -------------------------------------------------------------------------------- /data/generated_code/airlines_v4_3_prompt.txt: -------------------------------------------------------------------------------- 1 | 2 | The dataframe `df` is loaded and in memory. Columns are also named attributes. 3 | Description of the dataset in `df` (column dtypes might be inaccurate): 4 | " 5 | Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure." 6 | 7 | Columns in `df` (true feature dtypes listed here, categoricals encoded as int): 8 | Airline (int32): NaN-freq [0.0%], Samples [3, 13, 7, 11, 12, 17, 12, 11, 0, 3] 9 | Flight (float64): NaN-freq [0.0%], Samples [1589.0, 356.0, 6512.0, 2459.0, 7287.0, 2826.0, 7292.0, 2995.0, 797.0, 463.0] 10 | AirportFrom (float64): NaN-freq [0.0%], Samples [102.0, 21.0, 1.0, 108.0, 34.0, 162.0, 53.0, 129.0, 53.0, 65.0] 11 | AirportTo (float64): NaN-freq [0.0%], Samples [45.0, 24.0, 120.0, 3.0, 6.0, 86.0, 6.0, 3.0, 15.0, 3.0] 12 | DayOfWeek (int32): NaN-freq [0.0%], Samples [1, 6, 6, 4, 6, 4, 0, 3, 2, 3] 13 | Time (float64): NaN-freq [0.0%], Samples [865.0, 866.0, 461.0, 755.0, 724.0, 615.0, 745.0, 360.0, 455.0, 335.0] 14 | Length (float64): NaN-freq [0.0%], Samples [210.0, 157.0, 134.0, 170.0, 146.0, 60.0, 140.0, 215.0, 365.0, 164.0] 15 | Delay (category): NaN-freq [0.0%], Samples [1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 16 | 17 | 18 | This code was written by an expert datascientist working to improve predictions. It is a snippet of code that adds new columns to the dataset. 19 | Number of samples (rows) in training dataset: 1500 20 | 21 | This code generates additional columns that are useful for a downstream classification algorithm (such as XGBoost) predicting "Delay". 22 | Additional columns add new semantic information, that is they use real world knowledge on the dataset. They can e.g. be feature combinations, transformations, aggregations where the new column is a function of the existing columns. 23 | The scale of columns and offset does not matter. Make sure all used columns exist. Follow the above description of columns closely and consider the datatypes and meanings of classes. 24 | This code also drops columns, if these may be redundant and hurt the predictive performance of the downstream classifier (Feature selection). Dropping columns may help as the chance of overfitting is lower, especially if the dataset is small. 25 | The classifier will be trained on the dataset with the generated columns and evaluated on a holdout set. The evaluation metric is accuracy. The best performing code will be selected. 26 | Added columns can be used in other codeblocks, dropped columns are not available anymore. 27 | 28 | Code formatting for each added column: 29 | ```python 30 | # (Feature name and description) 31 | # Usefulness: (Description why this adds useful real world knowledge to classify "Delay" according to dataset description and attributes.) 32 | # Input samples: (Three samples of the columns used in the following code, e.g. 'Airline': [3, 13, 7], 'Flight': [1589.0, 356.0, 6512.0], ...) 33 | (Some pandas code using Airline', 'Flight', ... to add a new column for each row in df) 34 | ```end 35 | 36 | Code formatting for dropping columns: 37 | ```python 38 | # Explanation why the column XX is dropped 39 | df.drop(columns=['XX'], inplace=True) 40 | ```end 41 | 42 | Each codeblock generates exactly one useful column and can drop unused columns (Feature selection). 43 | Each codeblock ends with ```end and starts with "```python" 44 | Codeblock: 45 | -------------------------------------------------------------------------------- /data/generated_code/balance-scale_v3_0_prompt.txt: -------------------------------------------------------------------------------- 1 | 2 | The dataframe `df` is loaded and in memory. Columns are also named attributes. 3 | Description of the dataset in `df` (column dtypes might be inaccurate): 4 | "**Balance Scale Weight & Distance Database** 5 | This data set was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of (left-distance * left-weight) and (right-distance * right-weight). If they are equal, it is balanced. 6 | 7 | Attribute description 8 | The attributes are the left weight, the left distance, the right weight, and the right distance." 9 | 10 | Columns in `df` (true feature dtypes listed here, categoricals encoded as int): 11 | left-weight (float64): NaN-freq [0.0%], Samples [4.0, 5.0, 1.0, 3.0, 1.0, 4.0, 2.0, 3.0, 5.0, 4.0] 12 | left-distance (float64): NaN-freq [0.0%], Samples [2.0, 4.0, 4.0, 3.0, 3.0, 5.0, 1.0, 3.0, 4.0, 1.0] 13 | right-weight (float64): NaN-freq [0.0%], Samples [4.0, 4.0, 5.0, 2.0, 5.0, 3.0, 2.0, 2.0, 5.0, 4.0] 14 | right-distance (float64): NaN-freq [0.0%], Samples [2.0, 4.0, 5.0, 3.0, 1.0, 3.0, 4.0, 4.0, 4.0, 2.0] 15 | class (category): NaN-freq [0.0%], Samples [1.0, 0.0, 2.0, 0.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0] 16 | 17 | 18 | This code was written by an expert datascientist working to improve predictions. It is a snippet of code that adds new columns to the dataset. 19 | Number of samples (rows) in training dataset: 93 20 | 21 | This code generates additional columns that are useful for a downstream classification algorithm (such as XGBoost) predicting "class". 22 | Additional columns add new semantic information, that is they use real world knowledge on the dataset. They can e.g. be feature combinations, transformations, aggregations where the new column is a function of the existing columns. 23 | The scale of columns and offset does not matter. Make sure all used columns exist. Follow the above description of columns closely and consider the datatypes and meanings of classes. 24 | This code also drops columns, if these may be redundant and hurt the predictive performance of the downstream classifier (Feature selection). Dropping columns may help as the chance of overfitting is lower, especially if the dataset is small. 25 | The classifier will be trained on the dataset with the generated columns and evaluated on a holdout set. The evaluation metric is accuracy. The best performing code will be selected. 26 | Added columns can be used in other codeblocks, dropped columns are not available anymore. 27 | 28 | Code formatting for each added column: 29 | ```python 30 | # (Feature name and description) 31 | # Usefulness: (Description why this adds useful real world knowledge to classify "class" according to dataset description and attributes.) 32 | # Input samples: (Three samples of the columns used in the following code, e.g. 'left-weight': [4.0, 5.0, 1.0], 'left-distance': [2.0, 4.0, 4.0], ...) 33 | (Some pandas code using left-weight', 'left-distance', ... to add a new column for each row in df) 34 | ```end 35 | 36 | Code formatting for dropping columns: 37 | ```python 38 | # Explanation why the column XX is dropped 39 | df.drop(columns=['XX'], inplace=True) 40 | ```end 41 | 42 | Each codeblock generates exactly one useful column and can drop unused columns (Feature selection). 43 | Each codeblock ends with ```end and starts with "```python" 44 | Codeblock: 45 | --------------------------------------------------------------------------------