%pip install anai-opensource
import anai
from anai.preprocessing import Preprocessor
df = anai.load(
df_filepath='/Users/arshanwar/Projects/AutoML/open_source/ANAI/examples/healthcare-dataset-stroke-data.csv')
Loading Data [*] Data Loaded Successfully [ ✓ ]
df.head()
id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9046 | Male | 67.0 | 0 | 1 | Yes | Private | Urban | 228.69 | 36.6 | formerly smoked | 1 |
1 | 51676 | Female | 61.0 | 0 | 0 | Yes | Self-employed | Rural | 202.21 | NaN | never smoked | 1 |
2 | 31112 | Male | 80.0 | 0 | 1 | Yes | Private | Rural | 105.92 | 32.5 | never smoked | 1 |
3 | 60182 | Female | 49.0 | 0 | 0 | Yes | Private | Urban | 171.23 | 34.4 | smokes | 1 |
4 | 1665 | Female | 79.0 | 1 | 0 | Yes | Self-employed | Rural | 174.12 | 24.0 | never smoked | 1 |
prep = Preprocessor(dataset = df, target = 'stroke')
summary = prep.summary()
summary.head(10)
Stats | |
---|---|
No. of Cells | 61320 |
No. of Variables | 12 |
No. of Records | 5110 |
Missing Cells | 0.3 % |
Missing Cells Count | 201 |
Duplicacy | 0.00 % |
Duplicate Cell Count | 0 |
Anomaly Count | 256 |
column_summary = prep.column_summary()
column_summary.head(24)
id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Type Error | ID column is not allowed | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
hide | True | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Type | NaN | Categorical | Numeric | Numeric | Numeric | Categorical | Categorical | Categorical | Numeric | Numeric | Categorical | Numeric |
Missing Value % | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.9334637964774952 | 0.0 | 0.0 |
Mean | NaN | NA as column dtype is Categorical | 43.23 | 0.10 | 0.05 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 106.15 | 28.89 | NA as column dtype is Categorical | 0.05 |
Mode | NaN | NA as column dtype is Categorical | 78.00 | 0.00 | 0.00 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 93.88 | 28.70 | NA as column dtype is Categorical | 0.00 |
Maximum value | NaN | NA as column dtype is Categorical | 82.00 | 1.00 | 1.00 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 271.74 | 97.60 | NA as column dtype is Categorical | 1.00 |
Median | NaN | NA as column dtype is Categorical | 45.00 | 0.00 | 0.00 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 91.88 | 28.10 | NA as column dtype is Categorical | 0.00 |
Minimum value | NaN | NA as column dtype is Categorical | 0.08 | 0.00 | 0.00 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 55.12 | 10.30 | NA as column dtype is Categorical | 0.00 |
Standard Deviation | NaN | NA as column dtype is Categorical | 22.61 | 0.30 | 0.23 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 45.28 | 7.85 | NA as column dtype is Categorical | 0.22 |
99% Quartile | NaN | NA as column dtype is Categorical | 82.00 | 1.00 | 1.00 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 240.71 | 53.40 | NA as column dtype is Categorical | 1.00 |
90% Quartile | NaN | NA as column dtype is Categorical | 75.00 | 0.00 | 0.00 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 192.18 | 38.90 | NA as column dtype is Categorical | 0.00 |
66% Quartile | NaN | NA as column dtype is Categorical | 55.00 | 0.00 | 0.00 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 104.08 | 31.00 | NA as column dtype is Categorical | 0.00 |
33% Quartile | NaN | NA as column dtype is Categorical | 32.00 | 0.00 | 0.00 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 81.78 | 25.10 | NA as column dtype is Categorical | 0.00 |
10% Quartile | NaN | NA as column dtype is Categorical | 11.00 | 0.00 | 0.00 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 65.79 | 19.70 | NA as column dtype is Categorical | 0.00 |
1% Quartile | NaN | NA as column dtype is Categorical | 1.08 | 0.00 | 0.00 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 56.33 | 15.10 | NA as column dtype is Categorical | 0.00 |
Variance | NaN | NA as column dtype is Categorical | 511.33 | 0.09 | 0.05 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 2050.60 | 61.69 | NA as column dtype is Categorical | 0.05 |
Monotonic | NaN | NA as column dtype is Categorical | 0.00 | 0.00 | 0.00 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 0.00 | 0.00 | NA as column dtype is Categorical | 0.00 |
Mean Absolute Deviation | NaN | NA as column dtype is Categorical | 19.12 | 0.18 | 0.10 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 33.06 | 5.98 | NA as column dtype is Categorical | 0.09 |
No. of Unique Values | NaN | 3 | 104 | 2 | 2 | 2 | 5 | 2 | 3979 | 419 | 4 | 2 |
No. of Negative Values | NaN | NA as column dtype is Categorical | 0 | 0 | 0 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 0 | 0 | NA as column dtype is Categorical | 0 |
Percentage Infinite Values | NaN | NA as column dtype is Categorical | 0.0 | 0.0 | 0.0 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 0.0 | 0.0 | NA as column dtype is Categorical | 0.0 |
Skewness | NaN | NA as column dtype is Categorical | -0.14 | 2.72 | 3.95 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 1.57 | 1.06 | NA as column dtype is Categorical | 4.19 |
Shapiro_W | NaN | NA as column dtype is Categorical | 0.97 | 0.34 | 0.24 | NA as column dtype is Categorical | NA as column dtype is Categorical | NA as column dtype is Categorical | 0.81 | nan | NA as column dtype is Categorical | 0.22 |
df1 = prep.impute(method = 'mean')
df1.isna().sum()
id 0 gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 0 smoking_status 0 stroke 0 dtype: int64
features, labels = prep.encode(split = True)
features.head(4)
id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9046 | 0.048728 | 67.0 | 0 | 1 | 0.048728 | 0.048728 | 0.048728 | 228.69 | 36.6 | 0.048728 |
1 | 51676 | 0.048728 | 61.0 | 0 | 0 | 0.524364 | 0.048728 | 0.048728 | 202.21 | NaN | 0.048728 |
2 | 31112 | 0.524364 | 80.0 | 0 | 1 | 0.682909 | 0.524364 | 0.524364 | 105.92 | 32.5 | 0.524364 |
3 | 60182 | 0.524364 | 49.0 | 0 | 0 | 0.762182 | 0.682909 | 0.524364 | 171.23 | 34.4 | 0.048728 |
X_train, X_val, y_train, y_val, scaler = prep.prepare(features, labels, test_size = 0.2, random_state = 42, smote = False, k_neighbors = 3)
X_train.shape, X_val.shape, y_val.shape, y_train.shape
((4088, 11), (1022, 11), (1022,), (4088,))
ai = anai.run(filepath='/Users/arshanwar/Projects/AutoML/open_source/ANAI/examples/healthcare-dataset-stroke-data.csv',
target='stroke', predictor=['rfc', 'cat', 'xgb', 'lgbm', 'ext'])
ANAITaskWarning: Task is getting detected automatically. To suppress this behaviour, set suppress_task_detection=True and specify task with task argument Task: Classification ░█████╗░███╗░░██╗░█████╗░██╗ ██╔══██╗████╗░██║██╔══██╗██║ ███████║██╔██╗██║███████║██║ ██╔══██║██║╚████║██╔══██║██║ ██║░░██║██║░╚███║██║░░██║██║ ╚═╝░░╚═╝╚═╝░░╚══╝╚═╝░░╚═╝╚═╝ Started ANAI [ ✓ ] Preprocessing Started [*] Imputing Missing Values by mean [*] Imputing Done [ ✓ ] Preprocessing Done [ ✓ ] Training ANAI [*] Ensembling on top 5 models Training Done [ ✓ ] Results Below
Name | Accuracy | Cross Validated Accuracy | |
---|---|---|---|
0 | Random Forest Classifier | 99.412916 | 99.363824 |
1 | Stacking Ensembler | 99.510000 | 99.340000 |
2 | Extra Trees Classifier | 99.412916 | 99.314864 |
3 | Max Voting Ensembler | 99.410000 | 99.310000 |
4 | CatBoost Classifier | 99.412916 | 99.265904 |
5 | XGBoost Classifier | 99.412916 | 99.241574 |
6 | LightGBM Classifier | 99.412916 | 99.143715 |
Completed ANAI Run [ ✓ ] Saved Best Model to anai_info/best/classifier/models/Random_Forest_Classifier_1655501647.pkl and its scaler to anai_info/best/classifier/scalers/Random_Forest_Classifier_Scaler_1655501647.pkl Time Elapsed : 144.39 seconds
ai.explain('perm')
Explaining Best ANAI model [*]
Explaining ANAI Done [ ✓ ]
ai.result()
Name | Accuracy | Cross Validated Accuracy | Model | |
---|---|---|---|---|
0 | Random Forest Classifier | 99.412916 | 99.363824 | (DecisionTreeClassifier(max_features='auto', r... |
1 | Stacking Ensembler | 99.510000 | 99.340000 | StackingClassifier(cv=10,\n ... |
2 | Extra Trees Classifier | 99.412916 | 99.314864 | (ExtraTreeClassifier(random_state=840703915), ... |
3 | Max Voting Ensembler | 99.410000 | 99.310000 | VotingClassifier(estimators=[('Random Forest C... |
4 | CatBoost Classifier | 99.412916 | 99.265904 | <catboost.core.CatBoostClassifier object at 0x... |
5 | XGBoost Classifier | 99.412916 | 99.241574 | XGBClassifier(base_score=0.5, booster='gbtree'... |
6 | LightGBM Classifier | 99.412916 | 99.143715 | LGBMClassifier() |