!pip install opendatasets
!pip install pandas
Collecting opendatasets
Obtaining dependency information for opendatasets from https://files.pythonhosted.org/packages/00/e7/12300c2f886b846375c78a4f32c0ae1cd20bdcf305b5ac45b8d7eceda3ec/opendatasets-0.1.22-py3-none-any.whl.metadata
Using cached opendatasets-0.1.22-py3-none-any.whl.metadata (9.2 kB)
Requirement already satisfied: tqdm in /Users/aiden/anaconda3/lib/python3.11/site-packages (from opendatasets) (4.65.0)
Collecting kaggle (from opendatasets)
Downloading kaggle-1.6.8.tar.gz (84 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.6/84.6 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h Preparing metadata (setup.py) ... [?25ldone
[?25hRequirement already satisfied: click in /Users/aiden/anaconda3/lib/python3.11/site-packages (from opendatasets) (8.1.7)
Requirement already satisfied: six>=1.10 in /Users/aiden/anaconda3/lib/python3.11/site-packages (from kaggle->opendatasets) (1.16.0)
Requirement already satisfied: certifi>=2023.7.22 in /Users/aiden/anaconda3/lib/python3.11/site-packages (from kaggle->opendatasets) (2023.7.22)
Requirement already satisfied: python-dateutil in /Users/aiden/anaconda3/lib/python3.11/site-packages (from kaggle->opendatasets) (2.8.2)
Requirement already satisfied: requests in /Users/aiden/anaconda3/lib/python3.11/site-packages (from kaggle->opendatasets) (2.31.0)
Requirement already satisfied: python-slugify in /Users/aiden/anaconda3/lib/python3.11/site-packages (from kaggle->opendatasets) (5.0.2)
Requirement already satisfied: urllib3 in /Users/aiden/anaconda3/lib/python3.11/site-packages (from kaggle->opendatasets) (1.26.16)
Requirement already satisfied: bleach in /Users/aiden/anaconda3/lib/python3.11/site-packages (from kaggle->opendatasets) (4.1.0)
Requirement already satisfied: packaging in /Users/aiden/anaconda3/lib/python3.11/site-packages (from bleach->kaggle->opendatasets) (23.0)
Requirement already satisfied: webencodings in /Users/aiden/anaconda3/lib/python3.11/site-packages (from bleach->kaggle->opendatasets) (0.5.1)
Requirement already satisfied: text-unidecode>=1.3 in /Users/aiden/anaconda3/lib/python3.11/site-packages (from python-slugify->kaggle->opendatasets) (1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/aiden/anaconda3/lib/python3.11/site-packages (from requests->kaggle->opendatasets) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /Users/aiden/anaconda3/lib/python3.11/site-packages (from requests->kaggle->opendatasets) (3.4)
Using cached opendatasets-0.1.22-py3-none-any.whl (15 kB)
Building wheels for collected packages: kaggle
Building wheel for kaggle (setup.py) ... [?25ldone
[?25h Created wheel for kaggle: filename=kaggle-1.6.8-py3-none-any.whl size=111967 sha256=77d0a78e41a2adda36bb180fcf9d0769417ccfb65fe3b26e78bdefa4bbdd6e64
Stored in directory: /Users/aiden/Library/Caches/pip/wheels/8c/fe/8c/71a8dd0e02634fd0e4ba4abaaf2d4a6049cccff349625331e1
Successfully built kaggle
Installing collected packages: kaggle, opendatasets
Successfully installed kaggle-1.6.8 opendatasets-0.1.22
Requirement already satisfied: pandas in /Users/aiden/anaconda3/lib/python3.11/site-packages (1.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/aiden/anaconda3/lib/python3.11/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/aiden/anaconda3/lib/python3.11/site-packages (from pandas) (2022.7)
Requirement already satisfied: numpy>=1.21.0 in /Users/aiden/anaconda3/lib/python3.11/site-packages (from pandas) (1.24.3)
Requirement already satisfied: six>=1.5 in /Users/aiden/anaconda3/lib/python3.11/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
import opendatasets as od
import pandas
{"username":"aidenk1","key":"ec3c0b17264ad89d1b65e98ef1b1588f"}
od.download('https://www.kaggle.com/datasets/rkiattisak/sports-car-prices-dataset')
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:Your Kaggle Key:Downloading sports-car-prices-dataset.zip to ./sports-car-prices-dataset
100%|██████████| 8.36k/8.36k [00:00<00:00, 5.42MB/s]
import pandas as pd
df = pd.read_csv('/Users/aiden/Downloads/Sport car price.csv', encoding='cp1252', sep=',')
df
Car Make | Car Model | Year | Engine Size (L) | Horsepower | Torque (lb-ft) | 0-60 MPH Time (seconds) | Price (in USD) | |
---|---|---|---|---|---|---|---|---|
0 | Porsche | 911 | 2022 | 3 | 379 | 331 | 4 | 101,200 |
1 | Lamborghini | Huracan | 2021 | 5.2 | 630 | 443 | 2.8 | 274,390 |
2 | Ferrari | 488 GTB | 2022 | 3.9 | 661 | 561 | 3 | 333,750 |
3 | Audi | R8 | 2022 | 5.2 | 562 | 406 | 3.2 | 142,700 |
4 | McLaren | 720S | 2021 | 4 | 710 | 568 | 2.7 | 298,000 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1002 | Koenigsegg | Jesko | 2022 | 5 | 1280 | 1106 | 2.5 | 3,000,000 |
1003 | Lotus | Evija | 2021 | Electric Motor | 1972 | 1254 | 2 | 2,000,000 |
1004 | McLaren | Senna | 2021 | 4 | 789 | 590 | 2.7 | 1,000,000 |
1005 | Pagani | Huayra | 2021 | 6 | 764 | 738 | 3 | 2,600,000 |
1006 | Rimac | Nevera | 2021 | Electric Motor | 1888 | 1696 | 1.85 | 2,400,000 |
1007 rows × 8 columns
import pandas as pd
# Load the dataset
df = pd.read_csv("/Users/aiden/Downloads/Sport car price.csv")
# Select relevant columns
selected_columns = ['Car Make', 'Car Model', 'Year', 'Engine Size (L)', 'Horsepower', 'Torque (lb-ft)', '0-60 MPH Time (seconds)', 'Price (in USD)']
# Create a new DataFrame with selected columns
df_new = df[selected_columns]
print(df_new)
Car Make Car Model Year Engine Size (L) Horsepower Torque (lb-ft) \
0 Porsche 911 2022 3 379 331
1 Lamborghini Huracan 2021 5.2 630 443
2 Ferrari 488 GTB 2022 3.9 661 561
3 Audi R8 2022 5.2 562 406
4 McLaren 720S 2021 4 710 568
... ... ... ... ... ... ...
1002 Koenigsegg Jesko 2022 5 1280 1106
1003 Lotus Evija 2021 Electric Motor 1972 1254
1004 McLaren Senna 2021 4 789 590
1005 Pagani Huayra 2021 6 764 738
1006 Rimac Nevera 2021 Electric Motor 1888 1696
0-60 MPH Time (seconds) Price (in USD)
0 4 101,200
1 2.8 274,390
2 3 333,750
3 3.2 142,700
4 2.7 298,000
... ... ...
1002 2.5 3,000,000
1003 2 2,000,000
1004 2.7 1,000,000
1005 3 2,600,000
1006 1.85 2,400,000
[1007 rows x 8 columns]
# Importing required libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
class SportsCarPriceModel:
"""This class encompasses the sports car price model to predict the price of a sports car based on various factors."""
_instance = None
def __init__(self):
self.cars_data = None
self.model = None
self.features = ['Car Make', 'Car Model', 'Year', 'Engine Size (L)', 'Horsepower', 'Torque (lb-ft)', '0-60 MPH Time (seconds)']
self.target = 'Price (in USD)'
self.encoder = OneHotEncoder(handle_unknown='ignore')
def _load_data(self, file_path):
self.cars_data = pd.read_csv(file_path)
def _clean(self):
# For demonstration purposes, let's assume we're dropping NaN values
self.cars_data.dropna(inplace=True)
def _train(self):
X = self.cars_data[self.features]
y = self.cars_data[self.target]
self.model = LinearRegression()
self.model.fit(X, y)
@classmethod
def get_instance(cls, file_path):
if cls._instance is None:
cls._instance = cls()
cls._instance._load_data(file_path)
cls._instance._clean()
cls._instance._train()
return cls._instance
def predict_price(self, car_info):
# Clean and prepare car data for prediction
car_df = pd.DataFrame(car_info, index=[0])
# Ensure the features are in the same order as in the training data
car_df = car_df.reindex(columns=self.features, fill_value=0)
# Predict the car price
price = self.model.predict(car_df)
return price[0]
def testCarPricePrediction():
print("Step 1: Define car data for prediction:")
car_info = {
'Car Make': ['Ferrari'],
'Car Model': ['488 GTB'],
'Year': [2020],
'Engine Size (L)': [3.9],
'Horsepower': [660],
'Torque (lb-ft)': [560],
'0-60 MPH Time (seconds)': [3.0]
}
print("\t", car_info)
print()
car_model = SportsCarPriceModel.get_instance("/Users/aiden/Downloads/Sport car price.csv")
print("Step 2:", car_model.get_instance.__doc__)
print("Step 3:", car_model.predict_price.__doc__)
predicted_price = car_model.predict_price(car_info)
print('\t Predicted car price: ${:.2f}'.format(predicted_price))
print()
if __name__ == "__main__":
print("Begin:", testCarPricePrediction.__doc__)
testCarPricePrediction()
Begin: None
Step 1: Define car data for prediction:
{'Brand': ['Ferrari'], 'Year': [2020], 'Engine': [5.2], 'Transmission': ['Automatic'], 'Horsepower': [650]}
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[46], line 91
89 if __name__ == "__main__":
90 print("Begin:", testCarPricePrediction.__doc__)
---> 91 testCarPricePrediction()
Cell In[46], line 81, in testCarPricePrediction()
78 print("\t", car_info)
79 print()
---> 81 car_model = SportsCarPriceModel.get_instance("/Users/aiden/Downloads/Sport car price.csv")
82 print("Step 2:", car_model.get_instance.__doc__)
84 print("Step 3:", car_model.predict_price.__doc__)
Cell In[46], line 41, in SportsCarPriceModel.get_instance(cls, file_path)
39 cls._instance._load_data(file_path)
40 cls._instance._clean()
---> 41 cls._instance._train()
42 return cls._instance
Cell In[46], line 29, in SportsCarPriceModel._train(self)
28 def _train(self):
---> 29 X = self.cars_data[self.features]
30 y = self.cars_data[self.target]
32 self.model = LinearRegression()
File ~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:3813, in DataFrame.__getitem__(self, key)
3811 if is_iterator(key):
3812 key = list(key)
-> 3813 indexer = self.columns._get_indexer_strict(key, "columns")[1]
3815 # take() does not accept boolean indexers
3816 if getattr(indexer, "dtype", None) == bool:
File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6070, in Index._get_indexer_strict(self, key, axis_name)
6067 else:
6068 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6070 self._raise_if_missing(keyarr, indexer, axis_name)
6072 keyarr = self.take(indexer)
6073 if isinstance(key, Index):
6074 # GH 42790 - Preserve name from an Index
File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6133, in Index._raise_if_missing(self, key, indexer, axis_name)
6130 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
6132 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6133 raise KeyError(f"{not_found} not in index")
KeyError: "['Brand', 'Engine', 'Transmission'] not in index"
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
# Load the CSV file into a DataFrame
cars_data = pd.read_csv("/Users/aiden/Downloads/Sport car price.csv")
# Remove commas from 'Price (in USD)' column and convert it to numeric
cars_data['Price (in USD)'] = cars_data['Price (in USD)'].str.replace(',', '').astype(float)
# Handle non-numeric values in other columns (e.g., 'Electric Motor')
# For simplicity, we'll drop rows containing non-numeric values.
cars_data = cars_data.apply(pd.to_numeric, errors='coerce').dropna()
# Build distinct data frames on 'Price (in USD)' column
X = cars_data.drop('Price (in USD)', axis=1) # all except 'Price (in USD)'
y = cars_data['Price (in USD)'] # only 'Price (in USD)'
# Specify a minimum number of samples for the training set
min_train_samples = 100 # Adjust this value as needed
# Split data into train and test sets with a minimum number of samples for training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,
shuffle=True, stratify=None)
# Ensure that the training set has at least the minimum number of samples
while len(X_train) < min_train_samples:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,
shuffle=True, stratify=None)
# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Test the model
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('Mean Squared Error:', mse)
print('R^2 Score:', r2)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[43], line 24
21 min_train_samples = 100 # Adjust this value as needed
23 # Split data into train and test sets with a minimum number of samples for training
---> 24 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,
25 shuffle=True, stratify=None)
27 # Ensure that the training set has at least the minimum number of samples
28 while len(X_train) < min_train_samples:
File ~/anaconda3/lib/python3.11/site-packages/sklearn/utils/_param_validation.py:211, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
205 try:
206 with config_context(
207 skip_parameter_validation=(
208 prefer_skip_nested_validation or global_skip_validation
209 )
210 ):
--> 211 return func(*args, **kwargs)
212 except InvalidParameterError as e:
213 # When the function is just a wrapper around an estimator, we allow
214 # the function to delegate validation to the estimator, but we replace
215 # the name of the estimator by the name of the function in the error
216 # message to avoid confusion.
217 msg = re.sub(
218 r"parameter of \w+ must be",
219 f"parameter of {func.__qualname__} must be",
220 str(e),
221 )
File ~/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_split.py:2617, in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
2614 arrays = indexable(*arrays)
2616 n_samples = _num_samples(arrays[0])
-> 2617 n_train, n_test = _validate_shuffle_split(
2618 n_samples, test_size, train_size, default_test_size=0.25
2619 )
2621 if shuffle is False:
2622 if stratify is not None:
File ~/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_split.py:2273, in _validate_shuffle_split(n_samples, test_size, train_size, default_test_size)
2270 n_train, n_test = int(n_train), int(n_test)
2272 if n_train == 0:
-> 2273 raise ValueError(
2274 "With n_samples={}, test_size={} and train_size={}, the "
2275 "resulting train set will be empty. Adjust any of the "
2276 "aforementioned parameters.".format(n_samples, test_size, train_size)
2277 )
2279 return n_train, n_test
ValueError: With n_samples=0, test_size=0.3 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.
# Importing required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
class StrokeModel:
"""This class encompasses the stroke prediction model based on various patient factors."""
_instance = None
def __init__(self, csv_file):
self.stroke_data = pd.read_csv(csv_file)
self.model = None
self.features = ['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi']
self.categorical_features = ['gender', 'ever_married', 'work_type', 'Residence_type']
self.target = 'stroke'
self.encoder = OneHotEncoder(drop='first')
def _clean(self):
# Fill missing values if any
self.stroke_data.fillna(method='ffill', inplace=True)
# One-hot encode categorical features
encoded_features = pd.get_dummies(self.stroke_data[self.categorical_features], drop_first=True)
self.stroke_data = pd.concat([self.stroke_data, encoded_features], axis=1)
self.features.extend(encoded_features.columns)
def _train(self):
X = self.stroke_data[self.features]
y = self.stroke_data[self.target]
# Set minimum number of samples for training
min_train_samples = 100
# Split data into train and test sets with a minimum number of samples for training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,
shuffle=True, stratify=None)
# Ensure that the training set has at least the minimum number of samples
while len(X_train) < min_train_samples:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,
shuffle=True, stratify=None)
# Creating and training the logistic regression model
self.model = LogisticRegression()
self.model.fit(X_train, y_train)
# Making predictions on the test set
y_pred = self.model.predict(X_test)
# Calculating accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model accuracy:", accuracy)
@classmethod
def get_instance(cls, csv_file):
if cls._instance is None:
cls._instance = cls(csv_file)
cls._instance._clean()
cls._instance._train()
return cls._instance
def predict_stroke(self, patient):
# Clean and prepare patient data for prediction
patient_df = pd.DataFrame(patient, index=[0])
# One-hot encode categorical features
encoded_features = pd.get_dummies(patient_df[self.categorical_features], drop_first=True)
patient_df = pd.concat([patient_df, encoded_features], axis=1)
# Ensure the features are in the same order as in the training data
patient_df = patient_df.reindex(columns=self.features, fill_value=0)
# Predicting stroke probability
stroke_probability = self.model.predict_proba(patient_df)
return stroke_probability[:, 1][0]
def testStroke():
print("Step 1: Define patient data for prediction:")
patient_info = {
'age': [67],
'gender': ['Male'],
'hypertension': [0],
'heart_disease': [1],
'ever_married': ['Yes'],
'work_type': ['Private'],
'Residence_type': ['Urban'],
'avg_glucose_level': [228.69],
'bmi': [36.6]
}
print("\t", patient_info)
print()
strokeModel = StrokeModel.get_instance("/Users/aiden/Downloads/healthcare-dataset-stroke-data 2.csv")
print("Step 2:", StrokeModel.get_instance.__doc__)
print("Step 3:", StrokeModel.predict_stroke.__doc__)
stroke_probability = strokeModel.predict_stroke(patient_info)
print('\t Predicted stroke probability:', stroke_probability)
print()
if __name__ == "__main__":
print("Begin:", testStroke.__doc__)
testStroke()
Begin: None
Step 1: Define patient data for prediction:
{'age': [67], 'gender': ['Male'], 'hypertension': [0], 'heart_disease': [1], 'ever_married': ['Yes'], 'work_type': ['Private'], 'Residence_type': ['Urban'], 'avg_glucose_level': [228.69], 'bmi': [36.6]}
Model accuracy: 0.9419439008480104
Step 2: None
Step 3: None
Predicted stroke probability: 0.25112228247029594
/Users/aiden/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(