After binarizing a column of my data-set using sklearn the result is not correct. where is the code wrong?
After binarizing a column of my data-set using sklearn the result is not correct. where is the code wrong?
I pre-processing a data-set. I binarized one of the columns. after binarization I think the values are incorrect. the data has 303 observations(rows) and 14 features(columns).and the column i am binarizing is the last column.
here is a part of my code-
import pandas as pd
import numpy as np
#importing the dataset
header_names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
dataset = pd.read_csv('E:/HCU proj doc/EHR dataset/cleveland_data.csv', names= header_names)
array = dataset.values
# binarize num
from sklearn.preprocessing import Binarizer
x = array[:,13:]
binarize = Binarizer(threshold=0.0).fit(x)
transform_binarize = binarize.transform(x)
array[:,13:]=transform_binarize
print(transform_binarize)
here is what the original data column look like-
0,2,1,0,0.........1,0,3,1,1,2
and here is the output of the above code-
[[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[1.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[0.]
[1.]
[1.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[1.]
[1.]
[0.]]
I think the last ones are incorrect. I dont understand why is that.
1 Answer
1
If I am correct in assuming that this is the heart disease dataset taken from this UCI repository and the csv file is this one, then in that case these are correct values for binarizer. The original data column that you are using has a 0
in the final row, I think you missed that, try this code
0
for idx in range(0,len(x)):
print idx,x[idx],transform_binarize[idx]
Output
278 [1L] [1.]
279 [0L] [0.]
280 [2L] [1.]
281 [0L] [0.]
282 [3L] [1.]
283 [0L] [0.]
284 [2L] [1.]
285 [4L] [1.]
286 [2L] [1.]
287 [0L] [0.]
288 [0L] [0.]
289 [0L] [0.]
290 [1L] [1.]
291 [0L] [0.]
292 [2L] [1.]
293 [2L] [1.]
294 [1L] [1.]
295 [0L] [0.]
296 [3L] [1.]
297 [1L] [1.]
298 [1L] [1.]
299 [2L] [1.]
300 [3L] [1.]
301 [1L] [1.]
302 [0L] [0.] #<--- I think you missed this row while reading your dataset
If you try this code then you will that the binarizer is working exactly as it should be.
This answer is based on specific knowledge of the dataset, and does not indicate where in OP's code there is a specific problem -- which is the main request of the question ("Where is the code wrong?"). The application of this answer is too limited to be useful to future readers.
– Savage Henry
Jul 3 at 13:30
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Please add a reproducible example. See MCVE
– Vivek Kumar
Jul 3 at 7:02