Classification with KNN using R

Gifa Delyani Nursyafitri
5 min readNov 4, 2018

--

All models are wrong, but some models are useful.” — Goerge E.P. Box

Source : helloacm.com

Holaa, people! Right now, I want to share about Clasification with KNN. But first, I’ll tell you about KNN.

K Nearest Neighbors (KNN) is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique. In general, a large K value is more precise as it reduces the overall noise but there is no guarantee. Cross-validation is another way to retrospectively determine a good K value by using an independent dataset to validate the K value. (Source : https://www.saedsayad.com/k_nearest_neighbors.htm)

KNN can require a lot of memory or space to store all of the data, but only performs a calculation (or learn) when a prediction is needed, just in time. You can also update and curate your training instances over time to keep predictions accurate. (Source : https://helloacm.com/a-short-introduction-to-k-nearest-neighbors-algorithm/)

The data that I used is from Kaggle (again). Just click the link below to get the data!

The case that I will complete with the data using Classification is :
If in the Job Satisfication column express that number 1 = Very Dissatisfied, 2 = Dissatisfied, 3 = Satisfied and 4 = Very Satisfied, then classify Joni’s Job Satisfaction that has data such as the following Age = 35, Attention = No, Business Travel = Travel Rarely, Daily Rate = 1373, department = Sales, Distance = 8, Education = 2, Education Field= Medical, Employee count= 1, Environment = 4, gender = Men, hourly rate = 50, Job Involvement = 3, Job Level = 2, Job Role = Research Scientist, Marital Status = Single, Monthly Income = 3000 and Monthly Rates = 5000.

First of all input the data. The data format is .csv, so we must use the appropriate script to input CSV data into R . Below is a snapshot of the original data looks like after loading the dataset into a dataframe.

Don’t forget load the pakcages that we need!

library(dplyr)
library(class)
library(dbscan)

We can see all the object using this following script. It’s up to you to name the data, here I use “klasifikasi”.

objects(klasifikasi)

The snapshoot below is the output from objects. There are 33 objects : Attrition, Business Travel, Daily Rate, Department, Distance From House, Education, Education Field, Employee Count, Employee Number, Environment Satisfaction, Gender, Hourly Rate, Age, Job Involvement, Job Level, Job Role, Job Satisfaction, Marital Status, Monthly Income, Monthly Rate, Number Companies Worked, Over Time, Percent Salary Hike, Performance Rating, Relationship Satisfaction, Stock Option Level, Training Times Last Year, Work Life Balance, Years At Company, Years In Current Role, Years Since Last Promotion, dan Years With Curr Manager.

The next step, we want to know, is there any NA in the data? We can use the script below.

row.has.na<-apply(klasifikasi,1,function(x){any(is.na(x))})
sum(row.has.na)

Yeay, we got zero!!

Based on the output above, we know that there is no NA in the data.

The next step, we have to make a new data frame based on variables that we needed to analisys. We can use the script below.

klasifikasi%>%
select( JobSatisfaction,
ï..Age,
Attrition,
BusinessTravel,
DailyRate,
Department,
DistanceFromHome,
Education,
EducationField,
EmployeeCount,
EnvironmentSatisfaction,
Gender,
HourlyRate,
JobInvolvement,
JobLevel,
JobRole,
MaritalStatus,
MonthlyIncome,
MonthlyRate)%>%
data.frame()->newclass

Yeay, the dataframe is already!

After that, I need to see the class of each variables on my dataset. But before, I have to remove the NA from new dataframe. Just use script below to get the ouput.

#remove the NA from new dataframe
filterkerja<-data.frame(newclass[!row.has.na,])
View(filterkerja)
sapply(filterkerja, class)

The snapshoot below is the result. The class is consist of integer and factor.

The next step is we must change the variables into factor.

JobSatisfication= factor(filterkerja$JobSatisfaction)
Education=factor(filterkerja$Education)
EnvironmentSatisfaction=factor(filterkerja$EnvironmentSatisfaction)
JobInvolvement=factor(filterkerja$JobInvolvement)
JobLevel=factor(filterkerja$JobLevel)

To find the classification for Joni’s Job Satisfaction, we have to make a new dataframe with selected variables.

faktor<-data.frame(age=filterkerja$ï..Age,
att=filterkerja$Attrition,
travel=filterkerja$BusinessTravel,
daily=filterkerja$DailyRate,
depart=filterkerja$Department,
distance=filterkerja$DistanceFromHome,
edu=Education,
edufield=filterkerja$EducationField,
count=filterkerja$EmployeeCount,
envir=EnvironmentSatisfaction,
gender=filterkerja$Gender,
hour= filterkerja$HourlyRate,
jobin=JobInvolvement,
joblev=JobLevel,
jobrol= filterkerja$JobRole,
status=filterkerja$MaritalStatus,
monthlyin=filterkerja$MonthlyIncome,
monthlyrate=filterkerja$MonthlyRate)
#faktorcbind
faktor1<-cbind(faktor$age,faktor$att,faktor$travel,faktor$daily,
faktor$depart,faktor$distance,faktor$edu,
faktor$edufield,faktor$count,faktor$envir,
faktor$gender,faktor$hour,faktor$jobin,faktor$joblev,
faktor$jobrol,faktor$status,faktor$monthlyin,
faktor$monthlyrate)

After that, to find the classification, we must define the criteria of target first.

target<-(JobSatisfication)
target
#No=1
#Travel rarely= 3
#sales= 3
#medical = 4
#male= 2
#Research scientist= 7
#single= 3
####JONI####
joni<-cbind(35,1,3,1373,3,8,2,4,
1,4,2,50,3,2,7,3,3000,5000)

Almost finish! Once step again, yeay!

The last step is find the best K to classified Joni’s Job Satisfication, I’ve compared with K=1,K=4, K=11, K=17 and K=29.

hasil1<-knn(faktor1,joni,target,k=1,prob = TRUE);hasil1
hasil2<-knn(faktor1,joni,target,k=4,prob = TRUE);hasil2
hasil3<-knn(faktor1,joni,target,k=11,prob = TRUE);hasil3
hasil4<-knn(faktor1,joni,target,k=17,prob = TRUE);hasil4
hasil5<-knn(faktor1,joni,target,k=100,prob = TRUE);hasil5

Based on the result, k that have the higgest probability is 13. So, I decided to use that K.

The conclution is Joni are classified with Job Satisfication = 3. It means Joni is Satisfied with his Job.

Yeah, it’s the miracle of data. We just need another data to find the classification of someone or something. I wish next time I can find the classification of the “meant to be” based on the nearest neighbor.

Thank you for reading. Hope you enjoy.

Feel free to correct me!

--

--

Gifa Delyani Nursyafitri
Gifa Delyani Nursyafitri

Written by Gifa Delyani Nursyafitri

Ku abadikan disini, karena aku paham betul bahwa ingatan manusia terbatas.

No responses yet