<>R Language cluster analysis case

stay mclust The package contains a diabetes data set ( load mclust After package , This can be done by code “head(diabetes)” View the front page of the data 5
that 's ok , adopt “?diabetes” See the specific meaning of each variable ), The dataset contains 145 The measurement data of three indexes in patients with diabetes mellitus , For this dataset , Please make the following analysis :
(1) Consider only 3 Index data , use k-means Clustering analysis of data , Find the appropriate number of clusters , The clustering effect is evaluated ;
###################### The fifth question cluster analysis ############################################
library(mclust) library(MASS) data(diabetes) diabetes=diabetes# Import data head(
diabetes)# Observe the first five elements ?diabetes
give the result as follows :
> head(diabetes)# Observe the first five elements class glucose insulin sspg 1 Normal 80 356 124 2 Normal
97 289 117 3 Normal 105 319 143 4 Normal 90 356 199 5 Normal 90 323 240 6 Normal
86 381 157
Clustering :
km<-kmeans(diabetes[,2:4], 3, nstart = 1) km diabetes$cluster<-km$cluster# Assign a value to it
diabetes
The clustering results are as follows :
> km K-means clustering with 3 clusters of sizes 33, 86, 26 Cluster means:
glucose insulin sspg1 107.42424 531.5455 323.66667 2 91.39535 359.2791 166.72093
3 241.65385 1152.8846 75.69231 Clustering vector: 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
82 83 84 85 86 87 88 89 90 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 1 2 2 1 1 1
1 1 1 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
112 113 114 115 116 117 118 119 120 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 1 1 1 1 2 1 2
3 3 1 3 3 3 3 3 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
137 138 139 140 141 142 143 144 145 3 3 3 1 3 3 3 3 3 3 1 3 3 1 1 1 1 3 3 3 3 3
3 3 3 Within cluster sum of squares by cluster: [1] 1058399.6 592025.2 1738796.1
(between_SS / total_SS = 80.5 %) Available components: [1] "cluster" "centers"
"totss" "withinss" "tot.withinss" "betweenss" "size" [8] "iter" "ifault" >
diabetes class glucose insulin sspg cluster1 Normal 80 356 124 2 2 Normal 97 289
117 2 3 Normal 105 319 143 2 4 Normal 90 356 199 2 5 Normal 90 323 240 2
from :Within cluster sum of squares by cluster:[1] 1058399.6 592025.2 1738796.1
(between_SS / total_SS = 80.5 %) It is found that the clustering effect is good , It is also consistent with the original category label .

(2) When using hierarchical clustering , What are the similarities and differences of clustering results obtained by using different distances ?

basic thought : It is a statistical method to classify the research objects corresponding to the data . It's about bringing individuals together ( sample ) Or object
( variable ) According to the degree of similarity ( Distance ) Classification , The similarity between elements in the same class is stronger than that of elements in other classes . The aim is to maximize the homogeneity of elements between classes and the heterogeneity of elements between classes .
Q Cluster analysis refers to clustering samples ,R Cluster analysis refers to cluster analysis of variables .
Basic methods :
① Systematic clustering method : The shortest distance method , Longest distance method , Intermediate distance method , Class average method and sum of squares of deviation method, etc
gather —— At the beginning, the n
Each sample is classified as a class , The distances between samples and between classes are specified , Then the nearest two classes are merged into a new class , Calculate the distance between the new class and other classes ; Repeat the merging of the two closest classes , Reduce one class at a time , Until all the samples come together .
division —— from n A sample of a class begins , According to some optimal criterion, it is divided into two subclasses as far as possible , again
Each subclass is further divided into two classes by the same criterion , Choose one of the best sub classes , In this way, the number of classes is determined by
Two categories to three . It goes on like this , Until all n Each sample belongs to one class or adopts some stop rule .
d <- dist(scale(diabetes[,2:4]))# Establishing distance matrix # Comparison of different methods method=c("complete", "average"
, "centroid", "ward.D") alldata<-data.frame() hc<-hclust(d,"complete"); class1<-
cutree(hc, k=5) # Share 3 class hc<-hclust(d,"average"); class2<-cutree(hc, k=5) # Share 3 class
hc<-hclust(d,"centroid"); class3<-cutree(hc, k=5) # Share 3 class hc<-hclust(d,"ward.D");
class4<-cutree(hc, k=5) # Share 3 class alldata<-data.frame(diabetes,class1,class2,
class3,class4)# Merge into one data frame alldata
Clustering results :
> alldata class glucose insulin sspg cluster class1 class2 class3 class4 1
Normal80 356 124 2 1 1 1 1 2 Normal 97 289 117 2 1 1 1 1 3 Normal 105 319 143 2
1 1 1 1 4 Normal 90 356 199 2 1 1 1 2 5 Normal 90 323 240 2 1 1 1 2 6 Normal 86
381 157 2 1 1 1 1 7 Normal 100 350 221 2 1 1 1 2
Discoverable system clustering method , When using different distance methods , The clustering results are different , But the difference is not particularly big . For both ends, the classification result is better , However, the classification of intermediate value between categories is poor .

(3) Is the distribution of the data set suitable for density clustering ? If density clustering is used , Will the clustering effect be better than the above two clustering methods ?

Density clustering is also called density based clustering , The basic starting point is to assume that the clustering results can be obtained by sample distribution
To determine the density of , The main goal is to find high density regions separated by low density regions . It has the following advantages :
① Compared with K-means clustering ,DBSCAN There is no need to declare the number of clusters in advance , That is to say, the number of clusters will vary according to the field and the number of clusters MinPts
Parameter dynamic determination , So it can better reflect the original feature points of data cluster distribution , However, different clustering results can be obtained according to different parameters .
②DBSCAN Density clustering can find clusters of any shape , It is suitable for data sets with irregular data distribution .
library(fpc) model<-dbscan(diabetes[,2:4],eps=50,MinPts=5) diabetes$dbscan<-
model$cluster ## Visualization in different environments eps In this case , Clustering eps <- c(40,50,60,70) name <- c("one","two"
,"three","four") dbdata <- diabetes[,2:4] for (ii in 1:length(eps)) { modeli <-
dbscan(diabetes[,2:4],eps=eps[ii],MinPts=5) dbdata[[name[ii]]] <- as.factor(
modeli$cluster) } head(dbdata)
The clustering results are as follows :
> head(dbdata) glucose insulin sspg one two three four 1 80 356 124 1 1 1 1 2
97 289 117 1 1 1 1 3 105 319 143 1 1 1 1 4 90 356 199 1 1 1 1 5 90 323 240 1 1 1
1 6 86 381 157 1 1 1 1 ........ 138 188 958 100 0 0 2 3 139 339 1354 10 0 0 0 0
140 265 1263 83 0 0 0 0 141 353 1428 41 0 0 0 2 142 180 923 77 0 0 2 3
Can be found , Automatic clustering has become 3-4 class , The first three are divided into 3 class , So it can be divided into 3 Class is more reasonable .

Technology
©2019-2020 Toolsou All rights reserved,
VUE+Canvas Achieve desktop Pinball brick elimination games C/C++ Memory model 2019PHP Interview questions ( Continuously updated )PHPspringboot2 Separation of front and rear platforms ,token Put in header Pit for verification Vue SpringBoot conduct Excel download element-ui Step on pit record 45 The 12-year-old programmer was turned down , Is the workplace wrong ?Python Web frame Pandas Fundamentals of statistical analysis _ data processing (DataFrame Common operations )Java Misunderstanding —— Method overloading is a manifestation of polymorphism ?