# 概述

k近邻（k nearest neighbor）算法是一种监督算法，用于分类。它基本思想是计算新实例和训练集元素的**距离**，找出k个最接近的实例（neighbor），统计它们所属分类，次数最多的类别作为新实例的类别。

# 原理与步骤

``````class KNearestNeighbor:
``````

``````    def __init__(...):  pass
``````

``````    def train(...):     pass
``````

``````    def classify(...):  pass
``````

## 训练（train）

= (x − xmin)/(xmax − xmin)

``````def train(self, X, C):
``````

``````    '''X,C分别代表实例和类别'''
``````

``````    # 实例数据归一化，并保留数据备份
``````

``````    (self.X, self.C) = (normalize(X), C.copy())
``````

``````    # 可选，如果需要，则构建KD-Tree()
``````

``````    self.tree = KDTree()
``````

``````    self.tree.create(self.X)
``````

## 分类（classify）

• 训练对数据进行归一化，则分类是也需要归一化。
• 训练使用如KD-Tree等方式进行处理，则分类使用对应的方法寻找k个近邻。

``````def classify(self, x):
``````

``````    _x = normalize(x)                   # 将x归一化
``````

``````    nearest = self.find_neighbors(_x)   # 找出k个近邻
``````

``````    freq = frequency(nearests)          # 统计每个类型的次数
``````

``````    return freq.sorted()[-1]            # 排序后，返回次数最多的类别
``````

``````def find_neighbors(self, x):
``````

``````    '''寻找与x最接近的k个点'''
``````

``````    if self.tree == None:               # 判断是否使用了kd-tree
``````

``````        ds = self.distance(x, self.X)   # 计算所有点的距离
``````

``````        indices = ds.argsort()[0:k]     # 排序后，取前面k个
``````

``````    else:
``````

``````        indices = self.tree.find_neighbors(x, self.k)
``````

``````    # indices是k个近邻的索引位置
``````

``````    return self.C[indices]
``````

## 初始化（init）

d = x − y = ((xi − yi)2) ≥ xi − yi

``````def __init__(self, k, distance=euclidean ):
``````

``````    (self.k, self.distance) = (k, distance)
``````

# scikit-learn

``````import numpy as np
``````

``````from sklearn import neighbors
``````

``````# 准备数据，分成A B两类。A类在[0,0]附近，B类在[1,1]附近。
``````

``````X = np.array([[0, 0.1],   [-0.1, 0],
``````

``````              [0.1, 0.1], [0, 0],
``````

``````              [1, 1],     [1.1, 1],
``````

``````              [1, 1.1],   [1.1, 1.1]])
``````

``````C = ['A','A','A','A','B','B','B','B']
``````

``````# 初始化
``````

``````clf = neighbors.KNeighborsClassifier(n_neighbors=3, weights="uniform")
``````

``````# 训练
``````

``````clf.fit(X, C)
``````

``````# 分类
``````

``````c = clf.predict(np.array([[0.9,0.8]]))
``````

``````print(c)
``````

• n_neighbors: 指参数k
• weights: 指定数据分类的权重，归一化 是其中的一个方式。
• algorithm: 该参数可设定使用kd-tree等方法。
• metric: 距离计算公式

# 参考资料

posted on 2016-10-28 16:18 lemene 阅读(105) 评论(0)  编辑 收藏 引用