使用spark建立逻辑回归(Logistic)模型帮Helen找男朋友

来源：互联网发布：域名邮箱怎么弄编辑：程序博客网时间：2024/06/10 08:42

博主简介：风雪夜归子（Allen），机器学习算法攻城狮，喜爱钻研Meachine Learning的黑科技，对Deep Learning和Artificial Intelligence充满兴趣，经常关注Kaggle数据挖掘竞赛平台，对数据、Machine Learning和Artificial Intelligence有兴趣的童鞋可以一起探讨哦，个人CSDN博客：http://blog.csdn.net/u013719780?viewmode=contents

假设海伦一直使用在线约会网站寻找适合自己的约会对象。尽管约会网站会推荐不同的人选，但她没有从中找到喜欢的人。经过一番总结，她发现曾交往过三种类型的人：

□ 不喜欢的人
□ 魅力一般的人
□ 极具魅力的人

尽管发现了上述规律，但海伦依然无法将约会网站推荐的匹配对象归人恰当的分类。她觉得可以在周一到周五约会那些魅力一般的人，而周末则更喜欢与那些极具魅力的人为伴。海伦希望我们的分类算法可以更好地帮助她将匹配对象划分到确切的分类中。此外海伦还收集了一些约会网站未曾记录的数据信息，她认为这些数据更有助于匹配对象的归类。

海伦收集约会数据巳经有了一段时间，她把这些数据存放在文本文件datingTestSet中，每个样本数据占据一行，总共有1000行。海伦的样本主要包含以下3种特征：
□ 每年获得的飞行常客里程数
□ 玩视频游戏所耗时间百分比
□ 每周消费的冰淇淋公升数

file_content = sc.textFile('/Users/youwei.tan/Desktop/datingTestSet.txt')df = file_content.map(lambda x:x.split('\t'))df.take(2)

输出结果如下：

[[u'40920', u'8.326976', u'0.953952', u'largeDoses'], [u'14488', u'7.153469', u'1.673904', u'smallDoses']]

再将数据集转换成dataframe格式，具体代码如下：

dataset = sqlContext.createDataFrame(df, ['Mileage ', 'Gametime', 'Icecream', 'label'])dataset.show(5, False)dataset.printSchema

输出结果如下：

+--------+---------+--------+----------+|Mileage |Gametime |Icecream|label     |+--------+---------+--------+----------+|40920   |8.326976 |0.953952|largeDoses||14488   |7.153469 |1.673904|smallDoses||26052   |1.441871 |0.805124|didntLike ||75136   |13.147394|0.428964|didntLike ||38344   |1.669788 |0.134296|didntLike |+--------+---------+--------+----------+only showing top 5 rows

<bound method DataFrame.printSchema of DataFrame[Mileage : string, Gametime: string, Icecream: string, label: string]>

建立标签label的索引字典，目的是为了将字符串型的label转换成数值型的label。

label_set = dataset.map(lambda x: x[3]).distinct().collect()label_dict = dict()i = 0for key in label_set:    if key not in label_dict.keys():        label_dict[key ]= i        i = i+1label_dict

输出结果：

{u'didntLike': 0, u'largeDoses': 1, u'smallDoses': 2}

目前所得到的数据集类型是string类型，需要将其转成数值型，具体实现代码如下：

data = dataset.map(lambda x: ([x[i] for i in range(3)], label_dict[x[3]])).\               map(lambda (x,y): [int(x[0]), float(x[1]), float(x[2]), y])data = sqlContext.createDataFrame(data,  ['Mileage ', 'Gametime', 'Icecream', 'label'] )data.show(5, False)data.printSchema#data.selectExpr('Mileage', 'Gametime', 'Icecream', 'label').show()

输出结果：

+--------+---------+--------+-----+|Mileage |Gametime |Icecream|label|+--------+---------+--------+-----+|40920   |8.326976 |0.953952|1    ||14488   |7.153469 |1.673904|2    ||26052   |1.441871 |0.805124|0    ||75136   |13.147394|0.428964|0    ||38344   |1.669788 |0.134296|0    |+--------+---------+--------+-----+only showing top 5 rows

<bound method DataFrame.printSchema of DataFrame[Mileage : bigint, Gametime: double, Icecream: double, label: bigint]>

现在数据集已经符合我们的要求了，接下来就是建立模型了。在建立模型之前，我先对其进行标准化，然后用主成份分析（PCA）进行了降维，最后通过逻辑回归（logistic）模型进行分类和概率预测。具体实现代码如下：

from __future__ import print_function# $example on$from pyspark.ml import Pipelinefrom pyspark.ml.classification import LogisticRegressionfrom pyspark.ml.evaluation import BinaryClassificationEvaluatorfrom pyspark.ml.feature import HashingTF, Tokenizerfrom pyspark.ml.tuning import CrossValidator, ParamGridBuilderfrom pyspark.ml.feature import PCAfrom pyspark.mllib.linalg import Vectorsfrom pyspark.ml.feature import StandardScaler# 将类别2和类别1合并，即Helen对男生的印象是要么有魅力要么没有魅力。# 之所以合并，是因为pyspark.ml.classification.LogisticRegression目前仅支持二分类feature_data = data.map(lambda x:(Vectors.dense([x[i] for i in range(0,3)]),float(1 if x[3]==2 else x[3])))feature_data = sqlContext.createDataFrame(feature_data, ['features', 'labels'])#feature_data.show()train_data, test_data = feature_data.randomSplit([0.7, 0.3], 6)#train.show()scaler = StandardScaler(inputCol='features', outputCol='scaledFeatures',                            withStd=True, withMean=False)pca = PCA(k=2, inputCol="scaledFeatures", outputCol="pcaFeatures")lr = LogisticRegression(maxIter=10, featuresCol='pcaFeatures', labelCol='labels')pipeline = Pipeline(stages=[scaler, pca, lr])Model = pipeline.fit(train_data)results = Model.transform(test_data)results.select('probability', 'prediction', 'prediction').show(truncate=False)

输出结果如下：

+----------------------------------------+----------+----------+|probability                             |prediction|prediction|+----------------------------------------+----------+----------+|[0.22285193760551922,0.7771480623944808]|1.0       |1.0       ||[0.19145196324973038,0.8085480367502696]|1.0       |1.0       ||[0.25815968118089555,0.7418403188191045]|1.0       |1.0       ||[0.1904557879847662,0.8095442120152337] |1.0       |1.0       ||[0.23649048307318044,0.7635095169268196]|1.0       |1.0       ||[0.19581773456064858,0.8041822654393515]|1.0       |1.0       ||[0.17595295700627253,0.8240470429937274]|1.0       |1.0       ||[0.2693008979176928,0.7306991020823073] |1.0       |1.0       ||[0.19489995345665115,0.8051000465433488]|1.0       |1.0       ||[0.2790706794240234,0.7209293205759766] |1.0       |1.0       ||[0.2074274685125254,0.7925725314874746] |1.0       |1.0       ||[0.2225838179162865,0.7774161820837134] |1.0       |1.0       ||[0.23520083542636305,0.764799164573637] |1.0       |1.0       ||[0.16390109775004727,0.8360989022499528]|1.0       |1.0       ||[0.2032817412585787,0.7967182587414213] |1.0       |1.0       ||[0.22397459472064782,0.7760254052793522]|1.0       |1.0       ||[0.1987896145632484,0.8012103854367516] |1.0       |1.0       ||[0.18503543175783838,0.8149645682421617]|1.0       |1.0       ||[0.30849060803324585,0.6915093919667542]|1.0       |1.0       ||[0.2472540013472057,0.7527459986527943] |1.0       |1.0       |+----------------------------------------+----------+----------+only showing top 20 rows



最后对模型进行简单的评估，具体代码如下：

from pyspark.mllib.evaluation import MulticlassMetricspredictionAndLabels = results.select('probability', 'prediction', 'prediction').map(lambda x: (x[1], x[2]))metrics = MulticlassMetrics(predictionAndLabels)metrics.confusionMatrix().toArray()

输出结果：

array([[  40.,    0.],       [   0.,  257.]])

1 0