机器学习５——多元回归及其代码实现

来源：互联网发布：mac磁盘工具删除文件编辑：程序博客网时间：2024/06/10 15:00

本博客整理自coursera，欢迎转载交流。

Features

　　在简单的一元线性回归模型中，我们的features一般是指一些我们现有的变量。其实，在真正的使用中更重要的可能是如何定义这些特征，我们可以用一些量的函数组合来表示特征。比如预测房价，我们可以用面积，浴室个数……表示特征，也可以用log(面积)，浴室个数的平方……表示特征，那么我们的模型表示为：

yi=w0h0(xi)+w1h1(xi)+...+wDhD(xi)+εi
　　=∑Dj=0wjhj(xi)+εi
　　这里的h就是我们的特征，所谓线性回归是指系数Ｗ是线性，但是特征h可以是非线性组合。

　　讲到这里我么你＝就可以稍微改变一下我们的模型了（我们的原始数据是Training Data也即是Ｘ，我们需要通过Featyre extraction提取特征）：
　　这里写图片描述
　　

解释系数的意义

　　对于线性模型我们可以这样解释w1,w2,...的意义，以下面的模型为例，我们想要解释w^1的意义，我们需要把其他的系数作为常量，截取超平面，则可以了解到w^1的意义是两个平面交线的斜率。
　　这里写图片描述

　　注意：多项式函数不可以这么解释！
　　　　　　

建立多元回归模型

数学描述法

　　我们已经得到上述模型：
　　yi=w0h0(xi)+w1h1(xi)+...+wDhD(xi)+εi
　　=∑Dj=0wjhj(xi)+εi　　

　　令：
H⃗ =⎡⎣⎢⎢⎢⎢⎢⎢h0(x1)h0(x2)⋅⋅⋅　⋅⋅⋅h0(xN)h1(x1)h1(x2)⋅⋅⋅⋅⋅⋅h1(xN)h2(x1)⋅⋅⋅hD(x1)h2(x2)⋅⋅⋅hD(x2)h2(xN)⋅⋅⋅hD(xN)⎤⎦⎥⎥⎥⎥⎥⎥

W⃗ =⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢w0w1w2⋅⋅⋅wD⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥ 　　　　　ϵ⃗ =⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢ε0ε1ε2⋅⋅⋅εN⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
则：
Y⃗ =H⃗ ⋅W⃗ +ϵ⃗

　　接下来，我们计算ＲＳＳ：
　　RSS(w⃗ )=∑Ni=0(yi−y^i)2=(Y⃗ −H⃗ ⋅W⃗ )T⋅(Y⃗ −H⃗ ⋅W⃗ )
　　接下来不加证明的给出下面的公式：
　　ΔRSS(w⃗ )=−2HT(Y⃗ −H⃗ W⃗ )
　　令：ΔRSS(w⃗ )=0
　　得到：w⃗ =(H⃗ TH⃗ )−1H⃗ TY⃗

梯度下降法

　　根据我们之前对梯度下降法的介绍，我们需要作如下过程来寻找模型的解：
　　w⃗ (t+1)←w⃗ t−ηΔRSS(w⃗ (t))=w⃗ t+2η(Y⃗ −H⃗ W⃗ )

代码实现

　　实现了分别是用graphlab和自己编写函数实现多元回归模型，详细代码、数据文件和使用说明可以在这里下载，不需要积分。

#使用graphlab自带函数和算法的实现多元回归import graphlabsales = graphlab.SFrame('kc_house_data.gl/')train_data,test_data = sales.random_split(.8,seed=0)example_features = ['sqft_living', 'bedrooms', 'bathrooms']example_model = graphlab.linear_regression.create(train_data, target = 'price', features = example_features, validation_set = None)def get_residual_sum_of_squares(model, data, outcome):    # First get the predictions    predict_price = model.predict(data)    # Then compute the residuals/errors    err = predict_price-outcome    # Then square and add them up    RSS = (err*err).sum()    return(RSS)    from math import logtrain_data['bedrooms_squared'] = train_data['bedrooms'].apply(lambda x: x**2)test_data['bedrooms_squared'] = test_data['bedrooms'].apply(lambda x: x**2)# create the remaining 3 features in both TEST and TRAIN datatrain_data['bed_bath_rooms'] = train_data['bedrooms']*train_data['bathrooms']test_data['bed_bath_rooms'] = test_data['bedrooms']*test_data['bathrooms']train_data['log_sqft_living'] = train_data['sqft_living'].apply(lambda x:log(x))test_data['log_sqft_living'] = test_data['sqft_living'].apply(lambda x:log(x))train_data['lat_plus_long'] = train_data['lat']+train_data['long']test_data['lat_plus_long'] = test_data['lat']+test_data['long']#train_data.head()new1_mean = test_data['bedrooms_squared'].mean()new2_mean = test_data['bed_bath_rooms'].mean()new3_mean = test_data['log_sqft_living'].mean()new4_mean = test_data['lat_plus_long'].mean()print new1_mean,new2_mean,new3_mean,new4_meanprint round(new1_mean,2),round(new2_mean,2),round(new3_mean,2),round(new4_mean,2)model_1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']model_2_features = model_1_features + ['bed_bath_rooms']model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']model1 = graphlab.linear_regression.create(train_data,target='price',features=model_1_features,validation_set=None)model2 = graphlab.linear_regression.create(train_data,target='price',features=model_2_features,validation_set=None)model3 = graphlab.linear_regression.create(train_data,target='price',features=model_3_features,validation_set=None)rss1=get_residual_sum_of_squares(model1,train_data,train_data['price'])rss2=get_residual_sum_of_squares(model2,train_data,train_data['price'])rss3=get_residual_sum_of_squares(model3,train_data,train_data['price'])print rss1,rss2,rss3# Compute the RSS on TESTING data for each of the three models and record the values:rss11=get_residual_sum_of_squares(model1,test_data,test_data['price'])rss21=get_residual_sum_of_squares(model2,test_data,test_data['price'])rss31=get_residual_sum_of_squares(model3,test_data,test_data['price'])print rss11,rss21,rss31

#下面自己编写多元回归函数，以便更加深刻的理解多元回归的内部原理，尤其是加深对梯度下降的理解import graphlabsales = graphlab.SFrame('kc_house_data.gl/')import numpy as np # note this allows us to refer to numpy as np instead #编写函数获得所需数据def get_numpy_data(data_sframe, features, output):    data_sframe['constant'] = 1 # this is how you add a constant column to an SFrame    # add the column 'constant' to the front of the features list so that we can extract it along with the others:    features = ['constant'] + features # this is how you combine two lists    # select the columns of data_SFrame given by the features list into the SFrame features_sframe (now including constant):    features_sframe = data_sframe[features]    # the following line will convert the features_SFrame into a numpy matrix:    feature_matrix = features_sframe.to_numpy()    # assign the column of data_sframe associated with the output to the SArray output_sarray    output_sarray = data_sframe[output]    # the following will convert the SArray into a numpy array by first converting it to a list    output_array = output_sarray.to_numpy()    return(feature_matrix, output_array)#编写根据系数和特征预测房价的函数def predict_output(feature_matrix, weights):    # assume feature_matrix is a numpy matrix containing the features as columns and weights is a corresponding numpy array    # create the predictions vector by using np.dot()    predictions = np.dot(feature_matrix,weights)    return(predictions)#编写求取某一点偏导数的函数def feature_derivative(errors, feature):    # Assume that errors and feature are both numpy arrays of the same length (number of data points)    # compute twice the dot product of these vectors as 'derivative' and return the value    derivative = 2*np.dot(errors,feature)    return(derivative)from math import sqrt # recall that the magnitude/length of a vector [g[0], g[1], g[2]] is sqrt(g[0]^2 + g[1]^2 + g[2]^2)#编写多元回归求取系数的函数def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):    converged = False     weights = np.array(initial_weights) # make sure it's a numpy array    while not converged:        # compute the predictions based on feature_matrix and weights using your predict_output() function        predictions = predict_output(feature_matrix,weights)        # compute the errors as predictions - output        errs = predictions - output        gradient_sum_squares = 0 # initialize the gradient sum of squares        # while we haven't reached the tolerance yet, update each feature's weight        for i in range(len(weights)): # loop over each weight            # Recall that feature_matrix[:, i] is the feature column associated with weights[i]            # compute the derivative for weight[i]:            wi_dev = feature_derivative(errs,feature_matrix[:,i])            # add the squared value of the derivative to the gradient sum of squares (for assessing convergence)            gradient_sum_squares +=np.dot(wi_dev,wi_dev)            # subtract the step size times the derivative from the current weight            weights[i]-= step_size*wi_dev        # compute the square-root of the gradient sum of squares to get the gradient magnitude:        gradient_magnitude = sqrt(gradient_sum_squares)        if gradient_magnitude < tolerance:            converged = True    return(weights)#下面就可以开始使用我们自己编写的函数了train_data,test_data = sales.random_split(.8,seed=0)# let's test out the gradient descentsimple_features = ['sqft_living']my_output = 'price'(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)initial_weights = np.array([-47000., 1.])step_size = 7e-12tolerance = 2.5e7my_weight = regression_gradient_descent(simple_feature_matrix,output,initial_weights,step_size,tolerance)print my_weight(test_simple_feature_matrix, test_output) = get_numpy_data(test_data, simple_features, my_output)preout = predict_output(test_simple_feature_matrix,my_weight)print preoutRSS1= sqrt(((test_output-preout)*(test_output-preout)).sum())print RSS1model_features = ['sqft_living', 'sqft_living15'] # sqft_living15 is the average squarefeet for the nearest 15 neighbors. my_output = 'price'(feature_matrix, output) = get_numpy_data(train_data, model_features, my_output)initial_weights = np.array([-100000., 1., 1.])step_size = 4e-12tolerance = 1e9my_weight2 = regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance)print my_weight2simple_features = ['sqft_living','sqft_living15']my_output = 'price'(simple_feature_matrix1, output1) = get_numpy_data(test_data, simple_features, my_output)preout2 = predict_output(simple_feature_matrix1,my_weight2)print len(preout2)print preout2RSS2 = sqrt(((test_output-preout2)*(test_output-preout2)).sum())print RSS2

1 0