机器学习5——多元回归及其代码实现
来源:互联网 发布:mac磁盘工具删除文件 编辑:程序博客网 时间:2024/06/10 15:00
本博客整理自coursera,欢迎转载交流。
Features
在简单的一元线性回归模型中,我们的features一般是指一些我们现有的变量。其实,在真正的使用中更重要的可能是如何定义这些特征,我们可以用一些量的函数组合来表示特征。比如预测房价,我们可以用面积,浴室个数……表示特征,也可以用log(面积),浴室个数的平方……表示特征,那么我们的模型表示为:
yi=w0h0(xi)+w1h1(xi)+...+wDhD(xi)+εi
=∑Dj=0wjhj(xi)+εi
这里的h就是我们的特征,所谓线性回归是指系数W是线性,但是特征h可以是非线性组合。
讲到这里我么你=就可以稍微改变一下我们的模型了(我们的原始数据是Training Data也即是X,我们需要通过Featyre extraction提取特征):
解释系数的意义
对于线性模型我们可以这样解释
注意:多项式函数不可以这么解释!
建立多元回归模型
数学描述法
我们已经得到上述模型:
=
令:
H⃗ =⎡⎣⎢⎢⎢⎢⎢⎢h0(x1)h0(x2)⋅⋅⋅ ⋅⋅⋅h0(xN)h1(x1)h1(x2)⋅⋅⋅⋅⋅⋅h1(xN)h2(x1)⋅⋅⋅hD(x1)h2(x2)⋅⋅⋅hD(x2)h2(xN)⋅⋅⋅hD(xN)⎤⎦⎥⎥⎥⎥⎥⎥
W⃗ =⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢w0w1w2⋅⋅⋅wD⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥ ϵ⃗ =⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢ε0ε1ε2⋅⋅⋅εN⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
则:
Y⃗ =H⃗ ⋅W⃗ +ϵ⃗
接下来,我们计算RSS:
接下来不加证明的给出下面的公式:
令:
得到:
梯度下降法
根据我们之前对梯度下降法的介绍,我们需要作如下过程来寻找模型的解:
w⃗ (t+1)←w⃗ t−ηΔRSS(w⃗ (t))=w⃗ t+2η(Y⃗ −H⃗ W⃗ )
代码实现
实现了分别是用graphlab和自己编写函数实现多元回归模型,详细代码、数据文件和使用说明可以在这里下载,不需要积分。
#使用graphlab自带函数和算法的实现多元回归import graphlabsales = graphlab.SFrame('kc_house_data.gl/')train_data,test_data = sales.random_split(.8,seed=0)example_features = ['sqft_living', 'bedrooms', 'bathrooms']example_model = graphlab.linear_regression.create(train_data, target = 'price', features = example_features, validation_set = None)def get_residual_sum_of_squares(model, data, outcome): # First get the predictions predict_price = model.predict(data) # Then compute the residuals/errors err = predict_price-outcome # Then square and add them up RSS = (err*err).sum() return(RSS) from math import logtrain_data['bedrooms_squared'] = train_data['bedrooms'].apply(lambda x: x**2)test_data['bedrooms_squared'] = test_data['bedrooms'].apply(lambda x: x**2)# create the remaining 3 features in both TEST and TRAIN datatrain_data['bed_bath_rooms'] = train_data['bedrooms']*train_data['bathrooms']test_data['bed_bath_rooms'] = test_data['bedrooms']*test_data['bathrooms']train_data['log_sqft_living'] = train_data['sqft_living'].apply(lambda x:log(x))test_data['log_sqft_living'] = test_data['sqft_living'].apply(lambda x:log(x))train_data['lat_plus_long'] = train_data['lat']+train_data['long']test_data['lat_plus_long'] = test_data['lat']+test_data['long']#train_data.head()new1_mean = test_data['bedrooms_squared'].mean()new2_mean = test_data['bed_bath_rooms'].mean()new3_mean = test_data['log_sqft_living'].mean()new4_mean = test_data['lat_plus_long'].mean()print new1_mean,new2_mean,new3_mean,new4_meanprint round(new1_mean,2),round(new2_mean,2),round(new3_mean,2),round(new4_mean,2)model_1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']model_2_features = model_1_features + ['bed_bath_rooms']model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']model1 = graphlab.linear_regression.create(train_data,target='price',features=model_1_features,validation_set=None)model2 = graphlab.linear_regression.create(train_data,target='price',features=model_2_features,validation_set=None)model3 = graphlab.linear_regression.create(train_data,target='price',features=model_3_features,validation_set=None)rss1=get_residual_sum_of_squares(model1,train_data,train_data['price'])rss2=get_residual_sum_of_squares(model2,train_data,train_data['price'])rss3=get_residual_sum_of_squares(model3,train_data,train_data['price'])print rss1,rss2,rss3# Compute the RSS on TESTING data for each of the three models and record the values:rss11=get_residual_sum_of_squares(model1,test_data,test_data['price'])rss21=get_residual_sum_of_squares(model2,test_data,test_data['price'])rss31=get_residual_sum_of_squares(model3,test_data,test_data['price'])print rss11,rss21,rss31
#下面自己编写多元回归函数,以便更加深刻的理解多元回归的内部原理,尤其是加深对梯度下降的理解import graphlabsales = graphlab.SFrame('kc_house_data.gl/')import numpy as np # note this allows us to refer to numpy as np instead #编写函数获得所需数据def get_numpy_data(data_sframe, features, output): data_sframe['constant'] = 1 # this is how you add a constant column to an SFrame # add the column 'constant' to the front of the features list so that we can extract it along with the others: features = ['constant'] + features # this is how you combine two lists # select the columns of data_SFrame given by the features list into the SFrame features_sframe (now including constant): features_sframe = data_sframe[features] # the following line will convert the features_SFrame into a numpy matrix: feature_matrix = features_sframe.to_numpy() # assign the column of data_sframe associated with the output to the SArray output_sarray output_sarray = data_sframe[output] # the following will convert the SArray into a numpy array by first converting it to a list output_array = output_sarray.to_numpy() return(feature_matrix, output_array)#编写根据系数和特征预测房价的函数def predict_output(feature_matrix, weights): # assume feature_matrix is a numpy matrix containing the features as columns and weights is a corresponding numpy array # create the predictions vector by using np.dot() predictions = np.dot(feature_matrix,weights) return(predictions)#编写求取某一点偏导数的函数def feature_derivative(errors, feature): # Assume that errors and feature are both numpy arrays of the same length (number of data points) # compute twice the dot product of these vectors as 'derivative' and return the value derivative = 2*np.dot(errors,feature) return(derivative)from math import sqrt # recall that the magnitude/length of a vector [g[0], g[1], g[2]] is sqrt(g[0]^2 + g[1]^2 + g[2]^2)#编写多元回归求取系数的函数def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance): converged = False weights = np.array(initial_weights) # make sure it's a numpy array while not converged: # compute the predictions based on feature_matrix and weights using your predict_output() function predictions = predict_output(feature_matrix,weights) # compute the errors as predictions - output errs = predictions - output gradient_sum_squares = 0 # initialize the gradient sum of squares # while we haven't reached the tolerance yet, update each feature's weight for i in range(len(weights)): # loop over each weight # Recall that feature_matrix[:, i] is the feature column associated with weights[i] # compute the derivative for weight[i]: wi_dev = feature_derivative(errs,feature_matrix[:,i]) # add the squared value of the derivative to the gradient sum of squares (for assessing convergence) gradient_sum_squares +=np.dot(wi_dev,wi_dev) # subtract the step size times the derivative from the current weight weights[i]-= step_size*wi_dev # compute the square-root of the gradient sum of squares to get the gradient magnitude: gradient_magnitude = sqrt(gradient_sum_squares) if gradient_magnitude < tolerance: converged = True return(weights)#下面就可以开始使用我们自己编写的函数了train_data,test_data = sales.random_split(.8,seed=0)# let's test out the gradient descentsimple_features = ['sqft_living']my_output = 'price'(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)initial_weights = np.array([-47000., 1.])step_size = 7e-12tolerance = 2.5e7my_weight = regression_gradient_descent(simple_feature_matrix,output,initial_weights,step_size,tolerance)print my_weight(test_simple_feature_matrix, test_output) = get_numpy_data(test_data, simple_features, my_output)preout = predict_output(test_simple_feature_matrix,my_weight)print preoutRSS1= sqrt(((test_output-preout)*(test_output-preout)).sum())print RSS1model_features = ['sqft_living', 'sqft_living15'] # sqft_living15 is the average squarefeet for the nearest 15 neighbors. my_output = 'price'(feature_matrix, output) = get_numpy_data(train_data, model_features, my_output)initial_weights = np.array([-100000., 1., 1.])step_size = 4e-12tolerance = 1e9my_weight2 = regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance)print my_weight2simple_features = ['sqft_living','sqft_living15']my_output = 'price'(simple_feature_matrix1, output1) = get_numpy_data(test_data, simple_features, my_output)preout2 = predict_output(simple_feature_matrix1,my_weight2)print len(preout2)print preout2RSS2 = sqrt(((test_output-preout2)*(test_output-preout2)).sum())print RSS2
- 机器学习5——多元回归及其代码实现
- 机器学习——多元线性回归
- 机器学习——多元线性回归
- 机器学习 -- 多元回归实现实例
- 机器学习练习(二)——多元线性回归
- 机器学习练习(四)——多元逻辑回归
- Python实现机器学习二(实现多元线性回归)
- Python实现机器学习--实现多元线性回归
- 机器学习回归篇-多元线性回归
- 【机器学习】逻辑回归原理及其实现
- 【机器学习】逻辑回归原理及其实现
- 机器学习精简教程之三——用scikit-learn求解多元线性回归问题
- 【R笔记】R机器学习(三)——多元回归
- 机器学习——多元线性回归分析(multiple regression)及应用
- 多元线性回归 机器学习基础(1)
- 7.3多元线性回归--python机器学习
- 机器学习sklearn多元线性回归
- 机器学习sklearn多元线性回归2
- Dubbo与Zookeeper、SpringMVC整合和使用(负载均衡、容错)
- pdf转换成图片格式的转换方法
- 解决quartz和Jackrabbit重复加载两次的问题
- 减少C++代码编译时间的方法
- 用jquery使用操作checkbox出问题!
- 机器学习5——多元回归及其代码实现
- 设置交互要多久?最快只需三秒!
- 短信大量获取接口
- 四轴动力学分析
- Bit++
- 用css的border属性画三角形
- The Resource Server: Angular JS and Spring Security Part III
- 大型网站的 HTTPS 实践(四)——协议层以外的实践[2015/04/21]
- 【JZOJ4854】【NOIP2016提高A组集训第6场11.3】小澳的坐标系