Spark Streaming自定义数据源-实现自定义输入DStream和接收器
来源:互联网 发布:从程序员到架构师之路 编辑:程序博客网 时间:2024/06/02 13:25
Spark Streaming自定义数据源-实现自定义输入DStream和接收器
参考文档:SparkStreaming编程指南(官方文档)http://spark.apache.org/docs/2.0.0-preview/streaming-programming-guide.html
本文实现代码语言Scala
总体流程分为以下几步:
1. 实现一个自定义的接收器(receiver)
实现自定义接收器(receiver)要注意以下几点
1.1 相当于Socket套接字的客户端
自定义的接收器(receiver)其实相当于Socket套接字的客户端编程,用于接收服务端特定IP和端口发送的数据,这里的IP和端口需要根据特定的Socket服务端的IP和端口变化而变化。
1.2 继承Receiver类, 复写其onStart()和onStop()方法
具体包:org.apache.spark.streaming.receiver.Receiver
class CustomReceiver(host: String, port: Int, isTimeOut: Boolean, sec: Int) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging{
参数类型: 必须包含host:String和port: Int ,其中host为数据源服务端的IP、port为数据源服务端的端口号,其余参数可选其中isTimeOut为Boolean类型表示是否设置超时,true代表设置超时,false代表不设置,set为Int类型代表超时时间格式为s秒级。
复写方法:需要复写的方法有两个onStart()和onStop()方法
override def onStart(): Unit = { // Start the thread that receives data over a connection new Thread("Socket Receiver") { override def run() { receive() } }.start()}override def onStop(){ // There is nothing much to do as the thread calling receive() // is designed to stop by itself if isStopped() returns false}代码里面onStart()启动通过连接接收数据方法的线程,其中调用receiver()方法进行连接数据源服务端获取服务端数据并推送给Spark。
onStop不需要实现任何代码块。
1.3 方法receive接收数据并推送给Spark
onStart()方法启动相关调用receice()方法线程,实现接收数据源服务端发送的数据并且推送数据给Spark,具体方法实现如下代码:
/** Create a socket connection and receive data until receiver is stopped */private def receive() { val _pattern: String = "yyyy-MM-dd HH:mm:ss SSS" val format: SimpleDateFormat = new SimpleDateFormat(_pattern) val _isTimeOut: Boolean = isTimeOut val _sec :Int = sec var socket: Socket = null var userInput: String = null try { // Connect to host:port socket = new Socket(host, port) println(format.format(new Date())) println("建立了链接\n") if(_isTimeOut) socket.setSoTimeout(_sec * 1000) // Until stopped or connection broken continue reading val reader = new BufferedReader( new InputStreamReader(socket.getInputStream, StandardCharsets.UTF_8)) userInput = reader.readLine() //while(!isStopped && userInput != null) { while(!isStopped && userInput != null) { println(userInput) store(userInput) userInput = reader.readLine() } reader.close() socket.close() // Restart in an attempt to connect again when server is active again restart("Trying to connect again") } catch { case e: java.net.ConnectException => // restart if could not connect to server restart("Error connecting to " + host + ":" + port, e) case t: Throwable => // restart if there is any other error restart("Error receiving data", t) }}核心:其中核心在于调用其父类Receiver的store(userInput),代表把数据推送到Spark,其中userInput代表接收到自定义数据源服务端的每一行的数据,其余代码跟Socket的客户端编程类似。
2、实现模拟服务端自定义数据源
/** * 自定义Socket服务器,发送消息到CustomReceiver */class CustomServer(port: Int, isTimeOut: Boolean ,sec: Int) { val _pattern: String = "yyyy-MM-dd HH:mm:ss SSS" val format: SimpleDateFormat = new SimpleDateFormat(_pattern) val _isTimeOut = isTimeOut val _sec: Int = sec val _port = port def onStart(): Unit = { // Start the thread that receives data over a connection new Thread("Socket Receiver") { override def run() { sServer() } }.start() } def onStop(): Unit = { // There is nothing much to do as the thread calling receive() // is designed to stop by itself if isStopped() returns false } def sServer(): Unit = { println("----------Server----------") println(format.format(new Date())) var tryingCreateServer = 1 try { val server = new ServerSocket(_port) println("监听建立 等你上线\n") if(_isTimeOut) server.setSoTimeout(_sec) val socket = server.accept() println(format.format(new Date())) println("与客户端建立了链接") val writer = new OutputStreamWriter(socket.getOutputStream) println(format.format(new Date())) val in = new Scanner(System.in) //这里只是设置\n为数据分隔符,默认是空格 in.useDelimiter("\n") println("请写入数据") var flag = in.hasNext while (flag){ val s = in.next() /** * 注意:writer写入s数据,如果不加\n那么客户端接收不到数据 */ writer.write(s + "\n") Thread.sleep(1000) if(socket.isClosed){ println("socket is closed !") }else{ try{ writer.flush() }catch { case e: java.net.SocketException => println("Error 客户端连接断开了!!!!!!!!!") flag = false writer.close() socket.close() server.close() onStart() return } } } System.out.println(format.format(new Date())) System.out.println("写完啦 你收下\n\n\n\n\n") /** * 重新尝试建立监听 */ if(tryingCreateServer < 5){ writer.close() socket.close() server.close() onStart() tryingCreateServer += 1 } } catch{ case e: SocketTimeoutException => System.out.println(format.format(new Date()) + "\n" + _sec + "秒没给我数据 我下啦\n\n\n\n\n"); e.printStackTrace() case e: SocketException => e.printStackTrace() case e: Exception => e.printStackTrace() } }}object CustomServer { def main(args: Array[String]): Unit = { new CustomServer(8888, false, 0).onStart() }}
3、Spark Streaming调用相关的接收器
val receiverInputDStream = ssc.receiverStream(new CustomReceiver("hadoop01", 8888, false, 0))
其中自订阅接收器类实例的第一个参数“hadoop01”代表自定义数据源服务端的ip或者主机名,第二个参数代表8888代表数据源服务端的端口号,第三个参数false代表不启动超时设置,第四个参数0代表超时时间。
备注:在启动SparkStreming分析程序前必须先启动数据源服务端,可以通过服务端的控制台输入相关数据进行整个流程测试。
- Spark Streaming自定义数据源-实现自定义输入DStream和接收器
- Spark Streaming 自定义接收器
- Spark Streaming 自定义接收器
- Spark Streaming 自定义接收器
- 自定义Spark Streaming接收器(Receivers)
- 6.Spark Streaming:输入DStream和Receiver详解
- 7.Spark Streaming:输入DStream之基础数据源以及基于HDFS的实时wordcount程序
- [spark streaming] DStream 和 DStreamGraph 解析
- Spark Streaming 客户端接收器
- 自定义Spark Streaming的Receivers
- Spark Streaming介绍,DStream,DStream相关操作(来自学习资料)
- Spark Streaming之二:DStream解析
- Spark Streaming——DStream Transformation操作
- Spark Streaming实现实时WordCount,DStream的使用,updateStateByKey(func)实现累计计算单词出现频率
- spark-streaming 编程(四)自定义输出foreachRDD
- spark streaming 实现根据文件内容自定义文件名,并实现文件内容追加
- 自定义BroadcastReceiver接收器
- Android 自定义广播接收器
- angular之用户表
- populating-next-right-pointers-in-each-node Java code
- AndroidKiller高版本反编译失败解决
- struts2通配符
- dubbo+ zookeeper 简单环境搭建 (主体为dubbo)
- Spark Streaming自定义数据源-实现自定义输入DStream和接收器
- JavaScript全排列的六种算法 具体实现
- Oracle表空间的使用情况查询
- C# HttpWebRequest使用GET、POST请求获取结果
- bluebird之catch
- input标签 type="radio" 的默认值为on
- 吴恩达 EM算法翻译 混合高斯的EM求解
- linux2.6标准字符设备驱动模型(手动注册)
- CSDN 收藏