Spark读JDBC提前过滤数据

来源:互联网 发布:淘宝购买店铺 编辑:程序博客网 时间:2024/06/02 09:13

读取Mysql方法见:

http://blog.csdn.net/baifanwudi/article/details/78559177

有这样一个需求,读取mysql某一个表某一天数据;

        String beginTime = day+ " 00:00:00";        String endTime = day + " 23:59:59";        Dataset<Row> jdbcTable=spark.read().format("jdbc")                .option("url", PropertiesConfig.URL)                .option("dbtable",tableName)                .option("user",PropertiesConfig.USERNAME)                .option("password",PropertiesConfig.PASSWORD).load().filter("create_time between '"+beginTime+"' and '"+endTime+"'");

发现spark是把所有该tableName所有数据load进spark再filter,速率很慢.
所以想,是否提前过滤, 终于找到方法:

    String tableName= "(select * from device_info where create_time between '"+ beginTime + "' and '" + endTime + "' ) as device_time_filter";    Dataset<Row> jdbcTable=spark.read().format("jdbc")                .option("url", PropertiesConfig.URL)                .option("dbtable",tableName)                .option("user",PropertiesConfig.USERNAME)                .option("password",PropertiesConfig.PASSWORD).load().

只要在该mysql表的create_time建index,效率快很多.
原理就是先从mysql 先过滤读device_info 张表重命名为device_time_filter.

原创粉丝点击