Hive计算玩家连胜&连败

来源：互联网发布：17年机械行业数据采集编辑：程序博客网时间：2024/06/11 01:42

前几天接到以需求，要求计算玩家的连胜&连败纪录，数据sample如下：

源数据的表结构如下：

CREATE TABLE `t1`(  `id` bigint,   `player_id` int,   `game_result` int,   `season` int,   `game_type` int)ROW FORMAT DELIMITED   FIELDS TERMINATED BY '\t';

数据文件如下：

id:表示比赛id

player_id:用户id

game_result:比赛结果，0代表输，1代表赢

idplayer_idgame_result112312123133210
使用python编写UDF如下：

python文件名：

win_streak_reduce.py

#! /usr/bin/pythonimport sysfor line in sys.stdin:        line = line.split('\t')        player_id = line[0]        games = [int(i) for i in line[1].split(',')]        win_streak_max = 0        lost_streak_max = 0        win_tmp = 0        lost_tmp = 0        for i in games:                if i == 1:                        lost_tmp = 0                        win_tmp += 1                else:                        lost_tmp += 1                        win_tmp = 0                if win_tmp > win_streak_max:                        win_streak_max = win_tmp                if lost_tmp > lost_streak_max:                        lost_streak_max = lost_tmp        print "{player_id}\t{win_streak_max}\t{lost_streak_max}".format(player_id = player_id, win_streak_max = win_streak_max, lost_streak_max = lost_streak_max)

最后查询：

add file /win_streak_reduce.py;select transform(player_id,games) using 'python win_streak_reduce.py' as player_id,win_streak,lost_streak from (select player_id,concat_ws(',',collect_list(cast(game_result as string))) games from (select player_id,id, game_result from t1 cluster by player_id,id) t  group by player_id) t;

整体思路是：
1. 将比赛记录按照player_id,id排序，HQL中的cluster by player_id,id，会使同一个player_id的数据被分发到一起处理。
2. concat_ws(',',collect_list(cast(game_result as string)))会把玩家的比赛结果按照id从小打到排成数组，最后转成字符串，用,分隔
3. 使用udf对比赛结果进行计算，找出连胜和连败最大值。

使用hive不久，这样的处理方式虽然能解决问题，但相信这样的处理方式显得很不专业。请各位大牛批评指正。

1 0