bloom filter简单实现
来源:互联网 发布:淘宝网天猫投诉电话 编辑:程序博客网 时间:2024/06/10 03:04
再流计算中计算UV是个相当麻烦的事情,特别数据量很大的时候,中间存储就大的吓人。最近项目中遇到分类目计算UV,UV量大概在7000W,有20w多个类目。如果使用简单的存储中间结果再去重,如果使用内存内存打不下,使用Hbase的话HBASE的吞吐又不够。于是准备使用bloom近似计算UV。
写了个bloom filter的demo程序,由于uid都为数字在计算hash值时碰撞率比较搞,于是没有直接对uid使用bloom filter而是对uid 的md5值使用bloom filter:
import java.io.UnsupportedEncodingException;import java.security.MessageDigest;import java.security.NoSuchAlgorithmException;import java.util.BitSet;public class SimpleBloomFilter { private static final int DEFAULT_SIZE = 8 << 50000; //2<<60000 ;// private static final int [] seeds =new int []{5,7, 11 , 13 , 31 , 37 , 61}; private static final int[] seeds = new int []{13,31, 131, 1313, 13131, 131313}; private BitSet bits= new BitSet(DEFAULT_SIZE); private SimpleHash[] func=new SimpleHash[seeds.length];// private SimpleHash[] func = new SimpleHash(); public static void main(String[] args) {// String value = "stone2083@yahoo.cn" ; System.out.println(DEFAULT_SIZE); int count=0; SimpleBloomFilter filter=new SimpleBloomFilter(); for(int i = 10000;i<60000;i++){ String a = String.valueOf(i); String value = getMD5Str(a); if(!(filter.contains(value))){ count++; filter.add(value); } } System.out.println("result is :"+count); } public SimpleBloomFilter() { for( int i= 0 ; i< seeds.length; i ++ ) { func[i]=new SimpleHash(DEFAULT_SIZE, seeds[i]); } } public void add(String value) { for(SimpleHash f : func) { System.out.println(f.hash(value)); bits.set(f.hash(value), true ); } } public boolean contains(String value) { if(value ==null ) { return false ; } boolean ret = true ; for(SimpleHash f : func) { ret=ret&& bits.get(f.hash(value)); } return ret; } private static String getMD5Str(String str) { MessageDigest messageDigest = null; try { messageDigest = MessageDigest.getInstance("MD5"); messageDigest.reset(); messageDigest.update(str.getBytes("UTF-8")); } catch (NoSuchAlgorithmException e) { System.out.println("NoSuchAlgorithmException caught!"); System.exit(-1); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } byte[] byteArray = messageDigest.digest(); StringBuffer md5StrBuff = new StringBuffer(); for (int i = 0; i < byteArray.length; i++) { if (Integer.toHexString(0xFF & byteArray[i]).length() == 1) md5StrBuff.append("0").append(Integer.toHexString(0xFF & byteArray[i])); else md5StrBuff.append(Integer.toHexString(0xFF & byteArray[i])); } return md5StrBuff.toString(); } public static class SimpleHash { private int cap; private int seed; public SimpleHash( int cap, int seed) { this.cap= cap; this.seed =seed; } public int hash(String value) { int result=0 ; int len= value.length(); for (int i= 0 ; i< len; i ++ ) {// result =seed* result + value.charAt(i); result =seed* result + value.charAt(i); }// return (cap - 1 ) & (result/20); return (cap - 1 ) & result;// return result; } }}
- bloom filter简单实现
- Python实现Bloom filter
- php实现Bloom Filter
- c++实现Bloom Filter
- Bloom Filter算法实现
- HBase - Bloom Filter 简单理解
- Bloom Filter 算法的实现
- Bloom Filter算法和实现
- Bloom Filter原理与实现
- bloom filter的开源实现程序memcached bloom filter
- Bloom Filter
- Bloom Filter
- Bloom Filter
- Bloom Filter
- Bloom Filter
- Bloom Filter
- Bloom Filter
- Bloom Filter
- SVN对比VSS,不知这样够了没
- Java synchronized详解
- mysql清除binlog (备忘)
- js 汉字编码【鸡蛋】
- ZOJ 1240 IBM Minus One
- bloom filter简单实现
- memset对数组赋初值探讨
- DevExpress XtraGrid 动态自定义控制button按钮显示
- spring AspectJ 基本使用
- Android有关Please execute 'adb uninstall my.test' in a shell报错处理
- CentOS6.2下一步一步源代码安装OpenStack(八)Swift代理节点配置、运行
- ALM 解决方案之TechExcel DevSuite 评估报告 - 2
- Go语言_eclipse环境搭建(继上篇)
- poj 3767 I Wanna Go Home