Mahout贝叶斯算法源码分析(1)

来源:互联网 发布:郑玲玲的淘宝店叫什么 编辑:程序博客网 时间:2024/06/10 19:41

继前篇mahout 中Twenty Newsgroups Classification运行实例,本篇主要分析该算法的各个任务,首先是第一个任务,即seqdirectory,在提示信息里面的内容如下:

+ ./bin/mahout seqdirectory -i /home/mahout/mahout-work-mahout/20news-all -o /home/mahout/mahout-work-mahout/20news-seqWarning: $HADOOP_HOME is deprecated.Running on hadoop, using /home/mahout/hadoop-1.0.4/bin/hadoop and HADOOP_CONF_DIR=MAHOUT-JOB: /home/mahout/mahout-d-0.7/mahout-examples-0.7-job.jar13/08/26 23:38:49 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/home/mahout/mahout-work-mahout/20news-all], --keyPrefix=[], --output=[/home/mahout/mahout-work-mahout/20news-seq], --startPhase=[0], --tempDir=[temp]}13/08/26 23:42:57 INFO driver.MahoutDriver: Program took 248530 ms (Minutes: 4.142166666666666)
这个任务使用的java文件在mahout-examples-0.7-job.jar里面,路径为:org.apache.mahout.text.SequenceFilesFromDirectory.java。首先编写下面的测试文件:

package mahout.fansy.test.bayes;import org.apache.mahout.text.SequenceFilesFromDirectory;public class TestSeqdirectory {/** * @param args * @throws Exception  */public static void main(String[] args) throws Exception {//SequenceFilesFromDirectory sf=new SequenceFilesFromDirectory();String[] arg={"-fs","ubuntu:9000","-jt","ubuntu:9001","-i", "/home/mahout/mahout-work-mahout/20news-all","-o" ,"/home/mahout/mahout-work-mahout0/20news-seq"};SequenceFilesFromDirectory.main(arg);}}
然后设置断点即可开始分析数据逻辑流了。该算法的调用方法进行如下跳转:

1. SequenceFilesFromDirectory.main()-->run(SequenceFilesFromDirectory类中60行),源代码如下:

public int run(String[] args) throws Exception {    addOptions();            if (parseArguments(args) == null) {      return -1;    }       Map<String, String> options = parseOptions();    Path input = getInputPath();    Path output = getOutputPath();    if (hasOption(DefaultOptionCreator.OVERWRITE_OPTION)) {      Configuration conf = new Configuration();      HadoopUtil.delete(conf, output);    }    String keyPrefix = getOption(KEY_PREFIX_OPTION[0]);    Charset charset = Charset.forName(getOption(CHARSET_OPTION[0]));    Configuration conf = getConf();    FileSystem fs = FileSystem.get(input.toUri(), conf);    ChunkedWriter writer = new ChunkedWriter(conf, Integer.parseInt(options.get(CHUNK_SIZE_OPTION[0])), output);    try {      SequenceFilesFromDirectoryFilter pathFilter;      String fileFilterClassName = options.get(FILE_FILTER_CLASS_OPTION[0]);      if (PrefixAdditionFilter.class.getName().equals(fileFilterClassName)) {        pathFilter = new PrefixAdditionFilter(conf, keyPrefix, options, writer, charset, fs);      } else {        Class<? extends SequenceFilesFromDirectoryFilter> pathFilterClass =            Class.forName(fileFilterClassName).asSubclass(SequenceFilesFromDirectoryFilter.class);        Constructor<? extends SequenceFilesFromDirectoryFilter> constructor =            pathFilterClass.getConstructor(Configuration.class,                                           String.class,                                           Map.class,                                           ChunkedWriter.class,                                           Charset.class,                                           FileSystem.class);        pathFilter = constructor.newInstance(conf, keyPrefix, options, writer, fs);      }      fs.listStatus(input, pathFilter);    } finally {      Closeables.closeQuietly(writer);    }    return 0;  }
前面都是设置一些基本参数,一直到try里面的if判断,即if (PrefixAdditionFilter.class.getName().equals(fileFilterClassName)),由于这里的其他参数都是按照默认,所以直接进入if里面的语句,即pathFilter = new PrefixAdditionFilter(conf, keyPrefix, options, writer, charset, fs);这一句也是参数设置的,最最重要的是什么?是finally上面的一句:fs.list
Status(input,pathFilter);

2. fs.listStatus()-->listStatus(FileSystem 865行),具体如下:

public FileStatus[] listStatus(Path f, PathFilter filter) throws IOException {    ArrayList<FileStatus> results = new ArrayList<FileStatus>();    listStatus(results, f, filter);    return results.toArray(new FileStatus[results.size()]);  }
然后代码的第二行重载了一个listStatus()方法;重载的还是在本类中的方法,代码如下:

private void listStatus(ArrayList<FileStatus> results, Path f,      PathFilter filter) throws IOException {    FileStatus listing[] = listStatus(f);    if (listing != null) {      for (int i = 0; i < listing.length; i++) {        if (filter.accept(listing[i].getPath())) {          results.add(listing[i]);        }      }    }  }
3.这里感觉只是设置了一下results,并没有写入文件什么的,那么写入文件应该在哪里呢?这个很隐蔽,在if里面,if里的accept方面里面有写入文件的操作,代码是在SequenceFilesFromDirectoryFilter中的第85行的accept方法,具体代码如下:

public final boolean accept(Path current) {    log.debug("CURRENT: {}", current.getName());    try {      for (FileStatus fst : fs.listStatus(current)) {        log.debug("CHILD: {}", fst.getPath().getName());        process(fst, current);      }    } catch (IOException ioe) {      throw new IllegalStateException(ioe);    }    return false;  }
看到process方法,写入文件的操作都在process方法里面了。这个process方法是在PrefixAdditionFilter的48行的process方法,具体代码如下:

protected void process(FileStatus fst, Path current) throws IOException {    FileSystem fs = getFs();    ChunkedWriter writer = getWriter();    if (fst.isDir()) {      String dirPath = getPrefix() + Path.SEPARATOR + current.getName() + Path.SEPARATOR + fst.getPath().getName();      fs.listStatus(fst.getPath(),                    new PrefixAdditionFilter(getConf(), dirPath, getOptions(), writer, getCharset(), fs));    } else {      InputStream in = null;      try {        in = fs.open(fst.getPath());        StringBuilder file = new StringBuilder();        for (String aFit : new FileLineIterable(in, getCharset(), false)) {          file.append(aFit).append('\n');        }        String name = current.getName().equals(fst.getPath().getName())            ? current.getName()            : current.getName() + Path.SEPARATOR + fst.getPath().getName();        writer.write(getPrefix() + Path.SEPARATOR + name, file.toString());      } finally {        Closeables.closeQuietly(in);      }    }  }
下面来详细分析下代码:

比如针对20news文件,第一次调用这个方法的时候对应的fst的路径是hdfs://ubuntu:9000/home/mahout/mahout-work-mahout/20news-all/alt.atheism/49960,这个是一个文件,而非文件夹,所以进入if的else里面;然后打开了这个文件 in=fs.open(); 设置了一个临时变量 StringBuffer file用来存储整个文件;下面的for循环就是按行读取文件,然后把相应的字符串放入file变量中;最后使用writer.write()方法把file变量写入到输出writer变量中,在变量中可以查看到,如下图:


这里并没有写入到文件中,而是在最后全部文件都读取到writer变量中后才把writer变量的buffer写入到输出文件的。


分享,快乐,成长


转载请注明出处:http://blog.csdn.net/fansy1990 



原创粉丝点击