Kafka#4：存储设计

来源：互联网发布：python全栈编辑：程序博客网时间：2024/06/02 10:43

Kafka系列：

Kafka#1：QuickStart
Kafka#2：消息队列
Kafka#3：分布式设计

存储机制

Kafka relies heavily on the filesystem for storing and caching messages.

你可能会说，磁盘访问速度那么慢性能怎么保证？这样的问题，操作系统怎么可能不帮我们解决。

To compensate for this performance divergence modern operating systems have become increasingly aggressive in their use of main memory for disk caching. A modern OS will happily divert all free memory to disk caching with little performance penalty when the memory is reclaimed. All disk reads and writes will go through this unified cache. This feature cannot easily be turned off without using direct I/O, so even if a process maintains an in-process cache of the data, this data will likely be duplicated in OS pagecache, effectively storing everything twice.

in-process cache加direct I/O的方式一般是数据库系统会使用的。pagecache可以参考之前的这篇文章。

先看下log.dir下都有哪些文件，

$ tree /tmp/kafka-logs-1//tmp/kafka-logs-1/├── hello-1│   ├── 00000000000000000000.index│   └── 00000000000000000000.log├── hello-2│   ├── 00000000000000000000.index│   └── 00000000000000000000.log├── recovery-point-offset-checkpoint└── replication-offset-checkpoint

pagecache

既然使用了OS的pagecache，那么必定需要有定时刷盘的操作，由LogManager来完成，LogManager管理了本机所有的Log，一个partition对应一个Log，

  private val logs = new Pool[TopicAndPartition, Log]()

看下LogManager的启动，

  /**   *  Start the background threads to flush logs and do log cleanup   */  def startup() {    /* Schedule the cleanup task to delete old logs */    if(scheduler != null) {      info("Starting log cleanup with a period of %d ms.".format(retentionCheckMs))      scheduler.schedule("kafka-log-retention",                          cleanupLogs,                          delay = InitialTaskDelayMs,                          period = retentionCheckMs,                          TimeUnit.MILLISECONDS)      info("Starting log flusher with a default period of %d ms.".format(flushCheckMs))      scheduler.schedule("kafka-log-flusher",                          flushDirtyLogs,                          delay = InitialTaskDelayMs,                          period = flushCheckMs,                          TimeUnit.MILLISECONDS)      scheduler.schedule("kafka-recovery-point-checkpoint",                         checkpointRecoveryPointOffsets,                         delay = InitialTaskDelayMs,                         period = flushCheckpointMs,                         TimeUnit.MILLISECONDS)    }    if(cleanerConfig.enableCleaner)      cleaner.startup()  }

刷盘最后是由各个Log自己完成，Log#flush，

  /**   * Flush all log segments   */  def flush(): Unit = flush(this.logEndOffset)  /**   * Flush log segments for all offsets up to offset-1   * @param offset The offset to flush up to (non-inclusive); the new recovery point   */  def flush(offset: Long) : Unit = {    if (offset <= this.recoveryPoint)      return    debug("Flushing log '" + name + " up to offset " + offset + ", last flushed: " + lastFlushTime + " current time: " +          time.milliseconds + " unflushed = " + unflushedMessages)    // 将recoveryPoint到logEndOffset的LogSegment刷到磁盘    for(segment <- logSegments(this.recoveryPoint, offset))      segment.flush()    lock synchronized {      if(offset > this.recoveryPoint) {        // 更新recoveryPoint为logEndOffset        this.recoveryPoint = offset        lastflushedTime.set(time.milliseconds)      }    }  }

LogSegment#flush，

  /**   * Flush this log segment to disk   */  @threadsafe  def flush() {    LogFlushStats.logFlushTimer.time {      log.flush()      index.flush()    }  }

Log管理了partition下的所有LogSegment，

  /* the actual segments of the log */  private val segments: ConcurrentNavigableMap[java.lang.Long/*startOffset*/, LogSegment] = new ConcurrentSkipListMap[java.lang.Long, LogSegment]

LogSegment由一个log file和一个index file组成，对应到物理存储就是上面的00000000000000000000.log和00000000000000000000.index文件。

整理了下Log，LogSegment，log file和index file之间的关系，

   +-------------------------------------------------------------------------+   | Log                                                                     |   | +-----------------------------+---------+-----------------------------+ |   | | LogSegment                  |         | LogSegment                  | |   | | +----------+ +------------+ |         | +----------+ +------------+ | |   | | | log file | | index file | | ... ... | | log file | | index file | | |   | | +----------+ +------------+ |         | +----------+ +------------+ | |   | +-----------------------------+---------+-----------------------------+ |   +-------------------------------------------------------------------------+

回到LogManager，看下另一个定时任务，LogManager#checkpointRecoveryPointOffsets，

  /**   * Write out the current recovery point for all logs to a text file in the log directory    * to avoid recovering the whole log on startup.   */  def checkpointRecoveryPointOffsets() {    val recoveryPointsByDir = this.logsByTopicPartition.groupBy(_._2.dir.getParent.toString)    for(dir <- logDirs) {        val recoveryPoints = recoveryPointsByDir.get(dir.toString)        if(recoveryPoints.isDefined)          this.recoveryPointCheckpoints(dir).write(recoveryPoints.get.mapValues(_.recoveryPoint))    }  }

这个任务所做的事情就是将recoveryPoint写到recovery-point-offset-checkpoint文件，recoveryPoint之前的log都是已经刷到磁盘上的了。

high watermark

至于另一个文件replication-offset-checkpoint，是用来存储每个replica的HighWatermark的，定时写由ReplicaManager执行，

  def startHighWaterMarksCheckPointThread() = {    if(highWatermarkCheckPointThreadStarted.compareAndSet(false, true))      scheduler.schedule("highwatermark-checkpoint", checkpointHighWatermarks, period = config.replicaHighWatermarkCheckpointIntervalMs, unit = TimeUnit.MILLISECONDS)  }

  /**   * Flushes the highwatermark value for all partitions to the highwatermark file   */  def checkpointHighWatermarks() {    val replicas = allPartitions.values.map(_.getReplica(config.brokerId)).collect{case Some(replica) => replica}    val replicasByDir = replicas.filter(_.log.isDefined).groupBy(_.log.get.dir.getParentFile.getAbsolutePath)    for((dir, reps) <- replicasByDir) {      val hwms = reps.map(r => (new TopicAndPartition(r) -> r.highWatermark)).toMap      try {        highWatermarkCheckpoints(dir).write(hwms)      } catch {        case e: IOException =>          fatal("Error writing to highwatermark file: ", e)          Runtime.getRuntime().halt(1)      }    }  }

那么HighWatermark又是什么东东？The high watermark (HW) is the offset of the last committed message.记住HW是last committed message的起始偏移。broker的recovery需要用到HW。接下来看下HW是怎么increment的，

当leader收到ProduceRequest，保存消息时，会调用Partition#appendMessagesToLeader，

  def appendMessagesToLeader(messages: ByteBufferMessageSet) = {    leaderIsrUpdateLock synchronized {      val leaderReplicaOpt = leaderReplicaIfLocal()      leaderReplicaOpt match {        case Some(leaderReplica) =>          val log = leaderReplica.log.get          // 写数据，注意此时leader的partition的logEndOffset已经更新          val info = log.append(messages, assignOffsets = true)          // we may need to increment high watermark since ISR could be down to 1          maybeIncrementLeaderHW(leaderReplica)          info        case None =>          throw new NotLeaderForPartitionException("Leader not local for partition [%s,%d] on broker %d"            .format(topic, partitionId, localBrokerId))      }    }  }

  /**   * There is no need to acquire the leaderIsrUpdate lock here since all callers of this private API acquire that lock   * @param leaderReplica   */  private def maybeIncrementLeaderHW(leaderReplica: Replica) {    // 拿到partition的所有ISR（包括leader）的logEndOffset    val allLogEndOffsets = inSyncReplicas.map(_.logEndOffset)    // 最小的logEndOffset可能成为新的HW    val newHighWatermark = allLogEndOffsets.min    val oldHighWatermark = leaderReplica.highWatermark    if(newHighWatermark > oldHighWatermark) {      leaderReplica.highWatermark = newHighWatermark      debug("Highwatermark for partition [%s,%d] updated to %d".format(topic, partitionId, newHighWatermark))    }    else      debug("Old hw for partition [%s,%d] is %d. New hw is %d. All leo's are %s"        .format(topic, partitionId, oldHighWatermark, newHighWatermark, allLogEndOffsets.mkString(",")))  }

Replica#logEndOffset，

  def logEndOffset = {    if (isLocal)      // 本地就直接取日志文件的      log.get.logEndOffset    else      logEndOffsetValue.get()  }

假如ISR中的其他replica还没从leader同步数据，那么inSyncReplicas.map(_.logEndOffset).min必定还不能成为新的HW。让我们来看下ISR同步数据时HW的处理，ReplicaManager#recordFollowerPosition，

  def recordFollowerPosition(topic: String, partitionId: Int, replicaId: Int, offset: Long) = {    val partitionOpt = getPartition(topic, partitionId)    if(partitionOpt.isDefined) {      partitionOpt.get.updateLeaderHWAndMaybeExpandIsr(replicaId, offset)    } else {      warn("While recording the follower position, the partition [%s,%d] hasn't been created, skip updating leader HW".format(topic, partitionId))    }  }

Partition#updateLeaderHWAndMaybeExpandIsr，

  def updateLeaderHWAndMaybeExpandIsr(replicaId: Int, offset: Long) {    leaderIsrUpdateLock synchronized {      debug("Recording follower %d position %d for partition [%s,%d].".format(replicaId, offset, topic, partitionId))      val replicaOpt = getReplica(replicaId)      if(!replicaOpt.isDefined) {        throw new NotAssignedReplicaException(("Leader %d failed to record follower %d's position %d for partition [%s,%d] since the replica %d" +          " is not recognized to be one of the assigned replicas %s for partition [%s,%d]").format(localBrokerId, replicaId,            offset, topic, partitionId, replicaId, assignedReplicas().map(_.brokerId).mkString(","), topic, partitionId))      }      val replica = replicaOpt.get      // 更新ISR的logEndOffset      replica.logEndOffset = offset      // check if this replica needs to be added to the ISR      leaderReplicaIfLocal() match {        case Some(leaderReplica) =>          val replica = getReplica(replicaId).get          val leaderHW = leaderReplica.highWatermark          // For a replica to get added back to ISR, it has to satisfy 3 conditions-          // 1. It is not already in the ISR          // 2. It is part of the assigned replica list. See KAFKA-1097          // 3. It's log end offset >= leader's highwatermark          if (!inSyncReplicas.contains(replica) && assignedReplicas.map(_.brokerId).contains(replicaId) && replica.logEndOffset >= leaderHW) {            // expand ISR            val newInSyncReplicas = inSyncReplicas + replica            info("Expanding ISR for partition [%s,%d] from %s to %s"                 .format(topic, partitionId, inSyncReplicas.map(_.brokerId).mkString(","), newInSyncReplicas.map(_.brokerId).mkString(",")))            // update ISR in ZK and cache            updateIsr(newInSyncReplicas)            replicaManager.isrExpandRate.mark()          }          // 这时候所有ISR的logEndOffset可能都已经比老的HW高了          maybeIncrementLeaderHW(leaderReplica)        case None => // nothing to do if no longer leader      }    }  }

也就是说当leader收到所有ISR发来的fetch请求后，就会将HW更新为fetch请求中的offset，也就是last commited message的起始偏移了。

下面将几个重要的offset整理下。

                            HighWatermark                                  |                   |                                  |<-- uncommitted -->|                                  |                   |    +-----------------------------+-------------------+    | MESSAGE | ... ... | MESSAGE |      MESSAGE      |    +-------------------+-----------------------------+                        |                             |                        |<------   unflushed   ------>|                        |                             |                   RecoveryPoint                 LogEndOffset

offset index

接下来看下index文件是如何工作的。index文件是为了将一个fetch请求的targetOffset快速转换成相应的log文件中的position。先来看下index文件的写入，Log#append，

  /**   * Append this message set to the active segment of the log, rolling over to a fresh segment if necessary.   *    * This method will generally be responsible for assigning offsets to the messages,    * however if the assignOffsets=false flag is passed we will only check that the existing offsets are valid.   *    * @param messages The message set to append   * @param assignOffsets Should the log assign offsets to this message set or blindly apply what it is given   *    * @throws KafkaStorageException If the append fails due to an I/O error.   *    * @return Information about the appended messages including the first and last offset.   */  def append(messages: ByteBufferMessageSet, assignOffsets: Boolean = true): LogAppendInfo = {    val appendInfo = analyzeAndValidateMessageSet(messages)        // if we have any valid messages, append them to the log    if(appendInfo.shallowCount == 0)      return appendInfo          // trim any invalid bytes or partial messages before appending it to the on-disk log    var validMessages = trimInvalidBytes(messages)    try {      // they are valid, insert them in the log      lock synchronized {        appendInfo.firstOffset = nextOffset.get        // 是否需要生成新的log文件和index文件        // maybe roll the log if this segment is full        val segment = maybeRoll()        if(assignOffsets) {          // assign offsets to the messageset          val offset = new AtomicLong(nextOffset.get)          try {            // 在message的头部加上了offset            validMessages = validMessages.assignOffsets(offset, appendInfo.codec)          } catch {            case e: IOException => throw new KafkaException("Error in validating messages while appending to log '%s'".format(name), e)          }          appendInfo.lastOffset = offset.get - 1        } else {          // we are taking the offsets we are given          if(!appendInfo.offsetsMonotonic || appendInfo.firstOffset < nextOffset.get)            throw new IllegalArgumentException("Out of order offsets found in " + messages)        }        // Check if the message sizes are valid. This check is done after assigning offsets to ensure the comparison        // happens with the new message size (after re-compression, if any)        for(messageAndOffset <- validMessages.shallowIterator) {          if(MessageSet.entrySize(messageAndOffset.message) > config.maxMessageSize)            throw new MessageSizeTooLargeException("Message size is %d bytes which exceeds the maximum configured message size of %d."              .format(MessageSet.entrySize(messageAndOffset.message), config.maxMessageSize))        }        // 真正保存数据        // now append to the log        segment.append(appendInfo.firstOffset, validMessages)        // increment the log end offset        nextOffset.set(appendInfo.lastOffset + 1)        trace("Appended message set to log %s with first offset: %d, next offset: %d, and messages: %s"                .format(this.name, appendInfo.firstOffset, nextOffset.get(), validMessages))        if(unflushedMessages >= config.flushInterval)          flush()        appendInfo      }    } catch {      case e: IOException => throw new KafkaStorageException("I/O exception in append to log '%s'".format(name), e)    }  }

  /**   * Roll the log over to a new empty log segment if necessary   * @return The currently active segment after (perhaps) rolling to a new segment   */  private def maybeRoll(): LogSegment = {    val segment = activeSegment    // 是否超过了配置的值    if (segment.size > config.segmentSize ||         segment.size > 0 && time.milliseconds - segment.created > config.segmentMs ||        segment.index.isFull) {      debug("Rolling new log segment in %s (log_size = %d/%d, index_size = %d/%d, age_ms = %d/%d)."            .format(name,                    segment.size,                    config.segmentSize,                    segment.index.entries,                    segment.index.maxEntries,                    time.milliseconds - segment.created,                    config.segmentMs))      roll()    } else {      segment    }  }

  /**   * Roll the log over to a new active segment starting with the current logEndOffset.   * This will trim the index to the exact size of the number of entries it currently contains.   * @return The newly rolled segment   */  def roll(): LogSegment = {    val start = time.nanoseconds    lock synchronized {      val newOffset = logEndOffset      val logFile = logFilename(dir, newOffset)      val indexFile = indexFilename(dir, newOffset)      for(file <- List(logFile, indexFile); if file.exists) {        warn("Newly rolled segment file " + file.getName + " already exists; deleting it first")        file.delete()      }      segments.lastEntry() match {        case null =>         case entry => entry.getValue.index.trimToValidSize()      }      // 生成新的文件      val segment = new LogSegment(dir,                                    startOffset = newOffset,                                   indexIntervalBytes = config.indexInterval,                                    maxIndexSize = config.maxIndexSize,                                   time = time)      // 放到segments里面      val prev = addSegment(segment)      if(prev != null)        throw new KafkaException("Trying to roll a new log segment for topic partition %s with start offset %d while it already exists.".format(name, newOffset))            // schedule an asynchronous flush of the old segment      scheduler.schedule("flush-log", () => flush(newOffset), delay = 0L)            info("Rolled new log segment for '" + name + "' in %.0f ms.".format((System.nanoTime - start) / (1000.0*1000.0)))            segment    }  }

  /**   * Add the given segment to the segments in this log. If this segment replaces an existing segment, delete it.   * @param segment The segment to add   */  def addSegment(segment: LogSegment) = this.segments.put(segment.baseOffset, segment)

有个地方需要特别注意一下，就是会在message头部增加offset，ByteBufferMessageSet#assignOffsets，

  /**   * Update the offsets for this message set. This method attempts to do an in-place conversion   * if there is no compression, but otherwise recopies the messages   */  private[kafka] def assignOffsets(offsetCounter: AtomicLong, codec: CompressionCodec): ByteBufferMessageSet = {    if(codec == NoCompressionCodec) {      // do an in-place conversion      var position = 0      buffer.mark()      while(position < sizeInBytes - MessageSet.LogOverhead) {        buffer.position(position)        buffer.putLong(offsetCounter.getAndIncrement())        position += MessageSet.LogOverhead + buffer.getInt()      }      buffer.reset()      this    } else {      // messages are compressed, crack open the messageset and recompress with correct offset      val messages = this.internalIterator(isShallow = false).map(_.message)      new ByteBufferMessageSet(compressionCodec = codec, offsetCounter = offsetCounter, messages = messages.toBuffer:_*)    }  }

看下数据真正的保存，LogSegment#append，

  /**   * Append the given messages starting with the given offset. Add   * an entry to the index if needed.   *    * It is assumed this method is being called from within a lock.   *    * @param offset The first offset in the message set.   * @param messages The messages to append.   */  @nonthreadsafe  def append(offset: Long, messages: ByteBufferMessageSet) {    if (messages.sizeInBytes > 0) {      trace("Inserting %d bytes at offset %d at position %d".format(messages.sizeInBytes, offset, log.sizeInBytes()))      // 两次索引之间的log数据大小必须大于配置的index.interval.bytes      // append an entry to the index (if needed)      if(bytesSinceLastIndexEntry > indexIntervalBytes) {        index.append(offset, log.sizeInBytes())        this.bytesSinceLastIndexEntry = 0      }      // append the messages      log.append(messages)      this.bytesSinceLastIndexEntry += messages.sizeInBytes    }  }

index文件的写入，OffsetIndex#append，

  /**   * Append an entry for the given offset/location pair to the index. This entry must have a larger offset than all subsequent entries.   */  def append(offset: Long, position: Int) {    inLock(lock) {      require(!isFull, "Attempt to append to a full index (size = " + size + ").")      if (size.get == 0 || offset > lastOffset) {        debug("Adding index entry %d => %d to %s.".format(offset, position, file.getName))        // index entry 第一列        // 存放的是relativeOffset        this.mmap.putInt((offset - baseOffset).toInt)        // index entry 第二列        // 传进来的是log.sizeInBytes()，所以存放的是物理position        this.mmap.putInt(position)        this.size.incrementAndGet()        this.lastOffset = offset        require(entries * 8 == mmap.position, entries + " entries but file position in index is " + mmap.position + ".")      } else {        throw new InvalidOffsetException("Attempt to append an offset (%d) to position %d no larger than the last offset appended (%d) to %s."          .format(offset, entries, lastOffset, file.getAbsolutePath))      }    }  }

接下来看下fetch请求是怎么使用index文件的，Log#read，

  /**   * Read messages from the log   * @param startOffset The offset to begin reading at   * @param maxLength The maximum number of bytes to read   * @param maxOffset -The offset to read up to, exclusive. (i.e. the first offset NOT included in the resulting message set).   *    * @throws OffsetOutOfRangeException If startOffset is beyond the log end offset or before the base offset of the first segment.   * @return The messages read   */  def read(startOffset: Long, maxLength: Int, maxOffset: Option[Long] = None): MessageSet = {    trace("Reading %d bytes from offset %d in log %s of length %d bytes".format(maxLength, startOffset, name, size))    // check if the offset is valid and in range    val next = nextOffset.get    if(startOffset == next)      return MessageSet.Empty    // 第一步，先找出LogSegment，也就是XXXXXXX.log跟XXXXXXX.index    // segments是一个ConcurrentSkipListMap       var entry = segments.floorEntry(startOffset)          // attempt to read beyond the log end offset is an error    if(startOffset > next || entry == null)      throw new OffsetOutOfRangeException("Request for offset %d but we only have log segments in the range %d to %d.".format(startOffset, segments.firstKey, next))        // do the read on the segment with a base offset less than the target offset    // but if that segment doesn't contain any messages with an offset greater than that    // continue to read from successive segments until we get some messages or we reach the end of the log    while(entry != null) {      // 第二步，从LogSegment中读取      val messages = entry.getValue.read(startOffset, maxOffset, maxLength)      if(messages == null)        entry = segments.higherEntry(entry.getKey)      else        return messages    }        // okay we are beyond the end of the last segment but less than the log end offset    MessageSet.Empty  }

LogSegment#read,

  /**   * Read a message set from this segment beginning with the first offset >= startOffset. The message set will include   * no more than maxSize bytes and will end before maxOffset if a maxOffset is specified.   *    * @param startOffset A lower bound on the first offset to include in the message set we read   * @param maxSize The maximum number of bytes to include in the message set we read   * @param maxOffset An optional maximum offset for the message set we read   *    * @return The message set read or null if the startOffset is larger than the largest offset in this log.   */  @threadsafe  def read(startOffset: Long, maxOffset: Option[Long], maxSize: Int): MessageSet = {    if(maxSize < 0)      throw new IllegalArgumentException("Invalid max size for log read (%d)".format(maxSize))    if(maxSize == 0)      return MessageSet.Empty        val logSize = log.sizeInBytes // this may change, need to save a consistent copy    // 关键的一步，将targetOffset转换成log文件的物理position    val startPosition = translateOffset(startOffset)        // if the start position is already off the end of the log, return null    if(startPosition == null)      return null        // calculate the length of the message set to read based on whether or not they gave us a maxOffset    val length =       maxOffset match {        case None =>          // no max offset, just use the max size they gave unmolested          maxSize        case Some(offset) => {          // there is a max offset, translate it to a file position and use that to calculate the max read size          if(offset < startOffset)            throw new IllegalArgumentException("Attempt to read with a maximum offset (%d) less than the start offset (%d).".format(offset, startOffset))          val mapping = translateOffset(offset, startPosition.position)          val endPosition =             if(mapping == null)              logSize // the max offset is off the end of the log, use the end of the file            else              mapping.position          min(endPosition - startPosition.position, maxSize)         }      }    log.read(startPosition.position, length)  }

  /**   * Find the physical file position for the first message with offset >= the requested offset.   *    * The lowerBound argument is an optimization that can be used if we already know a valid starting position   * in the file higher than the greast-lower-bound from the index.   *    * @param offset The offset we want to translate   * @param startingFilePosition A lower bound on the file position from which to begin the search. This is purely an optimization and   * when omitted, the search will begin at the position in the offset index.   *    * @return The position in the log storing the message with the least offset >= the requested offset or null if no message meets this criteria.   */  @threadsafe  private[log] def translateOffset(offset: Long, startingFilePosition: Int = 0): OffsetPosition = {    val mapping = index.lookup(offset)    log.searchFor(offset, max(mapping.position, startingFilePosition))  }

两个关键的方法，index.lookup和log.searchFor。

先来看index.lookup，

  /**   * Find the largest offset less than or equal to the given targetOffset    * and return a pair holding this offset and it's corresponding physical file position.   *    * @param targetOffset The offset to look up.   *    * @return The offset found and the corresponding file position for this offset.    * If the target offset is smaller than the least entry in the index (or the index is empty),   * the pair (baseOffset, 0) is returned.   */  def lookup(targetOffset: Long): OffsetPosition = {    maybeLock(lock) {      val idx = mmap.duplicate      // 通过二分法查找index entry      val slot = indexSlotFor(idx, targetOffset)      if(slot == -1)        OffsetPosition(baseOffset, 0)      else        OffsetPosition(baseOffset + relativeOffset(idx, slot), physical(idx, slot))      }  }

  /**   * Find the slot in which the largest offset less than or equal to the given   * target offset is stored.   *    * @param idx The index buffer   * @param targetOffset The offset to look for   *    * @return The slot found or -1 if the least entry in the index is larger than the target offset or the index is empty   */  private def indexSlotFor(idx: ByteBuffer, targetOffset: Long): Int = {    // 计算要查找的relativeOffset    // we only store the difference from the base offset so calculate that    val relOffset = targetOffset - baseOffset        // check if the index is empty    if(entries == 0)      return -1        // check if the target offset is smaller than the least offset    if(relativeOffset(idx, 0) > relOffset)      return -1          // binary search for the entry    var lo = 0    var hi = entries-1    while(lo < hi) {      val mid = ceil(hi/2.0 + lo/2.0).toInt      val found = relativeOffset(idx, mid)      // index entry的第一列必须完全匹配      if(found == relOffset)        return mid      else if(found < relOffset)        lo = mid      else        hi = mid - 1    }    lo  }

  // 第n个index entry的第一列  /* return the nth offset relative to the base offset */  private def relativeOffset(buffer: ByteBuffer, n: Int): Int = buffer.getInt(n * 8)    // 第n个index entry的第二列  /* return the nth physical position */  private def physical(buffer: ByteBuffer, n: Int): Int = buffer.getInt(n * 8 + 4)

再来看看log.searchFor，

  /**   * Search forward for the file position of the last offset that is greater than or equal to the target offset   * and return its physical position. If no such offsets are found, return null.   * @param targetOffset The offset to search for.   * @param startingPosition The starting position in the file to begin searching from.   */  def searchFor(targetOffset: Long, startingPosition: Int): OffsetPosition = {    var position = startingPosition    // 12个字节    val buffer = ByteBuffer.allocate(MessageSet.LogOverhead)    val size = sizeInBytes()    while(position + MessageSet.LogOverhead < size) {      buffer.rewind()      channel.read(buffer, position)      if(buffer.hasRemaining)        throw new IllegalStateException("Failed to read complete buffer for targetOffset %d startPosition %d in %s"                                        .format(targetOffset, startingPosition, file.getAbsolutePath))      buffer.rewind()      // 读出log entry中的offset      val offset = buffer.getLong()      if(offset >= targetOffset)        return OffsetPosition(offset, position)      // 读出log entry中的message size      val messageSize = buffer.getInt()      if(messageSize < Message.MessageOverhead)        throw new IllegalStateException("Invalid message size: " + messageSize)      // 跳到下一个log entry      position += MessageSet.LogOverhead + messageSize    }    null  }

  val MessageSizeLength = 4  val OffsetLength = 8  val LogOverhead = MessageSizeLength + OffsetLength

总结下index文件跟log文件的物理存储格式，每个文件都是由若干entry组成，

index entry format:  +----------------------------------------------+  |    relativeOffset=      |  physicalPosition  |  | targetOffset-baseOffset |                    |  ... ...  +-------------------------+--------------------+  |<---  long, 8 bytes  --->|<-- int, 4 bytes -->|   log entry format:  +---------------------------------------------------------+  |  targetOffset |  messageSize  |         MESSAGE         |  ... ...  +---------------+---------------+-------------------------+  |<-- 8 bytes -->|<--  4 byte -->|<-- messageSize bytes -->|

sendfile

日志文件FileMessageSet还使用了一种高效的文件传输方式，FileChannel#transferTo，这个方法后面的JNI实现可参考这篇文章，在Linux平台下会使用零拷贝的特性，关于零拷贝参考这里。

  /**   * Write some of this set to the given channel.   * @param destChannel The channel to write to.   * @param writePosition The position in the message set to begin writing from.   * @param size The maximum number of bytes to write   * @return The number of bytes actually written.   */  def writeTo(destChannel: GatheringByteChannel, writePosition: Long, size: Int): Int = {    // Ensure that the underlying size has not changed.    val newSize = math.min(channel.size().toInt, end) - start    if (newSize < _size.get()) {      throw new KafkaException("Size of FileMessageSet %s has been truncated during write: old size %d, new size %d"        .format(file.getAbsolutePath, _size.get(), newSize))    }    val bytesTransferred = channel.transferTo(start + writePosition, math.min(size, sizeInBytes), destChannel).toInt    trace("FileMessageSet " + file.getAbsolutePath + " : bytes transferred : " + bytesTransferred      + " bytes requested for transfer : " + math.min(size, sizeInBytes))    bytesTransferred  }

参考资料

https://kafka.apache.org/documentation.html
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Replication

1 0