cassandra节点异常数据处理——HintedHandOff

来源：互联网发布：淘宝天猫lee鼎汉店真假编辑：程序博客网时间：2024/06/11 11:02

cassandra的HintedHandOff的功能主要是当节点掉线，或者各个副本节点中网络闪断的时候，此时数据的保存，等节点间恢复正常以后，副本数据就可以一致了。

1、Hints表的表结构

当副本节点传输数据失败以后，数据就会被保存在hints系统表中。

hints系统表的表结构为，compositeCF，rowkey为target_id，即节点的token值的uuid；

compositeName分别为hint_id 由产生hint数据的时间生成的uuid；和message_version为产生hint数据的时候数据的版本号；

普通字段有：mutation主要存在需要传输的数据RowMutation。

这里每一条hint数据就相当于一个Column
* The hint schema looks like this:
*
* CREATE TABLE hints (
* target_id uuid,
* hint_id timeuuid,
* message_version int,
* mutation blob,
* PRIMARY KEY (target_id, hint_id, message_version)
* ) WITH COMPACT STORAGE;

当数据传输失败以后，就会直接调用RowMutation中的hintFor方法，将数据转化为需要插入hints表中的数据格式。

/**     * Returns mutation representing a Hints to be sent to <code>address</code>     * as soon as it becomes available.  See HintedHandoffManager for more details.     */    public static RowMutation hintFor(RowMutation mutation, UUID targetId) throws IOException    {        RowMutation rm = new RowMutation(Table.SYSTEM_KS, UUIDType.instance.decompose(targetId));        UUID hintId = UUIDGen.getTimeUUID();        // determine the TTL for the RowMutation        // this is set at the smallest GCGraceSeconds for any of the CFs in the RM        // this ensures that deletes aren't "undone" by delivery of an old hint        int ttl = Integer.MAX_VALUE;        for (ColumnFamily cf : mutation.getColumnFamilies())            ttl = Math.min(ttl, cf.metadata().getGcGraceSeconds());        // serialize the hint with id and version as a composite column name        QueryPath path = new QueryPath(SystemTable.HINTS_CF, null, HintedHandOffManager.comparator.decompose(hintId, MessagingService.current_version));        rm.add(path, ByteBuffer.wrap(FBUtilities.serialize(mutation, serializer, MessagingService.current_version)), System.currentTimeMillis(), ttl);        return rm;    }

2、Hint数据的产生

cassandra这种nosql型数据库，为了保证数据的安全性，需要将一份数据存储在几个副本中，防止一个节点掉线或者节点数据有损坏的时候，数据还可以通过其他的节点访问和恢复，这样又不会影响当前的业务，最后也可以恢复数据。

副本间数据进行传输的时候，当发送失败，以后判断是否记录hint数据的条件有两个：

（1）系统是否开启hintedHandOff功能，通过cassandra.yaml文件中的hinted_handoff_enabled配置项决定，默认开启

（2）Gossiper节点监控工具判断节点掉线时间是否超过了记录hint的最大时间，记录hint的最大时间为cassandra.yaml文件中的max_hint_window_in_ms决定，文件中的默认值是3hour，但是如果没有该配置项则是1hour

public static boolean shouldHint(InetAddress ep)    {        if (!DatabaseDescriptor.hintedHandoffEnabled())        {            HintedHandOffManager.instance.metrics.incrPastWindow(ep);            return false;        }        boolean hintWindowExpired = Gossiper.instance.getEndpointDowntime(ep) > DatabaseDescriptor.getMaxHintWindow();        if (hintWindowExpired)        {            HintedHandOffManager.instance.metrics.incrPastWindow(ep);            logger.trace("not hinting {} which has been down {}ms", ep, Gossiper.instance.getEndpointDowntime(ep));        }        return !hintWindowExpired;    }

当决定写hint数据的时候，就会将需要写的hint数据提交给HintRunnable线程，完成记录hint数据的责任。此时系统会记录正在运行的hint任务个数，以及异常节点产生的hint任务。

 public static Future<Void> submitHint(final RowMutation mutation,                                          final InetAddress target,                                          final AbstractWriteResponseHandler responseHandler,                                          final ConsistencyLevel consistencyLevel)    {        // local write that time out should be handled by LocalMutationRunnable        assert !target.equals(FBUtilities.getBroadcastAddress()) : target;        HintRunnable runnable = new HintRunnable(target)        {            public void runMayThrow() throws IOException            {                logger.debug("Adding hint for {}", target);                writeHintForMutation(mutation, target);                // Notify the handler only for CL == ANY                if (responseHandler != null && consistencyLevel == ConsistencyLevel.ANY)                    responseHandler.response(null);            }        };        return submitHint(runnable);    }    private static Future<Void> submitHint(HintRunnable runnable)    {        totalHintsInProgress.incrementAndGet();        hintsInProgress.get(runnable.target).incrementAndGet();        return (Future<Void>) StageManager.getStage(Stage.MUTATION).submit(runnable);    }

系统记录的正在运行的hint任务个数，以及异常节点产生的hint任务是为了防止正常节点因为处理hint数据过多导致系统内存溢出，也需要保护在线的节点，所以如果正在运行的hint任务不能超过1024 * FBUtilities.getAvailableProcessors();

// avoid OOMing due to excess hints.  we need to do this check even for "live" nodes, since we can            // still generate hints for those if it's overloaded or simply dead but not yet known-to-be-dead.            // The idea is that if we have over maxHintsInProgress hints in flight, this is probably due to            // a small number of nodes causing problems, so we should avoid shutting down writes completely to            // healthy nodes.  Any node with no hintsInProgress is considered healthy.            if (totalHintsInProgress.get() > maxHintsInProgress                && (hintsInProgress.get(destination).get() > 0 && shouldHint(destination)))            {                throw new OverloadedException("Too many in flight hints: " + totalHintsInProgress.get());            }

3、Hint的发送机制

hint发送触发的原因有

（1）当节点启动加入集群的时候，就会启动一个后台进程，每个10min就会进行一次hint数据的处理。

（2）节点状态发送改变的时候，变为live状态以后，也会触发发送该节点hint数据的请求。

首先查询出hints表里面的数据，（隐患：这里会查询出所有的数据，如果hint表中数据过多，则会OutMemery），将每个节点的数据顺序处理。

然后只需要用其rowkey换成为节点ip，调用HintedHandoff线程池中的分发线程进行数据分发，HintedHandoff线程池中的线程个数由cassandra.yaml文件中的max_hints_delivery_threads: 2决定。

 Runnable runnable = new Runnable()        {            public void run()            {                scheduleAllDeliveries();                metrics.log();            }        };        StorageService.optionalTasks.scheduleWithFixedDelay(runnable, 10, 10, TimeUnit.MINUTES);

单个节点的数据发送，如果判断hintedHandOff暂停了以后，则也不会进行发送，这个暂停可以由nodetool中的PAUSEHANDOFF命令暂停，由RESUMEHANDOFF 命令恢复。

在得知节点的schema与本节点的schema版本一致，并且是在线节点以后则开始hint数据的发送

（1）查询每128个条记录作为一批需要发送的hint数据，

（2）将每条数据的Mutation列，发序列化为对应的Rowmutation，构造成消息，

（3）发送成功以后，则写入一个deleteColumn，删除对应的记录

（4）等制定节点的所有数据发送完成以后，就会强制flush hints系统表，将hint的所有SStable进行一次全量SSTable的compact。

private void deliverHintsToEndpointInternal(InetAddress endpoint) throws IOException, DigestMismatchException, InvalidRequestException, InterruptedException    {        ColumnFamilyStore hintStore = Table.open(Table.SYSTEM_KS).getColumnFamilyStore(SystemTable.HINTS_CF);        if (hintStore.isEmpty())            return; // nothing to do, don't confuse users by logging a no-op handoff        // check if hints delivery has been paused        if (hintedHandOffPaused)        {            logger.debug("Hints delivery process is paused, aborting");            return;        }        logger.debug("Checking remote({}) schema before delivering hints", endpoint);        try        {            waitForSchemaAgreement(endpoint);        }        catch (TimeoutException e)        {            return;        }        if (!FailureDetector.instance.isAlive(endpoint))        {            logger.debug("Endpoint {} died before hint delivery, aborting", endpoint);            return;        }        // 1. Get the key of the endpoint we need to handoff        // 2. For each column, deserialize the mutation and send it to the endpoint        // 3. Delete the subcolumn if the write was successful        // 4. Force a flush        // 5. Do major compaction to clean up all deletes etc.        // find the hints for the node using its token.        UUID hostId = Gossiper.instance.getHostId(endpoint);        logger.info("Started hinted handoff for host: {} with IP: {}", hostId, endpoint);        final ByteBuffer hostIdBytes = ByteBuffer.wrap(UUIDGen.decompose(hostId));        DecoratedKey epkey =  StorageService.getPartitioner().decorateKey(hostIdBytes);        final AtomicInteger rowsReplayed = new AtomicInteger(0);        ByteBuffer startColumn = ByteBufferUtil.EMPTY_BYTE_BUFFER;        int pageSize = PAGE_SIZE;        // read less columns (mutations) per page if they are very large        if (hintStore.getMeanColumns() > 0)        {            int averageColumnSize = (int) (hintStore.getMeanRowSize() / hintStore.getMeanColumns());            pageSize = Math.min(PAGE_SIZE, DatabaseDescriptor.getInMemoryCompactionLimit() / averageColumnSize);            pageSize = Math.max(2, pageSize); // page size of 1 does not allow actual paging b/c of >= behavior on startColumn            logger.debug("average hinted-row column size is {}; using pageSize of {}", averageColumnSize, pageSize);        }        // rate limit is in bytes per second. Uses Double.MAX_VALUE if disabled (set to 0 in cassandra.yaml).        int throttleInKB = DatabaseDescriptor.getHintedHandoffThrottleInKB();        RateLimiter rateLimiter = RateLimiter.create(throttleInKB == 0 ? Double.MAX_VALUE : throttleInKB * 1024);        while (true)        {            // check if hints delivery has been paused during the process            if (hintedHandOffPaused)            {                logger.debug("Hints delivery process is paused, aborting");                break;            }            QueryFilter filter = QueryFilter.getSliceFilter(epkey, new QueryPath(SystemTable.HINTS_CF), startColumn, ByteBufferUtil.EMPTY_BYTE_BUFFER, false, pageSize);            ColumnFamily hintsPage = ColumnFamilyStore.removeDeleted(hintStore.getColumnFamily(filter), (int)(System.currentTimeMillis() / 1000));            if (pagingFinished(hintsPage, startColumn))            {                if (ByteBufferUtil.EMPTY_BYTE_BUFFER.equals(startColumn))                {                    // we've started from the beginning and could not find anything (only maybe tombstones)                    break;                }                else                {                    // restart query from the first column until we read an empty row;                    // that will tell us everything was delivered successfully with no timeouts                    startColumn = ByteBufferUtil.EMPTY_BYTE_BUFFER;                    continue;                }            }            for (final IColumn hint : hintsPage.getSortedColumns())            {                // Skip tombstones:                // if we iterate quickly enough, it's possible that we could request a new page in the same millisecond                // in which the local deletion timestamp was generated on the last column in the old page, in which                // case the hint will have no columns (since it's deleted) but will still be included in the resultset                // since (even with gcgs=0) it's still a "relevant" tombstone.                if (!hint.isLive())                    continue;                if (hintedHandOffPaused)                {                    logger.debug("Hints delivery process is paused, aborting");                    break;                }                startColumn = hint.name();                ByteBuffer[] components = comparator.split(hint.name());                int version = Int32Type.instance.compose(components[1]);                DataInputStream in = new DataInputStream(ByteBufferUtil.inputStream(hint.value()));                RowMutation rm;                try                {                    rm = RowMutation.serializer.deserialize(in, version);                }                catch (UnknownColumnFamilyException e)                {                    logger.debug("Skipping delivery of hint for deleted columnfamily", e);                    deleteHint(hostIdBytes, hint.name(), hint.maxTimestamp());                    continue;                }                MessageOut<RowMutation> message = rm.createMessage();                rateLimiter.acquire(message.serializedSize(MessagingService.current_version));                WrappedRunnable callback = new WrappedRunnable()                {                    public void runMayThrow() throws IOException                    {                        rowsReplayed.incrementAndGet();                        deleteHint(hostIdBytes, hint.name(), hint.maxTimestamp());                    }                };                IAsyncCallback responseHandler = new WriteResponseHandler(endpoint, WriteType.UNLOGGED_BATCH, callback);                MessagingService.instance().sendRR(message, endpoint, responseHandler);            }            // check if node is still alive and we should continue delivery process            if (!FailureDetector.instance.isAlive(endpoint))            {                logger.debug("Endpoint {} died during hint delivery, aborting", endpoint);                return;            }        }        try        {            compact().get();        }        catch (Exception e)        {            throw new RuntimeException(e);        }        logger.info(String.format("Finished hinted handoff of %s rows to endpoint %s", rowsReplayed, endpoint));        if (hintedHandOffPaused)        {            logger.info("Hints delivery process is paused, not delivering further hints");        }    }

4、Hint数据的清理

hint数据发送完成以后，会产生对应的DeleteColumn，然后进行Flush，和Compact。