Proposal: Decrease Censorship Delay from 24 hours to 4 hours

TLDR: If it’s necessary to lower the window, we think lowering it to 12 hours instead of 4 hours would eliminate a significant amount of risk.

The above discussion has already covered many important points on this topic, and this post mostly just attempts to lay out the various tradeoffs involved.

The core of the discussion is the force inclusion mechanism which allows any user to trigger the sequencing of delayed messages that were inserted into the delayed inbox sufficiently long before the current time. The mechanism doesn’t allow any change in ordering of those delayed messages, just to read further messages in order until a given message has been included.

This mechanism guarantees that even if the sequencer goes completely offline, users of the chain can still exit the system securely. It also provides a strong guard against censorship since it provides a mechanism where a user can permissionlessly force the inclusion of a transaction.

Note that when the sequencer is operating normally, shortening the 24 hour force inclusion window has no security benefit since the sequencer can’t backdate a transaction earlier than the most recently posted batch which tends to be less than a minute old.

There are three main categories of scenarios where the force inclusion mechanism could be triggered:

  1. The sequencer is completely down and hasn’t accepted transactions since the last batch was posted
  2. Batches are being posted regularly, but the sequencer is not reading messages from the delayed inbox
  3. The sequencer is accepting transactions, but unable to post batches

The main stated upside of this proposal is to provide a guarantee of transaction inclusion within a (shorter) bounded window on the DAO’s Arbitrum chains even with a byzantine sequencer. The argument is that users can achieve that guarantee by posting to the delayed inbox and forcing inclusion if necessary. Therefore the upper bound would be time to include transaction on ethereum + force inclusion window. There’s certainly a lot of value in providing a stronger guarantee of inclusion time. This proposal is for one specific way of achieving this, but I think a wider discussion of methods for achieving this would be valuable since the current proposal creates significant risks.

Any of the states that would lead to force inclusion being callable can be reached by either an honest sequencer with malfunctioning software or a malicious sequencer intentionally deviated from the expected behavior. We can generally ignore the malicious sequencer case in the following analysis since a malicious sequencer could disrupt of soft confirmation and create MEV opportunities without halting delayed message inclusion. Therefore the remaining analysis will focus on the honest, but malfunctioning sequencer case.

The severity of the consequences of the force inclusion mechanism being triggered differ depending on the scenario.

  1. In the first scenario, there’s no reorg at all from the force inclusion mechanism since no transactions have been confirmed by the sequencer that haven’t ended up onchain.
  2. In the second scenario, there may be a couple minutes of soft confirmed transactions reorged, but since batches are posted by the Arbitrum One sequencer quite often (every 30-60 seconds approximately), minimal transactions would be affected.
  3. The third scenario is where the force inclusion mechanism would have serious consequences. If the sequencer has still been offering soft confirmations for the length of the inclusion window, there would likely be a reorg of everything since the last batch was posted leading to large scale MEV extraction opportunities. Avoiding this scenario is a very high priority.

As an aside, one possible way to minimize the likelihood of the third scenario would be for the sequencer to stop accepting transactions if there was an issue posting batches onchain. However the advantage of the current strategy is that soft confirmation liveness can stay available even if onchain finality isn’t advancing. Batch poster downtime incidents would be much more severe under this alternate policy.

There are a number of tradeoffs and considerations when it comes to setting the length of time before a user can trigger the force inclusion mechanism. Generally speaking, the core issue with having a shorter period, is that it limits the amount of time available to respond to any issues.

The Offchain Labs Site Reliability Engineering team is of the opinion that reducing the force inclusion window to 4 hours unreasonably increases the risk of a potential reorg. Several factors contribute to this risk, which we will discuss below.

To better grasp the tradeoffs involved, we must examine the common failure cases of batch posting, which can be categorized into general infrastructure failures and software bugs requiring a codebase patch for resolution. Enhancing the batch poster’s reliability to meet the increased uptime guarantee would necessitate additional resources, including personnel and infrastructure, at an added cost. The primary concern arises when a software bug necessitates a code patch. An example response process for such an issue is as follows:

  • Batch poster encounters an error - 00:00
  • Automated alerting initiates when batches are detected as not being posted - 00:30
  • SRE member responds - 00:30-00:45
  • SRE member diagnoses the problem - 00:45-01:00
  • Software issue identified, and Nitro developer paged - 01:00-01:15
  • Nitro developer locates the bug and patches it - 01:15-03:00
  • Image build on branch commences - 03:00-03:30
  • SRE team deploys new image - 03:30-03:45
  • If not fixed, sequencer messages face reorgs once the maximum downtime limit is reached - 04:00

While this process can be effective in some cases, it allows a very short window for the software issue to be diagnosed, patched, and tested before deployment to avoid reaching the force inclusion window. If the patch fails to resolve the problem in production, there would be minimal time available for a secondary patch attempt. Moreover, ensuring redundancy in the on-call schedule for the Nitro team would require increased coverage, as the current setup is only in place for the SRE team.

Considering the increased costs associated with this proposal and the heightened risk of sequencer message reorgs, we believe a 4-hour window is not feasible if we want to ensure that there will be no reorgs with the current staffing. In order to have a decent confidence level that deep reorgs could be prevented with high confidence, significantly more staff would have to be added both on the SRE team as well as Nitro team to be able to meet the timeline with confidence. Infrastructure would also need to be transitioned to a multi-region setup instead of solely multi-availability zone to ensure tolerance for region failures.

As an alternative we’d be much more comfortable with moving to a 12 hour force inclusion window. We think that this would be possible without significantly increased cost and staffing and improve what guarantees Arbitrum could offer today.

Addendum

It’s interesting to consider what other protocol options are possible to provide strong inclusion guarantees to Arbitrum users. This is an open area of research and I’ll list a few high level ideas here which are areas of future research and development.

  • The first barrier to guaranteed inclusion in the happy path is soft confirmation by the sequencer. Currently this is a centralized mechanism, but sequencer decentralization mechanisms can provide stronger censorship guarantees to its mechanism
  • The next barrier is ensuring that transactions with soft confirmation get posted onchain to achieve Ethereum finality. Modifications to the feed mechanism to include batch signatures from the sequencer could allow a broader set of entities to ensure that sequencer approved batches get posted.

Enhancing both of those mechanisms could potentially lead to much stronger inclusion guarantees without introducing additional reorg risk, and in fact greatly reducing the existing risk.

6 Likes

Per the community call regarding this proposal, it seems that we have reached consensus around a 12 hour delay given the complexity / SRE constraints around the force inclusion mechanism. The upside of @shotaro 's push towards a shorter window is that it increases interoperability between chains in a “multi-rollup world.” At the same time, @hkalodner makes a key point when he suggests that it’s possible for these features to emerge from a different portion of the “design space.” Given the bandwidth limitations / Offchain’s constraints in knowledge transfer, we probably want to err on the side of caution at the start.

Two relevant directions to extend the conversation from here:

  1. Focus less on the diversity of user personas and more on fault tolerance / low-lift maintenance of the infrastructure

  2. Develop a comprehensive methodology for ecosystem parameters

To the point of (2) at present, we could imagine a curve where the x-axis is censorship delay window and the y-axis is probability of a [substantive] block re-org. It’s tough to map this directly to a data science problem (without some grossly simplifying assumptions) since this deals with counterfactuals and you can’t just average things out with brownian motion. However, if we treat that model more as a heuristic, what becomes clear is that we should be willing to tolerate some non-zero probability of re-org risk (ε > 0), but based on community preferences, we should adjust the censorship window accordingly.

5 Likes

First of all, I would like to greatly thank all of participants of the community call. For me as a delegate it was a blast, I learned a lot and I think we had a great nuanced discussion.

For those that couldn’t join us, you can find the recording here: Discussion around proposal to decrease censorship delay (2023-09-06 19:10 GMT+2) - Google Drive
And the transcript (one produced automatically, so take it with a grain of salt) here: Discussion around proposal to decrease censorship delay (2023-09-06 19:10 GMT+2) - Transcript - Google Docs

I hope we’ll have more of such discussions in the future!

4 Likes

Just a summary of key conclusions

  • This proposals benefits a relatively small user group of dapp developers
  • Most delegates don’t understand the implications
  • Almost no one is qualified or has time to conduct thorough risk analysis
  • Decreasing the censorship period to 12 hours instead of 24 hours is possible, but there is anxiety or nerves about changing any protocol parameter
  • There are alternative protocol changes being researched for future arbitrum upgrades which could circumvent the need for this kind of proposal (sequencer censorship oracle by bridging the block header of L1 → L2 delayed inbox messages)

I think it’s fair to consider this proposal status as now inactive considering there isn’t enough support and there are possible alternatives which would require protocol upgrades in the future.

Thanks for everyone’s time : )

7 Likes