Proposal: Decrease Censorship Delay from 24 hours to 4 hours

Proposal: Decrease Censorship Delay from 24 hours to 4 hours

Category: Constitutional - Core

Submitted by: Shotaro

Abstract

This Proposal proposes to decrease the censorship resistant delay time for transaction force inclusion from 24 hours to 4 hours.

Moreover, as a consequence, this change restricts the power of the sequencer from backdating transactions down to 4 hours instead of 24 hours.

Motivation

Arbitrum chains today run in a trusted sequencer mode in which a server operated by Offchain Labs orders L2 transactions.

Censorship resistant L2 transactions however can be submitted through the delayedInbox, and force included after a delay.

Dapps, protocols, and L3s building on Arbitrum chains are constrained by worst case analysis — that is, what happens if the centralized sequencer misbehaves? The sequencer can’t produce fraudulent transactions, but it can censor transactions thereby forcing a 24 hour delay in transaction inclusion on L2. In particular, this censorship possibility imposes a 24 hour minimum latency constraint on all optimistic mechanisms designs.

Rationale

As detailed in Arbitrum docs, force inclusion is delayed on purpose so that a trusted sequencer can include transactions in an orderly manner. L1 finality requires at least 12.8 minutes. A trusted sequencer can voluntarily include transactions in the L1 delayed inbox after waiting for L1 finality. This allows the sequencer to give users a soft-guarantee of transaction ordering without waiting 12.8 min for finality.

Ethereum mainnet finality [1] takes ~64-95 slots, so 20 min on the upper range. The sequencer needs additional time to order transactions including any delayed transactions it chooses to include. 1 Hour total should suffice. However, in the interests of an abundance of caution while counter balancing the benefits of low latency 4 hours is chosen as a compromise. A 4 hour delay is a 6x decrease in latency for optimistic mechanisms implemented in any Dapp, Protocol, or L3 deployed on Arbitrum chains. A 4 hour delay maintains a healthy 4-8 times safety margin on the quality of life benefit of soft-finality guarantees provided by the trusted sequencer.

Specifications

Call setMaxTimeVariations in the SequencerInbox to update its old parameters,

   MaxTimeVariation({
       delayBlocks = 5760;
       futureBlocks = 12;
       delaySeconds = 86400
       futureSeconds = 3600;
    })

to its new parameters

   MaxTimeVariation({
       delayBlocks = 1200;
       futureBlocks = 12;
       delaySeconds = 14400;
       futureSeconds = 3600;
    })

Note: futureBlocks and futureSeconds unchanged.
Side note: the current delayBlocks is based on outdated ethpow consensus [2] rules about time synchronization using 15 second max timestamp change versus the ethpos 12 second slot time.

Changes shall be applied to Arbitrum One, Arbitrum Nova, and Arbitrum Goerli SequencerInbox contract deployments [3]. Note that the current MaxTimeVariation parameter for all chains is identical.

  • Arbitrum One 0x1c479675ad559DC151F6Ec7ed3FbF8ceE79582B6
  • Arbitrum Nova 0x211E1c4c7f1bF5351Ac850Ed10FD68CFfCF6c21b
  • Arbitrum Goerli 0x0484A87B144745A2E5b7c359552119B6EA2917A9

Steps to Implement

  1. Execution of the parameter change

Timeline

Minimum of 34 days for the constitutional governance process.

A goal of implementing the parameter change within 3 months.

Overall Cost

Governance contract gas cost interactions. I am willing to subsidize the proposal creation and relevant contract interactions if delegates with a proposing quorum are willing to create the proposal on my behalf.

Links

I am a new user so I am limited to 2 links. Posting more resources as strings below

[1] - notes.ethereum(dot)org/@vbuterin/single_slot_finality
[2] - github(dot)com/ethereum/go-ethereum/blob/00a73fbcce3250b87fc4160f3deddc44390848f4/consensus/ethash/consensus.go#L46
[3] - developer.arbitrum(dot)io/useful-addresses

20 Likes

This is a well and succinctly formulated proposal. Could you, or anyone else, provide further resources/analysis of the consequences of the change, preferably by a neutral party?

7 Likes

First, let me supplement the proposal with context on my background, as it can help establish some trust.

My Background

For reference, I am a developer at Kleros, a subjective oracle project.

Many of our dapps are built around optimistic mechanisms which introduces huge difficulties in accommodating the 24 hour censorship delay. So the motivation is practical rather than theoretical.

As a result of our focus on optimistic mechanisms, the Kleros research team has even made proposals on the Arbitrum research forum to solve the delay grief problem which is unsolved in the current version of the Arbitrum protocol which is why the validator set is still permissioned.

I have even made some small contributions to the L1 bridge related contracts in nitro, the Arbiturm One upgrade which is the current verison of the protocol live today.


Could you, or anyone else, provide further resources/analysis of the consequences of the change, preferably by a neutral party?

I have received some positive feedback from Arbitrum protocol developers in the past on this idea, though I can’t speak on anyone’s behalf. I welcome the security council to review this proposal and give it a sanity check.

Most arbitrum protocol rational and design is not quite as well documented as ethereum protocol development. There are many challenges settling to L1. For more context on the sequencer unhappy path, I link the relevant arbitrum doc at the end of this post [1].

AFAIK, the only other consequence of this parameter change relates to the way the sequencer assigns timestamps to transactions. Currently, under some conditions, the sequencer could backdate transactions up to 24 hours in the past. This change would limit that backdating to 4 hours in the past maximum.

Security Council Review Request

I invite any Arbitrum protocol developer, or specifically members of the Arbitrum security council to give a security review of this proposal. If AIP-1 passes, then the security council is paid a monthly salary, so I would rather the security council provide a review, so as not to burden arbitrum protocol developers from their core work with unpaid governance work — though there is some overlap between protocol developers and the security council so there is plenty of expertise.


Reference

[1] github(dot)com/OffchainLabs/arbitrum-docs/blob/b714a33320b0de2723601f8891cf4e0ad990b6fd/arbitrum-docs/sequencer.mdx#unhappyuncommon-case-sequencer-isnt-doing-its-job

11 Likes

The members of the Arbitrum Security Council have not been elected at the moment (perhaps it should say “not approved”), and if temperature check vote does not pass, which is very likely to happen, this process will take at least 37 days.

Other than that, I like this proposal, but I would like to hear the opinion of other competent people on how much it worsens the security of the network.

4 Likes

Thank you for referring me to the documentation on this topic: arbitrum-docs/sequencer.mdx at b714a33320b0de2723601f8891cf4e0ad990b6fd ¡ OffchainLabs/arbitrum-docs ¡ GitHub.
It very well explains how the process works, and i understand much better the meaning of the need for these changes.

In my opinion, this is completely safe and will not lead to any critical consequences when executed.

We will definitely vote FOR that proposal after instalment of DAO.

5 Likes

Hey, Bartek from l2beat.com here. I think this proposal requires further analysis. The length of the current delay (24 hours) has nothing to do with Ethereum finality - in fact in the same documentation referenced by OP we can read that:

The Sequencer will emit an L2 receipt about ~10 minutes after the transaction has been included in the delayed Inbox (the reason for this delay is to minimize the risk of short term L1 reorgs which could in term cause an L2 reorg and invalidate the Sequencer’s L2 receipts.)

The reason for the delay is that forcing transactions via L1 potentially invalidates all the L2 transactions that Sequencer “thought” to be finalized. In other words, suppose that Alice, having only 1 ETH is sending Bob that 1 ETH on L2 and - at the same time - she is forcing transaction through L1 that sends 1 ETH to Charlie. Clearly one of these transactions should revert, but which one ?

Currently Sequencer will “confirm” transaction to Bob almost immediately. You will see this transaction on an explorer, you will get immediate Metamask confirmation that transfer succeeded, etc… Now imagine Sequencer confirms the transaction, and during committing batch to L1 it crashes (for whatever reason).

With current parameters honest Sequencer has 24 hours to recover and still include Alice → Bob transaction into the L1 Canonical Chain. What you seem to be suggesting is that we reduce this 24 hours to 4 hours. This is IMO unrealistic and dangerous.

7 Likes

Bartek, thanks for your input.

The discussion of ethereum finality is about analysis of minimum delays.

All L1-> L2 are txns (such as eth deposits) are sent through the delayed inbox. The trusted sequencer will normally order those transactions after finality, but the sequencer won’t order txns its censoring. ETH finality is a speed limit for the sequencer to include txns in the delayed inbox.

This is the difference between soft-guarantees and hard-guarantees.

Anyone who wants a hard guarantee (eg exchanges) will not depend on the sequencer’s off-chain confirmation, but wait till the batch is published on L1.

So the off-chain double spending attack (CEX deposit) is mute here.

There’s no order confusion, Charlie gets priority. There’s just L1 latency now because the forceInclusion txn on L1 could get forked out of L1.

I’m pretty is where your though experiment fails and is not possible on Arbitrum. This is why the sequencer can back date txns within a boundary set by the same maxTimeVariation boundary. This constraint is specified in the arbitrum protocol implmentation when validators process transactions in the inbox.

These details are not well documented, and I invite an Arbitrum protocol developer to provide context. The following is my understanding of Arbitrum txn ordering.

  • On L2, txns ordered by the sequencer inherit a timestamp set by the sequencer within a boundary of maxTimeVariation

  • On L1, L1->L2 txns inherit a timestamp from the L1 block header.

  • Timestamps on L2 are monotonically increasing, but no synchronization guarantees with L1

  • In case timestamps in txns published on L1 violate the monotonicity or maxTimeVariations boundary, they get clamped the the nearest valid timestamp.


So if txns are forceIncluded, besides waiting for the ethereum speed limit so arbitrum validators and watchers can update their arbitrum state, there’s no confusion about txn ordering.


Soft-guarantees of txn inclusion improve user experience. This is possible as long as the sequencer is online.

24 hour delays hurt user experience, since any economic mechanism design in a dapp, protocol, or L3 built on top of Arbitrum faces the 24 hour delay.

Lowering the delay to 4 hours maintains the soft-guarantees of txns inclusion by a sequencer that is online. This improves the user experience of applications using optimistic mechanisms built on top of Arbitrum.

If you pay for your coffee on Arbitrum, the shop keeper can probably accept a soft guarantee, because L1 double spending will cost several cups of coffee in gas fees and the customer needs to shutdown the Offchain Labs sequencer server too.

If you pay for your lambo on Arbitrum, the car dealership or direct seller should wait for a hard-guarantee, yes even waiting for L1 finality.

These concerns are about off-chain double spending (yes exchanging goods and services, in the off-chain world, most commonly CEXs). On-chain double spending is not possible.


Agreed :smile: The topic is complex. So again, I welcome an Arbitrum protocol developer to chime in here on the risks. We should take a slow calculated approach. If there are unknown security risks, then we can better document it. If the txn timestamping details or my understanding is not correct, we can document the correct details properly, we can improve that as well. Those are all good outcomes. if this proposal is deemed a security risk and rejected. That’s a perfectly good outcome. Either way, better for the Arbitrum ecosystem.

6 Likes

Thank you for bringing up these points and contributing to the conversation

3 Likes

I think we are talking past each other here so let me try another way. As it stands there was no single case when users had to issue forceInclusion(). Ever. Check the chain.

The way I see this mechanism is as an emergency escape hatch to withdraw my funds if sequencer is down and/or censoring my withdrawal transaction. Note that for emergency withdrawals force inclusion mechanism is only part of the story, I also need Validators to post L2 state root (and they might be down too). I certainly would not rely on this mechanism for “normal” operations such as for example Oracle updates in which case 4 hours may be as bad as 24 hours for many applications.

Can you elaborate how exactly “24 hour delays hurt user experience” ?

1 Like

yes, I am aware. we are on the same page about this. I have done some homework here too :smile:

Agreed, forceInclusion is for extraordinary situations.

All dapps, protocols, and L3s implementing optimistic systems in their economic mechanism design are constrained by the 24 hour sequencer censorship limit for the resolution of timeout periods.

In particular, low value, high frequency applications will be impacted the most.

Most people think about DeFi as the only application on a blockchain. The usecases of blockchain are much beyond, including prediction markets (eg. futarchy, social media content moderation), global jobs markets (escrow for digital services), Token Curated Registries (TCRs).

For example, decentralized social media applications, one of my interests. Let’s suppose Reddit, one of the members of the Data Availability committee for Arbitrum Nova would be interested in adopting a decentralized content moderation system based on prediction markets. I made an implementation of a similar kind of content moderation system using Reality.Eth and Kleros for Telegram. The idea is much like optimistic rollups, someone makes a claim “This reddit post titled ‘Increase the blocksize’ in the subreddit /r/Bitcoin breaks the rules and the user should be banned” and leaves a deposit. Now an optimistic timeout period begins, if no one escalates the claim, the claim is accepted optimistically and the user is banned. If the claim is disputed, then a slow, secure dispute resolution oracle resolves the dispute (blockchain based justice protocol like Kleros). The 24 hour censorship period means that such a content moderation mechanism is bound by 24 hour latency in the optimistic timeout and would negatively impact the user experience on the social media platform. If we used a lower timeout, then we go from King Elon deciding platform censorship to Offchain labs deciding platform censorship for these types of applications using optimistic timeouts. Web3 should strive to provide users a 10x better UX, that’s the only way to gain serious adoption.

Low value, high frequency use cases can work with low latency in blockchain based systems securely as long as they fallback on a secure, slow mechanism. In practice with optimistic timeout periods, most users will experience the timeout period, but some users will escalate to a slow oracle process — again think blockchain based court. But if a centralized actor such as the sequencer can censor transactions for a 24 hour period, we can’t use a timeout period less than 24 hours.

In the case of labor markets, consider microtasking usecases such as language translation. Imagine an escrow dapp in which freelancers leave deposits to accept jobs. After completing the translation they submit their work. After a challenge timeout period, they are paid and their deposit is returned, however if challenged, a slow secure oracle is queried to resolve the dispute. You don’t need to imagine — we at Kleros built such an application called Linguo.

See here an example of an optimistic timeout less than 24 hours for funding a court appeal to approve the translation in a dispute over a language translation of vitalik’s blogs to French. This application is deployed on Gnosis Chain where it avoids expensive blockspace and censorship latency.

We should strive to build web3 applications with much better user experience than web2 alternatives. We can build systems where the happy path, the normal user experience is fast, and falls back on a slow secure mechanism. We can build decentralized labor markets where 99% of freelancers get paid in much less that << 24 hours (how about 4 hours :smile:), but 1% of services are disputed and escalated to a slow and secure oracle.


These topics touch on game theory and economic mechanism design.

I am all for safety — securing an optimistic rollup is hard. But unjustified, excessive safety margins such as the 24 hour censorship delay necessarily constrain all applications built on top. As Arbitrum’s licenses are restrictive, you can’t just deploy another instance of Arbitrum with a lower censorship delay. You can deploy L3s, but these L3s are also limited by the 24 hour censorship period.

I can go on at length about the usefulness of optimistic timeout periods in economic mechanism design of blockchain dapps with “real world” applications — but if Arbitrum wants to be a base for L3s and applications with real world use-cases, then we need to rigorously justify the safety margins at the cost of constraining the design space of applications built on-top, and therefore constraining the end user experience. I will stress that the use-cases here are practical, not merely theoretical. Optimistic timeout periods are common across many protocols.

If there are unaddressed security concerns regarding the current 24 hour delay, I would be happy to be corrected and we can better document the justification for current protocol parameters. However, my proposal justifies the 4 hour delay parameter change with a 4-8x safety margin on operation and 6x improvement in latency for optimistic timeout periods.

4 Likes

I’m an Offchain Labs protocol person, who was involved in the original decision to set this parameter to 24 hours. Here’s my opinion on the tradeoffs.

First, thanks to Shotaro and Bartek who have both been valuable participants in community discussion around Arbitrum over time.

You have collectively identified the main tradeoff here, which is that a shorter censorship delay value increases the risk of a reorg in the transaction sequence; but a longer censorship delay value provides a weaker guarantee if the sequencer does engage in censorship.

Here’s the sequence reorg scenario: The sequencer has published some transactions in its feed but those have not yet been posted to the on-chain inbox. In this state the sequencer goes offline (and redundancy measures such as failover to a hot spare somehow don’t succeed). The sequencer stays offline for longer than the delay interval. Someone then force-includes a message into the sequence. This reorgs the sequencer’s feed, because the messages that were in the feed but not yet inbox-posted will need to be included in the sequence after the force-included transaction, once the sequencer recovers.

By design, Arbitrum makes transaction sequence reorgs very rare, and many applications rely on that fact by assuming that a sequence reorg won’t happen. (Protocol designers have consistently warned that sequence reorgs are not impossible.). Those applications could suffer negative consequences if a sequence reorg does happen.

So the key technical question here is: how much would shortening the delay to 4 hours increase the risk of a sequence reorg? Our engineering team is analyzing that question, at the request of the Foundation, and we’ll report to the community once we have a clearer answer.

14 Likes

So the key technical question here is: how much would shortening the delay to 4 hours increase the risk of a sequence reorg? Our engineering team is analyzing that question, at the request of the Foundation, and we’ll report to the community once we have a clearer answer.

just bumping this thread, are there any preliminary findings from the engineering team?

4 Likes

Intro

Here are my preliminary findings. You can reproduce this data with the script I published in this repo, special-l2-relativity, run yarn timewarp.

Sequencer Uptime

The sequencer usually performs ‘well’, synchronized with L1 within about 5 min. This means that ‘soft confirmation’ from the sequencer is a 1 of 1 trust model for about 5 min, and if you wait 5 min, then you start to get the 1 of n trust model of the roll up. There are however anomalies.

On occasion, the sequencer performs ‘poorly’, with backdated transactions in batches as old as 2 hours as the all time worst sequencer performance. Since Arbitrum “One” upgraded to nitro in September 2022, there have been:

  • 4 significant outages (~1-2 hour sequencer ‘offline’)
  • 4 micro outages (~ 30-60 min sequencer ‘offline’)

This means, as the sequencer relay feed was running during those outages, that Arbitrum was running on a 1 of 1 trust model for 1-2 hours during these periods. ‘Offline’ is from the perspective of L1, the settlement layer, Arbitrum is ‘offline’. Default Arbitrum RPCs though rely on a trusted relay published by the sequencer. This means users by default use the 1 of 1, just trust offchain-lab assumption, which is fine as long as the sequencer promptly publishes it’s txn batch ordering promise, but it’s not fine when large delays are involved.

Anomalies

These anomalies could be explained due to poor operational infrastructure; or, for dapp, protocol, and L3 developers, one has to assume the worst that the sequencer is purposely withholding L2 state publishing. No optimistic mechanism design can be used on Arbitrum today safely without 24 hour minimum latency due to protocol design choices. Since the sequencer has complete power for 24 hours to censor and order transactions as it wishes, one cannot build any dapp, protocol, or L3 with a 1 of n trust assumption on Arbitrum without a minimum latency of 24 hours.

Shorter challenge period in optimistic mechanisms are possible, but this means a 1 of 1 trust assumption. You might as well just use AWS instead of the offchain labs database for the same trust model but better reliability, uptime, and cost.

Summary

Considering historical sequencer performance, this proposal would have had zero impact on Arbitrum performance and any ‘soft confirmation’ had it been effect since the Nitro upgrade. Not once has the sequencer exceeded its 24 hour backdating limit. Had this proposal been in effect since the Nitro upgrade, however, dapps, protocols, and L3s using Optimistic periods in their mechanism design would massively benefit from a 6x decrease in minimum latency from 24 hours to 4 hours.

This proposal strikes a compromise between applications which don’t care much about security and simply want low latency and are fine with a 1 of 1 trust assumption on the sequencer and the dapp, protocol, and L3 developers who require a 1 of n trust assumption rollup with a user experience with sub 24 hour latency. The current parameters heavily favor the low latency users who don’t care as much about trust assumptions. Based on historical data, this proposal would have had zero impact on the 1 of 1 soft confirmation trust assumption users, and would benefit 1 of n ‘hard confirmation’ trust assumption users with a 6x reduction in latency.

4 Likes

ELI5, what is this proposal about?

The sequencer is a trusted server ordering transactions off-chain. Periodically, the sequencer publishes the transaction ordering on-chain. eg laying the ‘train tracks’ for the arbitrum ‘train’


The sequencer might never publish the transaction ordering on-chain. Any arbitrum users who choose to use a 1 of 1 trust assumption on the sequencer will ‘jump off the cliff’ and trust the sequencer to lay the train tracks latter. If the sequencer doesn’t publish the transaction ordering on-chain, then the 1 of 1 soft-confirmation users will experience an L2 block re-org. These users ‘fall off the cliff’.


The sequencer has a time limit to lay the train tracks properly. Sometimes (most of the time), the sequencer does it’s job. But the sequencer gets tired, could take some time to respond, might be sleepy, might forget. The sequencer could also be malicious and want the 1 of 1 trusting users to fall of the cliff.

https://thumbs.gfycat.com/AchingTenseHorseshoebat-size_restricted.gif

The time window the sequencer has to ‘lay the train track’ for the 1 of 1 trust assumption users who are flying in mid air is also the time window the sequencer can refuse entry (censor) any passengers (transactions) on the arbitrum train.

The sequencer might welcome you on the train.

The sequencer might refuse entry if you are carrying something it doesn’t like. The sequencer might be complying with the local laws of the state where the train is located which refuses large musical instruments on-board (tornado cash contract interactions).

image

Summary

In designing the train, there is a balance to be met in creating an inclusive train that can’t refuse entry to passengers even if the train operator (sequencer) refuses entry and creating a train that allows 1 of 1 trust assumption users who are comfortable flying off the cliff and trusting the sequencer to lay the train tracks for the train to proceed.

Currently Arbitrum protocol design favors the wile coyote passengers who are comfortable with 1 of 1 trust assumptions.

Arbitrum however forces some passengers who prefer a 1 of n trust assumption who might be denied entry 24 hours. Any protocol, dapp, or L3 that uses optimistic mechanisms and wants to rely on 1 of n trust assumptions are forced to wait 24 hours.

This proposal strikes a balance between the interests of both users by decreasing the waiting time for 1 of n trust assumption and otherwise censored passengers from 24 hours to 4 hours, while maintaining this soft-confirmation guarantee for the 1 of 1 trusting passengers. Historically, this proposal would have had zero negative impact on any 1 of 1 trusting wile coyote passengers.

6 Likes

This proposal has initiated a very interesting discussion and I would believe it would be beneficial to the community to see the analysis in “how much would shortening the delay to 4 hours increase the risk of a sequence reorg? “

Have Offchain labs completed their analysis? If so, can they share the results? @EdFelten

14 Likes

I believe a risk analysis was conducted but not yet shared. . . At least requesting a status update from Offchain Labs and/or the Arbitrum foundation on the status of the risk analysis. @stonecoldpat


I briefly introduced the proposal at the end of last month’s arbitrum governance call.

@krst , you expressed some interest in further exploration of the proposal. Do you think next month’s governance call (28.06.2023) is appropriate to discuss the proposal? The proposal is technical in nature and most of the discussion on the governance call is about different interests pushing grant proposal so perhaps a separate call on this proposal would allow more focused participants and discussion.

16 Likes

Excellent work @shotaro. Appreciate the additional context you provided on governance call 5 with @krst and @stonecoldpat. Is there a need for further analysis on this regarding MEV-like exploits [that we outlined and suggested we have another call for]? Could also think about other gradients of risk that might be relevant here beyond just front-running.

7 Likes

Thank you for the healthy discussion shotaro and @Bartek .

It is always interesting to hear Bartek 's comments :slight_smile:

Also thank you to @EdFelten for summing this up into one question.

Would love to know what are the further findings, will the 4-hour delay improve the UX so much and are there any security concerns that may come up with it.

3 Likes

Following the discussion on this topic during recent Governance Community Call, we’ve planned a separate call to discuss this issue in details, I encourage all interested parties and delegates to join this discussion. Please spread the word.

Link to the Google Meet: meet.google.com/cpm-jdby-cgv
Link to the event: https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=N3ZqNzQ4dXV0bjlwaGZtMTFqdTM0b242aWcgY180MTU3OTg1ZDI0NTJkZmQ4YTkxYjZhMzZiY2NhYjM3ZGViOWJmZmU5MDUzYTRiOWJjYzRlOWZmZjllZjAyOTI0QGc&tmsrc=c_4157985d2452dfd8a91b6a36bccab37deb9bffe9053a4b9bcc4e9fff9ef02924%40group.calendar.google.com

3 Likes

TLDR: If it’s necessary to lower the window, we think lowering it to 12 hours instead of 4 hours would eliminate a significant amount of risk.

The above discussion has already covered many important points on this topic, and this post mostly just attempts to lay out the various tradeoffs involved.

The core of the discussion is the force inclusion mechanism which allows any user to trigger the sequencing of delayed messages that were inserted into the delayed inbox sufficiently long before the current time. The mechanism doesn’t allow any change in ordering of those delayed messages, just to read further messages in order until a given message has been included.

This mechanism guarantees that even if the sequencer goes completely offline, users of the chain can still exit the system securely. It also provides a strong guard against censorship since it provides a mechanism where a user can permissionlessly force the inclusion of a transaction.

Note that when the sequencer is operating normally, shortening the 24 hour force inclusion window has no security benefit since the sequencer can’t backdate a transaction earlier than the most recently posted batch which tends to be less than a minute old.

There are three main categories of scenarios where the force inclusion mechanism could be triggered:

  1. The sequencer is completely down and hasn’t accepted transactions since the last batch was posted
  2. Batches are being posted regularly, but the sequencer is not reading messages from the delayed inbox
  3. The sequencer is accepting transactions, but unable to post batches

The main stated upside of this proposal is to provide a guarantee of transaction inclusion within a (shorter) bounded window on the DAO’s Arbitrum chains even with a byzantine sequencer. The argument is that users can achieve that guarantee by posting to the delayed inbox and forcing inclusion if necessary. Therefore the upper bound would be time to include transaction on ethereum + force inclusion window. There’s certainly a lot of value in providing a stronger guarantee of inclusion time. This proposal is for one specific way of achieving this, but I think a wider discussion of methods for achieving this would be valuable since the current proposal creates significant risks.

Any of the states that would lead to force inclusion being callable can be reached by either an honest sequencer with malfunctioning software or a malicious sequencer intentionally deviated from the expected behavior. We can generally ignore the malicious sequencer case in the following analysis since a malicious sequencer could disrupt of soft confirmation and create MEV opportunities without halting delayed message inclusion. Therefore the remaining analysis will focus on the honest, but malfunctioning sequencer case.

The severity of the consequences of the force inclusion mechanism being triggered differ depending on the scenario.

  1. In the first scenario, there’s no reorg at all from the force inclusion mechanism since no transactions have been confirmed by the sequencer that haven’t ended up onchain.
  2. In the second scenario, there may be a couple minutes of soft confirmed transactions reorged, but since batches are posted by the Arbitrum One sequencer quite often (every 30-60 seconds approximately), minimal transactions would be affected.
  3. The third scenario is where the force inclusion mechanism would have serious consequences. If the sequencer has still been offering soft confirmations for the length of the inclusion window, there would likely be a reorg of everything since the last batch was posted leading to large scale MEV extraction opportunities. Avoiding this scenario is a very high priority.

As an aside, one possible way to minimize the likelihood of the third scenario would be for the sequencer to stop accepting transactions if there was an issue posting batches onchain. However the advantage of the current strategy is that soft confirmation liveness can stay available even if onchain finality isn’t advancing. Batch poster downtime incidents would be much more severe under this alternate policy.

There are a number of tradeoffs and considerations when it comes to setting the length of time before a user can trigger the force inclusion mechanism. Generally speaking, the core issue with having a shorter period, is that it limits the amount of time available to respond to any issues.

The Offchain Labs Site Reliability Engineering team is of the opinion that reducing the force inclusion window to 4 hours unreasonably increases the risk of a potential reorg. Several factors contribute to this risk, which we will discuss below.

To better grasp the tradeoffs involved, we must examine the common failure cases of batch posting, which can be categorized into general infrastructure failures and software bugs requiring a codebase patch for resolution. Enhancing the batch poster’s reliability to meet the increased uptime guarantee would necessitate additional resources, including personnel and infrastructure, at an added cost. The primary concern arises when a software bug necessitates a code patch. An example response process for such an issue is as follows:

  • Batch poster encounters an error - 00:00
  • Automated alerting initiates when batches are detected as not being posted - 00:30
  • SRE member responds - 00:30-00:45
  • SRE member diagnoses the problem - 00:45-01:00
  • Software issue identified, and Nitro developer paged - 01:00-01:15
  • Nitro developer locates the bug and patches it - 01:15-03:00
  • Image build on branch commences - 03:00-03:30
  • SRE team deploys new image - 03:30-03:45
  • If not fixed, sequencer messages face reorgs once the maximum downtime limit is reached - 04:00

While this process can be effective in some cases, it allows a very short window for the software issue to be diagnosed, patched, and tested before deployment to avoid reaching the force inclusion window. If the patch fails to resolve the problem in production, there would be minimal time available for a secondary patch attempt. Moreover, ensuring redundancy in the on-call schedule for the Nitro team would require increased coverage, as the current setup is only in place for the SRE team.

Considering the increased costs associated with this proposal and the heightened risk of sequencer message reorgs, we believe a 4-hour window is not feasible if we want to ensure that there will be no reorgs with the current staffing. In order to have a decent confidence level that deep reorgs could be prevented with high confidence, significantly more staff would have to be added both on the SRE team as well as Nitro team to be able to meet the timeline with confidence. Infrastructure would also need to be transitioned to a multi-region setup instead of solely multi-availability zone to ensure tolerance for region failures.

As an alternative we’d be much more comfortable with moving to a 12 hour force inclusion window. We think that this would be possible without significantly increased cost and staffing and improve what guarantees Arbitrum could offer today.

Addendum

It’s interesting to consider what other protocol options are possible to provide strong inclusion guarantees to Arbitrum users. This is an open area of research and I’ll list a few high level ideas here which are areas of future research and development.

  • The first barrier to guaranteed inclusion in the happy path is soft confirmation by the sequencer. Currently this is a centralized mechanism, but sequencer decentralization mechanisms can provide stronger censorship guarantees to its mechanism
  • The next barrier is ensuring that transactions with soft confirmation get posted onchain to achieve Ethereum finality. Modifications to the feed mechanism to include batch signatures from the sequencer could allow a broader set of entities to ensure that sequencer approved batches get posted.

Enhancing both of those mechanisms could potentially lead to much stronger inclusion guarantees without introducing additional reorg risk, and in fact greatly reducing the existing risk.

6 Likes