[DIP v1.5]Delegate Incentive Program Questions and Feedback

Hi @danielM !

  1. Send them an email. The AF will reach you to specify which documents you should send.
  2. This depends entirely on AF, so we can’t provide you an ETA.
  3. Yes, and we also receive a confirmation from AF.

Hi @jameskbh

First of all, thank you for your suggestion!

This is something we had considered at the time. The reality is that today, the framework already grants 60 points just for voting. Therefore, if we wanted to shift the concept from a “penalty” to an “incentive,” we would need to adjust the framework accordingly to ensure that delegates benefiting from the multiplier do not receive an unreasonable score merely for voting.

Ultimately, the result is quite similar. We understand the distinction between penalties and incentives, but the solution we have implemented allows us to experiment with the multiplier without making extensive modifications to the framework—making it easier to roll back if necessary.

In conclusion, a modification like the one you proposed seems possible, but only after we have tested the effectiveness of the multiplier within the framework. Once again, thank you for your suggestion!

1 Like

Based on the results of the analysis of the February scoring and the latest change in the scoring rules


I wanted to say a few words about the latest changes:

  1. The exclusion of points for Communication Rationale immediately reduced 10 points, which led to a decrease in tier 2 to tier 3 for the majority, and those who were in tier 3 were eliminated from incentives.
    I don’t know if this was intentional, but now only 24 people are left in incentives (a decrease of 3 times), although the point of the program was to attract a larger number of delegates

  2. Considering that the budget implied spending on delegations up to 825k ARB per month, reducing expenses to 240k ARB seems strange to me. No goal of this program described cost reduction. In fact, the cost itself is the goal, to motivate more delegates

  3. For many, this work takes up most of their time, and constant changes in conditions are negative for such people. How do you know that you are making the right adjustments if the conditions and calculations are different every month, because you can’t draw a conclusion based on just 1 month

Conclusion:

  • monthly changes are negative for the program, give time to understand what they give
  • it is worth thinking about rewarding more delegates (even with smaller amounts), so as not to worsen decentralization and achieve the program’s goals
5 Likes

I would like to point out that I agree with cp0x’s statement.

On Feb 12th, Seed Gov has unilateraly changed the parameters of the Delegate programme.

These changes were all in accordance with the original rules, which give them the authority to introduce changes (as they have pointed out multiple times), so this is not a matter of “could they change it or not”.

The point raised here is, they seem to be hovering in a new gray area of abuse of power where their points allocation (although via a rubric) is already highly subjective (I’m sure they’ll contest this statement and post a lengthy reply as to why this is not the case, but let’s be real: it is highly subjective in practice) and in the name of “enhancing quality of participation” they are overly complicating things, introducing changes unilaterally, and demanding a specific type of participation which under their logic is the right participation to have. The amount of AI comments and SPAM in forums seen as of late is, imo, a consequence of these enforcements, and at times they do not provide actual value.

Some concrete feedback: in my opinion, having 100% voting participation rate, and having a 100% vote comment rationale (i.e. forum posts as to why a vote was cast a certain way) should be enough to earn minimum rewards. With the changes introduced by Seed Gov, this is no longer the case. I think additional and thoughtful delegate feedback would be a nice to have (and could boost rewards exponentially), but not an exclusionary measure. Perhaps we can consider a different approach when a new wave of changes (or a new program manager) is introduced.

6 Likes

Hi @SEEDGov ,

We recently reviewed the DIP results for February and would like more clarity on how scoring improvements can be made. While we understood the previous scoring system and how scores from 1 to 4 were calculated, we noticed that the full score range has now increased from 4 to 10. Although we are familiar with the general rubric, it remains unclear how to improve our scores within this expanded range.

For example, in the proposal “Increase resilience to outside attackers by updating DAO parameters to not count ‘Abstain’ votes in Quorum,” we received a score of 7 in most categories but only 3 for impact. However, the difference between a 7, 8, or 9 remains unclear. What specific factors differentiate these scores, and how can we adjust our approach to achieve a higher score? With the current evaluation method, it is difficult for delegates to learn from past assessments and improve future feedbacks. The lack of clear distinctions between score levels makes it challenging to identify specific improvements needed to reach a higher rating. Additionally, given the subjective nature of scoring, it can be difficult to justify why certain scores are assigned.


We also have an idea to address this, which is to explore a normalized scoring approach, a method commonly used in academic assessments to ensure fairness, consistency, and transparency by benchmarking the highest-performing responses in each rubric category as a reference for a full score.

How Normalization Works

  • Instead of using a single delegate’s overall high score as the standard, this method identifies the highest-scoring delegate for each rubric category and uses their response as the reference point for a full score.

    For example, if Jojo scores 10 in depth of analysis, while Curia receives a full score in clarity, then the benchmark for each rubric should be set based on the delegate who performed best in that specific area.

    Alternatively, if Seedgov (as the program manager) determines what constitutes a full score or identifies specific gaps in the top-scoring responses, that could also be used as a benchmark. However, this would require clearly identifying what is needed or lacking to ensure that delegates understand how to achieve a higher score.

  • Defining Key Scoring Factors: Document the specific elements that contributed to the top scores—such as depth of analysis, clarity, or relevance—to provide clear guidelines for other delegates.

  • Standardized Evaluation: Assess other delegates’ scores relative to these category-specific benchmarks, ensuring a structured, transparent, and fair scoring system.

This approach ensures that each rubric category is measured against the highest standard in that specific criterion, rather than relying on a single overall top-scoring delegate. By doing so, it allows for a more precise and meaningful evaluation of delegate performance.

Additionally, this approach would allow delegates to compare their scores against the top-performing delegate in each category and understand:

  • How their score was calculated
  • What factors contributed to the highest score in each rubric category
  • Where they are lacking compared to the highest-performing delegate in that category

Examples of Normalized Scoring in other industry

Normalized scoring is widely used in academia and standardized testing to improve fairness, consistency, and accuracy:

  • University Grading: Many universities normalize exam scores to adjust for grading inconsistencies, ensuring more accurate and equitable distributions.
  • Standardized Testing: Exams like the SAT and GRE use score normalization to adjust for differences in test difficulty, ensuring that scores remain comparable over time.
  • Peer Evaluations: Online learning platforms like Coursera use normalized scoring in peer assessments to reduce grading bias and improve accuracy, achieving results comparable to expert evaluations.

Key Benefits of Normalized Scoring

  • Fairness & Consistency: Eliminates bias from evaluator subjectivity or varying assessment difficulty, ensuring scores reflect actual performance.
  • Transparency & Interpretability: Provides clear benchmarks for scoring, helping delegates understand how their scores were assigned and how to improve.
  • Objective Benchmarking: Ensures comparability across different evaluators, reducing arbitrary differences in scoring.

We believe by integrating a normalized scoring system, DIP evaluations can become more structured, transparent, and actionable, benefiting both evaluators and delegates.

We’d love to hear your thoughts on whether this approach could enhance fairness and clarity in the DIP scoring process.

3 Likes

I want to bring back this topic, adding a few other items as a suggestion.

You mentioned that we have 60 points for voting. This leaves 40 points for the delegate’s interactions that will lead to that vote (its feedback, reasoning, and comments in proposals).

I want to propose:

  • Lower the points related to voting to 50
  • Introduce the VP incentive/multiplier as a positive value (starting with 1 and basically using the same curve you use now)

Results: There is no penalty for smaller delegates, but an extra incentive to go after more VP

We want to have meaningful contributions from the delegates. We can move this needle by giving the PM evaluation more weight.

  • Increase the points related to the delegate activity to 50 - Delegate Feedback (DF)

It can become a fairer system. Results:

  • With this voting score system, there is no arbitrary “slashing” based only on VP, affecting the points related to voting. There are incentives to try to get larger delegations.
  • If a larger delegate does not produce meaningful contributions, this will reflect on his DF points.
  • If a smaller delegate does not produce meaningful contributions, this will reflect on his DF points.
4 Likes

Hey @cp0x

Indeed, part of the program is to attract a larger number of delegates, but not at any cost.

We would like to analyze the exclusion of rationales and how the merge with the Delegate’s Feedback parameter has impacted the results:

November:

Delegates with DF: 39

Delegates without DF: 20

AVG DF: 19

AVG DF (%): 62.68%

December:

Delegates with DF: 40

Delegates without DF: 19

AVG DF: 18

AVG DF (%): 58.83%

January:

Delegates with DF: 32

Delegates without DF: 31

AVG DF: 13

AVG DF (%): 42.43%

February:

Delegates with DF: 35

Delegates without DF: 31

AVG DF: 17

AVG DF (%): 41.62%

Note: This month, the DF parameter increased from 30 to 40 points.

This shows:

  • Fewer delegates are earning points for DF.

  • A noticeable decline in the AVG DF starting in January.

  • Although DF increased from 30 to 40 points, the AVG only rose from 13 in January to 17 in February, meaning 6 of the 10 additional points were not retained.

Regarding the first two points: One could argue that this reflects higher standards set by the PM—which, to be honest, is expected over time as we continuously push to raise the bar. However, we have also observed a general decline in the quality of contributions.

Additionally, the reactions observed in recent days are, at the very least, predictable. We were fully aware that raising the quality standards would result in those who were making only the bare minimum to qualify to be the first to express complaints, as they would also be the first to be excluded.

As for the last point, this partially confirms that there were hundreds of rationales being incentivized without necessarily adding value to discussions or decision-making. Delegates, of course, are free to justify their votes however they see fit, but incentivizing one-liner rationales is a different matter altogether.

In delegate programs like Optimism’s, participants do not even have a clear understanding of the criteria that the Optimism Foundation will use to evaluate contributions in the upcoming season.

However, this is possible because a delegate’s work should not depend 100% on a fixed formula—it should be more about organic interaction. We do not expect delegates to speculate and adjust their behavior solely to receive an incentive. Rather, we hope they make adjustments to contribute as much knowledge and time as possible organically, ensuring the best possible outcome for proposals.

As a final note, we are deeply concerned that some delegates seem to be more engaged when discussing delegate incentives than other critical topics within the DAO.

Hey @olimpio

The rubric is subjective—this was stated from the beginning, and it was approved in Tally. Quoting a screenshot from the on-chain proposal:

We do not understand how this can be labeled as an abuse of power when all we did was unify the scoring for two highly similar actions—justifying a vote (which is essentially a way of providing feedback) and the specific action of providing feedback.

This is an accurate observation and something we noticed in the program’s early months. In part, it is one of the reasons why we proceeded with the changes implemented in v1.6. To be fair, the previous framework “enforced” more comments than the current one. It counted comments and rationales without assessing their value. The spam we see today is more a residual effect of that previous framework rather than the current one.

The v1.6 update focuses on the ‘-ize’ in ‘professionalize’. That is why the evaluation criteria have become stricter. We understand that this may raise concerns, as it logically impacts some delegates’ compensation. However, we are fully confident that delegates will grow alongside the program and continue refining their professionalism, ultimately making Arbitrum the best DAO in the ecosystem.

We understand your point here, but we don’t fully agree. Yes, perhaps a delegate who votes on everything and justifies those votes should be able to receive compensation. The current minimum may be too high for such a task, especially if the idea is to avoid making any judgment on the rationales. However, SEEDGov is more aligned with the following statement:

The program aims for professionalization, and as we have mentioned before, that implies raising the bar over time.

In your particular case, you obtained 60 out of the 65 points needed just by voting, and you could have obtained the remaining 5 points even with one of your rationales. If you take the time to review the scoring of other delegates, you will find a lot of rationales that received a score.

To be clear, just 1 or 2 more in-depth rationales would have been enough for you in this case.

The main issue, specifically here, is:

  • Many rationales were reused from previous months where they were already incentivized (e.g., OpCo, D.A.O. Grants Program, STEP).
  • The four rationales posted on March 2 fall outside the evaluated month (February)

Regarding the repetition of rationales:

  • It’s understandable if you don’t have anything new to add, especially if your vote remains unchanged from Snapshot. But the question is: Should the DAO incentivize statements like “I voted FOR this proposal in Tally for the same reasons as stated in this previous rationale”?
  • The same question applies to the rationale for “AIP: Timeboost + Nova Fee Sweep”: Should the DAO incentivize a statement like “I voted for in snapshot.”?
  • Again, we are not saying any of this is wrong—we are simply analyzing whether the DAO should be compensating for these actions.

Note: the rest of the interactions correspond to rationals that were incentivized in January:

Considering this, how could the framework be adjusted to include delegates who only want to vote and justify their votes (even in just one line)?

Some options we are considering:

  • Create a fourth tier with a lower base compensation, accessible to both large delegates who only wish to vote/justify and smaller delegates whose contributions were insufficient for higher tiers.
  • Grant 5 points for completing all rationales, regardless of their quality. However, we see this as a suboptimal solution, as the compensation would be too high.
  • Lower the TP requirement from 65 to 60, though this would have other effects—such as making it sufficient for delegates with VP >4M to just vote, further exacerbating the issue mentioned earlier.

It is clear that in the long term the solution points elsewhere, perhaps to have two programs in one as previously suggested by other delegates:

One that includes large voters who contribute to the quorum, with basic compensation for tasks such as voting and justifying the vote, and another broader program that includes all types of contributors who can be compensated for their contributions to the DAO.

2 Likes

I would like to echo the comments already made here and add some of my thoughts. It is evident that many are concerned because SEED has become more demanding in recent months and the latest update has left many delegates out.

Firstly, I wish to address the subjectivity of the evaluation process in the program. This is an issue we all anticipated and understood to have no easy solution, and we agreed that SEED would be fair. To date, I believe they have performed reasonably well given the complexity of the problem. However, there are areas for improvement. For instance, I support @Curia suggestion to standardize scores; I have yet to see a comment rated 10/10, indicating SEED’s high standards. It might be feasible to consider normalizing the scores by treating the highest score each month as a 10 and adjusting the other scores accordingly.

Another significant issue is the evaluation rubric that SEED shared, purported to guide our assessments. It appears that factors such as Timing, Clarity and Communication, and Relevance are overlooked; the primary focus is on the depth of analysis. This aspect is highly subjective and tends to disproportionately influence the overall scoring, overshadowing the other criteria. This might not be problematic per se, but it would be helpful if SEED could provide more guidance, perhaps with an updated rubric that clarifies what exactly they are assessing. Although SEED has started providing individual reports with advice to DIP participants, I believe this approach is not sufficiently addressing the core of what we’re seeking.

Now, I’d like to propose two last somewhat radical ideas that might help reduce the subjectivity of evaluations: one solution could be to implement tools similar to SimScore for assessing comments, which would normalize scores across the board. Another challenging but potentially beneficial approach would be to anonymize comments during evaluation so that SEED cannot see who wrote them. This could mitigate any biases SEED might have towards highly recognized individuals in the ecosystem or, conversely, those with poorer reputations.

Lastly, the major issue with the latest update was the introduction of the VP multiplier. I have no objection to DF now having more influence, or to SEED becoming stricter: these changes encourage us to professionalize and put more effort into our comments. However, the real misstep was essentially penalizing smaller delegates for having less voting power. If we had not introduced the VP multiplier change, 11 delegates who significantly contribute to the DAO, such as Angela, ACI, Maxlomu, JuanRah, Ignas, among others, would have been included in this month’s incentives. A quick fix, as @jameskbh suggests, would be to introduce the VP multiplier as a positive value instead.

3 Likes

As of today, i would fully support something like this. We have a 3 problems here:

  • having good new delegates → i think we are solving this organically. Some folks that have a 50k voting power are really great and appreciated by everybody. It builds through being here, voting, discussing, and the DIP allows for this to happen
  • having big delegates voting → we are not solving for this right now. A big delegate has a lot of hurdles in participating in governance while, with the current quorum, we would be better of by just making everybody with 1, 2, 4m votes being paid (even not a large amount) to just vote. And maybe write rationale. Because the impact of a 10-20m votes is atm too valuable to forego it in favour of “smart participation”
  • a redistribution of voting power to new delegates and to old delegates that are deemed good by the dao. Problem still not solved, let’s see what happens with arb staking.

A separate dip, that could come up live in the next iteration, would solve problem 1 and 2. We will see if by november we indeed have these problems or other ones.

I agree with many of the arguments, considering that I proposed them myself.
However, I don’t quite agree with this:

After all, in this way, with a poor level of commenting by all delegates, everyone will get a conditional maximum. If we accept that we agree that the level of analysis of proposals by delegates should remain at a high level, then we need to keep the bar high.

However, I agree that even delegates with a constant top level of comments (like L2Beat ) do not get the maximum, and this is strange.

Cross posting this feedback given on the February results post.

We agree with other delegates’ feedback in that the implementation of voting multiplier in practice disproportionally affects smaller delegates. Taking our case as an example to illustrate this, our VP multiplier 0,8 (since we have about 60K in delegation) is the main reason we did not qualify. If a positive multiplier, like @jameskbh suggested, had been applied (for instance 1-1.2) we would have a multiplier in the lower bound (1) resulting in 70.81 in TP and still qualifying for the program, instead we have a 58.84.

Additionally, we suggest inclusion of more feedback both when responding to dispute and in DIP reports to improve clarity in the scoring and effectively help delegates seeking to improve their participation:

Thank you for sharing your thoughts and for illustrating the impact of the VP multiplier with your own data — it’s helpful for us to understand how the current system affects delegates of varying sizes.

Regarding the suggestion of applying a positive VP multiplier (e.g., 1–1.2 as proposed by @jameskbh), we’d like to clarify that while this may seem like a straightforward solution, it would require a broader recalibration of the voting participation (VP) scoring. If we were to apply a multiplier above 1, the base VP scoring would need to be adjusted downward to avoid inflating scores across the board. For example, in order for a 1.2x multiplier to bring someone to a score of 60, the underlying VP score would need to be as low as 50. This could introduce new imbalances and complexities in the scoring mechanism.

That said, we do want to acknowledge that the approach suggested by James is actively under consideration. We’re evaluating different weighting models that could better balance fairness and incentivization for smaller delegates without introducing distortions into the scoring system.

Thanks again for your constructive input — it’s invaluable as we work toward improving the Delegate Incentive Program.

3 Likes

you will hate me for this but…

regarding this [DIP v1.6]Delegate Incentive Program: Payment Distribution Thread - #6 by SEEDGov

in the same way you’ve added the note on the Safe for the last SeedGov payment transaction… it would be cool if you would also label, all transactions… so that info is in the Safe UI for all to see.

1 Like

given this [DIP v1.6] Delegate Incentive Program Results (March 2025) - #4 by paulofonseca

could we please change the cap system to be a general cap, instead of a per Tier cap?

The current system design, paired with fact that ARB price is down, actively disincentivizes delegates that do a better job.

So let’s define a dollar value per ARB, let’s say $0.30 USD per ARB, where if ARB is 10% below that dollar price, everybody gets capped 10%, proportionally.

Instead of the current system design, where the ones in the top of each tier, will get capped way more than others.

1 Like

I also want to highlight the problem of calculating the Delegate Feedback score

Example: let’s imagine 2 delegates: A and B.

Delegate A made 1 review per month with a maximum score of 10. That is, his base points will be 10 + 10 + 10 + 10 + 10 = 50.
Delegate B made 2 reviews per month, one with a maximum score of 10, and the other bad review with a rating of 2. That is, his base points will be (10 + 2) / 2 + (10 + 2) / 2 + (10 + 2) / 2 + (10 + 2) / 2 + (10 + 2) / 2 = 30.

It turns out that the A and B made one maximum contribution, but the second made one more. But he will receive almost 2 times less points for his activity. We punish him for trying to do more than the other.

I think this is unfair, since delegate B spent more of his time reviewing the proposals, but will receive half as much.


In accordance with this example, I propose changing the scoring and not taking the arithmetic mean of all the comments made by the delegate.

Either take the maximum feedback, or ignore several minimum ones, otherwise this will contribute to a decrease in delegate activity.

4 Likes

Hi @SEEDGov team!

Thank you again for all the operational effort and depth of analysis you bring to each monthly report. Compiling voting metrics, forum engagement, feedback scoring, and bonus allocations is a huge undertaking, and we truly appreciate how you keep the program running and the process transparent. However, we also recognize that any scoring methodology can introduce unintended incentives or edge-case behaviors, and your current Delegate Feedback average may sometimes discourage moderately-confident contributions. With that in mind, we’d like to offer a few ideas for adjustments that preserve quality incentives while still encouraging thoughtful participation.

Current challenge

  • DF is the simple average of all scored comments. Even a few lower-scoring remarks pull down a delegate’s overall DF score, which can discourage sharing “good-but-not-perfect” comments/feedback/ideas.
  • Fear of lowering one’s average may lead delegates to speak up only when absolutely sure, reducing the total volume and diversity of feedback.

This is also something @cp0x pointed out.

Design goals

After recognizing the challenges above, we believe any refinement to the DF calculation should be guided by clear objectives that both protect the integrity of high-impact insights and nurture a broad, confident level of participation.

  • Ensure only a delegate’s strongest contributions drive their DF score, incentivizing high-impact insights.
  • Allow delegates to share well-reasoned ideas, even if they aren’t guaranteed “perfect”, without fear of penalization.
  • Foster a healthy volume and variety of perspectives by not unduly punishing occasional lower-scoring comments.

Comparison of score calculation methods

These are the potential methods for calculating the DF score, which fit the design goals provided.

Method Calculation Pros Cons
Current Average Sum(scores) ÷ N Extremely simple; every comment counts; easy to audit Low-quality comments directly lower average; discourages moderately confident posts
Top-k Mean Sort scores descending, then average the best k (e.g. k = 4) Rewards only your top insights; other comments don’t hurt Requires at least k scored comments; ignores above-average feedback beyond k comments
Upper-Quantile Mean Sort scores descending, average the top X % of N (e.g. top 30 %) Captures a broader slice of above-average feedback; filters noise Needs enough total comments to fill the quantile; adds threshold complexity

Top-k Mean and Upper-Quantile Mean are essentially the same approach—both methods focus solely on averaging a delegate’s highest-scoring comments. The distinction between “best k” and “top X %” is secondary to the core goal: reward top-tier insights while filtering out lower-scoring noise.

In Top-k Mean, if a delegate posts 10 scored comments in a month and we set k = 4, we would simply average the highest four scores and ignore the other six. That way, everyone is encouraged to produce their best insights without worrying that a less polished comment will drag them down.

We hope this helps, and are happy to collaborate on a pilot in a future cycle!

4 Likes