Human Judgment and Credible Neutrality in Onchain Reputation | by scottonchain | Coinmonks | Nov, 2023

Careful application of traditional metrics is essential for maintaining credible neutrality

Blockchain reputation scoring is gaining widespread interest. Coinbase’s Request for Builders from August 2023 includes a call for onchain reputation scoring. Octan Network publishes an algorithmic reputation score based on Pagerank. Gitcoin Passport has its own reputation score, established through linking activities and accounts to an Ethereum address. Even recent KYC innovations like Coinbase Verifications can be viewed as a simple binary reputation score.

The emergence of a widely accepted reputation score will unlock new applications, for example:

  • It can facilitate innovation in DeFi and lending, similar to how credit scores are used in traditional finance.
  • It can enable a warning system for end users, similar to URL filtering, by alerting them to potential scams.

A meaningful reputation score achieves two key objectives:

  • Measurable Quality: The algorithm can be shown quantitatively to differentiate between trustworthy and untrustworthy entities.
  • Credible Neutrality: The algorithm’s design and implementation avoids any inherent bias regarding any specific group of individuals or entities.

These two objectives of a meaningful reputation score are inherently at odds with each other, as distinguishing good actors from bad actors relies on subjective assessments.

Consider TornadoCash, a highly reliable contract deemed malicious by certain regulatory bodies. Some stakeholders consider the reputation of this address very low, while others consider it very high. There is no credibly neutral way to resolve this.

For a reputation scoring mechanism to be widely adopted, it must meet both of these competing objectives.

Translating Quality Measurement to Blockchain Reputation

Internet ranking algorithms utilize quality scores like NDCG and Precision@N to evaluate their effectiveness and measure iterative improvements.

A Precision@N metric, for instance, reports the proportion of highly ranked addresses that are deemed trustworthy. In the blockchain realm, this could include addresses deployed by reputable entities like Uniswap or Aave. It may also be advantageous to include addresses linked to reputable centralized exchanges like Coinbase, or long-standing addresses (EOAs) associated with individuals.

Initially, a set of 1000-2000 reputable addresses can be curated and reported by the algorithm’s creator. Over time, this set can be standardized with community input. A Precision@1000 metric is then calculated as the proportion of top 1000 addresses that belong to the reputable list.


Because entities like Uniswap employ numerous addresses, a large list can be an output of simple, easily understood rules. For example:

For a truth set, we treat as reputable all addresses deployed or owned by Uniswap, Aave, Coinbase, or Kraken, along with any EOA with an age greater than five years and more than 10 ETH in historical outgoing transactions.

An important criterion for such a rule to be widely accepted is that it does not make a subjective judgment about controversial entities. Such entities receive a score from the algorithm, but the evaluation of the algorithm is independent.

The Precision@N approach can also be used to evaluate the algorithm’s detection of untrustworthy addresses. By compiling a list of known malicious addresses, the metric can be calculated as the proportion of top-ranked addresses which are not malicious. Alternatively, the metric can be calculated as the proportion of malicious addresses found among the bottom-ranked addresses.

The Question of Credible Neutrality

While metrics like Precision@N provide an objective quality measure for reputation scoring, they inspire questions about credible neutrality. This concept has become fundamental to blockchain research and development, and it can be considered a requirement for any widely-adopted mechanism. Vitalik Buterin’s 2020 blog post introduced credible neutrality as follows:

A mechanism is credibly neutral if just by looking at the mechanism’s design, it is easy to see that the mechanism does not discriminate for or against any specific people.

The central question is: If a truth set for a metric like Precision@N is human-curated, does that compromise credible neutrality of the algorithm?

Human Judged Test Sets and Credible Neutrality

A human-curated truth set for a metric like Precision@N would raise concerns about credible neutrality if it were used to train the algorithm. However, we can ensure credible neutrality by requiring that the truth set is only used for evaluation, not for training. This means that the algorithm’s mechanism is not biased by the human curation process.

As a result, the algorithm itself can be considered credibly neutral. It is designed to treat all addresses fairly and objectively, regardless of their perceived trustworthiness or maliciousness as reported in the truth set.

Certain more advanced considerations carry over from traditional ranking to blockchain reputation scoring, providing additional safeguards for credible neutrality. For example, in internet ranking, it’s common to separate truth sets into “validation” sets and “test” sets. The developer optimizes the algorithm against the “validation” set, but the final quality score is reported from the “test” set. This can prevent degradation of credible neutrality, which may arise from directly tuning the algorithm to the truth set.


As blockchain technology continues to bring new users, the need for reliable and unbiased reputation scoring becomes increasingly apparent. Achieving a truly meaningful reputation score requires a delicate balance between measurable quality and credible neutrality.

A framework for objective quality metrics, such as Precision@N, provides a valuable tool for evaluating the effectiveness of reputation scoring algorithms. These metrics measure the ability to distinguish between trustworthy and malicious addresses. However, it is crucial to ensure that these metrics are applied in a way that preserves credible neutrality.

By limiting the use of human-curated truth sets to evaluation rather than training or implementation, we can ensure that the reputation scoring algorithm remains objective and unbiased. This approach allows us to leverage human judgment while maintaining the algorithm’s neutrality.

Leave a Reply

Your email address will not be published. Required fields are marked *