Building Blocks and Reputation Tips

By now you should feel fairly conversant in the lingua franca (the graphical grammar presented in Chapter_2 ) of reputation systems, and you've had some exposure to their constituent bits and pieces. We've gone over reputation statements, messages, and processes, and you've become familiar with some rudimentary-but-serviceable models.

In this chapter, we'll “level up” and explore reputation claims in greater detail. We'll describe a taxonomy of claim types, exploring reputation roll-ups: actual computations on incoming messages that effect a particular output. Functionally, different types of roll-ups yield very different types of reputations, so we'll offer guidance on when to use which roll-ups. We'll end the chapter with practical advice in a section of “practitioner's tricks.”

Extending the Grammar: Building Blocks

Though understanding the elements of the reputation grammar is essential, building reputation models from the atoms up every time is time consuming and tedious. There's a lot of benefit to developing tools and shorthand for common patterns for the configuring reputation statements and templates for well understood reputation models.

The Data: Claim Types

Remember, a fundamental component of a reputation statement is the claim-the assertion of quality that a source makes about a target. In Chap_1-The_Reputation_Statement , we discussed how claims can be either explicit (a direct statement of quality, intended by the statement's source to act as such) or implicit (representing a source-user's concrete actions associated with a target entity). These fundamentally different approaches are important because the combination of implicit and explicit claims can yield some very nuanced and robust reputation models. In other words: We should pay attention to what people say, but we should give equal weight to what they do to determine which entities hold the community's interest.

Claims also can be of different types. One helpful distinction is between qualitative claims (claims that describe one or more qualities which may or may not be easily measured) and quantitative claims-claims that can be measured (and, in fact, are largely generated, communicated and read back as numbers of some kind.)

Reputation statements have claim values, which you can generally think of as “what you get back when you ask for a reputation's current state.” So, for instance, we can always query the system for Movies.Review.Overall.Average and get back a normalized score within the range of 0-1.

Note that the format of a claim does not always map exactly to the format in which you may wish to display (or, for that matter, gather) that claim. It's more likely that you'd want to translate the Movies.Review.Overall.Average to show your users 3 colored stars (out of 5) instead of a normalized score or percentage.

Qualitative Claim Types

Qualitative claims attempt to describe some quality of a reputable object. This quality may be as general as the object's overall “quality” (“This is an excellent restaurant!” ) or as specific as some particular dimension or aspect of the entity. (“The cinematography was stunning!” ) Generally, qualitative claim types are fuzzier than hard quantitative claims, so qualitative claims quite often end up being useful implicit claims.

This is not to say, however, that qualitative claims can't have a qualitative value when considered en masse: almost any claim type can at least be counted and displayed in the form of some simple cumulative score (or “aggregator” -we discuss the various reputation roll-ups below in Chap_3-Roll-Ups .) So while we can't necessarily assign an evaluative score to a user-contributed text comment, for instance (at least not without the rest of the community involved), it's quite common on the Web to see a count of the number of comments left about an entity, as a crude indicator of that item's popularity or the level of interest it draws.

Here are some common types of qualitative claims.

Text Comments

User-contributed text comments are perhaps the most common, defining feature of the user generated content on Web. Though a debate rages about the value of such comments (which in any case differs from site to site and from community to community), no one denies that the ability to leave text comments about an item of content, whether it's a blog entry, an article, or a YouTube video, is a wildly popular form of expression on the Web.

Text comment fields typically are provided as a freeform means of expression: a little white box that users can fill in any way they choose. However, better social sites will attempt to direct comments by providing guidelines or suggestions on what may be considered on- or off-topic.

Users' comments are usually freeform (unstructured) textual data. They typically are character-constrained in some way, however the constraints vary depending on the context: the character allowance for a message board posting is generally much greater than Twitter's famous 140-character limit.

In comment fields, you can choose whether to accommodate rich-text entry and display, and you can apply certain content filters to comments up front (for instance, you can choose to prohibit profanity or disallow fully formed URLs).

Comments are often just one component of a larger compound reputation statement. Movie reviews, for instance, typically are a combination of 5-star qualitative claims (and perhaps different ones for particular aspects of the film) and one or more freeform comment-type claims.

Comments are powerful reputation claims when interpreted by humans, but they may not be easy for automated systems to evaluate. The best way to evaluate text comments varies depending on the context. If a comment is just one component of a user review, the comment can contribute to a “completeness” score for that review: reviews with comments are deemed more complete than those without (and, in fact, the comment field may be required for the review to be accepted at all).

If the comments in your system are directed at another contributor's content (for example, user comments about a photo album or message board replies to a thread), consider evaluating comments as a measure of interest or activity around that reputable entity.

Here are examples of claims in the form of text comments:

  • Flickr's Interestingness algorithm likely accounts for the rate of commenting activity targeted at evaluating the quality of photos.
  • On Yahoo! Local, it's possible to give an establishment a full review (with star ratings, freeform comments, and bar ratings for subfacets of a user's experience with the establishment). Or a user can simply leave a rating of 1 to 5 stars. (This option encourages quick engagement with the site.) It's easy to see that there's greater business value (and utility to the community) in full reviews with well-written text comments (provided Yahoo! Local tracks the value of the reviews internally).

In our research at Yahoo! we often probed notions of authenticity to look at how readers interpret the veracity of a claim or evaluate the authority or competence of a claimant.

We wanted to know: when people read reviews online (or blog entries, or tweets), what are the specific cues that make them more likely to accept what they're reading as accurate? Is there something about the presentation of material that makes it more trustworthy? Or is it the way the content author is presented? (Does an “expert” badge convince anyone?)

Time and again, we found that it's the content itself-the review, entry, or comment being evaluated-that makes readers' minds up. If an argument is well stated, if it seems reasonable, and if readers can agree with some aspect of it, then readers are more likely to trust the content-no matter what meta-embellishment or framing it's given.

Conversely, research shows that users don't see poorly written reviews with typos or shoddy logic as coming from legitimate or trustworthy sources. People really do pay attention to content.

Media Uploads

Reputation value can be derived from other types of qualitative claim types besides just freeform textual data. Any time a user uploads media-either in response to another piece of content (see Figure_3-1 ), or as a subcomponent of the primary contribution itself-that activity is worth noting as a claim type.

We distinguish textual claims from other media for two reasons:

  1. While text comments typically are entered in context (users type them right into the browser as they interact with your site), media uploads usually require a slightly deeper level of commitment and planning on the user's part. For example, a user might need to use an external device of some kind and edit the media in some way before uploading it. Therefore…
  2. You may want to weight these types of contributions differently from text comments-or not, depending on the context - reflecting increased contribution value.

Figure_3-1: Video Responsesto a YouTube Video may boost its interest reputation.

A media upload consists of qualitative claim types that is not textual in nature.

  • Video
  • Images
  • Audio
  • Links
  • Collections of any of the above

When a media object is uploaded in response to another content submission, consider it as input indicating the level of activity related to the item or the level of interest in it.

When the upload is an integral part of a content submission, factor its presence, absence, or level of completion into the quality rating for that entity.

Here are examples of claims in the form of media uploads:

  • Since YouTube video responses require extra effort by the contributors and lead to viewers spending more time on the site, they should have a larger influence on the popularity rank than simple text comments.
  • A restaurant review site may attribute greater value to a review that features uploaded pictures of the reviewers' meal: it makes for a compelling display and gives a more well-rounded view of that reviewer's dining experience.

Relevant External Objects

A third type of claim is the presence or absence of inputs that are external to a reputation system. Reputation-based search relevance algorithms (which, again, lie outside the scope of this book) such as Google PageRank rely heavily on this type of claim.

A common format for such a claim is a link to an externally reachable and verifiable item of supporting data. This approach includes embedding Web 2.0 media widgets into other claim types, such as text comments.

When an external reference is provided in response to another content submission, consider it as input indicating the level of activity related to the item or the level of interest in it.

When the external reference is an integral part of a content submission, factor its presence or absence into the quality rating or level of completion for that entity.

Here are examples of claims based on external objects:

  • Some shopping review sites encourage cross-linking to other products or offsite resources as an indicator of review completeness: cross-linking demonstrates that the review author has done her homework and fully considered all options.
  • On blogs, the trackback feature originally had some value as an externally verifiable indicator of a post's quality or interest level. (Sadly, however, trackbacks have been a highly gamed spam mechanism for years.)
Quantitative Claim Types

Quantitative claims are the nuts and bolts of modern reputation systems, and they're probably what you think of first when you consider ways to assess or express an opinion about the quality of an item. Quantitative claims can be measured (by their very nature, they are measurements). For that reason, computationally and conceptually, they are easier to incorporate into a reputation system.

Normalized Value

Normalized value is the most common type of claim in reputation systems. A normalized value is always expressed as a floating-point number in a range from 0.0 to 1.0. Within the range of 0.0 to 1.0, closer to 0 is worse, closer to 1 is better. Normalization is a best practice for handling claim values because it provides ease of interpretation, integration, debugging, and general flexibility. A reputation system rarely, if ever, displays a normalized value to users. Instead, normalized values are de-normalized into a display format that is appropriate for the context of your application (they may be converted back to stars, for example).

One strength of normalized values is their general flexibility. They are the easiest of all quantitative types to perform math operations on; they are the only quantitative claim type that is finitely bounded; and they allow reputation inputs gathered in a number of different formats to be normalized with ease (and then de-normalized back to a display-specific form suitable for the context you want to display in).

Another strength of normalized value is the general utility of the format: normalizing data is the only way to perform cross-object and cross-reputation comparisons with any certainty. (Do you want your application to display “5-star restaurants” alongside “4-star hotels” ? If so, you'd better normalize those scores somewhere.)

Normalized values are also highly readable: because the bounds of a normalized score are already known, they are very easy (for you, the system architect, or others with access to the data) to read at a glance. With normalized scores, you do not need to understand the context of a score to be able to understand its value as a claim. Very little interpretation is needed.

Rank Value

A rank value is a unique positive integer. A set of rank values is limited to the number of targets in a bounded set of targets. For example, given a data set of “100 Movies from the Summer of 2009,” it is possible to have a ranked list in which each movie has exactly one value.

Here are some examples of uses for rank values:

  • Present claims for large collections of reputable entities: for example, quickly construct a list of the top 10, 20, or 100 objects in a set. One common pattern is displaying leaderboards.
  • Compare like items one-to-one, common on electronic product sales sites, like Shopping.com.
  • Build a ranked list of objects in a collection, as with Amazon's sales rank.

Scalar Value

When you think of scalar rating systems, we'd be surprised if-in your mind-you're not seeing stars. Rating systems of 3, 4, and 5 stars abound on the Web and have achieved a level of semipermanence in reputation systems. Perhaps that's because of the ease with which users can engage with star ratings-choosing a number of stars is a nice way to express an opinion beyond simple like or dislike.

More generally, a scalar value is a type of reputation claim in which a user gives an entity a “grade” somewhere along a bounded spectrum. The spectrum may be finely delineated and allow for many gradations of opinion (10-star ratings are not unheard of), or it may be binary (for example, thumbs-up, thumbs-down.)

  • Star ratings (3-, 4-, and 5-star scales are common)
  • Letter grade (A, B, C, D, F)
  • Novelty-type themes (“4 out of 5 cupcakes” )

Yahoo! Movies features letter grades for reviews. The overall grades calculated using a combination of professional reviewers' scores (which are transformed from a whole host of different claim types, from the New York Times letter-grade style to the classic Siskel and Ebert thumbs-up. thumbs-down format) and Yahoo! user reviews, which are gathered on a 5-star system.

Processes: Computing Reputation

Every reputation model is made up of inputs, messages, processes, and outputs. Processes perform various tasks. In addition to creating roll-ups, in which interim results are calculated, updated, and stored, processes include transformers, which change data from one format to another, and routers, which handle input, output, and the decision making needed to direct traffic among processes. In reputation model diagrams individual processes are represented as discrete boxes, but in practice the implementation of a process in an operational system combines multiple roles. For example, a single process may take input, do a complex calculation, send the result as a message to another process, and perhaps return the value to the calling application which would terminate that branch of the reputation model.

Processes are activated only when they receive an input message.

Roll-ups: Counters, Accumulators, Averages, Mixers, and Ratios

A roll-up process is the heart of any reputation system-it's where the primary calculation and storage of reputation statements are performed. Several generic kinds of roll-ups serve as abstract templates for the actual customized versions in operational reputation systems. Each type-counter, accumulator, average, mixer, and ratio-represents the most common simple computational unit in a model. In actual implementations, additional computation is almost always integrated with these simple patterns.

All processes receive one or more inputs, which consist of a reputation source, a target, a contextual claim name, and a claim value. In the diagrams below, unless otherwise stated, the input claim value is a normalized score. All processes that generate a new claim value, such as roll-ups and transformers, are assumed to be able to forward the new claim value to another process, even if that capability is not indicated on the diagram. By default in roll-ups, the resulting computed claim value is stored in a reputation statement by the aggregate source. A common pattern for naming the aggregate claim is to concatenate the claim context name (Movies_Acting) with a roll-up context name (Average). For example, the roll-up of many Movies_Acting_Ratings is the Movies_Acting_Average.

Simple Counter

A simple counter roll-up, Figure_3-2 ,adds one to a stored numeric claim representing all the times that the process received any input.

Figure_3-2: A Simple Counter process does just what you'd expect-as inputs come in, it counts them and stores the result.

A simple counter roll-up ignores any supplied claim value. Once it receives the input message, it reads (or creates) and adds one to the CountOfInputs , which is stored as the claim value for this process.

Here are pros and cons of using a simple counter roll-up:

ProsCons
Counters are simple to maintain and can easily be optimized for high performance.
A simple counter affords no way to recover from abuse. If abuse occurs, see Chap_3-Reversible_Counter .
Counters increase continuously over time, which tends to deflate the value of individual contributions. See Chap_3-Bias_Freshness_and_Decay .
Counters are the most subject of any process to Chap_3-First_Mover_Effects , especially when they are used in public reputation scores and leaderboards.
Reversible Counter

Like a simple counter roll-up, a reversible counter roll-up ignores any supplied claim value. Once it receives the input message, it either adds or subtracts one to a stored numeric claim depending on whether or not there is already a stored claim for this source and target.

Reversible counters, Figure_3-3 , are useful when there is a high probability of abuse (perhaps because of commercial incentive benefits such as contests-see Chap_5-Commercial_Incentives ) or when you anticipate the need to rescind inputs by users or the application for other reasons.

Figure_3-3: A Reversible Counter also counts incoming inputs, but it also remembers them, that they (and their effects) may be undone later. Trust us, this can be very useful.

Here are pros and cons of using a reversible counter roll-up:

ProsCons
Counters are easy to understand.
Individual contributions can be performed automatically, allowing for correction of abusive input and for bugs.
Reversible counters allow for individual inspection of source activity across targets.
A reversible counter scales with the database transaction rate, which makes it at least twice as expensive as a Chap_3-Simple_Counter .
Reversible counters require the equivalent of keeping a log file for every event.
Counters increase continuously over time, which tends to deflate the value of individual contributions. See Chap_3-Bias_Freshness_and_Decay .
Counters are the most subject of any process to Chap_3-First_Mover_Effects , especially when they are used in public reputation scores and leaderboards.
Simple Accumulator

A simple accumulator, Figure_3-4 , roll-up adds a single numeric input value to a running sum that is stored in a reputation statement.

Figure_3-4: A Simple Accumulator process adds arbitrary amounts and stores the sum.

Here are pros and cons of using a simple accumulator roll-up:

ProsCons
A simple accumulator is as simple as it gets; the sums of related targets can be compared mathematically for ranking.
Storage overhead for simple claim types is low; the system need not store each user's inputs.
Older inputs can have disproportionately high value.
A simple accumulator affords no way to recover from abuse. If abuse occurs, see Chap_3-Reversible_Accumulator .
If both positive and negative values are allowed, comparison of the sums may become meaningless.
Reversible Accumulator

A reversible accumulator roll-up, Figure_3-5 , either (1) stores and adds a new input value to a running sum or (2) undoes the effects of a previous addition. Consider using a Chap_3-Reversible_Accumulator if you would otherwise use a Chap_3-Simple_Accumulator but you want the option either to review how individual sources are contributing to the Sum or to be able to undo the effects of buggy software or abusive use. However, if you expect a very large amount of traffic, you may want to stick with a Chap_3-Simple_Accumulator : storing a reputation statement for every contribution can be prohibitively database intensive if traffic is high.

Figure_3-5: A Reversible Accumulator process improves on the Simple model-it remembers inputs so they may be undone.

Here are pros and cons of using a reversible accumulator roll-up:

ProsCons
Individual contributions can be performed automatically, allowing for correction of abusive input and for bugs.
Reversible accumulators allow for individual inspection of source activity across targets.
A reversible counter scales with the database transaction rate, which makes it at least twice as expensive as a Chap_3-Simple_Accumulator .
Older inputs can have disproportionately high value.
If both positive and negative values are allowed, comparison of the sums may become meaningless.
Simple Average

A simple average roll-up, Figure_3-6 , calculates and stores a running average including new input.

Figure_3-6: A Simple Average process keeps a running total and count for incremental calculations.

The simple average roll-up is probably the most common reputation score basis. It calculates the mathematical mean of a series of the history of inputs. Its components are a SumOfInputs , CountOfInputs , and the process claim value:AvgOfInputs .

Here are pros and cons of using a simple average roll-up:

ProsCons
Simple averages are easy for users to understand. Older inputs can have disproportionately high value compared to the average. See Chap_3-First_Mover_Effects .
A simple accumulator affords no way to recover from abuse. If abuse occurs, see Chap_3-Reversible_Average .
Most systems that compare ratings using simple averages suffer from Chap_3-Ratings_Bias_Effects and have uneven rating distributions.
When simple averages are used to compare ratings, in cases when the average has very few components, they don't accurately reflect group sentiment. See Chap_3-Low_Liquidity_Effects .
Reversible Average

A reversible average, Figure_3-7 , is a reversible version of simple average-it keeps a reputation statement for each input and optionally uses it to reverse the effects of the input.

Figure_3-7: A Reversible Average process remembers inputs so they may be undone.

If a previous input exists for this context, the reversible average operation reverses it: the previously stored claim value is removed to from the AverageOfInputs, the CountOfInputs is decremented, and the source's reputation statement is destroyed. If there is no previous input for this context, compute a simple average (see Chap_3-Simple_Average ) and store the input claim value in a reputation statement by this source for the target with this context.

Here are pros and cons of using a reversible average roll-up:

ProsCons
Reversible averages are easy for users to understand.
Individual contributions can be performed automatically, allowing for correction of abusive input and for bugs.
Reversible averages allow for individual inspection of source activity across targets.
A reversible average scales with the database transaction rate, which makes it at least twice as expensive as a Chap_3-Simple_Average .
Older inputs can have disproportionately high value compared to the average. See Chap_3-First_Mover_Effects .
Most systems that compare ratings using simple averages suffer from Chap_3-Ratings_Bias_Effects and have uneven rating distributions.
When reversible averages are used to compare ratings, in cases when the average has very few components, they don't accurately reflect group sentiment. See Chap_3-Low_Liquidity_Effects .
Mixer

A mixer roll-up, Figure_3-8 , combines two or more inputs or read values into a single score according to a weighting or mixing formula. It's preferable, but not required, to normalize the input and output values. Mixers perform most of the custom calculations in complex reputation models.

Figure_3-8: A Mixer combines multiple inputs together and weights each.

Simple Ratio

A simple ratio roll-up, Figure_3-9 , counts the number of inputs (the total), separately counts the number of times the input has a value of exactly 1.0 (for example, hits), and stores the result as a text claim with the value of “(hits) out of (total)” .

Figure_3-9: A Simple Ratio process keeps running sums and counts.

Reversible Ratio

If the source already has a stored input value for a target, a reversible ratio roll-up, Figure_3-10 , reverses the effect of the previous hit. Otherwise, this roll-up counts the total number of inputs (the total) and separately counts the number of times the input has a value of exactly 1.0 (hits). It stores the result as a text claim value of “(hits) out of (total)” and also stores the source's input value as a reputation statement for possible reversal and retrieval.

Figure_3-10: A Reversible Ratio process remembers inputs so they may be undone.

Transformers: Data Normalization

Data transformation is essential in complex reputation systems, in which information enters a model in many different forms. For example, consider an IP address reputation model for a mail system: perhaps it accepts this-email-is-spam votes from users; alongside incoming traffic rates to the mail server; as well as a historical karma score for the user submitting the vote. Each of these values must be transformed into a common numerical range before being combined.

Furthermore, it may be useful to represent the result in a discrete Spammer/DoNotKnowIfSpammer/NotSpammer category. In this example, transformation processes, Figure_3-11 , do both the normalization and de-normalization.

Figure_3-11: Transformers normalize and de-normalize data. They are not usually independent processes.

Simple Normalization (and Weighted Transform)

Simple normalization is the process of converting from a usually scalar score to the normalized range of 1.0. Often custom built, and typically accomplished with functions and tables.

Scalar Denormalization

Scalar denormalization is the process of converting usually normalized values inputs into a regular scale, such as bronze, silver, gold, number of stars, or rounded percentage. Often custom built, and typically accomplished with functions and tables.

External Data Transform

An external data transform is a process that accesses a foreign database and converts its data into a locally interpretable score, usually normalized. The example of the McAfee transformation shown in Figure_2-8 shows a table-based transformation from external data to a reputation statement with a normalized score. What makes an external data transformer unique is that, because retrieving the original value often is a network operation or is computationally expensive, it may be executed implicitly on demand, periodically, or even only when it receives an explicit request from some external process.

Routers: Messages, Decisions, and Termination

Besides calculating the values in a reputation model, there is important meaning in the way a reputation system is wired internally and to the back to the application: connecting the inputs to the transformers to the roll-ups to the processes that decide who gets notified of whatever side effects are indicated by the calculation. These are accomplished with a class of building blocks called routers - Messaging delivery patterns, decision points, and terminators determine the flow throughout the model as it executes.

Common Decision Process Patterns

We've described the process types above as pure primitives, but we don't mean to imply that your reputation processes can't or shouldn't be combinations of the various types. It's completely normal to have a simple accumulator that applies mixer semantics.

There are several common decision process patterns that change the flow of messages into, through, and out of a reputation model: evaluators, terminators and message routers of various types and combinations.

Simple Terminator

The simple terminator process is one that does not send any message to another reputation process, ending the execution of this branch of the model. Optionally a terminator may return it's claim value to the application. This is accomplished via a function return, sending a reply, or by signaling to the application environment.

Simple Evaluator

A simple evaluator process provides the basic “If … then …” statement of reputation models. Usually comparing two inputs and sends a message on to another process(es). Remember that the inputs may arrive asynchronously and separately, so the evaluator may need to have its own state.

Terminating Evaluator

A terminating evaluator ends the execution path started by the initial input, usually by returning or sending a signal to the application when some special condition or threshold has been met.

Message Splitter

A message splitter, Figure_3-12 , replicates a message and forwards it to more than one model event process. This operation starts multiple simultaneous execution paths for one reputation model, depending on the specific characteristics of the reputation framework implementation. See Appendix_A .

Figure_3-12: A message coming from a process may split and feed into two or more downstream processes.

Conjoint Message Delivery

Conjoint message delivery, Figure_3-13 , describes the pattern of messages from multiple different input sources being delivered to one process which treats them as if they all have the exact same meaning. For example, in a very large-scale system, multiple servers may send reputation input messages to a shared reputation system environment reporting on user actions: it doesn't matter which server sent the message; the reputation model treats them all the same way. This is drawn as two message lines joining into one input on the left side of the process box.

Figure_3-13: Conjoint message paths are represented by merging lines. These two different kinds of inputs will be evaluated in exactly the same way.

Input

Reputation models are effectively dormant when inactive - the model we present in this book doesn't require any persistent processes. Based on that assumption, a reputation model is activated by a specific input arriving as a message to the model. Input gets the ball rolling. Based on the requirements of custom reputation processes there can be many different forms of input, but a few basic input pattern provide the common basic structure.

Typical Inputs

Normally, every message to a reputation process must contain several items: the source, the target, and an input value. Often, the contextual claim name and other values, such as a time stamp and a reputation process ID, also are required for the reputation system to initialize, calculate, and store the required state.

Reputation Statements as Input

Our diagramming convention shows reputation statements as inputs. That's not always strictly accurate-it's just shorthand for the common method in which the application creates a reputation statement and passes a message containing the statement's context, source, claim, and target to the model. Don't confuse this notational convention with the case when a reputation statement is the target of an input message, - which is always represented as a embedded miniature version of the target reputation statement. See Chap_2-Reputation_Targets .

Periodic Inputs

Sometimes reputation models are activated on the basis of an input that's not reputation based, such as a timer that will perform an external data transform. At present, this grammar provides no explicit mechanism for reputation models to spontaneously wake up and begin executing, and this has an effect on mechanisms such as those detailed in Chap_3-Decay . So far, in the authors' experience, spontaneous reputation model activation is not necessary and keeping this constraint out has simplified high-performance implementations. However, there is no particular universal requirement for this limitation.

Output

Many reputation models terminate without explicitly returning a value to the application at all. Instead, they store the output asynchronously in reputation statements. The application then retrieves the results as reputation statements as they are needed-always getting the best possible result, even if it was generated as the result of some other user on some other server in another country.

Return Values

Simple reputation environments, in which all the model is implemented as serially executed in-line with the actual input actions, are usually implemented using on request-reply semantics: The reputation model runs for exactly one input at a time and runs until it terminates by returning a copy of the roll-up value that it calculated. Large-scale, asynchronous reputation frameworks, such as that described in Appendix_A , don't return results in this way. Instead, they terminate silently and sometimes send signals (see below).

Signals: Breaking Out of the Reputation Framework

Sometimes a reputation model needs to notify the application environment that something significant has happened and special handling is required. To accomplish this, the process sends a signal: a message that breaks out of the reputation framework. The mechanism of signaling is specific to each framework implementation, but in our diagramming grammar, signaling is always represented by an arrow leaving the box.

Logging

A reputation logging process provides a specialized form of output - it records a copy of the current score or message into an external store, typically using a asynchronous write. This action is usually the result of an evaluator deciding that a significant event requires special output. For example, if a user karma score has reached a new threshold, an evaluator may decide that the hosting application should send the user a congratulatory message.

Practitioner's Tips: Reputation Is Tricky

When you begin designing a reputation model and system using our graphical grammar, it may be tempting to take elements of the grammar and just plug them together in the simplest possible combinations to create an Amazon-like rating and review system, or a Digg-like voting model, or even a points-based karma incentive model as on StackOverflow. In practice-“in the wild,” where people with myriad personal incentives interact with them both as sources of reputation and as consumers-the implementation of reputation systems is fraught with peril. In this section, we describe several pitfalls to avoid in designing reputation models.

The Power and Costs of Normalization

We make much of normalization in this book. Indeed, in almost all of the reputation models we describe, calculations are performed on numbers from 0.0 to 1.0, even when normalization and denormalization might seem to be extraneous steps. Here are the reasons that normalization of claim values is an important, powerful tool for reputation:

  • Normalized Values Are Easy to Understand
    • Normalized claim values are always in a fixed, well-understood range. When applications read your claim values from the reputation database, they know that 0.5 means the middle of the range. Without normalization, claim values are ambiguous. A claim value of 5 could mean 5 out of 5 stars, 5 on a 10-point scale, 5 thumbs up, 5 votes out of 50, or 5 points.
  • Normalized Values Are Portable (Messages and Data Sharing)
    • Probably the most compelling reason to normalize the claim values in your reputation statements and messages is that normalized data is portable across various display contexts (see Chapter_7 ) and can reuse any of the roll-up process code in your reputation framework that accepts and outputs normalized values. Other applications will not require special understanding of your claim values to interpret them.
  • Normalized Values Are Easy to Transform (Denormalize)
    • The most common representation of the average of scalar inputs is a percentage - and this denormalization is accomplished trivially by multiplying the normalized value by 100. Any normalized score may be transformed into a scalar value by using a table or, if the conversion is linear, by performing a simple multiplication. For example, converting to a 5-star rating system could be as simple as multiplying rating by 0.20 to get the normalized score. To get the stars back, just multiply by 5.0.

Normalization also the values of any claim type, such as thumbs-up (1.0)/thumbs-down (0.0), to be denormalized as a different claim type, such as a percentage (0%-100%) or turned into a 3-point scale of thumbs-up (0.66-1.0), thumbs-down (0.0-0.33), or thumb-to-side (0.33-0.66). Using a normalized score allows this conversion to take place at display time without committing the converted value to the database. Also, the exact same values can be denormalized by different applications with completely different needs. As with all things, the power of normalization comes with some costs.

  • Combining Normalized Scalar Values Introduces Bias
    • Using different normalized numbers in large reputation systems can cause unexpected biases when the original claim types were scalar values with slightly different ranges. Averaging normalized maximum 4-star ratings (25% each) with maximum 5-star ratings (20% each) leads to rounding errors that cause the scores to clump up if the average is denormalized back to 5 stars. See the example table Table_4-1 .

Table_4-1: An example of ugly side-effects when normalizing/denormalizing across different scales
Scale1 Stars Normalized2 Stars Normalized3 Stars Normalized4 Stars Normalized5 Stars Normalized
4 stars255051-7576-100 N/A
5 stars204041-6061-8081-100
Averaged range / Denormalized0-22 /
23-45 /
46-67 /
68-90 /
78-100 /

Liquidity: You Won't Get Enough Input

A question of liquidity

When is 4.0 greater than 5.0? When enough people say it is!

Consider the following problem with simple averages: it is mathematically unreasonable to compare two similar targets with averages made from significantly different numbers of inputs. For the first target, suppose that there are only three ratings averaging 4.667 stars, which after rounding displays as , and you compare that average score to a target with a much greater number of inputs, say 500, averaging 4.4523 stars, which after rounding displays as only . The second target, the one with the lower average, better reflects the true consensus of the inputs, since there just isn't enough information on the first target to be sure of anything. Most simple-average displays with too few inputs shift the burden of evaluating the reputation to users by displaying the number of inputs alongside the simple average, usually in parentheses, like this: (142) .

But pawning off the interpretation of averages on users doesn't help when you're ranking targets on the basis of averages-a lone rating on a brand-new item will put the item at the top of any ranked results it appears in. This effect is inappropriate and should be compensated for.

We need a way to adjust the ranking of an entity based on the quantity of ratings. Ideally, an application performs these calculations on the fly so that no additional storage is required.

We provide the following solution: a high-performance liquidity compensation algorithm to offset variability in very small sample sizes. It's used on Yahoo! sites to which many new targets are added daily, with the result that, often, very few ratings are applied to each one.

  • RankMean

r = SimpleMean m - AdjustmentFactor a + LiquidityWeight l * Adjustment Factor a

  • LiquidityWeight

l = min(max((NumRatings n - LiquidityFloor f) / LiquidityCeiling c, 0), 1) * 2

  • Or

r = m - a + min(max((n - f) / c, 0.00), 1.00) * 2.00 * a This formula produces a curve like that in Figure_3-14 . Though a more mathematically continuous curve might seem appropriate, this linear approximation can be done with simple non-recursive calculations and requires no knowledge of previous individual inputs.

Figure_3-14: The effects of the liquidity compensation algorithm.

Suggested initial values for a , c , and f (assuming normalized inputs):

  • AdjustmentFactor
    • a = 0.10

This constant is the fractional amount to remove from the score before adding back in effects based on input volume. For many applications, such as 5-star ratings, it should be within the range of integer rounding error-in this example, if the AdjustmentFactor is set much higher than 10%, a lot of 4-star entities will be ranked before 5-star ones. If it's set too much lower, it may not have the desired effect.

  • LiquidityFloor
    • f = 10

This constant is the threshold for which we consider the number of inputs required to have a positive effect on the rank. In an ideal environment, this number is between 5 and 10, and our experience with large systems indicates that it should never be set lower than 3. Higher numbers help mitigate abuse and get better representation in consensus of opinion.

  • LiquidityCeiling
    • c = 60

This constant is the threshold beyond which additional inputs will not get a weighting bonus. In short, we trust the average to be representative of the optimum score. This number must not be lower than 30, which in statistics is the minimum required for a t-score. Note that the t-score cutoff is 30 for data that is assumed to be unmanipulated (read: random). We encourage you to consider other values for a , c , and f , especially if you have any data on the characteristics of your sources and their inputs..

Bias, Freshness, and Decay

When you're computing reputation values from user-generated ratings, several common psychological and chronological issues will likely represent themselves in your data. Often, data will be biased because of the cultural mores of an audience or simply because of the way the application gathers and shares reputations; for example, an application may favor the display of items that were previously highly rated. Data may also be stale because the nature of the target being evaluated is no longer relevant. For example, because of advances in technology, the ratings for the features of a specific model of digital camera, such as the number of pixels in each image, may be irrelevant within a few months. Numerous solutions and workarounds exist for these problems, one of which is to implement a method to decay old contributions to your reputations. Read on for details on these problems and what you can do about them.

Ratings Bias Effects

Figure_3-15: Some real ratings distributions on Yahoo! sites. Only one of these distributions suggests a healthy, useful spread of ratings within a community. Can you spot it?

Figure_3-15 shows the graphs of 5-star ratings from nine different Yahoo! sites with all the volume numbers redacted. We don't need them, since we only want to talk about the shapes of the curves.

Eight of these graphs have what is known to reputation system aficionados as J-curves-where the far right point (5 stars) has the very highest count, 4 stars the next highest, and 1 star a little more than the rest. Generally, a J-curve is considered less than ideal for several reasons. The average aggregate scores, which are all clumped together between 4.5 and 4.7 and therefore all display as 4 or 5 stars, are not so useful in visual sorting of options. Also, a J-curve begs the question: Why use a 5-point scale at all? Wouldn't you get the same effect with a simpler thumbs-up, thumbs-down scale, or maybe even just a super-simple favorite pattern?

The outlier among the graphs is for Yahoo! Autos Custom (now shut down), where users rated car profile pages created by other users. That graph has a W-curve: lots of 1-, 3-, and 5-star ratings and a healthy share of 4- and 2-star ratings, too. It was a healthy distribution and suggested that a 5-point scale was good for the community.

But why were Yahoo! Autos Custom's ratings so very different from Yahoo! Shopping, Local, Movies, and Travel?

Most likely, the biggest difference was that Autos Custom users were rating one another's content. The other sites had users evaluating static, unchanging, or feed-based content in which they didn't have a vested interest.

In fact, if you look at the curves for Shopping and Local, they are practically identical, and have the flattest J-hook, giving the lowest share of 1-star ratings. This similarity was a direct result of the overwhelming use pattern for those sites. Users come to find a great place to eat or the best vacuum to buy. When they search, the results with the highest ratings appear first. If a user has experienced that place or thing, he may well also rate it-if it's easy to do so-and most likely will give it 5 stars (see Chap_3-First_Mover_Effects ). If the user sees an object that isn't rated but that he likes, he may also rate and/or review it, usually giving it 5 stars so that others can share his discovery-otherwise, why bother? People don't think that it's worth the bother to seek out and create Internet ratings for mediocre places or things. The curves, then, are the direct result of a product design intersecting with users' goals. This pattern-“I'm looking for good things, so I'll help others find good things” -is a prevalent form of ratings bias. An even stronger example happens when users are asked to rate episodes of TV shows. They rate every episode 4.5 stars plus or minus .5 star because only the fans bother to rate the episodes, and no fan is ever going to rate an episode below a 3. Look at any popular current TV show on Yahoo! TV or [another site].

Our closer look at how Yahoo! Autos Custom ratings worked and how users were evaluating the content showed why 1-star ratings were given out so often: users gave feedback to other users to get them to change their behavior. Specifically, you would get one star if you (1) didn't upload a picture of your ride, or (2) uploaded a dealer stock photo of your ride. The site is Autos Custom, after all! Users reserved 5-star ratings for the best of the best. Ratings of 2 through 4 stars were actually used to evaluate the quality and completeness of the car's profile. Unlike all the sites graphed here, the 5-star scale truly represented a broad sentiment, and people worked to improve their scores.

One ratings curve isn't shown here: the U-curve, in which 1 star and 5 stars are disproportionately selected. Some highly controversial objects on Amazon are targets of this rating curve. Yahoo's now-defunct personal music service also saw this kind of curve when new music was introduced to established users: 1 star came to mean “Never play this song again” and 5 meant “More like this one, please.” If you're seeing U-curves, consider that users may be telling you something other than what you wanted to measure (or you might need a different rating scale.)

First-Mover Effects

When an application handles quantitative measures based on user input, whether it's ratings or measuring participation by counting the number of contributions to a site, several issues arise-all resulting from bootstrapping of communities-that we group together under the term first-mover effects.

  • Early Behavior Modeling and Early-Ratings Bias
    • The first people to contribute to a site have a disproportionate effect on the character and future contributions of others. After all, this is social media, and people usually try to fit into any new environment. For example, if the tone of comments is negative, new contributors will also tend to be negative, which will also lead to bias in any user-generated ratings. See Chap_3-Ratings_Bias_Effects .

When an operator introduces user-generated content and associated reputation systems, it is important to take explicit steps to model behavior for the earliest users in order to set the pattern for those who follow.

  • Discouraging New Contributors
    • Take special care with systems that contain leaderboards (see Chap_7-Leaderboard ) when they're used either for content or for users. Items displayed on leaderboards tend to stay on the leaderboards, because the more people who see those items and click, rate, and comment on them, the more who will follow suit, creating a self-sustaining feedback loop.

This loop not only keeps newer items and users from breaking into the leaderboards, it discourages new users from even making the effort to participate by giving the impression that they are too late to influence the result in any significant way. Though this phenomenon applies to all reputation scores, even for digital cameras, it's particularly acute in the case of simple point-based karma systems, which give active users ever more points for activity so that leaders, over years of feverish activity, amass millions of points, making it mathematically impossible for new users to ever catch up.

Freshness and Decay

As the Chap_3-First_Mover_Effects shows, time leaches value from reputation, but there's also the simple problem of ratings becoming stale over time as their target reputable entities change or become unfashionable. Businesses change ownership, technology becomes obsolete, cultural mores shift.

The key insight to dealing with this problem is to remember the expression “What did you do for me this week?” When you're considering how your reputation system will display reputation and use it indirectly to modify the experience of users, remember to account for time value. A common method for compensating for time in reputation values is to apply a decay function: subtract value from the older reputations as time goes on, at a rate that is appropriate to the context. For example, digital camera ratings for resolution should probably lose half their weight every year, whereas restaurant reviews should only lose 10% of their value in the same interval.

Here are some specific algorithms for decaying a reputation score over time:

  • Linear Aggregate Decay
    • Every score in the corpus is decreased by a fixed percentage per unit time elapsed, whenever it is recalculated. This is high performance, but scarcely updated reputations will have disproportionately high values. To compensate, a timer input can perform the decay process at regular intervals.
  • Dynamic Decay Recalculation
    • Every time a score is added to the aggregate, recalculate the value of every contributing score. This method provides a smoother curve, but it tends to become computationally expensive O(n2) over time.
  • Window-based Decay Recalculation
    • The Yahoo! Spammer IP reputation system has used a time window based decay calculation: fixed time or a fixed-size window of previous contributing claim values is kept with the reputation for dynamic recalculation when needed. New values push old values out of the window, and the aggregate reputation is recalculated from those that remain. This method produces a score with the most recent information available, but the information for low-liquidity aggregates may still be old.
  • Time-limited Recalculation
    • This is the de facto method that most engineers use to present any information in an application: use all of the ratings in a time range from the database and compute the score just in time. This is the most costly method, because it involves always hitting the database to recalculate an aggregate reputation (say, for a ranked list of hotels), when 99% of the time the resulting value is exactly the same as it was the previous iteration. This method also may throw away still contextually valid reputation. Performance and reliability are usually best served with the alternate approaches described above.

Implementer's Notes

The massive-scale Yahoo! Reputation Platform, detailed in Appendix_A implemented the reputation building blocks, such as the accumulator, sum, and even rolling average, in both the reputation model execution engine and in the database layer. This division of labor provided important performance improvements because the read-modify-write logic for stored reputation values are kept as close to the data store as possible. For small systems, it may be reasonable to keep the entire reputation system in memory at once, thus avoiding this complication. But be careful. If your site is as successful as you hope it might someday be, making an all-memory-based design may well come back to bite you, hard.

Making Buildings From Blocks

In this chapter, we extended the grammar by defining various reputation building blocks out of which hundreds of currently deployed reputation systems are built. We also shared tips about a few surprises we've encountered that emerge when these processes interact with real human beings.

In Chapter_4 we'll combine and customize these blocks to describe full-fledged reputation models and systems that are available on the web today. We look at a selection common of patterns including voting, points, and karma. We also review complex reputations, such as those at eBay and Flickr, in considerable detail. Diagramming these currently operational examples demonstrates the expressiveness of the grammar. And the lessons learned from their challenges provide important experiences to consider when designing new models.

chapter_3.txt · Last modified: 2023/03/12 12:11 (external edit)
www.chimeric.de Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0