Main

October 12, 2010

First! Randy to be the kickoff guest for new Community Chat podcast series.

Bill Johnston and Thomas Knolls are launching a new live podcast series: Community Chat on talkshoe.

I am so honored to be the lead-off guest on their inagural episode (Wednesday 10-13-10):



The kickoff episode of Community Chat! [We] will be discussing the premise of the Community Chat podcast with special guest Randy Farmer. Will also be getting a preview of Blog World Expo from Check Hemann.

I'll be talking with them about online community issues developers and operators all share in common - well, as much as I can in 10 minutes. :-) Click on the widget above to go there - it will be recorded for those who missed it live...

UPDATE The widget is now has an option to play back the session. Just choose "Kickoff" and press play. :-)

September 29, 2010

BWRS on Kindle Web - Try before you buy!

You can now read the Kindle edition of Building Web Reputation Systems on the web (search, print, etc.) and it is much cheaper than the paper version. Here's the free sample:

August 25, 2010

Oct-06-10 SVPMA Talk: Web Reputations: Putting Social Media to Work in Your Products

On October 6th, Randy will be presenting Web Reputations: Putting Social Media to Work in Your Products at the Silicon Valley Product Managers Association:

While social media were originally focused on consumers, product managers in every segment are wondering how to deal with this shift to customer interaction and communities of interest. We’re crowdsourcing ideas for our B2B products, putting up community self-support sites, and tweeting our updates. We surf user-generated content on Facebook and LinkedIn. Anonymous posts rate our products against the competition. Customer groups that love us - and hate us - are organizing on their own.

Hidden in social media are problems of web reputation: how to tell good stuff from bad, how to engage and reward contributors, scale up rating systems, and stamp out inappropriate content. Web reputations are a source of social power. As product managers, we need to understand the reputation and reward systems we put in place when we add social networking to our products/services.  This talk with provide you with the sample criteria you must think about when creating a social media strategy for your product.It will identify the five most common mistakes product designers and product managers make when considering adding reputation, ratings and reviews to their applications It will provide you with intellectual tools needed to avoid these pitfalls as well as teach you how to think effectively about the alternatives. For example, questions like “Is Like always better than Ratings?”  “When should I use thumbs-down?” and “Can I re-purpose reputation?” will be discussed.

It will be a variant of the Reputation Missteps talk - targeted at product managers. Non-members can attend for a nominal fee. This article will be updated with a registration link when it becomes available.

July 28, 2010

BWRS Peer Critique - Engage!

Eric Goldman (@ericgoldman) recently posted a detailed and thoughtful review of Building Web Reputation Systems on his blog (and gave us 4 stars on Amazon). Since his blog doesn't support comments, we've decided to respond here with copious backlinks. If you're involved with creating and managing reputation systems, you'll definitely want to follow Eric's writings.

We've received great and very positive reviews from others as well, but Eric's the first of our peers to take up the challenge of wanting more from our work. First a little about him from his bio:

Eric Goldman is an Associate Professor of Law at Santa Clara University School of Law. He also directs the school's High Tech Law Institute … [his] research focuses on Internet law, intellectual property, marketing, and the legal and social implications of new communication technologies …
He also has some great presentations online, especially a recent one about Regulating Reputation Systems [video]. It's great work that has already influenced our thinking and even our recent public presentations. Too bad we didn't even know about Eric or his work before completing the book. This lack of a coherent way to discuss reputation systems was one of the reasons we started this effort. But, we're getting ahead of ourselves…

Hopefully, if you're still interested in reading this far, you've read Eric's review . Go ahead, we'll wait—this response assumes you have, so we don't have to quote a lot of context. Bryce and Randy each have thoughts to share about the issues raised, so we'll call out our responses by name below.

"…a debate worth having"

Randy: Thank you for taking the time to write such a detailed critique. Before discussing the critical points, I really appreciate the props you give us for being the first ones to put a book together in this area, and your words of support for our experience. Hopefully it was the first of several to be contributed by many authors. As you said: "…the book provides a good repository of high-value experience-based perspectives that are not readily available elsewhere. Even if the book’s recommendations are debatable, it’s a debate worth having."

We asked for this debate, and you've engaged, so lets go!

Not Enough Citations

Bryce:Hi Eric—thank you for the insightful critique. It is exactly this level of dialog that we'd hoped the book would inspire, and many of your points are dead-on.

In particular, calling us out for a paucity of cited references stings a
bit (tho' deservedly so!) Randy and I made the decision early on that we would consciously avoid writing a 'survey' book—one focused on cataloging the various market- and academe-based approaches to reputation.

And, you're right, there is a deep, rich vein of prior art to be explored there—digesting all of it, and putting it into a consumer-friendly format for product and industry folks would indeed be a fantastic resource. It's just not the book we chose to write. (And the Ariely references? Yeah, I kinda feel those stick out like a sore thumb, too—we'll definitely want to leaven the text with richer references if we get a crack at a second edition. Suggestions are welcome!)

Randy: It's tough to pick a target market for a book. Our primary experience was with product managers and web designers who were making very basic reputation system design errors. I've tried getting them to read whitepapers on ratings, reviews, reputation—and honestly it wasn't worth their time (though it should have been!) But the problem of what to cite is worse than that—the fact we were writing the very first book on the subject is a testament to how difficult it is to actually find good source material.

Much of your critique (and our response) is about terminology and
usage in this new domain—even just doing searches is non-trivial. Heck, one of the key repositories we cite in the appendix [web.si.umich.edu/reputations/] hadn't been updated in years and now appears to be gone (or moved somewhere I can't find at the moment).

For example, you say "[w]e implemented a very similar system embodying these two points back in 2000-01 at Epinions"—where is this documented? Links please! (I didn't see them in your post.) If it exists, I either missed it because I didn't know the correct keywords, or it didn't get enough link-love to show up when I tried. It would have been fantastic to be able to point at proof before embarking on the uphill battle to convince Yahoo! product managers to even try allowing the users to moderate the worst-of-the-worst content.

I so look forward to the day that the stuff in our book is common-knowledge—but it isn't even close. This isn't the first new field of study I've been an early pioneer in: I'm the co-author, with Chip Morningstar, of The Lesson's of Lucasfilm's Habitat—the first paper on creating and operating avatar virtual worlds, written in 1990 (it too was a practitioner's take on what was, up to then, a largely theory-covered field.) Lessons has been cited in over 100 books and yet there are still people building systems with the errors that Chip and I clearly identified more than 20 years ago! It's a long road we're on together.

BTW, Both Bryce and I would really like to own a copy of the book that Eric
thinks our book could/should have been—does someone want to write it? Or is it really close to what we already have, and you—kind readers—just need to send us the links?

Object Reputation vs. Grading & Filtering

Bryce: I do take issue with one of your criticisms—your dismissal of content reputation as mere "grading and filtering" of content items, and your assertion reputation for content items "does not work."

You're mistaking a useful application of reputation (the ability to sort and promote/demote, which we cover in Chapter 8) with an attribute of the object being sorted: quality, freshness, popularity, etc. These attributes are determined, of course, by community concensus and—as it turns out—there's already a pretty good term for 'a general concensus about something arrived at by a number of sources, some of them
known to you and some of them not': it's reputation.

While it's true that certain types of content are fairly immutable, the contexts in which they're embedded are infinitely variable, and make reputation an invaluable way to think about, and tabulate, these attributes.

Let's take music as an example: an MP3 track is generally fixed and, you're correct, "does not change its character unless subsequently edited." So, perhaps (and I'm actually not willing to concede this point, but more on this later) "reputation" for one particular track may be of limited value.

But how about a song? How about a specific performance of a song? Try telling any of the contributors to The DeadLists Project that 'a song is a song is a song.' They've cataloged over 40 years of concert recordings from Grateful Dead shows, and can probably tell you exactly which performances of "Stella Blue" are the superior, must-listen experiences. Different context, different expectations for reputation.

Further, how about a playlist—one in which songs appear and disappear over time, coming in and out of rotation? The tracks themselves don't change, but collections of content objects most certainly do. Tracking the reputation of a collection gives consumers valuable information to
judge that asset: am I likely to like the types of songs featured here? (Google 'Billboard Payola Scandal' and then tell me that influencing content reputation hasn't historically been a very lucrative endeavor.)

Of course, content doesn't merely spring forth like Athena from the forehead of Zeus—no, people create content. So, many times, content reputation is useful as a kind of "proxy reputation" for a person (its creator). What's the best way to know an artists reputation? Why, look at how their works are received: how many downloads, how many sales, remixes, adds to playlists. These things are generally a much better indicator of an artist's impact than who they're dating or what hotel room they've trashed lately.

It's our contention that people and content reputation are inextricably
intertwined: to even attempt to assess one in the absence of the other, would be—and for many failed startups, has been—an exercise in futility.

And, as promised, a return to your initial point: that content doesn't change over time. This is a question that goes back at least as far as Socrates and the Sophists: are the qualities of a thing intrinsic to the thing itself, or imparted instead by the context that we situate it within? I (and, generally,
subsequent history, Aristotle notwithstanding) would argue the latter.

So, Mark Twain's Huck Finn, barring some minor edits and censored bits over the years, is indeed the same text that it's always been. But I don't think anyone would seriously argue that its reputation (our shared perception of its value, it's place in our cultural fabric) hasn't changed drastically over the years.

The exact same thing takes place, on smaller scales and with less evident effects, every time someone favorites a video on YouTube, or 'Bans' an artist from their Last.fm personal channel.

Randy: Interesting that you call out Karma as a confusing term for person-reputation, I see it in a lot of white papers these days. :-) None the less, all terminology should be up for debate at this point. Sorting out entity-reputation from person-reputation is important—the naming of names is negotiable. Any counter-suggestions?

Engage!

Randy: Again, we're so grateful for Eric to kickstart the debate on these important issues. As I've said to more than one dejected-looking peer "Don't be sad that I'm critical of your ideas—that
means they are interesting enough to criticize! If I didn't like them, I'd just go do something else and ignore them." I'm now accepting that advice myself. Here's hoping that the issues around reputation systems remain interesting enough to continue criticism, discussion, and refinement.

Bryce & Randy

Please, peers, leave comments here - if there's enough interest we're happy to move the debate to the wiki…

July 13, 2010

5 Reputation Missteps [video] @Google 7/1

I gave a solo version of the 5 Reputation Missteps (and how to avoid them) at Google as a tech-talk, and the video is up:

I'm afraid I don't do anywhere as well with Bryce's portions as he does, but this is one of the better solo presentations I've given...

If you'd like Bryce and I to respond to any comments/questions you have, please leave your comments here instead of on the video - we don't get email notifications there...

July 07, 2010

Going Meta: Web Reputation Building for Building Web Reputation Systems

Have you read Building Web Reputation Systems, the book, ebook, or wiki versions?

If so, we'd really like to read what you think, publicly, in the form of a review on any of
Amazon (UK), Borders, O'Reilly, or any other place that you like to share feedback...

All reviews are welcome, including critical ones - we're looking for more ways to improve the field and the way we communicate its subtleties. Given that the web's average star rating is 4.3, we're still too high with our low-liquidity 5.0 rating at Amazon.com...

Talk about "eating your own dog food!"

Thanks in advance,
Bryce and Randy

May 05, 2010

Web2.0 Expo Talk — 5 Reputation Missteps

The slides from our presentation yesterday at the Web 2.0 Expo in San Francisco. We will soon be adding all speaker's notes into the full version on Slideshare.

April 06, 2010

Don't Display Negative Karma Redux: Unvarnished

It's Reputation Wednesday again, and the entire subject area of reputation systems seems to be heating up. For example there's been a lot of chatter about Unvarnished.

Update 4/13/2010: The Register is reporting that an eBayer is being sued in the amount of $15,000 for leaving negative feedback - more fodder for thought...

Unvarnished is a public karma system for real-world identities which will reportedly accept [and protect] negative anonymous comments, presumably from former co-workers.

This has generated a lot of chatter, mostly negative from folks like Evelyn Rusli at TechCrunch:Unvarnished: A Clean, Well-Lighted Place For Defamation
Today, Unvarnished makes its beta debut. It’s essentially Yelp for LinkedIn: any user can create an online profile for a professional and submit anonymous reviews. You can claim your profile, but unlike LinkedIn, you have to accept every post, warts and all. And once the profile is up there’s no taking it down.

I asked co-founder, Peter Kazanjy, “Will you ever give users the option to take down their profile?” Kazanjy’s reply: “No, because if we did that, everyone would take their profile down”
...and... CNet's Molly Wood writes in Unvarnished: Person reviews or trollfest?
Because let's be clear. Though Unvarnished may be billed as a natural extension of trends that started with LinkedIn, Yelp, and even Facebook, MySpace, and message boards, there's nothing about this site that, in my opinion, doesn't lead almost immediately to rank nastiness.

After a long conversation with co-founder Peter Kazanjy, formerly of VMWare, I'm convinced that the founders (the others come from eBay and LinkedIn) really do think they're creating a site that will maintain a professional veneer, be well moderated by its users, and won't descend into personal attacks. I just don't agree.
...and perhaps a bit more positive - Craig Newmark says in Trust and reputation systems: redistributing power and influence
The most prominent experiment in directly measuring trust is Unvarnished, very recently launched in beta form. You rate what trust you have in specific individuals, and they might rate you. Unvarnished is pretty controversial, and is already attracting a lot of legal speculation. They're trying to address all the problems related to the trustworthiness of the information they receive, and if so, might become very successful.

Unvarnished Against the grain

This service breaks several tenants of online karma (people reputation) as outlined in Building Web Reputation Systems (wiki):

  1. Don't Display Negative Karma!(from The Dollhouse Mafia post)
    We said it there best: "Avoid negative public karma, really."
  2. Karma is Complex, Built of Indirect Input (Chapter 7 of our book draft)
    The reason for using indirect input is to establish a clear context of evaluation - eBay requires you complete a transaction in their system before you rate a seller. There is no way for unvarnished to tie the negative comments to an actual context (co-worker).
  3. There is a real problem with the incentives model for all the participants (we spend one half a chapter on incentives and motivation for user generated content).
    As we warn there, mixing ego-based motivation (i.e. revenge) with commercial incentives (personal brand building) is usually toxic. Unvarnished is especially problematic with the ability to leave anonymous comments. Doesn't anyone remember F*ckedCompany.com? Having been the target of comments like "Sieg Heil, Randy!" I can tell you one possible outcome for Unvarnished: Deadpool.
  4. Scanning our post summarizing Karma best practices suggests quite a few places Unvarnished might want to look at closely when creating and displaying their karma.

A colleague that worked for Wink.com, an identity aggregation site told me that they would get people angry at the fact that a profile had been assembled on their behalf on Wink, even if it was only built by a search engine—they would often demand it's removal, even though it only contained public data. Identity and privacy are sensitive topics.

The one thing I'm sure of, from my experience building online communities for over 35 years, the founders of Unvarnished will discover that the use-patterns will look nothing like what they've planned for or predicted. They have bitten off something in an area that is fraught with peril, and so far (in the press, at least) haven't shown any understanding how significantly different business reviews are from public user karma, especially when people's livelihoods are at stake.

[BTW, I've signed up for the beta at getunvarnished.com - so if you're already a member, push the magic button that requests a review from me. :-)]

March 31, 2010

Incentives and Behavior: Consider the Mayor

Are you considering an incentive system for your online community or application? There's been an overwhelming amount of attention paid lately to the ways that providing incentives—points, badges or trophies—to users can influence their behaviors and contributions. If you're already sold, then pay careful attention to NY Mayor Michael Bloomberg's efforts to incentivize positive behaviors amongst the city's poorest residents:

An unusual and much-heralded program that gave poor families cash to encourage good behavior and self-sufficiency has so far had only modest effects on their lives and economic situation, according to an analysis the Bloomberg administration released on Tuesday.
In the book, we caution against intermixing market and social norms (or providing external incentives in lieu of leveraging people's already-present intrinsic motivations) and it would be easy to point to NYC's experience as supporting that stance. Easy, but—perhaps—not entirely fair. As the Times article points out, the program has at least been partially succesful at lifting some citizens out of poverty.

It's interesting to note that one of the program's earliest failings, however, was its complexity. There were also problems of trust, comprehension and user education:

“I think people were confused, and there was some amount of distrust,” Ms. Brandenburg said. “For some people it sounded too good to be true. It took a while to explain to people what the offer was.”

Ms. Gibbs said many families had been perplexed by the guidelines that were laid out for them. Cash payments were eventually eliminated for actions like getting a library card and follow-up visits with a doctor.

“Too many things, too many details, more to manage in the lives of burdened, busy households,” Ms. Gibbs said, standing next to the mayor on Tuesday. “Big lesson for the future? Got to make it a lot more simple.”

These are all classic user experience problems that you, too, will wrestle with should you decide to provide incentives to influence behavior. (Hat-tip to Sam Ladner for the article-pointer.)

March 02, 2010

Coming to SxSW: Production Copies of Building Web Reputation Systems!

Bryce and I are happy to announce that Building Web Reputation Systems has gone to the printers! We're absolutely excited to share this news with you all today. It's hard to believe it's been more than a year since we started. Thank you so much to all who've been reading our work as we developed it and providing such helpful feedback—it wouldn't be half as good as it is without you!

The book will hit the retail shelves on 4/1, but if you can't wait that long you have 2 options: (1) early copies will be available from the O'Reilly booth at SxSW!; and (2) there are some eBook codes that will be made available for those willing to review the book and post it online—see the booth or contact us via email (our address is over there → in the sidebar.) I guess it will depend on your blogging karma score. :-)

If Amazon sales rank is any indicator, sales are already picking up, so it seems that, after mobbing the SxSW booth, the fastest way to get a paper copy is to preorder at O'Reilly,
Amazon, Borders, or your favorite book retailer.

For those of you who don't already know what this book is about, here's the back cover copy:

What do Amazon's product reviews, eBay's feedback score system, Slashdot's Karma System, and Xbox Live's Achievements have in common? They're all examples of successful reputation systems that enable consumer websites to manage and present user contributions most effectively. This book shows you how to design and develop reputation systems for your own sites or web applications, written by experts who have designed web communities for Yahoo! and other prominent sites.

Building Web Reputation Systems helps you ask the hard questions about these underlying mechanisms, and why they're critical for any organization that draws from or depends on user-generated content. It's a must-have for system architects, product managers, community support staff, and UI designers.

  • Scale your reputation system to handle an overwhelming inflow of user contributions
  • Determine the quality of contributions, and learn why some are more useful than others
  • Become familiar with different models that encourage first-class contributions
  • Discover tricks of moderation and how to stamp out the worst contributions quickly and efficiently
  • Engage contributors and reward them in a way that gets them to return
  • Examine a case study based on actual reputation deployments at industry-leading social sites, including Yahoo!, Flickr, and eBay

February 16, 2010

On Karma: Top-line Lessons on User Reputation Design

In Building Web Reputation Systems, we appropriate the term karma to mean a user reputation in an online service. As you might expect, karma is discussed heavily throughout the more than 300 pages. During the final editing process, it became clear that a simple summary of the main points would be helpful to those looking for guidance. It seemed that our first post in over a month (congratulations on the new delivery, Bryce!) should be something big and useful...

This post covers the following top-line points about designing karma systems, drawn from our book and other blog posts:

  • Karma is user reputation within a context
  • Karma is useful for building trust between users, and between a user and the site
  • Karma can be an incentive for participation and contributions
  • Karma is contextual and has limited utility globally. [A chessmaster is not a good eBay Seller]
  • Karma comes in several flavors - Participation, Quality and Robust (combined)
  • Karma should be complex and the result of indirect evaluations, and the formulation is often opaque
  • Personal karma is displayed only to the owner, and is good for measuring progress
  • Corporate karma is used by the site operator to find the very best and very worst users
  • Public karma is displayed to other users, which is what makes it the hardest to get right
  • Public karma should be used sparingly - it is hard to understand, isn't expected, and is easily confused with content ratings
  • Negative public karma should be avoided all together. In karma-math -1 is not the same magnitude as +1, and information loss is too expensive.
  • Public karma often encourages competitive behavior in users, which may not be compatible with their motivations. This is most easily seen with leaderboards, but can happen any time karma scores are prominently displayed. [i.e.: Twitter follower count]

Why bother with karma? [Preface]

Karma is a reputation score for a user in a community, it may be comprised of many components, such as:

  • How long has this person been a member of the community?
  • What types of activities has she engaged in?
  • How well has she performed at them?
  • What do other people think about this person?

Having access to a person's reputation might help you make better informed judgments. Judgements like…

  • Can I trust this person?
  • Should I transact with this person?
  • Is it worth my time to listen to this person?

Besides providing a means for trust between users, karma is often used as an incentive to encourage contributions to a service, or to identify specific users for special action - either recognition or corrective action. The tricky part is balancing the producer incentives against the potential for abuse and the consumers need for good filters over the content.

Karma is contextual (local) and has limited scope [Chapter 1]

Karma is built based on the actions of a user within a context, such as a web site, or even as a member a sub-community of a site. And those contributions are often limited to a very narrow range of actions - care must be taken to not over-generalize the value of a karma score. For example a eBay seller feedback karma only reflects the feelings of the buyers for the exact transactions completed. One of the known scamming patterns is for a scammer to develop strong positive karma selling a large number of smaller items and then switch to simultaneously listing a large number high-ticket items for auction at low prices, collecting the funds and then canceling their account. This is an evil form of reputation bankruptcy (see below).

There is a common misconception about karma - that it can be used across contexts, just as the FICO credit score is broadly used in the United States to determine suitability for issuing credit cards, purchasing a home, or even being hired for a job. Chapter 1 talks about this idea of a "Web Fico":

Several startup companies have attempted to codify a global user reputation for use across web sites, and some try to leverage a user's preexisting eBay seller's Feedback score as a primary value in their rating. They are trying to create some sort of “real person” or “good citizen” reputation system for use across all contexts. As with the FICO score, it is a bad idea to co-opt a reputation system for another purpose, and it dilutes the actual meaning of the score in its original context. The eBay Feedback score reflects only the transaction worthiness of a specific account, and it does so only for particular products bought or sold on eBay. The user behind that identity may in fact steal candy from babies, cheat at online poker, and fail to pay his credit card bills. Even eBay displays multiple types of reputation ratings within its singular limited context. There is no web FICO because there is no kind of reputation statement that can be legitimately applied to all contexts.

Participation vs. quality, and robust karma [Chapter 4]

There are two primitive forms of karma models: models that measure the amount of user participation and models that measure the quality of contributions. When these types of karma models are combined, we refer to the combined model as robust. Including both types of measures in the model gives the highest scores to the users who are both active and produce the best content.

Participation karma

Participation karma: As a user engages in various activities, they are recorded, weighted, and tallied.

Counting socially and/or commercially significant events by content creators is probably the most common type of participation karma model. This model is often implemented as a point system (Chap_4-Points), in which each action is worth a fixed number of points and the points accumulate. A participation karma model looks exactly like the figure above, where the input event represents the number of points for the action and the source of the activity becomes the target of the karma.

There is also a negative participation karma model, which counts how many bad things a user does. Some people call this model strikes, after the three-strikes rule of American baseball. Again, the model is the same, except that the application interprets a high score inversely.

Quality karma

A quality-karma model, such as eBay's seller feedback (Chap_4-eBay_Merchant_Feedback_Karma) model, deals solely with the quality of contributions by users. In a quality-karma model, the number of contributions is meaningless unless it is accompanied by an indication of whether each contribution is good or bad for business. The best quality-karma scores are always calculated as a side effect of other users evaluating the contributions of the target.

In the eBay example, a successful auction bid is the subject of the evaluation, and the results roll up to the seller: if there is no transaction, there should be no evaluation.

Robust karma

By itself, a participation-based karma score is inadequate to describe the value of a user's contributions to the community: we will caution time and again throughout the book that rewarding simple activity is an impoverished way to think about user karma. However, you probably don't want a karma score based solely on quality of contributions either. Under this circumstance, you may find your system rewarding cautious contributors-ones who, out of a desire to keep their quality-ratings high-only contribute to “safe” topics, or-once having attained a certain quality ranking-decide to stop contributing to protect that ranking.

What you really want to do is to combine quality-karma and participation-karma scores into one score-call it robust karma. The robust-karma score represents the overall value of a user's contributions: the quality component ensures some thought and care in the preparation of contributions, and the participation side ensures that the contributor is very active, that she's contributed recently, and (probably) that she's surpassed some minimal thresholds for user participation-enough that you can reasonably separate the passionate, dedicated contributors from the fly-by post-then-flee crowd.

The weight you'll give to each component depends on the application. Robust-karma scores often are not displayed to users, but may be used instead for internal ranking or flagging, or as factors influencing search ranking; see Chap_4-Keep_Your_Barn_Door_Closed , for common reasons for this secrecy. But even when karma scores are displayed, a robust-karma model has the advantage of encouraging users both to contribute the best stuff (as evaluated by their peers) and to do it often.

When negative factors are included in factoring robust-karma scores, it is particularly useful for customer care staff-both to highlight users who have become abusive or users whose contributions decrease the overall value of content on the site, and potentially to provide an increased level of service to proven-excellent users who become involved in a customer service procedure. A robust-karma model helps find the best of the best and the worst of the worst.

Robust karma: A robust-karma model might combine multiple other karma scores-measuring, perhaps, not just a user's output (Participation) but their effectiveness (or Quality) as well.

Unlike most content reputation, karma is implicit, opaque, and complex [Chapter 7]

A reputable entity is potentially any entry in a database, including users and content items, with one or more reputations attached to it. All kinds of reputation score types and all kinds of display and use patterns might seem equally valid for content reputation and karma, but usually they're not. To highlight the differences between content reputation and karma, we've categorized them by the ways in which they're typically calculated: simple and complex reputation.

Simple Reputation
Simple reputation is any reputation score that is generated directly by user evaluation of a reputable entity and that is subject to an elementary aggregation calculation, such as simple average. For example, simple reputation is used on most ratings-and-reviews sites. Simple reputation is direct and easy to understand.
Complex Reputation
Complex reputation is a score aggregated from multiple evaluations, including evaluations of different but related targets, calculated with an opaque method. email IP spammer, Google PageRank, and eBay feedback reputations are examples of complex reputation. It's an indirect evaluation, and users may not understand how it was calculated even if the score is displayed.

Content reputation is about things-typically inanimate objects without emotions or the ability to directly respond in any way to its reputation.

But karma represents the reputation of users, and users are people-they are alive, they have feelings, and they are the engine that powers your site. Karma is significantly more personal and therefore sensitive and meaningful. If a manufacturer gets a single bad product review on a web site, it probably won't even notice. But if a user gets a bad rating from a friend-or feels slighted or alienated by the way your karma system works-she might abandon an identity that has become valuable to your business. Worse yet, she might abandon your site altogether and take her content with her. (Worst of all, she might take others with her.)

Take extreme care in creating a karma system. User reputation on the web has undergone many experiments, and the primary lesson from that research is that karma should be a complex reputation and it should be displayed rarely.

Karma is complex, built of indirect inputs

Be careful with Karma-sometimes making things as simple and explicit as possible is the wrong choice for reputation:

  • Rating a user directly should be avoided. Typical implementations only require a user to click once to rate another user and are therefore prone to abuse. When direct evaluation karma models are combined with the common practice of streamlining user registration processes (on many sites opening a new account is an easier operation than changing the password on an existing account), they get out of hand quickly. See the example of Orkut in Chap_7-Display_Numbered_Levels.
  • Asking people to evaluate others directly is socially awkward. Don't put users in the position of lying about their friends.
  • Using multiple inputs presents a broader picture of the target user's value.
  • Economics research into “revealed preference,” or what people actually do, as opposed to what they say, indicates that actions provide a more accurate picture of value than elicited ratings.

Karma calculations are often opaque

Karma calculations may be opaque because the score is valuable as status, has revenue potential, and/or unlocks privileged application features.

Display karma sparingly

In Building Web Reputation Systems we separate reputation display into three categories: public (shown to other users), personal (shown only to the owner), and corporate (for company internal use.) Corporate karma is normally used to identify the very best and the very worst users for special actions, such as PR contact or account termination. Personal karma is typically used for reflecting progress against some goal - as a dieter tracks their body weight over time. Where karma display becomes challenging is when it is public.

There are several important things to consider when displaying karma to the public:

  • Publicly displayed karma should be rare because, as with content reputation, users are easily confused by the display of many reputations on the same page or within the same context.
  • Publicly displayed karma should be rare because it can create the wrong incentives for your community. Avoid sorting users by karma. See Chap_7-Leaderboards_Considered_Harmful.
  • If you do display it publicly, make karma visually distinct from any nearby content reputation. Yahoo!'s EU message board displays the karma of a post's author as a colored medallion, with the message rated with stars. But consider this: Slashdot's message board doesn't display the karma of post authors to anyone. Even the display of a user's own karma is vague: “positive,” “good,” or “excellent.” After originally displaying karma publicly as a number, over time Slashdot has shifted to an increasingly opaque display of karma.
  • Public displayed karma should be rare because it isn't expected. When Yahoo! Shopping added Top Reviewer karma to encourage review creation, they displayed a Top Reviewer badge with each review and rushed it out for the Christmas 2006 season. After the New Year had passed, user testing revealed that most users didn't even notice the badges. When they did notice them, many thought they meant either that the item was top rated or that the user was a paid shill for the product manufacturer or Yahoo!.

Though karma should be complex, it should still be limited to as narrow a context as possible. Don't mix shopping review karma with chess rank. It may sound silly now, but you'd be surprised how many people think they can make a business out of creating an Internet-wide trustworthiness karma.

Yahoo! holds reputation for karma scores to a higher standard than reputation for content. Be very careful in applying terminology and labels to people, for several reasons:

  • Avoid labels that might appear as attacks. They set a hostile tone that will be amplified in users' responses. This caution applies both to overly positive labels (such as “hotshot” or “top” designations) or negative ones (such as “newbie” or “rookie” ).
  • Avoid labels that introduce legal risks. What if a site labeled members of a health forum “experts,” and these “experts” then gave out bad advice?

These are rules of thumb that may not necessarily apply to a given context. In role-playing games, for example, publicly shared simple karma is displayed in terms of experience levels, which are inherently competitive.

Avoid negative public karma [Chapter 6]

This point is covered in detail in an earlier post The Dollhouse Mafia, or "Don't Display Negative Karma" - which anyone considering having negative karma effects in public reputation should read carefully. We'll only excerpt a small portion here:

This thinking—though seemingly intuitive—is impoverished, and is wrong in at least two important ways.

  • There can be no negative public karma-at least for establishing the trustworthiness of active users. A bad enough public score will simply lead to that user's abandoning the account and starting a new one, a process we call karma bankruptcy. This setup defeats the primary goal of karma-to publicly identify bad actors. Assuming that a karma starts at zero for a brand-new user that an application has no information about, it can never go below zero, since karma bankruptcy resets it. Just look at the record of eBay sellers with more than three red stars-you'll see that most haven't sold anything in months or years, either because the sellers quit or they're now doing business under different account names.
  • It's not a good idea to combine positive and negative inputs in a single public karma score. Say you encounter a user with 75 karma points and another with 69 karma points. Who is more trustworthy? You can't tell: maybe the first user used to have hundreds of good points but recently accumulated a lot of negative ones, while the second user has never received a negative point at all. If you must have public negative reputation, handle it as a separate score (as in the eBay seller feedback pattern).

Even eBay, with the most well-known example of public negative karma, doesn't represent how untrustworthy an actual seller might be-it only gives buyers reasons to take specific actions to protect themselves. In general, avoid negative public karma. If you really want to know who the bad guys are, keep the score separate and restrict it to internal use by moderation staff.

If you're still considering negative reputation, please [re]read the story of the Dollhouse Mafia and imagine your enemies attacking your system.

Public karma can discourage some contributors

Putting user reputations in a public ranked list, creates a competitive environment and some users' motivations are not at all compatible with being being publicly recognized. Still others will see high karma as the goal of the activity instead of the benefit and start to change their behavior to optimize their actions around their karma instead of using the site as intended.

In Leaderboards Considered Harmful, we pointed out:

[...]ranking the members of your community—and pitting them one-against-the-other in a competitive fashion—is typically a bad idea. Like the fabled djinni of yore, leaderboards on your site promise riches (comparisons! incentives! user engagement!!) but often lead to undesired consequences.

[...]

This may be the most insidious artifact of a leaderboard community: the very presence of a leaderboard changes the community dynamic and calls into question the motivations of everyone for any action they might take.

December 16, 2009

The Sensical Moment: Asking for User Opinion When the Time is Right

If you're asking for explicit user opinions in your reputation system (ratings, reviews or even just a simple “Like”), pay special attention to exactly when you are asking for them. You'll get better data if you try to gather opinions when it makes most sense to do so: try to find the sensical moments to solicit user input.

Ideally, you'll catch reviewers in moments where they're…

Sufficiently Invested

Can you make it too easy for users to give reviews? You may not think so—if you're in the early stages of deploying your reputation system (or building your site), then you're probably more worried about getting people to use the system at all. And putting obstacles in front of potential reviewers certainly doesn't sound like a good way to alleviate those fears. But, long-term, the success of your reputation system will depend on quality, honest and unbiased opinions.

It may well be in your best interest to limit those who can, and cannot, give ratings. Require that users register, at least. Plain and simple. It should be the bare minimum level of investment that a user should make to voice an opinion on your site.

You may want to go even further. Yahoo! Answers, for instance, limits certain functions (rating questions & answers) to only those users who've achieved a certain status (Level 2) on the site.

Recommendation: Make it easy, but not too easy, for users to give an opinion. Bake in some degree of accountability and ownership for publicly stated opinions.

Appropriately Informed

Don't ask your users to provide opinions on things they haven't experienced. This may be tricky, because the temptation will be strong to make rating objects as easy and low-friction as possible, which typically means putting rating controls in an easy-to-find location and keeping them there consistently. But consider the reputation value of 5-star ratings on YouTube (which we covered here only recently): do you suppose those generally-lackluster ratings distributions would improve if YouTube only allowed users to rate a video after first watching it? (To completion?)

This shortcoming is not limited to YouTube: years ago, Saleem Khan noted a trend on Digg where people were Digging up submissions with no way to have actually read the associated articles. (They couldn't have read them—the articles in question had gone offline before the favorable reviews continued to pour in.)

And even Apple has fallen victim to this oversight. Early iterations of the App Store rating system allowed for anyone to rate an iPhone app—whether they'd ever actually installed the app or not! This violates the "sufficient investment" principle, above, but it also seriously calls into question those reviewers' qualification to review. There's simply no way those ratings could have carried any real value—the reviewers weren't making informed decisions.

Apple eventually fixed this oversight. Now, you're given the opportunity to rate any app from the App Store interface, but when you try to do so for an app you've never tried?

MustOwn.png


Recommendation: Place ratings inputs either spatially or temporally downstream of the act of consumption.

But Not Overly Biased

Although Apple addressed that problem, they also introduced a new one. Now, when iPhone users attempt to delete an app from their device, they are asked to first rate the app.

iphone-rate.jpg

This is, of course, a horrible time to ask a user to rate an application. After they've made the decision that they no longer need the app and just as they're in the process of deleting it. Even an app that a user loved may fare poorly under these circumstances.

Perhaps it's truly a horrible app—in which case a bad rating would be justified— or perhaps the user just no longer has any use for it. (Maybe it's a game that he or she has already beaten, or a Twitter client made superfluous by a newer, sexier alternative.) By the time a user is uninstalling an iPhone app, the love affair with that app—if there ever was one—is unmistakably on the wane, and the average ratings likely reflect that fact.

Recommendation: Don't ask for ratings at the low-point of a user's relationship to the rated object.

And not too distracted

Another major sin of the App Store's "parting shot" rating request is that it makes the act of rating into a roadblock. In this excellent comment, PJ Cabrera makes the point:

Who knows how many users are just inputting anything just to move on, without paying attention to what they're doing[?]
True, there is a "No Thanks" button, but its meaning is ambiguous and some reviewers may mistake its intent (perhaps reading it as a "Cancel this deletion" action instead.) It is hard for users to give honest and considered opinions when they are still caught up in the experience that you're asking them to evaluate.

It's common practice, when buying a new car, to receive a customer satisfaction survey from the manufacturer. (This survey is used as an input into the car-selling reputation of the dealership you bought from.) Why do you suppose that the manufacturers will typically wait a week or more before sending you the survey? It's because they know that with a little time and distance from the (often stressful) day of the transaction that you're more likely to give a measured, thoughtful and accurate assessment of the transaction. (You're probably also more inclined to give a positive review, but that's an discussion for another post.)

Recommendation: Respect the primary tasks that a user may be engaged in on your site. Don't interrupt them unnecessarily in order to solicit ratings.

Special thanks to Laurent Stanevich for providing the iPhone app rating screenshot.

December 09, 2009

A Sneak-Peek at Reputation Concepts

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week, Bryce shares a simple work-in-progress and solicits your input to make it better.

Once upon a time, in (what feels like) a previous life, I illustrated some moderately well-received concept maps: diagrams intended to communicate some simple concepts about software systems and show the interrelationships between their moving parts.

Throughout work on Building Web Reputation Systems, it has always been my intent to attempt a compelling, engaging and fun-to-read concept map. Something to demonstrate the concepts that we've drawn on throughout the book. That was my intent anyway—it just never occurred to me how much work writing a book was going to be. So it hasn't been until fairly recently (like… um, tonight, actually) that I've been able to start pulling something together.

Adhering to our open policy, here, then is that very first rough-and-ugly (and incomplete!) sketch. (Click it for the full version on Flickr.)

RepConcepts.png

I usually don't use Omnigraffle in the design of these concept maps, but it's looseness and speed of idea-capture just felt right for this one, so I'll probably let the general shape of the map simmer for a while in it before moving it over to Illustrator for some fun touches and polish.

This sketch is, admittedly, incomplete. I have a paper version, drafted beforehand, that's easily 150% this size (in terms of # of concepts and linkages.) Please feel free to comment here, or over on Flickr. Hopefully you've enjoyed this brief light interlude, and I'll share more about the progress on the Reputation Systems Concept Map as it evolves.

December 02, 2009

The Cake is a Lie: Reputation, Facebook Apps, and "Consent" User Interfaces

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week, Randy comes back from the IIW with a simple idea for improving application permissioning.

In early November, I attended the 9th meeting of the Internet Identity Workshop. One of the working sessions I attended was on Social Consent user interface design. After the session, I had an insight that reputation might play a pivotal role in solving one of the key challenges presented. I shared my detailed, yet simple, idea with Kevin Marks and he encouraged me to share my thoughts through a blog post—so here goes…

The Problem: Consent Dialogs

The technical requirements for the dialog are pretty simple: applications have to ask users for permission to access their sensitive personal data in order to produce the desired output—whether that's to create an invitation list, or to draw a pretty graph, or to create a personalized high-score table including your friends, or to simply sign and attach an optional profile photo to a blog comment.

The problem, however, is this—users often don't understand what they are being asked to provide, or the risks posed by granting access. It's not uncommon for a trivial quiz application to request access to virtually the same amount of data as much more "heavyweight"applications (like, say, an app to migrate your data between social networks.) Explaining this to users—in any reasonable level of detail—just before running the application causes them to (perhaps rightfully) get spooked and abandon the permission grant.

Conflicting Interests

The platform providers want to make sure that their users are making as informed a decision as possible, and that unscrupulous applications don't take advantage of their users.

The application developers want to keep the barriers to entry as low as possible. This fact creates a lot of pressure to (over)simplify the consent flow. One designer quipped that it reduces the user decision to a dialog with only two buttons: "Go" and "Go Away" (and no other text.)

The working group made no real progress. Kevin proposed creating categories, but that didn't get anywhere because it just moved the problem onto user education—"What permissions does QuizApp grant again?"

Reputation to the Rescue?

All consent dialogs of this stripe suffer from the same problem: Users are asked to make a trust decision about an application that, by definition, they know nothing about!

This is where identity meets trust, and that's the kind of problem that reputation is perfect for. Applications should have reputations in the platform's database. That reputation can be displayed as part of the information provided when granting consent.

Here's one proposed model (others are possible, this is offered as an exemplar).

The Cake is a Lie: Your Friends as Canaries in the Coal Mine of New Apps

First a formalism: when an application wants to access a user's private Information (I), they have a set of intended Purposes (P) they wish to use it for. Therefore, the consent could be phrased thusly:

"If you let me have your (I), I will give you (P). [Grant] [Deny]"

Example: "If you give me access to your friends list, I will give you cake."

In this system, I propose that the applications be compelled to declare this formulation as part of the consent API call. (P) would be stored along with the app's record in the platform database. So far, this is only slightly different from what we have now, and of course, the application could omit or distort the request.

This is where the reputation comes in. Whenever a user uninstalls an application, the user is asked to provide a reason, including abusive use of data and specifically asks a question to see if the promise of (P) was kept.

"Did this application give you the [cake] it promised?"

All negative feedback is kept—to be re-used later when other new users install the app and encounter the consent dialog. If they have friends who have uninstalled this application already complaining that "If (I) then (P)" string was false, then the moral equivalent of this would appear scrawled in the consent box:


"Randy says the [cake] was unsatisfactory.
Bryce says the [cake] was unsatisfactory.
Pamela says the application spammed her friends list."

Afterthoughts

Lots of improvements are possible (not limiting it to friends, and letting early-adopters know that they are canaries in the coal mine.) These are left for future discussion.

Sure, this doesn't help early adopters.

But application reputation quickly shuts down apps that do obviously evil stuff.

Most importantly, it provides some insight to users by which they can make more informed consent decisions.

(And if you don't get the cake reference, you obviously haven't been playing Portal.)

November 18, 2009

Reputation is Identity

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's entry discusses the ways that reputation can make for richer user identities on your site. It is lightly adapted from our draft of Chapter 8.

Imagine you're at a party, and your friend Ted wants you to meet his friend, Mary. He might very well say something like… "I want you to meet my friend, Mary. She's the brunette over by the buffet line." A fine, beginning, to be sure. It helps to know who you're dealing with. But now imagine that Ted ended there as well. He doesn't take you by the arm, walk you over to Mary, and introduce you face to face. Maybe he walks off to get another drink. Um… this does not bode well for your new friendship with Mary.

Sadly, until fairly recently, this has been the state of identity on much of the Web. When people were represented at all, they were often nothing more than a meager collection of sparse data elements: a username; maybe an avatar; just enough identifying characteristics that you might recognize them again later, but not much else.

With the advent of social on the web, things have improved. Perhaps the biggest improvement has been that now people's relationships formulate a sizable component of their identity and presence on most sites. Now, mutual friends or acquaintances can act as a natural entree to forming new relationships. So at least Ted now will go that extra step and walk you over to that buffet table for a proper introduction.

But, you still won't know much about Mary, will you? Once introductions are out of the way, what will you possibly have to talk about? The addition of reputation to your site will provide that much-needed final dimension to your users' identities: depth. Wouldn't it be nice to review a truly rich and deep view of Mary's identity on your site before deciding what you and she will or won't have in common?

Here are but a few reasons why user identities on your site will be stronger with reputation than they would be without.

  • Reputation is based on history and the simple act of recording those histories – a user's past actions, or voting history, or the history of their relationship to the site – provides you with a lot of content (and context) that you can present to other users. This is a much richer model of identity than just a display-name and an avatar.
  • Visible histories reveal shared affinities and allow users with common interests to find each other. If you are a Top Contributor in the Board Games section of a site, then like-minded folks can find you, follow you, or invite you to participate in their activities.

    You will, however, find contexts where this is not desirable. On a question-and-answer site like Yahoo! Answers, for instance, don't be surprised to find out that many users won't want their questions about gonorrhea or chlamydia to appear as part of their historical record. Err on the side of giving your users control over what appears, or give them the ability to hide their participation history altogether.

  • A past is hard to fake. Most site identities are cheap. In and of themselves, they just don't mean much. A couple of quick form-fields, a 'Submit' button and practically anyone (or no one– bots welcome!) can become a full-fledged member of most sites. It is much harder, however, to fake a history of interaction with a site for any duration of time.

    We don't mean to imply that it can't be done – harvesting 'deep' identities is practically an offshoot industry of the MMORPG world (See the figure above.) But it does provide a fairly high participatory hurdle to jump. When done properly, user karma can assure some level of commitment and engagement from your users. (Or at least allow you to ascertain those levels quickly.)

  • Reputation disambiguates identity conflicts. Hopefully, you've moved away from publicly identifying users on your site by their unique identifier. (You have read the Tripartite Identity Pattern, right?) But this introduces a whole new headache: identity spoofing. If your public namespace doesn't guarantee uniqueness (or even if it does– it'll be hard to guard against similar-appearing/l33t-speak equivalents and the like) then you'll have this problem.

    Once your community is at scale, trolls will take great delight in appropriating others' identities – assuming the same display name, uploading the same avatar – purely in an effort to disrupt conversations. It's not a perfect defense, but always associate a contributor's identity with his or her participation history or reputation to help mitigate these occurrences. You will, at least, have armed the community with the information they need to decide who's legit and who's an interloper.

These are some of the reasons that extending user identities with reputation is useful. Chapter 8 of Building Web Reputation Systems offers a series of considerations for how to do so most effectively.

November 11, 2009

5-Star Failure?

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's entry confirms that poorly chosen reputation inputs will indeed yield poor results.

Pity the poor, beleaguered 5-Star rating. Not so very long ago, it was the belle of the online ratings ball: its widespread adoption by high-profile sites like Amazon, Yahoo!, and Netflix influenced a host of imitators, and—at one point—star-ratings were practically an a priori choice for site designers when considering how best to capture their users' opinions. Their no-brainer inclusion had almost reached cargo cult design status.

This has subsided in recent years, as stars have received stiff competition from hot, upstart mechanisms like "Digg-style" voting (what we, when contributing to the Yahoo! Pattern Library, rechristened as Vote to Promote.) And Facebook's "Like" action (which, I guess, was ahem, "inspired by" FriendFeed though, let us not forget that for a time, also flirted with Thumbs Up & Down rating of feed items.) Definitely, within the past 2 or 3 years, stars 'obvious' appeal as the ratings mechanism of choice is no longer so obvious.

Even more recently, 5-Star ratings fall from grace is almost complete. YouTube fired the first volley, declaring that, by and large, people on YouTube overwhelmingly give 5 stars to videos on that site. (For readers of this site, you'll recall that we blogged about similar J-Curve distributions that are prevalent on Yahoo! as well.)

And then the venerable Wall Street Journal declared that On the Internet, Everyone's a Critic But They're Not Very Critical:

One of the Web's little secrets is that when consumers write online reviews, they tend to leave positive ratings: The average grade for things online is about 4.3 stars out of five.
And, just like that, as quickly as 'stars are it' rose to prominence, 'stars are dead' is rapidly becoming the accepted wisdom. (Don't believe me? Read the comments when TechCrunch covered the YouTube discovery, and you'll see folks all-but-rushing to prop up a variety of their 'preferred rating mechanism' in stars' place.)

Are stars dead?

This is, of course, the wrong way to frame the question. Stars, thumbs, favorites, or sliders: any of these ratings input mechanisms are dead-on-arrival if they're not carefully considered within the context of use. 5-Star ratings require a little more cognitive investment than a simple 'I Like This' statement, so--before designing 5-star ratings into your system--consider the following.

Will it be clear to users what you're asking them to assess? It's not entirely surprising that YouTube's ratings overwhelmingly tend toward the positive. That's a long-observed and well understood phenomenon in the social sciences called Acquiescence Bias. It is "the tendency of a respondent to agree with a statement when in doubt." And 5-star ratings, in the case of YouTube, are nothing but doubt. What, exactly, is a fair and accurate quantitative assessment for a video on YouTube? The input mechanism does provide some clues, in the form of text hints for the various ratings levels (ranging from 'Poor' to 'Awesome!') but these are highly subjective and - themselves - way too open to interpretation.

Is a scale necessary? If the primary decision you're asking users to make is 'good vs. bad' or 'I liked it' or 'I didn't', then are multiple steps of decisioning really adding anything to their evaluation?

Are comparisons being made? Should I, as a user, rate videos in comparison to other similar videos on YouTube? What, exactly, distinguishes a 5-star football to the groin video from a 2-star? Am I rating against like videos? Or all videos on YouTube? (Or every video I've ever seen!?)

Have they watched the video? One way to encourage more-thoughtful ratings is to place the input mechanism at the proper juncture: make some attempt, at least, to ensure that the user is rating the thing only after having experienced it. YouTube's 5-star mechanism is fixed and always-present, encouraging drive-by ratings, premature ratings or just general sloppiness of assessment.

So, are stars inappropriate for YouTube, at least in the way that they've designed them? Probably, yes.

To wrap up, some quick links. Check out this elegant and innovative design that the folks at Steepster recently rolled out, and think about the ways it cleverly addresses all four of the concerns listed above.

And to see a really in-depth study of 5-star ratings used effectively, check out Using 5-Star Ratings from Christopher Allen & Shannon Appelcline's excellent series on Systems for Collective Choice.


October 28, 2009

Ebay's Merchant Feedback System

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week, we explore, to some depth, one of the Web's longest-running and highest-profile reputation systems. (We also test-drive our new Google-maps powered zoomable diagrams. Wheee!)

EBay contains the Internet's most well-known and studied user reputation or karma system: seller feedback. Its reputation model, like most others that are several years old, is complex and continuously adapting to new business goals, changing regulations, improved understanding of customer needs, and the never-ending need to combat reputation manipulation through abuse.

Rather than detail the entire feedback karma model here, we'll focus on claims that are from the buyer and about the seller. An important note about eBay feedback is that buyer claims exist in a specific context: a market transaction-a successful bid at auction for an item listed by a seller. This specificity leads to a generally higher quality-karma score for sellers than they would get if anyone could just walk up and rate a seller without even demonstrating that they'd ever done business with them; see Chapter 1- Implicit Reputation.

The scrolling/zooming diagram below shows how buyers influence a seller's karma scores on eBay. Though the specifics are unique to eBay, the pattern is common to many karma systems. For an explanation of the graphical conventions used, see Chapter 2.

The reputation model in this figure was derived from the following eBay pages: http://pages.ebay.com/help/feedback/scores-reputation.html and http://pages.ebay.com/services/buyandsell/welcome.html, both current as of July 2009.

We have simplified the model for illustration, specifically by omitting the processing for the requirement that only buyer feedback and Detailed Seller Ratings (DSR) provided over the previous 12 months are considered when calculating the positive feedback ratio, DSR community averages, and–by extension–power seller status. Also, eBay reports user feedback counters for the last month and quarter, which we are omitting here for the sake of clarity. Abuse mitigation features, which are not publicly available, are also excluded.

This diagram illustrates the seller feedback karma reputation model, which is made out of typical model components: two compound buyer input claims-seller feedback and detailed seller ratings-and several roll-ups of the seller's karma: community feedback ratings (a counter), feedback level (a named level), positive feedback percentage (a ratio), and the power seller rating (a label).

The context for the buyer's claims is a transaction identifier-the buyer may not leave any feedback before successfully placing a winning bid on an item listed by the seller in the auction market. Presumably, the feedback primarily describes the quality and delivery of the goods purchased. A buyer may provide two different sets of complex claims, and the limits on each vary:

  • 1. Typically, when a buyer wins an auction, the delivery phase of the transaction starts and the seller is motivated to deliver the goods of the quality advertised in a timely manner. After either a timer expires or the goods have been delivered, the buyer is encouraged to leave feedback on the seller, a compound claim in the form of a three-level rating-positive, neutral, or negative-and a short text-only comment about the seller and/or transaction. The ratings make up the main component of seller feedback karma.
  • 2. Once each week in which a buyer completes a transaction with a seller, the buyer may leave detailed seller ratings, a compound claim of four separate 5-star ratings in these categories: item as described,communications,shipping time,and shipping and handling charges.The only use of these ratings, other than aggregation for community averages, is to qualify the seller as a power seller.

EBay displays an extensive set of karma scores for sellers: the amount of time the seller has been a member of eBay; color-coded stars; percentages that indicate positive feedback; more than a dozen statistics track past transactions; and lists of testimonial comments from past buyers or sellers. This is just a partial list of the seller reputations that eBay puts on display.

The full list of displayed reputations almost serves as a menu of reputation types present in the model. Every process box represents a claim displayed as a public reputation to everyone, so to provide a complete picture of eBay seller reputation, we'll simply detail each output claim separately:

  • 3. The feedback score counts every positive rating given by a buyer as part of seller feedback, a compound claim associated with a single transaction. This number is cumulative for the lifetime of the account, and it generally loses its value over time-buyers tend to notice it only if it has a low value.

It is fairly common for a buyer to change this score, within some time limitations, so this effect must be reversible. Sellers spend a lot of time and effort working to change negative and neutral ratings to positive ratings to gain or to avoid losing a power seller rating. When this score changes, it is then used to calculate the feedback level.

  • 4. The feedback level claim is a graphical representation (in colored stars) of the feedback score. This process is usually a simple data transformation and normalization process; here we've represented it as a mapping table, illustrating only a small subset of the mappings. This visual system of stars on eBay relies, in part, on the assumption that users will know that a red shooting star is a better rating than a purple star. But we have our doubts about the utility of this representation for buyers. Iconic scores such as these often mean more to their owners, and they might represent only a slight incentive for increasing activity in an environment in which each successful interaction equals cash in your pocket.
  • 5. The community feedback rating is a compound claim containing the historical counts for each of the three possible seller feedback ratings-positive, neutral, and negative-over the last 12 months, so that the totals can be presented in a table showing the results for the last month, 6 months, and year. Older ratings are decayed continuously, though eBay does not disclose how often this data is updated if new ratings don't arrive. One possibility would be to update the data whenever the seller posts a new item for sale.

The positive and negative ratings are used to calculate the positive feedback percentage.

  • 6. The positive feedback percentage claim is calculated by dividing the positive feedback ratings by the sum of the positive and negative feedback ratings over the last 12 months. Note that the neutral ratings are not included in the calculation. This is a recent change reflecting eBay's confidence in the success of updates deployed in the summer of 2008 to prevent bad sellers from using retaliatory ratings against buyers who are unhappy with a transaction (known as tit-for-tat negatives). Initially this calculation included neutral ratings because eBay feared that negative feedback would be transformed into neutral ratings. It was not.

This score is an input into the power seller rating, which is a highly-coveted rating to achieve. This means that each and every individual positive and negative rating given on eBay is a critical one–it can mean the difference for a seller between acquiring the coveted power seller status, or not.

  • 7. The Detailed Seller Ratings community averages are simple reversible averages for each of the four ratings categories: item as described,communications,shipping time,and shipping and handling charges.There is a limit on how often a buyer may contribute DSRs.

EBay only recently added these categories as a new reputation model because including them as factors in the overall seller feedback ratings diluted the overall quality of seller and buyer feedback. Sellers could end up in disproportionate trouble just because of a bad shipping company or a delivery that took a long time to reach a remote location. Likewise, buyers were bidding low prices only to end up feeling gouged by shipping and handling charges. Fine-grained feedback allows one-off small problems to be averaged out across the DSR community averages instead of being translated into red-star negative scores that poison trust overall. Fine-grained feedback for sellers is also actionable by them and motivates them to improve, since these DSR scores make up half of the power seller rating.

  • 8. The power seller rating, appearing next to the seller's ID, is a prestigious label that signals the highest level of trust. It includes several factors external to this model, but two critical components are the positive feedback percentage, which must be at least 98%, and the DSR community averages, which each must be at least 4.5 stars (around 90% positive). Interestingly, the DSR scores are more flexible than the feedback average, which tilts the rating toward overall evaluation of the transaction rather than the related details.

Though the context for the buyer's claims is a single transaction or history of transactions, the context for the aggregate reputations that are generated is trust in the eBay marketplace itself. If the buyers can't trust the sellers to deliver against their promises, eBay cannot do business. When considering the roll-ups, we transform the single-transaction claims into trust in the seller, and–by extension–that same trust rolls up into eBay. This chain of trust is so integral and critical to eBay's continued success that they must continuously update the marketplace's interface and reputation systems.

October 21, 2009

User Motivations & System Incentives

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's entry summarizes our model for describing user motivations and incentives for participation in reputation systems.

This is a short summary of a large section of Chapter 6 of our book, Building Web Reputation Systems, entitled Incentives for User Participation, Quality, and Moderation. For this blog post, the content is being shuffled a bit. First we will name the motivations and related incentive models, then we'll describe how reputation systems interact with each motivational category. To read a more detailed discussion of the incentive sub-categories, read the Chapter 6.

Motivations and Incentives for social media participation:

  • Altruistic motivation: for the good of others
    • Tit-for-Tat or Pay-it-Forward incentives: I do it because someone else did it for me first"
    • Friendship incentives: "I do it because I care about others who will consume this"
    • Know-it-All or Crusader or Opinionated incentives: "I do it because I know something everyone else needs to know"
  • Commercial motivation: to generate revenue
    • Direct revenue incentives: Extracting commercial value (better yet, cash) directly from the user as soon as possible
    • Branding incentives: Creating indirect value by promotion - revenue will follow later
  • Egocentric motivation: for self-gratification
    • Fulfillment incentives: The desire to complete a task, assigned by oneself, a friend, or the application
    • Recognition incentives: The desire for the praise of others
    • The Quest for Mastery: Personal and private motivation to improve oneself

Altruistic or Sharing Incentives

Altruistic, or sharing, incentives reflect the giving nature of users who have something to share-a story, a comment, a photo, an evaluation-and who feel compelled to share it on your site. Their incentives are internal: they may feel an obligation to another user or to a friend, or they may feel loyal to (or despise) your brand.

When you're considering reputation models that offer altruistic incentives, remember that these incentives exist in the realm of social norms-they're all about sharing, not accumulating commercial value or karma points. Avoid aggrandizing users driven by altruistic incentives-they don't want their contributions to be counted, recognized, ranked, evaluated, compensated, or rewarded in any significant way. Comparing their work to anyone else's will actually discourage them from participating.

(See more on Tit-for-Tat, Friend, and Know-it-All altruistic incentives.)

Commercial Incentives

Commercial incentives reflect people's motivation to do something for money, though the money may not come in the form of direct payment from the user to the content creator. Advertisers have a nearly scientific understanding of the significant commercial value of something they call branding. Likewise, influential bloggers know that their posts build their brand, which often involves the perception of them as subject matter experts. The standing that they establish may lead to opportunities such as speaking engagements, consulting contracts, improved permanent positions at universities or prominent corporations, or even a book deal. A few bloggers may actually receive payment for their online content, but more are capturing commercial value indirectly.

Reputation models that exhibit content control patterns based on commercial incentives must communicate a much stronger user identity. They need strong and distinctive user profiles with links to each user's valuable contributions and content. For example, as part of reinforcing her personal brand, an expert in textile design would want to share links to content that she thinks her fans will find noteworthy.

But don't confuse the need to support strong profiles for contributors with the need for a strong or prominent karma system. When a new brand is being introduced to a market, whether it's a new kind of dish soap or a new blogger on a topic, a karma system that favors established participants can be a disincentive to contribute content. A community decides how to treat newcomers-with open arms or with suspicion. An example of the latter is eBay, where all new sellers must "pay their dues" and bend over backward to get a dozen or so positive evaluations before the market at large will embrace them as trustworthy vendors. Whether you need karma in your commercial incentive model depends on the goals you set for your application. One possible rule of thumb: If users are going to pass money directly to other people they don't know, consider adding karma to help establish trust.

(See more on Direct revenue and Branding commercial incentives.)

Egocentric Incentives

Egocentric incentives are often exploited in the design online in computer games and many reputation based web sites. The simple desire to accomplish a task taps into deeply hard-wired motivations described in behavioral psychology as classical and operant conditioning (which involves training subjects to respond to food-related stimulus) and schedules of reinforcement. This research indicates that people can be influenced to repeat simple tasks by providing periodic rewards, even a reward as simple as a pleasing sound.

But, an individual animal's behavior in the social vacuum of a research lab is not the same as the ways in which we very social humans reflect our egocentric behaviors to one another. Humans make teams and compete in tournaments. We follow leaderboards comparing ourselves to others and comparing groups that we associate ourselves with. Even if our accomplishments don't help another soul or generate any revenue for us personally, we often want to feel recognized for them. Even if we don't seek accolades from our peers, we want to be able to demonstrate mastery of something-to hear the message "You did it! Good job!"

Therefore, in a reputation system based on egocentric incentives, user profiles are a key requirement. In this kind of system, users need someplace to show off their accomplishments-even if only to themselves. Almost by definition, egocentric incentives involve one or more forms of karma. Even with only a simple system of granting trophies for achievements, users will compare their collections to one another. New norms will appear that look more like market norms than social norms: people will trade favors to advance their karma, people will attempt to cheat to get an advantage, and those who feel they can't compete will opt out altogether.

Egocentric incentives and karma do provide very powerful motivations, but they are almost antithetical to altruistic ones. The egocentric incentives of many systems have been over-designed, leading to communities consisting almost exclusively of experts. Consider just about any online role playing game that survived more than three years. For example, to retain its highest-level users and the revenue stream they produce, Worlds of Warcraft must continually produce new content targeted at those users. If they stop producing new content for their most dedicated users, their business will collapse. This elder game focus stunts WoW's growth -- parent company Blizzard has all-but-abandoned improvements aimed at acquiring new users. When new users do arrive (usually in the wake of a marketing promotion), they end up playing alone because the veteran players are only interested in the new content and don't want to bother going through the long slog of playing through the lowest levels of the game yet again.

(See more on Fulfillment, Recognition, and Quest-for-Mastery egocentric incentives.)

October 14, 2009

A Case Study: Yahoo! Answers Community Moderation

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's entry announces two important milestones.


We are proud to announce that our Chapter 12 Case Study—Yahoo! Answers Community Content—is now available for review! This chapter is a doozy. Using the structure and guidance from the rest of the book, it attempts to describe, in detail, a project that has saved Yahoo! millions of dollars in customer care costs (and produced a stronger, more content-vibrant community in the process.) No excerpts here. It's all good stuff—go read it.

And, coinciding with this draft chapter release, Randy and I can also announce that we've achieved an important milestone for the book: draft complete status. Our editor Mary blessed it on Monday. We're expecting feedback from our early reviewers soon that will dictate the tempo and scope of re-writes, so… stay tuned! We will, of course, continue to blog here and stick faithfully to our Reputation Wednesday schedule.

Whew.

October 07, 2009

The Dollhouse Mafia, or "Don't Display Negative Karma"

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's essay explains why publicly displayed user reputation (karma) is a very bad idea. It is excerpted from Chapter 7: Objects, Inputs, Scope, and Mechanism.

Because an underlying karma score is a number, product managers often misunderstand the interaction between numerical values and online identity. The thinking goes something like this:

  • In our application context, the users' value will be represented by a single karma, which is a numerical value.
  • There are good, trustworthy users and bad, untrustworthy users, and everyone would like to know which is which, so we will display their karma.
  • We should represent good actions as positive numbers and bad actions as negative, and we'll add them up to make karma.
  • Good users will have high positive scores (and other users will interact with them), and bad users will have low negative scores (and other users will avoid them).

This thinking—though seemingly intuitive—is impoverished, and is wrong in at least two important ways.

  • There can be no negative public karma-at least for establishing the trustworthiness of active users. A bad enough public score will simply lead to that user's abandoning the account and starting a new one, a process we call karma bankruptcy. This setup defeats the primary goal of karma-to publicly identify bad actors. Assuming that a karma starts at zero for a brand-new user that an application has no information about, it can never go below zero, since karma bankruptcy resets it. Just look at the record of eBay sellers with more than three red stars-you'll see that most haven't sold anything in months or years, either because the sellers quit or they're now doing business under different account names.
  • It's not a good idea to combine positive and negative inputs in a single public karma score. Say you encounter a user with 75 karma points and another with 69 karma points. Who is more trustworthy? You can't tell: maybe the first user used to have hundreds of good points but recently accumulated a lot of negative ones, while the second user has never received a negative point at all. If you must have public negative reputation, handle it as a separate score (as in the eBay seller feedback pattern).

Even eBay, with the most well-known example of public negative karma, doesn't represent how untrustworthy an actual seller might be-it only gives buyers reasons to take specific actions to protect themselves. In general, avoid negative public karma. If you really want to know who the bad guys are, keep the score separate and restrict it to internal use by moderation staff.

Virtual Mafia Shakedown: Negative Public Karma

The Sims Online was a multiplayer version of the popular Sims games by Electronic Arts and Maxis in which the user controlled an animated character in a virtual world with houses, furniture, games, virtual currency (called Simoleans), rental property, and social activities. You could call it playing dollhouse online.

One of the features that supported user socialization in the game was the ability to declare that another user was a trusted friend. The feature involved a graphical display that showed the faces of users who had declared you trustworthy outlined in green, attached in a hub-and-spoke pattern to your face in the center.

People checked each other's hubs for help in deciding whether to take certain in-game actions, such as becoming roommates in a house. Decisions like these are costly for a new user – the ramifications of the decision stick with a newbie for a long time, and backing outof a bad decision is not an easy thing to do. The hub was a useful decision-making device for these purposes.

That feature was fine as far as it went, but unlike other social networks, The Sims Online allowed users to declare other users un trustworthy too. The face of an untrustworthy user appeared circled in bright red among all the trustworthy faces in a user's hub.

It didn't take long for a group calling itself the Sims Mafia to figure out how to use this mechanic to shake down new users when they arrived in the game. The dialog would go something like this:

"Hi! I see from your hub that you're new to the area. Give me all your Simoleans or my friends and I will make it impossible to rent a house.”

"What are you talking about?"

"I'm a member of the Sims Mafia, and we will all mark you as untrustworthy, turning your hub solid red (with no more room for green), and no one will play with you. You have five minutes to comply. If you think I'm kidding, look at your hub-three of us have already marked you red. Don't worry, we'll turn it green when you pay…"

If you think this is a fun game, think again-a typical response to this shakedown was for the user to decide that the game wasn't worth $10 a month. Playing dollhouse doesn't usually involve gangsters.

Avoid public negative reputation. Really.


September 30, 2009

First Mover Effects

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's essay is concerned with important downstream effects that can arise from the first tentative days & weeks of a community's formation. It is excerpted from Chapter 4: Building Blocks and Reputation Tips.

When an application handles quantitative measures based on user input, whether it's ratings or measuring participation by counting the number of contributions to a site, several issues arise-all resulting from bootstrapping of communities-that we group together under the term first-mover effects.

Early Behavior Modeling and Early-Ratings Bias

The first people to contribute to a site have a disproportionate effect on the character and future contributions of others. After all, this is social media, and people usually try to fit into any new environment. For example, if the tone of comments is negative, new contributors will also tend to be negative, which will also lead to bias in any user-generated ratings. See Ratings Bias Effects.

When an operator introduces user-generated content and associated reputation systems, it is important to take explicit steps to model behavior for the earliest users in order to set the pattern for those who follow.

Discouraging New Contributors

Take special care with systems that contain leaderboards when they're used either for content or for users. Items displayed on leaderboards tend to stay on the leaderboards, because the more people who see those items and click, rate, and comment on them, the more who will follow suit, creating a self-sustaining feedback loop.

This loop not only keeps newer items and users from breaking into the leaderboards, it discourages new users from even making the effort to participate by giving the impression that they are too late to influence the result in any significant way. Though this phenomenon applies to all reputation scores, even for digital cameras, it's particularly acute in the case of simple point-based karma systems, which give active users ever more points for activity so that leaders, over years of feverish activity, amass millions of points, making it mathematically impossible for new users to ever catch up.

September 23, 2009

Party Crashers (or 'Who invited these clowns?')

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week, we look at some of the possible effects when unanticipated guests enter into your carefully-planned and modeled system. This essay is excerpted from Chapter 5.

Reputation can be a successful motivation for users to contribute large volumes of content and/or high-quality content to your application. At the very least, reputation can provide critical money-saving value to your customer care department by allowing users to prioritize the bad content for attention and likewise flag power users and content to be featured.

But mechanical reputation systems, of necessity, are always subject to unwanted or unanticipated manipulation: they are only algorithms, after all. They cannot account for the many, sometimes conflicting, motivations for users' behavior on a site. One of the strongest motivations of users who invade reputation systems is commercial. Spam invaded email. Marketing firms invade movie review and social media sites. And drop-shippers are omnipresent on eBay.

EBay drop-shippers put the middleman back into the online market: they are people who resell items that they don't even own. It works roughly like this:

  1. A seller develops a good reputation, gaining a seller feedback karma of at least 25 for selling items that she personally owns.
  2. The seller buys some drop-shipping software, which helps locate items for sale on eBay and elsewhere cheaply, or joins an online drop-shipping service that has the software and presents the items in a web interface.
  3. The seller finds cheap items to sell and lists them on eBay for a higher price than they're available for in stores but lower than other eBay sellers are selling them for. The seller includes an average or above-average shipping and handling charge.
  4. The seller sells an item to a buyer, receives payment, and sends an order for the item, along with a drop-shipping payment, to the drop-shipper (D), who then delivers the item to the buyer.

This model of doing business was not anticipated by the eBay seller feedback karma model, which only includes buyers and sellers as reputation entities. Drop-shippers are a third party in what was assumed to be a two-party transaction, and they cause the reputation model to break in various ways:

  • The original shippers sometimes fail to deliver the goods as promised to the buyer. The buyer then gets mad and leaves negative feedback: the dreaded red star. That would be fine, but it is the seller-who never saw or handled the good-that receives the mark of shame, not the actual shipping party.
  • This arrangement is a big problem for the seller, who cannot afford the negative feedback if she plans to continue selling on eBay.
  • The typical options for rectifying a bungled transaction won't work in a drop-shipper transaction: it is useless for the buyer to return the defective goods to the seller. (They never originated from the seller anyway.) Trying to unwind the shipment (the buyer returns the item to the seller; the seller returns it to the drop-shipper-if that is even possible; the drop-shipper buys or waits for a replacement item and finally ships it) would take too long for the buyer, who expects immediate recompense.

In effect, the seller can't make the order right with the customer without refunding the purchase price in a timely manner. This puts them out-of-pocket for the price of the goods along with the hassle of trying to recover the money from the drop-shipper.

But a simple refund alone sometimes isn't enough for the buyer! No, depending on the amount of perceived hassle and effort this transaction has cost them, they are still likely to rate the transaction negatively overall. (And rightfully so – once it's become evident that a seller is working through a drop-shipper, many of their excuses and delays start to ring very hollow.) So a seller may have, at this point, outlayed a lot of their own time and money to rectify a bad transaction only to still suffer the penalties of a red star.

What option does the seller have left to maintain their positive reputation? You guessed it – a payoff. Not only will a concerned seller eat the price of the goods – and any shipping involved – but they will also pay an additional cash bounty (typically up to $20.00) to get buyers to flip a red star to green.

What is the cost of clearing negative feedback on drop-shipped goods? The cost of the item + $20.00 + lost time in negotiating with the buyer. That's the cost that reputation imposes on drop-shipping on eBay.

The lesson here is that a reputation model will be reinterpreted by users as they find new ways to use your site. Site operators need to keep a wary eye on the specific behavior patterns they see emerging and adapt accordingly. Chapter 10 provides more detail and specific recommendations for prospective reputation modelers.

September 16, 2009

Yahoo! Answers Community Moderation

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week, we pause to highlight a great presentation from Micah Alpern of Yahoo!

Micah Alpern is Director of User Experience for Social Search at Yahoo! and was, at one time, the lead User Experience designer for the first several iterations of Yahoo! Answers. One of the final projects Micah worked on for Answers was a reputation-intensive program to reduce the amount of abusive content that was appearing on that site.

We'll be covering this very project, Yahoo! Answers Community Moderation as an in-depth case study in our soon-to-be-drafted Chapter 12 and Micah recently gave a fantastic presentation to the Wikimania 2009 conference. It covers everything from business goals, community metrics, design and implementation to some insight into how well the project performed, and continues to perform.

If you just can't get enough, we'd also recommend you check out the video of Micah's presentation. Thanks for the presentation, Micah! (And you'll be seeing Micah, Ori Zaltzman, Yvonne French and other key drivers of that project surface in Chapter 12.)

September 09, 2009

Time Decay in Reputation Systems

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's essay is excerpted from Chapter 4: Building Blocks and Reputation Tips.

Time leeches value from reputation: the section called “First Mover Effects” discussed how simple reputation systems grant early contributions are disproportionately valued over time, but there's also the simple problem that ratings become stale over time as their target reputable entities change or become unfashionable - businesses change ownership, technology becomes obsolete, cultural mores shift.

The key insight to dealing with this problem is to remember the expression “What did you do for me this week?” When you're considering how your reputation system will display reputation and use it indirectly to modify the experience of users, remember to account for time value. A common method for compensating for time in reputation values is to apply a decay function: subtract value from the older reputations as time goes on, at a rate that is appropriate to the context. For example, digital camera ratings for resolution should probably lose half their weight every year, whereas restaurant reviews should only lose 10% of their value in the same interval.

Here are some specific algorithms for decaying a reputation score over time:

  • Linear Aggregate Decay
    • Every score in the corpus is decreased by a fixed percentage per unit time elapsed, whenever it is recalculated. This is high performance, but scarcely updated reputations will have dispassionately high values. To compensate, a timer input can perform the decay process at regular intervals.
  • Dynamic Decay Recalculation
    • Every time a score is added to the aggregate, recalculate the value of every contributing score. This method provides a smoother curve, but it tends to become computationally expensive O(n2) over time.
  • Window-based Decay Recalculation
    • The Yahoo! Spammer IP reputation system has used a time window based decay calculation: fixed time or a fixed-size window of previous contributing claim values is kept with the reputation for dynamic recalculation when needed. New values push old values out of the window, and the aggregate reputation is recalculated from those that remain. This method produces a score with the most recent information available, but the information for low-liquidity aggregates may still be old.
  • Time-limited Recalculation
    • This is the de facto method that most engineers use to present any information in an application: use all of the ratings in a time range from the database and compute the score just in time. This is the most costly method, because it involves always hitting the database to consider an aggregate reputation (say, for a ranked list of hotels), when 99% of the time the value is exactly the same as it was the last time it was calculated. This method also may throw away still contextually valid reputation. We recommend trying some of the higher-performance suggestions above.

September 02, 2009

Chapters 3 (Architecture) & 9 (Uses) Are Up

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's entry introduces two new chapters in our book.

The wiki for Building Web 2.0 Reputation systems has been updated with two new chapters that are very different from each other.

Chapter 3 - The Reputation Sandbox is a fairly technical discussion of the execution environment for reputation models and establishing the product requirements to construct just such a sandbox. If that last sentence didn't make any sense to you, this technically oriented chapter can safely be skipped. Perhaps you will like something from the next one...

Chapter 9 - Using Reputation: The Good, The Bad and the Ugly presents a whole host of reputation-driven strategies for improving content quality (and the perception of said) on your community-driven site. This chapter will interest UX designers, product- and community-managers and social architects of all stripes.

No excerpts this time around folks. For their respective audiences, we think the chapters themselves are packed with chewy goodness.

August 26, 2009

Tag, You're It!

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's entry shares some news about the production and design of our book, and asks for your help in pushing it forward to completion.

This is very exciting for us. We're close enough to draft-complete that our wonderful editor at O'Reilly, Mary, went ahead and pulled the trigger on our cover-art! Earlier this week, she shared this with us, passed along from O'Reilly Creative Director Edie Freedman.

build_web_reputation_sys_comp.jpg

We love it. It's a beautiful parrot, and I really like the timeless, classic appeal of it. (To be honest, I can't believe that no animal cover has featured a parrot before now. But it's true.)

For those of you paying attention, yes we did share your animal suggestions with the creative team at ORA. They enjoyed them immensely and then—true to the admonition that leads off that page—set them aside and picked this big beautiful bird. (Or 'boo-wuh' as my toddler son says after repeat viewings on Daddy's laptop.) We're pleased with the end result, and excited to see the book coming thismuchcloser to reality.

However… we still need your help! All O'Reilly books feature a tagline, and we need some good suggestions. To accompany the written proposal for the book, Randy found a fun little "O'Reilly cover generator" somewhere online, and the tagline he provided to that was "Ratings, Reviews and Karma, Oh My!" That effort was only semi-facetious—it does highlight some of the principal patterns and methods discussed in the book. Not easy to do in 2 lines of text.

So, please—if you've been following the progress of the book and have some ideas about a tagline, we'd love to hear your thoughts. Please leave a comment on this page.

August 19, 2009

Low Liquidity Compensation for Reputation Systems

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's essay is excerpted from Chapter 4: Building Blocks and Reputation Tips. This tip provides a solution to an age old problem with ratings.
 

A question of liquidity -

When is 4.0 > 5.0? When enough people say it is!

 
  --2007, F. Randall Farmer, Yahoo! Community Analyst

Consider the following problem with simple averages: it is mathematically unreasonable to compare two similar targets with averages made from significantly different numbers of inputs. For the first target, suppose that there are only three ratings averaging 4.667 stars, which after rounding displays as , and you compare that average score to a target with a much greater number of inputs, say 500, averaging 4.4523 stars, which after rounding displays as only . The second target, the one with the lower average, better reflects the true consensus of the inputs, since there just isn't enough information on the first target to be sure of anything. Most simple-average displays with too few inputs shift the burden of evaluating the reputation to users by displaying the number of inputs alongside the simple average, usually in parentheses, like this: (142) .

But pawning off the interpretation of averages on users doesn't help when you're ranking targets on the basis of averages-a lone rating on a brand-new item will put the item at the top of any ranked results it appears in. This effect is inappropriate and should be compensated for.

We need a way to adjust the ranking of an entity based on the quantity of ratings. Ideally, an application performs these calculations on the fly so that no additional storage is required.

We provide the following solution: a high-performance liquidity compensation algorithm to offset variability in very small sample sizes. It's used on Yahoo! sites to which many new targets are added daily, with the result that, often, very few ratings are applied to each one.

  • RankMean
    • r = SimpleMean m - AdjustmentFactor a + LiquidityWeight l * Adjustment Factor a
  • LiquidityWeight
    • l = min(max((NumRatings n - LiquidityFloor f) / LiquidityCeiling c, 0), 1) * 2
  • Or
    • r = m - a + min(max((n - f) / c, 0.00), 1.00) * 2.00 * a

This formula produces a curve seen in the figure below. Though a more mathematically continuous curve might seem appropriate, this linear approximation can be done with simple nonrecursive calculations and requires no knowledge of previous individual inputs.

Figure: The effects of the liquidity compensation algorithm

Suggested initial values for a , c , and f (assuming normalized inputs):

  • AdjustmentFactor
    • a = 0.10

This constant is the fractional amount to remove from the score before adding back in effects based on input volume. For many applications, such as 5-star ratings, it should be within the range of integer rounding error-in this example, if the AdjustmentFactor is set much higher than 10%, a lot of 4-star entities will be ranked before 5-star ones. If it's set too much lower, it may not have the desired effect.

  • LiquidityFloor
    • f = 10

This constant is the threshold for which we consider the number of inputs required to have a positive effect on the rank. In an ideal environment, this number is between 5 and 10, and our experience with large systems indicates that it should never be set lower than 3. Higher numbers help mitigate abuse and get better representation in consensus of opinion.

  • LiquidityCeiling
    • c = 60

This constant is the threshold beyond which additional inputs will not get a weighting bonus. In short, we trust the average to be representative of the optimum score. This number must not be lower than 30, which in statistics is the minimum required for a t-score. Note that the t-score cutoff is 30 for data that is assumed to be unmanipulated (read: random). We encourage you to consider other values for a , c , and f , especially if you have any data on the characteristics of your sources and their inputs..

August 12, 2009

Ratings Bias Effects

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's essay is excerpted from Chapter 4: Building Blocks and Reputation Tips. It uses our experience with Yahoo! data to share some thoughts surrounding user ratings bias, and how to overcome it. You may be surprised by our recommendations.

Figure: Some Yahoo! Sites Ratings Distribution: "One of these things is not like the other. One of these things just doesn't belong."

Some Yahoo! Sites Ratings Distribution: "One of these things is not like the other. One of these things just doesn't belong."

This figure shows the graphs of 5-star ratings from nine different Yahoo! sites with all the volume numbers redacted. We don't need them, since we only want to talk about the shapes of the curves.

Eight of these graphs have what is known to reputation system aficionados as J-curves- where the far right point (5 Stars) has the very highest count, 4-Stars the next, and 1-Star a little more than the rest. Generally, a J-curve is considered less-than ideal for several reasons: The average aggregate scores all clump together between 4.5 to 4.7 and therefore they all display as 4- or 5-stars and are not-so-useful for visually sorting between options. Also, this sort of curve begs the question: Why use a 5-point scale at all? Wouldn't you get the same effect with a simpler thumbs-up/down scale, or maybe even just a super-simple favorite pattern?

The outlier amongst the graphs is for Yahoo! Autos Custom (which is now shut down) where users were rating the car-profile pages created by other users - has a W-curve. Lots of 1, 3, and 5 star ratings and a healthy share of 4 and 2 star as well. This is a healthy distribution and suggests that "a 5-point scale is good for this community".

But why was Autos Custom's ratings so very different from Shopping, Local, Movies, and Travel?

The biggest difference is most likely that Autos Custom users were rating each other's content. The other sites had users evaluating static, unchanging or feed-based content in which they don't have a vested interest.

In fact, if you look at the curves for Shopping and Local, they are practically identical, and have the flattest J hook - giving the lowest share of 1-stars. This is a direct result of the overwhelming use-pattern for those sites: Users come to find a great place to eat or vacuum to buy. They search, and the results with the highest ratings appear first and if the user has experienced that object, they may well also rate it - if it is easy to do so - and most likely will give 5 stars (see the section called “First Mover Effects”). If they see an object that isn't rated, but they like, they may also rate and/or review, usually giving 5-stars - otherwise why bother - so that others may share in their discovery. People don't think that mediocre objects are worth the bother of seeking out and creating internet ratings. So the curves are the direct result of the product design intersecting with the users goals. This pattern - I'm looking for good things so I'll help others find good things - is a prevalent form of ratings bias. An even stronger example happens when users are asked to rate episodes of TV shows - Every episode is rated 4.5 stars plus or minus .5 stars because only the fans bother to rate the episodes, and no fan is ever going to rate an episode below a 3. Look at any popular running TV show on Yahoo! TV or [another site].

Looking more closely at how Autos Custom ratings worked and the content was being evaluated showed why 1-stars were given out so often: users were providing feedback to other users in order to get them to change their behavior. Specifically, you would get one star if you 1) Didn't upload a picture of your ride, or 2) uploaded a dealer stock photo of your ride. The site is Autos Custom, after all! The 5-star ratings were reserved for the best-of-the-best. Two through Four stars were actually used to evaluate quality and completeness of the car's profile. Unlike all the sites graphed here, the 5-star scale truly represented a broad sentiment and people worked to improve their scores.

There is one ratings curve not shown here, the U-curve, where 1 and 5 stars are disproportionately selected. Some highly-controversial objects on Amazon see this rating curve. Yahoo's now defunct personal music service also saw this kind of curve when introducing new music to established users: 1 star came to mean "Never play this song again" and 5 meant "More like this one, please". If you are seeing U-curves, consider that the 1) users are telling you something other than what you wanted to measure is important and/or 2) you might need a different rating scale.

August 05, 2009

Polish & Predictability

Just a quick note. One development that Randy and I are excited about: we've solicited the help of a fantastic copy-editor, Cate de Heer to provide a third set of eyes on our draft chapters. Cate's help is, of course, is in addition to that of our superb O'Reilly Editor Mary Treseler.

We're hoping that early and ongoing copy improvements will help immensely in the latter stages of the book's development (which are rapidly approaching)—by the time our technical reviewers start their reviews, hopefully they won't be distracted by our errant commas (and my overabundant exclamation points!) Cate has already delivered revisions to Chapter 1, and we published them wholesale on the wiki, so please do check them out.

(And, if you're curious to see what a difference a thoughtful copy-edit can make to your writing, you can compare the current version with any version preceding the 'Copy Edits' checkin.)

Also, we're making an earnest attempt to be more regular with our blog-publishing schedule. So, starting next Wednesday, we'll be posting at least one meaty-sized essay on reputation matters every week. We're calling it Reputation Wednesday. A small, regular event mostly designed to keep Randy and I honest, and get us off our butts to push some of the thinking that we're putting into the book out there for conversation.

We hope that it will make some of the concepts more accessible to folks that may not have time to dive into the wiki. I'll also point out that—if you haven't done so already—now would be a wonderful time to subscribe to the feed for this blog. We promise to fill it up. I swear.