Saturday, November 17, 2007

Two Databases Walk into a Bar . . .

Yesterday I introduced Virgil Griffith, self-described "disruptive technologist," at a CalIT2-sponsored talk, "Wikiscanner: My Summer of Dilettante Data-Mining or Making a Corporation-Sized Cannon and Letting the Internet Decide Where to Point It."

Griffith opened his talk by explaining how his WikiScanner worked, which provides a simple online form to uncover the identities of anonymous Wikipedia editors. As Virtualpolitik readers may recall, this issue about the fallibility of the giant online encyclopedia first came to light when the IP addresses of congressional staffers were banned from Wikipedia in early 2006 for manipulating their public record.

Griffith described a relatively simple three-step method to creating WikiScanner: 1) Download all of Wikipedia, 2) Purchase a database with the information about which organizations own which IP addresses, and 3) Merge them together.

In Griffith's retelling, the costs associated with this project turned out to be minimal. The first step was free and involved about 21% of Wikipedia. The second step would have normally cost about $1000 in order to purchase a database from a private corporation, but Griffith was able to get the database from a sales rep gratis in about two hours when he promised to put the company's logo on his WikiScanner website. However, the company that provided the information about the IP addresses of the 2,668,905 different organizations soon asked for the company logo to be removed when their bid at free advertising ended up alienating many of their own clients who considered themselves WikiScanner victims. When the two databases were merged, Griffith found 187,529 different organizations that could be traced to at least one Wikipedia edit, including those at the CIA and the Vatican.

At first, Griffith said that he thought of the WikiScanner results as his personal "basket of evil," which he could delve into at will to fling damning evidence at those he resented for injustices or willful public stupidity like the Iraq War. But Griffith decided to crowd source the results for a number of different reasons. For example, he pointed out that the resulting database of anonymous edits was huge, and much of the information was outside the areas of his technical expertise. How would he be able to evaluate details about Pfizer or Amgen? He also wanted to avoid any possible legal liability if an offended party would claim that it had been intentionally targeted for malicious defamation. Griffith has been understandably litigation-averse, since he reached a settlement with the Blackboard company, after submission of his first freshman year paper "caused him to get sued under the Sedition and Espionage Act,” as he puts it. Furthermore, Griffith described media people as "lazy" and averse to "original research," and so he was pleased when Wired magazine helped him set up a Reddit-style ranking system that would produce easy-to-digest top ten lists. He showed how the current list included popular corporate villains like Diebold, Dow, ExxonMobil, and ChevronTexaco in what Griffith describes as an index that lets you see "who the Internet doesn't like."

Griffith shared his own personal favorites among the unearthed edits, which included CIA additions and subtractions that ranged from minutiae about matters nerdy and obscure (on "Light Saber Combat" styles) to entire memoirs about black ops (on "Black September" in Jordan). He pointed out that WikiScanner has even been involved in a FOIA suit over the entries made apparently by Arkansas state employees on behalf of presidential candidate Mike Huckabee in violation of election rules. According to Griffith, among the salacious cases uncovered when WikiScanner was used by the media, the editing out of a personal connections to a drug baron by a Dutch princess was probably the most easily solved, since the royal house of the Netherlands confessed to the sham immediately. Finally, as a rhetorician, I loved the fact that "A Controversial Speaker" was changed to "A Voice for the Farmer" in the case of elected representative Conrad Burns.

Griffith said that the experience of reviewing results had helped him realize that even seemingly monolithic organizations like the Republican headquarters were really characterized by the random and idiosyncratic sentiments of different individuals, an argument that I have also made about the heterogeneity of rhetorical approaches of message makers even on official government websites.

WikiScanner also allowed Griffith to explore some cross-cultural comparisons, since 19% of German Wikipedia edits are anonymous, and 17% of French alterations, but a whopping 40% of the edits to the Japanese Wikipedia were anonymous.

He is currently planning for the launch of WikiScanner 2.0 with input and support from the Wikimedia Foundation. Features would include new forms of surveillance for vote stacking, time overlaps, geographical location, vanity checking, and linkspam by screening for pages tagged with Google ads. (Of course, with so many blogs carrying advertising, this AdSense detector may discredit some legitimate sources of information.)

Griffith credited Wikipedia's own editing procedures with remedying many of the worst PR edits. For example, when Wal-Mart changed an entry that pointed out that wages at Wal-Mart were 20% lower than most retail stores to an assertion that the company paid double the minimum wage, Wikipedia editors were able to recognize both statements as true and construct an appropriate "although" clause to accommodate the edits of both parties.

From his case study of WikiScanner, Griffith moved into the evangelism portion of his talk, which could be reduced to a slide that read: "Amateur Data mining is fun." His list of "Tools to get you started" included standard favorites among programmers such as Python, Ruby, and MySQL. He also plugged GATE or a General Architecture for Text Engineering to provide toolkits for text mining, other text similarity tools to grapple with strings of text for which there are only approximate rather than exact matches, and a suite of Dutch data-mining tools that could be of use to hacktivists elsewhere. As Griffith said, "every hard programming thing is already done" by someone else.

He also argued that there were many interesting data sources that were currently being underutilized for interesting mash-up purposes. For example, he showed his own entry from the Notable Names Database along with Adolph Hitler's to illustrate the range of personalities and historical actors represented. He pointed out how this "IMDB for famous people" even revealed him to be a stutterer, which Griffith managed to suppress during his presentation, although he was obviously flustered by being almost a half-hour late.

He closed with some samples of cool mash-ups done by others. Given my line of work in public communication, I thought the dynamic text of this diagram of a speech by Alberto Gonzales was particularly good (best viewed in Safari), but there were also visualizations of other speeches on the same site by dissimulating orators such as Barry Bonds. He also suggested that closed captioning would provide a rich source of text for data mining, although only coverage of political speeches, proceedings, and events on CSPAN would be free of copyright restrictions.

Griffith showed topographical mash-ups as well. In the case of Planet Sony, Dan Kaminsky provides a geographical representation of where infestations of Sony's BMG rootkit have occurred, a project which eventually raised the consciousness of the Department of Defense, once it realized that it too had been hobbled by digital rights management technology to spur what Griffith called a "Clash of the Titans" between the forces of militarism and those of intellectual property ownership. As an example of risk communication, Nation under Siege combines Google mapping information with elevation data from the USGS to show what a mere five meters of added sea level would mean for urban communities, including my own coastal city of Santa Monica.

Griffith faced some tough questions from the crowd, particularly about China, where the authorities could use the same data-mining techniques to root out political dissidents. Griffith argued that there was a kind of cost-benefit analysis that goes with making things public. To justify his utilitarian calculations he compared Yahoo's China policy to Google's. During this part of the discussion, someone pointed out that Tor had not been disabled in China as a way to conduct subversive Wikipedia edits although it had been barred in the U.S. after extended periods of abuse by American editors.

(Slides from the talk are here.)

Labels: , , , , ,


Post a Comment

<< Home