spamming google scholar, artificially inflating link counts

A fascinating and clear and concisely written article, thanks Peter Murray for the reference.

http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0013.305

Academic Search Engine Spam and Google Scholar’s Resilience Against it

Joeran Beel and Bela Gipp

Journal of Electronic Publishing

Volume 13, Issue 3, December 2010

DOI: 10.3998/3336451.0013.305

In a previous paper we provided guidelines for scholars on optimizing research articles for academic search engines such as Google Scholar. Feedback in the academic community to these guidelines was diverse. Some were concerned researchers could use our guidelines to manipulate rankings of scientific articles and promote what we call ‘academic search engine spam’. To find out whether these concerns are justified, we conducted several tests on Google Scholar. The results show that academic search engine spam is indeed—and with little effort—possible: We increased rankings of academic articles on Google Scholar by manipulating their citation counts; Google Scholar indexed invisible text we added to some articles, making papers appear for keyword searches the articles were not relevant for; Google Scholar indexed some nonsensical articles we randomly created with the paper generator SciGen; and Google Scholar linked to manipulated versions of research papers that contained a Viagra advertisement. At the end of this paper, we discuss whether academic search engine spam could become a serious threat to Web-based academic search engines.

Gaming citation counts

The most interesting and frightening technique in that article for me, is the ease of inflating citation counts for papers (and thus for journals as a whole as well) on Google Scholar.

The for-pay reference-graphing services like Web of Science and Scopus are presumably  less susceptible, because they don’t (I think?  yet anyway) take papers (and data on citations scraped from papers) from the open web as a whole, but only from certain trusted sources.

There might still be a way to artificially inject citations into those systems, it would be interesting to do a similar experiment to try. One obvious way is the kind of publisher encouraged gaming the article above mentions too — publishers that encourage their authors to cite as many articles from the same publisher in the newly published article as possible. (Something that strikes me as deeply unethical publisher behavior, but also probably commonplace).  Of course, most academic authors do this unprompted and intuitively  for their own  previously published articles and it’s not considered unethical, but rather standard practice.(To some extent this raises questions about exactly how we tell the difference between ‘gaming’ citation counts, and now standard academic practices!).

The for-pay reference-graphing services try to seperate out “self-citations” for this purpose — presumably they could also seperate out wider categories of “self-citation” — citations from an article to another article in the journal even by a different author, or even to another article in another journal by the same publisher.  This latter one, while limiting the effect of publisher-encouraged ‘artificial’ citation counts, however, would also catch a whole bunch of ‘legitimate’ citations which presumably do tell you something about the popularity of an article.

If the for-pay reference-graphing services increase their attempts to spider content from the open web (as I’d expect they’re already working on if they’re smart), they will of course be subject to the exact same attacks as Google Scholar. It takes fancy and clever (read expensive) algorithms to try and defend against that on the open web — Google puts lots of resources into developing such (more than the for-pay citation-graphing services likely have to put in) in it’s standard web search, but apparently not in Scholar. (even for google, resources are not limitless, and there’s probably not that much money in Scholar for them at present).

Publication venues and other citation details

Another seperate finding worth note is the ease of faking the publication venue (for a real published elsewhere or entirely fake article) on Google Scholar. Definitely don’t trust citation details from Google Scholar results alone.

But that also means don’t trust citation information found in a PDF on the web… it could potentially be a fake PDF, right? It might not have been published where it says it was.

But most researchers probably do this all the time these days. You find a useful article on the web, how are you going to cite it except as what it claims itself to be?  Where are you going to check this that is going to be both reliable and not a seriously inconvenient time sucker?

And “useful” often means “useful as another reference to add to my literature search or other section so I have more references, cause the more the better my paper looks.” I think it’s safe to say that in many cases the researcher citing a paper hasn’t even read it.

And with sloppy citation possibly already the norm…

In my library graduate student days, at one point I found a paper that had been cited in three separate papers as “proceedings” of a conference, that looked like it would be just the thing for my research, but I couldn’t seem to track down these preceedings anywhere. Eventually I tracked down the original author of this mystery paper by email, and they told me, yeah, that was an oral conference presentation they had done, but it was never written down, let alone published in a proceedings.  I am reasonably confident that at least two of the authors I found citing this mystery citation weren’t even at the conference where it was presented — they had just copy and pasted the citation from somewhere else, to add to their own reference count.

At the time I found that somewhat shocking, but I actually suspect that copying and pasting a reference from someone elses paper into your own, without tracking down and reading the article cited, is probably pretty common. (I have run into non-conclusive but suggestive evidence of it several more times since then, including, sadly, in editing articles for the c4lj).  Even more common would be having the article in front of you, but not bothering to read it, just adding it to your reference list to bulk it up.

But I bring up this tangent because it makes me think if someone does manage to put a fake  — or incorrectly, perhaps accidentally, cited — paper in Google Scholar, it’s not unlikely to get cited at some point if it’s title seems useful for someone’s reference list, and once cited it’s not unlikely going to be copy-and-pasted to other articles too.

Once you start to look too carefully, the whole academic publishing endeavor can start to seem like a somewhat arbitrary game played by agreed upon rules in order to justify tenure decisions, rather than attempt to share knowledge with ones peers or the world or in general.  In this light though, the possibility of gaming Google Scholar is perhaps less alarming, as it’s really just business as usual.

This entry was posted in General. Bookmark the permalink.

5 Responses to spamming google scholar, artificially inflating link counts

  1. Mark says:

    Jonathan, you might be interested in this article: http://hdl.handle.net/2142/1697

    It is a sort of historical exegesis of a highly cited article that never existed.

  2. jrochkind says:

    Thanks Mark, that paper is fascinating, in relation to several different themes I touched upon above — mainly the issue of “phantom citations”, but also the history of computational Information Retrieval in general, and a reminder that the roots of automated ‘citation graphing’ as Google Scholar does are actually in such.

    Also, following your link with these topics in mind gave me a new appreciation for the IR “splash page”. Previously thinking of it only as an annoyance (“just get me to the PDF, why do you make me stop at a metadata splash page?”), this time I noticed that the PDF itself has NO “provenencial” citation detail on it at all, it doesn’t even tell me what journal it was published in. However, the splash page tells me it is “In Library Trends 52(4) Spring 2004: 748-764.”, as well as giving me the identity of the entity that makes this provenencial claim: The Illinois Digital Environmentfor Access to Learning and Scholarship, presumably a university-sponsored IR which one hopes is somewhat careful about such assertions, and at any rate has a ‘contact us’ link one could follow up with for questions of provenance.

  3. Pingback: blog.ecorrado.us » Acadmic Search Engine Spamming

  4. Pingback: Impact Challenge Day 3: Create a Google Scholar Profile - Impactstory blog

  5. Pingback: Day 6: Create a Google Scholar profile – The FEDUA Research Impact Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s