Why may Google textmine but Scientists may not?

I recently posted about why Google is not a good enough solution for searching the academic literature (because can’t build on the results! and read the comments on that post for more).

It is sad indeed, then, that PMC and Publishers forbid scientists and others from spidering/indexing/mining their content…. while giving Google privilege to do exactly this.

Check out the robot.txt files for PMC  for /pmc/articles/  and notice that GoogleBot is allowed, Bing and a few others are allowed, but User-Agent:* (the rest of us) are not.  The same is true for ScienceDirect robots.txt:  Google may textmine everything, experimenting scientists, nothing.  (hat tip to Alf Eaton on twitter)

Is this defensible on the grounds that Google knows what it is doing but The Rest Of Us Can Not Be Trusted?  I sure hope not.  Scientists are routinely trusted with a lot more than writing a script that won’t bring down a server.  There are other ways to ensure someone won’t bring down a server than a global robots.txt ban.

Perhaps a ban is the only way to prevent unauthorized redistribution of large numbers of papers gathered via spidering?  Nope.  Require people to register.  Monitor use.  Clearly state what may be redistributed, what may not, and what actions will be taken if people behave badly.

Maybe they are just waiting till Scientist-initiated indexing projects gets Big and Important and Ask Nicely and then they will write them in as an allowed user.  Maybe.  But restricting play and experimentation is a pretty poor way to bring about that future and we should not accept this as the default behaviour of the keepers of our scientific literature.

PMC calls its prohibition against bulk downloading a “copyright” issue.  That doesn’t make any sense to me.  Sounds much more like a Terms of Use issue than a copyright issue.  Am I wrong?  If so, educate me in the comments.  If I’m right, then I think we should ask PMC to change its wording because calling this a copyright issue just muddies already muddy waters.

It does appear to be, at least in part, a contract issue.  In the contract between publishers and PMC (http://t.co/EhZP5SrS1i point 16, ht again to Alf Eaton), PMC volunteers in its terms that PMC will prohibit bulk downloading.  Why does PMC include this sentence?  Is it part of the NIH Public Access law that PMC has to include this sentence?  If not, isn’t it capitulating an awful lot to publishers… basically undermining the ability for scientists to build enhanced searching tools, etc?

(and, how, given this, does Google get access?  Don’t get me wrong.  I think Google is fantastic!  I want Google to keep having access!  I just want all responsible systems to have the possibility of the same access to our publicly funded and hosted research, so that someone will build infrastructure that properly supports research and research tools.)

Anyway, these spidering policies strike me as unfair, and something that people should be talking about and complaining about and doing something about, especially as we start to craft new policies for how people and computers can access our Public Access research output under the new OSTP policy.