Discussion 1: Richard Poynder, 24th Aug 2010

Duration: 1 hour 36 mins 57 secs
Share this media item:
Embed this media item:


About this item
Discussion 1: Richard Poynder, 24th Aug 2010's image
Description: Richard is a freelance journalist who writes about information technology, telecommunications and intellectual property. He has particular interests in Open Access, e-Science, e-Research, copyright, patents, the Open Source and Free Software movements. He is the author of The Basement Interviews http://www.richardpoynder.co.uk/The%20Basement%20Interviews.htm. Other participants in the discussion were Peter Murray-Rust, Jordan Hatcher, Rufus Pollock, and Alberto Sicilia.
 
Created: 2011-01-31 10:43
Collection: Panton Discussions
Publisher: University of Cambridge
Copyright: University of Cambridge
Language: eng (English)
Distribution: World     (downloadable)
Credits:
Person:  Richard Poynder
Person:  Peter Murray-Rust
Person:  Jordan Hatcher
Person:  Rufus Pollock
Person:  Alberto Sicilia
Explicit content: No
Transcript
Transcript:
Panton Discussion with Richard Poynder
The Panton Arms pub, Panton Street, Cambridge UK
24th August 2010
Present:
• Richard Poynder (richard.poynder@btinternet.com )
• Jordan Hatcher (jordan@opencontentlawyer.com )
• Peter Murray-Rust (pm286@cam.ac.uk )
• Rufus Pollock (rufus.pollock@okfn.org )
• Alberto Sicilia (as2050@cam.ac.uk )
Q1) What is Open Data, why is it important, what problems does it fix?
Richard
So, what is Open Data, why is it important, and what are the problems that it’s seeking to fix?
Jordan
Right, so Open Data is the application of the same sort of principles of Open Source software licensing and Open Content licensing, like Creative Commons to data in databases. It’s important because databases, and data more broadly, isn’t a rights-free area, so it’s not this area where magically IP rights just don’t apply whatsoever and people can do whatever they want, it is an area where legal rights do attach, and so from my perspective as a lawyer, it’s important to have licensing tools to make sure that the data and the databases are covered in a way that you want them to be covered. So, in order for it to actually be open, you have to remove the potential stumbling block of the licence questions…
Richard
You’re talking about automatic rights that apply once the data is created, is that what you mean?
Jordan
Yes
Richard
In the way that copyright precedes the expression of …
Jordan
Yes, so intellectual property rights more broadly are separated into those that you have to go and ask registration for, so patents and trademarks being the key ones there, and ones that arise automatically when something is created that qualifies for that right. So copyright and database rights would fall into that, as well. Contract is a very important third area, it’s not an IP right, it just depends on what terms you put up on your website.
Richard
Right, okay
PMR
Can I answer your question about the problem it needs to fix? Now, I’m going to concentrate on science here, I’m probably going to pull us back into science on a regular basis, because Open Data over all fields of endeavour is an enormous area. So in terms of science, the problem is that we now have automated methods of extracting data, analysing, processing it and so on, the most well-known are in Biosciences, Astrophysics, things of that sort. But, in principle, all recorded science can be computed technically and we want to use the power of machines to do it. Now, wherever we run up against ambiguity of right-of-access or right-of-re-use, then we have a problem. Sometimes we have the problem that it’s non-trivial to access it behind a pay-wall, behind a register-wall, behind all sorts of things like this; sometimes we have the thing that if we re-use this, and re-distribute it, then some “Owner” of the data – this is a very fluid area, who owns what and who has the right to control it –will actually say “You can’t do that”. So there are well-documented instances of people who have used scientific data and they have been requested by some owner to take them down and so forth. We’ve got to get to a situation where that is no longer a problem, that it is automatically solved and that you know where you have the right to do this, or not.
Richard
And machine-readable rights, in other words?
Peter
Of course
Richard
Okay. So that’s the problem Open Data needs to fix. Okay.
Q2) What are PDDL, CC0, how do they work and differ?
Richard
So, if we then came onto the PDDL, the PPDC, the Creative Commons Zero, if I were a scientist approaching this on the web, finding out about these, I might be a bit confused as to how they differed, and what I needed to do about them
Jordan
Yes; so, I guess to start, the Public Domain Dedication and Licence, the PDDL, which is one of the Open Data Commons tools, and CC0, which is one of the Creative Commons tools, are just two of several open data legal tools that are out there, and the feature about both of those is that they are tools to place work in the public domain. So there are a few people out there in the world, organisations, that do not sort-of qualify for IP rights; the US government for example, is prohibited, this is a familiar example, that they don’t have copyright for works they create; now, there’s a slight wrinkle in how that actually plays out, because they may not have copyright within the United States, but they do outside of the United States. And here in the UK, and sort-of I think in every Commonwealth jurisdiction and others that sort-of come from the UK, the British legal tradition, tends to have Crown Copyright, so the starting point for governments, if you’re talking about government data, is that they do have copyright over works that they produce. They do have database rights, so the starting position is that a lot of people will have rights, and there potentially are rights in the databases in everyday life. And so you need a tool to give those rights up and place stuff in the public domain, and these are two tools that do that.
Richard
And why can’t you just put it in the public domain, why is it important to use one of these?
Jordan
Right, so, well, I turn around and ask you, “How would you put it into the public domain?”
Richard
Because copyright automatically applies.
Jordan
Right, so you’ve automatically got it, so you can’t renounce having it in the first place. You’ve got it automatically by virtue of creating something that is potentially copyrightable, there are questions, there are open questions on whether there is a copyright, or what level copyright does attach, but, it’s still an open question
Richard
So the important thing is to make a statement then, is it, that we do not want any rights in this?
Jordan
It is, absolutely, and so, the way that that works is, both sets of tools, my tool but the PDDL is much more explicit about it, work in two ways. One is that they have this public domain dedication, so it says right there in the agreement, I’m totally giving up all my rights and placing this in the public domain. It turns out that, maybe in some jurisdictions, you can’t actually give those rights up, and so what you do is you have a back-up licence, that the licence bit, which says “I’m licensing you” – which is different from giving up your rights totally, it says I’m retaining my rights but I’m giving you permission, and it says I’m licensing you to the fullest extent possible, you can do whatever you want, which is the most liberal licence imaginable.
Richard
And the difference between the PDDL and the PDDC is…?
Jordan
I have to confess I don’t know what the PDDC is…
Richard
I think it’s a Creative Commons one, isn’t it… that might have been superseded by CC0…
Jordan
Ah, so I think what you’re talking about is that Creative Commons currently has two tools now, and they’re trying to break it down into CC0 being “I own the rights for something, and I’m giving them up and placing it in the public domain”, and this other one, which is the certification one, is where you don’t own the rights to it, but you know that something is in the public domain, so you’re putting your stamp saying “This isn’t my work”, so say it’s something, say it’s some text of Shakespeare, or say it’s a photograph from 1880, something that’s clearly in the public domain, you can say “It’s not mine, but I know this in the public domain, here’s the stamp that it’s in the public domain.”
Richard
Right, okay
PMR
And from the scientist’s point of view, we hope that the differences are minimal. I mean, you know, I write a lot of software, so I know that there are two main types of the copy left, the GPL and things of that sort, and then there’s all the rest, which are compliant with the OSI; now within that, recently, I’ve had to make detailed distinctions between various licences because either of their incompatibility, so we’ve had had a question about do we go from LGPL to BSD because the LGPL puts slightly more restrictions and requirements on what can be done, but I would say that at a first level, the average author of software doesn’t need to worry about that, if you get into the minutiae then it can really can become quite complex and I hope that this doesn’t get to the stage where we have non-interoperable data licences at a practical level
Richard
Right
Jordan
I think that’s a really important point. The PDDL, CC0, these are two different tools, just like you have the MPL the Mozilla public licence and the GPL, but the difference is that because the PDDL and CC0 place things in the public domain, they are totally and completely inoperable with each other, and they’re totally and completely inoperable with anything else, sort of, in the world, because there are no restrictions with it, so you…
Richard
And there’s no conflict at all between them?
Jordan
There is no conflict between them
Richard
Okay, yes
Q3) Science Commons, Open Knowledge/Data definition – what are they, how are they used?
Richard
So, we have the science commons protocol for implementing open access data, and we have the open knowledge data definition, both quite long texts; what’s the story behind those and are they complementary or…?
Jordan
The open definition defines the rights that you must have in order for something to be open, so that applies across wider than data, it’s for knowledge in general, and it’s based on OSI, its open source definition, so it has a heritage and it has some established principles behind it. So it’s a bit wider in terms of what it means, I mean I can take you through the specifics of it…
Richard
So it’s a sort of background conversation or statement that sits behind these PDDL and the Creative Commons definitions? Statements of intent, and so on?
Jordan
Well, the open definition, so a lot of people talk about openness, but they use it in very different ways, and so one of the things is, by the open definition, how we the Open Knowledge Foundation has defined openness, non-commercial restrictions, and restrictions that prohibit derivatives, aren’t open, so if you just look at this Creative Commons set of licences, there are only two out of their six main licences that are Open licences as we define it. It’s not quite that narrow, it’s that we’re a small group, and we said well, these are our terms.
Richard
And that’s been an issue widely amongst the various Open movements, hasn’t it?
PMR
It has indeed.
Jordan
Yes. Creative Commons wasn’t set up primarily to support Open Access, it was set up to support a number of different licences to support creative works of all sorts. I would say that, and this may answer some of the later questions, that I find the Open Access movement very difficult until recently in not addressing this question up-front of the difference between Gratis and Libre, and I’m pleased to see that they did this about two years ago, but…
Richard
And partly as a result of your agitation…
PMR
Oh, well, thank you!
Richard
And I think you were the first one to really point out to the Open Access people that re-use was an issue, and an important issue, and I don’t think it’s really been… even though the Budapest pronouncement/declaration would seem to imply re-use.
PMR
Yes, one of the things that you find is that you cannot algorithmically interpret statements of intent as automatically covering what you need to do, so, you know, you it’s like the American Constitution, that is an algorithm for government, but it has to be interpreted by skilled people all the way down, and we wanted to get away from that, so the idea that you just looked at a set of principles and you could work out on your own what was happening, that’s really… [that brings us on to the next question, why we wanted the Panton Principles…]
Q4) The Panton Principles – what do these provide that wasn’t available before?
PMR
…That brings us on to the next question, why we wanted the Panton Principles
Richard
Yes; so, what do they bring to the party that wasn’t there before?
PMR
Well, I observed, slightly from the side-lines, discussions, very in-depth discussions, between Jordan, John Wilbanks, Rufus and others, on what we needed to do for data, and I found this very complicated, and I felt that what we needed was a statement, a pragmatic statement that said that this is what you can actually do. Yes, it does rest on this, and Jordan is absolutely right, the laws don’t go away because it happens to be data, but we can make a statement that we will act in such a way as to minimize or nullify that effect in practice within a consenting community. So the Panton Principles are all about consenting communities, in other words, it’s saying that a group of people covering hopefully a wide range of activities and in this case in science have agreed that they will put their data into the public domain, that they will carry out the mechanical procedures as suggested by PDDL or CC0 or whatever, but that the practice of this, the intent of this is to make their data reusable by anybody for whatever legitimate purpose.
Richard
So the Panton Principles, one might say, is to win the heart of scientists, the PDDL/CC0 is to win the lawyers’ minds?
PMR
Brilliantly put! And that is where, if you like, any of the heated arguments come, you know because I’m likely to, you know, fuzz over the legal difficulties, so if you simply concentrate on the legal aspects, it’s very very complex, the complexity hasn’t gone away because of this but you’ve given us a point where we can sleep easily without having to worry about it.
Jordan
Yes, I mean the Panton Principles is quite important, in that it’s the statement by scientists and by people that look at Open Data about how they feel that scientific data should be treated.
Richard
So would it be the Open Data equivalent of the Budapest Declarations, was that the intent, do you think?
PMR
That was the intent. We needed something that we could all sign up to. I was also conscious of the fact that there was this, I would call it a schism, others might not, in the open access movement between on the one hand Stevan Harnad’s view that all that mattered was exposure to human eyeballs, and at the other end, a number of people who felt that guaranteed re-use of information without legal problems and contractual or organisational problems was essential. And we wanted to try and stop that happening in this area.
Jordan
I guess there are two things to pick up on. One is, coming back to this science commons protocol and open access data. So how that fits in, is what science commons is focussed narrowly on, the science community, so again compared to the open definition, open definition is just broadly around open knowledge, and the way that their protocol works, is that if you’ve written like an RFC with the internet standards, and what it does is it sets a standard for what is open access data in science, and it matches the Panton Principles, in that both call for a public domain approach, but the way that the Science Commons document is structured is sort of looking towards creating what’s called like a certification mark. And so it’s a seal of approval that whatever legal tool or method is used to place something in the public domain, actually it works, so that when the US government publishes stuff they don’t have to use the CC licence or PDDL, they may use something else that just says, “Hey, we’re the US government, we don’t have any copyright over our stuff – take it”. So Science Commons can come along and say, “Look at this document, this meets the standards that we’ve set out in the protocol” and thus qualifies for the Open Access data mark.
Richard
And there is also the Open Data logo, where does that fit into this, is it the same thing?
Jordan
You mean our sort-of web button?
Richard
Yes, which I assume is the certification mark, is it?
Jordan
It isn’t, no; it is a web-button, we don’t have a licence-chooser, like…
Richard
But it’s a trade-mark, is it?
PMR
It’s not a trade mark in the normal sense. What I see it as is a visible statement of intent by the author that this is to be regarded as Open Data.
Richard
But it might not be Open Data?
PMR
Well, it’s also recommended that that links through to one of the licences that says this is Open Data. So we have a site for crystallography, and on the front of that site, we’ve put that all the data in this should be regarded as being exposed in compliance with the PDDL. We choose PDDL because that’s the one that Jordan’s come up with. Then on every page of data, we have a little Open Data button which links through. So it’s a question, although that may or may not be a legal statement, it manages the statement of intent to be consistent with the Panton Principles. So if you like, it is a statement that it is consistent with the Panton Principles, although there is also a button for the Panton Principles as well.
Jordan
I mean, in general, doing IP management, the nuts and bolts of any business or any organisation handling their in-bound intellectual property stuff that they’re getting into the organisation, or out-bound what they’re doing, is just to be clear about what your intent is, and make it clear to the other people that you’re dealing with.
Richard
I suppose yes, that making a statement has legal standing anyway, hasn’t it?
Jordan
Ah, well, it could, I mean you’re asking me for… [fades away for effect…]
Richard
[laughs] Yes, well okay. [To Peter] Now you, a couple of years ago, longer I suppose, set up Wikipedia page on Open Data?
PMR
Yes
Richard
And since then it’s grown into something else, hasn’t it?
PMR
The term has, the Wikipedia page hasn’t changed enormously, I mean I think there is a sort of general feeling that there is an authorship or something like this and people don’t like to go in and change it though they’re perfectly welcome to. What’s happened is that, about four years ago when I set up the page, open data was hardly used in the world, you know, you searched for open data and you would find it difficult to find very much, one or two statements from I think John Bozac made one, Tim Berners-Lee said something somewhere, I’d said one or two things, and now it’s absolutely everywhere and the fact is that “open” can mean a whole range of things, and “data” can mean a whole range of things, and it can apply from everything to the metadata that you unconsciously donate to search engines, to government data, to science and so on. So, I want to restrict the discussion as far as possible to science because science has got some unique aspects, one is that nobody has any intention of making any commercial gain out of doing the science and publishing the data, so you don’t do science for reward, right, it’s different from other creative activities. Secondly, there’s an absolute requirement in science that the experiments are reproducible or, more formally, falsifiable, a Popperian term and therefore you need to have access to everything that the scientist had to do. So there are various pragmatic reasons why science is a special case, and also I think some moral reasons, and so we are now beginning also to see ethical reasons that a lot of science data relates to the well-being of the planet and people on this planet, so it relates to disease, climate, and all sorts of other things that really it is very questionable that it should be withheld from anyone on the planet for any reason.
Richard
Yes
Q4a) Climate data at UEA – what issues does that raise?
Richard
I had a question I might bring in at this point, and that is the climate data at the University of East Anglia, that clearly fits right in here, doesn’t it. What issues does that raise?
PMR
I’ve been looking at this, at the last open knowledge conference, OKCon, there was a presentation by a group who were looking into opening the computer code which was used in that, so that was one particular aspect. And as a result we set up a little group in the Open Knowledge Foundation to look at climate, and that’s ongoing . I did some of my own investigation, and what I found is actually that climate is very complex as a science, people will not agree on whether it’s just atmospherics, whether it involves oceanography, whether it involves ecology and things like that. So that we’ve seen public focus on a rather small area which is atmospheric climate research, and climate as a whole goes much beyond that. It’s clear that it is not at the moment reproducible what’s done, there are some technical difficulties which are that this involves getting quite complex datasets available to those who want to reproduce it, and even more difficult the protocols, it is very difficult to distribute a computational protocol, that’s objective, some of the people who have been criticising this, that you know you can’t just give all the software and all the methods, open, that is something that all scientists are going to have to face in the next ten years or so.
Richard
Of course all three of the reports – I think there were three reports on the incident – concluded that openness was an issue…
PMR
Absolutely, yes
Richard
… that the university should have been more open,…
PMR
I agree
Richard
… and there were all kinds of Freedom of Information requests made and a lot were turned down, so is the take-away point here that here is a classic case of how openness might have avoided a lot of the problems?
PMR
I think that’s probably true, there’s no a priori reason why the data should be restricted. I mean, there are fields of scientific endeavour where there need to be restrictions, human data is one, rare species is another, so there are clearly areas where one can say “No, it’s not appropriate” [unclear]. I don’t see any a priori reason why climate data shouldn’t be available to everybody from the basis of the climate research. Now, some of the issues have been… two issues I picked up were 1) that nobody outside the climate community was qualified to process it, and I would dispute this, I mean I’m a physical scientist, I understand principal components, I understand computational simulation, I’m not saying that I would make a good job of it, but there’s no reason why I shouldn’t be involved, and multi-disciplinary science is going to be important. The second reason I think why it was inappropriate – sorry, the second justification as to why it wasn’t open, was I think that people’s scientific careers depended on this, and we’ve seen this with other things linking to this. Now, what we’re seeing now is a major cultural challenge as to what scientific data is, who it belongs to, who has access to it and so on, and it is going to take time for us to work that through the system, it’s not going to be easy.
Jordan
I was just going to say absolutely; I’m not a scientist, but I’ve been working with scientists, and observing the community. It’s a big shift, because, I’ve been in academia, and that’s… the data for scientists is what enables you to publish papers that are how you advance in your career. It’s not the money that you make out of the data, it’s the journal articles that you get out of it, and so there’s both a cultural shift around when and how you get access to the data, and in finding answers in order for people to get what they need to get out of data.
Q4b) Does Open Data imply open formats?
Richard
What you say about protocols, if I understood it correctly, raises another issue, and that is “Does Open Data imply open formats?”
PMR
I think so, yes. In other words, if you put out something with a format that people don’t understand, then that’s really not open, you know just dumping your tables from something in binary form isn’t good enough.
Jordan
I think (I was just having a conversation around this!), I think that open formats are a best-practice for doing Open Data, but they’re not required to be Open Data, in the sense that if you’re legally open, and it’s in a proprietary format, then you’re free to shift it out of that format into an open format. So you don’t… I think that in part you don’t want to overly burden publishers as well, from saying “Oh well, you have to spend a bunch of time format-shifting everything into something that is open in order to get it out there and be compliant” when, if you, as Rufus says, give us the data raw, and give it us now, if you just put it in, in whatever form you have it, and make sure the legal rights attached to it do allow other people to come and do possibly digitally quite hard work in shifting it to an open format.
Richard
Would there be issues where they might not be able to do that because they didn’t have proprietary information?
PMR
Well, the problem arises from a number of things. Firstly, one of the things it’s exposing is that our practice in managing data isn’t as good as it needs to be. So in other words, most people manage their data with implicit metadata, they don’t necessarily put the scientific units, they don’t explain what this column means, it’s obvious either to their particular group of collaborators what this is, but it’s completely unintelligible to other people. So you’ve probably got again a spectrum of sharing where one type of person you share it with is somebody who is a close collaborator, so somebody says “Can I have your data?” and they say “Yes, well, we’ll instruct you how to use it, but you will therefore include us as authors on your papers” or something like that. Another is where you say, here, at the other end of the spectrum, here’s the data, we put it out we do our best endeavour to tell you what it is, but it’s up to you to find the resources and to do it and so forth. And we’re going to see all of that spectrum.
Richard
Right.
PMR
There are some inescapable problems at the moment which is that if you use proprietary closed-source tools (as many scientists do) and there’s nothing wrong with this in principle, and they’re often linked to things like machines, so you buy a spectrometer, and the spectrometer puts it out in a format which can only be understood by buying a machine of that sort. That’s really quite a common type of thing and it applies to medical devices and so on. We need to get away from that, but you can’t retrospectively blame the scientist for having used that format because it’s the reasonable thing to do at that stage.
Richard
Okay
Q5) How many people have signed up to the Panton Principles?
Richard
How many people have signed up to the Panton Principles?
PMR
I think about a hundred, but really, it’s not very relevant; what’s more important is “How many people are practicing them?”, which means that “How many people are labelling things as Open Data in some form or other?”
Richard
Right.
Q7&8) Have Panton Principles been endorsed by Science Commons; do the PP represent a compromise?
Richard:
They are (Panton Principles) not endorsed by Science Commons are they?
Jordan:
John Wilbanks who heads up Science Commons is one the drafters of the Panton Principles…
Richard:
I read somewhere that it hadn’t been endorsed. The other thing I understand and I think you indicated earlier on that there was some disagreement amongst people involved in Open Data about what the Principles are and how you enact them and I think the Panton Principles was a bit of a compromise, and I wonder what.....
Peter:
I think that’s over interpreting
Richard:
Okay okay
Peter:
It’s a complex area and we got together to see if we could come out with something that worked. I think it’s fair to say probably that the more you get towards the legal side of things and the licences, the more that you will get different approaches and so on, and I’m sure that John and Jordan you know have different aspects when you go down to this level, but in terms of coming up with a set of principles, I don’t think that there’s any division in the community.
Jordan:
Yeah, I think that I can talk you through. So on the science part, I think that a lot of the work that the Open Knowledge Foundation as a whole does and parts of the Panton Principles and things like the PDDL all sort of really match up so I think my personal view is that scientific data should be in the public domain when possible, I mean as Peter mentioned issues such as privacy come out and can restrict without even being a possibility.
Richard:
So there’s not an equivalent to the open access gold/green kind of....
Peter:
No
Jordan:
There isn’t on scientific data; where there is a difference of opinion is around other licensing options for databases
Richard:
Beyond the PDDL?
Jordan:
Right, so it’s whether or not an attribution-only licence or a share-alike licence is appropriate for databases outside of the scientific space.

Q9) On not using Non-Commercial licences…
Richard:
… because I think from what you’re saying Open Data never implies non-commercial, it always implies commercial options. Is that right?
Peter:/Jordan:
Yes
Richard:
Yeah

Q10) What about Share Alike - does it fit Richard Stallman's concept of freedom?
Richard:
And share-alike is another issue here. Am I right in thinking that it doesn’t always imply share-alike and therefore it might not if you wanted to use open source kind of definitions fit Richard Stallman’s vision of free. Would that be correct?
Jordan:
Well, so I guess that you could kind of look at it as maybe an inverted pyramid right? So at the top is open, like the open definition and maybe even beyond that it’s like some of the work that Creative Commons does that not open under the open definition.
Richard:
Right
Jordan:
The open definition defines open knowledge more broadly, and it says that you have to have the right to use, reuse, redistribute, that sort of thing. The only restrictions that are permitted from that are attribution, to giving credit where credit’s due and share-alike, so this is the copy-left for reciprocal licensing/viral licensing or whatever your flavour for the month is for that. Those two restrictions are OK, so field of use restrictions, occasionally it pops up particularly in software forums that people don’t want the Military using their work, and so they want it to be open but prohibit Military use - that wouldn’t be open.
Richard:
As it would not be for OSI?
Jordan:
Yeah. Non-commercial restrictions are something that also come up. Having a non-commercial restriction isn't open. So and then you kind of drill down narrower into that sort of inverted pyramid, and for science, those recommended approach (to) Panton Principles that says public domain, there is a standard across all legal tools, the Science Commons Protocol, and then there are individual legal tools to help you place things in the public domain, so the PDDL and the CC0. I mean the whole reason we started the PDDL was because CC didn’t have a tool, they didn’t have CC0.
Richard:
Because you started out before CC0?
Jordan:
Yes, well, Science Commons was investigating data licensing for quite a long time, but they hadn't come up with a licence yet which is part of the reason why we started the PDDL, and Open Data Commons in general.

Q11) Bollier book "Viral Spiral" - Open Data does not rely on copyright
Richard:
Now these other two questions that I had which kind of roll together, I notice in David Bollier’s book, he talked about Open Data not being about copyright, not being about licences but a set of protocols, and I know that you, I think on the updater website, I saw the use of norms. I mean that’s distinctive perhaps about Open Data. Can you just sort of expand on it a bit?
Peter:
Well, I felt that, one of the things from this is that we cannot get algorithms which work all the way through. I mean just touching back on share-alike, the problem with share-alike, it is an algorithm, which if you apply it, you can get it in all sorts of tangles and so on, which is why I personally don’t like it. So that if you come back here and have licences which have got whatever good intent, by the time you combine them you’ve got different... and one of the things that different about Open Data and Open Access is that Open Data is normally reused and brings in all brings in all sorts of other different data artefacts and so on, so you end up with a huge heterogeneous mess of bits of data. There is no way you can get an algorithm which is consistent in that, so it is intention and norms that matter here. So what we say is this is what we would “like to be done”, not what “we mandate in a Court of Law should be done”.
Richard:
Right
Jordan:
I was just going to say I’ve never read Viral Spiral but I would fundamentally disagree with the idea that when thinking about Open Data from a creative perspective that you can ignore copyright as well as other rights. So I think you can ignore it because the Law is different in every Jurisdiction, but there’s likely to be some pretty critical legal questions that come up, and your users certainly, so let’s not forget you’re trying to publish information globally because that’s the whole point of doing it (for data) in the first place, and then your users are going to assume that there’s going to be some sort of IP rights problem, I think particularly these days when they see things like Creative Commons, so you have to think about it.

Q12) Open Data philosophy and the idea of norms
Jordan:
So the next part is to talk a bit about this idea of norms. So the idea behind the PDDL and some work that Creative Commons is doing, is that while as a legal matter, there is no restriction upon what you can and cannot do with the database, so it’s in the public domain. There are some expected behaviours that the creator would like to encourage. And so if you think about it from the scientific community, one really powerful norm is citation, right, so having a norm statement is about reinforcing “Hey, you know, this may be public domain, but you’re still expected to act like a scientist”, and cite the data, the source, you’re not legally required to, but you’re not really going to publish in a reputable journal without citing anyway, so it’s just sort of reinforcing that.
Richard:
So, the end game would be to create a culture in which scientists automatically make their data open unless there’s a good reason not to?
Jordan:
Yes
Peter:
These cultures exist, so that this is very strong in bio-science. It is chipped away at times by the private sector and so on, but generally the ethos in bio-science is that data is open, data is reusable, people don’t own it and so on, so that our colleague in the OKF, Tim Hubbard from the Sanger Centre, is responsible for the daily output of the Genomic studies that they do there, and he is absolutely clear that as soon as this is created, it goes out into the public domain for anybody to use, you know, and there is no restriction upon it whatever. Norms are enforceable by things other than legal procedures. So let’s say that somebody misuses a piece of data published out there in a serious way. They misrepresent the author, they deliberately change the numbers and re-publish it and so on. These people are going to be pilloried in their own court of public opinion, they might suffer things at their place of work so their Institution might “do them” for scientific fraud, they very probably won’t get refunded by their funding body etc. so, you know. These norms already exist, they’re not always articulated completely but....
Richard:
People know when they’ve broken them.
Peter:
Well you know, they’re not always very clear they’re in data, but I think we will get to a stage where there is a much greater emphasis on “you must publish your data”... “ your research must be seen to be reproducible” those are things which are coming. Now some scientists don’t like this, lots of them are already practiced, (for) many of them, it’s a completely new idea that you do this, and it’s a question of when you do it, how much data you give, how much help you give to people who might be reusing it, so it’s perfectly reasonable to say well, I don’t have to go to a huge amount of effort to help somebody else use my data. I put it out there you know by something that’s accepted and it’s up to them to put in the effort to write the programs or understand this or whatever. Data is not zero cost to reuse in many cases.

Q12a) Where does Open Data live - institutional, central, domain repositories?
Richard:
So where does Open Data live then? In Institutional Repositories, in large central Repositories…?
Peter:
Well, I think the first thing to understand is that all domains are different, so Astronomy is different from Computational Chemistry, from Bio-science from Neuroscience and whatever and we can distinguish rather crudely between “Big Science” Hadron Colliders, Telescopes, Genomes, things like that where you have an organisation which produces data and where increasingly, they are making this available. And at the other end, you’ve got long tail science which can often be numerically greater where people are publishing individual results in individual papers and I think that probably that’s where the biggest challenge that I see comes because there is no norm about reusing it and so forth. So for example, we’ve got robots which go out and read all of the Crystal Graphic publications, and they get some tens of thousands of data a year which we aggregate, so we’ve probably got two hundred thousand data-sets, each from a different publication.
Richard:
Right.
Peter:
Now in none of these do the authors say they’re Open Data but we’re claiming that they are Open Data and we will see what happens but it’s that sort of thing where we want to get to a culture where the authors stamp that as “I want this to be Open Data”.

Q13) "Ask for forgiveness, not permission" approach
Richard:
And this perhaps goes to the point I saw on the FAQ for the Panton Principles, it says “you may wish to take the ‘Ask for forgiveness not permission approach”. Not many lawyers would support that approach would they?
Peter
Of course they wouldn’t.
Richard:
But that’s what we’re talking about really.
Peter:
Scientists have to be pragmatists you know, they’re used to elements of risk in this, all sorts of risks, and this is a relatively low risk. If you ask for permission, you won’t get it. Asking for permission even if there’s no IP involved, somebody's got to answer it. You could end up with dozens of questions to find out whether you can use this data and we’ve tried asking Journals and you don’t know who to ask for a start. This is something else I might come onto at the end about our “Is it Open?” approach, a formal way of asking Journals and other data providers, is their data open. So you know, even asking questions is often impossible.
Richard:
Right.

Q15+16) Open Data c.f. Open Access - Open Libre and Open Gratis
Richard:
OK, so where do you think Open Data fits with Open Access because I know you used to engage with Stevan Harnad and you used to engage in the type of issues we talked about, Gratis OA and Libre OA. But I had the sense that you’ve kind of drifted away and thought that I’ll plough my own furrow and get on with it?
Peter:
Well, first of all, although I mean I’m a moderate proponent of Open Access, I’m not religious about it and I don’t think it solves… I don’t think things like Green Open Access solves any problems for scientists.
Richard:
Right.
Peter:
At the moment, there are domains where you know it will help science if all published material is openly available. I’m concentrating on the data at the moment because I think that’s achievable. It’s somebody else’s battle if you like. I do believe that if something is fully OA Libre with a CC-BY licence, then that de facto means that the data can be used as if it is Open Data. So in other words, we can go out and we find Open Access material and we can data mine it, text mine it, republish it and so on and no-body's going to come after us because that is the intent of the publisher. In doing that, the author is paid for, normally but not always, but the author probably paid for the privilege of having it openly available so it would be almost barmy to say OK well it’s openly available, but you can’t reuse it.
Richard:
Jordan, anything to add to that?
Jordan:
No.
Richard:
OK.

Q17) Where does Open Science fit in?
Richard:
Open Science. Now I notice on the web there was a video of you [Peter] at Berkeley I think it was wasn’t it, and this question came up about what is Open Science, and I think you and the others on the Panel seemed to say, “we don’t need to define it now”. Could you just expand upon it a bit?
Peter:
Sure. Well I think what’s happened is, first of all, we have to give enormous credit to people like Richard Stallman for getting the whole Open Movement going, I mean I read bits of Bollier’s book yesterday and I’ve always credited Richard Stallman as one of the great figures in this, and I’m glad to see he did as well right. So that basically he has come up with a philosophy which can be applied to huge numbers of fields of endeavour so people now say if we’ve got Open Source which works, if we’ve got Open Access, well which can be made to work, why don’t we apply it to everything. So it’s a mixture of I think what I would call an internet renaissance, in other words, people are rediscovering, politics and philosophy of the way in which knowledge can be moved around for human benefit. So there’s a huge number of people who want to do things for the public good and I think this generally comes under the heading of Open. So we have Open Drug Discovery, we have Open applied to almost anything you can think of, and science is one of the things it can be applied to. Now what it means I think varies a lot, I think that this is one of problems with these sorts of statements of intent rather than algorithms. So it could mean reproducible science, it could mean that all of your science was done in an open community, they’re not the same, it could mean that you only used openly available tools for your science, a sort of Quasi Beacon approach right. So some people have said that we shouldn’t be using any closed access software in the Open Knowledge Foundation and things like that.
Richard:
I’m sure Richard Stallman would say that (laughs).
Peter:
Indeed, and so on. It comes down to open biological materials. If you make a clone or a mouse or whatever, should that be available for everybody? So I think it’s a very broad term and I don’t think it can be easily compartmentalized more. The point of it is actually to be able to hold meetings with a very wide or discussions with a very wide range of people involved and to see what comes out of it. So I think it’s a “rallying call” rather than a “definition”.

Q18) What about Open Notebook Science?
Richard:
... Open Notebook Science (ONS).
Peter:
The idea of ONS is... we started... one of the protagonists of that is Jean-Claude Bradley, Cameron Neylon are closely involved in the kind of things we do. I think they came up with the idea of Open Science first of all, and I think I suggested or pointed out this was too general, so they changed it to ONS.
Richard:
Right.
Peter:
This is the idea that when you create a laboratory notebook then that is if you like, transcluded to the whole world so that everybody has a window on what you’re doing. So it would be as if there were people doing science here, it would be as if there was a streaming video going out to the rest of the world right. Now Jean-Claude practices this in Chemistry so that all of his lab notebooks are put up after a very modest period of probably a few hours or something like that to get it ready because it’s actually very technically very difficult to stream everything you do out in a semantic form to the web.
Richard:
Yeah, I think I saw on a mailing list or somewhere that you commented that you tried this but because you’d taken a day to put the data up you were ruled as not having fulfilled the right criteria?
Peter:
Well, we were doing Computation as Alberto knows, we were doing Computation and taking resultant complications and giving the, what we would call, the processed data perhaps a day after this. Now the main reason was we didn’t actually have a tool to route the raw data directly onto the web, there was no reason why it shouldn't be out there, it just wasn’t. I’m involved in various data projects from JISC and others, and one the things that’s emerging is this idea of a data chain where you start off with raw data and almost all disciplines have got something they then call processed data or cleaned data. Cleaned is a dangerous word because people who don’t understand the process will think that it’s fudged data; it isn’t. A typical example is that you get your raw data off a sensor and you then have to calibrate it for drift or other influences or something like that so almost all instruments will need some calibration so that turns into calibrated or cleaned data. It may be that that is actually too much to use so you filter it down and into numbers that mean something. In one of the areas we work in, well both Crystallography and Nuclear Magnetic Resonance, you have to carry out a Fourier Transform that’s what the science says right, so you collect the original data and then you Fourier Transform it. Now it could be that it is actually quite difficult to put out the raw data, it’s large and so on, it’s often in proprietary format so you put out the next step down the chain.
Richard:
So some people might be a little bit too fixated on the speed in which you put the data out?
Peter:
It’s a new area. I don’t think it’s something we should be religious about.

Q18a) Peter's Flower - reclaiming our scholarship
Richard:
Now I think I saw a picture of you on your blog holding up a... [flower]… petals with lots of colours?
Peter:
I should have brought it along.
Richard:
(laughs) That seems to me interesting if you’re trying to broaden the discussion because I think that’s what you’re trying to do?
Peter:
Yes, so the message of that is it says “Reclaim our Scholarship!” right, and this something that is at the root of all of this. I’m old enough that I can go back to having published papers without transferring Copyright to Publishers. I remember when it happened; I think it was the greatest mistake Academia made in the last 50 years was not to spot that, because they could have blocked it. I mean the university could have said no, this is inappropriate, you hang on to your copyright, and nobody spotted this and it’s cost us, I would say, 100’s of billions of dollars this mistake, seriously. If you add the value of scholarly research and so on, and the restrictions on its reuse and so on so forth, and with spending our time trying to claw this back. So yes this does have a political agenda which is to claw back what we believe belongs to us and data is one of the easiest points because it’s certainly in the simplistic understanding, it’s not covered by copyright, OK. It may well be covered by legal restrictions, but for most scientists, data is not copyrightable, and for a large number of Publishers, it’s not copyrightable and it shouldn’t be protected by these things. We do run into problems with things like images, I believe that a scientific image should also not be protected by copyright, or at least, not guarded by it. If you have a graph of one variable varying with another, then this is scientific data, it’s not a creative work. So yes, I want to reclaim our scholarship, I think that I had not realised how much of this was controlled by all sorts of thing, I’d assumed that things like Bibliography were freely available and some people I should say that they were Libre, but now I’m told that you know, it is unclear whether we have the ability to make compilations of Bibliography which is the factual statement of authors, Journals, dates on a paper without infringing some assumed right. Talking about restrictions, yesterday, I wanted to find out where a particular crystal structure had been published right, and I would do this normally by searching the bibliography with the names of the authors and coming up with those papers, pulling the papers down and see if they have crystallography in them. Machines can do all that. I need the bibliography to resolve things like which authors have published where, and I don’t want to have to be beholden them to some third party for the right to do that.
Richard:
Yeah.
Peter:
So that’s where my passion comes from.
Richard:
OK.




Q22) Scholarly publishers - how do they fit with Open Data?
Richard:
So you mentioned scholarly publishers. I’m wondering, where they fit in and how problematic they are in terms of Open Data?
Peter:
First of all, data is a problem, and everyone’s beginning to realise this. It’s a problem because it’s increasingly seen to be valuable, it is part of science, and having it in machine-readable and -processable form is going to be increasingly important for science. Tony Hey and colleagues at Microsoft have published a book called the Fourth Paradigm which talks about data-driven science, and we are moving to some extent from hypothesis-driven science to - some people call it ‘discovery science’, some people call it ‘data-driven’; doesn’t matter what it is. But it says: “here is all of this knowledge, can we make sense of it?”, rather than, “I set off with a hypothesis, can I find the data to...”
Richard:
...and the machine itself begins to make discoveries?
Peter:
Absolutely. Publishing is done with a very conventional model, which is nineteenth-century, in which the unit of publication is the article. It has all of this infrastructure of labelling, citations, page numbering; page numbering is still incredibly important, but for a machine it’s completely archaic. Some disciplines have published data alongside papers as part of the process, for whatever reason: normally I think to support the assertions of the scientists - doing that, requiring that the data is available - sometimes for reuse, to preserve it for the archive. Probably mainly the former. What’s clear now is that the explosion in the amount of data, the richness of it, and so on, means that that no longer works. People are saying it has to be published somewhere. Now in some of the domains – high-energy physics, astrophysics, bioscience sequences, structures and genomes - that’s well supported by publicly funded bodies of various sorts. In other areas, it is sometimes supported by conventional abstracters, so in chemistry, we have organizations like Chemical Abstracts, who manually abstract the current literature, they also curate it and make judgements on it of various sorts, and then, they make it available for a fee, because it’s clearly “sweat of the brow” and so forth. That model is going to break down because the amount of publication is going up, human effort only scales linearly at best - and probably worse than that - so there is a clear model at the technical level whereby if you publish data with your paper into the open domain then machines can make sense of that. The difficulty is that you’ve got to convince the whole of the publication process, which involves funders, institutions, authors, publishers, and readers (where they’re different), that this is a good model to move to. To me, it’s self-evident, but something which is self-evidently better isn’t always compelling.

Q23) Open Access publishers - are they adopting Open Data?
Peter:
Ian Hrynaszkiewicz is one of the editors there, and he was making a list of scientific repositories into which data can be put.
Richard:
What about the other, PLoS and the others?
Peter:
We’re in contact with them as well, and I think, they’re motivated towards this as well. There are no simple answers to this, because, there are some domains where it isn’t a problem, but any other domain where it is a problem, one’s going to have investment of some sort in, and it’s not clear who would invest. I’m also on UK PubMed Central, so, it’s possible that things like that could act as a repository for some of this.

Q20) How many datasets have the Open Data Logo?
Richard:
Earlier this year you - for the Panton Principles - you got a SPARC Innovator’s Award, which is great recognition by the library community and the open access and the open data people. What about, can you say anything about success in terms of, say, the number of databases that have the open logo on them, can you size what’s happening?
Peter:
It’s very small, and this is a question of simply hard work and hopefully a viral explosion of interest. If we get two or three key publishers doing this so that their journals come with little Open Data stickers on the webpages in the same way as they put the Open Access stickers on, then we will start to have essentially sort of advertising, and people will see these, click on them, see what’s it about. And so long as we’ve got good explanatory material, we hope that will spread. So that’s the most likely way to spread.

Q24) What are the obstacles to expansion of the Open Data movement?
Richard:
So what would you say are the main obstacles right now that you face in terms of pushing the Open Data button, as it were - getting the viral movement going?
Peter:
There are the technical ones, right, which is, not all data is suitable for doing this, whatever the motivation, we won’t be able to do all of that. The second is communities of practice, so, many communities of practice will not want to go to any additional effort to make this happen, so that, if you say to a scientist, you now need to submit your data file as a requirement for publication, they will say, “Get lost! I’m busy enough, I don’t want to have to do this lot.” Now, I think that will change, but I mean, this is not new, the International Union of Crystallography started requiring journals to publish crystal data in the early 1970s, I think, and not all journals liked it, and not all scientists liked it. But it’s gradually come through. So, when you’ve got an independent scientific body on board, then I think it’s more likely to happen; and where they speak for the community. The problem in some disciplines, fairly close to my own practice, is that the learned societies, the international unions, either don’t have a very high profile here, or are too closely involved with their own publishing efforts, so that you know there isn’t that detachment.
Richard:
I’m assuming, particularly having seen that plan for Berkeley, that one of the big ones is intellectual property more generally, things like patents. There seems to be a great feeling, and I think you touched on it earlier, there’s a sense in which a lot of scientists feel that still a very proprietorial approach to things because there might be patents coming out, they might be wanting to have a spin-out company, that sort of thing. That’s a big issue, isn’t it?
Jordan:
There are certainly two things to pick up there. One is, scientists are creators, just the same as artists and photographers, in that sense, and they’re making something and they feel proprietary about it because they made it, they put a lot of effort into it...
Richard:
And that works against your efforts, doesn’t it?
Peter:
Obviously, yes, and it’s a perfectly understandable human emotion, and it may well be one that persists for a long time or even indefinitely, but we have to work out the rules round it.
Jordan:
And that’s something that comes up with anybody trying to open their approach, that sort of human reaction of “oh well, it’s mine”. The second is that, clearly, universities are facing funding cuts and shortages and one of the ways that universities have been looking at making more money is spin out companies and that places pressure on scientists to identify […cut off]

Q25) Is Open Data inhibited by pressure for academia to have commercial outputs?
Richard:
You were saying that one of the big problems is that science has given away so much. I wonder whether one of the other problems might not be this growing sense that’s developed that the market should somehow operate in academia. I’m particularly talking about spin-outs and stuff.
Peter:
I’d like your views on this, because I think that there is a market, and I think it’s very poorly defined. Let me just come back to the question of the challenges we face. We do also have other things on our side, which in particular I think, the ones I look to the most are the funders, because they have in many cases an absolute jurisdiction over future funding for people - if people don’t accept their norms, well then they’re welcome to look for funding elsewhere, and so forth.
About the market: I have this very strange problem, which is that I don’t know what the value of funded scientific research worldwide is. I’ve tried to do some sums, and I come up with somewhere between $100 billion a year and $1 trillion a year. I haven’t found anybody with a full analysis of this. But this is funding going into universities to do research, so this is a very large industry, and this is an industry which ought to be able to manage its information flow, its evaluation, and all the rest of it. You wouldn’t find any other industry of this size which was so completely at the mercy of things over which it had very little control. The future is that you’ve got 10,000 Higher Education Institutions who are feudal, and they are competitive as much as they collaborate, probably more so. So they can’t get their act together. If they got their act together, they could solve a lot of these problems very quickly. And, at some stage, somebody might wake up and say: “OK, well look, 1% of this revenue is scholarly publishing, that’s a problem, let’s solve it.”

Q26) Government wants Universities to create jobs - does this inhibit Open Data?
Richard:
I was very struck when I interviewed Tony Hey, when he said to me - and I hadn’t quite realized that - that universities, their mission from the government now is to create employment, to create businesses for the local economy, and that came to me like a real shock, that this was what universities are asked to do, because is that not where the attempts to make things open are going to be constantly tripped up?
Jordan:
I guess, two things: one is that I don’t think that is all that different than universities’ mission all along, if you look at things like the RAE: how academics get rated is not on the quality of their teaching, but how much research they’re doing, and that’s how universities get their five-star rating or whatever - that directly relates to how much funding they get from government. And then, there’s this shift from just doing research to looking at ways to spin-out companies. I think universities are very much in the same position as governments - and maybe slightly different - in that they shouldn’t always be in the business of looking out for spin-out opportunities; if their mission is create business and create growth around, [they should] look at themselves as a platform: “here’s a bunch of research; you, the commercial community, here are rights to go do it” - which is an open approach, you can do it openly - and say, “here’s a bunch of data, here’s a bunch of stuff that we’re doing. You figure out the business model, you build on top of what we’ve done, and do what you do best, which is businesses.”
Richard:
Well, makes sense, doesn’t it?
Peter:
I wouldn't dispute the fact that universities have a role in generating new wealth for the community and so on, and that’s how it should be. What I think is clear is that some of the mechanisms of the last 30 years - the Bayh-Dole Act and similar things here - have actually led to a narrow and ineffective method for creating real wealth, and ok, it’s probably cause of the people I talk to, but there seems to be a growing realization that micro-patenting is counterproductive to not only the practice of science but the wealth creation itself.
Richard:
And that could be good because then, as that consciousness develops, you’re pushing the open button on that side, and it all begins to make more sense, doesn’t it?
Rufus
Yeah, I think a simple point, which is that, clearly we don’t always maximize wealth - commercial or otherwise - by being proprietary. In fact, to take a really complicated example, I read a little excerpt from a book the other day; it was amazing, it was about design of slaughterhouses by this lady who’s one of the world leaders in this. She was particularly concerned with animal welfare, but also how to build humane but also good and efficient slaughterhouses, and other ways of processing livestock. She had a whole section where she went on about the damage that patents had done, about how she had always promoted publishing openly her schemata and designs and the point she was making was twofold: one, that she thought that was more effective for disseminating this knowledge and actually getting animals treated better, but also, really important, it was not about just altruism; she said: I got far more work and consultancy than I can ever do - with McDonald’s, with all kinds of people - than I would have got being proprietary about it. She gave a very sad example, actually, of someone who’d invented a very humane and very cheap pig-stunning device, made out of bicycle parts, and that it had been patented and bought off, and then the product had been killed by a bigger competitor, who was worried about this product going to market. So even in an area you wouldn’t think of [as] Open Data, but in some ways, this woman was doing Open Data and open information in that she’s putting these schemata out there. To round that off, one of the first things I read as an economist doing research was about a guy called Rennie who’s very famous because he knew Watt during the industrial revolution, and Rennie and Watt, if you like, are the two poles - Watt famously patented a lot and was very aggressive in litigating the patents (or he and his business associate, Boulton), and also in extending the patent, and was very virulent in enforcement. Rennie, on the other hand, was basically an engineer who built a load of stuff to do with mill wheels and other stuff such as this, which was pretty crucial in the Industrial Revolution - and he was always giving out his designs, again, fairly freely. [It was] the same logic 200 years ago; he was saying, Watt wrote to him, going, “Why? You’re insane! Why are you doing this, this is your money, your patrimony!”, and so he’s like, “No, I get more work, more design, more building work and stuff that I can do, through me disseminating my ideas freely”. So, not only in things like science and particularly public science [where we] have got this up-front funding, where we know from economics [that] it’s always better to disseminate that freely if you can, but even in cases where you know we have got these issues, it often makes sense, and I wouldn’t go so far as to say “everything always shall be open”, no, but a lot of the time, even there, it can be both good commercially and good socially.

Q27) Patents are open - a great source of chemical information
Peter:
So actually, just parenthetically here, one of the attractions of patents is that they are open, right, I mean, they are closed for a year or whatever the jurisdiction is, but after that, anybody can do whatever they like with them. And so, we've got a project which we're showing at Science Online next week where we're reading all the European patents by machine, and we're pulling the chemistry out of them, because they're actually the biggest source of open chemistry anywhere.
Rufus
Expired patents, that's a good point.
Peter:
Oh no, they're not expired!
Rufus
They've been published - they're not secret.
Peter:
I mean, the information, the data is open. I can't make an invention which challenges that, but I can use the melting points and the solvents and all the rest of it.
Richard:
Which is how Monty Hyams started Derwent, the patent information provider - he just sat on his kitchen desk doing that sort of [thing].

Q21) What advice do you have for a scientist wanting to make their data Open?
Richard:
So, in terms of advice to scientists, about Open Data, a) what they should be doing and b) what are the pitfalls they should try and avoid?
Peter:
Well, I would say that the first thing they have to do is find out about it and it's not easy to give them advice if they don't know it exists. It's not something which is self-evident to most scientists. They have to be introduced to the problem. It's really up to us to give them as many chances of bumping into it as possible, and then I think it's up to us to give them very simple advice which is that, on the assumption that you believe that your data should be available to other people for reuse within science, these are the 3 or 4 things you should do. The first thing you should do is actually make sure that your data is well-managed during the experimental pre-publication phase. Now, if you're doing Open Notebook Science, you're probably doing that, because you've got to get it out, but most people aren't; most people have got it on their laptop - if they're lucky - and so forth, and most people suffer data loss and so on. I would say that it would be tremendous to have a concerted higher education effort on data awareness and management for incoming graduate students - I would put that at the top of my list, alongside - you've got to do lab safety, you've got to do various other things you have to conform to - I'd say data management is a key one. Once you do that, then most of the rest follows, because then you get to various stages where you can press buttons when you come to publish, and your experiment should then be published in whole somewhere or other. It's then really up to the domain to decide where it should be published. [The question arises] - should it be in central archives or institutional repositories? I'm absolutely clear on this, that it should be in central domain archives because they understand it. Putting random amounts of material into institutional repositories - I'm talking about data here, not papers - it will just get lost. I hope people challenge that view, but until they show me an institutional repository where data is going in and coming out, then, I'm going to maintain that we have to have these domain-specific repositories and that each domain's going to have to find ways of funding it, either by - it may be pay-for-access, there's no reason why you shouldn't have this, you can still have Open Data - it depends on the community.

Q29: What should the Open Data movement be doing now?
Richard:
I thought that a good final question would be one of yours, in fact, and that is what the Open Data movement should be doing now, how it should be advocating for Open Data, what kind of activities it should engage in?
Rufus
I think one thing is, there is a huge need to go and talk with people. It's always a difficult issue in that people are very busy - anyone is - scientists doing the actual research; remember, Open Data is a means to an end, it's not an end in itself - it's like money, it's there - so, it's very important to emphasise that. I think also, you want to tell people, but also, don't say, “Thou shalt do Open Data”, “Thou must do Open Data”, it's: “here's the ways it can help to you, this is why it might be a good idea”, also “here's how you can use existing Open Data”, I think that's a very important thing. It's like Open Source software or something else, often you start out using something else and that's how we come to doing it ourselves, because we say, “that was really useful, that was really convenient, imagine it was that convenient to get my stuff!” The other thing, that Peter can probably speak to well, is: one thing I really noticed when we started the Open Knowledge Foundation is the idea of just porting things, very directly. The Free and Open Source community has worked out a lot of stuff, a lot of ideas we can just port across. One thing that certainly struck me is that coders, they might code in different languages like C or Java or Python or whatever, but they have often a very shared outlook and set of assumptions, so that talking to them about things is quite easy, relatively, whereas in the Open Data, there's chemists, there might be economists, there might be people doing humanities and getting letters from Dickens, and there's much more variation in understanding, assumptions and world-view of different groups you speak to. So I think, another big challenge when you go and talk is you need champions in each field; relatively, in software it's easy to go “once we've got someone explaining it to Python they can explain it to Java people”, whereas here we really need people also to champion it in given domains, and know what are the things that make people tick in that domain - sometimes you talk about something, turns out that those guys don't care about that at all, or it's partly solved already in this way, so you really need that knowledge to make a convincing case.

Q30) Could SPARC help encourage scientists to open their data?
Richard:
You know SPARC - I don’t know whether you’ve done anything like this for Open Data - they produced this leaflet, for Open Access, to go out to scientists. This is what Open Access is, this is what you personally can be doing, is there any scope there for getting SPARC to do something similar for Open Data?
Peter:
Certainly, yes, and you know we’re in moderately regular touch with them...
Rufus
We’ve already done some work - there’s some work been done by different people already, already the most advanced one is for bibliographic data, for metadata libraries..., but, yes.
Peter:
Perhaps you can answer this, but I would try and pick up the things that made Open Source successful - it is getting large, influential opinion formers. So, for example, Microsoft, or at least Microsoft Research, are very aligned towards this, so if we can get that into any Microsoft products - they’ve got an author add-in for Open Access - if we can get that through to Open Data, that will reach a huge number of people. But they’ve got to want to use it because they’ve got to want to be doing some data-y things. You can see how it could spread through data-supporting projects.
Richard:
Yes, and Creative Commons got a great deal out of just having Microsoft put into Word that ‘here’s your Creative Commons’ button.
Peter:
Exactly. So another one is - I will probably go to UK PubMed Central and see if we can get this concept promoted on UK PubMed Central, so say, ‘this resource has Open Data’. Similarly, we go to publishers such as BioMed Central, PLoS, International Union of Crystallography, who have open access offerings, and get them to label them as well. So gradually, this will spread through and go to funders such as Wellcome and say, “can you raise this on the grants?” - funders for example have requirements now for data management, you’ve got to come up with a data management scheme, so, can you put in that the data must be released formally as Open Data, not just that you must publish your data - it’s things of that sort.
Richard:
And I suppose some case studies might do a good thing as well. I came across a great one: you know you talk about particle physicists? They’re heavily into Open Data now, and I got a great quote from the Director General, who pointed out that it costs 6 billion Euros to build the LHC, and he said ‘10 or 20 years ago we might have been able to repeat the experiment - they were simpler, cheaper, on a smaller scale. Today that is not the case, so we need to re-evaluate the data we collect to test a new theory or adjust it to a new development we’re going to have to be able to reuse it. That means we’re going to need to save it as Open Data.’ Of course there’s an interesting backstory to that, because when they were - the German particle physics centre DESY, they had a collider going called Petra - and they had all this data in it, and then it disappeared, and they needed to re-analyse the data, but this is a great story here: they then discovered that the file containing the recorded luminosities of each run was stored in a private cabinet and therefore lost when the DESY archive was cleaned up. Jan Olsen, when cleaning up his office around 1997, found an old ASCII print-out of the luminosity file. Unfortunately it was printed on green recycled paper not suitable for scanning and OCR-ing. The secretary at Aachen retyped it for four weeks, the checksum routine found and recovered only four typos! So they managed to recover it, but it taught them a huge lesson. We could so easily have lost all this. And apparently this had some impact on findings later on, in QCD - quantum chromodynamics theory. It’s just a great story for Open Data there. So, cases like that, you could take in the different domains, [and] say, “this is what you could lose”.
Alberto
Another thing is to understand how different communities work. I’m doing theoretical physics, not like the experimental part, and in one way people doing theoretical physics are very idealistic, because it’s not a field where patents, and trying to make money or whatever, but in another way, it’s also a very individualistic community, because it’s all about egos and who proves this theorem. So, there is quite a strange balance to meet between people who want to do physics because they want to advance knowledge and... But the same point, how people get promoted, what price - I think this is a very difficult balance to... how the community changed the way to recognize people.
Richard:
That’s what’s essential, really.
Alberto
Not only the one who used the data, but also the one produced data, and how the community changed the rules of who did what and what was important as well.

Closing remarks
Peter:
Well, Richard, thank you very much indeed for coming, you were in the neighbourhood so that was convenient, and we’ll now decide what we’ll do and so on. I think that has set an extremely good precedent for [the Panton Discussions], it’s covered a lot of ground, it’s been natural, I think it will transcribe well, we’ve got to think about what’s done there; it’s open so if you want to take any or all of it and do your own thing with it we’d be more than happy for you to do it
Richard:
Great, thank you!



A big Thank You to the transcribers: Jamaica Jones, Graham Steel, Karius Kempe

Appendix: List of Tracks

Track Question/contents
01 Brian
02 PMR giving intro
03 Richard proposing framework for session
04 Jordan & Richard describing multiple licences
05 PMR - goal is to not worry about licences
06 PMR talking about why data is not copyrightable
07 Jordan; what is Open Data; data-facts and data-databases
08 Discussion around which questions to tackle
09 Personal introductions; Jordan, PMR, Richard
10 Starting Real Questions
11 Q1) What is Open Data, why is it important, what problems does it fix
12 Q2) (question) What are PDDL, CC0, how do they work & differ
13 Interlude - phone call, memory card problems
14 Q2) (answer) What is PDDL, CC0, how do they work and differ
15 Q3) Science Commons, Open Knowledge/Data Definition - what are they, how are they used
16 Q4) The Panton Principles - what do these provide that wasn't available before
17 Q4a) Climate data at UEA - what issues does that raise
18 Q4b) Does Open Data imply open formats
19 Q5) How many people have signed up to the Panton Principles
20 Q7&8) Have Panton Principles been endorsed by Science Commons; do the PP represent a compromise
21 Q9) Not using Non-Commercial licences
22 Q10) What about Share Alike - does it fit Richard Stallman's concept of freedom
23 Q11) Bollier book "Viral Spiral" - Open Data does not rely on copyright
24 Q12) Open Data philosophy and the idea of norms
25 Q12a) Where does Open Data live - institutional, central, domain repositories
26 Q13) "Ask for forgiveness, not permission" approach
27 Q15+16) Open Data c.f. Open Access - Open Libre and Open Gratis
28 Q17) Where does Open Science fit in
29 Q18) What about Open Notebook Science
30 Q18a) Peter's Flower - reclaiming our scholarship
31 Q22) Scholarly publishers - how do they fit with Open Data
32 Q23) Open Access publishers - are they adopting Open Data
33 Q20) How many datasets have the Open Data Logo
34 Q24) What are the obstacles to expansion of the Open Data movement
35 Q25. Is Open Data inhibited by pressure for academia to have commercial outputs
36 Q26) Government wants Universities to create jobs - does this inhibit Open Data
37 Q27) Patents are open - a great source of chemical information
38 Q21) What advice do you have for a scientist wanting to make their data Open
39 Q29) What should Open Data movement be doing now
40 Q30) Could SPARC help encourage scientists to open their data
41 31) Closing remarks
Available Formats
Format Quality Bitrate Size
MP3 44100 Hz 125.0 kbits/sec 88.77 MB Listen Download
RealAudio 582.03 kbits/sec 413.30 MB View Download Stream
Auto * (Allows browser to choose a format it supports)