Wikisource talk:Wikisource and Project Gutenberg

From Wikisource
Jump to navigation Jump to search

What is the relationship between our Wiki text project and Michael Hart's en:Project Gutenberg? Are we cooperating, ignoring each other, or what? --Uncle Ed

I moved Ed's question from my talk page at en. It's a good question. I'm not sure they would be favourable to the idea of Wikisource based on this quote at their website:

...our experience is that people, however well-intentioned, do not keep copies up to date. We want there to be one clear source for people seeking the latest Project Gutenberg information, and we think that having a lot of out-of-date copies and partial copies scattered around the net would be a bad thing... [1]

Is there any way we could co-operate with them? Do we want to? Would there be any advantages/ disadvantages? Angela 22:49, 9 Dec 2003 (UTC)

The above comments from Project Gutenberg make good sense. There is more to this than simply saying "This work is in the public domain so I should be able to copy it." Somebody has already been quick to add of all of Shakespeare's works. That's acceptable here. It's not wrong, it's just that it doesn't do any good. Shakespeare's works are already available on many other sites. Putting them on Wikisource doesn't provide any added value, and added value is what will make using Wikisource worthwhile. Project Gutenberg appears to be a solidly supported project. If you must copy another on-line source instead of adding new material look for a project that's at risk of going out of business. Eclecticology 02:21, 10 Dec 2003 (UTC)
Having PD texts in a central, easy to link to, and easy to read (have you tried to read a 4 MB text file?) place is valuable to the other Wikimedia projects. In the future I'm sure we will be able to export the output of Wikisource pages to other Wikimedia wikis in a similar way as we now display images; [[source:Gettysburg Address]] in an edit window on Wikibooks would display the text of the Gettysburg Address on Wikibooks (which could then add whatever annotation that project wants). We could even call upon only certain lines if we wanted. --Maveric149 07:13, 10 Dec 2003 (UTC)~
Please don't misinunderstand me. I am as always an inclusionist at heart, and I look forward to people on the software end of things developing things similar to what you describe. Still there is a wide difference between the material that I would tolerate, and the material that I consider a big Wikimedia asset. I've argued in the past in favour of better editability, but there's still a lot of material around and that will be added to Wikisource that no-one would be motivated to annotate. Eclecticology 08:20, 10 Dec 2003 (UTC)
There are public domain texts of 19th and early 20th century works by well known authors, that I am working on formatting —where I have found no available source of electronic editions, or simpy raw unedited scans at one or two locations, and do consider these to be some of the more important contributions I could make here. I might have one or two of these done in the next few days, my formatting work and proofreading on them currently being somewhat sporadic. I have most of these works in printed editions in my personal library, and even where I have used electronic texts or scans rendered by others, often prefer to proofread and correct them using such hard copies, when available.
Even so, I do believe that providing "yet another" site where major and widely available "classic" texts are easily accessible, and linked to Encyclopedic articles, is important, especially one where errors can be rapidly and easily corrected, as soon as they are noticed. I have worked with MANY Project Gutenberg texts simpy for my own pleasure and easy reference, and know that there have been many typographical errors that remain in many of them. I have made notes of some of the errors that I've found in some, lost track of most these notes, and don't believe I ever actually got around to even notifying them of any errors until I was working on the Shakespeare works here. I was primarily using the raw public domain textfiles they make available, (or other public domain texts available elsewhere) as a base to work from and found their primary public domain edition of "Cymbeline" was entirely missing the third and fourth acts. This rather massive error was acknowledged, but I was told it would probably take some time to correct, and was advised to use either their edition which had a claim of copyright on it, and was declared limited to personal or educational uses… (thus more limited than the GFDL) or to work from the rather archaic, poorly formatted First Folio edition they also provide as public domain text. I simply hunted around until I managed to find another source of a complete electronic edition of the play in the public domain, at a rather obscure site, and worked from that and printed editions of my own to amend the edition I had worked on. (There are also two plays that I initially worked on that I intend to reformat in the next day or two, having learned better formatting strategies while working with the others.)
Project Gutenberg is an excellent and important project that I have highly praised to others, and which will always remain so, but I believe that with many important works, providing easily accessible, easily useable, and easily correctable copies of them, WIkisource can, and eventually shall do better than they currently are doing. Even in my own personal use, I find the slightly formatted and sectionable text that we can provide is both more appealing and accessible than the raw text that they do, or the often highly segmented editions on pages cluttered with advertising that some others do. Kalki 19:33, 10 Dec 2003 (UTC)
I agree heartily with Kalki's statement that it would be useful to have an editable version of public domain books, even if they're already available on Project Gutenberg. For one thing, I would love to be able to correct many of the typos in the PG version of "The Devil's Dictionary" by Ambrose Bierce. There are 77 Google result pages for the non-existent word "loganimity" because the PG version of the Devil's Dictionary has "longanimity" mis-typed as "loganimity". Because PG is quickly becoming the premier source for public domain texts on the web, and it has no clearly available mechanism for members of the general public to correct them, these sorts of mistakes get perpetuated more and more. --64.81.243.120 00:52, 23 Sep 2004 (UTC)
Let's not forget Distributed Proofreaders. We should in fact work with them by incorporating their texts as the primary source. Then when/if we find errors and fix them, we can send them the diffs to correct their copy. But that raises a very important point - what license will our modifications to the public domain works be under? IMO all edits/corrections (and maybe even formatting) should be in the public domain as well. I for one get pissed by websites such as http://1911encyclopedia.org which claim copyright on the famous public domain encyclopedia just because they (claim) to have edited the document somewhat (replacing archaic words with more modern counterparts and fixing OCR/formatting errors). --Maveric149 22:46, 10 Dec 2003 (UTC)
I have done a small bit of work at Distributed Proofreaders helping to proofread a few texts…they are another excellent project to get involved with, but to my understanding they serve primarily as a service to Project Gutenberg, more than as any independent repository of texts. I too feel that attempting to place new restrictions on public domain source material because of minor editing or processing is absurd and contemptible, and contrary to the ultimate spirit of social contribution that I think most Wikimedia-workers (and most Project Gutenberg volunteers) work with. Clearly declaring text available in the public domain to be such, and it remaining entirely so, is something I certainly favor. Kalki 01:54, 11 Dec 2003 (UTC)

I wrote to Dr Greg Newby, the director of Project Gutenberg about this. His reply is below. If people here feel collaboration would be useful, perhaps they could follow this up. Unfortunately I am no longer willing to have any involvement in Wikisource. Good luck with project everyone. Angela 04:28, 16 Jan 2004 (UTC)

Sorry to poke my head in here, and feel free to delete this, but I'm curious why the founder of WS is leaving with such a forceful (and brief) statement. I see nothing on this in your User or Talk page. David spector 14:22, 24 February 2012 (UTC)[reply]
 Subject:	Re: Wikisource

 Hi, Angela.  Sorry for taking so long to respond to your note,
 below.  I was out of town for a week, and managed to fall
 quite behind in my correspondence.

 I think it would be worthwhile for Project Gutenberg to
 collaborate with Wikisource.  We certainly have many
 of the same interests, and probably attract some of the
 same volunteers to create content.

 Because we have such different but complimentary
 structures, I have these simple suggestions.  Maybe
 we can work on some other ideas, for better coordination
 and mutual productivity:

 1) Project Gutenberg would happily accept content from Wikisource,
 as long as it can be submitted to our copyright guidelines (basically,
 we need the scanned title page & verso; see our HOWTO at
 http://gutenberg.net).

 GFDL and similar license are OK, too, though we have mostly try to
 keep away from computer documentation (which is a large piece of GFDL
 and Creative Commons type licenses) since it's both hard to present
 effectively in editable formats, and more likely to be outdated.

 2) You are most welcome to use PG materials (we have a "small print"
 license at http://gutenberg.net/license and in each eBook, but there
 are essentially no restrictions for non-commercial use when the 
 original
 will remain available or linked in).  One thing that would really
 be great would be to integrate some of our emerging procedures
 for bug tracking and error fixing with Wikisource, so that
 "fixed" items can get back into the PG collection. 

 Having "curators" for particular authors, titles, subjects, languages,
 etc., who could help add value to subsets of Project Gutenberg, is
 one of my biggest goals right now -- and the Wiki model is an appealing
 method for working towards that goal.

 3) We could try to set up mutual linking, a HOWTO or other materials
 that make it clear to volunteers that their work will potentially have
 the benefits of both:
	a) the long-term stability of PG, with a central
	depository, worldwide mirrors, standard metadata, etc. AND

	b) the nimble and effective Wikisource method of easily
	applying changes, collaborating informally and formally,
	and generally making it easy to move quickly

 (I'm oversimplifying our roles, of course, but the idea is to
 emphasize both shared interests and complimentary strengths.)

 We don't have a very formal process for "affiliates," but certainly
 I'd like to add you to our "links and affiliates" page, and work on
 better ways to share content and funnel volunteers to the most
 appropriate project.

 I've already spoken with folks at our Distributed Proofreaders project
 (http://www.pgdp.net), which is our greatest volunteer base and very
 prolific.  The Wiki model fits well with the activities (though not
 the details) of the DP volunteers.

 Let me know how this sounds, and sorry again for not responding
 sooner.

 Best,
  -- Greg

 Dr. Gregory B. Newby
 Chief Executive and Director
 Project Gutenberg Literary Archive Foundation http://gutenberg.net
 A 501(c)(3) not-for-profit organization with EIN 64-6221541

My biggest interest in collaboration with Greg Newby's project is the use of Wikipedia-style editing to fix typographical errors. I have just finished reading Little Women (by Alcott) and Pilgrim's Progress (Part 1 of 2, by Bunyan) . Both texts contained enough typos to make it worth my while to try to submit changes. But "Distributed Proofreaders" seems a bit cumbersome; and they have a kind of boot camp that discouraged me...

I'd like to be able to go directly into the text, and change the spelling of a single word. Based on my reputation as "Uncle Ed" of Wikipedia, I think people would trust my spelling changes. And if not, they could change it right back!

From time to time, someone from Gutenberg could vet these changes and decide if the corrected version meets their high standards. Wikipedia (or WikiSource, if ya wanna get technical) would thus be a "feeder" for corrections to Gutenberg.

I don't want to exacerbate the problem of duplicates; I only want to use our Wiki software to correct errors.

Ed Poor, aka Uncle Ed
Wikipedia


I have less concern with Project Gutenberg than some of the others doing this type of thing.

  1. In the course of discussion with User:Norwikian I came to realize that the works of Thomas Browne in the Penelope Project of the University of Chicago were no longer available to the general public. Their ARTFL project also includes the Diderot Encyclopédie, and requires subscription through an educational institution. Independent scholars are out of luck.
  2. A site like http://gaslight.mtroyal.ca/ ,which specializes in mystery stories, has not been adding texts to its database recently. Is this a sign that its disappearance could come at any time?
  3. Some sites have works available only in pdf format. Perhaps we might like to convert these into plain text.
  4. Then too there is no shortage of works that have never been digitized.

These are the sort of things to take into account when considering what to upload. Eclecticology 21:33, 16 Jan 2004 (UTC)


I regularly work with PG so I've seen these kind of questions come up regularly, to summarize,
  • Scans of publically available public domain works are themselves public domain.
  • Various ebook sites do disappear never to return with their ebooks disappearing with them (occassionally they do turn up again -- last year I discovered a gopher server which had ebooks that were typed up in the early 90s that had previously thought to have been lost).
  • Distributed Proofreaders is aware of this problem and they do engage in "raiding" (PG lingo) on websites that they think are going to disappear and for which they haven't been refused permission. They do ask permission for every site that they "raid" and if they get refused they don't do it.
  • If you have an etext for a book that is just a straight copy of a PD book, Project Gutenberg will take it if you can prove it's public domain (this generally means going out and finding a physical PD copy and checking a page every twenty pages or so to make sure it matches the etext).
Overall I think we should leave books to PG, as they have the experience, volunteers, and software infastructure to deal with them. On the other hand there are many other primary sources which Gutenberg don't generally include such as speechs, raw data, extracts from books, etc. (although exceptions exist) which is what Wikisource should primarily concentrate on. --Imran 00:13, 15 Feb 2004 (UTC)

I don't think that our views are very divergent. I think that we agree that determining whether something is in the public domain is paramount to a decision to include it, even if my interpretation of the term would allow for a wider range of inclusions.

My criticism of copying PG material is based on my opinion that it is mostly a waste of time. So unless there is a reason for copying a particular work, why bother?

I don't think that we should abandon books for PG to do alone. Books are worth including just as much as all the other items that you list. Also, AFAIK PG's material is all plain text. We are in a better position to upload images to illustrate these texts. The real growth will come when the technical genius of our developers notices us enough to develop the software for annotating in a parallel scrolling box. :-) Eclecticology 01:20, 15 Feb 2004 (UTC)


Do we have to add any copyright notice to use PG's works? After reading this and their own site, I'm left a bit confused. Ambivalenthysteria 06:47, 2 May 2004 (UTC)[reply]

No. As I understand their approach, basically they won't include anything that is not already in the public domain, and they tend to be very conservative in their evaluations. They permit things to be copied from their file either with all of their additional explanations, or none of it. Since their explanations are long, and could include material with which I am not completely comfortable, I will choose to have none of it. I do however show where I got something from, and with Gutenberg material I use the words "based on ..." which avoids any claim that it is identical to their material.
Sometimes a reference to old copyright notices is important to establish that the material is in fact in the public domain. It's also important for our downstream users to be able to establish that kind of link. Eclecticology 13:08, 2 May 2004 (UTC)[reply]

I'm Jim Tinsley, one of the people from PG who posts texts. I'm also the FAQ maintainer and I manage the bulk of corrections submitted to PG, so I can answer several of the points above from everyday experience.

On corrections,

  1. Our correction submission process is a simple e-mail, as detailed in R-26, R-27 and R-28 of our FAQ at http://www.gutenberg.org/faq/, where you will find my e-mail address, and the contact address for errata is on our contacts page http://www.gutenberg.org/about/contact . We welcome corrections from anybody about any text at any time.
  2. About half of all corrections submitted are wrong. That is, the original text is correct, but the reader who suggested the "correction" thought it wasn't. I would expect any Wikisource correction submissions to improve on that average!
  3. The quality of texts varies enormously. The initial edition of Little Women, first posted back in 1996, is atrocious. The current edition, lwmen13.txt, is about as clean as I can get it. There's one spot that I think must be wrong, but I've confirmed it with a print edition, so maybe it was an atypical awkwardness on Alcott's part, or a perpetuated misprint. Other older texts may be near-perfect, or, dare I say it, even perfect. Newer texts have gone through a more regular QC process, so are unlikely ever to dip below a reasonable minimum standard, but of course they also have errors. Sigh.
  4. I've just fixed the error pointed out in the Devil's Dictionary, along with 10 or so others. Thanks for noticing!
  5. One of the sorrows in my work with PG is seeing our errors perpetuated. I mean, we like to see our texts go out in the world, and be converted and appreciated and used by as many people as possible, but unfortunately a side-effect is that our errors get perpetuated too. Site A takes a text from us. Site B takes it from Site A, and so on, and for the most part they never check back to keep up with corrections. Google for the phrases "first sild dress" or "Unbless'd, ahandon'd to the wrath of Jove" to get an idea. And "loganimity" will probably be around for years, too.
  6. We have an ongoing project, to put older texts through our more recent QC process. It's going slowly, but it does mean that there quite a lot of movement in the older texts, so if you have copied one, please check back every month or so to see whether there's an update you need to include in your own copy.
  7. If some group of you want to focus on putting some texts through a Wikisource proofing or correcting process, and have specific ideas about how to communicate that back to PG, just e-mail me, or the errata address given on our contacts page at http://www.gutenberg.org/about/contact

On copyright,

  1. The vast majority of PG texts are in the public domain in the United States, but we do post copyrighted books with the permission of the copyright holder. Such texts will be clearly marked with a copyright notice in the header, so if you do snarf one for copying, you will see it immediately.
  2. Assuming the book is in the public domain, you can use it as you please, so long as you follow the PG trademark license, which means, effectively, removing the PG header and footer or following the conditions of using the PG trademark spelled out in the header or footer.
  3. Since we post copyrighted books, we also post GFDL, but we require, when posting a GFDL'ed book, or a book under a Creative Commons license, or any other license, that the contributor really does own the copyright on that book. Under US law, merely transcribing a book, or fixing typos, or adding markup to a text, does not gives the transcriber a new copyright. We are very much against people claiming a new copyright to which they are not entitled, even if they then license it under the GFDL.

On images,

  1. For the past year or so, 80% or of new texts have been added as both HTML and plain text, XML is teething, and we are starting to archive the page images of the paper book as well -- which should solve some existing problems about correcting errors!

On Wikisource copying texts from PG,

  1. So long as the text is not one of the relatively few copyrighted texts, fine!
  2. Please check back every month or two to see whether the date on the PG file has changed. It might be a good idea to note the datestamp from the PG file on the page you create here, so that someone else can know whether an update is needed if you're no longer around. The file ls-lR on gutenberg.net is always available (well, we're having some problems with it just now, but it's nearly always available) at http://www.gutenberg.net/dirs/ls-lR or http://www.gutenberg.net/dirs/ls-lR.gz, and will provide an automated way for you to keep tabs on the latest revisions. Our filename update conventions are available in the FAQ.
  3. As a matter of etiquette, please retain the Credit Line from the PG text (that usually comes in or just after the header, and says something like "Produced by Joanna Doe and Joe Bloggs".) This is especially important when the Credits Line acknowledges the input of other archives; some of them require us to acknowledge their contributions, and while you are not bound by that, every major site that copies these texts without crediting them dilutes our promise.
  4. When you find errors, we'd be obliged if you let us know about them! JimTinsley Nov. 22, 2004 00:26:21 (UTC)