Wikisource talk:Language domain requests/Moving pages to language domains

From Wikisource
Jump to navigation Jump to search

Moving pages to language domains[edit]

User:ThomasV has investigated this problem, and it would be great if he could summarize what he has learned so far here!

Since this is such a critical job, we may want to start a separate technical help page for it soon, but for now we might as well start right here.

Is there any possibility to make use of the language categories for this (i.e. a bot that might recognize them and transfer them to the correct domain)? Or a bot that might recognize certain language-specific characters in the title or contents of a page? Or might it be possible for a human to compose a list of page titles, and somehow "feed" it to a bot? Dovi 20:53, 10 May 2005 (UTC)

We already do have listings that are organized by language... the only problem we might have is preserving page history (which a bot cannot do). Ambush Commander 21:07, 10 May 2005 (UTC)

A proposal for moving pages[edit]

In order to keep the history of each page, we should not move pages to subdomains. It makes more sense to create a subdomain that contains a raw copy of the whole database, and then to delete what should does not belong to it. This is, at least for the french subdomain, how I want to proceed.

So, I suggest that we list all the subdomains for which this creation method is requested. In order to reduce the load of work, it is better not to create all subdomains simultaneously, but to wait until enough pages have been deleted from the main subdomain. Given that the majority of wikisource articles here are in English, I think we should start with this language. Here is how I propose to proceed: (ongoing sections don't necessarily have to be completed in order for the next step to continue, but are eventually going to be finished).

  1. create en.wikisource.org as a copy of wikisource.org.
  2. delete all articles in English from wikisource.org. This phase would last a few weeks.
  3. (ongoing) delete all non-English articles from en.wikisource.org
  4. create fr.wikisource.org as a copy of wikisource.org
  5. delete all articles in French from wikisource.org. hopefully this phase will be shorter!
  6. (ongoing) delete all non-French articles from fr.wikisource.org
  7. create de.wikisource.org as a copy of wikisource.org
  8. delete all articles in German from wikisource.org.
  9. (ongoing) delete all non-German articles from de.wikisource.org

and so on...

Of course the order of creation might be different from what I suggest here. But it reduces the load to start with languages that have a lot of pages. I'd like to know who is interested in using this method. It implies that people are willing to wait a few extra weeks before their subdomain is created, and to take part in the deletion process (oh come on, we've been waiting for 8 months, what are a few weeks?)

ThomasV 22:48, 10 May 2005 (UTC)

I'm interested in this method. I really can't think of any other ways to go about moving pages to new subdomains while still preserving the page history. This method, while very tedious, would at least allow us to do that. Zhaladshar 23:46, 10 May 2005 (UTC)
I'm interested too. I will help deleting articles. --LadyInGrey 00:11, 11 May 2005 (UTC)
I do have a couple questions, though, about this. First, this will entail a LOT of deleting, won't it? I mean, if we copy ALL of Wikisource into en.wikisource.org, we will then have to delete all English articles from wikisource.org, but won't we also have to delete all non-English documents from en.wikisource.org? Since it IS the English version, all French, German, Spanish, etc. documents will have to go. Likewise, this will happen for every other subdomain as well, correct? Zhaladshar 00:22, 11 May 2005 (UTC)
The main problem with this method is it's not scalable: we'll need admins like Zhaladshar to go into overtime deleting articles like crazy, cuz we normal users can't delete articles. However, personally, I think this is the best way to go about seperating the domains without nuking the page history. I also added an extra step, which states that the subdomains have to be cleaned too. Ambush Commander 01:28, 11 May 2005 (UTC)

I think what Thomas writes makes a lot of sense, though it will involve a huge amount of work, work that will fall entirely on the shoulders of administrators. We should try to think of ways to lessen that load.

One suggestion: When we speak about deleting all English articles from wikisource.org (after a raw copy is made at en:), followed by French, etc. - this should probably apply only to the main namespace.

But the project namespace ("Wikisource:"), with its policy and technical help pages, should probably not be deleted, even though most of their content is English. Those pages belong to the entire project, both because of their historical value, and so that the still relevant ones can be copied and adapted/translated for each language domain. Dovi 05:47, 11 May 2005 (UTC)

Sure, the namespace should not be deleted. concerning scalability, I guess we can create a few more admins for this particular task. if some people have problems with this, we can even make the statement that those creations are temporary. Concerning the amount of work, there's about 20,000 articles now, and about 45% of them are in English. For statistics, check here : Wikisource:Scriptorium#distribution_of_languages_in_wikisource. Okay they are a bit old and unreliable, but I guess it's ok to say that we'll have to delete about 10,000 articles in English during phase 2. I guess I'm able to delete more than 100 pages in one hour. ThomasV 06:48, 11 May 2005 (UTC)
This definitely sounds like an interesting idea, and a good way to keep page histories. I am willing to give a hand in the deletion process, and I think that creating new admins to assist in this process is a very good idea (serious users only, of course, not totally at random). We should start, as Thomas V suggests, with the main languages (English, French, German, Chinese...) and delete all texts in those languages at the main site before copying the remaining contents to domains for smaller languages. I would definitely prefer to wait a month or two for a major part of the contents to be deleted before copying the remaining contents to the Danish subdomain, rather than having to delete some 19.000+ non-Danish pages at the subdomain :)
I also have a question: Can we start to add new content to the subdomains before the bulk of the main site is copied, or will that process delete all contents at the subdomain? --Christian S 13:21, 11 May 2005 (UTC)

In reply to Christian S's question - it seems to me the answer is definitely yes! There is no reason not to. Plus, not to do so would be to create even more extra work... :-)

This also brings up another issue: We need to decide on some sort of criterion for when language wikis should be set up. Any thoughts? Dovi 13:27, 11 May 2005 (UTC)

hum, I am not sure if I really understood the question, but it seems to me that the answer should be no, because I do not know if it is possible to merge two databases easily. anyway, this is a technical question, we should ask developers. concerning the schedule, I believe we should first request a creation of en.wikisource, that will be a copy of the current wikisource. then we'll see how long it takes to complete phase 2. phase 2 will last until some people decide that it is time to create their subdomain. ThomasV 13:37, 11 May 2005 (UTC)

You understood the question correctly, it was if it is possible to merge two databases. If they can't be merged, then a subdomain should't be created untill it is ready to recieve a copy of the current (reduced) wikisource database. I see two ways to do this:
1: Start with the 6 largest languages one at a time, that is English, then French, Romanian, Spanish, German and Polish. According to the stats at Wikisource:Scriptorium#distribution_of_languages_in_wikisource this adds up to about 86% of the texts. When the texts in these languages have been deleted we can create the rest of language domains by copying the remaining part of the database, then reduced to about 3.000 pages, a size that most languages should be able to clean up.
2: Start with the English domain, and then let the users who requested other language domains decide for themself when they feel that the database is sufficiently reduced to be copied to their domain.
--Christian S 13:59, 11 May 2005 (UTC)
This project is going to take quite some time, obviously. We are toying around with the idea of adding new Admins to Wikisource for this, which is a great idea. But do we really have enough serious users who would be willing to do that? Most users register, add something, and never visit the site again. I can name just a handful of serious users--and most of them are admins!
Another question I have is that of sysop status among subdomains. Over at Wikipedia, sysop status on one subdomain does not entail status on another. You have to apply on every subdomain to become an admin. I'm sure the same will happen here, too, once language domains are created. But this seems like it will drastically reduce the number of eligable admins for any language domain to go about cleaning up each additionally created one (like English and French, which are immense). How will we get past this? Will all 19 admins on Wikisource.org be given sysop status on new subdomains as well to help with the clean-up? Zhaladshar 14:39, 11 May 2005 (UTC)

You are all right. I didn't think of the problem that it may only be possible to "dump" the entire contents of a database to a new location, and not to add all pages to an extant site. The only people who can answer all these questions are the developers. Dovi 15:10, 11 May 2005 (UTC)

Nevermind, I think I answered my own questions. Sysop status will be given to those who are requesting the new subdomain. That way, they understand the language and they are serious users. Zhaladshar 15:24, 11 May 2005 (UTC)
I agree that those who request a new subdomain are likely candidates for sysop status at the subdomains. Getting the nessesary manpower to do the clean-up at the subdomain may be an issue, and maybe a database dump isn't the best solution for very small languages. The cleaning up is going to be quite a lot of work to save the history of, say, 10 or 20 pages in some small language, and it may be better (or at least a lot easier) for such very small languages just to sacrifice the histories of a small number of texts by using the transwiki method instead of a database dump. This will minimise the workload at least for the smallest languages, which will probably also have the lowest number of sysop candidates to do the job.--Christian S 15:52, 11 May 2005 (UTC)
It's by no means an ideal situation, but it might have to be done just out of necessity. Of course, for the smallest languages, a good number of articles (those in English, French, German, etc.) will already have been removed, so the process will be much lighter for smaller subdomains, and the three or four users who requested it will have a lot less work to do than those working on en.-/fr.-/de.wikisource.org. Zhaladshar 16:05, 11 May 2005 (UTC)
Yes, I agree that the transwiki method is only meant as an "emergency option", the general procedure suggested above (the database reduction and dumping) is definitely to be preferred if at all possible.--Christian S 16:33, 11 May 2005 (UTC)

Concerning the fact that only admins can do this, I bet we could greatly optimize this process by using a combination of bots and non-admin help. It could go something like this:

Bot Specs

  • Must hook into an admin account (I know, we're all leery about giving bots admin perms, but wait...)
  • Must be manually operated (that is, it speeds up the Admin in deleting pages, but isn't supposed to work by itself)
  1. Harvest links from a special page that non-admins compiled of pages that need to be deleted
  2. Present links, one by one, to admin. They mark it delete, or not-delete, and the bot takes appropriate action
  3. Bot automatically removes processed links from page and adds them to it's own "ignore" cache
  4. After ~80% of the pages are done, switch to "All pages" view and finish by scanning all of that.

The main idea is to reduce the amount of interpage jumping admins have to do, so when they're deleting, it may only take, say 2 seconds for them to issue a verdict and then go to the next page. We could easily get a 1000 pages done in one hour.

Even if no one gets around to coding the bot (I might take a look into doing that), we need a listing of things that should be deleted that's easy for admin's to traverse, and then remove links as pages are deleted or verified as 'keeps'. Perhaps we could get a database dump of all pages and put them into a wiki page (like Special:Allpages). With the bot, perhaps we could set up a web interface (kay, maybe it's not a bot anymore) that presents a list of pages, with little excerpts, and then you just check or uncheck them off. Does mediawiki let you do that? What do you think? Ambush Commander 19:34, 11 May 2005 (UTC)

I guess it's very dangerous to let a robot delete pages, but I must say that I too have been investigating the question of using a bot. My idea was to have the robot decide whether a page is in a given language, and then to perform deletion manually. but if you want to semi-automate deletions, one practical issue is that existing robots (at least those in pymediawiki) do not have a deletion capability. ThomasV 20:20, 11 May 2005 (UTC)
I could try coding one myself, I'd just need to do a bit of investigation with MediaWiki to see exactly how the HTTP interface works. Ambush Commander 20:47, 11 May 2005 (UTC)

First, I apologie for my poor english.

Second, I don't know exactly what bot can do and can't do. Perhaps a method to selecting page (by bot) according to langage is a combination of : specific langage characters (ç, é, è, à ê in French for example) AND users register current language. In fact, I think a lot of us are contributing in only one language (except in discussion page like here).

The steps can be :

  1. copy of the entire database in each subdomain. The main database, wikisource, is freezed and can be used only as a backup copy, in case of bad operation during next steps. After copy, contributions can be done only on new subdomains.
  2. in each subdomains, creation of specific page, one for each language.
  3. each language has its own bot which can calculate, for each page, the probability this page has the "good language". calculation can be done as explain above.
  4. admins must flag items they want to delete in its specific "language page" (for example french admin do that in the "french language page", and a bot can delete them
  5. admins must also flag items which are considered as "correct language"
  6. a croos comparison of admins'flags of each language can be done by bot to verify / validate admins'selections, and to re-compute for pages which aren't flag as "deleted" or "correct".
  7. process ends when the "specific langage page" is empty.

I think this method has some advantages :

  1. It minimises error or bad selection
  2. Time to do this is not important. The process can be done a long time after the creation of subdomain without blocking new contributions.
  3. Subdomain is not dependant of the others.

Perhaps a combination of this method with the ThomasV's method could be a good solution ?

Nota Bene : some pages must be in more than one subdomains. For example, the latin version of Bible can be used by all of subdomains. François Rey 23:53, 12 May 2005 (UTC)

Don't apologize for your English, it is not that poor.
In general I don't like the idea of a bot being able to delete pages. It may be quite a programming job to make a robot that can correctly distinguish specific languages from each other, the special character recognition is not nessesarily enough, as many special characters are used, more or less frequently, in more than one language. If users have to register the language of each text anyway, then it would be a lot easier to just make a robot that recognises those language tags.
You suggest that admins should flag texts in their own language and create pages with lists of texts in a given language at all the subdomains. This seems like a cumbersome way to do it, as it takes just as long, or even longer, to flag a page and update a language list as it does to an admin to delete it manually, and if this tagging process must be done at each subdomain, then we have about 20.000 pages multiplied by the number of language subdomains (=MANY pages) to tag.
I don't agree with you that your suggestion makes the time it takes unimportant. When a subdomain is created I would like it to get cleaned up as soon as possible, so to me the time it takes and minimizing the workload is important.
I very much prefer the procedure suggested by ThomasV and let the creation of subdomains for small languages wait until the major part of the texts, that is, those in the largest languages, have been deleted from the main site (I consider my own language to be one of the smaller ones).
We shouldn't freeze the main database either, as it would make it impossible to use this domain for discussions about and coordination of the implementation of the subdomains. Instead, as the subdomains are being created, it should be strongly advertised that texts in the languages of created subdomains should be uploaded there and not at the main site (is it possible to use the MediaWiki:Sitenotice for this?). --Christian S 06:50, 13 May 2005 (UTC)
yeah, we can use Sitenotice for this. perhaps I should restate what I said above more clearly: I do not like the idea of having a bot delete pages automatically. it's just too dangerous. I'm sure a delete-bot would never be allowed in wikipedia. (and semi-automatically sounds kind of dangerous too).
in my opinion the latin version of the Bible should go to a latin subdomain if it is created. if no such domain is created, it should stay on wikisource.org. if a page contains a bilingual version of a text (for example, this ancient greek translation), then it should go to the subdomain of the language it is translated in.
concerning the creation of en.wikisource.org, I asked TimStarling about it. He said we'll have to wait a few more days, because he's too busy now. ThomasV 07:34, 13 May 2005 (UTC)
I agree that manual deletion is to be preferred over deletion bots, and yes, such a bot is definitely a no-go in most wiki projects.
The Bible in Latin should go to a latin subdomain (if created). There is no need for it in other language domains, as it will be as easily accessible by interwiki links as it would be if it were on any other language domain.
Thanks for the update on the creation of en wikisource, I really look forward to starting the process.--Christian S 07:56, 13 May 2005 (UTC)
On Latin, I agree with Thomas and Christian. Latin texts belong on la.wikisource when that is set up, and until then they belong here, linked to from the current Latin Main Page. Special editions, as Thomas pointed out (bilingual translations, or versions of Latin text with notes in modern languages) belong on the website of that language. Dovi 14:24, 13 May 2005 (UTC)
Furthermore, we are likely to get lots more questions and comments like this, so we need a place for people to discuss them and summarize the conclusions. Plus, we never formally decided our criterion for setting up language wikis (though the general trend is pretty clear, and nice, on the language requests page).
Suggestion: Let's open up pages on Language Policy (for general discussions of what pages go where, and how the subdomains relate to the "main pages" here) plus Start a new language edition (on the technical criteria for setting up a new wiki). These should be listed as links at the top of "Language domain requests" and at "Scriptorium." Dovi 14:37, 13 May 2005 (UTC)

I'am sorry, I don't explain my idea clearly.

For more easily understanding, I will describe how to do for french language, and we could imagine the same for the others.

First, a full copy of wikisource database is done on the fr subdomain. After this moment, all french contributions must be done on this subdomain. I agree we must have a common part with other wikisource subdomain to continue this discussion or to exchange somme data.

After the full copy, a bot will automatically create N pages, N equals the number of differents languages plus a page named "undefined langage". Each page, for example the german page, contains wiki links of all supposed german articles. On the side of each link, a specific button can be explicitally pressed by the admin (the french admin) if he is considerating this article as effectively a german article. It can verify this with only the title of the article, if it is an evidence, or by reading the content of the article.

The button can be a "suppress" button if the action can be done rapidly without re-reading the entire page, or it can be a "flag" button which indicates the article we must delete.

Then admin can use a second bot which only delete articles which are flagged.

These two bots don't decide which article must be delete, they only help admin to, first, sort articles ; second to delete flagged articles

Collaborative actions could be done between all wikisource subprojects, french admin can verify french articles list, etc ... In this case, the button will be a "confirm" button rather than a "delete" button.François Rey 11:19, 13 May 2005 (UTC)

the method I proposed does not require to program new tools, so it can be started immediately. I believe it is also more efficient, although I am not quite sure if I really understood what you propose. it seems to me that your method would take O(kN) deletions, where N is the number of articles and k the number of subdomains (not to mention the time spent programming bots). The method I proposed takes less than that. ThomasV 15:22, 13 May 2005 (UTC)
I understand the method I proposed will put a high load of work on admins. In order to nominate more admins, I requested bureaucrat rights on wikisource.org. ThomasV 16:22, 13 May 2005 (UTC)
I must say that I agree with ThomasV on this, ThomasV's method is both faster and simpler, and by using his method we can also avoid the controversial deletion bots. Also, I don't see the need for all this sorting into languages. Lets take your example with a full copy at the fr domain. Then, at the fr domain, it would only be nessesary to sort texts into two categories, "French" and "not French". Why confirm that a German text is in German, when it has to be deleted anyway as it is not French? And why not just delete the page at once, when it is being evaluated whether it should be kept or not? As an admin I find that just as easy. I support that we use the plan of ThomasV! That way the workload will be much less, especially for the smaller languages which are also the largest in number, than it would be if a full copy is made to every language subdomain, even with the aid of deletion bots.
I support that ThomasV gets bureaucrat rights. Have you made the request at meta:Requests for permissions? You probably should, as I'm not sure Ec will be around to grand you that status. --Christian S 16:43, 13 May 2005 (UTC)
Agreed. We need a bureaucrat around here. If deletion bots are not that preferable, then we don't have to use them (might take too long to program them anyway). So far, we've proposed:
  • Making a full copy of the database to language subdomains and then deleting from them in order to preserve page history
  • Make a global notification about the ongoing work? (personally, I think this is a must)
  • Creation of new pages to coordinate the language task (such as lists of English works, French works, etc.) Ambush Commander 21:27, 13 May 2005 (UTC)
I also prefer ThomasV's method. It should be kept simple. There are nearly 32,000 pages on Wikisource. With this method there are perhaps 60,000, 80,000 or so pages to be deleted. So you need admins and you need to organise it. So I suggest to announce the start of language domains and to invite all admins at all Wikimedia projects to help deleting the pages. I think most of them can be trusted and it's easier and faster to give them temporarly admin status than to elect new admins for Wikisource. Then there should be a plan who deletes which pages. You could create a list with entries like "Deleting all non-English pages starting with an A at en.wikisource" and a person who want to do this task can add his name. So you would make sure that there are not 10 people searching for non-English articles starting with "A" and the articles with "Z" are forgotten. If you find 50 admins, each working 2 hours a day (about 250 deletions, depending very much on the servers), the work could be done in less than a week. --Jofi 21:44, 13 May 2005 (UTC)
I would prefer to promote regular users on Wikisource to permanent admin status than give some Wikipedia admins only temporary status. Besides, with the new language domains, each language is going to need a few sysops to do maintenance on those pages, so why not just give the users who are requesting a subdomain adminstrator status? That way, they'll probably be more inclined to stay with the project (which the Wikipedians will probably not be) and will kill two birds with one stone. Zhaladshar 21:49, 13 May 2005 (UTC)
Normally there are only very rare situation in which an admin has to delete a page. So you don't need many admins. But this is a different situation. If you wait until each sub-Wikisource has elected it's admins and they have time and are willing to do the deletion jobs, than it will take many months until the work is done. So why not using "special" admins for a special situation? It's not important if they stay, it's important first to start the projects where they maybe stay in the end. --Jofi 22:22, 13 May 2005 (UTC)
Hmm... we could do the elections while the cleanup is proceeding on the main domain! Ambush Commander 00:36, 14 May 2005 (UTC)

Ok, I suppose I should speak in English, so excuse me if mine' is REALLY horrible. I was reading a lot about all the stuff, and I started speaking on it.wiki and on the italian irc channel, and now i have an opinion. First, a question:

  • Is it possible, for developers, to move pages from wikisource.org to xx.wikisource preserving histories?

If so, should not be faster and smarter to begin moving articles in a language that is less used? I don't know if my opinion is intellegible, so i make examples:

  • Let's suggest we have language xx. On wikisource there are a dozen articles written in xx. Developers should only create xx.wikisource and move a dozen articles. Then xx.wikisource should be able to start working

This is a way specular from ThomasV. Why should we make in this way? Simple: moving and deleting work is similar, BUT:

  • In the same time, we'll have more local wikisources operative. With ThomasV's way, we'll have en.wikisource in 2 weeks (i hope), with mine we'll have xx.wikisource, yy.wikisource, zz.wikisource
  • Minor wikisources should not be forced to wait a lot of time, until majors end with their work.

Ooooookay. I don't know if I was clear, so... in case, just ask. --Gatto Nero 16:13, 14 May 2005 (UTC)

Well - it sounds reasonable, but there is one MAJOR problem in what you suggest: There is a limited number of developers, and they are already overloaded with development tasks. Moving pages from one domain to another will probably be of very low priority to the developers. Deleting as ThomasV suggests can be done by ordinary admins, and new admins to assist in the process can be created relatively easy, while developer access is harder to get (with developer access you can really mess things up if you are not sure what you are doing). I don't think anybody would be granted developer status just to move articles. Not that your suggestion is bad, I just don't think it is workable, as minor (or major) wikisources probably will have to wait a long time for developers to move the relevant pages due to low priority compared to development tasks and bug fixes.--Christian S 16:50, 14 May 2005 (UTC)
Uh, you're right too. Guess should we ask to developers before, maybe. I think one or two developers should be enough to make the work, leaving at the end (to admins) all the "deletion" work. It depends on them, so... --Gatto Nero 17:00, 14 May 2005 (UTC)
If it is tecnically possible (whether it is or not, I have no idea) you may be able to persuade a developer to move a small number of pages to a small language domain. But don't expect them to be willing to move hundreds of pages. And there is no guarantee that subdomains for the smallest languages will be requested in any near future. Unless subdomains for one or more of the smallest languages are requested it is probably not relevant to ask the developers.--Christian S 18:56, 14 May 2005 (UTC)

List of admins who want to take part in the deletion work[edit]

To use ThomasV's method there are admins of this (the main) Wikisource needed who want to do the deletions here. So if you want to, please make a note here. --Jofi 22:56, 13 May 2005 (UTC)

  • Ambivalenthysteria ?
  • ArnoLagrange ?
  • Brion VIBBER ? (inactive since Dec 2004)
  • Caton ? (= Marc)
  • Christian S ?
  • Dr Absentius ? (inactive since Apr 2004)
  • Eclecticology ?
  • Kalki ? (inactive since Dec 2004)
  • LadyInGrey ?
  • Marc ?
  • Maveric149 ?
  • Menchi ? (inactive since Sep 2004)
  • Mxn ?
  • Samuel ? (inactive since May 2004)
  • Shin-改 ?
  • Shizhao ?
  • ThomasV ?
  • Yann ?
  • Zhaladshar ?
I am not sure if this list is useful; many of these admins are inactive. in addition, I believe we will create new admins for this deletion work. ThomasV 08:56, 14 May 2005 (UTC)
The list is useful if there won't be additional admins. Folowing your suggestion, for example the Italians have to wait until the English, French, German, ... articles are deleted before they can start their Wikisource. If there are too few admins doing the work and the deletions take a long time, the Italians (or others) may decide not to wait until the deletions are done, but to start with a complete copy of Wikisource at once. --Jofi 09:32, 14 May 2005 (UTC)
If, say, the italians or others don't want to wait, and the community requesting the language subdomain agree on it, then i see no problem in letting them start with a full copy. The suggested method will decrease the workload at the subdomains, but if the individual language community prefer the greater workload to waiting, then why not? I think waiting for some of the major languages to be deleted will be beneficial to many language communities, and that most of them can see the point in waiting. Meanwhile they could allways help deleting pages here to speed up the process and hence reduce the waiting time - what they delete here they won't have to delete at their own domain.
About the list of admins, I think it would be better to start two new lists: one for current admins who want to help and one for non-admin users who want to become admins in order to help.
--Christian S 11:24, 14 May 2005 (UTC)
Certainly it's better to wait until the deletions are made, than to start with a full copy. So there have to be many people able to delete pages to decrease the waiting time. --Jofi 11:44, 14 May 2005 (UTC)

Count me in as far as helping out with admin chores here. I've been keen on this project ever since Ec and I were the only two active admins, and I'd be equally keen to come back to see it to fruition. Ambivalenthysteria 05:41, 20 May 2005 (UTC)

Count me in as part of moving pages to language domains, but I need technical information on how to move pages with their edit history.--Jusjih 01:51, 25 August 2005 (UTC)

List of non-admin users who want to take part in the deletion work[edit]

Enter your name here if you want to help deleting the pages. If you aren't known at Wikisource add a short description. --Jofi 11:44, 14 May 2005 (UTC)

  • Jofi (admin at de.wikipedia)
  • François Rey (at wikisource since April 9, 2005 ; at wikipedia since november 2004 ; at wikisource project, I am working on french bible (improvement of presentation, cutting up texts in chapters) and in adding Les Misérables. Probably I could not be active at the end of june and the beginning of july, cause a moving out.
  • Ambush Commander (wikisource since May 8th, Wikipedia since August 2004 with 500+ edits) Currently, on wikisource, I'm working on The Economic Consequences of the Peace, the tables are really killing me (I so need to use regexps), and I know it's been quite a short time since I've joined wikisource, so I don't mind if I don't get a chance to help.
  • Jackobill (at wikisource since April 28th 2005 ; I have about 120+ edits done for french wikisource) I know I look pretty new but I think the number of edits I did in about less than one month is a proof that I want to help. I added L'Île mystérieuse.
  • Dovi 18:57, 21 May 2005 (UTC)
given the upcoming Special:Import feature (see below), I guess the proposal of copy-deletion is obsolete. User:ThomasBot already created almost-complete lists of pages per language. you may help the bot, by categorizing pages. ThomasV 19:26, 21 May 2005 (UTC)

What about google?[edit]

It is annoying to search for a term, find a link, click on it and find an empty page. It would be nice to preserve the links, at least for a few months until the search engines update their links. So, http://wikisource/wiki/stuff should have a http redirect to http://fr.wikisource.org/wiki/stuff. Bogdan 14:55, 14 May 2005 (UTC)

First of all, Great Idea! Second of all, Requires MediaWiki Changes. Third of all, Requires more work for Admins. Fourth of all, can we compromise? Instead, put a global notice on all empty pages that if what you're looking for here is missing, *it might not*, and then give a list of links to all the new subdomains and their respective entries (telling them to go to the language of this text), and maybe they'll find it. Ambush Commander 15:54, 14 May 2005 (UTC)
Yes, you only need to edit MediaWiki:Nogomatch and give the users links to articles with the same name in the subdomains. They will know what language they want and it's nearly no work for the admins. But the articles shouldn't be renamed in the subdomains. --Jofi 17:36, 14 May 2005 (UTC)
However, setting up Search Engine friendly HTTP redirects is going to be difficult. Let's just let google reindex the whole thing. Ambush Commander 19:46, 14 May 2005 (UTC)

Creating a list[edit]

For the languages with a smaller number of articles, it would be easier to move the pages by creating a list with all the articles in that language, then:

  1. copy the main database to the new subdomain database
  2. a bot in the language's wikisource deletes all the articles, but the ones in the list;
  3. a bot in the main wikisource deletes only the articles in the list.

For the Romanian language, for example, I could easily create one, as most of pages were contributed by me and I kept a pretty consistent standard.

Should I start making the list ? :-) Bogdan 19:07, 14 May 2005 (UTC)

hum... I am not in favor of using a delete-bot, at least not in the main wikisource. it's just too dangerous.
for a subdomain, I guess a delete-bot could be used; however it has yet to be written, tested, debugged, etc...
concerning Romanian, I thought it was one of the main languages in wikisource. but I do not speak Romanian, so I am not really able to tell if a page is in Romanian, or some other language that ressembles it. this means that the statistics I made are probably false, at least wrt romanian (see those stats in the Scriptorium). since you know that lanuage, maybe you could help fix those stats. ThomasV 19:21, 14 May 2005 (UTC)
Going through all articles to create a list of articles in a specific language should be almost the same work as going through all articles to delete articles in a specific language. I think the slowest part of the work won't be human beings but the servers and I prefer solutions without deletion bots. --Jofi 21:02, 14 May 2005 (UTC)
No, it's not the same work. Take this author Autor:Ion Luca Caragiale for example. To making a list of the 350 pages from that page you need only one click and a copy and paste. To delete those 350 pages manually, you need 350 x 3 clicks = 1050 clicks. Bogdan 21:49, 14 May 2005 (UTC)
It's less work if there is such a list. But many texts are not listed in lists or the lists are not divided into languages. And what you mentioned is only the list of one author, not the list of pages in one language. But if you want, you can create such a bot. If it's really safe and a copy of the database is made, one could try it out. --Jofi 22:44, 14 May 2005 (UTC)
::aside - Dang, that's a lot of pages we're going to be deleted. We haven't put it in perspective have we? - endaside:: Does MediaWiki have "mass deletion" tools? Ambush Commander 22:29, 14 May 2005 (UTC) Nope, I took a look at my own installation of Mediawiki and it didn't seem to have anything like that.
A "delete instantly without question if i'm sure" button would be very useful. But if there are enough people helping it should be possible without extra tools. --Jofi 22:49, 14 May 2005 (UTC)
::laments:: The whole problem is that Wikimedia wasn't built for transwiki tranfers. Meh. Ambush Commander 23:19, 14 May 2005 (UTC)
I swung over to bugzilla and found this: bugzilla:606, they've set it for the 1.5 release. I don't know how long that's going to take though. Ambush Commander 02:53, 15 May 2005 (UTC)
I don't know if it would work well for so many pages, if the revision history shall be kept, because it uses the xml-export without compression or anything else. Any minor edit in a 100kb page results in another 100kb to extract, move and import. --Jofi 09:39, 15 May 2005 (UTC)

To Wait? 1.5 MediaWiki promises Full History exports/imports of Articles[edit]

I sort of started a discussion in the list article, but I think it deserves it's own topic. Bugzilla:606 has the bug report for Special:Import, an automated transwiki system. It is set for the 1.5 MediaWiki release.

If we wait for this system to be implemented, we may be able to stop our admins from having to delete loads of articles: instead, we will have them man the Transwiki tools and start moving pages.

Jofi brought up one point that the Special:Import system may not be well adapted for mass moves of pages, as marking it up in XML will inflate the page size. However, we must also contrast this with the amount of deleting the other method will take. We are, effectively, deleting the entire Wiki for just ONE copied database: a percentage consisting of the English entries from the main, and the rest from the English copy. Should we wait? Ambush Commander 00:18, 16 May 2005 (UTC)

Does anybody know when the 1.5 MediaWiki is expected to be released? If it is to be released soon, then it can be an interesting option to wait for it, but I'm not sure we want to wait many months for the release. --Christian S 10:51, 16 May 2005 (UTC)

I do not know, but this feature is probably already in cvs, so we do not need to wait the release in order to test it. it would be great if somebody could test it. ThomasV 11:30, 16 May 2005 (UTC)

As far as I can see, only the export feature is up and running, but it is still not possible to import pages into a wiki. So, we can export pages with histories to a simple XML-file, but we have to wait for 1.5 in order to import the XML-files into the new language-wikis. Pity, it would be a really great feature to get up and running right now! --Christian S 12:10, 16 May 2005 (UTC)
It's here. It needs WikiError.php (will come in 1.5 but is not important for SpecialImport) and doesn't seem to work yet. It simply uses Special:Export from the source and then (should) import the XML file. But whe haven't to wait until 1.5. There are no dependencies. Special:Import just has to be completed. --Jofi 12:18, 16 May 2005 (UTC)
I have to correct myself: There ARE dependencies. It can't be used until 1.5 is ready. --Jofi 12:57, 16 May 2005 (UTC)

I guess if this feature is available soon, we should wait for it. however, I believe there will be no big difference between current cvs and 1.5. if it dos not work cvs yet, I do not believe it will be in 1.5. So what should I do? I already asked Tim to create en.wikisource.org (which he seems to have forgotten); maybe we should ask on meta what the best strategy is. ThomasV 12:28, 16 May 2005 (UTC)

Brion Vibber has called it a 1.5 release blocker. So it should be in 1.5.. Then we have to make lists of articles in a specific language, import them (only admins of the subdomain), mark them as deleteable in main wikisource and delete them (only admins of main). --Jofi 12:36, 16 May 2005 (UTC)

Are you sure that this feature would really save us some work? what you describe is about 4 operations per article. I do not know exactly how many deletions my method will involve, but if we make the assumption that languages have a geometric distribution (one language amounts 50% of articles, the next language 25%, the next one 12.5%, and so on...) then it is possible to show that my method involves 2 deletions per article. the actual distribution of languages is different from this simplified case, but not very different. ThomasV 12:44, 16 May 2005 (UTC)

I'm not sure what's better. Your suggestion: Go through "Allpages", click on article to delete, click on delete, select "I'm sure", click on delete. Do this about 60,000 times. Using SpecialImport: Go through "Allpages", add each article to a language list (copy&paste), add those lists partially to SpecialImport, make sure the articles are really imported, go to the articles at main wikisource, mark them as deletable, delete them. Do this about 30,000 times. --Jofi 13:06, 16 May 2005 (UTC)

MediaWiki 1.5 is due to go live around June. I've just asked Brion Vibber about this and he says Special:Import will definitely be part of 1.5, and that this will be the best way of splitting of the various language versions from here. Angela 13:14, 16 May 2005 (UTC)

Jofi: from what you describe, I believe my method is faster. Note that yours also involves a deletion. Another point is that my method can be performed by many users in parallel, it does not require coordination; otoh if people start building lists, they are likely to have conflicts.
Angela: did Brion say this specifically for wikisource?
ThomasV 13:36, 16 May 2005 (UTC)
Yes, he said this today specifically about Wikisource. Also, there is a deletion bot program available. See w:User:AngBot for example. Angela 14:40, 16 May 2005 (UTC)

After speaking to Tim Starling about Special:Import, I strongly suggest this method is used. The operation would need to be done once in total, not once per article. You just feed the list of articles to be moved to the French Wikisource into it, and it moves them all at once in one simple operation. Then you feed that same list to the deletion bot and it removes them from the original Wikisource. All that is needed is a list of which articles need to move where. Angela 14:49, 16 May 2005 (UTC)

Has the bot been tested? I don't see many "user" contributions. If it's the best method, we should start creating lists with articles by language. --Jofi 15:10, 16 May 2005 (UTC)
There is nothing in its contribution list because deletions are not recorded in a user's contributions (this might change in 1.5). It was run by a number of people to clear up page creation vandalism on the Chinese Wikipedia last year. It worked fine at the time, but I will check with the people maintaining the Pywikipedia framework that it will still work under MediaWiki 1.5. Angela 15:34, 16 May 2005 (UTC)
There is nothing in its contribution list because deletions are not recorded: Oh yes, certainly. I should have known that. A bot whose only activity is to delete is just so strange for me, that I didn't think about it. --Jofi 16:19, 16 May 2005 (UTC)
Following the irc discussion with Tim and Angela, I believe the Special:Import method has drawbacks: the operation of creating the list will take a lot of work (about as much as deleting pages, and maybe more if coordination between users is required). However, if we could have a robot categorize the pages, it would be much easier. There are programs around, that can tell if a text is written in French, with pretty good accuracy. I previously did not want to use such a program (see above) because I was considering to use it for the deletion process, where a mistake would be dramatic. However, I believe it makes sense to use such a bot with the Special:Import feature. I'm willing to give it a try. ThomasV 15:22, 16 May 2005 (UTC)


Here is what needs to happen if the Import method is used:

  1. Make a list of which pages belong to which language:
    There are two ways to do this.
    1. Add a category tag to every page stating which language it is in, then derive the list for Special:Import automatically from that. No admins are needed for this. This could be done with a bot (or many bots).
    2. Create a manual list like Wikisource:List of French articles and add the relevant articles there. No admins are needed for this.
  2. When the list is finished, one person pastes the list in Special:Export, saves the result, then pastes the same list into Special:Import. An admin needs to do the Import. This is one operation per language, not per article.
  3. Run a deletion bot to remove the articles that were just imported. The deletion bot already exists and can be run by multiple people if necessary.

Angela 15:34, 16 May 2005 (UTC)

I've just updated this since the export part isn't necessary. Special:Import can work without first exporting anything. Angela 22:45, 16 May 2005 (UTC)
I tried Special:Export with about 10 articles. I took very long. So I would prefer the interwiki-import tool, if it's ready. Then the users wouldn't have to save a local copy and send it to the wikimedia servers again. --Jofi 16:19, 16 May 2005 (UTC)
I wouldn't add category tags to the pages. Every version of a page has to be added to the XML file, so this could almost double the amount of data that has to be transfered. All articles (especially the long ones) should better be left untouched. --Jofi 16:33, 16 May 2005 (UTC)
A manual list would definitely be better than adding categories. If we should add a language category we would have to edit all the pages here, and again after they have been moved to the subdomains, in order to remove the category again as it will then be obsolete. A manual list can be created by simple copy-pasting of page titles into an edit page in another browser-window. --Christian S 17:49, 16 May 2005 (UTC)
If 1.5 goes live next month I think we should wait for it and then use the import tool. That is at least what I intend to do with da.wikisource, considering the recent info from Angela. --Christian S 18:49, 16 May 2005 (UTC)


(Sorry for my english) Instead of create a list with articles for a language, is it posibble to use the current categories and move all the articles in this category? For example: Category:Español. --LadyInGrey 20:03, 16 May 2005 (UTC)

You could use the current category pages as a basis for the export/import list, but then you'll miss all the uncategorized pages. I don't know how many pages that would be for Spanish, but there are many uncategorized pages in English, as well as in other languages. As stated above, categorizing, exporting and uncategorizing the uncategorized pages (i.e. Category:Español will not be relevant when the pages in Epanish has been moved to the Spanish subdomain) will probably mean more work than manually creating an export list of the uncategorized pages, followed by the export. --Christian S 20:30, 16 May 2005 (UTC)
I do not agree. All the work is in the "create the list" step. exporting, according to Tim and Angela, can be done in one step once the list is provided. uncategorizing can be done by a bot.
In order to create the list, there are two possibilities: Categorizing all pages that belong to a given language, or creating a list of these pages, that would take place in a special page. The first method is much better, because 1/ with categories, it is easy to know whether a page already belongs to the list and 2/ if we create the list manually there will be access conflicts between users.
ThomasV 21:19, 16 May 2005 (UTC)
We now have to move about 4 GB of data (gzip compressed >190 MB current, >780 MB old, according to http://dumps.wikimedia.org/). If you use SpecialExport and SpecialImport for all English pages at once, an (uncompressed) XML file of about 2 GB size has to be created, you have to download it to your local hard drive and then to upload it again. I can't believe that this will work.
If you categorize all pages, you have the problem, that the revision history increases for nearly every page. All "current" pages would become "old" and you have additional ~400 MB of data. O.K. it's only +10%, but with lists, for example sorted by first letter, this additional data wouldn't be neccessary. --Jofi 22:02, 16 May 2005 (UTC)
Jofi: you mentioned it took long to export 10 articles. how long? did you test on a local install of mediawiki? would it work for smaller languages? ThomasV 22:07, 16 May 2005 (UTC)
I also tested this. With the 12 articles in Category:Christmas carols it took less than 2 seconds. However, with the first 188 articles of Category:Deutsch, I first got a "The Wikimedia web server didn't return any response to your request" error, and then got an "XML Parsing Error". When I mentioned this to Brion, he said "we probably wouldn't do it from the web page. it's the backend system... There will be an automated import page. Someone with sufficient privs will be able to put in the name of the page to import and *poof* it comes in." So, the process is even simpler than I said above since exporting first is not required. Angela 22:45, 16 May 2005 (UTC)
I only tested the export function at Wikisource: the first 9 German articles starting with "A" (each far less than 100 KB) took something more than 1 minute (I didn't count) resulting in a XML file with 2,5 MB. Depending on the revision history it becomes too much data for the servers and the users to down- and upload. But if it works within the wikimedia servers, it shouldn't be a big problem. --Jofi 22:58, 16 May 2005 (UTC)
I tried it now with a single article (Kritik_der_Urteilskraft (765109 Byte)). I got 3 times a "The wikimedia web server didn't return any response to your request.", too. One should try it with a smaller amount of files with the wikimedia-import from within the servers. Maybe the English articles have to be split into several lists, but it's better than to crash the servers. --Jofi 23:14, 16 May 2005 (UTC)
I suppose the English subdomain is about the half of all articles. a good test of capacity would be to check if this method is able to export all articles, and reimport them in a new wiki. Tim suggested it is possible... but I really would prefer to know for sure. ThomasV 23:22, 16 May 2005 (UTC)
It sounds great that we can drop the exporting step. "Someone with sufficient privs", would that be an admin? --Christian S 05:42, 17 May 2005 (UTC)

Steps to create a subdomain so far:[edit]

Here is a summary of the suggestions. I added some bots that aren't created yet. I hope it won't be to difficult to create them. --Jofi 17:49, 17 May 2005 (UTC)

  • anybody: use "language bot" at main to get a list of pages in a specific language
  • anybody: look in the list to see if the entries are correct
  • admin at main: use "protection bot" the prevend the listed pages from being changed
  • admin at subdomain: use interwiki-import to import the listed pages to subdomain
  • anybody: use "comparison bot" to compare the pages at the subdomain with main (are they really completely imported?)
  • admin at main: use deletion bot to delete listed articles from main

Bot[edit]

I wrote a Bot that guesses in which language pages are written: User:ThomasBot. The goal is to generate lists that will be used in the Special:Import feature. (assuming this feature will work...)

The bot seems to work well for major languages, such as French or German. There are some miscategorized pages, but it is possible to categorize them manually. If some pages are forgotten, it will also be possible to move them later manually. So I believe it will be possible to organize the split in a relatively painless way...

The bot has a tendency to categorize English as Scots, but that is not really a problem as far as we are concerned; we'll put them together :-). It does not categorize redirect pages. ThomasV 15:56, 17 May 2005 (UTC)

It's a good start, but there are some misinterpretations I can't understand. How can the bot think ""Morning" -- means "Milking" -- to the Farmer --" might be German? A title with "the" and "to" in it should be certainly recognized as English. --Jofi 17:46, 17 May 2005 (UTC)

the bot does not take the title into account, but the content of the page (fortunately).
it does mistakes. the text you mention is very short. the longer the text, the higher the chances it finds the right language. ThomasV 18:04, 17 May 2005 (UTC)
I just checked again: the answer of the bot on that example was correct: it enumerates the possibilites, in order of likelihood. it said that German might be a possibility, but not as likely as English. ThomasV 18:09, 17 May 2005 (UTC)
OK. I thought it wouldn't be ordered by likelyhood. But at some other pages ("They have not chosen me," he said,) it doesn't know what language it is. But the page has words in it ("they", "he", "have", "could") that, in my opinion, should make it clear. I don't know how the bot works. Perhaps you could feed it with additional sample texts or could decrease the number of hits that make it sure? --Jofi 21:39, 17 May 2005 (UTC)
I didn't read this page since a couple of days, and many texts were added. It's hard to read it for a poor french like me !
I am happy that we could use bots to facilitate the moving to each language.
I would like to propose some things :
The deletion in the main wikisource can (must ?) be done a long time after import ; so if we will detected an error later (for example an english text is unfortunately import on the french subdomain), then we could repair easily this error.
Perhaps Thomas's bot could create a "list of french pages" with a recursive algorithm, which begin on the "french main page", take all pages listed on this main page, then do the same operation with the new pages it just listed, and so on until the end of each branches (the leaves).
The same will be done in each languages.
A "comparison" bot could verify if there is some pages which are in two or more lists
Another "comparison" bot could verify if there are pages which are never in a "language list".
Like this, manual actions will be minimized.François Rey 22:05, 17 May 2005 (UTC)
We don't have to wait long until the pages at main can be deleted. If a page came to a subdomain by mistake, it can easily be re-imported to main as it had been imported to subdomain.
A recursive algorithm wouldn't be useful, because everywhere (sometimes already at the main page) there are links to pages in a different language (translations for example). So a recursive algorithm could end up in declaring all pages to be French.
The language bot should go through all existing articles and put each on a (one) list. If it's not sure it should be added to a list "language unknown". So no page could be added twice or be forgotten. --Jofi 22:19, 17 May 2005 (UTC)
I'am not sure there is a lot of links to pages in a different language, except for "wikisource administrative pages" like scriptorium, "how to", ... The advantage of recursive algorithm is for "book" which have a "summary page" and many chapters (like Les Misérables. Only the summary have categorized tag, not chapter pages. Wil theses chapters pages be recognized by the correct bot ? François Rey 22:35, 17 May 2005 (UTC)
ThomasV's bot looks at the content of a page and also, but not only, looks if there is a category tag. Here is a part of the result: User:ThomasBot/visitedpages. --Jofi 22:44, 17 May 2005 (UTC)
Jofi: I do not think that it would be easy to train the bot on additional text (see above); I believe it is already optimized. Typically, if you provide additional training to this kind of AI, it will perform better on the text you just provided, at the expenses of other texts, so the overall performance might become worse... However, one thing you could help with is to identify sequences that determine the language of a text in wikisource. For the text you mentioned, I added the template "{{Emily Dickinson Index", which can be found in a lot of English pages. see User:ThomasBot ThomasV 05:59, 18 May 2005 (UTC)
It's not a really good algorithm (more than 1,500 unknown), but unless we have no better we should use it. I added some sequences at User talk:ThomasBot. --Jofi 21:19, 19 May 2005 (UTC)

please do not judge this bot at the sheer number of unknown (hey, now more than 2000!). Actually, I'm pretty happy with the last version. Almost all french pages are correctly categorized. Most of the uncategorized pages (i.e. "unknown") are in non-western alphabets, and are written in languages the bot does not know about. If you look at other uncategorized pages, you'll see that there's not enough text on them. thanks for the new sequences. Now, I think the rest of the work should be done manually. Just check in the "unknown" list, and if you find a page that the bot should have recognized, add to it a sequence the bot knows (like "[[Category:English]]" or "[[Category:Mathematics]]" for example). See the sequences at User talk:ThomasBot. ThomasV 09:27, 20 May 2005 (UTC)

the bot does not recognize russian. whoever wants to help can also add the tag [[Category:Русский]] to all the russian pages he finds! ThomasV 12:01, 20 May 2005 (UTC)

Don't just delete[edit]

I haven't read everything on this page, so excuse me if this has been mentioned before (and maybe this is moot now... I don't know what's been done since May 20th): Instead of "simply" deleting everything that's not the correct language (assuming we're still talking about copying the database for the new language domains), why not just tag them for translation (after the new domain goes live, I mean)? The template(s) used for this purpose can automatically place the pages into special categories (e.g., Category:Translate from Spanish, except this would be in German or French or whatever's appropriate). All other categories on the pages can be removed so as to further separate them from the correct-language content. As the pages are translated (and I know this would go very slowly, especially in the less-common-language wikis), they can be deleted through the usual mechanism. At the very least, I think this could be the approach taken for the more-common-language wikis (Spanish, French, Chinese, etc.), where the native-language content will relatively quickly dwarf the incorrect-language content still waiting for translation. - dcljr 07:45, 14 July 2005 (UTC)

Heh... I neglected to check Wikisource:Language domain requests/Moving pages to language domains. I guess my comment is moot. - dcljr 09:05, 14 July 2005 (UTC)


Am I allowed to ask if you make progress? The last contriburtion of the discussion is from May. --Schandolf 19:26, 25 July 2005 (UTC)

I have the same question. Igor Filippov 03:16, 28 July 2005 (UTC)

The case of Cornish[edit]

All right, geniuses. What about languages that don't have subdomains of their own? Oh yeah, I've got an idea. Let's just move their materials to other language subdomains. Great idea. Or maybe not? Well, this is exactly what's been happening with the (reasonably substantial number of) Cornish texts. One whole bunch, which are in Cornish but have a few Latin words (to the equivalent of exeunt and vide infra) have been deleted and resurfaced in la. Another whole bunch with some English words are now in en. Do I need to make a kw request just to keep all these things from being deleted without notice or discussion anywhere? QuartierLatin1968 20:37, 11 January 2006 (UTC)