User talk:ThomasBot/archive

From Wikisource
Jump to navigation Jump to search

How can I help?

the robot looks for special strings in the pages. if none is found, then it tries to guess using a program called Texcat.

here is how to help:

  • check the list in User:ThomasBot/unknown. you can help the bot by adding category tags to these pages.
  • Note: the russian tag is : Category:Русский. The greek tags are Category:Φιλοσοφία and Category:Ελληνικά.
  • the help of somebody able to tell russian, serbian and ukrainian apart would be greatly appreciated.
  • check the other lists. there might be pages that are misclassified. In this case, category tags can resolve the problem too.
  • you might propose that a new special string be added the robot. please do this only if you know see that a significant number of pages can be gained; you do not want to add new string for just one page, it is much better that you add a category tag to the page.

Here is the current set of strings recognized by the robot:

                if lowtext.find("[[category:ia:") != -1:
                    output = "interlingua (categorized)"

                elif lowtext.find("[[category:fr:") != -1:
                    output = "french (categorized)"
                elif lowtext.find(u"[[category:fran\xe7ais") != -1:
                    output = "french (categorized)"
                elif text.find("{{Philo}}") != -1:
                    output = "french (philo)"
                elif text.find("[[Wikisource:Election data]] > [[France, cantonales 2004]] >") != -1:
                    output = "french (election data)"

                elif lowtext.find("[[category:da:") != -1:
                    output = "danish (categorized)"
                    
                elif lowtext.find("[[category:psalmer") != -1:
                    output = "swedish (categorized)"

                elif lowtext.find("[[category:norsk") != -1:
                    output = "norwegian (categorized)"

                elif lowtext.find("[[category:de:") != -1:
                    output = "german (categorized)"
                elif lowtext.find("[[category:deutsch") != -1:
                    output = "german (categorized)"
                elif text.find("[[Wikisource:Autoren") != -1:
                    output = "german (author page)"
                elif text.find("[[Category:Autor (deutsch)") != -1:
                    output = "german (author page)"
                elif text.find("{{Autor|") != -1:
                    output = "german (author page)"

                elif text.find("<[[Wikisource:Authors") != -1:
                    output = "english (author page)"
                elif text.find("{{Author|") != -1:
                    output = "english (author page)"
                elif text.find("[[Category:English") != -1:
                    output = "english (categorized)"
                elif text.find("[[2001 UK general election|") != -1:
                    output = "english (election data)"
                elif text.find("This is a snapshot of the CIA World Fact Book") != -1:
                    output = "english (cia)"
                elif text.find("<[[2000 US Presidential Election]]") != -1:
                    output = "english (election data)"
                elif text.find("[[Author:Stephen Crane|other works]]") != -1:
                    output = "english (crane)"
                elif text.find("{{Emily Dickinson Index}}") != -1:
                    output = "english (Dickinson poem)"
                elif text.find("[[Category:James Fenimore Cooper") != -1:
                    output = "english (Cooper)"
                elif text.find("<[[The Book of Mormon]]") != -1:
                    output = "english (mormon)"
                elif text.find("{{Pioneers-Mini-TOC}}") != -1:
                    output = "english (pioneers)"

                elif lowtext.find("[[category:mathematics") != -1:
                    output = "no language (mathematics)"
                    
                elif lowtext.find("[[category:autori rom") != -1:
                    output = "romanian (categorized)"
                elif text.find("{{titlu|") != -1:
                    output = "romanian (title)"
                elif text.find(u'[[Category:Rom\xe2') != -1:
                    output = "romanian (unicode)"


                elif lowtext.find("[[category:autorzy") != -1:
                    output = "polish (categorized)"
                elif lowtext.find("[[category:unicode-pl") != -1:
                    output = "polish (categorized)"

                elif lowtext.find("[[category:es:") != -1:
                    output = "spanish (categorized)"
                elif lowtext.find("[[category:es-") != -1:
                    output = "spanish (categorized)"
                elif lowtext.find(u"[[category:espa\xf1ol") != -1:
                    output = "spanish (categorized)"
                elif lowtext.find("[[category:rimas de gustavo adolfo") != -1:
                    output = "spanish (categorized)"
                elif lowtext.find("[[category:obras literarias de") != -1:
                    output = "spanish (categorized)"

                elif text.find("{{Autore|") != -1:
                    output = "italian (author page)"
                elif lowtext.find("[[category:autori]]") != -1:
                    output = "italian (categorized)"
                elif lowtext.find("[[category:autori|") != -1:
                    output = "italian (categorized)"

                elif lowtext.find(u'[[category:asturianu') != -1:
                    output = "asturian (categorized)"

                elif lowtext.find(u'[[category:nahuatl') != -1:
                    output = "nahuatl (categorized)"

                elif lowtext.find(u'[[category:magyar') != -1:
                    output = "hungarian (categorized)"

                elif lowtext.find(u'[[category:portugu') != -1:
                    output = "portuguese (categorized)"

                elif lowtext.find(u'[[category:galego') != -1:
                    output = "galician (categorized)"

                elif lowtext.find(u'[[category:ca:') != -1:
                    output = "catalan (categorized)"
                elif lowtext.find(u'[[category:catal\xe0n]]') != -1:
                    output = "catalan (categorized)"

                elif text.find(u'[[Category:\u0420\u0443\u0441\u0441\u043a\u0438\u0439]]') != -1:
                    output = "russian (categorized)"
                    
                elif text.find(u'[[Category:\u03a6\u03b9\u03bb\u03bf\u03c3\u03bf\u03c6\u03af\u03b1') != -1:
                    output = "greek (categorized)"
                elif text.find(u'[[Category:\u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ac') != -1:
                    output = "greek (categorized)"
                elif lowtext.find(u"[[category:\u03c6\u03b9\u03bb\u03bf\u03c3\u03bf\u03c6\u03af\u03b1")!= -1:
                    output = "greek (categorized)"

Please note that the results on the robot page are in general not up to date. It takes time to run it through the whole site, I usually restrict it to the "unknown" pages.

String suggestions

[[Category:Psalmer]] This category contains, as far as I can see, only swedish texts. --Christian S 17:59, 24 May 2005 (UTC)[reply]

thanks for the help. are some of these texts currently misclassified?ThomasV 18:02, 24 May 2005 (UTC)[reply]

I don't think there are many misclassifications, but I saw at least one "unrecognized". Also Jesus lever, graven brast is classified as Danish, but that is not a real misclassification, as it is a multilingual version (Swedish, Norwegian and Danish). I don't know if there are more misclassified texts, but it is a rather large Swedish category, and therefore a usefull identifier in a number of Swedish texts, though I think you have identified most of them as Swedish allready.

A few Norwegian texts has been misclassified as Danish, I removed them from the list of Danish texts a few days ago, but they have now reappeared. They are: Internasjonalen, Main Page:Norsk, Main Page:Nynorsk, Norges grunnlov, originaltekst. It is no surprice that Danish and Norwegian texts are sometimes miscategorized, unless there is a clear marker like [[Category:Da: it can be quite hard to tell them apart if you are not familiar with at least one of them, especially for old texts. --Christian S 18:33, 24 May 2005 (UTC)[reply]

ok, I added the "psalmer" string. for the 4 texts above, the "danish" page is written by the bot, you don't want to edit it manually; it is safer to instruct the bot that these pages are norwegian. I'll put these pages in a "norwegian" category. is there already one? if not, how should it be named? ThomasV 18:45, 24 May 2005 (UTC)[reply]
btw, I just added the "norwegian" list generated by the bot: there are 7 items, but at least one of them seems to be in Danish. could you have a look? ThomasV 18:51, 24 May 2005 (UTC)[reply]

ok, I created Category:Norsk. what is the distinction between Main Page:Norsk and Main Page:Nynorsk ? ThomasV 19:06, 24 May 2005 (UTC)[reply]

A general Norwegian category (like Category:English) should be named Category:Norsk. I don't think any general Norwegian categories exist yet. I'm not quite sure about the exact difference between "norsk" and "nynorsk", but "nynorsk" ("New Norwegian") is some kind of official dialect, or something like that, with its own official spelling and grammar (it has its own wikipedia as well). I don't think we have any texts in "nynorsk" apart from Main Page:Nynorsk, if I'm wrong the norwegians will have to sort it out themselves. Of the seven texts categorized as Norwegian, only one was actually in Norwegian. Here is the list of the 7 items with the correct languages:

E Mathematics
Fyrtøiet Danish
Ja, vi elsker dette landet Norwegian
Keiserens nye Klæder Danish
Sneemanden Danish
“Hun duede ikke” Danish
Var inte rädd Swedish

This illustrates how difficult it is to make software that can distinguish correctly between Danish and Norwegian :) Swedish is easier to recognise, but errors can probably not be avoided. I think the easiest way to sort out these languages is to let the bot do it's work and sort out the texts as well as it can and then tell me when it has finished. Then I'll have a look at the lists of texts in the three languages and correct the errors. --Christian S 19:36, 24 May 2005 (UTC)[reply]

the bot does its job once a day or so. you don't want to wait if you can correct errors: add tags to the misclassified pages and they will go into the right list the next time the bot sees them. I already fixed the above ones ThomasV 07:08, 25 May 2005 (UTC)[reply]


Thomas, please add:

I added latina. the LC categories are for "library of congress": they include material in all languages. ThomasV 05:44, 6 Jun 2005 (UTC)
  • [[Category:Algorithms]] to Mathematics. --Jofi 10:04, 2 Jun 2005 (UTC)
concerning mathematics, I still do not know what to do: I guess constants like Pi should stay on the main wiki, while other pages in this category should go into subdomains. but that needs we would need a special category for "no language" mathematics, since the current one has pages in english in it ThomasV 16:22, 2 Jun 2005 (UTC)
I'm also not sure what to do with mathematics. But to keep it simple: It wouldn't be a problem if every subdomain had it's own "pi pages". It's redundant, but what costs a megabyte nowadays? But first the mathematics pages should remain at main. The subdomains can decide later if they want to add them to their domain. Pi is public domain anyway. --Jofi 21:24, 2 Jun 2005 (UTC)
  • please change everything to "lowtext.find". Otherwise "category:english" etc. won't be found. --Jofi 22:24, 7 Jun 2005 (UTC)
  • Please move pages w. Category: to
    • Computer Programming - (any language, maybe special subdomain or together with eg Babel in general, main or src? same counts imho for Mathematics, Algorithms etc.)
    • Nederlands ( = Dutch )   - pages in Dutch

Thanks a lot in advance and thanks for all the effort you put in this so far! Thomas for president :)

--Patio 10:00, 24 Jun 2005 (UTC)

Chinese pages

Many chinese pages have been wrongly categorized as Japanese. Please have this bot fixed. (If the text has no hiragana / katakana, the text is probably not Japanese) --Hello World! 15:33, 18 Jun 2005 (UTC)

That is right. If you do not know which is Chinese, please allow me to edit your lists.--Jusjih 08:22, 10 August 2005 (UTC)[reply]
sorry but I do not know anything about japanese or chinese. what is a hiragana? ThomasV 13:33, 21 August 2005 (UTC)[reply]

The Wikipedia already has fruitful of texts to explain what is hiragana and katakana (Kana is a collective term for hiragana + katakana). If the texts do not have any kana, it is 99% (if not exactly) certain that it is not a Japanese page. The unicode range U+3040 – U+309F is for hiragana, and U+30A0 – U+30FF is for katakana. Chinese articles normally do not contain kanas, except for those which talks about Japanese culture. --Hello World! 16:22, 12 October 2005 (UTC)[reply]

Var inte rädd

Is a swedish psalm that should be deleted from here because of copyrights. I do not have the rights to delet here but have found more articles to delete. Have listed them somewhere. --Damast 22:19, 18 Jun 2005 (UTC)

Found some unidentified swedish text now listed at User talk:Damast, and where I have put the psalms that should be deleted: Category talk:Psalmer. --Damast 22:45, 18 Jun 2005 (UTC)

Nederlands aka Dutch

moved to #String suggestions --Patio 10:03, 24 Jun 2005 (UTC)

Gujarati

The articles in Gujarati are taggued with Category:ગુજરાતી. And there are redirect in the list. Yann 4 July 2005 20:47 (UTC)

doneThomasV 13:30, 21 August 2005 (UTC)[reply]

Kurdish

Category is "category:Kurdî", all pages are categorized. --Erdal Ronahi 19:54, 14 July 2005 (UTC)[reply]

done ThomasV 13:30, 21 August 2005 (UTC)[reply]

Square numbers

Add Category:Square numbers to output = "no language (mathematics)". --LadyInGrey 18:54, 23 July 2005 (UTC)[reply]

done ThomasV 13:31, 21 August 2005 (UTC)[reply]

Polski

Please add Category:Tablice to output polish (categorized) (Category:Polski, the bot don't recognized them). --LadyInGrey 18:27, 22 August 2005 (UTC)[reply]

Italiano

Please add Category:Testi to output italian. There are some articles which bot don't recognized. --LadyInGrey 16:44, 25 August 2005 (UTC)[reply]

ok. are they in unknown, or are they miscategorized as another language?ThomasV 16:50, 25 August 2005 (UTC)[reply]
Mixed.
  1. Example: Balado pri Pinelli 2 languages: Esperanto and italiano, in category Testi and in the list bot/esperanto.
  2. Others in unknown.
  3. Others in category testi are in the list bot/italian
  4. Don't worry, I will classify mannually
i just restarted the bot with the new string. ThomasV 17:09, 25 August 2005 (UTC)[reply]

Wrong categorizing

Author:Max Weber was moved to German Wikisource by mistake. So please don't delete it here. --Jofi 10:29:43, 2005-08-27 (UTC)

these pages will have to be moved manually. can you do that? do not worry, they will not be deleted soon; for the moment the bot is just writing redirects to the pages that have been moved. ThomasV 10:49, 27 August 2005 (UTC)[reply]
I just realised that there is Autore:Giambattista Marino, so Autor:Giambattista Marino could be deleted. Author:Max Weber is English. Where to do I have to move it? --Jofi 12:09:00, 2005-08-27 (UTC)
I was talking about the italian one, to the it subdomain. the other one is alreay here. ThomasV 12:11, 27 August 2005 (UTC)[reply]

David Copperfield is not really German, it should stay at English Wikisource ;-) --Jofi 23:06:55, 2005-08-29 (UTC)

Not all pages that are said to have been moved really are moved

I see it here: Meyers Konversations-Lexikon (1888-1889). This page is not a problem, because it is contributed by IP. I just wanted to inform you. Maybe it happend more often. --Jofi 17:14:09, 2005-08-29 (UTC)

it is because I decided to feeze the german list after you manually edited it. here is the list that was given to brion: http://wikisource.org/w/index.php?title=User:ThomasBot/german&oldid=151593

I added the category tag to this file after that, but the robot was no longer maintaining the german list.

so the pages that are potentially missing are in the next revision of the bot: http://wikisource.org/w/index.php?title=User:ThomasBot/german&direction=next&oldid=151593

since it was 223 entries, I wanted to spare you some efforts and I reduced this list down to 21, then I wrote about it in your scriptorium. I probably made a mistake there, but I do not know what. I think you should check the longer list, with 223 items. sorry for the inconvenience.

ThomasV 17:29, 29 August 2005 (UTC)[reply]

I will check it later today. Shall I edit the list manually again and add the note for moved pages at the pages? Or better no more manual editing of the list? --Jofi 17:38:50, 2005-08-29 (UTC)

just remove the pages manually after you move them. the bot will understand :-) ThomasV 17:42, 29 August 2005 (UTC)[reply]

Current list: The only page left at the last list of the bot is Fermat Primality test. It's source code (in German). I don't oppose having it at de.wikisource but the author doesn't want to add it at the moment, so I won't do it too. Old list: I didn't looked at the "MKL1888:" articles at the older list, but it seems ok. Friisk Gesäts is only partially german and a german version is already at de. If I would say Universal Deklaratioun vun de Mënscherechter is German, I would make myself some enemies ;-) And Siddhartha is missing at de. for some reasons... --Jofi 21:51:46, 2005-08-29 (UTC)

Hr - croatian

Pages categorized, (they are) ready to be moved. SpeedyGonsales 15:29, 31 August 2005 (UTC)[reply]

Fr - français

ThomasBot, J'essayé, mais je n'ai pas reuissi. Pouvez Vous le transférer à ma place avec la page où on va écrir la biographie? Merci d'avance. --213.49.65.63 08:33, 5 September 2005 (UTC)Tomahawk[reply]


Bible Croatian

Monsieur ThomasBot, Ma traduction de Bible en croate n'est pas bien transféré à la subdomaine croate. Il manque moitié de châpitres. Pourriez Vous faire quelque chose? Merci, Tomahawk_Cheerocky --Tomahawk Cheerocky 16:00, 12 September 2005 (UTC)Tomislav Dretar.[reply]

Move?

Can you please move Varmed skall jag dig lova. I did not see we have got our own wikisource first.--Damast 06:03, 16 September 2005 (UTC)[reply]

too late. copy-paste it and blank it here ThomasV 07:23, 16 September 2005 (UTC)[reply]
The page is now moved. /EnDumEn 10:11, 16 September 2005 (UTC)[reply]

Latin?!?

How the hell did your bot get Latin from Maryland state laws relating to the Baltimore and Ohio Rail Road? --SPUI 18:14, 5 October 2005 (UTC)[reply]

This bot apparently isn't moving images properly

I just noticed that Advanced Automation for Space Missions was moved to en.wikisource a month ago but that all of the diagrams were left behind. I can reupload them from my originals here, but it's going to be a bit of a hassle. In future perhaps you should either move images as well or have your bot leave pages with images alone. Bryan Derksen 06:33, 14 October 2005 (UTC)[reply]

the bot was not meant to move images. you have to move them manually. ThomasV 07:47, 14 October 2005 (UTC)[reply]
I wasn't notified (I only very rarely stop by Wikisource, my userpage here redirects to where I normally live on en.wikipedia) and now it appears that some of the images that were left behind were deleted because they were "unusued". It's very fortunate I still have these things stored offline, but even so this still makes for an enormous hassle since I can't just copy over the descriptive text. Color me still peeved, please be more careful with your bots in the future. Bryan Derksen 06:27, 5 December 2005 (UTC)[reply]
my answer is here [1] ThomasV 08:15, 5 December 2005 (UTC)[reply]
And for the record, consider my complaint withdrawn from your language-bot and redirected to the developers who actually did the image-breaking page-moves. Since [2] reported ThomasBot as being the user that deleted the page, I assumed that it was involved in the process of moving the text rather than just marking its language. It's probably too late now for my complaint to make much difference one way or the other but I wanted to make it clear for the record. Bryan Derksen 08:33, 5 December 2005 (UTC)[reply]

Pourchasse en anglais

Thomas, Pourqoi ils me pourchassent sur WikiSource anglaise. J'ai bien prouvé mon identité, Thomas Dretart et Tomislav Dretar sont une même perssone avec le droit d'auteur sur les livres de Tomislav Dretar et Thomas Dretart. Je ne comprends pas ce qu'il faudrait-il faire, venir en personne? Voilà mon N° de Tél: 003226482600 mon e-mail: drtomis@scarlet.be , demandez les administrateurs croates, la Police Européen, mes petites filles Sylvia et Cindy François, écrivains belges Gérard Adam et Monique Thomassetie, Les Edition du Panthéon, tous vont confirmer mon identité et mes droits d'auteurs. J'ai l'impression d'être un criminel, je pense que vous êtes un Serbe tschetnik assoiffé de mon sang. Sur tout l'Internet il y a un WANTED mort au vivant. Où est le problème? Tomislav Dretar alias Thomas Dretart.--213.49.69.209 20:13, 2 December 2005 (UTC)[reply]