This is a program MahlenMorris is working on that reveals a WikiRPCInterface Facade for WikiEngines that are not fortunate enough to be running on JSPWiki :). This allows them to be accessed via Hula and it's slowly growing list of applications.
Status#11-Aug-2002: I feel bad that i haven't gotten Hoop ready for prime time yet. The good news is that the repository of pages is still being updated every 15 minutes. The HTML entity problem noted below has been dealt with, both in converting all the previous pages and when new pages are pulled.
The main reason i can't go live with Hoop is that I'm getting some situations where pulling data from the repository eventually make the web server (Resin) completely non-responsive, even to non-Hoop calls. That sounds like its hogging all the threads or something, or that some system resource is being used up. I spent some of yesterday further pinning down what's happening. I suspect it has to do with using RCS to get old versions of the pages, but I've no idea why it would be acting poorly in these cases when it works fine for JSPWiki.
I guess my plan will be:
- Further pin down what conditions are needed to cause the problem.
- If it really seems to be RCS acting up, maybe convert to JSPWiki's versioning code (which could be nice anyway, as it removes the overhead of spawning new processes).
- Of course, converting all my existing RCS to that could be a headache. So i may just jetison the existing change logs (keeping the files, of course).
- Maybe i should publish the latest source and have the other eyeballs spot the problem. :)
29-June-2002: XML code now removed. Now using a simple flat file format, and all 20000 files have been converted. Does seem to be running OK for now. Still want to watch how the emails go for a few days before formally announcing.
Did notice one unexpected data problem, though. "'" is showing up in the raw file as """, and other such HTML entities are in the raw data as well. Which makes sense, now that i consider it, since I'm snarfing the raw text by parsing HTML from the edit page; thus, those entities would need to be HTML-ized. But that's an issue for the EmailGenerator, since it sends back the raw text. I'll need to do conversions when snarfing, and run a converter over all the files.
27-June-2002: Well the full pull went fine and finished a few days ago, but I'm having some problems with the XML parsing. Every now and then it complains about UTF-8 problems involving "invalid byte 1 of 1-byte UTF-8 sequence (0x85)" or what have you. I fear this may be related to the fact that I have to run Java 1.4, per Janne's comment elsewhere.
So I'm seriously considering abandoning XML for now and just using some simple delimited text format. I hate to give up on it, but no combination of settings is working for me, and I'm too confused by the DOM spec to try and sort it out. Plus, I have to deal with text encoding issues (around Japanese text) at work; I don't want to have to deal with it here as well.
22-June-2002: Happened to wake up and not be able to get back to sleep at 4 in the morning here in San Francisco (I think it was the Red Bull I had last night). Went to check on Hoop's status, and noticed it couldn't write files. Looks like Windows XP doesn't like having over 16,000 files in the same directory! So I've hacked Hoop so that it puts files that start with A-M in a seperate directory, bifurcating the file space. Ugly, but works. Hoop must have been calling me out of my sleep to rescue it :).
I had thought this could happen, and had been thinking of having an A directory, a B directory, and so forth. But i don't write code like that at 4:00 AM. Maybe that should be an Extreme Programming technique; "code like you need to sleep".
Back to bed...
21-June-2002: Hoop seems to be cranking along in the full pull, now over halfway done.
Thought occurred to me this morning; I'll bet i could write an RSS feed using Hula, thus allowing me to provide one for WikiWikiWeb. Although i'm not sure what you do with an RSS feed once you have one, other people seem to have uses for it.
18-June-2002: I hesitate to say this, but Hoop appears to be sort of working. It's still in the midst of doing the full pull (and will be for several days, I suspect), but you can still access it via the XML-RPC interface. As a goofy demo, you can see what pages Hoop has recently pulled over at http://www.mahlen.org/Hoop/recent.jsp, and those links are clickable; they display the HTML of the page that I've snarfed. The following has been shut off, as Google was spidering the site. Even the links inside those pages are clickable; the ones that I have already link to my copy, the others link to c2.com. So slowly over the week all the links should eventually leave c2.com. Mind you, the pages you pull from my site come up a lot slower.
Ya, Google comes by us very often these days. Shows how popular we are :-) --JanneJalkanen
15-June-2002: I tried a Full Pull this afternoon. It ran for probably 8 hours before something caused an exception, causing the full pull to lose track of what it had already done. I think it was the server cutting me off again that caused the exception; of course, maybe it wasn't the "you've been overusing this server" protection at all, it may have been just the c2.com server hiccuping for a minute. Because i limit the process to one page request every 5 seconds, I only got 2700 pages (out of around 22000); there's three pulls per Wiki page, one each for raw text, HTML, and author name. At the rate it was going it would have taken about 91 hours.
I guess I'm going to have to rethink the whole strategy of the full pull. Rather than something that happens once in a big pulse and then never again, I'm going to have to make the initial pull something that happens in a background thread, which saves it's state as it goes. Of course, it also needs to be hardened even more against occassional inabilities to pull a page from the server. Serves me right for starting off with the biggest Wiki of all. On the other hand, I'll know that it works for smaller ones!
Chin rubbing and pondering time for me.
15-June-2002: A RecentChanges log is now being written to a file, so restarting the server doesn't lose that info. The CharConversionExceptions may be gone now, but since I'm not certain what was causing them, I'm not entirely certain what i did fixed it. I'm going to keep an eye on that. Could be the Java 1.4/UTF-8 problem Janne has mentioned elsewhere.
But i may feel bold today and turn on the full pull. That'll take a couple days to get through.
9-June-2002: Hoop is actually sort of running. Every 15 minutes it querys the WikiWikiWeb for new pages, and updates them on my machine. It can also answer WikiRPCInterface queries. Still a number of issues to resolve; I'm occasionally getting CharConversionException's, for one thing, and it's Recent Changes memory is lost when i make any code changes. But the concept seems sound. Nothing to show anyone yet, though.
But when it's been running reliably for a while, then i can pull the switch and do the Full Pull.
4-June-2002: Real Life has been interfering with my Hoop coding, but it's getting closer to a reality. Parsing of the RecentChanges (actually QuickChanges) page is working. I'm trying to be fairly confident that all the constituent pieces work before doing the Full Pull of c2.com. But at this point the only component that needs coding is the RPC <-> Storage bit. I suspect Janne's RPC servlet code will provide the basis of that.
Never did hear from Ward regarding this project. Perhaps he's waiting to see if it actually happens? He does say in a few places on WikiWikiWeb that he doesn't want the Wiki to be a bulletin board, and one could construe the EmailGenerator to be pushing it in that direction, especially since opinions on WikiWikiWeb run a bit hotter than they do here. But really, not so different from the RecentChanges junkies. It will be an interesting experiment to see what, if any, effect history and change notification will have.
16-May-2002: Have figured that if I gate the speed of retrieval, WikiWikiWeb doesn't complain (I've left it running for 30 minutes without incident). This will mean that it takes over 2 days to run at first, but that's OK, afterwards I'll read RecentChanges to see what pages changed over that time and re-snarf them (I have to do that RecentChanges parsing anyway).
- What's going on here? Have you managed to sort through the entire WikiWikiWeb? Did Ward go into a blood frenzy? :-) --JanneJalkanen
8-May-2002: Last night I tried to do an initial snarf of the entirety of WikiWikiWeb, but was quickly stopped by the c2.com process that detects denial of service attacks. I had anticipated this as a potential problem, and had intentionally limited my process to a single thread, but even that appears to have triggered it. I don't think this rate would be a huge problem for the hourly updates to avoid triggering, but currently it'd be hard to get the initial grab of the whole thing in a timely manner, since each page takes three page reads (one each for raw text, HTML page, and most recent author). With over 20,000 pages, if I throttle the rate down to one a second, that would take 17 hours to do, so undoubtedly inaccuracies would appear over that time. I know that may sound awfully finicky, but i like getting things right, especially when introducing them to a, mmmm, prickly bunch of people like the WikiWikiWeb has. And for all i know, even once per second could get detected as a DoS.
I saw that there is a page there on WikiMirrors, but unfortunately the existing c2.com mirrors are both too selective (not all pages are backed up) and only HTML (I really need the raw text as well to do the RPC interface correctly). I've shot off an email to Ward briefly explaining what I'm trying to do with Hoop. It occurs to me now that Hoop could easily act as the back-end to a pretty decent mirror (in fact, to any number of mirrors), with the advantage over other mirrors that it would be current within an hour. Of course, such a mirror should probably live on a machine with faster upload bandwidth than my piddly 128Kbits :).
6-May-2002: Now able to snarf text/HTML/etc. from WikiWikiWeb pages, thanks to the Jakarta ORO regular expression code. Code for storing pages is written but untested.
Purpose#What I'm thinking about is:
- An interface for getting data from the original Wiki.
- The Facade calls a class that implements this interface for the particular Wiki or WikiEngine. This implementing class handles the details of knowing the URL structure for that Wiki, knowing how to strip raw page text from the edit page or HTML from the rendered page, parsing for links, and so forth.
- The Facade would periodically (every hour?) ask this class for recent changes, and then get the raw text and HTML text for the updated page, saving them via RCS. Janne, you may think this is overkill, but I can't think how one would do the email delta generation right otherwise, and if I'm going to this much trouble to mimic the XML-RPC interface, i'd like to do it accurately.
- No, I don't think it's overkill. Interestingly, it would also provide versioning to those Wikis that don't support it. --JanneJalkanen
- The Facade also services XML-RPC requests, probably caching a fair amount of the requested data, for speed.
JanneJalkanen: I think this would be an interesting idea. If you take a closer look at the RSSGenerator.java in the current JSPWiki CVS repository, you'll notice that it would be nearly trivial to convert it to use the WikiRPCInterface, which in turn means that any Wiki could then publish RSS. Obviously, it would be better that each Wiki would support the RSS interface on its own, but still.