Hello list,
I'm thinking of implementing a flat-file backend. As Dom pointed out, metadata and versioning would need to be addressed.
For metadata, I was thinking of using Lucy[0] and storing metadata as fields.
As for versioning, I would have one current version of a file ("foo.txt"), and previous versions would follow a naming convention like "foo__2012-02-03T191540.txt". Moderation could also be represented in the filename somehow.
Additionally, if one didn't want to constrain users from using silly node names with long timestamps on the end, the versioned files could be kept in a directory named from the node.
How does that sound?
Cheers, --Ryan
[0] http://search.cpan.org/~dwheeler/Lucy-0.2.2/lib/Lucy.pod
On Fri 03 Feb 2012, Ryan Jendoubi ryan.jendoubi@gmail.com wrote:
I'm thinking of implementing a flat-file backend. As Dom pointed out, metadata and versioning would need to be addressed.
Neat idea! I seem to have missed the conversation with Dom that you mention - did this happen offline, or was it on another mailing list somewhere?
Do you plan to use this as part of another project, or is it just "oh, it would be nice to have a flat-file backend"?
Kake
Hi Kake,
On 04/02/12 16:10, Kake L Pugh wrote:
On Fri 03 Feb 2012, Ryan Jendoubiryan.jendoubi@gmail.com wrote:
I'm thinking of implementing a flat-file backend. As Dom pointed out, metadata and versioning would need to be addressed.
Neat idea! I seem to have missed the conversation with Dom that you mention - did this happen offline, or was it on another mailing list somewhere?
Sorry, that was unnecessarily cryptic of me - I emailed Dom directly as the current maintainer.
Here's the exchange:
On 03/02/12 18:58, Dominic Hargreaves wrote:
On Fri, Feb 03, 2012 at 02:26:15PM +0000, Ryan Jendoubi wrote:
Curious about the status of Wiki::Toolkit? I'm envisioning hacking together something based on flat files, if such a think is possible, and considering extending W::T to make that possible. I see the module includes things like its own DB abstraction though, which many people would tut at nowadays. So I'd like to know what kinds of changes might be well received and what should be left well alone.
Hi Ryan,
Good to hear that there is still some interest in Wiki::Toolkit :)
I don't see why you couldn't implement a new storage backend using flat files, but I don't think it's something we've discussed before.
I guess you'd have to think about how to store the metadata, and versioned data. The DB abstraction you've identified should be stretchable to flat file formats, once one was designed.
There is a confusingly named mailing list at http://www.earth.li/mailman/listinfo/cgi-wiki-dev which would be a good place to discuss details. There's also a dev site at http://www.wiki-toolkit.org/
Cheers, Dominic.
As to your question:
Do you plan to use this as part of another project, or is it just "oh, it would be nice to have a flat-file backend"?
I would plan to use it :-) Just a for a personal site. If you're asking if I have a use-case that really demands a flat-file backend, I guess not... maybe it's just that I'd quite like to throw markup together in vim when I want, but still edit it over the interwebs when I need to as well?
As an addendum to my previous suggestions, I guess it would be good to have /all/ the previous versions kept in one central "archive" directory, which you could then hide from web spiders. That would also make sure one's directories with current versions aren't choked with previous version files if/when one goes in to modify things by hand.
I'm starting to think this use-case might be too bespoke to be worth contributing back to WT. More of a burden than a benefit possibly?
Cheers,
--Ryan
On Fri, Feb 03, 2012 at 07:25:26PM +0000, Ryan Jendoubi wrote:
I'm thinking of implementing a flat-file backend. As Dom pointed out, metadata and versioning would need to be addressed.
For metadata, I was thinking of using Lucy[0] and storing metadata as fields.
Hmm, I'm not too familiar with Lucy, but it looks like quite a heavyweight solution for this. Would a simple serialisation interface like JSON or YAML be better?
As for versioning, I would have one current version of a file ("foo.txt"), and previous versions would follow a naming convention like "foo__2012-02-03T191540.txt". Moderation could also be represented in the filename somehow.
Additionally, if one didn't want to constrain users from using silly node names with long timestamps on the end, the versioned files could be kept in a directory named from the node.
Well, W::T currently uses revision numbers rather dates, so your naming convention would need to store that.
Cheers, Dominic.
On 05/02/12 17:38, Dominic Hargreaves wrote:
Hmm, I'm not too familiar with Lucy, but it looks like quite a heavyweight solution for this. Would a simple serialisation interface like JSON or YAML be better?
Possibly. My attraction to Lucy was more its (semi-standard Lucene-style) keyword /indexing/ rather than being a simple metadata store.
You're right though, a JSON file would be much more "consumable". I'll check later whether there are standard modules for searching / indexing JSON.
Well, W::T currently uses revision numbers rather dates, so your naming convention would need to store that.
Ah, good point.
I'm also working on a project with lots (thousands of documents) of badly formatted, but still useful, legacy HTML. We're doing all we can to clean them up computationally, but I envisage that there will still be many aspects of cleaning them up which will have to be crowdsourced over time (contributors being screened for HTML competence).
This is the use-case for which I'd like to say, "Take this document and make it wiki-like"; having a solution general enough so that any given flat file could be "wikified".
Interested to hear if you have any further thoughts on something like that. Otherwise I'll start hacking something together next time I get a chance.
Cheers,
--Ryan
On Mon, Feb 06, 2012 at 06:04:59AM +0000, Ryan Jendoubi wrote:
On 05/02/12 17:38, Dominic Hargreaves wrote:
Hmm, I'm not too familiar with Lucy, but it looks like quite a heavyweight solution for this. Would a simple serialisation interface like JSON or YAML be better?
Possibly. My attraction to Lucy was more its (semi-standard Lucene-style) keyword /indexing/ rather than being a simple metadata store.
You're right though, a JSON file would be much more "consumable". I'll check later whether there are standard modules for searching / indexing JSON.
Hrm, although this does raise the point that W::T does rely on some quite complex metadata queries; performance on a collection of JSON/YAML will suck for that, so that's probably not as good an idea after all.
I can't help saying 'sqlite' at this point...
On 06/02/12 08:26, Dominic Hargreaves wrote:
Hrm, although this does raise the point that W::T does rely on some quite complex metadata queries; performance on a collection of JSON/YAML will suck for that, so that's probably not as good an idea after all. I can't help saying 'sqlite' at this point...
For the legacy HTML thing, I suppose I /could/ put everything into a db then take it out and put it back every time we have a new batch of tweaks to run. Using Lucy would essentially be a compromise to that, having the 'content' lying around as flat files but indexing them on certain metadata. I'm sure one could construct something similar using a conventional db backend but only putting the metadata in it, along with just filenames instead of the current content.
Another way to come at it I guess would be to use W::T in the usual way, but in addition generate an on-disk copy of the current revision of each document, almost like a cache.
Then if you ran a batch textual manipulation on those documents you could have a script to check for any document that's different from its most up-to-date version in the W::T database and update that version accordingly. That wouldn't be too bad.
On the other hand, all our documents are already under version control anyway... I suppose you could have a hybrid db / git backend where metadata, backlinks etc are kept in a .gitignored db but the revisions themselves are kept in git. Again, pretty convoluted!
--Ryan
On Mon 06 Feb 2012, Dominic Hargreaves dom@earth.li wrote:
Hrm, although this does raise the point that W::T does rely on some quite complex metadata queries; performance on a collection of JSON/YAML will suck for that, so that's probably not as good an idea after all.
I wonder if this is kind of a red herring in this case. The reason for having multiple backends is that different applications have different needs and priorities. If the convenience of a flat file format outweighs the need for fast performance when manipulating metadata, then it's OK to make that tradeoff, isn't it?
Kake
On Mon, Feb 06, 2012 at 05:32:40PM +0000, Kake L Pugh wrote:
On Mon 06 Feb 2012, Dominic Hargreaves dom@earth.li wrote:
Hrm, although this does raise the point that W::T does rely on some quite complex metadata queries; performance on a collection of JSON/YAML will suck for that, so that's probably not as good an idea after all.
I wonder if this is kind of a red herring in this case. The reason for having multiple backends is that different applications have different needs and priorities. If the convenience of a flat file format outweighs the need for fast performance when manipulating metadata, then it's OK to make that tradeoff, isn't it?
If it's really a valid trade-off, and it's clearly documented, yes.