Category Archives: databases

quick method to clone a MySQL database

let’s say you have a MySQL database on db1.db and you want to clone it to db2.db

the “official” way to do this is to run a “mysqldump” on db1.db and then import the resulting .sql file into the db2.db server.

There are problems with this approach:

  • mysqldump locks the source database, making it inaccessible while the dump is happening.
  • mysqldump creates files which may be many times the size of the source database’s binary files, potentially exhausting the space on your source server before it’s even done.
  • the resulting file then needs to be imported into the target server, which could take hours depending on the size.

I needed to clone some databases in a hurry that are about 20G in size. The method I used ended up taking less than half an hour to complete, and the source database (db1.db) only had to be down for less than a minute, instead of the potential /hours/ in the mysqldump method.

  1. use rsync on db2.db to copy the data directories from db1.db to db2.db:
    cd /var/lib/ && rsync root@db1.db:/var/lib/mysql ./ -rva –progress –delete
  2. use rsync on db2.db to copy binary logs from db1.db to db2.db:
    cd /var/log/ && rsync root@db1.db:/var/log/mysql ./ -rva –progress –delete
  3. repeat 1&2 (the first time around would take some time. the second time around will be quick)
  4. on db1.db, stop the database
    service mysqld stop
  5. on db2.db, repeat 1&2 one last time
  6. on db1.db, start the database again, and start the slave service if you need to
    service mysqld start
  7. on db2.db, remove auto.cnf and any innodb log files
    cd /var/lib/mysql/ && rm -f auto.cnf ib_logfile*
  8. start the database, and start the slave if needed
    service mysqld start

With the above method, your source database will be down for only a minute or so (steps 4-6).

The reason that 1&2 are repeated 3 times:

  1. clone the db1.db database from scratch. this will take a while
  2. because it took so long to run #1, there are probably a lot of changes. repeat to get those changes
  3. when you stop db1.db, some files will get final changes as they are changed. grab those after db1.db has been stopped

You need to delete any existing innodb logs (step 7) which might cause the system to attempt to “fix” some tables it might think are broken. but, because we did a clean shutdown in step 4, this is not necessary. so delete the log files (they will be recreated automatically).

If you are doing the clone because you want to create a new slave database, then the database needs a new internal ID that it will send to the master. By deleting auto.cnf, you force the MySQL server to create a new unique ID.

Distributed File Storage, using PHP and MongoDB

Scenario:

  • Alice creates an entry on Server1, and uploads an image to it.
  • Bob views that entry on Server2, but can’t view the image because the server doesn’t have it.

There are a number of solutions to this.

  1. after each upload, push the new file out to all servers so they also have a copy
  2. mount an external file system, networked to all servers
  3. create a caching distributed file system centered around an external database

The first solution, ensuring that every uploaded file is simultaneously uploaded to all servers, is wrong for an obvious reason: hard-drive space. Imagine you have 20 servers and the file is likely only to ever be read on 3 of them (maybe they’re location-based?) – by uploading to all servers, you waste space, increasing storage costs and also slowing down the servers as they are busy doing work that they really don’t need to be doing.

The second solution is better – an external mounted solution such as NFS, S3QL, or Samba can store your files on file servers that are backed up and replicated, and are simultaneously available to all your web servers. But these solutions come at a huge speed cost – every file check involves network access, lock checking, POSIX compliance and other ugliness. Also, network file systems of this sort are very sensitive to network outages, however temporary they are.

The solution we will build in this article is to create an external file system that

  1. supports local caching of files for speed
  2. has immediate availability of files across all servers
  3. is “shardable”, so files only exist on servers where they are actually needed

Storage

To store the files, you need an external storage solution. For reasons that we will see later, the solution I use is MongoDB and its GridFS solution.

MongoDB is a NoSQL database, that stores information in binary JSON files. It is extremely scalable, and shards nicely as well, allowing us to concentrate more on our application and less on database maintenance.

To store the files, we will upload them into the MongoDB network, where they will be stored as “chunks”. Retrieving and storing the files is a simple matter, as we’ll see.

Saving Files

Up until now, all your files were recorded on the system using direct access – using file_put_contents(), for example.

We need to find all instances of these calls and route them through a new function called mdbFileSet (MongoDB File Set) that will record the file as requested, but will also upload it to the database.

In most cases, this is a simple matter – if the user-files directory is $_SERVER[‘DOCUMENT_ROOT’].’/userfiles/’, then a call such as file_put_contents($_SERVER[‘DOCUMENT_ROOT’].’/userfiles/’.$filename, $filecontent) will be replaced with mdbFileSet($filename, $filecontent). This is obviously more readable, and we are abstracting the user-files location as well, making it flexible.

The actual mdbFileSet() function works like this

  1. parameters are $fname and $file, which contain the filename (including the directories, delimited by ‘/’), and the file content as a string.
  2. check GridFS to see if the file already exists. If it does:
    1. delete the existing file (see Deleting Files in this article)
  3. copy the uploaded file to the local user-files location (to act as a cache)
  4. upload the file using GridFS

Code for the mdbFileSet function:

function mdbFileSet($fname, $file) {
  if (strpos($fname, '..')!==false) { // hack attempt
    return false;
  }
  global $MDBVARS;
  if (strpos($fname, '/')!==false) {
    @mkdir($MDBVARS['cache'].preg_replace('/[^\/]*$/', '', $fname), 0755, true);
  }
  file_put_contents($MDBVARS['cache'].$fname, $file);
  $conn=new Mongo($MDBVARS['dbhost']);
  $db=$conn->{$MDBVARS['dbname']};                
  $db->authenticate($MDBVARS['username'], $MDBVARS['password']);
  $grid=$db->getGridFS();                    
  $existing=$grid->findOne($fname);               
  if (!is_null($existing)) {
    $grid->delete($existing->file['_id']);
  }
  $grid->storeBytes($file, array('filename'=>$fname), array('safe'=>true));
  $conn->close();
}

You will need to set the $MDBVARS global array before running the function. I keep mine in the server’s config.php.

Example:

$MDBVARS=array(
  'cache'=>$_SERVER['DOCUMENT_ROOT'].'/userfiles/',
  'username'=>'username',
  'password'=>'password',
  'dbname'=>'filesdb',
  'dbhost'=>'mdb1.yourmongodbserver.com'
);

Replace the values in the above code with your own values.

You can test this easily. Create a test.php file with the following code:

<?php
require_once 'php/basics.php'; // link to file containing common functions
mdbFileSet('test/file.php', file_get_contents(__FILE__));
?>

The above code will upload a copy of the test.php file you just created, and will store a copy in your cache as well. After loading the file in your browser, you can test this by looking in your cache on the server:

[root@cp3 server]# ls userfiles/test/file.php -l
-rw-r--r-- 1 apache apache 142 Nov  6 10:36 userfiles/test/file.php

And also by logging into the MongoDB server and searching for the file:

> db.fs.files.find({filename:'test/file.php'})
{ "_id" : ObjectId("545b4f1560b99367688b456b"), "filename" : "test/file.php", "uploadDate" : ISODate("2014-11-06T10:36:05.121Z"), "length" : NumberLong(142), "chunkSize" : NumberLong(261120), "md5" : "00397d7306c53cda5ea9446d7bd62594" }

Before going any further, you should go through your code now and edit all your user-file-writing functions so they use the mdbFileSet() function. Everything should still work as before, but now, there will be a copy of each file saved in the MongoDB database as well.

Reading Files

Okay, so let’s say all your work so far in this article has been done on Server1. You now switch over to Server2 and want to open a record that includes an image uploaded to Server1. The image is obviously not on Server2, so how do we transparently download it to Server2 such that the end-user never needs to know?

For this, we will write a function called mdbFileGet (MongoDB File Get), which will retrieve it from the MongoDB server if it is not already cached locally. How it works:

  1. there is one parameter, $fname, which is the filename including the directories.
  2. if the file already exists in the local server’s cache, then return that file’s contents.
  3. otherwise, download the file from GridFS, store a copy in the local cache, and return the file’s contents.

There is an issue to do with the cache, which I’ll explain in a moment, but in the meantime, here is the code for the function:

function mdbFileGet($fname) {
  if (strpos($fname, '..')!==false) { // hack attempt
    return false;
  } 
  global $MDBVARS;
  if (file_exists($MDBVARS['cache'].$fname)) {
    return file_get_contents($MDBVARS['cache'].$fname);
  }
  $conn=new Mongo($MDBVARS['dbhost']);         
  $db=$conn->{$MDBVARS['dbname']};                
  $db->authenticate($MDBVARS['username'], $MDBVARS['password']);        
  $fdata=$db->fs->files->findOne(array('filename'=>$fname));
  if (is_null($fdata)) { // file doesn't exist
    $conn->close();
    return false;
  }
  $grid=$db->getGridFS();                    
  $file=$grid->findOne(array('filename'=>$fname));
  if (strpos($fname, '/')!==false) {
    @mkdir($MDBVARS['cache'].preg_replace('/[^\/]*$/', '', $fname), 0755, true);
  } 
  $bytes=$file->getBytes();
  file_put_contents($MDBVARS['cache'].$fname, $bytes);
  $ftime=date('YmdHis', $file->file['uploadDate']->sec);
  touch($MDBVARS['cache'].$fname, $ftime);
  $conn->close();
  return $bytes;
}

For an example of this in use, let’s consider an image, /userfiles/1/image.jpg that was uploaded to Server1. It’s obviously not yet on Server2, so how do we view it there?
When loading the file up (let’s say http://server2.yourcomp.any/userfiles/1/image.jpg), the server looks directly for the image, and doesn’t find it. We need to route the request through a script that makes sure the file is there before sending it back.
To do that in this case, we can use mod_rewrite so that calls to /userfiles/[whatever] are routed to something like /php/file-get.php, which handles the work.
Edit your .htaccess file, and add in something like this:

RewriteEngine on
RewriteRule ^userfiles/.*$ /php/file-get.php [QSA,L]

Now create the file php/file-get.php:

<?php
require_once 'basics.php'; // load common functions and config.php
$fname=preg_replace('/^\/userfiles\/|\?.*/', '', $_SERVER['REQUEST_URI']);
if (strpos($fname, '..')!==false) { // hack attempt
  exit;
}
$ext=strtolower(preg_replace('/.*\./', '', $fname));
switch ($ext) {
  case 'png':
    header('Content-type: image/png');
  break;
  case 'jpg': case 'jpeg':
    header('Content-type: image/jpg');
  break;
  case 'gif':
    header('Content-type: image/gif');
  break;
  default:
    header('Content-type: ');
}
echo mdbFileGet($fname);
?>

You can see that most of the file’s code is actually just figuring out the mime-type to show. The downloading and showing of the file is done right at the last line.

You can now transparently upload files on one server and view them on another!

In fact, once the file is uploaded, you can remove it completely from all servers, and then when you next need it, just load it up through mdbFileGet() as normal and it will download again.

The caching issue that I mentioned earlier has to do with cache invalidation. Let’s say we upload image.jpg and it is distributed to a number of servers. After a few hours, we might upload a replacement image – how do we tell the servers that the old image is invalid and it should be downloaded again?

We will start solving that in the next section.

Deleting Files

Deleting files is not as obvious as it sounds. On a one-server system, it’s simply a matter of using unlink() to remove the file, and there’s no more to be said about that.

However, in a multi-server system, we have three steps:

  1. delete the local cached file
  2. delete the database-stored file
  3. find all servers that have a copy of the file and delete the file from those servers.

#1 and #2 can be solved immediately in a very simple function:

function mdbFileRemove($fname) {
  if (strpos($fname, '..')!==false) { // hack attempt
    return false;
  }
  global $MDBVARS;
  @unlink($MDBVARS['cache'].$fname);
  $conn=new Mongo($MDBVARS['dbhost']);
  $db=$conn->{$MDBVARS['dbname']};
  $db->authenticate($MDBVARS['username'], $MDBVARS['password']);
  $grid=$db->getGridFS();
  $existing=$grid->findOne($fname);
  if (!is_null($existing)) {
    $grid->delete($existing->file['_id']);
  }
  $conn->close();
}

The above will delete a file from the local server and from the MongoDB database, but will not clear the file from other server caches.

To delete from the other machines, we need to set up a deletion queue, which we’ll do later in the File Delete Queues section.

Creating File Delete Queues

To delete files from all servers, we need to send a message to those servers to tell them to delete their local copies of the file.

Sending a message to every single server in your network is a waste of resources, as most of the servers may not actually have a copy of the file you are trying to delete.

So, we need to adapt the mdbFileSet and mdbFileGet functions so they add a record to the database telling it exactly what servers have copies of the files. This will then allow us to target just those servers and to know that we’re not wasting time.

Edit the mdbFileSet function and change this line:

$grid->storeBytes($file, array('filename'=>$fname), array('safe'=>true));

to this:

$grid->storeBytes(
  $file,
  array('filename'=>$fname, 'servers'=>array($_SERVER['HTTP_HOST'])),
  array('safe'=>true)
);

As a test, I uploaded an image called 3184/user-photos/3184.jpg, then checked my MongoDB instance:

> db.fs.files.find({filename:'3184/user-photos/3184.jpg'})
{ "_id" : ObjectId("545b68a160b993b86b8b4567"), "filename" : "3184/user-photos/3184.jpg", "servers" : [ "cp3.myserver.com" ], "uploadDate" : ISODate("2014-11-06T12:25:05.344Z"), "length" : NumberLong(37182), "chunkSize" : NumberLong(261120), "md5" : "9def8b14cb1611097e755692d04dcbdd" }

Note the highlighted servers section. As part of the file upload, we are initialising an array which states what servers have a copy of that file.

An important thing to note as well, is that in GridFS, the file is recorded in a set of chunks which are standard MongoDB documents, and the metadata of the file is recorded in another normal document. What we look at with db.fs.files.find is the metadata, not the file chunks. It would be uneconomical to store metadata within the same document(s) as the file chunks, as checking something as simple as its creation date, or the list of servers that have it, would then involve downloading the entire file.

Next, we need to adapt the mdbFileGet() function. Change the following:

touch($MDBVARS['cache'].$fname, date('YmdHis', $file->file['uploadDate']->sec));
$conn->close();

to this:

touch($MDBVARS['cache'].$fname, date('YmdHis', $file->file['uploadDate']->sec));
$db->fs->files->update(
  array('filename'=>$fname),
  array('$push'=>array('servers'=>$_SERVER['HTTP_HOST']))
);
$conn->close();

In this, we inline-update the server array that we created in mdbFileSet(). There is no need to download, change, and re-upload the record. In fact, there is a race condition there, in that some other server may be doing the same thing at the same time. It is safer to have the MongoDB server handle the update of the document directly.

If you then open the image on another server and check the file again on the MongoDB server, you’ll see something like this:

> db.fs.files.find({filename:'3184/user-photos/3184.jpg'})
{ "_id" : ObjectId("545b68a160b993b86b8b4567"), "filename" : "3184/user-photos/3184.jpg", "servers" : [ "cp3.myserver.com", "cp4.myserver.com" ], "uploadDate" : ISODate("2014-11-06T12:25:05.344Z"), "length" : NumberLong(37182), "chunkSize" : NumberLong(261120), "md5" : "9def8b14cb1611097e755692d04dcbdd" }

Note that the servers array has an extra entry in it, but nothing else was touched. Exactly what we want.

Next, we need to adapt the mdbFileRemove function, so it builds the queue of files to delete (and what servers to delete them from).

To do that, change the following:

  if (!is_null($existing)) {
    $grid->delete($existing->file['_id']);
  }

to this:

  if (!is_null($existing)) {
    if (isset($existing->file['servers'])) {
      $servers=array_unique($existing->file['servers']);
      $idx=array_search($_SERVER['HTTP_HOST'], $servers);
      if ($idx!==false) {
        unset($servers[$idx]);
      }
      $list=array_values(array_map(function($server) use ($fname) {
        return array(
          'filename'=>$fname,
          'server'=>$server
        );
      }, $servers));
      $ret=$db->command(array('insert'=>'deletes', 'documents'=>$list));
    }
    $grid->delete($existing->file['_id']);
  }

This code inserts an entry into a db.deletes collection on the MongoDB server for every server that has a cached copy of the file. Of course, it removes a reference to the local server before doing so, as we can handle that immediately.

After doing an update of the image on cp3.myserver.com, I then checked the MongoDB deletes collection:

> db.deletes.find()
{ "_id" : ObjectId("545b7f6165f402bccee49573"), "filename" : "3184/user-photos/3184.jpg", "server" : "cp4.myserver.com" }

This means we can now work on the next part; writing a deletion daemon.

Running a File Deletion Queue

We now have a list of the cached files and the servers that have them. But how do we tell those servers to delete those cached files?

A way to do this is to write a cron job that runs every minute and checks the MongoDB deletes collection to see if there are any cached files that need to be deleted, then call those servers and tell them to delete the files.

This script will need to run directly on the MongoDB server, so install PHP on that server. In particular, you will need the command-line version of PHP. In Centos7, it is installed like this:

[root@mdb1 ~]# yum install php-cli php-devel php-pear gcc openssl-devel
[root@mdb1 ~]# pecl install mongo
[root@mdb1 ~]# echo "extension=mongo.so" >> /etc/php.ini

On the MongoDB server, create a user called mongo (useradd mongo), and create a file called /home/mongo/checkCaches.php:

<?php
$MDBVARS=array(
        'username'=>'username',
        'password'=>'password',
        'dbname'=>'filesdb',
        'dbhost'=>'mdb1.yourmongodbserver.com',
        'apikey'=>'805f73958de1653e073e6a8c674bb1e8'
);
$conn=new Mongo($MDBVARS['dbhost']);
$db=$conn->{$MDBVARS['dbname']};
$db->authenticate($MDBVARS['username'], $MDBVARS['password']);
$fdata=$db->deletes->find();
while ($d=$fdata->getNext()) {
        $url='http://'.$d['server'].'/php/cacheClear.php';
        $ch=curl_init($url);
        curl_setopt($ch, CURLOPT_POST, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        $fname=$d['filename'];
        $now=''.microtime(true);
        $md5=md5($fname.$now.$MDBVARS['apikey']);
        curl_setopt($ch, CURLOPT_POSTFIELDS, array(
                'filename'=>$fname,
                'time'=>$now,
                'md5'=>$md5
        ));
        $ret=curl_exec($ch);
        if ($ret=='ok') {
                $db->deletes->remove(array('_id'=>new MongoId($d['_id'])));

        }
        curl_close($ch);
}
?>

The $MDBVARS array is almost the same as those on the application servers. We add a new item, though, apikey, which helps us provide some authentication without needing usernames and passwords. By running the filename, the time, and the apikey through an MD5 function, we create a value that can only reasonably be reproduced by another MD5 function that knows the same details. So, we send the filename, time and MD5 result through to the target server, and if the target server can reproduce the MD5 result by MD5ing the filename, time, and its own copy of the apikey, then that’s enough proof that the call is valid.

Make sure to add the apikey entry to all your servers’ $MDBVARS arrays.

On the target server, then, we create the /php/cacheClear.php file:

<?php
require_once 'basics.php';
$fname=$_REQUEST['filename'];
$md5=md5($fname.$_REQUEST['time'].$MDBVARS['apikey']);
if ($md5!=$_REQUEST['md5']) {
  echo 'incorrect API key';
  exit;
}
if (strpos($fname, '..')!==false) { // check for hacks
  exit;
}
@unlink($MDBVARS['cache'].$fname);
echo 'ok';
?>

As usual, there is a potential flaw to consider. The checkCaches.php file on the MongoDB server goes through every delete entry in the database, but what if this takes more than a minute to finish?

If it takes more than a minute to finish, and the script is being called once a minute, then eventually, the server will have multiple copies of the script running against overlapping lists of files, and it will crash.

The solution to this is simply to add a timeout to the script, so it runs for 55 seconds (say) and then stops.

In checkCaches.php on the MongoDB server, change the following:

while ($d=$fdata->getNext()) {

to this:

$now2=time();
while ($d=$fdata->getNext()) {
  if ($time-55>$now) {
    exit;
  }

Now it will simply stop after second 55, and continue when it is called again.

Edit cron for the mongo user (su mongo -c “crontab -e”) and add this line then save the file:

* * * * * php /home/mongo/checkCache.php >/dev/null 2>/dev/null

That’s it! You now have a working distributed filesystem.

multi-tenanted CMS architecture

Last week, I did a talk at the Dublin Google buildings titled “Multi-tenanted CMS Architecture using PHP”.

Here are the slides that I used:

While talking with Google’s Brian Brazil, he explained that it is actually more efficient to use one database and many separate tables, than to separate each installation into a separate database, so one point I made (that KV-WebME uses separate databases per site) will change in the future.

I think the talk went down well, by the number of questions afterwards.

Last year, I gave a similar talk, and made the mistake of including way too much PHP in it – I had assumed that the audience would be composed of PHP developers. This year, there is just one slide of PHP, and that’s just to illustrate one possible way to build a proxy config.

Lesson’s learned for this time:

  • Talk slower. When I’m explaining something, I tend to try to get as much in as possible, so speak very fast. This makes it hard to hear what I’m saying.
  • More pictures, less words!
  • Stats. Some of the questions were around how efficient certain parts of the method were – particularly on the overhead of piping a file through a script as opposed to simply delivering it via Apache. I need to come up with numbers for that.

Overall, I was happy with this presentation.

keeping an admin session active

I had a call from a client who asked why, after logging into a CMS admin area and spending an hour or so writing a document, she was unable to submit it because it claimed she was not logged in.

The answer was that the session had expired.

On busy servers, one method of optimisation is to reduce the session-time. This makes it easier for the server to cope with a large number of visitors, but also has the undesired effect of logging people out if they take their time over anything.

One solution to this is to keep the admin session in a database table, tied to a cookie in the browser. Unfortunately, that means that every time the browser sends the cookie, it must be verified, whereas a session is usually trusted.

The workaround is to use some JavaScript to refresh the session every now and then.

I wrote a simple cyclical script which polled the server every minute to refresh the session.

Here it is, using jQuery to handle the AJAX:

function keep_session_alive(){
  setTimeout(keep_session_alive,60000);
  $.get('/ww.admin/keepalive.php');
}
setTimeout(keep_session_alive,60000);

And the server-side code is this:

<?php
session_start();

Very very simple trick. The polling could be enhanced, if you want, to alert the admin of anything interesting that’s happened on the server.

happy birthday KFM

KFM is 4 years old today:

http://verens.com/2006/05/27/file-manager-for-fckeditor-day-one/

demos of older versions no longer available, but it’s interesting to know that it’s lasted four years!

of particular interest in my own case, is that there has been no major functionality change in the last two years or so. KFM today is basically the same as KFM two years ago; just faster, more secure, and better coded.

CMS Design using PHP and jQuery

I’m happy this week. Last week, I spent some time and organised myself a bit more. In work, things are going smoothly – managed to get over a tricky piece of work and the rest is simply a list of small tasks.

For the last few weeks, I’ve been emailing and messaging Darshana at Packt Publishing, about writing a second book (jQuery 1.3 with PHP is going very well – list of reviews).

I initially wanted to write about file management, to explain how KFM works, and to help force me to improve on it. But there’s just not enough of an interested market in that – it’s too specialised.

So instead, I’ll be writing about CMS design using PHP and jQuery.

We (webworks.ie) have a CMS engine which we’ve written and improved for the last 6 or so years. We’ve open-sourced it a number of times, but never managed to generate much interest in it. We never had the time to spend on publicising it.

The book will not be specifically about that engine, but rather about the concepts that went into creating it – how a CMS works, how to manage plugins, administration, user management, and all the other little bits and pieces that every PHP developer needs to eventually address.

By way of explanation, I will be demonstrating various parts of our CMS, and explaining how and why it was built that way. I will be closely examining the other major CMSes as well, and giving alternative methods where good ones exist.

The proposed chapter list is:

  1. Introduction
  2. CMS core design
  3. User management and access control
  4. Page creation and Navigation
  5. Template Management
  6. Plugins
  7. Form creation
  8. Image Gallery
  9. Panels
  10. Search and Polls
  11. RSS and News
  12. Online Store
  13. Products

I’m really excited about this project!

jQuery 1.3 With PHP: Datatables chapter is online

Packt have put another chapter of my book online. This time, it’s the datatables chapter, which discusses how to use the Datatables plugin to organise very large data tables.

The chapter discusses how to make a large (over two million rows) database usable through an ajaxified table.

The Datatables plugin supports search, sorting and pagination.

jQuery 1.3 With PHP: calendars chapter is online

Packt have put one of the chapters of my book, jQuery 1.3 with PHP, online as an article.

I mentioned this chapter in June, and provided a small demo and the source.

I’ve had a few requests for example code on how to do this using a real database. I’ve been meaning to make the time to do that. I’ll get that done hopefully tomorrow.

In the meantime, though, please enjoy the article.

By the way, the book went to the printer only a few days ago, so if you want to avail of the pre-order price, now’s the time.

KFM2: the beginning

KFM 1.x has reached the end of its development. It has fulfilled its original purpose; to improve on the default file-manager for FCKeditor, and well-exceeded it.

From my records, I can see that KFM has been installed on over 9000 separate domains. That is quite a lot of sites, and says quite loudly how important it is that KFM2 is at least as good as KFM1 (we all know about the KDE4.0 fiasco…).

KFM2 will blow KFM1 well out of the water. It will do everything that KFM1 does, but much quicker, and with a much less monolithic architecture, allowing it to be hugely flexible. I am planning on creating a download section similar to jQuery UI’s, where you can choose the components that you want in the system, and it will be built up and configured for you, ready for download.

See, KFM 2 will be modular. If you just want a simple upload/download facility, then you will only need the core, and plugins to handle uploads and downloads. If you want the whole shebang, you will find plugins for various ways of selecting files and directories, full-on graphical UIs such as the existing KFM1 forces on the user, plugins for hooking into external file systems using FTP or whatever other method people can imagine, plugins for multi-media, file editing, databases, user authentication, search, RSS, and so on.

Basically everything that KFM1 already does, but with the option to easily remove/add bits that you want. Make it as fast or as “bling” as you want.

KFM2 will have a number of new things which are not available in any other online file-systems, that I know of.

new file-system things

The file system will be modular. You will be able to attach symbolic links to any part of the file system, linking to other external systems.

As an example, let’s look at a URL: http://example.com/kfm/get/images/site1/logo.jpg.

One of the new ideas is that this will not necessarily be located on the server named example.com.

If you are running a very busy website, you might prefer to delegate image management to a different machine deeper in your network, so you add a symbolic link /images/ linking to the image-management machine. This causes the server to issue a redirect to the browser, redirecting it to http://images.example.com/kfm/get/site1/logo.jpg.

Now, imagine that the image management server doesn’t host the logo.jpg file on itself. Maybe it’s one of a number of images.example.com servers in a load-balancing cluster, and the file is actually located on a file-server accessible only by FTP. So, we add a symbolic link in the server’s KFM, telling it how to connect to the FTP server. The image server does this, gets the required file, and sends it to the browser.

This sounds a bit complex, but at least it’s possible with the new system. The old system simply would not be able to handle that at all.

API

At the moment, if you want to select a file or directory, you fire up KFM1, wait 15 seconds for it to finish booting itself up, and select the file or directory. That’s way too long. Really, you should only need to load up a tiny widget that’s designed specifically for picking a file.

The new system will have a simple core with an API that can be connected to by little JavaScript widgets. As an example, let’s say you just want to get the file-listing for the foo directory.

At the moment, you need to go through the whole business of booting up KFM’s GUI, navigating to the right directory, then waiting for the server to connect to the database, generate thumbnails and other unnecessary stuff.

You should just have to do something simple like load up http://example.com/kfm/rpc.php?a=get_file_listing&v=/foo through jQuery’s $.get() function. The RPC script does exactly what was asked, and nothing more.

This should allow us to write some incredibly fast and tiny scripts for all the file manipulation things you could possibly want.

Think of it as the “gnu” way of doing KFM – many small tools, each of which is designed to do one task well.

Of course, you will still be able to use the original GUI with this, after it’s been re-engineered to hook into the new RPC system.

Or, you can write your own – the RPC’s API will be well-documented, and will use simple HTTP parameters, so you can write your widgets in whatever language you want – JavaScript is what I will be writing in, but there’s no reason why you’re confined to that – write a GUI in Python, C, Flash – whatever you’re comfortable with.

plugins

I think the new plugin architecture will make this profoundly extensible. I’m writing the core to be as flexible and stable as possible. You will be able to write a plugin to just about anything:

  • authenticate a user before serving a file.
  • change a user’s root directory based on the server-name or the user’s authentication.
  • the entire system is held on a different machine accessible only by FTP? fine – let’s change how files are accessed.
  • record all files and directories in a database, allowing searches to be done.
  • manipulate files online – edit text files, rotate or crop or resize images, move or rename or delete, etc. standard KFM1 stuff, these.
  • create a log plugin, so all access is recorded.
  • create a quota plugin, for denying uploads or downloads when the quota is exceeded.

It was possible for us (Benjamin and myself) to write all of these into KFM1, but as we are only two people, and very busy people at that, it’s never been the absolute top priority to get all these things done.

But, when KFM2 is released, you won’t have to wait for us anymore. You will be able to write plugins, or download plugins created by other people, which do all of the above and I’m sure there will be plugins that I simply have not imagined yet, that inventive people will come up with (Conor, I’m sure you’ll be one of those!)

languages

Languages have always been one of the cool things about KFM. Version 1.3 was released in about 13 languages in total. Unfortunately, the code got too complex in 1.4 to easily add more, but KFM2 is starting from scratch with the learning-curve of KFM1 well behind me. I’ve already made most of the big mistakes I’m likely to make, so from now on, things should be much easier.

How languages will work is that a website will be created which allows people to help out open-source projects, not just KFM by providing translations of words and sentences.

The programmer will create a screen-shot showing the text in use, and will provide the untranslated text in a format such as .po. Translators will translate the text on-line, and the translation will be available to download by the programmer as soon as it’s done.

To ensure good translation, each line will be presented to a number of different translators. The translations which agree the most with each other, will be considered the “right” one.

Using this, it will be possible for KFM to organically add languages with no interaction from myself or any other programmers. When first loaded, a plugin would try to translate messages into the user’s language. Finding that it doesn’t have the language in its database, it will try downloading it from the language translation website. If it’s not available, a request will be added that it be created. When translators that speak that language come to the website, they will be automatically given the KFM text to translate to that language. The next time the plugin checks, the translation might be done, and it will be added to the KFM instance’s local language database.

Of course, people don’t just do this stuff for the hell of it. It’s more fun if there’s a competition or reward. It would be nice if I could get a few paying customers to add their own translation requests to the website – that way I could offer a reward to the best translators (those that do the most work, or are the most consistent).

The competition aspect comes in, where you give people points for every translation they do that is matched by someone else’s translation of the same text.

Google does a similar thing, for example, where it shows images to people and asks them to come up with keywords describing the image – the commonest keywords are considered to be correct. Same trick, different target.

Anyway! Without further ado, I have a core to design, and tests to create. I was going to talk about testing with PHPUnit, automatic upgrades, and the like, but to hell with ye all – I’ve work to do 😉

a tale of prefixes

I debugged a problem for a friend of mine over the last day. He had installed bbPress on a server which also had WordPress on it, and his WordPress site just suddenly stopped serving comments and pages, and started killing the database server instead.

In the end (after a day of considering this and that and a lot of head-scratching), it turns out that the bbPress installation had been done using the same table prefix as the WordPress one.

See, when a web app is installed, it usually installs a load of database tables. The way to avoid these tables overwriting each other is to “namespace” them. The most common method is to simply place a unique prefix in front of the table names.

For example, WordPress and bbPress both have tables for posts. The default installation is to name them wp_posts and bb_posts, respectively, using wp_ and bb_ as prefixes. KFM does that as well, using kfm_ by default.

In the current case, what happened is that the bbPress installation was done using wp_ as the prefix, and some of the WordPress tables were then changed to match what bbPress expected.

In the case of wp_posts, the field post_status was changed from a varchar holding values such as “publish” or “draft”, to a smallint holding just ‘1’ or ‘0’. This had the immediate effect of unpublishing over 6000 articles on the blog. (aside: oi, WordPress – why are you not using enums?)

A few hours, spread over a day or so, were spent figuring that out then fixing that.

So, what actually happened? Why was wp_ used as the prefix?

Let’s look at the WordPress installation.

wordpress

When you install a copy of WordPress, the first thing you do is fill in the database details. This is the field for the prefix:

<input id="prefix" type="text" size="25" value="wp_" name="prefix"/>

Notice the name, “prefix”. A lot of browsers these days include auto-fill, which is sometimes useful. When encountering a form, the browser would fill in the details that it already knows, such as email address, etc.

The problem, though, is that sometimes, autofill might fill in details that really should be left blank or at their default states. An example of this is in phpBB, the popular forum software – whenever an admin goes to edit a user’s details, autofill in Firefox tends to fill in the “password” field in the user form, but it really shouldn’t do that, and can potentially break a user’s account.

This is actually quite easy to fix, though, in the installer – the installation program just needs to change to use a different name for its input box:

<input id="prefix" type="text" size="25" value="wp_" name="wp_prefix"/>

Now the value should only be autofilled by values recorded from other WordPress installations, and should not affect future installations of other software.

The thing is, this is not what broke the system I was fixing.

Let’s look at the bbPress installation.

bbpress

When writing bbPress, more thought appears to have gone into the installation. The only fields visible here are database name, user, and password. These are the most commonly changed details, so are shown prominently.

There is an “advanced” checkbox. When clicked, this shows fields that should only be changed if you are sure of what they do:

bbpress2

The fields shown here rarely need to be changed. They include database host, collation, character set and table prefix.

The default character set is UTF8, and I can’t imagine why anyone would need anything other than that (that also covers collation). My own opinion is that these ones shouldn’t even be in advanced – they should be hard-coded.

There are not many reasons why the table prefix would ever need to be changed, but in the case I was troubleshooting, the database host was held on a separate server to the site itself.

So, because the database host needed to be changed, that explains why the admin opened the Advanced section in the first place.

If the prefix input box was named “prefix”, this would be the end of the story – I might have assumed that the problem was autofill (as I did for a while), and suggest the above solution to them.

But, the authors of bbPress have obviously come across the autofill problem themselves, because they named their prefix button “bb_table_prefix”, exactly as I suggest every installer should do (now do it with WordPress, guys!):

<input [...] name="bb_table_prefix" class="text" value="bb_" tabindex="8" />

So, the problem is not the software.

In this case, I believe it’s human error. What follows is my own guess at what happened, in hind-sight and without the benefit of watching the proceedings in person.

I think what happened is that our admin’s techie had left some notes about the settings used for the WordPress installation, including database username, password, etc. Then the admin had gone to install bbPress, in the correct assumption that installation should be straightforward and compatible with WordPress.

However, I believe the problem is that the techie had left a note about table prefix, but had not explained to the admin that all the other settings must be as written but the prefix must not.

The admin had simply filled in what was written, and bbPress did what it does, and changed WordPress’s database tables.

So, who’s at fault?

I believe the problem here was a misunderstanding – to all of us who have done a million installations of a thousand different software applications, it’s very obvious what the table prefix is, and what it does. However, I don’t believe I’ve ever seen it written down, or the dangers of using an existing value.

Let’s look at the bbPress installation again, this time with the ‘?’ expanded:

bbpress3

It says “If you are running multiple bbPress sites in a single database, you will probably want to change this.” While literally correct, this actually doesn’t explain what the prefix is or why it should be changed. And most importantly, it does not say “This value must be unique in your database, or it may overwrite your existing database tables.” The help text in bbPress’s case does not actually explain the danger of using existing values for the prefix.

The techie is also at fault. If leaving notes for someone, always make sure to explain anything which needs explaining. In this case, it should be something like “If installing new software, use these database access details, but make sure that you do not use these database table prefixes.” In fact, the table prefix should not have been in the notes at all – if the value is a default, then don’t write it down.

The admin is also at fault, for following the notes to the letter and filling in the prefix where the default value was actually perfect. I won’t fault him for installing software when the techie wasn’t at hand to point out mistakes, as there really shouldn’t have been any – the prefix error was actually quite easy to make; it just happened to have devastating effects.

However, these are all minor faults. They’re all easy mistakes to make, and it was the combination of all three that caused the error.

Having said all that, I realise that I have some installers to check now to make sure I follow my own advice!