Symfony World blog is not maintained anymore. Check new sys.exit() programming blog.

data migration with symfony & doctrine

Scene from "Apocalypse Now" by Francis Ford Coppola (1979)

Diving into 'outsourcing applications with symfony': migrating data

This post was inspired by Fizyk's comment on the outsourcing applications with symfony article. Fizyk suggested that it would be a good idea to wrap all data migration functionalities into a symfony task(s). Basically, migrating data between different versions of the same project is the topic I want to focus on in this article.


new version of a project, written from scratch

In the previous article, we've been discussing developing a new version of an old project from scratch (e.g. because the old one was so inflexible that it was unable to be extended). The first thing we should do is to design new database structure that will hold all data that the present project holds (I assume that the database structure will have to be different, because if it was good, no one would redevelop entire project from scratch). Depending on your luck, it can be an easy task or a nightmare [I've already had redeveloped 2 projects with databases that were not even in the first normal form... yes, there are projects like this. Check this out: one of these projects were made by a really bad developer who never used any type of version control system. He used no development environment (only production) and no localhost machines to test new features before they are added. He made all modifications on the production server, even if it took 2 weeks until they were finished. And there was only FTP access on the server... Oh, yes, there ae projects like that... and you may be asked to improve such projects :)].


data migration

Now when we have two parallel database structures, we need to copy all data from the old one to the new one. Depending on the data (its size, structure, relations, constraints, etc.), I use one of the following:

  • pure Doctrine (Doctrine in - Doctrine out)
  • mixed (Doctrine in - SQL out)
  • pure SQL (SQL query)
For each data structure, I choose one of the above migration types very carefully (detailed descriptions below). If the project is big (and I assume this is the case), the development phase takes a lot of time. It's very important that you can run entire migration job:
  • as fast as possible: this enables you to run the migration very often, because it doesn't take too much time. And this gives you the possibility to improve the scripts, test them carefully (each mistake made during the migration will be painful when new project is started in production environment). And, finally, you can download the production (old project) database and migrate it to the new structure to compare the frontend of both projects - this helps to develop the business logic,
  • with wrapped interface: click once - make yourself a cup of tea - come back in 10 minutes - click again - go out with your dog - come back in 15 minutes - the migration is done. Or otherwise: imagine you have a project with 50 tables and you have to run 50 actions (e.g. by a click), each one for each table - or call 50 symfony tasks, not to mention that you need to call them in appropriate order (because data is related in RDBMS) and you have to watch out not to break the order. Of course, there has to be 50 small jobs, deep inside the migration system, but your interface should do the work for you automatically (check an example below).
  • avoid repeating the same manual modifications. This is difficult to discuss in general - I'll give an example. I've had an action table which stored users activity. The old project database had a record which stated that user X has replied to topic Y with post Z. And neither topic Y nor post Z existed in the database (again, if the project is so bad that it has to be redeveloped from scratch, such bad mistakes are very common). Of course, the migration system failed each time when symfony tried to insert a record that was related to a non-existent record. What to do?
    • bad solution: remove this action record from you old database local copy (you download the old project database from the production server, load it to your SQL localhost machine and everythng is 100% safe)
    • good solution: remove this record from production database, because it already points some stupid and non-existing stuff.
    Why to do so? Because if such mistakes were made in the old project (data inconsistency), there can be quite a lot of such paniful action table, which will make a single migration job really long. And after you have done the whole migration 2 or 3 times, it'll be enough or dumb work and you won't want to do it any more. Again, this is only a simple example. Each situation will probably need a different solution
There is one more important thing: no matter how complicated your migration job is, always create a detailed list of steps to make, describing what is being done at the moment, why such order and what is already migrated/what is still waiting to be migrated. With such list, performing data migration is really easy and unstressful :)


Here you have an example of a migration job order from a project I did recently (all .sql files are 'pure SQL' migrations):

where the action http://polonia/backend_dev.php/migrator/migrateAll calls the following (these are the only 'pure Doctrine' migrations):

MigrationManager::migrateConfig();
MigrationManager::migrateForbiddenPhrases();
MigrationManager::migrateQuotes();
MigrationManager::migrateArticles();
MigrationManager::migrateReadingCategories();
MigrationManager::migrateReadingTextbooks();
MigrationManager::migrateCountries();
MigrationManager::migrateRegions();
MigrationManager::migrateSubpages();


pure Doctrine migration

This is the only code example in the outsourcing applications with symfony article. And this is probably the only migration type that would be wrapped in a symfony task, as Fizyk suggested. This is really easy - just fetch data from the old system and save it in the new system.

  • used for small amounts of data
Below is another example of this data migration technique - private messages sent between users of a social website:
static public function migrateMessages()
  {
    $wiadomosci = Doctrine_Query::create()
      ->from('Wiadomosci w')
      ->orderBy('w.wiad_id')
      ->fetchArray();
 
    foreach ($wiadomosci as $row)
    {
      $message = new Message();
      $message->setId($row['wiad_id']);
      $message->setTitle($row['temat']);
      $message->setText($row['tresc']);
      $message->setFromId($row['wiad_od']);
      $message->setToId($row['wiad_do']);
      $message->setDisplayed($row['przeczytane']);
      $message->setProvoke($row['zaczep'] == 't');
      $message->setCreatedAt($row['kolumna_data']);
      $message->setUpdatedAt($row['kolumna_data']);
      $message->save();
    }
  }


pure SQL migration

You may ask, why native SQL INSERT queries are constructed to be executed immediately, instead of using Doctrine? The answer is obvious: time difference. Doctrine is nice, but in fact, it is really slow. The problem with pure Doctrine migration is that it takes a lot of time to load all data with PHP/Doctrine and then to insert it into new structure. The table from the example above held over 50'000 records. At my local machine, it took several minutes to finish the job. It's too much (data migration should be as fast as possible). The solution to this problem is simply to create a sql script that will do the job thousand times faster:

INSERT INTO NEW_DB.message (
  id,
  title,
  text,
  from_id,
  to_id,
  displayed,
  provoke,
  created_at,
  updated_at
)
SELECT
  wiad_id,
  temat,
  tresc,
  wiad_od,
  wiad_do,
  przeczytane,
  IF(zaczep = 't', 1, 0),
  kolumna_data,
  kolumna_data
FROM OLD_DB.wiadomosci
ORDER BY wiad_id
Sometimes the case may be even worse. The action table, mentioned before, held over 400'000 records and the PHP/apache configuration would need really big limits (and the migration would take several hours to finish). The following is another sql script, migrating action data:
INSERT INTO NEW_DB.action (
  created_at,
  created_by,
  type_id_external,
  target_profile_id,
  forum_post_id,
  forum_topic_id
)
SELECT kolumna_data, u_id,
CASE akcja
  WHEN "profil" THEN 1
  WHEN "logowanie" THEN 2
  WHEN "wylogowanie" THEN 3
  WHEN "pm" THEN 4
  WHEN "zaczep" THEN 5
  WHEN "post" THEN 6
  WHEN "watek" THEN 7
END,
  IF (akcja = 'profil' OR akcja = 'pm' OR akcja = 'zaczep', akcja_id, NULL),
  IF (akcja = 'post', post_id, NULL),
  IF (akcja = 'watek', akcja_id, NULL)
FROM OLD_DB.akcje ORDER BY kolumna_data ASC
Pure SQL migration may be a little bit more complicated (processing data in PHP is easier) - but you'll surely save a lot of time.
  • used for huge amounts of data, as this can be the only possibility to migrate it


mixed migration: Doctrine IN / SQL OUT

Sometimes it can be impossible to create a magnificent "INSERT INTO SELECT FROM" query. PHP, although being as slow as a dead snail (in comparison to SQL database management systems), still gives more possibilities just because being an imperative programming language. The idea of this migration type is to fetch the old system data using Doctrine, generate the SQL code and execute it. You may execute such generated script in any way you like - db console, tools like phpmyadmin or even forcing Doctrine to execute raw SQL. If someone prefers, this can be also done as a symfony task - but if the project is really big, you don't benefit from wrapping such mechanism into a symfony task.

To fully understand the idea of this data migration type, take a look at the example below:

static public function getInexistentProfileIds()
  {
    $u_ids = Doctrine_Query::create()
      ->from('Uzytkownicy u')
      ->select('u.u_id')
      ->orderBy('u.u_id')
      ->fetchArray();
 
    $ids = range(1, $u_ids[count($u_ids)-1]['u_id']);
 
    foreach ($u_ids as $u_id)
      unset($ids[$u_id['u_id'] - 1]);
 
    return $ids;
  }
static public function generateSqlToRebuildUsers()
  {
    $ids = self::getInexistentProfileIds();
    $sql = 'INSERT INTO profile (`id`) VALUES ('.implode('), (', $ids).')';
    echo $sql;
    echo '<hr />';
    $sql = 'INSERT INTO sf_guard_user (`id`, `email_address`, `username`,`is_active`) VALUES (';
    foreach ($ids as $id)
      $sql .= $id.',"NULL-'.$id.'","NULL-'.$id.'", 0), (';
    echo $sql;
  }
The reason for doing mixed migration in this case is data inconsistency in the old project. Users were creating forum topics and posts. And when a user was deleted, posts and topics still referred to it. Migrating forum data returned dozens of errors, all of them was failing constraint - assigning topic/post to a non-existent user. The solution was to create dead user records (with deleted_at set), just for the SQL not to break the constraint. The first method fetches all IDs of the users that has been deleted, the second part generates the SQL code creating those dead records into the database. This is only an example, of course, and the example above may be found controversial :), but, believe me, migrating an entire system (which was really, really badly developed) is a difficult task itself, the time you have is limited and sometimes you are forced to use somehow brutal solutions. Anyway, it's up to you how you want to grab the generated SQL code. For me it was the easiest way to display it in the browser, copy-paste to phpmyadmin and execute. But you can generate a sql file, wrap it into a symfony task or do whatever you think of. The main idea is just to make the data migration faster.


and that's it

I hope that some of my hints will be useful if you ever need to migrate data between different versions of the same project. Comments are welcome :)

No comments:

Post a Comment